Click here to Skip to main content
15,887,812 members
Please Sign up or sign in to vote.
2.00/5 (1 vote)
See more:
Hi all,

I am an old programmer working on a new idea.

I am looking to crawl a specific website and all its sub-domains and trawl it for keywords, and do a count of the occurence of that word under the domain, and store the result in a database.

For example , consider IBM's portal (a massive website), and I want to check how many webpages have the word "ThinkPad".

I have no idea where to start. Should I be looking at things like GNU Wget , Abot or what? Or, am I looking at writing a search engine? When you enter a word in google - it tells you the number of results and the time like "2,999 in 0.003second".

In simpleton terms - like running a command line to do a grep on a list of files and piping it into a wc (word count) - except i want to run it on a website and all its domains and files. I would like to be able to define my criteria in an XML file or rules file - something i can enhance and manipulate over time.

Where should I start?

Thanks,

cbf28
Posted
Comments
Sergey Alexandrovich Kryukov 4-Dec-12 18:47pm    
There can be many approaches. We don't know what you can do and what not. First of all, it would be good if you tell us your preferred platform. If you don't, "where should I start" makes little sense. You should not, but you can.

Did you do anything so far? What are the problems?
There is nothing special about that. If a human can browse a Web site, a program can...
--SA
cbf28 5-Dec-12 14:57pm    
perfectly - if a human can browse it , so can a bot.

So my platform of choice would be Linux, and the database would be Oracle. The crawlers gathering the data and storing it in a database because I'll be using that to generate a set of graphs based on the data.

The reason I choose linux, is that once I've got my data stored in Oracle, I can use some linux script to query and filter the database and leverage my familiarity with Linux scripting.

The next step would be making these graphs updated in real time - but the proof of concept can be script-based for the moment.

So I know how to store and manipulate the data, however I am exploring tools to gather it from the web.

Ideas?

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900