Click here to Skip to main content
15,899,126 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I am beyond a beginner when it comes to algorithms but I have a need to search multiple websites and extract certain points of data, eventually sorting them into categories. Is there a simple algorithm I can run to do this or do I need something completely different?

Sorry if this is too vague. As I mentioned I am extremely new to algorithms with this being my only need of one.

Any help and/or clarification is much appreciated.

Thank you.

What I have tried:

I have tried to Google different types of algorithms only to be confused beyond belief. I have no background in algorithms, coding, or anything the like. I figured I would address this forum and see if I can be pointed in the right direction.
Posted
Updated 22-Jul-16 4:06am
Comments
Richard Deeming 22-Jul-16 10:08am    
Extracting useful data from websites that don't want you to is extremely difficult and fragile. You're essentially going to have to parse the HTML to try to find the data, which will need a different set of rules for every site. And as soon as a site changes its markup even slightly, your rules are invalidated and have to be rewritten.
Beginner'sbeginner 22-Jul-16 10:42am    
Hello Richard.
First of all, thank you for taking the time to reply to my question. The data I am trying to extract is not confidential and available to the public. For example: I am trying to find the cheapest place to buy a certain book and would like to run an algorithm that searches various book store websites to find the best price. Does this help in any way, shape, or form?
Richard Deeming 22-Jul-16 11:33am    
Not really. The data may be public, but that doesn't mean the site wants you to scrape their data to display on your own site.

If the site doesn't provide an API to query their data - and most don't - then you're stuck with trying to parse the HTML pages they return.

If you're lucky, they might use structured data[^] within the HTML to represent their products. If they do, then you just need to work out which format of structured data they're using, and extract that data from the HTML.

If not, then you're stuck with trying to understand their HTML markup to find a set of rules to extract the data. And as soon as they change their markup even slightly, your rules need to be rewritten.

(PS: Use the "Reply" button next to the comment to post a reply; that way, the author will be notified.)

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900