web scrapping to get main content of a web page?

Question

0.00/5 (No votes)

See more:

Hello All,

I have to create a web service in which, on passing any url, I have to download the content of that web page.(this can be done using WebClient, Web Request).
I need to parse that html document I have downloaded. (this can be done using htmlagilitypack).

But I don't know how to get relevant content from that page. I mean if the web page is an article, then i need to take out that article content leaving all the ads, other links etc and if the page contains no long text, but a lot of links and buttons, then i have to download them along with css and js.

I tried it using htmlaglitypack, but for some sites its not able to get the real content of css and js files. after download when i open these file either they are either blank or contains some error message.

I searched and found something like data mining algorithms, html parsing. But I didn't found any code sample or api or atleast a clear example.
Can Readability be used for this?

Please explain me anything related to this topic. What should be my approach.?

How can I achieve this?

(also: i need to take out that relevant content and in some cases whole web page and save it for offline reading)

Thanks in advance.

Posted 23-Jul-12 23:17pm

DeepsMann

Add a Solution

2 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

BobJanova · Answer 1 · 2012-07-23T23:37:00

You need to define 'relevant content' precisely. At the moment you're having trouble because you don't actually know what you want; if you have downloaded the whole page then you have the complete object tree, you can analyse it and assign values to various metrics for each node and you can run a heuristic analysis to find the 'main' node. But you need to define what you are looking for.

Because divs can be used for actual divisions and for layout, it can be quite difficult. How do you distinguish between

XML

<pre><div id="layout">
 <div id="content">
  <p>bla bla</b>
 </div>
 <div id="footer">
  <p>some template footer stuff
 </div>
</div></pre>

... and

XML

<pre><div id="content">
 <div id="p1">
  <p>bla bla</b>
 </div>
 <div id="p2">
  <p>yak yak
 </div>
</div></pre>

... where you want only the first paragraph in the top example but both in the second? If the divs are different styles then that can be a clue, but then again you want to include image captions, insets and other divs which may have a different style.

If you know the sites that you're scraping then you can use that information to help you. In the extreme case, you will know that site X puts its main content inside a <div id="content"/> and you can just go straight to that item.

Any page that relies on scripts to work (i.e. to load the primary content) won't be readable unless you run a scripting engine. Not many do that for the main content, but it's something to be aware of.

Dave Newton · Answer 2 · 2012-09-26T17:56:00

Solution 2

Try the boilerplate library.

Posted 26-Sep-12 17:56pm

Dave Newton