Click here to Skip to main content
15,887,746 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
Hello All,

I have to create a web service in which, on passing any url, I have to download the content of that web page.(this can be done using WebClient, Web Request).
I need to parse that html document I have downloaded. (this can be done using htmlagilitypack).

But I don't know how to get relevant content from that page. I mean if the web page is an article, then i need to take out that article content leaving all the ads, other links etc and if the page contains no long text, but a lot of links and buttons, then i have to download them along with css and js.

I tried it using htmlaglitypack, but for some sites its not able to get the real content of css and js files. after download when i open these file either they are either blank or contains some error message.

I searched and found something like data mining algorithms, html parsing. But I didn't found any code sample or api or atleast a clear example.
Can Readability be used for this?

Please explain me anything related to this topic. What should be my approach.?

How can I achieve this?

(also: i need to take out that relevant content and in some cases whole web page and save it for offline reading)

Thanks in advance.
Posted

You need to define 'relevant content' precisely. At the moment you're having trouble because you don't actually know what you want; if you have downloaded the whole page then you have the complete object tree, you can analyse it and assign values to various metrics for each node and you can run a heuristic analysis to find the 'main' node. But you need to define what you are looking for.

Because divs can be used for actual divisions and for layout, it can be quite difficult. How do you distinguish between

XML
<pre><div id="layout">
 <div id="content">
  <p>bla bla</b>
 </div>
 <div id="footer">
  <p>some template footer stuff
 </div>
</div></pre>


... and

XML
<pre><div id="content">
 <div id="p1">
  <p>bla bla</b>
 </div>
 <div id="p2">
  <p>yak yak
 </div>
</div></pre>


... where you want only the first paragraph in the top example but both in the second? If the divs are different styles then that can be a clue, but then again you want to include image captions, insets and other divs which may have a different style.

If you know the sites that you're scraping then you can use that information to help you. In the extreme case, you will know that site X puts its main content inside a <div id="content"/> and you can just go straight to that item.

Any page that relies on scripts to work (i.e. to load the primary content) won't be readable unless you run a scripting engine. Not many do that for the main content, but it's something to be aware of.
 
Share this answer
 
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900