NOTICE - This code in this article is no longer viable due to recent (and somewhat radical) changes in the format of the CodeProject pages that are being scraped. For this reason, I have come up with a completely new article that exploits the new format changes. That article is here:
CodeProject Article Scraper, Revisited
I left this article on the sight to give folks the opportunity to compare coding styles, structure changes, and even scraping methodology.
Introduction
This article describes a method for scraping data off of the CodeProject My Articles page. There is currently no CodeProject API for retrieving this data, so this is the only way to get the info. Unfortunately, the format of this page could change at any time, and may break this code, so it's up to you to stay on top of this issue. This should be quite easy since I've done all the hard work for you - all you have to do is maintain it.
IMPORTANT NOTE: Check the
History section at the bottom of this article and make sure you implement the bug fix(es) shown there.
The ArticleData Class
The ArticleData
class contains the data for each article scraped off the web page. The most interesting aspect of this class is that it's derived from IComparable
so that the generic list that contains the ArticleData
objects can sort the list on any of the scraped values. There are several ways to sort a generic list, and I used the one that kept the referring code the cleanest. What I'm trying to say is that you should pick the way you want to do it. No method is more correct than any other, and is more a factor of programmer style and preference than anything else.
The Way I Did It
I chose to derive the AriticleData class from IComparable, and write the functions necessary to perform the sorting. This keeps the referencing code free of needless clutter, thus making the code easier to read. This is the way I like to do things. In my humble opinion, there is no point in bothering the programmer with needless minutia. Instead of posting the entire class in this article, I'll simply show you two of the sorting functions:
public class ArticleData : IComparable<ArticleData>
{
#region Comparison delegates
public static Comparison<ArticleData> TitleCompare = delegate(ArticleData p1, ArticleData p2)
{
return (p1.SortAscending) ? p1.m_title.CompareTo(p2.m_title) : p2.m_title.CompareTo(p1.m_title);
};
public static Comparison<ArticleData> PageViewsCompare = delegate(ArticleData p1, ArticleData p2)
{
return (p1.SortAscending) ? p1.m_pageViews.CompareTo(p2.m_pageViews) : p2.m_pageViews.CompareTo(p1.m_pageViews);
};
public int CompareTo(ArticleData other)
{
return ArticleID.CompareTo(other.ArticleID);
}
#endregion Comparison delegates
The ArticleUpdate class
This class is derived from the ArticleData
class, and at first blush, it appears as if it's an exact duplicate of the ArticleData
class, but that's not the case. To make the code truly useful, you need a way to identify changes since your last data scrape. For the purposes of this demo, that's what this class enables. I recognize that you might have different reasons for scraping the My Articles page, so you should be prepared to write your own class that performs the functionality your application requires. It's my guess that your implementation will be more extensive than my own.
The class has its own sort delegates. They're similar enough that I decided not to actually show them in this article because I think it would be redundant. The truly interesting methods n this class are:
ApplyChanges
This method is called from the scraper manager object (covered in the next section) when an article is scraped of the web page. If the article exists in the list of existing articles, we call this method to change the data to its existing values. If ANYTHING has changed for the article, this method returns true
.
public bool ApplyChanges(ArticleUpdate item, DateTime timeOfUpdate, bool newArticle)
{
bool changed = false;
this.m_title = m_latestTitle;
this.m_link = m_latestLink;
this.m_lastUpdated = m_latestLastUpdated;
this.m_description = m_latestDescription;
this.m_pageViews = m_latestPageViews;
this.m_rating = m_latestRating;
this.m_votes = m_latestVotes;
this.m_popularity = m_latestPopularity;
this.m_bookmarks = m_latestBookmarks;
this.m_latestTitle = item.m_latestTitle;
this.m_latestLink = item.m_latestLink;
this.m_latestDescription = item.m_latestDescription;
this.m_latestPageViews = item.m_latestPageViews;
this.m_latestRating = item.m_latestRating;
this.m_latestVotes = item.m_latestVotes;
this.m_latestPopularity = item.m_latestPopularity;
this.m_latestBookmarks = item.m_latestBookmarks;
this.m_timeUpdated = timeOfUpdate;
this.m_newArticle = newArticle;
changed = (this.m_title != m_latestTitle ||
this.m_link != m_latestLink ||
this.m_lastUpdated != m_latestLastUpdated ||
this.m_description != m_latestDescription ||
this.m_pageViews != m_latestPageViews ||
this.m_rating != m_latestRating ||
this.m_votes != m_latestVotes ||
this.m_popularity != m_latestPopularity ||
this.m_bookmarks != m_latestBookmarks ||
this.m_newArticle == true);
m_changed = changed;
return changed;
}
PropertyChanged
The PropertyChanged
method allows you to see if a specific property has changed. Simply provide the property name, and handle the return value (true
if the property's value changed).
public bool PropertyChanged(string property)
{
string originalProperty = property;
property = property.ToLower();
switch (property)
{
case "title" : return (Title != LatestTitle);
case "link" : return (Link != LatestLink);
case "description" : return (Description != LatestDescription);
case "pageviews" : return (PageViews != LatestPageViews);
case "rating" : return (Rating != LatestRating);
case "votes" : return (Votes != LatestVotes);
case "popularity" : return (Popularity != LatestPopularity);
case "bookmarks" : return (Bookmarks != LatestBookmarks);
case "lastupdated" : return (LastUpdated != LatestLastUpdated);
}
throw new Exception(string.Format("Unknown article property - '{0}'", originalProperty));
}
HowChanged
This method accepts a property name, and returns a ChangeType
enumerator indicating if the new value is equal to, greater than, or less than the last value that was scraped.
public ChangeType HowChanged(string property)
{
ChangeType changeType = ChangeType.None;
string originalProperty = property;
property = property.ToLower();
switch (property)
{
case "title":
break;
case "link":
break;
case "description":
break;
case "pageviews":
{
if (PageViews != LatestPageViews)
{
changeType = ChangeType.Up;
}
}
break;
case "rating":
{
if (Rating > LatestRating)
{
changeType = ChangeType.Down;
}
else
{
if (Rating < LatestRating)
{
changeType = ChangeType.Up;
}
}
}
break;
case "votes":
{
if (Votes != LatestVotes)
{
changeType = ChangeType.Up;
}
}
break;
case "popularity":
{
if (Popularity > LatestPopularity)
{
changeType = ChangeType.Down;
}
else
{
if (Popularity < LatestPopularity)
{
changeType = ChangeType.Up;
}
}
}
break;
case "bookmarks":
{
if (Bookmarks > LatestBookmarks)
{
changeType = ChangeType.Down;
}
else
{
if (Bookmarks < LatestBookmarks)
{
changeType = ChangeType.Up;
}
}
}
break;
case "lastupdated":
break;
default : throw new Exception(
string.Format("Unknown article property - '{0}'",
originalProperty));
}
return changeType;
}
The ArticleScraper Class
To make things easy on myself, I put all of the scraping code into this class. The web page is requested, and then parsed to within an inch of its life. For purposes of this article, I placed no value in determining the category/sub-category under which the article is posted.
The RetrieveArticles
method is responsible for making the page request,a nd managing the parsing chores, which are themselves broken up into manageable chunks. During testing of the scraping code, I went to the My Articles page in a web browser, and saved the source code to a file. This allowed me to test without having to repeatedly hammer CodeProject during initial development of the parsing code. I decided to leave the code in the class to allow other programmers the same luxury. Here are the important bits (the text file specified in the code is provided with this articles download file):
if (this.ArticleSource == ArticleSource.CodeProject)
{
string url = string.Format("{0}{1}{2}",
"http://www.codeproject.com/script/",
"Articles/MemberArticles.aspx?amid=",
this.UserID);
Uri uri = new Uri(url);
WebClient webClient = new WebClient();
string response = "";
try
{
webClient.Proxy = WebRequest.DefaultWebProxy;
webClient.Proxy.Credentials = CredentialCache.DefaultCredentials;
response = webClient.DownloadString(uri);
}
catch (Exception ex)
{
throw ex;
}
pageSource = response;
}
else
{
StringBuilder builder = new StringBuilder("");
string filename = System.IO.Path.Combine(Application.StartupPath,
"MemberArticles.txt");
StreamReader reader = null;
try
{
reader = File.OpenText(filename);
string input = null;
while ((input = reader.ReadLine()) != null)
{
builder.Append(input);
}
}
catch (Exception ex)
{
throw ex;
}
finally
{
reader.Close();
}
pageSource = builder.ToString();
}
Note - The line in the code above that builds the url
string is formatted to prevent the containing <pre> tag from potentially forcing this articles page to require horizontal scrolling.
After getting the web page, the pageSource
variable should contain something. If it does, we hit the following code (and we're still in the RetrieveArticles
method):
int articleNumber = 0;
bool found = true;
while (found)
{
string articleStart = string.Format("<span id=\"ctl00_MC_AR_ctl{0}_MAS",
string.Format("{0:00}", articleNumber));
string articleEnd = string.Format("<span id=\"ctl00_MC_AR_ctl{0}_MAS",
string.Format("{0:00}", articleNumber + 1));
int startIndex = pageSource.IndexOf(articleStart);
if (startIndex >= 0)
{
pageSource = pageSource.Substring(startIndex);
startIndex = 0;
int endIndex = pageSource.IndexOf(articleEnd);
if (endIndex == -1)
{
endIndex = pageSource.IndexOf("<table");
if (endIndex == -1)
{
endIndex = pageSource.Length - 1;
}
}
string data = pageSource.Substring(0, endIndex);
if (data != "")
{
ProcessArticle(data, articleNumber);
}
else
{
found = false;
}
articleNumber++;
}
else
{
found = false;
}
}
CalculateAverages();
I guess I could have used LINQ to scrounge around in the XML, but when you get right down to it, we can't count on the HTML being valid, so it's simply more reliable to parse the text this way. I know, Chris, et al., work hard at making sure everything is just so, but they are merely human, and we know we can't count on humans to do it right every single time.
Processing an Article
By "process", I mean parsing out the HTML and digging the actual data out of the article's div. While fairly simple, it is admittedly tedious. We start out by getting the article's URL, which is a straightforward operation:
private string GetArticleLink(string data)
{
string result = data;
int hrefIndex = result.IndexOf("href=\"") + 6;
int endIndex = result.IndexOf("\">", hrefIndex);
result = result.Substring(hrefIndex, endIndex - hrefIndex).Trim();
return result;
}
Next, we clean the data, starting off by removing all of the HTML tags. A change was made to the source code to make the removal of HTML tags a little smarter. If the article title and/or description contain more than one pointy bracket, this method will be almost guaranteed to return only a portion of the actual text of the item in question. If you like, you can google for (and use) one of the many exhaustive HTML parsers available on the net. IMHO, it's not worth the effort considering this class' primary usage and consistently decent HTML we get from CodeProject.
private string RemoveHtmlTags(string data)
{
int ltCount = CountChar(data, '<');
int gtCount = CountChar(data, '>');
if (ltCount == gtCount)
{
data = ForwardStrip(data);
}
else
{
if (gtCount > ltCount)
{
data = BackwardStrip(ForwardStrip(data));
}
else
{
data = ForwardStrip(BackwardStrip(data));
}
}
return data;
}
private int CountChar(string data, char value)
{
int count = 0;
for (int i = 0; i < data.Length; i++)
{
if (data[i] == value)
{
count++;
}
}
return count;
}
private string ForwardStrip(string data)
{
bool found = true;
do
{
int tagStart = data.IndexOf("<");
int tagEnd = data.IndexOf(">");
if (tagEnd >= 0)
{
tagEnd += 1;
}
found = (tagStart >= 0 && tagEnd >= 0 && tagEnd-tagStart > 1);
if (found)
{
string tag = data.Substring(tagStart, tagEnd - tagStart);
data = data.Replace(tag, "");
}
} while (found);
return data;
}
private string BackwardStrip(string data)
{
bool found = true;
do
{
int tagStart = data.LastIndexOf("<");
int tagEnd = data.LastIndexOf(">");
if (tagEnd >= 0)
{
tagEnd += 1;
}
found = (tagStart >= 0 && tagEnd >= 0 && tagEnd-tagStart > 1);
if (found)
{
string tag = data.Substring(tagStart, tagEnd - tagStart);
data = data.Replace(tag, "");
}
} while (found);
return data;
}
Then, we remove all the extra stuff left behind:
private string CleanData(string data)
{
data = RemoveHtmlTags(data);
data = data.Replace("\t", "^").Replace(" ", "");
data = data.Replace("\n","").Replace("\r", "");
data = data.Replace(" / 5", "");
while (data.IndexOf(" ") >= 0)
{
data = data.Replace(" ", " ");
}
while (data.IndexOf("^ ^") >= 0)
{
data = data.Replace("^ ^", "^");
}
while (data.IndexOf("^^") >= 0)
{
data = data.Replace("^^", "^");
}
data = data.Substring(1);
data = data.Substring(0, data.Length - 1);
return data;
}
After this, we're left with a pure list of data that describes the article, delimited with caret characters. All that's left is to create an ArticleUpdate
item and store it in our generic list.
private void ProcessArticle(string data, int articleNumber)
{
string link = GetArticleLink(data);
data = CleanData(data);
string[] parts = data.Split('^');
string title = parts[0];
string description = parts[7];
string lastUpdated = GetDataField("Last Update", parts);
string pageViews = GetDataField("Page Views", parts).Replace(",", "");
string rating = GetDataField("Rating", parts);
string votes = GetDataField("Votes", parts).Replace(",", "");
string popularity = GetDataField("Popularity", parts);
string bookmarks = GetDataField("Bookmark Count", parts);
DateTime lastUpdatedDate;
ArticleUpdate article = new ArticleUpdate();
article.LatestLink = string.Format("http://www.codeproject.com{0}", link);
article.LatestTitle = title;
article.LatestDescription = description;
if (DateTime.TryParse(lastUpdated, out lastUpdatedDate))
{
article.LatestLastUpdated = lastUpdatedDate;
}
else
{
article.LatestLastUpdated = new DateTime(1990, 1, 1);
}
article.LatestPageViews = Convert.ToInt32(pageViews);
article.LatestRating = Convert.ToDecimal(rating);
article.LatestVotes = Convert.ToInt32(votes);
article.LatestPopularity = Convert.ToDecimal(popularity);
article.LatestBookmarks = Convert.ToInt32(bookmarks);
AddOrChangeArticle(article);
}
private void AddOrChangeArticle(ArticleUpdate article)
{
bool found = false;
DateTime now = DateTime.Now;
for (int i = 0; i < m_articles.Count; i++)
{
ArticleUpdate item = m_articles[i];
if (item.LatestTitle.ToLower() == article.LatestTitle.ToLower())
{
found = true;
item.ApplyChanges(article, now, false);
break;
}
}
if (!found)
{
article.ApplyChanges(article, now, true);
m_articles.Add(article);
}
for (int i = m_articles.Count - 1; i == 0; i--)
{
ArticleUpdate item = m_articles[i];
if (item.TimeUpdated != now)
{
m_articles.RemoveAt(i);
}
}
}
The Sample Application
The sample application is admittedly a rudimentary affair, and is honestly intended to show only one possible way to use the scraping code. I decided to use a WebBowser
control, but about halfway through the app, I began to regret that decision. However, I was afraid I'd become bored with the whole thing, and determined to soldier on.>/p>
You'll see that I didn't go to heroic lengths to pretty things up. For instance, I used PNG files for the graphics instead of GIF files. This means the transparency in the PNG files isn't handled correctly on systems running IE6 or earlier.
The application allows you to select the data on which to sort, and in what direction (ascending or descending). The default is the date last updated in descending order so that the newest articles appear first.
The WebBrowser
control displays the articles in a table, and uses icons to indicate changed data and certain statistical information regarding articles. The article titles are hyperlinks to the actual article's page, and that page is displayed within the WebBrowser
control. To go back to the article display, you have to click the Sort button because I didn't implement any of the forward/back functionality you find in a normal web browser.
The icons used are as follows:
- Indicates a new article. All articles will display as new when you initially start the application.
- Indicates the article with the best rating.
- Indicates the article with the worst rating.
- Indicates the article with the most votes.
- Indicates the article with the most page views.
- Indicates the most popular article.
- Indicates the article with the most bookmarks.
- Indicates that the associated field increased in value.
- Indicates that the associated field decreased in value.
Other controls on the form include the following.
Show New Info Only
This checkbox allows you to filter the list of articles so that only new articles, and articles that have new data are displayed.
Show Icons
This checkbox allows you to turn the display of icons on and off.
Automatic Refresh
This checkbox allows you to turn the automatic refrsh on and off. Once every hour, a BackgroundWorker
object is used to refresh the article data.
Button - Refresh From CodeProject
This button allows you to manually refresh the article data (and this button is available even if auto-refresh is turned on).
Lastly, you can specify the user ID of the user for whom you would like to retrieve data. After specifying the new ID, hit the Refresh button.
Closing
This code is only intended to be used to retrieve your own articles - the scraper class accepts the user ID as a parameter, and that ID is currently set to my own. Make sure you change that before you start looking for your own articles.
I've tried to make this as maintainable as possible without forcing the programmer to do conceptual back-flips, but there's no way I can acommodate everyone's reading comprehension levels, so what I guess I'm saying is - you're pretty much on your own. I can't guarantee that I will be able to maintain this article in a timely fashion, but that shouldn't matter. We're all programmers here, and the stuff I've presented isn't rocket science. Besides, you have plenty of examples in the provided classes to modify and/or extend their functionality. Have fun.
Remember also, that the png files and css file need to be in the same folder as the executable, or it won't find them.
History
02/19/2010 (IMPORTANT!): There is a bug in the code that will cause the program to always tell you that it couldn't retrieve article information. After you download the code, make the following change:
In the file
ArticleScraper.cs
, find the line that looks like this (in the
ProcessArticle()
method):
string rating = GetDataField("Rating", parts);
and change it to this:
string rating = GetDataField("Rating", parts).Replace("/5", "");
10/14/2008: Addressed the following:
- Added support for retrieving the web page via a proxy (thanks Pete O'Hanlon!).
- Added code to throw any exception encountered during the web page retrieval process (thanks again Pete
O'Hanlon!).
- Added a slightly more thorough HTML parse to handle errant < and > in the title or description of the
article (thanks ChandraRam!).
- Embedded the icons as resources in the exe file. They will be copied to the app folder the first time the exe is run.
- Added a new statistic item at the top of the form - "Articles Displayed".
- Enclosed the stuff at the top ofthe form in group boxes to make it look more organized.
10/13/2008: Addressed the following:
- Added the forgotten mostvotes.png image.
- Modified code to use the mostvotes image.
- Added a textbox to the form to allow you to specify the userID.
- Fixed the form resizing issues.
- The zip file now includes the debug folder with the images and css file.
10/13/2008: Original article posted.
I've been paid as a programmer since 1982 with experience in Pascal, and C++ (both self-taught), and began writing Windows programs in 1991 using Visual C++ and MFC. In the 2nd half of 2007, I started writing C# Windows Forms and ASP.Net applications, and have since done WPF, Silverlight, WCF, web services, and Windows services.
My weakest point is that my moments of clarity are too brief to hold a meaningful conversation that requires more than 30 seconds to complete. Thankfully, grunts of agreement are all that is required to conduct most discussions without committing to any particular belief system.