For crawling, you need to fetch HTML files using the class
System.Net.HttpWebRequest
. You
compile-time type should be
System.Net.WebRequest
though as the instances of derived classes are created by the
factory method Create
, depending on URI
schema. See the code sample here:
http://msdn.microsoft.com/en-us/library/system.net.webrequest.aspx[
^]. You can fetch the resource using
HTTP request method "GET".
The problem is knowing the URL. In many cases, the servers generate HTTP pages on the fly on some search request, for example. As the requests could be posted in many different ways, this is application-specific. In such cases, you will need to learn how to simulate such "POST" request for every Web application separately.
Now, you need to parse the HTML file and find your audio file references. If you are only interested only in auto files and not in any specific structure, it can be as simple as using regular expression. Your search criteria can include "http://" in the beginning and file extensions you're interested in at the end. See the class
System.Text.RegularExpressions.Regex
,
http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.aspx[
^]. Your Regex pattens should be easy enough to design. If you face a problem here, as a follow-up question.
These techniques will solve your problem only when the HTML statically presents the URLs of the audio records.
Using such method you can scrap only the URLs which are statically known in your HTML document. There are more cryptic scenarios where the user does not click at the anchored link with the static URL of the resource you need. Instead, some Javascript is called; it forms some HTTP request to the server which is sent using Ajax to get the file downloaded. In some cases, the user needs to answer some question to confirm the request is done not by a robot. Such scenarios are crackable in principle. At the same time, it is not theoretically possible to devise a universal algorithm which could analyze the Javascript algorithm involved to mimic the required user actions by your crawler. In some separate cases you will be able to crack such scenario when you use your own brain. The result cannot be guaranteed in all cases.
—SA