Introduction
If you've ever used the File | Save As... menu in Internet Explorer, you might have noticed a few interesting options IE provides under the Save As Type drop-down box:
The options provided are:
- Web Page, complete
- Web Archive, single file
- Web Page, HTML only
- Text File
Most of these are self-explanatory, with the exception of the Web Archive (MHTML) format. What's neat about this format is that it bundles the web page and all of its references, into a single compact .MHT file. It's a lot easier to distribute a single self-contained file than it is to distribute a HTML file with a subfolder full of image/CSS/Flash/XML files referenced by that HTML file. In our case, we were generating HTML reports and we needed to check these reports into a document management system which expects a single file. The MHTML (*.mht) format solves this problem beautifully!
This project contains the MhtBuilder
class, a 100% .NET managed code solution which can auto-generate a MHT file from a target URL, in one line of code. As a bonus, it will also generate all the other formats listed above, too. And it's completely free, unlike some commercial solutions you might find out there.
Background
I know people assume the worst of Microsoft, but the MHTML format is actually based on RFC standard 2557, compliant Multipart MIME Message (MHTML web archive). So it's an actual Internet standard! Web Archive, a.k.a. MHTML, is a remarkably simple plain text format which looks a lot like (and is in fact almost exactly identical to) an email. Here's the header of the MHT file you are viewing at the top of the page:
To generate a MHTML file, we simply merge together all of the files referenced in the HTML. The red line marks the first content block; there will be one content block for each file. We need to follow a few rules, though:
- Use Quoted-Printable encoding for the text formats.
- Use Base64 encoding for the binary formats.
- Make sure the Content-Location has the correct absolute URL for each reference.
Not all websites will tolerate being packaged into a MHTML file. This version of Mht.Builder supports frames and IFrame, but watch out for pages that include lots of complicated JavaScript. You'll want to use the .StripScripts
option on sites like that.
Using Mht.Builder
MhtBuilder comes with a complete demo app:
Try it out on your favorite website. The files will be generated by default in the \bin folder of the solution. Just click the View button to launch them. Bear in mind that for the Web Archive and complete tabs, all the content from the target web page must be downloaded to the /bin folder, so it might take a little while! Although I don't provide any feedback events yet, I do emit a lot of progress feedback via the Debug.Write
, so switch to the debug output tab to see what's happening in real time.
There are four tabs here, just like the four options IE provides in its Save As Type options. In MhtBuilder, these are the four methods being called, in the order they appear on the tabs:
Public Sub SavePageComplete(ByVal outputFilePath As String, Optional url As String)
Public Sub SavePageArchive(ByVal outputFilePath As String, Optional url As String)
Public Sub SavePage(ByVal outputFilePath As String, Optional url As String)
Public Sub SavePageText(ByVal outputFilePath As String, Optional url As String)
As of Windows XP Service Pack 2, HTML files opened from disk result in security blocks. In order to avoid this, we need to add the "Mark of the Web" to the file so IE knows what URL it came from, and can thus assign an appropriate security zone to the HTML. That's what the blnAddMark
parameter is for; it causes the HTML file to be tagged with this single line at the top:
<!---->
The other thing we need to do when saving these files is fix up the URLs. Any relative URLs such as:
<img src="/images/standard/logo225x72.gif">
must be converted to absolute URLs like so:
<img src="http://www.codeproject.com/images/standard/logo225x72.gif">
We do this using regular expressions, which gets us a NameValueCollection
of all the references we need to fix. We loop through each reference and perform the fixup on the HTML string.
Private Function ExternalHtmlFiles() As Specialized.NameValueCollection
If Not _ExternalFileCollection Is Nothing Then
Return _ExternalFileCollection
End If
_ExternalFileCollection = New Specialized.NameValueCollection
Dim r As Regex
Dim html As String = Me.ToString
Debug.WriteLine("Resolving all external HTML references from URL:")
Debug.WriteLine(" " & Me.Url)
r = New Regex( _
"(\ssrc|\sbackground)\s*=\s*((?<Key>'(?<Value>[^']+)')|" & _
"(?<Key>""(?<Value>[^""]+)"")|(?<Key>(?<Value>[^ \n\r\f]+)))", _
RegexOptions.IgnoreCase Or RegexOptions.Multiline)
AddMatchesToCollection(html, r, _ExternalFileCollection)
r = New Regex( _
"(@import\s|\S+-image:|background:)\s*?(url)*\s*?(?<Key>" & _
"[""'(]{1,2}(?<Value>[^""')]+)[""')]{1,2})", _
RegexOptions.IgnoreCase Or RegexOptions.Multiline)
AddMatchesToCollection(html, r, _ExternalFileCollection)
r = New Regex( _
"<link[^>]+?href\s*=\s*(?<Key>" & _
"('|"")*(?<Value>[^'"">]+)('|"")*)", _
RegexOptions.IgnoreCase Or RegexOptions.Multiline)
AddMatchesToCollection(html, r, _ExternalFileCollection)
r = New Regex( _
"<i*frame[^>]+?src\s*=\s*(?<Key>" & _
"['""]{0,1}(?<Value>[^'""\\>]+)['""]{0,1})", _
RegexOptions.IgnoreCase Or RegexOptions.Multiline)
AddMatchesToCollection(html, r, _ExternalFileCollection)
Return _ExternalFileCollection
End Function
We use a similar technique to get a list of all the files we need to download, which are then downloaded via my WebClientEx
class. Why use that instead of the built in Net.WebClient
? Good question! Because it doesn't support HTTP compression. My class, on the other hand, does:
Private Function Decompress(ByVal b() As Byte, _
ByVal CompressionType As HttpContentEncoding) As Byte()
Dim s As Stream
Select Case CompressionType
Case HttpContentEncoding.Deflate
s = New Zip.Compression.Streams.InflaterInputStream(New MemoryStream(b), _
New Zip.Compression.Inflater(True))
Case HttpContentEncoding.Gzip
s = New GZip.GZipInputStream(New MemoryStream(b))
Case Else
Return b
End Select
Dim ms As New MemoryStream
Const chunkSize As Integer = 2048
Dim sizeRead As Integer
Dim unzipBytes(chunkSize) As Byte
While True
sizeRead = s.Read(unzipBytes, 0, chunkSize)
If sizeRead > 0 Then
ms.Write(unzipBytes, 0, sizeRead)
Else
Exit While
End If
End While
s.Close()
Return ms.ToArray
End Function
HTTP compression is a no-brainer: it increases your effective bandwidth by 75 percent by using standard GZIP compression-- courtesy of the SharpZipLib library.
Conclusion
Creating MHTML files isn't hard, but there are lots of little gotchas when dealing with HTML, regular expressions, and HTTP downloads. I tried to document all the difficult bits in the source code. I've also tested MhtBuilder on dozens of different websites so far with excellent results.
There are many more details and comments in the source code provided at the top of the article, so check it out. Please don't hesitate to provide feedback, good or bad! I hope you enjoyed this article. If you did, you may also like my other articles as well.
History
- Sunday, September 12, 2004 - Published.
- Monday, March 28, 2005 - Version 2.0
- Completely rewritten!
- Autodetection of content encoding (e.g., international web pages), tested against multi-language websites.
- Now correctly decompresses both types of HTTP compression.
- Supports completely in-memory operation for server-side use, or on-disk storage for client use.
- Now works on web pages with frames and IFrames, using recursive retrieval.
- HTTP authentication and HTTP Proxy support.
- Allows configuration of browser ID string to retrieve browser-specific content.
- Basic cookie support (needs enhancement and testing).
- Much improved regular expressions used for parsing HTTP.
- Extensive use of VB.NET 2005 style XML comments throughout.