Offline Browser using WinInet, URL Moniker and MSHTML APIs





5.00/5 (11 votes)
Mar 23, 2005
4 min read

105841

3036
This article describes how to make an offline browser using Visual C++/Win32 APIs.
Introduction
This article demonstrates how to make an offline browser using Visual C++. It uses the following APIs:
- WinInet - Download HTML of all the web pages.
- URL Moniker - Download all the resources, for e.g., images, style sheets etc. to the local folder.
- MSHTML - Traverse HTML DOM (Document Object Model) tree to get the list of all the resources that needs to be downloaded.
Below is the brief description of the algorithm:
- Download the HTML of the web page, for e.g., www.google.com, and save it to the hard disk in a specified folder.
- Traverse the HTML document and look for
src
attribute in every tag, value ofsrc
attribute is the URL of a resource. If URL of the resource is absolute, for e.g., www.google.com/images/logo.gif, it is OK, but if the URL is relative, for e.g., images/logo.gif, make it absolute using the host name. I.e., its absolute URL will become <Host Name>/<path>, for e.g., www.google.com/images/logo.gif. - Update
src
attribute to reflect if there are any changes in the URL of the resource. Relative URLs will remain same, but for absolute addresses,src
attribute will be changed now to a relative one. - Save the original
src
attribute's value to srcdump, it is just for future references, so that the originalsrc
is still available.
Background
I'd like to explain the reason/scenario behind the development of this code snippet. I was working on a module which records user interactions with Web pages and I require to save the web page on the local hard drive without using the web browser's Save As option.
I searched a lot for some code that does the same for me, but didn't find any helpful material, so I decided to develop it myself. I am uploading it here because it may help others working on some related stuff and to get some feedback on any mistakes I made. I didn't use MFC just to make it compatible with Win32 Applications as well as with MFC.
Not to mention, it is my first ever article.
Using the code
Download HTML of the Web Page:
LoadHtml()
works in two modes based on the value of the bDownload
argument:
- If
bDownload
is true, it assumes that HTML is loaded already usingSetHtml()
function, and it doesn't execute the following code snippet, just populates the Hostname and Port fields from the URL. - If
bDownload
is false, it first downloads the HTML from the URL specified and then populates the Hostname and Port fields.//Download Web Page using WININET HINTERNET hNet = InternetOpen("Offline Browser", INTERNET_OPEN_TYPE_PROXY, NULL, NULL, 0); if(hNet == NULL) return; HINTERNET hFile = InternetOpenUrl(hNet, sUrl.c_str(), NULL, 0, 0, 0); if(hFile == NULL) return; while(true) { const int MAX_BUFFER_SIZE = 65536; unsigned long nSize = 0; char szBuffer[MAX_BUFFER_SIZE+1]; BOOL bRet = InternetReadFile(hFile, szBuffer, MAX_BUFFER_SIZE, &nSize); if(!bRet || nSize <= 0) break; szBuffer[nSize] = '\0'; m_sHtml += szBuffer; }
Load HTML into MSHTML Document Interface:
BrowseOffline()
assumes that the HTML is already loaded. First, it constructs the HTML DOM tree by loading the HTML into an MSHTML DOMDocument
interface using the following code:
//Load HTML to Html Document SAFEARRAY* psa = SafeArrayCreateVector(VT_VARIANT, 0, 1); VARIANT *param; bstr_t bsData = (LPCTSTR)m_sHtml.c_str(); hr = SafeArrayAccessData(psa, (LPVOID*)¶m); param->vt = VT_BSTR; param->bstrVal = (BSTR)bsData; //write your buffer hr = pDoc->write(psa); //closes the document, "applying" your code hr = pDoc->close(); //Don't forget to free the SAFEARRAY! SafeArrayDestroy(psa);
Traverse DOM Tree and download all the resources:
Once the DOM tree is constructed, it's time to traverse it and seek for the resources that needs downloading.
Currently, I only seek for src
attribute in all the elements, and once an src
attribute is found, it is downloaded and saved to the local folder.
//Iterate through all the elements in the document MSHTML::IHTMLElementCollectionPtr pCollection = pDoc->all; for(long a=0;a<pCollection->length;a++) { std::string sValue; IHTMLElementPtr pElem = pCollection->item( a ); //If src attribute is found that means we've a resource to download if(GetAttribute(pElem, L"src", sValue)) { //If resource URL is relative if(!IsAbsolute(sValue)) { .......... } //If resource URL is absolute else { .......... } } }
Download Resource with Absolute Path
If src
attribute has an absolute URL of the resource, the following actions are taken:
- Download the resource and save it to the appropriate folder in the local folder.
- Update the
src
attribute to the relative local path. - Save the value of the original
src
attribute as srcdump for future reference.//If resource URL is relative if(!IsAbsolute(sValue)) { if(sValue[0] == '/') sValue = sValue.substr(1, sValue.length()-1); //Create directories needed to hold this resource CreateDirectories(sValue, m_sDir); //Download the resource if(!DownloadResource(sValue, sValue)) { std::string sTemp = m_sScheme + m_sHost; sTemp += sValue; //Update src to the new src and put the original src attribute as //srcdump just for future references if(sTemp[0] == '/') sTemp = sTemp.substr(1, sTemp.length()-1); SetAttribute(pElem, L"src", sTemp); SetAttribute(pElem, L"srcdump", sValue); } //Unable to download the resource else { //Put srcdump same as src, It if for no use, I just put it to make //HTML DOM consistent SetAttribute(pElem, L"srcdump", sValue); } }
Download Resource with Relative Path
If src
attribute has a relative URL of the resource, the following actions are taken:
- Construct absolute URL from the relative URL using Hostname and Port fields.
- Download the resource and save it to the appropriate folder in the local folder.
- Update
src
attribute to the relative local path if required. - Save the value of original
src
attribute as srcdump for future reference.//If resource URL is absolute else { std::string sTemp; //Make URL relative sTemp = TrimHostName(sValue); //Create directories needed to hold this resource CreateDirectories(sTemp, m_sDir); //Dowload the resource if(DownloadResource(sTemp, sTemp)) { //Update src to the new src and put the original src attribute as //srcdump just for future references if(sTemp[0] == '/') sTemp = sTemp.substr(1, sTemp.length()-1); SetAttribute(pElem, L"src", sTemp); SetAttribute(pElem, L"srcdump", sValue); } }
Save updated HTML
Original HTML is changed because of the values changed for src
and the addition of srcdump
attribute. Original HTML is finally updated and saved with the name [GUID].html, where GUID is a Globally Unique Identifier generated using CoCreateGuid()
. It is just to make sure that it doesn't overwrite any existing web site in the same folder.
//Get upated HTML out of amendments we made and save it to the described directory MSHTML::IHTMLDocument3Ptr pDoc3 = pDoc; MSHTML::IHTMLElementPtr pDocElem; pDoc3->get_documentElement(&pDocElem); BSTR bstrHtml; pDocElem->get_outerHTML(&bstrHtml); std::string sNewHtml((const char*)OLE2T(bstrHtml)); SaveHtml(sNewHtml);
Download Resources
Once we've the absolute URL of the resource, it is straightforward to download it and save it to an appropriate local folder.
//Download specified resource if(URLDownloadToFile(NULL, sTemp.c_str(), sTemp2.c_str(), 0, NULL) == S_OK) return true; else return false;
Directory Structure of the Web Site
I've tried to maintain the same directory on the local folder as it is on the website. For example: downloading the resource images/logo.gif first creates a folder images inside the directory specified by the user and then downloads logo.gif into that folder.
Sample Usage
COfflineBrowser obj; char szUrl[1024]; printf("Enter URL: "); gets(szUrl); obj.SetDir("c:\\MyTemp\\"); obj.LoadHtml(szUrl, true); obj.BrowseOffline();