Click here to Skip to main content
15,868,164 members
Articles / Desktop Programming / MFC
Article

Offline Browser using WinInet, URL Moniker and MSHTML APIs

Rate me:
Please Sign up or sign in to vote.
5.00/5 (13 votes)
22 Mar 20054 min read 103.1K   3K   53   20
This article describes how to make an offline browser using Visual C++/Win32 APIs.

Output of the sample program

Web site saved on the hard disk

Resources saved in appropriate folders

Introduction

This article demonstrates how to make an offline browser using Visual C++. It uses the following APIs:

  1. WinInet - Download HTML of all the web pages.
  2. URL Moniker - Download all the resources, for e.g., images, style sheets etc. to the local folder.
  3. MSHTML - Traverse HTML DOM (Document Object Model) tree to get the list of all the resources that needs to be downloaded.

Below is the brief description of the algorithm:

  1. Download the HTML of the web page, for e.g., www.google.com, and save it to the hard disk in a specified folder.
  2. Traverse the HTML document and look for src attribute in every tag, value of src attribute is the URL of a resource. If URL of the resource is absolute, for e.g., www.google.com/images/logo.gif, it is OK, but if the URL is relative, for e.g., images/logo.gif, make it absolute using the host name. I.e., its absolute URL will become <Host Name>/<path>, for e.g., www.google.com/images/logo.gif.
  3. Update src attribute to reflect if there are any changes in the URL of the resource. Relative URLs will remain same, but for absolute addresses, src attribute will be changed now to a relative one.
  4. Save the original src attribute's value to srcdump, it is just for future references, so that the original src is still available.

Background

I'd like to explain the reason/scenario behind the development of this code snippet. I was working on a module which records user interactions with Web pages and I require to save the web page on the local hard drive without using the web browser's Save As option.

I searched a lot for some code that does the same for me, but didn't find any helpful material, so I decided to develop it myself. I am uploading it here because it may help others working on some related stuff and to get some feedback on any mistakes I made. I didn't use MFC just to make it compatible with Win32 Applications as well as with MFC.

Not to mention, it is my first ever article.

Using the code

Download HTML of the Web Page:

LoadHtml() works in two modes based on the value of the bDownload argument:

  1. If bDownload is true, it assumes that HTML is loaded already using SetHtml() function, and it doesn't execute the following code snippet, just populates the Hostname and Port fields from the URL.
  2. If bDownload is false, it first downloads the HTML from the URL specified and then populates the Hostname and Port fields.
    //Download Web Page using WININET
    HINTERNET hNet = InternetOpen("Offline Browser", 
                     INTERNET_OPEN_TYPE_PROXY, NULL, NULL, 0);
    if(hNet == NULL)
        return;
    
    HINTERNET hFile = InternetOpenUrl(hNet, sUrl.c_str(), NULL, 0, 0, 0); 
    if(hFile == NULL)
        return;
    
    while(true)
    {
        const int MAX_BUFFER_SIZE = 65536;
        unsigned long nSize = 0;
        char szBuffer[MAX_BUFFER_SIZE+1];
        BOOL bRet = InternetReadFile(hFile, szBuffer, MAX_BUFFER_SIZE, &nSize);
        if(!bRet || nSize <= 0)
            break;
        szBuffer[nSize] = '\0';
        m_sHtml += szBuffer;
    }

Load HTML into MSHTML Document Interface:

BrowseOffline() assumes that the HTML is already loaded. First, it constructs the HTML DOM tree by loading the HTML into an MSHTML DOMDocument interface using the following code:

//Load HTML to Html Document
SAFEARRAY* psa = SafeArrayCreateVector(VT_VARIANT, 0, 1);
VARIANT *param;
bstr_t bsData = (LPCTSTR)m_sHtml.c_str();
hr =  SafeArrayAccessData(psa, (LPVOID*)¶m);
param->vt = VT_BSTR;
param->bstrVal = (BSTR)bsData;

//write your buffer
hr = pDoc->write(psa);
//closes the document, "applying" your code  
hr = pDoc->close();

//Don't forget to free the SAFEARRAY!
SafeArrayDestroy(psa);

Traverse DOM Tree and download all the resources:

Once the DOM tree is constructed, it's time to traverse it and seek for the resources that needs downloading.

Currently, I only seek for src attribute in all the elements, and once an src attribute is found, it is downloaded and saved to the local folder.

//Iterate through all the elements in the document
MSHTML::IHTMLElementCollectionPtr pCollection = pDoc->all;
for(long a=0;a<pCollection->length;a++)
{
    std::string sValue;
    IHTMLElementPtr pElem = pCollection->item( a );
    //If src attribute is found that means we've a resource to download
    if(GetAttribute(pElem, L"src", sValue))
    {
        //If resource URL is relative
        if(!IsAbsolute(sValue))
        {
            ..........
        }
        //If resource URL is absolute
        else
        {
            ..........
        }
    }
}

Download Resource with Absolute Path

If src attribute has an absolute URL of the resource, the following actions are taken:

  1. Download the resource and save it to the appropriate folder in the local folder.
  2. Update the src attribute to the relative local path.
  3. Save the value of the original src attribute as srcdump for future reference.
    //If resource URL is relative
    if(!IsAbsolute(sValue))
    {
        if(sValue[0] == '/')
            sValue = sValue.substr(1, sValue.length()-1);
        //Create directories needed to hold this resource
        CreateDirectories(sValue, m_sDir);
        //Download the resource
        if(!DownloadResource(sValue, sValue))
        {
            std::string sTemp = m_sScheme + m_sHost;
            sTemp += sValue;
            //Update src to the new src and put the original src attribute as
            //srcdump just for future references
            if(sTemp[0] == '/')
                sTemp = sTemp.substr(1, sTemp.length()-1);
            SetAttribute(pElem, L"src", sTemp);
            SetAttribute(pElem, L"srcdump", sValue);
        }
        //Unable to download the resource
        else
        {
            //Put srcdump same as src, It if for no use, I just put it to make
            //HTML DOM consistent
            SetAttribute(pElem, L"srcdump", sValue);
        }
    }

Download Resource with Relative Path

If src attribute has a relative URL of the resource, the following actions are taken:

  1. Construct absolute URL from the relative URL using Hostname and Port fields.
  2. Download the resource and save it to the appropriate folder in the local folder.
  3. Update src attribute to the relative local path if required.
  4. Save the value of original src attribute as srcdump for future reference.
    //If resource URL is absolute
    else
    {
        std::string sTemp;
        //Make URL relative
        sTemp = TrimHostName(sValue);
        //Create directories needed to hold this resource
        CreateDirectories(sTemp, m_sDir);
        //Dowload the resource
        if(DownloadResource(sTemp, sTemp))
        {
            //Update src to the new src and put the original src attribute as
            //srcdump just for future references
            if(sTemp[0] == '/')
                sTemp = sTemp.substr(1, sTemp.length()-1);
            SetAttribute(pElem, L"src", sTemp);
            SetAttribute(pElem, L"srcdump", sValue);
        }
    }

Save updated HTML

Original HTML is changed because of the values changed for src and the addition of srcdump attribute. Original HTML is finally updated and saved with the name [GUID].html, where GUID is a Globally Unique Identifier generated using CoCreateGuid(). It is just to make sure that it doesn't overwrite any existing web site in the same folder.

//Get upated HTML out of amendments we made and save it to the described directory
MSHTML::IHTMLDocument3Ptr pDoc3 = pDoc;
MSHTML::IHTMLElementPtr pDocElem;
pDoc3->get_documentElement(&pDocElem);
BSTR bstrHtml;
pDocElem->get_outerHTML(&bstrHtml);
std::string sNewHtml((const char*)OLE2T(bstrHtml));
SaveHtml(sNewHtml);

Download Resources

Once we've the absolute URL of the resource, it is straightforward to download it and save it to an appropriate local folder.

//Download specified resource
if(URLDownloadToFile(NULL, sTemp.c_str(), sTemp2.c_str(), 0, NULL) == S_OK)
    return true;
else return false;

Directory Structure of the Web Site

I've tried to maintain the same directory on the local folder as it is on the website. For example: downloading the resource images/logo.gif first creates a folder images inside the directory specified by the user and then downloads logo.gif into that folder.

Sample Usage

COfflineBrowser obj;
char szUrl[1024];
printf("Enter URL: ");
gets(szUrl);
obj.SetDir("c:\\MyTemp\\");
obj.LoadHtml(szUrl, true);
obj.BrowseOffline();

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here


Written By
Web Developer
Pakistan Pakistan
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
Generalapplication crashed when method BrowseOffline return Pin
ZealotOfCode19-Jan-11 3:49
ZealotOfCode19-Jan-11 3:49 
GeneralMy vote of 5 Pin
yao200728-Nov-10 0:26
yao200728-Nov-10 0:26 
GeneralSome problems about the usage of IE7 and Character set conversion Pin
yangwenqin21-Jun-09 16:13
yangwenqin21-Jun-09 16:13 
GeneralHandle release - memory leak Pin
Francois661-Dec-07 6:57
Francois661-Dec-07 6:57 
Questionthe class can't work well when you install IE 7.0..why..? Pin
camelbird21-Sep-07 9:34
camelbird21-Sep-07 9:34 
GeneralSlow HTML Parsing Pin
Anonymous31-May-05 21:08
Anonymous31-May-05 21:08 
Generalsearching a specific term Pin
Adnan Siddiqi13-Apr-05 19:54
Adnan Siddiqi13-Apr-05 19:54 
Generalcareful! Pin
f5chwiet5-Apr-05 17:23
f5chwiet5-Apr-05 17:23 
GeneralURL-Moniker Pin
Reza Shams Amiri31-Mar-05 9:11
Reza Shams Amiri31-Mar-05 9:11 
GeneralRe: URL-Moniker Pin
Muhammad Sheraz Siddiqi31-Mar-05 22:27
Muhammad Sheraz Siddiqi31-Mar-05 22:27 
GeneralRe: URL-Moniker Pin
Reza Shams Amiri5-Apr-05 9:29
Reza Shams Amiri5-Apr-05 9:29 
Questioncan i use it to download GMAIL Pin
mohammed barqawi23-Mar-05 23:31
mohammed barqawi23-Mar-05 23:31 
AnswerRe: can i use it to download GMAIL Pin
Muhammad Sheraz Siddiqi24-Mar-05 23:53
Muhammad Sheraz Siddiqi24-Mar-05 23:53 
GeneralSmall Bug Pin
Azghar Hussain23-Mar-05 0:53
professionalAzghar Hussain23-Mar-05 0:53 
GeneralRe: Small Bug Pin
Muhammad Sheraz Siddiqi23-Mar-05 2:09
Muhammad Sheraz Siddiqi23-Mar-05 2:09 
GeneralRe: Small Bug Pin
Anonymous23-Mar-05 18:23
Anonymous23-Mar-05 18:23 
GeneralRe: Small Bug Pin
Muhammad Sheraz Siddiqi23-Mar-05 19:13
Muhammad Sheraz Siddiqi23-Mar-05 19:13 
GeneralRe: Small Bug Pin
Muhammad Sheraz Siddiqi23-Mar-05 19:14
Muhammad Sheraz Siddiqi23-Mar-05 19:14 
GeneralRe: Small Bug Pin
cpayne_stargames3-Apr-05 12:26
cpayne_stargames3-Apr-05 12:26 
GeneralRe: Small Bug Pin
Muhammad Sheraz Siddiqi5-Apr-05 20:46
Muhammad Sheraz Siddiqi5-Apr-05 20:46 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.