Click here to Skip to main content
15,887,965 members
Articles / Programming Languages / C#
Tip/Trick

How to Automate Saving Webpages as a Single .MHTML Files using Selenium Webdriver

Rate me:
Please Sign up or sign in to vote.
0.00/5 (No votes)
25 Sep 2023CPOL3 min read 3.2K   1
Saving webpages in single self contained files using Selenium Webdriver
A simple console application which downloads a set of webpages and saves them as single .MHTML files in a specified folder.

Introduction

EdgeSinglePageDownloader is a simple console application which downloads a set of webpages and saves them as single .MHTML files in a specified folder using Edge Selenium Webdriver (it should work also on ChromeDriver). It is a simple proof of concept which shows how to implement this feature using the Selenium Webdriver and allows also to download and save a set of webpages in batch.

Background

The Edge Selenium Webdriver is a NuGet package which allows you to automate Microsoft Edge by simulating user interaction. The saving of the browsed web page happens by sending CTRL + S key to Edge in order to pop up the Save As dialog, specifying then a filename, selecting the file format (Webpage, single file .mhtml) and clicking the Save button. Unluckily, the method SendKeys provided by Selenium Webdriver does not work (or at least I was not able to in Windows), so after several tries, I switched to using VBScript SendKeys method which works flawlessly with the caveat of requiring Windows as operating system.

Using the Code

Using the code is very simple, all you have to do is:

  • adjust the DefaultSaveFolder constant to specify the local folder where to save the mhtml files, it defaults to C:\temp
    C#
    const string DefaultSaveFolder = "c:\\temp";
  • adjust urlsToSave variable initialization with the urls you would like to save, by default The Verge and Wired urls are provided:
    C#
    var urlsToSave = new List<string> 
                     { "https://www.theverge.com", "https://www.wired.com/" };

After that, just run the code, Edge will be started, all the urlsToSave will be browsed sequentially and they will be saved in the DefaultSaveFolder with filenames Page_1.mhtml, Page_2.mhtml, ..., Page_n.mhtml.

The code is very simple and it is all contained in the Main function with the SaveAsSingleFile helper function.

The Main function performs these steps:

  • Loops for all urls inside the urlsToSave variable and for each of them:
    • Instantiates a new EdgeDriver class (provided by Selenium Webdriver) which starts a new Edge browser
    • Makes the browser navigate to the url by calling EdgeDriver.Navigate().GoToUrl
    • Saves the webpage as a single .mhtml file by calling helper function SaveAsSingleFile
C#
static void Main(string[] args)
{
    var options = new EdgeOptions();

    var service = EdgeDriverService.CreateDefaultService();
    service.EnableVerboseLogging = true;

    WshShell = new WshShellClass();

    var urlsToSave = new List<string> 
        { "https://www.theverge.com", "https://www.wired.com/" };

    var i = 1;
    foreach (var url in urlsToSave)
    {
        Driver = new EdgeDriver(service, options);

        Driver.Navigate().GoToUrl(url);

        SaveAsSingleFile(Path.Combine(DefaultSaveFolder, 
                         $"Page_{i++}.mhtml"),url);

        Driver.Close();
            }
    }

The SaveAsSingleFile helper function performs these steps:

  • Checks whether the output directory exists and creates it if not
  • Checks whether the output file exists and it is in the .mhtml format, if yes, it exists without re-saving it again otherwise it deletes the existing file
  • Sends CTRL (^ character) + S key to Edge browser by using WshShell.SendKeys, this pops up the "Save as" dialog (image below). Notice that the "Filename" and "Save as type" labels of textboxes have an underlined char respectively 'n' and 't', you can focus on their controls by pressing ALT + one of these characters. Please pay attention to the fact that these shortcuts are localization-dependent (I am using English localized Windows), in your Windows installation they could be different ones, so adapt them if necessary.
  • Sends ALT (% character) + 'n' and the filename passed to the function
  • Sends ALT (% character) + 't', DOWN ARROW to open all "Save as Type" possible formats , UP ARROW to select "WebPage, Single File (*.mhtml)" and ENTER (~ character) twice to confirm the file format and press the "Save" button.
  • After this, it waits for 1 minute (specified by MaxWaitForSaveMSec constant) to check that the saved file has been created and it is in the MHTML format (it checks it contains the string "Snapshot-Content-Location: {url}"), if not, it goes back to the beginning of the function to redo everything again.

    Image 1

C#
static void SaveAsSingleFile(string filename, string url)
{
again:
    if (!Directory.Exists(Path.GetDirectoryName(filename)))
        Directory.CreateDirectory(Path.GetDirectoryName(filename));
        
    if (System.IO.File.Exists(filename))
    {
        // simple check that the existing file format is mhtml, 
        // otherwise delete and re-save it
        if (!System.IO.File.ReadAllText(filename).Contains
                            ($"Snapshot-Content-Location: {url}"))
            System.IO.File.Delete(filename);
        else
            return;
    }
    
    WshShell.SendKeys("^s");
    Thread.Sleep(1000);
    // send alt+n, enter filename
    WshShell.SendKeys($"%n{filename}");
    Thread.Sleep(20);
    // send alt+t, down arrow, up arrow (to select single mhtml), press enter twice
    WshShell.SendKeys($"%t");
    Thread.Sleep(20);
    WshShell.SendKeys($"{{DOWN}}");
    Thread.Sleep(20);
    WshShell.SendKeys($"{{UP}}");
    Thread.Sleep(20);
    WshShell.SendKeys($"~~");
    
    // waits up to MaxWaitForSaveMSec to check that the file is saved correctly
    var endtime = DateTime.Now.AddMilliseconds(MaxWaitForSaveMSec);
    
    while (DateTime.Now < endtime)
    {
        Thread.Sleep(1000);
        // simple check that the file is present and its format is mhtml, 
        // otherwise retry again to save
        if (System.IO.File.Exists(filename))
        {
            if (!System.IO.File.ReadAllText(filename).Contains
                                ($"Snapshot-Content-Location: {url}"))
                goto again;
            else
                break;
        }
    }
}

Points of Interest

To use the VBScript SendKeys method, you have to create an instance of WScript.Shell COM object. The easiest way to do this is to reference directly its ActiveX Control file by right clicking project file --> Add --> Com Reference --> Browse --> Select C:\Windows\SysWOW64\wshom.ocx.

You should be seeing in Dependencies/COM node in Visual Studio Interop.IWshRuntimeLibrary. Click on it and change "Embed Interop Types" from Yes to No (if you use Net Core).

Interop.IWshRuntimeLibrary

After doing this, you can simply instantiate the WScript.Shell COM object by:

C#
var wshShell = new WshShellClass();

History

  • V1.0 (22nd September, 2023): Initial version

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer (Senior)
Italy Italy
I'm a senior .NET software engineer in the second half of my forties. Started using a computer at six years, gone through logo, basic, assembly, C/C++, java and finally to .NET and .NET core. Proficient also in databases, especially Sql Server and reporting. Let's say I have also some experience on security but mainly in the past, now things have become much more difficult and I do not have too much time to keep me updated, but sometimes I am still kicking in. Fan of videogames, technologies, motorbikes, travelling and comedy.

Email: Federico Di Marco <fededim@gmail.com>
Linkedin: LinkedIn
Github: GitHub
Stackoverflow: StackOverflow

Comments and Discussions

 
PraiseGood Robotic Solution Pin
Member 1349872526-Mar-24 19:46
Member 1349872526-Mar-24 19:46 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.