SiteMapper Tool

gggustafson

5.00/5 (5 votes)

Jun 16, 2014

CPOL

19 min read

22544

530

This article presents a tool named SiteMapper that creates a Google site map and a user traversable tree

Introduction [toc]

This article presents a tool, named SiteMapper that produces Google site maps. The first part of this article is a User Manual for the tool; the second describes its implementation.

Introduction
Table of Contents
User Manual
Implementation
- Data Structures
- Algorithms
Conclusion
References
Development Environment
History

The symbol [toc] returns the reader to the top of the Table of Contents.

User Manual [toc]

Although usually directed to start at a web site's root (usually index.htm or default.html), SiteMapper operates starting at any page in a web site. This starting page is referred to as the topmost file.

If the topmost file is a descendent of the web site's root, SiteMapper will traverse the site from that point, downward. However, SiteMapper may also traverse web pages above the topmost file. This could occur if some page makes a reference to another page higher in the web site's "tree".

For example, most well designed web sites allow a reader to click on some image (usually in the upper left of the page) to return to the web site's home page. This requires that SiteMapper continue traversal downward from that page. But this action could cause a recursion. So there is a built-in mechanism to avoid such recursion. When a page is revisited, the page is marked as a Duplicate Link and further downward traversal from that page is inhibited.

Type of the Topmost File [toc]

When SiteMapper first executes, the Type of Topmost File GroupBox opens and the user must choose the type of topmost file that the tool will access: web page or local file system. The distinction is required to insure that the correct Scheme [^] is supplied in the Uri [^]. The topmost file may be on the local file system or on a site on the web.

Name of the Topmost File [toc]

When a RadioButton in the Type of Topmost File GroupBox is checked, either the File Name or Web Page GroupBox opens. In addition, the Site Map Options, Tree Reporting Options, and Traversal Limits GroupBoxes and the Prepare Site Map Button appear.

If the file system option was chosen, the user must supply the name of the topmost file, located on the local computer. The user may either type the filename directly in the TextBox or click on the Browse button. If the Browse button is clicked, an open file DialogBox will be presented.

If the web page option was chosen, the user must enter the URL [^] of the topmost page, located on the web.

Warning

Although the tool will work against any web site, choosing a web site such as "www.google.com" or "codeproject.com" will probably cause an exception because memory was exceeded. This tool was not developed to support very large web sites, although it could be modified to do so. The traversal limit control will limit the number of pages retrieved and thus avoid (hopefully) an exception.

Whichever topmost filename option is chosen, if, after the filename has been entered in the TextBox and with the cursor still in the TextBox, the Enter key may be pressed to start traversal. Note that the options that will be used during traversal will be those selected at the time that the Enter key was pressed. This shortcut is an alternative to clicking the Prepare Site Map Button.

Site Map Options [toc]

Site map options control what is contained in the site map. Using the default site map options will produce the simplest form of a site map accepted by Google. Using the test web site contained within the SiteMapper project, the default site map options will produce the following:

<?xml version="1.0" encoding="utf-16"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>file:///C:/Gustafson/Utilities_2014/SiteMapper/TestWebSite/a.html</loc>
  </url>
  <url>
    <loc>file:///C:/Gustafson/Utilities_2014/SiteMapper/TestWebSite/c.html</loc>
  </url>
  <url>
    <loc>file:///C:/Gustafson/Utilities_2014/SiteMapper/TestWebSite/g.html</loc>
  </url>
  <url>
    <loc>file:///C:/Gustafson/Utilities_2014/SiteMapper/TestWebSite/d.html</loc>
  </url>
  <url>
    <loc>file:///C:/Gustafson/Utilities_2014/SiteMapper/TestWebSite/h.html</loc>
  </url>
  <url>
    <loc>file:///C:/Gustafson/Utilities_2014/SiteMapper/TestWebSite/j.html</loc>
  </url>
  <url>
    <loc>file:///C:/Gustafson/Utilities_2014/SiteMapper/TestWebSite/i.html</loc>
  </url>
  <url>
    <loc>file:///C:/Gustafson/Utilities_2014/SiteMapper/TestWebSite/e.html</loc>
  </url>
  <url>
    <loc>file:///C:/Gustafson/Utilities_2014/SiteMapper/TestWebSite/f.html</loc>
  </url>
  <url>
    <loc>file:///C:/Gustafson/Utilities_2014/SiteMapper/TestWebSite/b.html</loc>
  </url>
</urlset>

The description of the site map tags, provided by Google [^], is:

Tag	Required?	Description
<urlset>	Required	Encloses all information about the set of URLs included in the Sitemap.
<url>	Required	Encloses all information about a specific URL.
<loc>	Required	Specifies the URL. For images and video, specifies the landing page (aka play page, referrer page). Must be a unique URL.
<lastmod>	Optional	The date the URL was last modified, in YYYY-MM-DDThh:mmTZD format (time value is optional).
<changefreq>	Optional	Provides a hint about how frequently the page is likely to change. Valid values are: always - use for pages that change every time they are accessed. hourly daily weekly monthly yearly never - use this value for archived URLs.
<priority>	Optional	Describes the priority of a URL relative to all the other URLs on the site. This priority can range from 1.0 (extremely important) to 0.1 (not important at all). Does not affect your site's ranking in Google search results. Because this value is relative to other pages on your site, assigning a high priority (or specifying the same priority for all URLs) will not help your site's search ranking. In addition, setting all pages to the same priority will have no effect.

If a SiteMapper user wants one or more of the optional site map tags (i.e., <lastmod>, <changefreq>, or <priority>), check the appropriate box. When checked, additional information will be solicited.

Using the same test web site as above, a site map that contains all of the site map tags is depicted in the figure to the left.

The resulting site map may be copied (Ctrl-A, Ctrl-C, Ctrl-V) or it may be saved to a file by clicking on the Save Site Map Button. If that button is clicked, a Save As DialogBox will open.

The order in which the pages (Uri's) are presented is the order in which they were encountered during the traversal.

Tree Reporting Options [toc]

Tree Reporting Options control what is contained in the site tree, an interactive tree that provides the structure of the site being mapped.

The tree appears like a TreeView but with an additional expand/collapse symbol to the right of the page name. The expand/collapse symbol appears if one or more of the content CheckBoxes (CSS, Icons, Images, or JavaScript) was checked and that type of content is found in the page.

The user may specify whether or not Broken Links and/or External Pages are to be reported. A broken link is a reference that cannot be retrieved. An external page is one that does not have the topmost page as its base. IsBaseOf [^] is used to determine if a Uri is the base of another Uri. Regardless of these settings, neither broken links nor external pages will be included in the site map. However, if checked, the site tree will display these pages.

An example of a Site Tree, generated using the default site tree options and expanding all children, is depicted in the figure to the left.

The user may specify that the site tree display Duplicate Links. These links occur when a single page is linked to by more than one page on the site.

An example of an expanded Site Tree, generated using all of the site tree options and expanding all children, is depicted in the figure to the right.

When the expand/collapse symbol is clicked, the internal file inclusions of the web page are displayed. These inclusions include:

    CSS files
    Icon files
    Image files
    JavaScript files

For each of the site tree content options checked, during web page examination, the files for each checked content type are recorded. If there are no file inclusions within the web page or if no CheckBoxes were checked, no expand/collapse symbol will appear.

An example of an expanded site tree, generated using all of the site tree options and expanding all children content, is depicted in the figure to the left.

Traversal Limits [toc]

To insure that a very large web site traversal is somewhat limited, the user may specify a value in the Maximum Number of Web Pages NumericUpDown control. When the number of web pages recorded exceeds the value in that control, traversal terminates. The information collected up to that point can be displayed.

Prepare Site Map [toc]

When all of the setup options meet the user's needs, the Prepare Site Map Button is clicked. At that time, the actual web site traversal starts and the Status GroupBox is displayed..

Progress is reported by displaying the web page currently being retrieved as well as by a numeric count of the web pages so far processed. This is displayed in the figure to the left, above. It should be noted that there is a delay between clicking the Prepare Site Map Button and the first reported activity. After the first web page has been retrieved, following pages are retrieved more quickly.

When the web site has been traversed, the status is reported as Site Map Prepared. In addition, two TabPages appear at the top of Site Mapper Form. This is displayed in the figure to the right, above.

Additional Considerations [toc]

When the Site tree is displayed, web page links may be presented in any one of five colors.

Color	Meaning
Black	Normal web page link.
Dodger Blue	Duplicate web page link.
Red	A broken link encountered during the traversal.
Green	An external web page link (i.e., one that is not a descendent of the topmost file).
Dark Violet	Follow web page link. When the label of a link is clicked, the color changes from Black to Dark Violet. When the label is colored Dark Violet, a second click will cause the web page to be displayed in a separate window.

Implementation [toc]

SiteMapper is basically a web crawler that records web pages. Depending upon the user's choice of options, SiteMapper records all of the links contained within them. This section discusses the SiteMapper implementation. First, the data structures that are key to recording the structure of the web site. Then the algorithms that perform the traversal and reporting.

Data Structures [toc]

BrotherTree [toc]

The singularly most important data structure used in SiteMapper is a BrotherTree.

The BrotherTree is made up of BrotherNodes, each one pointing to its parent, children, and siblings (right-brothers).

The BrotherTree for the test web site, used earlier, is depicted in the figure to the left.

With the exception of the root (the top-left node), all nodes have a parent. Although all nodes have a left-brother link, it is not used in SiteMapper

BrotherNode [toc]

As mentioned above, the BrotherTree is made up of BrotherNodes. These nodes contain links to other nodes as well as information about the web page that the node represents.

The following table defines each of the BrotherNode members.

Name	Purpose
Node Uri	The Uri of the web page
Link Text	The text in the link for the web page
Parent	A link to the BrotherNode that references this page
Right Brother	A link to a BrotherNode, discovered later in the traversal, that is a sibling of this node (i.e., referenced by this node's parent)
Left Brother	A link to a BrotherNode, discovered earlier in the traversal, that is a sibling of this node (i.e., referenced by this node's parent)
Child	A link to the BrotherNode that is a child of this node
CSS	List of the CSS files referenced by the web page
Icon	List of the Icon files referenced by the web page
Image	List of the image file referenced by the web page
JavaScript	List of the JavaScript files referenced by the web page
Broken	true, if the web page cannot be retrieved; false, otherwise
Duplicate	true, if the web page is a duplicate reference; false, otherwise
External	true, if the web page is an external reference; false, otherwise
Depth	Indicates the depth of the web page in the BrotherTree
Follow Link	true, if the web page is to be followed if the user clicks a second time on the displayed node label; false, otherwise
Display Children	true, if the children of this node are to be displayed; false, otherwise
Display Content	true, if the contents (lists of CSS, Icon, Image, and JavaScript files) of this node are to be displayed; false, otherwise

BrotherNodes are added to the BrotherTree with the following method.

    // ************************************************** add_node

    /// <summary>
    /// add a node to the BrotherTree
    /// </summary>
    /// <param name="parent">
    /// the BrotherNode that will become the parent of the newly 
    /// added node
    /// </param>
    /// <param name="new_node">
    /// the BrotherNode that is to be added to the BrotherTree
    /// </param>
    /// <remarks>
    /// Root is the getter/setter that is associated with the root 
    /// (topmost) node of the Brother tree. When the first node is 
    /// added, it will have no parent and thus the first node will 
    /// become the root node
    /// </remarks>
    public void add_node ( BrotherNode  parent,
                           BrotherNode  new_node )
        {

        if ( new_node == null )
            {
            throw new ArgumentException ( @"may not be null",
                                          @"new_node" );
            }
        else if ( Root == null )
            {
            Root = new_node;
            }
        else if ( parent == null )
            {
            throw new ArgumentException ( @"may not be null",
                                          @"parent" );
            }
        else
            {
            new_node.Parent = parent;
            if ( parent.Child == null )
                {
                parent.Child = new_node;
                }
            else
                {
                BrotherNode  node = parent.Child;
                BrotherNode  next_node = node.RightBrother;

                while ( next_node != null )
                    {
                    node = next_node;
                    next_node = next_node.RightBrother;
                    }
                node.RightBrother = new_node;
                new_node.LeftBrother = node;
                new_node.Child = null;
                }
            }

        node_count++;
        }

Note that a LeftBrother is included in the BrotherNode, SiteMapper does not currently use it.

List Head and Node [toc]

Each BrotherNode has four lists associated with it. Each list contains included files (i.e., CSS, Icon, Image, and JavaScript) that are associated with the web page.

CSS	Cascading Style Sheet inclusions are detected when the following link tag is encountered and are recorded in the list with the ListType of CSS. <link rel="stylesheet" ... href="?.css" />
Icon	Icons are detected when the following link tag is encountered and are recorded in the list with the ListType of ICON. <link ... rel="[shortcut ]icon" href="?.ico" />
Image	Images are detected when the following img tag is encountered and are recorded in the list with the ListType of IMAGE. <img src="?" alt="?" ... />
JavaScript	JavaScript sources detected when the following script tag is encountered and are recorded in the list with the ListType of JAVASCRIPT. <script ... src="?.js"></script>/>

Algorithms [toc]

Probably the most important algorithm is the one that traverses the web site pages. However, to understand how everything goes together, it may be useful to understand the overall working of SiteMapper.

SiteMapper Overview [toc]

SiteMapper execution can be divided into six distinct steps.

1	Initialize SiteMapper	Housekeeping functions are performed before continuing execution. These tasks include: establishment of event handlers and initializing the GUI (initialize all three TabPages and display the Setup TabPage, limiting the display to the Type of the Topmost File GroupBox).
2	Collect user input	User interaction with SiteMapper is accomplished through the raising and processing of events. As a user checks a RadioButton or a CheckBox, presses a Button, chooses an item from a ComboBox, or sets a value in a NumericUpDown, SiteMapper records the users input. User input is saved in the ApplicationState Class so that it may be accessed from any of the SiteMapper Classes.
3	Verify user input	The design of the GUI is such that, with the exception of the topmost file name TextBox, all user interaction is automatically validated. To simplify the code, event handlers that collect the state of each type of control are defined: BUT_Click for all Buttons, CHKBX_CheckedChanged for all CheckBoxes, and RB_CheckedChanged for all RadioButtons. There are some exceptions but, as the name implies, they are exceptions. The characters in web page names that are entered into the web page topmost file TextBox are not verified (i.e., RFC 1738 - Uniform Resource Locators (URL) [^] is not applied). The characters entered into the filename topmost file TextBox are only limited to 260 characters in length.
4	Traverse the Web Site	To maintain a responsive user interface, the actual traversal of the web pages is performed asynchronously by a BackgroundWorker Class [^] thread.
5	Create Site Map	At the conclusion of the web site traversal, the Site Map is created and placed within a RichTextBox in the Site Map TabPage. Then both the Site Map and Site Tree TabPages are made visible.
6	Create Site Tree	The Site Tree resembles a TreeView but with sufficient differences that a new ScrollableControl, named BrotherTreeView, was developed. When the Site Map TabPage is entered, the BrotherTreeView OnPaint event is raised. The BrotherTreeView also reacts to user input on any part of the Site Tree.

Traversing the Web Site [toc]

Traversing a web site is straight-forward. Actually it's straight-forward only because of the Html Agility Pack [^] (HAP). Unfortunately, as with most CodePlex projects the documentation is non-existent. For documentation, I would strongly recommend Parsing HTML Documents with the Html Agility Pack [^] by Scott Mitchell of 4GuysFromRolla.com.

As mentioned earlier, the traversal is performed in an asynchronous background thread. The input required to traverse a web site is the BrotherNode for the topmost file. That BrotherNode is retrieved from the Thread_Input property Root. To report traversal progress, the Thread_Output properties Activity and WebPage are updated and the ReportProgress event is raised. Both classes are depicted following the TraverseWebSite_BW_DoWork code and its discussion.

    // ********************************* TraverseWebSite_BW_DoWork

    // worker thread that traverses the web site

    void TraverseWebSite_BW_DoWork ( object          sender, 
                                     DoWorkEventArgs e )

        {
        BrotherNode         root = null;
        ThreadInput         thread_input = null;
        ThreadOutput        thread_output = null;
        BackgroundWorker    worker = null;

        worker = ( BackgroundWorker ) sender;
        thread_input = ( ThreadInput ) e.Argument;
        thread_output = new ThreadOutput ( );
        root = thread_input.Root;

        State.site_tree = new BrotherTree ( );
        State.site_tree.add_node ( null, root );

        stack = new Stack<BrotherNode> ( );
        stack.Push ( root );

        visited = new Dictionary < string, BrotherNode > ( );
        visited.Add ( root.NodeUri.AbsoluteUri, root );

        while ( stack.Count > 0 )
            {
            int         count = 0;
            string      extension = String.Empty;
            BrotherNode node = stack.Pop ( );
            Uri         uri = node.NodeUri;

            if ( root.NodeUri.IsBaseOf ( node.NodeUri ) )
                {
                node.External = false;
                }
            else
                {
                node.External = true;
                continue;
                }

            if ( !String.IsNullOrEmpty ( uri.Fragment ) ||
                 ( uri.AbsoluteUri.IndexOf ( @"#" ) > 0 ) ||
                 ( uri.AbsoluteUri.IndexOf ( @"%23" ) > 0 ) )
                {
                continue;
                }

            extension = Path.GetExtension ( uri.AbsoluteUri ).
                                                ToLower ( ).
                                                Trim ( );
                                            
            if ( State.excluded_extension.Contains ( extension ) )
                {
                continue;
                }
            
            if ( worker.CancellationPending )
                {
                e.Cancel = true;
                break;
                }
                
            if ( BrotherTree.NodeCount >= State.maximum_pages )
                {
                stack.Clear ( );
                continue;
                }
                
            thread_output.Activity = @"Retrieving";
            thread_output.WebPage = node.NodeUri.AbsoluteUri;
            worker.ReportProgress ( count++, thread_output );
                                    // from HtmlAgilityPack
            HAP.HtmlDocument  web_page = new HAP.HtmlDocument ( );

            try
                {
                WebRequest  web_request;
                WebResponse web_response;
                
                web_request = WebRequest.Create ( uri.
                                                  AbsoluteUri ); 
                web_response = web_request.GetResponse ( );

                using ( Stream stream = web_response.
                                        GetResponseStream ( ) )
                    {
                    web_page.Load ( stream ); 
                    }
                }
            catch ( WebException we )
                {
                node.Broken = true;
                continue;
                }
                
            extract_hyperlinks ( web_page, root, node );
                
            if ( State.report_css )
                {
                extract_css ( web_page, node );
                }

            if ( State.report_icons )
                {
                extract_icons ( web_page, node );
                }

            if ( State.report_images )
                {
                extract_images ( web_page, node );
                }

            if ( State.report_javascript )
                {
                extract_javascript ( web_page, node );
                }
            }
        }

TraverseWebSite_BW_DoWork performs the following tasks:

The root is extracted from Thread_Input, added to the BrotherTree, pushed on a Stack, and added to the visited Dictionary.
As long as the stack is not empty, the following actions are taken:
- A node is popped off the stack. The first time, the node that is popped off will be the root; thereafter it will be a node that is a descendant of the root.
- If the node is external, the node is so marked and control returns to the top of the loop.
- If the node is a fragment, control returns to the top of the loop.
- If the node's extension is one of the excluded extensions (currently: .pdf, .gif, .jpg, .png, .tif, .zip), control returns to the top of the loop. The excluded extensions are defined in the Constants class.
- If the BrotherTree NodeCount is greater than or equal to the maximum number of pages to be retrieved, the stack is emptied and control returns to the top of the loop. This effectively terminates the loop.
- A progress report is made.
- A new Html Agility Pack HtmlDocument is created.
- The HtmlDocument is filled from a WebResponse; if not successful, the node is marked broken and control returns to the top of the loop.
- The hyperlinks in the current page are extracted and recorded. The code for the extract_hyperlinks method follows.
- Each content type specified by the user is extracted.

The Html Agility Pack has a Load method:

 HAP.HtmlDocument web_page = new HAP.HtmlWeb ( ).Load ( uri.AbsoluteUri );

However, that method will not load a file system web site. For that reason, the traversal method uses WebRequest and WebResponse.

The Thread_Input class is:

namespace DataStructures
    {
    // ********************************************* class ThreadInput

    public class ThreadInput
        {

        public BrotherNode  Root { get; set; }

        } // class ThreadInput

    } // namespace DataStructures

The Thread_Output class is:

namespace DataStructures
    {

    // *****************************_************** class ThreadOutput

    public class ThreadOutput
        {

        public string  Activity { get; set; }
        public string  WebPage { get; set; }

        } // class ThreadOutput

    } // namespace DataStructures

The most important of the extract methods is extract_hyperlinks:

    // **************************************** extract_hyperlinks

    void extract_hyperlinks ( HAP.HtmlDocument  web_page,
                              BrotherNode       root,
                              BrotherNode       parent )
        {
        HAP.HtmlNodeCollection  hyperlinks;

        hyperlinks = web_page.DocumentNode.SelectNodes (
                                                @"//a[@href]" );
        if ( ( hyperlinks == null ) || ( hyperlinks.Count == 0 ) )
            {
            return;
            }

        foreach ( HAP.HtmlNode hyperlink in hyperlinks )
            {
            HAP.HtmlAttribute   attribute;
            string              destination = String.Empty;
            string              extension = String.Empty;
            string              linked_text = String.Empty;
            BrotherNode         new_node = null;
            Uri                 next_uri;

            attribute = hyperlink.Attributes [ @"href" ];
            if ( attribute == null )
                {
                continue;
                }
            if ( String.IsNullOrEmpty ( attribute.Value ) )
                {
                continue;
                }

            destination = attribute.Value;

            next_uri = new_uri ( root.NodeUri, 
                                 destination, 
                                 parent );
            if ( next_uri == null )
                {
                continue;
                }

            if ( !String.IsNullOrEmpty ( hyperlink.InnerText ) )
                {
                linked_text =
                    String.Join (
                          @" ",
                          Regex.Split ( hyperlink.InnerText,
                                        @"(?:\r\n|\n|\r)" ) ).
                    Trim ( );
                }

            if ( visited.ContainsKey ( next_uri.AbsoluteUri ) )
                {
                if ( State.display_duplicate_pages )
                    {
                    BrotherNode  old_node = null;
                    string       uri = next_uri.AbsoluteUri;

                    visited.TryGetValue (     uri,
                                          out old_node );
                    if ( old_node != null )
                        {
                        string  new_parent = String.Empty;
                        string  old_parent = String.Empty;

                        new_node = new BrotherNode ( uri, 
                                                     linked_text,
                                                     parent );
                        new_parent = parent.NodeUri.AbsoluteUri;

                        if ( old_node.Parent == null )
                            {
                            continue;
                            }

                        old_parent = old_node.Parent.NodeUri.
                                                     AbsoluteUri;

                        if ( old_parent.Equals ( new_parent ) )
                            {
                            continue;
                            }

                        if ( new_node.Depth >= old_node.Depth )
                            {
                            old_node.Duplicate = false;
                            new_node.Duplicate = true;

                            State.site_tree.add_node ( parent, 
                                                       new_node );
                            }
                        else
                            {
                            old_node.Duplicate = true;
                            new_node.Duplicate = false;

                            State.site_tree.add_node ( parent, 
                                                       new_node );

                            visited.Remove ( uri );
                            visited.Add ( uri, new_node );
                            }
                        }
                    }
                }
            else
                {
                new_node = new BrotherNode ( next_uri.AbsoluteUri,
                                             linked_text,
                                             parent );
                new_node.External = true;
                if ( root.NodeUri.IsBaseOf ( next_uri ) )
                    {
                    new_node.External = false;
                    stack.Push ( new_node );
                    }
                visited.Add ( next_uri.AbsoluteUri, new_node );
                State.site_tree.add_node ( parent, new_node );
                }

            if ( BrotherTree.NodeCount >= State.maximum_pages )
                {
                stack.Clear ( );
                break;
                }
            }
        }

extract_hyperlinks is passed the following parameters.

Parameter	Class	Description
web_page	HtmlDocument	The Html Agility Pack HtmlDocument of the web page currently being examined by TraverseWebSite_BW_DoWork. Note that the Html Agility Pack HtmlDocument differs from the System.Windows.Forms HtmlDocument. Thus the need for the HAP prefix (using HAP = HtmlAgilityPack;).
root	BrotherNode	The BrotherNode that contains the topmost file name data.
parent	BrotherNode	The BrotherNode of the web page currently being examined by TraverseWebSite_BW_DoWork.

extract_hyperlinks performs the following tasks:

Retrieve all hyperlinks in the web page and store them in the HAP.HtmlNodeCollection hyperlinks collection. The selector "//a[@href]" selects all anchor elements that have an attribute named "href".
If no hyperlinks exist, return.
For each hyperlink in the collection, take the following actions:
- Retrieve the hyperlink attribute "href".
- Retrieve the attribute value, naming it "destination". This destination is the Url of a web page referenced by this web page.
- Create a Uri from the destination using the topmost file's Uri as the base.
  
  For example, if only "page.html" was the destination, and "http://domain.com/directory/index.html" was the root's Uri, the resulting Uri would be "http://domain.com/directory/page.html".
- Extract the link text from the anchor element.
- At this point, a hyperlink (Uri) and its link text have been retrieved. A test to determine if the Uri has already been visited determines the next actions to take.
  - If the Uri has already been visited (i.e., found in the visited Dictionary) and the user does not desire to view duplicate links, the Uri is ignored. Otherwise, the already visited node is retrieved.
    - If the already visited node's parent is null (implying that the already visited node is the BrotherTree root), ignore the Uri.
    - If the already visited node's parent equals the new node's parent, the Uri is ignored.
    - Create a new BrotherNode
    - Which node is to be marked as a duplicate, is based upon the depth of each node.
      - If the new node's depth is greater than or equal to the already visited node's depth, the new node is marked as a duplicate, the already visited node is marked as a non-duplicate, and the new node is added to the BrotherTree.
      - If the new node's depth is less than the already visited node's depth, the already visited node is marked as a duplicate, the new node is marked as a non-duplicate, the new node is added to the BrotherTree, and the already visited node is replaced in the visited Dictionary by the new node.
  - If the Uri has not already been visited (i.e., not found in the visited Dictionary):
    - Create a new BrotherNode
    - Mark the new node's External property based upon the relationship between the new node and the Brothertree root.
    - Add the new node to the visited Dictionary.
    - Add the new node to the BrotherTree.
- If the BrotherTree NodeCount is greater than or equal to the maximum number of pages to be retrieved, the stack is emptied and the extraction of hyperlinks returns.

The other extract methods extract_css, extract_icons, extract_images, and extract_javascript are much simpler. As shown in the following table, the major differences between the extract methods, are the XPath strings used to retrieve the appropriate HtmlNodeCollection.

Method	XPath
extract_css	"//link[@href]" with attributes "rel" and "stylesheet"
extract_icons	"//link[@href]" with attributes "rel" and "icon"
extract_images	"//img[@src]" with attribute "src"
extract_javascript	"//script[@src]"

Whether the method executes is dependent upon the options set by the user in the Tree Reporting Options.

Displaying the Web Site Tree [toc]

The class BrotherTreeView implements a ScrollableControl [^]. The BrotherTreeView is similar to the TreeView [^] class except that it performs special expand and collapse operations when the user clicks on an expand/collapse arrow or on a web page label.

Conclusion [toc]

This article has presented a tool that generates a Google site map for a user-specified web site.

References [toc]

BackgroundWorker Class [^]
Html Agility Pack [^]
IsBaseOf [^]
Parsing HTML Documents with the Html Agility Pack [^]
RFC 1738 - Uniform Resource Locators (URL) [^]
Scheme [^]
ScrollableControl [^]
TreeView [^]
Uri [^]
URL [^]

Development Environment [toc]

SiteMapper control was developed in the following environment:

	Microsoft Windows 7 Professional SP1
	Microsoft Visual Studio 2008 Professional SP1
	Microsoft .Net Framework Version 3.5 SP1
	Microsoft Visual C# 2008

History [toc]

SiteMapper V3.3

06/16/2014

Original Article