Click here to Skip to main content
15,867,488 members
Articles / Web Development / HTML
Tip/Trick

Copy Remote Website to Modify Functionality Locally

Rate me:
Please Sign up or sign in to vote.
4.85/5 (4 votes)
1 Aug 2014CPOL4 min read 15.2K   11   1
A method for duplicating a web page (including all scripts and styles) to run as if it originated from your own server, then modify server-side and client-side functionality afterwards.

Introduction

I made this function when I was asked to create an app to modify the search criteria of a search grid on someone else's site and then only display the grid to the user afterwards. The long and complicated way of doing this would be to create a screen-scraper application and modify the DOM. The simplest way was to copy the entire site and add minimal JavaScript to hide/show specific panels and hard-set what information was passed to the search grid on document load (without having to use screen scraping).

Background

It was a given that the site I copied had jQuery implemented. But either way, you could implement jQuery before or after the copy just in case the site you were referencing did not do so *see new Points of Interest. You will also be required to add the attributes [id="body" runat="server"] to the <body> tag, so that the copy method can change the contents of the body tag server-side.

Using the Code

The C# code is as follows:

C#
using System;
using System.Collections.Generic;
using System.IO;
using System.Net;
using System.Text;
using System.Text.RegularExpressions;
using System.Web;
using System.Web.UI;
using System.Web.UI.HtmlControls;
using System.Linq;
using Microsoft.Ajax.Utilities;

namespace WebStuff
{
    public partial class Utilities
    {
        const RegexOptions _defaultRxFlags = RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.Multiline;
        const StringComparison _ic = StringComparison.CurrentCultureIgnoreCase;
        static Regex _rxPageScriptsOnly = new Regex(@"<(script|style|link)\b[^>]*>[\s\S]*?<\/\1[^>]*>|(<(link|script)[^>]*((\/>)|(>)))", _defaultRxFlags);
        static Regex _rxScriptTags = new Regex(@"<(script|style|link)[^>]*>|<\/(script|style|link)[^>]*>", _defaultRxFlags);
        static Regex _rxBodyOnly = new Regex(@"<(body)\b[^>]*>[\s\S]*?<\/\1[^>]*>", _defaultRxFlags);
        static Regex _rxBodyTags = new Regex(@"<body[^>]*>|<\/body[^>]*>", _defaultRxFlags);
        static Regex _rxScriptVersion = new Regex(@"(\p{P}[\d]+)+(.min)?", _defaultRxFlags);
        static Regex _rxScriptPath = new Regex(@"[^\/]+$", _defaultRxFlags);

        public static void CopyHtmlPage(string url)
        {
            Page page = HttpContext.Current.Handler as Page;

            List<string> parentResidentScripts = new List<string>();
            string resListStr = page.Request.Params["ResidentScripts"];
            if (!string.IsNullOrEmpty(resListStr))
            {
                string[] splits = resListStr.Split(new string[] { "," }, StringSplitOptions.RemoveEmptyEntries);
                foreach (string split in splits)
                {
                    parentResidentScripts.Add(GetScriptBaseName(split));
                }
            }

            HtmlGenericControl body = (HtmlGenericControl)page.FindControl("body");
            if (body == null)
                throw new Exception("No access to modify local <body> tag. Add [id='body' runat='server'] attributes to the <body> tag.");

            Uri location = new Uri(url, UriKind.RelativeOrAbsolute);
            string htmlText = GetResponseText(url);

            Match bodyMatch = _rxBodyOnly.Match(htmlText);
            if (!bodyMatch.Success)
                throw new Exception("Rendered html has no complete <body>[content]</body> element at [" + url + "]");

            {
                StringBuilder bodyText = new StringBuilder(
                    _rxPageScriptsOnly.Replace(
                        _rxBodyTags.Replace(bodyMatch.Value, "").Trim()
                        , "")
                    );
                FixAllLinks(ref bodyText, location); //anchor tags and style attributes

                System.Web.UI.Control newBody = page.ParseControl(bodyText.ToString(), true);
                if (newBody != null)
                    body.Controls.Add(newBody);
            }

            //set up minifier
            Minifier minifier = new Minifier();
            CodeSettings scriptSettings = new CodeSettings();
            scriptSettings.MinifyCode = true;
            scriptSettings.OutputMode = OutputMode.MultipleLines;
            scriptSettings.CollapseToLiteral = true;
            scriptSettings.PreserveImportantComments = false;
            scriptSettings.EvalTreatment = EvalTreatment.Ignore;
            scriptSettings.InlineSafeStrings = true;
            scriptSettings.LocalRenaming = LocalRenaming.CrunchAll;
            scriptSettings.MacSafariQuirks = (new string[] { "safari", "apple" }).Any(w => page.Request.UserAgent.Contains(w));
            scriptSettings.ConstStatementsMozilla = (new string[] { "mozilla" }).Any(w => page.Request.UserAgent.Contains(w));
            scriptSettings.PreserveFunctionNames = true;
            scriptSettings.RemoveFunctionExpressionNames = true;
            scriptSettings.RemoveUnneededCode = false;
            scriptSettings.StripDebugStatements = true;
            scriptSettings.ReorderScopeDeclarations = true;

            //expand script blocks
            MatchCollection pageScripts = _rxPageScriptsOnly.Matches(htmlText);
            int controlIndex = 0;
            foreach (Match pageScript in pageScripts)
            {
                string hrefType = " src=";
                Control newScript = null;
                int checkScriptIndex = pageScript.Value.IndexOf("script", _ic);
                int checkStyleIndex = pageScript.Value.IndexOf("style", _ic);
                int checkLinkIndex = pageScript.Value.IndexOf("link", _ic);
                bool isLinkTag = (checkLinkIndex > -1 && checkLinkIndex < 3);
                bool isScriptTag = (checkScriptIndex > -1 && checkScriptIndex < 3);
                bool isStyleTag = (checkStyleIndex > -1 && checkStyleIndex < 3);
                if (isScriptTag)
                {
                    newScript = new HtmlGenericControl("script");
                    ((HtmlGenericControl)newScript).Attributes.Add("type", "text/javascript");
                }
                else if (isStyleTag || isLinkTag)
                {
                    newScript = new HtmlGenericControl("style");
                    ((HtmlGenericControl)newScript).Attributes.Add("type", "text/css");
                    hrefType = " href=";
                }
                else
                    continue;

                StringBuilder scriptText = new StringBuilder(pageScript.Value);
                string workingText = scriptText.ToString();
                int srcLength = hrefType.Length + 1;
                int srcIndex = workingText.IndexOf(hrefType) + srcLength;
                string encap = workingText.Substring(srcIndex - 1, 1);
                int endIndex = workingText.IndexOf(encap, srcIndex);
                if (isLinkTag && (workingText.IndexOf("text/css", srcIndex, _ic) < 0 || workingText.IndexOf("stylesheet", srcIndex, _ic) < 0)) //not a style link tag
                {
                    if (srcIndex > 0)
                    {
                        string srcUrl = workingText.Substring(srcIndex, endIndex - srcIndex).Trim();
                        Uri resourceLocation = ResolveUrl(srcUrl, location);
                        scriptText = scriptText.Replace(srcUrl, resourceLocation.ToString());
                    }
                    newScript = new LiteralControl();
                }
                else if (srcIndex > srcLength && srcIndex < workingText.IndexOf(">") && endIndex > srcIndex)
                {
                    string srcUrl = workingText.Substring(srcIndex, endIndex - srcIndex).Trim();

                    //skip adding scripts which exist on calling parent
                    string baseName = GetScriptBaseName(srcUrl);
                    if (parentResidentScripts.Contains(baseName, StringComparer.CurrentCultureIgnoreCase))
                        continue;

                    Uri resourceLocation = ResolveUrl(srcUrl, location);

                    ((HtmlGenericControl)newScript).Attributes.Add("original", resourceLocation.ToString());
                    try
                    {
                        scriptText = new StringBuilder((isScriptTag) ?
                                minifier.MinifyJavaScript(
                                    GetResponseText(resourceLocation.ToString())
                                , scriptSettings)
                                : minifier.MinifyStyleSheet(
                                    GetResponseText(resourceLocation.ToString())
                                )
                            );
                        FixAllLinks(ref scriptText, resourceLocation);
                    }
                    catch (Exception ex)
                    {
                        scriptText.Length = 0;
                        ((HtmlGenericControl)newScript).Attributes.Add("error", ex.Message);
                    }
                }
                else
                {
                    scriptText = new StringBuilder((isScriptTag) ?
                            minifier.MinifyJavaScript(
                                _rxScriptTags.Replace(pageScript.Value, "").Trim()
                            , scriptSettings)
                            : minifier.MinifyStyleSheet(
                                _rxScriptTags.Replace(pageScript.Value, "").Trim()
                            )
                        );
                    FixAllLinks(ref scriptText, location);
                }

                if (scriptText.Length > 0)
                {
                    if (isScriptTag)
                    {
                        scriptText.Insert(0, "<!-- \n");
                        scriptText.Append("\n -->");
                    }

                    if (newScript.GetType() == typeof(HtmlGenericControl))
                        ((HtmlGenericControl)newScript).InnerHtml = scriptText.ToString();
                    if (newScript.GetType() == typeof(LiteralControl))
                        ((LiteralControl)newScript).Text = scriptText.ToString();
                }

                if (pageScript.Index < bodyMatch.Index)
                {
                    page.Header.Controls.AddAt(controlIndex, newScript);
                    controlIndex++;
                }
                else
                {
                    body.Controls.Add(newScript);
                }

            }
        }

        private static string GetScriptBaseName(string scriptUrl)
        {
            string baseName = _rxScriptPath.Match(scriptUrl).Value;
            baseName = _rxScriptVersion.Replace(baseName, "");
            return baseName;
        }

        private static void FixAllLinks(ref StringBuilder fixText, Uri siteUrl)
        {
            FixLinks("url(", ref fixText, siteUrl);
            FixLinks("src=", ref fixText, siteUrl);
            FixLinks("href=", ref fixText, siteUrl);
        }

        private static void FixLinks(string searchType, ref StringBuilder fixText, Uri siteUrl)
        {
            int urlIndex = 0;
            while (urlIndex > -1)
            {
                string workingText = fixText.ToString();
                urlIndex = workingText.IndexOf(searchType, urlIndex, _ic);
                if (urlIndex < 0) continue;
                urlIndex = urlIndex + searchType.Length;

                string urlEncap = fixText[urlIndex].ToString();
                if (urlEncap.Equals(@"\"))
                {
                    urlIndex++;
                    urlEncap += fixText[urlIndex].ToString();
                    urlIndex++;
                }
                else if (!urlEncap.Equals("'") && !urlEncap.Equals("\""))
                    urlEncap = ")";
                else
                    urlIndex++;

                int endIndex = workingText.IndexOf(urlEncap, urlIndex);
                string srcUrl = workingText.Substring(urlIndex, endIndex - urlIndex);
                if (string.IsNullOrEmpty(srcUrl.Trim()) ||
                    srcUrl.Trim().Equals("#") ||
                    srcUrl.Trim().StartsWith("javascript:", _ic) ||
                    srcUrl.Trim().Equals("/a", _ic))
                    continue;

                Uri resourceLocation = ResolveUrl(srcUrl, siteUrl);
                fixText = fixText.Remove(urlIndex, endIndex - urlIndex);
                fixText = fixText.Insert(urlIndex, resourceLocation.ToString());
            }
        }

        private static Uri ResolveUrl(string srcUrl, Uri siteUrl)
        {
            Uri resourceLocation = null;

            string pathSeperator = "/";
            int sepIndex = srcUrl.IndexOf("\\");
            if (sepIndex > -1 && sepIndex < 10) pathSeperator = "\\";
            bool wellFormed = Uri.TryCreate(srcUrl, UriKind.RelativeOrAbsolute, out resourceLocation);
            try
            {
                wellFormed = (resourceLocation.Scheme != "");
            }
            catch
            {
                wellFormed = false;
            }
            if (!wellFormed)
            {
                int lastSep = siteUrl.OriginalString.LastIndexOf(pathSeperator);
                int rootSep = siteUrl.OriginalString.IndexOf(pathSeperator, siteUrl.Host.Length);
                string resourcePath = ((srcUrl.StartsWith(pathSeperator))
                        ? siteUrl.OriginalString.Substring(0, rootSep)
                        : siteUrl.OriginalString.Substring(0, lastSep + 1)) + srcUrl;
                resourceLocation = new Uri(resourcePath);
            }

            return resourceLocation;
        }

        public static string GetResponseText(string url)
        {
            string ret = "";
            StreamReader reader = null;
            try
            {
                WebRequest request = WebRequest.Create(url);
                WebResponse response = request.GetResponse();
                reader = new System.IO.StreamReader(response.GetResponseStream());
                ret = reader.ReadToEnd();
            }
            finally
            {
                if (reader != null) reader.Close();
            }
            return ret;
        }

    }
}

To use copy method, you should override the OnPreRender method of a page so that your contents are rendered and processed before you even copy the off-site web page.

C#
protected override void OnPreRender(EventArgs e)
{
    base.OnPreRender(e);
    CopyHtmlPage("https://www.google.com/finance");
}

By placing your scripts in the correct location, you can preceed or follow the copied HTML code.

ASP.NET
<%@ Page Language="C#" AutoEventWireup="true" CodeBehind="SiteCopy.aspx.cs" Inherits="WebStuff.SiteCopy" %>

<!DOCTYPE html>

<html xmlns="http://www.w3.org/1999/xhtml">
<head runat="server">
    <script type="text/javascript">
        /*place scripts in very first element of the header if you want it to run before copied scripts*/
    </scripts>
    <title>Site Copy Finance Grid</title>
    <script type="text/javascript">
        /*place scripts after the first element that you wish to run after copied scripts*/

        //for example
        //document begin      
        $(document).ready(function () {
            //hide specific elements
            var firstTable = $("table:first");
            var mainRow = firstTable.find("tr:first");
            var columns = mainRow.children();
            columns.eq(0).hide();
            columns.eq(1).hide();

            searchGroup = "Fortune500"; //hardset the search group for the grid
            search();            
        });
    </script>
</head>
<body id="body" runat="server">

</body>
</html>
*[New] This simple html page shows how you could call the site copy page and insert the resulting page into a div panel. The ability to ignore duplicated scripts is shown here by passing a list of scripts already loaded by the current page.
HTML
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
    <title>Offsite Copy, Ajax Panel</title>
    <script src='/scripts/jquery-2.1.1.js' type='text/javascript'></script>
</head>
<body>
    <div id="offsite" style="background:url(loading.gif) no-repeat center center; -moz-min-width:20px; -ms-min-width:20px; -o-min-width:20px; -webkit-min-width:20px; min-width:20px;min-height:20px;">
    </div>
    <script type="text/javascript">
        var scripts = $('script[src]');
        var scriptList = "";
        $.each(scripts, function (k, v) {
            scriptList += $(v).attr('src') + ",";
        });
        jQuery.support.cors = true;
        $.ajax({
            type: "GET",
            url: 'http://mysite.com/SiteCopy.aspx',
            data: "ResidentScripts=" + encodeURIComponent(scriptList),
            dataType: "html",
            contentType: "text/html; charset=utf-8",
            cache: false,
            crossDomain: true,
            isLocal: true,
            success: function(data) {
                $('#offsite').html(data);
            },
            error: function(request, error) {
                alert(error + ": " + request.status);
            },
            complete: function () {
                $('#offsite').css('background','none');
            }
        });
    </script>
</body>
</html>

Points of Interest

Now the source site is copied and all script/style elements are unraveled as in-line code. The server copy also takes into account if a script is in the header or the body and locates them accordingly on the duplicate. If an error occurs, the script/style element will have an [error="?"] attribute with the description of the problem as well as a [original="?"] attribute to indicate where that path was before the script was unraveled from a src or href location on the original site.

As a note, after I loaded and modified the page to only view what I wished to see, I had to write a JavaScript which parsed out any links and images which had relative URL references and change them to absolute references which pointed to the site which I copied from so that they would display and navigate properly. I guess I could have added server-side code to do this, but I wanted the code to be more client-side configurable after rendering. For example: You could change the images you wished by simply naming a local file the same as an image on the source site. *[New] After seeing the advantages of doing this during parse time, I changed the code to replace all links with links related to the copied site.

*[New] Code was added to process <link> tags that don't need to be unraveled (such as a favicon reference).

*[New] After trying to allow this code to work from an ajax call, I found that when certain javascripts were duplicated on the calling page and the ajax panel the code is added to; that certain scripts would fail to load properly (jQuery in particular). I added a version independant script checker which compares script file names on the request to those on the copied site and cancel including the script if a similar one was on the calling page already. Passing a form or query variable named "ResidentScripts" and assigning the value as a comma separateed list of script paths

*[New] I added server side script minification using the WebGrease toolkit. I added it to the project using the NuGet package manager. You can remove the minification code if you like, it really doesn't save you much load time, since it also takes time to minify the code anyway; I added it to cut down on the amount of data flowing accross the web.

I also had to debug the flow of JavaScript code on the copied site to see which variables to change in order to hard-set the search() function. This is a method for more advanced coders, but not out of the realm of intermediate coder's understanding.

The real beauty of this method is that if the site you are copying changes, the entire site is copied, so your duplicate page will display those changes in real-time; though you may have to make some changes to the javascript, it would be a minor and easier change than re-coding a scraper to look for different element names and formats.

History

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer (Senior) Centeva
United States United States
I achieved my degree in Electronics Engineering, however, my true passion has always been programming. I started programming at a very young age using Basic on a TRS-80 and saving my programs on audio tape through an audible modem. I moved up to Basic on an Atari 800XL computer, saving my work on 5.25 Floppy Disks. I then learned Basic on an Apple IIe, saving my work on 3.5 floppies. When I approached Highschool I began getting into lower level languages such as Borland Pascal on IBM 8086 machines using DOS. Gaining a love of early video games (gotta love Ultima 3 through 7), I endeavored to write my own games and DOS utilities using Borland C++ and Intel x86 Assembly language. I began a career in software engineering during college using everything from Rex on OS/2 to .Net Studio v1.0 (some tech support jobs thrown in here and there). I am now a big proponent for C#, I believe that (standards-wise) it is where C++ should have been many years ago. Today I write everything from Native apps for PC, Mac and smart-phones; to Web applications. Trends change quickly, but I perceive the most useful form of programming currently is Web Applications, Cloud services, asynchronous Ajax, and JQuery JavaScript libraries.

Comments and Discussions

 
GeneralMy vote of 5 Pin
Volynsky Alex2-Aug-14 5:04
professionalVolynsky Alex2-Aug-14 5:04 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.