Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

Steganography 13 - Hiding Binary Data in HTML Documents

0.00/5 (No votes)
13 Mar 2008 6  
Some ideas on how to hide binary data in text documents

Introduction

The last twelve articles only dealt with hiding binary data in binary files. It's getting boring, isn't it? Let's take the first text format you can think of right now, and hide binary data in such a document. You are just reading an HTML page - alright, HTML is our file format for this article!

Find a Hiding Place

We cannot insert anything into an HTML document. Whatever we insert would be either visible in the browser, or visible in the source text as useless stuff. But the order of attributes can be changed, without changing the visible document or the file's size.

<span class="bigText" style="color:#0088ff">
         Text with a CSS class and special color
</span>

<span style="color:#0088ff" class="bigText">
         Do you see the difference?
</span>

The example above shows two variations of the same content. Let's define a very simple key from it:

Key Attribute Corresponding Attribute
class style


if( class-attribute before style-attribute ){
    the tag encodes a "1"-bit
}
else{
    the tag encodes a "0"-bit
}

With this key, every combination of class and style stands for one bit. We need 80 text spans to hide 10 characters of a secret text. That's very much carrier text, for a little bit of secret text. Fortunately, HTML documents have more common attribute combinations, especially if we use old HTML with inline formatting instead of CSS. Here are a few examples. Key attribute first may mean "1", corresponding attribute first may mean "0".

Key Attribute Corresponding Attribute
width height
src alt
align valign
href target

A Short Example

The carrier documents must be quite long, because every tag can only hide a few bits. The home page of pc-errors.de contains just enough attributes to hide 16 ASCII characters. Anyway, a short example document with hiding places for three bytes should be enough. Would you expect secrets in that page?

Above, you see a typical homepage of a bird fanatic, who has never heard about HTML 4 and uses a WYSIWYG editor he found on an old magazine CD. The page begins like that:

<html>
<head>
      <title>Canary Birds</title>
      <meta name="author" content="Peter Miller">

      <style>
             .bigText{ font-size:14px; font-weight:bold; }
      </style>
</head>
<body text="#000000" bgcolor="#FFFFFF" link="#FF0000"
                       alink="#FF0000" vlink="#FF0000">
      <div align="center" width="50%">
           <h1>Canaries</h1>
           <span class="bigText" style="color:#0088ff">
             The Finches who got their Name from Islands
             which got their Name from Dogs
           </span>
      </div>

There are five useful attribute couples:

Key Attribute Corresponding Attribute
name content
text bgcolor
alink vlink
align width
class style

Each couple occurs only once, so the first part of the document can hide only five bits. Let's go on with the rest of the page:

      <table width="60%" height="100" cellpadding="4" cellspacing="0"
                bgcolor="white" align="center">

             <tr>
                 <td align="right" valign="middle">
                     <img src="exampleImage.jpg" width="164" height="116"
                             alt="Yellow Bird" title="Yellow Bird" border="0">
                 </td>
                 <td align="left" valign="top">
                     The most canaries are yellow, even though they can have
                     all thinkable patterns of
                     <span class="bigText"
                            style="color:#ffffff; background:#000000">white</span>,
                     <span class="bigText" style="color:#bb0000">red</span> and
                     <span class="bigText" style="color:#888888">grey</span>.
                     <a href="#" target="_blank">click here to see photos.</a>
                 </td>
             </tr>
             <tr>
                 <td align="right" valign="top">
                     Male birds are great singers.
                     <a href="#" target="_blank">click here to listen to a sample.</a>
                 </td>
                 <td align="left" valign="middle">
                     <img src="exampleImage2.jpg" width="164" height="176"
                         alt="Singing Bird" title="A Canary is singing" border="0">
                 </td>
             </tr>
             <tr>
                 <td align="left" valign="top">
                     You cannot keep canaries in a cage all day long.
                     They can get sick, if you don't let them fly.
                 </td>
                 <td align="left" valign="top">
                     Another big mistake is to keep one canary alone.
                     Every birds need at least one partner,
                     loneliness can lead to bad disorders.
                 </td>
             </tr>
             <tr>
                 <td colspan="2">
                     <img src="exampleImage3.jpg" width="194" height="35"
                             alt="Feather" title="A Canary Feather" border="0">
                 </td>
             </tr>
      </table>
</body>
</html>

In this part of the document, additional attribute couples are possible:

Key Attribute Corresponding Attribute
width height
src alt
title border
cellspacing cellpadding
bgcolor align
align valign
href target

The combination of width and height occurs four times, that's a capacity of four bits. src and alt appear three times, that's a capacity for three bits. Three more bits from title and border. cellpadding/cellspacing occurs only once, just as bgcolor/align, that's another two bits. align/valign adds capacity for six bits, href/target adds three bits. Together with the five bits from above, the document has enough capacity to hide 26 bits, that's three characters and two unused bits.

Three characters are not enough for a long letter, but enough to say "no!", or, in ASCII values, "110 111 033" ("01101110 01101111 00100001"). Let's go through the document and find the first tag with a useable attribute couple...

<meta name="author" content="Peter Miller">

name/content is "1", content/name is "0".
We have to re-order the attributes, to hide a value of "0":

<meta content="Peter Miller" name="author">

One bit is done. Next bit...

<body text="#000000" bgcolor="#FFFFFF" link="#FF0000"
       alink="#FF0000" vlink="#FF0000">

text/bgcolor is "1", bgcolor/text is "0".
alink/vlink is "1", vlink/alink is "0".
We want to hie "1" and "1", no changes to this line are required.

<body text="#000000" bgcolor="#FFFFFF" link="#FF0000"
       alink="#FF0000" vlink="#FF0000">

... and so on... for every bit, we need to swap two attributes. Image tags can carry up to three bits, if also the deprecated attributes are there:

<img src="exampleImage.jpg" width="164" height="116" alt="Yellow Bird"
       title="Yellow Bird" border="0">

We want to hide "010".
The first key attribute in this tag is "src",
so we take the corresponding attribute "alt".
The bit to hide is "0", the combination for "0" is alt/src,
so we place the "alt"-attribute before the "src"-attribute.

<img alt="Yellow Bird" src="exampleImage.jpg" width="164" height="116"
        title="Yellow Bird" border="0">

The next key attribute is "width", the corresponding attribute is "height".
Now, the bit to hide is "1", so we put "height" after "width".
The third key attribute is "title", and its corresponding attribute is "border".
To hide a "0", we move "title" behind "border".

<img alt="Yellow Bird" src="exampleImage.jpg" width="164" height="116"
           border="0" title="Yellow Bird">

No more Examples, Show the Implementation!

Alright, first we need two classes to store HTML tags and their attributes. Attributes don't have many properties, they only have a name and a value. Each attribute in a tag can be used for only one message bit. The program has to mark it as already handled.

public class HtmlAttribute {
        private String name;
        private String value;
        private bool handled;

        public String Name {
            get { return name; }
        }

        public String Value {
            get { return this.value; }
            set { this.value = value; }
        }

        public bool Handled {
            get { return handled; }
            set { this.handled = value; }
        }

        public HtmlAttribute(String name) {
            this.name = name.ToLower();
            this.value = String.Empty;
            handled = false;
        }
}

An HTML tag has a name and a number of attributes. The constructor searches the tag's text for attributes and their values.

public class HtmlTag {
        public int beginPosition;
        public int endPosition;
        private String name;

        public int BeginPosition {
            get { return beginPosition; }
            set { beginPosition = value; }
        }

        public int EndPosition {
            get { return endPosition; }
            set { endPosition = value; }
        }

        public String Name {
            get { return name; }
        }

        private HtmlAttributeCollection attributes;
        public HtmlAttributeCollection Attributes{
            get{ return attributes; }
        }

        public HtmlTag(String text, int beginPosition, int endPosition) {
            //... complicated lines for splitting tags into attributes ...
            //... you better read it in the full source code ...
        }
}

The Hide method lists all HTML tags, and then loops over the tags and their attributes. Attributes that have already been handled are being ignored. If an attribute is still fresh and unused, the method looks it up in the key table...

/// <summary>Hide a message in an HTML document</summary>
/// <param name="sourceFileName">Path and name of the HTML document</param>
/// <param name="destinationFileName">Path
///         and name to save the resulting HTML document</param>
/// <param name="message">The message to hide</param>
/// <param name="keyTable">DataTable with the key attributes</param>
public void Hide(String sourceFileName,
       String destinationFileName,
       Stream message,
       DataTable keyTable)
{
    //read the carrier document
    StreamReader reader = new StreamReader(sourceFileName, Encoding.Default);
    String htmlDocument = reader.ReadToEnd();
    reader.Close();

    message.Position = 0;

    //list the HTML tags
    HtmlTagCollection tags = FindTags(htmlDocument);

    StringBuilder insertTextBuilder = new StringBuilder();
    DataRow[] rows;
    HtmlAttribute secondAttribute;
    int offset = 0;
    int bitIndex = 7;
    int messageByte = 0;

    foreach (HtmlTag tag in tags) {

        insertTextBuilder.Remove(0, insertTextBuilder.Length);
        insertTextBuilder.AppendFormat("<{0}", tag.Name);

        foreach (HtmlAttribute attribute in tag.Attributes) {

            if (!attribute.Handled) { //attribute has not been used, yet

                //find key row for this attribute
                rows =
                  keyTable.Select(String.Format("firstAttribute = '{0}'",
                  attribute.Name));

... If the program finds the attribute's name in the first key column, it is a primary key attribute and its secondary key attribute is looked up in the attribute collection of the current tag. If the secondary key attribute exists, we have found a key attribute couple and are able to hide one bit.

                if (rows.Length > 0) {

                    //find corresponding attribute
                    secondAttribute = FindAttribute(
                                    rows[0]["secondAttribute"].ToString(),
                                    tag.Attributes);

                    if (secondAttribute != null) {

                        if (bitIndex == 7) {
                            //get next message byte
                            bitIndex = 0;
                            messageByte = message.ReadByte();
                        } else {
                            //next bit
                            bitIndex++;
                        }

                        //change the attributes' order
                        HideBit(messageByte,
                                bitIndex,
                                attribute,
                                secondAttribute,
                                insertTextBuilder);

                        //mark both attributes as handled
                        attribute.Handled = true;
                        secondAttribute.Handled = true;
                    }
                }

If the attribute was not a primary key attribute, it can be a secondary key attribute. That means, it will be handled later on, together with its primary key attribute. If the attribute is not found in any key column, it is not meant to be used and must be copied into the new tag as it is.

                if (!attribute.Handled) {
                    //The attribute is not a primary key attribute.
                    //Is it a secondary key attribute?
                    bool copyAttribute = false;
                    rows =
                      keyTable.Select(String.Format("secondAttribute = '{0}'",
                      attribute.Name));

                    if(rows.Length > 0){
                        //if the corresponding first attribute
                        //does not exist in
                        //this tag or has already been used,
                        //this attribute will not be used and must be copied.
                        HtmlAttribute firstAttribute = FindAttribute(
                                      rows[0]["firstAttribute"].ToString(),
                                      tag.Attributes);

                        if (firstAttribute == null) {
                            copyAttribute = true;
                        }else{
                            copyAttribute = firstAttribute.Handled;
                        }
                    }

                    else if (rows.Length == 0) {
                        //this attribute is not part
                        //of the key and must be copied.
                        copyAttribute = true;
                    }

                    if (copyAttribute) {
                        //copy unused attribute
                        insertTextBuilder.AppendFormat(
                            @" {0}={1}",
                            attribute.Name, attribute.Value);

                        attribute.Handled = true;
                    }
                }
            }
        }

At this point, you see the reason why we saved the start and end positions with every tag. When we're finished with a tag's attributes, we have to replace the old tag with the new one. Just for the case that a few white spaces got lost on the way, we compare old length and new length. If there is a difference, all following tags will still be found, even though they have been moved.

        //replace old tag with new tag

        tag.BeginPosition += offset;
        tag.EndPosition += offset;

        String insertText = insertTextBuilder.ToString();
        int newLength = insertText.Length;
        if (newLength > 0) {
            int oldLength = tag.EndPosition - tag.BeginPosition;
            htmlDocument = htmlDocument.Remove(tag.BeginPosition, oldLength);
            htmlDocument = htmlDocument.Insert(tag.BeginPosition, insertText);

            offset += (newLength - oldLength);
        }

        if (messageByte < 0) {
            break; //finished
        }
    }

    //save the new document
    StreamWriter writer = new StreamWriter(destinationFileName);
    writer.Write(htmlDocument);
    writer.Close();
}

How to Reconstruct the Message

Extracting a message is much easier, because we need not care about unused attributes. Loop through the tags and attributes, find a primary key attribute, get its corresponding attribute, and compare the positions, that's all.

/// <summary>Extract a hidden message from an HTML document</summary>
/// <param name="sourceFileName">Path and name of the HTML document</param>
/// <param name="message">Empty stream for the message</param>
/// <param name="keyTable">DataTable with the key attributes</param>
public void Extract(String sourceFileName, Stream message, DataTable keyTable) {

    // ... read the carrier document ...
    // ... list the HTML tags ...
    // ... declarations ...

    foreach (HtmlTag tag in tags) {
        foreach (HtmlAttribute attribute in tag.Attributes) {

            if (!attribute.Handled) { //attribute has not been used, yet

                //find key row for this attribute
                rows =
                   keyTable.Select(String.Format("firstAttribute = '{0}'",
                   attribute.Name));
                if (rows.Length > 0) {

                    //find corresponding attribute
                    secondAttribute = FindAttribute(
                                    rows[0]["secondAttribute"].ToString(),
                                    tag.Attributes);

                    if (secondAttribute != null) {

                        attributePosition = htmlDocument.IndexOf(
                                          attribute.Name,
                                          tag.BeginPosition);

                        secondAttributePosition = htmlDocument.IndexOf(
                                                secondAttribute.Name,
                                                tag.BeginPosition);

                        //compare the attributes' positions
                        messageByte = ExtractBit(
                                    attributePosition,
                                    secondAttributePosition,
                                    messageByte,
                                    bitIndex,
                                    message);

Like in the previous articles, the Extract methods expect to find the message's length, before the actual message begins. Because of a document's limited capacity, the length value is only one byte long, not four.

                        //next bit
                        if (bitIndex == 7) {
                            bitIndex = 0;

                            if ((message.Length == 1) && (messageLength == 0)) {
                                //read length
                                message.Position = 0;
                                BinaryReader binaryReader =
                                              new BinaryReader(message);
                                messageLength = binaryReader.ReadByte();
                                reader = null;
                                message.SetLength(0);
                                message.Position = 0;
                            }
                            else if ((messageLength > 0) &&
                                     (message.Length == messageLength)) {
                                break; //finished
                            }

                        } else {
                            bitIndex++;
                        }

                        //mark both attributes as handled
                        attribute.Handled = true;
                        secondAttribute.Handled = true;
                    }
                }
     // ... skip attributes, exit when finished, and so on ...
}

Building a Key

The key is not any binary file anymore, it is a table of attributes. You should build your key with the key editor, and save it to an XML file. The *.zip archive contains two example files, maybe they are useful as key templates.

History

  • 14th November, 2004: Initial post
  • 13th March, 2008: Article updated - bug fixed in source archive

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here