Fun with Unicode

Yang Kok Wah

4.99/5 (28 votes)

May 19, 2014

CPOL

11 min read

51345

2846

Typing Unicode directly to a text-box, including support for surrogate pairs. Create simple web pages to display fanciful fonts

Introduction

This article shows a technique to allow you to directly type in Unicode to a text-box without the use of a dedicated IME (Input Method Editor) or using the Character Map Tool. It also discusses about surrogate pair encoding and the implementation of a fun tool to create simple web pages that can display fanciful fonts.

Background

Some preliminary concepts:

Unicode code point

A Unicode code point is referred to by writing "U+" followed by its hexadecimal number. For code points in the Basic Multilingual Plane (BMP), four digits are used. For example the U+222B is the code point for the Mathematical symbol for Integration "∫". Other Multilingual Plane can have code points with 5 hexadecimal digits. For example the ancient Egyptian Hieroglyphs are from U+F3000 - U+F4B92.

Unicode encoding

All Unicode code points can be encoded in either of the 2 standard encoding formats: UTF16 and UTF8.

UTF16 are mostly double byte encoding (except for surrogate pairs). The encoding for U+222B is hexadecimal 22 2B if the byte ordering is Big endian and hexadecimal 2B 22 if the ordering is Little endian. For encoding Unicode code points outside of the Basic Multilingual Plane, 2 sets of 4 hexadecimal numbers are used. See Surrogate Support in Microsoft Products for more details on how to do the encoding.

UTF8 is an encoding standard that uses 1 or more bytes to encode each Unicode code point.

Glyphs

These are graphics used to render the character representing the Unicode code point in a display. Note that for the same Unicode code point, for language like Arabic, the glyph used is different depending on the neighbouring characters.

Fonts

These are collection of glyphs that are normally grouped together based on language or usage. Each glyph in the font file is tagged to a Unicode code point. For some interesting font files, you may want to visit this site: Unicode Fonts for Ancient Scripts

IME (Input Method Editor)

A language specific tool used to efficiently create Unicode code point to be entered into a Unicode supporting text input interface. In Windows 7, you can install new IME via the Control Panel -> Region and Language -> Keyboard and Language.

Character Map Tool

A generic tool provided by Microsoft that can generate all Unicode code point for the Basic Multilingual Plane and you can copy and paste into a Unicode supporting text input interface. You can access the tool via Start->All Programs-> Accessories->System Tools-> Character Map.

Private Character Editor

A little known tool that can be used to create and edit characters for the Private Character Area U+E000 - U+F877. This area can hold 6400 characters. It is reserved for private use. The Private Character Editor can be accessed as c:\windows\system32\eudcedit.exe.

The glyphs created are found in the files c:\windows\fonts\eudc.euf and c:\windows\fonts\eudc.tte. These files are hidden if you try to access it using explorer. However, you can copy out the files using the cmd prompt.

To view the glyphs created, you can use Character Map and search for font: All Fonts (Private Characters). Alternatively, you can use our program developed here.

Using the code

The code below performs the main task of generating the Unicode code point. When the user type into the bottom text-box (textBox3), the code kicks in. It checks if the key typed is a <space> and that the preceding characters are in the format of "U+####" or "U+#####", and replaced these characters with the encoding for the Unicode code point they represent.

Note that the code works for Basic Multilingual Plane "U+####", as well as all the other planes "U+#####" where each Unicode code point is represented by 5 hexadecimal digits.

You would also need a Unicode font for the text-box. I use Arial Unicode MS, 14.25pt font that comes with Windows 7.

        private void HandleKeyPress(object sender, KeyPressEventArgs e)
        {
            TextBox textbox = (TextBox)sender;
            //System.Diagnostics.Debug.WriteLine(textbox.Text + " " + e.KeyChar);
            string s = "";
            
            if (e.KeyChar == ' ' && textbox.SelectionStart >= 6)
            {
                textbox.SelectedText = "";
                
                //n is number of chars preceeding the cursor position 
                //that we want to analysze

                //U+#### is 6 chars and U+##### is 7 chars
                //if possible we will analyze 7 chars
                //Otherwise if cursor position is < 7, we take n as 6
                int n = (textbox.SelectionStart == 6) ? 6 : 7;

                //take n preceeding chars from the cursor position
                //this would be the text that we want to analyze
                s = textbox.Text.Substring(textbox.SelectionStart - n, n);

                //n1 is the position of the the search pattern header "U+" 
                int n1 = s.ToUpper().IndexOf("U+");

                // System.Diagnostics.Debug.WriteLine(s);

                //if we found "U+" header for the search patterns
                if (n1 >= 0)
                {
                    //get the chars after the "U+" header
                    //s1 are the following chars up till the cursor position
                    //s1 could valid unicode code point
                    string s1 = s.Substring(n1 + 2, s.Length - (n1 + 2));
                    //System.Diagnostics.Debug.WriteLine(s1);
                    
                    //we attempt to encode s1 in utf16 encoding
                    string s2 = "";
                    unicodepoint2utf16(s1, ref s2);

                    //if we have a valid utf16 encoding 
                    //we get actual character from the utf16 string representation
                    if (s2 != "")
                    {
                        uint d = Convert.ToUInt32(s2, 16);
                        uint maskb0 = Convert.ToUInt32("FF000000", 16);
                        uint maskb1 = Convert.ToUInt32("FF0000", 16);
                        uint maskb2 = Convert.ToUInt32("FF00", 16);
                        uint maskb3 = Convert.ToUInt32("FF", 16);
                        byte b0 = (byte)((d & maskb0) >> 24);
                        byte b1 = (byte)((d & maskb1) >> 16);
                        byte b2 = (byte)((d & maskb2) >> 8);
                        byte b3 = (byte)((d & maskb3));

                        byte[] bytes;

                        //b0 is the highest order byte and 
                        //b3 is the lowest order byte
                        //a code unit is 4 hex digits, ie 2 bytes
                        //b0,b1 is the high order code unit
                        //b2,b3 is the low order code unit

                        //Note that Windows uses Little Endian 
                        //for the byte ordering for char
                        //so to encode the code unit (each a char of 2 bytes)
                        //we have to put the lower order byte to the left
                        //the encoding for the code units would be as follows
                        //high order code unit: b1,b0
                        //low order code unit: b3,b2

                        if (b0 == 0 && b1 == 0)
                            //high order code unit is 0000
                            //we only need use the low order code unit
                            bytes = new byte[] { b3, b2 };
                        else
                            //high order code unit <> 0000
                            //this is a surrogate pair encoding
                            //we need 2 code units
                            //b1, b0 for high order code unit
                            //b3, b3 for low order code unit
                            bytes = new byte[] { b1, b0, b3, b2 };

                        UnicodeEncoding u = new UnicodeEncoding();
                        //generate the character from the byte array
                        //Note that if we send in a surrogate pair encoding of 4 bytes
                        //we would get a double char character
                        //a double char character is render as one glyph
                        //but has a length of 2,
                        //if s3 holds a double char character, s3.Length is 2
                        string s3 = u.GetString(bytes);
                        //System.Diagnostics.Debug.WriteLine(s3);
                        //(n-n1) is the number of chars that we want to replace
                        //it is the length of the U+#### or U+#####
                        //we select these chars to be replaced 
                        //by s3 (the Unicode character that we generated)
                        textbox.SelectionStart = textbox.SelectionStart - (n - n1);
                        textbox.SelectionLength = (n - n1);
                        textbox.SelectedText = s3;

                        //we have taken care of the <space> entered
                        //do not futher process it
                        e.Handled = true;
                    }
                }
            }
        }

The unicodepoint2utf16() function takes in as parameters a Unicode code point string and a ref string that will be modified to hold the resulting UTF-16 encoding. The resulting UTF-16 string can have 4 (for U+####) or 8 (for U+#####) hexadecimal digits. For 8 hexadecimal digits output string, the first 4 hexadecimal digits and the last 4 hexadecimal digits form the surrogate pair for UTF-16 encoding. For example U+2040A will be encoded as the pair D841, DC0A. A surrogate pair will have the encoding in 2 code units. The range for the code units are:

High: U+D800 - U+DBFF

Low: U+DC00 - U+DFFF

This encoding standard allows for (DBFF-D800 +1)*(DFFF-DC00+1) = 1048576 code points!

        //unicode point to utf-16 including surrogate pair encoding
        private void unicodepoint2utf16(string unp, ref string utf16)
        {         
            utf16 = "";

            //test for 5 hexadecimal unicode point
            //remove any leading "0" or spaces
            uint testint=0;
            string simplified_unp = "";
            try
            {
                testint = Convert.ToUInt32(unp, 16);
            }
            catch
            {   //not a hexadecimal
                return;
            }
            
            simplified_unp=testint.ToString("x");
       
            if (simplified_unp.Length == 5)
            {
                try
                {
                    uint d = Convert.ToUInt32(simplified_unp, 16);

                    uint d1 = Convert.ToUInt32("10000", 16);
                    uint d2 = d - d1;
                    uint p1 = d2 >> 10;
                    uint m1 = Convert.ToUInt32("1111111111", 2);
                    uint p2 = d2 & m1;
                    uint d800 = Convert.ToUInt32("d800", 16);
                    uint dc00 = Convert.ToUInt32("dc00", 16);
                    uint s1 = d800 + p1;
                    uint s2 = dc00 + p2;
                    utf16 = s1.ToString("x4") + s2.ToString("x4");
                }
                catch
                {
                    return;
                }
                return;
            }

            //for checking of 4 hexadecimals, we include leading 0 but not leading spaces
            if (unp.Length == 4 && unp.TrimStart(' ').Length ==4)
            {
                try
                {
                    uint d = Convert.ToUInt32(simplified_unp, 16);
                    utf16 = d.ToString("x4");
                }
                catch
                {
                    return;
                }
               
            }
        }

Basic Demo

When the demo starts, the bottom text-box's content will be : ....U+265b<press space to get the character for this unicode>

Press <space> bar and the code U+265b will be replaced by the character represented by U+265b. Guess what that is?

You can select "Help" from the combo-box to get help on using the top left text-box.

Below are some of the sample Unicode points that you may like to test out:

CJK ( Simplified Chinese meaning East ) : U+4E1C. Type U+4e1c follow by space

Greek ( Pi ) : U+03c0. Type U+03c0 follow by space

Symbols ( White Spade ) : U+2664. Type U+2664 follow by space

A 5 hexadecimal digits Unicode: U+2040b. Type U+2040b follow by space

If you have downloaded the Aegyptus font from Unicode Fonts for Ancient Scripts, you can installed it by copying the font file to the Windows Fonts directory at c:\\windows\fonts. Then change the font for the text-box to Aegyptus. Double clicking on any of the text-boxes pop up the Font-Dialog to select the font to assign to the text-box.

You may want to try out the Unicode code points shown in the picture below. For example, to get the character of the owl (top row 9th item after the first item), the Unicode code point would be U+10980 + 9 (hex for 9) = U+10989. So for the double wave (10th item after the the first item), it would be U+10980 + A (hex for 10) = U+1098A. You should be able to quite easily work out the Unicode code points for the rest of the figures below.

Type U+#####<space> for example, U+10989 follow by space will have the owl typed out into the text-box.

To use the other 2 text-boxes:

Type in using keyboard or IME to the top left text-box. You can also get the characters from the Character Map Tool and paste into this text-box. To find out the Unicode code point for any character, click to the right of the character to set the cursor and a tool-tip will pop up showing the Unicode code point. For more features select Help from the combo-box.

Click the -> button next to this text-box to display all the UTF-16 encoding to the right text-box.

Similarly, you can type in sets of 4 digits space-seperated hexadecimal UTF-16 encoding into the right text-box and click the <-- button to see the characters on the left text-box.

These are the Unicode groups that you can select from the combo-box

Meroitic U+10980 - U+109ff Aegyptus,36,BOLD

Hieroglyphs U+f3000 - U+f4b92 Aegyptus,36,BOLD

Chinese U+4e00 - U+9fa5 Arial Unicode MS,14,REGULAR

Phaistos Disc1 U+F01D0 - U+F01E7 Aegean,36,REGULAR

Phaistos Disc2 U+F0200 - U+F0247 Aegean,36,REGULAR

Cypro-Minoan U+F1000 - U+F1136 Aegean,36,REGULAR

Cypriot Syllabary U+F1700-U+F1853 Aegean,36,REGULAR

A whole list of other groups has been added. See the Top picture.

Advance Demo

For this demo, you would need to download both the Aegyptus and the Aegean fonts. These can be done via the links at the top of this article.

After you have downloaded and installed these fonts, you should be able to display all the glyph for each of the Unicode ranges above.

However a Windows text-box can only be assigned one font at any one time and currently there is no universal font that can support all possible Unicode code point.

If you type U+10980<space>U+F1000<space> in the bottom text-box, at least one of two characters would not be displayed correctly. This is because the glyph for U+10980 is found in Aegyptus font and U+F1000 is found in Aegean font. If you assign Aegean font to the text-box, U+10980 will not display correctly, and if you assign Aegyptus font, U+F1000 will not display correctly. Unless you can find a font that support both of these code points, you do not have a solution.

Ah....but, we can use a rich text-box control right? No. Current version of rich text-box control does not support surrogate pairs encoding, although it support multiple fonts. Both U+10980 and U+F1000 are encoded using surrogate pairs, so we would not be able to use the rich text-box to display these characters.

One of the solution to this problem is to use a web browser control. The current version of the web browser control supports surrogate pair encoding. To display characters in the Unicode range correctly, we put the characters within <div> or <span> tags with the correct font assigned to the CSS style for the these tags.

<span style="font-family:@font@;color:@color@;font-size:@font-size@px"><b>&#x@unicode@;</b></span> 

<div style="font-family:@font@;color:@color@;font-size:@font-size@px"><b>@block@</b></div>

The above are the templates we use to generate the <span> and <div> tags. We can replace the placeholders (those @xx@ items) using the getHTMLformatEntry(string s1, string font) function below.

        string getHTMLformatEntry(string s1, string font)
        {
/*
    <span style="font-family:@font@;color:@color@;font-size:@font-size@px"><b>&#x@unicode@;</b></span>

*/
            string s = Resource1.sSpan_Template;
            string f = font;
            string[] vf = f.Split(',');
            s=s.Replace("@font@", vf[0]);
            int font_size = (int.Parse(vf[1]) * 3) / 2;
            s = s.Replace("@font-size@", font_size+"");
            Random r= new Random();
            int i=r.Next(0,7);
            string[] colors = new string[] {"red","green","blue","magenta",
                                            "cyan","black","orange","pink" };
            string color = colors[i];
            s=s.Replace("@color@",color);
            s=s.Replace("@unicode@", s1);
            if (vf[2] != "BOLD")
            {
                s = s.Replace("<b>", "");
                s= s.Replace("</b>","");
            }
            return s;
        }

For instance, if we want to display U+F1000, we pass as parameters

s1: "f1000"

font: "Aegean,36,REGULAR"

The output would be:

<span style="font-family:Aegyptus;color:magenta;font-size:54px"><b>&#xf1000;</b></span>

The color is randomly assigned, but the rest of the placeholders are replaced by data from the input parameters

Similarly we can replace the placeholders in the <div> template.

The main difference between the <div> tag and the <span> tag is that the <div>tag will take up the entire line in the web page (if we do not use table and cell). If we want 2 characters with different fonts to be side by side, we would use the <span> tag. The <div> tag is used for block of characters all having the same font.

Steps for this Demo

1) Type some message in the bottom text-box on the left

2) Click the -> button next to this text-box

3) Select "Cypro-Minoan U+F1000 - U+F1136" Unicode range from the combo-box

4) Hold the Alt key and mouse left click at the first character in the top left text-box

5) Select "Meroitic U+10980 - U+109ff" Unicode range from the combo-box

6) Hold the Alt key and mouse left click at the first character in the top left text-box

Analysis and Explanation

In step 2 when the -> button is clicked, we make use of the <div> template to generate the <div> tag as shown below:

<div style="font-family:Arial Unicode MS;color:black;font-size:14.25px"><b>Demo:
Putting "U+F1000" Aegean font with
"U+10980" side by side</b></div>

In step 4, from the mouse click, we set the cursor position behind the intended character to get the Unicode code point of that character, in this case we get "f1000". The Alt key is to indicate that we also want to paste the character to the web browser control. We call getHTMLformatEntry() function, passing in this code point, and current font ("Aegean,36,REGULAR" ) to create the tag below:

<span style="font-family:Aegean;color:cyan;font-size:54px">&#xf1000;</span>

Similarly step 6 will also generate a <span> tag, but now the font is different, and the tag below would be generated

<span style="font-family:Aegyptus;color:black;font-size:54px"><b>&#x10980;</b></span>

Beside these 2 templates, we also have another template that we would use to create the entire HTML page . The placeholder @@ would be replaced by the concatenation of all of the previously generated <div> and <span> tags. As tags are generated, we store them in the global variable htmlelements. Replacing @@ with the content htmlelements would give us a well formatted html page that we could use to update the web browser control

<!DOCTYPE html><html><body>@@</body></html>

After you have completed all the 6 steps, click "View Source" button to view the html page in Notepad. The file is created in the current directory and the default name is temp.html.txt. Rename to temp.html and view the page in any web browser.

Alternatively you can just click "View in External Browser" button to launch the page directly to the default web browser in your system.

I have tested the page created on IE 8 and Chrome successfully. If the referenced fonts are installed in your Windows system, the page should be rendered correctly, as the newer browsers mostly support surrogate pair encoding.

You can also click "Remove Last Insert" to remove the last item you inserted into the web browser.

Finally click "Clear" to clear the content of the web browser control.

Points of Interest

1) The code fragments to enable direct Unicode typing in a text-box is quite small and simple that you can easily include in your project. To enable this feature in any text-box

        //To enable Unicode processing
        //**************************************************************
        private UnicodeProcessing uniprocessing = new UnicodeProcessing();
        //***************************************************************

        //To enable you to type unicode directly to the text-box 
        //*****************************************************************************
        textBox3.KeyPress += new KeyPressEventHandler(uniprocessing.HandleKeyPress);
        //******************************************************************************

2) With Version 3, you can create fanciful web pages that has all those interesting glyphs.

Have fun!

History

19 May 2014: Version 1

21 May 2014: Version 2: Add support for surrogate pairs.

23 May 2014: Version 2d: Add in a combo box to select Unicode Range

24 May 2014: Version 3: Add in web browser control to allow for multiple fonts support

26 May 2014: Version 3b: Encapsulate all unicode processing functions, making it easier to reuse these features. Add more features to Html procesing in the demo, allowing deletion of last insert. Fix bug to handle leading spaces and commas in html page

28 May 2014: Version 3c: Added in extensive list of character groupings, including the private area U+e000 - U+f8ff. Also include discussion on the Private Character Editor, eudcedit.exe.

Reference

Wikipedia: Unicode

Wikipedia: UTF-16