Split grapheme in C#

DebugST

5.00/5 (1 vote)

Nov 9, 2021

MIT

3 min read

5020

Use [Unicode 14.0.0] version. Can support automatic code generation according to the latest version.

Background

Before that, I would like to thank my friend: netero, who helped me a lot to complete this code.

When we processed the string, we found that we could not accurately obtain the length of the string. So we checked a lot of information and found a lot of related codes, but the results were not very good. Because the version of the Unicode document they used was too old, many character processing errors would occur, so they felt that they had implemented such a function.

This is the project: https://github.com/DebugST/STGraphemeSplitter

Cases

string strText = "abc";
Console.WriteLine(strText.Length) // output is: 3

//But... when there are some special characters... like emoji. .

string strText = "👩‍🦰👩‍👩‍👦‍👦🏳️‍🌈";
Console.WriteLine(strText.Length) // output is: 22

It can be seen that the desired result is 3, but the result is 22. Why is that?

Character Clusters

Character clusters refer to text elements that people intuitively and cognitively consider to be individual characters. A character cluster may be an abstract character, or it may be composed of multiple abstract characters. Character clusters should be the basic unit of text operations.

The reason for this situation is: in many compilers, or in memory. The characters are all encoded in Unicode. So when counting the length, it is the number of Unicode codes counted. As we all know, a Unicode is two bytes. Even if all the intervals are used as character encoding, it is only 0x0000-0xFFFF, which is 65536 characters. This interval may not fit all Chinese characters.

Coding Range

So the Unicode organization thought of a way, that is surrogate. The Unicode organization does not intend to treat all 0x0000-0xFFFF as character ranges.

So at this time, the Unicode organization decided to take out the 2048 character interval as surrogate characters.

0xD800-0xDBFF are high surrogate characters. 0xDC00-0xDFFF are low surrogate characters.

High surrogate characters are usually followed by low surrogate characters. Their codes take out the last 10 bit combinations and add 0x10000 to make a new code, so that there can be more character combinations, as many as 1,048,576.

So such a character requires two Unicode characters.

private static int GetCodePoint(string strText, int nIndex) {
    if (!char.IsHighSurrogate(strText, nIndex)) {
        return strText[nIndex];
    }
    if (nIndex + 1 >= strText.Length) {
        return 0;
    }
    return ((strText[nIndex] & 0x03FF) << 10) + (strText[nIndex + 1] & 0x03FF) + 0x10000;
}

The above mentioned [high surrogate ] is followed by [low surrogate ], so a character is at most two Unicode, which is four bytes? No no no. . . This is not the calculation. Because the character encodings in different intervals have different properties. Unicode determines the character clusters based on these properties.

Take the most common characters for example, such as: [\r\n]

Think of it as two characters in a large logarithmic programming language. Yes. It is indeed two characters.

But for the human senses, whether it is [\r\n] or [\n], it is always a character, that is [new line].

So [\r\n] is one character in human consciousness, not two.

If you don't do this, then the following situation will occur:

string strA = "A\r\nB";
var strB =  strA.Reverse(); // "B\n\rA";

This is not the result we want. The result we want is "B\r\nA", and Unicode is indeed defined as such [GB3]: https://www.unicode.org/reports/tr29/#GB3.

Do not break between a CR and LF. Otherwise, break before and after controls.
GB3                     CR   ×   LF
GB4    (Control | CR | LF)   ÷      
GB5                          ÷   (Control | CR | LF)

Characters also have combined attributes, such as: [ā]

It looks like a character, but it is actually a combination of two characters. [a + ̄ = ā] -> "a\u0304"

This is how the 0x0300-0x036F interval is defined in Unicode:

0300..036F    ; Extend # Mn [112] COMBINING GRAVE ACCENT..COMBINING LATIN SMALL LETTER X

So "\u0304" has [Extend] attribute, and [Extend] is defined as follows in the split rule:

Do not break before extending characters or ZWJ.
GB9                          ×    (Extend | ZWJ)

Unicode defines many attributes, and the attributes used to determine the segmentation are as follows:

CR, LF, Control, L, V, LV, LVT, T, 
Extend, ZWJ, SpacingMark, Prepend, Extended_Pictographic, RI

These attribute distribution intervals are also defined by Unicode:

https://www.unicode.org/Public/14.0.0/ucd/auxiliary/GraphemeBreakProperty.txt

And, the standard to determine whether these characters should be combined is here:

https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundary_Rules

This code is all written in accordance with the latest Unicode standard. Even if Unicode is updated in the future, the code also provides a code generation function, which can generate the latest code according to the latest Unicode standard. For example:

/// <summary>
/// Build the [GetGraphemeBreakProperty] function and [m_lst_code_range]
/// Current [GetGraphemeBreakProperty] and [m_lst_code_range] create by:
/// https://www.unicode.org/Public/14.0.0/ucd/auxiliary/GraphemeBreakProperty.txt
/// https://www.unicode.org/Public/14.0.0/ucd/emoji/emoji-data.txt
/// [Extended_Pictographic] type was not in [GraphemeBreakProperty.txt(14.0.0)]
/// So append [emoji-data.txt] to [GraphemeBreakProperty.txt] to create code
/// </summary>
/// <param name="strText">The text of [GraphemeBreakProperty.txt]</param>
/// <returns>Code</returns>
public static string CreateBreakPropertyCodeFromText(string strText);

Demo

string strText = "👩‍🦰👩‍👩‍👦‍👦🏳️‍🌈Abc";
List<string> lst = STGraphemeSplitter.Split(strText);
Console.WriteLine(string.Join(",", lst.ToArray())); //Output: 汉,字,👩‍🦰,👩‍👩‍👦‍👦,🏳️‍🌈,A,b,c

int nLen = STGraphemeSplitter.GetLength(strText);   //Only get length.

foreach (var v in STGraphemeSplitter.GetEnumerator(strText)) {
    Console.WriteLine(v);
}

STGraphemeSplitter.Each(strText, (str, nStart, nLen) => { //faster
    Console.WriteLine(str.Substring(nStart, nLen));
});

//If the above speed is not fast enough? Then create the cache before using
//Creating a cache to an array is relatively fast and takes up a lot of space.
STGraphemeSplitter.CreateArrayCache();
//It is relatively slow to create a cache to the dictionary, and the temporary space is small.
STGraphemeSplitter.CreateDictionaryCache();
STGraphemeSplitter.ClearCache();                //Clear all cache

History

9^th November, 2021: Initial version