Unicode , UTF16, .NET – string, CodePoint out of BMP (Basic Multilingual Plane)

Question

4.00/5 (1 vote)

See more:

Dear experts
I’m trying to understand .NET strings and Unicode in all its details. From what I read, .NET strings are UTF16 coded.

Based on this “knowledge”/”assumption” I tried to see how Code Points out of BMP will be handled by .net strings and failed on my very first experiment.

My code

C#

int music = 0x1D161; //U+1D161 = MUSICAL SYMBOL SIXTEENTH NOTE
string s1;
s1= Char.ConvertFromUtf32(music);
textBox1.Text = s1;

With the above code I expected
a.) to see the musical symbol in the text box , but I see only a square
b.) s1.Length returns one (even the code point needs two code units – surrogate pair- ? ), but Length returns 2

Does anybody can explain me where I’m wrong?

Thank you very much in advance.
Bruno

Posted 15-Jan-15 9:12am

User 11060979

Updated 15-Jan-15 9:42am

v2

Add a Solution

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Sergey Alexandrovich Kryukov · Accepted Answer · 2015-01-15T09:57:00

Solution 1

You should never assume any particular UTF encoding. All the .NET API is well abstracted from this representation. Remember that with all UTFs except UTF-32, the characters are represented using different number of bytes. With UTF-16, characters beyond BMP are encoded using surrogate pairs. You can serialize them using the Encoding class. As to the Unicode code points, they should be understood and pure mathematical integer number, fully abstracted from their computer presentation, in natural order.

As to your particular problem, your approach is correct, because UTF32LE, in the range 0.. 10FFFF is encoded exactly as the code point would be encoded. However, I never saw a Windows font supporting this range for musical notation (http://unicode.org/charts/PDF/U1D100.pdf[^]). Maybe, this is the only problem.

As to the second question: yes, the length 1 is correct. The property returns number of characters, not 16-bit words. You converted the code point to a surrogate pair, right? And this is one character.

—SA

Posted 15-Jan-15 9:57am

Sergey Alexandrovich Kryukov

Comments

[no name] 15-Jan-15 16:05pm

Dear Sergey
Thank you very much for your answer.

"You should never assume any particular UTF Encoding": But when I read MS- help, it is stated that string class is UTF-16 coded....and nothing other. That confuses me once again ;)

"Fonts": Great hint, thanks. I will check this whether this is my only Problem.

"Length":
The length Returns 2, I expect 1.

At the Moment I don't know how to write a html, but I'm pretty sure that a browser will display the right character.

Simply fighting with the bascics.
Thank you again
Bruno

Sergey Alexandrovich Kryukov 15-Jan-15 16:23pm

Again, the API is well abstracted from UTF-16. Look at the class Encoding, it takes care of conversion between bytes and characters, as well as of real lengths of text in characters. Will the browser display it? Why are you sure? I just tried, it's no shown...

You know, characters beyond BMP are quite rare, I personally, almost never face with them. For a long time, NT-based (where Unicode was introduced) Windows did not support them, then supported partially, and so on. But I used to download some fonts supporting code points beyond BMP; it worked. Nevertheless, I never saw such font bundled with Windows by default (not 100% sure though). You probably can find out or even create (too big deal, I know) a special font and use it...

Anyway, will you formally accept this answer?

—SA

[no name] 15-Jan-15 16:31pm

Thank you.
Yes, I will accept your answer.
But please give me a hint
a.) why s1.Length returns 2, which seems to be wrong...
b.) what is the Syntax (maybe procedure) to have such a Code Point in html

And no, the API (I assume you reffer also string class) _is notabstracted_ from UTF-16....I think....until now....still try to understand it :-)

"You know, characters beyond BMP are quite rare...":
That is not an argument. The same programmers using ASCII 32...127 used some years ago. If it is available (beyond BMP) I think I have take care about.

Thank you
Bruno

Sergey Alexandrovich Kryukov 15-Jan-15 17:07pm

a) You told me that s1.Length returns 1. Checking up... Yes, it's 2. Looks wrong to me, too. Wow...
b) 𑄑 for in hex, 𛈇 in decimal...
Well, if it is not completely abstracted, it's still safer to use abstract approach.
—SA

[no name] 15-Jan-15 17:16pm

a.) I never told s1.Length returns 1, but yes in my question it is a bit awkward/hidden described.

b.) Thanks, I will try with this

Just try to understand the real basics, that's why sometimes strange questions from my side.

My 5 and accepted...and I'm sure I will come back with other strange questions about this :-)
Bruno

Sergey Alexandrovich Kryukov 15-Jan-15 17:55pm

Will be glad to hear from you again :-)
—SA

[no name] 16-Jan-15 6:36am

Now I read again through the MSDN documentation and found:

"The Length property of a string represents the number of Char objects it contains, not the number of Unicode characters. To access the individual Unicode code points in a string, use the StringInfo object."

I can't find anymore the help where I read that Length Returns the number of CodePoints....maybe that was only in my brain :-)

Anyway I'm happy that it is like this (number of Char), otherwise confusion would be bigger, at least for me.

Bruno

Sergey Alexandrovich Kryukov 16-Jan-15 11:04am

Pay attention that Char.ConvertFromUtf32 returns string, not character, which is already weird. Maybe, this particular method is limited be BMP, but then it would make little sense. All right, there are other ways to input this character.
—SA

Sergey Alexandrovich Kryukov 16-Jan-15 11:20am

I see. Look:
char.ConvertFromUtf32(0x1d161) returns string of two "characters": d834 and dd61.
By range, it is... high surrogate followed by low surrogate. So, this is wrong. A string is still sequence of two word, wrongly treated as characters. Now, you can make array of 4 bytes: 34, d8, 61, dd and read it as UTF16LE string. According to Unicode, it should be exactly one character, but...
—SA

Sergey Alexandrovich Kryukov 16-Jan-15 11:49am

Lets see:
int count = Encoding.UTF32.GetCharCount(new byte[] { 0x61, 0xd1, 1, 0 });
It also returns 2 (!) Round trip shows that they are surrogates:


char[] chars = Encoding.UTF32.GetChars(new byte[] { 0x61, 0xd1, 1, 0 }); //2!
bool IsLowSurrogate = char.IsLowSurrogate(chars[1]); //true!
bool isHightSurrogate = char.IsHighSurrogate(chars[0]); //true!

.

So, I have to agree: strings are not really well abstracted from encoding, and this is revealed when you go outside BMP. It must be done for the ease of implementation. In UI valid surrogate pairs are shown as a single character though.

—SA

[no name] 16-Jan-15 12:14pm

Thank you very much for this and the time to care that much about my question!
Well, "well abstracted string" - concerning what we are discussing here - I think would have some serious Performance Impacts. Furthermore it would make it necessary to extend Char to four Bytes.

So I have to take back my comment on your argument some comments above: "You know, characters beyond BMP are quite rare..."

Thank you very much again.
Bruno

Sergey Alexandrovich Kryukov 16-Jan-15 14:31pm

Are you developing some music application (this is something I do sometimes, just a bit)? Maybe you should better go without any characters? The Unicode range you are trying to use is not popular enough at this time...
—SA

[no name] 16-Jan-15 14:56pm

No I’m not developing music applications *). I had chosen this code fragment from MSDN because of my affinity to music (playing drums, little bit piano and guitar), because I do much more better recognize music notes than Egyptian hieroglyphs :-)

*) Only tried some very basic midi things. And a tool for blind people to help them cut wav/mp3. Very interesting experience to “show” a “cut mark” auditive.
Bruno

Sergey Alexandrovich Kryukov 16-Jan-15 15:44pm

But this is the same thing I meant. I have some such interests, too.

I also do stuff like that, but very little of it. I've done some microtonal research (this is a very interesting topic), then started unrelated thing: to develop the analyzer of chord structure and chord/applicature recognition/generation assisted with MIDI use (before I finally realized that I can do it nearly as fast almost entirely in my head, but with the benefits of generating additional associations and ideas in between; so I stopped this project; perhaps it rather stimulated my education :-), developed some ID tag processing for my own use (based on some CodeProject code, by the way, as well as MIDI), stuff like that.

And recently I published a very simple but practical sound recorder which helped me a lot to do some rehearsals; other recorders are no good to use when the hands are busy, so I implemented convenient sound activation. If it can help you, too, you are welcome to try it out; please see my CodeProject article list.

—SA