Click here to Skip to main content
15,867,835 members
Please Sign up or sign in to vote.
4.00/5 (1 vote)
See more:
Dear experts
I’m trying to understand .NET strings and Unicode in all its details. From what I read, .NET strings are UTF16 coded.

Based on this “knowledge”/”assumption” I tried to see how Code Points out of BMP will be handled by .net strings and failed on my very first experiment.

My code
C#
int music = 0x1D161; //U+1D161 = MUSICAL SYMBOL SIXTEENTH NOTE
string s1;
s1= Char.ConvertFromUtf32(music);
textBox1.Text = s1;

With the above code I expected
a.) to see the musical symbol in the text box , but I see only a square
b.) s1.Length returns one (even the code point needs two code units – surrogate pair- ? ), but Length returns 2

Does anybody can explain me where I’m wrong?

Thank you very much in advance.
Bruno
Posted
Updated 15-Jan-15 9:42am
v2

1 solution

You should never assume any particular UTF encoding. All the .NET API is well abstracted from this representation. Remember that with all UTFs except UTF-32, the characters are represented using different number of bytes. With UTF-16, characters beyond BMP are encoded using surrogate pairs. You can serialize them using the Encoding class. As to the Unicode code points, they should be understood and pure mathematical integer number, fully abstracted from their computer presentation, in natural order.

As to your particular problem, your approach is correct, because UTF32LE, in the range 0.. 10FFFF is encoded exactly as the code point would be encoded. However, I never saw a Windows font supporting this range for musical notation (http://unicode.org/charts/PDF/U1D100.pdf[^]). Maybe, this is the only problem.

As to the second question: yes, the length 1 is correct. The property returns number of characters, not 16-bit words. You converted the code point to a surrogate pair, right? And this is one character.

—SA
 
Share this answer
 
Comments
[no name] 15-Jan-15 16:05pm    
Dear Sergey
Thank you very much for your answer.

"You should never assume any particular UTF Encoding": But when I read MS- help, it is stated that string class is UTF-16 coded....and nothing other. That confuses me once again ;)

"Fonts": Great hint, thanks. I will check this whether this is my only Problem.

"Length":
The length Returns 2, I expect 1.


At the Moment I don't know how to write a html, but I'm pretty sure that a browser will display the right character.

Simply fighting with the bascics.
Thank you again
Bruno
Sergey Alexandrovich Kryukov 15-Jan-15 16:23pm    
Again, the API is well abstracted from UTF-16. Look at the class Encoding, it takes care of conversion between bytes and characters, as well as of real lengths of text in characters. Will the browser display it? Why are you sure? I just tried, it's no shown...

You know, characters beyond BMP are quite rare, I personally, almost never face with them. For a long time, NT-based (where Unicode was introduced) Windows did not support them, then supported partially, and so on. But I used to download some fonts supporting code points beyond BMP; it worked. Nevertheless, I never saw such font bundled with Windows by default (not 100% sure though). You probably can find out or even create (too big deal, I know) a special font and use it...

Anyway, will you formally accept this answer?

—SA
[no name] 15-Jan-15 16:31pm    
Thank you.
Yes, I will accept your answer.
But please give me a hint
a.) why s1.Length returns 2, which seems to be wrong...
b.) what is the Syntax (maybe procedure) to have such a Code Point in html

And no, the API (I assume you reffer also string class) _is notabstracted_ from UTF-16....I think....until now....still try to understand it :-)

"You know, characters beyond BMP are quite rare...":
That is not an argument. The same programmers using ASCII 32...127 used some years ago. If it is available (beyond BMP) I think I have take care about.



Thank you
Bruno
Sergey Alexandrovich Kryukov 15-Jan-15 17:07pm    
a) You told me that s1.Length returns 1. Checking up... Yes, it's 2. Looks wrong to me, too. Wow...
b) 𑄑 for in hex, 𛈇 in decimal...
Well, if it is not completely abstracted, it's still safer to use abstract approach.
—SA
[no name] 15-Jan-15 17:16pm    
a.) I never told s1.Length returns 1, but yes in my question it is a bit awkward/hidden described.

b.) Thanks, I will try with this


Just try to understand the real basics, that's why sometimes strange questions from my side.

My 5 and accepted...and I'm sure I will come back with other strange questions about this :-)
Bruno

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900