Click here to Skip to main content
15,887,822 members
Please Sign up or sign in to vote.
4.00/5 (1 vote)
I have an input string that contains composite Unicode characters, like:

"Leppӓnen" == "\x004c\x0065\x0070\x0070\x04d3\x006e\x0065\x006e"

I want to convert this to use the precomposed characters, ie:

"Leppӓnen" == "\x004c\x0065\x0070\x0070\x00e4\x006e\x0065\x006e"

I have tried:

- String.Normalize() and String.Normalize(NormalizationForm)
- kernel32.dll!WideCharToMultiByte(...)

My last resort will be writing a method to manually look for the normalized versions of these characters and substitute the precomposed characters, but I was hoping there was a framework or Win32 function to do this.

If you have no idea what I'm talking about, see: http://en.wikipedia.org/wiki/Unicode_equivalence[^]
To see the character sets I'm talking about, see: http://en.wikibooks.org/wiki/Unicode/Character_reference/0000-0FFF[^]
Posted
Updated 1-Sep-11 3:33am
v3
Comments
Sergey Alexandrovich Kryukov 31-Aug-11 23:16pm    
Interesting question on this boring topic. One note: I don't think your example is correct in its numeric part. These to variants hardly can be the example of composite vs. precomposed simply because they formally have the same number of code points.
I tries two forms on text editor -- they are rendered correctly, but not recognized as equal (may be I should have used different comparison). Sorry, I don't know how to automate the conversion by any ready-to-use methods. By the way, why do you need to convert them the the precomposed? -- just curious, never faced such problem.
--SA
Yvan Rodrigues 1-Sep-11 9:00am    
You're right, technically the first one is not composite, but rather is a 2-byte precomposed form; whereas the second form is a 1-byte precomposed form (I didn't want to make the topic even MORE boring, but you made me :). The first form is the "standard normalized" version, the second is the "legacy normalized" version". Many western european accented characters have a legacy normalized form. These were created so that 8-bit systems had a fighting chance of being able to easily interpret most Unicode characters used in the west.

They are the same character but you are correct that the framework (and most Unicode implementations) are not great at evaluating all forms of equivalence.

The reason I need this is that I'm using the iTextSharp library to generate some PDFs and these characters only render correctly if I use the 8-bit form. I believe this is actually the font's fault, not iTextSharp's, but most commercially available fonts don't seem to render the first form of the character correctly -- they just omit it.

1 solution

You can try WideCharToMultiByte function from Windows unmanaged code. Reference at: http://msdn.microsoft.com/en-us/library/dd374130%28v=vs.85%29.aspx[^]
 
Share this answer
 
Comments
Yvan Rodrigues 1-Sep-11 8:51am    
Yeah, I tried that (see above), but if I passed CP_UTF8 and WC_COMPOSITECHECK I would get a Windows error 78 (bad parameters). I also tried other codepages like 1200, 1201 and 65001 but they all resulted in the same error.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900