How to convert composite Unicode characters to precomposed characters

Question

4.00/5 (1 vote)

See more:

I have an input string that contains composite Unicode characters, like:

"Leppӓnen" == "\x004c\x0065\x0070\x0070\x04d3\x006e\x0065\x006e"

I want to convert this to use the precomposed characters, ie:

"Leppӓnen" == "\x004c\x0065\x0070\x0070\x00e4\x006e\x0065\x006e"

I have tried:

- String.Normalize() and String.Normalize(NormalizationForm)
- kernel32.dll!WideCharToMultiByte(...)

My last resort will be writing a method to manually look for the normalized versions of these characters and substitute the precomposed characters, but I was hoping there was a framework or Win32 function to do this.

If you have no idea what I'm talking about, see: http://en.wikipedia.org/wiki/Unicode_equivalence[^]
To see the character sets I'm talking about, see: http://en.wikibooks.org/wiki/Unicode/Character_reference/0000-0FFF[^]

Posted 31-Aug-11 8:48am

Yvan Rodrigues

Updated 1-Sep-11 3:33am

v3

Add a Solution

Comments

Sergey Alexandrovich Kryukov 31-Aug-11 23:16pm

Interesting question on this boring topic. One note: I don't think your example is correct in its numeric part. These to variants hardly can be the example of composite vs. precomposed simply because they formally have the same number of code points.
I tries two forms on text editor -- they are rendered correctly, but not recognized as equal (may be I should have used different comparison). Sorry, I don't know how to automate the conversion by any ready-to-use methods. By the way, why do you need to convert them the the precomposed? -- just curious, never faced such problem.
--SA

Yvan Rodrigues 1-Sep-11 9:00am

You're right, technically the first one is not composite, but rather is a 2-byte precomposed form; whereas the second form is a 1-byte precomposed form (I didn't want to make the topic even MORE boring, but you made me :). The first form is the "standard normalized" version, the second is the "legacy normalized" version". Many western european accented characters have a legacy normalized form. These were created so that 8-bit systems had a fighting chance of being able to easily interpret most Unicode characters used in the west.

They are the same character but you are correct that the framework (and most Unicode implementations) are not great at evaluating all forms of equivalence.

The reason I need this is that I'm using the iTextSharp library to generate some PDFs and these characters only render correctly if I use the 8-bit form. I believe this is actually the font's fault, not iTextSharp's, but most commercially available fonts don't seem to render the first form of the character correctly -- they just omit it.

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Ganesan Senthilvel · Answer 1 · 2011-08-31T18:52:00

Solution 1

You can try WideCharToMultiByte function from Windows unmanaged code. Reference at: http://msdn.microsoft.com/en-us/library/dd374130%28v=vs.85%29.aspx[^]

Posted 31-Aug-11 18:52pm

Ganesan Senthilvel

Comments

Yvan Rodrigues 1-Sep-11 8:51am

Yeah, I tried that (see above), but if I passed CP_UTF8 and WC_COMPOSITECHECK I would get a Windows error 78 (bad parameters). I also tried other codepages like 1200, 1201 and 65001 but they all resulted in the same error.