Unknown character in UTF-8 encoded text

Question

0.00/5 (No votes)

See more:

I have a file which contains some data. That data is encoded in UTF-8 (without a BOM)

Those bytes are usually no problem to handle. Yet know in that file there is a byte sequence I don't know what it should represent (neither could I find any information about it too)

To examine the date I opened the file in a hex editor. There were UTF-8 char sequences which were pretty normal (C3 BC for ü and C3 B6 for ö etc.)

Yet then there was the following sequence I don't know how to get to the expected char:

C3 83 EF BF BF

From the context I can gather that it should represent the character ü. Yet I've no idea how you could possibly get to that sequence...

Example how this looks like in the file (Hex View):

54 65 73 74 20 77 69 74 68 20 63 68 61 72 20 22 
75 65 22 20 2D 3E 20 C3 83 EF BF BF 20 69 74 20 73 68 6F 75 
6C 64 20 70 72 6F 62 61 62 6C 79 20 72 65 70 72 
65 73 65 6E 74 20 74 68 65 20 63 68 61 72 20 ^__b style="color:darkred">C3 
BC

Actual text (UTF-8):

Test with char "ue" -> 

Now that strange sequence: ^__b style="color:darkred">Ã it should probably represent the char ^__b style="color:darkred">ü

(Well looks like CP won't let me display the decode value of EF BF BF ;) )

I've highlighted the according sections in the Hex View and the Representation in the text View.

Now the question:

What should C3 83 EF BF BF represent? I suppose C3 83 translates okay to Ã but what is EF BF BF? The only thing I found was that if you convert the char 0xFFFF to UTF-8 EF BF BF is the byte sequence that you get. But still: what should it exactly represent?

Posted 3-Dec-13 23:38pm

Nicholas Marty

Add a Solution

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Pascal-78 · Accepted Answer · 2013-12-04T01:43:00

Solution 1

I think your sequence C3 83 EF BF BF is the result of an other UTF8 encoding with the "ANSI" sequence C3 BC.

Let me explain:
1) when trying to convert char C3 to UTF8, you will get C3 83
2) if BC is not known in the CodePage, the Unicode result might be FF FF
3) Encoding to UTF8 the Unicode result will generate EF BF BF

in conclusion:
C3 BC is converted to Unicode using a codepage (don't know which one, but not UTF8).
This will result in C3 00 FF FF (because BC is not known in the used codepage.
Then this result is encoded from Unicode to UTF8 to
C3 83 EF BF BF

I think the error is in the program generating your source file.

Posted 4-Dec-13 1:43am

Pascal-78

Comments

Nicholas Marty 4-Dec-13 7:53am

Yeah, most likely the program generating that file is at fault somewhere.

My missing link was that I didn't know (or maybe also forgot) that when converting an unknown character this might result in "FF FF". (I was at least pretty close in finding out that FF FF translates to EF BF BF in UTF-8 ;) )

Your explanation would very well explain the problem here. So thanks for that :)

Pascal-78 4-Dec-13 9:01am

In fact, "U+FFFD" is the specific code for "replacement character" used to replace an unknown or unrepresentable character. and "FF FF" is not allowed as an Unicode Character. May be it's an other mistake of the program generating the file.