I have a file which contains some data. That data is encoded in UTF-8 (without a BOM)
Those bytes are usually no problem to handle. Yet know in that file there is a byte sequence I don't know what it should represent (neither could I find any information about it too)
To examine the date I opened the file in a hex editor. There were UTF-8 char sequences which were pretty normal (
C3 BC
for
ü
and
C3 B6
for
ö
etc.)
Yet then there was the following sequence I don't know how to get to the expected char:
C3 83 EF BF BF
From the context I can gather that it
should represent the character ü. Yet I've no idea how you could possibly get to that sequence...
Example how this looks like in the file (Hex View):
54 65 73 74 20 77 69 74 68 20 63 68 61 72 20 22
75 65 22 20 2D 3E 20 C3 83 EF BF BF 20 69 74 20 73 68 6F 75
6C 64 20 70 72 6F 62 61 62 6C 79 20 72 65 70 72
65 73 65 6E 74 20 74 68 65 20 63 68 61 72 20 ^__b style="color:darkred">C3
BC
Actual text (UTF-8):
Test with char "ue" ->
Now that strange sequence: ^__b style="color:darkred">Ã it should probably represent the char ^__b style="color:darkred">ü
(Well looks like CP won't let me display the decode value of EF BF BF ;) )
I've highlighted the according sections in the Hex View and the Representation in the text View.
Now the question:
What should
C3 83 EF BF BF
represent? I suppose
C3 83
translates okay to
Ã
but what is
EF BF BF
? The only thing I found was that if you convert the char 0xFFFF to UTF-8
EF BF BF
is the byte sequence that you get. But still: what should it exactly represent?