Detect encoding of byte array

Question

1.80/5 (2 votes)

See more:

Hi,

I am stuck in a very big issue related to encoding,i have a column in database which store xml as welll as string prviously they were stored in UTF-8 format but now it has been changed to UTF-16.so we need to read old as well as new data that are i different format.so i need a way to find encoding of byte array.Please provide a solution as soon as possible,its very very very urgent.

Thanks
Akanksha

Posted 7-May-12 18:18pm

Member 7894376

Add a Solution

Comments

kiquenet.com 22-Oct-13 9:16am

any final solution with full source code sample working about it ?

kiquenet.com 23-Oct-13 2:47am

What's about http://stackoverflow.com/a/19464728/206730 ?

2 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Sergey Alexandrovich Kryukov · Answer 1 · 2012-05-07T18:59:00

Strictly speaking, there is no a regular 100% certain way to tell the UTF from the encoded array of bytes. You have done a fatal mistake and increased the entropy of the system. This mistake is theoretically not reversable, in the same sense that the entropy of a closed system cannot be reduced.

The serialized Unicode string can be represented as two components: an array of bytes which can be obtained from System.Text.Encoding.GetBytes(string) and the information of the encoding itself. You can think of this piece of information as of the reference to a concrete run-time Encoding class. Please see:
http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx[^].

The part of the Unicode standard dedicated to UTFs suggests a standard mechanism of keeping the encoding information with string data. This is a certain number of bytes called BOM (Byte Order Mark) different for each UTF which allows for unambiguous detection of UTF encoding. Please see:
http://en.wikipedia.org/wiki/Byte_order_mark[^],
http://unicode.org/faq/utf_bom.html[^].

It looks like you failed to use this or any other mechanism, with fatal consequences. This is a second case of a failure like that I faced with at CodeProject.

I can see some ways to fix it, but it will take some labor. First, a trained human eye can easily detect the encoding just looking at the bytes rendered somehow. And if the same array of bytes is deserialized into text using two or more different encodings, an eye of every human familiar with the writing system (language) can detect which encoding was correct in no time.

You can automate this process. To do that, you should have a dictionary (or dictionaries) of the languages used in the text and perform a statistical analysis of the text deserialized with different hypothetical encodings. The right encoding would be the one which yields more matches of some text lexemes with the dictionary entries. You will need to do a research to confirm the level of validity of the decisions at different match level (say, percentage of matches) using a judgement of a human operator. When this is done and the confidence level is established, you will need to pass the whole database through this system. In all questionable cases (you should develop the criterion for a certain vs. questionable case in your experimental research), a final solution should be chosen by a human operator. The problem will be solved when all the text are converted to some single UTF. Alternatively, BOM could be used, but I would recommend to use UTF-8 in all cases, as the result of the fix.

You would need to be much more careful with fusional or agglutinative languages, because in this case you would need to be able to extract roots or other lexical units of a word, morphemes (for comparison with a dictionary), which is, in a general case, a serious task of both linguistic and computing. Please see:
http://en.wikipedia.org/wiki/Agglutinative_language[^],
http://en.wikipedia.org/wiki/Fusional_language[^],
http://en.wikipedia.org/wiki/Word_root[^],
http://en.wikipedia.org/wiki/Morpheme[^].

[EDIT]

Actually, there is one more simple statistical criterion which should work pretty well on many situations in case you use either UTF-8 or UTF-16LE.

It will work well if most of the code falls in 1-2-3 Unicode sub-ranges, which usually happens when there is one dominating language in the text. Usually, there are many characters with code point withing ASCII, and most of the other falls in the same Unicode sub-range. In UTF-16LE the sub-range is indicated by the high byte of each of 16-bit words (32-bit characters beyond BMP are rare, I don't consider them). Therefore, if you consider only the distribution of high bytes, they will fall in 1-2 main modes, more rarely in 3 or more. In this way, if you find one mode of, say, 30% or more (10%, 20%), it's more likely UTF-16LE then UTF-8 where you will also see the modes in the distribution, but more of less distinct modes.

This trick won't work well on languages based on logograms (like Chinese or Korean), but should show good results on most other languages like Western European, Slavic, Ugric, Georgian, Armenian, Arabo-Persian and of course Brahmic script (most of numerous Indian writing systems, Thai, etc.), and a lot more.

[END EDIT]

Overall, creation of this software could be done quite soon, but the cost of the research and final conversion of data could be more or less expensive. Next time, use your head before doing your work.

—SA

et3 · Answer 2 · 2014-11-12T10:08:00

Solution 2

How about discriminating utf8/16 based on table entry date.

Posted 12-Nov-14 10:08am

et3