Click here to Skip to main content
15,887,822 members
Please Sign up or sign in to vote.
1.80/5 (2 votes)
See more:
Hi,

I am stuck in a very big issue related to encoding,i have a column in database which store xml as welll as string prviously they were stored in UTF-8 format but now it has been changed to UTF-16.so we need to read old as well as new data that are i different format.so i need a way to find encoding of byte array.Please provide a solution as soon as possible,its very very very urgent.

Thanks
Akanksha
Posted
Comments
kiquenet.com 22-Oct-13 9:16am    
any final solution with full source code sample working about it ?
kiquenet.com 23-Oct-13 2:47am    
What's about http://stackoverflow.com/a/19464728/206730 ?

Strictly speaking, there is no a regular 100% certain way to tell the UTF from the encoded array of bytes. You have done a fatal mistake and increased the entropy of the system. This mistake is theoretically not reversable, in the same sense that the entropy of a closed system cannot be reduced.

The serialized Unicode string can be represented as two components: an array of bytes which can be obtained from System.Text.Encoding.GetBytes(string) and the information of the encoding itself. You can think of this piece of information as of the reference to a concrete run-time Encoding class. Please see:
http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx[^].

The part of the Unicode standard dedicated to UTFs suggests a standard mechanism of keeping the encoding information with string data. This is a certain number of bytes called BOM (Byte Order Mark) different for each UTF which allows for unambiguous detection of UTF encoding. Please see:
http://en.wikipedia.org/wiki/Byte_order_mark[^],
http://unicode.org/faq/utf_bom.html[^].

It looks like you failed to use this or any other mechanism, with fatal consequences. This is a second case of a failure like that I faced with at CodeProject.

I can see some ways to fix it, but it will take some labor. First, a trained human eye can easily detect the encoding just looking at the bytes rendered somehow. And if the same array of bytes is deserialized into text using two or more different encodings, an eye of every human familiar with the writing system (language) can detect which encoding was correct in no time.

You can automate this process. To do that, you should have a dictionary (or dictionaries) of the languages used in the text and perform a statistical analysis of the text deserialized with different hypothetical encodings. The right encoding would be the one which yields more matches of some text lexemes with the dictionary entries. You will need to do a research to confirm the level of validity of the decisions at different match level (say, percentage of matches) using a judgement of a human operator. When this is done and the confidence level is established, you will need to pass the whole database through this system. In all questionable cases (you should develop the criterion for a certain vs. questionable case in your experimental research), a final solution should be chosen by a human operator. The problem will be solved when all the text are converted to some single UTF. Alternatively, BOM could be used, but I would recommend to use UTF-8 in all cases, as the result of the fix.

You would need to be much more careful with fusional or agglutinative languages, because in this case you would need to be able to extract roots or other lexical units of a word, morphemes (for comparison with a dictionary), which is, in a general case, a serious task of both linguistic and computing. Please see:
http://en.wikipedia.org/wiki/Agglutinative_language[^],
http://en.wikipedia.org/wiki/Fusional_language[^],
http://en.wikipedia.org/wiki/Word_root[^],
http://en.wikipedia.org/wiki/Morpheme[^].

[EDIT]

Actually, there is one more simple statistical criterion which should work pretty well on many situations in case you use either UTF-8 or UTF-16LE.

It will work well if most of the code falls in 1-2-3 Unicode sub-ranges, which usually happens when there is one dominating language in the text. Usually, there are many characters with code point withing ASCII, and most of the other falls in the same Unicode sub-range. In UTF-16LE the sub-range is indicated by the high byte of each of 16-bit words (32-bit characters beyond BMP are rare, I don't consider them). Therefore, if you consider only the distribution of high bytes, they will fall in 1-2 main modes, more rarely in 3 or more. In this way, if you find one mode of, say, 30% or more (10%, 20%), it's more likely UTF-16LE then UTF-8 where you will also see the modes in the distribution, but more of less distinct modes.

This trick won't work well on languages based on logograms (like Chinese or Korean), but should show good results on most other languages like Western European, Slavic, Ugric, Georgian, Armenian, Arabo-Persian and of course Brahmic script (most of numerous Indian writing systems, Thai, etc.), and a lot more.

[END EDIT]

Overall, creation of this software could be done quite soon, but the cost of the research and final conversion of data could be more or less expensive. Next time, use your head before doing your work.

—SA
 
Share this answer
 
v2
Comments
kiquenet.com 22-Oct-13 9:16am    
any final solution with full source code sample working about it ?
Sergey Alexandrovich Kryukov 22-Oct-13 11:38am    
Are you kidding? By what reason should I do so much work for OP? I answered the question in detail, it should be quite enough...
—SA
kiquenet.com 22-Oct-13 12:28pm    
Thanks, it's great :-)
Sergey Alexandrovich Kryukov 22-Oct-13 13:39pm    
You are welcome. :-)
—SA
kiquenet.com 23-Oct-13 2:47am    
What's about http://stackoverflow.com/a/19464728/206730 ?
How about discriminating utf8/16 based on table entry date.
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900