I think you are on the confused side about unicode like many other programmers. The unicode character set is nothing more that a table consisting of about 1 million characters. Since the range of a
char
is just 0..255 and the range of a
wchar_t
is 0..65535 its obvious that you can store a unicode character neither in a
char
nor in a
wchar_t
. You need at least 32 bits to to have the range to actually encode 1 unicode character (codepoint) with 1 integer. For this reason if you want to use 1 integer to store any of the unicode characters then you have to use utf-32 that is an encoding that uses no tricks. In
utf32
1
uint32
is one index into the unicode table. Period. However in practice
utf32
is rarely used because its memory intensive and wastes a lot of memory especially in case of languages that use a lot of ascii characters. Because of this
utf8
and
utf16
are more widespread than
utf32
but in
utf8
and
utf16
one integer (
uint8
or
uint16
) alone isn't necessary an index into the unicode table. For example in case of
utf8
any character that is bigger than 127 means that this and the next few bytes together (max 4 bytes) will store the necessary bits that form together and index into the large unicode table (
http://en.wikipedia.org/wiki/UTF-8[
^]). In case of
utf16
it is also possible that two wchar_ts together form an index (high and low surrogate pairs: 0xd800-0xdfff,
https://en.wikipedia.org/wiki/UTF-16[
^]). For this reason some operations on
utf8
and
utf16
encoded strings are not effective. For example strlen() and wcslen() return the number of chars and wchar_ts in the string instead of the actual number of unicode characters (that can be less than the number of chars or wchar_ts because of the trick I mentioned). Indexing a unicode character in the string is also ineffective. However in many cases these operations are not required and there are some other operations that are effective with these utf encodings as well, for example concatenation.
Often you are not really interested in the encoding of the string and the unicode characters in it so you can handle the string as a big bunch of binary data. In fact, many programs just load strings from some localization database/file and use them to display text on the screen. Only the text renderer/drawer method has to be able to decode the utf encoded binary data (string) into a sequence of unicode characters and the text drawer needs just a simple iterator that that retrieves the unicode characters from the utf data from left to right direction, that can be done effectively with both
utf8
and
utf16
and you don't even have to care about this if you are using for example windows
DrawText()
.
Of course you may want to "procedurally" generate strings in the program but that is an easier task. Many operations allow you to treat the string as a sequence of chars and wchar_ts that make your work easier, for example if you are searching the next newline in a string in utf8 you can easily process the string as a sequence of chars because all bytes of a special multibyte character sequence are bigger than 127 so you can safely search for the next chr(10) without actually interpreting the unicode characters (the special multi byte and multi wchar_t utf8/16 stuff) in the encoded string. The same is true for all ascii characters (<128), this comes handy for example in case of an xml parasers in which the special characters are ascii (<>&").
utf16
or
utf8
? You can hide this as an implementation detail in your own string class and later you can easily change this as you will, or you can make it platform dependent. On linux utf8 is the way to go but you can use utf8 even on windows to store data in memory and you can convert to utf16 on the fly when you call a windows function that requires an
utf16
string. Many make the mistake of calling ansi windows functions with utf8 data. You know: almost every windows functions that receive a string parameter have 3 names, e.g.:
DrawTextA()
DrawTextW()
and
DrawText()
that is just a macro defined to either DrawTextA or W. On winNT the A functions just convert the input string to
utf16
using the current locale of windows and then call the W version of the function, so dont make the mistake of calling A functions with utf8 strings. It will work if the string contains just ascii characters (<128) but it wont work with any special chars! On windows call always the W functions directly with utf16 strings so either store the strings as
utf16
with a terminating null or store
utf8
and write an
utf16
converter method for your string class that returns a temporary
utf16
converted string!
The conclusion is that you can simply read/write text from/to files as binary data, the encoding matters only if someone starts processing the binary data as a sequence of unicode characters. Even if you read in the text file as a big chunk of utf encoded binary data you can easily split it into lines (along chr(10) integers) without processing the actual unicode characters on the lines, or you can easily process a localization text file which has lines with key=value pairs without taking care about utf because all you have to do is splitting the line into two parts along an ascii character ('=').
Another interesting thing is that not all bytes sequences (binary data) can be interpreted as a valid utf8 or utf16 string! It worth validating the string when you read it from the file and I usually validate strings at runtime only in debug builds to make the release builds faster. In some cases you may need validation in runtime even in release builds but that is rare.
EDIT: Of course if you want to use the standard library to detect the actual encoding of the file and convert it to a format your program uses (for example
utf-16
) then my comments are just details that help you to understand whats going on. A text file can store text in several formats. Usually the first few (2-5) bytes of the text file is a special sequence that indicates the encoding of the text that follows, this is called the BOM (Byte Order Mark) and it isn't shown by modern text editors (use a hex editor to check this):
http://en.wikipedia.org/wiki/Byte_order_mark[
^]
Note that a bom at the beginning of the file isn't required but in that case a text editor might have hard time to guess the format (sometimes its impossible).
If you create data files for your program yourself then you can use a fix format even without bom. We often use
utf8
without bom here and our program allows no other format.