Click here to Skip to main content
15,662,737 members
Articles / Desktop Programming / Win32
Article
Posted 25 Nov 2019

Stats

22K views
404 downloads
24 bookmarked

Doing UTF-8 in Windows

Rate me:
Please Sign up or sign in to vote.
4.81/5 (13 votes)
2 Aug 2020MIT4 min read
This is (yet another!) article on how to handle UTF-8 encoding on a platform that still encourages the UTF-16 encoding.
In this article I am providing a small library for this purpose. The code works, it is clean, easy to understand and small. This is an implementation of the solution advocated in the UTF-8 Everywhere manifesto.

Background

Let me rehash some of the points made in the manifesto mentioned above:

  • UTF-16 (variously called Unicode, widechar or UCS-2) was introduced back in early '90-es and, at the time, it was believed that its 65000 characters will be enough for all characters,
  • Except in particular cases, UTF-16 is not more efficient or easier to use than UTF-8. In fact, in many cases, the opposite is true.
  • In UTF-16, characters have also variable width encoding (two or four bytes) and counting characters is as difficult as in UTF-8.

If you want to work with UTF-8 encoding in Windows (and you should), and you don't want go insane or your program to crash unexpectedly, you must follow the rules given below:

  • Define _UNICODE when compiling your program (or select "Use Unicode Character Set" in Visual Studio).
  • Use wchar_t or std::wstring only in arguments to API function calls. Use char or std::string everywhere else.
  • Use widen() and narrow() functions to go between UTF-8 and UTF-16.

The functions provided in this package will make your life much easier.

Calling Library Functions

All functions live in the utf8 namespace and I would advise you not to place a using directive for this namespace. This is because many/most functions have the same name as the traditional C functions. For instance, if you had a function call:

C++
mkdir (folder_name);

and you want to start using UTF-8 characters, you just have to change it to:

C++
utf8::mkdir (folder_name);

Prefixing the function with the namespace makes it obvious what function you are using.

Basic Conversion Functions

Following the same manifesto, the basic conversion functions are narrow(), to go from UTF-16 to UTF-8 and widen() to go in the opposite direction. Their signatures are:

C++
std::string narrow (const wchar_t* s);
std::string narrow (const std::wstring& s);

std::wstring widen (const char* s);
std::wstring widen (const std::string& s);

In addition, there are two more functions for conversion from and to UTF-32:

C++
std::string narrow (const std::u32string& s);
std::u32string runes (const std::string& s);

Internally, the conversion is done using the WideCharToMultiByte and MultiByteToWideChar functions.

There are also functions for counting the number of characters in a UTF-8 string (length()), to check if a string is valid (valid()), and to advance a pointer/iterator in character string (next()).

Wrappers

Pretty much all the other functions are wrappers around traditional C/C++ functions or structures:

  • directory manipulation functions: mkdir, rmdir, chdir, getcwd
  • file operations: fopen, chmod, access, rename, remove
  • streams: ifstream, ofstream, fstream
  • path manipulation functions: splitpath and makepath
  • environment access functions putenv and getenv
  • character classification functions is... (isalnum, isdigit, isalpha, etc.)

The parameters for all these functions mimic the standard parameters. For some of them however, like access, rename, etc., the return type is bool with true indicating success and false indicating failure. This is contrary to standard C functions that return 0 for success. Caveat emptor!

Return Values

For API functions that return a character string, you would need to setup a wchar_t buffer to receive the value, convert it to UTF-8 using the narrow function and eventually release the buffer. Below is an example of how this would look like. The code retrieves the name of temporary file:

C++
wstring wpath (_MAX_PATH, L'\0');
wstring wfname (_MAX_PATH, L'\0');

GetTempPath (wpath.size (), const_cast<wchar_t*>(wpath.data ()));
GetTempFileName (wpath.c_str(), L"ABC", 1, const_cast<wchar_t*>(wfname.data ()));

string result = narrow(wfname);

This seemed a bit too cumbersome and error prone so I made a small object destined to hold returned values. It has operators to convert it to a wchar_t buffer and then to a UTF-8 string. For lack of a better name, I called it buffer. Using this object, the same code fragment becomes:

C++
utf8::buffer path (_MAX_PATH);
utf8::buffer fname (_MAX_PATH);

GetTempPath (path.size (), path);
GetTempFileName (path, L"ABC", 1, fname);

string result = fname;

Internally, a buffer object contains UTF-16 characters but the string conversion operator invokes the utf8::narrow function to convert the string to UTF-8.

Program Arguments

There are two functions for accessing and converting UTF-16 encoded program arguments: the get_argv function returns an argv like array of pointers to command line arguments:

C++
  int argc;
 char **argv = utf8::get_argv (&argc);

The second one provides directly a vector of strings:

C++
std::vector<std::string> argv = utf8::argv ();

When using the first function, one has to call utf8::free_argv function to release the memory allocated for argv array.

Conclusion

I hope this article and the included code shows that using UTF-8 encoding in Windows programs doesn't have to be too painful.

The next chapters in this series are:

History

  • 02 August, 2020 - Links to other articles in the series, code updated
  • 22 November, 2019 - Initial version

License

This article, along with any associated source code and files, is licensed under The MIT License


Written By
Canada Canada
Mircea is an OOP (old, opinionated programmer) with more years of experience than he likes to admit. Always opened to new things, he is however too bruised to follow any passing fad.

Lately, he hangs around here, hoping that some of the things he learned can be useful to others.

Comments and Discussions

 
QuestionNice work! Pin
Terence Russell27-May-21 11:41
Terence Russell27-May-21 11:41 
AnswerRe: Nice work! Pin
Mircea Neacsu27-May-21 15:28
Mircea Neacsu27-May-21 15:28 
QuestionCould u provide me the utf8.lib compiled file Pin
FatalError0x4c3-Oct-20 2:29
FatalError0x4c3-Oct-20 2:29 
AnswerRe: Could u provide me the utf8.lib compiled file Pin
Mircea Neacsu3-Oct-20 3:47
Mircea Neacsu3-Oct-20 3:47 
GeneralRe: Could u provide me the utf8.lib compiled file Pin
FatalError0x4c3-Oct-20 4:39
FatalError0x4c3-Oct-20 4:39 
BugSmall bug in utf8.h, line 265 Pin
Bernd Schroeder3-Aug-20 3:25
Bernd Schroeder3-Aug-20 3:25 
GeneralRe: Small bug in utf8.h, line 265 Pin
Mircea Neacsu3-Aug-20 3:40
Mircea Neacsu3-Aug-20 3:40 
Good catch! Thanks for pointing it out. I've updated the GitHub repo.
Mircea

QuestionUTF-8 Everywhere Manifesto Pin
BugDigger2-Aug-20 20:52
BugDigger2-Aug-20 20:52 
QuestionInteresting Pin
colins22-Aug-20 20:34
colins22-Aug-20 20:34 
QuestionI Am Curious Pin
Rick York25-Nov-19 11:23
mveRick York25-Nov-19 11:23 
AnswerRe: I Am Curious Pin
Mircea Neacsu25-Nov-19 11:36
Mircea Neacsu25-Nov-19 11:36 
GeneralRe: I Am Curious Pin
svansickle26-Nov-19 9:50
svansickle26-Nov-19 9:50 
GeneralRe: I Am Curious Pin
Mircea Neacsu26-Nov-19 10:35
Mircea Neacsu26-Nov-19 10:35 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.