SEEKING ADVICE REGARDING CONVERSION TO UNICODE
WHY I AM ADOPTING UNICODE SO LATE
[Feel free to skip to "The Situation", below, but if you do, please don't flame me for "waiting so long".]
I'm a former professional software engineer and programmer.
I was "away" from programming for many years due to medical considerations.
I "left" just as the shift to unicode became mainstream.
Lately, I've begun to tinker with programming.
Since my "return to programming", I've continued to work only with MBCS builds.
[Due to my condition, any conversion to Unicode will be a lot bigger deal than were I healthy.]
After my "return" (such as it is), I read about unicode.
It seemed clear to me that UTF-8 was the "way to go".
I assumed that UTF-8 had been adopted, pretty much universally.
I was very surprised (and disappointed) to learn that Microsoft had "gone with" UTF-16.
MY CODING SITUATION
My issues are with conversion to unicode in the following environment:
Language: C++
Target Platform: Native Windows
Development Tool: Visual Studio 2013
I have many lines of C++ source code (hundreds of thousands, almost certainly) in both libraries and applications.
Additionally, I make use of third-party libraries (for which I also have the source code). Some of these are relatively current, as I have done some work (MBCS) since my "return to programming".
THE MAIN PROBLEM WITH UNICODE BUILDS
A) Every interaction with the Windows API requires UTF-16 strings+.
B) However, the third-party libraries I use accept char*.
For every single "string type" variable, I must decide whether it should be "char" or "TCHAR" (or string or wstring).
For every string literal, I must decide whether it needs to be A) macro wrapped, B) prefixed, C) neither. [Basically, native char*, or UTF-16.]
This is the case not only for new code, but for existing code (existing code probably contains over 100,000 instances of such variables and literals).
[Ignore for now, that I want to be able to do both MBCS and unicode builds. That's not the issue. Pretend I'll be doing only unicode builds.]
For every single function or method call, I must insure that the expected string types are passed.
The code base consists of 4 "categories": A) Windows API, B) Third-party API/code, C) my libraries, D) my applications.
Strings are being "passed around all over the place" within a code category and between them.
It would be ever so much better if all functions (Windows API, my code, third-party code) used the same string/character type.
However, I can't very well convert the third-party libraries to use TCHAR, _T(), etc.
I'd have to do that every time a new version of a library was released.
[Incidentally, one library I use is Boost.]
In addition to having to choose the "right" string type for every variable and literal, it appears that I'll have to add a great number of string-conversion calls to my code, continually convert between UTF-16 and UTF-8/ASCII.
It seems to me that Microsoft created a programming nightmare by going with UTF-16.
It further seems to me that they avoided a relatively minor inconvenience to themselves (to maintain DLL compatibility), by inflicting an absurdly high cost on non-Microsoft developers.
Am I missing something?
If Microsoft had just "gone with" UTF-8, UTF-8 could be passed "everywhere". The whole nightmare would not exist. It would have been simplicity itself.
[Yes, some third-party code (at that time) might not have dealt correctly with multi-byte UTF-8 characters. However, it could have been updated to do so.]
How do I approach the conversion?
A) Use a UTF-16 "infrastructure", and write a wrapper around every third-party library?
B) Use a UTF-8 "infrastructure", and write a wrapper around the Windows API?
For example, a CEditUTF8 class, as a drop-in replacement for CEdit.
C) Decide, individually, for every string variable and literal, the appropriate type, and add ad-hoc calls to string conversion functions all over my code?
It strikes me as Microsoft having made a phenomenally selfish decision by going with UTF-16 (which should have been rejected by the developer community).
Maybe I'm wrong.
Maybe I'm missing something.
It appears to me that even if there were no third-party library issue, the decision to "go with" UTF-16 greatly complicates source code.
The third-party compatibility issue makes the situation at least 10 times worse.
For new code, nearly half my programming effort will be dealing with string-type issues.
In addition to "mixed" strings making code far messier, there is inefficiency due to conversions.
If I adopt solution (C), converting existing code may require inspection of every use of the "char" keyword, every use of string, etc.
Yet another consideration: I prefer to make my library code portable, if possible (to isolate OS-dependent code). This string issue complicates the writing of portable code.
Now, even code that is completely OS-independent must contain "Microsoftisms", to address
string/character type issues.
It all could have been so simple.
COMMENTS? SUGGESTIONS? DERISION? LAUGHS? INSULTS? COMMISERATION? AGREEMENT?
P.S.
I wrote a utility to wrap string literals with _T() **. However, it can't determine which literals to wrap (it's not smart enough to parse the source code, and determine the types of the variables to which the literal is being assigned). Even if it could, it wouldn't solve the problems of variable types, or conversion calls.
** This wasn't done via regular expression search and replace:
Firstly, there were too many exceptions (include directives, various macros, embedded double quotes, comments - both // and /**/ - and various other "special cases").
Secondly, Visual Studio search and replace is bugged: Search and replace IN FILES does not work if regular expressions are enabled (I have used SED to get around this).