Click here to Skip to main content
15,889,216 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Hi all,
I am using "u_strToUTF8()" in my function "UnicodeToString()" to convert a unicode string to string. All was going fine but it is destroying the value of string having arabic data.
Is there any solution?

Here is code , just for reference.
..................................................................
C++
string ICUUtils::UniCodeStringToString( UnicodeString sSrc)
{
	UErrorCode code = U_ZERO_ERROR;
	int32_t len = 0;
	u_strToUTF8( NULL, 0, &len, (UChar*)sSrc.getTerminatedBuffer(), sSrc.length(), &code );
	code = U_ZERO_ERROR;
	int32_t cap = len+1;
	char* ptr = new char[cap + 1];
	memset(ptr, 0, cap + 1);
	u_strToUTF8( ptr, cap, &len, (UChar*)sSrc.getTerminatedBuffer(), sSrc.length(), &code );
	string sTemp(ptr, len);
	delete [] ptr;
	return sTemp;
}
Posted
Comments
nv3 2-Aug-12 7:08am    
If you put UTF8 encoded characters into an std::string, I would expect that some of the double-byte sequence look strange, when interpreted as 7-bit ASCII. Are you sure you want a UTF8 encoding, or rather an encoding into your current code page?
pasztorpisti 2-Aug-12 7:34am    
I think the best is to avoid codepages and to use utf in your program. I often store utf8 in my strings but of course then you have to handle it as utf8 when calling api functions - for example converting to utf16 when drawing the string or using it as filename, etc... I think utf8 is the less hassle free to use in a unicode program becuase in 90% of the cases you can use the string as you would use an ascii string (dont have to use strings with 'L' prefix, widechar is somewhat unnaturally hacked in feature of C++), and its easy to port to linux where utf8 is the native encoding of the OS (that was a wise decision in linux).
nv3 2-Aug-12 7:43am    
I agree with what you said, but obviously riaz100 tried to output the contents of his string with a simple printf or the like and wasn't prepared for the outcome. If your program expects ASCII and you work with UTF8 you are in trouble. So either he has to convert the contents of the std::string before output or should convert his UTF16 string into the local code page.
pasztorpisti 2-Aug-12 7:47am    
Sure, I'm putting together an answer that might help.

Short answer:
UChar is a 16 bit integer so I guess that unicodestring stores the string in utf16. Actually utf16 is the native format of windows (not true for win9x versions) so you should just use this string with windows with the 'W' (utf16) versions of the winapi functions (I guess you don't know what im talking about so continue reading).

Long answer:
For backward compatiblity reasons every windows functions that work with strings have 2 versions. An ansi version for backward compatiblity, and an utf16 (widechar) version that is actually the native for the OS. Example DrawTextA() and DrawTextW(). One of them expects you to pass in a const char*, the other expects a const wchar_t*. I guess you are wondering what is that DrawTextA() and DrawTextW(), since you know only a DrawText() function and msdn docs are also using just DrawText()! Actually DrawText is just a macro that is defined to either DrawTextA or DrawTextW depending on your visual studio project settings - more accurately the character set setting. The same is true to the string parameter of drawtext that is LPCTSTR - this LPCTSTR is defined to either const char* or const wchar_t* finally. You should read this article to get the grasp what I'm talking about: What are TCHAR, WCHAR, LPSTR, LPWSTR, LPCTSTR (etc.)?[^]

After reading that short article do this: Set the character set setting of your project to unicode. Use the windows functions directly with sSrc.getTerminatedBuffer() - the utf16 string. If you are using special (non ascii) characters then always use uncode, and not some codepaged legacy stuff that might work on some machines where the language and codepage settings are the same as on your machine. If you are writing crossplatform program then its wise to store utf8 everywhere and on windows converting the parameters to utf16 for the time of function calls and the converting back the result to utf8 for your prog if it the call returns some strings.

EDIT: Forgot to mention something: Its very important to switch your "character set" project setting to unicode because it makes sure that DrawText is actually defined to DrawTextW and not DrawTextA - same is true for every other winapi functions that work with strings. I'm writing this because once I left my character set project setting on Unset (ANSI) and then used some 'W' function calls directly without useing the macros that were actually defined to ansi function calls. This worked for me but one day I found myself in front of a very strange bug that took me a whole day to find: My window creation code and my message loop contained some ANSI stuff becase they used the macro versions of the functions and this bizarre mixture of ansi/unciode stuffs cause very strange bugs! Even some window messages and window message related structs are Ansi/Widechar dependent! So its 2012, forget about ansi and go ahead with unicode.
 
Share this answer
 
v2
I don't know where this function comes from but you may like to take a look at my tip: Handling simple text files in C/C++[^]. This shows how to convert using the standard Microsoft library functions, which also take account of the fact that the source and destination buffers may not be the same length.
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900