"u_strToUTF8()" is destroying my unicode string, any alternate?

Question

0.00/5 (No votes)

See more:

Hi all,
I am using "u_strToUTF8()" in my function "UnicodeToString()" to convert a unicode string to string. All was going fine but it is destroying the value of string having arabic data.
Is there any solution?

Here is code , just for reference.
..................................................................

C++

string ICUUtils::UniCodeStringToString( UnicodeString sSrc)
{
	UErrorCode code = U_ZERO_ERROR;
	int32_t len = 0;
	u_strToUTF8( NULL, 0, &len, (UChar*)sSrc.getTerminatedBuffer(), sSrc.length(), &code );
	code = U_ZERO_ERROR;
	int32_t cap = len+1;
	char* ptr = new char[cap + 1];
	memset(ptr, 0, cap + 1);
	u_strToUTF8( ptr, cap, &len, (UChar*)sSrc.getTerminatedBuffer(), sSrc.length(), &code );
	string sTemp(ptr, len);
	delete [] ptr;
	return sTemp;
}

Posted 2-Aug-12 0:31am

ahsanriaz1K

Add a Solution

Comments

nv3 2-Aug-12 7:08am

If you put UTF8 encoded characters into an std::string, I would expect that some of the double-byte sequence look strange, when interpreted as 7-bit ASCII. Are you sure you want a UTF8 encoding, or rather an encoding into your current code page?

pasztorpisti 2-Aug-12 7:34am

I think the best is to avoid codepages and to use utf in your program. I often store utf8 in my strings but of course then you have to handle it as utf8 when calling api functions - for example converting to utf16 when drawing the string or using it as filename, etc... I think utf8 is the less hassle free to use in a unicode program becuase in 90% of the cases you can use the string as you would use an ascii string (dont have to use strings with 'L' prefix, widechar is somewhat unnaturally hacked in feature of C++), and its easy to port to linux where utf8 is the native encoding of the OS (that was a wise decision in linux).

nv3 2-Aug-12 7:43am

I agree with what you said, but obviously riaz100 tried to output the contents of his string with a simple printf or the like and wasn't prepared for the outcome. If your program expects ASCII and you work with UTF8 you are in trouble. So either he has to convert the contents of the std::string before output or should convert his UTF16 string into the local code page.

pasztorpisti 2-Aug-12 7:47am

Sure, I'm putting together an answer that might help.

2 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

pasztorpisti · Accepted Answer · 2012-08-02T02:00:00

Short answer:
UChar is a 16 bit integer so I guess that unicodestring stores the string in utf16. Actually utf16 is the native format of windows (not true for win9x versions) so you should just use this string with windows with the 'W' (utf16) versions of the winapi functions (I guess you don't know what im talking about so continue reading).

Long answer:
For backward compatiblity reasons every windows functions that work with strings have 2 versions. An ansi version for backward compatiblity, and an utf16 (widechar) version that is actually the native for the OS. Example DrawTextA() and DrawTextW(). One of them expects you to pass in a const char*, the other expects a const wchar_t*. I guess you are wondering what is that DrawTextA() and DrawTextW(), since you know only a DrawText() function and msdn docs are also using just DrawText()! Actually DrawText is just a macro that is defined to either DrawTextA or DrawTextW depending on your visual studio project settings - more accurately the character set setting. The same is true to the string parameter of drawtext that is LPCTSTR - this LPCTSTR is defined to either const char* or const wchar_t* finally. You should read this article to get the grasp what I'm talking about: What are TCHAR, WCHAR, LPSTR, LPWSTR, LPCTSTR (etc.)?[^]

After reading that short article do this: Set the character set setting of your project to unicode. Use the windows functions directly with sSrc.getTerminatedBuffer() - the utf16 string. If you are using special (non ascii) characters then always use uncode, and not some codepaged legacy stuff that might work on some machines where the language and codepage settings are the same as on your machine. If you are writing crossplatform program then its wise to store utf8 everywhere and on windows converting the parameters to utf16 for the time of function calls and the converting back the result to utf8 for your prog if it the call returns some strings.

EDIT: Forgot to mention something: Its very important to switch your "character set" project setting to unicode because it makes sure that DrawText is actually defined to DrawTextW and not DrawTextA - same is true for every other winapi functions that work with strings. I'm writing this because once I left my character set project setting on Unset (ANSI) and then used some 'W' function calls directly without useing the macros that were actually defined to ansi function calls. This worked for me but one day I found myself in front of a very strange bug that took me a whole day to find: My window creation code and my message loop contained some ANSI stuff becase they used the macro versions of the functions and this bizarre mixture of ansi/unciode stuffs cause very strange bugs! Even some window messages and window message related structs are Ansi/Widechar dependent! So its 2012, forget about ansi and go ahead with unicode.

Richard MacCutchan · Accepted Answer · 2012-08-02T02:42:00

I don't know where this function comes from but you may like to take a look at my tip: Handling simple text files in C/C++[^]. This shows how to convert using the standard Microsoft library functions, which also take account of the fact that the source and destination buffers may not be the same length.

"u_strToUTF8()" is destroying my unicode string, any alternate?

2 solutions

Solution 1

Solution 2

Add your solution here

Preview 0