Latin characters, MultiByteToWideChar and UTF-8

Question

5.00/5 (1 vote)

See more:

Hello
I received to translate strings that I received on a html into wide char (in an ISAPI extension)

Strangely, the code below works with asian characters but not with latin accents.
I suspect that latin accents should have been encoded... but they are not

C++

char * pString = "éàa";
int nSize = MultiByteToWideChar(CP_UTF8,0,pString,-1,NULL,0);
if ( nSize != 0 )
{
    WCHAR * pBuffer= new WCHAR[nSize];
    MultiByteToWideChar(CP_UTF8,0,pString,-1,pBuffer,nSize);
    // returns 65533, 65533,97: the 2 accents are not recognised

}

So what do I need to do to transform the string I receive in proper unicode wide char?
Thanks in advance,
Jerry

Posted 28-Nov-13 22:12pm

BadJerry

Add a Solution

3 solutions

Solution 1

Your input string pString is not in UTF-8 format, but in extended ASCII format. If you input a legal UTF-8, I would assume that your code would work.

The accented characters will be represented by 2-byte sequences in UTF-8. You might want to try to insert the proper UTF-8 sequence by using escapes.

Posted 28-Nov-13 22:30pm

nv3

Comments

BadJerry 29-Nov-13 8:07am

Hello and thanks. But my problem is that I receive the string like this. Is there a code page I can use instead or a pre-processing step I could use to ensure that I always correctly decode the string?
Thanks in advance

nv3 29-Nov-13 8:16am

If your input is always like this you probably want to use CP_ACP instead of CP_UTF8 in your conversion calls.

But I understand that you sometimes are receiving Chinese text via the same interface and this will probably be encoded as UTF-8. If that is the case, you need to write some code that first detects the encoding of your input and then applies the appropriate CP_ argument for the conversion.

BadJerry 29-Nov-13 8:55am

Thanks again! And how do I detect the right code page? I want to make sure I do not ever lose data of course.

nv3 29-Nov-13 10:39am

Well, in general it is not possible to detect which encoding a byte sequence is written in. In some cases you might be in luck, though: If the string contains a BOM (byte order mark) at it's beginning, it can be detected as UTF-8. But not all UTF-8 strings have such a BOM. And you can validate a string from start to end and verify if it is legal UTF-8. If it is not, you again have a clue, but if it is legal UTF-8 that doesn't prove that this is necessarily a UTF-8 string.

Hence, the best solution is that the source of your input string gives you a clue about what type of encoding is used. You could for example design two input functions of your class, one of which takes a UTF-8 argument and the other an ANSI string.

Solution 2

If you receive a string by HTML, then encoding is either specified in the page or determined by some rules (the same way a browser know the encoding that was used).

As far as I know, in an HTML page, UTF-8 is usually specified. If not, then the encoding is supposed to be ASCII and accentued characters should have been coded using HTML entities (for example é).

Neverthless, I think that some browsers will assume Windows ANSI (western) code page if nothing specified and there are some characters >= 128. You should not rely on this.

As mentionned in another solution it is sometime possible to guess using some rules but whenever possible you should not rely on that but you should known which encoding was used in the source and which one you want for the target and do the appropriate conversion.

Usually you should use UTF-8 encoding and if working on Windows, you might use UTF-16 too.

Posted 30-Nov-13 3:10am

Philippe Mori

Comments

BadJerry 1-Dec-13 13:06pm

Thanks for this... I have posted something below that seems to work... I still have no idea about how robust it is!

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

BadJerry · Accepted Answer · 2013-12-01T07:11:00

In the end I have used the following code to make sure that I store proper utf-8 strings from what I recieve (it's for an ISAPI extension hence UNICODE is not used)

C++

#ifndef _UNICODE
CString CStaticTools::MakeUTF8Compatible(const CString & strData)
{

	int nSize = MultiByteToWideChar(CP_UTF8,MB_ERR_INVALID_CHARS,strData,-1,NULL,0);
	if ( nSize != 0 )
		return strData;

	nSize = MultiByteToWideChar(CP_ACP,MB_ERR_INVALID_CHARS,strData,-1,NULL,0);
	if ( nSize == 0 )
		return strData;

	WCHAR * pBuffer= new WCHAR[nSize];
	MultiByteToWideChar(CP_ACP,0,strData,-1,pBuffer,nSize);

	BOOL bUsed = false;
	int nUtfSize = WideCharToMultiByte(CP_UTF8,0,pBuffer,-1,NULL,0,NULL,NULL);

	if ( nUtfSize == 0)
	{
		delete pBuffer;
		return strData;
	}


	char * pDest = new char[nUtfSize];

	WideCharToMultiByte(CP_UTF8,0,pBuffer,-1,pDest,nUtfSize,NULL,NULL);
	
	CString strResult = pDest;
	
	delete [] pBuffer;
	delete pDest;

	return strResult;
}
#endif

Please tell me if I have missed out something
Thanks Philippe (merci!) and nv3!