Click here to Skip to main content
15,898,222 members
Please Sign up or sign in to vote.
5.00/5 (1 vote)
See more:
Hello
I received to translate strings that I received on a html into wide char (in an ISAPI extension)

Strangely, the code below works with asian characters but not with latin accents.
I suspect that latin accents should have been encoded... but they are not

C++
char * pString = "éàa";
int nSize = MultiByteToWideChar(CP_UTF8,0,pString,-1,NULL,0);
if ( nSize != 0 )
{
    WCHAR * pBuffer= new WCHAR[nSize];
    MultiByteToWideChar(CP_UTF8,0,pString,-1,pBuffer,nSize);
    // returns 65533, 65533,97: the 2 accents are not recognised

}


So what do I need to do to transform the string I receive in proper unicode wide char?
Thanks in advance,
Jerry
Posted

Your input string pString is not in UTF-8 format, but in extended ASCII format. If you input a legal UTF-8, I would assume that your code would work.

The accented characters will be represented by 2-byte sequences in UTF-8. You might want to try to insert the proper UTF-8 sequence by using escapes.
 
Share this answer
 
Comments
BadJerry 29-Nov-13 8:07am    
Hello and thanks. But my problem is that I receive the string like this. Is there a code page I can use instead or a pre-processing step I could use to ensure that I always correctly decode the string?
Thanks in advance
nv3 29-Nov-13 8:16am    
If your input is always like this you probably want to use CP_ACP instead of CP_UTF8 in your conversion calls.

But I understand that you sometimes are receiving Chinese text via the same interface and this will probably be encoded as UTF-8. If that is the case, you need to write some code that first detects the encoding of your input and then applies the appropriate CP_ argument for the conversion.
BadJerry 29-Nov-13 8:55am    
Thanks again! And how do I detect the right code page? I want to make sure I do not ever lose data of course.
nv3 29-Nov-13 10:39am    
Well, in general it is not possible to detect which encoding a byte sequence is written in. In some cases you might be in luck, though: If the string contains a BOM (byte order mark) at it's beginning, it can be detected as UTF-8. But not all UTF-8 strings have such a BOM. And you can validate a string from start to end and verify if it is legal UTF-8. If it is not, you again have a clue, but if it is legal UTF-8 that doesn't prove that this is necessarily a UTF-8 string.

Hence, the best solution is that the source of your input string gives you a clue about what type of encoding is used. You could for example design two input functions of your class, one of which takes a UTF-8 argument and the other an ANSI string.
If you receive a string by HTML, then encoding is either specified in the page or determined by some rules (the same way a browser know the encoding that was used).

As far as I know, in an HTML page, UTF-8 is usually specified. If not, then the encoding is supposed to be ASCII and accentued characters should have been coded using HTML entities (for example é).

Neverthless, I think that some browsers will assume Windows ANSI (western) code page if nothing specified and there are some characters >= 128. You should not rely on this.

As mentionned in another solution it is sometime possible to guess using some rules but whenever possible you should not rely on that but you should known which encoding was used in the source and which one you want for the target and do the appropriate conversion.

Usually you should use UTF-8 encoding and if working on Windows, you might use UTF-16 too.
 
Share this answer
 
Comments
BadJerry 1-Dec-13 13:06pm    
Thanks for this... I have posted something below that seems to work... I still have no idea about how robust it is!
In the end I have used the following code to make sure that I store proper utf-8 strings from what I recieve (it's for an ISAPI extension hence UNICODE is not used)
C++
#ifndef _UNICODE
CString CStaticTools::MakeUTF8Compatible(const CString & strData)
{

	int nSize = MultiByteToWideChar(CP_UTF8,MB_ERR_INVALID_CHARS,strData,-1,NULL,0);
	if ( nSize != 0 )
		return strData;

	nSize = MultiByteToWideChar(CP_ACP,MB_ERR_INVALID_CHARS,strData,-1,NULL,0);
	if ( nSize == 0 )
		return strData;

	WCHAR * pBuffer= new WCHAR[nSize];
	MultiByteToWideChar(CP_ACP,0,strData,-1,pBuffer,nSize);

	BOOL bUsed = false;
	int nUtfSize = WideCharToMultiByte(CP_UTF8,0,pBuffer,-1,NULL,0,NULL,NULL);

	if ( nUtfSize == 0)
	{
		delete pBuffer;
		return strData;
	}


	char * pDest = new char[nUtfSize];

	WideCharToMultiByte(CP_UTF8,0,pBuffer,-1,pDest,nUtfSize,NULL,NULL);
	
	CString strResult = pDest;
	
	delete [] pBuffer;
	delete pDest;

	return strResult;
}
#endif


Please tell me if I have missed out something
Thanks Philippe (merci!) and nv3!
 
Share this answer
 
Comments
nv3 2-Dec-13 7:28am    
Does not look good to me; but perhaps I am misinterpreting your code. If it contained some comments, things would be easier to explain.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900