Click here to Skip to main content
15,887,746 members
Please Sign up or sign in to vote.
5.00/5 (1 vote)
See more:
I've been using wininet in a client-side application in Unicode mode. Mostly, it's working.

I put the download into a char* buffer and build out a std::string on each InternetReadFile. Since I think I'm getting a UTF8 encoding web page, I use CA2W(buffer,CP_UTF8) to convert after I've downloaded it all.


However, I'm wondering how I know what the charset encoding is of the downloaded file...

I understand that /if/ it is UTF8, I should convert it. But what if it isn't UTF8? What if it is UTF16 or ISO-8859-1 or ANSI?

I can use GetQueryInfo to get the content type. Do I need to parse this to find the encoding? I'm going to try that next.
Posted
Updated 16-Mar-11 23:18pm
v2
Comments
[no name] 17-Mar-11 6:52am    
For text files, that were written using UTF16 and other codepages, there are some bytes in the very begining of the file.

UTF-8
EF BB BF
UTF-16BE
FE FF
UTF-16LE
FF FE
UTF-32BE
00 00 FE FF
UTF-32LE
FF FE 00 00
[no name] 22-Mar-11 4:49am    
Every file has a header.
There is described a file format including code page meaning it as extended data description.

1 solution

If I understand correctly, you are using wininet to download random web pages from the internet and are not sure how to determine their encoding?

That is pretty much impossible to do with 100% accuracy. Your best bet is to check whether it is valid UTF-8 and then look for some sign it is something else. I would first check the HTPP charset header, and then also check for existance of a byte-order mark in the file. There is also IsUnicode WinAPI call that is used i.e. in Notepad, but even that is not really reliable.
 
Share this answer
 
v2
Comments
JoeAndrieu 18-Mar-11 4:20am    
Thanks. This should work.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900