Unicode content from InternetReadFile?

Question

5.00/5 (1 vote)

See more:

I've been using wininet in a client-side application in Unicode mode. Mostly, it's working.

I put the download into a char* buffer and build out a std::string on each InternetReadFile. Since I think I'm getting a UTF8 encoding web page, I use CA2W(buffer,CP_UTF8) to convert after I've downloaded it all.

However, I'm wondering how I know what the charset encoding is of the downloaded file...

I understand that /if/ it is UTF8, I should convert it. But what if it isn't UTF8? What if it is UTF16 or ISO-8859-1 or ANSI?

I can use GetQueryInfo to get the content type. Do I need to parse this to find the encoding? I'm going to try that next.

Posted 16-Mar-11 22:15pm

JoeAndrieu

Updated 16-Mar-11 23:18pm

v2

Add a Solution

Comments

[no name] 17-Mar-11 6:52am

For text files, that were written using UTF16 and other codepages, there are some bytes in the very begining of the file.

UTF-8
EF BB BF
UTF-16BE
FE FF
UTF-16LE
FF FE
UTF-32BE
00 00 FE FF
UTF-32LE
FF FE 00 00

[no name] 22-Mar-11 4:49am

Every file has a header.
There is described a file format including code page meaning it as extended data description.

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Nemanja Trifunovic · Accepted Answer · 2011-03-17T02:41:00

If I understand correctly, you are using wininet to download random web pages from the internet and are not sure how to determine their encoding?

That is pretty much impossible to do with 100% accuracy. Your best bet is to check whether it is valid UTF-8 and then look for some sign it is something else. I would first check the HTPP charset header, and then also check for existance of a byte-order mark in the file. There is also IsUnicode WinAPI call that is used i.e. in Notepad, but even that is not really reliable.