Parsing unicode files in C++

Question

4.71/5 (3 votes)

See more:

Hi I am really having trouble dealing with adding unicode support to my code, i have perfectly working ASCII code but then with entry of some chinese and korean charectors making it break. I tried on web to search for sample code or guide but there is no enough proper info so looking for someone who already worked and can help me fix it. I have simple task to do
- input unicode (UTF16) text file
- scan it line by line and then parse it into tokens
- using delimiters filter the tokens that need and ignore rest
- store these tokens in array like stucture and do some string comparisons on it.

i am using windows platform and Code::block editor with mingw
i am pasting some part of code below , any help greatly appreciated and if you could give me sample code that would be great.
----------------------------------------

#include <iostream>
#include <windows.h>
#include <string.h>
#include <algorithm>
#include <cstring>
#include <fstream>
const int MAX_CHARS_PER_LINE = 4072;  
const int MAX_TOKENS_PER_LINE = 1;      
const wchar_t* const DELIMITER = L"\"";

class IntegrityCheck
{
    public:
        std::wstring Profile_Container[5000][4];
        void Profile_PRD_Parser();
};

 void IntegrityCheck::Profile_PRD_Parser()
{

std::wstring skip (L".exe");
std::wstring databoxtemp[1][1];
int a=-1;

// create a file-reading object
wifstream fin.open("profiles.prd");  //open a file
wofstream fout("out.txt");  // this dumps the parsing ouput 

// read each line of the file
while (!fin.eof())
{
    // read an entire line into memory
    wchar_t buf[MAX_CHARS_PER_LINE];

    fin.getline(buf, MAX_CHARS_PER_LINE);

    // parse the line into blank-delimited tokens
    int n = 0; // a for-loop index

    // array to store memory addresses of the tokens in buf
    const wchar_t* token[MAX_TOKENS_PER_LINE] = {}; // initialize to 0

    // parse the line
    token[0] = wcstok(buf, DELIMITER); // first token

    if (token[0]) // zero if line is blank
    {

        for (n = 0; n < MAX_TOKENS_PER_LINE; n++)   // setting n=0 as we want to ignore the first token
        {
            token[n] = wcstok(0, DELIMITER); // subsequent tokens

            if (!token[n]) break; // no more tokens

            std::wstring str2 =token[n];

            std::size_t found = str2.find(str);  //substring comparison

            if (found!=std::string::npos)   // if its exe then it writes in Dxout for same app name on new line
            {
                a++;
                Profile_Container[a][0]=token[n];
                std::transform(Profile_Container[a][2].begin(), Profile_Container[a][2].end(), Profile_Container[a][2].begin(), ::tolower);  //convert all data to lower 

                fout<<Profile_Container[a][0]<<"\t"<<Profile_Container[a][1]<<"\t"<<Profile_Container[a][2]<<"\n"; //write to file
            }

        }
    }

}

fout.close();
fin.close();
}

int main()
{
IntegrityCheck p1;
p1.Profile_PRD_Parser();
}

Posted 3-Dec-13 14:27pm

nxc121

Updated 30-Jun-22 3:22am

Add a Solution

Comments

nv3 4-Dec-13 2:46am

And what exactly is the "trouble" you are having?

nxc121 4-Dec-13 4:25am

Well it does not work :( . when i pass the unicode file then it does not parse correctly, i get blank output file. when i tried to see the buffer value, i see all binary values. theoretically it should just work , what i am missing here?

nv3 4-Dec-13 4:47am

Why don't you simply start your program in a debugger and step through it line-by-line. That will pretty soon tell you what is wrong. You will need this technique in the years to come many times. So, why not getting used to the debugger right now ?

nxc121 4-Dec-13 5:21am

I am stuck at getline(buf.. , I see all binary data inside, it should not be right? if i try to write that value i see nothing but blank. i have been trying differnt ways since last 2 days. yes i did debug ,but i am unable to understand. i can make out something is going wroung in getline(buf.. , but i dont know what alternative i can use for it. i will start from scratch again, i need to finish this code by end of this week so i am trying to look for quick fix that can get going, i would really appreciate if you could put some light on where i am getting it wrong.

nv3 4-Dec-13 8:17am

Are you certain that your input file contains wide characters? If not, then I would suggest to generate a file by writing a short string to a wofstream and then try to read that back. That will give you certainty that you are not stumbling, because the input file is just an 8-bit file.

Member 15078716 30-Jun-22 9:40am

@nv3 - He already said that he is using Code::Blocks, which as a debugger that can show up while the compiler is running and if it finds a problem which stops the compiler then the response is the first line and he does not have to step through the debugger line-by-line for that. It often does not tell you what is wrong. It tells you what the compiler had issues with as far as syntax is concerned. It does not tell you if your operating system has a problem with your code. It does not tell you if the file that you are reading is encoded wrong. It does not tell you if your code, which might work well on other systems, does not work well on the system that you are currently using.

"You will need this technique in the years to come many times. So, why not getting used to the debugger right now ?" That is an assumption that he does not do that already and a premeditated assumptive insult to him. Do not do that. It makes you look petty. At the bottom of this page it says, "Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question."

Member 15078716 30-Jun-22 9:28am

@nv3 - He already said it. "Hi I am really having trouble dealing with adding unicode support to my code". Or are you simply posting that question to add to your total posting count? I am gessing that you are.

jeron1 30-Jun-22 12:01pm

Something tells me nv3 doesn't really care about this 9 years later.

Member 15078716 30-Jun-22 13:46pm

It was, to me, an important enough topic that I searched and found it and some like it. I just thought that others might benefit from what I said. I could be wrong.

jeron1 30-Jun-22 13:55pm

The topic certainly may be important, it's the calling somebody out that seems totally unnecessary.

"Or are you simply posting that question to add to your total posting count? I am gessing that you are."

3 solutions

Add a Solution

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

KarstenK · Answer 1 · 2013-12-06T04:58:00

Solution 1

you should run trough your code in the debugger and a code sample file.

i would change your code on this places:

C#

// read an entire line into memory
    wchar_t buf[MAX_CHARS_PER_LINE] = {0};

C#

 //first we done before
for (n = 1; n < MAX_TOKENS_PER_LINE; n++)

Posted 6-Dec-13 4:58am

KarstenK

pasztorpisti · Answer 2 · 2013-12-08T13:44:00

Switching to utf16 is often not the way to go. Using utf8 is often a better choice for many reasons. It is a representation that may consume a bit more memory in case of some eastern languages but utf8 is usually much more desirable if we consider porting to other platforms (like unix) and integration with legacy code that uses normal char pointers (like your original code). If you use utf8 then your original code will work like a charm. Often original ascii parsers and text processors work with utf8 data without any problems.

Instead of porting all of your text processing logic just make the decision that all of your code works with utf8 and create importers (foreign encoding to utf8 converters) and exporters (utf8 to foreign encoding converters) for all other encodings. In your case this means an utf16 to utf8 converter for loading and an utf8 to utf16 converter for saving. The better would be to get the file right in utf8 format, then you don't have to convert. Note that utf files *MAY* start with a BOM[^] but this isn't necessary, some smart text editors (like Notepad++) can often detect the encoding even without a BOM but handling (detecting/checking and then skipping) the bom in your loader code may be necessary if you read the file in binary mode.

Member 15078716 · Answer 3 · 2022-06-30T03:22:00

Since you are using Code::Blocks (undisclosed version), and
since you are using Microsoft Windows (undisclosed version), and
since you are using MinGW (undisclosed version) with probably GCC (undisclosed version),
which all that is OK, then the following will probably help with seeing the Unicode that you are working with.

There is a lot that you have not said, so I will be expansive in my answer. If you already know these things, then that is nice.

I did not see it in your example, so I suggest that you add this to the beginning of your code:

#define _UNICODE
#define UNICODE

You do not have to switch to utf-8 as utf-16 will (to some extent) work. It is my opinion that utf-16 is very limited even though your Windows operating system might be using it in it's own processing. Windows has many code pages that sometimes can be loaded when the operating system is being loaded and these seem to help to expand those limitations of utf-16, but still I prefer utf-8. If this is a choice of yours to use utf-16 then you can stay with it but some advanced Unicode symbols might not show up correctly.

If you do not mind some small further adjustments: (Just some suggestions which you might have already known about.)

It looks like you might be using the command line interface setting in Code::Blocks. See the following.

I suggest Settings / Editor / Encoding settings / UTF-8 (If that is OK with you) or UTF-16LE (for Windows) (or BE for Macintosh) if you want to stay with utf-16.

Also, Project / Properties / Build targets / Debug (and Release also) / Platforms / All

Also, Project / Properties / Build targets / Debug (and Release also) / Type / GUI application (if you want to make a program with a Graphical User Interface) or Console application (if you want a console interface)

The advice that I give to you is based upon my using the GUI instead of the CLI because I can produce both a graphical output and a cli output from the GUI application.

__________

And, and this is big, REALLY BIG : Microsoft does not play well with C++11 and even C++17 when the code includes reading and writing Unicode. For that I am using WriteFile(), etc.

Thank you for asking.

As is common: If this answer works for you, please click the accept and rate in with the stars.

Parsing unicode files in C++

3 solutions

Solution 1

Solution 2

Solution 3

Add your solution here

Preview 0

Parsing unicode files in C++

3 solutions

Solution 1

Solution 2

Solution 3

Add your solution here

Preview 0

Existing Members

...or Join us