Click here to Skip to main content
15,880,854 members
Please Sign up or sign in to vote.
4.71/5 (3 votes)
See more:
Hi I am really having trouble dealing with adding unicode support to my code, i have perfectly working ASCII code but then with entry of some chinese and korean charectors making it break. I tried on web to search for sample code or guide but there is no enough proper info so looking for someone who already worked and can help me fix it. I have simple task to do
- input unicode (UTF16) text file
- scan it line by line and then parse it into tokens
- using delimiters filter the tokens that need and ignore rest
- store these tokens in array like stucture and do some string comparisons on it.

i am using windows platform and Code::block editor with mingw
i am pasting some part of code below , any help greatly appreciated and if you could give me sample code that would be great.
----------------------------------------
#include <iostream>
#include <windows.h>
#include <string.h>
#include <algorithm>
#include <cstring>
#include <fstream>
const int MAX_CHARS_PER_LINE = 4072;  
const int MAX_TOKENS_PER_LINE = 1;      
const wchar_t* const DELIMITER = L"\"";

class IntegrityCheck
{
    public:
        std::wstring Profile_Container[5000][4];
        void Profile_PRD_Parser();
};

 void IntegrityCheck::Profile_PRD_Parser()
{

std::wstring skip (L".exe");
std::wstring databoxtemp[1][1];
int a=-1;

// create a file-reading object
wifstream fin.open("profiles.prd");  //open a file
wofstream fout("out.txt");  // this dumps the parsing ouput 

// read each line of the file
while (!fin.eof())
{
    // read an entire line into memory
    wchar_t buf[MAX_CHARS_PER_LINE];

    fin.getline(buf, MAX_CHARS_PER_LINE);

    // parse the line into blank-delimited tokens
    int n = 0; // a for-loop index

    // array to store memory addresses of the tokens in buf
    const wchar_t* token[MAX_TOKENS_PER_LINE] = {}; // initialize to 0

    // parse the line
    token[0] = wcstok(buf, DELIMITER); // first token

    if (token[0]) // zero if line is blank
    {

        for (n = 0; n < MAX_TOKENS_PER_LINE; n++)   // setting n=0 as we want to ignore the first token
        {
            token[n] = wcstok(0, DELIMITER); // subsequent tokens

            if (!token[n]) break; // no more tokens

            std::wstring str2 =token[n];

            std::size_t found = str2.find(str);  //substring comparison

            if (found!=std::string::npos)   // if its exe then it writes in Dxout for same app name on new line
            {
                a++;
                Profile_Container[a][0]=token[n];
                std::transform(Profile_Container[a][2].begin(), Profile_Container[a][2].end(), Profile_Container[a][2].begin(), ::tolower);  //convert all data to lower 

                fout<<Profile_Container[a][0]<<"\t"<<Profile_Container[a][1]<<"\t"<<Profile_Container[a][2]<<"\n"; //write to file
            }

        }
    }

}

fout.close();
fin.close();
}

int main()
{
IntegrityCheck p1;
p1.Profile_PRD_Parser();
}     
Posted
Updated 30-Jun-22 3:22am
Comments
nv3 4-Dec-13 2:46am    
And what exactly is the "trouble" you are having?
nxc121 4-Dec-13 4:25am    
Well it does not work :( . when i pass the unicode file then it does not parse correctly, i get blank output file. when i tried to see the buffer value, i see all binary values. theoretically it should just work , what i am missing here?
nv3 4-Dec-13 4:47am    
Why don't you simply start your program in a debugger and step through it line-by-line. That will pretty soon tell you what is wrong. You will need this technique in the years to come many times. So, why not getting used to the debugger right now ?
nxc121 4-Dec-13 5:21am    
I am stuck at getline(buf.. , I see all binary data inside, it should not be right? if i try to write that value i see nothing but blank. i have been trying differnt ways since last 2 days. yes i did debug ,but i am unable to understand. i can make out something is going wroung in getline(buf.. , but i dont know what alternative i can use for it. i will start from scratch again, i need to finish this code by end of this week so i am trying to look for quick fix that can get going, i would really appreciate if you could put some light on where i am getting it wrong.
nv3 4-Dec-13 8:17am    
Are you certain that your input file contains wide characters? If not, then I would suggest to generate a file by writing a short string to a wofstream and then try to read that back. That will give you certainty that you are not stumbling, because the input file is just an 8-bit file.

you should run trough your code in the debugger and a code sample file.

i would change your code on this places:

C#
// read an entire line into memory
    wchar_t buf[MAX_CHARS_PER_LINE] = {0};


C#
 //first we done before
for (n = 1; n < MAX_TOKENS_PER_LINE; n++) 
 
Share this answer
 
Switching to utf16 is often not the way to go. Using utf8 is often a better choice for many reasons. It is a representation that may consume a bit more memory in case of some eastern languages but utf8 is usually much more desirable if we consider porting to other platforms (like unix) and integration with legacy code that uses normal char pointers (like your original code). If you use utf8 then your original code will work like a charm. Often original ascii parsers and text processors work with utf8 data without any problems.

Instead of porting all of your text processing logic just make the decision that all of your code works with utf8 and create importers (foreign encoding to utf8 converters) and exporters (utf8 to foreign encoding converters) for all other encodings. In your case this means an utf16 to utf8 converter for loading and an utf8 to utf16 converter for saving. The better would be to get the file right in utf8 format, then you don't have to convert. Note that utf files *MAY* start with a BOM[^] but this isn't necessary, some smart text editors (like Notepad++) can often detect the encoding even without a BOM but handling (detecting/checking and then skipping) the bom in your loader code may be necessary if you read the file in binary mode.
 
Share this answer
 
v2
Since you are using Code::Blocks (undisclosed version), and
since you are using Microsoft Windows (undisclosed version), and
since you are using MinGW (undisclosed version) with probably GCC (undisclosed version),
which all that is OK, then the following will probably help with seeing the Unicode that you are working with.

There is a lot that you have not said, so I will be expansive in my answer. If you already know these things, then that is nice.



I did not see it in your example, so I suggest that you add this to the beginning of your code:

#define _UNICODE
#define UNICODE


You do not have to switch to utf-8 as utf-16 will (to some extent) work. It is my opinion that utf-16 is very limited even though your Windows operating system might be using it in it's own processing. Windows has many code pages that sometimes can be loaded when the operating system is being loaded and these seem to help to expand those limitations of utf-16, but still I prefer utf-8. If this is a choice of yours to use utf-16 then you can stay with it but some advanced Unicode symbols might not show up correctly.

If you do not mind some small further adjustments: (Just some suggestions which you might have already known about.)

It looks like you might be using the command line interface setting in Code::Blocks. See the following.

I suggest Settings / Editor / Encoding settings / UTF-8 (If that is OK with you) or UTF-16LE (for Windows) (or BE for Macintosh) if you want to stay with utf-16.

Also, Project / Properties / Build targets / Debug (and Release also) / Platforms / All

Also, Project / Properties / Build targets / Debug (and Release also) / Type / GUI application (if you want to make a program with a Graphical User Interface) or Console application (if you want a console interface)

The advice that I give to you is based upon my using the GUI instead of the CLI because I can produce both a graphical output and a cli output from the GUI application.

__________

And, and this is big, REALLY BIG : Microsoft does not play well with C++11 and even C++17 when the code includes reading and writing Unicode. For that I am using WriteFile(), etc.

Thank you for asking.

As is common: If this answer works for you, please click the accept and rate in with the stars.
 
Share this answer
 
v8

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900