Click here to Skip to main content
15,887,449 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Hello, I am trying to find a way to compare unicode strings without accents and case, so for example strings 'áíé' and 'AIE' should be considered equal.

I have tried boost::locale, also unicode normalisation, but can not get it working correctly.
I think that ICU would work, but my boss does not like to link with it because of its size.

locale: Slovak, charset: windows-1250
I am using Windows Vista, but compiling with _WIN32_WINNT = 0x0501
boost: 1.48.0
IDE: VS2005

This is what I do:

C++
static boost::locale::generator gen;
std::locale::global(gen.generate(std::locale(""), ""));

// ...

std::wstring wstr_a = L"Dušan";
std::wstring wstr_b = L"Dusan";

std::wstring wstr_c = L"áíéúó";
std::wstring wstr_d = L"aieuo";

// rslt = 1 => INCORRECT
int rslt = std::use_facet<boost::locale::collator<wchar_t>>(std::locale()).compare (
  boost::locale::collator_base::primary, wstr_a, wstr_b);

// rslt1 = 0 => CORRECT
int rslt1 = std::use_facet<boost::locale::collator<wchar_t>>(std::locale()).compare (
  boost::locale::collator_base::primary, wstr_c, wstr_d);

std::wstring normalized_a = boost::locale::normalize(wstr_a, boost::locale::norm_nfd);
std::wstring normalized_b = boost::locale::normalize(wstr_b, boost::locale::norm_nfd);
std::wstring normalized_c = boost::locale::normalize(wstr_c, boost::locale::norm_nfd);
std::wstring normalized_d = boost::locale::normalize(wstr_d, boost::locale::norm_nfd);

// normalized_a = { 'D', 'u', 's', 0x030c, 'a', 'n' }
// normalized_b = { 'D', 'u', 's', 'a', 'n' }
// normalized_c = { 'a', 0x0301, 'i', 0x0301, 'e', 0x0301, 'u', 0x0301, 'o', 0x0301 }
// normalized_d = { 'a', 'i', 'e', 'u', 'o' }

// rslt2 = 1 => INCORRECT
int rslt2 = std::use_facet<boost::locale::collator<wchar_t>>(std::locale()).compare (
  boost::locale::collator_base::primary,
  normalized_a, normalized_b);

// rslt3 = 0 => CORRECT
int rslt3 = std::use_facet<boost::locale::collator<wchar_t>>(std::locale()).compare (
  boost::locale::collator_base::primary,
  normalized_c, normalized_d);


What am I doing wrong?
Should it work? Is it only bug in boost?

EDIT: While debugging, I have found that boost uses CompareStringW function in collator...
Posted
Updated 21-May-13 1:41am
v3

1 solution

You can try to use WideCharToMultiByte() to convert the strings to ASCII using precomposed characters and compare the converted ASCII strings:
C++
char lpszAscii[128];
::WideCharToMultiByte(20127, WC_COMPOSITECHECK, L"Dušan áíéúó", -1, lpszAscii, 128, NULL, NULL);
int nCompare = stricmp(lpszAscii, "Dusan aieuo")

But note that this fails for some characters and all symbols (will be replaced by a '?'). Examples are the German 'ß' and currency symbols like '€'. If you are limited to code page 1250, you may check all characters from this code page and provide special handling for the characters that fail (e.g. replace the Euro symbol by 'EUR').
 
Share this answer
 
Comments
Dusan Paulovic 21-May-13 7:38am    
Thanks for response, unfortunately, your solution is limited on Windows (and I like to have it portable because of potential porting to other platfarms in future), but good to know. Also, I am not limited to code page 1250, even it is my native.
Jochen Arndt 21-May-13 7:55am    
Sorry that I could not help. You mentioned that you use Vista and so I thought that it is for Windows. If you need portable code, you must use some library like ICU or write platform dependant code that performs similar conversions (e.g. using iconv with Linux). The above code is not limited to code page 1250, but will work only on latin character based pages.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900