How to compare unicode strings ignoring accents?

Question

0.00/5 (No votes)

See more:

Hello, I am trying to find a way to compare unicode strings without accents and case, so for example strings 'áíé' and 'AIE' should be considered equal.

I have tried boost::locale, also unicode normalisation, but can not get it working correctly.
I think that ICU would work, but my boss does not like to link with it because of its size.

locale: Slovak, charset: windows-1250
I am using Windows Vista, but compiling with _WIN32_WINNT = 0x0501
boost: 1.48.0
IDE: VS2005

This is what I do:

C++

static boost::locale::generator gen;
std::locale::global(gen.generate(std::locale(""), ""));

// ...

std::wstring wstr_a = L"Dušan";
std::wstring wstr_b = L"Dusan";

std::wstring wstr_c = L"áíéúó";
std::wstring wstr_d = L"aieuo";

// rslt = 1 => INCORRECT
int rslt = std::use_facet<boost::locale::collator<wchar_t>>(std::locale()).compare (
  boost::locale::collator_base::primary, wstr_a, wstr_b);

// rslt1 = 0 => CORRECT
int rslt1 = std::use_facet<boost::locale::collator<wchar_t>>(std::locale()).compare (
  boost::locale::collator_base::primary, wstr_c, wstr_d);

std::wstring normalized_a = boost::locale::normalize(wstr_a, boost::locale::norm_nfd);
std::wstring normalized_b = boost::locale::normalize(wstr_b, boost::locale::norm_nfd);
std::wstring normalized_c = boost::locale::normalize(wstr_c, boost::locale::norm_nfd);
std::wstring normalized_d = boost::locale::normalize(wstr_d, boost::locale::norm_nfd);

// normalized_a = { 'D', 'u', 's', 0x030c, 'a', 'n' }
// normalized_b = { 'D', 'u', 's', 'a', 'n' }
// normalized_c = { 'a', 0x0301, 'i', 0x0301, 'e', 0x0301, 'u', 0x0301, 'o', 0x0301 }
// normalized_d = { 'a', 'i', 'e', 'u', 'o' }

// rslt2 = 1 => INCORRECT
int rslt2 = std::use_facet<boost::locale::collator<wchar_t>>(std::locale()).compare (
  boost::locale::collator_base::primary,
  normalized_a, normalized_b);

// rslt3 = 0 => CORRECT
int rslt3 = std::use_facet<boost::locale::collator<wchar_t>>(std::locale()).compare (
  boost::locale::collator_base::primary,
  normalized_c, normalized_d);

What am I doing wrong?
Should it work? Is it only bug in boost?

EDIT: While debugging, I have found that boost uses CompareStringW function in collator...

Posted 20-May-13 22:31pm

Dusan Paulovic

Updated 21-May-13 1:41am

v3

Add a Solution

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Jochen Arndt · Answer 1 · 2013-05-21T01:20:00

You can try to use WideCharToMultiByte() to convert the strings to ASCII using precomposed characters and compare the converted ASCII strings:

C++

char lpszAscii[128];
::WideCharToMultiByte(20127, WC_COMPOSITECHECK, L"Dušan áíéúó", -1, lpszAscii, 128, NULL, NULL);
int nCompare = stricmp(lpszAscii, "Dusan aieuo")

But note that this fails for some characters and all symbols (will be replaced by a '?'). Examples are the German 'ß' and currency symbols like '€'. If you are limited to code page 1250, you may check all characters from this code page and provide special handling for the characters that fail (e.g. replace the Euro symbol by 'EUR').