Pattern Matching - Strings

Question

5.00/5 (1 vote)

See more:

I read this piece of code in a c++ book(Sams - C++ Footprint and Performance Optimization).

There is an article suggesting that string comparison can be made faster by treating strings as integer.

I am a little confused about the reason behind it? Is it that by doing so we will lessen the number of machine instructions that will take place for converting each character first to an integer and then to bits or something else?

Can anybody please elaborate?

C++

inline int int_strcmp(char *s1, char *s2, int len)
{
    if ((len & 0x03) != 0)
        return(macro_strcmp(s1, s2));

    for (int i =0; *(int*)&s1[i] == *(int*)&s2[i];)
    {
        i+=4;
        if (i >= len)
            return(true);  // match
    }
    return(false);
}

Article (for reference)

Another way to speed up string comparison is by treating the character arrays as integer arrays . On most systems an integer is four times larger than a character, and so four characters can be compared at the same time. When the strings have a length that is a multiple of the length of an integer, doing integer comparisons can greatly speed up your string comparison. Listing 10.1 shows what an integer string compare function can look like.

Listing 10.1 Integer String Comparison
inline int int_strcmp(char *s1, char *s2, int len)
{
if ((len & 0x03) != 0)
return(macro_strcmp(s1, s2));

for (int i =0; *(int*)&s1[i] == *(int*)&s2[i];)
{
i+=4;
if (i >= len)
return(true); // match
}
return(false);
}

The int_strcmp function quickly checks whether the given string length is a multiple of four. If this is not the case, the previously discussed macro_strcmp function is called. For strings with a correct length, characters are compared in groups of four by casting the character pointers to integer pointers and thus reading integers from the string. The longer the compared strings are—and the more they look alike—the more benefits this int_strcmp function reaps compared to the previous implementations given in this chapter. For our example of finding Small Table articles in the dbase.txt file, this integer implementation is again faster (up to 50%). This is reflected in the results of the 10Source01.cpp program.

The reason why the string lengths have to be a multiple of the integer size for this function to work is that any other size will cause this function to compare beyond the length of a string. A string with a length of six, for instance, will be compared in two loop iterations. The first iteration compares the first four bytes, which is no problem. The second iteration compares the second four bytes, of which only two are part of the actual string. The third byte will be a null indicating the end of the string. The fourth byte, however, is not part of the string. Two strings of six characters that are identical could be found to be different just because of the value of these fourth bytes. The int_compare function can of course be altered to check for different string sizes but it is not very likely that it will still be faster in most string compare cases.

Posted 6-Mar-13 21:57pm

Member 8576081

Updated 6-Mar-13 21:58pm

v2

Add a Solution

2 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

OriginalGriff · Answer 1 · 2013-03-06T22:17:00

:laugh: No, it's simpler than that. And also more complicated...

Characters are 8 bit quantities (yes, yes, I know, they don't have to be - let it ride...) and very few "modern" machines work natively in 8 bit quantities - they have instructions to work with them, but they also have 32 or 64 bit wide buses and register sets which can do similar operations on a lot more data.

When you fetch a byte from memory, the system actually fetches either a 32 or 64 bit value, because that is its bus size so the first improvement is to say "If we are pulling a 32 bit number, why pull it again for the next byte?" Just read the data once, and look at all 4 or 8 bytes it contains without going back outside the processor for more data.

But, that's not what this is about. If you are reading 32 bits (and it take no longer than reading 8) why not compare 32 bits instead of 8? It can be done in a single instruction (same as an 8 bit compare) and it takes the same time. So it save three instructions (minimum) per compare. And when you think about string sizes, most of them will be longer than four bytes anyway - so we will save considerable time by doing this.

CPallini · Answer 2 · 2013-03-06T22:21:00

Quote:
am a little confused about the reason behind it? Is it that by doing so we will lessen the number of machine instructions that will take place for converting each character first to an integer and then to bits or something else?

You don't convert each character to an integer, instead you interpret a group of 4 characters at time as a single integer and compare such integer. This way, in a single machine comparison you perform 4 characters comparison.

As a side note, it looks that such int_strcmp doesn't match the behaviour of the standard strcmp. The latter provides more info (I guess they have to do that because conforming to strcmp would be endian-dependent).
Moreover why (the fresh hell) the signature specifies int as return type while the function returns either true or false (doesn't look neat to me)?

Pattern Matching - Strings

2 solutions

Solution 1

Solution 2

Add your solution here

Preview 0