Click here to Skip to main content
15,905,229 members
Please Sign up or sign in to vote.
5.00/5 (1 vote)
See more:
I read this piece of code in a c++ book(Sams - C++ Footprint and Performance Optimization).

There is an article suggesting that string comparison can be made faster by treating strings as integer.

I am a little confused about the reason behind it? Is it that by doing so we will lessen the number of machine instructions that will take place for converting each character first to an integer and then to bits or something else?

Can anybody please elaborate?

C++
inline int int_strcmp(char *s1, char *s2, int len)
{
    if ((len & 0x03) != 0)
        return(macro_strcmp(s1, s2));

    for (int i =0; *(int*)&s1[i] == *(int*)&s2[i];)
    {
        i+=4;
        if (i >= len)
            return(true);  // match
    }
    return(false);
}



Article (for reference)

Another way to speed up string comparison is by treating the character arrays as integer arrays . On most systems an integer is four times larger than a character, and so four characters can be compared at the same time. When the strings have a length that is a multiple of the length of an integer, doing integer comparisons can greatly speed up your string comparison. Listing 10.1 shows what an integer string compare function can look like.

Listing 10.1 Integer String Comparison
inline int int_strcmp(char *s1, char *s2, int len)
{
if ((len & 0x03) != 0)
return(macro_strcmp(s1, s2));

for (int i =0; *(int*)&s1[i] == *(int*)&s2[i];)
{
i+=4;
if (i >= len)
return(true); // match
}
return(false);
}

The int_strcmp function quickly checks whether the given string length is a multiple of four. If this is not the case, the previously discussed macro_strcmp function is called. For strings with a correct length, characters are compared in groups of four by casting the character pointers to integer pointers and thus reading integers from the string. The longer the compared strings are—and the more they look alike—the more benefits this int_strcmp function reaps compared to the previous implementations given in this chapter. For our example of finding Small Table articles in the dbase.txt file, this integer implementation is again faster (up to 50%). This is reflected in the results of the 10Source01.cpp program.

The reason why the string lengths have to be a multiple of the integer size for this function to work is that any other size will cause this function to compare beyond the length of a string. A string with a length of six, for instance, will be compared in two loop iterations. The first iteration compares the first four bytes, which is no problem. The second iteration compares the second four bytes, of which only two are part of the actual string. The third byte will be a null indicating the end of the string. The fourth byte, however, is not part of the string. Two strings of six characters that are identical could be found to be different just because of the value of these fourth bytes. The int_compare function can of course be altered to check for different string sizes but it is not very likely that it will still be faster in most string compare cases.
Posted
Updated 6-Mar-13 21:58pm
v2

:laugh: No, it's simpler than that. And also more complicated...

Characters are 8 bit quantities (yes, yes, I know, they don't have to be - let it ride...) and very few "modern" machines work natively in 8 bit quantities - they have instructions to work with them, but they also have 32 or 64 bit wide buses and register sets which can do similar operations on a lot more data.

When you fetch a byte from memory, the system actually fetches either a 32 or 64 bit value, because that is its bus size so the first improvement is to say "If we are pulling a 32 bit number, why pull it again for the next byte?" Just read the data once, and look at all 4 or 8 bytes it contains without going back outside the processor for more data.

But, that's not what this is about. If you are reading 32 bits (and it take no longer than reading 8) why not compare 32 bits instead of 8? It can be done in a single instruction (same as an 8 bit compare) and it takes the same time. So it save three instructions (minimum) per compare. And when you think about string sizes, most of them will be longer than four bytes anyway - so we will save considerable time by doing this.
 
Share this answer
 
Comments
Member 8576081 7-Mar-13 4:31am    
Thank you for the answer. I got it now.
Quote:
am a little confused about the reason behind it? Is it that by doing so we will lessen the number of machine instructions that will take place for converting each character first to an integer and then to bits or something else?

You don't convert each character to an integer, instead you interpret a group of 4 characters at time as a single integer and compare such integer. This way, in a single machine comparison you perform 4 characters comparison.


As a side note, it looks that such int_strcmp doesn't match the behaviour of the standard strcmp. The latter provides more info (I guess they have to do that because conforming to strcmp would be endian-dependent).
Moreover why (the fresh hell) the signature specifies int as return type while the function returns either true or false (doesn't look neat to me)?
 
Share this answer
 
v2
Comments
Member 8576081 7-Mar-13 4:32am    
@CPallini - Thanks a lot!!
CPallini 7-Mar-13 5:27am    
You are welcome.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900