Click here to Skip to main content
15,868,142 members
Articles / Programming Languages / C++
Article

String Search Using Generated Search Logic

Rate me:
Please Sign up or sign in to vote.
0.00/5 (No votes)
29 Feb 2012CPOL4 min read 24K   135   5   4
This article is about a similar function which is generated using a Linux shell script and provides a similar facility.

Introduction

The problem of searching for a string within another string is important in Computer Science. Most programmers are familiar with the C strstr function which provides the caller with an ability to search for a string needle within a string haystack. This article is about a similar function which is generated using a Linux shell script and provides a similar facility.

Background

The approach taken for implementing this version of string searching is to generate a search function with a variable degree of search branches. The code searches linearly within the string haystack using the multiple search logic each with different offsets into the string within the main search loop.

The design of the search function is as follows:

C++
#include <string.h>
#include <iostream>

using namespace std;

#define STRSTR(S,T,P) \
{ \
    register const char *s = (S), *t = (T); \
    while(*s && *t && *s++ == *t++); \
    if(!*t) { \
        (P)=(S); \
    } else { \
        (P)=NULL; \
    } \
}
 
int
_strstr(const char *str, const char *f)
{
   int len = strlen(str);
   int m = 4;
   int i = len / 4 + 1;
   int *k= new int[m];
   const char *orig = str, *p;
   memset(k,0x0,sizeof(int) * m );
   for(int j = 1; j < m; j++) {
      k[j] += (k[j-1]+i);
      cout << "k[j]: "<< k[j] << endl;
   }
   cout << "i : " << i << endl;
   while(i--) {
      cout << "k[0] " << ((str + k[0]) - orig) << " len " << len << endl;
      if(*str && ((str + k[0]) - orig) < len && *(str + k[0]) == *f) {
         STRSTR(str+k[1],f,p); 
         if(p) {
            cout << "k[0]: " << p << endl;
            return p - orig;
         }
      }
      cout << "k[1] " << ((str + k[1]) - orig) << " len " << len << endl;
      if(*str && ((str + k[1]) - orig) < len && *(str + k[1]) == *f) {
         STRSTR(str+k[1],f,p); 
         if(p) {
            cout << "k[1]: " << p << endl;
            return p - orig;
         }
      }
      cout << "k[2] " << ((str + k[2]) - orig) << " len " << len << endl;
      if(*str && ((str + k[2]) - orig) < len && *(str + k[2]) == *f) {
         STRSTR(str+k[2],f,p); 
         if(p) {
            cout << "k[2]: " << p << endl;
            return p - orig;
         }
      }
      cout << "k[3] " << ((str + k[3]) - orig) << " len " << len << endl;
      if(*str && ((str + k[3]) - orig) < len && *(str + k[3]) == *f) {
         STRSTR(str+k[3],f,p);
         if(p) {
            cout << "k[3]: " << p << endl;
            return p - orig;
         }
      }
      str++;
   }
}

Where in the above code, STRSTR is a macro that tests a possible match by comparing the entire string haystack S with the string needle T to be found. If found, the results are placed into pointer P, otherwise P is NULL. The variable m represents the number of search logics that are generated. In this example, m is equal to four, and there are four if statements of the form:

C++
if(*str && ((str + k[0]) - orig) < len && *(str + k[0]) == *f)

These if statements check the input string haystack represented by pointer 'str' to see if the first character is a match to the first character of the string needle represented by pointer 'f'. However, care must be taken to assert that the test is not out of bounds from the length of string 'str'. This is done in the first part of the test:

C++
*str && ((str + k[0]) - orig) < len

The array k contains m offsets into string 'str'. In this way, during a linear scan of the string 'str', multiple offsets into the string can be scanned in the same pass. This approach is similar to how a loop would be unraveled for optimization. Once there is a possible match, represented by the following condition within the if statement:

C++
*(str + k[0]) == *f

The macro STRSTR is invoked and a test to see if a match is found is made.

Clearly, in the worst case, the algorithm will scan the length of the entire string haystack which is N, and perform a constant m such comparisons where the worst case for the comparisons is given by scanning the length of the entire string needle which is M. Thus, the algorithm is O(N * M). This will happen in cases where the string haystack is of the form 'aaaa' and the string needle is of the form 'ab'. A pass through 'aaaa' will have each of the m search logics match the first character of 'ab' causing STRSTR to execute and scan all of the string needle and failing each time. However, if the string needle does exist within the string haystack, then the algorithm should execute much faster than that as it would likely find a possible match within a pass of the string haystack which is enhanced by the multiple search logics scanning at different positions within the string haystack. In the average case, the code will likely take O(N/2) scan of the haystack (similar to linear search) with all proceeding constant m tests failing and one succeeding taking O(M) time for an average running time of O(N/2 + M). The best case is that the algorithm matches the first characters of the string haystack with the string needle, then immediately goes to STRSTR to scan for the needle in the haystack and returns success which would be O(M) time.

Using the Code

A Bash script called gen_string_match.sh is provided to emit the C++ code. It allows the user to generate as large or as small a number of internal

test branches to run within the main search loop. To run the script and

compile the emitted C++ code, type (on Linux):

C++
./gen_string_matching.sh
g++ strstr.C -D_TEST_STRING_MATCH -o test_string_match

The shell script does not generate a header file, so to use the '_strstr' function, declare in a header:

C++
extern "C" char *_strstr(const char *s, const char *t);

Points of Interest

The code is generated to allow the user to use a function that has many search logics within the linear pass of the string haystack. If this is equal to the length of the string haystack, then the code will match the needle within the first pass.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer (Senior)
United States United States
Dr. Doss holds a PhD in Applied Computer Science. His research interests include Operating Systems, Networking, Theory of NP Completeness,and Software Engineering.
https://www.rdoss.com

Comments and Discussions

 
Suggestionwhich interest? Pin
Armel Asselin5-Mar-12 21:21
Armel Asselin5-Mar-12 21:21 
Questionlittle bug Pin
pirraste5-Mar-12 20:51
pirraste5-Mar-12 20:51 
Maybe a typo: on the first 'if' it tests
STRSTR(str+k[1],f,p);
but it should be
STRSTR(str+k[0],f,p);

I really liked this approach. Smile | :)
QuestionNot C++ Pin
Emilio Garavaglia1-Mar-12 19:59
Emilio Garavaglia1-Mar-12 19:59 
AnswerRe: Not C++ Pin
Wolfgang_Baron5-Mar-12 10:44
professionalWolfgang_Baron5-Mar-12 10:44 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.