Using Morris Pratt String Searching In Byte Arrays

Robert Vandenberg Huang

0/5 (0 vote)

Feb 18, 2014

CPOL

2 min read

11786

A short discussion of string searching in raw data.

Introduction

This is a tip about how I use Morris Pratt string searching algorithm in byte arrays to get the same efficient result. I'm assuming that you've understood the basic concept of this algorithm so it won't be repeated in this topic. Also, you should already know extension methods before you get started with this topic.

Background

In 2013, I built an extension method of Morris Pratt searching algorithm to search specific string in text file. But later, I had an idea and tried to apply it on every enumerable-related type if each element in it is equatable. So I expanded my method and then get the same efficient result if I search it in byte array, which means I can directly search string in raw text file and don't have to use ASCII or Unicode to decode it at first. The program became more efficient in text searching.

Using the Code

It's easy to use these extensions. First of all, let's take a look of their prototypes:

        public static int MorrisPrattSearchFirst<T>(this T[] t, T[] p)
            where T : IEquatable<T> { }

        public static int MorrisPrattSearchFirst<T>(this T[] t, T[] p, int startIndex)
            where T : IEquatable<T> { }

where t is the parent array and p is the specified array for searching. The second method allows users to search p in t starting at non-zero index. Constraint on type parameter T makes sure that each element in both arrays is equatable. Returned value represents the index of first matched result (-1 if not found). Both of them are defined under the CollectionAssistant.IEnumerableExtentions class.

Here's a complete example of using them. First, we search key word "Ishmael" in character arrays from the text file MobyDick.txt. And next search it in ASCII-encoded byte arrays. Then compare the efficiency.

using System;
using System.Diagnostics;
using System.IO;
using System.Text;

namespace CollectionAssistant.Demo
{
    class Program
    {
        static void Main(string[] args)
        {
            // This is a example which demonstrates the differences 
            // between searching text in characters and in raw byte-array.
            string keyWord = "Ishmael";

            // Searching key word in character array. 
            char[] originalCharArray = File.ReadAllText("MobyDick.txt").ToCharArray();
            char[] keyWordCharArray = keyWord.ToCharArray();

            int index = -1;
            int pos = 0;

            Console.WriteLine("Start searching key word in character array.");

            Stopwatch watch = Stopwatch.StartNew();

            do
            {
                // Using the extension method.
                index = originalCharArray.MorrisPrattSearchFirst(keyWordCharArray, pos);

                if (index >= 0)
                {
                    Console.WriteLine("Key word \"{0}\" found. Index: {1}", keyWord, index);
                    pos += (index + keyWordCharArray.Length);
                }
            } while (index >= 0 && pos < originalCharArray.Length);
             
            watch.Stop();

            Console.WriteLine("Elapsed ticks: {0}", watch.ElapsedTicks); 
            Console.ReadKey(true);

            pos = 0; 
             
            Console.WriteLine("Start searching key word in byte array.");

            // Searching key word in byte array.
            byte[] originalByteArray = File.ReadAllBytes("MobyDick.txt");
            byte[] keyWordByteArray = Encoding.Default.GetBytes(keyWord);

            watch.Restart();

            do
            {
                // Using the extension method.
                index = originalByteArray.MorrisPrattSearchFirst(keyWordByteArray, pos);

                if (index >= 0)
                {
                    Console.WriteLine("Key word \"{0}\" found. Index: {1}", keyWord, index);
                    pos += (index + keyWordByteArray.Length);
                }
            } while (index >= 0 && pos < originalByteArray.Length);

            watch.Stop();

            Console.WriteLine("Elapsed ticks: {0}", watch.ElapsedTicks); 
            Console.ReadKey(true);
        }
    }
}

The result shows that both of them performed almost same efficiently: about 3000 ~ 8000 ticks in this case.

Conclusion

Searching text in character arrays and in byte arrays are both efficient while using Morris Pratt algorithm. But sometimes, it's better in byte arrays especially for some situations such as text file or multipart/form-data - which means parent byte array possibly contains both text and non-text data and can't be decoded into character array or string.

You can obtain the complete source code from here.