Memory and time efficient Levenshtein algorithm

Horatiu-Andrei Stoianovici

5.00/5 (7 votes)

29 Jun 2012CPOL3 min read

36.6K

708

This is an alternative for Fast, memory efficient Levenshtein algorithm

Download source code - 1.4 KB

Introduction

The Levenshtein distance is an algorithm that returns the "difference" between two strings. That difference is actually the least amount of add a letter/delete a letter/replace a letter operations you would have to apply to the first string, to make it equal with the second. In web applications this can be useful for searches where the user types a word incorrectly, and you will want to look for close matches, instead of exact matches.

Advantages of the improved version

S1: the first string;
S2: the second string;
N: the length of S1;
M: the length of S2;

The classic algorithm uses a matrix with the size of N*M elements. My improved version uses only two arrays (there is an article describing this improvement here), making it more efficient.

Another improvement is that this version also has a limit parameter.
e.g. While searching for close matches, you certainly wouldn't want "bicycle" to match "hurricane". The difference between the two is too big.
The classic algorithm will compute the distance, even if the two words are very different. This takes A LOT more time than if you would have known that you want to search for differences smaller than limit .

Description of the improved algorithm

First, we see if S1 and S2 are identical (S1==S2). If they are, it would be a waste of time to actually calculate their difference.
After that, we calculate the absolute value of the difference between N and M. If it is greater than the limit, we could be sure that the difference between S1 and S2 is greater that the limit, because we will have to insert at least abs(<code>N-M) letters.
Next, the classic algorithm would examine each letter of S1, and each letter of S2. But that would be a waste of time if we know that the difference should be less than limit. Instead, we will examine for each letter i of S1, only the letters between i-limit and i+limit of S2. It would be useless to examine the letters before i-limit, or after i+limit, because the difference between the first i letters of S1, and the first i+limit+1, i+limit+2... and i+limit-1,i+limit-2... letters of S2 would certainly be bigger than limit. *e.g. You would have to make limit+1 insertions to transform the first i letters of S1 intro the first i+limit+1 letters of S2.* It is also necessary to initialize the array with a huge value in the i+limit-1 and i+limit+1 positions (if these positions exist) to prevent the algorithm from choosing those values in the next step (because that part of the array is "untouched", the values would be 0).
For each i letter of S1, the algorithm sees if there was at least a computed value that is <=limit, otherwise it returns infinite (in my algorithm, 9.999.999)

Using the Code

I have implemented the algorithm in two popular languages for web development: PHP and JavaScript .

The PHP version

The PHP function is compare($string1,$string2,$limit) :

$string1 is the first string to be compared.
$string2 is the second string to be compared.
$limit is optional, but highly recommended: it is the limit of the calculated distance.

The function returns the distance between the two strings, or the value 9.999.999 if the distance is greater than the limit.

Example:

PHP

echo compare("efficient","sufficient",4);
//the output will be 2

PHP

echo compare("malicious","delicious",10);
//the output will be 2

PHP

echo compare("grandma","anathema",5);
//the output will be 5

PHP

echo compare("grandma","anathema",4); 
//the output will be 9999999

The JavaScript version

The JavaScript function is compare(string1,string2,limit) :

string1 is the first string to be compared.
string2 is the second string to be compared.
limit is optional, but highly recommended: it is the limit of the calculated distance.

The function returns the distance between the two strings, or the value 9.999.999 if the distance is greater than the limit.

Example:

PHP

alert(compare("efficient","sufficient",4));
//the output will be 2

PHP

alert(compare("malicious","delicious",10));
//the output will be 2

PHP

alert(compare("grandma","anathema",5));
//the output will be 5

PHP

alert(compare("grandma","anathema",4));
//the output will be 9999999

References

Fast, memory efficient Levenshtein algorithm

History

8 April 2012- published article with source files

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)