Click here to Skip to main content
15,887,135 members
Articles / Programming Languages / C#
Alternative
Article

Memory and time efficient Levenshtein algorithm

Rate me:
Please Sign up or sign in to vote.
5.00/5 (7 votes)
29 Jun 2012CPOL3 min read 35.7K   707   26   7
This is an alternative for "Fast, memory efficient Levenshtein algorithm"

Introduction 

 The Levenshtein distance is an algorithm that returns the "difference" between two strings. That difference is actually the least amount of add a letter/delete a letter/replace a letter operations you would have to apply to the first string, to make it equal with the second. In web applications this can be useful for searches where the user types a word incorrectly, and you will want to look for close matches, instead of exact matches. 

Advantages of the improved version 

  •  S1: the first string;
  •  S2: the second string; 
  •  N: the length of S1
  •  M: the length of S2;  

 The classic algorithm uses a matrix with the size of N*M elements. My improved version uses only two arrays (there is an article describing this improvement here), making it more efficient.

 Another improvement is that this version also has a limit parameter.
e.g. While searching for close matches, you certainly wouldn't want "bicycle" to match "hurricane". The difference between the two is too big.
The classic algorithm will compute the distance, even if the two words are very different. This takes A LOT more time than if you would have known that you want to search for differences smaller than limit .  

Description of the improved algorithm

  • First, we see if S1 and S2 are identical (S1==S2). If they are, it would be a waste of time to actually calculate their difference. 
  • After that, we calculate the absolute value of the difference between N and M. If it is greater than the limit, we could be sure that the difference between S1 and S2 is greater that the limit, because we will have to insert at least abs(<code>N-M) letters. 
  • Next, the classic algorithm would examine each letter of S1, and each letter of S2. But that would be a waste of time if we know that the difference should be less than limit.  Instead, we will examine for each letter i of S1, only the letters between i-limit and i+limit of S2. It would be useless to examine the letters before i-limit, or after i+limit, because the difference between the first i letters of S1, and the first i+limit+1, i+limit+2... and i+limit-1,i+limit-2... letters of S2 would certainly be bigger than limit. *e.g. You would have to make limit+1 insertions to transform the first i letters of S1 intro the first i+limit+1 letters of S2.* It is also necessary to initialize the array with a huge value in the i+limit-1 and i+limit+1 positions (if these positions exist) to prevent the algorithm from choosing those values in the next step (because that part of the array is "untouched", the values would be 0). 
  • For each i letter of S1, the algorithm sees if there was at least a computed value that is <=limit, otherwise it returns infinite (in my algorithm, 9.999.999) 

Using the Code 

I have implemented the algorithm in two popular languages for web development: PHP and JavaScript . 

 The PHP version

The PHP function is compare($string1,$string2,$limit) :

  1. $string1 is the first string to be compared.
  2. $string2 is the second string to be compared.
  3. $limit is optional, but highly recommended: it is the limit of the calculated distance. 

The function returns the distance between the two strings, or the value 9.999.999 if the distance is greater than the limit.

Example:

PHP
echo compare("efficient","sufficient",4);
//the output will be 2  
PHP
echo compare("malicious","delicious",10);
//the output will be 2
PHP
echo compare("grandma","anathema",5);
//the output will be 5
PHP
echo compare("grandma","anathema",4); 
//the output will be 9999999 

 The JavaScript version

The JavaScript function is compare(string1,string2,limit)

  1. string1 is the first string to be compared.
  2. string2 is the second string to be compared.
  3. limit is optional, but highly recommended: it is the limit of the calculated distance.  

The function returns the distance between the two strings, or the value 9.999.999 if the distance is greater than the limit. 

Example:

PHP
alert(compare("efficient","sufficient",4));
//the output will be 2 
PHP
alert(compare("malicious","delicious",10));
//the output will be 2 
PHP
alert(compare("grandma","anathema",5));
//the output will be 5 
PHP
alert(compare("grandma","anathema",4));
//the output will be 9999999  

References

History

  • 8 April 2012- published article with source files  

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Student
Romania Romania
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
QuestionAfter 12 years: Thank you! Pin
oahrens14-Jan-24 1:17
oahrens14-Jan-24 1:17 
Thanks Smile | :) Smile | :) Smile | :) for posting the code here! It was a very good inspiration for my own purposes. Allow me to make a few comments:
  1. For readers who want to orient themselves in the jungle of different implementations of the edit distance: The idea of only examining relevant regions in the inner loop was first published by Esko Ukkonen (Ukkonen E. Algorithms for approximate string matching. Inf Control. 1985;64(1-3):100-18.). This type of algorithm is therefore sometimes called the “Ukkonen algorithm”.

  2. The memory consumption of the function can be reduced by limiting both arrays of dist to the length of 2 * ((limit + Math.abs(n - m)) / 2 + 1) + 1 instead of using the length m + 2. The first and last elements of both arrays are initialized with the "high value". This is the "untouched" region, which now remains unchanged and no longer needs to be redefined every time the outer loop is run.

    To redesign the code according to these principles, the following changes are necessary (line numbers and expressions refer to the JavaScript version):

    a) insert after line 21:
    JavaScript
    var offset = Math.floor((limit + Math.abs(n - m)) / 2);
    b) replace line 22 with:
    JavaScript
    var dist = new Array(new Array(2 * (offset + 1) + 1), new Array(2 * (offset + 1) + 1));
    c) replace lines 24-27 with the new initialization:
    JavaScript
    for (var i = 1; i < dist[1 - current].length - 1; i++) {
        if (i < offset + 2) {
            dist[1 - current][i] = 0;
        } else {
            dist[1 - current][i] = i - offset - 1;
        }
    }
    dist[current][0] = 9999999;
    dist[1 - current][0] = 9999999;
    dist[current][dist[current].length - 1] = 9999999;
    dist[1 - current][dist[1 - current].length - 1] = 9999999;
    d) delete line 31
    e) delete lines 32 and 33 and replace them with:
    JavaScript
    if (i <= offset) {
        dist[current][offset - i + 1] = i;
    }
    f) in line 34 limit is replaced by offset
    g) after line 35 add:
    JavaScript
    var pos = j - (i - offset) + 1;
    h) in lines 36 to 40:
    • when calling dist[current] make the following replacements:
      [j] --> [pos]
      [j - 1] --> [pos - 1]
    • when calling dist[1 - current] make the following replacements:
      [j] --> [pos + 1]
      [j - 1] --> [pos]
    i) delete lines 43 to 46
    j) in lines 51 and 53 m is replaced by pos

  3. Another possibility that Ukkonen presents is, to calculate the exact edit distance by examining only relevant regions in all cases, not just in the case where the distance is below an externally specified limit value.

    To do this, limit is initially set to Math.abs(n - m) + 1 within the function. Alternatively, limit can be set to 4 and, if necessary, doubled until limit is greater than or equal to Math.abs(n - m). The code from lines 22 to 50 (with or without the changes noted above) is then repeated in a loop until dists[1 - current][pos] > limit or dists[1 - current][m] > limit. Meanwhile, limit is doubled after each run of the loop. In the end dists[1 - current][pos] or dists[1 - current][m] is the exact edit distance.

    I found a correct implementation of Ukkonen's algorithm (written in C by Abram Connelly) that follows this approach. This was also my source for some of the ideas outlined above. In the case of other implementations of the Ukkonen algorithm circulating on the Internet, I am not sure whether they are all really correct.Unsure | :~

  4. In my opinion one should add to the comment of wtwhite regarding Myers' bit-parallel algorithm: In fact, this is faster than all variations of Wagner and Fischer's algorithm (which include Ukkonen's algorithm). But the bit-parallel approach can only be used in the special case where the cost of insertions, deletions and replacings when calculating the edit distance are identical. In contrast, with derivatives of the Wagner-Fischer algorithm it is possible to set different costs for these actions. So both have their place.


modified 15-Jan-24 7:10am.

GeneralVB.NET version of your code Pin
Dominic Turner28-Jun-12 2:10
Dominic Turner28-Jun-12 2:10 
QuestionA couple of things Pin
wtwhite30-Apr-12 20:42
wtwhite30-Apr-12 20:42 
AnswerRe: A couple of things Pin
Horatiu-Andrei Stoianovici1-May-12 1:38
Horatiu-Andrei Stoianovici1-May-12 1:38 
GeneralRe: A couple of things Pin
PIEBALDconsult1-May-12 11:01
mvePIEBALDconsult1-May-12 11:01 
GeneralMy vote of 5 Pin
Ilka Guigova30-Apr-12 8:58
Ilka Guigova30-Apr-12 8:58 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.