The Damerau-Levenshtein distance between two strings is the number of additions, deletions, substitutions, and transpositions needed to transform one into the other. It’s an extension of the Levenshtein distance, by incorporating transpositions into the set of operations.
I had a need to use it in a program recently, but I couldn’t find any Python implementation around that wasn’t buggy in trivial cases or just extremely inefficient (strictly, you can implement it neatly by recursion, and it’s quite instructive about the algorithm in some ways, but it takes an age to run when you do and you hit the recursion limit in fairly short order). I’ve put one together myself from the algorithmic definition that I’ve tested to work correctly and reasonably efficiently. I’d suggest working in a C module if you need the best possible efficiency. There are some things in difflib you can work around to doing the same thing as well, particularly good if you do want the required sequence of operations too. I’m only doing the numeric side of things here.
The algorithm is inherently O(N*M) in time, and the naïve version is in space as well. This implementation is O(M) in space, as only the last two rows of the matrix need to be kept around. I have implemented it for Python 2.x, but 2to3 makes an accurate translation if you’re using it with Py3K.
The code is available under the MIT licence, in the hope that it will be useful, but without warranty of any kind. I have also included a codesnippet GUID in line with the linked post, as a sort of experiment. Please leave that comment intact if you’re posting a derivative somewhere, and add your own.
This entry was posted
on Sunday, April 26th, 2009 at 17:02 and is filed under Uncategorized.
You can follow any responses to this entry through the RSS 2.0 feed.
You can leave a response, or trackback from your own site.
6 Responses to “Python Damerau-Levenshtein distance implementation”
btw: Any idea how to get a ratio function (normalization) akin to that in difflib? I tried: return float((max_length – DL_distance(seq1, seq2)) / max_length)
numerix, it works perfectly well. You need to note the difference between the Levenshtein distance (described in the SPOJ problem and the calculator you link to) and the Damerau-Levenshtein distance, which allows transpositions.
Both algorithms give the correct answer for that pair: delete H, U, R, replace B with K, transpose O and H, replace P with Z. Six steps.
Thanks for this. It is faster than: http://www.guyrutenberg.com/2008/12/15/damerau-levenshtein-distance-in-python/comment-page-1/
btw: Any idea how to get a ratio function (normalization) akin to that in difflib? I tried: return float((max_length – DL_distance(seq1, seq2)) / max_length)
but it’s not as nice as the one in difflib.
J. Reagle is right according to the speed of the algorithm, but the algorithms doesn’t work correctly.
You find more information about that in my comment to this code: http://www.guyrutenberg.com/2008/12/15/damerau-levenshtein-distance-in-python/comment-page-1/
numerix, it works perfectly well. You need to note the difference between the Levenshtein distance (described in the SPOJ problem and the calculator you link to) and the Damerau-Levenshtein distance, which allows transpositions.
Both algorithms give the correct answer for that pair: delete H, U, R, replace B with K, transpose O and H, replace P with Z. Six steps.
Thanks Michael, for pointing that out. I was not aware of that difference!
i am trying to convert your code to Cython and have already obtained 20x speed gains. sadly, my translation still has some problems; so i opened a question on http://stackoverflow.com/questions/3431933/how-to-correct-algorithm-malloc-related-bugs-in-this-damerau-levenshtein-edit-di