The Damerau-Levenshtein distance between two strings is the number of additions, deletions, substitutions, and transpositions needed to transform one into the other. It’s an extension of the Levenshtein distance, by incorporating transpositions into the set of operations.
I had a need to use it in a program recently, but I couldn’t find any Python implementation around that wasn’t buggy in trivial cases or just extremely inefficient (strictly, you can implement it neatly by recursion, and it’s quite instructive about the algorithm in some ways, but it takes an age to run when you do and you hit the recursion limit in fairly short order). I’ve put one together myself from the algorithmic definition that I’ve tested to work correctly and reasonably efficiently. I’d suggest working in a C module if you need the best possible efficiency. There are some things in difflib you can work around to doing the same thing as well, particularly good if you do want the required sequence of operations too. I’m only doing the numeric side of things here.
The algorithm is inherently O(N*M) in time, and the naïve version is in space as well. This implementation is O(M) in space, as only the last two rows of the matrix need to be kept around. I have implemented it for Python 2.x, but 2to3 makes an accurate translation if you’re using it with Py3K.
The code is available under the MIT licence, in the hope that it will be useful, but without warranty of any kind. I have also included a codesnippet GUID in line with the linked post, as a sort of experiment. Please leave that comment intact if you’re posting a derivative somewhere, and add your own.
This entry was posted
on Sunday, April 26th, 2009 at 17:02 and is filed under Uncategorized.
You can follow any responses to this entry through the RSS 2.0 feed.
You can leave a response, or trackback from your own site.
11 Responses to “Python Damerau-Levenshtein distance implementation”
btw: Any idea how to get a ratio function (normalization) akin to that in difflib? I tried: return float((max_length – DL_distance(seq1, seq2)) / max_length)
numerix, it works perfectly well. You need to note the difference between the Levenshtein distance (described in the SPOJ problem and the calculator you link to) and the Damerau-Levenshtein distance, which allows transpositions.
Both algorithms give the correct answer for that pair: delete H, U, R, replace B with K, transpose O and H, replace P with Z. Six steps.
I wanted to port this to be compatible with Python 3. I’m new to 3 myself, but I think I’ve pulled it off by checking a few strings across both the versions.
Here are the changes:
1) thisrow = range(1, len(seq2) + 1) + [0]
becomes
thisrow = list(range(1, len(seq2) + 1)) + [0]
2) for x in xrange(len(seq1)):
becomes
for x in range(len(seq1)):
3) for y in xrange(len(seq2)):
becomes
for y in range(len(seq2)):
I checked ‘spam’ vs. ham (2)
‘hasddddddddddddddddam’ vs ‘ham’ (18)
Here is the full updated code:
____________________________________________
def dameraulevenshtein(seq1, seq2):
“”"Calculate the Damerau-Levenshtein distance between sequences.
This distance is the number of additions, deletions, substitutions,
and transpositions needed to transform the first sequence into the
second. Although generally used with strings, any sequences of
comparable objects will work.
Transpositions are exchanges of *consecutive* characters; all other
operations are self-explanatory.
This implementation is O(N*M) time and O(M) space, for N and M the
lengths of the two sequences.
It works with arbitrary sequences too:
>>> dameraulevenshtein(‘abcd’, ['b', 'a', 'c', 'd', 'e'])
2
“”"
# codesnippet:D0DE4716-B6E6-4161-9219-2903BF8F547F
# Conceptually, this is based on a len(seq1) + 1 * len(seq2) + 1 matrix.
# However, only the current and two previous rows are needed at once,
# so we only store those.
oneago = None
thisrow = list(range(1, len(seq2) + 1)) + [0]
for x in range(len(seq1)):
# Python lists wrap around for negative indices, so put the
# leftmost column at the *end* of the list. This matches with
# the zero-indexed strings and saves extra calculation.
twoago, oneago, thisrow = oneago, thisrow, [0] * len(seq2) + [x + 1]
for y in range(len(seq2)):
delcost = oneago[y] + 1
addcost = thisrow[y - 1] + 1
subcost = oneago[y - 1] + (seq1[x] != seq2[y])
thisrow[y] = min(delcost, addcost, subcost)
# This block deals with transpositions
if (x > 0 and y > 0 and seq1[x] == seq2[y - 1]
and seq1[x-1] == seq2[y] and seq1[x] != seq2[y]):
thisrow[y] = min(thisrow[y], twoago[y - 2] + 1)
return thisrow[len(seq2) - 1]
Thanks, OneEyedMan. Those changes should be all that’s required, and they’re what 2to3 will give you if you’re converting automatically. The Python 3 version should run fine on Python 2 as well, with just a little waste around the range() calls.
[...] the previus listing I have to make some remarks. In line 2 and 3 we use third party libraries for levenshtain distance and metaphone algorithms. In line 8 we are collecting a list of 70 candidates. This particular [...]
[...] the previous listing I have to make some remarks. In line 2 and 3 we use third party libraries for Levenshtein distance and metaphone algorithms. In line 8 we are collecting a list of 70 candidates. This particular [...]
Thanks for this. It is faster than: http://www.guyrutenberg.com/2008/12/15/damerau-levenshtein-distance-in-python/comment-page-1/
btw: Any idea how to get a ratio function (normalization) akin to that in difflib? I tried: return float((max_length – DL_distance(seq1, seq2)) / max_length)
but it’s not as nice as the one in difflib.
J. Reagle is right according to the speed of the algorithm, but the algorithms doesn’t work correctly.
You find more information about that in my comment to this code: http://www.guyrutenberg.com/2008/12/15/damerau-levenshtein-distance-in-python/comment-page-1/
numerix, it works perfectly well. You need to note the difference between the Levenshtein distance (described in the SPOJ problem and the calculator you link to) and the Damerau-Levenshtein distance, which allows transpositions.
Both algorithms give the correct answer for that pair: delete H, U, R, replace B with K, transpose O and H, replace P with Z. Six steps.
Thanks Michael, for pointing that out. I was not aware of that difference!
i am trying to convert your code to Cython and have already obtained 20x speed gains. sadly, my translation still has some problems; so i opened a question on http://stackoverflow.com/questions/3431933/how-to-correct-algorithm-malloc-related-bugs-in-this-damerau-levenshtein-edit-di
I wanted to port this to be compatible with Python 3. I’m new to 3 myself, but I think I’ve pulled it off by checking a few strings across both the versions.
Here are the changes:
1) thisrow = range(1, len(seq2) + 1) + [0]
becomes
thisrow = list(range(1, len(seq2) + 1)) + [0]
2) for x in xrange(len(seq1)):
becomes
for x in range(len(seq1)):
3) for y in xrange(len(seq2)):
becomes
for y in range(len(seq2)):
I checked ‘spam’ vs. ham (2)
‘hasddddddddddddddddam’ vs ‘ham’ (18)
Here is the full updated code:
____________________________________________
def dameraulevenshtein(seq1, seq2):
“”"Calculate the Damerau-Levenshtein distance between sequences.
This distance is the number of additions, deletions, substitutions,
and transpositions needed to transform the first sequence into the
second. Although generally used with strings, any sequences of
comparable objects will work.
Transpositions are exchanges of *consecutive* characters; all other
operations are self-explanatory.
This implementation is O(N*M) time and O(M) space, for N and M the
lengths of the two sequences.
>>> dameraulevenshtein(‘ba’, ‘abc’)
2
>>> dameraulevenshtein(‘fee’, ‘deed’)
2
It works with arbitrary sequences too:
>>> dameraulevenshtein(‘abcd’, ['b', 'a', 'c', 'd', 'e'])
2
“”"
# codesnippet:D0DE4716-B6E6-4161-9219-2903BF8F547F
# Conceptually, this is based on a len(seq1) + 1 * len(seq2) + 1 matrix.
# However, only the current and two previous rows are needed at once,
# so we only store those.
oneago = None
thisrow = list(range(1, len(seq2) + 1)) + [0]
for x in range(len(seq1)):
# Python lists wrap around for negative indices, so put the
# leftmost column at the *end* of the list. This matches with
# the zero-indexed strings and saves extra calculation.
twoago, oneago, thisrow = oneago, thisrow, [0] * len(seq2) + [x + 1]
for y in range(len(seq2)):
delcost = oneago[y] + 1
addcost = thisrow[y - 1] + 1
subcost = oneago[y - 1] + (seq1[x] != seq2[y])
thisrow[y] = min(delcost, addcost, subcost)
# This block deals with transpositions
if (x > 0 and y > 0 and seq1[x] == seq2[y - 1]
and seq1[x-1] == seq2[y] and seq1[x] != seq2[y]):
thisrow[y] = min(thisrow[y], twoago[y - 2] + 1)
return thisrow[len(seq2) - 1]
Thanks, OneEyedMan. Those changes should be all that’s required, and they’re what 2to3 will give you if you’re converting automatically. The Python 3 version should run fine on Python 2 as well, with just a little waste around the range() calls.
[...] the previus listing I have to make some remarks. In line 2 and 3 we use third party libraries for levenshtain distance and metaphone algorithms. In line 8 we are collecting a list of 70 candidates. This particular [...]
[...] the previous listing I have to make some remarks. In line 2 and 3 we use third party libraries for Levenshtein distance and metaphone algorithms. In line 8 we are collecting a list of 70 candidates. This particular [...]
added to beginning:
seq1 = seq1.translate(string.maketrans(“”,”"), string.punctuation).upper()
seq2 = seq2.translate(string.maketrans(“”,”"), string.punctuation).upper()
to remove punctuation marks and case from the equation.