The Damerau-Levenshtein distance between two strings is the number of additions, deletions, substitutions, and transpositions needed to transform one into the other. It’s an extension of the Levenshtein distance, by incorporating transpositions into the set of operations.

I had a need to use it in a program recently, but I couldn’t find any Python implementation around that wasn’t buggy in trivial cases or just extremely inefficient (strictly, you **can** implement it neatly by recursion, and it’s quite instructive about the algorithm in some ways, but it takes an age to run when you do and you hit the recursion limit in fairly short order). I’ve put one together myself from the algorithmic definition that I’ve tested to work correctly and reasonably efficiently. I’d suggest working in a C module if you need the best possible efficiency. There are some things in difflib you can work around to doing the same thing as well, particularly good if you do want the required sequence of operations too. I’m only doing the numeric side of things here.

The algorithm is inherently O(N*M) in time, and the naïve version is in space as well. This implementation is O(M) in space, as only the last two rows of the matrix need to be kept around. I have implemented it for Python 2.x, but 2to3 makes an accurate translation if you’re using it with Py3K.

def dameraulevenshtein(seq1, seq2): """Calculate the Damerau-Levenshtein distance between sequences. This distance is the number of additions, deletions, substitutions, and transpositions needed to transform the first sequence into the second. Although generally used with strings, any sequences of comparable objects will work. Transpositions are exchanges of *consecutive* characters; all other operations are self-explanatory. This implementation is O(N*M) time and O(M) space, for N and M the lengths of the two sequences. >>> dameraulevenshtein('ba', 'abc') 2 >>> dameraulevenshtein('fee', 'deed') 2 It works with arbitrary sequences too: >>> dameraulevenshtein('abcd', ['b', 'a', 'c', 'd', 'e']) 2 """ # codesnippet:D0DE4716-B6E6-4161-9219-2903BF8F547F # Conceptually, this is based on a len(seq1) + 1 * len(seq2) + 1 matrix. # However, only the current and two previous rows are needed at once, # so we only store those. oneago = None thisrow = range(1, len(seq2) + 1) + [0] for x in xrange(len(seq1)): # Python lists wrap around for negative indices, so put the # leftmost column at the *end* of the list. This matches with # the zero-indexed strings and saves extra calculation. twoago, oneago, thisrow = oneago, thisrow, [0] * len(seq2) + [x + 1] for y in xrange(len(seq2)): delcost = oneago[y] + 1 addcost = thisrow[y - 1] + 1 subcost = oneago[y - 1] + (seq1[x] != seq2[y]) thisrow[y] = min(delcost, addcost, subcost) # This block deals with transpositions if (x > 0 and y > 0 and seq1[x] == seq2[y - 1] and seq1[x-1] == seq2[y] and seq1[x] != seq2[y]): thisrow[y] = min(thisrow[y], twoago[y - 2] + 1) return thisrow[len(seq2) - 1]

The code is available under the MIT licence, in the hope that it will be useful, but without warranty of any kind. I have also included a codesnippet GUID in line with the linked post, as a sort of experiment. Please leave that comment intact if you’re posting a derivative somewhere, and add your own.

Tags: damerau-levenshtein, python

Thanks for this. It is faster than: http://www.guyrutenberg.com/2008/12/15/damerau-levenshtein-distance-in-python/comment-page-1/

btw: Any idea how to get a ratio function (normalization) akin to that in difflib? I tried: return float((max_length – DL_distance(seq1, seq2)) / max_length)

but it’s not as nice as the one in difflib.

J. Reagle is right according to the speed of the algorithm, but the algorithms doesn’t work correctly.

You find more information about that in my comment to this code: http://www.guyrutenberg.com/2008/12/15/damerau-levenshtein-distance-in-python/comment-page-1/

numerix, it works perfectly well. You need to note the difference between the Levenshtein distance (described in the SPOJ problem and the calculator you link to) and the Damerau-Levenshtein distance, which allows transpositions.

Both algorithms give the correct answer for that pair: delete H, U, R, replace B with K, transpose O and H, replace P with Z. Six steps.

Thanks Michael, for pointing that out. I was not aware of that difference!

i am trying to convert your code to Cython and have already obtained 20x speed gains. sadly, my translation still has some problems; so i opened a question on http://stackoverflow.com/questions/3431933/how-to-correct-algorithm-malloc-related-bugs-in-this-damerau-levenshtein-edit-di

I wanted to port this to be compatible with Python 3. I’m new to 3 myself, but I think I’ve pulled it off by checking a few strings across both the versions.

Here are the changes:

1) thisrow = range(1, len(seq2) + 1) + [0]

becomes

thisrow = list(range(1, len(seq2) + 1)) + [0]

2) for x in xrange(len(seq1)):

becomes

for x in range(len(seq1)):

3) for y in xrange(len(seq2)):

becomes

for y in range(len(seq2)):

I checked ‘spam’ vs. ham (2)

‘hasddddddddddddddddam’ vs ‘ham’ (18)

Here is the full updated code:

____________________________________________

def dameraulevenshtein(seq1, seq2):

“””Calculate the Damerau-Levenshtein distance between sequences.

This distance is the number of additions, deletions, substitutions,

and transpositions needed to transform the first sequence into the

second. Although generally used with strings, any sequences of

comparable objects will work.

Transpositions are exchanges of *consecutive* characters; all other

operations are self-explanatory.

This implementation is O(N*M) time and O(M) space, for N and M the

lengths of the two sequences.

>>> dameraulevenshtein(‘ba’, ‘abc’)

2

>>> dameraulevenshtein(‘fee’, ‘deed’)

2

It works with arbitrary sequences too:

>>> dameraulevenshtein(‘abcd’, [‘b’, ‘a’, ‘c’, ‘d’, ‘e’])

2

“””

# codesnippet:D0DE4716-B6E6-4161-9219-2903BF8F547F

# Conceptually, this is based on a len(seq1) + 1 * len(seq2) + 1 matrix.

# However, only the current and two previous rows are needed at once,

# so we only store those.

oneago = None

thisrow = list(range(1, len(seq2) + 1)) + [0]

for x in range(len(seq1)):

# Python lists wrap around for negative indices, so put the

# leftmost column at the *end* of the list. This matches with

# the zero-indexed strings and saves extra calculation.

twoago, oneago, thisrow = oneago, thisrow, [0] * len(seq2) + [x + 1]

for y in range(len(seq2)):

delcost = oneago[y] + 1

addcost = thisrow[y – 1] + 1

subcost = oneago[y – 1] + (seq1[x] != seq2[y])

thisrow[y] = min(delcost, addcost, subcost)

# This block deals with transpositions

if (x > 0 and y > 0 and seq1[x] == seq2[y – 1]

and seq1[x-1] == seq2[y] and seq1[x] != seq2[y]):

thisrow[y] = min(thisrow[y], twoago[y – 2] + 1)

return thisrow[len(seq2) – 1]

Thanks, OneEyedMan. Those changes should be all that’s required, and they’re what 2to3 will give you if you’re converting automatically. The Python 3 version should run fine on Python 2 as well, with just a little waste around the range() calls.

[…] the previus listing I have to make some remarks. In line 2 and 3 we use third party libraries for levenshtain distance and metaphone algorithms. In line 8 we are collecting a list of 70 candidates. This particular […]

[…] the previous listing I have to make some remarks. In line 2 and 3 we use third party libraries for Levenshtein distance and metaphone algorithms. In line 8 we are collecting a list of 70 candidates. This particular […]

added to beginning:

seq1 = seq1.translate(string.maketrans(“”,””), string.punctuation).upper()

seq2 = seq2.translate(string.maketrans(“”,””), string.punctuation).upper()

to remove punctuation marks and case from the equation.

[…] project is based on Michael Homer’s pure Python implementation. It runs in O(N*M) time using O(M) space and supports unicode characters. Since it’s […]

[…] experimented with a number of Python-based Damerau-Levenshtein edit distance implementations, I finally found the one listed below as editdistance_reference(). It seems to deliver correct results and appears to have an efficient […]

[…] distance for the typos and Double Metaphone for spelling (Python implementations here and […]

[…] the previous listing I have to make some remarks. In line 2 and 3 we use third party libraries for Levenshtein distance and metaphone algorithms. In line 8 we are collecting a list of 70 candidates. This particular […]