Python Damerau-Levenshtein distance implementation

The Damerau-Levenshtein distance between two strings  is the number of additions, deletions, substitutions, and transpositions needed to transform one into the other.  It’s an extension of the Levenshtein distance, by incorporating transpositions into the set of operations.

I had a need to use it in a program recently, but I couldn’t find any Python implementation around that wasn’t buggy in trivial cases or just extremely inefficient (strictly, you can implement it neatly by recursion, and it’s quite instructive about the algorithm in some ways, but it takes an age to run when you do and you hit the recursion limit in fairly short order). I’ve put one together myself from the algorithmic definition that I’ve tested to work correctly and reasonably efficiently. I’d suggest working in a C module if you need the best possible efficiency. There are some things in difflib you can work around to doing the same thing as well, particularly good if you do want the required sequence of operations too. I’m only doing the numeric side of things here.

The algorithm is inherently O(N*M) in time, and the naïve version is in space as well. This implementation is O(M) in space, as only the last two rows of the matrix need to be kept around. I have implemented it for Python 2.x, but 2to3 makes an accurate translation if you’re using it with Py3K.

def dameraulevenshtein(seq1, seq2):
    """Calculate the Damerau-Levenshtein distance between sequences.

    This distance is the number of additions, deletions, substitutions,
    and transpositions needed to transform the first sequence into the
    second. Although generally used with strings, any sequences of
    comparable objects will work.

    Transpositions are exchanges of *consecutive* characters; all other
    operations are self-explanatory.

    This implementation is O(N*M) time and O(M) space, for N and M the
    lengths of the two sequences.

    >>> dameraulevenshtein('ba', 'abc')
    2
    >>> dameraulevenshtein('fee', 'deed')
    2

    It works with arbitrary sequences too:
    >>> dameraulevenshtein('abcd', ['b', 'a', 'c', 'd', 'e'])
    2
    """
    # codesnippet:D0DE4716-B6E6-4161-9219-2903BF8F547F
    # Conceptually, this is based on a len(seq1) + 1 * len(seq2) + 1 matrix.
    # However, only the current and two previous rows are needed at once,
    # so we only store those.
    oneago = None
    thisrow = range(1, len(seq2) + 1) + [0]
    for x in xrange(len(seq1)):
        # Python lists wrap around for negative indices, so put the
        # leftmost column at the *end* of the list. This matches with
        # the zero-indexed strings and saves extra calculation.
        twoago, oneago, thisrow = oneago, thisrow, [0] * len(seq2) + [x + 1]
        for y in xrange(len(seq2)):
            delcost = oneago[y] + 1
            addcost = thisrow[y - 1] + 1
            subcost = oneago[y - 1] + (seq1[x] != seq2[y])
            thisrow[y] = min(delcost, addcost, subcost)
            # This block deals with transpositions
            if (x > 0 and y > 0 and seq1[x] == seq2[y - 1]
                and seq1[x-1] == seq2[y] and seq1[x] != seq2[y]):
                thisrow[y] = min(thisrow[y], twoago[y - 2] + 1)
    return thisrow[len(seq2) - 1]


The code is available under the MIT licence, in the hope that it will be useful, but without warranty of any kind. I have also included a codesnippet GUID in line with the linked post, as a sort of experiment. Please leave that comment intact if you’re posting a derivative somewhere, and add your own.

Tags: ,

14 Responses to “Python Damerau-Levenshtein distance implementation”

  1. btw: Any idea how to get a ratio function (normalization) akin to that in difflib? I tried: return float((max_length – DL_distance(seq1, seq2)) / max_length)

    but it’s not as nice as the one in difflib.

  2. numerix says:

    J. Reagle is right according to the speed of the algorithm, but the algorithms doesn’t work correctly.
    You find more information about that in my comment to this code: http://www.guyrutenberg.com/2008/12/15/damerau-levenshtein-distance-in-python/comment-page-1/

  3. Michael says:

    numerix, it works perfectly well. You need to note the difference between the Levenshtein distance (described in the SPOJ problem and the calculator you link to) and the Damerau-Levenshtein distance, which allows transpositions.

    Both algorithms give the correct answer for that pair: delete H, U, R, replace B with K, transpose O and H, replace P with Z. Six steps.

  4. numerix says:

    Thanks Michael, for pointing that out. I was not aware of that difference!

  5. flow says:

    i am trying to convert your code to Cython and have already obtained 20x speed gains. sadly, my translation still has some problems; so i opened a question on http://stackoverflow.com/questions/3431933/how-to-correct-algorithm-malloc-related-bugs-in-this-damerau-levenshtein-edit-di

  6. OneEyedMan says:

    I wanted to port this to be compatible with Python 3. I’m new to 3 myself, but I think I’ve pulled it off by checking a few strings across both the versions.
    Here are the changes:
    1) thisrow = range(1, len(seq2) + 1) + [0]
    becomes
    thisrow = list(range(1, len(seq2) + 1)) + [0]
    2) for x in xrange(len(seq1)):
    becomes
    for x in range(len(seq1)):
    3) for y in xrange(len(seq2)):
    becomes
    for y in range(len(seq2)):

    I checked ‘spam’ vs. ham (2)
    ‘hasddddddddddddddddam’ vs ‘ham’ (18)
    Here is the full updated code:
    ____________________________________________
    def dameraulevenshtein(seq1, seq2):
    “”"Calculate the Damerau-Levenshtein distance between sequences.

    This distance is the number of additions, deletions, substitutions,
    and transpositions needed to transform the first sequence into the
    second. Although generally used with strings, any sequences of
    comparable objects will work.

    Transpositions are exchanges of *consecutive* characters; all other
    operations are self-explanatory.

    This implementation is O(N*M) time and O(M) space, for N and M the
    lengths of the two sequences.

    >>> dameraulevenshtein(‘ba’, ‘abc’)
    2
    >>> dameraulevenshtein(‘fee’, ‘deed’)
    2

    It works with arbitrary sequences too:
    >>> dameraulevenshtein(‘abcd’, ['b', 'a', 'c', 'd', 'e'])
    2
    “”"
    # codesnippet:D0DE4716-B6E6-4161-9219-2903BF8F547F
    # Conceptually, this is based on a len(seq1) + 1 * len(seq2) + 1 matrix.
    # However, only the current and two previous rows are needed at once,
    # so we only store those.
    oneago = None
    thisrow = list(range(1, len(seq2) + 1)) + [0]
    for x in range(len(seq1)):
    # Python lists wrap around for negative indices, so put the
    # leftmost column at the *end* of the list. This matches with
    # the zero-indexed strings and saves extra calculation.
    twoago, oneago, thisrow = oneago, thisrow, [0] * len(seq2) + [x + 1]
    for y in range(len(seq2)):
    delcost = oneago[y] + 1
    addcost = thisrow[y - 1] + 1
    subcost = oneago[y - 1] + (seq1[x] != seq2[y])
    thisrow[y] = min(delcost, addcost, subcost)
    # This block deals with transpositions
    if (x > 0 and y > 0 and seq1[x] == seq2[y - 1]
    and seq1[x-1] == seq2[y] and seq1[x] != seq2[y]):
    thisrow[y] = min(thisrow[y], twoago[y - 2] + 1)
    return thisrow[len(seq2) - 1]

  7. Michael says:

    Thanks, OneEyedMan. Those changes should be all that’s required, and they’re what 2to3 will give you if you’re converting automatically. The Python 3 version should run fine on Python 2 as well, with just a little waste around the range() calls.

  8. emmaespina says:

    [...] the previus listing I have to make some remarks. In line 2 and 3 we use third party libraries for levenshtain distance and metaphone algorithms. In line 8 we are collecting a list of 70 candidates. This particular [...]

  9. [...] the previous listing I have to make some remarks. In line 2 and 3 we use third party libraries for Levenshtein distance and metaphone algorithms. In line 8 we are collecting a list of 70 candidates. This particular [...]

  10. denny says:

    added to beginning:
    seq1 = seq1.translate(string.maketrans(“”,”"), string.punctuation).upper()
    seq2 = seq2.translate(string.maketrans(“”,”"), string.punctuation).upper()

    to remove punctuation marks and case from the equation.

  11. [...] project is based on Michael Homer’s pure Python implementation. It runs in O(N*M) time using O(M) space and supports unicode characters. Since it’s [...]

  12. [...] experimented with a number of Python-based Damerau-Levenshtein edit distance implementations, I finally found the one listed below as editdistance_reference(). It seems to deliver correct results and appears to have an efficient [...]

  13. […] distance for the typos and Double Metaphone for spelling (Python implementations here and […]

Leave a Reply

You must be logged in to post a comment.