Posts Tagged ‘damerau-levenshtein’

Python Damerau-Levenshtein distance implementation

Sunday, April 26th, 2009

The Damerau-Levenshtein distance between two strings  is the number of additions, deletions, substitutions, and transpositions needed to transform one into the other.  It’s an extension of the Levenshtein distance, by incorporating transpositions into the set of operations.

I had a need to use it in a program recently, but I couldn’t find any Python implementation around that wasn’t buggy in trivial cases or just extremely inefficient (strictly, you can implement it neatly by recursion, and it’s quite instructive about the algorithm in some ways, but it takes an age to run when you do and you hit the recursion limit in fairly short order). I’ve put one together myself from the algorithmic definition that I’ve tested to work correctly and reasonably efficiently. I’d suggest working in a C module if you need the best possible efficiency. There are some things in difflib you can work around to doing the same thing as well, particularly good if you do want the required sequence of operations too. I’m only doing the numeric side of things here.

The algorithm is inherently O(N*M) in time, and the naïve version is in space as well. This implementation is O(M) in space, as only the last two rows of the matrix need to be kept around. I have implemented it for Python 2.x, but 2to3 makes an accurate translation if you’re using it with Py3K.

def dameraulevenshtein(seq1, seq2):
    """Calculate the Damerau-Levenshtein distance between sequences.

    This distance is the number of additions, deletions, substitutions,
    and transpositions needed to transform the first sequence into the
    second. Although generally used with strings, any sequences of
    comparable objects will work.

    Transpositions are exchanges of *consecutive* characters; all other
    operations are self-explanatory.

    This implementation is O(N*M) time and O(M) space, for N and M the
    lengths of the two sequences.

    >>> dameraulevenshtein('ba', 'abc')
    >>> dameraulevenshtein('fee', 'deed')

    It works with arbitrary sequences too:
    >>> dameraulevenshtein('abcd', ['b', 'a', 'c', 'd', 'e'])
    # codesnippet:D0DE4716-B6E6-4161-9219-2903BF8F547F
    # Conceptually, this is based on a len(seq1) + 1 * len(seq2) + 1 matrix.
    # However, only the current and two previous rows are needed at once,
    # so we only store those.
    oneago = None
    thisrow = range(1, len(seq2) + 1) + [0]
    for x in xrange(len(seq1)):
        # Python lists wrap around for negative indices, so put the
        # leftmost column at the *end* of the list. This matches with
        # the zero-indexed strings and saves extra calculation.
        twoago, oneago, thisrow = oneago, thisrow, [0] * len(seq2) + [x + 1]
        for y in xrange(len(seq2)):
            delcost = oneago[y] + 1
            addcost = thisrow[y - 1] + 1
            subcost = oneago[y - 1] + (seq1[x] != seq2[y])
            thisrow[y] = min(delcost, addcost, subcost)
            # This block deals with transpositions
            if (x > 0 and y > 0 and seq1[x] == seq2[y - 1]
                and seq1[x-1] == seq2[y] and seq1[x] != seq2[y]):
                thisrow[y] = min(thisrow[y], twoago[y - 2] + 1)
    return thisrow[len(seq2) - 1]

The code is available under the MIT licence, in the hope that it will be useful, but without warranty of any kind. I have also included a codesnippet GUID in line with the linked post, as a sort of experiment. Please leave that comment intact if you’re posting a derivative somewhere, and add your own.