Archive for April, 2009

s3w 0.5.0

Thursday, April 30th, 2009

There we go, a real release. s3w is a client to access and synchronise with Amazon Web Services’ Simple Storage Service (Amazon S3). It supports both direct access to S3 operations (GET, PUT, LIST, COPY, …) and higher-level functionality like bucket-to-bucket copy and pushing local directory structures into buckets.

This is a fairly early version, but it’s relatively featureful already. As well as the bucket-to-bucket copying I wanted initially, it has a full set of manipulation commands and support for pushing a local directory into a bucket along with suitable metadata to allow recreating it locally with the pull command (so it can be used for backups, as well as for pushing out a website tree). Running just s3w with no arguments will give a list of commands and a brief summary, while comprehensive documentation for each is available with s3w <commandname> –help.

s3w 0.5.0 is now out. s3w depends on Python 2.6 and Boto. There are build and quickstart instructions in the readme.

Development takes place in a Bazaar branch, currently hosted on Launchpad (not necessarily going to stay there). You can access it using bzr branch lp:s3w. Patches welcome! There are some usage examples on the homepage, as well as in the internal documentation. Development is continuing with some new features I have in mind, so there should be another release sometime soon too.

Python Damerau-Levenshtein distance implementation

Sunday, April 26th, 2009

The Damerau-Levenshtein distance between two strings  is the number of additions, deletions, substitutions, and transpositions needed to transform one into the other.  It’s an extension of the Levenshtein distance, by incorporating transpositions into the set of operations.

I had a need to use it in a program recently, but I couldn’t find any Python implementation around that wasn’t buggy in trivial cases or just extremely inefficient (strictly, you can implement it neatly by recursion, and it’s quite instructive about the algorithm in some ways, but it takes an age to run when you do and you hit the recursion limit in fairly short order). I’ve put one together myself from the algorithmic definition that I’ve tested to work correctly and reasonably efficiently. I’d suggest working in a C module if you need the best possible efficiency. There are some things in difflib you can work around to doing the same thing as well, particularly good if you do want the required sequence of operations too. I’m only doing the numeric side of things here.

The algorithm is inherently O(N*M) in time, and the naïve version is in space as well. This implementation is O(M) in space, as only the last two rows of the matrix need to be kept around. I have implemented it for Python 2.x, but 2to3 makes an accurate translation if you’re using it with Py3K.

def dameraulevenshtein(seq1, seq2):
    """Calculate the Damerau-Levenshtein distance between sequences.

    This distance is the number of additions, deletions, substitutions,
    and transpositions needed to transform the first sequence into the
    second. Although generally used with strings, any sequences of
    comparable objects will work.

    Transpositions are exchanges of *consecutive* characters; all other
    operations are self-explanatory.

    This implementation is O(N*M) time and O(M) space, for N and M the
    lengths of the two sequences.

    >>> dameraulevenshtein('ba', 'abc')
    >>> dameraulevenshtein('fee', 'deed')

    It works with arbitrary sequences too:
    >>> dameraulevenshtein('abcd', ['b', 'a', 'c', 'd', 'e'])
    # codesnippet:D0DE4716-B6E6-4161-9219-2903BF8F547F
    # Conceptually, this is based on a len(seq1) + 1 * len(seq2) + 1 matrix.
    # However, only the current and two previous rows are needed at once,
    # so we only store those.
    oneago = None
    thisrow = range(1, len(seq2) + 1) + [0]
    for x in xrange(len(seq1)):
        # Python lists wrap around for negative indices, so put the
        # leftmost column at the *end* of the list. This matches with
        # the zero-indexed strings and saves extra calculation.
        twoago, oneago, thisrow = oneago, thisrow, [0] * len(seq2) + [x + 1]
        for y in xrange(len(seq2)):
            delcost = oneago[y] + 1
            addcost = thisrow[y - 1] + 1
            subcost = oneago[y - 1] + (seq1[x] != seq2[y])
            thisrow[y] = min(delcost, addcost, subcost)
            # This block deals with transpositions
            if (x > 0 and y > 0 and seq1[x] == seq2[y - 1]
                and seq1[x-1] == seq2[y] and seq1[x] != seq2[y]):
                thisrow[y] = min(thisrow[y], twoago[y - 2] + 1)
    return thisrow[len(seq2) - 1]

The code is available under the MIT licence, in the hope that it will be useful, but without warranty of any kind. I have also included a codesnippet GUID in line with the linked post, as a sort of experiment. Please leave that comment intact if you’re posting a derivative somewhere, and add your own.

Copying buckets on Amazon S3: s3w

Friday, April 24th, 2009

I run some server backups to Amazon Simple Storage Service (S3) using s3sync. We were previously using s3backer and rsync to maintain daily and weekly backups, but it works out too bandwidth-inefficient to operate that way – what was saved on storage went out on incoming bandwidth in the same day.

Amazon has a special incoming bandwidth rate of 3c/GB for this month, for some reason I don’t remember, so I moved the backups to s3sync while it was cheap. It’s a lot more cost-efficient this way, but we lost the daily and weekly backups (unless all the data gets transferred multiple times). What I wanted was a way to clone an entire bucket into another, but I couldn’t find any existing tools to do that.

I had been looking at the excellent Boto AWS library for other reasons already, so I put together a little script with it. I can run s3w copybucket bucket1 bucket2 to do a bucket-to-bucket copy of all the keys without transferring the data out and in again. It copies as efficiently as possible, only overwriting if the existing key content is different in the destination than the source, and using Amazon’s key copying so there’s no bandwidth used. This version also includes putstdin, getstdout, metadata, and list commands that I found useful for other parts of the backup system and debugging.

s3w 0.0.1

This is very early stages, there’s no installer in that, and it’s probably full of bugs. I’ve been running it on our backups for a while now without any issues though. I have a more advanced version locally with some more commands in it I’ll release if it goes anywhere.

Sample LocaliseDirs script

Saturday, April 4th, 2009

Every so often people come along wanting to localise (or otherwise change) the names of the /Programs or /System/* directories. We’re not terribly interested in changing our structure, and for the reasons outlined in one of the design documents I don’t think localising them is much use either, but I’ve written a little sample script that does it that somebody could run with if they were motivated.

It is of course fully possible to use whatever structure you like in Gobo – just create symlinks for your new tree, and optionally use `gobohide -h` to make the real versions disappear. That will suffice for those that just want to change the names for no particular reason (though it never seems to placate them…), but it’s clearly too much of a hassle for localisation.

The LocaliseDirs script will parse a simple tab-separated value file mapping the real location to a new name, and create symlinks to implement that vision. It saves a list of the changes it made into /.Localise_Journal, so that they can be reverted without guesswork (LocaliseDirs –reset), and it can use that file to hide/show the real directories (–hide and –show).

That script is unsupported and not a strong candidate for inclusion in the tools any time soon, but it’s there for those that might be interested in using or extending it. It’s mostly a proof-of-concept and demonstration of how you could go about it manually. If you do improve it, please send along the patches — if it gets to the point where it doesn’t have obvious flaws any more it could be merged into Scripts proper eventually.