Archive for the ‘Uncategorized’ Category

s3w 0.6.0

Friday, June 12th, 2009

It’s time for another release. s3w is a client to access and synchronise with Amazon Web Services’ Simple Storage Service (Amazon S3). It supports both direct access to S3 operations (GET, PUT, LIST, COPY, …) and higher-level functionality like bucket-to-bucket copy and pushing local directory structures into buckets.

s3w 0.6.0 is now out. As well as the standard array of bug fixes, the major new feature in this release is “stacked push”. In a stacked push, another bucket (or prefix within a bucket) is used as a base, and any files that match the corresponding object in the stacked-upon location won’t be pushed again. Incremental backups are one use for that; I also use it as a quick way to publish slightly different versions of files. Multiple stacked locations can be used, but each new location in the stack adds another network round-trip to slow the process down, so you probably want to keep the number low. They’re searched left-to-right as on the command line, push –stacked base1 –stacked base2 … src/ dest/.

Pull has a corresponding stacked mode for reconstructing the directory tree, which reads in the same order. There’s also new copybucket –move support, which deletes the source keys after they’re copied, and which can be paired with the new copybucket –save-date, which preserves the original Last-Modified date in a special piece of metadata which push will use (incremental backups, again). Etags (file checksums) can be calculated to save redundant pushes as well, which is useful if you have severe clock skew or unreliable modification times.

There are also scattered bug fixes to problems that emerged during the development cycle, and some extra niceties like offering suggestions to mistyped commands or bucket names. The configuration file can customise more behaviour – per-bucket exclusions and custom short names for buckets. See s3w config –help for details of how to set these values – push.bucket.<dom>.excludes (list) and shortname.<shortname> (string bucket name).

s3w depends on Python 2.6 and Boto. There are build and quickstart instructions in the readme. Running just s3w with no arguments will give a list of commands and a brief summary, while comprehensive documentation for each is available with s3w <commandname> –help.

Development takes place in a Bazaar branch, currently hosted on Launchpad (not necessarily going to stay there). You can access it using bzr branch lp:s3w. Patches welcome! There are some usage examples on the homepage, as well as in the internal documentation.

Reviews of books that should never have been written I: Ice Station

Wednesday, June 3rd, 2009

Matthew Reilly’s Ice Station is somewhat in the Crichton or Clancy vein, of which, well, you know. It has the cringing down pat. Nevertheless, can’t blame it for what it is, I knew that going in and I knew I was going to regret it, but the real problem: Reilly appears to have reached adulthood unaware that whales and dolphins have blowholes through which to breathe. I am unable to fathom this.

Regular expression animator

Tuesday, May 5th, 2009

This regular expression animator (including both DFA and NFA traversals) is very cool.

s3w 0.5.0

Thursday, April 30th, 2009

There we go, a real release. s3w is a client to access and synchronise with Amazon Web Services’ Simple Storage Service (Amazon S3). It supports both direct access to S3 operations (GET, PUT, LIST, COPY, …) and higher-level functionality like bucket-to-bucket copy and pushing local directory structures into buckets.

This is a fairly early version, but it’s relatively featureful already. As well as the bucket-to-bucket copying I wanted initially, it has a full set of manipulation commands and support for pushing a local directory into a bucket along with suitable metadata to allow recreating it locally with the pull command (so it can be used for backups, as well as for pushing out a website tree). Running just s3w with no arguments will give a list of commands and a brief summary, while comprehensive documentation for each is available with s3w <commandname> –help.

s3w 0.5.0 is now out. s3w depends on Python 2.6 and Boto. There are build and quickstart instructions in the readme.

Development takes place in a Bazaar branch, currently hosted on Launchpad (not necessarily going to stay there). You can access it using bzr branch lp:s3w. Patches welcome! There are some usage examples on the homepage, as well as in the internal documentation. Development is continuing with some new features I have in mind, so there should be another release sometime soon too.

Python Damerau-Levenshtein distance implementation

Sunday, April 26th, 2009

The Damerau-Levenshtein distance between two strings  is the number of additions, deletions, substitutions, and transpositions needed to transform one into the other.  It’s an extension of the Levenshtein distance, by incorporating transpositions into the set of operations.

I had a need to use it in a program recently, but I couldn’t find any Python implementation around that wasn’t buggy in trivial cases or just extremely inefficient (strictly, you can implement it neatly by recursion, and it’s quite instructive about the algorithm in some ways, but it takes an age to run when you do and you hit the recursion limit in fairly short order). I’ve put one together myself from the algorithmic definition that I’ve tested to work correctly and reasonably efficiently. I’d suggest working in a C module if you need the best possible efficiency. There are some things in difflib you can work around to doing the same thing as well, particularly good if you do want the required sequence of operations too. I’m only doing the numeric side of things here.

The algorithm is inherently O(N*M) in time, and the naïve version is in space as well. This implementation is O(M) in space, as only the last two rows of the matrix need to be kept around. I have implemented it for Python 2.x, but 2to3 makes an accurate translation if you’re using it with Py3K.

def dameraulevenshtein(seq1, seq2):
    """Calculate the Damerau-Levenshtein distance between sequences.

    This distance is the number of additions, deletions, substitutions,
    and transpositions needed to transform the first sequence into the
    second. Although generally used with strings, any sequences of
    comparable objects will work.

    Transpositions are exchanges of *consecutive* characters; all other
    operations are self-explanatory.

    This implementation is O(N*M) time and O(M) space, for N and M the
    lengths of the two sequences.

    >>> dameraulevenshtein('ba', 'abc')
    2
    >>> dameraulevenshtein('fee', 'deed')
    2

    It works with arbitrary sequences too:
    >>> dameraulevenshtein('abcd', ['b', 'a', 'c', 'd', 'e'])
    2
    """
    # codesnippet:D0DE4716-B6E6-4161-9219-2903BF8F547F
    # Conceptually, this is based on a len(seq1) + 1 * len(seq2) + 1 matrix.
    # However, only the current and two previous rows are needed at once,
    # so we only store those.
    oneago = None
    thisrow = range(1, len(seq2) + 1) + [0]
    for x in xrange(len(seq1)):
        # Python lists wrap around for negative indices, so put the
        # leftmost column at the *end* of the list. This matches with
        # the zero-indexed strings and saves extra calculation.
        twoago, oneago, thisrow = oneago, thisrow, [0] * len(seq2) + [x + 1]
        for y in xrange(len(seq2)):
            delcost = oneago[y] + 1
            addcost = thisrow[y - 1] + 1
            subcost = oneago[y - 1] + (seq1[x] != seq2[y])
            thisrow[y] = min(delcost, addcost, subcost)
            # This block deals with transpositions
            if (x > 0 and y > 0 and seq1[x] == seq2[y - 1]
                and seq1[x-1] == seq2[y] and seq1[x] != seq2[y]):
                thisrow[y] = min(thisrow[y], twoago[y - 2] + 1)
    return thisrow[len(seq2) - 1]


The code is available under the MIT licence, in the hope that it will be useful, but without warranty of any kind. I have also included a codesnippet GUID in line with the linked post, as a sort of experiment. Please leave that comment intact if you’re posting a derivative somewhere, and add your own.

Copying buckets on Amazon S3: s3w

Friday, April 24th, 2009

I run some server backups to Amazon Simple Storage Service (S3) using s3sync. We were previously using s3backer and rsync to maintain daily and weekly backups, but it works out too bandwidth-inefficient to operate that way – what was saved on storage went out on incoming bandwidth in the same day.

Amazon has a special incoming bandwidth rate of 3c/GB for this month, for some reason I don’t remember, so I moved the backups to s3sync while it was cheap. It’s a lot more cost-efficient this way, but we lost the daily and weekly backups (unless all the data gets transferred multiple times). What I wanted was a way to clone an entire bucket into another, but I couldn’t find any existing tools to do that.

I had been looking at the excellent Boto AWS library for other reasons already, so I put together a little script with it. I can run s3w copybucket bucket1 bucket2 to do a bucket-to-bucket copy of all the keys without transferring the data out and in again. It copies as efficiently as possible, only overwriting if the existing key content is different in the destination than the source, and using Amazon’s key copying so there’s no bandwidth used. This version also includes putstdin, getstdout, metadata, and list commands that I found useful for other parts of the backup system and debugging.

s3w 0.0.1

This is very early stages, there’s no installer in that, and it’s probably full of bugs. I’ve been running it on our backups for a while now without any issues though. I have a more advanced version locally with some more commands in it I’ll release if it goes anywhere.