Monday, December 13, 2010

Interval Search

Not that "meta" but interesting:

Someone popped up on Brazilian Python discussion list asking for ideas to optimize this search:
There is a set of merchandise distributors, each being able to deliver goods to one or more zip code ranges.Given a set of ZIP codes, he needs to find out which distributors can make the delivery for a given ZIP code.

For ~100000 zip cods and 1500 ZIP ranges this search takes about 1 min. with pure SQL in a postgresql server with some optimizations.

The implementation here can make the search in ~4s using in memory python data structures.

Leonardo Santagada offered an idea that might be even faster than the one presented here: finding a hash function for the ZIP ranges and building a Btree with then - thereafter, check each ZIP code agains the btree and retrieve the matching ranges.

The problem is that such hashfunction is not trivial (not for me :-p ) to come up with. And event he btree implementation would be an order of magnitude more complicated than this here:
Instead of matching the ZIPs against the set of ranges, one by one, we do match the Ranges against a sorted list of ZIPs - and for each Range get all ZIPs it matches in O(log(N)) time.

I didn't came up with this - credit is due to my friend Humberto Innareli, who had the idea in a chat in the car. (I myself came up with far more complicated, and far worse, ideas in the meantime)

For log(n) searchs in a sorted list, Python's stdlib provides the "bisect" module - its use is straightforward. Some more fancy classes and dictionary manipulation for a comfortable interface, and there it is, in less than 40 lines, running more than 10 times faster than the PostgreSQL solution for the same values. Btw, in the performance test. ZIP ranges are assumed to be no larger than 2/10ths of the total ZIP space, as this is the largest chunk of continuous zip codes mapping to a location in Brazil - if the ranges are all completly random performance drops dramatically - thats is because "harvesting" the zip codes themselves after we found the range of matching ZIPs is made in linear time, and is a function of the range size.

Friday, December 3, 2010

Multiple contexts in 'With' statement: not just for 2.7 and 3.1

As you know one of the syntax inovations for Python 3.1 and 2.7 is teh possiblity of using multiple,
different contexts in a single "with" statement. I.E. it is now possible to write this:

with open("file1") as f1, open("file2") as f2:
    #do stuff

With earlier Python versions it is thought that one is required to write:

with open("file1") as f1:
    with open("file2") as f2:
        #do stuff

But the fact is that if you are using Python 2.5 or 2.6 you are not bound to use this dreaded nested "with" construct.

A simple class could be used to allow the "flat 'with' statement" with no new syntax added - and it can be used in Python 2.6 (and probably 2.5). In fact, I am not not shure if this syntax enhancement is not in direct violation of "Special cases aren't special enough to break the rules". After all, the "for" statement can be used with multiple variables and no one has to write:

for index in range(len(seq)), value in seq: 
    #do stuff

And now, for something completly different, the multicontext class:

# -*- coding: utf-8 -*-
# Author: João S. O. Bueno
# License: Creative Commons Attribuion Required 3.0

class multicontext(object):
    def __init__(self, *args):
        if (len(args) == 1 and 
               (hasattr(args[0], "__len__") or 
                hasattr(args[0], "__iter__"))):
            self.objs = list(args[0])
            self.objs = args
    def __enter__(self):
        return tuple(obj.__enter__() for obj in self.objs)
    def __exit__(self, type_, value, traceback):
        return all([obj.__exit__(type_,value, traceback)
                      for obj in self.objs])

Zip. That is all. No need to upgrade your Python if all you need are flat "with"s.
Usage example and test:

>>> with  multicontext(open("%s.txt" %letter, "wt") for letter in "abc") as (a,b,c):
...   a.write("a")
...   b.write("b")
...   c.write("c")
>>> a
<closed file 'a.txt', mode 'wt' at 0x7f4172f17ad0>
>>> b
<closed file 'b.txt', mode 'wt' at 0x7f4172f177a0>
>>> c
<closed file 'c.txt', mode 'wt' at 0x7f4172f179c0>

(and thanks @rbp for pointing out I was offering no prize for the "contests" here #dislexia)