Monday, December 13, 2010

Interval Search

Not that "meta" but interesting:

Someone popped up on Brazilian Python discussion list asking for ideas to optimize this search:
There is a set of merchandise distributors, each being able to deliver goods to one or more zip code ranges.Given a set of ZIP codes, he needs to find out which distributors can make the delivery for a given ZIP code.

For ~100000 zip cods and 1500 ZIP ranges this search takes about 1 min. with pure SQL in a postgresql server with some optimizations.

The implementation here can make the search in ~4s using in memory python data structures.

Leonardo Santagada offered an idea that might be even faster than the one presented here: finding a hash function for the ZIP ranges and building a Btree with then - thereafter, check each ZIP code agains the btree and retrieve the matching ranges.

The problem is that such hashfunction is not trivial (not for me :-p ) to come up with. And event he btree implementation would be an order of magnitude more complicated than this here:
Instead of matching the ZIPs against the set of ranges, one by one, we do match the Ranges against a sorted list of ZIPs - and for each Range get all ZIPs it matches in O(log(N)) time.

I didn't came up with this - credit is due to my friend Humberto Innareli, who had the idea in a chat in the car. (I myself came up with far more complicated, and far worse, ideas in the meantime)

For log(n) searchs in a sorted list, Python's stdlib provides the "bisect" module - its use is straightforward. Some more fancy classes and dictionary manipulation for a comfortable interface, and there it is, in less than 40 lines, running more than 10 times faster than the PostgreSQL solution for the same values. Btw, in the performance test. ZIP ranges are assumed to be no larger than 2/10ths of the total ZIP space, as this is the largest chunk of continuous zip codes mapping to a location in Brazil - if the ranges are all completly random performance drops dramatically - thats is because "harvesting" the zip codes themselves after we found the range of matching ZIPs is made in linear time, and is a function of the range size.

No comments:

Post a Comment