Monday, December 13, 2010

Interval Search

Not that "meta" but interesting:

Someone popped up on Brazilian Python discussion list asking for ideas to optimize this search:
There is a set of merchandise distributors, each being able to deliver goods to one or more zip code ranges.Given a set of ZIP codes, he needs to find out which distributors can make the delivery for a given ZIP code.

For ~100000 zip cods and 1500 ZIP ranges this search takes about 1 min. with pure SQL in a postgresql server with some optimizations.

The implementation here can make the search in ~4s using in memory python data structures.

Leonardo Santagada offered an idea that might be even faster than the one presented here: finding a hash function for the ZIP ranges and building a Btree with then - thereafter, check each ZIP code agains the btree and retrieve the matching ranges.

The problem is that such hashfunction is not trivial (not for me :-p ) to come up with. And event he btree implementation would be an order of magnitude more complicated than this here:
Instead of matching the ZIPs against the set of ranges, one by one, we do match the Ranges against a sorted list of ZIPs - and for each Range get all ZIPs it matches in O(log(N)) time.

I didn't came up with this - credit is due to my friend Humberto Innareli, who had the idea in a chat in the car. (I myself came up with far more complicated, and far worse, ideas in the meantime)

For log(n) searchs in a sorted list, Python's stdlib provides the "bisect" module - its use is straightforward. Some more fancy classes and dictionary manipulation for a comfortable interface, and there it is, in less than 40 lines, running more than 10 times faster than the PostgreSQL solution for the same values. Btw, in the performance test. ZIP ranges are assumed to be no larger than 2/10ths of the total ZIP space, as this is the largest chunk of continuous zip codes mapping to a location in Brazil - if the ranges are all completly random performance drops dramatically - thats is because "harvesting" the zip codes themselves after we found the range of matching ZIPs is made in linear time, and is a function of the range size.

Friday, December 3, 2010

Multiple contexts in 'With' statement: not just for 2.7 and 3.1

As you know one of the syntax inovations for Python 3.1 and 2.7 is teh possiblity of using multiple,
different contexts in a single "with" statement. I.E. it is now possible to write this:

with open("file1") as f1, open("file2") as f2:
    #do stuff

With earlier Python versions it is thought that one is required to write:

with open("file1") as f1:
    with open("file2") as f2:
        #do stuff

But the fact is that if you are using Python 2.5 or 2.6 you are not bound to use this dreaded nested "with" construct.

A simple class could be used to allow the "flat 'with' statement" with no new syntax added - and it can be used in Python 2.6 (and probably 2.5). In fact, I am not not shure if this syntax enhancement is not in direct violation of "Special cases aren't special enough to break the rules". After all, the "for" statement can be used with multiple variables and no one has to write:

for index in range(len(seq)), value in seq: 
    #do stuff

And now, for something completly different, the multicontext class:

# -*- coding: utf-8 -*-
# Author: João S. O. Bueno
# License: Creative Commons Attribuion Required 3.0

class multicontext(object):
    def __init__(self, *args):
        if (len(args) == 1 and 
               (hasattr(args[0], "__len__") or 
                hasattr(args[0], "__iter__"))):
            self.objs = list(args[0])
            self.objs = args
    def __enter__(self):
        return tuple(obj.__enter__() for obj in self.objs)
    def __exit__(self, type_, value, traceback):
        return all([obj.__exit__(type_,value, traceback)
                      for obj in self.objs])

Zip. That is all. No need to upgrade your Python if all you need are flat "with"s.
Usage example and test:

>>> with  multicontext(open("%s.txt" %letter, "wt") for letter in "abc") as (a,b,c):
...   a.write("a")
...   b.write("b")
...   c.write("c")
>>> a
<closed file 'a.txt', mode 'wt' at 0x7f4172f17ad0>
>>> b
<closed file 'b.txt', mode 'wt' at 0x7f4172f177a0>
>>> c
<closed file 'c.txt', mode 'wt' at 0x7f4172f179c0>

(and thanks @rbp for pointing out I was offering no prize for the "contests" here #dislexia)

Wednesday, November 17, 2010

X-unzip - and peeping into it

Most of us already know the value of the zip function. (and xzip, in python2).

Once in a while, there might be the need of "unzipping" a sequence: that is - de-interleave a sequence zipped inside other one. So that one could get [[0, 1, 2, 3], [10, 11, 12, 13]]
back from [(0, 10), (1, 11), (2, 12), (3, 13)].

Python being what it is, that is an easy, if not trivial task -
we can just fetch all values on the zipped sequence, and append each one on a list enumerated inside a container sequence - it is easy as:

def unzip(seq):
    start = True
    for next in seq:
        if start:
            result = [[] for _ in xrange(len(next))]
            start = False
        for i,v in enumerate(next):
    return result

But... what if the original list, with te values to be de-interleaved is itself huge? Or part of a data-stream generated elsewhere? How can we fetch de-interleaved data, in order, without having to read through all that stream?

Well, the function up to this would return us a container sequence where each object would be a generator itself. And when called upon, this generator would manage to fetch new values from the original stream as needed, and share the values of the other data sequences with its sibling generators.

Fear not! That is a task for a couple dynamically generated generators (DGG's) agressively sharing data in their "nonlocal" variable space (otherwise known as closure).
def xunzip(seq):
    it = iter(seq)
    next =
    width = len(next)
    results = []
    buffer_ = [{"tuple": next, "consumed": 0}]
    # This is a list to workaround the non-existence
    # of the "nonlocal" keyword in python 2
    buffer_base = [0]
    for i in xrange(width):
        def freeze_i(j):
            def unzipper():
                offset = 0
                while True:
                    index = offset - buffer_base[0]
                    if not (offset < len(buffer_) + buffer_base[0]):
                        buffer_.append({"tuple":, "consumed": 0})
                    buffer_[index]["consumed"] += 1
                    value = buffer_[index]["tuple"][j]
                    if buffer_[index]["consumed"] == width:
                        buffer_base[0] += 1
                    yield value
                    offset += 1
            unzipper.func_name = "seq_%d" % j
            return unzipper
    return results
And just to finish today's unzipping party, some introspective code so that we can peep all of it taking place:
if __name__ == "__main__":
    import random
    a = zip(range(10), range(10,20), range(20,30))
    #print unzip(a)
    b = xunzip(a)
    finished_count = [0,0,0]
    unpack = [-1, -1, -1]
    while sum(finished_count) < len(b):
        index = random.randrange(len(b))
            unpack[index] = b[index].next()
            buffer_lenght = len(b[index].gi_frame.f_locals["buffer_"])
        except StopIteration:
            finished_count[index] = 1
            buffer_lenght = "Unknown"
        print unpack, "Buffer lenght: ", buffer_lenght
(Allternatively, fetch all code together from here: )

Thursday, November 11, 2010

"Selfless Python"

Proof of concept of Python methods using no explicit declaration of the parameter Self.

I mean:

class A:
   __metaclass__ = Selfless
   def __init__(a , b):
      self.a = a
      self.b = b

Is made to work from this example.

One common critic to Python is the need of the explicit "self" parameter declaration on function Methods. No matter how this design is optimal in terms of internal consistency, keeping things explicit, and reducing the keyword count, people keep complaining about it.

Given the access granted to the inner workings of the Language, it is easy to arrange an alternate mechanism for class methods that could preclude the declaration of a first parameter to a method. Therefore, people who are seeing complaining could try some shots of "selfless" Python.

My approach here is just as follows: I create a metaclass that uses an alternate way to fetch function methods inside the classes. With the normal "type" metaclass - i.e. normal "new style" classes - any functions declared in a class body are substituted, at class creation time, for a descriptor. When the attribute with the function name in a class, or instance of it, is retrieved, this descriptor creates a method object that wraps the original function. (And yes, the method object is created at attribute access time). This method object, when called, prepends the "instance" object as the first parameter in the list, and forwards the call to the wrapped function.

In this "selfless" proof of concept, I change the functions in the class body for descriptor objects, but descriptors that behave differently: instead of wrapping the function in a method object, they recreate the function when the method is acessed.
A new function object is created with the same code object, the same name, the same closure, but, with a different global dictionary. The global dictionary of the original function is copied and changed to include an entry named "self" which points to the object instance. (Or "None" if the function is retrieved from the class, not from an instance).

Therefore, inside these selfless methods, the name "self" can be used without ever being declared: it will exist as a global vairable at execution time, and point to the instance. Just like the parameter self exists as a local variable and points to the instance in regular Python methods.

This should not be used in production code - it is a good proof of concept - but of course there should be lots of strange and inconsistent behavior associeated with it. One of which is the fact methods can not be used "unbounded" as one does with ordinary python methods, explicitly passing an instance object as the first parameter.

Asa final consideration, Pedro Werneck had an implementation of this "feature" before here: - but unlike Werneck's example, I don't fiddle with the bytecode or anyother thing to "rebuild" the bytecode so that references to an unexisting local self variable are rewritten, and "self" is artificially inserted as a local variable at runtime. The work presented here is an order of magnitude simpler, and possibly less subject to colateral effetcs.

Nonetheless: I warn again - this is just a toy! Don't use this in "real life"!

Other than that, it works finner and in a simpler way than I expected.

# -*- coding: utf-8 -*-
from types import FunctionType

Changes classes so that each method has its instances changed so
that "self" is a global variable pointing to the instance. 

class SelflessDescriptor(object):
    def __init__(self, func):
        self.func = func
    def __get__(self, instance, class_):
        new_globals =  self.func.func_globals.copy()
        new_globals["self"] = instance
        new_func = FunctionType(self.func.func_code,
             new_globals, self.func.func_name,
        return new_func

class Selfless(type):
    def __new__(cls, name, bases, dict_):
        for key, val in dict_.items():
            if isinstance(val, FunctionType):
                dict_[key] = SelflessDescriptor(val)
        return type(name, bases, dict_)

if __name__ == "__main__":
    __metaclass__ = Selfless

    class A:
        def __init__(a, b):
            self.a = a
            self.b = b
    a = A("Hello", "World!")
    print a.a, a.b

Monday, November 8, 2010

Tail Recursion Elimination in Python

This, a while back, was maybe my first hack using introspection that I perceived as "wow, this is just fun".

So, Tail Recursion is a feature on some functional Languages to allow a function that calls itself as its last statement and just returns the value of this last call to its original caller to have no state recorded on itself. All the states needed are passed down on itself on this recursive call. Thus, the systems resources does not have to keep track of all the states it is past while the function descends calling itself - it just has to return to the original caller.

For example, ina factorial function, the last statement might just be something like "factorial(n - 1, init = n)" where the "init" parameter would be used internally to modify the return value (and default to "1" so thatthe original caller would just all "factorial(n)".

So, that is not the way Pythn works. And, by default, Python never will, as you can see on this BDLF's post: (coincidence or not, that was posted two days before I originally got to my "Soft Tail Recursion Concept", as exposed here).

And, although, Python's "hardware" won't allow it, its introspective nature already allow us to do so, via decorators.

here is the idea: You wrap the too-be-tail-recursive function in an object that, when called, takes note of the thread it's running, the depth of the call, and the parameters used. It then calls the original function. If the function calls itself again, when the call is routed through the wrapper object, it will note the depth has increased. It then saves the latest parameters (which contain all the state that is needed to the completion of the task -however deep it is) - and return control to the outer level object - the one that was running before the first (wrapped) function call.

"Return control"? How can it do that? Not usually seeing as "black magic": it just raises an exception, that is captured on the outter execution frame. Since teh code runniong on that frame is on the same object that raised the exception, it does know the parameters to be used in the function call - just that this time around the function call happens at depth "0".

The example bellow is even thread-safe. The limitation is that the decoratro has no way to know if the decorated function is suitable to be used with this resource - note that any statements or expressions that would be run after the recursive call are simply ignored (and making arithmetic operations with the function return value are just that: an expression that is evaluated after the function call).

Saturday, November 6, 2010

Recursive Lambda Functions

This one I had presented at my talk at FISL in 2010: One interestign feature in some functional languages - not all - that is seemingly unavaliable from Python is the ability of anonimous functions to call themselves, using recursion.

You can do that in Python if you assign a name to an anonymous function -- but DUH -- it won't be anomous anymore.

So, it is possible to do:

fact = lambda n: n * fact(n - 1) if n > 1 else 1

But not without depending on the name "fact", external to the function. The porblem to call an anonymous function is to get a reference to itself. Well...for the called code, there is a way to get a reference, not to the calling function, but to the calling code object. The code object is an attribute (f_code) to execution frames in Python. And while one can't execute code_objects directly, given a code object, it is possible to build a function around it -- and this function will work the same as the calling function, and can be called.

To create a function, one just have to call the DFunctionType -- I import it from types, but it one could as well just do "FunctionType = type(lambda: None)" - (It is how it is defined inside the file in the stdlib anyway). The documentation on FuntionType should be enough for anyone: it accepts a code_object and a globals dictionay as mandatory parameters.

So, here we are, for the Python recursive Lambda:

from inspect import currentframe
from types import FunctionType

lambda_cache = {}

def myself (*args, **kw):
    caller_frame = currentframe(1)
    code = caller_frame.f_code
    if not code in lambda_cache:
        lambda_cache[code] = FunctionType(code, caller_frame.f_globals)
    return lambda_cache[code](*args, **kw)

if __name__ == "__main__":
    print "Factorial of 5", (lambda n: n * myself(n - 1) if n > 1 else 1)(5)

(just note that it won't preserve default parameters for the lambda function - I found no way of getting a reference to those)

Friday, November 5, 2010

Python instance methods - How are they different from ordinary functions?

Python OO model works exceptionally well. Its introspection capabilities, and flexibility provided by the descriptors implementation allow for a lot of interesting thing - one of the nicest being what is available through the "property" built-in call.

However, the descriptor model implies in some behaviors that can raise eyebrows, even for experienced users. Like this bit here:
>>> class A(object):
...   def b(self): pass
>>> A.b is A.b
>>> a = A()
>>> a.b is a.b

Eeeeek!  :-) And now what? Well - it happens that while Object methods are not just functions that receive the object instance as its first parameter - that is, methods are objects from a different class than functions - they are not recorded in the class or in the object instance itself. Rather: both bound and unbound methods for new-style classes in Python are created at attribute retrieval-time: The very "a.b" expression above is, internally, a call to a.__getattribute__("b") - and this call uses the descritor mechanism to wrap the function stored in the class __dict__ attribute in an instance method.

So, methods are objects that contain a reference to the original function, but prepends the "self" variable to the function call when they are called for a bound instance:

>>> class A(object):
...   def b(self): pass
>>> a = A()
>>> a.b.im_func
<function b at 0x7f6613a6cd70>
>>> a.b.im_func is A.b.im_func is A.__dict__["b"]

(extra bonus: I just learnd about the possibility of concatenating the "is" operator here as well :-) )

This way of doing things works fine, in almost all circunstances. However, the only kinds of objects that are turned into "properties which are method factories" at class creation time are functions. If you want to have other kind of callable objects (like any instance of a class with a "call") into a  proper method, you have to create the descriptor for that yourself. It is actually quite simple -you just have to "implement the descriptor protocol" - which is another way to say, you just have to add a __get__ method :-) . Let's suppose I want a method for a class that is itself an object which keeps track of how many times it was called:
class CounterFunction(object):
    """Designed to work as function decorator - will not work for methods """
    def __init__(self, function):
        self.func = function
        self.counter = 0
    def __call__(self, *args, **kw):
        self.counter += 1
        return self.func(*args, **kw)

def b(): pass

print b.counter
#-> 2

And if we try to use that as a decorator for a method,
the method will no longer "work":

class C(object):
    def d(self): pass

c = C()
#Traceback (most recent call last):
#  File "<stdin>", line 1, in <module>
#  File "<stdin>", line 8, in __call__
#TypeError: d() takes exactly 1 argument (0 given)

However a call like "c.d(c) " would succeed  - the c.d object just does not get its instance prepended to the argument list.

If we do, instead:
from types import MethodType

class CounterFunction(object):
    """Designed to work as function or method decorator """
    def __init__(self, function):
        self.func = function
        self.counter = 0
    def __call__(self, *args, **kw):
        self.counter += 1
        return self.func(*args, **kw)
    def __get__(self, instance, owner):
        return MethodType(self, instance, owner)

class C(object):
    def d(self): pass

c = C()
print c.d.counter
#-> 1

This "__get__" method, as above, does exactly what the authomatically created "getter" for methods does when we use an ordnary method. This behavior is implemented in the metaclass for Python classes, that is, the "type" method. Whena  class object is instantiaded (the moment it is defined) the __new__ method inside type checks which objects in its body dictionary are functions, and replaces them by the descritors with a __get__ method like the one above.

It should be noted that it is an official advise that you can increase performance in certain code patterns by caching an object method in a local variable rather than retrieveing the method each time you want to call it - that is:

for i in xrange(5000):

is actually faster as:

b = a.do_something
for i in xrange(5000):

The exact behavior of methods and descriptors is described in detail here in the official documentation, here, --look at "user methods"  objects:

What if called functions could access the variables from the caller?

Of course, normally you just pass the variables you are interested the called function to access as parameters.

But it raises an interesting development paradigm if every non-local variable is looked up on the  scope of each calling function. Some languages, like Postscript, work this way.

In Python it can be achieved simply by copying the local variables in the code-frame of the caller functions to the global scope of the called one:

This decorator, while solid enough for "production" won't  work for variables in functions defined within other functions that use variable names that exist on the outher functions. Those would still be bound to the defining closure, as Free Variables.  It can be made to work with those, but it would be another beast.

Monday, November 1, 2010

Twitter and Tkinter: Who does not follow you

We are not so "meta" today, and more hands on.

Given the spam avalanche today, with people trying to get me to use a webapp that checks "who within my firends does not follow me" - I thought: That is quite easy -- just pick one set of people and subtract from the other, and voilá!

To get it out of terminal usage, I added in a little bit Tkinter - it is quite minimalist, still the Tkinter Part could contain some interesting patterns that I had to look around to produce (like, using the scrollbar, or updating what is displayed on teh window in the middle of data processing, with the"update") thing )

As for th twitter, we are not using any libraries exteranl to Python - just the rest API for a service that does not need authentication, and good old "urllib". We just loop through the follower and friends page results, and handpick just the users screen name (we could pick URL, and others for a more sofisticated app).

Saturday, October 30, 2010

Creating a Heap Class in one Python Line

Sometimes the stdlib is just strange --
like, the "heapq" module -- it provides some methods to use a list as a Heap, according to some well known algorithms - but it does not provide itself a "Heap" class -  You have to create your Healp as an empty list, and call "heapq.heappush" and "heapq.heappop"  (among other  methods), always passing your "heap" (list) as the first argument.

Oh.. "the object itself" as its first argument -- I had seen that before - so, we could possibly just create a class and in its body provide "pop = heapq.heappop" , and so on -- since these "heap*" methods signature is always "heapq.heap* (heap[, arg])", if I mark these functions as a "Heap class" Method themselves, they should work as methods. Not!

class Heap(list):
    pop = heapq.heappop 

The only members  in a class body that are promoted to "InstanceMethods" are functions - and the hepq methods are "built-in functions".  It is a trick on how Python new style classes work. Without being functions, they are not promoted to "methods", and whenever they are called, the object instance will not be pre-pended to parameters list.

So, the most obvious way to do that, is to create wrapper functions that just call the heapq.methods, and use  those as methods in a class that inherits from "list".

Since the goal is to do that in one line, this is the perfect ocasion to use the new dictionary generators from Python 2.7 to create the class body.

So we try:

Heap = type("Heap", (list,), {item: (lambda self, *args: getattr(heapq, "heap" + item)(self, *args))  for item in ("pop", "push", "pushpop", "replace")   } )

This also does not work!
We  have to keep in mind whenever we generate functions inside a for loop in Python, that functions work in a closure - so, the variables used on the "for" itself will always evaluate to the last value the attained in the "for statement" when the generated functions are called.  Hence, all the members in the class above will wrap "heapq.heapreplace".

The way to get it working is to add yield the generated functions from another level of wrapper functions, so that the variables used in the loop are frozen in this second-level wrappers.

import heapq
Heap = type("Heap", (list,), {item: (lambda item2: (lambda self, *args: getattr(heapq, "heap" + item2)(self, *args)))(item)  for item in ("pop", "push", "pushpop", "replace")})

And now we are in business:

>>> a = Heap()
>>> a.push(10)
>>> a.push(5)
>>> a.push(1)
>>> a.push(20)
>>> a.pop()
>>> a.pop()
>>> a.pushpop(30)