Wednesday, November 17, 2010

X-unzip - and peeping into it

Most of us already know the value of the zip function. (and xzip, in python2).

Once in a while, there might be the need of "unzipping" a sequence: that is - de-interleave a sequence zipped inside other one. So that one could get [[0, 1, 2, 3], [10, 11, 12, 13]]
back from [(0, 10), (1, 11), (2, 12), (3, 13)].

Python being what it is, that is an easy, if not trivial task -
we can just fetch all values on the zipped sequence, and append each one on a list enumerated inside a container sequence - it is easy as:

def unzip(seq):
    start = True
    for next in seq:
        if start:
            result = [[] for _ in xrange(len(next))]
            start = False
        for i,v in enumerate(next):
            result[i].append(v)
    return result


But... what if the original list, with te values to be de-interleaved is itself huge? Or part of a data-stream generated elsewhere? How can we fetch de-interleaved data, in order, without having to read through all that stream?

Well, the function up to this would return us a container sequence where each object would be a generator itself. And when called upon, this generator would manage to fetch new values from the original stream as needed, and share the values of the other data sequences with its sibling generators.

Fear not! That is a task for a couple dynamically generated generators (DGG's) agressively sharing data in their "nonlocal" variable space (otherwise known as closure).
def xunzip(seq):
    it = iter(seq)
    next = it.next()
    width = len(next)
    results = []
    buffer_ = [{"tuple": next, "consumed": 0}]
    # This is a list to workaround the non-existence
    # of the "nonlocal" keyword in python 2
    buffer_base = [0]
    for i in xrange(width):
        def freeze_i(j):
            def unzipper():
                offset = 0
                while True:
                    index = offset - buffer_base[0]
                    if not (offset < len(buffer_) + buffer_base[0]):
                        buffer_.append({"tuple": it.next(), "consumed": 0})
                    buffer_[index]["consumed"] += 1
                    value = buffer_[index]["tuple"][j]
                    if buffer_[index]["consumed"] == width:
                        buffer_.pop(0)
                        buffer_base[0] += 1
                    yield value
                    offset += 1
            unzipper.func_name = "seq_%d" % j
            return unzipper
        results.append(freeze_i(i)())
    return results
And just to finish today's unzipping party, some introspective code so that we can peep all of it taking place:
if __name__ == "__main__":
    import random
    a = zip(range(10), range(10,20), range(20,30))
    #print unzip(a)
    b = xunzip(a)
    finished_count = [0,0,0]
    unpack = [-1, -1, -1]
    while sum(finished_count) < len(b):
        index = random.randrange(len(b))
        try:
            unpack[index] = b[index].next()
            buffer_lenght = len(b[index].gi_frame.f_locals["buffer_"])
        except StopIteration:
            finished_count[index] = 1
            buffer_lenght = "Unknown"
        print unpack, "Buffer lenght: ", buffer_lenght
(Allternatively, fetch all code together from here: http://pastebin.com/fiAq2uy9 )

2 comments:

  1. Can't you simply do

    >>> zipped = list(zip(list1, list2))
    >>> unzipped = zip(*zipped)

    ?

    The "list()" around the first zip is because, in Python 3, "zip" returns a "zip object", not a list. This also ensures (if I'm not mistaken) that the generator-like properties you seek are kept.

    ReplyDelete
  2. While this zip-reapplying does unzip - (and D'UH, I had not even perceived that) - the iterator I get on the long example here is not achieved by Python's 3 zip object (or itertools.izip on python 2.0) :
    It returns an iterator that which, when its items are retrieved, return the full deinterleaved sequences as a list, one at a time.
    The examle in this post returns a container list for one iterator for each data sequence.

    ReplyDelete