Snacker News

Python generators for biologists

August 24, 2017

The use of generators is an intermediate programming topic that comes up again and again in biological codebases. In short, a generator lets you store some code to run later. In biological codebases, generators are excellent for times when you would like to go through all the elements in some collection and operate on all of them, and then collect the result of all the operations into another list.

What’s a generator?

In Python, a generator is similar to an iterable in that it is a collection you can iterate over. However, generators are lazy, meaning that they only bother to know the values of the items they contain when asked for the value (not when they’re constructed).

A generator many computational biologists have used is the range() function from the Python standard library.

>>> R = range(3)
>>> for r in R:
>>>   print(r)
0
1
2

A common use case for generators

Without generators, you create an empty list, iterate through the items in your collection, and then add the results to the list.

results = []
for item in items:
    my_result = item.transform()
    results.append(my_result)

Using a generator, you define a generator that yields individual results.

def results():
    for item in items:
        yield item.transform()

When you call the function, you get a Python generator that you can iterate over. The neat part: the generator items get evaluated lazily (as needed instead of all at once), so that if your transform method is some big ugly piece of scientific code, it gets evaluated only when you actually use the value of the result for something.

How generators work

Generators’ lazy evaluation can help make computational biology tasks more efficient. Let’s take a closer look at how they work. We have a collection called items that we wish to iterate over. On each item, we must evaluate a heavy transform function. However, we have clearly-defined convergence criteria such that we can terminate the processing of items in the list once we get enough information (perhaps we have found the one we are looking for).

To set up, we’ll make a dummy type to play with. (Note, this code is Python code containing the names of the types of all of the variables. In this case, the types say that transform returns a float and that T is a float.)

class Item:
    def transform(self) -> float:
        T: float = big_long_calculation()
        return T

items = [Item(n) for n in range(10000)]

First, let’s look at the case where we don’t use generators. Without using a generator, we have to evaluate the transform method, which calls big_long_calculation on every single item in the list.

results = []
for item in items:
    my_result = item.transform()
    # code is run and value is returned
    results.append(my_result)

results
# we have evaluated `transform` on every sample

def get_answers():
    answers = 0
    for result in results:
        answers += result
        if answers > 10:
            break
            # we may have only used the first few items,
            # and transformed thousands of items unnecessarily
    return answers

Using generators, we wait to run the transform method until we are sure we will need to. Our evaluation of each result is only done if we decide that result is needed, as is shown below.

def get_results():
    for item in items:
        yield item.transform()
        # this code is stored, not run

def get_answers():
    results = get_results()
    # create generator, but do not evaluate anything yet
    answers = 0
    for result in results:
        answers += result
        # transform method called here for this one item
        if answers > 10:
            break
            # we never transform another item
    return answers

In this case, if we only need to evaluate a few items in a very long list for our answers variable to be >10, then we can save a lot of computational time by using a generator. This lazy evaluation is more efficient than using a list in the case that the generator need not be exhausted, and the same as using a list when all items need to be evaluated, so you may in general prefer to use generators.