August 24, 2017
The use of generators is an intermediate programming topic that comes up again and again in biological codebases. In short, a generator lets you store some code to run later. In biological codebases, generators are excellent for times when you would like to go through all the elements in some collection and operate on all of them, and then collect the result of all the operations into another list.
In Python, a generator is similar to an iterable in that it is a collection you can iterate over. However, generators are lazy, meaning that they only bother to know the values of the items they contain when asked for the value (not when they’re constructed).
A generator many computational biologists have used is the
range() function from the Python standard library.
>>> R = range(3) >>> for r in R: >>> print(r) 0 1 2
Without generators, you create an empty list, iterate through the items in your collection, and then add the results to the list.
results =  for item in items: my_result = item.transform() results.append(my_result)
Using a generator, you define a generator that
yields individual results.
def results(): for item in items: yield item.transform()
When you call the function, you get a Python generator that you can iterate over. The neat part: the generator items get evaluated lazily (as needed instead of all at once), so that if your
transform method is some big ugly piece of scientific code, it gets evaluated only when you actually use the value of the result for something.
Generators’ lazy evaluation can help make computational biology tasks more efficient. Let’s take a closer look at how they work. We have a collection called
items that we wish to iterate over. On each
item, we must evaluate a heavy
transform function. However, we have clearly-defined convergence criteria such that we can terminate the processing of items in the list once we get enough information (perhaps we have found the one we are looking for).
To set up, we’ll make a dummy type to play with. (Note, this code is Python code containing the names of the types of all of the variables. In this case, the types say that
transform returns a
float and that
T is a
class Item: def transform(self) -> float: T: float = big_long_calculation() return T items = [Item(n) for n in range(10000)]
First, let’s look at the case where we don’t use generators. Without using a generator, we have to evaluate the
transform method, which calls
big_long_calculation on every single item in the list.
results =  for item in items: my_result = item.transform() # code is run and value is returned results.append(my_result) results # we have evaluated `transform` on every sample def get_answers(): answers = 0 for result in results: answers += result if answers > 10: break # we may have only used the first few items, # and transformed thousands of items unnecessarily return answers
Using generators, we wait to run the
transform method until we are sure we will need to. Our evaluation of each result is only done if we decide that result is needed, as is shown below.
def get_results(): for item in items: yield item.transform() # this code is stored, not run def get_answers(): results = get_results() # create generator, but do not evaluate anything yet answers = 0 for result in results: answers += result # transform method called here for this one item if answers > 10: break # we never transform another item return answers
In this case, if we only need to evaluate a few items in a very long list for our
answers variable to be >10, then we can save a lot of computational time by using a generator. This lazy evaluation is more efficient than using a list in the case that the generator need not be exhausted, and the same as using a list when all items need to be evaluated, so you may in general prefer to use generators.