Snacker News

Scaling biology with techniques from computer science

August 1, 2017

Computation has become a central part of biology. These days, if you have biological questions, you are more than likely going to use computers, sequencing, and laboratory automation to answer them.

I don’t think that the marriage of biology and big data is an accident. Biology is chemistry at scale. And these days, what’s exciting about computing is computing at scale.

At the same time that biologists are providing enormous data sets that train the newest generation of algorithms by computer scientists, biologists have something to learn from computer scientists.

Many biologists view the command line and programming languages of computer science as arcane. But that’s changing, both as powerful computing becomes much more friendly with tools like the Jupyter notebook, and as older scientists die (retire) and are replaced with younger scientists who are more familiar with computers.

Yet “more familiar” does not always imply “more advanced”. As computer interfaces become highly tuned to modes of natural human interaction (touch, spoken commands), their users lose habits of thought and record-keeping that are at odds with solving scientific problems. Many of the practices of computer science, such as version control and code testing, are valuable to biologists.

Data storage and sharing

The trend of open access to biological data is clouded by the issue is that a lot of biological data (about 30% overall) is biological data on humans, some of which is linked to enough identifying information to link to the humans in question.

There is also an incredibly small subset of biological data that could be used to harm people (I am thinking of things like the sequence of the smallpox virus).

Of the majority of data generated by biologists that is free from both of these questions, the trend toward making it available is positive. However, much can still be learned.

Version control

Git seemed a little silly when I first learned it. I had done a few programming projects and I had read that version control was what good developers used to keep track of changes to their code as it changed. So I tried it out myself, just keeping track of what I was working on, which was a molecular modeling protocol.

Turns out it’s an amazing tool for focusing your work. In one approach, branches of the repo are used to develop features, which are then merged back into the main branch (often called master). This keeps you focused (goal-oriented) when adding or changing things to your scientific code.

But the main this is committing your work. When you have to write a commit message every time you change code, you begin to think in commits, and think about your code from the level not just of syntax, but at the level of structure, organization, and planning. And this helps you write better code.

Let the software shape you

Try to understand how the computer works on its terms, and you will adopt the mind of a computer scientist. This is really exemplified by learning a new programming language.

I started out programming in Python, and it took me really far. Python has an amazing combination of excellent package ecosystem, wide variety of stuff available, easy to learn syntax, and really fast implementations of stuff that needs to be fast. All of this keep you on Python.

Yet I am a curious person and I am always trying out new programming languages. The language that I use for work is C++, and though some reading online I came across Rust. Rust, like C++ has static types, and programs are compiled before use. But the authors of Rust have made decisions that favor the humans writing the code.

As I learned how to use Rust, I was sucked into the testing philosophy.

The shell

Can we please have a shell in every molecular modeling program? I am not joking. I would really prefer to memorize an arcane list of commands than deal with your GUI. Seriously. Read some Dieter Rams books.