Snacker News

How to become a computational biologist

August 1, 2017

Computation has become a central part of biology. These days, if you have biological questions, you are more than likely going to use computers and laboratory automation to answer them.

What’s exciting about computational biology is exactly what’s exiting about computers today, which is computing at scale. Biology at scale is different than the biology of fifty years ago. Today, biologists read whole communities via deep sequencing, solve protein structure prediction problems via computational algorithms, and conduct experiments with thousands of samples routinely with the help of robotics. What follows is advice for those who seek to use computational tools to ask biological questions.

Learn the Unix shell

The shell, otherwise known as the terminal or the command line, is itself of course just a computer program. Yet the ability to compose detailed commands is a great introduction to the philosophy of Unix computing. The vast majority of us learn Bash, the Bourne Again Shell. Our knowledge of Bash is closely tied to how we feel about using the command line: the more you learn, the easier everything else will be.

First things first, you will need a computer that runs some flavor of Unix. The two platforms I am most comfortable with are macOS and GNU/Linux. There are plenty out there. Windows is not recommended as a platform for scientific computing.

You should immediately become familiar with using the shell (Bash) for most of your work, even if you could use other programs to do it. The shell, or command line interface, is a powerful, flexible, and precise way of interacting with your computer. Start with Matt Might’s excellent introduction to Unix, which applies to macOS and GNU/Linux. Then read his guide to settling in.

The important thing to understand is that Bash, the shell, is commonly used on a lot of computers to do the everyday tasks of computing: getting around, dealing with files, finding things, installing software, and of course, running all of the domain-specific software we use. The more you use it this way, the more familiar it will be, and the easier it will be to learn the rest.

Learn one of the Unix text editors

I think you should use whatever computing environment makes you happiest. I personally love the text-based GUI of the shell for developing code, while others prefer the GUI of an integrated development environment like PyCharm, Visual Studio, Xcode, Atom, Sublime Text, and many many others that I have not used personally.

That said: learn to use either Vim or Emacs. These two are the venerable Unix text editors, and you will be using them a lot if you do computational biology. Which one you choose is totally up to you: the choice is yours. But learning one of them is a rite of passage and adds a valuable tool to your kit.

If you’re interested: my favorite text editor is vim

My text editor of choice is vim. If you are interested learning how to use it, it is really quite easy if you have like half an hour (and then the rest of your life). Use the vim-tutor to learn the basics (installed with vim, available from your Bash shell). Then, read this post on Stack Overflow: your problem with vim is that you don’t grok vi, it explains vim‘s verb-noun philosophy. After you’ve used it for a while, you will love these vim koans.

Learn Python and use Jupyter notebooks

The Python language is ubiquitous in computational biology. Learn it. In case you’re wondering which version of Python to learn (2 or 3), learn Python 3. The reasons for this are in another post.

Jupyter notebooks are a form of interactive document that runs in the web browser that combines live code, output, LaTeX markup, and structured text. They are a mostly-perfect combination of text (full formatting plus real Latex rendering), code (in many languages with pluggable kernels), and data in the form of charts and other visualizations generated inline. Jupyter notebooks are invaluable for data science tasks and exploration of data.

They enhance your ability to discover new functions in software libraries by binding Tab to autocomplete, and provide several convenient interfaces for looking at code documentation inline with the code you’re writing (i.e. pressing Shift-Tab while editing a function call will bring up a floating window with the documentation for the function you are editing).

Learn why it’s not true that “Python is slow”

You may hear misinformed people say that Python is slow. This is a surprising attitude that can be particularly harmful to novice biologists that are struggling to learn computation in the first place, who shirk at now having to learn yet another thing (a “fast”, usually used as synonymous with “compiled” language) before they can get any work done.

Now that the falsehood has gotten halfway around the world, let’s let the truth get its pants on. The truth is that the parts of Python where speed matters, such as the numerical computing library numpy, are implemented as extensions to Python that run as fast as compiled languages like Rust and C (in numpy‘s case, because it is implemented in C).

Thus, Python combines the best of both worlds (from a biologist’s point of view). It is easy to learn and write, and is widely sharable, and also fast enough for compute-intensive scientific calculations.

But learn a compiled language like Rust anyway

When you have mastered Python, you should learn a compiled, statically-typed language like Rust. It may sometimes be useful to write code this way for performance, but it is certainly useful to learn a compiled language as a mental exercise. Learning to think about types, the heap and the stack, data structures, and algorithm complexity (big-O notation) is a powerful skill for a computational biologist.

Alongside this, you may find it interesting to examine some of the algorithms behind commonly-used computational biology tools. Read about why computer scientists are so obsessed with multiplying matrixes. Read a beginner’s guide to linkers to understand why it’s always so hard to get scientific software to compile.

Learn about basic software practices like unit testing and version control

Software development practices help programmers write code that is clear, accessible to others and correct. You will become a much better computational biologist by trying your hand at building software using the tools of professional software developers.

In conclusion: try to take the good

Don’t be afraid of adopting some computer science practices like code maintainability and version control. Help your fellow biologists do the same. The use of these tools can help us be better biologists, and thus we’ll learn more about the world.