December 20, 2017
It’s well known that the Unix shell allows for some amazing feats of brevity. But did you know it was possible to perform diffs on designed protein structures, transcribe DNA to RNA, reverse complement DNA sequences, and perform other computational biology feats from your command line? What follows is some fun with Unix in computational biology.
A common task in protein design is evaluating various designed structures that are output from a molecular modeling program. You could write a Perl program to do this, as the author of this script did. (The author in this case was a graduate student in computational biology.)
#! /usr/bin/perl
if(!$ARGV[0] or !$ARGV[1]){ die("Incorrect command line: ");}
my @pdb1 = @pdb2 = undef;
my $counter=0;
open( PDB, $ARGV[0]) or die("Could not open file $ARGV[0]!");
foreach $temp (){
if($temp =~ /^ATOM/){
if(substr($temp, 13, 2) =~ /N\s/){
$counter++;
$pdb1[$counter] = substr($temp, 17, 3).substr($temp, 23, 3);
#print "$pdb1[$counter] ";
}
}elsif($temp =~ /^\@ATOM/){
if(substr($temp, 14, 2) =~ /N\s/){
$counter++;
$pdb1[$counter] = substr($temp, 18, 3).substr($temp, 24, 3);
#print "$pdb1[$counter] ";
}
}
}
close (PDB);
$counter=0;
open( PDB, $ARGV[1]) or die("Could not open file $ARGV[0]!");
foreach $temp (){
if($temp =~ /^ATOM/){
if(substr($temp, 13, 2) =~ /N\s/){
$counter++;
$pdb2[1][$counter] = substr($temp, 17, 3).substr($temp, 23, 3);
#print "$pdb2[$counter] ";
}
}
}
close (PDB);
$mutation_counter=0;
$mutation = "select mutated, resi ";
$checker = "Filter: ";
print "PDB1\tPDB2\n";
for($x=1;$x<=$counter;$x++){
if($pdb1[$x] !~ /$pdb2[1][$x]/){print "$pdb1[$x]\t$pdb2[1][$x]\n"; $mutation=$mutation.$x."+"; $checker=$checker.$pdb1[$x].$pdb2[1][$x].","; $mutation_counter++;}
}
print substr($mutation,0,-1),"\n";
print "$checker\n$ARGV[1]\n";
print "Total Mutations: $mutation_counter\n";
exit(0);
I am actually not 100% familiar with the vagaries of Perl syntax (by choice), but I can begin to see the outline of what is happening here. The author is looping over the lines in a file, for each line matching a pattern, and then adding a processed version of the line to a associative array.
Since there are two files involved is this process, it’s not very much of a stretch to think that, when it comes to the second file, we are looping over the lines, matching them with a pattern, and then comparing the value to the corresponding value in the associative array.
Once you think about the program in terms of what it needs to do, it begins to be clear that you can Bash this one. Is it possible to get a list of the differences between the two structures using a single line of awk
?
awk '/^AT.*CA/ && NR==FNR {a[$6]=$4;next} /^AT.*CA/ && $4!=a[$6] {print a[$6] $6 $4 }' file1.txt file2.txt
It’s not exactly one line, but it’s close. Let’s break this down a bit by expanding the notation and putting in a few comments. There are two Awk statements, as you can see below, each combining two statements with the &&
(and) operator.
/^AT.*CA/ && NR==FNR {
a[$6] = $4
# e.g., a[11] = 'ARG'
next
# also evaluate the next statement
}
The first statement evaluates to true if both of the following are true: 1) the line starts with ‘AT’ and contains ‘CA’, and 2) we are reading in the first file of the two files passed to the script (NR==FNR
is true only in the first file). This, at least for valid PDB files with proteins in them,
will pull out the alpha carbons of the protein structure, which is a proxy for the sequence, since there is one alpha carbon per residue. After Awk has split the PDB line on whitespace, $6
contains the residue number and $4
contains the residue three letter code.
The assignment a[$6] = $4
creates the associative array a
implicitly according to Awk’s syntax. For example, if the residue number is 11
, and the residue code is ARG
, then Awk will set the value at the index 11
in the array a
to ARG
, silently converting the string “11” into the integer 11
. The function then continues to the next statement.
/^AT.*CA/ && $4!=a[$6] {
print a[$6] $6 $4
# prints 'ARG' '11' 'LYS'
}
The next statement is true if both of the following are true: 1) the line starts with ‘AT’ and contains ‘CA’, and 2) the three letter code at this position is not the same as the one stored in the associative array we built from the first PDB file. In this case, the mutation is printed in the standard biochemical way.
You may find yourself writing a short Python script to find the reverse complement of a DNA sequence, as the author of this script did. The author here was a high school student learning to code.
def complement(s):
basecomplement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}
letters = list(s)
letters = [basecomplement[base] for base in letters]
return ''.join(letters)
def revcom(s):
complement(s[::-1])
return complement(s[::-1])
revcom('ACTAGTCGTAGTCAATTAGTCAGTCAAGAAATAA')
This is a fine solution, though there are a few novice programming mistakes, such as the creation of the letters
object, which is unnecessary because Python str
is an iterable. The call to complement
in the first line of the revcom
function is unnecessary, and the names could be better. The code can be greatly condensed, as in the example below:
def reverse_complement(sequence):
return ''.join([dict(zip('ATCG', 'TAGC'))[n] for n in sequence][::-1])
# the slice [::-1] reverses a list in Python
reverse_complement('ATCTACTCG')
But as long as we’re condensing, why not Bash it? The Unix commands tr
(transliterate) and rev
(reverse) will happily perform this transformation for us with a lot less code.
echo ACTAGTCGTAGTCAATTAGTCAGTCAAGAAATAA | tr ATCG TAGC | rev
I know that this has been a rather silly post. However, I want to point out that the level of implementation for a given task is of interest to computer scientists. Take, for example, the case of two eminent computer scientists (Donald Knuth and Doug McIlroy) showing two very different ways of accomplishing the same task (printing a sorted list of the most commonly used words in a text file) in this paper.