Biopython is a tour-de-force Python library which contains a variety of modules for analyzing and manipulating biological data in Python. While this library has lots of functionality, it is primarily useful for dealing with sequence data and querying online databases (such as NCBI or UniProt) to obtain information about sequences.
This online tutorial describes nearly all the capabilities (with examples!) of things to do with Biopython (ctrl+F is your friend here!).
Download and install
You may need to download and install Biopython on your system. Instructions for this step are available here.
Reading and parsing sequence files
Biopython is an ideal tool for reading and writing sequence data. Biopython has two main modules for this purpose: SeqIO (sequence input-output) and AlignIO (alignment input-output). Each of these modules has two primary (although there are others!) functions: read and parse. The read function will read in a file with a single sequence or alignment, and the parse function will read in a file with multiple sequences or multiple sequence alignments. Each function takes two arguments: the file name and the sequence format (e.g. fasta, phylip, etc.)
Biopython parses sequence files into Biopython data structures, known as SeqRecord objects. Each sequence has several attributes (which you can examine with dir()), but the most important ones are .seq, .id, and .description.
Manipulating Biopython objects
Working with sequence files can be a bit tricky since all sequences are really Biopython SeqRecord objects. Most sequence processing is done with the sequences converted to strings, but to interface with Biopython, we need to convert sequences to SeqRecord objects.
Here are some useful methods that you can use on Seq objects. Note that all of these methods return other Seq objects which can be re-cast to strings with str(), if you need.
Writing sequences to file
Biopython supports sequence records to file in a format that you can specify:
Note that, in many circumstances, it is easiest to write sequences to file using regular file writing, in particular if you want to save in FASTA format. This will require looping over a list (or dictionary!) which contains the sequences to write.
Converting file formats
Biopython makes it very straight-forward to convert between sequence file formats - simply read in a file and write it out in the new format, or use the handy .convert() method. Again, these methods work with both AlignIO() and SeqIO().
Interacting with sequence alignments
Here are some useful methods for manipulating sequence alignments:
Creating new sequence alignments is a bit tricky - you need to create multiple SeqRecord objects and merge them together into a MultipleSequenceAlignment object. We won’t do that here, but you can look it up in the Biopython tutorial.
Querying online databases
Biopython has excellent built-in tools for collecting and parsing data from NCBI and SwissProt/ExPASy. Here, we’ll discuss NCBI querying and parsing. The Biopython online tutorial contains more information about querying and parsing SwissProt information.
Querying NCBI databases
Any (yes, any!) NCBI database can be queried with the Entrez module in Biopython, and you can parse the downloaded data using the SeqIO module. As an example, the code below obtains a record for the human hemoglobin alpha subunit gene with ID “NP_000549.1” (note that the GI ID will also work!) from the protein database:
Once the record has been parsed by SeqIO.read(), we can extract various pieces of information:
More of the information stored in the .features attribute, which is a fairly complicated structure. Each feature contains two important attributes: type which indicates the feature we’re dealing with, and qualifiers, a dictionary of information. To examine this information, you will need to loop over its keys:
To access a particular feature, such as “CDS” (stands for “CoDing Sequence”), use an if statement:
The above CDS feature information tells us that the coding sequence record in NCBI for this gene is NM_000558.4, from positions 67-495. It also provides cross-reference information for searching other databases and gene name(s).
Practice Exercises
Using the included pb2.fasta alignment file, create a new alignment file (also in FASTA format) containing the sequences converted into their reverse complements. For this exercise, try using a dictionary structure to loop over the data. Also, you may find the Biopython .reverse_complement() helpful! Try saving the file and/or converting the resulting file to a different alignment format, such as phylip or Stockholm (see here for available alignment formats in Biopython).
In the pb2.fasta file, you’ll notice that some of the IDs begin with "Hu_" and others begin with "Av_". These indicators tell you whether or not the pb2 sequence is from an influenza strain that infects humans (Hu) or avians (Av). For this exercise, determine the difference in average GC content between human-infecting and avian-infecting sequences. For this exercise, you will need to use the string method .startswith("XXX") to determine if a given sequence ID is human- or avian-infecting. This method returns True or False depending if a given sting starts with the provided argument. For example: