Building Analysis Pipelines
Parts of this lesson borrow heavily from “Bioinformatics Data Skills” by Vince Buffalo. You can buy it here: http://shop.oreilly.com/product/0636920030157.do And you should because it’s awesome
The python sys
module
The sys
module interacts with the python interpreter. This module primarily comes with certain variables (rather than functions) which are particularly useful, two of which are described below.
Passing command-line arguments sys.argv
Often, you’ll want to pass input arguments to a script. All input arguments are stored in the variable sys.argv
(note that you must import the sys
module!). For example, you might have a script called calc_dna.py
which performs certain calculations on a sequence data file, but each time you run the script, you might want to process a different file. One option to pass the file name as an input argument: python calc_dna.py inputfile.fasta
, where inputfile.fasta
is the single command-line argument. If you’ve loaded the sys
module, all command line arguments will be stored in sys.argv
:
Assume the following code is included in the script calc_dna.py.
From the command line you run python calc_dna.py inputfile.fasta
and get back:
['calc_dna.py', 'inputfile.fasta']
Notice that the first entry in sys.argv
is the name of the script. After this come all command line arguments! In addition, all sys.argv
entries will be strings. So remember that if you want to use an input argument as a number, you must convert it to a float or integer.
Generally, you should save the input arguments to a new variable inside the script:
Error checking is often very useful here; you might have a script which requires two command line arguments. For instance, let’s say you have a script (which does something..) which takes a file name and a number as its two arguments. To ensure that this always happens, help yourself out with assertion statements:
Editing the path with sys.path
You can view everything in python’s path by printing the contents of the list sys.path
. This variable will you tell which directories on your computer that the current python interpretter is able to access. To edit this path in place, for instance by adding a directory to the path, simply use .append()
:
## The python os
and shutil
modules
The os
and shutil
modules are useful for interacting with your computer’s operating system (typically UNIX). With these modules, you can run commands from your python script which are analogous to UNIX commands like cd
and pwd
.
Some examples:
Module | Command | Description | Unix equivalent | Example |
---|---|---|---|---|
os |
os.listdir |
List all items in a given directory | ls |
os.listdir("/directory/of/interest/") |
os |
os.remove |
Remove a file | rm |
os.remove("i_hate_this_file.txt") |
os |
os.rmdir |
Remove a directory | rm -r |
os.rmdir("/i/hate/this/directory/") |
os |
os.mkdir |
Create a new directory | mkdir |
os.mkdir("/path/to/brand/new/directory/") |
os |
os.mkdirs |
Create many new directories | mkdir |
os.mkdir("/path/to/a/brand/new/directory/", "/path/to/another/brand/new/directory/") |
os |
os.chdir |
Change directory where python is running | cd |
os.chdir("/another/directory/where/i/want/to/be/") |
shutil |
shutil.copy |
Copy a file | cp |
shutil.copy("old_file.txt", "new_file.txt") |
shutil |
shutil.move |
Move a file | mv |
shutil.move("old_file.txt", "new_file.txt") |
Running external commands with os
You will often want to use Python scripting to automate analyses which use external programs or softwares. You can actually call these programs directly from your python script using the function os.system()
. This function takes a single argument: the command you want to run (as a string). Anything that you could type into the command line can be given to os.system
!
We can also incorporate sys
to provide the input/output file names as arguments!
Finally, you can check that the command has run properly by saving the output of os.system()
(basically, save it to a variable). In UNIX, a returned value of 1 means an error occurred, but a returned value of 0 means everything went fine. Therefore, we want to make sure that os.system()
returns a value of 0, by editing the last few lines:
Python is pretty great, if you’re into that kind of thing. We just talked about how to interface with bash from python. Now lets talk about some of the amazing things you can do with bash.
Part 2: useful UNIX/Bash one-liners: pipes, sed
, awk
, and more!
Why is Unix the language of bioinformatics? Read about the Unix philosophy here: https://en.wikipedia.org/wiki/Unix_philosophy.
Do one thing and do it Well.
Bash is great for getting a quick look at your data and answering questions easily. While you can do all of these things in python with what you’ve already learned, it’s often far easier and quicker to use the huge ecosystem of Unix tools already available and heavily optimized (albeit quirky).
We’re going to look at a lot of commands, so remember, you can always use man
to look up the specifics of bash commands. Try man ls
to look at all the options for ls
. Pressing q
will return you to your terminal.
Lets review some basic bash commands:
We can use head
to quickly look at the first lines in a file
head Mus_musculus.GRCm38.75_chr1.bed
You can change the number of lines returned with the -n
argument.
tail
is just like head
but looks at the end of a file
tail Mus_musculus.GRCm38.75_chr1.bed
history
shows you your recent shell commands
history
and grep
is a powerful way to search files.
But Stephanie will cover grep
in GREAT detail later, so we’ll mostly just think of it as “find” for now.
history | grep tail
Welcome to the new world. The new world of |
or “pipes”
Piping allows you to redirect the output of one unix command to be the input of another command. This is a really powerful idea and critical to effectively using Unix. We’ll come back to it.
less
is a great way to inspect files.
Less actually starts a program that allows you to view text at the console. You won’t be able to edit it. To quit less
just press q
. space
takes you to the next page, and b
goes back.
less contaminated.fasta
wc
outputs the number of words, lines, and characters in the supplied file, or files
The argument -l
restricts this to just lines, what we usually care about.
wc Mus_musculus.GRCm38.75_chr1.bed
wc Mus_musculus.GRCm38.75_chr1.bed contaminated.fasta
So of course, you can imagine using this to answer some questions.
cut
pulls out specific columns (“fields”) from files
It uses tab as the separator by default
cut -f 2 Mus_musculus.GRCm38.75_chr1.bed | head -n 3
The -f
argument specifies which column to keep
column
then makes that output pretty to look at in the console
cat Mus_musculus.GRCm38.75_chr1.gtf | column -t
Only use for looking at data in the terminal. If you output column to a file the variable number of spaces is a mess for other programs to parse.
Now lets get to some Good Stuff
sort
will, of course, “sort” plain text data. But its syntax can be confusing, so get familiar with the man page.
cat example.bed
sort example.bed
sort -k1,1 -k2,2n example.bed
Command | Meaning | Example |
---|---|---|
-b | ignore leading blanks | sort -b filename > filename.sorted |
-r | reverse | sort -r filename > filename.sortedr |
-k POS1 | sort by field/character indicated by POS1 | sort by field 2: sort -k 2 filename sort by second character in field 2: sort -k 2.2 filename |
-k POS1,POS2 | sort based on the characters from POS1 to POS2 | sort by characters in fields 2 and 3: sort -k 2,3 filename sort starting with second character in field 2 up to and including field 3: sort -k 2.2,3 filename |
uniq
takes a file or standard input (from a pipe) and removes consecutive duplicates. It’s very simple, but you’ll use it a lot.
cat letters.txt
uniq letters.txt
sort letters.txt | uniq
sort letters.txt | uniq -c
Command | Meaning | Example |
---|---|---|
-c | prefixes lines with the number of times they occur | uniq -c filename |
-d | prints only repeated lines | uniq -d filename |
-u | prints only unique lines | uniq -u filename > filename.unique |
-f N | skips N number of lines | uniq -f 30 filename |
sort | uniq
and sort | uniq -c
are very commonly used in bioinformatics. Combined with other unix tools like grep
and cut
they are outrageously powerful.
grep -v "^#" Mus_musculus.GRCm38.75_chr1.gtf | cut -f3 | sort | uniq -c
grep -v "^#" Mus_musculus.GRCm38.75_chr1.gtf | cut -f3 | sort | uniq -c | sort -rn
awk
is a programming language onto itself. We’ll keep it (pretty) simple, though, and show how simple awk
“one-liners” can be used as integral components of analysis pipelines.
awk
statements follow the syntax of pattern { action }
. They can be very simple:
awk `{print $0}` example.bed
awk
is designed to process columnar data (data separated by fields). $1
represents the first field, $2
the second, etc.
awk '$3 - $2 > 18' example.bed
# example using HW csv file
# print columns 2 and 3
cat WEEK_06_python5_HW.csv | awk -F, '{ print ($2,$3) }'
# print columns 2 and 3, add a `tab` between the items
cat WEEK_06_python5_HW.csv | awk -F, '{ print ($2"\t"$3) }'
# print columns 1 and 2, adding in a `tab` between, a new field "newline", and a new line at the end
cat WEEK_06_python5_HW.csv | awk -F, '{print ($1"\t"$2 "\tnewline\n")}'
# more complicated, a for loop that prints each item
cat WEEK_06_python5_HW.csv | awk -F, '{for (i=1;i<=4;i++) {print ($i)}}'
# for each line, from items 1 until 4: `for (i=1;i<=4;i++)`
# print each item: `{print $i}`
Last but not least, lets talk about sed
. The Stream editor. Sed reads data from a file or standard input and can edit it a line at a time.
This is mostly based on regular expression (which Stephanie will explain further), so again just think of it as a simple Find and Replace.
head -n 3 chroms.txt
sed 's/chrom/chr/' chroms.txt | head -n 3
- General pseudocode:
sed [-E] command/regex/replacement/optionalflag filename > newfile
- My favorite pseudocode:
sed -E s/OLD/NEW/
- Mac users must include
-E
to access regular expressions sed
does not understand\t
and\n
, see below
# replace first instance of XX with YY for each line
sed s/XX/YY/ filename > newfile
# replace all instances of XX with YY
# - `g` flag means 'global' and searches for all instances of the pattern
sed s/XX/YY/g filename > newfile
# replace all instances of XX with YY and of AA with ZZ
# - `-e` flag lets you execute multiple sed commands at once
sed -e s/XX/YY/g -e s/AA/ZZ/g filename > newfile
# keep only letter and space characters ([a-zA-Z' ']*) that come before a different type of character in each line
# - must escape all () using `\`, ie: \([regex]\)
# - must put whitespace and replacement ('\1\2') in quotes, or else it is interpreted as a separate command
sed -E s/\([a-zA-Z' ']*\)\(.*\)/'\1'/ example
- To insert tabs (
\t
) you’ll have to hitctrl + v
, thenTab
while in the terminal environment - For newline characters (
\n
), you have to code it directly into the line with\ + enter
, for example:
sed -E s/\([0-9]\)/'\1\
'/ examplefile.txt
Reminder: Unix and pipes are awesome. There are computationally efficient and easy to debug. They deserve to be a part of your workflow. But what about when need to do something more? A lot of things to a lot of files… lets talk about scripting in bash.
Part 3: Bash Scripts for Pipelines
Calling all of your shell scripts from Python isn’t really efficient. There’s a fair bit of computational overhead when you ‘call’ the shell from within python. If you have a python analysis script you want to run on a lot of files, it’s better to write a script in bash than to wrap it all in python.
A bash script is really just a list of commands. By convention bash scripts end in .sh
and can be run by typing:
sh my_first_script.sh
Okay take a look at head my_first_script
. How do you want to take a look? What do you see? A similar shebang to a python script, and then some parameters. After the parameters, each line is a separate bash command.
Now lets look at using command line arguments in bash. Try running the second script.
sh my_second_script.sh
#you can also run scripts like this, but you have to set them to be executable
chmod u+x my_second_script.sh
./my_second_script.sh firstarg secondarg thirdarg
So command line arguments are pretty easy. What else is nice about bash scripts?
sh my_third_script.sh
sh my_third_script.sh 15
Now take a look at that script. You can do conditionals and for loops in bash! You can also do while and until loops, and complex conditional statements. The syntax is weird, though (as in -ne instead of python’s !=), so pay attention.
Okay lets get a little bit more real. Lets say you have a pipeline of 10 commands you need to run on a bunch of files. Instead of going command by command and file by file, you can wrap all the commands into a single bash script and call that for each file. Look at this:
sh a_realer_script.sh contaminated.fastq
sh a_realer_script.sh chroms.txt
You can of course write new files inside bash scripts using >
or >>
and even clean up intermediate files with rm
.
Why do we care? It makes it much easier to make your analysis reproducible and consistent- by using scripts and pipelines, you can ensure you’re treating your data in the same way each time. And, mostly importantly: It’s easier!
There are many POWERFUL ways to expand your bash scripting- globbing, find
, xargs
, file arrays, and more! Play around with it. You’ll be glad you did!