Python BLAST Tutorial

Overview

In this tutorial, you will automate BLAST queries with Python. You will learn how to run BLAST locally, multiple times, and how to read BLAST results with Python. In the process, you will build a program pipeline, a concept useful in many biological analyses independent of BLAST.

The tutorial consists of six parts:

Preparations
Running local BLAST manually
Running local BLAST from Python
Running BLAST many times with Python
Reading BLAST output with Biopython
Plotting the results

Case Study: Plasmodium falciparum

We have the hypothesis that Plasmodium falciparum has adapted to the human organism during its long history as a parasite. Specifically, we want to examine whether proteins from Plasmodium are more similar to human proteins than one would expect. If this is true, we could interpret e.g. that more similar proteins help Plasmodium to evade the human immune system.

As a small sample study, we will BLAST a set of peptides from a few Homo sapiens proteins against the proteome of Plasmodium falciparum. As a control, we will use the proteome of Schizosaccharomyces pombe.

1. Preparations

1.1 Check whether BLAST+ is properly installed

Enter the two following commands in a Linux console:

makeblastdb
blastp

Both should result in an error message other than command not found.

1.2 Create a BLAST database for Plasmodium falciparum

Create a BLAST database for the Plasmodium proteins. First, open a console and go to the folder data/ . Type:

makeblastdb -in Plasmodium_falciparum.fasta -dbtype prot

You should see a message similar to:

Adding sequences from FASTA; added 5414 sequences in 0.56993 seconds.

1.3 Create a BLAST database for the control organism

Please create a BLAST database for Schizosaccharomyces pombe as well.

1.4 Questions

What files have appeared in the data/ directory?
Why do we need to create a database first? Why can't BLAST do that right before each query?

2. Running local BLAST manually

Before running a large series of BLAST experiments, we will run a small sequence as a technical proof of concept. We are using a sequence copied from the Plasmodium sequences, so we know that BLAST should generate a 100% match.

2.1 Create a query file

Create an empty file query.seq in a text editor. Write the following peptide sequence into the file:

DAAITAALNANAVK

Make sure that there are no other characters in the file (no empty lines or FASTA deflines). Save the file to the data/ directory.

2.2 Running local BLAST against Plasmodium

Go to a console in the data/ directory and type:

blastp -query query.seq -db Plasmodium_falciparum.fasta -out output.txt -outfmt 7

2.3 Running local BLAST against the control group

Repeat the above query for Schizosaccharomyces pombe.

2.4 Adjust output formats

Insert different numbers (1-7) for the outfmt parameter and re-run the query.

2.5 Questions

Take a look at the BLAST output. Is the result what you would expect?
Does the control group support your assumptions so far?
Which of the output formats do you find the easiest to read?
Which of the output formats is probably the easiest to read for a program?

3. Running local BLAST from Python

Now we are going to do exactly the same operation from a Python program. For this we will need the os module.

3.1 Introduction to the os module

Open the document pipelines/os_module_puzzle.pdf. Do the exercise.

3.2 Running BLAST from Python

Now we will use the function os.system to run BLAST. Create a Python script run_blast.py in the data/ directory. Write the following commands into it:

import os

cmd = "blastp -query query.seq -db Plasmodium_falciparum.fasta -out output.txt -outfmt 7"
os.system(cmd)

Execute the program.

3.3 Customizing the query

In order to make the BLAST command in Python more flexible, we will combine it from variables. Change the code to the following:

db = "Plasmodium_falciparum.fasta"
cmd = "blastp -query query.seq -db " + db + " -out output.txt -outfmt 7"

3.4 More variables

Now add separate variables for the query and the output file name as well.

3.5 Additional examples for using os

In the pipelines/ directory you find more examples using the os module. If you like, try them out as well.

3.6 Questions

Is the output of the Python BLAST run identical to the one you did manually? How can you check that?

4. Running BLAST many times with Python

4.1 Creating query files

The file data/human_peptide.fasta contains about 2000 peptides. We want to run BLAST for each of them. To do so, we need to write each peptide to a separate file.

First, create a new folder for the query files:

mkdir data/queries

The Python script multiblast/split_fasta.py does that using Bio.SeqIO. You can use it by typing in the multiblast/ directory:

python split_fasta.py ../data/human_peptide.fasta ../data/queries

If you want, you can try writing that script by yourself.

4.2 Validate the queries

Make sure that the query files have been generated and that they are not empty. You can check both with:

ls -l data/queries
more data/queries/9568103_99.fasta

4.3 Create output directories

Prepare a place where the results from each BLAST run will be stored:

mkdir data/Plasmodium_out
mkdir data/Pombe_out

4.4 Run BLAST

You can run BLAST for all queries with the script multiblast/run_blast.py. It uses os for three different things:

Reading directory names as command-line parameters
Looping through all files in a directory
Running the BLAST command

However, the program is incomplete.

You need to complete the BLAST command inserting the file names from the given variables. Use the parameter -outfmt 5 in order to create XML output. We will need this later to read it from Biopython.

When everything is done, you should be able to execute the script with:

python run_blast.py ../data/queries/ ../data/Plasmodium_falciparum.fasta ../data/Plasmodium_out/

Inspect the result.

5. Reading BLAST output with Biopython

In this exercise, we will evaluate the results of multiple BLAST runs. To save time compared to a manual evaluation on many files, we will write a Python script to identify the best hits. For that, we need BLAST output in the XML format. You can obtain XML output by adding the -outfmt 5 option.

5.1 Reading XML data

XML is a structured format that is easy for computers to parse. Biopython offers a parser specific for the BLAST output which reads an output file into a neat data structure.

Run the program BLAST_XML/parse_blast_xml.py.

python parse_blast_xml.py

5.2 Read one of your BLAST result files

Adjust the program to read one of your BLAST output files. Try to figure out how many HSPs there are, and how many are below an e-value threshold of 0.001.

5.3 Read all of your BLAST result files

Customize the program to read all of your result files. How many hits do you have in total. What is the hit with the highest score?

References

BLAST+ is a new, faster (C++ based) version that replaces BLAST2, as of Oct 2013. Also see: http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download

BLAST