First Steps in Biopython
- Start a Python console
1. Load a sequence file
Load the FASTA file
ap006852.fasta into Biopython.
from Bio import SeqIO records = list(SeqIO.parse("ap006852.fasta", "fasta")) dna = records print(dna.name) print(dna.description) print(dna.seq[:100])
Check whether the following statements are
- The command
print(len(dna))displays the length of the sequence.
recordsresults in a different sequence record.
dna.seq[50:100]displays a different portion of the sequence.
2. Manipulating a sequence
from Bio.Seq import Seq dna = Seq("ACGTTGCAC") print(dna) result = dna.__________() print(result) if result == 'GTGCAACGT': print('OK')
Which of the following needs to be inserted to obtain
3. Calculating GC-content
Look up Section 3.2 of the Biopython documentation on (http://biopython.org/DIST/docs/tutorial/Tutorial.html) to find out how to calculate the GC-content of a sequence.
What is the GC-content of the sequence loaded in task 1?
4. Print annotation of a GenBank file
Load the GenBank file
ap006852.gbk. In contrast to a FastA file, this one contains not only the sequence, but a rich set of annotations. Load the file as follows:
records = list(SeqIO.parse("ap006852.gbk", "genbank")) dna = records
Answer the following questions:
4.1 Which command gives the species the sequence is from?
4.2 Which command produces the bigger ID number (GenBank ID or PubMed ID)
4.3 How can you view a list of available annotation fields?
5. Count atoms in a PDB structure
The following code reads the 3D structure of a tRNA molecule from the file
1ehz.pdb and counts the number of atoms.
There is a bug in the program. Execute the program. Identify the problem and fix it.
from Bio import PDB parser = PDB.PDBParser() struc = parser.get_structure("tRNA", "1ehz.pdb") n_atoms = 0 for model in struc: for chain in model: for residue in chain: for atom in residue: print(residue.resname, residue.id) n_atoms += 1 print(n_atoms)
6. Retrieve entries from NCBI databases
Use the following code to download identifiers (with the
esearch web app) and protein sequences for these identifiers (with the
efetch web app) from the NCBI databases.
The order of lines got messed up! Please sort the lines to make the code work.
identifiers = records['IdList'] from Bio import Entrez records = handle.read() handle = Entrez.efetch(db="protein", id=identifiers, rettype="fasta", retmode="text") open("globins.fasta", "w").write(records) searchresult = Entrez.esearch(db="protein", term="hemoglobin", retmax=5) records = Entrez.read(searchresult) Entrez.email = "[email protected]"