- Details
- Parent Category: Programming Assignments' Solutions
We Helped With This Python Programming Assignment: Have A Similar One?

Category | Programming |
---|---|
Subject | Python |
Difficulty | Undergraduate |
Status | Solved |
More Info | Python Assignment |
Assignment Description
LIFE733 Assignment 3 BioPython team working exercise
Deadline: 4pm 30/04/2018
Hand-in and assessment notes
Grade will come from the following 4 components:
- Overall team score for quality of total analysis pipeline and team documentation (40%)
- Individual score for code written by you as an individual (every Python class MUST have a comment at the top indicating your name) and documentation showing how you tested your parts (60%)
Specifications
- Each team contains 3 or 4 members
- A team of 3 MUST have:
o One team leader
o One BLAST developer
o One Multiple sequence alignment (MSA) developer
- A team of 4 MUST have:
o One team leader
o One BLAST developer
o One Multiple sequence alignment developer
o One Phylogenetics developer and literature mining developer
You have to construct a bioinformatics pipeline using BioPython
The pipeline will take as input (less than 10 in each case):
- A set of protein sequences in FASTA format
- A set of gene names + species name to look up
- A set of protein identifiers
The pipeline will perform the following steps:
- If protein identifiers are provided as input, it will retrieve these records from the provided FASTA file (uniprot-apicomplexa.fasta), reporting back to the user any records that could not be located
- If a pair of gene name and species name is provided as input, it will attempt to retrieve these pairs from the FASTA file, or report an error if they cannot be found
- It will extract these records to a temporary file (or use the protein sequences if provided in FASTA format) and perform a BLAST search against uniprot-apicomplexa.fasta
- For each input protein sequence, the BLAST step should output those proteins passing a user selected threshold (e.g. e-value < E-10) to a new FASTA file, as well as producing some plots or graphics in suitably named png files
- The FASTA files from the BLAST step are passed to the multiple sequence alignment (MSA) step. An MSA should be performed using each input FASTA file, producing output alignment files and plots to display the quality of the alignments produced.
- (For teams of 4 people) the Phylogenetics developer should process trees (input as .dnd files) to display species names and gene names (communicating with other team members to ensure such data is available). Trees should be reformatted to label branches for particular species with a given colour, shown in the figure legend.
- (For teams of 4 people) the Phylogenetics developer will use the gene name and species names to perform queries in pubmed, to retrieve any articles in which these search terms are found within the abstracts of articles.
- For top marks, I would like to see the results of all steps assembled into one or several pdf files by the team leaders’ code, with figure legends and text indicating what each step contains.
Hand-in details:
1. Team leader uploads to VITAL a zip containing:
a. All code from the entire team – each class is written by one individual only and flagged as such in a comment at the top
b. Documentation showing testing of the whole pipeline
2. Each other individual uploads a document showing the testing of their individual code
Testing and documentation
- On VITAL I have put a mini-FASTA file called uniprot_apicomplexa_mapk.fasta. For BLAST searching and extraction of gene names etc, use this FASTA file for testing. It contains all “MAPK” genes from Apicomplexan pathogen proteomes.
- I have also uploaded the full FASTA files for all Apicomplexa (~230MB zipped). Only use this file, when you’re convinced your code is working on smaller examples.
- As noted above, you will be expected to show how you have tested the routines you have developed using both correct (working) and incorrect (non-working) inputs. Where possible, you should show how you handle incorrect inputs, to give helpful error messages to the end user. Make sure you also document your code well.
Team responsibilities
1. Team Leader
- Team leader is responsible for developing any necessary code for retrieving details from FASTA files, for calling each step of the pipeline i.e. write controlling code that takes the first input and produces the final output, calling each step in turn.
- Specifically you will need to process input from the user (at the command line) of three types listed above. If it is gene names or protein identifiers, you will need to extract these from the FASTA file, and make a new temporary BLAST file to pass to the BLAST step.
- For top marks, you should pass a shared pdf object to functions in the other three parts, so that figures can be added to a multiple page pdf report, which you will control: following this example: http://matplotlib.org/examples/pylab_examples/multipage_pdf.html
- If you cannot make this work, the fall back is to produce a text file for the user as the final output, telling the user where to find results in a variety of png and text files as appropriate
- For top marks, you could add some extra plots to the report showing some extra statistics e.g. time taken to run each step, counts of hits at each step or similar at your discretion
2. BLAST developer
- The team leader will pass to your code a fasta file containing n protein sequences, for you to perform a BLAST search via the command line (note: don’t try this with 1000s of proteins, you may want to write code to limit the number of sequences to less than 10 say).
- You should write code to process the results from each of the n searches, to extract proteins passing a user entered threshold e.g. e-value < E-10, and produce n fasta files to pass on to the next step (MSA).
- For top marks, you should also produce one or two plots for each set of BLAST results written both as png files, and to the pdf object provided by the team leader (if they manage this part), showing for example a histogram of BLAST scores, nicely formatted alignment or other plot of your choice
3. MSA developer
- You will receive from the BLAST step n FASTA files, on which you should perform n clustalw runs to produce multiple sequence alignments.
o Note I recommend using the following command to produce better trees for phylogenetics: “clustalw2 -INFILE=[inputfilename].fasta -ALIGN -TYPE=protein -CLUSTERING=UPGMA”
- For each alignment, aim to produce a plot similar to the one below as png (and written to the pdf object provided by the team leader if possible), showing the percentage agreement in positions along the alignment length, with a figure legend.
- For top marks, you also need to produce one other plot of some type showing some statistics or nicely formatted view of the alignment, with a figure legend
4. Phylogenetics developer
- From the MSA step, you will receive n “.dnd” files. You should reformat these to display species and gene names on the branches, adding suitable figure legend. You should work out how to colour branch lengths according to the different species in the results.
- You should save n figures as png files, and ideally to the pdf object provided by the team leaders’ code
- You should query pubmed over the internet to find any abstracts containing both the gene name and/or species name, and report back the pubmed ID, article details (author, journal, year, volume, pages) to a text file report.
- For top marks, come up with an extra plot showing numbers of papers found per year or similar for a given query.
If anything is unclear, please email me.