| ![]() |
TBiB Q4/2006
|
Exam Project / TBiB 2007: Conservation plots for whole-genome Drosophila alignments
The goal of this exercise is to plot conservation along
whole-genome alignment of six Drosophila species. The exercise
involves construction a database containing information about
the alignments, using the database to extract alignments for
selected ranges of genomes, and using Rpy to plot conservation
along the alignments.
The Project ReportTo complete the project, you must write a report documenting your solutions. The report should be structured with a section for each exercise describing its solution (see Exercises for details). The report should also describe the overall design of your database and your scripts, and give a short users-guide to it. The path to your code must be included in the report. The project report should be handed in on Friday, June 29, 2007, by email to Søren Besenbacher <besen@daimi.au.dk>. MotivationWe all think that genome-browsers are really cool. It shows us all kinds of information about chunks of a selected genome: where the genes are, where DNA is conserved across several genomes, and whatnot. Obviously, this is information we want to be able to browse for our favourite organisms. But what do we do when working on new sequences? When we have new alignments not present in the genome browsers? We have to implement that functionality ourselves. In this project we learn how to implement some of this: how to calculate and plot annotations of a genome. (We do not implement actual web-browsing, for future reference you can look at lecture notes from previous years of this course). Administrating multiple multiple (sic) alignmentsUnless genomes are very closely related, they cannot be aligned in a single contiguous block. Structural rearrangements means that contiguous blocks in one genome can be distributed over several different regions and chromosomes in other genomes. Thus, dealing with a whole-genome multiple alignment means we have to deal with multiple multiple alignments. In this project we will look at genomes from six different Drosophila species. The data can be found in /users/besen/public_html/TBiB2007/CAF1/ and consist of 1411 alignments in FASTA format (numbered from 1 to 1414 excluding 116, 840, and 915). The locations of each alignment in the different genomes is described in the file /users/besen/public_html/TBiB2007/CAF1/map. The first line of the file is a header describing the format of the data. EXERCISE 1: Construct a SQL database with the information from the map file. Conservation plotsThe first thing we want to plot is conservation — in sliding windows — along a genome. Given a reference genome (species) we want to be able to extract all alignments within a given region of a chromosome, calculate conservation in a sliding window along the alignments within that region, and plot this conservation. EXERCISE 2: Write a function that — given a species name, chromosome name, start and stop position — returns then ID of all alignments overlapping the region in question. EXERCISE 3: Update the function from EXERCISE 2 such that it returns data structures containing the actual multiple alignments overlapping the region. Remember that the first and last alignment might not be completely contained within the region (and remember that the index into an alignment is with respect to some reference genome so the corresponding column in the alignment depends on the gaps in that reference genome). To save memory, you might want to use a generator (remember the yield keyword?) to return the sequence of alignments. EXERCISE 4: Make a function that use a sliding window of a certain width in the reference genome to calculate the conservation of the multiple alignments. The output of the function should be a list of floating numbers, one for each window. Parameters to the function should include window size and size of the overlap between neighbouring windows. When calculating the conservation, you should consider only the columns where the reference genome is not a gap. Here we define the "conservation" of a column as the fraction of nucleotides (or gaps) in a column that contains the most frequent nucleotide (or gap). The conservation of a window is the mean conservation of the columns in the window. EXERCISE 5: Combine the functions in EXERCISE 2—4 to produce a script that plots conservations along a contiguous region of a reference genome. Gene annotationWe also want to plot a gene annotation along our genome (so we can spot if genes and conservation is correlated and other nifty things). The position of known genes in the Drosophila Melanogaster genome can be found in the file /users/besen/public_html/TBiB2007/CAF1/DroMel_CAF1.gff. Each line in the file contains information about a segment of a gene, the third column of the line tells you whether the segment in question is an exon or CDS (coding DNA sequence) or something else. We want to be able to show the positions of the CDS segments in our conservation plots so you can disregard all lines not describing CDS segments. The first column tells which chromosome the gene is found on, and column 4 and 5 contains the start and stop positions of the segment in the Drosophila Melanogaster genome. Column 7 contains a plus or a minus depending on which strand the gene is on, the positions are always given on the positive strand regardless of the strand of the gene. EXERCISE 6: Make a database that can hold the relevant information from the DroMel_CAF1.gff file. EXERCISE 7: Write a function that given a chromosome name and a start and stop position returns a list of all the CDS found in that region of the Drosophila Melanogaster genome. For each CDS in the list we want to know its starting and ending positions and which strand it is on. EXERCISE 8: Update the script from EXERCISE 5 so that it also can plot CDS as line segments if the reference genome is Drosophila Melanogaster. |