We will focus on writing a function for extracting sequences from the assemblies at Ensembl. At the lecture, we saw that the Caenorhabditis elegans database contained a table, assembly, that contained a list of contig-ids with their associated placement on the chromosomes; a table, contig, relating contig-ids with dna-ids; and a table dna that contained the actual sequences.
In these exercises we will use these tables to extract sequences.
Our goal is to write a function that, given a chromosome name, a start index and a stop index, returns the corresponding dna sequence.
EXERCISE DB1X.1: Write an SQL query that joins chromosome and assembly, to let you index chromosomes by name. Write a function for extracting the id from the name. (This is an exercise you might have done at the lectures, if you got that far: DB1.6).
EXERCISE DB1X.2: Write a query that extracts the contigs in a certain range on a specific chromosome. That is, write a query that takes a range and returns all contigs that start before the range and does not end before the range.
EXERCISE DB1X.3: Join the tables assembly, contig, and dna to get from chromosome ids to contigs to sequences.
EXERCISE DB1X.4: Wrap the queries from DB1X.2 and DB1X.3 to extract sequences in a given range on a given chromosome. Remember that you must remove the first part of the first contig and the last part of the last contig, to trim the sequence to the correct range, and be careful that you take the orientation of the contig into account: if the orientation is -1 you must reverse it.
EXERCISE DB1X.5: Write a function that returns the results of the query in DB1X.4, concatenated, to form the wanted indexed sequence.
We have written a function that extracts indexed sequences from the Caenorhabditis elegans assembly at Ensemble.
The complexity of this function is about as difficult as they get in this course. If you manged to do the exercise, feel good! If you didn't, don't feel bad about it. The database access you will need in the next mandatory exercise will not be as bad.