Exercise: Database Access to Ensembl

In this exercise, we continue with accessing the Ensembl databases from Python scripts. With the mandatory project soon to be handed in, you are busy, so we keep it simple...sort of...the first exercises should be easy for you to do, the later are harder. When it gets too complicated, just stop.

Motivation

We will focus on writing a function for extracting sequences from the assemblies at Ensembl. At the lecture, we saw that the Caenorhabditis elegans database contained a table, assembly, that contained a list of contig-ids with their associated placement on the chromosomes; a table, contig, relating contig-ids with dna-ids; and a table dna that contained the actual sequences.

In these exercises we will use these tables to extract sequences.

Extracting Indexed Sequences

Our goal is to write a function that, given a chromosome name, a start index and a stop index, returns the corresponding dna sequence.

EXERCISE DB1X.1: Write an SQL query that joins chromosome and assembly, to let you index chromosomes by name. Write a function for extracting the id from the name. (This is an exercise you might have done at the lectures, if you got that far: DB1.6).

EXERCISE DB1X.2: Write a query that extracts the contigs in a certain range on a specific chromosome. That is, write a query that takes a range and returns all contigs that start before the range and does not end before the range.

EXERCISE DB1X.3: Join the tables assembly, contig, and dna to get from chromosome ids to contigs to sequences.

EXERCISE DB1X.4: Wrap the queries from DB1X.2 and DB1X.3 to extract sequences in a given range on a given chromosome. Remember that you must remove the first part of the first contig and the last part of the last contig, to trim the sequence to the correct range, and be careful that you take the orientation of the contig into account: if the orientation is -1 you must reverse it.

EXERCISE DB1X.5: Write a function that returns the results of the query in DB1X.4, concatenated, to form the wanted indexed sequence.

Summary

We have written a function that extracts indexed sequences from the Caenorhabditis elegans assembly at Ensemble.

The complexity of this function is about as difficult as they get in this course. If you manged to do the exercise, feel good! If you didn't, don't feel bad about it. The database access you will need in the next mandatory exercise will not be as bad.

Valid XHTML 1.0! Valid CSS! Time-stamp: "2003-11-25 15:12:49 mailund"