Tool-Building in Bioinformatics

TBiB Q4/2006

BiRC / Courses / TBiB / Projects / A Multiple Alignment Database

Exam Project / TBiB 2006: A Multiple Alignment Database

In this project we write a web-service for multiple alignment, where sequences to be aligned can be provided by the user, either as the actual sequences or as NCBI gi numbers. Sequences will be added to a database (if they are not already there), and the multiple alignment will be extracted from a database or created, depending on whether it already exists or not.

The Project Report

To complete the project, you must do Exercise 1-8 below (7 and 8 are optional) and write a report documenting your solutions. The report should be structured with a section for each exercise describing its solution (see Exercises for details). The report should also describe the overall design of your web-service, and give a short users-introduction to it. The path to your code and URL to your web-service in the report must be included.

The project report should be handed in on Wednesday, June 28, 2006, to one of the lecturers.

Motivation

We want to combine the code we have written in the last several weeks exercises into an integrated web-service. This includes the Clustal W wrapper for creating multiple alignments, the NCBI searching,, the multiple alignment database., and CGI Interface to Clustal W.

Access to Sequences

First, we focus on providing sequences to our service.

We want two ways of inputting sequences to our service: directly (by providing the sequence itself) or by NCBI sequence ID (which we in the following refers to as NCBI_ID). In both cases, the sequence should be inserted into the underlying database for later use.

EXERCISE 1: Write a function that, given a sequence or an NCBI_ID, if the sequence is not already in your database, inserts it. This includes downloading it from NCBI if it is given as an NCBI_ID. Decide how to handle sequences without an NCBI_ID (i.e. a sequence you enter yourself); how do you give them an identifier (for use in your database)? how do you check whether they are already in your database? how do you later on extract them from your database?

The report should contain a description of your database, the source code of the function for inserting sequences, and an explanation of the design choices.

EXERCISE 2: Write a function that, given a sequence identifier in the framework you have designed in Exercise 1, extracts the sequence from your database.

The report should contain the source code of the function and an explanation of the design choices.

Multiple Alignments

We now turn to building multiple alignments from sequences in the database.

As in the multiple alignment database exercises, we want an association between the multiple alignment and the sequences appearing in it.

EXERCISE 3: Write a function that, given a list of sequence identifiers, either extracts a multiple alignment of the sequences (if one exists) or creates the alignment, inserts it into the database, and return it. This is a variation of exercise DB2X.5; the difference is that we now use our own variant of sequence identifiers.

The report should contain the source code of the function and an explanation of the design choices.

EXERCISE 4: Re-do exercises DB2X.3 and DB2X.4 with the new identifiers.

The report should contain the source code of the functions and an explanation of the design choices.

User Interface

We now only need to write a user interface to the functionality above. This will, naturally, be in the form of CGI scripts.

For each exercise, the report should contain the source code of the web-page and CGI script plus screenshots illustrating it in action.

EXERCISE 5: Write a web-page plus CGI script for populating the database with sequences. The web-page plus CGI script should act as an interface to the function developed in Exercise 1.

EXERCISE 6: Write a web-page plus CGI script for displaying multiple alignments. The sequences used can be provided as your database IDs, NCBI_IDs, or explicitly. All sequences that are not already in the database should be inserted and the alignment, if it is not already in the database, should be generated and displayed on a web-page.

EXERCISE 7 (optional): Update the alignment web-page above such that it now contains links to the individual sequences. That is, from the multiple alignment page, it should be possible to click your way to the individual sequences.

EXERCISE 8 (optional): Update the sequence-pages from above so they contain links to pages for all the alignments the sequence appears in. That is, the sequence page should contain a link for each alignment the sequence appears in, and the link should call a script that generates the specified multiple alignment page.