Tool-Building in Bioinformatics

TBiB Q4/2006

BiRC / Courses / TBiB / Projects / NISC BAC Contig Assembly

NISC BAC Contig Assembly

In this project we will mine the NISC Comparative Vertebrate Sequencing Project and NCBI for BAC contigs, which we will then assemble into longer sequences.

Motivation

We wish to examine the evolution of closely related species, and for that we want a multiple sequence alignment of a large section of their genomes. For distantly related species, chromosome rearrangements and plain old mutations will complicate this, but for closely related species we should be able to build long alignments.

At the NISC Comparative Vertebrate Sequencing Project, we can get a set of sequences for a large number of species, categorized into targets, where sequences from each target are homologue to a know location on the human genome. The various targets do not have the same coverage, nor are the various species sequenced to the same degree, but even so we should be able to use this data somehow.

Unfortunately, the data is in the form of non-assembled BAC contigs, and not aligned, so we need to do those things ourselves. In this project we will mine the contigs and assemble them — in a rather naive way, but even so... — and in the final project place the assembled sequences on the human genome and construct multiple alignments.

Mining NISC and NCBI

The first thing we need to do is to get hold of the sequences.

We can get a list of the organisms here, and for each organism get a list of the targets and the sequences in the target.

Exercise NISC.1: Browse through some of the species and examine the URLs. Do you see a pattern? Can you construct the URLs for fetching the page for a given organism?

Exercise NISC.2: Browse the targets of a few organisms. Do you see a pattern in the URLs that will let you fetch a certain target from a certain organisms, without going through the browsing process? Try switching between the clone map and the Summary Table. On the clone map page there is a link called 'Export Data' that leads to a file containing the same information as the Summary Table but in a text format that is easier to parse. Can you download this file directly by building the right URL?

Now that we can get directly to a summary table, we need to extract the accession numbers for the sequences — with those in hand we can fetch the actual sequences from NCBI, using the module we wrote for the web-services exercises.

As for getting the accession numbers, let us examine a summary table page.

The numbers we are after are the ones in the GenBank column. Some of those are redundant, though, so let us only go for the ones with Status "Sequenced", not those with Status "Redundant".

In other words, what we want is the fourth column of those lines where the fifth column is "Sequenced".

Exercise NISC.3: Examine the text file mentioned in Ex. 2. Can you write a function that extract the "Sequenced" accession numbers?

With the accession numbers in hand, we can fetch the sequences from NCBI. You have already written a fetch function for this, and luckily it turns out that the Eutils web-service for fetching sequences accepts accession numbers just as well as its own internal IDs, so that fetch function works out of the box.

Exercise NISC.4: Write a function that takes the accession numbers, fetches the sequences, and write the sequences to files on the local disk (or DAIMI's NFS if you prefer), preferably in files with names that makes it easy for you to find the relevant sequences again.

Assembling BAC Clones

For the assembling, we will use a quick and dirty approach.

On the summary table pages, the contigs are listed in their sequential order, and we will simply check for each neighbour pair in that ordering, whether a sufficiently long suffix of the first matches a prefix of the other.

It is quite possible for neighbour contigs not to satisfy this — either because they in fact do not overlap, or because of some sequencing error — but it is the criteria we will use. The benefit of this criteria is that it is very Conservative, so if we assemble contigs using this method, we know that we assemble them correctly — we are just also likely to not assemble contigs which could be assembled with a slightly more intelligent approach.

But intelligence can be added later, for now we go with the simple approach.

Exercise NISC.5: Write a function that extracts the last 100 characters of one string and finds the first occurrence of this sub-string in another string.

Exercise NISC.6: Use the function above to try to assemble two neighbour contigs by

  • Finding the last 100 characters of the first sequence in the second sequence (finding an "anchor" to work with).
  • Then tests whether the prefix up to the anchor in the second sequence matches the corresponding suffix of the first sequence, that is: do the characters in the second string up to and including the anchor match the suffix of the same length of the first sequence?

Exercise NISC.7: Combine the two functions into a function that iterates over all the contigs, tests if they can be assembled (there is an anchor, and the suffix and prefix match), and if so, assemble the sequences by concatenating the first string with the second minus the prefix which is already there as a suffix of the first.

In the last exercise, you did remember the append-list then join idiom, right? If not, did you notice a speedup when you fixed it?

Just for fun, take a look at some of the neighbours you did not assemble using this method. Many of them still look very much like they could be assembled; the suffix of one is almost, but not quite, a prefix of the next. Do you have any suggestions for how to assemble these cases? What are the pros of cons of your suggestions?

Evaluating the Project

If you have signed up for getting credit for the course, you should hand in a short report on this project. I don't want a fully detailed report — we will save that for the final project — but a few pages describing how you solved the different parts of the problem, how you validated your solution, and which suggestions you have for improving or extending your solution, if any.

Your report should also include the source code for your solution.

The reports should be handed in Tuesday, May 16th.

Summary

We have written scripts for mining the NISC Comparative Vertebrate Sequencing Project for maps of BAC clones, and for extracting the sequences from NCBI.

Furthermore, we have written simple scripts for assembling the downloaded sequences into longer continuous genomic sequences.