Mandatory Project 1

The project is to be solved in groups of two or three persons. It should be handed in no later than Thursday November 13th, 2003. The next project will follow immediately.

Project description

  1. Design a data structure class phylogeny that represents a tree. Write a module that can parse a string in the newick tree format and represent this tree in your phylogeny data structure.

  2. Write a suitable set of modules so that you can translate between any two of the following sequence formats both ways:

    You may assume that Fasta files are always DNA (or, if you want to, you can come up with some way of checking what type of sequence it is), and that the first word after the > is a unique identifier for the sequence. All modules should work with strings only; i.e. they should not handle reading/writing files (so the actual reading a file into a string or a list of strings is taken care of in a wrapper program; see below).

  3. Write a wrapper program which handles interaction with the user and uses the above modules to translate between sequence formats.

Remarks

Of course you may use everything you have seen in the course so far as you please. E.g., you might employ the strategy illustrated in the figure below. Part 1 is probably the hardest so don't save that for the last day. If you're interested, read the hints. If not, just get the test function which you might use to test your Newick parser.

Since the formats do not keep the same information about a sequence, translating between them may force you to throw away some information while keeping some. It's up to you what you keep; our own internal format (Isequence.py) keeps the type, ID and name (if given) of a sequence along with the sequence itself. You may use and modify this format as you wish.

Documentation

Each group should hand in a short report. The report should include a list of the program files and a brief explanation of the role each file plays. The next mandatory projects will use parts of this project; thus, if for some reason any part of this project is not completed the report should clearly state which, and why.

The report should also include the full path to a directory in which all relevant program files (and nothing else) are located (thus if this directory and its content is copied, the programs will work when run from their new location). Make sure they are readable and don't change them after the deadline. Give instructions on how to run the main wrapper program.

Each program file should be easily readable Python code. I.e. put plenty of comments in the programs, use logical variable names, put in empty lines to delimit different parts of the program, etc. Don't explain the obvious ("here we use a for loop to go through all elements in the list"), just make sure everything is clearly understandable.

Your program should work on all the given sample sequences (see below). That means that if you translate, e.g., the sequences in GDE file 1 to Fasta, then to GenBank format, then back to GDE format, the actual sequences should match exactly the sequences in the original file (of course you may have lost some additional information on the way, but the sequences themselves should be unchanged).

Sample sequence files

GDE file 1
GDE file 2
fasta file 1
fasta file 2
GenBank file 1
GenBank file 2