Mandatory Project 1
The project is to be solved in groups of two or three persons. It
should be handed in no later than Thursday November 13th, 2003. The
next project will follow immediately.
Project description
- Design a data structure class phylogeny that represents a
tree. Write a module that can parse a string in the newick tree format
and represent this tree in your phylogeny data structure.
- Write a suitable set of modules so that you can translate between
any two of the following sequence formats both ways:
You may assume that Fasta files are always DNA (or, if you want to,
you can come up with some way of checking what type of sequence it
is), and that the first word after the > is a unique identifier for
the sequence. All modules should work with strings only; i.e. they
should not handle reading/writing files (so the actual reading a file
into a string or a list of strings is taken care of
in a wrapper program; see below).
- Write a wrapper program which handles interaction with the user
and uses the above modules to translate between sequence formats.
-
The program should present the user with a menu like the following when
started and then prompt the user for a choice:
a) read sequence file, b) write sequence file, c) dna
translation, d) write xml file, e) read newick file.
- In case a) is chosen the user is presented with a new menu, namely
a list of sequence formats which the program can handle.
- After choosing one of these formats, the user is prompted for a
file name. Then the file is read (catch any exceptions and give an
error message), the appropriate filter is applied, and the sequences
read are now represented somehow as 'current sequences'.
- In case b) is chosen, the user chooses from the list of applicable
formats and is prompted for a file name, and the current sequences are
saved in one file in the chosen format. Give a suitable error message in
case there are no current sequences.
- In case c) is chosen, the user is prompted for a file name, and
the current sequences are translated from DNA into amino acid
sequences and written (in any form you wish) to the file. You may
print the amino acid sequences to standard output as well if you
like. If the current sequences are not DNA, the user is told
so.
- In case d) is chosen, the user is prompted for a file name, and
the current sequences are saved in a file in XML format. Give a
suitable error message in case there are no current sequences.
- In case e) is chosen, the user is prompted for a newick tree file
name. The file is read and the newick module is applied so that the
tree is now represented using the phylogeny data structure as the
'current tree'.
Remarks
Of course you may use everything you have seen in the course so far as
you please. E.g., you might employ the strategy illustrated in the figure
below. Part 1 is probably the hardest so don't save that for
the last day. If you're interested, read the hints. If not, just get the test function which you might use to test
your Newick parser.

Since the formats do not keep the same information about a sequence,
translating between them may force you to throw away some
information while keeping some. It's up to you what you keep; our own
internal format (Isequence.py) keeps the type, ID and name (if given)
of a sequence along with the sequence itself. You may use and modify
this format as you wish.
Documentation
Each group should hand in a short report. The report should include a
list of the program files and a brief explanation of the role each
file plays. The next mandatory projects will use parts of this
project; thus, if for some reason any part of this project is not
completed the report should clearly state which, and why.
The report should also include the full path to a directory in which
all relevant program files (and nothing else) are located (thus
if this directory and its content is copied, the programs will work
when run from their new location). Make sure they are readable and
don't change them after the deadline. Give instructions on how to run
the main wrapper program.
Each program file should be easily readable Python code. I.e. put
plenty of comments in the programs, use logical variable names, put in
empty lines to delimit different parts of the program, etc. Don't
explain the obvious ("here we use a for loop to go through all
elements in the list"), just make sure everything is clearly
understandable.
Your program should work on all the given sample sequences (see
below). That means that if you translate, e.g., the sequences in GDE
file 1 to Fasta, then to GenBank format, then back to GDE format, the
actual sequences should match exactly the sequences in the original
file (of course you may have lost some additional information on the
way, but the sequences themselves should be unchanged).
Sample sequence files
GDE file 1
GDE file 2
fasta file 1
fasta file 2
GenBank file 1
GenBank file 2