| ![]() |
TBiB Q4/2006
|
Exercise: Wrapping ClustalW
In this exercise, we wrap communication with
the ClustalW
tool. This lets us import ClustalW's functionality for
multiple alignment and tree building into a python module.
MotivationWe would like a module implementing multiple alignment and tree building, but we do not want to implement the functionality ourselves; the functionality is already found in ClustalW, and we would like simply to reuse it from python. The best way to achieve this is to write a module of functions wrapping calls to ClustalW. The functions will behave similar to command-line calls to ClustalW, but make it easier to use the functionality in larger python scripts. In this exercise we will only wrap a small set of the full ClustalW functionality, but the exercise will (hopefully) teach you enough that, should you later need more of the functionality, you will be able to extend the module to include the needed functionality with little effort. Wrapping ClustalW
Web-resources:
Clustal W. We begin by examining the functionality provided by ClustalW. Clustal W is not installed globally on DAIMI, but I have compiled a version you can use. A Linux version can be found at ~mailund/linux/clustalw1.8, and a Sun version can be found at ~mailund/sunos/clustalw1.8. You can run ClustalW by specifying the complete path directly, i.e. as ~mailund/linux/clustalw1.8/clustalw, (for the Linux executable) or by adding the path ~mailund/linux/clustalw1.8/ to your PATH environment variable. There are two ways of interacting with ClustalW: either interactively through a menu system, or by setting command line options. To enter ClustalW in interactive mode, simply run the program:
$ clustalw
**************************************************************
******** CLUSTAL W (1.8) Multiple Sequence Alignments ********
**************************************************************
1. Sequence Input From Disc
2. Multiple Alignments
3. Profile / Structure Alignments
4. Phylogenetic trees
S. Execute a system command
H. HELP
X. EXIT (leave program)
Your choice:
We are, however, not really interested in interacting with ClustalW in this way — we want to interact with python — so we want to use ClustalW through its command line options. To get a list of the options type either clustalw -help or clustalw -options. There is quite a lot of options, but no worries, we will wrap it in small steps, a little functionality at a time. Multiple AlignmentThe first thing we want wrapped is multiple alignment. Examining the ClustalW options, we see that we need to call it with options like: clustalw -align -infile=foo.fasta This will produce a multiple alignment that will be written to the file foo.aln. To wrap ClustalW to build a multiple alignment, it would seem, we need to write the sequences to a file, call ClustalW with the filename as an -infile= option, and read back the resulting file. Exercise CW 1: Write a function align that, given a list of sequences, provides a multiple alignment of the sequences by calling ClustalW. We can specify whether the sequences being aligned are dna sequences or protein sequences using the -type= option: -type=dna for dna sequences and -type=protein for protein. We can select the distance matrix to use with the -matrix= option, for protein sequences, and with the -dnamatrix= option, for dna sequences. Start and extending costs for gaps can be set using -gapopen= and -gapext=. Exercise CW 2: Extend the align function so it selects the correct type, based on the sequences, and calls ClustalW with the correct type option. Consider an appropriate handling of calls with mixed dna and protein sequences. Exercise CW 3: Extend align so distance matrices and gap-costs can be specified as arguments. These arguments should have appropriate default values. Make sure to check that the choice of distance matrix matches the type of alignment. If we examine the output from running ClustalW we will see something like this: $ clustalw -align -infile=foo.fasta CLUSTAL W (1.8) Multiple Sequence Alignments Sequence format is Pearson Sequence 1: foobar1 15 bp Sequence 2: foobar2 15 bp Sequence 3: foobar3 13 bp Sequence 4: foobar4 13 bp Start of Pairwise alignments Aligning... Sequences (1:2) Aligned. Score: 93 Sequences (1:3) Aligned. Score: 100 Sequences (1:4) Aligned. Score: 92 Sequences (2:3) Aligned. Score: 92 Sequences (2:4) Aligned. Score: 92 Sequences (3:4) Aligned. Score: 92 Guide tree file created: [foo.dnd] Start of Multiple Alignment There are 3 groups Aligning... Group 1: Sequences: 2 Score:203 Group 2: Sequences: 3 Score:250 Group 3: Sequences: 4 Score:204 Alignment Score 365 CLUSTAL-Alignment file created [foo.aln] We would like to collect the group/score information and the total alignment score, in addition to the multiple alignment. To do that we need to parse the output. Exercise CW 4: Modify the align method such that it returns a tuple (align,score,groupScores) where align is the alignment, score is the total score of the alignment, and groupScores is a list of score-values, order with group 1 first. Tree Building
Web-resources:
The Newick tree format. We can build neighbour-joining trees using ClustalW: call ClustalW with the option -tree and with an input-file containing an alignment. The tree is written in the Newick format to a file with .ph suffix.
$ clustalw -align -type=dna -infile=foo.fasta > /dev/null
$ cat foo.aln
CLUSTAL W (1.8) multiple sequence alignment
foobar1 AGGTTGTATACTATC
foobar3 AGGTTGT--ACTATC
foobar2 AGGTTGTTTACTATC
foobar4 CGGTTGT--ACTATC
****** ******
$ clustalw -tree -infile=foo.aln > /dev/null
$ cat foo.ph
(
(
foobar1:0.01667,
foobar3:-0.01667)
:0.01667,
foobar2:0.01667,
foobar4:0.06026);
The redirection to /dev/null above might look odd to you, if you are not familiar with UNIX. The file /dev/null is a special file, in the sense that it just eats up everything we write to it; think of it as a waste basket for bits. So what happens when we redirection the output of a program to /dev/null is simply that we ignore the output; it is thrown away instead of written to the screen. Exercise CW 5: Write a function, buildTree, that, given a list of sequences, build a neighbour-joining tree using ClustalW. If we already have build an alignment, it is rather silly to have to re-build it just to make a tree. We should also be able to build a tree from an already constructed alignment. Exercise CW 6: Write a function, buildTreeFromAlignment that builds a neighbour-joining tree from a multiple alignment. Exercise CW 7: Re-factor (re-write) the function buildTree to use the functions buildTreeFromAlignment and align. Wrapping Clustal W Data in ClassesOur current design, where we work on lists of sequences and tuples of data, is sub-optimal. In the true spirit of object orientation, we would prefer to collect related data in objects, and to provide the functions on the data as methods on these objects. For this application, we can wrap our input list of sequences in an object on which we can calculate multiple alignments and trees. Exercise CW 8: Write a class, ClustalAlignment, encapsulating a list of sequences, with methods align and buildTree with identical functionality of the old functions with the same names. Replace the old functions with this class. The align method returns a tuple of information about the alignment. A better solution would be to provide a method, getAlignment, that return the alignment alone, and provide other methods for getting the other information. Since we do not want to re-align whenever we need different information about an existing alignment, we store the information in the object for the other accessors to get hold of. Exercise CW 9: Re-factor ClustalAlignment such that align simply stores alignment information in the object, and such that accessors getAlignment, getAlignmentScore, and getGroupScores give access to the information. If any of the accessors are called before an alignment has been build, they should automatically call align with default parameters. Exercise CW 10: Re-factor ClustalAlignment such that buildTree is split into two methods — similar to the split of align — where one method, buildTree builds the tree (calling align if necessary) and another method, getTree returns the tree. If getTree is called before buildTree has been called, it should automatically call buildTree. SummaryWe have written functions wrapping the ClustalW tool, providing multiple alignment and tree building functionality to our python scripts. We can now collect these functions in a module, clustalw, for future use. Make sure to keep this module, and keep it well factored for future use. |