Tool-Building in Bioinformatics

TBiB Q4/2006

BiRC / Courses / TBiB / Lecture Notes / Wrapping ClustalW

Exercise: Wrapping ClustalW

In this exercise, we wrap communication with the ClustalW tool. This lets us import ClustalW's functionality for multiple alignment and tree building into a python module.

Motivation

We would like a module implementing multiple alignment and tree building, but we do not want to implement the functionality ourselves; the functionality is already found in ClustalW, and we would like simply to reuse it from python.

The best way to achieve this is to write a module of functions wrapping calls to ClustalW. The functions will behave similar to command-line calls to ClustalW, but make it easier to use the functionality in larger python scripts.

In this exercise we will only wrap a small set of the full ClustalW functionality, but the exercise will (hopefully) teach you enough that, should you later need more of the functionality, you will be able to extend the module to include the needed functionality with little effort.

Wrapping ClustalW

We begin by examining the functionality provided by ClustalW.

Clustal W is not installed globally on DAIMI, but I have compiled a version you can use. A Linux version can be found at ~mailund/linux/clustalw1.8, and a Sun version can be found at ~mailund/sunos/clustalw1.8.

You can run ClustalW by specifying the complete path directly, i.e. as ~mailund/linux/clustalw1.8/clustalw, (for the Linux executable) or by adding the path ~mailund/linux/clustalw1.8/ to your PATH environment variable.

There are two ways of interacting with ClustalW: either interactively through a menu system, or by setting command line options. To enter ClustalW in interactive mode, simply run the program:

$ clustalw



 **************************************************************
 ******** CLUSTAL W (1.8) Multiple Sequence Alignments  ********
 **************************************************************


     1. Sequence Input From Disc
     2. Multiple Alignments
     3. Profile / Structure Alignments
     4. Phylogenetic trees

     S. Execute a system command
     H. HELP
     X. EXIT (leave program)

Your choice:    
  

We are, however, not really interested in interacting with ClustalW in this way — we want to interact with python — so we want to use ClustalW through its command line options. To get a list of the options type either clustalw -help or clustalw -options. There is quite a lot of options, but no worries, we will wrap it in small steps, a little functionality at a time.

Multiple Alignment

The first thing we want wrapped is multiple alignment. Examining the ClustalW options, we see that we need to call it with options like:

  clustalw -align -infile=foo.fasta  

This will produce a multiple alignment that will be written to the file foo.aln.

To wrap ClustalW to build a multiple alignment, it would seem, we need to write the sequences to a file, call ClustalW with the filename as an -infile= option, and read back the resulting file.

Exercise CW 1: Write a function align that, given a list of sequences, provides a multiple alignment of the sequences by calling ClustalW.

We can specify whether the sequences being aligned are dna sequences or protein sequences using the -type= option: -type=dna for dna sequences and -type=protein for protein. We can select the distance matrix to use with the -matrix= option, for protein sequences, and with the -dnamatrix= option, for dna sequences. Start and extending costs for gaps can be set using -gapopen= and -gapext=.

Exercise CW 2: Extend the align function so it selects the correct type, based on the sequences, and calls ClustalW with the correct type option. Consider an appropriate handling of calls with mixed dna and protein sequences.

Exercise CW 3: Extend align so distance matrices and gap-costs can be specified as arguments. These arguments should have appropriate default values. Make sure to check that the choice of distance matrix matches the type of alignment.

If we examine the output from running ClustalW we will see something like this:

$ clustalw -align -infile=foo.fasta



 CLUSTAL W (1.8) Multiple Sequence Alignments



Sequence format is Pearson
Sequence 1: foobar1          15 bp
Sequence 2: foobar2          15 bp
Sequence 3: foobar3          13 bp
Sequence 4: foobar4          13 bp
Start of Pairwise alignments
Aligning...
Sequences (1:2) Aligned. Score:  93
Sequences (1:3) Aligned. Score:  100
Sequences (1:4) Aligned. Score:  92
Sequences (2:3) Aligned. Score:  92
Sequences (2:4) Aligned. Score:  92
Sequences (3:4) Aligned. Score:  92
Guide tree        file created:   [foo.dnd]
Start of Multiple Alignment
There are 3 groups
Aligning...
Group 1: Sequences:   2      Score:203
Group 2: Sequences:   3      Score:250
Group 3: Sequences:   4      Score:204
Alignment Score 365
CLUSTAL-Alignment file created  [foo.aln]
  

We would like to collect the group/score information and the total alignment score, in addition to the multiple alignment. To do that we need to parse the output.

Exercise CW 4: Modify the align method such that it returns a tuple (align,score,groupScores) where align is the alignment, score is the total score of the alignment, and groupScores is a list of score-values, order with group 1 first.

Tree Building

We can build neighbour-joining trees using ClustalW: call ClustalW with the option -tree and with an input-file containing an alignment. The tree is written in the Newick format to a file with .ph suffix.

$ clustalw -align -type=dna -infile=foo.fasta > /dev/null
$ cat foo.aln
CLUSTAL W (1.8) multiple sequence alignment

 
foobar1         AGGTTGTATACTATC
foobar3         AGGTTGT--ACTATC
foobar2         AGGTTGTTTACTATC
foobar4         CGGTTGT--ACTATC
                 ******  ******
$ clustalw -tree -infile=foo.aln > /dev/null
$ cat foo.ph
(
(
foobar1:0.01667,
foobar3:-0.01667)
:0.01667,
foobar2:0.01667,
foobar4:0.06026);

The redirection to /dev/null above might look odd to you, if you are not familiar with UNIX. The file /dev/null is a special file, in the sense that it just eats up everything we write to it; think of it as a waste basket for bits. So what happens when we redirection the output of a program to /dev/null is simply that we ignore the output; it is thrown away instead of written to the screen.

Exercise CW 5: Write a function, buildTree, that, given a list of sequences, build a neighbour-joining tree using ClustalW.

If we already have build an alignment, it is rather silly to have to re-build it just to make a tree. We should also be able to build a tree from an already constructed alignment.

Exercise CW 6: Write a function, buildTreeFromAlignment that builds a neighbour-joining tree from a multiple alignment.

Exercise CW 7: Re-factor (re-write) the function buildTree to use the functions buildTreeFromAlignment and align.

Wrapping Clustal W Data in Classes

Our current design, where we work on lists of sequences and tuples of data, is sub-optimal. In the true spirit of object orientation, we would prefer to collect related data in objects, and to provide the functions on the data as methods on these objects.

For this application, we can wrap our input list of sequences in an object on which we can calculate multiple alignments and trees.

Exercise CW 8: Write a class, ClustalAlignment, encapsulating a list of sequences, with methods align and buildTree with identical functionality of the old functions with the same names. Replace the old functions with this class.

The align method returns a tuple of information about the alignment. A better solution would be to provide a method, getAlignment, that return the alignment alone, and provide other methods for getting the other information.

Since we do not want to re-align whenever we need different information about an existing alignment, we store the information in the object for the other accessors to get hold of.

Exercise CW 9: Re-factor ClustalAlignment such that align simply stores alignment information in the object, and such that accessors getAlignment, getAlignmentScore, and getGroupScores give access to the information. If any of the accessors are called before an alignment has been build, they should automatically call align with default parameters.

Exercise CW 10: Re-factor ClustalAlignment such that buildTree is split into two methods — similar to the split of align — where one method, buildTree builds the tree (calling align if necessary) and another method, getTree returns the tree. If getTree is called before buildTree has been called, it should automatically call buildTree.

Summary

We have written functions wrapping the ClustalW tool, providing multiple alignment and tree building functionality to our python scripts.

We can now collect these functions in a module, clustalw, for future use. Make sure to keep this module, and keep it well factored for future use.