| ![]() |
Scripting 2005
|
Distributed BLAST and Clustal WIn this project, we use the framework we developed for distributed computing to build a system for aligning the BAC clones from the previous project. We will build distributed BLAST and Clustal W services and use these to position clones on the human genome, and build multiple alignments of the clones, respectively. MotivationWe want to construct multiple sequence alignments of the various vertebrate contigs we constructed in the previous project, exploiting of course the knowledge about how the contigs fit together obtainable from the targets at NISC. Each contig we built in the last project is located on a know region of the human genome, but the exact location of the contigs are not know. Using BLAST, we can position the ends of our contigs on the human genome, and in this way position the entire contigs. Having positioned several contigs on the same human genome region, we can extract the sequences where the contigs overlap and build a multiple sequence alignment of this using Clustal W. This process can naturally be parallelized: BLASTing the various contig-ends can be done in parallel, since these BLASTs are independent, and each target can be aligned independent of the others. Distributed BLASTWe wish to use BLAST to position the contigs on the human genome, and for this we want to implement a distributed BLAST service. To keep it simple, we just want a wrapper around MEGABLAST with output type 3 (i.e. calling MEGABLAST with -D 3). You can find a description of the format, and documentation for MEGABLAST in general, in /users/mailund/hs-genome/blast/megablast.txt. The executable itself (for Linux) is found as /users/mailund/hs-genome/blast/megablast. Exercise BLAST.1: Design an interface for a MEGABLAST wrapper. You do not need to support any other output format than type 3, but you should consider returning this output in another form than simple text. You do not need to support general databases either, but can stick to the human genome chromosomes found in /users/mailund/hs-genome/ — make sure that the database to search in is an option to a search. Consider how the option interface should be (general strings or something that lets python validate options?) If you do not support all MEGABLAST's options, describe why you have chosen the ones you do support (and consider the intended use of the distributed BLAST service described below when making your choice). Exercise BLAST.2: Implement your wrapper and turn it into an RMI service. To do this you will have to make the wrapper an object if it isn't already this. Consider how input and output should be passed to the service and which objects should be remote and which should be serialized. Should the results, for example, be serialized as one object and send back, of should the service provide an iterator for the result similar to mysql's various fetch methods? Give justification for your choices. Exercise BLAST.3: Build a MEGABLAST client — a script that takes the sequences to be blasted plus any options on the command line and performs the blasting using the distributed service. Consider how the blasting can be parallelized (but be aware that blasting single sequences one at a time can be much less efficient than blasting a set of sequences against the same database in one call to MEGABLAST, distribution or not; so do not try to be too clever here). You might want to consider the getopt module for option parsing. Distributed Clustal WYou have already implemented a Clustal W wrapper in previous exercises, but we now want to make a distributed version. Exercise CLUSTALW.1: Turn your wrapper into an RMI service. To do this you will have to make the wrapper an object if it isn't already this. Consider how input and output should be passed to the service and which objects should be remote and which should be serialized. Exercise CLUSTALW.2: Build a Clustal W client — a script that takes the sequences to be aligned plus any options on the command line and performs the alignment using the distributed service. Multiple Sequence Alignment of NISC BAC Clones (Optional)This part is optional, you do not need to do it (it will not affect your grade either way, but this is the part that ties the other two parts together, and I bet you will have fun doing it if you choose to). We can use these two services to make multiple sequence alignments of the NISC BAC clones from the previous project in the following way:
Exercise MSA.1*: Place the assembled contigs on the human genome by:
Exercise MSA.2*: Use the placement of the endpoints to find out where a list of contings on the same target overlap, then cut out the overlapping sequences and align them using Clustal W. Evaluating the ProjectIf you have signed up for getting credit for the course, you should hand in a short report on this project. You should describe the design decisions you have made (see various examples above), plus describe any difficulties you have had with the design and implementation. You should also give a description of how the code was tested. Your report should also include the source code for your solution. The reports should be handed in July 1st.. SummaryWe have developed distributed services for BLAST and Clustal W, for parallelizing making multiple alignments of the NISC BAC clone contigs of the previous project. We have then (optionally) used these services to build multiple alignments of the clones. |