Introduction ============ This text documents version 1.0 of the alignment program COMBAT (an acronym for COMBined AlignmenT), which implements the combined DNA/protein alignment method described in Christian N. S. Pedersen, Rune B. Lyngsų and Jotun Hein. "Comparison of coding DNA" in Proceedings of the 9th Annual Symposium of Combinatorial Pattern Matching (CPM), 1998. The current implementation is done in C and computes and outputs an optimal alignment according to the DNA/protein score function described in the above paper. COMBAT runs in time O("lenght of sequence 1" * "lenght of sequence 2"), but with a large constant, and uses space proportional to ("lenght of sequence 1" * "lenght of sequence 2")* 4 bytes + 3 Mb. E.g. alignment of sequences of length 300 consumes about 3.3 Mb, and alignment of sequences of length 2000 consumes about 19 Mb. Below are instructions on how to compile and use the program. If you have any ideas to improve the program, or discover any bugs, please contact; Tejs Sharling (tejs@daimi.au.dk) Christian N. S. Pedersen (cstorm@daimi.au.dk). Implementation ============== COMBAT version 1.0 includes the following files: Makefile sequence.c sequence.h cost.c cost.h distance.c distance.h combat.c combine.c convert.c input.c input.h my_errors.c my_errors.h which compile to three programs; combat, convert and combine. 'combat' takes as input two sequences, an amino-acid distance matrix, a nucleotide distance matrix and two affine gap functions (one for the amino-acid level and one for the nucleotide level) and produces an optimal alignment of the two sequences in the DNA/protein model with respect to these parameters. 'convert' converts a similarity matrix into a distance matrix, using the inter-row distance method described in W.R. Taylor and D.T. Jones, "Deriving an Amino Acid Distance Matrix" in Journal of Theoretical Biology (1993) 164, 65-85. 'combine' takes two sequences and an alignment generated by combat, and prints the alignment in the usual textual representation. One can specify the width of the print. The default is set to 75. If width is set to 0, each sequence will be printed on a single line. Compilation =========== To compile the programs do make convert make combine make combat or just make Running combat ============== Run combat with combat where is the name of a file that describes which the two sequences to align and which distance matrices and gap-cost functions to use. The format of is as follows. >inputfile "" >outputfile "" >distance matrix "" >nucleotide matrix "" >gap functions protein: Palfa + Pbeta*k dna: Dalfa + Pbeta*k where ---- "" is the name of a file which describes the two sequences to align, The format of this file is as follows. >name of sequence one >name of sequence two where [atcgATCG] in is considered a nucleotide and everything else is ignored (a ';' means that the rest of the line is ignored) e.g. >seqA agtcgcatgact acgactgac >seqB actatgagga ctcgactcggg ---- "" is the name of the file to which the computed alignment is written. The file will contain data on the form x1 y1 x2 y2 : : xn ym that represents the path of the optimal alignment in the alignment graph. Use 'combine' to produce the usual textual representation of the alignment (see below). ---- "" is the name of the file that describes the amino-acid distance matrix. The file is on the form: AlaCysAspGluPheGlyHisIleLysLeuMetAsnProGlnArgSerThrValTrpTyr Ala v Cys v v Asp v v v : : : : . Tyr v v v v v v v v v v v v v v v v v v v v where v is any positive real number value, e.g. 23.233, and the amino-acid names must be omitted. The matrix is required to be a metric and the maximal distance between any two amino-acids should not exceed Palfa+2*Pbeta, i.e. v <= Palfa+2*Pbeta for all v. ---- "" is the name of the file that describes the the nucleotide distance matrix. The file on the form: A T C G A 0.0 T 2.0 0.0 C 2.0 2.0 0.0 G 2.0 2.0 2.0 0.0 where the A,T,C,G must be omitted. ---- The gap-cost functions can handle any positive real number values, as long as Palfa>=2*Pbeta and Dalfa>=2*Dbeta, e.g. >gap functions protein: 20.78 + 7.28k dna: 4.5 + 2.0*k combat runs in time O(length of sequence 1 * length of sequence 2), but with a large constant, and uses space proportional to (length of sequence 1 * length of sequence 2)*1100 bytes, e.g. two sequences of length 300 needs about 100 Mbytes and two sequences of length 600 needs about 400 Mbytes! The space usage can be reduced to linear in the length of sequences by an alternative implementation. Running convert =============== Run 'convert' with convert where the matrix in the file should be on the same form as the distance matrix 'combat' uses, described earlier. Running combine =============== Run 'combine' with combine where the files has the same format as the corresponding files that 'combat' uses, described earlier. 'combine' prints an alignment on the form: >seqA agtcgcatgactc---acgactgac--- >seqB act---atga---ggactcgactccggg It is also possible view an alignment graphically using gnuplot and the output from combat directly. e.g. if the output of combat is placed in 'combat.aln' the type gnuplot and in gnuplot do set grid; plot 'combat.aln' with lines Example ======= In the directory 'example' are different example files. Goto the directory and type ../combat combat.ctl to run combat on the sequences in combat.seq using PAM250_distance.m as amino-acid distance matrix, nucleotide_distance.m as nucleotide distance matrix, and gap-cost functions: protein: 20 + 8*k dna: 8 + 2*k The result in the file combat.aln can be viewed by gnuplot by typing gnuplot show or in textual representation using 'combine' by typing ../combine combat.seq combat.aln