Beyond Phylogenies

Exercise sheet (mainly shamelessly stolen from Yun S. Song).

Exercise I. Viewing Recombination Animations

In this exercise, we will use an on-line animation program to visualise ancestral histories where recombination events can occur.

  1. Start a web browser and go to http://www.coalescent.dk/.

  2. Click on "Hudson animator" under Animation and then click on "Recombination". On the mid-right side of the window, you should see two boxes where you can specify the number of samples (denoted n) and the scaled recombination rate (denoted ρ).

  3. (ρ = 0):
    To get familiarised with the animator, let us first try the case where the recombination rate is zero.
    1. Set n=9 and ρ=0, then click on the button which says "Recalc". If you want to speed up the animation, you can change the speed value shown on the lower left corner of the animation window.

    2. You should be able to see a tree being constructed. Note that the vertices are colour-coded. Red is used for the initial samples, green for coalescent events, and blue for recombination events (but we are not seeing any of those yet, since ρ is zero).

    3. Move your cursor to any vertex. You should see that the vertex under your cursor, as well as its immediate parent and descendant verticies, get highlighted by enclosed circles. On the lower right hand side of the animation window, you can see the time associated to the vertex. Also shown in that box is a long bar denoting the genetic material associated to the vertex (currently, the information here is quite dull, but it gets more interesting when we add recombinations). If a region is coloured, then it means that the region is ancestral to at least one sample at time 0. In our current example, the entire region is coloured because there is no recombination events. We will return to this point later in the exercise.

    4. Try to move around your cursor to visit different vertices. Move your cursor to the uppermost vertex. Notice that the bar in the lower-right corner assocated to the vertex's genetic material is coloured green, whereas all other vertices have red bars assocated to them. What is special about the uppermost vertex?

  4. (ρ ≠ 0):
    1. Set n=9 and ρ = 1, and then click on "Recalc". If you don't see any blue vertices in the animated history, re-click on "Recalc" until you do.

    2. Move your cursor to a blue vertex in the animated history and examine the box in the lower-right corner. What does it mean?

    3. In the upper-right corner of the animation window, click on the tab denoted "Trees". By clicking at various places in the horizontal region just below the tree, observe how different trees are supported at different positions in the gene.

    4. Go back to the "Animation" section, change ρ to 10, and "Recalc" you animation. Not the increase in the number of recombination vertices.

    5. Move around your cursor to different vertices and examine the box in the lower-right corner. Do you see how ancestral materials are fragmented? Also, some bars may be partially coloured green. Referring back to 3.d. of the current exercise, can you figure out what it indicates?

Exercise II. Generating Simulated Data

Assume the infinite-sites model of mutation, which implies that at most one mutation event per site can occur in the evolutionary history of sequences.

  1. To get an idea of what kind of implication the infinite-sites model of mutation has on ancestral inference, consider the following example:

    Suppose that we are given a sample of four sequences, whose 2 particular polymorphic sites are:

    0 0
    0 1
    1 0
    1 0
    A mutation event at some point in time changed a 0 to a 1 or a 1 to a 0 in each site. Under the infinite-sites model of mutation, can you describe the evolutionary history of both polymorphic sits by a single tree? If not, what could be a possible evolutionary history?

    Side remark: Two polymorphic sits are called incompatible if all four combinations, 00, 01, 10, and 11 occur. If two sites are incompatible, then it indicates that at least one recombination must have occurred in the evolutionary history of the samples such that the breakpoint lies between the two sites.

  2. ms is a useful program for generating polymorphism data under the infinite-sites model of mutation. The basic syntax for generating samples while incorporating recombination is as follows: ms nseq nsim -t θ -r ρ nsites -T
    1. "nseq" denotes the number of sequences in the sample.

    2. "nsim" denotes the number of simulations to perform.

    3. θ is the scaled mutation rate.

    4. ρ is the scaled recombination rate.

    5. The program uses a finite-sites uniform recombination model. "nsites" denote the number of sites to be used in the recombination model. Mutation events still occur according to the infinite-sites model of mutation. (A different simulator, slightly more general and slightly more complicated to use, CoaSim uses both an infinite-sites mutation model and an infinite-sites recombination model, but otherwise simulates histories under the same underlying theoretical model).

    6. -T means you want the local trees to be included in the output.

  3. Try the following command: ms 7 1 -t 3 -r 3 3000 -T

    1. In the output, "segsites" denote the number of polymorphic sites generated.

    2. Note that the simulated polymorphism data for the 7 sequences are shown at the bottom of the output. Do you observe any incompatible pairs of sites?

    3. Local trees in the Newick format are shown immediately after the sign "//". Can you make sense of the tree notation? Try to draw the first tree shown.

      The notation "[k]" preceding a tree means that the tree describes the evolutionary history for the next k sites. What do you get if you add up the integers shown inside "[]"? You should get 3000, the "nsites" chosen for this example.

    4. Even if there are no incompatibilities in the simulated polymorphism data, there may be more than one local tree reported in the output; i.e. different sites could be described by different trees. Can you think of different scenarios which could lead to such cases?

  4. In the infinite-sites model of mutation, the expected number of polymorphic sites for a given number of sequences is proportional to θ. As you carry out the following exercises, compare the changes in the number of polymorphic sites with the number of local trees.

    1. Repeat the following command 10 times: ms 7 1 -t 1 -r 3 3000 -T

    2. Do the same for θ = 10.

    3. Do the same for θ = 100.

  5. See msdoc.pdf for further description of the program.

Exercise III. Counting Local Trees

In this exercise, we will try to examine how the number of local trees depend on recombination and gene conversion rates. In what follows, we will use ms with nseq=7, θ=15, and nsites=3000.

  1. Run 500 simulations and obtain the average number of local trees reported for the following values of ρ:

    1. ρ=0.0001
    2. ρ=1
    3. ρ=10
    4. ρ=100

    Scripting hints: You can use grep and wc to count the number of trees like this:

    ms 7 1 -t 1 -r 3 3000 -T | grep "^\[" | wc -l

    and use this in a script to run your simulations.

    You can script the counting and calculating averaged using a scripting language such as Python

    import sys, os
    if len(sys.argv) != 2:
        print 'Usage:', sys.argv[0], 'rho'
        sys.exit(2)
    
    rho = float(sys.argv[1])
    msCmd = 'ms 7 1 -t 1 -r %f 3000 -T | grep "^\[" | wc -l' % rho
    
    totalCount = 0
    noRuns = 500
    for i in xrange(noRuns):
        totalCount += int(os.popen(msCmd).read())
    print 'Avg:', float(totalCount) / noRuns 

    Perl

    if ($#ARGV != 0) {
        print "Usage: count.pl rho\n";
        exit 2;
    }
    
    $rho = $ARGV[0];
    $msCmd = "ms 7 1 -t 1 -r $rho 3000 -T | grep '^\\[' | wc -l";
    
    $totalCount = 0;
    $noRuns = 500;
    for (my $i = 0; $i < $noRuns; ++$i) {
        $totalCount += `$msCmd`;
    }
    print "Avg: ", $totalCount / $noRuns, "\n";
    

    or Ruby.

    if (ARGV.length != 1) then
        print "Usage: count.rb rho\n"
        exit 2
    end
    
    $rho = ARGV[0]
    
    $totalCount = 0.0
    $noRuns = 500
    $noRuns.times do
      $totalCount += `ms 7 1 -t 1 -r #{$rho} 3000 -T | grep '^\\[' | wc -l`.to_f
    end
    print "Avg: ", $totalCount / $noRuns, "\n" 
  2. Gene conversion can be included in the simulations using option -c as follows: ms nseq nsim -t θ -r ρ nsites -T -c f λ

    Here, f = g/r, where g is the gene conversion rate per site and r is the recombination rate per "nsites". λ denotes the mean gene conversion tract length. In this part of the exercise, let ρ=1 and λ=50. Run 500 simulations and obtain the average number of local trees reported for the following values of f:

    1. f = 0.0001
    2. f = 1
    3. f = 10
    4. f = 100

    How do these compare with part 1.a.?

Exercise IV: Minimal Recombination Histories

We now look at parsimonious evolutionary histories, in the sense of histories that require the minimal number of recombinations (assuming the infinite-sites mutation model).

For this, we use the Beagle suite of software.

  1. The program beagle calculates the minimal number of recombinations absolutely needed to explain a list of sequences.

    The input to beagle is a list of lines similar to the sequences at the end of ms' output — in fact we can directly take these and input them to beagle.

    A script such as the following (or similar in Python or Perl) can translate ms' output in to beagle's input.

    #!/bin/env ruby
    
    STDIN.each { |line| if line =~ /^positions/ then break end }
    STDIN.each { |line| print line } 

    Try simulating a few datasets using ms and then run them through beagle. Start with small datasets! How does the running time of beagle grow with the input size?

  2. beagle counts the minimal number of recombinations by essentially reconstructing all histories with first 0, then 1, then 2, etc. recombinations leading to the observed sequences, until it finds one that leads back to a single ancestral sequence.

    The reason it gets so slow as the size increases is that there are a lot of such histories.

    We can count the number of states (not histories, but sets of sequences) that are reachable between the input sequences and an ancestral sequence reachable by the minimal number of recombinations. For this we use the tool greven (danish for Count, but not as in counting but as in nobility...Rune has a sick sense of humour) that takes input in the same format as beagle.

    Try simulating datasets with ms and run them through greven. This time you should really start with small datasets!

  3. The number of local trees in the ms output tells you how many recombination occurred in the history; there is one more local tree that the number of recombinations.

    We will now try to compare the true number of recombinations with the minimal number of recombinations.

    For this, we need to count the number of local trees at the same time as running the sequences through beagle. We can do this by modifying our scripts from above, e.g. (in Ruby):

    #!/bin/env ruby
    
    if (ARGV.length != 2) then
        print "Usage: count.rb theta rho\n"
        exit 2
    end
    
    $theta = ARGV[0]
    $rho = ARGV[1]
    
    simResult = `./msdir/ms 7 1 -t #{$theta} -r #{$rho} 3000 -T`
    f = File.open('beagle.input','w')
    
    noTrees = 0
    inSeq = false
    simResult.each_line do
      |line|
      noTrees += 1 if line =~ /^\[/
      inSeq = true if line =~ /^positions/
      next         if line =~ /^positions/
      f.print line if inSeq
    end
    
    f.close
    
    print 'True number of recombinations: ', noTrees-1, "\n"
    print `./beagle < beagle.input`  

    Run this for a number of different choices of ρ and θ and calculate the average for each (ρ,θ) point; compare the minimal number of recombinations and the actual number of recombinations.