Publications

Journal Papers
To Appear
A coalescent model incorporating allele specific mutation rates: application to the estimation of effective population size from microsatellite polymorphism data
T. Bataillon, A.A.T. Smith, T.Mailund and A-C. Thuillet
To appear in Genetics
[Abstract]
Recent experimental evidence has demonstrated that mutation rates of microsatellites are allele-specific rather than locus-specific. Although numerous population genetics models have been devised to accommodate the peculiarities of the microsatellite mutational model, allowing for stepwise mutation events, virtually no population genetics model incorporating variation in mutation rate among alleles at a same locus has been studied. We have developed a simple urn model that allows the simulation of microsatellite data from a stationary Wright Fisher population that incorporates allele-specific mutation patterns. An approximate Bayesian framework for the estimation of long term effective size, Ne, is proposed. The method uses rejection sampling to approximate the likelihood of joint summary statistics describing microsatellite polymorphism data. The performance of the method is examined using Monte Carlo simulations of datasets. Last, we illustrate our method by re-analyzing data from an earlier study documenting patterns of microsatellite polymorphism before and after domestication in durum wheat (Triticum dicoiccoides) where the relationship between allele length and allele mutability has been established from direct observations of mutation rates.
2009
Ancestral population genomics: the coalescent hidden Markov model approach
Julien Y Dutheil, Ganesh Ganapathy, Asger Hobolth, Thomas Mailund, Marcy K Uyenoyama, and Mikkel H Schierup
Genetics 183, 259–274 2009. doi:10.1534/genetics.109.103010
[Abstract]
With incomplete lineage sorting (ILS), the genealogy of closely related species differs along their genomes. This feature can be used to infer population genetics parameters in ancestral species. We use a hidden Markov model to infer the genealogical state along a four-species alignment with ILS. Coalescent theory is used to parametrize the hidden Markov model in terms of population genetics quantities. The transition probabilities and emission probabilities can be expressed as functions of the population parameters, speciation times, ancestral effective population sizes and recombination rate, and estimated from a genome alignment of closely related species. We analyze a basic, panmictic demographic model and study its properties using an extensive set of coalescent simulations. We assess the effect of the model assumptions, and demonstrate that the Markov property provides a good approximation to the ancestral recombination graph. Using a too restricted set of possible genealogies, necessary to reduce the computational load, can bias parameter estimates. We propose a simple correction for this bias, and suggest directions for future extensions of the model. We show that the patterns of ILS along a sequence alignment undergoing recombination can also be recovered effciently and that the ancestral recombination rate can be recovered. Finally, we introduce an extension of the basic model that allows for substitution rate heterogeneity, and reanalyze Human-Chimpanzee-Gorilla-Orangutan alignments using the new models.
Using biological networks to search for interacting loci in genomewide association studies
M. Emily, T. Mailund, J. Hein, L. Schauser and M.H. Schierup
European Journal of Human Genetics; doi: 10.1038/ejhg.2009.15
[Abstract]
Genome-wide association studies have identified a large number of Single-Nucleotide Polymorphisms (SNPs) that individually predispose to diseases. Genome-wide association studies have identified a large number of Single-Nucleotide Polymorphisms (SNPs) that individually predispose to diseases. However many genetic risk factors remain unaccounted for. Proteins coded by genes interact in the cell and it is likely that certain variants mainly affect the phenotype in combination with other variants, termed epistasis. An exhaustive search for epistatic effects is computationally demanding, since several billions of SNP pairs exist for typical genotyping chips. In this study, the experimental knowledge on biological networks is used to narrow the search for two-locus epistasis. We provide evidence that this approach is computationally feasible and statistically powerful. By applying this method to the Wellcome Trust Case-Control Consortium (WTCCC) data sets, we report four significant cases of epistasis between unlinked loci, in susceptibility to Crohn's disease, bipolar disorder, hypertension and rheumatoid arthritis.
Haplotype frequencies in a sub-region of chromosome 19q13.3, related to risk and prognosis of cancer, differ dramatically between ethnic groups
M.H. Schierup, T. Mailund, H. Li, J. Wang, A. A. Tjønneland, U. Vogel, L. Bolund and B. Nexø.
BMC Evolutionary Biology, 10(20); doi:10.1186/1471-2350-10-20
[Abstract]
Background. A small region of about 70 kb on human chromosome 19q13.3 encompasses 4 genes of which 3, ERCC1, ERCC2, and PPP1R13L (aka RAI) are related to DNA repair and cell survival, and one, CD3EAP, aka ASE1, may be related to cell proliferation. The whole region seems related to the cellular response to external damaging agents and markers in it are associated with risk of several cancers.
Methods. We downloaded the genotypes of all markers typed in the 19q13.3 region in the HapMap populations of European, Asian and African descent and inferred haplotypes. We combined the European HapMap individuals with a Danish breast cancer case-control data set and inferred the association between HapMap haplotypes and disease risk.
Results. We found that the susceptibility haplotype in our European sample had increased from 2 to 50 percent very recently in the European population, and to almost the same extent in the Asian population. The cause of this increase is unknown. The maximal proportion of overall genetic variation due to differences between groups for Europeans versus Africans and Europeans versus Asians (the Fst value) closely matched the putative location of the susceptibility variant as judged from haplotype-based association mapping.
Conclusions. The combined observation that a common haplotype causing an increased risk of cancer in Europeans and a high differentiation between human populations is highly unusual and suggests a causal relationship with a recent increase in Europeans caused either by genetic drift overruling selection against the susceptibility variant or a positive selection for the same haplotype. The data does not allow us to distinguish between these two scenarios. The analysis suggests that the region is not involved in cancer risk in Africans and that the susceptibility variants may be more finely mapped in Asian populations.
Local phylogeny mapping of quantitative traits: Higher accuracy and better ranking than single marker association in genomewide scans
S. Besenbacher, T. Mailund and M.H. Schierup
Genetics, 181, 747-753 2009; doi:10.1534/genetics.108.092643
[Abstract]
We present a new method, termed QBlossoc, for linkage disequilibrium (LD) mapping of genetic variants underlying a quantitative trait. The method builds on a previously published method, Blossoc, for LD mapping of case-control studies. The method builds local genealogies along the genome and looks for a significant clustering of quantitative trait values in these trees. We analyse its efficiency in terms of localisation and ranking of true positives among a large number of negatives and compare the results with single marker approaches. Simulation results of markers at densities comparable to genotype chips show that QBlossoc is more accurate in localisation of true positives as expected since it uses the additional information of LD between markers simultaneously. More importantly for genome wide surveys, QBlossoc ranks true positives significantly higher than single marker approaches, again suggesting that a true signal displays itself more strongly in a set of adjacent markers than a spurious (false) signal. The method is both memory and CPU efficient. It has been tested on a real data set of height data for 5000 individuals measured at 317K markers and completed analysis within 5 CPU days. The QBlossoc program with full documentation can be downloaded through www.daimi.au.dk/~mailund/.
A fast algorithm for genome wide haplotype pattern mining
S. Besenbacher, C.N.S. Pedersen and T. Mailund
BMC Bioinformatics, 10(Suppl 1): S74 doi:10.1186/1471-2105-10-S1-S74
[Abstract]
Background: Identifying the genetic components of common diseases has long been an important area of research. Recently, genotyping technology has reached the level where it is cost effective to genotype single nucleotide polymorphism (SNP) markers covering the entire genome, in thousands of individuals, and analyse such data for markers associated with a diseases. The statistical power to detect association, however, is limited when markers are analysed one at a time. This can be alleviated by considering multiple markers simultaneously. The Haplotype Pattern Mining (HPM) method is a machine learning approach to do exactly this. Originally developed for candidate gene analysis, its running time is not optimised for the much larger genome wide data sets in use today.
Results: We present a new, faster algorithm for the HPM method. The new approach patterns of haplotype diversity in the genome: locally in the genome, the number of observed haplotypes is much smaller than the total number of possible haplotypes. We show that the new approach speeds up the HPM method with a factor of 2 on a genome wide dataset with 5009 individuals typed in 491208 markers using default parameters and more if the pattern length is increased.
Conclusions: The new algorithm significantly speeds up the HPM method and makes it feasible to apply HPM to whole genome association mapping with thousands of individuals and hundreds of thousands of markers.
2008
SNPFile - A software library and file format for large scale association mapping and population genetics studies
J. Nielsen and T. Mailund
BMC Bioinformatics 2008, 9:526; doi:10.1186/1471-2105-9-526
[Abstract]
Background: High-throughput genotyping technology has enabled cost effective typing of thousands of individuals in hundred of thousands of markers for use in genome wide studies. This vast improvement in data acquisition technology makes it an informatics challenge to efficiently store and manipulate the data. While spreadsheets and flat text files were adequate solutions earlier, the increased data size mandates more efficient solutions.
Results: We describe a new binary file format for SNP data, together with a software library for file manipulation. The file format stores genotype data together with any kind of additional data, using a flexible serialisation mechanism. The format is designed to be IO efficient for the access patterns of most multi-locus analysis methods.
Conclusions: The new file format has been very useful for our own studies where it has significantly reduced the informatics burden in keeping track of various secondary data, and where the memory and IO efficiency has greatly simplified analysis runs. A main limitation with the file format is that it is only supported by the very limited set of analysis tools developed in our own lab. Through scripting interfaces to the file format, we hope to alleviate this in the future.
Efficient Whole-Genome Association Mapping using Local Phylogenies for Unphased Genome Data
Z. Ding, T. Mailund and Y.S. Song
Bioinformatics 24(19):2215-2221; doi:10.1093/bioinformatics/btn406
[Abstract]
Motivation: Recent advances in genotyping technology has made data acquisition for whole-genome association study cost effective, and a current active area of research is developing efficient methods to analyze such large-scale data sets. Most sophisticated association mapping methods that are currently available take phased haplotype data as input. However, phase information is not readily available from sequencing methods and inferring the phase via computational approaches is time-consuming, taking days to phase a single chromosome.
Results: In this paper, we devise an efficient method for scanning unphased whole-genome data for association. Our approach combines a recently found linear-time algorithm for phasing genotypes on trees with a recently proposed tree-based method for association mapping. From unphased genotype data, our algorithm builds local phylogenies along the genome, and scores each tree according to the clustering of cases and controls. We assess the performance of our new method on both simulated and real biological data sets.
Investigating Selection on Viruses: A Statisical Alignment Approach
S. de Groot, T. Mailund, G.A. Lunter and J. Hein
BMC Bioinformatics 9(304) 2008, doi:10.1186/1471-2105-9-304
[Abstract]
Background: Two problems complicate the study of selection in viral genomes: Firstly, the presence of genes in overlapping reading frames implies that selection in one reading frame can bias our estimates of neutral mutation rates in another reading frame. Secondly, the high mutation rates we are likely to encounter complicate the inference of a reliable alignment of genomes. To address these issues, we develop a model that explicitly models selection in overlapping reading frames. We then integrate this model into a statistical alignment framework, enabling us to estimate selection while explicitly dealing with the uncertainty of individual alignments. We show that in this way we obtain un-biased selection parameters for different genomic regions of interest, and improve in accuracy compared to the fixed alignment method.
Results: We run a series of simulation studies to gauge how well we do in comparison to other methods. We show that the standard practice of using a fixed ClustalW alignment can lead to considerable biases and that estimation accuracy increases substantially when explicitly integrating over the uncertainty in inferred alignments. We even manage to compete favourably for general evolutionary distances with an alignment produced by GenAl. We therefore propose that marginalizing over all alignments, as opposed to using a fixed one, should be considered in any parametric inference from divergent sequence data for which the alignments are not known with certainty. Running our method on real data, we discover in HIV2 that double coding regions appear to be under less stringent selection than single coding ones. Additionally, there appears to be evidence for differential selection, where one overlapping reading frame is under positive and the other under negative selection. We also analyse Hepatitis B to understand the interaction of selection between two overlapping regions.
Computing the All-Pairs Quartet Distance on a Set of Evolutionary Trees
M. Stissing, T. Mailund, C.N.S. Pedersen, G.S. Brodal, and R. Fagerberg
Journal of Bioinformatics and Computational Biology 6(1) 37–50 2008.
Linked version is a preprint of an article submitted for consideration in Journal of Bioinformatics and Computational Biology © 2008 World Scientific Publishing Company
Special Issue journal version of Computing the All-Pairs Quartet Distance on a Set of Evolutionary Trees APBC'07.
[Abstract]
We present two algorithms for calculating the quartet distance between all pairs of trees in a set of binary evolutionary trees on a common set of species. The algorithms exploit common substructure among the trees to speed up the pairwise distance calculations thus performing significantly better on large sets of trees compared to performing distinct pairwise distance calculations, as we illustrate experimentally, where we see a speedup factor of around 130 in the best case.
2007
On Recombination Induced Multiple and Simultaneous Coalescent Events
J. Davies, F. Simancik, R. Lyngsø, T. Mailund, and J. Hein
Genetics 177: 2151–2160. doi:10.1534/genetics.107.071126
[Abstract]
Coalescent Theory is almost ubiquitous in contemporary molecular population genetics. Inherent in most applications is a continuous time approximation that assumes sample size is small relative to the actual population size. This assumption in effect precludes simultaneous and multiple coalescent events, which can constitute an arbitrarily large component when sample size is sufficiently large. In most situations this is justifiably ignored as a large sample size will only have few ancestors a couple of generations back and then the assumption is valid. However, in tracing the evolutionary history of large chromosomal segments, a large recombination rate will consistently keep the number of ancestors large such that multiple and simultaneous coalescent events cannot be ignored. This can create a major disparity between discrete time and continuous time models and we here show its importance illustrated with parameters typical of the human genome. The presence of gene convergence only aggravates its importance. This could seriously undermine the application of coalescent theory to complete genomes. However, it can be shown that multiple and simultaneous coalescent events influences global quantities, such as total number of ancestors, but has negligible effect on local quantities, such as linkage disequilibrium or similarities of close local trees. Reassuringly the majority of applications of coalescent models with recombination are based on local quantities for purposes such as association mapping.
Annotation of Selection Strengths in Viral Genomes
S. McCauley, S. de Groot, T. Mailund, and J. Hein
Bioinformatics 23(22): 2978–2986. doi:10.1093/bioinformatics/btm472.
[Abstract]
Motivation: Viral genomes tend to code in overlapping reading frames to maximize information content. This may result in atypical codon bias and particular evolutionary constraints. Due to the fast mutation rate of viruses, there is additional strong evidence for varying selection between intra- and intergenomic regions. The presence of multiple coding regions complicates the concept of Ka/Ks ratio, and thus begs for an alternative approach when investigating selection strengths. Building on the paper by McCauley & Hein (2006), we develop a method for annotating a viral genome coding in overlapping reading frames. We introduce an evolutionary model capable of accounting for varying levels of selection along the genome, and incorporate it into our prior single sequence HMM methodology, extending it now to a phylogenetic HMM. Given an alignment of several homologous viruses to a reference sequence, we may thus achieve an annotation both of coding regions as well as selection strengths, allowing us to investigate different selection patterns and hypotheses.
Results: We illustrate our method by applying it to a multiple alignment of four HIV2 sequences, as well as four Hepatitis B sequences. We obtain an annotation of the coding regions, as well as a posterior probability for each site of the strength of selection acting on it. From this we may deduce the average posterior selection acting on the different genes. Whilst we are encouraged to see in HIV2, that the known to be conserved genes gag and pol are indeed annotated as such, we also discover several sites of less stringent negative selection within the env gene. The the best of our knowledge we are the first to subsequently provide a full selection annotation of the Hepatitis B genome by explicitly modelling the evolution within overlapping reading frames, and not relying on the simple Ka/Ks ratios.
Comparative Annotation of Viral Genomes with Non-Conserved Gene Structure
S. de Groot, T. Mailund, and J. Hein
Bioinformatics 23(9):1080–1089. doi:10.1093/bioinformatics/btm078.
[Abstract]
Motivation: Detecting genes in viral genomes is a complex task. Due to the biological necessity of them being constrained in length, RNA viruses in particular tend to code in overlapping reading frames. Since one amino acid is encoded by a triplet of nucleic acids, up to three genes may be coded for simultaneously in one direction. Conventional HMM based gene finding algorithms may find it difficult — if not impossible — to identify multiple coding regions, since in general their topologies do not allow for the presence of overlapping or nested genes. Comparative methods have therefore been restricted to likelihood ratio tests on potential regions as to being double or single coding, using the fact that the constrictions forced upon multiple-coding nucleotides will result in atypical sequence evolution. Exploiting these same constraints, we present a hidden Markov model based gene-finding program, which allows for coding in unidirectional nested and overlapping reading frames, to annotate two homologous aligned viral genomes. Our method does not insist on conserved gene structure between the two sequences, thus making it applicable for the pairwise comparison of more distantly related sequences.
Results: We apply our method to 15 pairwise alignments of six different HIV2 genomes. Given sufficient evolutionary distance between the two sequences, we achieve sensitivity of about 84% and specificity of about 97%. We additionally annotate three pairwise alignments of the more distantly related HIV1 and HIV2, as well as of two different Hepatitis Viruses, attaining results of ~87% sensitivity and ~98.5% specificity. We subsequently incorporate prior knowledge by "knowing" the gene structure of one sequence and annotating the other conditional on it. Boosting accuracy close to perfect we demonstrate that conservation of gene structure on top of nucleotide sequence is a valuable source of information, especially in distantly related genomes.
Genomic relationships and speciation times of human, chimpanzee and gorilla inferred from a coalescent hidden Markov model
A. Hobolth, O.F. Christensen, T. Mailund and M.H. Schierup
In PLoS Genetics, 3(2) 2007; doi:10.1371/journal.pgen.0030007
[Abstract]
The genealogical relationship of human, chimpanzee and gorilla varies along the genome. We develop a hidden Markov model (HMM) that incorporates this variation and relate the model parameters to population genetics quantities such as speciation times and ancestral population sizes. The HMM is an analytically tractable approximation to the coalescent process with recombination, and in simulations we see no apparant bias in the HMM estimates. We apply the HMM to four autosomal contiguous human-chimp-gorilla-orangutan alignments comprising a total of 1.9 million base pairs. We find a very recent speciation time of human-chimp (4.1 +/- 0.4 million years), and fairly large ancestral effective population sizes (65,000 +/- 30,000 for the human-chimp ancestor and 45,000 +/- 10,000 for the human-chimp-gorilla ancestor). Furthermore, around 50% of the human genome coalesces with chimpanzee after speciation with gorilla. We also consider 250 thousand base pairs of X-chromosome alignments and find an effective population size much smaller than 75% of the autosomal effective population sizes. Finally, we find that the rate of transitions between different genealogies correlate well with the regionwide present day human recombination rate, but does not correlate with the fine-scale recombination rates and recombination hot spots, suggesting that the latter are evolutionary transient.
Experiences with GeneRecon on MiG
T. Mailund, C.N.S. Pedersen, J. Bardino, B. Vinter, and H.H. Karlsen
In Future Generation Computer Systems 2007 23 580–586. doi:10.1016/j.future.2006.09.003.
Special Issue journal version of Initial Experiences with GeneRecon on MiG GCA'05.
[Abstract]
We report on our experiences so far with running a bioinformatics simulation study on a newly developed grid architecture. We briefly describe the application bioinformaticsan association mapping algorithm for both locating disease loci and separating cases into those diseased due to genetic factors and those diseased solely due to factors environmentand describe the grid architecture and how the application is set up to run on the grid.
2006
Whole genome association mapping by incompatibilities and local perfect phylogenies
T. Mailund, S. Besenbacher, and M.H. Schierup
BMC Bioinformatics 2006 7(454). doi:10.1186/1471-2105-7-454.
[Abstract]
Background: With current technology, vast amounts of data can be cheaply and efficiently produced in association studies, and to prevent data analysis to become the bottleneck of studies, fast and efficient analysis methods that scale to such data set sizes must be developed.
Results: We present a fast method for accurate localisation of disease causing variants in high density case-control association mapping experiments with large numbers of cases and controls. The method searches for significant clustering of case chromosomes in the "perfect" phylogenetic tree defined by the largest region around each marker that is compatible with a single phylogenetic tree. This perfect phylogenetic tree is treated as a decision tree for determining disease status, and scored by its accuracy as a decision tree. The rationale for this is that the perfect phylogeny near a disease affecting mutation should provide more information about the affected/unaffected classification than random trees. If regions of compatibility contain few markers, due to e.g. large marker spacing, the algorithm can allow the inclusion of incompatibility markers in order to enlarge the regions prior to estimating their phylogeny. Haplotype data and phased genotype data can be analysed. The power and efficiency of the method is investigated on 1) simulated genotype data under different models of disease determination 2) artificial data sets created from the HapMap ressource, and 3) data sets used for testing of other methods in order to compare with these. Our method has the same accuracy as single marker association (SMA) in the simplest case of a single disease causing mutation and a constant recombination rate. However, when it comes to more complex scenarios of mutation heterogeneity and more complex haplotype structure such as found in the HapMap data our method outperforms SMA as well as other fast, data mining approaches such as HapMiner and Haplotype Pattern Mining (HPM) despite being significantly faster. For unphased genotype data, an initial step of estimating the phase only slightly decreases the power of the method. The method was also found to accurately localise the known susceptibility variants in an empirical data set—the ΔF508 mutation for cystic fibrosis—where the susceptibility variant is already known—and to find significant signals for association between the CYP2D6 gene and poor drug metabolism, although for this dataset the highest association score is about 60kb from the CYP2D6 gene.
Conclusions: Our method has been implemented in the Blossoc (BLOck aSSOCiation) software. Using Blossoc, genome wide chip-based surveys of 3 million SNPs in 1000 cases and 1000 controls can be analysed in less than two CPU hours.
Fast Calculation of the Quartet Distance Between Trees of Arbitrary Degrees
C. Christiansen, T. Mailund, C.N.S. Pedersen, M. Randers, and M. Stissing
Algorithms for Molecular Biology 2006, 1(16); doi:10.1186/1748-7188-1-16.
[Abstract]
Background: A number of algorithms have been developed for calculating the quartet distance between two evolutionary trees on the same set of species. The quartet distance is the number of quartets—sub-trees induced by four leaves—that differs between the trees. Mostly, these algorithms are restricted to work on binary trees, but recently we have developed algorithms that work on trees of arbitrary degree.
Results: We present a fast algorithm for computing the quartet distance between trees of arbitrary degree. Given input trees T and T', the algorithm runs in time O(n+|V||V'|min{id, id'}) and space O(n+|V||V'|), where n is the number of leaves in the two trees, V and V' are the non-leaf nodes in T and T', respectively, and id and id' are the maximal number of non-leaf nodes adjacent to a non-leaf node in T and T', respectively. The fastest algorithms previously published for arbitrary degree trees run in O(n3) (independent of the degree of the tree) and O(|V||V'| id id'), respectively. We experimentally compare the algorithm with existing algorithms for computing the quartet distance for general trees.
Conclusions: We present a new algorithm for computing the quartet distance between two trees of arbitrary degree. The new algorithm improves the asymptotic running time for computing the quartet distance, compared to previous methods, and experimental results indicate that the new method also performs significantly better in practice.
GeneRecon—A coalescent based tool for fine-scale association mapping
T. Mailund, M.H. Schierup, C.N.S. Pedersen, J.N. Madsen, J. Hein, and L. Schauser
Bioinformatics 2006 22 (18): 2317–2318; doi:10.1093/bioinformatics/btl153.
[Abstract]
GeneRecon is a tool for fine-scale association mapping using a coalescence model. GeneRecon takes as input case-control data from phased or unphased SNP and micro-satellite genotypes. The posterior distribution of disease locus position is obtained by Metropolis Hastings sampling in the state space of genealogies. Input format, search strategy, and the sampled statistics can be configured through the Guile Scheme programming language embedded in GeneRecon, making GeneRecon highly configurable.
The effective size of the Icelandic population and the prospects for LD mapping: inference from unphased microsatellite markers
T. Bataillon, T. Mailund, S. Thorlacius, E. Steingrimsson, T. Rafnar, M.M. Halldorsson, V. Calian, and M.H. Schierup
European Journal of Human Genetics 2006, 14, 1044–1053. doi:10.1038/sj.ejhg.5201669.
[Abstract]
Characterizing the extent of linkage disequilibrium (LD) in the genome is a pre-requisite for association mapping studies. Patterns of LD also contain information about the past demography of populations. On this study we focus on the Icelandic population where LD was investigated in 12 regions of ~15 cM using regularly spaced microsatellite loci. In total 1753 individuals were genotyped for 179 markers. LD was estimated using a composite disequilibrium measure based on unphased data. LD decreases with distance in all 12 regions and can be detected over approximately 4 cM for the given sample size. Differences in the patterns of decrease of LD with distance among genomic regions were mostly due to two regions exhibiting respectively higher and lower proportions of pairs in LD than average within the first 4 cM. We pooled data from all regions except these two and summarized patterns of LD by computing the proportion of pairs of loci exhibiting significant LD (at the 5% level) as a function of distance. We compared observed patterns of LD with simulated datasets obtained under scenarios with varying demography and intensity of recombination. We show that unphased data allow to make inferences on scaled recombination rates from patterns of LD. Patterns of LD in Iceland suggest a genome-wide scaled recombination rate of ρ* = 200 [130–330] per cM which is equivalent to a long term effective population size of ~5000 in the range of estimates recently reported in three populations using extensive SNPs data from the HapMap project. We discuss the implication of our findings for association mapping studies using the Icelandic population and expected pattern of LD between SNPs markers.
Recrafting the Neighbor-Joining Method
T. Mailund, G.S. Brodal, R. Fagerberg, C.N.S. Pedersen, and D. Phillips
BMC Bioinformatics 2006, 7(29). doi:10.1186/1471-2105-7-29.
[Abstract]
Background: The neighbor-joining method by Saitou and Nei is a widely used method for constructing phylogenetic trees. The formulation of the method gives rise to a canonical Θ(n3) algorithm upon which all existing implementations are based.
Methods: In this paper we present techniques for speeding up the canonical neighbor-joining method. Our algorithms construct the same phylogenetic trees as the canonical neighbor-joining method. The best-case running time of our algorithms are O(n2) but the worst-case remains O(n3). We empirically evaluate the performance of our algoritms on distance matrices obtained from the Pfam collection of alignments.
Results: The experiments indicate that the running time of our algorithms evolve as Θ(n2) on the examined instance collection. We also compare the running time with that of the QuickTree tool, a widely used efficient implementation of the canonical neighbor-joining method.
Conclusions: The experiments show that our algorithms also yield a significant speed-up, already for medium sized instances.
2005
CoaSim: A Flexible Environment for Simulating Genetic Data under Coalescent Models
T. Mailund, M.H. Schierup, C.N.S. Pedersen, P.J.M. Mechlenborg, J.N. Madsen, and L. Schauser
BMC Bioinformatics 2005, 6(252). doi:10.1186/1471-2105-6-252.
[Abstract]
Background Coalescent simulations are playing a large role in interpreting large scale intra- polymorphism surveys and for planning and evaluating association studies. Coalescent of data sets under different models can be compared to the actual data to test different evolutionary factors and thus get insight into these.
Results We have created the CoaSim application as a flexible environment for Monte various types of genetic data under equilibrium and non-equilibrium coalescent variety of applications. Interaction with the tool is through the Guile version scripting language. Scheme scripts for many standard and advanced applications these can easily be modified by the user for a much wider range of applications. interface with less functionality and flexibility is also included. It is primarily exploratory and educational tool.
Conclusions CoaSim is a powerful tool because of its flexibility and ease of use. This is illustrated varied uses of the application, e.g. evaluation of association mapping methods, bootstrapping, and design and choice of markers for specific questions.
Pigs in Sequence Space: A 0.66X Coverage Pig Genome Survey based on Shotgun Sequencing
R.Wernersson, M.H. Schierup, F.G. Jørgensen, J. Gorodkin, F. Panitz, H.H. Staerfeldt, O.F. Christensen, T. Mailund, H. Hornshøj, A. Klein, W. Jun, B. Liu, S. Hu, W. Dong, W. Li, G.K.-S. Wong, J. Yu, J. Wang, C. Bendixen, M. Fredholm, S. Brunak, H. Yang and L. Bolund
BMC Genomics 2005, 6(70). doi:10.1186/1471-2164-6-70.
[Abstract]
Background Comparative whole genome analysis of Mammalia can benefit from the addition of more species. The pig is an obvious choice due to its economic and medical importance as well as its evolutionary position in the artiodactyls.
Results We have generated ~ 3.84 million shotgun sequences (0.66X coverage) of the pig genome. The data are hereby released together with an initial evolutionary analysis. The non-repetitive fraction of the sequences was aligned to the UCSC human-mouse alignment and the resulting three-species alignments were annotated using the human genome annotation. Ultra-conserved elements and miRNAs were identified. The results show that for each of these types of orthologous data, pig is much closer to human than mouse is. Purifying selection has been more efficient in pig compared to human, but not as efficient as in mouse, and pig seems to have an isochore structure most similar to the structure in human.
Conclusion The addition of the pig to the set of species sequenced at low coverage adds to the understanding of selective pressures that have acted on the human genome by bisecting the evolutionary branch between human and mouse with the mouse branch being approximately 3 times as long as the human branch, and the joint alignment of the shot-gun sequences to the human-mouse alignment offers a rapid way for the investigator to define specific regions for analysis and resequencing.
RBT—A Tool for Building Refined Buneman Trees
S. Besenbacher, T. Mailund, L. Westh-Nielsen, C.N.S. Pedersen
Bioinformatics, 21(8) 1711–1712, 2005.
[Abstract]
We have developed a tool implementing an efficient algorithm for refined Buneman tree reconstruction. The algorithm—which has the same complexity as the neighbour-joining method and the (plain) Buneman tree construction—enables refined Buneman tree reconstruction on large taxa sets.
2004
QuickJoin—Fast Neighbour-Joining Tree Reconstruction
T. Mailund and C.N.S. Pedersen
Bioinformatics, 20(17): 3261–3262, 2004.
[Abstract]
We have built a tool for fast construction of very large phylogenetic trees. The tool uses heuristics for speeding up the neighbour-joining algorithm—while still constructing the same tree as the original neighbour-joining algorithm—making it possible to construct trees for ~8000 species in less than ten minutes on a single desktop PC. In comparison, the same task takes more than 30 minutes using the QuickTree neighbour-joining implementation.
QDist—Quartet Distance Between Evolutionary Trees
T. Mailund and C.N.S. Pedersen
Bioinformatics, 20(10): 1636–1637, 2004.
[Abstract]
QDist is a program for computing the quartet distance between two unrooted evolutionary trees, i.e. the number of quartet topology differences between the two trees, where a quartet topology is the topological subtree induced by four species. The implementation is based on an algorithm with running time O(n log² n), which makes it practical to compare large trees.
Exploiting Equivalence Reduction and the Sweep-Line Method for Detecting Terminal States
J. Billington, G.E. Gallasch, L.M. Kristensen, and T. Mailund
IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans. 34(1): 23–37.
[Abstract]
State space exploration is one of the main approaches to computer-aided verification and analysis of finite-state systems, and can be used to reason about a wide range of properties during the design phase of a system, including desired and undesired terminal states. Several state space reduction methods have been developed to alleviate the state explosion problem inherent in methods based on state space exploration. In this paper we develop algorithms for combining two of these methods: the sweep-line method and the equivalence reduction method. The algorithms allow terminal states of the system to be reported on-the-fly during the state space exploration. We demonstrate the usefulness of the algorithms by applying them to an industrial case study. These experimental results show that the combined methods achieves a better reduction of the state space than when either methods are used in isolation.
Conference Papers
2010
Building very large neighbour-joining trees
M. Simonsen, T. Mailund and C.N.S. Pedersen
Proceedings of Bioinformatics 2010.
[Abstract]
The neighbour-joining method by Saitou and Nei is a widely used method for phylogenetic reconstruction, made popular by a combination of computational efficiency and reasonable accuracy. With its cubic running time by Studier and Kepler, the method scales to hundreds of species, and while it is usually possible to infer phylogenies with thousands of species, tens or hundreds of thousands of species is infeasible. Recently we developed a simple branch and bound heuristic, RapidNJ, which significantly reduces the running time on average. However, the O(n2) space consumption of the RapidNJ method, and the NJ method in general, becomes a problem when inferring phylogenies with 10000+ taxa. In this paper we present two extentions of RapidNJ which reduce memory requirements and enable RapidNJ to infer very large phylogenetic trees efficiently. We also present an improved search heuristic for RapidNJ which improves RapidNJs performance on many data sets of all sizes.
2009
A sub-cubic time algorithm for computing the quartet distance between two general trees
T. Mailund, J. Nielsen, and C.N.S. Pedersen
Proc. of IJCBS 2009.
This is a fixed version of A quadratic time algorithm for computing the quartet distance between two general trees that turned out to be less quadratic than hoped...
[Abstract]
We derive a sub-cubic time and space algorithm for computing the quartet distance between a pair of general trees, i.e. trees where inner nodes can have any degree ≥ 3. The time and space complexity of our algorithm is quadratic in the number of leaves and does not depend on the degree of the inner nodes. This makes it the fastest algorithm for computing the quartet distance between general trees independent of the degree of the inner nodes.
A fast algorithm for genome wide haplotype pattern mining
S. Besenbacher, C.N.S. Pedersen and T. Mailund
In the Proc of APBC 2009, Tsinghua University Press, 772-780.
[Abstract]
Background: Identifying the genetic components of common diseases has long been an important area of research. Recently, genotyping technology has reached the level where it is cost effective to genotype single nucleotide polymorphism (SNP) markers covering the entire genome, in thousands of individuals, and analyse such data for markers associated with a diseases. The statistical power to detect association, however, is limited when markers are analysed one at a time. This can be alleviated by considering multiple markers simultaneously. The Haplotype Pattern Mining (HPM) method is a machine learning approach to do exactly this. Originally developed for candidate gene analysis, its running time is not optimised for the much larger genome wide data sets in use today.
Results: We present a new, faster algorithm for the HPM method. The new approach patterns of haplotype diversity in the genome: locally in the genome, the number of observed haplotypes is much smaller than the total number of possible haplotypes. We show that the new approach speeds up the HPM method with a factor of 2 on a genome wide dataset with 5009 individuals typed in 491208 markers using default parameters and more if the pattern length is increased.
Conclusions: The new algorithm significantly speeds up the HPM method and makes it feasible to apply HPM to whole genome association mapping with thousands of individuals and hundreds of thousands of markers.
2008
Rapid neighbour-joining
M. Simonsen, T. Mailund and C.N.S. Pedersen
Proceedings of the Workshop on Algorithms in Bioinformatics (WABI) 2008, LNBI Volume 5251, Springer, 113-122 doi:10.1007/978-3-540-87361-7_10
[Abstract]
The neighbour-joining method reconstructs phylogenies by iteratively joining pairs of nodes until a single node remains. The cri- teria for which pair of nodes to merge is based on both the distance between the pair and the average distance to the rest of the nodes. In this paper, we present a new search strategy for the optimisation criteria used for selecting the next pair to merge and we show empirically that the new search strategy is superior to other state-of-the-art neighbour- joining implementations.
2007
Computing the All-Pairs Quartet Distance on a Set of Evolutionary Trees
M. Stissing, T. Mailund, C.N.S. Pedersen, G.S. Brodal, and R. Fagerberg
Proceedings of the Asia-Pacific Bioinformatics Conference (APBC) 2007, pp. 91–100, Series on Advances in Bioinformatics and Computational Biology, vol. 5, Imperial College Press.
Presentation slides (first part of): [PDF] [Open Office]
[Abstract]
We present two algorithms for calculating the quartet distance between all pairs of trees in a set of binary evolutionary trees on a common set of species. The algorithms exploit common substructure among the trees to speed up the pairwise distance calculations thus performing significantly better on large sets of trees compared to performing distinct pairwise distance calculations, as we illustrate experimentally, where we see a speedup factor of around 130 in the best case.
Computing the Quartet Distance between Evolutionary Trees of Bounded Degree
M. Stissing, C.N.S. Pedersen, T. Mailund, G.S. Brodal, and R. Fagerberg
Proceedings of the Asia-Pacific Bioinformatics Conference (APBC) 2007, pp. 101–110, Series on Advances in Bioinformatics and Computational Biology, vol. 5, Imperial College Press.
Presentation slides (second part of): [PDF] [Open Office]
[Abstract]
We present an algorithm for calculating the quartet distance between two evolutionary trees of bounded degree on a common set of n species. The previous best algorithm has running time O(d2n2) when considering trees, where no node is of more than degree d. The algorithm developed herein has running time O(d9n log n)) which makes it the first algorithm for computing the quartet distance between non-binary trees which has a sub-quadratic worst case running time.
2005
Algorithms for Computing the Quartet Distance between Trees of Arbitrary Degree
C. Christiansen, T. Mailund, C.N.S. Pedersen, and M. Randers
Proceedings of Workshop on Algorithms in Bioinformatics (WABI), 2005, LNBI 3692, pp. 77-88 © Springer-Verlag.
Full length version of Computing the Quartet Distance between Evolutionary Trees of Bounded Degree (extended abstract).
[Abstract]
We present two algorithms for computing the quartet distance between trees of arbitrary degree. The quartet distance between two unrooted evolutionary trees is the number of quartets—sub-trees induced by four leaves—that differs between the trees. Previous algorithms focus on computing the quartet distance between binary trees. In this paper, we present two algorithms for computing the quartet distance between trees of arbitrary degrees. One in time O(n3) and space O(n2) and one in time O(n2d2) and space O(n2), where n is the number of species and d is the maximal degree of the internal nodes of the trees. We experimentally compare the two algorithms.
Quartet Distance between General Trees (extended abstract)
C. Christiansen, T. Mailund, C.N.S. Pedersen, and M. Randers
Proceedings of International Conference on Numerical Analysis and Applied Mathematics (ICNAAM) 2005, pp. 796-799 © Wiley-VCH Verlag GmbH & Co.
Presentation slides: [PDF] [Open Office]
[Abstract]
We present two algorithms for computing the quartet distance between trees of arbitrary degree. The quartet distance between two unrooted evolutionary trees is the number of quartets—sub-trees induced by four leaves—that differs between the trees. Previous algorithms focus on computing the quartet distance between binary trees. In this paper, we present two algorithms for computing the quartet distance between trees of arbitrary degrees. One in time O(n3) and space O(n2) and one in time O(n2d2) and space O(n2), where n is the number of species and d is the maximal degree of the internal nodes of the trees. We experimentally compare the two algorithms.
Initial experiences with GeneRecon on MiG
T. Mailund, C.N.S. Pedersen, J. Bardino, B. Vinter, and H.H. Karlsen
Proceedings of The 2005 International Conference on Grid Computing and Applications (GCA'05)
[Abstract]
We report on very early experiences with running a bioinformatics simulation study on a newly developed grid architecture. We briefly describe the bioinformatics application—an association mapping algorithm for both locating disease loci and separating cases into those diseased due to genetic factors and those diseased solely due to environment factors—and describe the grid architecture and how the application is set up to run on the grid.
2004
Obtaining Memory-Efficient Reachability Graph Representations Using the Sweep-Line Method
T. Mailund and M. Westergaard
Proceedings of Tools and Algorithms for the Construction and Analysis of Systems (TACAS 2004), LNCS 2988 pp. 177-191 © Springer-Verlag. The original publication is available on LINK at http://www.springerlink.com/openurl.asp?genre=article&issn=0302-9743&volume=2988&spage=177.
[Abstract]
This paper is concerned with a memory-efficient representation of the reachability graphs. We describe a technique that enables us to represent each reachable marking in a number of bits close to the theoretical minimum needed for explicit state enumeration. The technique maps each state vector into a number between zero and the number of reachable states and uses the sweep-line method to delete the state vectors themselves. A prototype of the proposed technique has been implemented and experimental results are reported.
2003
Efficient Path Finding with the Sweep-Line Method using External Storage
L.M. Kristensen and T. Mailund
Proceedings of International Conference on Formal Engineering Methods (ICFEM 2003) LNCS 2885 pp. 319-337 © Springer-Verlag. The original publication is available on LINK at http://www.springerlink.com/openurl.asp?genre=article&issn=0302-9743&volume=2885&spage=319.
[Abstract]
The sweep-line method deletes states on-the-fly during state space exploration to reclaim memory and thereby reduce peak memory usage. This deletion of states prohibits the immediate generation of, e.g., an error-trace when the violation of a safety property is detected. We address this problem by combining the sweep-line method with storing a spanning tree of the explored state space in external storage on a magnetic disk. We show how this allows us to easily obtain paths in the state space, such as error-traces. A key property of the proposed technique is that it avoids searching in external storage during the state space exploration and gives the same reduction in peak memory usage as the stand-alone sweep-line method. The subsequent generation of the path then requires one seek on disk for each state on the path. We evaluate the proposed technique on a number of example systems by means of an implementation, and compare its performance to a related technique. These practical experiments demonstrate how the suggested technique complements existing techniques based on using external storage.
2002
A Compositional Sweep-Line State Space Exploration Method
L.M. Kristensen and T. Mailund
Proceedings of Formal Description Techniques for Distributed Systems and Communication Protocols (FORTE 2002), LNCS 2529 pp. 327-343 © Springer-Verlag. The original publication is available on LINK at http://link.springer.de/link/service/series/0558/papers/2529/25290327.pdf.
[Abstract]
State space exploration is a main approach to verification of finite-state systems. The sweep-line method exploits a certain kind of progress present in many systems to reduce peak memory usage during state space exploration. We present a new sweep-line algorithm for a compositional setting where systems are composed of subsystems. The compositional setting makes it possible to divide subsystem progress measures into monotone and non-monotone progress measures to further reduce peak memory usage. We show that in a compositional setting, it is possible to automatically obtain a progress measure for the system by combining the progress measure for each subsystem to a progress measure for the full system.
A Generalised Sweep-Line Method for Safety Properties
L.M. Kristensen and T. Mailund
Proceedings of Formal Methods Europe (FME 2002), LNCS 2391 pp. 549-567 © Springer-Verlag. The original publication is available on LINK at http://link.springer.de/link/service/series/0558/papers/2391/23910549.pdf.
[Abstract]
The recently developed sweep-line method exploits progress present in many concurrent systems to explore the full state space of the system while storing only small fragments of the state space in memory at a time. A disadvantage of the sweep-line method is that it relies on a monotone and global notion of progress. This prevents the method from being used for many reactive systems. In this paper we generalise the sweep-line method such that it can be used for verifying safety properties of reactive systems exhibiting local progress. The basic idea is to relax the monotone notion of progress and to recognise the situations where this could cause the state space exploration not to terminate. The generalised sweep-line method explores all reachable states of the system, but may explore a state several times. We demonstrate the practical application of the generalised sweep-line method on two case studies demonstrating a reduction in peak memory usage to typically 10% compared to the use of ordinary full state spaces.
Analysing Infinite-State Systems by Combining Equivalence Reduction and the Sweep-Line Method
T. Mailund
Proceedings of International Conference on Application and Theory of Petri Nets (ICATPN 2002), LNCS 2360 pp. 314-333 © Springer-Verlag. The original publication is available on LINK at http://link.springer.de/link/service/series/0558/papers/2360/23600314.pdf.
[Abstract]
The sweep-line method is a state space exploration method for on-the-fly verification aimed at systems exhibiting progress. Presence of progress in the system makes it possible to delete certain states during state space generation, which reduces the memory used for storing the states. Unfortunately, the same progress that is used to improve memory performance in state space exploration often leads to an infinite state space: The progress in the system is carried over to the states resulting in infinitely many states only distinguished through the progress. A finite state space can be obtained using equivalence reduction, abstracting away the progress, but in its simplest form this removes the progress property required by the sweep-line method. In this paper we examine a new method for using equivalence relations to obtain a finite set of classes, without compromising the progress property essential for the sweep-line method. We evaluate the new method on two case studies, showing significant improvements in performance, and we briefly discuss the new method in the context of Timed Coloured Petri Nets, where the `increasing global time' semantics can be exploited for more efficient analysis than what is achieved using a `delay' semantics.
Sweep-Line State Space Exploration for Coloured Petri Nets
G.E. Gallasch, L.M. Kristensen, and T. Mailund
Proceedings of Fourth Workshop on Practical Use of Coloured Petri Nets and the CPN Tools (CPN 2002). DAIMI PB-560 pp. 101-119, Aarhus, Denmark, 2002.
[Abstract]
The basic idea of the sweep-line state space method is to exploit a formal notion of progress found in many concurrent and distributed systems. Exploiting progress makes it possible to sweep through the state space of a CP-net while storing only a small fragment of the states in memory at any time. Properties of the system can then be verified on-the-fly during the sweep of the state space. This can lead to significant savings in peak memory usage and computation time. Examples of systems possessing progress are transport protocols, transactions protocols, workflow models, and systems modelled with Timed CP-nets. We present Sweep/CPN, a library extension to Design/CPN supporting the sweep-line method, and demonstrate its use.
2001
Condensed State Spaces for Timed Petri Nets
S. Christensen, L.M. Kristensen and T. Mailund
Proceedings of International Conference on Application and Theory of Petri Nets (ICATPN 2001), LNCS 2075 pp. 101-120 © Springer-Verlag. The original publication is available on LINK at http://link.springer.de/link/service/series/0558/papers/2075/20750101.pdf.
[Abstract]
We present a state space method for Petri nets having a time concept based on a global clock and associating time stamps to tokens. The method is based on equivalence on states and makes it possible to condense the usually infinite state space of such timed Petri nets into a finite state space without loosing analysis power. The practical application of the method is demonstrated on a large example of an audio/video protocol by means of a computer tool implementing the method.
A Sweep-Line Method for State Space Exploration
S. Christensen, L.M. Kristensen and T. Mailund
Proceedings of Tools and Algorithms for the Construction and Analysis of Systems (TACAS 2001), LNCS 2031 pp. 450-464 © Springer-Verlag. The original publication is available on LINK at http://link.springer.de/link/service/series/0558/papers/2031/20310450.pdf.
[Abstract]
We present a state space exploration method for on-the-fly verification. The method is aimed at systems for which it is possible to define a measure of progress based on the states of the system. The measure of progress makes it possible to delete certain states on-the-fly during state space generation, since these states can never be reached again. This in turn reduces the memory used for state space storage during the task of verification. Examples of progress measures are sequence numbers in communication protocols and time in certain models with time. We illustrate the application of the method on a number of Coloured Petri Net models, and give a first evaluation of its practicality by means of an implementation based on the Design/CPN state space tool. Our experiments show significant reductions in both space and time used during state space exploration. The method is not specific to Coloured Petri Nets but applicable to a wide range of modelling languages.
State Space Methods for Timed Coloured Petri Nets
S. Christensen, K. Jensen, T. Mailund and L.M. Kristensen
Proceedings of 2nd International Colloquium on Petri Net Technologies for Modelling Communication Based Systems. pp. 14-15, Berlin Germany, September 2001.
[Abstract]
We present two recently developed state space methods for timed coloured Petri nets which reconciles state space methods and time concepts of coloured Petri nets. The first method is based on an equivalence relation on states which makes it possible to condense the usually infinite state space of a timed coloured Petri net into a finite condensed state space without serious loss of analysis power. The second method, the sweep-line method, exploits the increasing model time of timed CP-nets to store only small fragments of the full state space in memory at a time thereby alleviating the the state explosion problem. The performance and potential of the methods is demonstrated on an industrial case study by means of a computer tool implementing the two methods.
2000
Separation of Style and Content with XML in an Interchange Format for High-level Petri Nets
T. Mailund and K.H. Mortensen
Proceedings of the Meeting on XML/SGML based Interchange Formats for Petri Nets, pp. 7-12, Aarhus, Denmark, 2000.
[Abstract]
Style sheets have been proposed by the World Wide Web Consortium (W3C) as a means for separating presentation and content in World Wide Web documents, and they are now widely used and generally popular. In this paper we use the same idea for an XML interchange format proposal for high-level Petri nets. There are several benefits of using this design principle: It is easier to exchange content-only for non-graphics tools, and alternative styles can conveniently be replaced to make a new graphical appearance. The ideas presented in this paper are illustrated by means of examples.
1999
Parameterised Coloured Petri Nets
T. Mailund
Proceedings of Second Workshop on Practical Use of Coloured Petri Nets and Design/CPN, DAIMI PB-541 pp. 133-151, Aarhus, Denmark, 1999.
[Abstract]
In this paper we examine coloured Petri nets extended with parameters. We characterise three kinds of parameterisation and formally define Parameterised Coloured Petri Nets. We then discuss how parameterised coloured Petri nets can be used to create libraries of coloured Petri net modules in the same way as libraries for programming languages. Finally we discuss how to implement a simple simulator for such modules.
1998
Textual Interchange Format for High-Level Petri Nets
R.B. Lyngsø and T. Mailund
Proceedings of First Workshop on Practical Use of Coloured Petri Nets and Design/CPN, DAIMI PB-532 pp. 47-64 Aarhus, Denmark, 1998.
[Abstract]
In this paper a text format for high-level Petri net diagrams is presented. The text format is designed to serve as a platform-independent file format for the Design/CPN tool. It is consistent with the forthcoming standard for high-level Petri nets. The text format may also be seen as our contribution to the development of an open, tool-independent interchange format for high-level Petri nets. The text format will make it possible to move Design/CPN diagrams between all supported hardware platforms and versions. It is also designed to be a bridge to other Petri net tools, e.g., other analysis tools which the user may want to use with Design/CPN diagrams. The proposed text format does not address any standardisation for the inscription language used in the diagram. It is, however, possible to extend the format to incorporate such a standardisation. The text format is designed for the exchange of hierarchical coloured Petri nets but the structure is general enough to cope with other high-level Petri nets as well. The text format presented here has been implemented as part of Design/CPN version 3.1.
Technical Reports
2004
SplitDist—Calculating Split-Distances for Sets of Trees
T. Mailund
May 24, 2004.
[Abstract]
We present a tool for comparing a set of input trees, calculating for each pair of trees the split-distances, i.e., the number of splits in one tree not present in the other.
2003
Speeding Up Neighbour-Joining Tree Construction
G.S. Brodal, R. Fagerberg, T. Mailund, C.N.S. Pedersen, and D. Phillips
ALCOMFT-TR-03-102, ALCOM-FT, November 2003.
Presentation slides: [PostScript] [Open Office]
[Abstract]
A widely used method for constructing phylogenetic trees is the neighbour-joining method of Saitou and Nei. We develope heuristics for speeding up the neighbour-joining method which generate the same phylogenetic trees as the original method. All heuristics are based on using a quad-tree to guide the search for the next pair of nodes to join, but differ in the information stored in quad-tree nodes, the way the search is performed, and in the way the quad-tree is updated after a join. We empirically evaluate the performance of the heuristics on distance matrices obtained from the Pfam collection of alignments, and compare the running time with that of the QuickTree tool, a well-known and widely used implementation of the standard neighbour-joining method. The results show that the presented heuristics can give a significant speed-up over the standard neighbour-joining method, already for medium sized instances.
2002
A Compositional Sweep-Line State Space Exploration Method—EXTENDED VERSION
L.M. Kristensen and T. Mailund
Dept. of Computer Science, University of Aarhus, April 2002
[Abstract]
State space exploration is a main approach to verification of finite-state systems. The sweep-line method exploits a certain kind of progress present in many systems to reduce peak memory usage during state space exploration. We present a new sweep-line algorithm for a compositional setting where systems are composed of subsystems. The compositional setting makes it possible to divide subsystem progress measures into monotone and non-monotone progress measures to further reduce peak memory usage. We show that in a compositional setting, it is possible to automatically obtain a progress measure for the system by combining the progress measure for each subsystem to a progress measure for the full system.
Tutorials
2006
Association Mapping: Fundamental Principles and Applications
T. Mailund, M.H. Schierup, J. Hein, L. Schauser, and J.N. Madsen
At Pacific Symposium on Biocomputing (PSB) 2006.
Presentation slides: [PDF] [Open Office]
[Abstract]
The past 15 years have witnessed an explosion in the identification of genes involved in Mendelian diseases. Most of these genes have been identified through family based linkage analysis. In contrast, the unravelling of the genetic basis of common, but complex diseases, such as cancer, diabetes and Alzheimer's disease has made little progress in this period. Complicating factors that make linkage analysis of these diseases impractical are late disease onset, the involvement of multiple low-penetrance genetic factors, ambiguous diagnosis, as well as strong influence of environmental and lifestyle factors. Power analysis and simulations suggest that association studies are the method of choice when attempting to genetically characterize complex diseases. Association studies draw a number of cases and controls from the same population and genotype their DNA at polymorphic loci. Excess correlation amongst alleles from separate loci, called Linkage Disequilibrium (LD), in the case group indicates the presence of a risk factor. Here we discuss the challenge of complex diseases, as well as simple and model based statistical approaches to the mapping of complex diseases.
2005
Association Mapping: Design Issues and Data Analysis Approaches
L. Schauser, T. Mailund, J.N. Madsen, J. Hein and M.H. Schierup
At Pacific Symposium on Biocomputing (PSB) 2005.
[Abstract]
The past 15 years have witnessed an explosion in the identification of genes involved in Mendelian diseases. Most of these genes have been identified through family based linkage analysis. In contrast, the unravelling of the genetic basis of common, but complex diseases, such as cancer, diabetes and Alzheimer's disease has made little progress in this period. Complicating factors that make linkage analysis of these diseases impractical are late disease onset, the involvement of multiple low-penetrance genetic factors, ambiguous diagnosis, as well as strong influence of environmental and lifestyle factors. Power analysis and simulations suggest that association studies are the method of choice when attempting to genetically characterize complex diseases. Association studies draw a number of cases and controls from the same population and genotype their DNA at polymorphic loci. Excess correlation amongst alleles from separate loci, called Linkage Disequilibrium (LD), in the case group indicates the presence of a risk factor. Here we discuss the challenge of complex diseases, as well as simple and model based statistical approaches to the mapping of complex diseases.
1999
Advanced Tutorial on the use of Coloured Petri Nets and Design/CPN
L.M. Kristensen, T. Mailund, and L.M. Wells
Second Workshop and Tutorial on Practical Use of Coloured Petri Nets and Design/CPN, 1999.
[Abstract]
What you always wanted to know about practical use of Design/CPN (but were afraid to ask). The advanced tutorial is aimed at engineers, students and researchers who already have some experience with the use of Coloured Petri Nets and Design/CPN.
Manuals
2006
Getting Started with HapCluster — An introduction to the HapCluster association mapping tool
T. Mailund
[Abstract]
HapCluster++ is a software package for linkage disequilibrium mapping. It is based on a Bayesian Markov-chain Monte Carlo (MCMC) method for finescale linkage-disequilibrium gene mapping using high-density marker maps.
Getting Started with Blossoc — An introduction to the Blossoc association mapping tool
T. Mailund
[Abstract]
Blossoc is a linkage disequilibrium association mapping tool that attempts to build (perfect) genealogies for each site in the input and score these according to non-random clustering of affected individuals, and judge high-scoring areas as likely candidates for containing disease affecting variation. Building the local genealogy trees is based on a number of heuristics that are not guaranteed to build true trees, but have the advantage of more sophisticated methods of being extremely fast.
GeneRecon Users' Manual — A coalescent based tool for fine-scale association mapping
T. Mailund
[Abstract]
GeneRecon is a software package for linkage disequilibrium mapping using coalescent theory. It is based on Bayesian Markov-chain Monte Carlo (MCMC) method for fine-scale linkage-disequilibrium gene mapping using high-density marker maps. GeneRecon explicitly models the genealogy of a sample of the case chromosomes in the vicinity of a disease locus. Given case and control data in the form of genotype or haplotype information, it estimates a number of parameters, most importantly, the disease position.
Getting Started with GeneRecon — An Introduction to the Association Mapping Tool GeneRecon
T. Mailund and L. Schauser
[Abstract]
GeneRecon is a software package for linkage disequilibrium mapping using coalescent theory. It is based on Bayesian Markov-chain Monte Carlo (MCMC) method for fine-scale linkage-disequilibrium gene mapping using high-density marker maps. GeneRecon explicitly models the genealogy of a sample of the case chromosomes in the vicinity of a disease locus. Given case and control data in the form of genotype or haplotype information, it estimates a number of parameters, most importantly, the disease position.
2005
CoaSim Guile Manual — Using the Guile-based CoaSim Simulator
T. Mailund
[Abstract]
CoaSim is a tool for simulating the coalescent process with recombination and geneconversion, under either constant population size or exponential population growth. It effectively constructs the ancestral recombination graph for a given number of chromosomes and uses this to simulate samples of SNP and micro-satellite haplotypes or genotypes.
Getting Started with CoaSim — An Introduction to the Simulator CoaSim
T. Mailund
[Abstract]
CoaSim is a tool for simulating the coalescent process with recombination and geneconversion, under either constant population size or exponential population growth. It effectively constructs the ancestral recombination graph for a given number of chromosomes and uses this to simulate samples of SNP and micro-satellite haplotypes or genotypes.
2003
DProg Tutorial/Manual
T. Mailund
[Abstract]
DPROG is a domain specific language for specifying dynamic programming algorithms; given a recursive definition of the problem, the compiler generates code for solving the problem using dynamic programming. This tutorial/manual explains the basics of the language and how the generated code can be used in an application. The manual covers version 0.2 of DPROG; the language might change in later versions.
Dissertations
2003
Sweeping the State Space—A Sweep-Line State Space Exploration Method
T. Mailund
Ph.D. Dissertation, Department of Computer Science, University of Aarhus, February 2003.
[Abstract]
The thesis describes the sweep-line method, a newly developed reduction method for alleviating the state explosion problem inherent in explicit-state state space exploration. The basic idea underlying the sweep-line method is, when calculating the state space, to recognise and delete states that are not reachable from the currently unprocessed states. Intuitively we drag a sweep-line through the state space with the invariant that all states behind the sweep-line have been processed and are unreachable from the states in front of the sweep-line. When calculating the state space of a system we iteratively calculate successors of new states, and we need to distinguish between previously seen states and truly new states to avoid reprocessing the same states indefinitely. When calculating the successors of the known unprocessed states, all successors are in front of the sweep-line. Consequently, the states behind the sweep-line are not needed for distinguishing between new and old states and can therefore be deleted.
2000
Coloured Petri Nets—A Tool in Software Engineering
T. Mailund
Progress report, Ph.D. part A, Department of Computer Science, University of Aarhus, December 2000.
[Abstract]
This progress report presents and summarizes the research work done by the author during part A of his PhD study at the Department of Computer Science, University of Aarhus.
The work described consists of three main parts. The first part is on language constructions for modular coloured Petri net models. Work on parameterisation of modules, and how this can be used to gain higher degrees of reuse, is presented, and work in progress on behaviour preserving module refinements is sketched.
The second part is on state space analysis methods for model verification, where two new reduction methods are described. The first method exploits equivalences to gain finite representations for the usually infinite state spaces of timed coloured Petri nets. The second method exploits the progress, that is present in some models, to sweep through all reachable states without, at any given time, storing the entire state space.
The third and last part is on a text format for exchanging models between different tools. An XML-based text format is described. This text format is now implemented in the tools Design/CPN, as an optional file format, and CPN Tools, as the main file format. Techniques for separating content and graphical style information, using style sheets, is also described.