Publications

Journal Papers
To Appear
The bonobo genome compared with the chimpanzee and human genomes
The bonobo genome analysis consortium
To appear in Nature.
[Abstract]
Two African apes are the closest living relatives of humans: the chimpanzee (Pan troglodytes) and the bonobo (Pan paniscus). Although similar in many respects, bonobos and chimpanzees differ strikingly in key social and sexual behaviors, and for some of these traits they show more similarity with humans than with each other. Here, we sequenced and assembled the bonobo genome to study its evolutionary relationship with the chimpanzee and human genomes. We find that over 3% of the human genome is more closely related to either the bonobo or the chimpanzee genome than these are to each other. These regions allow various aspects of the ancestry of the two ape species to be reconstructed. In addition, many of the regions overlap genes and may eventually help understand the genetic basis of phenotypes that humans share with one of the two apes to the exclusion of the other.
2012
Cultural phylogenetics of the Tupi language family in lowland South America
R.S. Walker, S. Wichmann, T. Mailund and C.J. Atkisson
PLoS ONE 7(4): e35025. doi:10.1371/journal.pone.0035025
[Abstract]
Background: Recent advances in automated assessment of basic vocabulary lists allow the construction of linguistic phylogenies useful for tracing dynamics of human population expansions, reconstructing ancestral cultures, and modeling transition rates of cultural traits over time.
Methods: Here we investigate the Tupi expansion, a widely-dispersed language family in lowland South America, with a distance-based phylogeny based on 40-word vocabulary lists from 48 languages. We coded 11 cultural traits across the diverse Tupi family including traditional warfare patterns, post-marital residence, corporate structure, community size, paternity beliefs, sibling terminology, presence of canoes, tattooing, shamanism, men's houses, and lip plugs.
Results/Discussion: The linguistic phylogeny supports a Tupi homeland in west-central Brazil with subsequent major expansions across much of lowland South America. Consistently, ancestral reconstructions of cultural traits over the linguistic phylogeny suggest that social complexity has tended to decline through time, most notably in the independent emergence of several nomadic hunter-gatherer societies. Estimated rates of cultural change across the Tupi expansion are on the order of only a few changes per 10,000 years, in accord with previous cultural phylogenetic results in other language families around the world, and indicate a conservative nature to much of human culture.
The sweep-line state space exploration method
K. Jensen, L.M. Kristensen and T. Mailund
Theoretical Computer Science 429 169-179.
[Abstract]
The sweep-line method exploits intrinsic progress in concurrent systems to alleviate the state explosion problem in explicit state model checking. The concept of progress makes it possible to delete states from memory during state space exploration and thereby reduce peak memory usage. The contribution of this paper is twofold. Firstly, we provide a coherent presentation of the sweep-line theory and the many variants of the method that have been developed over the past 10 years since the basic idea of the method was conceived. Secondly, we survey a selection of case studies where the sweep-line method has been put into practical use for the verification of concurrent systems.
Insights into the hominid evolution from the gorilla genome sequence
The gorilla genome analysis consortium
Nature 483 169-175, 2012.
[Abstract]
Gorillas are humans' closest living relatives after chimpanzees, and are of comparable importance for the study of human origins and evolution. Here we present the assembly and analysis of a genome sequence for the western lowland gorilla, and compare the whole genomes of all extant great ape genera. We propose a synthesis of genetic and fossil evidence consistent with placing the human-chimpanzee and human-chimpanzee-gorilla speciation events at approximately 6 and 10 million years ago (Mya). In 30% of the genome, gorilla is closer to human or chimpanzee than the latter are to each other; this is rarer around coding genes, indicating pervasive selection throughout great ape evolution, and has functional consequences in gene expression. A comparison of protein coding genes reveals approximately 500 genes showing accelerated evolution on each of the gorilla, human and chimpanzee lineages, and evidence for parallel acceleration, particularly of genes involved in hearing. We also compare the western and eastern gorilla species, estimating a sequence divergence time 1.75 Mya, but with evidence for more recent genetic exchange and a bottleneck in the eastern species. The use of the genome sequence in these and future analyses will promote a deeper understanding of great ape biology and evolution.
Extensive X-linked adaptive evolution in central chimpanzees
C. Hvilsom, Y. Qian, T. Bataillon, Y. Li, T. Mailund, B. Salle, F. Carlsen, R. Li, H. Zhang, T. Jiang, H. Jiang, X. Jin, K. Munch, A. Hobolth, H.R. Siegismund, J. Wang, and M.H. Schierup
PNAS 109(6) 2054–2059
[Abstract]
Surveying genome-wide coding variation within and among species gives unprecedented power to study the genetics of adaptation, in particular the proportion of amino acid substitutions fixed by positive selection. Additionally, contrasting the autosomes and the X chromosome holds information on the dominance of beneficial (adaptive) and deleterious mutations. Here we capture and sequence the complete exomes of 12 chimpanzees and present the largest set of protein coding polymorphism to date. We report extensive adaptive evolution specifically targeting the X chromosome of chimpanzees with as much as 30% of all amino acid replacements being adaptive. Adaptive evolution is barely detectable on the autosomes except for a few striking cases of recent selective sweeps associated with immunity gene clusters. We also find much stronger purifying selection than observed in humans, and in contrast to humans, we find that purifying selection is stronger on the X chromosome than on the autosomes in chimpanzees. We therefore conclude that most adaptive mutations are recessive. We also document dramatically reduced synonymous diversity in the chimpanzee X chromosome relative to autosomes and stronger purifying selection than for the human X chromosome. If similar processes were operating in the human-chimpanzee ancestor as in central chimpanzees today, our results therefore provide an explanation for the much-discussed reduction in the human-chimpanzee divergence at the X chromosome.
2011
Local genealogies in a linear mixed model for genome-wide association mapping in complex pedigreed populations
G. Sahana, T. Mailund, M.S. Lund and B. Guldbrandtsen
PLoS ONE 6(11): e27061. doi:10.1371/journal.pone.0027061
[Abstract]
Introduction: The state-of-the-art for dealing with multiple levels of relationship among the samples in genome-wide association studies (GWAS) is unified mixed model analysis (MMA). This approach is very flexible, can be applied to both family-based and population-based samples, and can be extended to incorporate other effects in a straightforward and rigorous fashion. Here, we present a complementary approach, called GENMIX (genealogy based mixed model) which combines advantages from two powerful GWAS methods: genealogy-based haplotype grouping and MMA.
Subjects and Methods: We validated GENMIX using genotyping data of Danish Jersey cattle and simulated phenotype and compared to the MMA. We simulated scenarios for three levels of heritability (0.21, 0.34, and 0.64), seven levels of MAF (0.05, 0.10, 0.15, 0.20, 0.25, 0.35, and 0.45) and five levels of QTL effect (0.1, 0.2, 0.5, 0.7 and 1.0 in phenotypic standard deviation unit). Each of these 105 possible combinations (3 h2 x 7 MAF x 5 effects) of scenarios was replicated 25 times.
Results: GENMIX provides a better ranking of markers close to the causative locusâ location. GENMIX outperformed MMA when the QTL effect was small and the MAF at the QTL was low. In scenarios where MAF was high or the QTL affecting the trait had a large effect both GENMIX and MMA performed similarly.
Conclusion: In discovery studies, where high-ranking markers are identified and later examined in validation studies, we therefore expect GENMIX to enrich candidates brought to follow-up studies with true positives over false positives more than the MMA would.
A sub-cubic time algorithm for computing the quartet distance between two general trees
J. Nielsen, A.K. Kristensen, T. Mailund and C.N.S. Pedersen
Algorithms for Molecular Biology 2011 6:15 doi:10.1186/1748-7188-6-15
[Abstract]
Background: When inferring phylogenetic trees different algorithms may give different trees. To study such effects a measure for the distance between two trees is useful. Quartet distance is one such measure, and is the number of quartet topologies that differ between two trees.
Results: We have derived a new algorithm for computing the quartet distance between a pair of general trees, i.e. trees where inner nodes can have any degree >=3. The time and space complexity of our algorithm is sub-cubic in the number of leaves and does not depend on the degree of the inner nodes. This makes it the fastest algorithm so far for computing the quartet distance between general trees independent of the degree of the inner nodes.
Conclusions: We have implemented our algorithm and two of the best competitors. Our new algorithm is significantly faster than the competition and seems to run in close to quadratic time in practice.
On computing the coalescence time density in an isolation-with-migration model with few samples
A. Hobolth, L.N. Andersen and T. Mailund
Genetics 187: 1241–1243
Estimating Divergence Time and Ancestral Effective Population Size of Bornean and Sumatran Orangutan Subspecies using a Coalescent Hidden Markov Model
T. Mailund, J.Y. Dutheil, A. Hobolth, G. Lunter and M.H. Schierup
PLoS Genetics 7(3): e1001319. doi:10.1371/journal.pgen.1001319
[Abstract]
Due to genetic variation in the ancestor of two populations or two species, the divergence time for DNA sequences from two populations is variable along the genome. Within genomic segments all bases will share the same divergence — because they share a most recent common ancestor — when no recombination event has occurred to split them apart. The size of these segments of constant divergence depends on the recombination rate, but also on the speciation time, the effective population size of the ancestral population, as well as demographic effects and selection. Thus, inference of these parameters may be possible if we can decode the divergence times along a genomic alignment. Here, we present a new hidden Markov model that infers the changing divergence (coalescence) times along the genome alignment using a coalescent framework, in order to estimate the speciation time, the recombination rate and the ancestral effective population size. The model is efficient enough to allow inference on whole-genome data sets. We first investigate the power and consistency of the model with coalescent simulations, and then apply it to the whole genome sequences of the two orangutan sub-species, Bornean (P. p. pygmaeus) and Sumatran (P. p. abelii) orangutans from the Orangutan Genome Project. We estimate the speciation time between the two sub-species to be 334 +/- 145 thousand years ago and the effective population size of the ancestral orangutan species to be 26, 800 +/- 6, 700, consistent with recent results based on smaller data sets. We also report a negative correlation between chromosome size and ancestral effective population size which we interpret as a signature of recombination increasing the efficacy of selection.
Incomplete lineage sorting patterns among human, chimpanzee and orangutan suggest recent orangutan speciation and widespread natural selection
A. Hobolth, J.Y. Dutheil, J. Hawks, M.H. Schierup and T. Mailund
Genome Research 21:349-356
[Abstract]
We use our recently developed coalescent HMM framework on the complete orangutan genome to search for regions of the genome where humans are closer to orangutans than to chimpanzees due to incomplete lineage sorting (ILS) in the ancestor of human and chimpanzees. We find that ILS is present in ~1% of the genome suggesting that the ancestral species of human and chimpanzees never experienced a severe population bottleneck. We validate the existence of ILS with simulation results, site pattern analysis and rare genomic events. The existence of ILS allows us to disentangle the time of isolation of humans and orangutans (the speciation time) from the divergence time, and we find that speciation is as recent as 9-13 mya (depending on the calibration point). The analysis also provide further support for a recent speciation of human and chimpanzee at ~4 mya and a diverse ancestor of human and chimpanzee with an effective population size of ~50,000 individuals. We use Bayesian reconstruction to infer ILS for each nucleotide in the genome in order to infer patterns of selection in the ancestral species. We demonstrate the effect of background selection in the common ancestor of humans and chimpanzees. In agreement with population genetics predictions, ILS is reduced in exons and in gene dense regions when we control for confounding factors such GC content and recombination rate, Finally, we find that the broad scale recombination rate is conserved over the complete ape phylogeny.
Comparative and demographic analysis of orangutan genomes
The orangutan genome consortium
Nature 469, 529-533, 2011, doi:10.1038/nature09687
[Abstract]
"Orangutan" is derived from the Malay man of the forest and aptly describes the Southeast Asian great apes native to Sumatra and Borneo. The endangered orangutan species, Pongo abelii (Sumatran) and Pongo pygmaeus (Bornean), are the most phylogenetically distant great apes from humans, providing a powerful perspective on hominid evolution. Here we present a Sumatran orangutan draft genome assembly and short read sequence data from five Sumatran and five Bornean orangutan genomes. Our analyses reveal that, compared to other primates, the orangutan genome has many unique features. Structural evolution of the orangutan genome has proceeded much more slowly than other great apes, evidenced by fewer rearrangements, less segmental duplication, less gene family turnover and surprisingly quiescent Alu repeats, which played a major role in restructuring other primate genomes. We also characterize the first primate polymorphic neocentromere, found in both orangutan species, emphasizing the gradual evolution of orangutan genome structure. Orangutans have extremely low energy usage for a eutherian mammal1, far lower than their hominid relatives, and adding their genome to the repertoire of sequenced primates illuminates new signals of positive selection in several pathways including glycolipid metabolism. From the population perspective, both orangutan species have deep nucleotide diversity; however, Sumatran individuals possess greater diversity than their Bornean counterparts and more species-specific variation. Our estimate of Bornean/Sumatran speciation time, 400k years ago (ya), is more recent than most previous studies and highlights the complexity of the orangutan speciation process. Despite a smaller modern census population size, Sumatran effective population size (Ne) has expanded exponentially relative to ancestral Ne since the Bornean/Sumatran split, while Bornean Ne declined over the same period. Overall, the resources and analyses presented here offer new opportunities in evolutionary genomics, insights into hominid biology, and an extensive database of variation for conservation efforts.
Inference of large phylogenies using neighbour-joining
M. Simonsen, T. Mailund and C.N.S. Pedersen
Biomedical Engineering Systems and Technologies vol 127, 334-344 2011, Springer-Verlag.
This is a longer version of Building very large neighbour-joining trees selected for publication in a Springer proceeding.
[Abstract]
The neighbour-joining method is a widely used method for phylogenetic reconstruction which scales to thousands of taxa. However, advances in sequencing technology have made data sets with more than 10,000 related taxa widely available. Infernce of such large phylogenies take hours or days using the neighbour-joining method on a normal desktop computer because of the O(n3) running time. RapidNJ is a search heuristic which reduce the running time of the neighbour-joining method significantly but at the cost of an increased memory consumption making inference of large phylogenies infeasible. We present two extensions for RapidNJ which reduce the memory requirements and allows phylogenies with more than 50,000 taxa to be inferred efficiently on a desktop computer. Furthermore, an improved version of the search heuristic is presented which reduce the running time of RapidNJ on many data sets.
2009
Ancestral population genomics: the coalescent hidden Markov model approach
J.Y. Dutheil, G. Ganapathy, A. Hobolth, T. Mailund, M.K. Uyenoyama, and M.H. Schierup
Genetics 183, 259–274 2009. doi:10.1534/genetics.109.103010
[Abstract]
With incomplete lineage sorting (ILS), the genealogy of closely related species differs along their genomes. This feature can be used to infer population genetics parameters in ancestral species. We use a hidden Markov model to infer the genealogical state along a four-species alignment with ILS. Coalescent theory is used to parametrize the hidden Markov model in terms of population genetics quantities. The transition probabilities and emission probabilities can be expressed as functions of the population parameters, speciation times, ancestral effective population sizes and recombination rate, and estimated from a genome alignment of closely related species. We analyze a basic, panmictic demographic model and study its properties using an extensive set of coalescent simulations. We assess the effect of the model assumptions, and demonstrate that the Markov property provides a good approximation to the ancestral recombination graph. Using a too restricted set of possible genealogies, necessary to reduce the computational load, can bias parameter estimates. We propose a simple correction for this bias, and suggest directions for future extensions of the model. We show that the patterns of ILS along a sequence alignment undergoing recombination can also be recovered effciently and that the ancestral recombination rate can be recovered. Finally, we introduce an extension of the basic model that allows for substitution rate heterogeneity, and reanalyze Human-Chimpanzee-Gorilla-Orangutan alignments using the new models.
Using biological networks to search for interacting loci in genomewide association studies
M. Emily, T. Mailund, J. Hein, L. Schauser and M.H. Schierup
European Journal of Human Genetics; doi: 10.1038/ejhg.2009.15
[Abstract]
Genome-wide association studies have identified a large number of Single-Nucleotide Polymorphisms (SNPs) that individually predispose to diseases. Genome-wide association studies have identified a large number of Single-Nucleotide Polymorphisms (SNPs) that individually predispose to diseases. However many genetic risk factors remain unaccounted for. Proteins coded by genes interact in the cell and it is likely that certain variants mainly affect the phenotype in combination with other variants, termed epistasis. An exhaustive search for epistatic effects is computationally demanding, since several billions of SNP pairs exist for typical genotyping chips. In this study, the experimental knowledge on biological networks is used to narrow the search for two-locus epistasis. We provide evidence that this approach is computationally feasible and statistically powerful. By applying this method to the Wellcome Trust Case-Control Consortium (WTCCC) data sets, we report four significant cases of epistasis between unlinked loci, in susceptibility to Crohn's disease, bipolar disorder, hypertension and rheumatoid arthritis.
Haplotype frequencies in a sub-region of chromosome 19q13.3, related to risk and prognosis of cancer, differ dramatically between ethnic groups
M.H. Schierup, T. Mailund, H. Li, J. Wang, A. A. Tjønneland, U. Vogel, L. Bolund and B. Nexø.
BMC Evolutionary Biology, 10(20); doi:10.1186/1471-2350-10-20
[Abstract]
Background. A small region of about 70 kb on human chromosome 19q13.3 encompasses 4 genes of which 3, ERCC1, ERCC2, and PPP1R13L (aka RAI) are related to DNA repair and cell survival, and one, CD3EAP, aka ASE1, may be related to cell proliferation. The whole region seems related to the cellular response to external damaging agents and markers in it are associated with risk of several cancers.
Methods. We downloaded the genotypes of all markers typed in the 19q13.3 region in the HapMap populations of European, Asian and African descent and inferred haplotypes. We combined the European HapMap individuals with a Danish breast cancer case-control data set and inferred the association between HapMap haplotypes and disease risk.
Results. We found that the susceptibility haplotype in our European sample had increased from 2 to 50 percent very recently in the European population, and to almost the same extent in the Asian population. The cause of this increase is unknown. The maximal proportion of overall genetic variation due to differences between groups for Europeans versus Africans and Europeans versus Asians (the Fst value) closely matched the putative location of the susceptibility variant as judged from haplotype-based association mapping.
Conclusions. The combined observation that a common haplotype causing an increased risk of cancer in Europeans and a high differentiation between human populations is highly unusual and suggests a causal relationship with a recent increase in Europeans caused either by genetic drift overruling selection against the susceptibility variant or a positive selection for the same haplotype. The data does not allow us to distinguish between these two scenarios. The analysis suggests that the region is not involved in cancer risk in Africans and that the susceptibility variants may be more finely mapped in Asian populations.
Local phylogeny mapping of quantitative traits: Higher accuracy and better ranking than single marker association in genomewide scans
S. Besenbacher, T. Mailund and M.H. Schierup
Genetics, 181, 747-753 2009; doi:10.1534/genetics.108.092643
[Abstract]
We present a new method, termed QBlossoc, for linkage disequilibrium (LD) mapping of genetic variants underlying a quantitative trait. The method builds on a previously published method, Blossoc, for LD mapping of case-control studies. The method builds local genealogies along the genome and looks for a significant clustering of quantitative trait values in these trees. We analyse its efficiency in terms of localisation and ranking of true positives among a large number of negatives and compare the results with single marker approaches. Simulation results of markers at densities comparable to genotype chips show that QBlossoc is more accurate in localisation of true positives as expected since it uses the additional information of LD between markers simultaneously. More importantly for genome wide surveys, QBlossoc ranks true positives significantly higher than single marker approaches, again suggesting that a true signal displays itself more strongly in a set of adjacent markers than a spurious (false) signal. The method is both memory and CPU efficient. It has been tested on a real data set of height data for 5000 individuals measured at 317K markers and completed analysis within 5 CPU days. The QBlossoc program with full documentation can be downloaded through www.daimi.au.dk/~mailund/.
A fast algorithm for genome wide haplotype pattern mining
S. Besenbacher, C.N.S. Pedersen and T. Mailund
BMC Bioinformatics, 10(Suppl 1): S74 doi:10.1186/1471-2105-10-S1-S74
[Abstract]
Background: Identifying the genetic components of common diseases has long been an important area of research. Recently, genotyping technology has reached the level where it is cost effective to genotype single nucleotide polymorphism (SNP) markers covering the entire genome, in thousands of individuals, and analyse such data for markers associated with a diseases. The statistical power to detect association, however, is limited when markers are analysed one at a time. This can be alleviated by considering multiple markers simultaneously. The Haplotype Pattern Mining (HPM) method is a machine learning approach to do exactly this. Originally developed for candidate gene analysis, its running time is not optimised for the much larger genome wide data sets in use today.
Results: We present a new, faster algorithm for the HPM method. The new approach patterns of haplotype diversity in the genome: locally in the genome, the number of observed haplotypes is much smaller than the total number of possible haplotypes. We show that the new approach speeds up the HPM method with a factor of 2 on a genome wide dataset with 5009 individuals typed in 491208 markers using default parameters and more if the pattern length is increased.
Conclusions: The new algorithm significantly speeds up the HPM method and makes it feasible to apply HPM to whole genome association mapping with thousands of individuals and hundreds of thousands of markers.
2008
SNPFile - A software library and file format for large scale association mapping and population genetics studies
J. Nielsen and T. Mailund
BMC Bioinformatics 2008, 9:526; doi:10.1186/1471-2105-9-526
[Abstract]
Background: High-throughput genotyping technology has enabled cost effective typing of thousands of individuals in hundred of thousands of markers for use in genome wide studies. This vast improvement in data acquisition technology makes it an informatics challenge to efficiently store and manipulate the data. While spreadsheets and flat text files were adequate solutions earlier, the increased data size mandates more efficient solutions.
Results: We describe a new binary file format for SNP data, together with a software library for file manipulation. The file format stores genotype data together with any kind of additional data, using a flexible serialisation mechanism. The format is designed to be IO efficient for the access patterns of most multi-locus analysis methods.
Conclusions: The new file format has been very useful for our own studies where it has significantly reduced the informatics burden in keeping track of various secondary data, and where the memory and IO efficiency has greatly simplified analysis runs. A main limitation with the file format is that it is only supported by the very limited set of analysis tools developed in our own lab. Through scripting interfaces to the file format, we hope to alleviate this in the future.
Efficient Whole-Genome Association Mapping using Local Phylogenies for Unphased Genome Data
Z. Ding, T. Mailund and Y.S. Song
Bioinformatics 24(19):2215-2221; doi:10.1093/bioinformatics/btn406
[Abstract]
Motivation: Recent advances in genotyping technology has made data acquisition for whole-genome association study cost effective, and a current active area of research is developing efficient methods to analyze such large-scale data sets. Most sophisticated association mapping methods that are currently available take phased haplotype data as input. However, phase information is not readily available from sequencing methods and inferring the phase via computational approaches is time-consuming, taking days to phase a single chromosome.
Results: In this paper, we devise an efficient method for scanning unphased whole-genome data for association. Our approach combines a recently found linear-time algorithm for phasing genotypes on trees with a recently proposed tree-based method for association mapping. From unphased genotype data, our algorithm builds local phylogenies along the genome, and scores each tree according to the clustering of cases and controls. We assess the performance of our new method on both simulated and real biological data sets.
Investigating Selection on Viruses: A Statisical Alignment Approach
S. de Groot, T. Mailund, G.A. Lunter and J. Hein
BMC Bioinformatics 9(304) 2008, doi:10.1186/1471-2105-9-304
[Abstract]
Background: Two problems complicate the study of selection in viral genomes: Firstly, the presence of genes in overlapping reading frames implies that selection in one reading frame can bias our estimates of neutral mutation rates in another reading frame. Secondly, the high mutation rates we are likely to encounter complicate the inference of a reliable alignment of genomes. To address these issues, we develop a model that explicitly models selection in overlapping reading frames. We then integrate this model into a statistical alignment framework, enabling us to estimate selection while explicitly dealing with the uncertainty of individual alignments. We show that in this way we obtain un-biased selection parameters for different genomic regions of interest, and improve in accuracy compared to the fixed alignment method.
Results: We run a series of simulation studies to gauge how well we do in comparison to other methods. We show that the standard practice of using a fixed ClustalW alignment can lead to considerable biases and that estimation accuracy increases substantially when explicitly integrating over the uncertainty in inferred alignments. We even manage to compete favourably for general evolutionary distances with an alignment produced by GenAl. We therefore propose that marginalizing over all alignments, as opposed to using a fixed one, should be considered in any parametric inference from divergent sequence data for which the alignments are not known with certainty. Running our method on real data, we discover in HIV2 that double coding regions appear to be under less stringent selection than single coding ones. Additionally, there appears to be evidence for differential selection, where one overlapping reading frame is under positive and the other under negative selection. We also analyse Hepatitis B to understand the interaction of selection between two overlapping regions.
Computing the All-Pairs Quartet Distance on a Set of Evolutionary Trees
M. Stissing, T. Mailund, C.N.S. Pedersen, G.S. Brodal, and R. Fagerberg
Journal of Bioinformatics and Computational Biology 6(1) 37–50 2008.
Linked version is a preprint of an article submitted for consideration in Journal of Bioinformatics and Computational Biology © 2008 World Scientific Publishing Company
Special Issue journal version of Computing the All-Pairs Quartet Distance on a Set of Evolutionary Trees APBC'07.
[Abstract]
We present two algorithms for calculating the quartet distance between all pairs of trees in a set of binary evolutionary trees on a common set of species. The algorithms exploit common substructure among the trees to speed up the pairwise distance calculations thus performing significantly better on large sets of trees compared to performing distinct pairwise distance calculations, as we illustrate experimentally, where we see a speedup factor of around 130 in the best case.
2007
On Recombination Induced Multiple and Simultaneous Coalescent Events
J. Davies, F. Simancik, R. Lyngsø, T. Mailund, and J. Hein
Genetics 177: 2151–2160. doi:10.1534/genetics.107.071126
[Abstract]
Coalescent Theory is almost ubiquitous in contemporary molecular population genetics. Inherent in most applications is a continuous time approximation that assumes sample size is small relative to the actual population size. This assumption in effect precludes simultaneous and multiple coalescent events, which can constitute an arbitrarily large component when sample size is sufficiently large. In most situations this is justifiably ignored as a large sample size will only have few ancestors a couple of generations back and then the assumption is valid. However, in tracing the evolutionary history of large chromosomal segments, a large recombination rate will consistently keep the number of ancestors large such that multiple and simultaneous coalescent events cannot be ignored. This can create a major disparity between discrete time and continuous time models and we here show its importance illustrated with parameters typical of the human genome. The presence of gene convergence only aggravates its importance. This could seriously undermine the application of coalescent theory to complete genomes. However, it can be shown that multiple and simultaneous coalescent events influences global quantities, such as total number of ancestors, but has negligible effect on local quantities, such as linkage disequilibrium or similarities of close local trees. Reassuringly the majority of applications of coalescent models with recombination are based on local quantities for purposes such as association mapping.
Annotation of Selection Strengths in Viral Genomes
S. McCauley, S. de Groot, T. Mailund, and J. Hein
Bioinformatics 23(22): 2978–2986. doi:10.1093/bioinformatics/btm472.
[Abstract]
Motivation: Viral genomes tend to code in overlapping reading frames to maximize information content. This may result in atypical codon bias and particular evolutionary constraints. Due to the fast mutation rate of viruses, there is additional strong evidence for varying selection between intra- and intergenomic regions. The presence of multiple coding regions complicates the concept of Ka/Ks ratio, and thus begs for an alternative approach when investigating selection strengths. Building on the paper by McCauley & Hein (2006), we develop a method for annotating a viral genome coding in overlapping reading frames. We introduce an evolutionary model capable of accounting for varying levels of selection along the genome, and incorporate it into our prior single sequence HMM methodology, extending it now to a phylogenetic HMM. Given an alignment of several homologous viruses to a reference sequence, we may thus achieve an annotation both of coding regions as well as selection strengths, allowing us to investigate different selection patterns and hypotheses.
Results: We illustrate our method by applying it to a multiple alignment of four HIV2 sequences, as well as four Hepatitis B sequences. We obtain an annotation of the coding regions, as well as a posterior probability for each site of the strength of selection acting on it. From this we may deduce the average posterior selection acting on the different genes. Whilst we are encouraged to see in HIV2, that the known to be conserved genes gag and pol are indeed annotated as such, we also discover several sites of less stringent negative selection within the env gene. The the best of our knowledge we are the first to subsequently provide a full selection annotation of the Hepatitis B genome by explicitly modelling the evolution within overlapping reading frames, and not relying on the simple Ka/Ks ratios.
Comparative Annotation of Viral Genomes with Non-Conserved Gene Structure
S. de Groot, T. Mailund, and J. Hein
Bioinformatics 23(9):1080–1089. doi:10.1093/bioinformatics/btm078.
[Abstract]
Motivation: Detecting genes in viral genomes is a complex task. Due to the biological necessity of them being constrained in length, RNA viruses in particular tend to code in overlapping reading frames. Since one amino acid is encoded by a triplet of nucleic acids, up to three genes may be coded for simultaneously in one direction. Conventional HMM based gene finding algorithms may find it difficult — if not impossible — to identify multiple coding regions, since in general their topologies do not allow for the presence of overlapping or nested genes. Comparative methods have therefore been restricted to likelihood ratio tests on potential regions as to being double or single coding, using the fact that the constrictions forced upon multiple-coding nucleotides will result in atypical sequence evolution. Exploiting these same constraints, we present a hidden Markov model based gene-finding program, which allows for coding in unidirectional nested and overlapping reading frames, to annotate two homologous aligned viral genomes. Our method does not insist on conserved gene structure between the two sequences, thus making it applicable for the pairwise comparison of more distantly related sequences.
Results: We apply our method to 15 pairwise alignments of six different HIV2 genomes. Given sufficient evolutionary distance between the two sequences, we achieve sensitivity of about 84% and specificity of about 97%. We additionally annotate three pairwise alignments of the more distantly related HIV1 and HIV2, as well as of two different Hepatitis Viruses, attaining results of ~87% sensitivity and ~98.5% specificity. We subsequently incorporate prior knowledge by "knowing" the gene structure of one sequence and annotating the other conditional on it. Boosting accuracy close to perfect we demonstrate that conservation of gene structure on top of nucleotide sequence is a valuable source of information, especially in distantly related genomes.
Genomic relationships and speciation times of human, chimpanzee and gorilla inferred from a coalescent hidden Markov model
A. Hobolth, O.F. Christensen, T. Mailund and M.H. Schierup
In PLoS Genetics, 3(2) 2007; doi:10.1371/journal.pgen.0030007
[Abstract]
The genealogical relationship of human, chimpanzee and gorilla varies along the genome. We develop a hidden Markov model (HMM) that incorporates this variation and relate the model parameters to population genetics quantities such as speciation times and ancestral population sizes. The HMM is an analytically tractable approximation to the coalescent process with recombination, and in simulations we see no apparant bias in the HMM estimates. We apply the HMM to four autosomal contiguous human-chimp-gorilla-orangutan alignments comprising a total of 1.9 million base pairs. We find a very recent speciation time of human-chimp (4.1 +/- 0.4 million years), and fairly large ancestral effective population sizes (65,000 +/- 30,000 for the human-chimp ancestor and 45,000 +/- 10,000 for the human-chimp-gorilla ancestor). Furthermore, around 50% of the human genome coalesces with chimpanzee after speciation with gorilla. We also consider 250 thousand base pairs of X-chromosome alignments and find an effective population size much smaller than 75% of the autosomal effective population sizes. Finally, we find that the rate of transitions between different genealogies correlate well with the regionwide present day human recombination rate, but does not correlate with the fine-scale recombination rates and recombination hot spots, suggesting that the latter are evolutionary transient.
Experiences with GeneRecon on MiG
T. Mailund, C.N.S. Pedersen, J. Bardino, B. Vinter, and H.H. Karlsen
In Future Generation Computer Systems 2007 23 580–586. doi:10.1016/j.future.2006.09.003.
Special Issue journal version of Initial Experiences with GeneRecon on MiG GCA'05.
[Abstract]
We report on our experiences so far with running a bioinformatics simulation study on a newly developed grid architecture. We briefly describe the application bioinformaticsan association mapping algorithm for both locating disease loci and separating cases into those diseased due to genetic factors and those diseased solely due to factors environmentand describe the grid architecture and how the application is set up to run on the grid.
2006
Whole genome association mapping by incompatibilities and local perfect phylogenies
T. Mailund, S. Besenbacher, and M.H. Schierup
BMC Bioinformatics 2006 7(454). doi:10.1186/1471-2105-7-454.
[Abstract]
Background: With current technology, vast amounts of data can be cheaply and efficiently produced in association studies, and to prevent data analysis to become the bottleneck of studies, fast and efficient analysis methods that scale to such data set sizes must be developed.
Results: We present a fast method for accurate localisation of disease causing variants in high density case-control association mapping experiments with large numbers of cases and controls. The method searches for significant clustering of case chromosomes in the "perfect" phylogenetic tree defined by the largest region around each marker that is compatible with a single phylogenetic tree. This perfect phylogenetic tree is treated as a decision tree for determining disease status, and scored by its accuracy as a decision tree. The rationale for this is that the perfect phylogeny near a disease affecting mutation should provide more information about the affected/unaffected classification than random trees. If regions of compatibility contain few markers, due to e.g. large marker spacing, the algorithm can allow the inclusion of incompatibility markers in order to enlarge the regions prior to estimating their phylogeny. Haplotype data and phased genotype data can be analysed. The power and efficiency of the method is investigated on 1) simulated genotype data under different models of disease determination 2) artificial data sets created from the HapMap ressource, and 3) data sets used for testing of other methods in order to compare with these. Our method has the same accuracy as single marker association (SMA) in the simplest case of a single disease causing mutation and a constant recombination rate. However, when it comes to more complex scenarios of mutation heterogeneity and more complex haplotype structure such as found in the HapMap data our method outperforms SMA as well as other fast, data mining approaches such as HapMiner and Haplotype Pattern Mining (HPM) despite being significantly faster. For unphased genotype data, an initial step of estimating the phase only slightly decreases the power of the method. The method was also found to accurately localise the known susceptibility variants in an empirical data set—the ΔF508 mutation for cystic fibrosis—where the susceptibility variant is already known—and to find significant signals for association between the CYP2D6 gene and poor drug metabolism, although for this dataset the highest association score is about 60kb from the CYP2D6 gene.
Conclusions: Our method has been implemented in the Blossoc (BLOck aSSOCiation) software. Using Blossoc, genome wide chip-based surveys of 3 million SNPs in 1000 cases and 1000 controls can be analysed in less than two CPU hours.
Fast Calculation of the Quartet Distance Between Trees of Arbitrary Degrees
C. Christiansen, T. Mailund, C.N.S. Pedersen, M. Randers, and M. Stissing
Algorithms for Molecular Biology 2006, 1(16); doi:10.1186/1748-7188-1-16.
[Abstract]
Background: A number of algorithms have been developed for calculating the quartet distance between two evolutionary trees on the same set of species. The quartet distance is the number of quartets—sub-trees induced by four leaves—that differs between the trees. Mostly, these algorithms are restricted to work on binary trees, but recently we have developed algorithms that work on trees of arbitrary degree.
Results: We present a fast algorithm for computing the quartet distance between trees of arbitrary degree. Given input trees T and T', the algorithm runs in time O(n+|V||V'|min{id, id'}) and space O(n+|V||V'|), where n is the number of leaves in the two trees, V and V' are the non-leaf nodes in T and T', respectively, and id and id' are the maximal number of non-leaf nodes adjacent to a non-leaf node in T and T', respectively. The fastest algorithms previously published for arbitrary degree trees run in O(n3) (independent of the degree of the tree) and O(|V||V'| id id'), respectively. We experimentally compare the algorithm with existing algorithms for computing the quartet distance for general trees.
Conclusions: We present a new algorithm for computing the quartet distance between two trees of arbitrary degree. The new algorithm improves the asymptotic running time for computing the quartet distance, compared to previous methods, and experimental results indicate that the new method also performs significantly better in practice.
GeneRecon—A coalescent based tool for fine-scale association mapping
T. Mailund, M.H. Schierup, C.N.S. Pedersen, J.N. Madsen, J. Hein, and L. Schauser
Bioinformatics 2006 22 (18): 2317–2318; doi:10.1093/bioinformatics/btl153.
[Abstract]
GeneRecon is a tool for fine-scale association mapping using a coalescence model. GeneRecon takes as input case-control data from phased or unphased SNP and micro-satellite genotypes. The posterior distribution of disease locus position is obtained by Metropolis Hastings sampling in the state space of genealogies. Input format, search strategy, and the sampled statistics can be configured through the Guile Scheme programming language embedded in GeneRecon, making GeneRecon highly configurable.
The effective size of the Icelandic population and the prospects for LD mapping: inference from unphased microsatellite markers
T. Bataillon, T. Mailund, S. Thorlacius, E. Steingrimsson, T. Rafnar, M.M. Halldorsson, V. Calian, and M.H. Schierup
European Journal of Human Genetics 2006, 14, 1044–1053. doi:10.1038/sj.ejhg.5201669.
[Abstract]
Characterizing the extent of linkage disequilibrium (LD) in the genome is a pre-requisite for association mapping studies. Patterns of LD also contain information about the past demography of populations. On this study we focus on the Icelandic population where LD was investigated in 12 regions of ~15 cM using regularly spaced microsatellite loci. In total 1753 individuals were genotyped for 179 markers. LD was estimated using a composite disequilibrium measure based on unphased data. LD decreases with distance in all 12 regions and can be detected over approximately 4 cM for the given sample size. Differences in the patterns of decrease of LD with distance among genomic regions were mostly due to two regions exhibiting respectively higher and lower proportions of pairs in LD than average within the first 4 cM. We pooled data from all regions except these two and summarized patterns of LD by computing the proportion of pairs of loci exhibiting significant LD (at the 5% level) as a function of distance. We compared observed patterns of LD with simulated datasets obtained under scenarios with varying demography and intensity of recombination. We show that unphased data allow to make inferences on scaled recombination rates from patterns of LD. Patterns of LD in Iceland suggest a genome-wide scaled recombination rate of ρ* = 200 [130–330] per cM which is equivalent to a long term effective population size of ~5000 in the range of estimates recently reported in three populations using extensive SNPs data from the HapMap project. We discuss the implication of our findings for association mapping studies using the Icelandic population and expected pattern of LD between SNPs markers.
Recrafting the Neighbor-Joining Method
T. Mailund, G.S. Brodal, R. Fagerberg, C.N.S. Pedersen, and D. Phillips
BMC Bioinformatics 2006, 7(29). doi:10.1186/1471-2105-7-29.
[Abstract]
Background: The neighbor-joining method by Saitou and Nei is a widely used method for constructing phylogenetic trees. The formulation of the method gives rise to a canonical Θ(n3) algorithm upon which all existing implementations are based.
Methods: In this paper we present techniques for speeding up the canonical neighbor-joining method. Our algorithms construct the same phylogenetic trees as the canonical neighbor-joining method. The best-case running time of our algorithms are O(n2) but the worst-case remains O(n3). We empirically evaluate the performance of our algoritms on distance matrices obtained from the Pfam collection of alignments.
Results: The experiments indicate that the running time of our algorithms evolve as Θ(n2) on the examined instance collection. We also compare the running time with that of the QuickTree tool, a widely used efficient implementation of the canonical neighbor-joining method.
Conclusions: The experiments show that our algorithms also yield a significant speed-up, already for medium sized instances.
2005
CoaSim: A Flexible Environment for Simulating Genetic Data under Coalescent Models
T. Mailund, M.H. Schierup, C.N.S. Pedersen, P.J.M. Mechlenborg, J.N. Madsen, and L. Schauser
BMC Bioinformatics 2005, 6(252). doi:10.1186/1471-2105-6-252.
[Abstract]
Background Coalescent simulations are playing a large role in interpreting large scale intra- polymorphism surveys and for planning and evaluating association studies. Coalescent of data sets under different models can be compared to the actual data to test different evolutionary factors and thus get insight into these.
Results We have created the CoaSim application as a flexible environment for Monte various types of genetic data under equilibrium and non-equilibrium coalescent variety of applications. Interaction with the tool is through the Guile version scripting language. Scheme scripts for many standard and advanced applications these can easily be modified by the user for a much wider range of applications. interface with less functionality and flexibility is also included. It is primarily exploratory and educational tool.
Conclusions CoaSim is a powerful tool because of its flexibility and ease of use. This is illustrated varied uses of the application, e.g. evaluation of association mapping methods, bootstrapping, and design and choice of markers for specific questions.
Pigs in Sequence Space: A 0.66X Coverage Pig Genome Survey based on Shotgun Sequencing
R.Wernersson, M.H. Schierup, F.G. Jørgensen, J. Gorodkin, F. Panitz, H.H. Staerfeldt, O.F. Christensen, T. Mailund, H. Hornshøj, A. Klein, W. Jun, B. Liu, S. Hu, W. Dong, W. Li, G.K.-S. Wong, J. Yu, J. Wang, C. Bendixen, M. Fredholm, S. Brunak, H. Yang and L. Bolund
BMC Genomics 2005, 6(70). doi:10.1186/1471-2164-6-70.
[Abstract]
Background Comparative whole genome analysis of Mammalia can benefit from the addition of more species. The pig is an obvious choice due to its economic and medical importance as well as its evolutionary position in the artiodactyls.
Results We have generated ~ 3.84 million shotgun sequences (0.66X coverage) of the pig genome. The data are hereby released together with an initial evolutionary analysis. The non-repetitive fraction of the sequences was aligned to the UCSC human-mouse alignment and the resulting three-species alignments were annotated using the human genome annotation. Ultra-conserved elements and miRNAs were identified. The results show that for each of these types of orthologous data, pig is much closer to human than mouse is. Purifying selection has been more efficient in pig compared to human, but not as efficient as in mouse, and pig seems to have an isochore structure most similar to the structure in human.
Conclusion The addition of the pig to the set of species sequenced at low coverage adds to the understanding of selective pressures that have acted on the human genome by bisecting the evolutionary branch between human and mouse with the mouse branch being approximately 3 times as long as the human branch, and the joint alignment of the shot-gun sequences to the human-mouse alignment offers a rapid way for the investigator to define specific regions for analysis and resequencing.
RBT—A Tool for Building Refined Buneman Trees
S. Besenbacher, T. Mailund, L. Westh-Nielsen, C.N.S. Pedersen
Bioinformatics, 21(8) 1711–1712, 2005.
[Abstract]
We have developed a tool implementing an efficient algorithm for refined Buneman tree reconstruction. The algorithm—which has the same complexity as the neighbour-joining method and the (plain) Buneman tree construction—enables refined Buneman tree reconstruction on large taxa sets.
2004
QuickJoin—Fast Neighbour-Joining Tree Reconstruction
T. Mailund and C.N.S. Pedersen
Bioinformatics, 20(17): 3261–3262, 2004.
[Abstract]
We have built a tool for fast construction of very large phylogenetic trees. The tool uses heuristics for speeding up the neighbour-joining algorithm—while still constructing the same tree as the original neighbour-joining algorithm—making it possible to construct trees for ~8000 species in less than ten minutes on a single desktop PC. In comparison, the same task takes more than 30 minutes using the QuickTree neighbour-joining implementation.
QDist—Quartet Distance Between Evolutionary Trees
T. Mailund and C.N.S. Pedersen
Bioinformatics, 20(10): 1636–1637, 2004.
[Abstract]
QDist is a program for computing the quartet distance between two unrooted evolutionary trees, i.e. the number of quartet topology differences between the two trees, where a quartet topology is the topological subtree induced by four species. The implementation is based on an algorithm with running time O(n log² n), which makes it practical to compare large trees.
Exploiting Equivalence Reduction and the Sweep-Line Method for Detecting Terminal States
J. Billington, G.E. Gallasch, L.M. Kristensen, and T. Mailund
IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans. 34(1): 23–37.
[Abstract]
State space exploration is one of the main approaches to computer-aided verification and analysis of finite-state systems, and can be used to reason about a wide range of properties during the design phase of a system, including desired and undesired terminal states. Several state space reduction methods have been developed to alleviate the state explosion problem inherent in methods based on state space exploration. In this paper we develop algorithms for combining two of these methods: the sweep-line method and the equivalence reduction method. The algorithms allow terminal states of the system to be reported on-the-fly during the state space exploration. We demonstrate the usefulness of the algorithms by applying them to an industrial case study. These experimental results show that the combined methods achieves a better reduction of the state space than when either methods are used in isolation.
Conference Papers
To Appear
Using colored Petri nets to construct coalescent hidden Markov models: Automatic translation from demopgrahic specifications to efficient inference methods
T. Mailund, A.E. Halager, and M. Westergaard
To appear at the 2012 International Conference on Applications and Theory of Petri Nets (ICATPN).
[Abstract]
Biotechnological improvements over the last decade has made it economically and technologically feasible to collect large DNA sequence data from many closely related species. This enables us to study the detailed evolutionary history of recent speciation and demographics. Sophisticated statistical methods are needed, however, to extract the information that DNA sequences hold, and a limiting factor in this is dealing with the large state space that the ancestry of large DNA sequences spans. Recently a new analysis method, CoalHMMs, has been developed, that makes it computationally feasible to scan full genome sequences -- the complete genetic information of a species -- and extract genetic histories from this. Applying this methodology, however, requires that the full state space of ancestral histories can be constructed. This is not feasible to do manually, but by applying formal methods such as Petri nets it is possible to build sophisticated evolutionary histories and automatically derive the analysis models needed. In this paper we describe how to use colored stochastic Petri nets to build CoalHMMs for complex demographic scenarios.
2010
HMMlib: A C++ library for general hidden Markov models exploiting modern CPUs
A. Sand, A.T. Brask, C.N.S. Pedersen and T. Mailund
Proceedings of HiBi 2010.
[Abstract]
We present a C++ library for constructing and analyzing general hidden Markov models. The library consists of a number of template classes and generic functions, parameterized with the precision of floating point types and different types of hardware acceleration.
A new powerful method for genome-wide association mapping using local phylogenies in a mixed model
G. Sahana, T. Mailund, M.S. Lund and B. Guldbrandtsen
Proceedings of WCGALP 2010.
[Abstract]
Local phylogenies are phylogenies for the longest chromosome region around a marker that does not require recurrent mutation or recombination (Mailund et al. 2006). Local phylogeny-based association mapping was introduced by Mailund et al. (2006) for a case-control study and Besenbacher et al. (2009) for complex traits and implemented in the software Blossoc. Ledur et al. (2009) used Blossoc for QTL mapping using the dataset simulated for QTLMAS-XII workshop by Lund et al. (2009). Blossoc was found to perform best out of several approaches used to analyze the common dataset (Crooks et al. 2009). However, the method developed can be applied only for random samples which is not always available in human genetics and generally unavailable in livestock populations. Ledur et al. (2009) corrected the data for the pedigree and then analyzed as if it was a random sample. However, precorrecting the phenotypes for pedigree is not an optimal solution because it reduces power to identify QTL (Aulchenko et al. 2007). Therefore, a method which can utilize the phylogeny information and simultaneously able to account multiple levels of relatedness among individuals in the model, is expected to have more power when the samples actually come from a complex pedigree. Yu et al. (2006) present a unified mixed-model method for association mapping that accounts for multiple levels of relatedness. The mixed-model approach has the added advantage of being flexible, as it can be applied to both family-based and population samples. Therefore, combing local phylogeny with mixed-model will give a powerful and flexible method for association mapping. We present here a new method, PHYLOMIX (phylogy based mixed-model), which uses local phylogenies in a variance component based model for association mapping for complex traits, thereby combining a phylogeny-based approach with a mixed-model.
Building very large neighbour-joining trees
M. Simonsen, T. Mailund and C.N.S. Pedersen
Proceedings of Bioinformatics 2010.
[Abstract]
The neighbour-joining method by Saitou and Nei is a widely used method for phylogenetic reconstruction, made popular by a combination of computational efficiency and reasonable accuracy. With its cubic running time by Studier and Kepler, the method scales to hundreds of species, and while it is usually possible to infer phylogenies with thousands of species, tens or hundreds of thousands of species is infeasible. Recently we developed a simple branch and bound heuristic, RapidNJ, which significantly reduces the running time on average. However, the O(n2) space consumption of the RapidNJ method, and the NJ method in general, becomes a problem when inferring phylogenies with 10000+ taxa. In this paper we present two extentions of RapidNJ which reduce memory requirements and enable RapidNJ to infer very large phylogenetic trees efficiently. We also present an improved search heuristic for RapidNJ which improves RapidNJs performance on many data sets of all sizes.
2009
A sub-cubic time algorithm for computing the quartet distance between two general trees
T. Mailund, J. Nielsen, and C.N.S. Pedersen
Proc. of IJCBS 2009.
This is a fixed version of A quadratic time algorithm for computing the quartet distance between two general trees that turned out to be less quadratic than hoped...
[Abstract]
We derive a sub-cubic time and space algorithm for computing the quartet distance between a pair of general trees, i.e. trees where inner nodes can have any degree ≥ 3. The time and space complexity of our algorithm is quadratic in the number of leaves and does not depend on the degree of the inner nodes. This makes it the fastest algorithm for computing the quartet distance between general trees independent of the degree of the inner nodes.
A fast algorithm for genome wide haplotype pattern mining
S. Besenbacher, C.N.S. Pedersen and T. Mailund
In the Proc of APBC 2009, Tsinghua University Press, 772-780.
[Abstract]
Background: Identifying the genetic components of common diseases has long been an important area of research. Recently, genotyping technology has reached the level where it is cost effective to genotype single nucleotide polymorphism (SNP) markers covering the entire genome, in thousands of individuals, and analyse such data for markers associated with a diseases. The statistical power to detect association, however, is limited when markers are analysed one at a time. This can be alleviated by considering multiple markers simultaneously. The Haplotype Pattern Mining (HPM) method is a machine learning approach to do exactly this. Originally developed for candidate gene analysis, its running time is not optimised for the much larger genome wide data sets in use today.
Results: We present a new, faster algorithm for the HPM method. The new approach patterns of haplotype diversity in the genome: locally in the genome, the number of observed haplotypes is much smaller than the total number of possible haplotypes. We show that the new approach speeds up the HPM method with a factor of 2 on a genome wide dataset with 5009 individuals typed in 491208 markers using default parameters and more if the pattern length is increased.
Conclusions: The new algorithm significantly speeds up the HPM method and makes it feasible to apply HPM to whole genome association mapping with thousands of individuals and hundreds of thousands of markers.
2008
Rapid neighbour-joining
M. Simonsen, T. Mailund and C.N.S. Pedersen
Proceedings of the Workshop on Algorithms in Bioinformatics (WABI) 2008, LNBI Volume 5251, Springer, 113-122 doi:10.1007/978-3-540-87361-7_10
[Abstract]
The neighbour-joining method reconstructs phylogenies by iteratively joining pairs of nodes until a single node remains. The cri- teria for which pair of nodes to merge is based on both the distance between the pair and the average distance to the rest of the nodes. In this paper, we present a new search strategy for the optimisation criteria used for selecting the next pair to merge and we show empirically that the new search strategy is superior to other state-of-the-art neighbour- joining implementations.
2007
Computing the All-Pairs Quartet Distance on a Set of Evolutionary Trees
M. Stissing, T. Mailund, C.N.S. Pedersen, G.S. Brodal, and R. Fagerberg
Proceedings of the Asia-Pacific Bioinformatics Conference (APBC) 2007, pp. 91–100, Series on Advances in Bioinformatics and Computational Biology, vol. 5, Imperial College Press.
Presentation slides (first part of): [PDF] [Open Office]
[Abstract]
We present two algorithms for calculating the quartet distance between all pairs of trees in a set of binary evolutionary trees on a common set of species. The algorithms exploit common substructure among the trees to speed up the pairwise distance calculations thus performing significantly better on large sets of trees compared to performing distinct pairwise distance calculations, as we illustrate experimentally, where we see a speedup factor of around 130 in the best case.
Computing the Quartet Distance between Evolutionary Trees of Bounded Degree
M. Stissing, C.N.S. Pedersen, T. Mailund, G.S. Brodal, and R. Fagerberg
Proceedings of the Asia-Pacific Bioinformatics Conference (APBC) 2007, pp. 101–110, Series on Advances in Bioinformatics and Computational Biology, vol. 5, Imperial College Press.
Presentation slides (second part of): [PDF] [Open Office]
[Abstract]
We present an algorithm for calculating the quartet distance between two evolutionary trees of bounded degree on a common set of n species. The previous best algorithm has running time O(d2n2) when considering trees, where no node is of more than degree d. The algorithm developed herein has running time O(d9n log n)) which makes it the first algorithm for computing the quartet distance between non-binary trees which has a sub-quadratic worst case running time.
2005
Algorithms for Computing the Quartet Distance between Trees of Arbitrary Degree
C. Christiansen, T. Mailund, C.N.S. Pedersen, and M. Randers
Proceedings of Workshop on Algorithms in Bioinformatics (WABI), 2005, LNBI 3692, pp. 77-88 © Springer-Verlag.
Full length version of Computing the Quartet Distance between Evolutionary Trees of Bounded Degree (extended abstract).
[Abstract]
We present two algorithms for computing the quartet distance between trees of arbitrary degree. The quartet distance between two unrooted evolutionary trees is the number of quartets—sub-trees induced by four leaves—that differs between the trees. Previous algorithms focus on computing the quartet distance between binary trees. In this paper, we present two algorithms for computing the quartet distance between trees of arbitrary degrees. One in time O(n3) and space O(n2) and one in time O(n2d2) and space O(n2), where n is the number of species and d is the maximal degree of the internal nodes of the trees. We experimentally compare the two algorithms.
Quartet Distance between General Trees (extended abstract)
C. Christiansen, T. Mailund, C.N.S. Pedersen, and M. Randers
Proceedings of International Conference on Numerical Analysis and Applied Mathematics (ICNAAM) 2005, pp. 796-799 © Wiley-VCH Verlag GmbH & Co.
Presentation slides: [PDF] [Open Office]
[Abstract]
We present two algorithms for computing the quartet distance between trees of arbitrary degree. The quartet distance between two unrooted evolutionary trees is the number of quartets—sub-trees induced by four leaves—that differs between the trees. Previous algorithms focus on computing the quartet distance between binary trees. In this paper, we present two algorithms for computing the quartet distance between trees of arbitrary degrees. One in time O(n3) and space O(n2) and one in time O(n2d2) and space O(n2), where n is the number of species and d is the maximal degree of the internal nodes of the trees. We experimentally compare the two algorithms.
Initial experiences with GeneRecon on MiG
T. Mailund, C.N.S. Pedersen, J. Bardino, B. Vinter, and H.H. Karlsen
Proceedings of The 2005 International Conference on Grid Computing and Applications (GCA'05)
[Abstract]
We report on very early experiences with running a bioinformatics simulation study on a newly developed grid architecture. We briefly describe the bioinformatics application—an association mapping algorithm for both locating disease loci and separating cases into those diseased due to genetic factors and those diseased solely due to environment factors—and describe the grid architecture and how the application is set up to run on the grid.
2004
Obtaining Memory-Efficient Reachability Graph Representations Using the Sweep-Line Method
T. Mailund and M. Westergaard
Proceedings of Tools and Algorithms for the Construction and Analysis of Systems (TACAS 2004), LNCS 2988 pp. 177-191 © Springer-Verlag. The original publication is available on LINK at http://www.springerlink.com/openurl.asp?genre=article&issn=0302-9743&volume=2988&spage=177.
[Abstract]
This paper is concerned with a memory-efficient representation of the reachability graphs. We describe a technique that enables us to represent each reachable marking in a number of bits close to the theoretical minimum needed for explicit state enumeration. The technique maps each state vector into a number between zero and the number of reachable states and uses the sweep-line method to delete the state vectors themselves. A prototype of the proposed technique has been implemented and experimental results are reported.
2003
Efficient Path Finding with the Sweep-Line Method using External Storage
L.M. Kristensen and T. Mailund
Proceedings of International Conference on Formal Engineering Methods (ICFEM 2003) LNCS 2885 pp. 319-337 © Springer-Verlag. The original publication is available on LINK at http://www.springerlink.com/openurl.asp?genre=article&issn=0302-9743&volume=2885&spage=319.
[Abstract]
The sweep-line method deletes states on-the-fly during state space exploration to reclaim memory and thereby reduce peak memory usage. This deletion of states prohibits the immediate generation of, e.g., an error-trace when the violation of a safety property is detected. We address this problem by combining the sweep-line method with storing a spanning tree of the explored state space in external storage on a magnetic disk. We show how this allows us to easily obtain paths in the state space, such as error-traces. A key property of the proposed technique is that it avoids searching in external storage during the state space exploration and gives the same reduction in peak memory usage as the stand-alone sweep-line method. The subsequent generation of the path then requires one seek on disk for each state on the path. We evaluate the proposed technique on a number of example systems by means of an implementation, and compare its performance to a related technique. These practical experiments demonstrate how the suggested technique complements existing techniques based on using external storage.
2002
A Compositional Sweep-Line State Space Exploration Method
L.M. Kristensen and T. Mailund
Proceedings of Formal Description Techniques for Distributed Systems and Communication Protocols (FORTE 2002), LNCS 2529 pp. 327-343 © Springer-Verlag. The original publication is available on LINK at http://link.springer.de/link/service/series/0558/papers/2529/25290327.pdf.
[Abstract]
State space exploration is a main approach to verification of finite-state systems. The sweep-line method exploits a certain kind of progress present in many systems to reduce peak memory usage during state space exploration. We present a new sweep-line algorithm for a compositional setting where systems are composed of subsystems. The compositional setting makes it possible to divide subsystem progress measures into monotone and non-monotone progress measures to further reduce peak memory usage. We show that in a compositional setting, it is possible to automatically obtain a progress measure for the system by combining the progress measure for each subsystem to a progress measure for the full system.
A Generalised Sweep-Line Method for Safety Properties
L.M. Kristensen and T. Mailund
Proceedings of Formal Methods Europe (FME 2002), LNCS 2391 pp. 549-567 © Springer-Verlag. The original publication is available on LINK at http://link.springer.de/link/service/series/0558/papers/2391/23910549.pdf.
[Abstract]
The recently developed sweep-line method exploits progress present in many concurrent systems to explore the full state space of the system while storing only small fragments of the state space in memory at a time. A disadvantage of the sweep-line method is that it relies on a monotone and global notion of progress. This prevents the method from being used for many reactive systems. In this paper we generalise the sweep-line method such that it can be used for verifying safety properties of reactive systems exhibiting local progress. The basic idea is to relax the monotone notion of progress and to recognise the situations where this could cause the state space exploration not to terminate. The generalised sweep-line method explores all reachable states of the system, but may explore a state several times. We demonstrate the practical application of the generalised sweep-line method on two case studies demonstrating a reduction in peak memory usage to typically 10% compared to the use of ordinary full state spaces.
Analysing Infinite-State Systems by Combining Equivalence Reduction and the Sweep-Line Method
T. Mailund
Proceedings of International Conference on Application and Theory of Petri Nets (ICATPN 2002), LNCS 2360 pp. 314-333 © Springer-Verlag. The original publication is available on LINK at http://link.springer.de/link/service/series/0558/papers/2360/23600314.pdf.
[Abstract]
The sweep-line method is a state space exploration method for on-the-fly verification aimed at systems exhibiting progress. Presence of progress in the system makes it possible to delete certain states during state space generation, which reduces the memory used for storing the states. Unfortunately, the same progress that is used to improve memory performance in state space exploration often leads to an infinite state space: The progress in the system is carried over to the states resulting in infinitely many states only distinguished through the progress. A finite state space can be obtained using equivalence reduction, abstracting away the progress, but in its simplest form this removes the progress property required by the sweep-line method. In this paper we examine a new method for using equivalence relations to obtain a finite set of classes, without compromising the progress property essential for the sweep-line method. We evaluate the new method on two case studies, showing significant improvements in performance, and we briefly discuss the new method in the context of Timed Coloured Petri Nets, where the `increasing global time' semantics can be exploited for more efficient analysis than what is achieved using a `delay' semantics.
Sweep-Line State Space Exploration for Coloured Petri Nets
G.E. Gallasch, L.M. Kristensen, and T. Mailund
Proceedings of Fourth Workshop on Practical Use of Coloured Petri Nets and the CPN Tools (CPN 2002). DAIMI PB-560 pp. 101-119, Aarhus, Denmark, 2002.
[Abstract]
The basic idea of the sweep-line state space method is to exploit a formal notion of progress found in many concurrent and distributed systems. Exploiting progress makes it possible to sweep through the state space of a CP-net while storing only a small fragment of the states in memory at any time. Properties of the system can then be verified on-the-fly during the sweep of the state space. This can lead to significant savings in peak memory usage and computation time. Examples of systems possessing progress are transport protocols, transactions protocols, workflow models, and systems modelled with Timed CP-nets. We present Sweep/CPN, a library extension to Design/CPN supporting the sweep-line method, and demonstrate its use.
2001
Condensed State Spaces for Timed Petri Nets
S. Christensen, L.M. Kristensen and T. Mailund
Proceedings of International Conference on Application and Theory of Petri Nets (ICATPN 2001), LNCS 2075 pp. 101-120 © Springer-Verlag. The original publication is available on LINK at http://link.springer.de/link/service/series/0558/papers/2075/20750101.pdf.
[Abstract]
We present a state space method for Petri nets having a time concept based on a global clock and associating time stamps to tokens. The method is based on equivalence on states and makes it possible to condense the usually infinite state space of such timed Petri nets into a finite state space without loosing analysis power. The practical application of the method is demonstrated on a large example of an audio/video protocol by means of a computer tool implementing the method.
A Sweep-Line Method for State Space Exploration
S. Christensen, L.M. Kristensen and T. Mailund
Proceedings of Tools and Algorithms for the Construction and Analysis of Systems (TACAS 2001), LNCS 2031 pp. 450-464 © Springer-Verlag. The original publication is available on LINK at http://link.springer.de/link/service/series/0558/papers/2031/20310450.pdf.
[Abstract]
We present a state space exploration method for on-the-fly verification. The method is aimed at systems for which it is possible to define a measure of progress based on the states of the system. The measure of progress makes it possible to delete certain states on-the-fly during state space generation, since these states can never be reached again. This in turn reduces the memory used for state space storage during the task of verification. Examples of progress measures are sequence numbers in communication protocols and time in certain models with time. We illustrate the application of the method on a number of Coloured Petri Net models, and give a first evaluation of its practicality by means of an implementation based on the Design/CPN state space tool. Our experiments show significant reductions in both space and time used during state space exploration. The method is not specific to Coloured Petri Nets but applicable to a wide range of modelling languages.
State Space Methods for Timed Coloured Petri Nets
S. Christensen, K. Jensen, T. Mailund and L.M. Kristensen
Proceedings of 2nd International Colloquium on Petri Net Technologies for Modelling Communication Based Systems. pp. 14-15, Berlin Germany, September 2001.
[Abstract]
We present two recently developed state space methods for timed coloured Petri nets which reconciles state space methods and time concepts of coloured Petri nets. The first method is based on an equivalence relation on states which makes it possible to condense the usually infinite state space of a timed coloured Petri net into a finite condensed state space without serious loss of analysis power. The second method, the sweep-line method, exploits the increasing model time of timed CP-nets to store only small fragments of the full state space in memory at a time thereby alleviating the the state explosion problem. The performance and potential of the methods is demonstrated on an industrial case study by means of a computer tool implementing the two methods.
2000
Separation of Style and Content with XML in an Interchange Format for High-level Petri Nets
T. Mailund and K.H. Mortensen
Proceedings of the Meeting on XML/SGML based Interchange Formats for Petri Nets, pp. 7-12, Aarhus, Denmark, 2000.
[Abstract]
Style sheets have been proposed by the World Wide Web Consortium (W3C) as a means for separating presentation and content in World Wide Web documents, and they are now widely used and generally popular. In this paper we use the same idea for an XML interchange format proposal for high-level Petri nets. There are several benefits of using this design principle: It is easier to exchange content-only for non-graphics tools, and alternative styles can conveniently be replaced to make a new graphical appearance. The ideas presented in this paper are illustrated by means of examples.
1999
Parameterised Coloured Petri Nets
T. Mailund
Proceedings of Second Workshop on Practical Use of Coloured Petri Nets and Design/CPN, DAIMI PB-541 pp. 133-151, Aarhus, Denmark, 1999.
[Abstract]
In this paper we examine coloured Petri nets extended with parameters. We characterise three kinds of parameterisation and formally define Parameterised Coloured Petri Nets. We then discuss how parameterised coloured Petri nets can be used to create libraries of coloured Petri net modules in the same way as libraries for programming languages. Finally we discuss how to implement a simple simulator for such modules.
1998
Textual Interchange Format for High-Level Petri Nets
R.B. Lyngsø and T. Mailund
Proceedings of First Workshop on Practical Use of Coloured Petri Nets and Design/CPN, DAIMI PB-532 pp. 47-64 Aarhus, Denmark, 1998.
[Abstract]
In this paper a text format for high-level Petri net diagrams is presented. The text format is designed to serve as a platform-independent file format for the Design/CPN tool. It is consistent with the forthcoming standard for high-level Petri nets. The text format may also be seen as our contribution to the development of an open, tool-independent interchange format for high-level Petri nets. The text format will make it possible to move Design/CPN diagrams between all supported hardware platforms and versions. It is also designed to be a bridge to other Petri net tools, e.g., other analysis tools which the user may want to use with Design/CPN diagrams. The proposed text format does not address any standardisation for the inscription language used in the diagram. It is, however, possible to extend the format to incorporate such a standardisation. The text format is designed for the exchange of hierarchical coloured Petri nets but the structure is general enough to cope with other high-level Petri nets as well. The text format presented here has been implemented as part of Design/CPN version 3.1.
Technical Reports
2004
SplitDist—Calculating Split-Distances for Sets of Trees
T. Mailund
May 24, 2004.
[Abstract]
We present a tool for comparing a set of input trees, calculating for each pair of trees the split-distances, i.e., the number of splits in one tree not present in the other.
2003
Speeding Up Neighbour-Joining Tree Construction
G.S. Brodal, R. Fagerberg, T. Mailund, C.N.S. Pedersen, and D. Phillips
ALCOMFT-TR-03-102, ALCOM-FT, November 2003.
Presentation slides: [PostScript] [Open Office]
[Abstract]
A widely used method for constructing phylogenetic trees is the neighbour-joining method of Saitou and Nei. We develope heuristics for speeding up the neighbour-joining method which generate the same phylogenetic trees as the original method. All heuristics are based on using a quad-tree to guide the search for the next pair of nodes to join, but differ in the information stored in quad-tree nodes, the way the search is performed, and in the way the quad-tree is updated after a join. We empirically evaluate the performance of the heuristics on distance matrices obtained from the Pfam collection of alignments, and compare the running time with that of the QuickTree tool, a well-known and widely used implementation of the standard neighbour-joining method. The results show that the presented heuristics can give a significant speed-up over the standard neighbour-joining method, already for medium sized instances.
2002
A Compositional Sweep-Line State Space Exploration Method—EXTENDED VERSION
L.M. Kristensen and T. Mailund
Dept. of Computer Science, University of Aarhus, April 2002
[Abstract]
State space exploration is a main approach to verification of finite-state systems. The sweep-line method exploits a certain kind of progress present in many systems to reduce peak memory usage during state space exploration. We present a new sweep-line algorithm for a compositional setting where systems are composed of subsystems. The compositional setting makes it possible to divide subsystem progress measures into monotone and non-monotone progress measures to further reduce peak memory usage. We show that in a compositional setting, it is possible to automatically obtain a progress measure for the system by combining the progress measure for each subsystem to a progress measure for the full system.
Tutorials
2006
Association Mapping: Fundamental Principles and Applications
T. Mailund, M.H. Schierup, J. Hein, L. Schauser, and J.N. Madsen
At Pacific Symposium on Biocomputing (PSB) 2006.
Presentation slides: [PDF] [Open Office]
[Abstract]
The past 15 years have witnessed an explosion in the identification of genes involved in Mendelian diseases. Most of these genes have been identified through family based linkage analysis. In contrast, the unravelling of the genetic basis of common, but complex diseases, such as cancer, diabetes and Alzheimer's disease has made little progress in this period. Complicating factors that make linkage analysis of these diseases impractical are late disease onset, the involvement of multiple low-penetrance genetic factors, ambiguous diagnosis, as well as strong influence of environmental and lifestyle factors. Power analysis and simulations suggest that association studies are the method of choice when attempting to genetically characterize complex diseases. Association studies draw a number of cases and controls from the same population and genotype their DNA at polymorphic loci. Excess correlation amongst alleles from separate loci, called Linkage Disequilibrium (LD), in the case group indicates the presence of a risk factor. Here we discuss the challenge of complex diseases, as well as simple and model based statistical approaches to the mapping of complex diseases.
2005
Association Mapping: Design Issues and Data Analysis Approaches
L. Schauser, T. Mailund, J.N. Madsen, J. Hein and M.H. Schierup
At Pacific Symposium on Biocomputing (PSB) 2005.
[Abstract]
The past 15 years have witnessed an explosion in the identification of genes involved in Mendelian diseases. Most of these genes have been identified through family based linkage analysis. In contrast, the unravelling of the genetic basis of common, but complex diseases, such as cancer, diabetes and Alzheimer's disease has made little progress in this period. Complicating factors that make linkage analysis of these diseases impractical are late disease onset, the involvement of multiple low-penetrance genetic factors, ambiguous diagnosis, as well as strong influence of environmental and lifestyle factors. Power analysis and simulations suggest that association studies are the method of choice when attempting to genetically characterize complex diseases. Association studies draw a number of cases and controls from the same population and genotype their DNA at polymorphic loci. Excess correlation amongst alleles from separate loci, called Linkage Disequilibrium (LD), in the case group indicates the presence of a risk factor. Here we discuss the challenge of complex diseases, as well as simple and model based statistical approaches to the mapping of complex diseases.
1999
Advanced Tutorial on the use of Coloured Petri Nets and Design/CPN
L.M. Kristensen, T. Mailund, and L.M. Wells
Second Workshop and Tutorial on Practical Use of Coloured Petri Nets and Design/CPN, 1999.
[Abstract]
What you always wanted to know about practical use of Design/CPN (but were afraid to ask). The advanced tutorial is aimed at engineers, students and researchers who already have some experience with the use of Coloured Petri Nets and Design/CPN.
Manuals
2006
Getting Started with HapCluster — An introduction to the HapCluster association mapping tool
T. Mailund
[Abstract]
HapCluster++ is a software package for linkage disequilibrium mapping. It is based on a Bayesian Markov-chain Monte Carlo (MCMC) method for finescale linkage-disequilibrium gene mapping using high-density marker maps.
Getting Started with Blossoc — An introduction to the Blossoc association mapping tool
T. Mailund
[Abstract]
Blossoc is a linkage disequilibrium association mapping tool that attempts to build (perfect) genealogies for each site in the input and score these according to non-random clustering of affected individuals, and judge high-scoring areas as likely candidates for containing disease affecting variation. Building the local genealogy trees is based on a number of heuristics that are not guaranteed to build true trees, but have the advantage of more sophisticated methods of being extremely fast.
GeneRecon Users' Manual — A coalescent based tool for fine-scale association mapping
T. Mailund
[Abstract]
GeneRecon is a software package for linkage disequilibrium mapping using coalescent theory. It is based on Bayesian Markov-chain Monte Carlo (MCMC) method for fine-scale linkage-disequilibrium gene mapping using high-density marker maps. GeneRecon explicitly models the genealogy of a sample of the case chromosomes in the vicinity of a disease locus. Given case and control data in the form of genotype or haplotype information, it estimates a number of parameters, most importantly, the disease position.
Getting Started with GeneRecon — An Introduction to the Association Mapping Tool GeneRecon
T. Mailund and L. Schauser
[Abstract]
GeneRecon is a software package for linkage disequilibrium mapping using coalescent theory. It is based on Bayesian Markov-chain Monte Carlo (MCMC) method for fine-scale linkage-disequilibrium gene mapping using high-density marker maps. GeneRecon explicitly models the genealogy of a sample of the case chromosomes in the vicinity of a disease locus. Given case and control data in the form of genotype or haplotype information, it estimates a number of parameters, most importantly, the disease position.
2005
CoaSim Guile Manual — Using the Guile-based CoaSim Simulator
T. Mailund
[Abstract]
CoaSim is a tool for simulating the coalescent process with recombination and geneconversion, under either constant population size or exponential population growth. It effectively constructs the ancestral recombination graph for a given number of chromosomes and uses this to simulate samples of SNP and micro-satellite haplotypes or genotypes.
Getting Started with CoaSim — An Introduction to the Simulator CoaSim
T. Mailund
[Abstract]
CoaSim is a tool for simulating the coalescent process with recombination and geneconversion, under either constant population size or exponential population growth. It effectively constructs the ancestral recombination graph for a given number of chromosomes and uses this to simulate samples of SNP and micro-satellite haplotypes or genotypes.
2003
DProg Tutorial/Manual
T. Mailund
[Abstract]
DPROG is a domain specific language for specifying dynamic programming algorithms; given a recursive definition of the problem, the compiler generates code for solving the problem using dynamic programming. This tutorial/manual explains the basics of the language and how the generated code can be used in an application. The manual covers version 0.2 of DPROG; the language might change in later versions.
Dissertations
2003
Sweeping the State Space—A Sweep-Line State Space Exploration Method
T. Mailund
Ph.D. Dissertation, Department of Computer Science, University of Aarhus, February 2003.
[Abstract]
The thesis describes the sweep-line method, a newly developed reduction method for alleviating the state explosion problem inherent in explicit-state state space exploration. The basic idea underlying the sweep-line method is, when calculating the state space, to recognise and delete states that are not reachable from the currently unprocessed states. Intuitively we drag a sweep-line through the state space with the invariant that all states behind the sweep-line have been processed and are unreachable from the states in front of the sweep-line. When calculating the state space of a system we iteratively calculate successors of new states, and we need to distinguish between previously seen states and truly new states to avoid reprocessing the same states indefinitely. When calculating the successors of the known unprocessed states, all successors are in front of the sweep-line. Consequently, the states behind the sweep-line are not needed for distinguishing between new and old states and can therefore be deleted.
2000
Coloured Petri Nets—A Tool in Software Engineering
T. Mailund
Progress report, Ph.D. part A, Department of Computer Science, University of Aarhus, December 2000.
[Abstract]
This progress report presents and summarizes the research work done by the author during part A of his PhD study at the Department of Computer Science, University of Aarhus.
The work described consists of three main parts. The first part is on language constructions for modular coloured Petri net models. Work on parameterisation of modules, and how this can be used to gain higher degrees of reuse, is presented, and work in progress on behaviour preserving module refinements is sketched.
The second part is on state space analysis methods for model verification, where two new reduction methods are described. The first method exploits equivalences to gain finite representations for the usually infinite state spaces of timed coloured Petri nets. The second method exploits the progress, that is present in some models, to sweep through all reachable states without, at any given time, storing the entire state space.
The third and last part is on a text format for exchanging models between different tools. An XML-based text format is described. This text format is now implemented in the tools Design/CPN, as an optional file format, and CPN Tools, as the main file format. Techniques for separating content and graphical style information, using style sheets, is also described.