Publications
To Appear
A coalescent model incorporating allele specific mutation rates:
application to the estimation of effective population size from microsatellite
polymorphism data
T. Bataillon, A.A.T. Smith, T.Mailund and A-C. Thuillet
To appear in Genetics
Recent experimental evidence has demonstrated that mutation rates of
microsatellites are allele-specific rather than locus-specific. Although
numerous population genetics models have been devised
to accommodate the peculiarities of the microsatellite mutational model,
allowing for stepwise mutation events, virtually no population genetics model
incorporating variation in mutation rate among alleles at a same locus has
been studied. We have developed a simple urn model that allows the
simulation of microsatellite data from a stationary Wright Fisher population
that incorporates allele-specific mutation patterns. An approximate Bayesian
framework for the estimation of long term effective size,
Ne, is proposed. The method uses rejection sampling
to approximate the likelihood of joint summary statistics describing
microsatellite polymorphism data. The performance of the method
is examined using Monte Carlo simulations of datasets. Last, we illustrate
our method by re-analyzing data from an earlier study documenting patterns
of microsatellite polymorphism before and after domestication in durum
wheat (Triticum dicoiccoides) where the relationship between allele length
and allele mutability has been established from direct observations of mutation
rates.
2009
Julien Y Dutheil, Ganesh Ganapathy, Asger Hobolth,
Thomas Mailund, Marcy K Uyenoyama, and
Mikkel H Schierup
Genetics 183, 259–274
2009. doi:10.1534/genetics.109.103010
With incomplete lineage sorting (ILS), the genealogy of closely related species differs along their
genomes. This feature can be used to infer population genetics parameters in ancestral species.
We use a hidden Markov model to infer the genealogical state along a four-species alignment with
ILS. Coalescent theory is used to parametrize the hidden Markov model in terms of population
genetics quantities. The transition probabilities and emission probabilities can be expressed as
functions of the population parameters, speciation times, ancestral effective population sizes
and recombination rate, and estimated from a genome alignment of closely related species. We
analyze a basic, panmictic demographic model and study its properties using an extensive set of
coalescent simulations. We assess the effect of the model assumptions, and demonstrate that the
Markov property provides a good approximation to the ancestral recombination graph. Using a
too restricted set of possible genealogies, necessary to reduce the computational load, can bias
parameter estimates. We propose a simple correction for this bias, and suggest directions for future
extensions of the model. We show that the patterns of ILS along a sequence alignment undergoing
recombination can also be recovered effciently and that the ancestral recombination rate can be
recovered. Finally, we introduce an extension of the basic model that allows for substitution rate
heterogeneity, and reanalyze Human-Chimpanzee-Gorilla-Orangutan alignments using the new
models.
M. Emily, T. Mailund, J. Hein, L. Schauser and M.H. Schierup
European Journal of Human Genetics; doi: 10.1038/ejhg.2009.15
Genome-wide association studies have identified a large number of Single-Nucleotide Polymorphisms (SNPs) that individually predispose to diseases. Genome-wide association studies have identified a large number of Single-Nucleotide Polymorphisms (SNPs) that individually predispose to diseases. However many genetic risk factors remain unaccounted for. Proteins coded by genes interact in the cell and it is likely that certain variants mainly affect the phenotype in combination with other variants, termed epistasis. An exhaustive search for epistatic effects is computationally demanding, since several billions of SNP pairs exist for typical genotyping chips. In this study, the experimental knowledge on biological networks is used to narrow the search for two-locus epistasis. We provide evidence that this approach is computationally feasible and statistically powerful. By applying this method to the Wellcome Trust Case-Control Consortium (WTCCC) data sets, we report four significant cases of epistasis between unlinked loci, in susceptibility to Crohn's disease, bipolar disorder, hypertension and rheumatoid arthritis.
M.H. Schierup, T. Mailund, H. Li, J. Wang, A. A. Tjønneland, U. Vogel, L. Bolund
and B. Nexø.
BMC Evolutionary Biology, 10(20); doi:10.1186/1471-2350-10-20
Background. A small region of about 70 kb on human chromosome 19q13.3 encompasses 4 genes of which 3, ERCC1, ERCC2, and PPP1R13L (aka RAI) are related to DNA repair and cell survival, and one, CD3EAP, aka ASE1, may be related to cell proliferation. The whole region seems related to the cellular response to external damaging agents and markers in it are associated with risk of several cancers.
Methods. We downloaded the genotypes of all markers typed in the 19q13.3 region in the HapMap populations of European, Asian and African descent and inferred haplotypes. We combined the European HapMap individuals with a Danish breast cancer case-control data set and inferred the association between HapMap haplotypes and disease risk.
Results. We found that the susceptibility haplotype in our European sample had increased from 2 to 50 percent very recently in the European population, and to almost the same extent in the Asian population. The cause of this increase is unknown. The maximal proportion of overall genetic variation due to differences between groups for Europeans versus Africans and Europeans versus Asians (the Fst value) closely matched the putative location of the susceptibility variant as judged from haplotype-based association mapping.
Conclusions. The combined observation that a common haplotype causing an increased risk of cancer in Europeans and a high differentiation between human populations is highly unusual and suggests a causal relationship with a recent increase in Europeans caused either by genetic drift overruling selection against the susceptibility variant or a positive selection for the same haplotype. The data does not allow us to distinguish between these two scenarios. The analysis suggests that the region is not involved in cancer risk in Africans and that the susceptibility variants may be more finely mapped in Asian populations.
S. Besenbacher, T. Mailund and M.H. Schierup
We present a new method, termed QBlossoc, for linkage disequilibrium (LD)
mapping of genetic variants underlying a quantitative trait. The method builds on
a previously published method, Blossoc, for LD mapping of case-control studies.
The method builds local genealogies along the genome and looks for a significant
clustering of quantitative trait values in these trees. We analyse its efficiency in terms
of localisation and ranking of true positives among a large number of negatives and
compare the results with single marker approaches. Simulation results of markers
at densities comparable to genotype chips show that QBlossoc is more accurate in
localisation of true positives as expected since it uses the additional information of
LD between markers simultaneously. More importantly for genome wide surveys,
QBlossoc ranks true positives significantly higher than single marker approaches,
again suggesting that a true signal displays itself more strongly in a set of adjacent
markers than a spurious (false) signal. The method is both memory and CPU
efficient. It has been tested on a real data set of height data for 5000 individuals
measured at 317K markers and completed analysis within 5 CPU days.
The QBlossoc program with full documentation can be downloaded through
www.daimi.au.dk/~mailund/.
S. Besenbacher, C.N.S. Pedersen and T. Mailund
BMC Bioinformatics, 10(Suppl 1): S74 doi:10.1186/1471-2105-10-S1-S74
Background: Identifying the genetic components of
common diseases has long been an important area of research.
Recently, genotyping technology has reached the level where it is
cost effective to genotype single nucleotide polymorphism (SNP)
markers covering the entire genome, in thousands of individuals, and
analyse such data for markers associated with a diseases. The
statistical power to detect association, however, is limited when
markers are analysed one at a time. This can be alleviated by
considering multiple markers simultaneously. The Haplotype
Pattern Mining (HPM) method is a machine learning approach to do
exactly this. Originally developed for candidate gene analysis, its
running time is not optimised for the much larger genome wide data
sets in use today.
Results: We present a new, faster algorithm for the
HPM method. The new approach patterns of haplotype diversity in the
genome: locally in the genome, the number of observed haplotypes is
much smaller than the total number of possible haplotypes. We show
that the new approach speeds up the HPM method with a factor of 2 on
a genome wide dataset with 5009 individuals typed in 491208 markers
using default parameters and more if the pattern length is
increased.
Conclusions: The new algorithm significantly speeds up
the HPM method and makes it feasible to apply HPM to whole genome
association mapping with thousands of individuals and hundreds of
thousands of markers.
2008
J. Nielsen and T. Mailund
BMC Bioinformatics 2008, 9:526; doi:10.1186/1471-2105-9-526
Background: High-throughput genotyping technology has enabled cost effective typing of thousands of individuals in hundred of thousands of markers for use in genome wide studies. This vast improvement in data acquisition technology makes it an informatics challenge to efficiently store and manipulate the data. While spreadsheets and flat text files were adequate solutions earlier, the increased data size mandates more efficient solutions.
Results: We describe a new binary file format for SNP data, together with a software library for file manipulation. The file format stores genotype data together with any kind of additional data, using a flexible serialisation mechanism. The format is designed to be IO efficient for the access patterns of most multi-locus analysis methods.
Conclusions: The new file format has been very useful for our own studies where it has significantly reduced the informatics burden in keeping track of various secondary data, and where the memory and IO efficiency has greatly simplified analysis runs. A main limitation with the file format is that it is only supported by the very limited set of analysis tools developed in our own lab. Through scripting interfaces to the file format, we hope to alleviate this in the future.
Z. Ding, T. Mailund and Y.S. Song
Motivation:
Recent advances in genotyping technology has made data
acquisition for whole-genome association study cost effective, and a
current active area of research is developing efficient methods to
analyze such large-scale data sets. Most sophisticated association
mapping methods that are currently available take phased
haplotype data as input. However, phase information is not readily
available from sequencing methods and inferring the phase via
computational approaches is time-consuming, taking days to phase a
single chromosome.
Results: In this paper, we devise an efficient method for
scanning unphased whole-genome data for association. Our approach
combines a recently found linear-time algorithm for phasing
genotypes on trees with a recently proposed tree-based method for
association mapping. From unphased genotype data, our algorithm
builds local phylogenies along the genome, and scores each
tree according to the clustering of cases and controls. We assess
the performance of our new method on both simulated and real
biological data sets.
S. de Groot, T. Mailund, G.A. Lunter and J. Hein
Background: Two problems complicate the study of selection in viral genomes: Firstly, the presence of genes in overlapping reading frames implies that selection in one reading frame can bias our estimates of neutral mutation rates in another reading frame. Secondly, the high mutation rates we are likely to encounter complicate the inference of a reliable alignment of genomes. To address these issues, we develop a model that explicitly models selection in overlapping reading frames. We then integrate this model into a statistical alignment framework, enabling us to estimate selection while explicitly dealing with the uncertainty of individual alignments. We show that in this way we obtain un-biased selection parameters for different genomic regions of interest, and improve in accuracy compared to the fixed alignment method.
Results: We run a series of simulation studies to gauge how well we do in comparison to other methods. We show that the standard practice of using a fixed ClustalW alignment can lead to considerable biases and that estimation accuracy increases substantially when explicitly integrating over the uncertainty in inferred alignments. We even manage to compete favourably for general evolutionary distances with an alignment produced by GenAl. We therefore propose that marginalizing over all alignments, as opposed to using a fixed one, should be considered in any parametric inference from divergent sequence data for which the alignments are not known with certainty. Running our method on real data, we discover in HIV2 that double coding regions appear to be under less stringent selection than single coding ones. Additionally, there appears to be evidence for differential selection, where one overlapping reading frame is under positive and the other under negative
selection. We also analyse Hepatitis B to understand the interaction of selection between two overlapping regions.
M. Stissing, T. Mailund, C.N.S. Pedersen, G.S. Brodal,
and R. Fagerberg
Journal of Bioinformatics and Computational Biology 6(1) 37–50 2008.
Linked version is a preprint of an article submitted for consideration in Journal of Bioinformatics and Computational Biology © 2008
World Scientific Publishing Company
Special Issue journal version of Computing the All-Pairs Quartet Distance
on a Set of Evolutionary Trees APBC'07.
We present two algorithms for calculating the quartet
distance between all pairs of trees in a set of binary
evolutionary trees on a common set of species. The algorithms
exploit common substructure among the trees to speed up the
pairwise distance calculations thus performing significantly
better on large sets of trees compared to performing distinct
pairwise distance calculations, as we illustrate
experimentally, where we see a speedup factor of around 130
in the best case.
2007
J. Davies, F. Simancik, R. Lyngsø, T. Mailund, and J. Hein
Coalescent Theory is almost ubiquitous in contemporary
molecular population genetics. Inherent in most
applications is a continuous time approximation that
assumes sample size is small relative to the actual
population size. This assumption in effect precludes
simultaneous and multiple coalescent events, which can
constitute an arbitrarily large component when sample size
is sufficiently large. In most situations this is
justifiably ignored as a large sample size will only
have few ancestors a couple of generations back and then
the assumption is valid. However, in tracing the
evolutionary history of large chromosomal segments, a large
recombination rate will consistently keep the number of
ancestors large such that multiple and simultaneous
coalescent events cannot be ignored. This can create a
major disparity between discrete time and continuous time
models and we here show its importance illustrated with
parameters typical of the human genome. The presence of
gene convergence only aggravates its importance. This could
seriously undermine the application of coalescent theory to
complete genomes. However, it can be shown that multiple
and simultaneous coalescent events influences global
quantities, such as total number of ancestors, but has
negligible effect on local quantities, such as linkage
disequilibrium or similarities of close local
trees. Reassuringly the majority of applications of
coalescent models with recombination are based on local
quantities for purposes such as association mapping.
S. McCauley, S. de Groot, T. Mailund, and J. Hein
Motivation: Viral genomes tend to code in overlapping
reading frames to maximize information content. This may
result in atypical codon bias and particular evolutionary
constraints. Due to the fast mutation rate of viruses, there
is additional strong evidence for varying selection between
intra- and intergenomic regions. The presence of multiple
coding regions complicates the concept of
Ka/Ks ratio, and thus begs for
an alternative approach when investigating selection
strengths. Building on the paper by McCauley & Hein
(2006), we develop a method for annotating a viral genome
coding in overlapping reading frames. We introduce an
evolutionary model capable of accounting for varying levels
of selection along the genome, and incorporate it into our
prior single sequence HMM methodology, extending it now to a
phylogenetic HMM. Given an alignment of several homologous
viruses to a reference sequence, we may thus achieve an
annotation both of coding regions as well as selection
strengths, allowing us to investigate different selection
patterns and hypotheses.
Results: We illustrate our method by applying it to a
multiple alignment of four HIV2 sequences, as well as four
Hepatitis B sequences. We obtain an annotation of the coding
regions, as well as a posterior probability for each site of
the strength of selection acting on it. From this we may
deduce the average posterior selection acting on the
different genes. Whilst we are encouraged to see in HIV2,
that the known to be conserved genes gag and
pol are indeed annotated as such, we also discover
several sites of less stringent negative selection within the
env gene. The the best of our knowledge we are the
first to subsequently provide a full selection annotation of
the Hepatitis B genome by explicitly modelling the evolution
within overlapping reading frames, and not relying on the
simple Ka/Ks ratios.
S. de Groot, T. Mailund, and J. Hein
Motivation: Detecting genes in viral genomes is a
complex task. Due to the biological necessity of them
being constrained in length, RNA viruses in particular tend
to code in overlapping reading frames. Since one amino acid
is encoded by a triplet of nucleic acids, up to three genes
may be coded for simultaneously in one
direction. Conventional HMM based gene finding algorithms
may find it difficult — if not impossible — to
identify multiple coding regions, since in general their
topologies do not allow for the presence of overlapping or
nested genes. Comparative methods have therefore been
restricted to likelihood ratio tests on potential regions
as to being double or single coding, using the fact that
the constrictions forced upon multiple-coding nucleotides
will result in atypical sequence evolution. Exploiting
these same constraints, we present a hidden Markov model
based gene-finding program, which allows for coding in
unidirectional nested and overlapping reading frames, to
annotate two homologous aligned viral genomes. Our method
does not insist on conserved gene structure between the two
sequences, thus making it applicable for the pairwise
comparison of more distantly related sequences.
Results: We apply our method to 15 pairwise
alignments of six different HIV2 genomes. Given sufficient
evolutionary distance between the two sequences, we achieve
sensitivity of about 84% and specificity of about 97%. We
additionally annotate three pairwise alignments of the more
distantly related HIV1 and HIV2, as well as of two
different Hepatitis Viruses, attaining results of ~87%
sensitivity and ~98.5% specificity. We subsequently
incorporate prior knowledge by "knowing" the gene structure
of one sequence and annotating the other conditional on
it. Boosting accuracy close to perfect we demonstrate that
conservation of gene structure on top of nucleotide
sequence is a valuable source of information, especially in
distantly related genomes.
A. Hobolth, O.F. Christensen, T. Mailund and
M.H. Schierup
The genealogical relationship of human, chimpanzee and
gorilla varies along the genome. We develop a hidden Markov
model (HMM) that incorporates this variation and relate the
model parameters to population genetics quantities such
as speciation times and ancestral population sizes. The HMM
is an analytically tractable approximation to the
coalescent process with recombination, and in simulations we
see no apparant bias in the HMM estimates. We apply the HMM
to four autosomal contiguous human-chimp-gorilla-orangutan
alignments comprising a total of 1.9 million base pairs. We find
a very recent speciation time of human-chimp (4.1 +/- 0.4
million years), and fairly large ancestral effective
population sizes (65,000 +/- 30,000 for the human-chimp
ancestor and 45,000 +/- 10,000 for the human-chimp-gorilla
ancestor). Furthermore, around 50% of the
human genome coalesces with chimpanzee after speciation with
gorilla. We also consider 250 thousand base pairs of
X-chromosome alignments and find an effective population
size much smaller than 75% of the autosomal effective population
sizes. Finally, we find that the rate of transitions
between different genealogies correlate well with the regionwide
present day human recombination rate, but does not
correlate with the fine-scale recombination rates and
recombination hot spots, suggesting that the latter are
evolutionary transient.
T. Mailund, C.N.S. Pedersen, J. Bardino, B. Vinter, and
H.H. Karlsen
In Future Generation Computer Systems 2007
23
580–586.
doi:10.1016/j.future.2006.09.003.
Special Issue journal version of
Initial Experiences with GeneRecon
on MiG GCA'05.
We report on our experiences so far with running a
bioinformatics simulation study on a newly developed grid
architecture. We briefly describe the application
bioinformaticsan association mapping algorithm for both
locating disease loci and separating cases into those
diseased due to genetic factors and those diseased solely due
to factors environmentand describe the grid architecture and
how the application is set up to run on the grid.
2006
T. Mailund, S. Besenbacher, and M.H. Schierup
Background: With current technology, vast amounts of
data can be cheaply and efficiently produced in association
studies, and to prevent data analysis to become the
bottleneck of studies, fast and efficient analysis methods
that scale to such data set sizes must be developed.
Results: We present a fast method for accurate
localisation of disease causing variants in high density
case-control association mapping experiments with large
numbers of cases and controls. The method searches for
significant clustering of case chromosomes in the "perfect"
phylogenetic tree defined by the largest region around each
marker that is compatible with a single phylogenetic tree.
This perfect phylogenetic tree is treated as a decision tree
for determining disease status, and scored by its accuracy as
a decision tree. The rationale for this is that the perfect
phylogeny near a disease affecting mutation should provide
more information about the affected/unaffected classification
than random trees. If regions of compatibility contain few
markers, due to e.g. large marker spacing, the algorithm can
allow the inclusion of incompatibility markers in order to
enlarge the regions prior to estimating their
phylogeny. Haplotype data and phased genotype data can be
analysed. The power and efficiency of the method is
investigated on 1) simulated genotype data under different
models of disease determination 2) artificial data sets
created from the HapMap ressource, and 3) data sets used for
testing of other methods in order to compare with these. Our
method has the same accuracy as single marker association
(SMA) in the simplest case of a single disease causing
mutation and a constant recombination rate. However, when it
comes to more complex scenarios of mutation heterogeneity and
more complex haplotype structure such as found in the HapMap
data our method outperforms SMA as well as other fast, data
mining approaches such as HapMiner and Haplotype Pattern
Mining (HPM) despite being significantly faster. For unphased
genotype data, an initial step of estimating the phase only
slightly decreases the power of the method. The method was
also found to accurately localise the known susceptibility
variants in an empirical data set—the ΔF508 mutation
for cystic fibrosis—where the susceptibility variant is
already known—and to find significant signals for
association between the CYP2D6 gene and poor drug metabolism,
although for this dataset the highest association score is
about 60kb from the CYP2D6 gene.
Conclusions: Our method has been implemented in the
Blossoc (BLOck aSSOCiation) software. Using Blossoc, genome
wide chip-based surveys of 3 million SNPs in 1000 cases and
1000 controls can be analysed in less than two CPU hours.
C. Christiansen, T. Mailund, C.N.S. Pedersen, M. Randers,
and M. Stissing
Background:
A number of algorithms have been developed for calculating the
quartet distance between two evolutionary trees on the same set of
species. The quartet distance is the number of quartets—sub-trees
induced by four leaves—that differs between the trees. Mostly,
these algorithms are restricted to work on binary trees, but
recently we have developed algorithms that work on trees of
arbitrary degree.
Results: We present a fast algorithm for computing the
quartet distance between trees of arbitrary degree. Given input
trees T and T', the algorithm runs in time
O(n+|V||V'|min{id, id'}) and space O(n+|V||V'|),
where n is the number of leaves in the two trees,
V and V' are the non-leaf nodes in T and
T', respectively, and id and id' are the
maximal number of non-leaf nodes adjacent to a non-leaf node in
T and T', respectively. The fastest algorithms
previously published for arbitrary degree trees run in
O(n3) (independent of the degree of the tree)
and O(|V||V'| id id'), respectively. We
experimentally compare the algorithm with existing algorithms for
computing the quartet distance for general trees.
Conclusions:
We present a new algorithm for computing the quartet distance
between two trees of arbitrary degree. The new algorithm improves
the asymptotic running time for computing the quartet distance,
compared to previous methods, and experimental results indicate that
the new method also performs significantly better in
practice.
T. Mailund,
M.H. Schierup,
C.N.S. Pedersen,
J.N. Madsen,
J. Hein, and
L. Schauser
GeneRecon is a tool for fine-scale association mapping
using a coalescence model. GeneRecon takes as input
case-control data from phased or unphased SNP and
micro-satellite genotypes. The posterior distribution of
disease locus position is obtained by Metropolis Hastings
sampling in the state space of genealogies. Input format,
search strategy, and the sampled statistics can be
configured through the Guile Scheme programming language
embedded in GeneRecon, making GeneRecon highly
configurable.
T. Bataillon, T. Mailund, S. Thorlacius, E. Steingrimsson,
T. Rafnar, M.M. Halldorsson, V. Calian, and M.H. Schierup
Characterizing the extent of linkage disequilibrium (LD) in
the genome is a pre-requisite for association mapping
studies. Patterns of LD also contain information about the
past demography of populations. On this study we focus on
the Icelandic population where LD was investigated in 12
regions of ~15 cM using regularly spaced microsatellite
loci. In total 1753 individuals were genotyped for 179
markers. LD was estimated using a composite disequilibrium
measure based on unphased data. LD decreases with distance
in all 12 regions and can be detected over approximately 4
cM for the given sample size. Differences in the patterns
of decrease of LD with distance among genomic regions were
mostly due to two regions exhibiting respectively higher
and lower proportions of pairs in LD than average within
the first 4 cM. We pooled data from all regions except
these two and summarized patterns of LD by computing the
proportion of pairs of loci exhibiting significant LD (at
the 5% level) as a function of distance. We compared
observed patterns of LD with simulated datasets obtained
under scenarios with varying demography and intensity of
recombination. We show that unphased data allow to make
inferences on scaled recombination rates from patterns of
LD. Patterns of LD in Iceland suggest a genome-wide scaled
recombination rate of ρ* = 200 [130–330]
per cM which is equivalent to a long term effective
population size of ~5000 in the range of estimates recently
reported in three populations using extensive SNPs data
from the HapMap project. We discuss the implication of our
findings for association mapping studies using the
Icelandic population and expected pattern of LD between
SNPs markers.
T. Mailund, G.S. Brodal, R. Fagerberg, C.N.S. Pedersen, and
D. Phillips
Background:
The neighbor-joining method by Saitou and Nei is a widely
used method for constructing phylogenetic trees. The
formulation of the method gives rise to a canonical
Θ(n3) algorithm upon which all
existing implementations are based.
Methods: In this paper we present techniques for
speeding up the canonical neighbor-joining method. Our
algorithms construct the same phylogenetic trees as the
canonical neighbor-joining method. The best-case running
time of our algorithms are O(n2)
but the worst-case remains
O(n3). We empirically evaluate
the performance of our algoritms on distance matrices
obtained from the Pfam collection of alignments.
Results: The experiments indicate that the running time
of our algorithms evolve as Θ(n2) on
the examined instance collection. We also compare the running
time with that of the QuickTree tool, a widely used efficient
implementation of the canonical neighbor-joining method.
Conclusions: The experiments show that our algorithms
also yield a significant speed-up, already for medium sized
instances.
2005
T. Mailund, M.H. Schierup, C.N.S. Pedersen,
P.J.M. Mechlenborg, J.N. Madsen, and L. Schauser
Background
Coalescent simulations are playing a large role in
interpreting large scale intra- polymorphism surveys and for
planning and evaluating association studies. Coalescent of
data sets under different models can be compared to the
actual data to test different evolutionary factors and thus
get insight into these.
Results
We have created the CoaSim application as a flexible
environment for Monte various types of genetic data under
equilibrium and non-equilibrium coalescent variety of
applications. Interaction with the tool is through the Guile
version scripting language. Scheme scripts for many standard
and advanced applications these can easily be modified by the
user for a much wider range of applications. interface with
less functionality and flexibility is also included. It is
primarily exploratory and educational tool.
Conclusions
CoaSim is a powerful tool because of its flexibility and ease
of use. This is illustrated varied uses of the application,
e.g. evaluation of association mapping methods,
bootstrapping, and design and choice of markers for specific
questions.
R.Wernersson,
M.H. Schierup,
F.G. Jørgensen,
J. Gorodkin,
F. Panitz,
H.H. Staerfeldt,
O.F. Christensen,
T. Mailund,
H. Hornshøj,
A. Klein,
W. Jun,
B. Liu,
S. Hu,
W. Dong,
W. Li,
G.K.-S. Wong,
J. Yu,
J. Wang,
C. Bendixen,
M. Fredholm,
S. Brunak,
H. Yang
and L. Bolund
Background
Comparative whole genome analysis of Mammalia can benefit
from the addition of more species. The pig is an obvious
choice due to its economic and medical importance as well as
its evolutionary position in the artiodactyls.
Results
We have generated ~ 3.84 million shotgun sequences (0.66X
coverage) of the pig genome. The data are hereby released
together with an initial evolutionary analysis.
The non-repetitive fraction of the sequences was aligned to
the UCSC human-mouse alignment and the resulting
three-species alignments were annotated using the human
genome annotation. Ultra-conserved elements and miRNAs were
identified. The results show that for each of these types of
orthologous data, pig is much closer to human than mouse
is. Purifying selection has been more efficient in pig
compared to human, but not as efficient as in mouse, and pig
seems to have an isochore structure most similar to the
structure in human.
Conclusion
The addition of the pig to the set of species sequenced at
low coverage adds to the understanding of selective pressures
that have acted on the human genome by bisecting the
evolutionary branch between human and mouse with the mouse
branch being approximately 3 times as long as the human
branch, and the joint alignment of the shot-gun sequences to
the human-mouse alignment offers a rapid way for the
investigator to define specific regions for analysis and
resequencing.
S. Besenbacher, T. Mailund, L. Westh-Nielsen, C.N.S. Pedersen
Bioinformatics, 21(8) 1711–1712, 2005.
We have developed a tool implementing an efficient algorithm
for refined Buneman tree reconstruction. The algorithm—which
has the same complexity as the neighbour-joining method and the
(plain) Buneman tree construction—enables refined Buneman tree
reconstruction on large taxa sets.
2004
T. Mailund and C.N.S. Pedersen
Bioinformatics, 20(17): 3261–3262, 2004.
We have built a tool for fast construction of very large
phylogenetic trees. The tool uses heuristics for speeding
up the neighbour-joining algorithm—while still
constructing the same tree as the original neighbour-joining
algorithm—making it possible to construct trees for ~8000
species in less than ten minutes on a single desktop PC. In
comparison, the same task takes more than 30 minutes using
the QuickTree neighbour-joining implementation.
T. Mailund and C.N.S. Pedersen
Bioinformatics, 20(10): 1636–1637, 2004.
QDist is a program for computing the quartet distance between
two unrooted evolutionary trees, i.e. the number of quartet
topology differences between the two trees, where a quartet
topology is the topological subtree induced by four species.
The implementation is based on an algorithm with running time
O(n log² n), which makes it practical to compare large
trees.
J. Billington, G.E. Gallasch, L.M. Kristensen, and
T. Mailund
IEEE Transactions on Systems, Man, and
Cybernetics, Part A: Systems and Humans. 34(1): 23–37.
State space exploration is one of the main approaches to
computer-aided verification and analysis of finite-state
systems, and can be used to reason about a wide range of
properties during the design phase of a system, including
desired and undesired terminal states. Several state space
reduction methods have been developed to alleviate the state
explosion problem inherent in methods based on state space
exploration. In this paper we develop algorithms for
combining two of these methods: the sweep-line method and
the equivalence reduction method. The algorithms allow
terminal states of the system to be reported on-the-fly
during the state space exploration. We demonstrate the
usefulness of the algorithms by applying them to an
industrial case study. These experimental results show that
the combined methods achieves a better reduction of the
state space than when either methods are used in isolation.
2010
M. Simonsen, T. Mailund and C.N.S. Pedersen
Proceedings of Bioinformatics 2010.
The neighbour-joining method by Saitou and Nei is a widely
used method for phylogenetic reconstruction, made popular by a
combination of computational efficiency and reasonable
accuracy. With its cubic running time by Studier and Kepler,
the method scales to hundreds of species, and while it is
usually possible to infer phylogenies with thousands of
species, tens or hundreds of thousands of species is
infeasible. Recently we developed a simple branch and bound
heuristic, RapidNJ, which significantly reduces the running
time on average. However, the O(n2) space consumption of the
RapidNJ method, and the NJ method in general, becomes a
problem when inferring phylogenies with 10000+ taxa.
In this paper we present two extentions of RapidNJ which reduce memory
requirements and enable RapidNJ to infer very large
phylogenetic trees efficiently. We also present an improved
search heuristic for RapidNJ which improves RapidNJs
performance on many data sets of all sizes.
2009
T. Mailund, J. Nielsen, and C.N.S. Pedersen
We derive a sub-cubic time and space algorithm for computing the quartet distance between a pair of general trees, i.e. trees where inner nodes can have any degree ≥ 3. The time and space complexity of our algorithm is quadratic in the number of leaves and does not depend on the degree of the inner nodes. This makes it the fastest algorithm for computing the quartet distance between general trees independent of the degree of the inner nodes.
A fast algorithm for genome wide haplotype pattern mining
S. Besenbacher, C.N.S. Pedersen and T. Mailund
In the Proc of APBC 2009, Tsinghua University Press, 772-780.
Background: Identifying the genetic components of
common diseases has long been an important area of research.
Recently, genotyping technology has reached the level where it is
cost effective to genotype single nucleotide polymorphism (SNP)
markers covering the entire genome, in thousands of individuals, and
analyse such data for markers associated with a diseases. The
statistical power to detect association, however, is limited when
markers are analysed one at a time. This can be alleviated by
considering multiple markers simultaneously. The Haplotype
Pattern Mining (HPM) method is a machine learning approach to do
exactly this. Originally developed for candidate gene analysis, its
running time is not optimised for the much larger genome wide data
sets in use today.
Results: We present a new, faster algorithm for the
HPM method. The new approach patterns of haplotype diversity in the
genome: locally in the genome, the number of observed haplotypes is
much smaller than the total number of possible haplotypes. We show
that the new approach speeds up the HPM method with a factor of 2 on
a genome wide dataset with 5009 individuals typed in 491208 markers
using default parameters and more if the pattern length is
increased.
Conclusions: The new algorithm significantly speeds up
the HPM method and makes it feasible to apply HPM to whole genome
association mapping with thousands of individuals and hundreds of
thousands of markers.
2008
M. Simonsen, T. Mailund and C.N.S. Pedersen
The neighbour-joining method reconstructs phylogenies by
iteratively joining pairs of nodes until a single node remains. The cri-
teria for which pair of nodes to merge is based on both the distance
between the pair and the average distance to the rest of the nodes. In
this paper, we present a new search strategy for the optimisation criteria
used for selecting the next pair to merge and we show empirically that
the new search strategy is superior to other state-of-the-art neighbour-
joining implementations.
2007
M. Stissing, T. Mailund, C.N.S. Pedersen, G.S. Brodal,
and R. Fagerberg
Proceedings of the Asia-Pacific
Bioinformatics Conference (APBC) 2007, pp. 91–100,
Series on Advances in Bioinformatics and Computational
Biology, vol. 5, Imperial College Press.
We present two algorithms for calculating the quartet
distance between all pairs of trees in a set of binary
evolutionary trees on a common set of species. The algorithms
exploit common substructure among the trees to speed up the
pairwise distance calculations thus performing significantly
better on large sets of trees compared to performing distinct
pairwise distance calculations, as we illustrate
experimentally, where we see a speedup factor of around 130
in the best case.
M. Stissing, C.N.S. Pedersen, T. Mailund, G.S. Brodal,
and R. Fagerberg
Proceedings of the Asia-Pacific
Bioinformatics Conference (APBC) 2007, pp. 101–110,
Series on Advances in Bioinformatics and Computational
Biology, vol. 5, Imperial College Press.
We present an algorithm for calculating the quartet distance
between two evolutionary trees of bounded degree on a common
set of n species. The previous best algorithm has running
time O(d2n2) when considering
trees, where no node is of more than degree d. The
algorithm developed herein has running time
O(d9n log n)) which makes it the first
algorithm for computing the quartet distance between
non-binary trees which has a sub-quadratic worst case
running time.
2005
C. Christiansen, T. Mailund, C.N.S. Pedersen, and M. Randers
Proceedings of Workshop on Algorithms in
Bioinformatics (WABI), 2005,
LNBI 3692, pp. 77-88 © Springer-Verlag.
Full length version of Computing the Quartet Distance between
Evolutionary Trees of Bounded Degree (extended abstract).
We present two algorithms for computing the quartet distance
between trees of arbitrary degree. The quartet distance between two
unrooted evolutionary trees is the number of quartets—sub-trees
induced by four leaves—that differs between the trees.
Previous algorithms focus on computing the quartet distance
between binary trees. In this paper, we present two algorithms for
computing the
quartet distance between trees of arbitrary degrees. One in time
O(n3) and space O(n2) and
one in time O(n2d2) and space
O(n2), where n is the number of
species and d is the maximal
degree of the internal nodes of the trees. We experimentally compare
the two algorithms.
C. Christiansen, T. Mailund, C.N.S. Pedersen, and M. Randers
Proceedings of International Conference on Numerical
Analysis and Applied Mathematics (ICNAAM) 2005, pp. 796-799 ©
Wiley-VCH Verlag GmbH & Co.
We present two algorithms for computing the quartet distance
between trees of arbitrary degree. The quartet distance between two
unrooted evolutionary trees is the number of quartets—sub-trees
induced by four leaves—that differs between the trees.
Previous algorithms focus on computing the quartet distance
between binary trees. In this paper, we present two algorithms for
computing the
quartet distance between trees of arbitrary degrees. One in time
O(n3) and space O(n2) and
one in time O(n2d2) and space
O(n2), where n is the number of
species and d is the maximal
degree of the internal nodes of the trees. We experimentally compare
the two algorithms.
T. Mailund,
C.N.S. Pedersen,
J. Bardino,
B. Vinter,
and H.H. Karlsen
Proceedings of The 2005 International
Conference on Grid Computing and Applications
(GCA'05)
We report on very early experiences with running a
bioinformatics simulation study on a newly developed grid
architecture. We briefly describe the bioinformatics
application—an association mapping algorithm for both
locating disease loci and separating cases into those
diseased due to genetic factors and those diseased solely
due to environment factors—and describe the grid
architecture and how the application is set up to run on the
grid.
2004
T. Mailund and M. Westergaard
This paper is concerned with a memory-efficient
representation of the reachability graphs. We describe a
technique that enables us to represent each reachable
marking in a number of bits close to the theoretical minimum
needed for explicit state enumeration. The technique maps
each state vector into a number between zero and the number
of reachable states and uses the sweep-line method to delete
the state vectors themselves. A prototype of the proposed
technique has been implemented and experimental results are
reported.
2003
L.M. Kristensen and T. Mailund
The sweep-line method deletes states on-the-fly during state
space exploration to reclaim memory and thereby reduce peak
memory usage. This deletion of states prohibits the immediate
generation of, e.g., an error-trace when the violation of a
safety property is detected. We address this problem by
combining the sweep-line method with storing a spanning tree
of the explored state space in external storage on a magnetic
disk. We show how this allows us to easily obtain paths in the
state space, such as error-traces. A key property of the
proposed technique is that it avoids searching in external
storage during the state space exploration and gives the same
reduction in peak memory usage as the stand-alone sweep-line
method. The subsequent generation of the path then requires
one seek on disk for each state on the path. We evaluate the
proposed technique on a number of example systems by means of
an implementation, and compare its performance to a related
technique. These practical experiments demonstrate how the
suggested technique complements existing techniques based on
using external storage.
2002
L.M. Kristensen and T. Mailund
State space exploration is a main approach to verification of
finite-state systems. The sweep-line method exploits a certain
kind of progress present in many systems to reduce peak memory
usage during state space exploration. We present a new
sweep-line algorithm for a compositional setting where systems
are composed of subsystems. The compositional setting makes it
possible to divide subsystem progress measures into monotone and
non-monotone progress measures to further reduce peak memory
usage. We show that in a compositional setting, it is possible
to automatically obtain a progress measure for the system by
combining the progress measure for each subsystem to a progress
measure for the full system.
L.M. Kristensen and T. Mailund
The recently developed sweep-line method exploits progress
present in many concurrent systems to explore the full state
space of the system while storing only small fragments of
the state space in memory at a time. A disadvantage of the
sweep-line method is that it relies on a monotone and global
notion of progress. This prevents the method from being used
for many reactive systems. In this paper we generalise the
sweep-line method such that it can be used for verifying
safety properties of reactive systems exhibiting local
progress. The basic idea is to relax the monotone notion of
progress and to recognise the situations where this could
cause the state space exploration not to terminate. The
generalised sweep-line method explores all reachable states
of the system, but may explore a state several times. We
demonstrate the practical application of the generalised
sweep-line method on two case studies demonstrating a
reduction in peak memory usage to typically 10% compared
to the use of ordinary full state spaces.
T. Mailund
The sweep-line method is a state space exploration method
for on-the-fly verification aimed at systems exhibiting
progress. Presence of progress in the system makes it
possible to delete certain states during state space
generation, which reduces the memory used for storing the
states. Unfortunately, the same progress that is used to
improve memory performance in state space exploration often
leads to an infinite state space: The progress in the system
is carried over to the states resulting in infinitely many
states only distinguished through the progress. A finite
state space can be obtained using equivalence reduction,
abstracting away the progress, but in its simplest form this
removes the progress property required by the sweep-line
method. In this paper we examine a new method for using
equivalence relations to obtain a finite set of classes,
without compromising the progress property essential for the
sweep-line method. We evaluate the new method on two case
studies, showing significant improvements in performance,
and we briefly discuss the new method in the context of
Timed Coloured Petri Nets, where the `increasing
global time' semantics can be exploited for more efficient
analysis than what is achieved using a `delay' semantics.
G.E. Gallasch, L.M. Kristensen, and T. Mailund
Proceedings of Fourth Workshop on Practical Use of
Coloured Petri Nets and the CPN Tools (CPN 2002).
DAIMI PB-560
pp. 101-119, Aarhus, Denmark, 2002.
The basic idea of the sweep-line state space method is to
exploit a formal notion of progress found in many concurrent
and distributed systems. Exploiting progress makes it
possible to sweep through the state space of a CP-net while
storing only a small fragment of the states in memory at any
time. Properties of the system can then be verified
on-the-fly during the sweep of the state space. This can
lead to significant savings in peak memory usage and
computation time. Examples of systems possessing progress
are transport protocols, transactions protocols, workflow
models, and systems modelled with Timed CP-nets. We present
Sweep/CPN, a library extension to
Design/CPN supporting the sweep-line method, and
demonstrate its use.
2001
S. Christensen, L.M. Kristensen and T. Mailund
We present a state space method for Petri nets having a
time concept based on a global clock and associating time
stamps to tokens. The method is based on equivalence on
states and makes it possible to condense the usually
infinite state space of such timed Petri nets into a finite
state space without loosing analysis power. The practical
application of the method is demonstrated on a large
example of an audio/video protocol by means of a computer
tool implementing the method.
S. Christensen, L.M. Kristensen and T. Mailund
We present a state space exploration method for on-the-fly
verification. The method is aimed at systems for which it is
possible to define a measure of progress based on the states
of the system. The measure of progress makes it possible to
delete certain states on-the-fly during state space
generation, since these states can never be reached
again. This in turn reduces the memory used for state space
storage during the task of verification. Examples of
progress measures are sequence numbers in communication
protocols and time in certain models with time. We
illustrate the application of the method on a number of
Coloured Petri Net models, and give a first evaluation of
its practicality by means of an implementation based on the
Design/CPN state space tool. Our experiments show significant
reductions in both space and time used during state space
exploration. The method is not specific to Coloured Petri
Nets but applicable to a wide range of modelling languages.
S. Christensen, K. Jensen, T. Mailund and L.M. Kristensen
Proceedings of 2nd International
Colloquium on Petri Net Technologies for Modelling
Communication Based Systems.
pp. 14-15, Berlin Germany, September 2001.
We present two recently developed state space methods for
timed coloured Petri nets which reconciles state space
methods and time concepts of coloured Petri nets. The first
method is based on an equivalence relation on states which
makes it possible to condense the usually infinite state
space of a timed coloured Petri net into a finite condensed
state space without serious loss of analysis power. The
second method, the sweep-line method, exploits the
increasing model time of timed CP-nets to store only small
fragments of the full state space in memory at a time
thereby alleviating the the state explosion problem. The
performance and potential of the methods is demonstrated on
an industrial case study by means of a computer tool
implementing the two methods.
2000
T. Mailund and K.H. Mortensen
Proceedings of the Meeting on XML/SGML based
Interchange Formats for Petri Nets,
pp. 7-12, Aarhus, Denmark, 2000.
Style sheets have been proposed by the World Wide Web
Consortium (W3C) as a means for separating presentation and
content in World Wide Web documents, and they are now
widely used and generally popular. In this paper we use
the same idea for an XML interchange format proposal for
high-level Petri nets. There are several benefits of using
this design principle: It is easier to exchange content-only
for non-graphics tools, and alternative styles can
conveniently be replaced to make a new graphical
appearance. The ideas presented in this paper are
illustrated by means of examples.
1999
Parameterised Coloured Petri Nets
T. Mailund
Proceedings of Second Workshop on Practical Use of Coloured
Petri Nets and Design/CPN,
DAIMI PB-541
pp. 133-151, Aarhus, Denmark, 1999.
In this paper we examine coloured Petri nets extended with
parameters. We characterise three kinds of parameterisation
and formally define Parameterised Coloured Petri
Nets. We then discuss how parameterised coloured Petri
nets can be used to create libraries of coloured Petri net
modules in the same way as libraries for programming
languages. Finally we discuss how to implement a simple
simulator for such modules.
1998
Textual Interchange Format for High-Level Petri Nets
R.B. Lyngsø and T. Mailund
Proceedings of First Workshop on Practical Use of Coloured
Petri Nets and Design/CPN,
DAIMI PB-532
pp. 47-64 Aarhus, Denmark, 1998.
In this paper a text format for high-level Petri net
diagrams is presented. The text format is designed to serve
as a platform-independent file format for the Design/CPN
tool. It is consistent with the forthcoming standard for
high-level Petri nets. The text format may also be seen as
our contribution to the development of an open,
tool-independent interchange format for high-level Petri
nets. The text format will make it possible to move
Design/CPN diagrams between all supported hardware platforms
and versions. It is also designed to be a bridge to other
Petri net tools, e.g., other analysis tools which the user
may want to use with Design/CPN diagrams. The proposed text
format does not address any standardisation for the
inscription language used in the diagram. It is, however,
possible to extend the format to incorporate such a
standardisation. The text format is designed for the
exchange of hierarchical coloured Petri nets but the
structure is general enough to cope with other high-level
Petri nets as well. The text format presented here has been
implemented as part of Design/CPN version 3.1.
2004
T. Mailund
May 24, 2004.
We present a tool for comparing a set of input
trees, calculating for each pair of trees the split-distances, i.e.,
the number of splits in one tree not present in the other.
2003
G.S. Brodal, R. Fagerberg, T. Mailund, C.N.S. Pedersen, and
D. Phillips
ALCOMFT-TR-03-102, ALCOM-FT, November 2003.
A widely used method for constructing phylogenetic trees is
the neighbour-joining method of Saitou and Nei. We develope
heuristics for speeding up the neighbour-joining method which
generate the same phylogenetic trees as the original method.
All heuristics are based on using a quad-tree to guide the
search for the next pair of nodes to join, but differ in the
information stored in quad-tree nodes, the way the search is
performed, and in the way the quad-tree is updated after a
join.
We empirically evaluate the performance of the heuristics on
distance matrices obtained from the Pfam collection of
alignments, and compare the running time with that of the
QuickTree tool, a well-known and widely used implementation of
the standard neighbour-joining method. The results show that
the presented heuristics can give a significant speed-up over
the standard neighbour-joining method, already for medium
sized instances.
2002
L.M. Kristensen and T. Mailund
Dept. of Computer Science, University of Aarhus, April 2002
State space exploration is a main approach to verification
of finite-state systems. The sweep-line method exploits a
certain kind of progress present in many systems to reduce
peak memory usage during state space exploration. We present
a new sweep-line algorithm for a compositional setting where
systems are composed of subsystems. The compositional
setting makes it possible to divide subsystem progress
measures into monotone and non-monotone progress measures to
further reduce peak memory usage. We show that in a
compositional setting, it is possible to automatically
obtain a progress measure for the system by combining the
progress measure for each subsystem to a progress measure
for the full system.
2006
Association Mapping: Fundamental Principles and Applications
T. Mailund, M.H. Schierup, J. Hein, L. Schauser, and J.N. Madsen
At Pacific Symposium on Biocomputing (PSB) 2006.
The past 15 years have witnessed an explosion in the
identification of genes involved in Mendelian diseases. Most of
these genes have been identified through family based linkage
analysis. In contrast, the unravelling of the genetic basis of
common, but complex diseases, such as cancer, diabetes and
Alzheimer's disease has made little progress in this
period. Complicating factors that make linkage analysis of these
diseases impractical are late disease onset, the involvement of
multiple low-penetrance genetic factors, ambiguous diagnosis, as
well as strong influence of environmental and lifestyle
factors. Power analysis and simulations suggest that association
studies are the method of choice when attempting to genetically
characterize complex diseases. Association studies draw a number
of cases and controls from the same population and genotype their
DNA at polymorphic loci. Excess correlation amongst alleles from
separate loci, called Linkage Disequilibrium (LD), in the case
group indicates the presence of a risk factor. Here we discuss
the challenge of complex diseases, as well as simple and model
based statistical approaches to the mapping of complex diseases.
2005
L. Schauser, T. Mailund, J.N. Madsen, J. Hein and M.H. Schierup
At Pacific Symposium on Biocomputing (PSB) 2005.
The past 15 years have witnessed an explosion in the
identification of genes involved in Mendelian diseases. Most of
these genes have been identified through family based linkage
analysis. In contrast, the unravelling of the genetic basis of
common, but complex diseases, such as cancer, diabetes and
Alzheimer's disease has made little progress in this
period. Complicating factors that make linkage analysis of these
diseases impractical are late disease onset, the involvement of
multiple low-penetrance genetic factors, ambiguous diagnosis, as
well as strong influence of environmental and lifestyle
factors. Power analysis and simulations suggest that association
studies are the method of choice when attempting to genetically
characterize complex diseases. Association studies draw a number
of cases and controls from the same population and genotype their
DNA at polymorphic loci. Excess correlation amongst alleles from
separate loci, called Linkage Disequilibrium (LD), in the case
group indicates the presence of a risk factor. Here we discuss
the challenge of complex diseases, as well as simple and model
based statistical approaches to the mapping of complex diseases.
1999
L.M. Kristensen, T. Mailund, and L.M. Wells
Second Workshop and Tutorial on Practical Use of Coloured Petri
Nets and Design/CPN, 1999.
What you always wanted to know about practical use of
Design/CPN (but were afraid to ask).
The advanced tutorial is aimed at engineers, students and
researchers who already have some experience with the use of
Coloured Petri Nets and Design/CPN.
2006
T. Mailund
HapCluster++ is a software package for linkage disequilibrium
mapping. It is based on a Bayesian Markov-chain Monte Carlo
(MCMC) method for finescale linkage-disequilibrium gene
mapping using high-density marker maps.
T. Mailund
Blossoc is a linkage disequilibrium association mapping
tool that attempts to build (perfect) genealogies for each
site in the input and score these according to non-random
clustering of affected individuals, and judge high-scoring
areas as likely candidates for containing disease affecting
variation. Building the local genealogy trees is based on a
number of heuristics that are not guaranteed to build true
trees, but have the advantage of more sophisticated methods
of being extremely fast.
T. Mailund
GeneRecon is a software package for linkage disequilibrium
mapping using coalescent theory. It is based on Bayesian
Markov-chain Monte Carlo (MCMC) method for fine-scale
linkage-disequilibrium gene mapping using high-density
marker maps. GeneRecon explicitly models the genealogy of a
sample of the case chromosomes in the vicinity of a disease
locus. Given case and control data in the form of genotype
or haplotype information, it estimates a number of
parameters, most importantly, the disease position.
T. Mailund and L. Schauser
GeneRecon is a software package for linkage disequilibrium
mapping using coalescent theory. It is based on Bayesian
Markov-chain Monte Carlo (MCMC) method for fine-scale
linkage-disequilibrium gene mapping using high-density
marker maps. GeneRecon explicitly models the genealogy of a
sample of the case chromosomes in the vicinity of a disease
locus. Given case and control data in the form of genotype
or haplotype information, it estimates a number of
parameters, most importantly, the disease position.
2005
T. Mailund
CoaSim is a tool for simulating the coalescent process with
recombination and geneconversion, under either constant
population size or exponential population growth. It
effectively constructs the ancestral recombination graph for a
given number of chromosomes and uses this to simulate samples of
SNP and micro-satellite haplotypes or genotypes.
T. Mailund
CoaSim is a tool for simulating the coalescent process with
recombination and geneconversion, under either constant
population size or exponential population growth. It
effectively constructs the ancestral recombination graph for a
given number of chromosomes and uses this to simulate samples of
SNP and micro-satellite haplotypes or genotypes.
2003
T. Mailund
DPROG is a domain specific language for specifying dynamic
programming algorithms; given a recursive definition of the
problem, the compiler generates code for solving the problem
using dynamic programming.
This tutorial/manual explains the basics of the language and
how the generated code can be used in an application. The
manual covers version 0.2 of DPROG; the language might change
in later versions.
2003
T. Mailund
Ph.D. Dissertation, Department of Computer Science,
University of Aarhus, February 2003.
The thesis describes the sweep-line method, a
newly developed reduction method for
alleviating the state explosion problem
inherent in explicit-state state space
exploration. The basic idea underlying the
sweep-line method is, when calculating the
state space, to recognise and delete states
that are not reachable from the currently
unprocessed states. Intuitively we drag a
sweep-line through the state space with the
invariant that all states behind the
sweep-line have been processed and are
unreachable from the states in front of the
sweep-line. When calculating the state space
of a system we iteratively calculate
successors of new states, and we need to
distinguish between previously seen states
and truly new states to avoid reprocessing
the same states indefinitely. When
calculating the successors of the known
unprocessed states, all successors are in
front of the sweep-line. Consequently, the
states behind the sweep-line are not needed
for distinguishing between new and old states
and can therefore be deleted.
2000
T. Mailund
Progress report, Ph.D. part A, Department of Computer Science,
University of Aarhus, December 2000.
This progress report presents and summarizes the research
work done by the author during part A of his PhD study at
the Department of Computer Science, University of Aarhus.
The work described consists of three main parts. The first
part is on language constructions for modular coloured Petri
net models. Work on parameterisation of modules, and how
this can be used to gain higher degrees of reuse, is
presented, and work in progress on behaviour preserving
module refinements is sketched.
The second part is on state space analysis methods for model
verification, where two new reduction methods are described.
The first method exploits equivalences to gain finite
representations for the usually infinite state spaces of
timed coloured Petri nets. The second method exploits the
progress, that is present in some models, to sweep through
all reachable states without, at any given time, storing the
entire state space.
The third and last part is on a text format for exchanging
models between different tools. An XML-based text format is
described. This text format is now implemented in the tools
Design/CPN, as an optional file format, and CPN Tools, as the
main file format. Techniques for separating content and
graphical style information, using style sheets, is also
described.