SLIDE 1 ML tree inference using gap-coding
Derrick J. Zwickl and Mark T. Holder
- Dept. Ecology and Evolutionary Biology
- Univ. Kansas
Lawrence, Kansas
SLIDE 2
Acknowledgments Derrick Zwickl Jeet Sukumaran, Jamie Oaks, Tracy Heath, and Jiaye Yu NSF ATOL
SLIDE 3 Talk outline
- Background
- Simple approaches based on gap-coding:
– Dollo Indel Mixture model. – Mkv with ascertainment bias correction for at least one residue in a column.
- Some very preliminary results from an implementation in
GARLI.
SLIDE 4
- Unaligned sequences, U, are observed.
- The tree, T, which maximize Pr(U|T), is the ML tree (or
we could maximize Pr(T|U) in a Bayesian framework).
- We could use MCMC to marginalize over every possible
alignment, A: Pr(U|T) =
Pr(U, A|T)
- Too computationally intensive for trees with hundreds or
thousands of leaves.
- See Ben Redelings talk tomorrow !
SLIDE 5 Two-phase methods
- 1. Infer a single alignment, A from the unaligned sequences,
U.
- 2. Infer tree, T, given the alignment, A, and sequences. For
example, the tree that maximizes Pr(U, A|T). Weaknesses:
- 1. Ignores alignment uncertainty by treating {U, A} as if it
were the known data.
- 2. Ignores the fact the progressive alignment strategies use a
guide tree to infer A
SLIDE 6 Two-phase methods with gaps as missing data
- 1. Infer a single alignment, A. Map {U, A} → Z where Z is
an aligned sequence matrix in which all gaps are treated as missing data.
- 2. Infer tree, T, given Z. For example, the tree that maximizes
Pr(Z|T, θ). Weaknesses:
- 1. Ignores alignment uncertainty by treating {U, A} as if it
were the known data.
- 2. Ignores the fact the progressive alignment strategies use a
guide tree to infer A
- 3. Ignores phylogenetic information from indels.
SLIDE 7 Addressing alignment uncertainty without marginalizing
- ver all plausible alignments
- Masking or “culling”
- Wheeler’s “elision” method of concatenating alternative
alignments
e – iteratively considering multiple alignments and trees. Results in an estimate of a tree and alignment pair.
- Resampling procedures which are cell-based rather than
column-based: – Jamie Oaks (KU grad. student) is currently testing a cell-based jackknifing procedure that will try let alignment uncertainty affect clade support.
SLIDE 8 This talk: ML methods that do not ignore information from gaps
- tree inference with the alignment treated as fixed.
- considering single-residue indel events (considering multiple-
residue indel events means that columns an alignment are no longer independent).
- focussing on methods that can be fit well with:
– large-scale tree inference software (GARLI) and – high-performance likelihood calculation library (lib-beagle).
SLIDE 9 General strategies for fixed alignment, single-residue indel handling Rather than {U, A} → Z where Z is an aligned matrix with gaps converted to missing data:
- Analyze the aligned sequences, {U, A} and treat gaps as a
“5th base” or “21st amino acid”
- “Gap-code” A as binary matrix (1 for presence of a residue,
0 for a gap). Next, analyze Z under a model of substitution and A under a model with a state space of {0, 1}
SLIDE 10 “5th base” coding Weaknesses:
- The simplest form would allow evolutionary transitions such
as C → G → — → G But, the first and second G residues should not be considered to be homologous!
- Gap states evolve in a “5th-base” coding, but it is odd to
model the evolution of a site that is not there.
- Indels of multiple residues are not handled nicely.
SLIDE 11 Better “5th base” coding Rivas and Eddy (2008):
- provide a nice discussion of these issues
- develop techniques for computing the transition probability
matrices for “extended” state models
- point out the need to correct for ascertainment bias
- develop
altered versions
the Felsenstein’s pruning algorithm so that at most one branch per tree can have an insertion at a site
- implemented in their software DNAMLE
SLIDE 12
Stochastic Dollo Alekseyenko, Lee, and Suchard (2008) generalize “stochastic Dollo” models that prohibit re-evolution of a complex character. Their approach could be used to model insertion of a site: Their Figure 4:
SLIDE 13 Dollo indel mixture model - motivation
Like Rivas and Eddy (2008) and Alekseyenko et
- al. (2008), we would like an indel model that
strictly enforces homology. But we would like most of the likelihood calculations to use the standard pruning algorithm of Felsenstein, so we can use highly optimized code (lib-beagle, GARLI . . .).
SLIDE 14
Dollo indel mixture model - separating substitution calculations from indel calculations ai refers to the gap state at column i of an alignment. ui is the sequence state at this position of the aligned matrix. zi is the states with gaps mapped to missing. Pr(ui, ai|T) = Pr(zi, ai|T) = Pr(zi|T, ai) Pr(ai|T) We can calculate Pr(zi|T, ai) from the standard pruning algorithm.
SLIDE 15 Dollo indel mixture model - gap coding For a residue there are 3 states indicating presence/absence: never had it → has it → lost it But if we “gap-code” a site we see two observable states: = never had it OR lost it 1 = has it We can use a model with 3 possible states and the modified pruning algorithm to assure that there can be no more than
- ne insertion per column when we calculate Pr(ai|T)
SLIDE 16
Dollo indel mixture model - mixture part Rather than infer the insertion rate, we can use a simple mixture approach: Pr(ui, ai|T) = Pr(ui, ai|T, 0 ins.) Pr(0 ins.) + Pr(ui, ai|T, 1 ins.) Pr(1 ins.) If φ is the probability of a site having having an insertion: Pr(ui, ai|T) = Pr(ui, ai|T, 0 ins.)(1 − φ) + Pr(ui, ai|T, 1 ins.)φ The max. likelihood estimate of φ can be found very quickly.
SLIDE 17 Mk not-fully-absent variant For comparison, we can also analyze the gap-coded data using:
- 2-state model that allows re-insertion,
- standard pruning algorithm.
We should condition our inference on the fact that our data is censored because ai cannot be all gaps. This is slight tweak of the Mkv model (the “morphology model”) of Lewis (2001).
SLIDE 18 Very preliminary results - simulation study
- Simulation on a tree of estimated for 64 species of Rana.
- Sweeps of simulation of indel rates, indel lengths, and tree
depths.
- Tree inference in GARLI under:
– GTR model ignoring indels, – GTR + Dollo Indel Mixture Model (DIMM), and – GTR + Mkv variant
SLIDE 19 Results on true alignment
0.02 0.04 0.06 0.08 0.1 0.04 0.08 0.16 0.32
Tree Depth 2.0 Error Deletion + Insertion Rate
dnaMix dna dnaMk 0.02 0.04 0.06 0.08 0.1 0.04 0.08 0.16 0.32
Tree Depth 1.0 Error
dnaMix dna dnaMk 0.02 0.04 0.06 0.08 0.1 0.04 0.08 0.16 0.32
Tree Depth 0.5 Error
dnaMix dna dnaMk 0.02 0.04 0.06 0.08 0.1 0.04 0.08 0.16 0.32
Tree Depth 0.1 Error True Alignment
dnaMix dna dnaMk
Deletion + Insertion Rate
SLIDE 20 Results on MAFFT alignment
Tree Depth 2.0 Error Tree Depth 1.0 Error Tree Depth 0.5 Error Tree Depth 0.1 Error
0.1 0.2 0.3 0.4 0.5 0.6 0.04 0.08 0.16 0.32
Deletion + Insertion Rate
dnaMix dna dnaMk 0.1 0.2 0.3 0.4 0.5 0.6 0.04 0.08 0.16 0.32 dnaMix dna dnaMk 0.1 0.2 0.3 0.4 0.5 0.6 0.04 0.08 0.16 0.32 dnaMix dna dnaMk
MAFFT Alignment Deletion + Insertion Rate
0.1 0.2 0.3 0.4 0.5 0.6 0.04 0.08 0.16 0.32 dnaMix dna dnaMk
SLIDE 21 Preliminary Results Summary
- DIMM and Mkv variant are able to use indel information to
improve tree inference when the alignment is reliable
is very sensitive to alignment error (though the MAFFT alignments are quite compressed for these simulations).
- Mkv variant is much less sensitive to alignment error.
- Analyses with Prank alignments are currently underway.
- Analyses using these gap-coding methods are available
in GARLI (demo later today; note: not fully optimized).
SLIDE 22 Future work
- 1. Multiple tiers of “masking” like approaches
- 2. Comparing Mkv variant and DIMM to identify compression
- 3. Elision-like approaches in ML
- 4. Integration of these approaches within SAT´
e
SLIDE 23
Future work - richer masking/culling to classify columns
Rather than: We can: included included, with DIMM included, with Mkv variant included, gaps ignored excluded excluded
SLIDE 24 Future work - identifying compression Mkv allows any number of insertions in a column, while DIMM forces there to be 0 or 1. We can compare partial likelihoods from both models to identify sites that show strong evidence
- f multiple insertion events.
SLIDE 25
SLIDE 26 Future work - more appropriate “Elision” method for ML We will be adding the ability to score multiple possible alignments to GARLI. The likelihoods rather that log-likelihood will be added across alternatives (see Redelings and Suchard review chapter 2008).
(e)
X Y Z 1 2 3 4 5 6 7 8 9 1 1 1 0 1 2 Outgroup T A G A G C A C T C A G Taxon A T A G A G C A C T C A G Taxon B T A G T G A A G C C A G Taxon C T A G T G A A G C C A G Taxon D T A G - - - A G C C A G Taxon E T A G - - - A G C C A G X Y Z 1 2 3 4 5 6 7 8 9 1 1 1 0 1 2 T A G A G C A C T C A G T A G A G C A C T C A G T A G T G A A G C C A G T A G T G A A G C C A G T A G A G C - - - C A G T A G A G C - - - C A G
from figure M. S. Y. Lee, TREE, 2001
SLIDE 27 Future work - integration with SAT´ e
- In the next few weeks we will be adding GARLI as a tree
inference option for SAT´ e
- The subalignment stage of SAT´
e should be an excellent step to incorporate automated masking, compression correction,
- r elision recoding into SAT´
e.
SLIDE 28 Thanks!
Questions, suggestions, collaborations, and donation
- f test datasets are always welcome!