ML tree inference using gap-coding Derrick J. Zwickl and Mark T. - PowerPoint PPT Presentation

ML tree inference using gap-coding Derrick J. Zwickl and Mark T. Holder Dept. Ecology and Evolutionary Biology Univ. Kansas Lawrence, Kansas

Acknowledgments Derrick Zwickl Jeet Sukumaran, Jamie Oaks, Tracy Heath, and Jiaye Yu NSF ATOL

Talk outline • Background • Simple approaches based on gap-coding: – Dollo Indel Mixture model. – M k v with ascertainment bias correction for at least one residue in a column. • Some very preliminary results from an implementation in GARLI. • Future work

• Unaligned sequences, U , are observed. • The tree, T , which maximize Pr( U | T ) , is the ML tree (or we could maximize Pr( T | U ) in a Bayesian framework). • We could use MCMC to marginalize over every possible alignment, A : � Pr( U | T ) = Pr( U, A | T ) A • Too computationally intensive for trees with hundreds or thousands of leaves. • See Ben Redelings talk tomorrow !

Two-phase methods 1. Infer a single alignment, A from the unaligned sequences, U . 2. Infer tree, T , given the alignment, A , and sequences. For example, the tree that maximizes Pr( U, A | T ) . Weaknesses: 1. Ignores alignment uncertainty by treating { U, A } as if it were the known data. 2. Ignores the fact the progressive alignment strategies use a guide tree to infer A

Two-phase methods with gaps as missing data 1. Infer a single alignment, A . Map { U, A } → Z where Z is an aligned sequence matrix in which all gaps are treated as missing data. 2. Infer tree, T , given Z . For example, the tree that maximizes Pr( Z | T, θ ) . Weaknesses: 1. Ignores alignment uncertainty by treating { U, A } as if it were the known data. 2. Ignores the fact the progressive alignment strategies use a guide tree to infer A 3. Ignores phylogenetic information from indels.

Addressing alignment uncertainty without marginalizing over all plausible alignments • Masking or “culling” • Wheeler’s “elision” method of concatenating alternative alignments • SAT´ e – iteratively considering multiple alignments and trees. Results in an estimate of a tree and alignment pair. • Resampling procedures which are cell-based rather than column-based: – Jamie Oaks (KU grad. student) is currently testing a cell-based jackknifing procedure that will try let alignment uncertainty affect clade support.

This talk: ML methods that do not ignore information from gaps • tree inference with the alignment treated as fixed. • considering single-residue indel events (considering multiple- residue indel events means that columns an alignment are no longer independent). • focussing on methods that can be fit well with: – large -scale tree inference software (GARLI) and – high-performance likelihood calculation library (lib-beagle).

General strategies for fixed alignment, single-residue indel handling Rather than { U, A } → Z where Z is an aligned matrix with gaps converted to missing data: • Analyze the aligned sequences, { U, A } and treat gaps as a “ 5 th base” or “ 21 st amino acid” • “Gap-code” A as binary matrix (1 for presence of a residue, 0 for a gap). Next, analyze Z under a model of substitution and A under a model with a state space of { 0 , 1 }

“ 5 th base” coding Weaknesses: • The simplest form would allow evolutionary transitions such as C → G → — → G But, the first and second G residues should not be considered to be homologous! • Gap states evolve in a “ 5 th -base” coding, but it is odd to model the evolution of a site that is not there. • Indels of multiple residues are not handled nicely.

Better “ 5 th base” coding Rivas and Eddy (2008): • provide a nice discussion of these issues • develop techniques for computing the transition probability matrices for “extended” state models • point out the need to correct for ascertainment bias • develop altered versions of the Felsenstein’s pruning algorithm so that at most one branch per tree can have an insertion at a site • implemented in their software DNAML E

Stochastic Dollo Alekseyenko, Lee, and Suchard (2008) generalize “stochastic Dollo” models that prohibit re-evolution of a complex character. Their approach could be used to model insertion of a site: Their Figure 4:

Dollo indel mixture model - motivation Like Rivas and Eddy (2008) and Alekseyenko et al . (2008), we would like an indel model that strictly enforces homology. But we would like most of the likelihood calculations to use the standard pruning algorithm of Felsenstein, so we can use highly optimized code (lib-beagle, GARLI . . . ).

Dollo indel mixture model - separating substitution calculations from indel calculations a i refers to the gap state at column i of an alignment. u i is the sequence state at this position of the aligned matrix. z i is the states with gaps mapped to missing. Pr( u i , a i | T ) = Pr( z i , a i | T ) = Pr( z i | T, a i ) Pr( a i | T ) We can calculate Pr( z i | T, a i ) from the standard pruning algorithm.

Dollo indel mixture model - gap coding For a residue there are 3 states indicating presence/absence: never had it → has it → lost it But if we “gap-code” a site we see two observable states: 0 = never had it OR lost it 1 = has it We can use a model with 3 possible states and the modified pruning algorithm to assure that there can be no more than one insertion per column when we calculate Pr( a i | T )

Dollo indel mixture model - mixture part Rather than infer the insertion rate , we can use a simple mixture approach: Pr( u i , a i | T ) = Pr( u i , a i | T, 0 ins. ) Pr( 0 ins. ) + Pr( u i , a i | T, 1 ins. ) Pr( 1 ins. ) If φ is the probability of a site having having an insertion: Pr( u i , a i | T ) = Pr( u i , a i | T, 0 ins. )(1 − φ ) + Pr( u i , a i | T, 1 ins. ) φ The max. likelihood estimate of φ can be found very quickly.

M k not-fully-absent variant For comparison, we can also analyze the gap-coded data using: • 2-state model that allows re-insertion, • standard pruning algorithm. We should condition our inference on the fact that our data is censored because a i cannot be all gaps. This is slight tweak of the M k v model (the “morphology model”) of Lewis (2001).

Very preliminary results - simulation study • Simulation on a tree of estimated for 64 species of Rana . • Sweeps of simulation of indel rates, indel lengths, and tree depths. • Tree inference in GARLI under: – GTR model ignoring indels, – GTR + Dollo Indel Mixture Model (DIMM), and – GTR + M k v variant

Results on true alignment True Alignment 0.1 0.1 dnaMix dnaMix Tree Depth 1.0 Tree Depth 0.1 0.08 dna dna 0.08 dnaMk dnaMk 0.06 0.06 Error Error 0.04 0.04 0.02 0.02 0 0 0.04 0.08 0.16 0.32 0.04 0.08 0.16 0.32 0.1 0.1 dnaMix dnaMix Tree Depth 2.0 Tree Depth 0.5 dna 0.08 0.08 dna dnaMk dnaMk 0.06 0.06 Error Error 0.04 0.04 0.02 0.02 0 0 0.04 0.08 0.16 0.32 0.04 0.08 0.16 0.32 Deletion + Insertion Rate Deletion + Insertion Rate

Results on MAFFT alignment MAFFT Alignment 0.6 0.6 dnaMix dnaMix Tree Depth 1.0 0.5 0.5 Tree Depth 0.1 dna dna dnaMk dnaMk 0.4 0.4 Error Error 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0.04 0.08 0.16 0.32 0.04 0.08 0.16 0.32 0.6 0.6 dnaMix 0.5 Tree Depth 2.0 Tree Depth 0.5 0.5 dna dnaMk 0.4 0.4 Error Error 0.3 0.3 0.2 0.2 dnaMix dna 0.1 0.1 dnaMk 0 0 0.04 0.08 0.16 0.32 0.04 0.08 0.16 0.32 Deletion + Insertion Rate Deletion + Insertion Rate

Preliminary Results Summary • DIMM and M k v variant are able to use indel information to improve tree inference when the alignment is reliable • DIMM is sensitive to alignment error (though very the MAFFT alignments are quite compressed for these simulations). • M k v variant is much less sensitive to alignment error. • Analyses with Prank alignments are currently underway. • Analyses using these gap-coding methods are available in GARLI (demo later today; note: not fully optimized).

Future work 1. Multiple tiers of “masking” like approaches 2. Comparing M k v variant and DIMM to identify compression 3. Elision-like approaches in ML 4. Integration of these approaches within SAT´ e

Future work - richer masking/culling to classify columns Rather than: We can: included included, with DIMM included, with M k v variant included, gaps ignored excluded excluded

Future work - identifying compression M k v allows any number of insertions in a column, while DIMM forces there to be 0 or 1. We can compare partial likelihoods from both models to identify sites that show strong evidence of multiple insertion events.

Future work - more appropriate “Elision” method for ML We will be adding the ability to score multiple possible alignments to GARLI. The likelihoods rather that log-likelihood will be added across alternatives (see Redelings and Suchard review chapter 2008). (e) X Y Z X Y Z 1 2 3 4 5 6 7 8 9 1 1 1 1 2 3 4 5 6 7 8 9 1 1 1 0 1 2 0 1 2 Outgroup T A G A G C A C T C A G T A G A G C A C T C A G Taxon A T A G A G C A C T C A G T A G A G C A C T C A G Taxon B T A G T G A A G C C A G T A G T G A A G C C A G Taxon C T A G T G A A G C C A G T A G T G A A G C C A G Taxon D T A G - - - A G C C A G T A G A G C - - - C A G Taxon E T A G - - - A G C C A G T A G A G C - - - C A G from figure M. S. Y. Lee, TREE , 2001

ML tree inference using gap-coding Derrick J. Zwickl and Mark T. - PowerPoint PPT Presentation

ML tree inference using gap-coding Derrick J. Zwickl and Mark T. Holder Dept. Ecology and Evolutionary Biology Univ. Kansas Lawrence, Kansas Acknowledgments Derrick Zwickl Jeet Sukumaran, Jamie Oaks, Tracy Heath, and Jiaye Yu NSF ATOL Talk

Formal Modeling in Cognitive Science 1 Coding Theorems Lecture 28: Kraft Inequality; Source Coding

Are Hybrid Physical Designs Important? 1 B+ tree 2 C O L B+ tree 3 ? C O L C O L B+ tree

Image and Video Coding: Video Coding Extensions Screen Content Coding Screen Content Coding

ADVANCED MULTIMEDIA ADVANCED MULTIMEDIA CODING CODING Fernando Pereira Instituto Superior

Dynamical systems Expanding maps on the circle. Coding Jana Rodriguez Hertz ICTP 2018 coding

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

Risk-Based Coding and Reimbursement What is Risk-Based Coding? Risk-Based Coding Overview A

Entropy Coding Definition of Entropy Three Entropy coding techniques: (taken from the

Coding and Applications in Sensor Networks Coding and Applications in Sensor Networks Why coding?

Applications of Random Coding and Algebraic Coding Theories to Universal Lossless Source Coding

Coding and Applications in Sensor Networks Why coding? Information compression

Tree-sitter @maxbrunsfeld What is Tree-sitter? Why I wrote Tree-sitter What were

UCF FINANCIALS THE N EXT G EN Fit-Gap Kick Off April 17, 2018 AGENDA How are fit-gap sessions

MCP gap bottom bottom electrode gap Anode

Final Examples Announcements Trees Tree-Structured Data def tree(label, branches=[]): A tree

Services Using E-Tree Service Type Ethernet Private Tree (EP-Tree) and Ethernet Virtual Private

Chapter 3 Indexing Navigate and Search Large Data Volumes File Organization File Organization

Vehicle routing Pricing strategies - Going-rate pricing m ethodologies to support -

Lists, Dictionaries, and Trees Oh My! Tyler Moore CSE 3353, SMU, Dallas, TX February 12,

Dynamic tables Task: Store a dynamic set in a table/array. Elements can only be inserted, and all

Hashing In the last class Implementing Dictionary ADT Definition of red-black tree

Why compute minimum edit distance? Minimum edit distance: worked example Sometimes we want to

Outline CP for VRP DMP204 SCHEDULING, TIMETABLING AND ROUTING 1. Construction Heuristics

One-Pass Streaming Algorithms Complaints and Grievances Theory and Practice about theory in