On the identifiability of two tree mixtures for group-based models - - PowerPoint PPT Presentation

on the identifiability of two tree mixtures for group
SMART_READER_LITE
LIVE PREVIEW

On the identifiability of two tree mixtures for group-based models - - PowerPoint PPT Presentation

On the identifiability of two tree mixtures for group-based models E. Allman 1 c 2 J. Rhodes 1 S. Petrovi S. Sullivant 3 1 University of Fairbanks, Alaska 2 University of Illinois Chicago 3 North Carolina State University Phylomania 2010


slide-1
SLIDE 1

On the identifiability of two tree mixtures for group-based models

  • E. Allman1
  • S. Petrovi´

c2

  • J. Rhodes1
  • S. Sullivant3

1 University of Fairbanks, Alaska 2 University of Illinois Chicago 3 North Carolina State University

Phylomania 2010 – Hobart, Tasmania November 2010

On the identifiability of two tree mixtures for group-based models 1/33

slide-2
SLIDE 2

Today’s talk:

◮ identifiability of 2-tree mixture models ◮ work dates from 2009 and before ◮ focus today on algebraic techniques (technical at times)

On the identifiability of two tree mixtures for group-based models 2/33

slide-3
SLIDE 3

Background

Interest sparked by papers/conversations

◮ Kolaczkowski and Thornton: 2004 Nature ◮ Mossel and Vigoda; Ronquist et al: 2005, 2006, Science ◮ ˇ

Stefankoviˇ c and Vigoda: 2007, JCB, Phylogeny of Mixture Models:

Robustness of Maximum Likelihood and Non-identifiable Distributions

◮ Matsen and Steel: 2007, Sys. Bio., Phylogenetic mixtures on a

single tree can mimic a tree of another topology

◮ Matsen, Mossel, and Steel: 2008, BMB, Mixed-up trees: the

structure of phylogenetic mixtures

◮ Junhyong Kim

On the identifiability of two tree mixtures for group-based models 3/33

slide-4
SLIDE 4

Due to incomplete lineage sorting, or other biological phenonomenon, sequence data may have evolved along two or more trees.

Species Tree

Gene 1 Gene 2

Q: Is it theoretically possible to identify the two trees giving rise to expected pattern frequencies? Q’: If so, what about the numerical parameters for these trees?

On the identifiability of two tree mixtures for group-based models 4/33

slide-5
SLIDE 5

Due to incomplete lineage sorting, or other biological phenonomenon, sequence data may have evolved along two or more trees.

Species Tree

Gene 1 Gene 2

Q: Is it theoretically possible to identify the two trees giving rise to expected pattern frequencies? Q’: If so, what about the numerical parameters for these trees?

On the identifiability of two tree mixtures for group-based models 4/33

slide-6
SLIDE 6

Modeling sequence evolution along a tree(s)

For a fixed tree T and a model of sequence evolution (GTR,

GTR+Γ, JC, ...), the distribution of states at the leaves of T is

a function ψT of the model’s parameters.

  • Eg. GTR model on a n-taxon tree T

parameterization map ψT : ST − → ∆4n−1

  • π, Q, {te}

→ P = (pi1··· ,in) where pi1···in is the expected frequency of pattern i = i1 · · · in at the leaves of T.

On the identifiability of two tree mixtures for group-based models 5/33

slide-7
SLIDE 7

Mixture models

Modeling sequence evolution along two or more trees requires using a mixture model.

  • Eg. Suppose T1 and T2 are two n-taxon trees, then the

distribution is a point in the image of ψT1,T2 : ST1 × ST2 × [0, 1] − → ∆4n−1

  • s1, s2, w

→ P = (pi1··· ,in) where P = wψT1(s1) + (1 − w)ψT2(s2) is the weighted sum of the distributions for parameter choices

  • n T1 and T2.

On the identifiability of two tree mixtures for group-based models 6/33

slide-8
SLIDE 8

Group-based models

Today: focus on group-based models Cavender-Farris-Neyman (CFN), Jukes-Cantor (JC), Kimura 2-Parameter (K2P), Kimura 3-Parameter (K3P) These models, as well as GM, have an algebraic structure useful for analysis.

On the identifiability of two tree mixtures for group-based models 7/33

slide-9
SLIDE 9

Model parameters π, {Me} on tree T

S1 S2 M1 M2 π S3 M3 M4

pijk =

4

  • l=1

4

  • m=1

πlM1(l, m)M2(m, i)M3(m, j)M4(l, k) lead to a polynomial parameterization map ψT. Thus, any mixture distribution PT1,T2 ∈ ψT1,T2 is also parameterized by polynomials.

On the identifiability of two tree mixtures for group-based models 8/33

slide-10
SLIDE 10

Mixture varieties

Extending the parameterization to complex parameters, define VT1 ∗ VT2 = Im ψT1,T2, the phylogenetic mixture variety. (Point: This allows ideas from algebraic geometry to be used.)

On the identifiability of two tree mixtures for group-based models 9/33

slide-11
SLIDE 11

Algebraic geometry reminders

◮ Fundamental correspondence:

Geometry ← → Algebra V ← → IV Corresponding to any phylogenetic variety V is its ideal IV of phylogenetic invariants, the ideal of polynomials f in the pattern frequencies pi so that f (P) = 0 for any P ∈ V .

◮ Inclusion reversing correspondence:

V1 ⊆ V2 ⇐ ⇒ IV2 ⊆ IV1

On the identifiability of two tree mixtures for group-based models 10/33

slide-12
SLIDE 12

More notation

For stochastic parameter choices, denote the collection of joint distributions by MT1 ∗ MT2. Note that MT1 ∗ MT2 VT1 ∗ VT2. Though the varieties are used for proofs because of their algebraic structure (dim, good intersection properties, etc.), all results today hold for the stochastic distributions.

On the identifiability of two tree mixtures for group-based models 11/33

slide-13
SLIDE 13

Monomial parameterization

Hendy, Penny, Sz´ ekely, Erd¨

  • s, Evans, Speed, Sturmfels, Sullivant:

Group-based models can be diagonalized by means of the discrete Fourier transform over G (Hadamard transform). In the Fourier coordinates, group-based models give rise to toric varieties. (In this setting, ψT is parameterized by monomials.) Moreover, the discrete Fourier transform is a linear change of variables, so it behaves well with respect to taking mixtures of group-based models. F(MT1) ∗ F(MT2) = F(MT1 ∗ MT2)

On the identifiability of two tree mixtures for group-based models 12/33

slide-14
SLIDE 14

Fourier coordinates

For each split A|B in T, introduce a set of Fourier parameters {aA|B

g

: g ∈ G}.

Theorem (Hendy-Penny)

In the Fourier coordinates, a group-based phylogenetic model is given parameterically by: qg1,...,gn =

A|B∈Σ(T) aA|B P

a∈A ga

if g1 + · · · + gn = 0 if g1 + · · · + gn = 0 ‘Coordinates’ in this parameterization are called q-coordinates.

On the identifiability of two tree mixtures for group-based models 13/33

slide-15
SLIDE 15

Fourier coordinates

For JC, K2P, we take G = Z2 × Z2 = {A, C, G, T}.

◮ For K2P model, we have aA|B G

= aA|B

T

for all A|B

◮ For JC model, we have aA|B C

= aA|B

G

= aA|B

T

for all A|B.

On the identifiability of two tree mixtures for group-based models 14/33

slide-16
SLIDE 16

Tree parameter identifiability (stochastic version)

Definition

The tree parameters T1, . . . , Tk in a k-class phylogenetic mixture model are identifiable, if for all P ∈ MT1 ∗ · · · ∗ MTk there does not exist another set of k trees T ′

1, . . . , T ′ k such that

P ∈ MT ′

1 ∗ · · · ∗ MT ′ k. On the identifiability of two tree mixtures for group-based models 15/33

slide-17
SLIDE 17

Tree parameter identifiability (geometric version)

VT1*VT2 VT3*VTi

Definition

The tree parameters in a k-class phylogenetic mixture model are generically identifiable if for all non-equal multisets {T1, . . . , Tk}, and {T ′

1, . . . , T ′ k},

dim(VT1 ∗ · · · ∗ VTk ∩ VT ′

1 ∗ · · · ∗ VT ′ k) < dim(VT1 ∗ · · · ∗ VTk). On the identifiability of two tree mixtures for group-based models 16/33

slide-18
SLIDE 18

Generic identifiability of tree parameters

An immediate consequence of the geometric definition: dim(VT1 ∗ VT2 ∩ VT ′

1 ∗ VT ′ 2) < dim(VT1 ∗ VT2)

is that tree parameters are generically identifiable for stochastic parameter choices too. That is, the trees giving rise to MT1 ∗ MT2 are identifiable, except on a non-generic set E of stochastic parameters (s1, s2, π) of Lebesque measure zero where ψT1,T2(s1, s2, π) = ψT ′

1,T ′ 2(s′

1, s′ 2, π′).

(E is the set of bad parameters.) On the identifiability of two tree mixtures for group-based models 17/33

slide-19
SLIDE 19

Algebraic methods for proofs

Use

◮ dimension counts for phylogenetic varieties ◮ all phylogenetic mixture varieties are irreducible, since they

are parameterized

◮ two irreducible varieties of the same dimension either

coincide or intersect in a sub-variety of lower dimension Analogy with linear spaces. = ⇒ if two phylogenetic varieties are distinct, then parameters will be generically identifiable

◮ two varieties V1 and V2 are distinct if IV1 = IV2

and V1 V2 if there exists an invariant f2 ∈ IV2 \ IV1

On the identifiability of two tree mixtures for group-based models 18/33

slide-20
SLIDE 20

VT1*VT2 VT3*VTi

IVT1∗VT2 = IVT3∗VT4

VT1* VT2 VT3

∃f ∈ IVT1∗VT2 \ IVT3

On the identifiability of two tree mixtures for group-based models 19/33

slide-21
SLIDE 21

Algebraic methods for proofs

Use

◮ group-based models (JC and K2P) have linear invariants

which can be used to construct invariants for 2-tree mixtures

◮ computational algebra packages like Singular

On the identifiability of two tree mixtures for group-based models 20/33

slide-22
SLIDE 22

Main theorem (tree parameters)

Theorem

The tree parameters of the 2-tree mixture model MT1 ∗ MT2 are generically identifiable under the Jukes-Cantor and Kimura 2-parameter models if T1, T2 are binary with n ≥ 4 leaves. Strategy: Prove theorem for quartets n = 4, then lift to larger trees.

On the identifiability of two tree mixtures for group-based models 21/33

slide-23
SLIDE 23

Identifiability of quartet trees

Proposition

Let T1 = 12|34, T2 = 14|23, T3 = 13|24. Then ℓ(q) = qGGGG + qGTGT − qGGTT − qGTTG satisfies ℓ(q) = 0 for all q ∈ MT1 ∗ MT2, but ℓ(q) = 0 for some q ∈ MT3 for the JC and K2P models.

Corollary

Generic identifiabiliity of tree parameters holds for n = 4. a few details ....

On the identifiability of two tree mixtures for group-based models 22/33

slide-24
SLIDE 24

If q = wq1 + (1 − w)q2, then since ℓ is linear = ⇒ ℓ(q) = ℓ(wq1 + (1 − w)q2) = wℓ(q1) + (1 − w)ℓ(q2)

qGGGG qGTGT qGGTT qGTTG

+

  • T1

1 2 4 3 1 2 4 3 1 2 4 3 1 2 4 3 +

  • T2

1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 +

  • T3

1 3 4 2 1 3 4 2 1 3 4 2 1 3 4 2 +

  • = 0

= 0 = 0

l(q) =

On the identifiability of two tree mixtures for group-based models 23/33

slide-25
SLIDE 25

Generic identifiabiliity of tree parameters holds for n = 4.

Case: two different trees in mixture

◮ The linear invariant ℓ ∈ IVT1 ∗VT2 \ IVT3 . Thus, VT3 VT1 ∗ VT2. ◮ Since VT3 ⊂ VT3 ∗ VTi , we have IVT3 ∗VTi ⊂ IVT3 and thus,

ℓ ∈ IVT1 ∗VT2 \ IVT3 ∗VTi .

◮ Since VT1 ∗ VT2 and VT3 ∗ VTi are irreducible of the same

dimension with different ideals, they are distinct.

VT1* VT2 VT3

∃ℓ ∈ IVT1∗VT2 \ IVT3

VT1*VT2 VT3*VTi

IVT1∗VT2 = IVT3∗VTi

On the identifiability of two tree mixtures for group-based models 24/33

slide-26
SLIDE 26

Generic identifiability of continuous parameters

Definition

The continuous parameters of a 2-tree mixture model are generically identifiable if for generic choices of (s1, s2, w), ψT1,T2(s1, s2, w) = ψT1,T2(s′

1, s′ 2, w′)

implies (s1, s2, w) = (s′

1, s′ 2, w′)

  • r, in the case where T1 = T2, that

(s1, s2, w) = (s′

2, s′ 1, 1 − w′).

On the identifiability of two tree mixtures for group-based models 25/33

slide-27
SLIDE 27

Main theorem (continuous parameters)

Theorem*

The continuous parameters of the 2-tree mixture model MT1 ∗ MT2 are generically identifiable under the Jukes-Cantor and Kimura 2-parameter models if T1, T2 are binary with n ≥ 5 leaves.

Definition

Theorem* means that the result holds with high probability. Note: If T1 = T2, no ∗ needed.

On the identifiability of two tree mixtures for group-based models 26/33

slide-28
SLIDE 28

Theorem* ?

Proposition:

Let ψ : Cd → Cm be a polynomial (or rational)

  • map. Then for some k ∈ {1, 2, 3, . . . , ∞},

ψ is generically k − to − 1. That is, except for some exceptional set E of parameter space, the map will be k − to − 1. Moreover, E is of Lebesgue measure 0 within the parameter space. For example, ψ : C → C given by ψ(z) = z2 and ψ(z) = 1 z2 Taking E = {0}, then k = 2.

On the identifiability of two tree mixtures for group-based models 27/33

slide-29
SLIDE 29

Theorem* ?

Proposition:

Let ψ : Cd → Cm be a polynomial (or rational)

  • map. Then for some k ∈ {1, 2, 3, . . . , ∞},

ψ is generically k − to − 1. That is, except for some exceptional set E of parameter space, the map will be k − to − 1. Moreover, E is of Lebesgue measure 0 within the parameter space. For example, ψ : C → C given by ψ(z) = z2 and ψ(z) = 1 z2 Taking E = {0}, then k = 2.

On the identifiability of two tree mixtures for group-based models 27/33

slide-30
SLIDE 30

Polynomial maps are generically k − to − 1

  • 1. To prove* the Theorem* for a particular tree, repeatedly

generate random rational parameter choices θ and then symbolically solve the simultaneous polynomial system ψ(t) = ψ(θ) and hope for one solution.

(One solution means that parameters are ‘probably’ generically identifiable.)

  • 2. We check this using software Singular, for JC and K2P
  • n 4 and 5-taxon trees.
  • 3. Recovering parameters uniquely on quartets =

⇒ recover parameters on arbitrarily sized trees.

On the identifiability of two tree mixtures for group-based models 28/33

slide-31
SLIDE 31

Why n = 5 in Theorem*?

Proposition*

For T a 4-taxon tree under the Jukes-Cantor model, the continuous parameters in MT ∗ MT are not generically

  • identifiable. The map ψT,T is generically 6-to-1 (up to label

swapping). For biologically relevant parameters, we observed between 1 and 4 biologically relevant preimages.

On the identifiability of two tree mixtures for group-based models 29/33

slide-32
SLIDE 32

Another Mathematical Surprise

1 2 3 4 5 1 2 4 5 3 1 4 2 3

5 T1 T2 T3

Theorem

For the Jukes-Cantor model VT2 ⊆ VT1 ∗ VT3. Q: Is M2 ⊆ M1 ∗ M3? A*: No, unless you allow 0 and/or infinite branch lengths in T1 and T3.

On the identifiability of two tree mixtures for group-based models 30/33

slide-33
SLIDE 33

Another Mathematical Surprise

1 2 3 4 5 1 2 4 5 3 1 4 2 3

5 T1 T2 T3

Theorem

For the Jukes-Cantor model VT2 ⊆ VT1 ∗ VT3. Q: Is M2 ⊆ M1 ∗ M3? A*: No, unless you allow 0 and/or infinite branch lengths in T1 and T3.

On the identifiability of two tree mixtures for group-based models 30/33

slide-34
SLIDE 34

Mixtures of many trees

Recent work of Rhodes and Sullivant has advanced these results:

Theorem

◮ Under the general Markov model of sequence evolution,

the tree parameter and continuous parameters are generically identifiable for a k-class mixture on the same tree, provided k < 4⌈n/4⌉−1.

On the identifiability of two tree mixtures for group-based models 31/33

slide-35
SLIDE 35

Open problems

◮ Develop methods to remove * from Theorem* ◮ Beyond group-based models: GTR, rate variation ◮ Arbitrary k-tree mixtures

On the identifiability of two tree mixtures for group-based models 32/33

slide-36
SLIDE 36

Thank you.

On the identifiability of two tree mixtures for group-based models 33/33