Change-point Detection on a Tree to Study Evolutionary Adaptation - - PowerPoint PPT Presentation

change point detection on a tree to study evolutionary
SMART_READER_LITE
LIVE PREVIEW

Change-point Detection on a Tree to Study Evolutionary Adaptation - - PowerPoint PPT Presentation

Change-point Detection on a Tree to Study Evolutionary Adaptation from Present-day Species e 1 , 2 , Paul Bastide 3 , 4 , Mahendra Mariadassou 4 , C ecile An ephane Robin 3 St 1 Department of Statistics, University of WisconsinMadison,


slide-1
SLIDE 1

Change-point Detection on a Tree to Study Evolutionary Adaptation from Present-day Species

C´ ecile An´ e1,2, Paul Bastide3,4, Mahendra Mariadassou4, St´ ephane Robin3

1 Department of Statistics, University of Wisconsin–Madison, WI, 53706, USA 2 Department of Botany, University of Wisconsin–Madison, WI, 53706, USA 3 UMR MIA-Paris, AgroParisTech, INRA, Universit´

e Paris-Saclay, 75005, Paris, France

4 MaIAGE, INRA, Universit´

e Paris-Saclay, 78352 Jouy-en-Josas, France

19 April 2016

slide-2
SLIDE 2

Stochastic Processes on Trees Identifiability Problems and Counting Issues Statistical Inference Turtles Data Set

Introduction

Unit

200 150 100 50

Turtles phylogenetic tree with habitats. (Jaffe et al., 2011).

Dermochelys Coriacea Homopus Areolatus

How can we explain the diversity, while accounting for the phylogenetic correlations ? Modelling: a shifted stochastic process on the phylogeny.

CA, PB, MM, SR Change-point Detection on a Tree 2/19

slide-3
SLIDE 3

Stochastic Processes on Trees Identifiability Problems and Counting Issues Statistical Inference Turtles Data Set

Outline

1

Stochastic Processes on Trees

2

Identifiability Problems and Counting Issues

3

Statistical Inference

4

Turtles Data Set

CA, PB, MM, SR Change-point Detection on a Tree 3/19

slide-4
SLIDE 4

Stochastic Processes on Trees Identifiability Problems and Counting Issues Statistical Inference Turtles Data Set Principle of the Modeling Shifts Equivalency OU/BM

Stochastic Process on a Tree

(Felsenstein, 1985)

E D C B A F G H tAB t R 200 400 600 800 −4 −2 2 4 6 time phenotype E D C B A R F G H

Only tip values are

  • bserved

Brownian Motion: Var [A | R ] = σ2t Cov [A; B | R ] = σ2tAB

CA, PB, MM, SR Change-point Detection on a Tree 4/19

slide-5
SLIDE 5

Stochastic Processes on Trees Identifiability Problems and Counting Issues Statistical Inference Turtles Data Set Principle of the Modeling Shifts Equivalency OU/BM

BM vs OU

Equation Stationary State Variance

200 400 600 800 −4 −2 1 time phenotype W(t)

dW (t) = σdB(t) None. σij = σ2tij

200 400 600 800 1 2 3 4 time phenotype W(t) β

(1 − e−αt)β

t1 2 = ln(2) α

dW (t) = σdB(t) +α[β(t)−W (t)]dt    µ = β0 γ2 = σ2 2α σij = γ2e−α(ti +tj ) × (e2αtij − 1)

CA, PB, MM, SR Change-point Detection on a Tree 5/19

slide-6
SLIDE 6

Stochastic Processes on Trees Identifiability Problems and Counting Issues Statistical Inference Turtles Data Set Principle of the Modeling Shifts Equivalency OU/BM

Shifts

E D C B A R 200 400 600 800 −4 −2 2 4 6 time phenotype E D C B A R

BM Shifts in the mean: mchild = mparent + δ

200 400 600 800 −2 −1 1 2 time phenotype E D C B A R

OU Shifts in the optimal value: βchild = βparent + δ

CA, PB, MM, SR Change-point Detection on a Tree 6/19

slide-7
SLIDE 7

Stochastic Processes on Trees Identifiability Problems and Counting Issues Statistical Inference Turtles Data Set Principle of the Modeling Shifts Equivalency OU/BM

Shifts

E D C B A R δ 200 400 600 800 −4 −2 2 4 6 time phenotype E D C B A R

BM Shifts in the mean: mchild = mparent + δ

200 400 600 800 −2 −1 1 2 time phenotype E D C B A R

OU Shifts in the optimal value: βchild = βparent + δ

CA, PB, MM, SR Change-point Detection on a Tree 6/19

slide-8
SLIDE 8

Stochastic Processes on Trees Identifiability Problems and Counting Issues Statistical Inference Turtles Data Set Principle of the Modeling Shifts Equivalency OU/BM

Shifts

E D C B A R δ 200 400 600 800 5 10 time phenotype E D C B A E D C B A R δ

BM Shifts in the mean: mchild = mparent + δ

200 400 600 800 −2 −1 1 2 time phenotype E D C B A R

OU Shifts in the optimal value: βchild = βparent + δ

CA, PB, MM, SR Change-point Detection on a Tree 6/19

slide-9
SLIDE 9

Stochastic Processes on Trees Identifiability Problems and Counting Issues Statistical Inference Turtles Data Set Principle of the Modeling Shifts Equivalency OU/BM

Shifts

E D C B A R δ 200 400 600 800 5 10 time phenotype E D C B A E D C B A R δ

BM Shifts in the mean: mchild = mparent + δ

200 400 600 800 2 4 6 time phenotype E D C B A E D C B A R δ

OU Shifts in the optimal value: βchild = βparent + δ

CA, PB, MM, SR Change-point Detection on a Tree 6/19

slide-10
SLIDE 10

Stochastic Processes on Trees Identifiability Problems and Counting Issues Statistical Inference Turtles Data Set Principle of the Modeling Shifts Equivalency OU/BM

Linear Regression Model

Y5 Y4 Y3 Y2 Y1 Z1 Z4 Z2 Z3 δ3 δ1 δ2

∆ =              µ δ1 δ2 δ3              T∆ =      µ + δ2 µ µ + δ1 + δ3 µ + δ1 µ + δ1      T =      Z1 Z2 Z3 Z4 Y1 Y2 Y3 Y4 Y5 Y1 1 1 1 Y2 1 1 1 Y3 1 1 1 Y4 1 1 1 1 Y5 1 1 1 1     

BM : Y = T∆BM+E BM

CA, PB, MM, SR Change-point Detection on a Tree 7/19

slide-11
SLIDE 11

Stochastic Processes on Trees Identifiability Problems and Counting Issues Statistical Inference Turtles Data Set Principle of the Modeling Shifts Equivalency OU/BM

Linear Regression Model

Y5 Y4 Y3 Y2 Y1 Z1 Z4 Z2 Z3 δ3 δ1 δ2

∆ =              λ δ1 δ2 δ3              TW (α)∆ =      λ + w5δ2 λ λ + w2δ1 + w7δ3 λ + w2δ1 λ + w2δ1      W (α) = Diag(1 − e−α(h−tpa(i)), 1 ≤ i ≤ m+n) λ = µe−αh + β0(1 − e−αh)

BM : Y = T∆BM+E BM OU : Y = TW (α)∆OU+E OU

CA, PB, MM, SR Change-point Detection on a Tree 8/19

slide-12
SLIDE 12

Stochastic Processes on Trees Identifiability Problems and Counting Issues Statistical Inference Turtles Data Set Identifiability Problems Number of Parsimonious Solutions Number of Models with K Shifts

Equivalencies

Number of shifts K fixed, several equivalent solutions.

µ δ1 δ2 µ+δ2 µ+δ1 µ δ2 − δ1 δ1

Problem of over-parametrization: parsimonious configurations.

CA, PB, MM, SR Change-point Detection on a Tree 9/19

slide-13
SLIDE 13

Stochastic Processes on Trees Identifiability Problems and Counting Issues Statistical Inference Turtles Data Set Identifiability Problems Number of Parsimonious Solutions Number of Models with K Shifts

Equivalencies

Number of shifts K fixed, several equivalent solutions.

µ δ1 δ2 µ+δ2 µ+δ1 µ δ2 − δ1 δ1 µ + δ1 δ1 − δ2 µ + δ2 µ+δ2 µ+δ1 δ2 − δ1

Problem of over-parametrization: parsimonious configurations.

CA, PB, MM, SR Change-point Detection on a Tree 9/19

slide-14
SLIDE 14

Stochastic Processes on Trees Identifiability Problems and Counting Issues Statistical Inference Turtles Data Set Identifiability Problems Number of Parsimonious Solutions Number of Models with K Shifts

Parsimonious Solution : Definition

Definition (Parsimonious Allocation) A coloring of the tips being given, a parsimonious allocation of the shifts is such that it has a minimum number of shifts.

CA, PB, MM, SR Change-point Detection on a Tree 10/19

slide-15
SLIDE 15

Stochastic Processes on Trees Identifiability Problems and Counting Issues Statistical Inference Turtles Data Set Identifiability Problems Number of Parsimonious Solutions Number of Models with K Shifts

Parsimonious Solution : Definition

Definition (Parsimonious Allocation) A coloring of the tips being given, a parsimonious allocation of the shifts is such that it has a minimum number of shifts.

CA, PB, MM, SR Change-point Detection on a Tree 10/19

slide-16
SLIDE 16

Stochastic Processes on Trees Identifiability Problems and Counting Issues Statistical Inference Turtles Data Set Identifiability Problems Number of Parsimonious Solutions Number of Models with K Shifts

Parsimonious Solution : Definition

Definition (Parsimonious Allocation) A coloring of the tips being given, a parsimonious allocation of the shifts is such that it has a minimum number of shifts.

CA, PB, MM, SR Change-point Detection on a Tree 10/19

slide-17
SLIDE 17

Stochastic Processes on Trees Identifiability Problems and Counting Issues Statistical Inference Turtles Data Set Identifiability Problems Number of Parsimonious Solutions Number of Models with K Shifts

Parsimonious Solution : Definition

Definition (Parsimonious Allocation) A coloring of the tips being given, a parsimonious allocation of the shifts is such that it has a minimum number of shifts. ≤

CA, PB, MM, SR Change-point Detection on a Tree 10/19

slide-18
SLIDE 18

Stochastic Processes on Trees Identifiability Problems and Counting Issues Statistical Inference Turtles Data Set Identifiability Problems Number of Parsimonious Solutions Number of Models with K Shifts

Parsimonious Solution : Definition

Definition (Parsimonious Allocation) A coloring of the tips being given, a parsimonious allocation of the shifts is such that it has a minimum number of shifts.

CA, PB, MM, SR Change-point Detection on a Tree 10/19

slide-19
SLIDE 19

Stochastic Processes on Trees Identifiability Problems and Counting Issues Statistical Inference Turtles Data Set Identifiability Problems Number of Parsimonious Solutions Number of Models with K Shifts

Parsimonious Solution : Definition

Definition (Parsimonious Allocation) A coloring of the tips being given, a parsimonious allocation of the shifts is such that it has a minimum number of shifts.

CA, PB, MM, SR Change-point Detection on a Tree 10/19

slide-20
SLIDE 20

Stochastic Processes on Trees Identifiability Problems and Counting Issues Statistical Inference Turtles Data Set Identifiability Problems Number of Parsimonious Solutions Number of Models with K Shifts

Parsimonious Solution : Definition

Definition (Parsimonious Allocation) A coloring of the tips being given, a parsimonious allocation of the shifts is such that it has a minimum number of shifts. ∼

CA, PB, MM, SR Change-point Detection on a Tree 10/19

slide-21
SLIDE 21

Stochastic Processes on Trees Identifiability Problems and Counting Issues Statistical Inference Turtles Data Set Identifiability Problems Number of Parsimonious Solutions Number of Models with K Shifts

Equivalent Parsimonious Allocations

Definition (Equivalency) Two allocations are said to be equivalent (noted ∼) if they are both parsimonious and give the same colors at the tips. Find one solution Several existing Dynamic Programming algorithms (Fitch, Sankoff, see Felsenstein, 2004). Enumerate all solutions New recursive algorithm, adapted from previous ones (and implemented in R).

+ CA, PB, MM, SR Change-point Detection on a Tree 11/19

slide-22
SLIDE 22

Stochastic Processes on Trees Identifiability Problems and Counting Issues Statistical Inference Turtles Data Set Identifiability Problems Number of Parsimonious Solutions Number of Models with K Shifts

Equivalent Parsimonious Solutions for an OU Model.

6 5 1 0.53 −6 4.16 1 3.16 −35.76 −29.76 1 3.16 −4.16 −6 1 3.16 −5 6 1 Equivalent allocations and values of the shifts - OU.

CA, PB, MM, SR Change-point Detection on a Tree 12/19

slide-23
SLIDE 23

Stochastic Processes on Trees Identifiability Problems and Counting Issues Statistical Inference Turtles Data Set Identifiability Problems Number of Parsimonious Solutions Number of Models with K Shifts

Collection of Models

New Problem Number of Equivalence Classes:

  • SPI

K

  • ?
  • SPI

K

m+n−1

K

  • =

# of edges

# of shifts

  • A recursive algorithm to compute
  • SPI

K

  • (implemented in R).

→ Generally dependent on the topology of the tree.

+

Binary tree:

  • SPI

K

  • =

2n−2−K

K

  • =

# of edges−# of shifts

# of shifts

  • CA, PB, MM, SR

Change-point Detection on a Tree 13/19

slide-24
SLIDE 24

Stochastic Processes on Trees Identifiability Problems and Counting Issues Statistical Inference Turtles Data Set EM Algorithm Model Selection

EM Algorithm: number of shifts K fixed

Y5 Y4 Y3 Y2 Y1 Z1 Z4 Z2 Z3 ℓ7 δ ℓ4

Y3 | Z2 ∼ N

  • Z2 + δ, ℓ7σ2

Z4 | Z1 ∼ N

  • Z1, ℓ4σ2

log pθ(Y ) = Eθ[log pθ(Z, Y ) | Y ] − Eθ[log pθ(Z) | Y ]

pθ(Z, Y ) = pθ(Z1)

  • 1<j≤m

pθ(Zj|Zparent(j))

  • 1≤i≤n

pθ(Yi|Zparent(i))

EM Algorithm Maximize Eθ[log pθ(Z, Y ) | Y ] E step Given θh, compute pθh(Z | Y ) M step θh+1 = argmaxθ Eθh[log pθ(Z, Y ) | Y ]

CA, PB, MM, SR Change-point Detection on a Tree 14/19

slide-25
SLIDE 25

Stochastic Processes on Trees Identifiability Problems and Counting Issues Statistical Inference Turtles Data Set EM Algorithm Model Selection

Model Selection on K

Unit

−1.9 2 −2 5 2 2

Simulated OU (α =3, γ2 =0.1)

CA, PB, MM, SR Change-point Detection on a Tree 15/19

slide-26
SLIDE 26

Stochastic Processes on Trees Identifiability Problems and Counting Issues Statistical Inference Turtles Data Set EM Algorithm Model Selection

Model Selection on K

Unit

0.3

ˆ YK = argmax

η∈SPI

K

−n 2 log   

  • Y − ˆ

  • 2

V

n   

−100 −50 50 5 10 15 K

plots LL

LL = −n 2 log   

  • Y − ˆ

YK

  • 2

V

n   

CA, PB, MM, SR Change-point Detection on a Tree 15/19

slide-27
SLIDE 27

Stochastic Processes on Trees Identifiability Problems and Counting Issues Statistical Inference Turtles Data Set EM Algorithm Model Selection

Model Selection on K

Unit

−0.3 5.3

ˆ YK = argmax

η∈SPI

K

−n 2 log   

  • Y − ˆ

  • 2

V

n   

−100 −50 50 5 10 15 K

plots LL

LL = −n 2 log   

  • Y − ˆ

YK

  • 2

V

n   

CA, PB, MM, SR Change-point Detection on a Tree 15/19

slide-28
SLIDE 28

Stochastic Processes on Trees Identifiability Problems and Counting Issues Statistical Inference Turtles Data Set EM Algorithm Model Selection

Model Selection on K

Unit

−1.7 2 4.7

ˆ YK = argmax

η∈SPI

K

−n 2 log   

  • Y − ˆ

  • 2

V

n   

−100 −50 50 5 10 15 K

plots LL

LL = −n 2 log   

  • Y − ˆ

YK

  • 2

V

n   

CA, PB, MM, SR Change-point Detection on a Tree 15/19

slide-29
SLIDE 29

Stochastic Processes on Trees Identifiability Problems and Counting Issues Statistical Inference Turtles Data Set EM Algorithm Model Selection

Model Selection on K

Unit

−1.7 2.4 −2.6 4.3

ˆ YK = argmax

η∈SPI

K

−n 2 log   

  • Y − ˆ

  • 2

V

n   

−100 −50 50 5 10 15 K

plots LL

LL = −n 2 log   

  • Y − ˆ

YK

  • 2

V

n   

CA, PB, MM, SR Change-point Detection on a Tree 15/19

slide-30
SLIDE 30

Stochastic Processes on Trees Identifiability Problems and Counting Issues Statistical Inference Turtles Data Set EM Algorithm Model Selection

Model Selection on K

Unit

−1.7 2.2 −2.4 4.5 1.3

ˆ YK = argmax

η∈SPI

K

−n 2 log   

  • Y − ˆ

  • 2

V

n   

−100 −50 50 5 10 15 K

plots LL

LL = −n 2 log   

  • Y − ˆ

YK

  • 2

V

n   

CA, PB, MM, SR Change-point Detection on a Tree 15/19

slide-31
SLIDE 31

Stochastic Processes on Trees Identifiability Problems and Counting Issues Statistical Inference Turtles Data Set EM Algorithm Model Selection

Model Selection on K

Unit

−1.7 1.9 −2 4.9 1.6 1.6

ˆ YK = argmax

η∈SPI

K

−n 2 log   

  • Y − ˆ

  • 2

V

n   

−100 −50 50 5 10 15 K

plots LL

LL = −n 2 log   

  • Y − ˆ

YK

  • 2

V

n   

CA, PB, MM, SR Change-point Detection on a Tree 15/19

slide-32
SLIDE 32

Stochastic Processes on Trees Identifiability Problems and Counting Issues Statistical Inference Turtles Data Set EM Algorithm Model Selection

Model Selection on K

Unit

−1.7 1.9 −2 4.7 0.8 1.6 1.6

ˆ YK = argmax

η∈SPI

K

−n 2 log   

  • Y − ˆ

  • 2

V

n   

−100 −50 50 5 10 15 K

plots LL

LL = −n 2 log   

  • Y − ˆ

YK

  • 2

V

n   

CA, PB, MM, SR Change-point Detection on a Tree 15/19

slide-33
SLIDE 33

Stochastic Processes on Trees Identifiability Problems and Counting Issues Statistical Inference Turtles Data Set EM Algorithm Model Selection

Model Selection on K

Unit

−1.7 1.8 −2 4.8 0.8 1.1 1.7 1.7

ˆ YK = argmax

η∈SPI

K

−n 2 log   

  • Y − ˆ

  • 2

V

n   

−100 −50 50 5 10 15 K

plots LL

LL = −n 2 log   

  • Y − ˆ

YK

  • 2

V

n   

CA, PB, MM, SR Change-point Detection on a Tree 15/19

slide-34
SLIDE 34

Stochastic Processes on Trees Identifiability Problems and Counting Issues Statistical Inference Turtles Data Set EM Algorithm Model Selection

Model Selection on K

Unit

−1.6 1.7 0.8 −2 0.8 4.8 0.8 0.9 0.8 1.2 0.4 1.8 −1 −16.5 1.8 −0.7

ˆ YK = argmax

η∈SPI

K

−n 2 log   

  • Y − ˆ

  • 2

V

n   

−100 −50 50 5 10 15 K

plots LL

LL = −n 2 log   

  • Y − ˆ

YK

  • 2

V

n   

CA, PB, MM, SR Change-point Detection on a Tree 15/19

slide-35
SLIDE 35

Stochastic Processes on Trees Identifiability Problems and Counting Issues Statistical Inference Turtles Data Set EM Algorithm Model Selection

Model Selection: Penalized Likelihood

Idea ˆ K = − argmin

0≤K≤p−1

n 2 log   

  • Y − ˆ

YK

  • 2

V

n    − 1 2 pen′(K)

−100 −50 50 5 10 20 K Penalized LL

criteria LL CA, PB, MM, SR Change-point Detection on a Tree 16/19

slide-36
SLIDE 36

Stochastic Processes on Trees Identifiability Problems and Counting Issues Statistical Inference Turtles Data Set EM Algorithm Model Selection

Model Selection: Penalized Likelihood

Idea ˆ K = − argmin

0≤K≤p−1

n 2 log   

  • Y − ˆ

YK

  • 2

V

n    − 1 2 pen′(K)

−100 −50 50 5 10 20 K Penalized LL

criteria LL AIC

Penalties:

AIC K + 3

CA, PB, MM, SR Change-point Detection on a Tree 16/19

slide-37
SLIDE 37

Stochastic Processes on Trees Identifiability Problems and Counting Issues Statistical Inference Turtles Data Set EM Algorithm Model Selection

Model Selection: Penalized Likelihood

Idea ˆ K = − argmin

0≤K≤p−1

n 2 log   

  • Y − ˆ

YK

  • 2

V

n    − 1 2 pen′(K)

−100 −50 50 5 10 20 K Penalized LL

criteria LL AIC BIC

Penalties:

AIC K + 3 BIC 1

2(K + 3) log(n) CA, PB, MM, SR Change-point Detection on a Tree 16/19

slide-38
SLIDE 38

Stochastic Processes on Trees Identifiability Problems and Counting Issues Statistical Inference Turtles Data Set EM Algorithm Model Selection

Model Selection: Penalized Likelihood

Idea ˆ K = − argmin

0≤K≤p−1

n 2 log   

  • Y − ˆ

YK

  • 2

V

n    − 1 2 pen′(K)

−100 −50 50 5 10 20 K Penalized LL

criteria LL AIC BIC LINselect

Penalties:

AIC K + 3 BIC 1

2(K + 3) log(n)

LINselect pen(n, K,

  • SPI

K

  • )

CA, PB, MM, SR Change-point Detection on a Tree 16/19

slide-39
SLIDE 39

Stochastic Processes on Trees Identifiability Problems and Counting Issues Statistical Inference Turtles Data Set EM Algorithm Model Selection

Proposition: LINselect Penalty

Proposition (Form of the Penalty and guarantees (α known)) Under our setting: Y = TW (α)∆ + γE with E ∼ N(0, V ), define the penalty: pen(K) = A n − K − 1 n − K − 2 EDkhi

  • K + 2, n − K − 2, exp
  • − log
  • SPI

K

  • − 2 log(K + 2)
  • If κ < 1, and p ≤ min
  • κn

2+log(2)+log(n) , n − 7

  • , we get:

E   

  • E [Y ] − ˆ

Y ˆ

K

  • 2

V

γ2    ≤ C(A, κ) inf

η∈M

  • E [Y ] − Y ∗

η

  • 2

V

γ2 + (Kη + 2) (3 + log(n))

  • with C(A, κ) a constant depending on A and κ only.

Based on Baraud et al. (2009)

+ CA, PB, MM, SR Change-point Detection on a Tree 17/19

slide-40
SLIDE 40

Stochastic Processes on Trees Identifiability Problems and Counting Issues Statistical Inference Turtles Data Set

Turtles Dataset

Unit

200 150 100 50

Freshwater Island Mainland Saltwater

Colors: habitats. Boxes: selected EM regimes.

Habitat EM

  • No. of shifts

16 5

  • No. of regimes

4 6 lnL

  • 133.86
  • 97.59

ln 2/α (%) 7.44 5.43 σ2/2α 0.33 0.22 CPU t (min) 65.25 134.49

(Jaffe et al., 2011)

CA, PB, MM, SR Change-point Detection on a Tree 18/19

slide-41
SLIDE 41

Stochastic Processes on Trees Identifiability Problems and Counting Issues Statistical Inference Turtles Data Set

Turtles Dataset

Unit

200 150 100 50

Freshwater Island Mainland Saltwater

Colors: habitats. Boxes: selected EM regimes. Chelonia mydas

CA, PB, MM, SR Change-point Detection on a Tree 18/19

slide-42
SLIDE 42

Stochastic Processes on Trees Identifiability Problems and Counting Issues Statistical Inference Turtles Data Set

Turtles Dataset

Unit

200 150 100 50

Freshwater Island Mainland Saltwater

Colors: habitats. Boxes: selected EM regimes. Geochelone nigra abingdoni

CA, PB, MM, SR Change-point Detection on a Tree 18/19

slide-43
SLIDE 43

Stochastic Processes on Trees Identifiability Problems and Counting Issues Statistical Inference Turtles Data Set

Turtles Dataset

Unit

200 150 100 50

Freshwater Island Mainland Saltwater

Colors: habitats. Boxes: selected EM regimes. Chitra indica

CA, PB, MM, SR Change-point Detection on a Tree 18/19

slide-44
SLIDE 44

Stochastic Processes on Trees Identifiability Problems and Counting Issues Statistical Inference Turtles Data Set

Conclusion and Perspectives

A general inference framework for trait evolution models. Conclusions Some problems of identifiability arise. An EM can be written to maximize likelihood. Adaptation of model selection results to non-iid framework. R codes Available on GitHub:

https://github.com/pbastide/Phylogenetic-EM

Perspectives Multivariate traits. Deal with uncertainty (tree, data). Use fossil records.

CA, PB, MM, SR Change-point Detection on a Tree 19/19

slide-45
SLIDE 45

Bibliography

  • Y. Baraud, C. Giraud, and S. Huet. Gaussian Model Selection with an Unknown Variance. The Annals of Statistics,

37(2):630–672, Apr. 2009. J.-P. Baudry, C. Maugis, and B. Michel. Slope Heuristics: Overview and Implementation. Statistics and Computing, 22(2):455–470, March 2012.

  • V. Brault, J.-P. Baudry, C. Maugis, and B. Michel. capushe: Capushe, Data-Driven Slope Estimation and

Dimension Jump. R package version 1.0, 2012.

  • J. Felsenstein. Phylogenies and the Comparative Method. The American Naturalist, 125(1):1–15, Jan. 1985.
  • J. Felsenstein. Inferring Phylogenies. Sinauer Associates, Suderland, USA, 2004.
  • A. L. Jaffe, G. J. Slater, and M. E. Alfaro. The Evolution of Island Gigantism and Body Size Variation in Tortoises

and Turtles. Biology letters, 11(11), November 2011.

  • P. Massart. Concentration Inequalities and Model Selection, volume 1896 of Lecture Notes in Mathematics.

Springer Berlin Heidelberg, 2007.

  • J. C. Uyeda and L. J. Harmon. A Novel Bayesian Method for Inferring and Interpreting the Dynamics of Adaptive

Landscapes from Phylogenetic Comparative Data. Systematic Biology, 63(6):902–918, July 2014. Photo Credits :

Parrot-beaked Tortoise Homopus areolatus CapeTown 8” by Abu Shawka - Own work. Licensed under CC0 via Wikimedia Commons

Leatherback sea turtle Tinglar, USVI (5839996547)” by U.S. Fish and Wildlife Service Southeast Region - Leatherback sea turtle/ Tinglar, USVIUploaded by AlbertHerring. Licensed under CC BY 2.0 via Wikimedia Commons

Hawaii turtle 2” by Brocken Inaglory. Licensed under CC BY-SA 3.0 via Wikimedia Commons

Dudhwalive chitra” by Krishna Kumar Mishra — Own work. Licensed under CC BY 3.0 via Wikimedia Commons

Lonesome George in profile” by Mike Weston - Flickr: Lonesome George 2. Licensed under CC BY 2.0 via Wikimedia Commons

Florida Box Turtle Digon3a” , “Jonathan Zander (Digon3)” derivative work: Materialscientist

slide-46
SLIDE 46

Thank you for listening

slide-47
SLIDE 47

References Inference Identifiability Issues Simulations Results Multivariate

Appendices

5

Inference Model Selection

6

Identifiability Issues Cardinal of Equivalence Classes Number of Tree Compatible Clustering

7

Simulations Results

8

Multivariate Models Inference

CA, PB, MM, SR Change-point Detection on a Tree 3/28

slide-48
SLIDE 48

References Inference Identifiability Issues Simulations Results Multivariate Model Selection

Model Selection with Unknown Variance

Theorem (Baraud et al. (2009)) Under the following setting: Y ′ = E

  • Y ′

+ γE ′ with E ′ ∼ N(0, In) and S′ = {S′

η, η ∈ M} If Dη = Dim(S′

η), Nη = n − Dη ≥ 7, max(Lη, Dη) ≤ κn, with κ < 1, and:

Ω′ =

  • η∈M

(Dη + 1)e−Lη < +∞ If: ˆ η = argmin

η∈M

  • Y ′ − ˆ

Y ′

η

  • 2

1 + pen(η) Nη

  • with:

pen(η) = penA,L(η) = A Nη Nη − 1 EDkhi[Dη + 1, Nη − 1, e−Lη] , A > 1 Then: E   

  • E [Y ′] − ˆ

Y ′

ˆ η

  • 2

γ2    ≤ C(A, κ)

  • inf

η∈M

  • E [Y ′] − Y ′

η

  • 2

γ2 + max(Lη, Dη)

  • + Ω′
  • CA, PB, MM, SR

Change-point Detection on a Tree 4/28

slide-49
SLIDE 49

References Inference Identifiability Issues Simulations Results Multivariate Model Selection

IID Framework (α = 0)

Assume Kη = Dη − 1 ≤ p − 1 ≤ n − 8, ∀η ∈ M Then: Ω′ =

  • η∈M

(Dη + 1)e−Lη =

  • η∈M

(Kη + 2)e−Lη =

p−1

  • K=0
  • SPI

K

  • (K + 2)e−LK =

p−1

  • K=0
  • SPI

K

  • (K + 2)e−(log
  • SPI

K

  • +2 log(K+2))

=

p−1

  • K=0

1 K + 2 ≤ log(p) ≤ log(n) And: LK ≤ log n + m − 1 K

  • +2 log(K +2) ≤ K log(n+m−1)+2(K +1) ≤ p(2+log(2n−2))

Hence, if p ≤ min

  • κn

2+log(2)+log(n) , n − 7

  • , then max(Lη, Dη) ≤ κn for any η ∈ M.

CA, PB, MM, SR Change-point Detection on a Tree 5/28

slide-50
SLIDE 50

References Inference Identifiability Issues Simulations Results Multivariate Model Selection

Non-IID Framework (α = 0)

Cholesky decomposition: V = LLT Y ′ = L−1Y s′ = L−1s E ′ = L−1E Y ′ = E

  • Y ′

+ γE ′, with: E ′ ∼ N(0, In) S′

η = L−1Sη,

ˆ Y ′

η = ProjS′

η Y ′ = argmin

a′∈S′

η

  • Y − La′

2

V = L−1 ˆ

  • E [Y ] − ˆ

η

  • 2

V =

  • E
  • Y ′

− ˆ Y ′

ˆ η

  • 2

,

  • Y − ˆ

  • 2

V =

  • Y ′ − ˆ

Y ′

η

  • 2

CritMC(η) =

  • Y ′ − ˆ

Y ′

η

  • 2

1 + penA,L(η) Nη

  • =
  • Y − ˆ

  • 2

V

  • 1 + penA,L(η)

  • back

CA, PB, MM, SR Change-point Detection on a Tree 6/28

slide-51
SLIDE 51

References Inference Identifiability Issues Simulations Results Multivariate Cardinal of Equivalence Classes Number of Tree Compatible Clustering

Cardinal of Equivalence Classes

Initialization For tips Propagation

Kl

k = argmin 1≤p≤K

  • Sil (p) + I{p = k}
  • Si(k) =

L

  • l=1

Sil (pl) + I{pl = k} , ∀(p1, . . . pL) ∈ K1

k × . . . × KL k

Ti(k) =

  • (p1,...pL)∈K1

k ×...×KL k

L

  • l=1

Til (pl) =

L

  • l=1
  • pl ∈Kl

k

Til (pl)

Termination Sum on the root vector

back (1,0,0) (1,0,0) (0,∞,∞)(0,∞,∞)

S T

CA, PB, MM, SR Change-point Detection on a Tree 7/28

slide-52
SLIDE 52

References Inference Identifiability Issues Simulations Results Multivariate Cardinal of Equivalence Classes Number of Tree Compatible Clustering

Cardinal of Equivalence Classes

Initialization For tips Propagation

Kl

k = argmin 1≤p≤K

  • Sil (p) + I{p = k}
  • Si(k) =

L

  • l=1

Sil (pl) + I{pl = k} , ∀(p1, . . . pL) ∈ K1

k × . . . × KL k

Ti(k) =

  • (p1,...pL)∈K1

k ×...×KL k

L

  • l=1

Til (pl) =

L

  • l=1
  • pl ∈Kl

k

Til (pl)

Termination Sum on the root vector

back

· · ·

(Si (1), · · · , Si (K)) (Ti (1), · · · , Ti (K)) (Si1 (k))k (Ti1 (k))k (SiL (k))k (TiL (k))k CA, PB, MM, SR Change-point Detection on a Tree 7/28

slide-53
SLIDE 53

References Inference Identifiability Issues Simulations Results Multivariate Cardinal of Equivalence Classes Number of Tree Compatible Clustering

Cardinal of Equivalence Classes

Initialization For tips Propagation

Kl

k = argmin 1≤p≤K

  • Sil (p) + I{p = k}
  • Si(k) =

L

  • l=1

Sil (pl) + I{pl = k} , ∀(p1, . . . pL) ∈ K1

k × . . . × KL k

Ti(k) =

  • (p1,...pL)∈K1

k ×...×KL k

L

  • l=1

Til (pl) =

L

  • l=1
  • pl ∈Kl

k

Til (pl)

Termination Sum on the root vector

back (1,0,0) (1,0,0) (0,∞,∞)(0,∞,∞)

S T

S(1) = 0 + 0 ; T(1) = 1 x 1 S(2) = 0 + 1 ; T(2) = 1 x 1 S(3) = 0 + 1 ; T(3) = 1 x 1

K1

1 = {1}

K1

2 = {1}

K1

3 = {1}

CA, PB, MM, SR Change-point Detection on a Tree 7/28

slide-54
SLIDE 54

References Inference Identifiability Issues Simulations Results Multivariate Cardinal of Equivalence Classes Number of Tree Compatible Clustering

Cardinal of Equivalence Classes

Initialization For tips Propagation

Kl

k = argmin 1≤p≤K

  • Sil (p) + I{p = k}
  • Si(k) =

L

  • l=1

Sil (pl) + I{pl = k} , ∀(p1, . . . pL) ∈ K1

k × . . . × KL k

Ti(k) =

  • (p1,...pL)∈K1

k ×...×KL k

L

  • l=1

Til (pl) =

L

  • l=1
  • pl ∈Kl

k

Til (pl)

Termination Sum on the root vector

back (1,0,0) (1,0,0) (0,1,0) (0,0,1) (0,0,1) (1,1,1) (1,1,1) (1,1,1) (1,1,3) (0,∞,∞)(0,∞,∞)(∞,0,∞)(∞,∞,0)(∞,∞,0)

S T

(0,1,1) (1,1,0) (1,1,2) (2,2,2)

S T S T S T S T

CA, PB, MM, SR Change-point Detection on a Tree 7/28

slide-55
SLIDE 55

References Inference Identifiability Issues Simulations Results Multivariate Cardinal of Equivalence Classes Number of Tree Compatible Clustering

Linking Shifts and Clustering

Assumption “No Homoplasy” : 1 shift = 1 new color Proposition “K shifts ⇐ ⇒ K + 1 clusters”

CA, PB, MM, SR Change-point Detection on a Tree 8/28

slide-56
SLIDE 56

References Inference Identifiability Issues Simulations Results Multivariate Cardinal of Equivalence Classes Number of Tree Compatible Clustering

Linking Shifts and Clustering

Assumption “No Homoplasy” : 1 shift = 1 new color ∼

The No Homoplasy hypothesis is not respected.

Proposition “K shifts ⇐ ⇒ K + 1 clusters”

CA, PB, MM, SR Change-point Detection on a Tree 8/28

slide-57
SLIDE 57

References Inference Identifiability Issues Simulations Results Multivariate Cardinal of Equivalence Classes Number of Tree Compatible Clustering

Linking Shifts and Clustering

Assumption “No Homoplasy” : 1 shift = 1 new color ∼

The No Homoplasy hypothesis is not respected.

Proposition “K shifts ⇐ ⇒ K + 1 clusters”

CA, PB, MM, SR Change-point Detection on a Tree 8/28

slide-58
SLIDE 58

References Inference Identifiability Issues Simulations Results Multivariate Cardinal of Equivalence Classes Number of Tree Compatible Clustering

Definitions

T a rooted tree with n tips N(T )

K

= |CK| the number of possible partitions of the tips in K clusters A(T )

K

the number of possible marked partitions Partitions in two groups for a binary tree with 3 tips Difference between N(T3)

2

and A(T3)

2

: N(T3)

2

= 3: partitions 1 and 2 are equivalent A(T3)

2

= 4: one marked color ( “white = ancestral state” )

CA, PB, MM, SR Change-point Detection on a Tree 9/28

slide-59
SLIDE 59

References Inference Identifiability Issues Simulations Results Multivariate Cardinal of Equivalence Classes Number of Tree Compatible Clustering

General Formula (Binary Case)

If T is a binary tree, consider Tℓ and Tr the left and right sub-trees of T . Then:          N(T )

K

=

  • k1+k2=K

N(Tℓ)

k1

N(Tr )

k2

+

  • k1+k2=K+1

A(Tℓ)

k1

A(Tr )

k2

A(T )

K

=

  • k1+k2=K

A(Tℓ)

k1

N(Tr )

k2

+ N(Tℓ)

k1

A(Tr )

k2

+

  • k1+k2=K+1

A(Tℓ)

k1

A(Tr )

k2

We get: N(T )

K+1 = N(n) K+1 =

  • 2n − 2 − K

K

  • and

A(T )

K+1 = A(n) K+1 =

  • 2n − 1 − K

K

  • CA, PB, MM, SR

Change-point Detection on a Tree 10/28

slide-60
SLIDE 60

References Inference Identifiability Issues Simulations Results Multivariate Cardinal of Equivalence Classes Number of Tree Compatible Clustering

Recursion Formula (General Case)

If we are at a node defining a tree T that has p daughters, with sub-trees T1, . . . , Tp, then we get the following recursion formulas:                  N(T )

K

=

  • k1+···+kp=K

k1,...,kp≥1 p

  • i=1

N(Ti )

ki

+

  • I⊂1,p

|I|≥2

  • k1+···+kp=K+|I|−1

k1,...,kp≥1

  • i∈I

A(Ti )

ki

  • i /

∈I

N(Ti )

ki

A(T )

K

=

  • I⊂1,p

|I|≥1

  • k1+···+kp=K+|I|−1

k1,...,kp≥1

  • i∈I

A(Ti )

ki

  • i /

∈I

N(Ti )

ki

No general formula. The result depends on the topology of the tree.

back CA, PB, MM, SR Change-point Detection on a Tree 11/28

slide-61
SLIDE 61

References Inference Identifiability Issues Simulations Results Multivariate

Simulations Design

(Uyeda and Harmon, 2014) Topology of the tree fixed (unit height, λ = 0.1, with 64, 128, 256 taxa). Initial optimal value fixed: β0 = 0 One ” base”scenario αb = 3, γ2

b = 0.5, Kb = 5.

α ∈ log(2)/{0.01, 0.05, 0.1, 0.2, 0.23, 0.3, 0.5, 0.75, 1, 2, 10}. γ2 ∈ {0.3, 0.6, 3, 6, 12, 18, 30, 60, 150}/(2αb). K ∈ {0, 1, 2, 3, 4, 5, 8, 11, 16}. Shifts values ∼ 1

2N(4, 1) + 1 2N(−4, 1)

Shifts randomly placed at regular intervals separated by 0.1 unit length. n = 200 repetitions : 16200 configurations. CPU time on cluster MIGALE (Jouy-en-Josas): α known: 6 minutes per estimation (66 days in total). α unknown: 52 minutes per estimation (570 days in total).

CA, PB, MM, SR Change-point Detection on a Tree 12/28

slide-62
SLIDE 62

References Inference Identifiability Issues Simulations Results Multivariate

Log-Likelihood

t1 2 = ln(2) α γ2 K −300 −200 −100 100 −800 −600 −400 −200 −300 −275 −250 −225 0.01 0.05 0.1 0.2 0.23 0.3 0.5 0.75 1 2 10 0.05 0.1 0.5 1 2 3 5 10 25 1 2 3 4 5 8 11 16 Log Likelihood

α Known Estimated

Log likelihood for a tree with 256 tips. Solid black dots are the median of the log likelihood for the true parameters.

CA, PB, MM, SR Change-point Detection on a Tree 13/28

slide-63
SLIDE 63

References Inference Identifiability Issues Simulations Results Multivariate

Number of Shifts

t1 2 = ln(2) α γ2 K 2 4 6 8 3 6 9 5 10 15 ntaxa = 64 ntaxa = 128 ntaxa = 256 0.01 0.05 0.1 0.2 0.23 0.3 0.5 0.75 1 2 10 0.05 0.1 0.5 1 2 3 5 10 25 1 2 3 4 5 8 11 16 K select

α Known Estimated

CA, PB, MM, SR Change-point Detection on a Tree 14/28

slide-64
SLIDE 64

References Inference Identifiability Issues Simulations Results Multivariate

One Example

Unit −1.7 −5.7 −3.3 −2.7 −4.7 Unit −0.3 −1.7 −4.6 −5.6 Unit −0.3 −1.7 −8.9 5.6 Unit −0.3 −1.7 −11.6 −6

CA, PB, MM, SR Change-point Detection on a Tree 15/28

slide-65
SLIDE 65

References Inference Identifiability Issues Simulations Results Multivariate

Adjusted Rand Index

t1 2 = ln(2) α γ2 K 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 ntaxa = 64 ntaxa = 128 ntaxa = 256 0.01 0.05 0.1 0.2 0.23 0.3 0.5 0.75 1 2 10 0.05 0.1 0.5 1 2 3 5 10 25 1 2 3 4 5 8 11 16 ARI

K Known Estimated α Known Estimated

CA, PB, MM, SR Change-point Detection on a Tree 16/28

slide-66
SLIDE 66

References Inference Identifiability Issues Simulations Results Multivariate

Parameters: β0

t1 2 = ln(2) α γ2 K −2 2 −2 2 −2 2 ntaxa = 64 ntaxa = 128 ntaxa = 256 0.01 0.05 0.1 0.2 0.23 0.3 0.5 0.75 1 2 10 0.05 0.1 0.5 1 2 3 5 10 25 1 2 3 4 5 8 11 16 β ^

0 − β0

K Known Estimated α Known Estimated

CA, PB, MM, SR Change-point Detection on a Tree 17/28

slide-67
SLIDE 67

References Inference Identifiability Issues Simulations Results Multivariate

Parameters: α

t1 2 = ln(2) α γ2 K 10 20 30 2 4 6 8 −1 1 2 3 4 ntaxa = 64 ntaxa = 128 ntaxa = 256 0.01 0.05 0.1 0.2 0.23 0.3 0.5 0.75 1 2 10 0.05 0.1 0.5 1 2 3 5 10 25 1 2 3 4 5 8 11 16

(t

^

1 2 − t1 2) t1 2

K Known Estimated

CA, PB, MM, SR Change-point Detection on a Tree 18/28

slide-68
SLIDE 68

References Inference Identifiability Issues Simulations Results Multivariate

Parameters: γ2

t1 2 = ln(2) α γ2 K −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 ntaxa = 64 ntaxa = 128 ntaxa = 256 0.01 0.05 0.1 0.2 0.23 0.3 0.5 0.75 1 2 10 0.05 0.1 0.5 1 2 3 5 10 25 1 2 3 4 5 8 11 16

^2 − γ2) γ2

K Known Estimated α Known Estimated

CA, PB, MM, SR Change-point Detection on a Tree 19/28

slide-69
SLIDE 69

References Inference Identifiability Issues Simulations Results Multivariate Models Inference

BM Model

Data n vectors of p traits at the tips: Yi =    Yi1 . . . Yip    SDE dW(t) = ΣdBt, rate matrix R = ΣΣT (p × p) Covariances Cov [Yil; Yjq] = tijRlq for i, j tips, and l, q characters Var [vec(Y)] = Cn ⊗ R Shifts K shifts δ1, · · · , δK vectors size p → All the characters shift at the same time

CA, PB, MM, SR Change-point Detection on a Tree 20/28

slide-70
SLIDE 70

References Inference Identifiability Issues Simulations Results Multivariate Models Inference

BM Model

Linear Model Representation vec(Y) = vec(∆TT) + E with E ∼ N(0, V = Cn ⊗ R) Incomplete Data Representation Y3 | Z2 ∼ N

  • Z2 + δ, ℓ7R
  • Y5

Y4 Y3 Y2 Y1 Z1 Z4 Z2 Z3 ℓ7 δ

CA, PB, MM, SR Change-point Detection on a Tree 21/28

slide-71
SLIDE 71

References Inference Identifiability Issues Simulations Results Multivariate Models Inference

OU Model: General Case

Data n vectors of p traits at the tips: Yi =    Yi1 . . . Yip    SDE A (p × p)“selection strength” dW(t) = −A(W(t) − β(t))dt + ΣdBt Covariances Cov [Xi; Xj] = e−AtiΓe−AT tj + e−A(ti−tij) tij e−AvΣΣTe−AT vdv

  • e−AT (tj−tij)

Shifts K shifts δ1, · · · , δK vectors size p → On the optimal values

CA, PB, MM, SR Change-point Detection on a Tree 22/28

slide-72
SLIDE 72

References Inference Identifiability Issues Simulations Results Multivariate Models Inference

OU Model: A scalar

Assumption A = αIp “scalar” Stationnary State S =

1 2αR

Fixed Root For i, j tips and l, q characters: Cov [Yil; Yjq] = 1 2αe−2αh e2αtij − 1

  • Rlq

→ Can be reduced to a BM on a re-scaled tree

CA, PB, MM, SR Change-point Detection on a Tree 23/28

slide-73
SLIDE 73

References Inference Identifiability Issues Simulations Results Multivariate Models Inference

EM algorithm

BM Natural generalization of the univariate case. OU M step intractable in general. Incomplete Data Model: Can readily handle missing data.

CA, PB, MM, SR Change-point Detection on a Tree 24/28

slide-74
SLIDE 74

References Inference Identifiability Issues Simulations Results Multivariate Models Inference

Model Selection

Previous criterion cannot be applied Solution: “Slope Heuristic”

  • based method

Massart (2007)

  • racle inequality with known variance

penalty up to a multiplicative constant

Baudry et al. (2012)

Slope-heuristic method to calibrate the constant Implemented in capushe (Brault et al., 2012)

CA, PB, MM, SR Change-point Detection on a Tree 25/28

slide-75
SLIDE 75

References Inference Identifiability Issues Simulations Results Multivariate Models Inference

Model Selection: Toy Example

Unit

1 5 1

Unit

−1 5 1

Unit

5 −10 1

Figure: Simulated Process.

CA, PB, MM, SR Change-point Detection on a Tree 26/28

slide-76
SLIDE 76

References Inference Identifiability Issues Simulations Results Multivariate Models Inference

Model Selection: Toy Example

−350 −300 −250 −200 −150

Contrast representation

penshape(m) (labels : Model names) −γn(s ^m) The regression line is computed with 6 points 1 2 3 4 5 6 7 8 9 10 0.235 0.250 0.265

Successive slope values

Number of points (penshape(m),−γn(s ^m)) for the regression κ 11 11 10 9 8 8 7 6 5 5 4 3 2

Selected models with respect to the successive slope values

Number of points (penshape(m),−γn(s ^n)) for the regression Model 11 11 10 9 8 8 7 6 5 5 4 3 2 2

Figure: capushe output for penalized log-likelihood.

CA, PB, MM, SR Change-point Detection on a Tree 27/28

slide-77
SLIDE 77

References Inference Identifiability Issues Simulations Results Multivariate Models Inference

Model Selection: Toy Example

Unit

1.2 0.6 4.5

Unit

−1.1 0.3 5.2

Unit

−0.1 −10.1 4.6

Figure: Reconstructed Process.

CA, PB, MM, SR Change-point Detection on a Tree 28/28