[PPT] - Which is more useful? Reality Detailed map Detailed public PowerPoint Presentation

SLIDE 1

Which ¡is ¡more ¡useful?

“Reality” Detailed ¡map Detailed ¡public ¡transporta6on Simplified ¡metro

Saturday, July 22, 17

SLIDE 2

Models ¡don’t ¡need ¡to ¡reflect ¡reality

A ¡model ¡is ¡an ¡inten6onal ¡simplifica6on ¡of ¡a ¡complex ¡situa6on ¡

designed ¡to ¡eliminate ¡extraneous ¡detail ¡in ¡order ¡to ¡focus ¡ aAen6on ¡on ¡the ¡essen6als ¡of ¡the ¡situa6on. ¡ ¡(Daniel ¡L. ¡Hartl, ¡ 2000)

"The ¡most ¡that ¡can ¡be ¡expected ¡from ¡any ¡model ¡is ¡that ¡it ¡can ¡

supply ¡a ¡useful ¡approxima6on ¡to ¡reality: ¡All ¡models ¡are ¡wrong; ¡ some ¡models ¡are ¡useful". ¡ ¡(George ¡E. ¡P. ¡Box, ¡1987)

A ¡model ¡is ¡a ¡simplifica6on ¡or ¡approxima6on ¡of ¡reality ¡and ¡hence ¡

will ¡not ¡reflect ¡all ¡of ¡reality ¡... ¡While ¡a ¡model ¡can ¡never ¡be ¡ “truth,” ¡a ¡model ¡might ¡be ¡ranked ¡from ¡very ¡useful, ¡to ¡useful, ¡to ¡ somewhat ¡useful ¡to, ¡finally, ¡essen6ally ¡useless. ¡ ¡(Burnham ¡and ¡ Anderson, ¡2002)

Model ¡selec6on ¡is ¡a ¡process ¡of ¡seeking ¡the ¡least ¡inadequate ¡

model ¡from ¡a ¡predefined ¡set, ¡all ¡of ¡which ¡may ¡be ¡grossly ¡ inadequate ¡as ¡a ¡representa6on ¡of ¡reality. ¡ ¡(J. ¡J. ¡Welch, ¡2006)

Saturday, July 22, 17

SLIDE 3

Why do models matter?

Model-based methods including ML and Bayesian

inference (typically) make a consistent estimate of the phylogeny (estimate converges to true tree as number of sites increases toward infinity)

Saturday, July 22, 17

SLIDE 4

Why do models matter?

Model-based methods including ML and Bayesian

inference (typically) make a consistent estimate of the phylogeny (estimate converges to true tree as number of sites increases toward infinity) A C B D

(Felsenstein, 1978)

... even when you’re in the “Felsenstein Zone”

Saturday, July 22, 17

SLIDE 5

In the Felsenstein Zone

Sequence Length

0.25 0.50 0.75 1.00 2500 5000 7500 10000 Proportion Correct Sequence Length parsimony ML-GTR

Simulation model = GTR

Saturday, July 22, 17

SLIDE 6

Why do models matter (continued)?

Saturday, July 22, 17

SLIDE 7

Why do models matter (continued)?

Parsimony is inconsistent in the Felsenstein zone

(and other scenarios)

Saturday, July 22, 17

SLIDE 8

Why do models matter (continued)?

Parsimony is inconsistent in the Felsenstein zone

(and other scenarios)

Likelihood is consistent in any “zone” (when certain

requirements are met)

Saturday, July 22, 17

SLIDE 9

Why do models matter (continued)?

Parsimony is inconsistent in the Felsenstein zone

(and other scenarios)

Likelihood is consistent in any “zone” (when certain

requirements are met) But this guarantee requires that the model be specified correctly! Likelihood can also be inconsistent if the model is oversimplified

Saturday, July 22, 17

SLIDE 10

Why do models matter (continued)?

Parsimony is inconsistent in the Felsenstein zone

(and other scenarios)

Likelihood is consistent in any “zone” (when certain

requirements are met) But this guarantee requires that the model be specified correctly! Likelihood can also be inconsistent if the model is oversimplified

Real data always evolve according to processes

more complex than any computationally feasible model would permit, so we have to choose “good” rather than “correct” models

Saturday, July 22, 17

SLIDE 11

What is a “good” model?

Saturday, July 22, 17

SLIDE 12

What is a “good” model?

A model that appropriately balances fit of the data

with simplicity (parsimony, in a different sense)

Saturday, July 22, 17

SLIDE 13

What is a “good” model?

A model that appropriately balances fit of the data

with simplicity (parsimony, in a different sense) i.e., if a simpler model fits the data almost as well as a more complex model, prefer the simpler one

Saturday, July 22, 17

SLIDE 14

What is a “good” model?

A model that appropriately balances fit of the data

with simplicity (parsimony, in a different sense) i.e., if a simpler model fits the data almost as well as a more complex model, prefer the simpler one

B B B B B B B B

80
40

40 80 120 25 50 75 100 y x B B B B B B B B 20 40 60 80 100 25 50 75 100 y x

y =1.30 + 0.965x (r 2 = 0.963) y = - 330 +134x - 15.5x2 +0.816x3

0.0225x4 + 0.000335x5
0.00000255x6 +0.00000000777x7

(r 2 =1.000)

Saturday, July 22, 17

SLIDE 15

What is a “good” model?

Saturday, July 22, 17

SLIDE 16

What is a “good” model?

Parsimony ¡in ¡sta,s,cs ¡represents ¡a ¡tradeoff ¡between ¡bias ¡and ¡ variance ¡as ¡a ¡func,on ¡of ¡the ¡dimension ¡of ¡the ¡model. ¡ ¡A ¡good ¡ model ¡is ¡a ¡balance ¡between ¡under-‑ ¡and ¡over-‑fi>ng. ¡(Burnham ¡ and ¡Anderson, ¡1998)

Saturday, July 22, 17

SLIDE 17

What is a “good” model?

Parsimony ¡in ¡sta,s,cs ¡represents ¡a ¡tradeoff ¡between ¡bias ¡and ¡ variance ¡as ¡a ¡func,on ¡of ¡the ¡dimension ¡of ¡the ¡model. ¡ ¡A ¡good ¡ model ¡is ¡a ¡balance ¡between ¡under-‑ ¡and ¡over-‑fi>ng. ¡(Burnham ¡ and ¡Anderson, ¡1998)

Saturday, July 22, 17

SLIDE 18

What is a “good” model?

Parsimony ¡in ¡sta,s,cs ¡represents ¡a ¡tradeoff ¡between ¡bias ¡and ¡ variance ¡as ¡a ¡func,on ¡of ¡the ¡dimension ¡of ¡the ¡model. ¡ ¡A ¡good ¡ model ¡is ¡a ¡balance ¡between ¡under-‑ ¡and ¡over-‑fi>ng. ¡(Burnham ¡ and ¡Anderson, ¡1998)

B B B B B B B B

80
40

40 80 120 25 50 75 100 y x B B B B B B B B 20 40 60 80 100 25 50 75 100 y x

y =1.30 + 0.965x (r 2 = 0.963) y = - 330 +134x - 15.5x2 +0.816x3

0.0225x4 + 0.000335x5
0.00000255x6 +0.00000000777x7

(r 2 =1.000)

Saturday, July 22, 17

SLIDE 19

Why models don’t have to be perfect

Assertion: In most situations, phylogenetic inference is relatively robust to model misspecification, as long as critical factors influencing sequence evolution are accommodated Caveat: There are some kinds of model misspecification that are very difficult to overcome (e.g., “heterotachy”) A B C D A B C D Half of sites Other half Likelihood can be consistent in Felsenstein zone, but will be inconsistent if a single set of branch lengths are assumed when there are actually two sets of branch lengths (Chang 1996) (“heterotachy”) E.g.:

Saturday, July 22, 17

SLIDE 20

GTR Family of Reversible DNA Substitution Models

GTR SYM TrN F81 JC K3ST K2P HKY85 F84

Equal base frequencies 3 substitution types (transitions, 2 transversion classes) 2 substitution types (transitions vs. transversions) 3 substitution types (transversions, 2 transition classes) 2 substitution types (transitions vs. transversions) Single substitution type Equal base frequencies Single substitution type Equal base frequencies

(general time-reversible) (Tamura-Nei) (Hasegawa-Kishino-Yano) (Felsenstein) Jukes-Cantor (Kimura 2-parameter) (Kimura 3-subst. type) (Felsenstein)

Saturday, July 22, 17

SLIDE 21

Among site rate heterogeneity

Proportion of invariable sites

– Some sites extremely unlikely to change due to strong functional or structural constraint (Hasegawa et al., 1985)

Gamma-distributed rates

– Rate variation assumed to follow a gamma distribution with shape parameter α

Site-specific rates (another way to model ASRV)

– Different relative rates assumed for pre-assigned subsets of sites Lemur AAGCTTCATAG TTGCATCATCCA …TTACATCATCCA Homo AAGCTTCACCG TTGCATCATCCA …TTACATCCTCAT Pan AAGCTTCACCG TTACGCCATCCA …TTACATCCTCAT Goril AAGCTTCACCG TTACGCCATCCA …CCCACGGACTTA Pongo AAGCTTCACCG TTACGCCATCCT …GCAACCACCCTC Hylo AAGCTTTACAG TTACATTATCCG …TGCAACCGTCCT Maca AAGCTTTTCCG TTACATTATCCG …CGCAACCATCCT

Saturday, July 22, 17

SLIDE 22

Modeling ASRV with gamma distribution

…can also include a proportion of “invariable” sites (pinv)

0.02 0.04 0.06 0.08 1 2

Rate

α=50 α=200 α=2 α=0.5

Frequency

Saturday, July 22, 17

SLIDE 23

Performance of ML when its model is violated

Sequence Length

Propo rtion C

rrect

Tree

α = 0.5, pinv=0.5 α = 1.0, pinv=0.5 α = 1.0, pinv=0.2

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 100 1000 1000 10000 GTRi g GTRg HKYg GTRi HKYi GTRer HKYer parsimony HKYi g GTRi g GTRg HKYg GTRi HKYi GTRer HKYer parsimony HKYi g 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 100 1000 1000 10000 GTRi g GTRg HKYg GTRi HKYi GTRer HKYer parsimony HKYi g 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 100 1000 1000 10000 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 100 1000 1000 10000 GTRi g HKYi g GTRg HKYg GTRi HKYi GTRer HKYer Parsimony 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 100 1000 1000 10000 GTRi g HKYi g GTRg HKT g GTRi HKYi GTRer HKYer parsimony 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 100 1000 1000 10000 GTRig HYYig GTRg HKYg GTRi HKYi GRTer HKYer parsimony 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 100 1000 1000 10000 GTRig GTRg HKYg GTRi HKYi GTRer HKYer parsimony HKYi g 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 100 1000 1000 10000 GTRi g GTRg HKYg GTRi HKYi GTRer HKYer parsimony HKYi g 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 100 1000 1000 10000

Saturday, July 22, 17

SLIDE 24

“MODERATE”–Felsenstein zone

α = 1.0, pinv=0.5

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 100 1000 10000 100000 JCer JC+G JC+I JC+I+G GTRer GTR+G GTR+I GTR+I+G parsimon y

Saturday, July 22, 17

SLIDE 25

“MODERATE”–Inverse- Felsenstein zone

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 100 1000 10000 100000 JCer JC+G JC+I JC+I+G GTRer GTR+G GTR+I GTR+I+G parsimon y

Saturday, July 22, 17

SLIDE 26

Likelihood ratio tests

δ = −2 ln L0 − ln L1

( )

If model L0 is nested within model L1, δ is distributed as X2 with degrees-of-freedom equal to difference in number of free parameters

Model selection criteria

Saturday, July 22, 17

SLIDE 27

Histogram of δ = −2 ln L0 − ln L1

( )

JC vs K80 models

Saturday, July 22, 17

SLIDE 28

X2 3.84 6.64 0.05 and 0.01 critical values

Histogram of δ = −2 ln L0 − ln L1

( )

JC vs K80 models

Saturday, July 22, 17

SLIDE 29

Model selection criteria

Saturday, July 22, 17

SLIDE 30

Akaike information criterion (AIC)

Model selection criteria

Saturday, July 22, 17

SLIDE 31

Akaike information criterion (AIC)

Model selection criteria

AICi = −2lnLi + 2K

where K is the number of free parameters estimated

Saturday, July 22, 17

SLIDE 32

Akaike information criterion (AIC)
AICc (corrected AIC)

Model selection criteria

AICi = −2lnLi + 2K

where K is the number of free parameters estimated

Saturday, July 22, 17

SLIDE 33

Akaike information criterion (AIC)
AICc (corrected AIC)
Bayesian information criterion (BIC)

Model selection criteria

AICi = −2lnLi + 2K

where K is the number of free parameters estimated

Saturday, July 22, 17

SLIDE 34

Akaike information criterion (AIC)
AICc (corrected AIC)
Bayesian information criterion (BIC)

Model selection criteria

AICi = −2lnLi + 2K

where K is the number of free parameters estimated

BICi = −2lnLi + K lnn

where K is the number of free parameters estimated and n is the “sample size” (typically number of sites)

Saturday, July 22, 17

SLIDE 35

AIC vs. BIC

– BIC performs well when true model is contained in model set, and among a set of simple models, AIC often selects a more complex model than the truth (indeed, AIC is formally statistically inconsistent) – But in phylogenetics, no model is as complex as the truth, and the true model will never be contained in the model set. – BIC often chooses models that seem too simple, however.

Saturday, July 22, 17

SLIDE 36

Par66oned ¡Models

Many ¡authors ¡have ¡emphasized ¡the ¡importance ¡of ¡ modeling ¡heterogeneity ¡among ¡genes ¡or ¡other ¡subsets ¡

f ¡the ¡data ¡appropriately ¡

Saturday, July 22, 17

SLIDE 37

Par66oned ¡Models

“...data ¡par66oning ¡is ¡more ¡an ¡art ¡ than ¡a ¡science, ¡and ¡it ¡should ¡rely ¡on ¡

ur ¡knowledge ¡of ¡the ¡biological ¡

system...”

Yang ¡and ¡Rannala ¡(2012; ¡Nature ¡Rev. ¡Genet. ¡13:303-‑314)

Many ¡authors ¡have ¡emphasized ¡the ¡importance ¡of ¡ modeling ¡heterogeneity ¡among ¡genes ¡or ¡other ¡subsets ¡

f ¡the ¡data ¡appropriately ¡

Saturday, July 22, 17

SLIDE 38

Ways ¡to ¡par66on

By ¡gene
By ¡codon
By ¡gene/codon ¡combina6on
Stems ¡vs. ¡loops ¡(probably ¡not ¡advisable—

e.g., ¡Simon ¡et ¡al., ¡2006)

Coding ¡vs. ¡noncoding

Saturday, July 22, 17

SLIDE 39

Naive ¡par66oning

Run ¡ModelTest/JModelTest; ¡es6mate ¡

a ¡model ¡(from ¡the ¡GTR+I+G ¡family) ¡ separately ¡for ¡each ¡gene/subset

Perform ¡an ¡ML/Bayesian ¡analysis, ¡

assigning ¡the ¡chosen ¡models ¡to ¡each ¡

Saturday, July 22, 17

SLIDE 40

Naive ¡par66oning

Run ¡ModelTest/JModelTest; ¡es6mate ¡

a ¡model ¡(from ¡the ¡GTR+I+G ¡family) ¡ separately ¡for ¡each ¡gene/subset

Perform ¡an ¡ML/Bayesian ¡analysis, ¡

assigning ¡the ¡chosen ¡models ¡to ¡each ¡ Too ¡many ¡parameters! ¡ ¡1-‑10 ¡ parameters ¡for ¡each ¡gene; ¡amount ¡

f ¡data ¡available ¡to ¡es6mate ¡each ¡

parameter ¡does ¡not ¡increase

Saturday, July 22, 17

SLIDE 41

Over-‑Par66oning

Consider ¡the ¡following ¡(contrived) ¡example:

Gene ¡A: ¡HKY+G, ¡π ¡= ¡(0.26, ¡0.24, ¡0.23, ¡0.27), ¡𝜆=1.1, ¡α=3.0
Gene ¡B: ¡GTR, ¡π ¡= ¡(0.25,0.24,0.25,0.26), ¡(a,b,c,d,e)=(1.1, ¡

1.2, ¡0.9, ¡1.1, ¡0.95)

Gene ¡C: ¡JC+I ¡(pinv=0.05)

Saturday, July 22, 17

SLIDE 42

Over-‑Par66oning

Consider ¡the ¡following ¡(contrived) ¡example:

Gene ¡A: ¡HKY+G, ¡π ¡= ¡(0.26, ¡0.24, ¡0.23, ¡0.27), ¡𝜆=1.1, ¡α=3.0
Gene ¡B: ¡GTR, ¡π ¡= ¡(0.25,0.24,0.25,0.26), ¡(a,b,c,d,e)=(1.1, ¡

1.2, ¡0.9, ¡1.1, ¡0.95)

Gene ¡C: ¡JC+I ¡(pinv=0.05)

These ¡are ¡all ¡GTR ¡models ¡that ¡are ¡not ¡far ¡from ¡the ¡ Jukes-‑Cantor ¡model, ¡but ¡they ¡all ¡have ¡different ¡ names

BeAer ¡to ¡es6mate ¡one ¡GTR ¡model ¡(even ¡with ¡5+3+1+1=10 ¡parameters, ¡ es6mated ¡from ¡all ¡data) ¡than ¡3 ¡separate ¡models ¡with ¡2+5+1=8 ¡parameters ¡ (but ¡only ¡one ¡gene’s ¡worth ¡of ¡data ¡for ¡each ¡model)

Saturday, July 22, 17

SLIDE 43

How ¡to ¡find ¡op6mal ¡par66onings?

Consider ¡a ¡data ¡ set ¡with ¡3 ¡genes, ¡ A, ¡B, ¡and ¡C: B A C B A C C A B C B A B A C For ¡each ¡par66oning ¡scheme, ¡evaluate ¡some ¡set ¡of ¡ models ¡from ¡the ¡GTR+I+G ¡(e.g., ¡56 ¡models) ¡according ¡to ¡ AIC ¡or ¡BIC Choose ¡a ¡combina6on ¡of ¡par66oning ¡scheme ¡and ¡model ¡ for ¡subsequent ¡par66oned-‑model ¡analyses

Rob ¡Lanfear’s ¡Par88onFinder ¡(hAp://www.robertlanfear.com/par66onfinder/) ¡ automates ¡this ¡process; ¡method ¡now ¡also ¡available ¡in ¡PAUP* ¡test ¡versions

Saturday, July 22, 17

SLIDE 44

How ¡many ¡par66onings?

In ¡general, ¡the ¡ number ¡of ¡ par66onings ¡on ¡n ¡ subsets ¡is ¡a ¡“Bell ¡ number”

N Bell number 2 3 4 5 6 7 12 60 2 5 52 203 877 4140 4 x 106 9.8 x 1059

Obviously, ¡there ¡are ¡too ¡many ¡ par66oning ¡schemes ¡to ¡evaluate ¡them ¡ all ¡for ¡more ¡than ¡a ¡few ¡subsets.

Saturday, July 22, 17

SLIDE 45

Greedy ¡algorithm ¡when ¡there ¡are ¡too ¡many ¡

B A C D

Lanfear, R., Calcott, B., Ho, S. Y. W., & Guindon, S. (2012). Partitionfinder: combined selection of partitioning schemes and substitution models for phylogenetic analyses. Molecular Biology and Evolution, 29(6), 1695– 1701

Saturday, July 22, 17

SLIDE 46

Greedy ¡algorithm ¡when ¡there ¡are ¡too ¡many ¡

B A C D B A C D

Lanfear, R., Calcott, B., Ho, S. Y. W., & Guindon, S. (2012). Partitionfinder: combined selection of partitioning schemes and substitution models for phylogenetic analyses. Molecular Biology and Evolution, 29(6), 1695– 1701

Saturday, July 22, 17

SLIDE 47

Greedy ¡algorithm ¡when ¡there ¡are ¡too ¡many ¡

B A C D B A C D C A B D

Lanfear, R., Calcott, B., Ho, S. Y. W., & Guindon, S. (2012). Partitionfinder: combined selection of partitioning schemes and substitution models for phylogenetic analyses. Molecular Biology and Evolution, 29(6), 1695– 1701

Saturday, July 22, 17

SLIDE 48

Greedy ¡algorithm ¡when ¡there ¡are ¡too ¡many ¡

B A C D B A C D C A B D D A B D

Lanfear, R., Calcott, B., Ho, S. Y. W., & Guindon, S. (2012). Partitionfinder: combined selection of partitioning schemes and substitution models for phylogenetic analyses. Molecular Biology and Evolution, 29(6), 1695– 1701

Saturday, July 22, 17

SLIDE 49

Greedy ¡algorithm ¡when ¡there ¡are ¡too ¡many ¡

B A C D B A C D C A B D D A B D C B A D

Lanfear, R., Calcott, B., Ho, S. Y. W., & Guindon, S. (2012). Partitionfinder: combined selection of partitioning schemes and substitution models for phylogenetic analyses. Molecular Biology and Evolution, 29(6), 1695– 1701

Saturday, July 22, 17

SLIDE 50

Greedy ¡algorithm ¡when ¡there ¡are ¡too ¡many ¡

B A C D B A C D C A B D D A B D D B A C C B A D

Lanfear, R., Calcott, B., Ho, S. Y. W., & Guindon, S. (2012). Partitionfinder: combined selection of partitioning schemes and substitution models for phylogenetic analyses. Molecular Biology and Evolution, 29(6), 1695– 1701

Saturday, July 22, 17

SLIDE 51

Greedy ¡algorithm ¡when ¡there ¡are ¡too ¡many ¡

B A C D B A C D C A B D D A B D D C A B D B A C C B A D

Lanfear, R., Calcott, B., Ho, S. Y. W., & Guindon, S. (2012). Partitionfinder: combined selection of partitioning schemes and substitution models for phylogenetic analyses. Molecular Biology and Evolution, 29(6), 1695– 1701

Saturday, July 22, 17

SLIDE 52

Greedy ¡algorithm ¡when ¡there ¡are ¡too ¡many ¡

B A C D B A C D C A B D D A B D D C A B D B A C C B A D

Lanfear, R., Calcott, B., Ho, S. Y. W., & Guindon, S. (2012). Partitionfinder: combined selection of partitioning schemes and substitution models for phylogenetic analyses. Molecular Biology and Evolution, 29(6), 1695– 1701

× × × × ×

Saturday, July 22, 17

SLIDE 53

Greedy ¡algorithm ¡when ¡there ¡are ¡too ¡many ¡

B A C D B A C D C A B D D A B D D C A B D B A C C B A D B A C D

Lanfear, R., Calcott, B., Ho, S. Y. W., & Guindon, S. (2012). Partitionfinder: combined selection of partitioning schemes and substitution models for phylogenetic analyses. Molecular Biology and Evolution, 29(6), 1695– 1701

× × × × ×

Saturday, July 22, 17

SLIDE 54

Greedy ¡algorithm ¡when ¡there ¡are ¡too ¡many ¡

B A C D B A C D C A B D D A B D D C A B D B A C C B A D B A C D C B D A

Lanfear, R., Calcott, B., Ho, S. Y. W., & Guindon, S. (2012). Partitionfinder: combined selection of partitioning schemes and substitution models for phylogenetic analyses. Molecular Biology and Evolution, 29(6), 1695– 1701

× × × × ×

Saturday, July 22, 17

SLIDE 55

Greedy ¡algorithm ¡when ¡there ¡are ¡too ¡many ¡

B A C D B A C D C A B D D A B D D C A B D B A C C B A D B A C D C B A D C B D A

Lanfear, R., Calcott, B., Ho, S. Y. W., & Guindon, S. (2012). Partitionfinder: combined selection of partitioning schemes and substitution models for phylogenetic analyses. Molecular Biology and Evolution, 29(6), 1695– 1701

× × × × ×

Saturday, July 22, 17

SLIDE 56

Greedy ¡algorithm ¡when ¡there ¡are ¡too ¡many ¡

B A C D B A C D C A B D D A B D D C A B D B A C C B A D B A C D C B A D C B D A

Lanfear, R., Calcott, B., Ho, S. Y. W., & Guindon, S. (2012). Partitionfinder: combined selection of partitioning schemes and substitution models for phylogenetic analyses. Molecular Biology and Evolution, 29(6), 1695– 1701

× × × × × × ×

Saturday, July 22, 17

SLIDE 57

Greedy ¡algorithm ¡when ¡there ¡are ¡too ¡many ¡

B A C D B A C D C A B D D A B D D C A B D B A C C B A D B A C D C B A D C B D A B A C D

Lanfear, R., Calcott, B., Ho, S. Y. W., & Guindon, S. (2012). Partitionfinder: combined selection of partitioning schemes and substitution models for phylogenetic analyses. Molecular Biology and Evolution, 29(6), 1695– 1701

× × × × × × ×

Saturday, July 22, 17

SLIDE 58

Greedy ¡algorithm ¡when ¡there ¡are ¡too ¡many ¡

B A C D B A C D C A B D D A B D D C A B D B A C C B A D B A C D C B A D C B D A B A C D

1 ¡+ ¡n(n2 ¡-‑ ¡1)/6 ¡= ¡11 ¡schemes

For ¡1265 ¡genes, ¡there ¡would ¡s,ll ¡be ¡ 337,380,561 ¡schemes ¡to ¡evaluate!

Lanfear, R., Calcott, B., Ho, S. Y. W., & Guindon, S. (2012). Partitionfinder: combined selection of partitioning schemes and substitution models for phylogenetic analyses. Molecular Biology and Evolution, 29(6), 1695– 1701

× × × × × × ×

Saturday, July 22, 17

SLIDE 59

How ¡to ¡par66on ¡thousands ¡of ¡genes ¡(or ¡other ¡ subsets)?

Cluster ¡analysis

Saturday, July 22, 17

SLIDE 60

How ¡to ¡par66on ¡thousands ¡of ¡genes ¡(or ¡other ¡ subsets)?

¡Li, ¡Lu, ¡and ¡Or6 ¡(2008)

Es6mate ¡model ¡parameters ¡on ¡a ¡shared ¡model; ¡similar ¡subsets ¡ will ¡have ¡similar ¡parameter ¡es6mates ¡and ¡will ¡cluster ¡together.

Problem? ¡ ¡Similar ¡models ¡(in ¡the ¡sense ¡of ¡predic,ng ¡similar ¡site ¡paSern ¡ frequencies), ¡can ¡have ¡different ¡parameter ¡MLEs. ¡ ¡Must ¡use ¡same ¡model ¡ for ¡all ¡subsets.

Cluster ¡analysis

Saturday, July 22, 17

SLIDE 61

How ¡to ¡par66on ¡thousands ¡of ¡genes ¡(or ¡other ¡ subsets)?

¡Li, ¡Lu, ¡and ¡Or6 ¡(2008)

Es6mate ¡model ¡parameters ¡on ¡a ¡shared ¡model; ¡similar ¡subsets ¡ will ¡have ¡similar ¡parameter ¡es6mates ¡and ¡will ¡cluster ¡together.

Problem? ¡ ¡Similar ¡models ¡(in ¡the ¡sense ¡of ¡predic,ng ¡similar ¡site ¡paSern ¡ frequencies), ¡can ¡have ¡different ¡parameter ¡MLEs. ¡ ¡Must ¡use ¡same ¡model ¡ for ¡all ¡subsets.

Cluster ¡analysis

¡Lanfear ¡et ¡al. ¡(Par66onFinder2)

Hierarchical ¡(or ¡non-‑hierarchical ¡kmeans) ¡clustering ¡using ¡ same ¡idea ¡as ¡Li ¡et ¡al. ¡(very ¡efficient ¡implementa6on)

Saturday, July 22, 17

Which ¡is ¡more ¡useful?

Models ¡don’t ¡need ¡to ¡reflect ¡reality

Why do models matter?

inference (typically) make a consistent estimate of the phylogeny (estimate converges to true tree as number of sites increases toward infinity)

Why do models matter?

inference (typically) make a consistent estimate of the phylogeny (estimate converges to true tree as number of sites increases toward infinity) A C B D

... even when you’re in the “Felsenstein Zone”

In the Felsenstein Zone

Why do models matter (continued)?

Why do models matter (continued)?

(and other scenarios)

Why do models matter (continued)?

(and other scenarios)

requirements are met)

Why do models matter (continued)?

(and other scenarios)

requirements are met) But this guarantee requires that the model be specified correctly! Likelihood can also be inconsistent if the model is oversimplified

Why do models matter (continued)?

(and other scenarios)

requirements are met) But this guarantee requires that the model be specified correctly! Likelihood can also be inconsistent if the model is oversimplified

more complex than any computationally feasible model would permit, so we have to choose “good” rather than “correct” models

What is a “good” model?

What is a “good” model?

with simplicity (parsimony, in a different sense)

What is a “good” model?

with simplicity (parsimony, in a different sense) i.e., if a simpler model fits the data almost as well as a more complex model, prefer the simpler one

What is a “good” model?

with simplicity (parsimony, in a different sense) i.e., if a simpler model fits the data almost as well as a more complex model, prefer the simpler one

What is a “good” model?

What is a “good” model?

What is a “good” model?

What is a “good” model?

Why models don’t have to be perfect

GTR Family of Reversible DNA Substitution Models

Among site rate heterogeneity

Modeling ASRV with gamma distribution

Performance of ML when its model is violated

“MODERATE”–Felsenstein zone

α = 1.0, pinv=0.5

“MODERATE”–Inverse- Felsenstein zone

( )

Model selection criteria

( )

( )

Model selection criteria

Model selection criteria

Model selection criteria

Model selection criteria

Model selection criteria

Model selection criteria

AIC vs. BIC

Par66oned ¡Models

Many ¡authors ¡have ¡emphasized ¡the ¡importance ¡of ¡ modeling ¡heterogeneity ¡among ¡genes ¡or ¡other ¡subsets ¡

Par66oned ¡Models

“...data ¡par66oning ¡is ¡more ¡an ¡art ¡ than ¡a ¡science, ¡and ¡it ¡should ¡rely ¡on ¡

system...”

Many ¡authors ¡have ¡emphasized ¡the ¡importance ¡of ¡ modeling ¡heterogeneity ¡among ¡genes ¡or ¡other ¡subsets ¡

Ways ¡to ¡par66on

Naive ¡par66oning

a ¡model ¡(from ¡the ¡GTR+I+G ¡family) ¡ separately ¡for ¡each ¡gene/subset

assigning ¡the ¡chosen ¡models ¡to ¡each ¡

Naive ¡par66oning

a ¡model ¡(from ¡the ¡GTR+I+G ¡family) ¡ separately ¡for ¡each ¡gene/subset

assigning ¡the ¡chosen ¡models ¡to ¡each ¡ Too ¡many ¡parameters! ¡ ¡1-­‑10 ¡ parameters ¡for ¡each ¡gene; ¡amount ¡

parameter ¡does ¡not ¡increase

Over-­‑Par66oning

Consider ¡the ¡following ¡(contrived) ¡example:

Over-­‑Par66oning

Consider ¡the ¡following ¡(contrived) ¡example:

How ¡to ¡find ¡op6mal ¡par66onings?

How ¡many ¡par66onings?

In ¡general, ¡the ¡ number ¡of ¡ par66onings ¡on ¡n ¡ subsets ¡is ¡a ¡“Bell ¡ number”

Obviously, ¡there ¡are ¡too ¡many ¡ par66oning ¡schemes ¡to ¡evaluate ¡them ¡ all ¡for ¡more ¡than ¡a ¡few ¡subsets.

Greedy ¡algorithm ¡when ¡there ¡are ¡too ¡many ¡

B A C D

Greedy ¡algorithm ¡when ¡there ¡are ¡too ¡many ¡

B A C D B A C D

Greedy ¡algorithm ¡when ¡there ¡are ¡too ¡many ¡

B A C D B A C D C A B D

Greedy ¡algorithm ¡when ¡there ¡are ¡too ¡many ¡

assigning ¡the ¡chosen ¡models ¡to ¡each ¡ Too ¡many ¡parameters! ¡ ¡1-‑10 ¡ parameters ¡for ¡each ¡gene; ¡amount ¡

Over-‑Par66oning

Over-‑Par66oning