part II codon substitution models and the analysis of natural - - PDF document

part ii
SMART_READER_LITE
LIVE PREVIEW

part II codon substitution models and the analysis of natural - - PDF document

2015-07-21 part II codon substitution models and the analysis of natural selection pressure Joseph P. Bielawski Department of Biology Department of Mathematics & Statistics Dalhousie University part II model based inference 1. 3


slide-1
SLIDE 1

2015-07-21 1 codon substitution models and the analysis of natural selection pressure

Joseph P. Bielawski Department of Biology Department of Mathematics & Statistics Dalhousie University

part II

1. 3 inference tasks 2. example gene analysis (& experimental validation) 3. MLE instabilities 4. false biological conclusions 5. example evolutionary survey (& experimental validation) 6. closing thoughts part II model based inference

slide-2
SLIDE 2

2015-07-21 2

3 analytical tasks

Task 1. Parameter estimation (e.g., ω) Task 2. Hypothesis testing Task 3. Make predictions (e.g., sites having ω > 1 )

  • 1. three tasks

model based inference

t, κ, ω = unknown constants estimated by ML

π’s = empirical [GY: F3×4 or F61 in Lab]

use a numerical hill-climbing algorithm to maximize the likelihood function

  • 1. analytical task-1

task 1: parameter estimation

slide-3
SLIDE 3

2015-07-21 3

Parameters: t and ω Gene: acetylcholine α receptor

mouse human common ancestor lnL = -2399

  • 1. analytical task-1

parameter estimation How do we know that the estimate is significant?

Task 1. Parameter estimation (e.g., ω) ✔ Task 2. Hypothesis testing Task 3. Prediction / Site identification LRT

  • 1. three tasks
slide-4
SLIDE 4

2015-07-21 4

H0: uniform selective pressure among sites (M0) H1: variable selective pressure among sites (M3)

Compare 2Δl = 2(l1 - l0) with a χ2 distribution

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Model 0

= 0.65

ω ˆ

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Model 3

= 0.01 = 0.90 = 5.55

ω ˆ ω ˆ ω ˆ

LRT No. 1: Does selection pressure vary among sites?

  • 1. analytical task-2

H0: variable selective pressure but NO positive selection (M1) H1: variable selective pressure with positive selection (M2)

Compare 2Δl = 2(l1 - l0) with a χ2 distribution

0.1 0.2 0.3 0.4 0.5 0.6 0.7

Model 1a

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Model 2a

= 0.5 (ω = 1) = 3.25

ω ˆ ω ˆ

(ω = 1) = 0.5

ω ˆ

LRT No. 2: Have some sites evolved under positive selection?

  • 1. analytical task-2
slide-5
SLIDE 5

2015-07-21 5

0.05 0.1 0.15 0.2

0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 frequency

simulated

2Δℓ

Data from: Anisimova, Bielawski, and Yang (2001) MBE 18: 1585-1592 simulated

2 4

χ

the LRT does not follow the χ2 distribution

  • 1. analytical task-2

Number of cases out of 100 for which the null hypothesis was rejected at the α = 1% (5%) significance levels Exp. Simulation LRT Simulation parameters Type I error at α = 1% (5%) Taxa κ ω t N = 100 N = 500 A…. M0 M0 & M3 6 2 0.40 0.11 0 (0) 0 (0) 1.1 0 (0) 0 (0) 11 0 (0) 0 (0) B…. M0 M0 & M3 17 2 0.40 2.11 0 (0) 0 (1) 8.44 0 (0) 0 (1) 16.88 0 (1) 0 (0) C… M0 M0 & M3 5 5 0.25 0.91 0 (0) 0 (0) 9.1 0 (0) 0 (1) 18.2 0 (1) 2 (3) D… M7 M7 & M8 6 2 p = 0.41 q = 1.10 0.11 N/A 0 (0) 1.1 N/A 1 (5) 11 N/A 1 (4)

Data from: Anisimova, Bielawski, and Yang (2001) MBE 18: 1585-1592

the LRT is conservative

NOTE: Here t denotes total tree length (sum of all branch lengths in the tree)

  • 1. analytical task-2
slide-6
SLIDE 6

2015-07-21 6

Power of the LRT: Number of replicates out of 100 in which positive selection was indicated by parameter estimates (P+), or detected by the LRT at the 1% (P+s, 0.01) and 5% (P+s, 0.05, in parentheses) significance levels

Simulation LRT Simulation parameters P+ P+s, 0.01 (0.05) Taxa κ ω distribution t LC = 100 LC = 500 LC = 100 LC = 500 M3 M0 & M3 17 2 ω0 = 0.018, p0 = 0.386 ω1 = 0.304, p1 = 0.535 ω2 = 1.691, p2 = 0.079 0.38 61 80 10 (17) 66 (72) 2.11 93 100 91 (92) 100 (100) 8.44 99 100 99 (99) 100 (100) 16.88 99 99 99 (99) 99 (99) 105.5 31 58 31 (31) 58 (58)

Data from: Anisimova, Bielawski, and Yang (2001) MBE 18: 1585-1592

the LRT can be powerful

NOTE: Here t denotes total tree length (sum of all branch lengths in the tree)

  • 1. analytical task-2

How do we identify the selected sites?

Task 1. Parameter estimation (e.g., ω) ✔ Task 2. Hypothesis testing ✔ Task 3. Prediction / Site identification

Bayes’ rule

  • 1. three tasks
slide-7
SLIDE 7

2015-07-21 7

Which sites have ω > 1 ?

GTG CTG TCT CCT GCC GAC AAG ACC AAC GTC AAG GCC GCC TGG GGC AAG GTT GGC GCG CAC ... ... ... G.C ... ... ... T.. ..T ... ... ... ... ... ... ... ... ... .GC A.. ... ... ... ..C ..T ... ... ... ... A.. ... A.T ... ... .AA ... A.C ... AGC ... ... ..C ... G.A .AT ... ..A ... ... A.. ... AA. TG. ... ..G ... A.. ..T .GC ..T ... ..C ..G GA. ..T ... ... ..T C.. ..G ..A ... AT. ... ..T ... ..G ..A .GC ...

model: 9% have ω > 1 Bayes’ rule: site 4, 12 & 13 structure: sites are in contact

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
  • 1. analytical task-3

Suppose that a population consists of 60% males and 40% females, and a disease occurs at the rate 1% in males and 0.1% in females.

P(D) = P(M)P(D|M) + P(F)P(D|F)

Q1: What is the probability that any individual carries the disease? A1: 0.6 × 0.01 + 0.4 × 0.001 = 0.0064

See Yang and Bielawski (2000) TREE 15:496-503 for a detailed presentation of this example

Bayes’ rule: yet another (silly) example of

  • 1. analytical task-3
slide-8
SLIDE 8

2015-07-21 8 Q2: Given that an individual carries the disease, what is the probability that it is a male? A2: 0.6 × 0.01/0.0064 = 0.94

P(M) P(D|M) P(M|D) = P(D)

Bayes’ rule: yet another (silly) example of

See Yang and Bielawski (2000) TREE 15:496-503 for a detailed presentation of this example

  • 1. analytical task-3

Pr(θ|D) = Pr(D|θ) Pr(θ)

  • θ Pr(D|θ) Pr(θ)
1

Bayes’ rule in statistics

Likelihood of hypothesis θ Prior probability of hypothesis θ Posterior probability

  • f hypothesis θ

Marginal probability

  • f the data (marginalizing
  • ver hypotheses)

From Paul Lewis’ lecture ….

slide-9
SLIDE 9

2015-07-21 9

= 0.03 = 0.40 = 14.1

ω0 ω2 ω1

P(xh) =

i=0 K−1

∑ p(ω i)P(xh |ω i)

  • 1. analytical task-3

identifying selected sites under a codon model

= 0.85 = 0.10 = 0.05

p0 p1 p2

Prior Likelihood Total probability Likelihood of hypothesis (ω2) Prior probability of hypothesis (ω2) Posterior probability of hypothesis (ω2) Marginal probability (Total probability) of the data

P(ω2 | xh) = P(ω2)P xh |ω2

( )

P(ωi)P xh |ωi

( )

i=0 K−1

  • 1. analytical task-3

Bayes’ rule for identifying selected sites

Site class 0: ω0 = .03, 85% of codon sites Site class 1: ω1 = .40, 10% of codon sites Site class 2: ω2 = 14, 05% of codon sites

? ?

slide-10
SLIDE 10

2015-07-21 10

!" !#$" !#%" !#&" !#'" ("

(" &" ((" (&" $(" $&" )(" )&" %(" %&" *(" *&" &(" &&" +(" +&" '(" '&" ,(" ,&" (!(" (!&" (((" ((&" ($(" ($&" ()(" ()&" (%(" (%&" (*(" (*&" (&(" (&&" (+(" (+&" ('(" ('&" (,(" (,&" $!(" ,-./012%34514/67%837/56% 956:38430%837/56% >5:;38/58%.85?-?/1/;2%
  • 1. analytical task-3

Bayes’ rule for identifying selected sites

Site class 0: ω0 = .03 (strong purifying selection)

Site class 1: ω1 = .40 (weak purifying selection) Site class 2: ω2 = 14 (positive selection)

NOTE: The posterior probability should NOT be interpreted as a “P-value”; it can be interpreted as a measure of relative support, although there is rarely any attempt at “calibration”

Empirical Bayes

Naive Empirical Bayes

  • (NEB)
  • Nielsen and Yang, 1998
  • assumes no MLE errors

Bayes Empirical Bayes

  • (BEB)
  • Yang et al., 2005
  • accommodate MLE errors
  • 1. analytical task-3

Bayes’ rule for identifying selected sites

slide-11
SLIDE 11

2015-07-21 11

* exception: extreme parameter estimates

1. assign a prior to ω distribution parameters 2. fix branch lengths to MLEs 3. integrate over uncertainty 4. BEB is faster than “Full Bayes” (FB)

See: Yang Z, Wong WS, Nielsen R. 2005. Bayes empirical bayes inference of amino acid sites under positive selection. Mol Biol Evol. 22(4):1107-1118.

  • 1. analytical task-3

Bayes Empirical Bayes (BEB)

False classification rates

Small datasets: FB/BEB < NEB Large datasets: FB/BEB ≈ NEB*

Task 1. Parameter estimation (e.g., ω) Task 2. Hypothesis testing Task 3. Prediction / Site identification

model based inference progress …

let’s put this into practice …

slide-12
SLIDE 12

2015-07-21 12

1. 3 inference tasks ✔ 2. example gene analysis (& experimental validation) 3. MLE instabilities 4. false biological conclusions 5. example evolutionary survey (& experimental validation) 6. closing thoughts progress model based inference

Red/blue colour morphs of the great star coal Montastraea cavernosa

  • Is color diversity tuned by natural selection?
  • Is there a relationship between colour and endosymbiotic algae?
  • 2. example analysis

colour diversity of coral pigments

See Field et al. 2006 J. Mol. Evol. 62(3):332-9 for details.

slide-13
SLIDE 13

2015-07-21 13

ω0 = 0.02 ω1 = 0.5 ω2 = 1.93

p0 p1 p2 P(ω2 | xh) = p2P xh |ω2

( )

P xh

( )

= p2P xh |ω2

( )

piP xh |ωi

( )

i= 0 K−1

Bayes’ rule:

  • 2. example: GFP

signal 1: long term (diversifying) selection

sites in red correspond to the protein- binding region of non-colored homologs

  • f these GFP proteins

“long-term” selection

ω0 ω1 ω2

See Field et al. 2006 J. Mol. Evol. 62(3):332-9 for details. = 0.03 = 0.59 = 2.2

Foreground branch

signal 2: episodic selection

Baye’s rule:

See Field et al. 2006 J. Mol. Evol. 62(3):332-9 for details.

  • 2. example analysis
slide-14
SLIDE 14

2015-07-21 14

Bacteria were engineered to express the extant and ancestral GFP-like proteins. These bacteria were then cultured in a pattern that corresponded to the GFP-LIKE gene tree. Ugalde JA, Chang BS, Matz MV. Evolution of coral pigments recreated.

  • Science. (2003). 305:1433.

just for fun ….

  • 2. example analysis

1. 3 inference tasks ✔ 2. example gene analysis (& experimental validation) ✔ 3. MLE instabilities 4. false biological conclusions 5. example evolutionary survey (& experimental validation) 6. closing thoughts model based inference progress

slide-15
SLIDE 15

2015-07-21 15

Normal MLE uncertainty (M2a)

  • large sample size with regularity conditions
  • MLEs approximately unbiased and minimum

variance MLE instabilities (M2a)

  • small sample sizes and θ on boundary
  • regularity does not hold
  • continuous θ has been discretized (e.g., M2a)
  • non-Gaussian, over-dispersed, divergence

among datasets

p Frequency 0.00 0.10 0.20 5 10 15 20 25 30 w Frequency 3 4 5 6 7 8 5 10 15 20 p Frequency 0.0 0.4 0.8 20 40 60 80 w Frequency 5 10 15 50 100 150

ˆ θ ~ N θ, I ˆ θ

( )

−1

( )

  • 3. MLE instabilities

pω>1 ¡ pω>1 ¡

ω>1 ¡ ω>1 ¡ * distributions obtained from simulation Original Data sites Bootstrap sample #1 Bootstrap sample #2

sample same number

  • f sites, with replacement

sample same number

  • f sites, with replacement

(and so on)

T

^

T(1) T(2)

Slide from Joe Felsenstein

Non-parametric bootstrapping

(via Mark Holder)

X 2 X 1 X B θ 1 θ 2 θ B

slide-16
SLIDE 16

2015-07-21 16

estimate the distribution

  • 4. Bootstrapping & Bagging!

!! !!!!!!!!!!!!!!!!!!!!!!!!!!⋯!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!⋯!!!!!!!!!! !!

! ! > 1|!, !! !!!!!!!! ! ! > 1|!, !! !!!!!⋯!!!!!!! ! ! > 1|!, !! ! Original(data:""X"is"an"alignment"with" h = 1 … N"codon"sites" Bootstrapping:""sample"N"sites"with" replacement"to"get"B"bootstrap"replicates"" MLE(distribu4on:"analyze"each"bootstrap" replicate"to"es6mate"parameters"θi Posterior(prob:"use"θi to"compute"PP" for"each"site,"h,"of"the"original(data( "

We use bagging to aggregate information across the bootstrap replicates. Bagging can improve classification when there is a large amount of variability in a statistic [4]. !"# !!

! ! > 1|!,!! !!!!!!!!!!!"!!!!!!!!!!"# !! ! ! > 1|!,!! !!!!!!!

aggregate"by"averaging" the"posteriors aggregate"by"using"the" median(posterior

We use bootstrapping to estimate the variability of the MLEs:

new methods (SBA): classify sites based on these statistics

  • 3. MLE instabilities

Example: mixture parameters (p0 + p1 + p2 = 1.0) for codon model M2a

p0

5 5 5 5 10 15 20 25

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8

Example: bootstrap distribution for the mixture parameters p0 and p1 of codon model M2a Cause: too complex a model and too little signal in the data

p0 ¡ p1 ¡

Kernel smoothing is a technique for estimating the density of a random variable from “noisy” data.

p0

5 5 5 5 10 15 20 25

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8

10 20 30 40 40 50 60 70 80 90

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8

bootstrap distribution smoothed distribution (λ = 0.4)

p0 ¡ p1 ¡

MLE instabilities

Much better approximation

  • f the “infinite sample”

bootstrap distribution

p0 ¡ p1 ¡

kernal smoothing

  • 3. MLE instabilities
slide-17
SLIDE 17

2015-07-21 17

  • 4. Bootstrapping & Bagging!

!! !!!!!!!!!!!!!!!!!!!!!!!!!!⋯!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!⋯!!!!!!!!!! !!

! ! > 1|!, !! !!!!!!!! ! ! > 1|!, !! !!!!!⋯!!!!!!! ! ! > 1|!, !! ! Original(data:""X"is"an"alignment"with" h = 1 … N"codon"sites" Bootstrapping:""sample"N"sites"with" replacement"to"get"B"bootstrap"replicates"" MLE(distribu4on:"analyze"each"bootstrap" replicate"to"es6mate"parameters"θi Posterior(prob:"use"θi to"compute"PP" for"each"site,"h,"of"the"original(data( "

We use bagging to aggregate information across the bootstrap replicates. Bagging can improve classification when there is a large amount of variability in a statistic [4]. !"# !!

! ! > 1|!,!! !!!!!!!!!!!"!!!!!!!!!!"# !! ! ! > 1|!,!! !!!!!!!

aggregate"by"averaging" the"posteriors aggregate"by"using"the" median(posterior

We use bootstrapping to estimate the variability of the MLEs:

new methods (SBA): classify sites based on these statistics

smoothed bootstrap aggregation

  • 3. MLE instabilities
  • 0.0

0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

50% ω=0 50% ω=1.5

False Positive Rate True Positive Rate Reference: “worst case” Reference: “best case” NEB BEB

ROC curve: “hard: dataset

  • 3. MLE instabilities
slide-18
SLIDE 18

2015-07-21 18

  • 0.0

0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

50% ω=0 50% ω=1.5

False Positive Rate True Positive Rate

  • Approximation

Based on bagging the posterior mean omega Bootstrap aggregating (bagging)*

ROC curve: “hard: dataset

  • 3. MLE instabilities
  • 0.0

0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

45% ω=0 45% ω=1 10% ω=5

False Positive Rate True Positive Rate

  • ROC curve: “easy” dataset
  • 3. MLE instabilities
slide-19
SLIDE 19

2015-07-21 19

This dataset is “notorious” for excessive false positives under NEB [2, 7]. NEB indicates that all sites (including 158 invariant sites) are under positive selection, with PP=1.0, in tax Inference of positive selection: 158 invariant sites: NEB: PP = 1.0 BEB: 0.55 < PP < 0.61 ave-SBA: 0.28 < PP < 0.32 med-SBA: 0.22 < PP < 0.28 21 sites with nonsynonymous variation: NEB: PP = 1.0 BEB: 0.91 < PP < 0.93 ave-SBA: 0.57 < PP < 0.62 med-SBA: 0.63 < PP < 0.70

0" 0.2" 0.4" 0.6" 0.8" 1" Typical"gene" tax"gene"

MLEs: codon model M2a tax has extreme mixture:

p0=0; p1=0; p2=1.0

(MLE instabilities?)

ω0<1 ω1=1 ω1>1

Bootstrapping & Bagging Real Data: HTLV-I tax gene

(spectacular failure of the NEB method)

  • 3. MLE instabilities
  • 3. MLE instabilities

Using multiple models is not a “talisman” against false positives.

slide-20
SLIDE 20

2015-07-21 20

1. 3 inference tasks ✔ 2. example gene analysis (& experimental validation) ✔ 3. MLE instabilities ✔ 4. false biological conclusions 5. example evolutionary survey (& experimental validation) 6. closing thoughts progress model based inference

“A model is an intentional simplification of a complex situation designed to eliminate extraneous detail in order to focus attention

  • n the essentials of the situation”

(Daniel L. Hartl)

  • 4. false conclusions

“know your data”

“ Remember that all models are

wrong; the practical question is how wrong do they have to be to not be useful.

“”

(George E. P. Box)

slide-21
SLIDE 21

2015-07-21 21

How wrong is too wrong?

”know your data”

When you make false biological conclusions !!!

  • 4. false conclusions

sequence evolution is complex 1. codon usage 2. process variation among sites 3. process variation over time 4. recombination

  • 4. false conclusions
slide-22
SLIDE 22

2015-07-21 22

Sums of codon usage counts the GstD gene of Drosophila

  • Phe F TTT 0 | Ser S TCT 0 | Tyr Y TAT 1 | Cys C TGT 0

TTC 27 | TCC 15 | TAC 22 | TGC 6 Leu L TTA 0 | TCA 0 | *** * TAA 0 | *** * TGA 0 TTG 1 | TCG 1 | TAG 0 | Trp W TGG 8

  • Leu L CTT 2 | Pro P CCT 1 | His H CAT 0 | Arg R CGT 1

CTC 2 | CCC 15 | CAC 4 | CGC 7 CTA 0 | CCA 3 | Gln Q CAA 0 | CGA 0 CTG 29 | CCG 1 | CAG 14 | CGG 0

  • Ile I ATT 4 | Thr T ACT 2 | Asn N AAT 5 | Ser S AGT 1

ATC 12 | ACC 11 | AAC 17 | AGC 4 ATA 0 | ACA 2 | Lys K AAA 1 | Arg R AGA 0 Met M ATG 4 | ACG 4 | AAG 37 | AGG 1

  • Val V GTT 0 | Ala A GCT 0 | Asp D GAT 2 | Gly G GGT 4

GTC 2 | GCC 38 | GAC 11 | GGC 6 GTA 1 | GCA 2 | Glu E GAA 0 | GGA 11 GTG 25 | GCG 3 | GAG 30 | GGG 0

  • how to model codon frequencies?
  • 4. false conclusions

biological conclusions are affected

Estimation bias for the

d

N

/ d

S

ratio

Simulation: GC3 = 89.5% (ENC = 28.3) 0.0 0.5 1.0 1.5 2.0 2.5 1 2 3 4

dN/dS = 0.01 dN/dS = 0.10 dN/dS = 0.30

Positive selection Purifying selection Sequence divergence (t) dN/dS unpublished simulation study

positive selection in highly biased gene sequences?

  • 4. false conclusions
slide-23
SLIDE 23

2015-07-21 23

substitution rates are proportional to empirical frequency of: Goldman and Yang 1994 (GY): Muse and Gaut 1994 (MG): target codon target nucleotide

See Rodrique et al. (2008) for a comparison of GY and MG style codon models that suggests the MG style, combined with parameters for codon preferences, might be the most desirable core-model for future development.

how to model codon frequencies? “know your data”

Δ at codon position 1st 2nd 3rd GY

πCAA πACA πAAC

MG

πc

1

πc

2

πc

3

AAA → CAA AAA → ACA AAA → AAC

Example: A → C

how to model codon frequencies?

  • 4. false conclusions
slide-24
SLIDE 24

2015-07-21 24

Δ at codon position 1st 2nd 3rd GY

πCAA πACA πAAC

MG

πc

1

πc

2

πc

3

CNF

AAA → CAA AAA → ACA AAA → AAC

Example: A → C

conditional nucleotide frequencies (CNF)

πC

1 π AA 2,3

πC

2 π A,A 1,3

πC

3 π AA 1,2

For more details: Yap et al. (2010) mbe 27: 726-734.

  • 4. false conclusions

for extreme bias such as in pathogen genomes

sequence evolution is complex 1. codon usage 2. process variation among sites 3. process variation over time 4. recombination

  • 4. false conclusions
slide-25
SLIDE 25

2015-07-21 25

M-series models (and many others): Biological interpretation of differences among sites in ω requires that such differences are due to selection pressure alone. Should we be concerned?

sequence evolution is complex

  • 4. false conclusions

cell membrane in grey; helix structures span the membrane: Hydrophobic amino acids here loop structures extend into cytoplasm: Hydrophilic amino acids here loop structures extend into extra-cellular space: Hydrophilic amino acids here

ω0 π0 κ0 c0 ω1 π1 κ1 c1 ω2 π2 κ2 c2

partition sites within a gene:

GY-type codon models: variable ω’s + c’s among sites = variable dN & dS among sites

transmenbrane proteins

  • 4. false conclusions
slide-26
SLIDE 26

2015-07-21 26

error site class 1 site class 2 M2a*† 7.2% (7.7%) ω0 = 0.06; [w1 = 1] p0 = 0.83; p1 = 0.13 ω2 = 2.48 [p2 = 0.04] M3† 6.4% ω1 = 0.07 p1 = 0.86 ω2 = 1.45 [p2 = 0.14] M8*† 5.5% (6.0%) p = 0.33, q = 2.51 p1 = 0.91 ω2 = 1.92 [p2 = 0.09] Soft LiBaC 1 10.1% ω1 = 0.04 {p1 = 0.79} ω2 = 0.72 {p2 = 0.21} Soft LiBaC 2 9.6% ω1 = 0.05 {p1 = 0.80} ω2 = 0.72 {p2 = 0.20} * LRTs for positive selection were highly significant in all cases. † High posterior probabilities for sites having ω > 1 in all replications.

Site class 1: 900 sites Beta(0.5,4.5) Mean ω = 0.1 c = 1 π’s = empirical Site class 2: 100 sites Beta(4.5,0.5) Mean ω = 0.9 c = 5 π’s = 1/61

realistic simulation w/ no positive selection

This simulation is covered in: Bao, Gu, Dunn, Bielawski (2008) Mol Biol Evol 25:1995-2007

  • 4. false conclusions

modeling process variation among sites

  • 4. false conclusions

process variation among sites software & references

  • synonymous rate
  • nonsynonymous rate

several methods in: HyPhy: Kosakovsky Pond et al. (2005) Datamonkey: Delport et al. (2010)

  • baseline DNA/RNA substitution rate
  • nonsynonymous rate

MultiLayer: Rubinstein et al. (2011)

  • baseline DNA/RNA substitution rate
  • transition/transversion ratio
  • codon frequencies
  • nonsynonymous rate

LiBaC: Bao et al. (2008)

slide-27
SLIDE 27

2015-07-21 27

sequence evolution is complex 1. codon usage 2. process varies among sites 3. process varies over time 4. recombination

  • 4. false conclusions

Branch t N S dN dS N*dN S*dS

18..19 Fequal 6.118 375.3 134.7 0.5244 6.2595 196.8 843.4 F3x4 10.461 401.2 108.8 0.5446 14.3384 218.5 1559.9 F61 10.416 411.7 98.3 0.5504 15.7089 226.6 1544.2 9..15* Fequal 1.622 400.8 139.2 0.0648 1.9111 26 266 F3x4 8.07 409.1 130.9 0.0624 10.9047 25.5 1427.1 F61 15.343 413.2 126.8 0.0611 21.5828 25.3 2736.4

Table: Highly biased branch-specific summary statistics under M0

* Data for branch 9..15 obtained from a separate analysis of ONLY LL adapted Prochlorococcus

[insert your profanity here] … that’s a lot of substitutions

GC ≈ 29% 18..19 HL adapted GC ≈ 45% LL adapted 9..15
  • 4. false conclusions
slide-28
SLIDE 28

2015-07-21 28

Null Alternate % P<0.05 M1 Model A 100 Model A(ω=1) Model A 30 (60) M3(k=2) Model B 100 MLE Results: 100% of estimated values of ω were >1 in the foreground branch LTR results:

* t = 1.5


GC = 29% ω0 = 0.0 (24%) ω1 = 0.2 (53%) ω2 = 0.9 (23%) GC ≈ 45% ω0 = 0.0 (24%) ω1 = 0.2 (53%) ω2 = 0.1 (23%) *Foreground branch: ω = 0.9 and t = 1.5

simulate under the null (no positive selection)

This issue is partly covered in: Bay and Bielawski (2013) JME , 76(4):205-215

  • 4. false conclusions

sequence evolution is complex 1. codon usage 2. variation among sites 3. variation over time 4. recombination

  • 4. false conclusions
slide-29
SLIDE 29

2015-07-21 29

recombination

Note: Recombination adds among site-wise variation relative to both process and phylogeny! See Sullivan et al. 2006 PLoS Biology 4: e234 for details.

  • 4. false conclusions

Caution: high levels of recombination lead to type I errors

  • low recombination: LRT is robust (<15% of branches)
  • high recombination: LRT type I error rate as high as 90%
  • Bayesian site identification is less sensitive than LRT

For more details: Anisimova, Nielsen and Yang (2003) Genetics, 164:1229-1236. For a solution: Scheffler et al. (2006) Bioinformatics, 22:2493-2499.

performance: model misspecification

  • 4. false conclusions
slide-30
SLIDE 30

2015-07-21 30

1. 3 inference tasks ✔ 2. example gene analysis (& experimental validation) ✔ 3. MLE instabilities ✔ 4. false biological conclusions ✔ 5. example evolutionary survey (& experimental validation) 6. closing thoughts progress model based inference “Identification and assessment of NR2C1 (TR2) as a modulator of pluripotentiality during hominid evolution”

Baker et al. submitted to MBE

  • 6. evolutionary survey

Experimental assessment of evolutionary change in pluripotentiality (1) evolutionary survey (2) rank candidates Sample, process, and align 48 nuclear receptors

Foreground branch only
slide-31
SLIDE 31

2015-07-21 31

intracellular transcription factors that directly regulate gene expression reproduction, development, and endocrine signaling

  • 6. evolutionary survey

nuclear receptor (NR) family

hormones & behavior

Phase 1: 1. processing and Q.C. 2. structurally-aware modelling of codon evolution 3. test each gene for 5 different evolutionary scenarios 4. use “FDR control” to identify families of genes associated with different evolutionary scenarios

  • 6. evolutionary survey

analysis of 48 primate NRs

slide-32
SLIDE 32

2015-07-21 32

1B) H2: ωM = ωGA ≠ ωHC = ωH 1C) H3: ωM = ωGA = ωHC ≠ ωH 1D) H4: ωM ≠ ωGA = ωH 1E) H5: ωM = ωGA ≠ ωHC 1A) H1: ωM ≠ ωGA = ωHC = ωH KEY ωM : ωMammal ωGA : ωGreatApe ωHC : ωHuman-Chimpanzee ωH : ωHuman

episodic models long-term shift models

  • 6. evolutionary survey

number of unique candidate genes: 9 evolutionary scenario LRT: p < 0.05 FDR: q < 0.05 episodic models H1: great apes 1 gene 0 genes H2: human-chimp 0 genes 0 genes H3: humans 4 genes 1 genes long-term shift H4: great apes 10 genes 5 genes H5: human-chimp 8 genes 5 genes

  • 6. evolutionary survey

phase 1: analysis of 48 NRs

slide-33
SLIDE 33

2015-07-21 33

phase 2: reliability and robustness assessment 1. alignment (independent evaluations) 2. recombination 3. MG94 style codon model 4. alternative tree topologies 5. robustness to variation in baseline DNA/RNA rates 6. bootstrapping

  • 6. evolutionary survey

analysis of 9 NRs 1. alignment assessment ✔ 2. no evidence of recombination ✔ 3. alternate framework (MG94): LRTs and MLEs robust ✔ 4. alternate tree topology: LRTs and MLEs robust ✔ 5. variation in baseline DNA/RNA rate of evolution: ✔ 6. bootstrap distribution ✗ nuclear receptor NR1D1: positive selection along human lineage at 1% of sites

nr1d1_bs_mB3_initals_results.txt
  • 5E-2
5E-2 0.1 0.15 0.2 0.25 0.3 0.05 0.1 0.2 0.15 0.25

pFG# Density#

instabilities in the MLEs

FULL STOP FULL STOP

slide-34
SLIDE 34

2015-07-21 34

gene status note ESSRA excluded MLE instabilities ESSRB excluded MLE instabilities NR1D1 excluded MLE instabilities NR2E3 excluded recombination RARG excluded MLE instabilities ESSRB ranked 4th PGR ranked 3rd RORA ranked 2nd NR2C1 ranked 1st

nr2c1_bsD5.txt 0.1 0.2 0.3 0.4 0.5 0.6 0.7 1 2 3 4 5

Density ¡ pSHIFT ¡ NR2C1

  • 6. evolutionary survey

identifying candidates for further study

  • Implicated in

regulation of neural stem cells

  • Embryonic stem cell

maintenance

  • Neuronal

differentiation

What is the function of NR2C1?

image: C. Tworney

dffdasfdsfsgfdsgfdsgf

slide-35
SLIDE 35

2015-07-21 35

  • 6. evolutionary survey

ancestral gene reconstruction

hNR2C1 cNR2C1 aNR2C1

NR2C1 from extant humans

Inferred NR2C1 for the human- chimpanzee common ancestor 7-16 million years Synthetic biology

to program cells to become factories…

NR2C1 from extant chimpanzees

Use genetic system… to make ancestral & extant proteins

OCT4 & Nanog

image: White 2015

Nanog

KD + Chimp KD+ Human KD + Ancestral Nonsense Knockdown 25 50 75 100 125

  • 6. evolutionary survey

Can NR2C1 regulate pluripotentiality via Oct4 or Nanog?

NR2C1

slide-36
SLIDE 36

2015-07-21 36

OCT4 & Nanog

image: White 2015

Oct4

KD + Chimp KD+ Human KD + Ancestral Nonsense Knockdown 25 50 75 100 125 100 125

  • 6. evolutionary survey

Can NR2C1 regulate pluripotentiality via Oct4 or Nanog?

NR2C1

Photo: Tom Maynard

~ 20,000 ES cells

  • 6. evolutionary survey

Can NR2C1 impact a phenotype of pluripotentiality?

average size of ES cell clonal colonies is a proxy for their capacity for self renewal

5 10 15 20

B

Cells / colony (thousands) at iteration #4 KD + Chimpanzee KD+ Human KD + Ancestral Nonsense Knockdown

slide-37
SLIDE 37

2015-07-21 37

Chimpanzee Ancestral Control Human 60 80 100 120 Luciferase activity (% of control)

  • 6. evolutionary survey

Can NR2C1 impact expression via the Pepck promoter?

HEK-293 cells plasmids encoding aNR2C1, cNR2C1, or hNR2C1

Expression of Pepck increases as development proceeds, including the developing neural tube. Pepck expression is considered a proxy for cell differentiation, including neuronal cells

Pepck

aNR2C1' hNR2C1' cNR2C1'

Substitution Domain S101.102INDEL NTD V242M Hinge T254A Hinge S274N Hinge G514S LBD NR2C1&&&&Human&substitutions&&&&&&&&&&&&&&&&& Substitution Domain H37R NTD N38T NTD L94M NTD S148A DBD T225A Hinge NR2C1&&&&Chimpanzee&substitutions&&&&&&&&&&&&& Oct4 expression ! Nanog expression " colony size ! PEPCK promoter activity " Oct4 expression " Nanog expression " colony size " PEPCK promoter activity ! Oct4 expression " Nanog expression " colony size " PEPCK promoter activity ! NTD:%N%terminal%domain%(A/B)% DBD:%DNA%binding%domain%(C)% Hinge:%flexible%hinge%domain%(D)% LBD:%Ligand%binding%domain%(E)% CTD:%C%terminal%domain%(F)%

  • 6. evolutionary survey

substitutions at different sites lead to divergence from ancestor

slide-38
SLIDE 38

2015-07-21 38

1. evolution of NR2C1 appears associated with evolution of several aspects of ES cell regulation 2. broad sense parallel evolution at phenotypic level 3. evolution at gene level involved different sites 4. experimental assays corroborate signal for functional divergence derived from the evolutionary analysis 5. targeted evolutionary surveys in combination with experimental analysis can contribute to studies of gene regulation evolution with respect to developmental phenotypes

  • 6. evolutionary survey

1. 3 inference tasks ✔ 2. example gene analysis (& experimental validation) ✔ 3. MLE instabilities ✔ 4. false biological conclusions ✔ 5. example evolutionary survey (& experimental validation) ✔ 6. closing thoughts progress model based inference

slide-39
SLIDE 39

2015-07-21 39 YOU are in the best position to make informed decisions about how to analyze YOUR data.

“ Thinking just once is so much

better than just using 20 different models.

“”

  • Christopher Jones
  • 6. evolutionary survey

closing thoughts