[PDF] - Why are assumptions worth worrying about? Because they can lead to PDF Document

SLIDE 1

1

Functional Divergence Topic 3B: Testing adaptive macroevolution [Part 2]

Why are assumptions worth worrying about? Because they can lead to qualitatively different biological conclusions !!!

SLIDE 2

2

Can model assumptions affect the results a particular gene?

Estimation of dS and dN between Drosophila melanogater and D. simualns GstD1 genes Method ts/tv bias Codon bias

κ S N dS dN ω no no

1.0 152.9 447.1 0.0776 0.0213 0.274

yes no

1.88 165.8 434.2 0.0221 0.0691 0.320

no 3 × 4

1.0 70.6 529.4 0.1605 0.0189 0.118

yes 3 × 4

2.71 73.4 526.6 0.1526 0.0193 0.127

no empirical

1.0 40.5 559.5 0.3198 0.0201 0.063

yes empirical

2.53 45.2 554.8 0.3041 0.0204 0.067

(Data from: Bielawski and Yang, In Statistical methods in Molecular Evolution, Springer Verlag Series in Statistics in Health and Medicine. New York, New York. In Press).

SLIDE 3

3

OK, that was a quantitative difference, but it did not lead to a qualitative difference in the biological conclusion

Isochores and the vertebrate genome

Isochore families (>300kb) GC poor: L1 and L2 GC rich: H1 H1 and H2 H2 GC very rich: H3

Cold-blooded Warm-blooded L2 H2 L1 H1 L1 H3

SLIDE 4

4

Origins of isochores

1. Natural selection:

Bernardi and Bernardi 1986 Galtier and Mouchiroud 1998 Eyre-Walker 1999

2. Mutation pressure:

Filipski 1988 Wolfe and Sharpe 1993 Francino and Ochman 1999 What is the genomic relationship between dS and GC content?

Most studies Miyata et al. 1989 Bernardi et al. 1993 Matassi et al. 1999 Eyre-Walker 1994

GC3

ds 1. 2. 3.

GC3

ds

GC3

ds

SLIDE 5

5

Simple Model

r2 = 0.0228, P = 0.1759 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 0.2 0.4 0.6 0.8 1.0

GC3 d S

Artiodactyla vs. Primates (82 nuclear genes)

(Data from:Bielawski, Dunn, and Yang (2000) Genetics, 156: 1299

- 1308)

Mammalian nuclear genes:

Model with ts/tv and codon bias

r2 = 0.53, P < 0.0001 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 0.2 0.4 0.6 0.8 1.0 GC3

Is my favorite gene evolving under positive selection pressure?

SLIDE 6

6 Estimation bias for the dN/dS ratio

Simulation: GC3 = 89.5% (ENC = 28.3)

0.0 0.5 1.0 1.5 2.0 2.5 1 2 3 4

dN/dS = 0.01 dN/dS = 0.10 dN/dS = 0.30

Positive selection Purifying selection Sequence divergence (t) dN/dS

The dN/dS (ω) ratio is a valuable index of selection pressure! Computing the dN/dS (ω) ratio can be tricky!

SLIDE 7

7

Another problem:

In a pairwise analysis we must average the ω ratio over:

1. all sites
2. the entire evolutionary history

CCT CAG

t0 t1 k

Pairwise analysis does not detect much adaptive evolution In a large-scale pairwise database search, only 17 out of 3,595 genes were found to be under positive selection, at <0.5% (Endo et al. 1996 MBE 13: 685-690)

SLIDE 8

8

The problem of averaging over sites:

ATG CTT GTG CTA CTT GTG CTA CTT GTG CTA CTT GTG CTA ATG CTT GTG CTA CTT GTG CTA CTT GTG CTA CTT GTG CTA CGC TAA

1

1 3 5 7 9 1 1 1 3 1 5 1 7 1 9 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99 1 01

Purifying: dN/dS < 1 Neutral: dN/dS = 1 Adaptive: dN/dS > 1 Purifying: dN/dS < 1 Neutral: dN/dS = 1 Adaptive: dN/dS > 1 Purifying: dN/dS < 1 Neutral: dN/dS = 1 Adaptive: dN/dS > 1

75% St. purifying: ω = 0.005 20% Wk. purifying: ω = 0.50 5% Adaptive: ω > 3.5

The problem of averaging over sites:

(3.5 × 0.05) + (0.5 × 0.20) + (0.005 × 0.75) = 0.279

When we average over all three classes of sites ( ) we do NOT detect positive selection:

The average is a weighted sum over all three categories of sites: The average over all sites indicates that purifying selection dominates, with ω = 0.28

SLIDE 9

9

The problem of averaging over time:

ε γG γA δ β 40 – 80 mya 150 – 200 mya 100 – 140 mya 35 mya

Chrom. 11

β globin gene cluster ε γG γA δ β ε γG γA δ β 40 – 80 mya 150 – 200 mya 100 – 140 mya 35 mya

Chrom. 11

β globin gene cluster

0.5 0.062 35 1.2 0.212 120 (T = 1) (T = 565my) 0.2 0.106 60 0.2 0.106 60 0.5 0.150 85 0.5 0.062 35 0.2 0.097 55 0.2 0.203 115 ω Fraction of t b.l. (my)

Again, if we average

ver the tree, we do

NOT detect positive selection; ω = 0.49.

Grey branches: ω = 0.2 Black branches: ω = 0.5 Blue branches: ω = 1.2

We have the technology…

SLIDE 10

10

A real dataset: let’s do it!

SLIDE 11

11

What is the DAZ gene family? Two members: DAZL1 :

autosome [3p24]
present in all vertebrates

DAZ:

Y chromosome [Yq11.23]
present only in Old World Monkeys

DAZ evolved via a chromosomal translocation event O.W.M. N.W.M.

All other vertebrates

3p24 Yq11.23 DAZL1 DAZ DAZL1

DAZL1
Gene duplication via

translocation to Y-chromosome; 40 MYA

SLIDE 12

12

DAZ = Deleted in AZoospermia

Azoospermia is the most common form of male infertility
AZF (azoospermic factor)
locus on Y chromosome

~15% of infertile men have deletion in AZF

deletion in AZF contains a gene[s], crucial for spermatogenesis
one of these (AZFc) encodes the DAZ gene

At first, DAZ was thought to be functional:

DAZ and DAZL1: expressed only in germ cells
DAZ: expression highest in spermatogonia
Elimination of DAZL1 in mice = azoospermia
Human DAZ rescues azoospermic mice

Evolutionary analysis of DAZ family offers surprising conclusion (Agulnick et al. 1998)

Similar rates among three codon positions
Similar rates between introns and exons
High rates of nonsynonymous substitution (ω about 1)

Surprising conclusions: 1- No functional constraints on primate DAZ (young pseudogene) 2- DAZ plays no role in human spermatogenesis Method problem? Pairwise estimation of dN and dS Simple model [ts=tv; equal frequencies; JC69 correction]

SLIDE 13

13

DAZL1: Mus DAZL1: Human DAZ: Macacca DAZL1: Macacca DAZ: Human 0.1

a d e g f b c

Synonymous sites

Chromosomal translocation event

Did selection pressure change following the translocation event? Probabilistic models can permit different ωs on different branches x1 x2 x3 x4 j k

t1;ω1 t2;ω1 t0;ω0 t3;ω0 t4;ω0

SLIDE 14

14

Variable selective pressure among lineages of the DAZ gene family

DAZL1: Mus DAZL1: Human DAZ: Macacca DAZL1: Macacca DAZ: Human 0.1 0.10 1.44 0.35 1.14 0.35 3.47 0.001

ω6 = 1.144 for branch g ω5 = 0.355 for branch f ω4 = 0.350 for branch e ω3 = 1.444 for branch d ω2 = 0.001 for branch c ω1 = 3.474 for branch b

1426.40

ω0 = 0.100 for branch a 7 Free ratios………………

1442.44

ω0 = 0.295 for all branches 1 One-ratio……………… l Parameters for branches p Model

The free ratios model has a higher likelihood, but it also has more parameters; how do we know if the gain in likelihood is significant? Increasing model complexity will always increase the likelihood score

SLIDE 15

15

Likelihood ratio test (LRT)

Test statistic = 2∆ℓ = 2(ℓ0(θ0) - ℓ1(θ1)) ℓ0 is the maximum log likelihood under H0 given parameters θ0 and ℓ1 is the maximum log likelihood under H1 given parameters θ1

Degrees of freedom = difference in the number of parameters between the two models

The LRT tests for a significant gain in likelihood score

ω6 = 1.144 for branch g ω5 = 0.355 for branch f ω4 = 0.350 for branch e ω3 = 1.444 for branch d ω2 = 0.001 for branch c ω1 = 3.474 for branch b

1426.40

ω0 = 0.100 for branch a 7 Free ratios………………

1442.44

ω0 = 0.295 for all branches 1 One-ratio……………… l Parameters for branches p Model

Likelihood ratio test: One-ratio vs. Free-ratios: 2δ = 14.2, df = 6, P = 0.014 We test the increase the likelihood score with an LRT Remember: the above estimates are an average over all sites in the gene!

SLIDE 16

16

Does selection pressure vary among sites?

0.2 0.4 0.6 0.8 1

Site class 1: ω0 < 1, 75% of codon sites Site class 2: ω1 = 1, 20% of codon sites Site class 3: ω2 > 1, 05% of codon sites

p0 p1 p2

) | ( ) (

1 i h i K i h

P p P ω x x

∑

− =

=

We can formulate this in terms of a probability distribution:

H

0: unifo rm selec tive pressure amo ng sites (M0)

H

1: variable se le c tive pr

e ssur e amo ng site s (M3)

Co mpare 2∆l = 2(l1 - l0) with a χ2 distr ibution

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Model 0

= 0.59

ω ˆ

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Model 3

= 0.09 = 0.64 = 5.64

ω ˆ ω ˆ ω ˆ

Does selection pressure vary among sites?

Note: the above are plots of the MLEs for DAZ

SLIDE 17

17

0.2 0.4 0.6 0.8 1 ω ratio Sites

M7: beta M8: beta&ω

0.2 0.4 0.6 0.8 1 ω ratio Sites >1

H

0: Beta distribute d variable sele c tive pressure (M7)

H

1: Beta plus po sitive selec tio n (M8)

Co mpare 2∆l = 2(l1 - l0) with a χ2 distr ibution

Are some sites subject to positive selection?

Note: the above are plots of the MLEs for DAZ

12.16* 19.14* Tree B ………………………… 6.82* 8.94* Tree A ………………………… M7 vs. M8 M0 vs. M3

Note.⎯ * significant at 5% level ( = 5.99, df = 2)

2 % 5

χ

We use the LRT to test two hypotheses:

H1: Selection pressure varies among sites in DAZ (M0 vs. M3) H2: Some sites in DAZ evolved under positive selection (M7 vs. M8)

SLIDE 18

18

Reconciling the different conclusions

Patterns observed by Agulnick et al. (1998) were an artefact of

averaging over sites, and between sequences, in a gene that had strong spatial and temporal heterogeneity in selective constraints

This example is not unique:

– HIV in humans (Crandell et al. 1999; Zannotto et al. 1999) – κ-casein gene in bovids (Ward et al. 1997) Conclusion: DAZ is again implicated as a possible cause of the most common form of human male infertility

Branch-Site models allow variation among branches and sites: Yang and Nielsen, 2002 Bielawski and Yang 2004 Bielawski and Yang: double gamma (+)

SLIDE 19