Why are assumptions worth worrying about? Because they can lead to - - PDF document
Why are assumptions worth worrying about? Because they can lead to - - PDF document
Functional Divergence Topic 3B: Testing adaptive macroevolution [Part 2] Why are assumptions worth worrying about? Because they can lead to qualitatively different biological conclusions !!! 1 Can model assumptions affect the results a
2
Can model assumptions affect the results a particular gene?
Estimation of dS and dN between Drosophila melanogater and D. simualns GstD1 genes Method ts/tv bias Codon bias
κ S N dS dN ω no no
1.0 152.9 447.1 0.0776 0.0213 0.274
yes no
1.88 165.8 434.2 0.0221 0.0691 0.320
no 3 × 4
1.0 70.6 529.4 0.1605 0.0189 0.118
yes 3 × 4
2.71 73.4 526.6 0.1526 0.0193 0.127
no empirical
1.0 40.5 559.5 0.3198 0.0201 0.063
yes empirical
2.53 45.2 554.8 0.3041 0.0204 0.067
(Data from: Bielawski and Yang, In Statistical methods in Molecular Evolution, Springer Verlag Series in Statistics in Health and Medicine. New York, New York. In Press).
3
OK, that was a quantitative difference, but it did not lead to a qualitative difference in the biological conclusion
Isochores and the vertebrate genome
Isochore families (>300kb) GC poor: L1 and L2 GC rich: H1 H1 and H2 H2 GC very rich: H3
Cold-blooded Warm-blooded L2 H2 L1 H1 L1 H3
4
Origins of isochores
- 1. Natural selection:
Bernardi and Bernardi 1986 Galtier and Mouchiroud 1998 Eyre-Walker 1999
- 2. Mutation pressure:
Filipski 1988 Wolfe and Sharpe 1993 Francino and Ochman 1999 What is the genomic relationship between dS and GC content?
Most studies Miyata et al. 1989 Bernardi et al. 1993 Matassi et al. 1999 Eyre-Walker 1994
GC3
ds 1. 2. 3.
GC3
ds
GC3
ds
5
Simple Model
r2 = 0.0228, P = 0.1759 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 0.2 0.4 0.6 0.8 1.0
GC3 d S
Artiodactyla vs. Primates (82 nuclear genes)
(Data from:Bielawski, Dunn, and Yang (2000) Genetics, 156: 1299
- - 1308)
Mammalian nuclear genes:
Model with ts/tv and codon bias
r2 = 0.53, P < 0.0001 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 0.2 0.4 0.6 0.8 1.0 GC3
Is my favorite gene evolving under positive selection pressure?
6
Estimation bias for the dN/dS ratio
Simulation: GC3 = 89.5% (ENC = 28.3)
0.0 0.5 1.0 1.5 2.0 2.5 1 2 3 4
dN/dS = 0.01 dN/dS = 0.10 dN/dS = 0.30
Positive selection Purifying selection Sequence divergence (t) dN/dS
The dN/dS (ω) ratio is a valuable index of selection pressure! Computing the dN/dS (ω) ratio can be tricky!
7
Another problem:
In a pairwise analysis we must average the ω ratio over:
- 1. all sites
- 2. the entire evolutionary history
CCT CAG
t0 t1 k
Pairwise analysis does not detect much adaptive evolution In a large-scale pairwise database search, only 17 out of 3,595 genes were found to be under positive selection, at <0.5% (Endo et al. 1996 MBE 13: 685-690)
8
The problem of averaging over sites:
ATG CTT GTG CTA CTT GTG CTA CTT GTG CTA CTT GTG CTA ATG CTT GTG CTA CTT GTG CTA CTT GTG CTA CTT GTG CTA CGC TAA
11 3 5 7 9 1 1 1 3 1 5 1 7 1 9 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99 1 01
Purifying: dN/dS < 1 Neutral: dN/dS = 1 Adaptive: dN/dS > 1 Purifying: dN/dS < 1 Neutral: dN/dS = 1 Adaptive: dN/dS > 1 Purifying: dN/dS < 1 Neutral: dN/dS = 1 Adaptive: dN/dS > 1
75% St. purifying: ω = 0.005 20% Wk. purifying: ω = 0.50 5% Adaptive: ω > 3.5
The problem of averaging over sites:
(3.5 × 0.05) + (0.5 × 0.20) + (0.005 × 0.75) = 0.279
When we average over all three classes of sites ( ) we do NOT detect positive selection:
The average is a weighted sum over all three categories of sites: The average over all sites indicates that purifying selection dominates, with ω = 0.28
9
The problem of averaging over time:
ε γG γA δ β 40 – 80 mya 150 – 200 mya 100 – 140 mya 35 mya
- Chrom. 11
β globin gene cluster ε γG γA δ β ε γG γA δ β 40 – 80 mya 150 – 200 mya 100 – 140 mya 35 mya
- Chrom. 11
β globin gene cluster
0.5 0.062 35 1.2 0.212 120 (T = 1) (T = 565my) 0.2 0.106 60 0.2 0.106 60 0.5 0.150 85 0.5 0.062 35 0.2 0.097 55 0.2 0.203 115 ω Fraction of t b.l. (my)
Again, if we average
- ver the tree, we do
NOT detect positive selection; ω = 0.49.
Grey branches: ω = 0.2 Black branches: ω = 0.5 Blue branches: ω = 1.2
We have the technology…
10
A real dataset: let’s do it!
11
What is the DAZ gene family? Two members: DAZL1 :
- autosome [3p24]
- present in all vertebrates
DAZ:
- Y chromosome [Yq11.23]
- present only in Old World Monkeys
DAZ evolved via a chromosomal translocation event O.W.M. N.W.M.
All other vertebrates
3p24 Yq11.23 DAZL1 DAZ DAZL1
- DAZL1
- Gene duplication via
translocation to Y-chromosome; 40 MYA
12
DAZ = Deleted in AZoospermia
- Azoospermia is the most common form of male infertility
- AZF (azoospermic factor)
- locus on Y chromosome
~15% of infertile men have deletion in AZF
- deletion in AZF contains a gene[s], crucial for spermatogenesis
- one of these (AZFc) encodes the DAZ gene
At first, DAZ was thought to be functional:
- DAZ and DAZL1: expressed only in germ cells
- DAZ: expression highest in spermatogonia
- Elimination of DAZL1 in mice = azoospermia
- Human DAZ rescues azoospermic mice
Evolutionary analysis of DAZ family offers surprising conclusion (Agulnick et al. 1998)
- Similar rates among three codon positions
- Similar rates between introns and exons
- High rates of nonsynonymous substitution (ω about 1)
Surprising conclusions: 1- No functional constraints on primate DAZ (young pseudogene) 2- DAZ plays no role in human spermatogenesis Method problem? Pairwise estimation of dN and dS Simple model [ts=tv; equal frequencies; JC69 correction]
13
DAZL1: Mus DAZL1: Human DAZ: Macacca DAZL1: Macacca DAZ: Human 0.1
a d e g f b c
Synonymous sites
Chromosomal translocation event
Did selection pressure change following the translocation event? Probabilistic models can permit different ωs on different branches x1 x2 x3 x4 j k
t1;ω1 t2;ω1 t0;ω0 t3;ω0 t4;ω0
14
Variable selective pressure among lineages of the DAZ gene family
DAZL1: Mus DAZL1: Human DAZ: Macacca DAZL1: Macacca DAZ: Human 0.1 0.10 1.44 0.35 1.14 0.35 3.47 0.001
ω6 = 1.144 for branch g ω5 = 0.355 for branch f ω4 = 0.350 for branch e ω3 = 1.444 for branch d ω2 = 0.001 for branch c ω1 = 3.474 for branch b
- 1426.40
ω0 = 0.100 for branch a 7 Free ratios………………
- 1442.44
ω0 = 0.295 for all branches 1 One-ratio……………… l Parameters for branches p Model
The free ratios model has a higher likelihood, but it also has more parameters; how do we know if the gain in likelihood is significant? Increasing model complexity will always increase the likelihood score
15
Likelihood ratio test (LRT)
Test statistic = 2∆ℓ = 2(ℓ0(θ0) - ℓ1(θ1)) ℓ0 is the maximum log likelihood under H0 given parameters θ0 and ℓ1 is the maximum log likelihood under H1 given parameters θ1
Degrees of freedom = difference in the number of parameters between the two models
The LRT tests for a significant gain in likelihood score
ω6 = 1.144 for branch g ω5 = 0.355 for branch f ω4 = 0.350 for branch e ω3 = 1.444 for branch d ω2 = 0.001 for branch c ω1 = 3.474 for branch b
- 1426.40
ω0 = 0.100 for branch a 7 Free ratios………………
- 1442.44
ω0 = 0.295 for all branches 1 One-ratio……………… l Parameters for branches p Model
Likelihood ratio test: One-ratio vs. Free-ratios: 2δ = 14.2, df = 6, P = 0.014 We test the increase the likelihood score with an LRT Remember: the above estimates are an average over all sites in the gene!
16
Does selection pressure vary among sites?
0.2 0.4 0.6 0.8 1
Site class 1: ω0 < 1, 75% of codon sites Site class 2: ω1 = 1, 20% of codon sites Site class 3: ω2 > 1, 05% of codon sites
p0 p1 p2
) | ( ) (
1 i h i K i h
P p P ω x x
∑
− =
=
We can formulate this in terms of a probability distribution:
H
0: unifo rm selec tive pressure amo ng sites (M0)
H
1: variable se le c tive pr
e ssur e amo ng site s (M3)
Co mpare 2∆l = 2(l1 - l0) with a χ2 distr ibution
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Model 0
= 0.59
ω ˆ
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Model 3
= 0.09 = 0.64 = 5.64
ω ˆ ω ˆ ω ˆ
Does selection pressure vary among sites?
Note: the above are plots of the MLEs for DAZ
17
0.2 0.4 0.6 0.8 1 ω ratio Sites
M7: beta M8: beta&ω
0.2 0.4 0.6 0.8 1 ω ratio Sites >1
H
0: Beta distribute d variable sele c tive pressure (M7)
H
1: Beta plus po sitive selec tio n (M8)
Co mpare 2∆l = 2(l1 - l0) with a χ2 distr ibution
Are some sites subject to positive selection?
Note: the above are plots of the MLEs for DAZ
12.16* 19.14* Tree B ………………………… 6.82* 8.94* Tree A ………………………… M7 vs. M8 M0 vs. M3
Note.⎯ * significant at 5% level ( = 5.99, df = 2)
2 % 5
χ
We use the LRT to test two hypotheses:
H1: Selection pressure varies among sites in DAZ (M0 vs. M3) H2: Some sites in DAZ evolved under positive selection (M7 vs. M8)
18
Reconciling the different conclusions
- Patterns observed by Agulnick et al. (1998) were an artefact of
averaging over sites, and between sequences, in a gene that had strong spatial and temporal heterogeneity in selective constraints
- This example is not unique: