Idea Transcript
Hypothesis Testing to Detect Positive Selection Asif Tamuri EMBL-EBI Molecular Phylogenetics Course Agricultural Institute of Slovenia, Ljubljana
Outline • Introduction • Codon substitution models & ω • Tests - Lineage-specific adaptation - Site-specific adaptation - Branch-site models • Final words
Natural selection • Phenotypic evolution
driven mostly by selection
• Fitness of the phenotype/ morphological trait
Mutations and fitness fWT = 1 fMUT = 1 + s s = fMUT − fWT
s
0 <
s> s≈0
S = 2Ns
0
Neutral theory
Neutral
Neutral Nearly Neutral 0.5
Mutation
Neutral : Adaptive 999 : 1
4
3
2
Mutation
5
Neutral : Adaptive 0.5 Neutral 999 :: 1Adaptive : Adaptive 0.4 0.5 Deleterious 0.3 0.2 0.1
1
0
0
1
0 0
10
20
Substitution
Substitution
Neutral : Adaptive 24 : 1
999 : 1 9:1
0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0
Neutral : Adaptive : Adaptive 1.5 1.5Neutral Deleterious 1.5 24 24 : 1: 1 : Adaptive 3:1 1 1 1
0.5
0.5 0.5
Nearly Neutral
Nearly Neutral Adaptive
Adaptiv
0.5
Deleterious : Adaptive 0.5 Deleterious :9Adaptive :1 0.5 Neutral : Adaptive 0.4 9:1 9:1
0.4 0.4 0.3 0.3 0.3 0.2 0.2 0.2 0.1 0.1 0.1 00 0
1.5 1.5 1.5 1
0.5
N
0.4 0.3 0.2 0.1 0
Deleterious : Adaptive Deleterious : Adaptive Neutral3: :Adaptive 1:1 3 1: 4
N
1.5
1
1
0.5 0.5 0.5
0.5
1
0 0 0 00 000 –30 20 –10 10 –20 –10 –20 30 –30 –30 –10 –10 0 –20 –20 –30 –20 –10 00 0 10 10 20 20 30 30–30 –20 –10 30 –30
Ns
A
Ns
Ns
0 20 –10 10 30 –30 –20 10 –20 2020 3030 –30 10
0
–10
Fig. 8.1 Distributions of fitness effects (Ns) among mutations and substitutions under neutral, nearly neutral, and ad Akashi, H. (1999). Gene, 238(1), 39–51. ularamong evolution. One of possible distribution is shown for each model, area under the summing to 1. Ta ness effects (Ns) mutations and substitutions under neutral, nearly neutral, andthe adaptive models ofcurves molecFig. 8.1 Distributions fitness effects (Ns) among mutations and with substitutions under neutral, nearly neutral,
Sequence diversity •
What is the underlying cause of the sequence diversity we observe?
•
Natural selection
•
-
Most mutations are deleterious Substitution are adaptive
Neutral theory
-
Most mutations are deleterious Most substitutions are nearly neutral Fate of neutral mutations is determined by drift
Molecular adaptation • Defensive immunity • Evading defensive immunity
• Reproduction • Digestion • Many more... Yang, Z., & Bielawski, J. P. (2000). Trends in ecology & evolution, 15(12), 496–503. Pedersen et al. (2012) BMC Genomics 2012, 13:694!
Synonymous and nonsynonymous substitutions
Measuring selection using ω
•d
S (Ks)
site
: synonymous substitutions per
•d
N (KA):
nonsynonymous substitutions per site
•ω=d
N/
dS : nonsynonymous/ synonymous rate ratio
Measuring selection using ω •
•
Amino acid change:
-
neutral: fixed at the same rate as a synonymous mutations, ω = 1.
-
deleterious: purifying selection will reduce its fixation rate, ω < 1.
-
selective advantage: fixed with a higher rate than a synonymous mutations, ω > 1.
ω significantly higher than one is evidence for diversifying selection.
Codon substitution rates Synonymous transition Synonymous transversion Nonsynonymous transition Nonsynonymous transversion
Transition: Purine Purine (A Pyrimidine
GTG ATG TTG
G)
Pyrimidine (C
Adapted from Yang (2006)
T)
CTT
CTC
CTG
CTA CCG
CGG CAG
Basic model of codon substitution if i and j differ at more than one position,
0, j,
qij
for synonymous transversi on,
j,
for synonymous transition ,
j,
for nonsynonym ous transversi on, j,
for nonsynonym ous transition.
• κ: transition/transversion rate ratio • π : biased codon usage for codon j j
Goldman and Yang (1994) Mol Biol Evol 11:725-736
Codon frequencies Fequal
1/61
F1x4
f(C) f(T) f(G)
F3x4
f1(C) f2(T) f3(G)
F61
f(CTG)
ML estimation of dN and dS • Transition probabilities P(t) = {p (t)} = exp(Qt) • Imagine two sequences: - Probability of a site with ij
codons i and j: πi pij(t)
-
Given t, κ and ω, calculate the log-likelihood = log{πi pij(t)}
-
Maximise
i
t
j
Maximum likelihood estimation lnL
#
"
Selective pressure on phylogenies • Adaptive evolution is difficult to detect using only two sequences
• It averages selective pressure over entire evolutionary history & over all sites
• To increase power of the method, allow selection pressure to vary over sites or branches of multiple sequences on a phylogeny
Lineage-specific adaptation •
Most selection pressure is episodic, happening at some particular moment in evolutionary history
•
A lineage that underwent Darwinian selection may have a dN/dS rate different from other lineages or greater than one.
•
Models allowing variable dN/dS ratios among lineages
-
Ratio for few lineages different from background Ratio of lineage of interest is greater than 1 Variable among evolutionary lineages
Yang, Z. (1998). Mol Biol Evol, 15(5), 568–573. Yang, Z. & Nielsen, R. (1998) Journal of Molecular Evolution 46:409-418.
Branch model • •
H0: Same selective pressure across entire tree. H1: Selective pressure change in one part of the tree.
-
Two set of branches with independent ω ratios Two different Q rate matrices → two P(t)-s Maximise log-likelihood wrt all parameters
ω
vs.
ω0
ω1
LRT ω
• • •
vs.
ω0
ω1
H0: Same selective pressure across entire tree. H1: Selective pressure change in one part of the tree. Likelihood Ratio Test:
-
H0 has p parameters with log-likelihood ℓ0 H1 has q parameters with log-likelihood ℓ1 Compare 2(ℓ1 - ℓ0) with χ2 distribution with d.o.f. = p - q to test whether to reject null
Lysozyme example !c
!h
Yang, Z. (1998). Mol Biol Evol, 15(5), 568–573.
Table 1 Log Likelihood Values and Parameter Estimates Under Different Models Model Large data set (n 19) A. One ratio: 0 H B. Two ratios: 0 H, C. Two ratios: 0 C, D. Two ratios: 0, H E. Three ratios: 0, H, F. Two ratios: G. Two ratios: H. Two ratios: I. Three ratios: J. Three ratios:
p
ˆ
ˆ0
......... C .......... H .......... C .......... C ...........
35 36 36 36 37
1043.84 1041.70 1039.92 1037.59 1037.04
4.157 4.163 4.186 4.199 4.196
0.574 0.489 0.484 0.392 0.392
1. . . . . . . 1 ...... 0 C, H 1 ...... 0, H C 1. . . . . . . . 0, H, C 1, C. . . . . . . . 0, H
35 35 35 36 36
1042.50 1042.29 1040.32 1037.92 1039.49
4.074 4.058 3.974 4.101 4.063
0.488 0.484 0.392 0.392 0.392
0
H,
C
C
ˆH ˆ0 ˆ0 7.166 ˆ0 1 1 1
ˆC ˆ0 3.383 ˆ0 ˆH 3.516 1 ˆ0 1 1 3.448
Yang, Z. (1998). Mol Biol Evol, 15(5), 568–573.
Table 2 Likelihood Ratio Statistics (2
Null Hypothesis Tested A. ( H B. C C. C D. H E. H A.( H B. C C. C D. H E. H
C)
Assumption Models Made Compared
0
........ 0 ........ 0 ........ 0 ........ 0
C)
1.. 1.. .. ..... 1.. .. ..... 1. . . . . . . . . 1.........
* t (P ** Extremely
) for Testing Hypotheses
5%;
H
C
H
0
free
H C C
0
free
H
C
H
0
free
H C C
0
free
2 1
t (P
3.84). 1%;
A A C A B
and D and B and E and C and E
D and H B and F E and I C and G E and J 2 1
Large Data Set (n 19)
Small Data Set (n 7)
12.50** 4.28* 5.76* 7.84** 9.32**
8.78** 2.76 3.96* 5.88* 7.08**
5.46* 1.60 1.76 4.74* 4.90*
5.46* 1.68 1.84 4.60* 4.76*
6.63).
Yang, Z. (1998). Mol Biol Evol, 15(5), 568–573.
Site-specific adaptation •
Previous model assumes all sites are under the same selection pressure (averaging ω across all sites).
•
Unrealistic; sites have various functional and structural roles, and so likely to be under different selective pressure.
•
Many sites might be under strong purifying selection, ω close to 0.
‘Site model’ allows for variable selection intensity among amino acid sites.
Nielsen, R. & Yang, Z. (1998) Genetics 148:929-936 Yang, Z. et al. (2000). Genetics, 155(1), 431–449.
Random-sites model • Use statistical distribution to describe variation of ω among sites.
- e.g. classes of sites with different ω • Test: 1. Do sites exist where ω > 1? (LRT) 2. If so, identify positively selected sites (Bayes)
Simple model • • •
H0: M1 (neutral)
-
proportion p0 of conserved sites, ω0 = 0 proportion p1 of neutral sites, ω1 = 1
H1: M2 (selection)
-
as H0, and... proportion p2 of sites with ω2 estimated from data
Does M2 fit data significantly better than M1 (LRT) and is M2 estimated ω2 greater than one?
Nielsen, R. & Yang, Z. (1998) Genetics 148:929-936
More models TABLE 2
Models of variable
p
Model code M0 M1 M2 M3
(one-ratio) (neutral) (selection) (discrete)
Parameters
M5 (gamma) M6 (2gamma) M7 (beta) M8 (beta& ) M9 (beta&gamma) M10 (beta&gamma 1) M11 (beta&normal 1)
1 1 3 2K 1 (K 3) K 1 (K 5) 2 4 2 4 5 5 5
M12 (0&2normal 1)
5
p0, p1,
2
M13 (3normal 0)
6
p0, p1,
2
M4 (freqs)
p, number of parameters in the
ratios among sites
Notes One ratio for all sites p1 1 p0, 0 0, 1 1 p2 1 p0 p1, 0 0, 1 1 pK 1 1 p0 p1 . . . pK 2
p0 p0, p1, 2 p0, p1, . . . , pK 2, 0, 1, . . . , K 1 p0, p1, . . . , pK 2 , p0, 0, p, q p0, p, q, p0, p, q, p0, p, q, p0, p, q,
0
,
The
1
, , , ,
,
,
1
,
0
2
,
1
2
k
are fixed at 0, 1⁄3, 2⁄3, 1, and 3
From ( , ) p0 from ( 0, 0) and p1 1 p0 from ( 1, 1) From (p, q) p0 from (p, q) and 1 p0 with p0 from (p, q) and 1 p0 from ( , ) p0 from (p, q) and 1 p0 from 1 ( , ) p0 from (p, q) and 1 p0 from ( , 2), truncated to 1 p0 with 0 0 and 1 p0 from the mixture: p1 from (1, 21), and 1 p1 from ( 2, 22), both normals truncated to 1 p0 from (0, 20), p1 from (1, 21), and p2 1 p0 p1 from ( 2, 22), all normals truncated to 1
distribution.
Yang, Z. et al. (2000). Genetics, 155(1), 431–449.
Beta distribution
https://en.wikipedia.org/wiki/Beta_distribution
Common tests • •
*
Variable selective pressure among sites?
-
M0 (one ratio) and M3 (discrete)
Sites evolving by positive selection?
-
M1 (neutral) and M2 (selection) [ω0 = 0]
-
M7 (beta) and M8 (beta&ω)
M1a (NearlyNeutral) and M2a (PositiveSelection) [ω0 < 1] *
M8a (beta&ωs=1) and M8 (beta&ω) ✝
Yang, Z. et al. (2005). Mol Biol Evol, 22(4), 1107–1118. ✝ Swanson, W. J. et al. (2003). Mol Biol Evol, 20(1), 18–20.
Identification of sites evolving under positive selection • LRT is significant for alternative model and estimated ω > 1
• Empirical Bayes (EB) approach - Calculate posterior probability of each site for each ω class - which class does the site most likely belong to?
-
Positive selection acting on sites with high posterior probability for the “ω > 1” site class
EB procedure • Imagine K site classes with ω ratios (ω , 0
ω1, ..., ωk-1) and proportions (p0, p1, ..., pk-1)
• Posterior probability that site with data x is from site class i: P(x | ω i )pi P(ω i | x) = = P(x)
∑
P(x | ω i )pi
K −1 j=0
P(x | ω j )p j
Class I MHC example Table 2 Log-likelihood Values and Parameter Estimates Under Random-sites Models for the Class I MHC Alleles Model Code
p
M7 (beta). . . . . . . . . M8 (beta & ) . . . .
393 395
7498.97 7232.68
= 0.88
pˆ 0.103, qˆ 0.354 pˆ0 0.900, (pˆ1 0.100), pˆ 0.710, ˆ 5.122
Positively Selected Sites 0.168, qˆ
Not allowed 9F 24A 45M 62G 63E 67V 69A 70H 71S 77D 80T 81L 82R 94T 95V 97R 99Y 113Y 114H 116Y 151H 152V 156L 163T 167W
= 5.12
Probability
< 0.55
Estimates of Parameters
Site Yang, Z., & Swanson, W. J. (2002). Mol Biol Evol, 19(1), 49–57. Yang, Z. (2002). Current Opinion in Genetics & Development, 12(6), 688–694.
Variations of site model • Random-effects model • Fixed-sites model - Sites are partitioned by user • One ω per site - e.g. SLR Massingham & Goldman (2005). Genetics, 169(3), 1753–1762.
Branch-site models • • • •
Site class 0 - conserved Site class 1 - neutral Site class 2a - variable class 1 Site class 2b - variable class 2
•
Null:
• •
LRT
Yang, Z., & Nielsen, R. (2002). Mol Biol Evol, 19(6), 908–917. Yang, Z. et al. (2005). Mol Biol Evol, 22(4), 1107–1118. Yang, Z., & dos Reis, M. (2011). Mol Biol Evol, 28(3), 1217–1228.
!2 = 1 (fixed) Alternative: !2 >= 1 (estimated)
Notes
• Two sequences are not enough! • Branch models use one ω for all sites • Site models use distribution of ω for sites but unchanged in time
-
Good power if this is true
• Branch-site models - Some power in detecting episodic adaptation - Sensitive to sequence and alignment errors • Sequences need to be selected carefully
•
•
•
Parameters on the boundary affects the LRT
-
χ2 approximation is not expected to hold The test is conservative i.e. false positive rate will be less than significance level
Bayesian prediction of sites
-
no practical from a few highly similar sequences to increase accuracy, increase number of lineages sensitive to sample errors in parameter estimates and ω distribution
LRT seems robust to model assumptions, but test!
-
Use multiple models (M0-M3, M1a-M2a, M7-M8) Check for local optima Check effect of ω distribution, or codon frequency method, on parameters estimation and site identification
Limitations • Episodic selection • At a positively selected site, all amino acids are assumed to be equally advantageous
• ω > 1 is a conservative measure
(averaging across branches/sites)
• Ignores errors in the phylogeny
References Yang, Z. (2012) PAML User Guide Yang, Z., & Reis, dos, M. (2011). Statistical properties of the branch-site test of positive selection. Molecular biology and evolution, 28(3), 1217–1228. Yang, Z. (2006). Computational molecular evolution. Oxford University Press. Bielawski, J. P. & Yang, Z. (2005). Maximum Likelihood Methods for Detecting Adaptive Protein Evolution. Ch. 5 in Statistical Methods in Molecular Evolution ed. Nielsen, R. Springer, NY. Yang, Z., Wong, W. S. W., & Nielsen, R. (2005). Bayes empirical bayes inference of amino acid sites under positive selection. Molecular biology and evolution, 22(4), 1107– 1118. Yang, Z., & Nielsen, R. (2002). Codon-substitution models for detecting molecular adaptation at individual sites along specific lineages. Molecular biology and evolution, 19(6), 908–917. Yang, Z., & Bielawski, J. P. (2000). Statistical methods for detecting molecular adaptation. Trends in ecology & evolution, 15(12), 496–503.
Yang, Z. et al. (2000). Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics, 155(1), 431–449. Nielsen, R. & Yang, Z. (1998). Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics 148:929-936. Yang, Z. (1998). Likelihood ratio tests for detecting positive selection and application to primate lysozyme evolution. Molecular biology and evolution, 15(5), 568–573. Yang, Z. & Nielsen, R. (1998). Synonymous and nonsynonymous rate variation in nuclear genes of mammals. Journal of Molecular Evolution 46:409-418. Goldman, N., & Yang, Z. (1994). A codon-based model of nucleotide substitution for protein-coding DNA sequences. Molecular biology and evolution, 11(5), 725–736.