Selection Detection and Two-Sample-Testing: Generalized Greenwood Statistics and their Applications
Ðan Daniel Erdmann-Pham, Jonathan Terhorst & Yun S. Song
University of California, Berkeley
July 9, 2019 SPA 2019
Selection Detection and Two-Sample-Testing: Generalized Greenwood - - PowerPoint PPT Presentation
Selection Detection and Two-Sample-Testing: Generalized Greenwood Statistics and their Applications an Daniel Erdmann-Pham, Jonathan Terhorst & Yun S. Song University of California, Berkeley July 9, 2019 SPA 2019 Motivation Framework
Ðan Daniel Erdmann-Pham, Jonathan Terhorst & Yun S. Song
University of California, Berkeley
July 9, 2019 SPA 2019
Motivation Framework Application
Generalized Greenwood Statistics
Motivation Framework Application
Population Genetics: Detecting Selective Pressure
Neutral Tree
◮ At each depth, leaf set sizes are approximately equidistributed ◮ Leaf set sizes are highly unbalanced close to the root ◮ Given a tree, how can we tell whether it was generated under selection or not? ◮ Data allows computation of sum of squares of leaf set sizes
Generalized Greenwood Statistics
Motivation Framework Application
Population Genetics: Detecting Selective Pressure
Neutral Tree
◮ At each depth, leaf set sizes are approximately equidistributed ◮ Leaf set sizes are highly unbalanced close to the root ◮ Given a tree, how can we tell whether it was generated under selection or not? ◮ Data allows computation of sum of squares of leaf set sizes
Generalized Greenwood Statistics
Motivation Framework Application
Population Genetics: Detecting Selective Pressure
Neutral Tree
◮ At each depth, leaf set sizes are approximately equidistributed
Tree with Selection
◮ Leaf set sizes are highly unbalanced close to the root ◮ Given a tree, how can we tell whether it was generated under selection or not? ◮ Data allows computation of sum of squares of leaf set sizes
Generalized Greenwood Statistics
Motivation Framework Application
Population Genetics: Detecting Selective Pressure
Neutral Tree
◮ At each depth, leaf set sizes are approximately equidistributed
Tree with Selection
◮ Leaf set sizes are highly unbalanced close to the root ◮ Given a tree, how can we tell whether it was generated under selection or not? ◮ Data allows computation of sum of squares of leaf set sizes
Generalized Greenwood Statistics
Motivation Framework Application
Population Genetics: Detecting Selective Pressure
Neutral Tree
◮ At each depth, leaf set sizes are approximately equidistributed
Tree with Selection
◮ Leaf set sizes are highly unbalanced close to the root ◮ Given a tree, how can we tell whether it was generated under selection or not? ◮ Data allows computation of sum of squares of leaf set sizes
Generalized Greenwood Statistics
Motivation Framework Application
Population Genetics: Detecting Selective Pressure
Neutral Tree
◮ At each depth, leaf set sizes are approximately equidistributed
Tree with Selection
◮ Leaf set sizes are highly unbalanced close to the root ◮ Given a tree, how can we tell whether it was generated under selection or not? ◮ Data allows computation of sum of squares of leaf set sizes
Generalized Greenwood Statistics
Motivation Framework Application
Population Genetics: Detecting Selective Pressure
Neutral Tree
◮ At each depth, leaf set sizes are approximately equidistributed
Tree with Selection
◮ Leaf set sizes are highly unbalanced close to the root ◮ Given a tree, how can we tell whether it was generated under selection or not? ◮ Data allows computation of sum of squares of leaf set sizes
Generalized Greenwood Statistics
Motivation Framework Application
Two-Sample Tests: Comparing {Xk}k∈[n] and {Yk}k∈[m] ≠ ≠
How to test the hypothesis whether {Xk} and {Yk} are identi- cally distributed?
Generalized Greenwood Statistics
Motivation Framework Application
Two-Sample Tests: Comparing {Xk}k∈[n] and {Yk}k∈[m] Xk ~ Yk (Null) ≠ ≠
How to test the hypothesis whether {Xk} and {Yk} are identi- cally distributed?
Generalized Greenwood Statistics
Motivation Framework Application
Two-Sample Tests: Comparing {Xk}k∈[n] and {Yk}k∈[m] Xk ~ Yk (Null) [Xk] ≠ [Yk] (Alternative) ≠
How to test the hypothesis whether {Xk} and {Yk} are identi- cally distributed?
Generalized Greenwood Statistics
Motivation Framework Application
Two-Sample Tests: Comparing {Xk}k∈[n] and {Yk}k∈[m] Xk ~ Yk (Null) [Xk] ≠ [Yk] (Alternative) Var[Xk] ≠ Var[Yk] (Alternative)
How to test the hypothesis whether {Xk} and {Yk} are identi- cally distributed?
Generalized Greenwood Statistics
Motivation Framework Application
Two-Sample Tests: Comparing {Xk}k∈[n] and {Yk}k∈[m] Xk ~ Yk (Null) [Xk] ≠ [Yk] (Alternative) Var[Xk] ≠ Var[Yk] (Alternative)
How to test the hypothesis whether {Xk} and {Yk} are identi- cally distributed?
Generalized Greenwood Statistics
Motivation Framework Application
Generalized Greenwood Statistics
Motivation Framework Application
Balls and bins
◮ Sn,k ∼ U
Distribution) Limit as n → ∞ for fixed k ◮ Greenwood Statistic (Greenwood ’46) ◮ Some moments, CLT, statistical efficiency (Moran ’47, ’51, ’53) ◮ Geometry: intersection of L1 and L2 balls
◮ Up to k = 3 (Gardner ’52) ◮ Large deviations (Schechtner, Zinn ’00)
◮ Tabulation of z-scores up to k = 20 (Burrows ’79, Currie ’81, Stephens ’81)
Generalized Greenwood Statistics
Motivation Framework Application
Balls and bins
◮ Sn,k ∼ U
Distribution) Limit as n → ∞ for fixed k ◮ Greenwood Statistic (Greenwood ’46) ◮ Some moments, CLT, statistical efficiency (Moran ’47, ’51, ’53) ◮ Geometry: intersection of L1 and L2 balls
◮ Up to k = 3 (Gardner ’52) ◮ Large deviations (Schechtner, Zinn ’00)
◮ Tabulation of z-scores up to k = 20 (Burrows ’79, Currie ’81, Stephens ’81)
Generalized Greenwood Statistics
Motivation Framework Application
Balls and bins
◮ Sn,k ∼ U
Distribution) Limit as n → ∞ for fixed k ◮ Greenwood Statistic (Greenwood ’46) ◮ Some moments, CLT, statistical efficiency (Moran ’47, ’51, ’53) ◮ Geometry: intersection of L1 and L2 balls
◮ Up to k = 3 (Gardner ’52) ◮ Large deviations (Schechtner, Zinn ’00)
◮ Tabulation of z-scores up to k = 20 (Burrows ’79, Currie ’81, Stephens ’81)
Generalized Greenwood Statistics
Motivation Framework Application
Balls and bins
Sn,k
1
Sn,k
2
... ... Sn,k
k
◮ Sn,k ∼ U
Distribution) Limit as n → ∞ for fixed k ◮ Greenwood Statistic (Greenwood ’46) ◮ Some moments, CLT, statistical efficiency (Moran ’47, ’51, ’53) ◮ Geometry: intersection of L1 and L2 balls
◮ Up to k = 3 (Gardner ’52) ◮ Large deviations (Schechtner, Zinn ’00)
◮ Tabulation of z-scores up to k = 20 (Burrows ’79, Currie ’81, Stephens ’81)
Generalized Greenwood Statistics
Motivation Framework Application
Balls and bins
Sn,k
1
Sn,k
2
... ... Sn,k
k
◮ Sn,k ∼ U
Distribution) Limit as n → ∞ for fixed k ◮ Greenwood Statistic (Greenwood ’46) ◮ Some moments, CLT, statistical efficiency (Moran ’47, ’51, ’53) ◮ Geometry: intersection of L1 and L2 balls
◮ Up to k = 3 (Gardner ’52) ◮ Large deviations (Schechtner, Zinn ’00)
◮ Tabulation of z-scores up to k = 20 (Burrows ’79, Currie ’81, Stephens ’81)
Generalized Greenwood Statistics
Motivation Framework Application
Balls and bins
Sn,k
1
Sn,k
2
... ... Sn,k
k
◮ Sn,k ∼ U
Distribution) ◮ Can we perform hypothesis testing based on Sn,k2
2?
Limit as n → ∞ for fixed k ◮ Greenwood Statistic (Greenwood ’46) ◮ Some moments, CLT, statistical efficiency (Moran ’47, ’51, ’53) ◮ Geometry: intersection of L1 and L2 balls
◮ Up to k = 3 (Gardner ’52) ◮ Large deviations (Schechtner, Zinn ’00)
◮ Tabulation of z-scores up to k = 20 (Burrows ’79, Currie ’81, Stephens ’81)
Generalized Greenwood Statistics
Motivation Framework Application
Balls and bins
Sn,k
1
Sn,k
2
... ... Sn,k
k
◮ Sn,k ∼ U
Distribution) ◮ Can we perform hypothesis testing based on Sn,k2
2?
What is the distribution of Sn,k2
2?
Limit as n → ∞ for fixed k ◮ Greenwood Statistic (Greenwood ’46) ◮ Some moments, CLT, statistical efficiency (Moran ’47, ’51, ’53) ◮ Geometry: intersection of L1 and L2 balls
◮ Up to k = 3 (Gardner ’52) ◮ Large deviations (Schechtner, Zinn ’00)
◮ Tabulation of z-scores up to k = 20 (Burrows ’79, Currie ’81, Stephens ’81)
Generalized Greenwood Statistics
Motivation Framework Application
Balls and bins
Xk ~ Yk (Null) Sn,m+1
1
... Sn,m+1
j
... Sn,m+1
m
◮ Sn,k ∼ U
Distribution) ◮ Can we perform hypothesis testing based on Sn,k2
2?
What is the distribution of Sn,k2
2?
Limit as n → ∞ for fixed k ◮ Greenwood Statistic (Greenwood ’46) ◮ Some moments, CLT, statistical efficiency (Moran ’47, ’51, ’53) ◮ Geometry: intersection of L1 and L2 balls
◮ Up to k = 3 (Gardner ’52) ◮ Large deviations (Schechtner, Zinn ’00)
◮ Tabulation of z-scores up to k = 20 (Burrows ’79, Currie ’81, Stephens ’81)
Generalized Greenwood Statistics
Motivation Framework Application
Balls and bins
Xk ~ Yk (Null) Sn,m+1
1
... Sn,m+1
j
... Sn,m+1
m
◮ Sn,k ∼ U
Distribution) ◮ Can we perform hypothesis testing based on Sn,k2
2?
What is the distribution of Sn,k2
2?
Limit as n → ∞ for fixed k ◮ Greenwood Statistic (Greenwood ’46) ◮ Some moments, CLT, statistical efficiency (Moran ’47, ’51, ’53) ◮ Geometry: intersection of L1 and L2 balls
◮ Up to k = 3 (Gardner ’52) ◮ Large deviations (Schechtner, Zinn ’00)
◮ Tabulation of z-scores up to k = 20 (Burrows ’79, Currie ’81, Stephens ’81)
Generalized Greenwood Statistics
Motivation Framework Application
Balls and bins
Xk ~ Yk (Null) Sn,m+1
1
... Sn,m+1
j
... Sn,m+1
m
◮ Sn,k ∼ U
Distribution) ◮ Can we perform hypothesis testing based on Sn,kp
p,w
What is the distribution of Sn,kp
p,w?
Limit as n → ∞ for fixed k ◮ Greenwood Statistic (Greenwood ’46) ◮ Some moments, CLT, statistical efficiency (Moran ’47, ’51, ’53) ◮ Geometry: intersection of L1 and L2 balls
◮ Up to k = 3 (Gardner ’52) ◮ Large deviations (Schechtner, Zinn ’00)
◮ Tabulation of z-scores up to k = 20 (Burrows ’79, Currie ’81, Stephens ’81)
Generalized Greenwood Statistics
Motivation Framework Application
Balls and bins
Xk ~ Yk (Null) Sn,m+1
1
... Sn,m+1
j
... Sn,m+1
m
◮ Sn,k ∼ U
Distribution) ◮ Can we perform hypothesis testing based on Sn,kp
p,w
What is the distribution of Sn,kp
p,w?
Limit as n → ∞ for fixed k ◮ Greenwood Statistic (Greenwood ’46) ◮ Some moments, CLT, statistical efficiency (Moran ’47, ’51, ’53) ◮ Geometry: intersection of L1 and L2 balls
◮ Up to k = 3 (Gardner ’52) ◮ Large deviations (Schechtner, Zinn ’00)
◮ Tabulation of z-scores up to k = 20 (Burrows ’79, Currie ’81, Stephens ’81)
Generalized Greenwood Statistics
Motivation Framework Application
Balls and bins
Xk ~ Yk (Null) Sn,m+1
1
... Sn,m+1
j
... Sn,m+1
m
◮ Sn,k ∼ U
Distribution) ◮ Can we perform hypothesis testing based on Sn,kp
p,w
What is the distribution of Sn,kp
p,w?
Limit as n → ∞ for fixed k ◮ Greenwood Statistic (Greenwood ’46) ◮ Some moments, CLT, statistical efficiency (Moran ’47, ’51, ’53) ◮ Geometry: intersection of L1 and L2 balls
◮ Up to k = 3 (Gardner ’52) ◮ Large deviations (Schechtner, Zinn ’00)
◮ Tabulation of z-scores up to k = 20 (Burrows ’79, Currie ’81, Stephens ’81)
Generalized Greenwood Statistics
Motivation Framework Application
Balls and bins
Xk ~ Yk (Null) Sn,m+1
1
... Sn,m+1
j
... Sn,m+1
m
◮ Sn,k ∼ U
Distribution) ◮ Can we perform hypothesis testing based on Sn,kp
p,w
What is the distribution of Sn,kp
p,w?
Limit as n → ∞ for fixed k ◮ Greenwood Statistic (Greenwood ’46) ◮ Some moments, CLT, statistical efficiency (Moran ’47, ’51, ’53) ◮ Geometry: intersection of L1 and L2 balls
◮ Up to k = 3 (Gardner ’52) ◮ Large deviations (Schechtner, Zinn ’00)
◮ Tabulation of z-scores up to k = 20 (Burrows ’79, Currie ’81, Stephens ’81)
Generalized Greenwood Statistics
Motivation Framework Application
Balls and bins
Xk ~ Yk (Null) Sn,m+1
1
... Sn,m+1
j
... Sn,m+1
m
◮ Sn,k ∼ U
Distribution) ◮ Can we perform hypothesis testing based on Sn,kp
p,w
What is the distribution of Sn,kp
p,w?
Limit as n → ∞ for fixed k ◮ Greenwood Statistic (Greenwood ’46) ◮ Some moments, CLT, statistical efficiency (Moran ’47, ’51, ’53) ◮ Geometry: intersection of L1 and L2 balls
◮ Up to k = 3 (Gardner ’52) ◮ Large deviations (Schechtner, Zinn ’00)
◮ Tabulation of z-scores up to k = 20 (Burrows ’79, Currie ’81, Stephens ’81)
Generalized Greenwood Statistics
Motivation Framework Application
Balls and bins
Xk ~ Yk (Null) Sn,m+1
1
... Sn,m+1
j
... Sn,m+1
m
◮ Sn,k ∼ U
Distribution) ◮ Can we perform hypothesis testing based on Sn,kp
p,w
What is the distribution of Sn,kp
p,w?
Limit as n → ∞ for fixed k ◮ Greenwood Statistic (Greenwood ’46) ◮ Some moments, CLT, statistical efficiency (Moran ’47, ’51, ’53) ◮ Geometry: intersection of L1 and L2 balls
◮ Up to k = 3 (Gardner ’52) ◮ Large deviations (Schechtner, Zinn ’00)
◮ Tabulation of z-scores up to k = 20 (Burrows ’79, Currie ’81, Stephens ’81)
Generalized Greenwood Statistics
Motivation Framework Application
Balls and bins
Xk ~ Yk (Null) Sn,m+1
1
... Sn,m+1
j
... Sn,m+1
m
◮ Sn,k ∼ U
Distribution) ◮ Can we perform hypothesis testing based on Sn,kp
p,w
What is the distribution of Sn,kp
p,w?
Limit as n → ∞ for fixed k ◮ Greenwood Statistic (Greenwood ’46) ◮ Some moments, CLT, statistical efficiency (Moran ’47, ’51, ’53) ◮ Geometry: intersection of L1 and L2 balls
◮ Up to k = 3 (Gardner ’52) ◮ Large deviations (Schechtner, Zinn ’00)
◮ Tabulation of z-scores up to k = 20 (Burrows ’79, Currie ’81, Stephens ’81)
Generalized Greenwood Statistics
Motivation Framework Application
Results
Observation (Recursion) Let G(x) = ∞
m=0 Li−2m(x)/m!, then
ESn,k2m
2
= m!
n−1
k−1
[xn] (Sm(x))⋆(k) .
Corollaries (Discrete)
O
n
ε log
n
ε
+ n
ε log k
tests
limits: CLT, LLN, large deviations Corollaries (Continuous)
approximation: Fn,k − Fk∞ ∈ O
n−1
Fn,k − Fk ≥ 0
Fk ∈ Ck−3 ([0, 1])
Generalized Greenwood Statistics
Motivation Framework Application
Results
Observation (Recursion) Let G(x) = ∞
m=0 Li−2m(x)/m!, then
ESn,k2m
2
= m!
n−1
k−1
[xn] (Sm(x))⋆(k) .
Corollaries (Discrete)
O
n
ε log
n
ε
+ n
ε log k
tests
limits: CLT, LLN, large deviations Corollaries (Continuous)
approximation: Fn,k − Fk∞ ∈ O
n−1
Fn,k − Fk ≥ 0
Fk ∈ Ck−3 ([0, 1])
Generalized Greenwood Statistics
Motivation Framework Application
Results
Observation (Recursion) Let G(x) = ∞
m=0 Li−2m(x)/m!, then
ESn,k2m
2
= m!
n−1
k−1
[xn] (Sm(x))⋆(k) .
Corollaries (Discrete)
O
n
ε log
n
ε
+ n
ε log k
tests
limits: CLT, LLN, large deviations Corollaries (Continuous)
approximation: Fn,k − Fk∞ ∈ O
n−1
Fn,k − Fk ≥ 0
Fk ∈ Ck−3 ([0, 1])
Generalized Greenwood Statistics
Motivation Framework Application
Results
Observation (Recursion) Let G(x) = ∞
m=0 Li−2m(x)/m!, then
ESn,k2m
2
= m!
n−1
k−1
[xn] (Sm(x))⋆(k) .
Corollaries (Discrete)
O
n
ε log
n
ε
+ n
ε log k
tests
limits: CLT, LLN, large deviations Corollaries (Continuous)
approximation: Fn,k − Fk∞ ∈ O
n−1
Fn,k − Fk ≥ 0
Fk ∈ Ck−3 ([0, 1])
Generalized Greenwood Statistics
Motivation Framework Application
Results
Observation (Recursion) Let G(x) = ∞
m=0 Li−2m(x)/m!, then
ESn,k2m
2
= m!
n−1
k−1
[xn] (Sm(x))⋆(k) .
Corollaries (Discrete)
O
n
ε log
n
ε
+ n
ε log k
tests
limits: CLT, LLN, large deviations Corollaries (Continuous)
approximation: Fn,k − Fk∞ ∈ O
n−1
Fn,k − Fk ≥ 0
Fk ∈ Ck−3 ([0, 1])
Generalized Greenwood Statistics
Motivation Framework Application
Results
Observation (Recursion) Let G(x) = ∞
m=0 Li−2m(x)/m!, then
ESn,k2m
2
= m!
n−1
k−1
[xn] (Sm(x))⋆(k) .
Corollaries (Discrete)
O
n
ε log
n
ε
+ n
ε log k
tests
limits: CLT, LLN, large deviations Corollaries (Continuous)
approximation: Fn,k − Fk∞ ∈ O
n−1
Fn,k − Fk ≥ 0
Fk ∈ Ck−3 ([0, 1])
Generalized Greenwood Statistics
Motivation Framework Application
Results
Observation (Recursion) Let G(x) = ∞
m=0 Li−2m(x)/m!, then
ESn,k2m
2
= m!
n−1
k−1
[xn] (Sm(x))⋆(k) .
Corollaries (Discrete)
O
n
ε log
n
ε
+ n
ε log k
tests
limits: CLT, LLN, large deviations Corollaries (Continuous)
approximation: Fn,k − Fk∞ ∈ O
n−1
Fn,k − Fk ≥ 0
Fk ∈ Ck−3 ([0, 1])
Generalized Greenwood Statistics
Motivation Framework Application
Results
Observation (Recursion) Let G(x) = ∞
m=0 Li−2m(x)/m!, then
ESn,k2m
2
= m!
n−1
k−1
[xn] (Sm(x))⋆(k) .
Corollaries (Discrete)
O
n
ε log
n
ε
+ n
ε log k
tests
limits: CLT, LLN, large deviations Corollaries (Continuous)
approximation: Fn,k − Fk∞ ∈ O
n−1
Fn,k − Fk ≥ 0
Fk ∈ Ck−3 ([0, 1])
Generalized Greenwood Statistics
Motivation Framework Application
Results
Observation (Recursion) Let G(x) = ∞
m=0 Li−2m(x)/m!, then
ESn,k2m
2
= m!
n−1
k−1
[xn] (Sm(x))⋆(k) .
Corollaries (Discrete)
O
n
ε log
n
ε
+ n
ε log k
tests
limits: CLT, LLN, large deviations Corollaries (Continuous)
approximation: Fn,k − Fk∞ ∈ O
n−1
Fn,k − Fk ≥ 0
Fk ∈ Ck−3 ([0, 1])
Generalized Greenwood Statistics
Motivation Framework Application
Generalized Greenwood Statistics
Motivation Framework Application
Comparing Non-Parametric Two-Sample Tests
|| Sn,k ||2
2
Kolmogorov-Smirnov
7 20
Figure: Hypothesis testing based on Sn,k2
2 is more sensitive to variance
changes than common other two-sample tests.
Generalized Greenwood Statistics
Motivation Framework Application
Comparing Non-Parametric Two-Sample Tests
|| Sn,k ||2
2
Kolmogorov-Smirnov
Null Alternative
Figure: Hypothesis testing based on Sn,k2
2 is more sensitive to
compound mean and variance changes than common other two-sample tests, for randomly generated null and alternative of common support.
Generalized Greenwood Statistics
Motivation Framework Application
Comparing Non-Parametric Two-Sample Tests
|| Sn,k ||2
2
Kolmogorov-Smirnov
Null Alternative
Figure: Hypothesis testing based on Sn,k2
2 is more sensitive to
compound mean and variance changes than common other two-sample tests, for randomly generated null and alternative of distinct support.
Generalized Greenwood Statistics
Motivation Framework Application
New Perspectives on old Questions
What happened?
functions of moments
continuous problem
What happens now?
alternatives
Generalized Greenwood Statistics
Motivation Framework Application
New Perspectives on old Questions
What happened?
functions of moments
continuous problem
What happens now?
alternatives
Generalized Greenwood Statistics
Motivation Framework Application
New Perspectives on old Questions
What happened?
functions of moments
continuous problem
What happens now?
alternatives
Generalized Greenwood Statistics
Motivation Framework Application
New Perspectives on old Questions
What happened?
functions of moments
continuous problem
What happens now?
alternatives
Generalized Greenwood Statistics
Motivation Framework Application
New Perspectives on old Questions
What happened?
functions of moments
continuous problem
What happens now?
alternatives
Generalized Greenwood Statistics
Motivation Framework Application
New Perspectives on old Questions
What happened?
functions of moments
continuous problem
What happens now?
alternatives
Generalized Greenwood Statistics
Motivation Framework Application
New Perspectives on old Questions
What happened?
functions of moments
continuous problem
What happens now?
alternatives
Generalized Greenwood Statistics
Motivation Framework Application
Acknowledgements
Jonathan Terhorst Yun Song Jonathan Fischer Funding: German Academic Scholarship Foundation Generalized Greenwood Statistics