Nearest Neighbor and Kernel Survival Analysis
George H. Chen Assistant Professor of Information Systems Carnegie Mellon University
June 11, 2019
Nearest Neighbor and Kernel Survival Analysis Nonasymptotic Error - - PowerPoint PPT Presentation
Nearest Neighbor and Kernel Survival Analysis Nonasymptotic Error Bounds and Strong Consistency Rates George H. Chen Assistant Professor of Information Systems Carnegie Mellon University June 11, 2019 Survival Analysis Gluten
June 11, 2019
Gluten allergy Immuno- suppressant Low resting heart rate High BMI Irregular heart beat Time of death Day 2 Day 10 Day ≥ 6
Gluten allergy Immuno- suppressant Low resting heart rate High BMI Irregular heart beat Time of death Day 2 Day 10 Day ≥ 6 Feature vector X X Observed time Y Y
Gluten allergy Immuno- suppressant Low resting heart rate High BMI Irregular heart beat Time of death Day 2 Day 10 Day ≥ 6 Feature vector X X Observed time Y Y
When we stop collecting training data, not everyone has died!
Gluten allergy Immuno- suppressant Low resting heart rate High BMI Irregular heart beat Time of death Day 2 Day 10 Day ≥ 6 Goal: Estimate S(t|x) = P(survive beyond time t | feature vector x) Feature vector X X Observed time Y Y
When we stop collecting training data, not everyone has died!
Model: Generate data point as follows: (X, Y, δ)
Model: Generate data point as follows: (X, Y, δ)
Model: Generate data point as follows: (X, Y, δ)
Model: Generate data point as follows: (X, Y, δ)
Model: Generate data point as follows: (X, Y, δ)
T ≤ C
Model: Generate data point as follows: (X, Y, δ)
T ≤ C
Y = T, δ = 1 Set
Otherwise: Set Y = C, δ = 0 Model: Generate data point as follows: (X, Y, δ)
T ≤ C
Y = T, δ = 1 Set
Estimator (Beran 1981):
Otherwise: Set Y = C, δ = 0 Model: Generate data point as follows: (X, Y, δ)
T ≤ C
Y = T, δ = 1 Set
Estimator (Beran 1981):
Otherwise: Set Y = C, δ = 0 Model: Generate data point as follows: (X, Y, δ)
T ≤ C x
Y = T, δ = 1 Set
find training points closest to . data points k x k Estimator (Beran 1981):
Otherwise: Set Y = C, δ = 0 Model: Generate data point as follows: (X, Y, δ)
T ≤ C x
Y = T, δ = 1 Set
find training points closest to . data points k x k Estimator (Beran 1981):
Otherwise: Set Y = C, δ = 0 Model: Generate data point as follows: (X, Y, δ)
T ≤ C x Kaplan-Meier estimator b S(t | x)
Y = T, δ = 1 Set
find training points closest to . data points k x k Estimator (Beran 1981): Kernel variant is similar
Otherwise: Set Y = C, δ = 0 Model: Generate data point as follows: (X, Y, δ)
T ≤ C x Kaplan-Meier estimator b S(t | x)
Y = T, δ = 1 Set
find training points closest to . data points k x k Estimator (Beran 1981): Kernel variant is similar
Otherwise: Set Y = C, δ = 0 Model: Generate data point as follows: (X, Y, δ)
T ≤ C x Kaplan-Meier estimator b S(t | x) Error: for time horizon t sup
t∈[0,τ]
|b S(t|x) − S(t|x)| τ
Y = T, δ = 1 Set
find training points closest to . data points k x k Estimator (Beran 1981): Kernel variant is similar
Otherwise: Set Y = C, δ = 0 Model: Generate data point as follows: (X, Y, δ)
T ≤ C x Kaplan-Meier estimator b S(t | x) Error: for time horizon t sup
t∈[0,τ]
|b S(t|x) − S(t|x)| τ Enough of the n training data have Y values > t Y τ n
Y = T, δ = 1 Set
find training points closest to . data points k x k Estimator (Beran 1981): Kernel variant is similar
Otherwise: Set Y = C, δ = 0 Model: Generate data point as follows: (X, Y, δ)
T ≤ C x Kaplan-Meier estimator b S(t | x) Error: for time horizon t sup
t∈[0,τ]
|b S(t|x) − S(t|x)| τ Feature space is separable metric space (intrinsic dimension d) d Enough of the n training data have Y values > t Y τ n
Y = T, δ = 1 Set
find training points closest to . data points k x k Estimator (Beran 1981): Kernel variant is similar
Otherwise: Set Y = C, δ = 0 Model: Generate data point as follows: (X, Y, δ)
T ≤ C x Kaplan-Meier estimator b S(t | x) Error: for time horizon t sup
t∈[0,τ]
|b S(t|x) − S(t|x)| τ Continuous r.v. in time & smooth w.r.t. feature space (Hölder index a) α Feature space is separable metric space (intrinsic dimension d) d Enough of the n training data have Y values > t Y τ n
Y = T, δ = 1 Set
find training points closest to . data points k x k Estimator (Beran 1981): Kernel variant is similar
Otherwise: Set Y = C, δ = 0 Model: Generate data point as follows: (X, Y, δ)
T ≤ C x Kaplan-Meier estimator b S(t | x) Error: for time horizon t sup
t∈[0,τ]
|b S(t|x) − S(t|x)| τ Borel prob. measure Continuous r.v. in time & smooth w.r.t. feature space (Hölder index a) α Feature space is separable metric space (intrinsic dimension d) d Enough of the n training data have Y values > t Y τ n
Y = T, δ = 1 Set
k-NN estimator with has strong consistency rate: sup
t∈[0,τ]
|b S(t|x) − S(t|x)| ≤ e O(n−α/(2α+d)) k = e Θ(n2α/(2α+d))
If no censoring, problem reduces to conditional CDF estimation k-NN estimator with has strong consistency rate: sup
t∈[0,τ]
|b S(t|x) − S(t|x)| ≤ e O(n−α/(2α+d)) k = e Θ(n2α/(2α+d))
If no censoring, problem reduces to conditional CDF estimation
→ Error upper bound, up to a log factor, matches conditional CDF
estimation lower bound by Chagny & Roche 2014 k-NN estimator with has strong consistency rate: sup
t∈[0,τ]
|b S(t|x) − S(t|x)| ≤ e O(n−α/(2α+d)) k = e Θ(n2α/(2α+d))
If no censoring, problem reduces to conditional CDF estimation
→ Error upper bound, up to a log factor, matches conditional CDF
estimation lower bound by Chagny & Roche 2014 Proof ideas also give finite sample rates for: k-NN estimator with has strong consistency rate: sup
t∈[0,τ]
|b S(t|x) − S(t|x)| ≤ e O(n−α/(2α+d)) k = e Θ(n2α/(2α+d))
If no censoring, problem reduces to conditional CDF estimation
→ Error upper bound, up to a log factor, matches conditional CDF
estimation lower bound by Chagny & Roche 2014 Proof ideas also give finite sample rates for:
k-NN estimator with has strong consistency rate: sup
t∈[0,τ]
|b S(t|x) − S(t|x)| ≤ e O(n−α/(2α+d)) k = e Θ(n2α/(2α+d))
If no censoring, problem reduces to conditional CDF estimation
→ Error upper bound, up to a log factor, matches conditional CDF
estimation lower bound by Chagny & Roche 2014 Proof ideas also give finite sample rates for:
k-NN estimator with has strong consistency rate: sup
t∈[0,τ]
|b S(t|x) − S(t|x)| ≤ e O(n−α/(2α+d)) k = e Θ(n2α/(2α+d)) ( )
If no censoring, problem reduces to conditional CDF estimation
→ Error upper bound, up to a log factor, matches conditional CDF
estimation lower bound by Chagny & Roche 2014 Proof ideas also give finite sample rates for:
k-NN estimator with has strong consistency rate: sup
t∈[0,τ]
|b S(t|x) − S(t|x)| ≤ e O(n−α/(2α+d)) k = e Θ(n2α/(2α+d)) ( )
If no censoring, problem reduces to conditional CDF estimation
→ Error upper bound, up to a log factor, matches conditional CDF
estimation lower bound by Chagny & Roche 2014 Proof ideas also give finite sample rates for:
k-NN estimator with has strong consistency rate: sup
t∈[0,τ]
|b S(t|x) − S(t|x)| ≤ e O(n−α/(2α+d)) k = e Θ(n2α/(2α+d)) ( )
Most general finite sample theory for k-NN and kernel survival estimators
If no censoring, problem reduces to conditional CDF estimation
→ Error upper bound, up to a log factor, matches conditional CDF
estimation lower bound by Chagny & Roche 2014 Proof ideas also give finite sample rates for:
k-NN estimator with has strong consistency rate: sup
t∈[0,τ]
|b S(t|x) − S(t|x)| ≤ e O(n−α/(2α+d)) k = e Θ(n2α/(2α+d)) ( )
Most general finite sample theory for k-NN and kernel survival estimators
Existing kernel results only for Euclidean space (Dabrowska 1989, Van Keilegom & Veraverbeke 1996, Van Keilegom 1998)
0.60 0.65 0.70
c-Lndex
DdDptLve Nernel rDndom survLvDl Iorest Nernel (trLDngle) L1 Nernel (trLDngle) L2 Nernel (box) L1 Nernel (box) L2 N-11 (trLDngle) L1 N-11 (trLDngle) L2 N-11 L1 N-11 L2 cox
DDtDset "gbsg2" ConcordDnce IndLces
0.60 0.65 0.70
c-Lndex
DdDptLve Nernel rDndom survLvDl Iorest Nernel (trLDngle) L1 Nernel (trLDngle) L2 Nernel (box) L1 Nernel (box) L2 N-11 (trLDngle) L1 N-11 (trLDngle) L2 N-11 L1 N-11 L2 cox
DDtDset "gbsg2" ConcordDnce IndLces
Distance/kernel choice matter a lot in practice
0.60 0.65 0.70
c-Lndex
DdDptLve Nernel rDndom survLvDl Iorest Nernel (trLDngle) L1 Nernel (trLDngle) L2 Nernel (box) L1 Nernel (box) L2 N-11 (trLDngle) L1 N-11 (trLDngle) L2 N-11 L1 N-11 L2 cox
DDtDset "gbsg2" ConcordDnce IndLces
Distance/kernel choice matter a lot in practice Learning the kernel typically has best performance (but no theory yet!)