Nearest Neighbor and Kernel Survival Analysis Nonasymptotic Error - - PowerPoint PPT Presentation

nearest neighbor and kernel survival analysis
SMART_READER_LITE
LIVE PREVIEW

Nearest Neighbor and Kernel Survival Analysis Nonasymptotic Error - - PowerPoint PPT Presentation

Nearest Neighbor and Kernel Survival Analysis Nonasymptotic Error Bounds and Strong Consistency Rates George H. Chen Assistant Professor of Information Systems Carnegie Mellon University June 11, 2019 Survival Analysis Gluten


slide-1
SLIDE 1

Nearest Neighbor and
 Kernel Survival Analysis

George H. Chen
 Assistant Professor of Information Systems
 Carnegie Mellon University

June 11, 2019

Nonasymptotic Error Bounds and Strong Consistency Rates

slide-2
SLIDE 2

Gluten
 allergy Immuno-
 suppressant Low resting heart rate High BMI Irregular heart beat Time of death Day 2 Day 10 Day ≥ 6

Survival Analysis

slide-3
SLIDE 3

Gluten
 allergy Immuno-
 suppressant Low resting heart rate High BMI Irregular heart beat Time of death Day 2 Day 10 Day ≥ 6 Feature vector X X Observed time Y Y

Survival Analysis

slide-4
SLIDE 4

Gluten
 allergy Immuno-
 suppressant Low resting heart rate High BMI Irregular heart beat Time of death Day 2 Day 10 Day ≥ 6 Feature vector X X Observed time Y Y

Survival Analysis

When we stop collecting training data, not everyone has died!

slide-5
SLIDE 5

Gluten
 allergy Immuno-
 suppressant Low resting heart rate High BMI Irregular heart beat Time of death Day 2 Day 10 Day ≥ 6 Goal: Estimate S(t|x) = P(survive beyond time t | feature vector x) Feature vector X X Observed time Y Y

Survival Analysis

When we stop collecting training data, not everyone has died!

slide-6
SLIDE 6

Problem Setup

slide-7
SLIDE 7

Model: Generate data point as follows: (X, Y, δ)

Problem Setup

slide-8
SLIDE 8

Model: Generate data point as follows: (X, Y, δ)

  • 1. Sample feature vector X ∼ PX

Problem Setup

slide-9
SLIDE 9
  • 2. Sample time of death T ∼ PT |X

Model: Generate data point as follows: (X, Y, δ)

  • 1. Sample feature vector X ∼ PX

Problem Setup

slide-10
SLIDE 10
  • 2. Sample time of death T ∼ PT |X
  • 3. Sample time of censoring C ∼ PC|X

Model: Generate data point as follows: (X, Y, δ)

  • 1. Sample feature vector X ∼ PX

Problem Setup

slide-11
SLIDE 11
  • 2. Sample time of death T ∼ PT |X
  • 3. Sample time of censoring C ∼ PC|X

Model: Generate data point as follows: (X, Y, δ)

  • 1. Sample feature vector X ∼ PX
  • 4. If death happens before censoring ( ):

T ≤ C

Problem Setup

slide-12
SLIDE 12
  • 2. Sample time of death T ∼ PT |X
  • 3. Sample time of censoring C ∼ PC|X

Model: Generate data point as follows: (X, Y, δ)

  • 1. Sample feature vector X ∼ PX
  • 4. If death happens before censoring ( ):

T ≤ C

Problem Setup

Y = T, δ = 1 Set

slide-13
SLIDE 13
  • 2. Sample time of death T ∼ PT |X
  • 3. Sample time of censoring C ∼ PC|X

Otherwise: Set Y = C, δ = 0 Model: Generate data point as follows: (X, Y, δ)

  • 1. Sample feature vector X ∼ PX
  • 4. If death happens before censoring ( ):

T ≤ C

Problem Setup

Y = T, δ = 1 Set

slide-14
SLIDE 14

Estimator (Beran 1981):

  • 2. Sample time of death T ∼ PT |X
  • 3. Sample time of censoring C ∼ PC|X

Otherwise: Set Y = C, δ = 0 Model: Generate data point as follows: (X, Y, δ)

  • 1. Sample feature vector X ∼ PX
  • 4. If death happens before censoring ( ):

T ≤ C

Problem Setup

Y = T, δ = 1 Set

slide-15
SLIDE 15

Estimator (Beran 1981):

  • 2. Sample time of death T ∼ PT |X
  • 3. Sample time of censoring C ∼ PC|X

Otherwise: Set Y = C, δ = 0 Model: Generate data point as follows: (X, Y, δ)

  • 1. Sample feature vector X ∼ PX
  • 4. If death happens before censoring ( ):

T ≤ C x

Problem Setup

Y = T, δ = 1 Set

slide-16
SLIDE 16

find training points closest to . data points k x k Estimator (Beran 1981):

  • 2. Sample time of death T ∼ PT |X
  • 3. Sample time of censoring C ∼ PC|X

Otherwise: Set Y = C, δ = 0 Model: Generate data point as follows: (X, Y, δ)

  • 1. Sample feature vector X ∼ PX
  • 4. If death happens before censoring ( ):

T ≤ C x

Problem Setup

Y = T, δ = 1 Set

slide-17
SLIDE 17

find training points closest to . data points k x k Estimator (Beran 1981):

  • 2. Sample time of death T ∼ PT |X
  • 3. Sample time of censoring C ∼ PC|X

Otherwise: Set Y = C, δ = 0 Model: Generate data point as follows: (X, Y, δ)

  • 1. Sample feature vector X ∼ PX
  • 4. If death happens before censoring ( ):

T ≤ C x Kaplan-Meier estimator b S(t | x)

Problem Setup

Y = T, δ = 1 Set

slide-18
SLIDE 18

find training points closest to . data points k x k Estimator (Beran 1981): Kernel variant is similar

  • 2. Sample time of death T ∼ PT |X
  • 3. Sample time of censoring C ∼ PC|X

Otherwise: Set Y = C, δ = 0 Model: Generate data point as follows: (X, Y, δ)

  • 1. Sample feature vector X ∼ PX
  • 4. If death happens before censoring ( ):

T ≤ C x Kaplan-Meier estimator b S(t | x)

Problem Setup

Y = T, δ = 1 Set

slide-19
SLIDE 19

find training points closest to . data points k x k Estimator (Beran 1981): Kernel variant is similar

  • 2. Sample time of death T ∼ PT |X
  • 3. Sample time of censoring C ∼ PC|X

Otherwise: Set Y = C, δ = 0 Model: Generate data point as follows: (X, Y, δ)

  • 1. Sample feature vector X ∼ PX
  • 4. If death happens before censoring ( ):

T ≤ C x Kaplan-Meier estimator b S(t | x) Error: for time horizon t sup

t∈[0,τ]

|b S(t|x) − S(t|x)| τ

Problem Setup

Y = T, δ = 1 Set

slide-20
SLIDE 20

find training points closest to . data points k x k Estimator (Beran 1981): Kernel variant is similar

  • 2. Sample time of death T ∼ PT |X
  • 3. Sample time of censoring C ∼ PC|X

Otherwise: Set Y = C, δ = 0 Model: Generate data point as follows: (X, Y, δ)

  • 1. Sample feature vector X ∼ PX
  • 4. If death happens before censoring ( ):

T ≤ C x Kaplan-Meier estimator b S(t | x) Error: for time horizon t sup

t∈[0,τ]

|b S(t|x) − S(t|x)| τ Enough of the n training data have Y values > t Y τ n

Problem Setup

Y = T, δ = 1 Set

slide-21
SLIDE 21

find training points closest to . data points k x k Estimator (Beran 1981): Kernel variant is similar

  • 2. Sample time of death T ∼ PT |X
  • 3. Sample time of censoring C ∼ PC|X

Otherwise: Set Y = C, δ = 0 Model: Generate data point as follows: (X, Y, δ)

  • 1. Sample feature vector X ∼ PX
  • 4. If death happens before censoring ( ):

T ≤ C x Kaplan-Meier estimator b S(t | x) Error: for time horizon t sup

t∈[0,τ]

|b S(t|x) − S(t|x)| τ Feature space is separable metric space
 (intrinsic dimension d) d Enough of the n training data have Y values > t Y τ n

Problem Setup

Y = T, δ = 1 Set

slide-22
SLIDE 22

find training points closest to . data points k x k Estimator (Beran 1981): Kernel variant is similar

  • 2. Sample time of death T ∼ PT |X
  • 3. Sample time of censoring C ∼ PC|X

Otherwise: Set Y = C, δ = 0 Model: Generate data point as follows: (X, Y, δ)

  • 1. Sample feature vector X ∼ PX
  • 4. If death happens before censoring ( ):

T ≤ C x Kaplan-Meier estimator b S(t | x) Error: for time horizon t sup

t∈[0,τ]

|b S(t|x) − S(t|x)| τ Continuous r.v. in time &
 smooth w.r.t. feature space
 (Hölder index a) α Feature space is separable metric space
 (intrinsic dimension d) d Enough of the n training data have Y values > t Y τ n

Problem Setup

Y = T, δ = 1 Set

slide-23
SLIDE 23

find training points closest to . data points k x k Estimator (Beran 1981): Kernel variant is similar

  • 2. Sample time of death T ∼ PT |X
  • 3. Sample time of censoring C ∼ PC|X

Otherwise: Set Y = C, δ = 0 Model: Generate data point as follows: (X, Y, δ)

  • 1. Sample feature vector X ∼ PX
  • 4. If death happens before censoring ( ):

T ≤ C x Kaplan-Meier estimator b S(t | x) Error: for time horizon t sup

t∈[0,τ]

|b S(t|x) − S(t|x)| τ Borel prob. measure Continuous r.v. in time &
 smooth w.r.t. feature space
 (Hölder index a) α Feature space is separable metric space
 (intrinsic dimension d) d Enough of the n training data have Y values > t Y τ n

Problem Setup

Y = T, δ = 1 Set

slide-24
SLIDE 24

Theory (Informal)

slide-25
SLIDE 25

k-NN estimator with has strong consistency rate: sup

t∈[0,τ]

|b S(t|x) − S(t|x)| ≤ e O(n−α/(2α+d)) k = e Θ(n2α/(2α+d))

Theory (Informal)

slide-26
SLIDE 26

If no censoring, problem reduces to conditional CDF estimation k-NN estimator with has strong consistency rate: sup

t∈[0,τ]

|b S(t|x) − S(t|x)| ≤ e O(n−α/(2α+d)) k = e Θ(n2α/(2α+d))

Theory (Informal)

slide-27
SLIDE 27

If no censoring, problem reduces to conditional CDF estimation

→ Error upper bound, up to a log factor, matches conditional CDF

estimation lower bound by Chagny & Roche 2014 k-NN estimator with has strong consistency rate: sup

t∈[0,τ]

|b S(t|x) − S(t|x)| ≤ e O(n−α/(2α+d)) k = e Θ(n2α/(2α+d))

Theory (Informal)

slide-28
SLIDE 28

If no censoring, problem reduces to conditional CDF estimation

→ Error upper bound, up to a log factor, matches conditional CDF

estimation lower bound by Chagny & Roche 2014 Proof ideas also give finite sample rates for: k-NN estimator with has strong consistency rate: sup

t∈[0,τ]

|b S(t|x) − S(t|x)| ≤ e O(n−α/(2α+d)) k = e Θ(n2α/(2α+d))

Theory (Informal)

slide-29
SLIDE 29

If no censoring, problem reduces to conditional CDF estimation

→ Error upper bound, up to a log factor, matches conditional CDF

estimation lower bound by Chagny & Roche 2014 Proof ideas also give finite sample rates for:

  • Kernel Kaplan-Meier estimators

k-NN estimator with has strong consistency rate: sup

t∈[0,τ]

|b S(t|x) − S(t|x)| ≤ e O(n−α/(2α+d)) k = e Θ(n2α/(2α+d))

Theory (Informal)

slide-30
SLIDE 30

If no censoring, problem reduces to conditional CDF estimation

→ Error upper bound, up to a log factor, matches conditional CDF

estimation lower bound by Chagny & Roche 2014 Proof ideas also give finite sample rates for:

  • Kernel Kaplan-Meier estimators

k-NN estimator with has strong consistency rate: sup

t∈[0,τ]

|b S(t|x) − S(t|x)| ≤ e O(n−α/(2α+d)) k = e Θ(n2α/(2α+d)) ( )

  • k-NN & kernel Nelson-Aalen cumulative hazard estimators − log S(t | x)

Theory (Informal)

slide-31
SLIDE 31

If no censoring, problem reduces to conditional CDF estimation

→ Error upper bound, up to a log factor, matches conditional CDF

estimation lower bound by Chagny & Roche 2014 Proof ideas also give finite sample rates for:

  • Kernel Kaplan-Meier estimators
  • Generalization bound for automatic k using validation data

k-NN estimator with has strong consistency rate: sup

t∈[0,τ]

|b S(t|x) − S(t|x)| ≤ e O(n−α/(2α+d)) k = e Θ(n2α/(2α+d)) ( )

  • k-NN & kernel Nelson-Aalen cumulative hazard estimators − log S(t | x)

Theory (Informal)

slide-32
SLIDE 32

If no censoring, problem reduces to conditional CDF estimation

→ Error upper bound, up to a log factor, matches conditional CDF

estimation lower bound by Chagny & Roche 2014 Proof ideas also give finite sample rates for:

  • Kernel Kaplan-Meier estimators
  • Generalization bound for automatic k using validation data

k-NN estimator with has strong consistency rate: sup

t∈[0,τ]

|b S(t|x) − S(t|x)| ≤ e O(n−α/(2α+d)) k = e Θ(n2α/(2α+d)) ( )

  • k-NN & kernel Nelson-Aalen cumulative hazard estimators − log S(t | x)

Most general finite sample theory for k-NN and kernel survival estimators

Theory (Informal)

slide-33
SLIDE 33

If no censoring, problem reduces to conditional CDF estimation

→ Error upper bound, up to a log factor, matches conditional CDF

estimation lower bound by Chagny & Roche 2014 Proof ideas also give finite sample rates for:

  • Kernel Kaplan-Meier estimators
  • Generalization bound for automatic k using validation data

k-NN estimator with has strong consistency rate: sup

t∈[0,τ]

|b S(t|x) − S(t|x)| ≤ e O(n−α/(2α+d)) k = e Θ(n2α/(2α+d)) ( )

  • k-NN & kernel Nelson-Aalen cumulative hazard estimators − log S(t | x)

Most general finite sample theory for k-NN and kernel survival estimators

Theory (Informal)

Existing kernel results only for Euclidean space
 (Dabrowska 1989, Van Keilegom & Veraverbeke 1996, Van Keilegom 1998)

slide-34
SLIDE 34

0.60 0.65 0.70

c-Lndex

DdDptLve Nernel rDndom survLvDl Iorest Nernel (trLDngle) L1 Nernel (trLDngle) L2 Nernel (box) L1 Nernel (box) L2 N-11 (trLDngle) L1 N-11 (trLDngle) L2 N-11 L1 N-11 L2 cox

DDtDset "gbsg2" ConcordDnce IndLces

Experiments

slide-35
SLIDE 35

0.60 0.65 0.70

c-Lndex

DdDptLve Nernel rDndom survLvDl Iorest Nernel (trLDngle) L1 Nernel (trLDngle) L2 Nernel (box) L1 Nernel (box) L2 N-11 (trLDngle) L1 N-11 (trLDngle) L2 N-11 L1 N-11 L2 cox

DDtDset "gbsg2" ConcordDnce IndLces

Experiments

Distance/kernel choice matter a lot in practice

slide-36
SLIDE 36

0.60 0.65 0.70

c-Lndex

DdDptLve Nernel rDndom survLvDl Iorest Nernel (trLDngle) L1 Nernel (trLDngle) L2 Nernel (box) L1 Nernel (box) L2 N-11 (trLDngle) L1 N-11 (trLDngle) L2 N-11 L1 N-11 L2 cox

DDtDset "gbsg2" ConcordDnce IndLces

Experiments

Distance/kernel choice matter a lot in practice Learning the kernel typically has best performance (but no theory yet!)