ROBUST LOCATION AND SCALE ESTIMATION WITH CENSORED OUTCOMES Jerome - - PowerPoint PPT Presentation

robust location and scale estimation with censored
SMART_READER_LITE
LIVE PREVIEW

ROBUST LOCATION AND SCALE ESTIMATION WITH CENSORED OUTCOMES Jerome - - PowerPoint PPT Presentation

ROBUST LOCATION AND SCALE ESTIMATION WITH CENSORED OUTCOMES Jerome H. Friedman Stanford University MACHINE LEARNING y = F ( x , z ) y = outcome variable x = ( x 1 , x p ) observed predictor variables z = ( z 1 , z 2 , ) other


slide-1
SLIDE 1

ROBUST LOCATION AND SCALE ESTIMATION WITH CENSORED OUTCOMES Jerome H. Friedman Stanford University

slide-2
SLIDE 2

MACHINE LEARNING y = F(x, z) y = outcome variable

x = (x1 · ··, xp) observed predictor variables z = (z1, z2, · · ·) other variables

Goal: estimate E [y | x] given data {yi, xi}N

i=1

slide-3
SLIDE 3

STATISTICAL MODEL y = f(x) + s(x) · ǫ f(x) = E [y | x] location function s(x) > 0 scale function ε = random variable, E [ε | x] = 0 Prediction: ˆ y = f(x) s(x) · ǫ = “irreducible error” (unavoidable)

slide-4
SLIDE 4

REDUCIBLE ERROR r(x) = E | f(x) − ˆ f(x) | f(x) = optimal location (target) function ˆ f(x) = estimate based on training data & ML method ML goal: methods to reduce r(x) Statistics goal: methods to estimate r(x) Prediction error (y) = Reducible + Irreducible Usually: Irreducible s(x) >> Reducible r(x)

slide-5
SLIDE 5

USUAL ASSUMPTIONS s(x) = s = constant (homoscedasticity) ǫ ∼ N(0, 1) (normality)

slide-6
SLIDE 6

HOMOSCEDASTICITY F(x, z) = f(x) + g(z) additive p(x, z) = ⇒ scale [g(z) | x] = constant Not very likely

slide-7
SLIDE 7

2 4 6 8 10 0.0 0.5 1.0 1.5 2.0 2.5 y P ( y | x )

loc = 5, scale = 0.1

2 4 6 8 10 0.0 0.5 1.0 1.5 2.0 2.5 y P ( y | x )

loc = 5, scale = 0.25

2 4 6 8 10 0.0 0.5 1.0 1.5 2.0 2.5 y P ( y | x )

loc = 5, scale = 0.5

2 4 6 8 10 0.0 0.5 1.0 1.5 2.0 2.5 y P ( y | x )

loc = 5, scale = 1

slide-8
SLIDE 8

NORMALITY - not very likely either Tukey: “small residuals ≃ normal, larger have heavier tails.” Heterodistributionality

slide-9
SLIDE 9

Heterodistributionality Robustness: Choose compromise ¯ p(ǫ) good properties for others ¯ p(ǫ) = normal, not good!

slide-10
SLIDE 10

LOGISTIC DISTRIBUTION ǫ | x = (y − f(x))/s(x) ¯ p(ǫ) =

e−ǫ s(1+e−ǫ )2

small | ε | ∼ normal, large | ε | ∼ exponential

−5 5 0.00 0.05 0.10 0.15 0.20 0.25 eps logistic distribution

slide-11
SLIDE 11

Prediction: ˆ y = ˆ f(x) ˆ f(x) = arg min

f∈F N

  • i=1

[εi + 2 log(1 + e−εi)] εi = (yi − f(xi))/s(xi) minimized at f(xi) = yi indep s(xi) 1/s(xi) ∼ “weight” for obs i controls relative influence of i to fit

slide-12
SLIDE 12

Using incorrect s(x) to estimate f(x) increases variance, not bias assume s(x) = constant usually not too bad

slide-13
SLIDE 13

ESTIMATE ˆ s(x) (1) Improve ˆ f(x) in high variance settings. (2) Important inferential statistic: (a) prediction interval ∼ accuracy of ˆ y-prediction: logistic: IQR [ y | f(x)] = 2 s(x)/ log(3) (b) can affect decision (3) Crucial with censoring

slide-14
SLIDE 14

CENSORING (y-value partially known) Data: {yi, xi}N

1 → {ai, bi, xi}N 1

ai ≤ yi ≤ bi ai = bi = yi ⇒ y-value known ai = −∞ ⇒ censored below bi bi = ∞ ⇒ censored above ai Otherwise: interval censored [ai, bi]

slide-15
SLIDE 15

Special Case {ai, bi} → K disjoint intervals (bins): K = 2 ⇒ usual binary logistic regression K > 2 ⇒ ordered multiclass logistic regression

slide-16
SLIDE 16

LIKELIHOOD Pr(a ≤ y ≤ b) =

1 1+e−(b−f)/s − 1 1+e−(a−f)/s

Depends strongly on both f and s Need to estimate both f(x) and s(x)

slide-17
SLIDE 17

−15 −10 −5 5 10 15 0.00 0.05 0.10 0.15 0.20 0.25 y Probability density

Logistic distribution: f = 5

slide-18
SLIDE 18

EXERCISE [ ˆ f(x), ˆ s(x)] = arg min

(f,s)∈F N

  • i=1

L[ai, bi, f(xi), s(xi)] L(a, b, f, s) = − log

  • 1

1+e−(b−f)/s − 1 1+e−(a−f)/s

slide-19
SLIDE 19

PROBLEM L(a, b, f, s) NOT convex in s IS convex in t = 1/s ⇒ solve for t Constraint t > 0 ⇒ solve for log(t) = − log(s)

slide-20
SLIDE 20

GRADIENT BOOSTED TREE ENSEMBLES

  • Ann. Statist, 29. 1189 — 1232 (2001)

ˆ f(x) = Kf

k=1 T (f) k

(x) log(ˆ s(x)) = Ks

k=1 T (s) k (x)

Tk(x) =CART—tree (x)

slide-21
SLIDE 21

ITERATIVE GRADIENT BOOSTING Start: ˆ s(x) = constant Loop { ˆ f(x) = tree—boost f(x) given ˆ s(x) log(ˆ s(x)) = tree—boost log(s(x)) given ˆ f(x) } Until no change

slide-22
SLIDE 22

DIAGNOSTICS (1) median [ y | f(x)] = f(x) (2) median [ | y − f(x) | | s(x) ] = s(x) · log(3) (3) # (yi ∈ [u, v] | fi ∈ [g, h]) =

  • fi∈[g,h]
  • 1

1+e−(v−fi)/si − 1 1+e−(u−fi)/si

  • (fi = ˆ

f(xi), si = ˆ s(xi))

slide-23
SLIDE 23

California Housing Price Data (STATLIB Repository) N = 20460 CA neighborhoods (1990 census block groups) y = Median House Value

x = (Median Income, Housing Median Age,

Ave No Rooms, Ave No Bedrooms, Population, Ave Occupancy, Latitude, Longitude)

slide-24
SLIDE 24

CA housing prices

Neighborhood median Frequency 1 2 3 4 5 200 400 600 800 1 2 3 4 5 1 2 3 4 5 Predicted location ( f ) y − value 0.05 0.10 0.20 0.50 1.00 0.005 0.050 0.500 Predicted scale abs ( ( y − f ) / log(3)) 1 2 3 4 5 6 0.0 0.2 0.4 0.6 0.8 1.0 Predicted location Predicted scale

slide-25
SLIDE 25

income long lat

  • ccup

rooms bedrooms age pop Relative importance 20 40 60 80 100

CA housing : location model

2 3 4 5 6 7 8 −1.5 0.0 1.0 2.0 income Location −122 −120 −118 −1.5 0.0 1.0 2.0 long Location 33 34 35 36 37 38 39 40 −1.5 0.0 1.0 2.0 lat Location

slide-26
SLIDE 26
  • ccup

lat income pop long rooms bedrooms age Relative importance 20 40 60 80 100

CA housing : log ( scale ) model

2.0 2.5 3.0 3.5 4.0 4.5 −1.0 0.0 1.0

  • ccup

log ( scale ) 33 34 35 36 37 38 39 40 −1.0 0.0 1.0 lat log ( scale ) 2 3 4 5 6 7 8 −1.0 0.0 1.0 income log ( scale )

slide-27
SLIDE 27

QUESTIONNAIRE DATA N = 8857, p = 13 y = AGE ∈

                      

14 17 18 24 25 34 35 44 45 54 55 64 65 ∞

slide-28
SLIDE 28

x = (Occupation, Type of Home, Sex,

Marital Status, Education,Income, Lived in BA, Dual Incomes, Persons in Household, Persons in Household < 18, Householder Status, Ethnicity, Language)

slide-29
SLIDE 29

1 3 5 7 y in bin Pobability 0.0 0.6

f in bin 1 : 285

1 3 5 7 y in bin Pobability 0.0 0.5

f in bin 2 : 484

1 3 5 7 y in bin Pobability 0.0 0.4

f in bin 3 : 847

1 3 5 7 y in bin Pobability 0.0 0.3

f in bin 4 : 868

1 3 5 7 y in bin Pobability 0.00 0.30

f in bin 5 : 211

1 3 5 7 y in bin Pobability 0.0 0.3

f in bin 6 : 41

1 3 5 7 y in bin Pobability 0.0 0.5

f in bin 7 : 216

AGE predictions

slide-30
SLIDE 30

20 30 40 50 60 80 0.5 1.0 2.0 5.0 10.0 Predicted location Predicted scale

AGE

slide-31
SLIDE 31
  • cc

edu status Inc Mstat kids Fsize Lived Relative importance 20 40 60 80 100

Location Model

5 1 6 8 9 3 2 4 7

  • cc

Location −5 5 10 15 20 40 159 77 38 1446436518 1 2 3 4 5 6 −5 5 15 edu Location M 1 2 3 M status Location −5 5 10 15 20 194 198 98 10

slide-32
SLIDE 32

status Mstat

  • cc

Fsize Lived kids edu dual Relative importance 20 40 60 80 100

Scale Model

1 2 3 M status log ( scale ) −0.5 −0.4 −0.3 −0.2 −0.1 0.0 0.1 0.2 194 198 98 10 4 1 5 3 2 M Mstat log ( scale ) −0.5 −0.4 −0.3 −0.2 −0.1 0.0 0.1 0.2 12 185 196 55 42 10 5 1 6 8 9 3 2 4 7

  • cc

log ( scale ) −0.5 −0.4 −0.3 −0.2 −0.1 0.0 0.1 0.2 40 159 77 38 1446 43 65 18

slide-33
SLIDE 33

Wine Quality Data (Irvine Repository) N = 6497 samples of Portuguese "Vinho Verde" ˜ y = Quality: integer (1, 2, · · ·, 10) median of at least 3 expert evaluations ˜ y = k ⇒ y ∈ [k − 1/2, k + 1/2]

slide-34
SLIDE 34

x =(Fixed acidity, Volatile acidity, Citric acid,

Residual sugar, Chlorides, Free sulfur dioxide Total sulfur dioxide, Density, pH, Sulfates, Alcohol)

slide-35
SLIDE 35

1 2 3 4 5 6 7 y in bin Pobability 0.0 0.2 0.4 0.6

f in bin 3 : 685

1 2 3 4 5 6 7 y in bin Pobability 0.0 0.2 0.4 0.6

f in bin 4 : 1103

1 2 3 4 5 6 7 y in bin Pobability 0.0 0.2 0.4 0.6

f in bin 5 : 267

Wine quality data

slide-36
SLIDE 36

4.5 5.0 5.5 6.0 6.5 7.0 0.15 0.20 0.25 0.30 0.35 0.40 0.45 Predicted location Predicted scale

Wine quality data

slide-37
SLIDE 37

alcohol volatile.acidity total.sulfur.dioxide residual.sugar density sulphates chlorides citric.acid Relative importance 40 80

Wine: location

9 10 11 12 −0.4 0.0 0.4 alcohol Location 0.2 0.4 0.6 −0.4 0.0 0.4 volatile.acidity Location 50 100 150 200 −0.4 0.0 0.4 total.sulfur.dioxide Location

slide-38
SLIDE 38

alcohol density free.sulfur.dioxide fixed.acidity residual.sugar chlorides pH volatile.acidity Relative importance 20 60 100 9 10 11 12 −0.15 0.00 alcohol log ( scale ) 0.990 0.994 0.998 −0.15 0.00 density log ( scale ) 10 30 50 70 −0.15 0.00 free.sulfur.dioxide log ( scale )

slide-39
SLIDE 39

ORDERED MULTICLASS LOGISTIC REGRESSION yi ∈ {C1 < C2, · · ·, CK−1 < CK} Interval censored: {ai, bi} → K disjoint intervals (bins): {b0, b1, · · ·bK} b0 = −∞, bK = ∞ bins ∼ classes with separating boundaries

b = {b1, b2, · · ·, bK−1} unknown

(overall location & scale arbitrary)

slide-40
SLIDE 40

OPTIMAL SCALING (aka ACE) [ˆ

b, ˆ

f(x), ˆ s(x)] = arg min

b, (f,s)∈F

N

  • i=1

L[bk(i)−1, bk(i), f(xi), s(xi)] L(u, v, f, s) = − log

  • 1

1+e(f−v)/s − 1 1+e(f−u)/s

  • Alternating optimization:

b :

N

  • i=1

 

1 1 + e( f(xi)−bk)/s(xi) −

k

  • j=1

I(yi = cj)

  = 0

slide-41
SLIDE 41

5 6 7 8 9 5 6 7 8 9 Original boundaries Input boundaries

Input

5 6 7 8 9 5 6 7 8 9 Original boundaries Estimated boundaries

Iteration 1

5 6 7 8 9 5 6 7 8 9 Original boundaries Estimated boundaries

Iteration 2

5 6 7 8 9 5 6 7 8 9 Original boundaries Estimated boundaries

Iteration 3

Wine quality data — optimal scaling

slide-42
SLIDE 42

5 6 7 8 9 2 4 6 8 10 Original boundaries Input boundaries

Input

5 6 7 8 9 2 4 6 Original boundaries Estimated boundaries

Iteration 1

5 6 7 8 9 −1 1 2 3 4 5 6 Original boundaries Estimated boundaries

Iteration 2

5 6 7 8 9 1 2 3 4 5 Original boundaries Estimated boundaries

Iteration 3

Wine quality data — optimal scaling

slide-43
SLIDE 43

ASYMMETRIC ERRORS y | x = f(x) +

  • sl(x) · ε

ε ≤ 0 su(x) · ε ε > 0 f(x) = mode at x sl(x) = lower scale at x su(x) = upper scale at x ε ∼ standard logistic distribution

slide-44
SLIDE 44

−6 −4 −2 2 4 6 0.00 0.05 0.10 0.15 0.20 y p ( y, 0.25, 1.0 )

Asymmetric logistic

f = 0, sl = 1/4, su = 1

slide-45
SLIDE 45

EXERCISE (no censoring) [ ˆ f(x), ˆ sl(x), ˆ su(x)] = arg min(f,sl,su)∈F

N

i=1 L[yi, f(xi), sl(xi), su(xi)]

L[y, f, sl, su] = L[y, f, sl] ·I [ y−f ≤ 0]+L[y, f, su] ·I [ y−f > 0] L(y, f, s) = log(s)+(y −f)/s+2 log(1+e−(y−f)/s) Iterative gradient boosting

slide-46
SLIDE 46

ASYMMETRIC DIAGNOSTICS (1) median [ y | f(x), sl(x),su(x)] = f(x) +

      

su(x) log

  • 3su(x)−sl(x)

su(x)+sl(x)

  • sl(x) ≤ su(x)

−sl(x) log

  • 3sl(x)−su(x)

su(x)+sl(x)

  • sl(x) > su(x)

(2) median y≤f(x)[ | y−f(x) | | sl(x)]/ log(3) = sl(x) (3) median y>f(x)[ | y−f(x) | | su(x)]/ log(3) = su(x)

slide-47
SLIDE 47

(1/2) — Million Song Dataset (Irvine Repository) N = 515345 songs (463715 train, 51630 test) y = year released (1922 — 2011)

x = 90 attributes ( Echo Nest API):

12 timbre average, 78 = timbre covariance

slide-48
SLIDE 48

MSD song release

Y ear Frequency 1940 1960 1980 2000 1000 2000 3000 4000

Predicted mode year | x

Y ear Frequency 1990 1995 2000 2005 2010 500 1000 2000 1990 1995 2000 2005 2010 1940 1960 1980 2000 Predicted mode Year release

Mode

1980 1990 2000 2010 1940 1960 1980 2000 Predicted median Predicted median

Median

slide-49
SLIDE 49

Lower scale

Y ears Frequency 5 10 15 20 25 30 1000 3000 2 5 10 20 0.01 0.10 1.00 10.00 Predicted scale abs ( ( y − f ) / log(3))

Lower scale Upper scale

Y ears Frequency 5 10 15 20 25 30 1000 2000 3000 4000 1.0 1.5 2.0 3.0 4.0 5e−03 5e−02 5e−01 5e+00 Predicted scale abs ( ( y − f ) / log(3))

Upper scale

slide-50
SLIDE 50

1990 1995 2000 2005 2010 5 10 15 20 25 Predicted mode Lower scale 1990 1995 2000 2005 2010 0.5 1.5 2.5 3.5 Predicted mode Upper scale 5 10 15 20 25 0.5 1.5 2.5 3.5 Lower scale Upper scale

slide-51
SLIDE 51

2005 2006 2007 2008 2009 2010 2011 2012 0.00 0.05 0.10 0.15 0.20 0.25 Y ear Probability

Truncated logistic

slide-52
SLIDE 52

Predict y | x, settle for p(y | x)

  • 1. ROBUST (logistic) estimate: loc f(x) & scale s(x)
  • 2. General censoring: ai ≤ yi ≤ bi (ubiquitous)
  • 3. Graphical diagnostics
  • 4. Ordered multi—class classification
  • 5. Asymmetric p(y | x) : f(x), sl(x), su(x)
slide-53
SLIDE 53

REFERENCES Gradient boosting: Ann. Statist, 29. 1189 — 1232 (2001) Optimal scaling: ACE: J. Amer. Statist. Assoc. 80. 580 — 598 (1985) GIFI: Nonlinear Multivariate Analysis. Wiley, N.Y. (1990) Slides: http://statweb.stanford.edu/~jhf/talks/kdd.pdf