ROBUST LOCATION AND SCALE ESTIMATION WITH CENSORED OUTCOMES Jerome - PowerPoint PPT Presentation

ROBUST LOCATION AND SCALE ESTIMATION WITH CENSORED OUTCOMES Jerome H. Friedman Stanford University

MACHINE LEARNING y = F ( x , z ) y = outcome variable x = ( x 1 · ·· , x p ) observed predictor variables z = ( z 1 , z 2 , · · · ) other variables Goal: estimate E [ y | x ] given data { y i , x i } N i =1

STATISTICAL MODEL y = f ( x ) + s ( x ) · ǫ f ( x ) = E [ y | x ] location function s ( x ) > 0 scale function ε = random variable, E [ ε | x ] = 0 Prediction: y = f ( x ) ˆ s ( x ) · ǫ = “irreducible error” (unavoidable)

REDUCIBLE ERROR r ( x ) = E | f ( x ) − ˆ f ( x ) | f ( x ) = optimal location (target) function ˆ f ( x ) = estimate based on training data & ML method ML goal: methods to reduce r ( x ) Statistics goal: methods to estimate r ( x ) Prediction error ( y ) = Reducible + Irreducible Usually: Irreducible s ( x ) >> Reducible r ( x )

USUAL ASSUMPTIONS s ( x ) = s = constant (homoscedasticity) ǫ ∼ N (0 , 1) (normality)

HOMOSCEDASTICITY F ( x , z ) = f ( x ) + g ( z ) additive p ( x , z ) = ⇒ scale [ g ( z ) | x ] = constant Not very likely

P ( y | x ) P ( y | x ) 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5 0 0 loc = 5, scale = 0.5 loc = 5, scale = 0.1 2 2 4 4 y y 6 6 8 8 10 10 P ( y | x ) P ( y | x ) 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5 0 0 loc = 5, scale = 0.25 2 2 loc = 5, scale = 1 4 4 y y 6 6 8 8 10 10

NORMALITY - not very likely either Tukey: “small residuals ≃ normal, larger have heavier tails.” Heterodistributionality

Heterodistributionality Robustness: Choose compromise ¯ p ( ǫ ) good properties for others p ( ǫ ) = normal, not good! ¯

y = ˆ Prediction: ˆ f ( x ) N � [ ε i + 2 log(1 + e − ε i )] ˆ f ( x ) = arg min f ∈ F i =1 ε i = ( y i − f ( x i )) /s ( x i ) minimized at f ( x i ) = y i indep s ( x i ) 1 /s ( x i ) ∼ “weight” for obs i controls relative influence of i to fit

Using incorrect s ( x ) to estimate f ( x ) increases variance, not bias assume s ( x ) = constant usually not too bad

ESTIMATE ˆ s ( x ) (1) Improve ˆ f ( x ) in high variance settings. (2) Important inferential statistic: (a) prediction interval ∼ accuracy of ˆ y -prediction: logistic: IQR [ y | f ( x )] = 2 s ( x ) / log(3) (b) can affect decision (3) Crucial with censoring

CENSORING ( y -value partially known) Data: { y i , x i } N 1 → { a i , b i , x i } N 1 a i ≤ y i ≤ b i a i = b i = y i ⇒ y -value known a i = −∞ ⇒ censored below b i b i = ∞ ⇒ censored above a i Otherwise: interval censored [ a i , b i ]

Special Case { a i , b i } → K disjoint intervals (bins): K = 2 ⇒ usual binary logistic regression K > 2 ⇒ ordered multiclass logistic regression

LIKELIHOOD 1 1 Pr( a ≤ y ≤ b ) = 1+ e − ( b − f ) /s − 1+ e − ( a − f ) /s Depends strongly on both f and s Need to estimate both f ( x ) and s ( x )

Logistic distribution: f = 5 0.25 0.20 Probability density 0.15 0.10 0.05 0.00 −15 −10 −5 0 5 10 15 y

EXERCISE N � [ ˆ f ( x ) , ˆ s ( x )] = arg min L [ a i , b i , f ( x i ) , s ( x i )] ( f,s ) ∈ F i =1 � � 1 1 L ( a, b, f, s ) = − log 1+ e − ( b − f ) /s − 1+ e − ( a − f ) /s

PROBLEM L ( a, b, f, s ) NOT convex in s IS convex in t = 1 /s ⇒ solve for t Constraint t > 0 ⇒ solve for log( t ) = − log( s )

GRADIENT BOOSTED TREE ENSEMBLES Ann. Statist, 29 . 1189 — 1232 (2001) f ( x ) = � K f k =1 T ( f ) ˆ ( x ) k s ( x )) = � K s k =1 T ( s ) log(ˆ k ( x ) T k ( x ) = CART—tree ( x )

ITERATIVE GRADIENT BOOSTING s ( x ) = constant Start: ˆ Loop { ˆ f ( x ) = tree—boost f ( x ) given ˆ s ( x ) s ( x )) = tree—boost log( s ( x )) given ˆ log(ˆ f ( x ) } Until no change

DIAGNOSTICS (1) median [ y | f ( x )] = f ( x ) (2) median [ | y − f ( x ) | | s ( x ) ] = s ( x ) · log(3) (3) # ( y i ∈ [ u, v ] | f i ∈ [ g, h ]) = � � � 1 1 1+ e − ( v − fi ) /si − f i ∈ [ g,h ] 1+ e − ( u − fi ) /si ( f i = ˆ f ( x i ) , s i = ˆ s ( x i ) )

California Housing Price Data (STATLIB Repository) N = 20460 CA neighborhoods (1990 census block groups) y = Median House Value x = ( Median Income, Housing Median Age, Ave No Rooms, Ave No Bedrooms, Population, Ave Occupancy, Latitude, Longitude)

CA housing prices 5 800 4 Frequency 600 y − value 3 400 2 200 1 0 0 0 1 2 3 4 5 1 2 3 4 5 Neighborhood median Predicted location ( f ) 1.0 abs ( ( y − f ) / log(3)) 0.500 0.8 Predicted scale 0.6 0.050 0.4 0.2 0.005 0.0 0.05 0.10 0.20 0.50 1.00 0 1 2 3 4 5 6 Predicted scale Predicted location

Location Relative importance 100 20 40 60 80 −1.5 0.0 1.0 2.0 0 CA housing : location model income −122 long lat −120 occup long rooms bedrooms −118 age pop Location Location −1.5 0.0 1.0 2.0 −1.5 0.0 1.0 2.0 33 2 34 3 35 4 income 36 lat 5 37 6 38 7 39 8 40

log ( scale ) Relative importance 100 20 40 60 80 −1.0 0.0 1.0 0 CA housing : log ( scale ) model 33 occup 34 lat income 35 pop 36 lat long 37 rooms 38 bedrooms 39 age 40 log ( scale ) log ( scale ) −1.0 0.0 1.0 −1.0 0.0 1.0 2.0 2 3 2.5 4 3.0 income occup 5 3.5 6 4.0 7 4.5 8

QUESTIONNAIRE DATA N = 8857 , p = 13  � � 14 17   �   �  18 24  �   �   � 25 34  � � y = AGE ∈ 35 44 �   �  45 54  �   �   55 64 �   �  � 65 ∞

x = (Occupation, Type of Home, Sex, Marital Status, Education,Income, Lived in BA, Dual Incomes, Persons in Household, Persons in Household < 18, Householder Status, Ethnicity, Language)

f in bin 1 : 285 f in bin 2 : 484 f in bin 3 : 847 Pobability Pobability Pobability 0.4 0.5 0.0 0.6 0.0 0.0 1 3 5 7 1 3 5 7 1 3 5 7 y in bin y in bin y in bin f in bin 4 : 868 f in bin 5 : 211 f in bin 6 : 41 0.30 Pobability Pobability Pobability 0.3 0.3 0.00 0.0 0.0 1 3 5 7 1 3 5 7 1 3 5 7 y in bin y in bin y in bin f in bin 7 : 216 Pobability 0.5 0.0 1 3 5 7 y in bin AGE predictions

AGE 10.0 5.0 Predicted scale 2.0 1.0 0.5 20 30 40 50 60 80 Predicted location

Location Model Relative importance 100 20 15 80 Location 10 38 60 5 40 40 0 159 1446436518 20 77 −5 0 occ edu status Inc Mstat kids Fsize Lived 5 1 6 8 9 3 2 4 7 occ 20 15 15 Location Location 10 5 5 194 10 M 0 198 98 −5 −5 1 2 3 M 1 2 3 4 5 6 edu status

Scale Model Relative importance 100 0.2 0.1 194 10 80 198 log ( scale ) 0.0 60 −0.1 −0.2 40 98 −0.3 20 −0.4 0 −0.5 status Mstat occ Fsize Lived kids edu dual 1 2 3 M status 0.2 0.2 38 40 0.1 0.1 12 185 55 log ( scale ) log ( scale ) 65 1446 159 0.0 10 0.0 42 18 43 −0.1 −0.1 196 −0.2 −0.2 77 −0.3 −0.3 −0.4 −0.4 −0.5 −0.5 4 1 5 3 2 M 5 1 6 8 9 3 2 4 7 Mstat occ

Wine Quality Data (Irvine Repository) N = 6497 samples of Portuguese "Vinho Verde" y = Quality: integer ( 1 , 2 , · · · , 10 ) ˜ median of at least 3 expert evaluations y = k ⇒ y ∈ [ k − 1 / 2 , k + 1 / 2] ˜

x = (Fixed acidity, Volatile acidity, Citric acid, Residual sugar, Chlorides, Free sulfur dioxide Total sulfur dioxide, Density, pH, Sulfates, Alcohol)

f in bin 3 : 685 f in bin 4 : 1103 0.6 0.0 0.2 0.4 0.6 Pobability Pobability 0.4 0.2 0.0 1 2 3 4 5 6 7 1 2 3 4 5 6 7 y in bin y in bin f in bin 5 : 267 0.6 Pobability 0.4 0.2 0.0 1 2 3 4 5 6 7 y in bin Wine quality data

Wine quality data 0.45 0.40 Predicted scale 0.35 0.30 0.25 0.20 0.15 4.5 5.0 5.5 6.0 6.5 7.0 Predicted location

Wine: location citric.acid 0.4 chlorides Location sulphates density 0.0 residual.sugar total.sulfur.dioxide volatile.acidity −0.4 alcohol 0 40 80 9 10 11 12 Relative importance alcohol 0.4 0.4 Location Location 0.0 0.0 −0.4 −0.4 0.2 0.4 0.6 50 100 150 200 volatile.acidity total.sulfur.dioxide

volatile.acidity pH log ( scale ) chlorides 0.00 residual.sugar fixed.acidity free.sulfur.dioxide −0.15 density alcohol 0 20 60 100 9 10 11 12 Relative importance alcohol log ( scale ) log ( scale ) 0.00 0.00 −0.15 −0.15 0.990 0.994 0.998 10 30 50 70 density free.sulfur.dioxide

ORDERED MULTICLASS LOGISTIC REGRESSION y i ∈ { C 1 < C 2 , · · · , C K − 1 < C K } Interval censored: { a i , b i } → K disjoint intervals (bins): { b 0 , b 1 , · · · b K } b 0 = −∞ , b K = ∞ bins ∼ classes with separating boundaries b = { b 1 , b 2 , · · · , b K − 1 } unknown (overall location & scale arbitrary)

ROBUST LOCATION AND SCALE ESTIMATION WITH CENSORED OUTCOMES Jerome - PowerPoint PPT Presentation

ROBUST LOCATION AND SCALE ESTIMATION WITH CENSORED OUTCOMES Jerome H. Friedman Stanford University MACHINE LEARNING y = F ( x , z ) y = outcome variable x = ( x 1 , x p ) observed predictor variables z = ( z 1 , z 2 , ) other

Robust Location and Scatter Estimators Outline for Multivariate Data Analysis Background

Accurate Regression Parameters and Summary Statistics Estimation in Data with Censored Missing

Location, Location, Location, Location, Location: Location: GPS and Google Earth GPS and

Location Estimation Media, Location Estimation Media, Algorithms and Systems, Algorithms and

Outlier Outlier Outlier- Outlier - -robust - robust robust robust identification

CS371m - Mobile Computing Location (Location, Location, Location) Cheap GPS

MOBILE COMPUTING CSE 40814/60814 Fall 2015 Location, Location, Location Location information

Gaussian Process Regression with Perry Groot , Peter Lucas Censored Data Using Expectation

Which Websites are Censored, Anyway? Zachary Weinberg zackw@cmu.edu Carnegie Mellon University

Facility location II. Chapter 10 Location-Allocation Model Plant Location Model Network

Facility location I. Chapter 10 Facility location Continuous facility location models Single

Conditional density estimation in a censored single-index regression model Olivier Bouaziz 1 and

Robust Estimation and Generative Adversarial Networks Weizhi ZHU Hong Kong University of Science

Robust estimation of precision matrices under cellwise contamination Garth Tarr, Samuel M

Constrained Mixture Estimation for Constrained Mixture Estimation Analysis and Robust

Short Course in Supervised Learning Robust Optimization and Machine Learning Robust Supervised

Nonlinear Control Lecture # 28 Robust State Feedback Stabilization Nonlinear Control Lecture # 28

Nonlinear Control Lecture # 10 State Feedback Stabilization and Robust State Feedback

Robust Lower Bounds for Communication and Stream Computation Amit Chakrabarti Dartmouth

Deploying Robust Security in IoT Ruozhou Yu, Guoliang Xue , Vishnu Teja Kilari, Xiang Zhang

Rapid, Robust, and Reliable Blind Deconvolution via Nonconvex Optimization Shuyang Ling

Web Crawling Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted

Dynamic Robust Utility P Beissner* F Maccheroni # M Marinacci # S Mukerji @ * ANU # U Bocconi and @

R-Packages for Robust Asymptotic Statistics Dr. Matthias Kohl Chair for Stochastics joint work