an improper estimator with optimal excess risk in
play

An improper estimator with optimal excess risk in misspecified - PowerPoint PPT Presentation

An improper estimator with optimal excess risk in misspecified density estimation and logistic regression Jaouad Mourtada , Stphane Gaffas StatMathAppli 2019, Frjus CMAP, cole polytechnique, LPSM, Universit Paris-Diderot


  1. An improper estimator with optimal excess risk in misspecified density estimation and logistic regression Jaouad Mourtada ∗ , Stéphane Gaïffas † StatMathAppli 2019, Fréjus ∗ CMAP, École polytechnique, † LPSM, Université Paris-Diderot On arXiv soon. 1

  2. Predictive density estimation

  3. Predictive density estimation: setting • Space Z ; i.i.d. sample Z n 1 = ( Z 1 , . . . , Z n ) ∼ P n , with P unknown distribution on Z . • Given Z n 1 , predict new sample Z ∼ P (probabilistic prediction) • f density on Z (wrt base measure µ ), z ∈ Z , log-loss ℓ ( f , z ) = − log f ( z ) . Risk R ( f ) = E [ ℓ ( f , Z )] where Z ∼ P . • Family F of densities on Z = statistical model ; g n ( Z n • Goal : find density � g n = � 1 ) with small excess risk E [ R ( � g n )] − inf f ∈F R ( f ) . 2

  4. On the logarithmic loss: ℓ ( f , z ) = − log f ( z ) • Standard loss function, connected to lossless compression; • Minimizing risk amounts to maximizing joint probability attributed to large test sample ( Z ′ 1 , . . . , Z ′ m ) ∼ P m : � � m m � � f ( Z ′ ℓ ( f , Z ′ − = exp [ − m ( R ( f ) + o ( 1 ))] j ) = exp j ) j = 1 j = 1 • Letting p = dP / d µ be the true density, � � p ( Z ) �� R ( f ) − R ( p ) = E Z ∼ P log =: KL ( p , f ) � 0 . f ( Z ) Risk minimized by true density : f ∗ = p ; excess risk given by the Kullback-Leibler divergence (relative entropy). 3

  5. Well-specified case: asymptotic optimality of the MLE Here, assume that p ∈ F ( well-specified model), with F a regular parametric family/model of dimension d . The Maximum Likelihood Estimator (MLE) � f n , defined by n n � � � f n := argmin ℓ ( f , Z i ) = argmax f ( Z i ) f ∈F f ∈F i = 1 i = 1 satisfies, as n → ∞ , � 1 � f n ) = d R ( � f ∈F R ( f ) = KL ( p , � f n ) − inf 2 n + o . n The d / ( 2 n ) rate is asymptotically optimal (locally asymptotically minimax – Hájek, Le Cam): MLE is efficient . 4

  6. Misspecified case (statistical learning viewpoint) Assumption p ∈ F is restrictive and generally not satisfied: model chosen by the statistician, simplification of the truth. General misspecified case where p �∈ F : model F is false but useful. Excess risk is a relevant objective. MLE � f n can degrade under model misspecification: � 1 � f ∈F R ( f ) = d eff R ( � f n ) − inf 2 n + o n where d eff = Tr[ H − 1 G ] , G = E [ ∇ ℓ ( f ∗ , Z ) ∇ ℓ ( f ∗ , Z ) ⊤ ] , H = ∇ 2 R ( f ∗ ) . Misspecified case: d eff depends on P , and we may have d eff ≫ d . 5

  7. Cumulative risk/regret and online-to-batch conversion Well-established theory (Merhav 1998, Cesa-Bianchi & Lugosi 2006) for controlling cumulative excess risk n n � � Regret n = ℓ ( � g t − 1 , Z t ) − inf ℓ ( f , Z t ) ; f ∈F t = 1 t = 1 F bounded family: minimax regret of ( d log n ) / 2 + O ( 1 ) . Implies excess risk of ( d log n ) / ( 2 n ) + O ( 1 / n ) for averaged predictor: n � 1 g n = ¯ g t . � n + 1 t = 0 ⊕ Valid under model misspecification (distribution-free); ⊖ Suboptimal rate for individual risk, inefficient predictor. Infinite for unbounded families (eg Gaussian), computational complexity. 6

  8. The Sample Minimax Predictor

  9. The Sample Minimax Predictor (SMP) We introduce the Sample Minimax Predictor , given by: � f z n ( z ) � [ ℓ ( g , z ) − ℓ ( � f z f n = argmin sup n , z )] = � Z � f z ′ n ( z ′ ) µ ( dz ′ ) g z ∈Z where � n � � � f z n = argmin ℓ ( f , Z i ) + ℓ ( f , z ) . f ∈F i = 1 • In general, � f n �∈ F : improper predictor . • Conditional variant � f n ( y | x ) for conditional density estimation. • Regularized variant. 7

  10. Excess risk bound for the SMP � f z n ( z ) � f n ( z ) = (1) � Z � f z ′ n ( z ′ ) µ ( dz ′ ) Theorem (M., Gaïffas, Scornet, 2019) The SMP � f n (1) satisfies: � � � �� � � R ( � f ( z ) � f n ) − inf f ∈F R ( f ) � E Z n log ( z ) µ ( d z ) . (2) E n 1 Y • Analogous excess risk bound in the conditional case. • Typically simple d / n + o ( n − 1 ) bound for standard models (Gaussian, multinomial), even in misspecified case . 8

  11. Application: Gaussian linear model

  12. Gaussian linear model • Conditional density estimation problem. • Probabilistic prediction of response Y ∈ R given covariates X ∈ R d . Risk of conditional density f ( y | x ) is R ( f ) = E [ ℓ ( f ( X ) , Y )] = E [ − log f ( Y | X )] . • F = { f β : β ∈ R d } with f β ( ·| x ) = N ( � β, x � , 1 ) , so that ℓ ( f β , ( x , y )) = 1 2 ( y − � β, x � ) 2 • MLE is � f n ( ·| x ) = N ( � � β n , x � , 1 ) , with � β n ordinary least squares: � n � − 1 n n � � � ( Y i − � β, X i � ) 2 = � X i X ⊤ β n = argmin Y i X i i β ∈ R d i = 1 i = 1 i = 1 9

  13. SMP for the Gaussian linear model Σ n = n − 1 � n Σ = E [ XX ⊤ ] , � i = 1 X i X ⊤ true/sample covariance matrix i Theorem (Distribution-free excess risk for SMP) � � 2 ) . If The SMP is � f n ( ·| x ) = N ( � � 1 + � ( n � Σ n ) − 1 x , x � β n , x � , E [ Y 2 ] < + ∞ , then � � �� � � � � R ( � ( n � Σ n + XX ⊤ ) − 1 X , X f n ) − inf β ∈ R d R ( β ) � E − log 1 − E � �� � "leverage score" which is twice the minimax risk in the well-specified case. • Smaller than E [Tr(Σ 1 / 2 � Σ − 1 n Σ 1 / 2 )] / n ∼ d / n under regularity assumption on P X ( Σ − 1 / 2 X not too close to any hyperplane) • By contrast, for MLE: E [ R ( � f n )] − R ( β ∗ ) ∼ E [( Y − � β ∗ , X � ) 2 � Σ − 1 / 2 X � 2 ] / ( 2 n ) . 10

  14. Application to logistic regression

  15. Logistic regression: setting • Binary label Y ∈ {− 1 , 1 } , covariates X ∈ R d . Risk of conditional density f ( ± 1 | x ) R ( f ) = E [ − log f ( Y | X )] . • F = { f β : β ∈ R d } family of conditional densities of Y | X : f β ( y | x ) = P β ( Y = y | X = x ) = σ ( y � β, x � ) , y ∈ {− 1 , 1 } with σ ( u ) = e u / ( 1 + e u ) sigmoid function. For β, x ∈ R d , y ∈ {± 1 } ℓ ( β, ( x , y )) = log( 1 + e − y � β, x � ) 11

  16. Limitations of MLE and proper (plug-in) predictors β n ( y | x ) = σ ( y � � • MLE f � β n , x � ) not fully satisfying for prediction: – Ill-defined when sets { X i : Y i = 1 } and { X i : Y i = − 1 } are linearly separated, yields 0 or 1 probabilities ( ⇒ infinite risk). – Risk d eff / ( 2 n ) ; if � X � � R , d eff may be as large as 1 de � β ∗ � R . • Lower bound (Hazan et al., 2014) for any proper (within class) predictor of min( BR / √ n , de BR / n ) . • Better O ( d · log( BRn ) / n ) through online-to-batch conversion, with improper predictor (Foster et al., 2018). But computationally expensive (posterior sampling). 1 Bach & Moulines (2013); see also Ostrovskii & Bach (2018). 12

  17. Sample Minimax Predictor for logistic regression The SMP writes: f ( x , y ) � ( y | x ) � n f n ( y | x ) = f ( x , − 1 ) f ( x , 1 ) � ( − 1 | x ) + � ( 1 | x ) n n f ( x , y ) where � is the MLE obtained when adding ( x , y ) to the sample. n • Well-defined, even in the separated case; invariant by linear transformation of X (“prior-free”). Never outputs 0 probability. • Computationally reasonable: prediction obtained by solving two logistic regressions (replaces sampling by optimization). • NB: still more expensive than simple logistic regression (need to update solution of logistic regression for each test input x ). 13

  18. Excess risk bound for the penalized SMP Theorem (M., Gaïffas, Scornet 2019) Assume that � X � � R a.s. and let λ = 2 R 2 / ( n + 1 ) . Then, logistic SMP with penalty λ � β � 2 / 2 satisfies: for every β ∈ R d , n + � β � 2 R 2 � � − R ( β ) � 3 d R ( � E f λ, n ) (3) n Remark. Fast rate under no assumption on L ( Y | X ) . √ d ) and � β ∗ � = O ( 1 ) , then optimal O ( d / n ) excess risk. If R = O ( � √ Recall min( BR / √ n , de BR / n ) = min( d / n ) lower bound d / n , de for proper predictors (incl. Ridge logistic regression). Also better than O ( d log n / n ) from OTB, but worse dependence on � β ∗ � . 14

  19. Conclusion

  20. Conclusion Sample Minimax Predictor = procedure for predictive density estimation. General excess risk bound, typically does not degrade under model misspecification. Gaussian linear model: tight bound, within a factor of 2 of minimax. For logistic regression: simple predictor, bypasses lower bounds for proper (plug-in) predictors (removes exponential factor for worst-case distributions). Next directions: • Other GLMs? • Online logistic regression (individual sequences)? • Application to statistical learning with other loss functions? 15

  21. Thank you! 16

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend