a chaining algorithm for online non parametric regression
play

a chaining algorithm for online non parametric regression Pierre - PowerPoint PPT Presentation

a chaining algorithm for online non parametric regression Pierre Gaillard December 2, 2015 University of Copenhagen This is joint work with Sebastien Gerchinovitz table of contents 1. Online prediction of arbitrary sequences 2. Finite


  1. a chaining algorithm for online non parametric regression Pierre Gaillard December 2, 2015 University of Copenhagen This is joint work with Sebastien Gerchinovitz

  2. table of contents 1. Online prediction of arbitrary sequences 2. Finite reference class: prediction with expert advice 3. Large reference class 4. Extensions, current (and future) work 2

  3. online prediction of arbitrary sequences

  4. the framework of this talk Sequential prediction of arbitrary time-series 1 : N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games . 2006. 1 Difficulty: no stochastic assumption on the time series - the environment reveals y t 4 - a time-series y 1 , . . . , y n ∈ Y = [ − B , B ] is to be predicted step by step - covariates x 1 , . . . , x n ∈ X are sequentially available At each forecasting instance t = 1 , . . . , n - the environment reveals x t ∈ X - the player is ask to form a prediction � y t of y t based on – the past observations y 1 , . . . , y t − 1 – the current and past covariates x 1 , . . . , x t L n = ∑ n y t − y t ) 2 . Goal: minimize the cumulative loss: � t = 1 ( � - neither on the observations ( y t ) - neither on the covariates ( x t )

  5. the framework of this talk - the environment reveals y t reference performance n inf our performance Sequential prediction of arbitrary time-series: n def 4 - solution: produce the prediction as a function of x t - a time-series y 1 , . . . , y n ∈ Y = [ − B , B ] is to be predicted step by step - covariates x 1 , . . . , x n ∈ X are sequentially available At each forecasting instance t = 1 , . . . , n - the environment reveals x t ∈ X y t = � � f t ( x t ) Goal: minimize our regret against a reference function class F ∈ Y X ∑ ∑ (� ) 2 ( ) 2 Reg n ( F ) = f ( x t ) − y t − f ( x t ) − y t f ∈F t = 1 t = 1 � �� � � �� �

  6. the framework of this talk def Goal reference performance n inf our performance Sequential prediction of arbitrary time-series: n 4 - the environment reveals y t - solution: produce the prediction as a function of x t - a time-series y 1 , . . . , y n ∈ Y = [ − B , B ] is to be predicted step by step - covariates x 1 , . . . , x n ∈ X are sequentially available At each forecasting instance t = 1 , . . . , n - the environment reveals x t ∈ X y t = � � f t ( x t ) Goal: minimize our regret against a reference function class F ∈ Y X ∑ ∑ (� ) 2 ( ) 2 Reg n ( F ) = f ( x t ) − y t − f ( x t ) − y t = o ( n ) f ∈F t = 1 t = 1 � �� � � �� � � �� �

  7. finite reference class: prediction with expert advice

  8. 6 def The exponentially weighted average forecaster (EWA) 1 At each forecasting instance t , Littlestone and M. K. Warmuth (1994) and Vovk (1990) 1 n exp n a strategy for finite F Assumption: F = { f 1 , . . . , f K } ⊂ Y X is finite - assign to each function f k the weight ( ) 2 ) − η ∑ t − 1 ( y s − f k ( x s ) s = 1 � p k , t = ( ) 2 ) ∑ K − η ∑ t − 1 ( y s − f j ( x s ) j = 1 exp s = 1 f t = ∑ K - form function � y t = � k = 1 � p k , t f k and predict � f t ( x t ) Performance: if Y = [ − B , B ] and η = 1 / ( 8 B 2 ) ∑ ∑ ( ) 2 − inf ( ) 2 ⩽ 8 B 2 log K y t − � Reg n ( F ) = f ( x t ) y t − f ( x t ) f ∈F t = 1 t = 1 If B is not known in advance, η can be tuned online (doubling trick).

  9. proof 1. Upper bound the instantaneous loss n 2. Sum over all t , the sum telescopes 7 ( ) 2 ( ) 2 y t − ∑ K y t − � f t ( x t ) = k = 1 � p k , t f k ( x t ) ( K ) 2 ) ( ∑ for η ⩽ 1 / ( 8 B 2 ) p k , t e − η y t − f k ( x t ) − 1 � ⩽ η log k = 1 ( ) 2 ) ( � by definition of � pk , t + 1 p k , t e − η y t − f k ( x t ) − 1 = η log � p k , t + 1 ( ) 2 + 1 � p k , t + 1 = y t − f k ( x t ) η log � p k , t ✟ ∑ ( ) 2 − ( ) 2 ⩽ 1 η log ✟✟ � p k , n + 1 y t − � f t ( x t ) y t − f k ( x t ) ⩽ log K = 8 B 2 log K � η p k , 1 t = 1

  10. large reference class

  11. 9 Definition (metric entropy) n Vovk (2001) n inf (1) Regret bound of order (forgetting constants): approximate F by a finite class 1. Approximate F by a finite set F ε such that ∀ f ∈ F ∃ f ε ∈ F ε ∥ f − f ε ∥ ∞ ⩽ ε . Such set F ε is called an ε -net of F 2. Run EWA on F ε The cardinal of the smallest ε -net F ε that satisfies (1) is denoted N ∞ ( F , ε ) . The metric entropy of F is log N ∞ ( F , ε ) .   ∑ ∑ ( ) 2 − inf ( ) 2   Reg n ( F ) = Reg n ( F ε ) + y t − f ε ( x t ) y t − f ( x t ) f ε ∈F ε f ∈F t = 1 t = 1 ≲ log N ∞ ( F , ε ) + ε n ���� � �� � Approximation of F by F ε Regret of EWA on F ε

  12. examples of reference classes: the parametric case Example Journal of Approximation Theory (2013). F. Gao, C-K. Ing, and Y. Yang. “Metric entropy and sparse linear approximation of ℓq-hulls for 0< q≤ 1”. In: 2 s Then 2 , - sparse linear regression - linear regression in a compact ball 10 If N ∞ ( F , ε ) ≲ ε − p for p > 0 as ε → 0, Reg n ( F ) ≲ log N ∞ ( F , ε ) + ε n ε ≈ 1 / n log ( ε − p ) ≈ + ε n ≈ p log ( n ) Assume you have d ⩾ 1 black-box forecasters φ 1 , . . . , φ d ∈ X Y {∑ d comp. R d } N ∞ ( F , ε ) ≲ ε − d F = j = 1 u j φ j : for u ∈ Θ ⊂ → {∑ d } F = j = 1 u j φ j : for u ∈ [ 0 , 1 ] d s.t. ∥ u ∥ 1 = 1 and ∥ u ∥ 0 = s ( d ) 1 + 1 / ( ε √ s ) ( ) log N ∞ ( F , ε ) ≲ log → Reg n ( F ) ≲ s log ( 1 + dn / s ) + s log

  13. examples of reference classes: the parametric case Example Journal of Approximation Theory (2013). F. Gao, C-K. Ing, and Y. Yang. “Metric entropy and sparse linear approximation of ℓq-hulls for 0< q≤ 1”. In: 2 s Then 2 , - sparse linear regression - linear regression in a compact ball 10 If N ∞ ( F , ε ) ≲ ε − p for p > 0 as ε → 0, Reg n ( F ) ≲ log N ∞ ( F , ε ) + ε n ε ≈ 1 / n log ( ε − p ) ≈ + ε n ≈ p log ( n ) ➝ optimal Assume you have d ⩾ 1 black-box forecasters φ 1 , . . . , φ d ∈ X Y {∑ d comp. R d } N ∞ ( F , ε ) ≲ ε − d F = j = 1 u j φ j : for u ∈ Θ ⊂ → {∑ d } F = j = 1 u j φ j : for u ∈ [ 0 , 1 ] d s.t. ∥ u ∥ 1 = 1 and ∥ u ∥ 0 = s ( d ) 1 + 1 / ( ε √ s ) ( ) log N ∞ ( F , ε ) ≲ log → Reg n ( F ) ≲ s log ( 1 + dn / s ) + s log

  14. 11 n G.G. Lorentz. “Metric Entropy, Widths, and Superpositions of Functions”. In: Amer. Math. Monthly 6 (1962). 3 . 1 Example p what if F is non parametric? if log N ∞ ( F , ε ) ≲ ε − p for p > 0 as ε → 0, Reg n ( F ) ≲ log N ∞ ( F , ε ) + ε n ε = n − 1 / ( p + 1 ) ε − p ≲ + ε n ≈ p + 1 - 1-Lipschitz ball on [ 0 , 1 ] { } � � f ∈ Y X : � ⩽ ∥ x − y ∥ � f ( x ) − f ( y ) F = ∀ x , y ∈ X ⊂ [ 0 , 1 ] Reg n ( F ) ≲ √ n Then log N ∞ ( F , ε ) ≈ ε − 1 → - Hölder ball on X ⊂ [ 0 , 1 ] with regularity β = q + α > 1 / 2 { } � � f ∈ Y X : ∀ x , y ∈ X � f ( q ) ( x ) − f ( q ) ( y ) � ⩽ | x − y | α and ∀ k ⩽ q , ∥ f ( k ) ∥ ∞ ⩽ B F = Then 3 log N ∞ ( F , ε ) ≈ ε − 1 /β → Reg n ( F ) ≲ n 1 + β

  15. 11 n G.G. Lorentz. “Metric Entropy, Widths, and Superpositions of Functions”. In: Amer. Math. Monthly 6 (1962). 3 1 1 3 1 Example p n n p p what if F is non parametric? if log N ∞ ( F , ε ) ≲ ε − p for p > 0 as ε → 0, ➝ suboptimal: Reg n ( F ) ≲ log N ∞ ( F , ε ) + ε n p + 2 if p < 2 p − 1 ε = n − 1 / ( p + 1 ) ε − p if p > 2 ≲ + ε n ≈ p + 1 - 1-Lipschitz ball on [ 0 , 1 ] { } � � f ∈ Y X : � ⩽ ∥ x − y ∥ � f ( x ) − f ( y ) F = ∀ x , y ∈ X ⊂ [ 0 , 1 ] Reg n ( F ) ≲ √ n ➝ suboptimal: n Then log N ∞ ( F , ε ) ≈ ε − 1 → - Hölder ball on X ⊂ [ 0 , 1 ] with regularity β = q + α > 1 / 2 { } � � f ∈ Y X : ∀ x , y ∈ X � f ( q ) ( x ) − f ( q ) ( y ) � ⩽ | x − y | α and ∀ k ⩽ q , ∥ f ( k ) ∥ ∞ ⩽ B F = Then 3 log N ∞ ( F , ε ) ≈ ε − 1 /β → Reg n ( F ) ≲ n 1 + β 1 + 2 β . ➝ suboptimal: n

  16. minimax rates Theorem (Rakhlin and Sridharan 2014 4 ) A. Rakhlin and K. Sridharan. “Online Nonparametric Regression”. In: COLT (2014). 4 and Lugosi 1999) - Online learning with arbitrary sequences (Opper and Haussler 1997; Cesa-Bianchi Rakhlin et al. 2013) - Statistical learning with i.i.d. data to derive risk bounds (e.g., Massart 2007; - Chaining to bound the supremum of a stochastic process (Dudley 1967) This term is a Dudley entropy integral that appears in 12 The minimax rate of the regret if of order n inf { ∫ γ } √ √ log N seq ( F , γ ) + log N seq ( τ, F ) d τ + ε n γ ⩾ ε ⩾ 0 ε where log N seq ( F , ε ) ⩽ log N ∞ ( F , ε ) is the sequential entropy of F . log N ∞ ( F , γ ) : regret of EWA against γ -net ➝ crude approximation ε n : approximation error of the ε -net ➝ fine approximation √ n ∫ γ √ log N ∞ ( F , τ ) d τ : from large scale γ to small scale ε . ε

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend