a chaining algorithm for online non parametric regression Pierre - PowerPoint PPT Presentation

a chaining algorithm for online non parametric regression Pierre Gaillard December 2, 2015 University of Copenhagen This is joint work with Sebastien Gerchinovitz

table of contents 1. Online prediction of arbitrary sequences 2. Finite reference class: prediction with expert advice 3. Large reference class 4. Extensions, current (and future) work 2

online prediction of arbitrary sequences

the framework of this talk Sequential prediction of arbitrary time-series 1 : N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games . 2006. 1 Difficulty: no stochastic assumption on the time series - the environment reveals y t 4 - a time-series y 1 , . . . , y n ∈ Y = [ − B , B ] is to be predicted step by step - covariates x 1 , . . . , x n ∈ X are sequentially available At each forecasting instance t = 1 , . . . , n - the environment reveals x t ∈ X - the player is ask to form a prediction � y t of y t based on – the past observations y 1 , . . . , y t − 1 – the current and past covariates x 1 , . . . , x t L n = ∑ n y t − y t ) 2 . Goal: minimize the cumulative loss: � t = 1 ( � - neither on the observations ( y t ) - neither on the covariates ( x t )

the framework of this talk - the environment reveals y t reference performance n inf our performance Sequential prediction of arbitrary time-series: n def 4 - solution: produce the prediction as a function of x t - a time-series y 1 , . . . , y n ∈ Y = [ − B , B ] is to be predicted step by step - covariates x 1 , . . . , x n ∈ X are sequentially available At each forecasting instance t = 1 , . . . , n - the environment reveals x t ∈ X y t = � � f t ( x t ) Goal: minimize our regret against a reference function class F ∈ Y X ∑ ∑ (� ) 2 ( ) 2 Reg n ( F ) = f ( x t ) − y t − f ( x t ) − y t f ∈F t = 1 t = 1 � ��

the framework of this talk def Goal reference performance n inf our performance Sequential prediction of arbitrary time-series: n 4 - the environment reveals y t - solution: produce the prediction as a function of x t - a time-series y 1 , . . . , y n ∈ Y = [ − B , B ] is to be predicted step by step - covariates x 1 , . . . , x n ∈ X are sequentially available At each forecasting instance t = 1 , . . . , n - the environment reveals x t ∈ X y t = � � f t ( x t ) Goal: minimize our regret against a reference function class F ∈ Y X ∑ ∑ (� ) 2 ( ) 2 Reg n ( F ) = f ( x t ) − y t − f ( x t ) − y t = o ( n ) f ∈F t = 1 t = 1 � ��

finite reference class: prediction with expert advice

6 def The exponentially weighted average forecaster (EWA) 1 At each forecasting instance t , Littlestone and M. K. Warmuth (1994) and Vovk (1990) 1 n exp n a strategy for finite F Assumption: F = { f 1 , . . . , f K } ⊂ Y X is finite - assign to each function f k the weight ( ) 2 ) − η ∑ t − 1 ( y s − f k ( x s ) s = 1 � p k , t = ( ) 2 ) ∑ K − η ∑ t − 1 ( y s − f j ( x s ) j = 1 exp s = 1 f t = ∑ K - form function � y t = � k = 1 � p k , t f k and predict � f t ( x t ) Performance: if Y = [ − B , B ] and η = 1 / ( 8 B 2 ) ∑ ∑ ( ) 2 − inf ( ) 2 ⩽ 8 B 2 log K y t − � Reg n ( F ) = f ( x t ) y t − f ( x t ) f ∈F t = 1 t = 1 If B is not known in advance, η can be tuned online (doubling trick).

proof 1. Upper bound the instantaneous loss n 2. Sum over all t , the sum telescopes 7 ( ) 2 ( ) 2 y t − ∑ K y t − � f t ( x t ) = k = 1 � p k , t f k ( x t ) ( K ) 2 ) ( ∑ for η ⩽ 1 / ( 8 B 2 ) p k , t e − η y t − f k ( x t ) − 1 � ⩽ η log k = 1 ( ) 2 ) ( � by definition of � pk , t + 1 p k , t e − η y t − f k ( x t ) − 1 = η log � p k , t + 1 ( ) 2 + 1 � p k , t + 1 = y t − f k ( x t ) η log � p k , t ✟ ∑ ( ) 2 − ( ) 2 ⩽ 1 η log ✟✟ � p k , n + 1 y t − � f t ( x t ) y t − f k ( x t ) ⩽ log K = 8 B 2 log K � η p k , 1 t = 1

large reference class

9 Definition (metric entropy) n Vovk (2001) n inf (1) Regret bound of order (forgetting constants): approximate F by a finite class 1. Approximate F by a finite set F ε such that ∀ f ∈ F ∃ f ε ∈ F ε ∥ f − f ε ∥ ∞ ⩽ ε . Such set F ε is called an ε -net of F 2. Run EWA on F ε The cardinal of the smallest ε -net F ε that satisfies (1) is denoted N ∞ ( F , ε ) . The metric entropy of F is log N ∞ ( F , ε ) .   ∑ ∑ ( ) 2 − inf ( ) 2   Reg n ( F ) = Reg n ( F ε ) + y t − f ε ( x t ) y t − f ( x t ) f ε ∈F ε f ∈F t = 1 t = 1 ≲ log N ∞ ( F , ε ) + ε n �� Approximation of F by F ε Regret of EWA on F ε

examples of reference classes: the parametric case Example Journal of Approximation Theory (2013). F. Gao, C-K. Ing, and Y. Yang. “Metric entropy and sparse linear approximation of ℓq-hulls for 0< q≤ 1”. In: 2 s Then 2 , - sparse linear regression - linear regression in a compact ball 10 If N ∞ ( F , ε ) ≲ ε − p for p > 0 as ε → 0, Reg n ( F ) ≲ log N ∞ ( F , ε ) + ε n ε ≈ 1 / n log ( ε − p ) ≈ + ε n ≈ p log ( n ) Assume you have d ⩾ 1 black-box forecasters φ 1 , . . . , φ d ∈ X Y {∑ d comp. R d } N ∞ ( F , ε ) ≲ ε − d F = j = 1 u j φ j : for u ∈ Θ ⊂ → {∑ d } F = j = 1 u j φ j : for u ∈ [ 0 , 1 ] d s.t. ∥ u ∥ 1 = 1 and ∥ u ∥ 0 = s ( d ) 1 + 1 / ( ε √ s ) ( ) log N ∞ ( F , ε ) ≲ log → Reg n ( F ) ≲ s log ( 1 + dn / s ) + s log

examples of reference classes: the parametric case Example Journal of Approximation Theory (2013). F. Gao, C-K. Ing, and Y. Yang. “Metric entropy and sparse linear approximation of ℓq-hulls for 0< q≤ 1”. In: 2 s Then 2 , - sparse linear regression - linear regression in a compact ball 10 If N ∞ ( F , ε ) ≲ ε − p for p > 0 as ε → 0, Reg n ( F ) ≲ log N ∞ ( F , ε ) + ε n ε ≈ 1 / n log ( ε − p ) ≈ + ε n ≈ p log ( n ) ➝ optimal Assume you have d ⩾ 1 black-box forecasters φ 1 , . . . , φ d ∈ X Y {∑ d comp. R d } N ∞ ( F , ε ) ≲ ε − d F = j = 1 u j φ j : for u ∈ Θ ⊂ → {∑ d } F = j = 1 u j φ j : for u ∈ [ 0 , 1 ] d s.t. ∥ u ∥ 1 = 1 and ∥ u ∥ 0 = s ( d ) 1 + 1 / ( ε √ s ) ( ) log N ∞ ( F , ε ) ≲ log → Reg n ( F ) ≲ s log ( 1 + dn / s ) + s log

11 n G.G. Lorentz. “Metric Entropy, Widths, and Superpositions of Functions”. In: Amer. Math. Monthly 6 (1962). 3 . 1 Example p what if F is non parametric? if log N ∞ ( F , ε ) ≲ ε − p for p > 0 as ε → 0, Reg n ( F ) ≲ log N ∞ ( F , ε ) + ε n ε = n − 1 / ( p + 1 ) ε − p ≲ + ε n ≈ p + 1 - 1-Lipschitz ball on [ 0 , 1 ] { } � � f ∈ Y X : � ⩽ ∥ x − y ∥ � f ( x ) − f ( y ) F = ∀ x , y ∈ X ⊂ [ 0 , 1 ] Reg n ( F ) ≲ √ n Then log N ∞ ( F , ε ) ≈ ε − 1 → - Hölder ball on X ⊂ [ 0 , 1 ] with regularity β = q + α > 1 / 2 { } � � f ∈ Y X : ∀ x , y ∈ X � f ( q ) ( x ) − f ( q ) ( y ) � ⩽ | x − y | α and ∀ k ⩽ q , ∥ f ( k ) ∥ ∞ ⩽ B F = Then 3 log N ∞ ( F , ε ) ≈ ε − 1 /β → Reg n ( F ) ≲ n 1 + β

11 n G.G. Lorentz. “Metric Entropy, Widths, and Superpositions of Functions”. In: Amer. Math. Monthly 6 (1962). 3 1 1 3 1 Example p n n p p what if F is non parametric? if log N ∞ ( F , ε ) ≲ ε − p for p > 0 as ε → 0, ➝ suboptimal: Reg n ( F ) ≲ log N ∞ ( F , ε ) + ε n p + 2 if p < 2 p − 1 ε = n − 1 / ( p + 1 ) ε − p if p > 2 ≲ + ε n ≈ p + 1 - 1-Lipschitz ball on [ 0 , 1 ] { } � � f ∈ Y X : � ⩽ ∥ x − y ∥ � f ( x ) − f ( y ) F = ∀ x , y ∈ X ⊂ [ 0 , 1 ] Reg n ( F ) ≲ √ n ➝ suboptimal: n Then log N ∞ ( F , ε ) ≈ ε − 1 → - Hölder ball on X ⊂ [ 0 , 1 ] with regularity β = q + α > 1 / 2 { } � � f ∈ Y X : ∀ x , y ∈ X � f ( q ) ( x ) − f ( q ) ( y ) � ⩽ | x − y | α and ∀ k ⩽ q , ∥ f ( k ) ∥ ∞ ⩽ B F = Then 3 log N ∞ ( F , ε ) ≈ ε − 1 /β → Reg n ( F ) ≲ n 1 + β 1 + 2 β . ➝ suboptimal: n

minimax rates Theorem (Rakhlin and Sridharan 2014 4 ) A. Rakhlin and K. Sridharan. “Online Nonparametric Regression”. In: COLT (2014). 4 and Lugosi 1999) - Online learning with arbitrary sequences (Opper and Haussler 1997; Cesa-Bianchi Rakhlin et al. 2013) - Statistical learning with i.i.d. data to derive risk bounds (e.g., Massart 2007; - Chaining to bound the supremum of a stochastic process (Dudley 1967) This term is a Dudley entropy integral that appears in 12 The minimax rate of the regret if of order n inf { ∫ γ } √ √ log N seq ( F , γ ) + log N seq ( τ, F ) d τ + ε n γ ⩾ ε ⩾ 0 ε where log N seq ( F , ε ) ⩽ log N ∞ ( F , ε ) is the sequential entropy of F . log N ∞ ( F , γ ) : regret of EWA against γ -net ➝ crude approximation ε n : approximation error of the ε -net ➝ fine approximation √ n ∫ γ √ log N ∞ ( F , τ ) d τ : from large scale γ to small scale ε . ε

a chaining algorithm for online non parametric regression Pierre - PowerPoint PPT Presentation

a chaining algorithm for online non parametric regression Pierre Gaillard December 2, 2015 University of Copenhagen This is joint work with Sebastien Gerchinovitz table of contents 1. Online prediction of arbitrary sequences 2. Finite

MLSE Channel Estimation MLSE Channel Estimation MLSE Channel Estimation Parametric or Non-

Chaining Operator in Climb Method Chaining jQuery Method Chaining Extended Climb Christopher

Semi-parametric and response setup non-parametric approaches to Parametric models

Using first order logic (Ch. 9) Backward chaining Backward chaining is almost the opposite of

Non-parametric Bayesian Statistics Graham Neubig 2011-12-22 1 Graham Neubig Non-parametric

Introduction to non-parametric Bayes Introduction to non-parametric Bayes methods 1 Overview

Priority queues Hash tables chaining Priority queue ADT Binary heap March 13, 2020 Cinda

Variational Bayesian Inference for Parametric and Non-Parametric Regression with Missing Predictor

Odds Algorithm An Online Algorithm Group Fibonado 20. Dec 2016 Group Fibonado Odds Algorithm

Online isotonic regression Wojciech Kot lowski DA2PL 2018 Pozna n University of

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Towards a non-parametric Towards a non-parametric stochastic framework: a consistent approach of

Non parametric prediction and mapping of standing Non-parametric prediction and mapping of

Parametric Methods Steven J Zeil Old Dominion Univ. Fall 2010 1 Distributions Estimating

TCTL model checking lower/upper-bound Introduction parametric timed automata without Parametric

Decision Problems Decision Making under Uncertainty, Part III Christos Dimitrakakis Chalmers

Simpler Optimal Algorithm for Contextual Bandits under Realizability Yunzong Xu MIT Joint work

Imprecision in learning: introduction Sebastien Destercke Universit de Technologie de

CMU 15-896 Noncooperative games 2: Learning and minimax Teacher: Ariel Procaccia Reminder: The

Outline 1. Standing on the Shoulders of Giants . . . 2. What is Information? 3. Shannon

Introduction to Machine Learning 25. Multiplicative Updates, Games and Boosting Alex Smola

Theory and Statistics Constantinos Daskalakis CSAIL and EECS, MIT Min-Max Optimization Solve:

Multi-agent learning Erik Berbee & Bas van Gijzel , Master Student AT, Utrecht University Erik

a chaining algorithm for online non parametric regression Pierre - PowerPoint PPT Presentation

a chaining algorithm for online non parametric regression Pierre Gaillard December 2, 2015 University of Copenhagen This is joint work with Sebastien Gerchinovitz table of contents 1. Online prediction of arbitrary sequences 2. Finite

MLSE Channel Estimation MLSE Channel Estimation MLSE Channel Estimation Parametric or Non-

Chaining Operator in Climb Method Chaining jQuery Method Chaining Extended Climb Christopher

Semi-parametric and response setup non-parametric approaches to Parametric models

Using first order logic (Ch. 9) Backward chaining Backward chaining is almost the opposite of

Non-parametric Bayesian Statistics Graham Neubig 2011-12-22 1 Graham Neubig Non-parametric

Introduction to non-parametric Bayes Introduction to non-parametric Bayes methods 1 Overview

Priority queues Hash tables chaining Priority queue ADT Binary heap March 13, 2020 Cinda

Variational Bayesian Inference for Parametric and Non-Parametric Regression with Missing Predictor

Odds Algorithm An Online Algorithm Group Fibonado 20. Dec 2016 Group Fibonado Odds Algorithm

Online isotonic regression Wojciech Kot lowski DA2PL 2018 Pozna n University of

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Towards a non-parametric Towards a non-parametric stochastic framework: a consistent approach of

Non parametric prediction and mapping of standing Non-parametric prediction and mapping of

Parametric Methods Steven J Zeil Old Dominion Univ. Fall 2010 1 Distributions Estimating

TCTL model checking lower/upper-bound Introduction parametric timed automata without Parametric

Decision Problems Decision Making under Uncertainty, Part III Christos Dimitrakakis Chalmers

Simpler Optimal Algorithm for Contextual Bandits under Realizability Yunzong Xu MIT Joint work

Imprecision in learning: introduction Sebastien Destercke Universit de Technologie de

CMU 15-896 Noncooperative games 2: Learning and minimax Teacher: Ariel Procaccia Reminder: The

Outline 1. Standing on the Shoulders of Giants . . . 2. What is Information? 3. Shannon

Introduction to Machine Learning 25. Multiplicative Updates, Games and Boosting Alex Smola

Theory and Statistics Constantinos Daskalakis CSAIL and EECS, MIT Min-Max Optimization Solve:

Multi-agent learning Erik Berbee &amp; Bas van Gijzel , Master Student AT, Utrecht University Erik

Multi-agent learning Erik Berbee & Bas van Gijzel , Master Student AT, Utrecht University Erik