a chaining algorithm for online non parametric regression Pierre - - PowerPoint PPT Presentation

a chaining algorithm for online non parametric regression
SMART_READER_LITE
LIVE PREVIEW

a chaining algorithm for online non parametric regression Pierre - - PowerPoint PPT Presentation

a chaining algorithm for online non parametric regression Pierre Gaillard December 2, 2015 University of Copenhagen This is joint work with Sebastien Gerchinovitz table of contents 1. Online prediction of arbitrary sequences 2. Finite


slide-1
SLIDE 1

a chaining algorithm for online non parametric regression

Pierre Gaillard December 2, 2015

University of Copenhagen This is joint work with Sebastien Gerchinovitz

slide-2
SLIDE 2

table of contents

  • 1. Online prediction of arbitrary sequences
  • 2. Finite reference class: prediction with expert advice
  • 3. Large reference class
  • 4. Extensions, current (and future) work

2

slide-3
SLIDE 3
  • nline

prediction

  • f

arbitrary sequences

slide-4
SLIDE 4

the framework of this talk

Sequential prediction of arbitrary time-series1:

  • a time-series y1, . . . , yn ∈ Y = [−B, B] is to be predicted step by step
  • covariates x1, . . . , xn ∈ X are sequentially available

At each forecasting instance t = 1, . . . , n

  • the environment reveals xt ∈ X
  • the player is ask to form a prediction

yt of yt based on – the past observations y1, . . . , yt−1 – the current and past covariates x1, . . . , xt

  • the environment reveals yt

Goal: minimize the cumulative loss: Ln = ∑n

t=1(

yt − yt)2 . Difficulty: no stochastic assumption on the time series

  • neither on the observations (yt)
  • neither on the covariates (xt)

1

  • N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. 2006.

4

slide-5
SLIDE 5

the framework of this talk

Sequential prediction of arbitrary time-series:

  • a time-series y1, . . . , yn ∈ Y = [−B, B] is to be predicted step by step
  • covariates x1, . . . , xn ∈ X are sequentially available

At each forecasting instance t = 1, . . . , n

  • the environment reveals xt ∈ X
  • solution: produce the prediction as a function of xt
  • yt =

ft(xt)

  • the environment reveals yt

Goal: minimize our regret against a reference function class F ∈ YX Regn(F)

def

=

n

t=1

( f(xt) − yt )2

  • ur performance

− inf

f∈F n

t=1

( f(xt) − yt )2

  • reference performance

4

slide-6
SLIDE 6

the framework of this talk

Sequential prediction of arbitrary time-series:

  • a time-series y1, . . . , yn ∈ Y = [−B, B] is to be predicted step by step
  • covariates x1, . . . , xn ∈ X are sequentially available

At each forecasting instance t = 1, . . . , n

  • the environment reveals xt ∈ X
  • solution: produce the prediction as a function of xt
  • yt =

ft(xt)

  • the environment reveals yt

Goal: minimize our regret against a reference function class F ∈ YX Regn(F)

def

=

n

t=1

( f(xt) − yt )2

  • ur performance

− inf

f∈F n

t=1

( f(xt) − yt )2

  • reference performance

= o(n)

Goal

4

slide-7
SLIDE 7

finite reference class: prediction with expert advice

slide-8
SLIDE 8

a strategy for finite F

Assumption: F = {f1, . . . , fK} ⊂ YX is finite The exponentially weighted average forecaster (EWA)1 At each forecasting instance t,

  • assign to each function fk the weight
  • pk,t =

exp ( − η ∑t−1

s=1

( ys − fk(xs) )2) ∑K

j=1 exp

( − η ∑t−1

s=1

( ys − fj(xs) )2)

  • form function

ft = ∑K

k=1

pk,tfk and predict yt = ft(xt) Performance: if Y = [−B, B] and η = 1/(8B2) Regn(F)

def

=

n

t=1

( yt − f(xt) )2 − inf

f∈F n

t=1

( yt − f(xt) )2 ⩽ 8B2 log K If B is not known in advance, η can be tuned online (doubling trick).

1

Littlestone and M. K. Warmuth (1994) and Vovk (1990)

6

slide-9
SLIDE 9

proof

  • 1. Upper bound the instantaneous loss

( yt − ft(xt) )2 = ( yt − ∑K

k=1

pk,tfk(xt) )2

for η⩽1/(8B2)

⩽ − 1 η log ( K ∑

k=1

  • pk,te−η

(

yt−fk(xt)

)2)

by definition of pk,t+1

= − 1 η log (

  • pk,t
  • pk,t+1

e−η (

yt−fk(xt)

)2) = ( yt − fk(xt) )2 + 1 η log

  • pk,t+1
  • pk,t
  • 2. Sum over all t, the sum telescopes

n

t=1

( yt − ft(xt) )2 − ( yt − fk(xt) )2 ⩽ 1 η log ✟✟

  • pk,n+1
  • pk,1

⩽ log K η = 8B2 log K

7

slide-10
SLIDE 10

large reference class

slide-11
SLIDE 11

approximate F by a finite class Vovk (2001)

  • 1. Approximate F by a finite set Fε such that

∀f ∈ F ∃fε ∈ Fε ∥f − fε∥∞ ⩽ ε . (1) Such set Fε is called an ε-net of F

  • 2. Run EWA on Fε

Definition (metric entropy) The cardinal of the smallest ε-net Fε that satisfies (1) is denoted N∞(F, ε). The metric entropy of F is log N∞(F, ε). Regret bound of order (forgetting constants):

Regn(F) = Regn(Fε) +   inf

fε∈Fε n

t=1

( yt − fε(xt) )2 − inf

f∈F n

t=1

( yt − f(xt) )2   ≲ log N∞(F, ε)

  • Regret of EWA on Fε

+ ε n

  • Approximation of F by Fε

9

slide-12
SLIDE 12

examples of reference classes: the parametric case

If N∞(F, ε) ≲ ε−p for p > 0 as ε → 0, Regn(F) ≲ log N∞(F, ε) + ε n ≈ log(ε−p) + ε n

ε≈1/n

≈ p log(n) Example Assume you have d ⩾ 1 black-box forecasters φ1, . . . , φd ∈ X Y

  • linear regression in a compact ball

F = {∑d

j=1ujφj :

for u ∈ Θ ⊂

  • comp. Rd}

→ N∞(F, ε) ≲ ε−d

  • sparse linear regression

F = {∑d

j=1ujφj :

for u ∈ [0, 1]d s.t. ∥u∥1 = 1 and ∥u∥0 = s } Then2, log N∞(F, ε) ≲ log (d s ) + s log ( 1 + 1/(ε√s) ) → Regn(F) ≲ s log(1 + dn/s)

2

  • F. Gao, C-K. Ing, and Y. Yang. “Metric entropy and sparse linear approximation of ℓq-hulls for 0< q≤ 1”. In:

Journal of Approximation Theory (2013).

10

slide-13
SLIDE 13

examples of reference classes: the parametric case

If N∞(F, ε) ≲ ε−p for p > 0 as ε → 0, Regn(F) ≲ log N∞(F, ε) + ε n ≈ log(ε−p) + ε n

ε≈1/n

≈ p log(n) ➝ optimal Example Assume you have d ⩾ 1 black-box forecasters φ1, . . . , φd ∈ X Y

  • linear regression in a compact ball

F = {∑d

j=1ujφj :

for u ∈ Θ ⊂

  • comp. Rd}

→ N∞(F, ε) ≲ ε−d

  • sparse linear regression

F = {∑d

j=1ujφj :

for u ∈ [0, 1]d s.t. ∥u∥1 = 1 and ∥u∥0 = s } Then2, log N∞(F, ε) ≲ log (d s ) + s log ( 1 + 1/(ε√s) ) → Regn(F) ≲ s log(1 + dn/s)

2

  • F. Gao, C-K. Ing, and Y. Yang. “Metric entropy and sparse linear approximation of ℓq-hulls for 0< q≤ 1”. In:

Journal of Approximation Theory (2013).

10

slide-14
SLIDE 14

what if F is non parametric?

if log N∞(F, ε) ≲ ε−p for p > 0 as ε → 0, Regn(F) ≲ log N∞(F, ε) + ε n ≲ ε−p + ε n

ε=n−1/(p+1)

≈ n

p p+1

Example

  • 1-Lipschitz ball on [0, 1]

F = { f ∈ YX : ∀ x, y ∈ X ⊂ [0, 1]

  • f(x) − f(y)
  • ⩽ ∥x − y∥

} Then log N∞(F, ε) ≈ ε−1 → Regn(F) ≲ √n

  • Hölder ball on X ⊂ [0, 1] with regularity β = q + α > 1/2

F = { f ∈ YX : ∀ x, y ∈ X

  • f(q)(x) − f(q)(y)
  • ⩽ |x − y|α and ∀k ⩽ q, ∥f(k)∥∞ ⩽ B

} Then3 log N∞(F, ε) ≈ ε−1/β → Regn(F) ≲ n

1 1+β

.

3

G.G. Lorentz. “Metric Entropy, Widths, and Superpositions of Functions”. In: Amer. Math. Monthly 6 (1962).

11

slide-15
SLIDE 15

what if F is non parametric?

if log N∞(F, ε) ≲ ε−p for p > 0 as ε → 0, Regn(F) ≲ log N∞(F, ε) + ε n ≲ ε−p + ε n

ε=n−1/(p+1)

≈ n

p p+1

➝ suboptimal: n

p p+2

if p < 2 n

p−1 p

if p > 2

Example

  • 1-Lipschitz ball on [0, 1]

F = { f ∈ YX : ∀ x, y ∈ X ⊂ [0, 1]

  • f(x) − f(y)
  • ⩽ ∥x − y∥

} Then log N∞(F, ε) ≈ ε−1 → Regn(F) ≲ √n ➝ suboptimal: n

1 3

  • Hölder ball on X ⊂ [0, 1] with regularity β = q + α > 1/2

F = { f ∈ YX : ∀ x, y ∈ X

  • f(q)(x) − f(q)(y)
  • ⩽ |x − y|α and ∀k ⩽ q, ∥f(k)∥∞ ⩽ B

} Then3 log N∞(F, ε) ≈ ε−1/β → Regn(F) ≲ n

1 1+β

➝ suboptimal: n

1 1+2β . 3

G.G. Lorentz. “Metric Entropy, Widths, and Superpositions of Functions”. In: Amer. Math. Monthly 6 (1962).

11

slide-16
SLIDE 16

minimax rates

Theorem (Rakhlin and Sridharan 20144) The minimax rate of the regret if of order inf

γ⩾ε⩾0

{ log N seq(F, γ) + √ n ∫ γ

ε

√ log N seq(τ, F) dτ + εn } where log N seq(F, ε) ⩽ log N∞(F, ε) is the sequential entropy of F. log N∞(F, γ): regret of EWA against γ-net ➝ crude approximation εn: approximation error of the ε-net ➝ fine approximation √n ∫ γ

ε

√ log N∞(F, τ) dτ: from large scale γ to small scale ε. This term is a Dudley entropy integral that appears in

  • Chaining to bound the supremum of a stochastic process (Dudley 1967)
  • Statistical learning with i.i.d. data to derive risk bounds (e.g., Massart 2007;

Rakhlin et al. 2013)

  • Online learning with arbitrary sequences (Opper and Haussler 1997; Cesa-Bianchi

and Lugosi 1999)

4

  • A. Rakhlin and K. Sridharan. “Online Nonparametric Regression”. In: COLT (2014).

12

slide-17
SLIDE 17

minimax rates

Theorem (Rakhlin and Sridharan 20144) The minimax rate of the regret if of order inf

γ⩾ε⩾0

{ log N ∞(F, γ) + √ n ∫ γ

ε

√ log N ∞(τ, F) dτ + εn } if log N∞(F, ε) ≈ log N seq(F, ε). log N∞(F, γ): regret of EWA against γ-net ➝ crude approximation εn: approximation error of the ε-net ➝ fine approximation √n ∫ γ

ε

√ log N∞(F, τ) dτ: from large scale γ to small scale ε. This term is a Dudley entropy integral that appears in

  • Chaining to bound the supremum of a stochastic process (Dudley 1967)
  • Statistical learning with i.i.d. data to derive risk bounds (e.g., Massart 2007;

Rakhlin et al. 2013)

  • Online learning with arbitrary sequences (Opper and Haussler 1997; Cesa-Bianchi

and Lugosi 1999)

4

  • A. Rakhlin and K. Sridharan. “Online Nonparametric Regression”. In: COLT (2014).

12

slide-18
SLIDE 18

minimax rates

Theorem (Rakhlin and Sridharan 20144) The minimax rate of the regret if of order inf

γ⩾ε⩾0

{ log N ∞(F, γ) + √ n ∫ γ

ε

√ log N ∞(τ, F) dτ + εn } if log N∞(F, ε) ≈ log N seq(F, ε). Example: let p ∈ (0, 2) and F such that log N ∞(F, ε) ≈ ε−p as ε → ∞ . The minimax regret is then of order γ−p + √ n ∫ γ

ε

τ −p/2 dτ + ε n = γ−p + √ nγ1−p/2 + 0 ≈ n

p p+2

for the optimal choices ε = 0 and γ ≈ n−1/(p+2).

4

  • A. Rakhlin and K. Sridharan. “Online Nonparametric Regression”. In: COLT (2014).

12

slide-19
SLIDE 19
  • ur contributions

Main algorithm which:

  • achieves the Dudley-type regret bound

Regn ≲ log N∞(F, γ) + √ n ∫ γ

ε

√ log N∞(F, τ) dτ + εn

  • efficient version for Hölder class in [0, 1] (costs a log factor)

Key-subroutine (Multi-variable EG) to go from scale γ to scale ε. Function class Metric entropy Regret of EWA Our Regret ε−p p ∈ (0, 2) np/(p+1) np/(p+2) Lipschitz on [0, 1] ε−1 n1/2 n1/3 β-Hölder on [0, 1] ε−1/β β > 1/2 n1/(β+1) n1/(2β+1) Sparse lin. reg. log (d

s

) + s log ( 1 + 1/(ε√s) ) s log(1 + dn/s) s log(1 + dn/s)

13

slide-20
SLIDE 20

the chaining argument

Instead of competing directly with Fε for small ε (too many functions)

  • 1. create a “chain” of refining approximations

π0(f) ∈ F(0), π1(f) ∈ F(1), . . . of any function f ∈ Fε such that ∀k ⩾ 0 sup

f∈F

∥πk(f) − f∥∞ ⩽ γ/2k and Card F(k) = N∞(F, γ/2k) ,

  • 2. compete with the chains

inf

f∈Fε n

t=1

( yt − f(xt) )2 = inf

f∈Fε n

t=1

( yt − π0(f)

∈F(0)

(xt) −

k=0 |small increments|⩽3γ/2k+1

  • [

πk+1(f) − πk(f) ]

  • ∈G(k)

(xt) )2 . where G(k) def = {πk(f) − πk−1(f) : f ∈ F}

14

slide-21
SLIDE 21

compete with the chains: two aggregations levels

inf

f∈Fε n

t=1

( yt − f(xt) )2 = inf

f∈Fε n

t=1

( yt − π0(f)

∈F(0)

(xt) −

k=0 |small increments|⩽3γ/2k+1

  • [

πk+1(f) − πk(f) ]

  • ∈G(k)

(xt) )2 It thus suffices to compete with inf

f0∈F(0)

{ inf

g0∈G(0),...,gKε ∈G(Kε) n

t=1

( yt − ( f0 + g0 + · · · + gKε ) (xt) )2

  • low-scale aggregation:

gradient descents simultaneously at all gk Regret cost: √n ∫ γ

ε

√ log N∞(F, τ) dτ }

  • high-scale aggregation: run EWA to be competitive against all f0 ∈ F(0)

Regret cost: log(CardF(0)) = log N∞(F, γ)

15

slide-22
SLIDE 22

the exponentiated gradient forecaster (eg)

Let ∆N

def

= {u ∈ RN

+ : ∑N i=1 ui = 1}.

Setting: at each step t ⩾ 1, player plays ut ∈ ∆N and environment chooses convex and differentiable loss function ℓt : ∆N → R. The Exponentiated Gradient forecaster5 At each forecasting instance t ⩾ 1, define ut ∈ ∆N component-wise by

  • uk,t

def

= 1 Zt exp ( − η

t−1

s=1

uk,sℓs(

us) ) where Zt is a normalization factor. Regret bound: if ∥∇ℓt∥∞ ⩽ G. For η = G−1√ 2(log N)/n

n

t=1

ℓt( ut) ⩽ min

u∈∆N n

t=1

ℓt(u) + G √ 2n log N If G is small, G √ n log N can be better than B2 log N.

5

Kivinen and M. Warmuth (1997) and Cesa-Bianchi (1999)

16

slide-23
SLIDE 23

exponentiated gradient simultaneously on all chain links

Let ∆N denote the simplex in RN. Goal: minimize a sequence of multi-variable losses ( u(1), . . . , u(K)) → ℓt ( u(1), . . . , u(K)) simultaneously over all variables (u(1), . . . , u(K)) ∈ ∆N1 × . . . × ∆NK.

The Multi-variable Exponentiated Gradient forecaster

input : tuning parameters η(1), . . . , η(K) > 0. initialization : set u(k)

1 def

= ( 1

Nk , . . . , 1 Nk

) ∈ ∆Nk for all k = 1, . . . , K. for each round t = 2, 3, . . . do Compute the weight vectors ( u(1)

t

, . . . , u(K)

t

) ∈ ∆N1 × . . . × ∆NK as follows (Z(k)

t

is a normalization factor):

  • u(k)

t,i def

= exp ( −η(k)

t−1

s=1

u(k) s,i

ℓs (

  • u(1)

s , . . . ,

u(K)

s

)) Z(k)

t

, i ∈ {1, . . . , Nk}

end

Regret bound: If the ℓt are jointly convex and differentiable with ∥∇u(k)ℓt∥∞ ⩽ G(k), then Multi-variable EG tuned with η(k) = √ 2 log(Nk)/n /G(k) satisfies:

n

t=1

ℓt (

  • u(1)

t

, . . . , u(K)

t

) − min

u(1),...,u(K) n

t=1

ℓt ( u(1), . . . , u(K)) ⩽ √ 2n

K

k=1

G(k)√ log Nk

17

slide-24
SLIDE 24

how to use it to compete with chains?

Remember the goal: inf

f0∈F(0)

{ inf

g0∈G(0),...,gKε ∈G(Kε) n

t=1

( yt − ( f0 + g0 + · · · + gKε ) (xt) )2

  • low-scale aggregation: Multi-variable EG with loss functions:

ℓt ( u(1), . . . , u(Kε)) = ( yt −f0 −u(1) ·g(0) +· · ·+u(Kε) ·g(Kε))2 Regret cost: √n ∫ γ

ε

√ log N∞(F, τ) dτ }

  • high-scale aggregation: run EWA to be competitive against all f0 ∈ F(0)

Regret cost: log(CardF(0)) = log N∞(F, γ)

18

slide-25
SLIDE 25

main result

Theorem (G. and Gerchinovitz 2015) Let B > 0, n ⩾ 1, and γ ∈ ( B/n, B ) . Assume that max1⩽t⩽n |yt| ⩽ B and that supf∈F ∥f∥∞ ⩽ B. Then, Chaining EWA well-tuned (depending on γ, B, and n) satis- fies: Regn(F) ⩽ B2( 5 + 50 log N∞(F, γ) ) + 120B √ n ∫ γ/2 √ log N∞(F, ε)dε Remarks:

  • the bound with ε ̸= 0 and ε n can also be obtained.
  • calibrations in B and n should be possible by doubling trick and clipping

19

slide-26
SLIDE 26

efficient implementation for lipschitz functions

The idea is to design computationally manageable coverings F(k), k ⩾ 0:

  • approximate any Lipschitz function f ∈ [0, 1] → [−B, B] with piecewise constant

functions (level k = 0);

  • refine the approximation via a dyadic discretization (levels k ⩾ 1).

+ + + + + +

At each round t, the point xt falls into only one subinterval for each level k ➟ No need to update all coefficients ➟ manageable complexity. For Hölder functions: piecewise constant ➝ piecewise polynomials

20

slide-27
SLIDE 27

complexity

Function class Time complexity Space complexity Lipschitz on [0, 1] n4/3 log n n4/3 log n β-Hölder on [0, 1] poly(n) poly(n) Can be extended to Lipschitz on [0, 1]d functions at small cost. We lose a log factor in the regret bound.

  • B. Guedj and his PhD student are implementing it for numerical experiments.

21

slide-28
SLIDE 28

extensions, current (and future) work

slide-29
SLIDE 29

extension to general loss functions

Goal: minimize the regret Regn = ∑n

t=1 ℓt

( ft ) − inff∈F ∑n

t=1 ℓt

( f ) for generic sequences of loss functions (ℓt). If the loss functions ℓt are convex and Lipschitz, we can achieve Regn(F) ≲

✭✭✭✭✭ ✭

log N∞(F, γ)

  • Large scale term not possible

(was thanks to exp-concavity)

+ √ n ∫ 1

ε

√ log N∞(F, τ) dτ + εn

Lipschitz class on [0, 1]d Metric entropy EWA Regret Our Regret d = 1 ε−1 n2/3 n1/2 d = 2 ε−2 n3/4 n1/2 log n d ⩾ 3 ε−d n(d+1)/(d+2) n(d−1)/d

First constructive algorithm to achieve the optimal6 rates. The rate n(d+1)/(d+2) was achieved by G. and Baudin (2014) and Hazan and Megiddo (2007).

6

  • A. Rakhlin and K. Sridharan. “Online Nonparametric Regression with General Loss Functions”. In: arXiv (2015).

23

slide-30
SLIDE 30

extension to lipschitz bandits?

Setting: at each step t ⩾ 1,

  • simultaneously player plays

xt ∈ F (possibly random) and environment chooses ℓt convex Lipschitz (hidden)

  • then player suffers and observes only ℓt(

xt) Goal: minimize the expected regret Regn(F)

def

= E [∑n

t=1 ℓt(

xt) ] −infx∈F ∑n

t=1 ℓt(x)

Full information Bandits Feedback (CardF = K) + EWA √ n log K √ nK ε-net + EWA √ n log N∞(F, ε) + εn √ nN∞(F, ε) + εn Chaining √n ∫ 1

ε

√ log N∞(F, x) dx + εn √n ∫ 1

ε

√ N∞(F, x) dx + εn? For F ⊂ [0, 1] this would lead to O(√n) regret → first constructive algorithm! Difficulty: taking advantage of small loss ranges in bandits

24

slide-31
SLIDE 31
  • ther related open questions

Get the sequential entropy N seq(F, ε) instead of the metric entropy N∞(F, ε) Efficient version for other function classes

  • step-wise Lipschitz functions ➝ application to classification
  • generalized additive models ➝ useful to predict electricity consumption

Similar results with other algorithms (Kernel regression)

Thank you !

25

slide-32
SLIDE 32

references

  • N. Cesa-Bianchi. “Analysis of two gradient-based algorithms for on-line

regression”. In: J. Comput. System Sci. 59.3 (1999), pp. 392–411.

  • N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge

University Press, 2006.

  • G. and P. Baudin. “A consistent deterministic regression tree for non-parametric

prediction of time series”. http://arxiv.org/abs/1405.1533. 2014.

  • G. and S. Gerchinovitz. “A Chaining Algorithm for Online Nonparametric

Regression”. In: Proceedings of COLT’15. Vol. 40. JMLR: Workshop and Conference Proceedings, 2015, pp. 764–796.

  • F. Gao, C-K. Ing, and Y. Yang. “Metric entropy and sparse linear approximation of

ℓq-hulls for 0< q≤ 1”. In: Journal of Approximation Theory 166 (2013), pp. 42–55.

  • E. Hazan and N. Megiddo. “Online Learning with Prior Knowledge”. In:

Proceedings of the 20th Annual Conference on Learning Theory (COLT’07). Ed. by

  • N. H. Bshouty and C. Gentile. Vol. 4539. Lecture Notes in Computer Science.

Springer Berlin Heidelberg, 2007, pp. 499–513.

26

slide-33
SLIDE 33
  • J. Kivinen and M.K. Warmuth. “Exponentiated Gradient Versus Gradient Descent

for Linear Predictors”. In: Information and Computation 132.1 (1997), pp. 1–63.

  • N. Littlestone and M. K. Warmuth. “The Weighted Majority Algorithm”. In:

Information and Computation 108.2 (1994), pp. 212–261. G.G. Lorentz. “Metric Entropy, Widths, and Superpositions of Functions”. In: Amer.

  • Math. Monthly 69.6 (1962), pp. 469–485.
  • A. Rakhlin and K. Sridharan. “Online Nonparametric Regression”. In: COLT 35

(2014), pp. 1232–1264.

  • A. Rakhlin and K. Sridharan. “Online Nonparametric Regression with General

Loss Functions”. In: arXiv (2015).

  • V. Vovk. “Aggregating Strategies.” In: Proceedings of the Third Workshop on

Computational Learning Theory. 1990, pp. 371–386.

  • V. Vovk. “Competitive on-line statistics”. In: International Statistical Review 69.2

(2001), pp. 213–248.

27