[PPT] - Adaptive inference and its relations to sequential decision making PowerPoint Presentation

SLIDE 1

Adaptive inference and its relations to sequential decision making

Alexandra Carpentier1 OvGU Magdeburg Based on joint works with Olga Klopp, Samory Kpotufe, Andr´ ea Locatelli, Matthias L¨

ffler, Richard Nickl

Criteo, Oct. 2nd, 2019

1Partly funded by the DFG EN CA1488, the CRC 1294, the GK 2297,

the GK 2433.

SLIDE 2

Non-Convex Optimization

Problem

Finding/Exploiting the maximum M(f) of an unknown function f.

SLIDE 3

Non-Convex Optimization

Problem

Finding/Exploiting the maximum M(f) of an unknown function f.

Question

Can we design algorithms that adapt to the difficulty of the problem?

SLIDE 4

Non-Convex Optimization

Depending on the difficulty of the problem, we would hope to get different performances :

SLIDE 5

Non-Convex Optimization

Depending on the difficulty of the problem, we would hope to get different performances :

Question

Can we adapt to the hyperparameters?

SLIDE 6

Scope of this talk

Talk : ◮ Presentation of adaptive inference in statistics. ◮ Adaptivity in continuously armed bandits.

SLIDE 7

ADAPTIVE INFERENCE

SLIDE 8

Adaptive inference for non-parametric regression

Problem : Non-parametric regression

X X X X X X X X X X X X X X X X

Inference (estimation + uncertainty quantification) of the function?

SLIDE 9

Adaptive inference for non-parametric regression

Problem : Non-parametric regression

X X X X X X X X X X X X X X X X

Inference (estimation + uncertainty quantification) of the function?

SLIDE 10

Adaptive inference for non-parametric regression

Problem : Non-parametric regression

X X X X X X X X X X X X X X X X

Inference (estimation + uncertainty quantification) of the function? The Model f: function on [0, 1]d. n observed data samples (Xi, Yi)i≤n : Yi = f(Xi) + εi, i = 1, . . . , n, where Xi ∼iid U[0,1]d and ε is an

indep. centered noise s. t. |ε| ≤ 1.

SLIDE 11

Adaptive inference for non-parametric regression

Problem : Non-parametric regression

X X X X X X X X X X X X X X X X

Inference (estimation + uncertainty quantification) of the function? The Model f: function on [0, 1]d. n observed data samples (Xi, Yi)i≤n : Yi = f(Xi) + εi, i = 1, . . . , n, where Xi ∼iid U[0,1]d and ε is an

indep. centered noise s. t. |ε| ≤ 1.

SLIDE 12

Adaptive inference for non-parametric regression

Problem : Non-parametric regression

X X X X X X X X X X X X X X X X

Inference (estimation + uncertainty quantification) of the function? The Model f: function on [0, 1]d. n observed data samples (Xi, Yi)i≤n : Yi = f(Xi) + εi, i = 1, . . . , n, where Xi ∼iid U[0,1]d and ε is an

indep. centered noise s. t. |ε| ≤ 1.

C(α) = {Hoelder ball (α)}. E.g. for α ≤ 1 {f : |f(x) − f(y)| ≤ x − yα

∞}.

SLIDE 13

Adaptive inference for non-parametric regression

Problem : Non-parametric regression

X X X X X X X X X X X X X X X X

Inference (estimation + uncertainty quantification) of the function? Question : If f ∈ C(α), then the “optimal” precision of inference should depend on α. Inference adaptive to α?

SLIDE 14

Adaptive inference Adaptive estimation and confidence statements :

See [Lepski, 1990-92], [Juditsky and Lambert-Lacroix, 1994], [Donoho and Johnstone, 1990-92], [Low, 2004-06], [Birg´ e and Massart, 1994-00], [Gin´ e and Nickl, 2010], etc.

◮ “Large” sets C0 ⊂ C1 e.g. C0 =: C(γ) and C1 =: C(α) with α < γ.

◮ Associated probability distributions Pf for f ∈ C1 ◮ Receive a dataset of n i.i.d. entries according to Pf Adaptive inference : Adaptation to the set Ch when f ∈ Ch, h ∈ {0, 1}.

C0

C1

SLIDE 15

Adaptive inference Adaptive estimation and confidence statements :

See [Lepski, 1990-92], [Juditsky and Lambert-Lacroix, 1994], [Donoho and Johnstone, 1990-92], [Low, 2004-06], [Birg´ e and Massart, 1994-00], [Gin´ e and Nickl, 2010], etc.

◮ “Large” sets C0 ⊂ C1 e.g. C0 =: C(γ) and C1 =: C(α) with α < γ.

◮ Associated probability distributions Pf for f ∈ C1 ◮ Receive a dataset of n i.i.d. entries according to Pf Estimation : ◮ Minimax-optimal estimation errors r0 (over C0) and r1 (over C1) in . norm

Minimax-opt. est. error

rh = inf

˜ f est.

sup

f∈Ch

Ef ˜ f−f, h ∈ {0, 1}.

Minimax-optimal .∞ est. error in non-param. reg. C(α) :

log(n)

n α/(2α+d) .

See [Lepski, 1990-92, etc].

SLIDE 16

Adaptive inference Adaptive estimation and confidence statements :

See [Lepski, 1990-92], [Juditsky and Lambert-Lacroix, 1994], [Donoho and Johnstone, 1990-92], [Low, 2004-06], [Birg´ e and Massart, 1994-00], [Gin´ e and Nickl, 2010], etc.

◮ “Large” sets C0 ⊂ C1 e.g. C0 =: C(γ) and C1 =: C(α) with α < γ.

◮ Associated probability distributions Pf for f ∈ C1 ◮ Receive a dataset of n i.i.d. entries according to Pf Adaptive estimation : ◮ Minimax-optimal estimation errors r0 (over C0) and r1 (over C1) in . norm

C0 C1

r1 r0 f f f f ^ ^

SLIDE 17

Adaptive inference Adaptive estimation and confidence statements :

See [Lepski, 1990-92], [Juditsky and Lambert-Lacroix, 1994], [Donoho and Johnstone, 1990-92], [Low, 2004-06], [Birg´ e and Massart, 1994-00], [Gin´ e and Nickl, 2010], etc.

◮ “Large” sets C0 ⊂ C1 e.g. C0 =: C(γ) and C1 =: C(α) with α < γ.

◮ Associated probability distributions Pf for f ∈ C1 ◮ Receive a dataset of n i.i.d. entries according to Pf Adaptive estimation : ◮ Minimax-optimal estimation errors r0 (over C0) and r1 (over C1) in . norm ◮ In many models : adaptive estimator ˆ f exists

Adaptive estimation

sup

f∈Ch

Ef ˆ f − f ≤ rh, ∀h ∈ {0, 1}.

Adaptive estimators exist in non-param. reg. See [Lepski, 1990-92,

Donoho and Johnstone, 1998, etc].

SLIDE 18

Adaptive inference Adaptive estimation and confidence statements :

See [Lepski, 1990-92], [Juditsky and Lambert-Lacroix, 1994], [Donoho and Johnstone, 1990-92], [Low, 2004-06], [Birg´ e and Massart, 1994-00], [Gin´ e and Nickl, 2010], etc.

◮ “Large” sets C0 ⊂ C1 e.g. C0 =: C(γ) and C1 =: C(α) with α < γ.

◮ Associated probability distributions Pf for f ∈ C1 ◮ Receive a dataset of n i.i.d. entries according to Pf Adaptive and honest confidence sets : ◮ Minimax-optimal estimation errors r0, r1 in . norm ◮ Confidence set ˆ C : contains f and has adaptive diameter

C0 C1

r1 r0 f f ^ ^ f f C ^ C ^

SLIDE 19

Adaptive inference Adaptive estimation and confidence statements :

See [Lepski, 1990-92], [Juditsky and Lambert-Lacroix, 1994], [Donoho and Johnstone, 1990-92], [Low, 2004-06], [Birg´ e and Massart, 1994-00], [Gin´ e and Nickl, 2010], etc.

◮ “Large” sets C0 ⊂ C1 e.g. C0 =: C(γ) and C1 =: C(α) with α < γ.

◮ Associated probability distributions Pf for f ∈ C1 ◮ Receive a dataset of n i.i.d. entries according to Pf Adaptive and honest confidence sets : ◮ Minimax-optimal estimation errors r0, r1 in . norm ◮ Confidence set ˆ C : contains f and has adaptive diameter

η-adapt. and honest conf. set

Honesty : sup

f∈C1

Pf(f ∈ ˆ C) ≥ 1 − η. Adaptivity : sup

f∈Ch

Ef ˆ C ≤ rh, ∀h ∈ {0, 1}.

SLIDE 20

Adaptive inference Adaptive estimation and confidence statements :

See [Lepski, 1990-92], [Juditsky and Lambert-Lacroix, 1994], [Donoho and Johnstone, 1990-92], [Low, 2004-06], [Birg´ e and Massart, 1994-00], [Gin´ e and Nickl, 2010], etc.

◮ “Large” sets C0 ⊂ C1 e.g. C0 =: C(γ) and C1 =: C(α) with α < γ.

◮ Associated probability distributions Pf for f ∈ C1 ◮ Receive a dataset of n i.i.d. entries according to Pf Adaptive and honest confidence sets : ◮ Minimax-optimal estimation errors r0, r1 in . norm ◮ Confidence set ˆ C : contains f and has adaptive diameter

C0 C1

r1 r0 f f ^ ^ f f C ^ C ^

SLIDE 21

Adaptive inference Adaptive estimation and confidence statements :

See [Lepski, 1990-92], [Juditsky and Lambert-Lacroix, 1994], [Donoho and Johnstone, 1990-92], [Low, 2004-06], [Birg´ e and Massart, 1994-00], [Gin´ e and Nickl, 2010], etc.

◮ “Large” sets C0 ⊂ C1 e.g. C0 =: C(γ) and C1 =: C(α) with α < γ.

◮ Associated probability distributions Pf for f ∈ C1 ◮ Receive a dataset of n i.i.d. entries according to Pf

In non-parametric regression : Adaptive and honest confidence sets do not exist.

See [Cai and Low (2004)], [Hoffmann and Nickl (2011)], etc.

Indeed minimax rate for testing between C0 = C(γ) and C1 = C(α) in .∞ norm is: log(n) n −α/(2α+d) = r1 ≫ r0. Common situation, adaptive inference paradox - see [Gine and Nickl, 2011], [C, Klopp, L¨

ffler, Nickl,

2017] for a systematic study and relations to a testing problem.

SLIDE 22

Subtle problem : Matrix completion

Problem : Application : Recommendation system (e.g. Netflix). Alice Bob Carine Daniel Ed            



           Inference (estimation + uncertainty quantification) of the matrix?

SLIDE 23

Subtle problem : Matrix completion

Problem : Application : Recommendation system (e.g. Netflix). Alice Bob Carine Daniel Ed            



           Inference (estimation + uncertainty quantification) of the matrix? Trace Regression Model f : matrix of dimension d × d. n observed data samples (Xi, Yi)i≤n : Yi = fXi + εi, i = 1, . . . , n, where Xi ∼iid U{1,...,d}2 and ε is an

indep. centered noise s. t. |ε| ≤ 1.

Customers Products         {, }

{, }

       

SLIDE 24

Subtle problem : Matrix completion

Problem : Application : Recommendation system (e.g. Netflix). Alice Bob Carine Daniel Ed            



           Inference (estimation + uncertainty quantification) of the matrix? Bernoulli Model f : matrix of dimension d × d. Data Yi,j = (fi,j+εi,j)Bi,j, (i, j) ∈ {1, . . . , d}2, where Bi,j ∼iid B(n/d2) and ε is an

indep. centered noise such that

|ε| ≤ 1. Customers Products        



      

SLIDE 25

Subtle problem : Matrix completion

Problem : Application : Recommendation system (e.g. Netflix). Alice Bob Carine Daniel Ed            



           Inference (estimation + uncertainty quantification) of the matrix?

High dimensional regime : d2 ≥ n.

SLIDE 26

Subtle problem : Matrix completion

Problem : Application : Recommendation system (e.g. Netflix). Alice Bob Carine Daniel Ed            



           Inference (estimation + uncertainty quantification) of the matrix? Let for 1 ≤ k ≤ d C(k) = {f : rank(f) ≤ k, f∞ ≤ 1}.

High dimensional regime : d2 ≥ n.

SLIDE 27

Subtle problem : Matrix completion

Problem : Application : Recommendation system (e.g. Netflix). Alice Bob Carine Daniel Ed            



           Inference (estimation + uncertainty quantification) of the matrix? Let for 1 ≤ k ≤ d C(k) = {f : rank(f) ≤ k, f∞ ≤ 1}. Question : If f ∈ C(k), then the “optimal” precision of inference should depend on k. Inference adaptive to k?

High dimensional regime : d2 ≥ n.

SLIDE 28

Adaptive estimation

There exists an adaptive estimator ˆ f of f ∈ C(k) that achieves the minimax-optimal error rk over all C(k) E ˜ f − fF ≤ d

kd

n := rk. where . is the Frobenius norm, [Keshavan et al., 2009, Cai et al., 2010, Kolchinskii et al., 2011, Klopp and Gaiffas, 2015]. In terms of estimation of f, these two models are equivalent. Question : Adaptive and honest confidence set scaling with rk?

SLIDE 29

Adaptive estimation

There exists an adaptive estimator ˆ f of f ∈ C(k) that achieves the minimax-optimal error rk over all C(k) E ˜ f − fF ≤ d

kd

n := rk. where . is the Frobenius norm, [Keshavan et al., 2009, Cai et al., 2010, Kolchinskii et al., 2011, Klopp and Gaiffas, 2015]. Question : Adaptive and honest confidence set scaling with rk?

SLIDE 30

Matrix completion : Trace regression

Problem : Application : Recommendation system (e.g. Netflix). Alice Bob Carine Daniel Ed            



           Inference (estimation + uncertainty quantification) of the matrix? Trace Regression Model f : matrix of dimension d × d. n observed data samples (Xi, Yi)i≤n : Yi = fXi + εi, i = 1, . . . , n, where Xi ∼iid U{1,...,d}2 and ε is an

indep. centered noise s. t. |ε| ≤ 1.

Customers Products         {, }

{, }

       

SLIDE 31

Confidence sets : Trace Regression Model

Theorem (C., Klopp, L¨

ffler and Nickl, 2016)

In the matrix completion “trace regression” model, η-adaptive and honest confidence sets exist. Dimension reduction in the smaller model not too radical.

SLIDE 32

Matrix completion : Bernoulli Model

Problem : Application : Recommendation system (e.g. Netflix). Alice Bob Carine Daniel Ed            



           Inference (estimation + uncertainty quantification) of the matrix? Bernoulli Model f : matrix of dimension d × d. Data Yi,j = (fi,j+εi,j)Bi,j, (i, j) ∈ {1, . . . , d}2, where Bi,j ∼iid B(n/d2) and ε is an

indep. centered noise such that

|ε| ≤ 1. Customers Products        



      

SLIDE 33

Confidence sets : Bernoulli Model

Theorem (C., Klopp, L¨

ffler and Nickl, 2016)

◮ Bernoulli Model with known noise variance : Adaptive and honest confidence sets exist. ◮ Bernoulli Model with unknown noise variance : Adaptive and honest confidence sets do not exist . The two models are not equivalent in this case!

SLIDE 34

(Simplified) Idea of the proof : Unknown variance

No entries sampled twice! First example : rank one H0 : Random opinions! Customers Products      



     H1 : Rank one opinions. Customers Products      



    

SLIDE 35

(Simplified) Idea of the proof : Unknown variance

No entries sampled twice! First example : rank one H0 : Random opinions! Customers Products      



     H1 : Rank one opinions. Customers Products      



    

SLIDE 36

(Simplified) Idea of the proof : Unknown variance

No entries sampled twice! First example : rank one H0 : Random opinions! Customers Products      

−

−

−

− −       H1 : Rank one opinions. Customers Products      

−

−

−

− −      

SLIDE 37

(Simplified) Idea of the proof : Unknown variance

No entries sampled twice! First example : rank one H0 : Random opinions! Customers Products      

−

−

−

− −       H1 : Rank one opinions. Customers Products      

−

−

−

− −       Less than n4

d4 such cycles whp → distinguishability only if

n ≫ d.

SLIDE 38

(Simplified) Idea of the proof : Unknown variance

No entries sampled twice! General case : rank k H0 : Random opinions! Customers Products      



     H1 : Rank one opinions. Customers Products      



    

SLIDE 39

(Simplified) Idea of the proof : Unknown variance

No entries sampled twice! General case : rank k H0 : Random opinions! Customers Products      



     H1 : Rank one opinions. Customers Products      



    

SLIDE 40

(Simplified) Idea of the proof : Unknown variance

No entries sampled twice! General case : rank k H0 : Random opinions! Customers Products      

−

−

−

− −       H1 : Rank one opinions. Customers Products      

−

−

−

− −      

SLIDE 41

(Simplified) Idea of the proof : Unknown variance

No entries sampled twice! General case : rank k H0 : Random opinions! Customers Products      

−

−

−

− −       H1 : Rank one opinions. Customers Products      

−

−

−

− −       Less than

n4 d4k3 correct cycles (taking rank groups into account)

→ distinguishability only if n ≫ k3/4d.

SLIDE 42

Conclusion on adaptive inference

Adaptive inference paradox: adaptive estimation is generally possible and adaptive uncertainty quantification mostly not. We have seen that in the non-parametric regression with L∞ norm: ◮ Adaptive estimation is possible ◮ Adaptive and honest confidence sets do not exist Typical example of adaptive inference paradox.

SLIDE 43

ADAPTIVITY IN X ARMED BANDITS

SLIDE 44

Non-Convex Optimization

Problem

Finding/Exploiting the maximum M(f) of an unknown function f.

Question

Can we design algorithms that adapt to the difficulty of the problem?

SLIDE 45

Non-Convex Optimization

Problem

Finding/Exploiting the maximum M(f) of an unknown function f.

Question

Can we design algorithms that adapt to the difficulty of the problem?

SLIDE 46

Non-Convex Optimization

Depending on the difficulty of the problem, we would hope to get different performances :

Question

Can we adapt to the hyperparameters?

SLIDE 47

X-armed bandit problem

Game: ◮ Parameters : function f with M(f) = maxx f(x), n ◮ for t = 1, ..., n

◮ learner picks Xt ∈ [0, 1]d ◮ receives Yt = f(Xt) + ǫt, ǫ

indep. noise

s.t. Eǫt = 0, |ǫt| ≤ 1

◮ output x(n) ∈ [0, 1]d

SLIDE 48

X-armed bandit problem

Game: ◮ Parameters : function f with M(f) = maxx f(x), n ◮ for t = 1, ..., n

◮ learner picks Xt ∈ [0, 1]d ◮ receives Yt = f(Xt) + ǫt, ǫ

indep. noise

s.t. Eǫt = 0, |ǫt| ≤ 1

◮ output x(n) ∈ [0, 1]d

M(f)

SLIDE 49

X-armed bandit problem

Game: ◮ Parameters : function f with M(f) = maxx f(x), n ◮ for t = 1, ..., n

◮ learner picks Xt ∈ [0, 1]d ◮ receives Yt = f(Xt) + ǫt, ǫ

indep. noise

s.t. Eǫt = 0, |ǫt| ≤ 1

◮ output x(n) ∈ [0, 1]d

M(f)

X

SLIDE 50

X-armed bandit problem

Game: ◮ Parameters : function f with M(f) = maxx f(x), n ◮ for t = 1, ..., n

◮ learner picks Xt ∈ [0, 1]d ◮ receives Yt = f(Xt) + ǫt, ǫ

indep. noise

s.t. Eǫt = 0, |ǫt| ≤ 1

◮ output x(n) ∈ [0, 1]d

M(f)

X X

SLIDE 51

X-armed bandit problem

Game: ◮ Parameters : function f with M(f) = maxx f(x), n ◮ for t = 1, ..., n

◮ learner picks Xt ∈ [0, 1]d ◮ receives Yt = f(Xt) + ǫt, ǫ

indep. noise

s.t. Eǫt = 0, |ǫt| ≤ 1

◮ output x(n) ∈ [0, 1]d

M(f)

X X X

SLIDE 52

X-armed bandit problem

Game: ◮ Parameters : function f with M(f) = maxx f(x), n ◮ for t = 1, ..., n

◮ learner picks Xt ∈ [0, 1]d ◮ receives Yt = f(Xt) + ǫt, ǫ

indep. noise

s.t. Eǫt = 0, |ǫt| ≤ 1

◮ output x(n) ∈ [0, 1]d

M(f)

X X X X

SLIDE 53

X-armed bandit problem

Game: ◮ Parameters : function f with M(f) = maxx f(x), n ◮ for t = 1, ..., n

◮ learner picks Xt ∈ [0, 1]d ◮ receives Yt = f(Xt) + ǫt, ǫ

indep. noise

s.t. Eǫt = 0, |ǫt| ≤ 1

◮ output x(n) ∈ [0, 1]d Performance measures: ◮ Simple regret: rn = M(f) − f(x(n)) ◮ Cumulative regret: Rn = nM(f) − n

t=1 f(Xt) M(f)

X X X X

SLIDE 54

Classical result for stochastic bandits

K−armed stochastic bandits In the discrete case - f constant by parts on K known sets - classical stochastic bandits.

Idea

Approximate the continuous function f in Kn ‘relevant’ parts - this will depend on the regularity of f.

SLIDE 55

Classical result for stochastic bandits

K−armed stochastic bandits In the discrete case - f constant by parts on K known sets - classical stochastic bandits. The minimax regret satisfies (up to logarithmic terms) inf

algo A

sup

K−discrete f

rn(A, f) ≈

K

n , and inf

algo A

sup

K−discrete f

Rn(A, f) ≈ √ nK.

Idea

Approximate the continuous function f in Kn ‘relevant’ parts - this will depend on the regularity of f.

SLIDE 56

Classical result for stochastic bandits

K−armed stochastic bandits In the discrete case - f constant by parts on K known sets - classical stochastic bandits. The minimax regret satisfies (up to logarithmic terms) inf

algo A

sup

K−discrete f

rn(A, f) ≈

K

n , and inf

algo A

sup

K−discrete f

Rn(A, f) ≈ √ nK.

Idea

Approximate the continuous function f in Kn ‘relevant’ parts - this will depend on the regularity of f.

SLIDE 57

Assumptions on f

Parametrize f from easy to hard problems.

SLIDE 58

Assumptions on f

Parametrize f from easy to hard problems. ◮ Regularity condition: ∃α > 0 s.t. ∀x, y: |f(x) − f(y)| ≤ x − yα

∞,

For α ≤ 1, H¨

lder assumption

SLIDE 59

Assumptions on f

Parametrize f from easy to hard problems. ◮ Regularity condition: ∃α > 0 s.t. ∀x, y: |f(x) − f(y)| ≤ x − yα

∞,

For α ≤ 1, H¨

lder assumption

M(f) M(f)

small large

SLIDE 60

Assumptions on f

Parametrize f from easy to hard problems. ◮ Regularity condition: ∃α > 0 s.t. ∀x, y: |f(x) − f(y)| ≤ x − yα

∞,

For α ≤ 1, H¨

lder assumption

◮ Margin condition: ∃β ≥ 0 s.t.: Vol(x : M(f) − f(x) ≤ ∆) ≤ ∆β No restriction for β = 0, larger β corresponds to easier problem

SLIDE 61

Assumptions on f

Parametrize f from easy to hard problems. ◮ Regularity condition: ∃α > 0 s.t. ∀x, y: |f(x) − f(y)| ≤ x − yα

∞,

For α ≤ 1, H¨

lder assumption

◮ Margin condition: ∃β ≥ 0 s.t.: Vol(x : M(f) − f(x) ≤ ∆) ≤ ∆β No restriction for β = 0, larger β corresponds to easier problem

M(f) M(f)

small large

β β

SLIDE 62

Lower bounds

Define P(α, β) the class of functions that satisfy these assumptions.

Theorem: Lower-bound for α, β known [Bubeck et al. 11]

For any strategy that performs at most n noisy function evaluations, it holds that: sup

P∈P(α,β)

EP [rn] ≥ n−

α 2α+d−αβ := rα,β,

sup

P∈P(α,β)

EP [Rn] ≥ n1−

α 2α+d−αβ := Rα,β,

where does not depend on the strategy and note that Rα,β = nrα,β. Goal: design procedures without access to α, β with optimal regret

SLIDE 63

Case α known

See e.g. [Agrawal (1995), Kleinberg (2004), Auer et al. (2007), Kleinberg et

al. (2008), Bubeck et al. (2011a,b,c),

Cope (2009), Munos (2014), Valko et al (2015)], etc. Optimistic strategies (e.g. HOO in [Bubeck et al. 11]): use the knowledge

f α to construct local (multi-scale)

upper-confidence bounds on f, and choose next Xt optimistically. Our strategy: similar intuition but works hierarchically (at a single scale),

nly refining the partition in promising

cells of the previous partition.

SLIDE 64

Case α known

See e.g. [Agrawal (1995), Kleinberg (2004), Auer et al. (2007), Kleinberg et

al. (2008), Bubeck et al. (2011a,b,c),

Cope (2009), Munos (2014), Valko et al (2015)], etc. Optimistic strategies (e.g. HOO in [Bubeck et al. 11]): use the knowledge

f α to construct local (multi-scale)

upper-confidence bounds on f, and choose next Xt optimistically. Our strategy: similar intuition but works hierarchically (at a single scale),

nly refining the partition in promising

cells of the previous partition.

M(f) M(f)

SLIDE 65

Case α known

Theorem: Upper-bound for α known [Locatelli, C, 2018]

Our strategy for opimisation is such that with probability at least 1 − n−1 sup

P∈P(α,β)

rn ≤ ˜ rα,β, EP Rn ≤ ˜ Rα,β. See also [Bubeck et al (2011), Minsker (2013), Bull (2014), Valko et al (2015)] etc.

SLIDE 66

Adaptivity?

The algorithm naturally adapts to β but needs α as a parameter.

Question

Can we adapt to α?

SLIDE 67

Adaptivity?

The algorithm naturally adapts to β but needs α as a parameter.

Question

Can we adapt to α? Reminder from adaptive inference : ◮ Adaptive estimation is possible in non-parametric regression ◮ Adaptive and honest confidence sets do not exist in non-parametric regression

SLIDE 68

Adaptivity?

The algorithm naturally adapts to β but needs α as a parameter.

Question

Can we adapt to α? Reminder from adaptive inference : ◮ Adaptive estimation is possible in non-parametric regression ◮ Adaptive and honest confidence sets do not exist in non-parametric regression

Question

Is the X-armed bandit problem closer to adaptive estimation or adaptive and honest uncertainty quantification?

SLIDE 69

Adaptivity for simple regret (optimisation)

α-Adaptive strategy: ◮ Split budget in log2 n chunks of same size ◮ Run previous strategy with αi =

i log n for all i

◮ Aggregate recommendations Idea: ∃αi∗ s.t.: α −

1 log(n) ≤ αi∗ ≤ α

SLIDE 70

Adaptivity for simple regret (optimisation)

α-Adaptive strategy: ◮ Split budget in log2 n chunks of same size ◮ Run previous strategy with αi =

i log n for all i

◮ Aggregate recommendations Idea: ∃αi∗ s.t.: α −

1 log(n) ≤ αi∗ ≤ α

M(f) M(f) M(f)

T

o large

Just right T

o small

SLIDE 71

Adaptivity for simple regret (optimisation)

α-Adaptive strategy: ◮ Split budget in log2 n chunks of same size ◮ Run previous strategy with αi =

i log n for all i

◮ Aggregate recommendations Idea: ∃αi∗ s.t.: α −

1 log(n) ≤ αi∗ ≤ α

◮ Strategy 1: cross-validate [Grill et al. 15] ◮ Strategy 2: recommend x(n) ∈

i≤ˆ I sn,αi [LCK 17]

Recovers the optimal rate but adaptively!

M(f) M(f) M(f)

T

o large

Just right T

o small

SLIDE 72

Adaptivity for simple regret (optimisation)

α-Adaptive strategy: ◮ Split budget in log2 n chunks of same size ◮ Run previous strategy with αi =

i log n for all i

◮ Aggregate recommendations Idea: ∃αi∗ s.t.: α −

1 log(n) ≤ αi∗ ≤ α

◮ Strategy 1: cross-validate [Grill et al. 15] ◮ Strategy 2: recommend x(n) ∈

i≤ˆ I sn,αi [LCK 17]

Recovers the optimal rate but adaptively!

M(f) M(f) M(f)

T

o large

Just right T

o small

SLIDE 73

Adaptivity for simple regret (optimisation)

α-Adaptive strategy: ◮ Split budget in log2 n chunks of same size ◮ Run previous strategy with αi =

i log n for all i

◮ Aggregate recommendations Idea: ∃αi∗ s.t.: α −

1 log(n) ≤ αi∗ ≤ α

Theorem: Upper-bound simple regret [Locatelli, C, 2018]

Our strategy yields whp sup

α,β∈S

sup

P ∈P(α,β)

rn log(n)urα,β ≤ .

M(f) M(f) M(f)

T

o large

Just right T

o small

SLIDE 74

Adaptivity for Cumulative regret

Intuition: previous strategy favors exploration (linear regret) Can we adaptively balance exploration/exploitation?

Theorem: Impossibility result for adaptive cumulative regret [Locatelli, C, 2018]

Fix γ > α > 0 and β. Any strategy with (near)-optimal regret bounded by ˜ Rα,β uniformly over P(α, β) is such that sup

P∈P(γ,β)

EP [Rn] ≥ ˜ Rα,β. In fact something more refined holds for any algorithm with given rate on P(γ, β) or P(α, β).

SLIDE 75

Adaptivity for Cumulative regret

Intuition: previous strategy favors exploration (linear regret) Can we adaptively balance exploration/exploitation?

Theorem: Impossibility result for adaptive cumulative regret [Locatelli, C, 2018]

Fix γ > α > 0 and β. Any strategy with (near)-optimal regret bounded by ˜ Rα,β uniformly over P(α, β) is such that sup

P∈P(γ,β)

EP [Rn] ≥ ˜ Rα,β. In fact something more refined holds for any algorithm with given rate on P(γ, β) or P(α, β).

SLIDE 76

Adaptivity for Cumulative regret

Intuition: previous strategy favors exploration (linear regret) Can we adaptively balance exploration/exploitation?

Theorem: Impossibility result for adaptive cumulative regret [Locatelli, C, 2018]

Fix γ > α > 0 and β. Any strategy with (near)-optimal regret bounded by ˜ Rα,β uniformly over P(α, β) is such that sup

P∈P(γ,β)

EP [Rn] ≥ ˜ Rα,β. In fact something more refined holds for any algorithm with given rate on P(γ, β) or P(α, β).

SLIDE 77

Conclusion

Adaptivity possible for simple regret but not for cumulative regret. ◮ Simple regret is in essence closer to adaptive estimation : adaptation possible ◮ Cumulative regret is in essence closer to adaptive and honest confidence sets : adaptation impossible More systematic relation to adaptivity in active learning?