Adaptive inference and its relations to sequential decision making - - PowerPoint PPT Presentation

adaptive inference and its relations to sequential
SMART_READER_LITE
LIVE PREVIEW

Adaptive inference and its relations to sequential decision making - - PowerPoint PPT Presentation

Adaptive inference and its relations to sequential decision making Alexandra Carpentier 1 OvGU Magdeburg Based on joint works with Olga Klopp, Samory Kpotufe, Andr ea Locatelli, Matthias L offler, Richard Nickl Criteo, Oct. 2nd, 2019 1


slide-1
SLIDE 1

Adaptive inference and its relations to sequential decision making

Alexandra Carpentier1 OvGU Magdeburg Based on joint works with Olga Klopp, Samory Kpotufe, Andr´ ea Locatelli, Matthias L¨

  • ffler, Richard Nickl

Criteo, Oct. 2nd, 2019

1Partly funded by the DFG EN CA1488, the CRC 1294, the GK 2297,

the GK 2433.

slide-2
SLIDE 2

Non-Convex Optimization

Problem

Finding/Exploiting the maximum M(f) of an unknown function f.

slide-3
SLIDE 3

Non-Convex Optimization

Problem

Finding/Exploiting the maximum M(f) of an unknown function f.

Question

Can we design algorithms that adapt to the difficulty of the problem?

slide-4
SLIDE 4

Non-Convex Optimization

Depending on the difficulty of the problem, we would hope to get different performances :

slide-5
SLIDE 5

Non-Convex Optimization

Depending on the difficulty of the problem, we would hope to get different performances :

Question

Can we adapt to the hyperparameters?

slide-6
SLIDE 6

Scope of this talk

Talk : ◮ Presentation of adaptive inference in statistics. ◮ Adaptivity in continuously armed bandits.

slide-7
SLIDE 7

ADAPTIVE INFERENCE

slide-8
SLIDE 8

Adaptive inference for non-parametric regression

Problem : Non-parametric regression

X X X X X X X X X X X X X X X X

Inference (estimation + uncertainty quantification) of the function?

slide-9
SLIDE 9

Adaptive inference for non-parametric regression

Problem : Non-parametric regression

X X X X X X X X X X X X X X X X

Inference (estimation + uncertainty quantification) of the function?

slide-10
SLIDE 10

Adaptive inference for non-parametric regression

Problem : Non-parametric regression

X X X X X X X X X X X X X X X X

Inference (estimation + uncertainty quantification) of the function? The Model f: function on [0, 1]d. n observed data samples (Xi, Yi)i≤n : Yi = f(Xi) + εi, i = 1, . . . , n, where Xi ∼iid U[0,1]d and ε is an

  • indep. centered noise s. t. |ε| ≤ 1.
slide-11
SLIDE 11

Adaptive inference for non-parametric regression

Problem : Non-parametric regression

X X X X X X X X X X X X X X X X

Inference (estimation + uncertainty quantification) of the function? The Model f: function on [0, 1]d. n observed data samples (Xi, Yi)i≤n : Yi = f(Xi) + εi, i = 1, . . . , n, where Xi ∼iid U[0,1]d and ε is an

  • indep. centered noise s. t. |ε| ≤ 1.
slide-12
SLIDE 12

Adaptive inference for non-parametric regression

Problem : Non-parametric regression

X X X X X X X X X X X X X X X X

Inference (estimation + uncertainty quantification) of the function? The Model f: function on [0, 1]d. n observed data samples (Xi, Yi)i≤n : Yi = f(Xi) + εi, i = 1, . . . , n, where Xi ∼iid U[0,1]d and ε is an

  • indep. centered noise s. t. |ε| ≤ 1.

C(α) = {Hoelder ball (α)}. E.g. for α ≤ 1 {f : |f(x) − f(y)| ≤ x − yα

∞}.

slide-13
SLIDE 13

Adaptive inference for non-parametric regression

Problem : Non-parametric regression

X X X X X X X X X X X X X X X X

Inference (estimation + uncertainty quantification) of the function? Question : If f ∈ C(α), then the “optimal” precision of inference should depend on α. Inference adaptive to α?

slide-14
SLIDE 14

Adaptive inference Adaptive estimation and confidence statements :

See [Lepski, 1990-92], [Juditsky and Lambert-Lacroix, 1994], [Donoho and Johnstone, 1990-92], [Low, 2004-06], [Birg´ e and Massart, 1994-00], [Gin´ e and Nickl, 2010], etc.

◮ “Large” sets C0 ⊂ C1 e.g. C0 =: C(γ) and C1 =: C(α) with α < γ.

◮ Associated probability distributions Pf for f ∈ C1 ◮ Receive a dataset of n i.i.d. entries according to Pf Adaptive inference : Adaptation to the set Ch when f ∈ Ch, h ∈ {0, 1}.

C0

C1

slide-15
SLIDE 15

Adaptive inference Adaptive estimation and confidence statements :

See [Lepski, 1990-92], [Juditsky and Lambert-Lacroix, 1994], [Donoho and Johnstone, 1990-92], [Low, 2004-06], [Birg´ e and Massart, 1994-00], [Gin´ e and Nickl, 2010], etc.

◮ “Large” sets C0 ⊂ C1 e.g. C0 =: C(γ) and C1 =: C(α) with α < γ.

◮ Associated probability distributions Pf for f ∈ C1 ◮ Receive a dataset of n i.i.d. entries according to Pf Estimation : ◮ Minimax-optimal estimation errors r0 (over C0) and r1 (over C1) in . norm

Minimax-opt. est. error

rh = inf

˜ f est.

sup

f∈Ch

Ef ˜ f−f, h ∈ {0, 1}.

Minimax-optimal .∞ est. error in non-param. reg. C(α) :

  • log(n)

n α/(2α+d) .

See [Lepski, 1990-92, etc].

slide-16
SLIDE 16

Adaptive inference Adaptive estimation and confidence statements :

See [Lepski, 1990-92], [Juditsky and Lambert-Lacroix, 1994], [Donoho and Johnstone, 1990-92], [Low, 2004-06], [Birg´ e and Massart, 1994-00], [Gin´ e and Nickl, 2010], etc.

◮ “Large” sets C0 ⊂ C1 e.g. C0 =: C(γ) and C1 =: C(α) with α < γ.

◮ Associated probability distributions Pf for f ∈ C1 ◮ Receive a dataset of n i.i.d. entries according to Pf Adaptive estimation : ◮ Minimax-optimal estimation errors r0 (over C0) and r1 (over C1) in . norm

C0 C1

r1 r0 f f f f ^ ^

slide-17
SLIDE 17

Adaptive inference Adaptive estimation and confidence statements :

See [Lepski, 1990-92], [Juditsky and Lambert-Lacroix, 1994], [Donoho and Johnstone, 1990-92], [Low, 2004-06], [Birg´ e and Massart, 1994-00], [Gin´ e and Nickl, 2010], etc.

◮ “Large” sets C0 ⊂ C1 e.g. C0 =: C(γ) and C1 =: C(α) with α < γ.

◮ Associated probability distributions Pf for f ∈ C1 ◮ Receive a dataset of n i.i.d. entries according to Pf Adaptive estimation : ◮ Minimax-optimal estimation errors r0 (over C0) and r1 (over C1) in . norm ◮ In many models : adaptive estimator ˆ f exists

Adaptive estimation

sup

f∈Ch

Ef ˆ f − f ≤ rh, ∀h ∈ {0, 1}.

Adaptive estimators exist in non-param. reg. See [Lepski, 1990-92,

Donoho and Johnstone, 1998, etc].

slide-18
SLIDE 18

Adaptive inference Adaptive estimation and confidence statements :

See [Lepski, 1990-92], [Juditsky and Lambert-Lacroix, 1994], [Donoho and Johnstone, 1990-92], [Low, 2004-06], [Birg´ e and Massart, 1994-00], [Gin´ e and Nickl, 2010], etc.

◮ “Large” sets C0 ⊂ C1 e.g. C0 =: C(γ) and C1 =: C(α) with α < γ.

◮ Associated probability distributions Pf for f ∈ C1 ◮ Receive a dataset of n i.i.d. entries according to Pf Adaptive and honest confidence sets : ◮ Minimax-optimal estimation errors r0, r1 in . norm ◮ Confidence set ˆ C : contains f and has adaptive diameter

C0 C1

r1 r0 f f ^ ^ f f C ^ C ^

slide-19
SLIDE 19

Adaptive inference Adaptive estimation and confidence statements :

See [Lepski, 1990-92], [Juditsky and Lambert-Lacroix, 1994], [Donoho and Johnstone, 1990-92], [Low, 2004-06], [Birg´ e and Massart, 1994-00], [Gin´ e and Nickl, 2010], etc.

◮ “Large” sets C0 ⊂ C1 e.g. C0 =: C(γ) and C1 =: C(α) with α < γ.

◮ Associated probability distributions Pf for f ∈ C1 ◮ Receive a dataset of n i.i.d. entries according to Pf Adaptive and honest confidence sets : ◮ Minimax-optimal estimation errors r0, r1 in . norm ◮ Confidence set ˆ C : contains f and has adaptive diameter

η-adapt. and honest conf. set

Honesty : sup

f∈C1

Pf(f ∈ ˆ C) ≥ 1 − η. Adaptivity : sup

f∈Ch

Ef ˆ C ≤ rh, ∀h ∈ {0, 1}.

slide-20
SLIDE 20

Adaptive inference Adaptive estimation and confidence statements :

See [Lepski, 1990-92], [Juditsky and Lambert-Lacroix, 1994], [Donoho and Johnstone, 1990-92], [Low, 2004-06], [Birg´ e and Massart, 1994-00], [Gin´ e and Nickl, 2010], etc.

◮ “Large” sets C0 ⊂ C1 e.g. C0 =: C(γ) and C1 =: C(α) with α < γ.

◮ Associated probability distributions Pf for f ∈ C1 ◮ Receive a dataset of n i.i.d. entries according to Pf Adaptive and honest confidence sets : ◮ Minimax-optimal estimation errors r0, r1 in . norm ◮ Confidence set ˆ C : contains f and has adaptive diameter

C0 C1

r1 r0 f f ^ ^ f f C ^ C ^

slide-21
SLIDE 21

Adaptive inference Adaptive estimation and confidence statements :

See [Lepski, 1990-92], [Juditsky and Lambert-Lacroix, 1994], [Donoho and Johnstone, 1990-92], [Low, 2004-06], [Birg´ e and Massart, 1994-00], [Gin´ e and Nickl, 2010], etc.

◮ “Large” sets C0 ⊂ C1 e.g. C0 =: C(γ) and C1 =: C(α) with α < γ.

◮ Associated probability distributions Pf for f ∈ C1 ◮ Receive a dataset of n i.i.d. entries according to Pf

In non-parametric regression : Adaptive and honest confidence sets do not exist.

See [Cai and Low (2004)], [Hoffmann and Nickl (2011)], etc.

Indeed minimax rate for testing between C0 = C(γ) and C1 = C(α) in .∞ norm is: log(n) n −α/(2α+d) = r1 ≫ r0. Common situation, adaptive inference paradox - see [Gine and Nickl, 2011], [C, Klopp, L¨

  • ffler, Nickl,

2017] for a systematic study and relations to a testing problem.

slide-22
SLIDE 22

Subtle problem : Matrix completion

Problem : Application : Recommendation system (e.g. Netflix). Alice Bob Carine Daniel Ed            

           Inference (estimation + uncertainty quantification) of the matrix?

slide-23
SLIDE 23

Subtle problem : Matrix completion

Problem : Application : Recommendation system (e.g. Netflix). Alice Bob Carine Daniel Ed            

           Inference (estimation + uncertainty quantification) of the matrix? Trace Regression Model f : matrix of dimension d × d. n observed data samples (Xi, Yi)i≤n : Yi = fXi + εi, i = 1, . . . , n, where Xi ∼iid U{1,...,d}2 and ε is an

  • indep. centered noise s. t. |ε| ≤ 1.

Customers Products         {, }

  • {, }

       

slide-24
SLIDE 24

Subtle problem : Matrix completion

Problem : Application : Recommendation system (e.g. Netflix). Alice Bob Carine Daniel Ed            

           Inference (estimation + uncertainty quantification) of the matrix? Bernoulli Model f : matrix of dimension d × d. Data Yi,j = (fi,j+εi,j)Bi,j, (i, j) ∈ {1, . . . , d}2, where Bi,j ∼iid B(n/d2) and ε is an

  • indep. centered noise such that

|ε| ≤ 1. Customers Products        

      

slide-25
SLIDE 25

Subtle problem : Matrix completion

Problem : Application : Recommendation system (e.g. Netflix). Alice Bob Carine Daniel Ed            

           Inference (estimation + uncertainty quantification) of the matrix?

High dimensional regime : d2 ≥ n.

slide-26
SLIDE 26

Subtle problem : Matrix completion

Problem : Application : Recommendation system (e.g. Netflix). Alice Bob Carine Daniel Ed            

           Inference (estimation + uncertainty quantification) of the matrix? Let for 1 ≤ k ≤ d C(k) = {f : rank(f) ≤ k, f∞ ≤ 1}.

High dimensional regime : d2 ≥ n.

slide-27
SLIDE 27

Subtle problem : Matrix completion

Problem : Application : Recommendation system (e.g. Netflix). Alice Bob Carine Daniel Ed            

           Inference (estimation + uncertainty quantification) of the matrix? Let for 1 ≤ k ≤ d C(k) = {f : rank(f) ≤ k, f∞ ≤ 1}. Question : If f ∈ C(k), then the “optimal” precision of inference should depend on k. Inference adaptive to k?

High dimensional regime : d2 ≥ n.

slide-28
SLIDE 28

Adaptive estimation

There exists an adaptive estimator ˆ f of f ∈ C(k) that achieves the minimax-optimal error rk over all C(k) E ˜ f − fF ≤ d

  • kd

n := rk. where . is the Frobenius norm, [Keshavan et al., 2009, Cai et al., 2010, Kolchinskii et al., 2011, Klopp and Gaiffas, 2015]. In terms of estimation of f, these two models are equivalent. Question : Adaptive and honest confidence set scaling with rk?

slide-29
SLIDE 29

Adaptive estimation

There exists an adaptive estimator ˆ f of f ∈ C(k) that achieves the minimax-optimal error rk over all C(k) E ˜ f − fF ≤ d

  • kd

n := rk. where . is the Frobenius norm, [Keshavan et al., 2009, Cai et al., 2010, Kolchinskii et al., 2011, Klopp and Gaiffas, 2015]. Question : Adaptive and honest confidence set scaling with rk?

slide-30
SLIDE 30

Matrix completion : Trace regression

Problem : Application : Recommendation system (e.g. Netflix). Alice Bob Carine Daniel Ed            

           Inference (estimation + uncertainty quantification) of the matrix? Trace Regression Model f : matrix of dimension d × d. n observed data samples (Xi, Yi)i≤n : Yi = fXi + εi, i = 1, . . . , n, where Xi ∼iid U{1,...,d}2 and ε is an

  • indep. centered noise s. t. |ε| ≤ 1.

Customers Products         {, }

  • {, }

       

slide-31
SLIDE 31

Confidence sets : Trace Regression Model

Theorem (C., Klopp, L¨

  • ffler and Nickl, 2016)

In the matrix completion “trace regression” model, η-adaptive and honest confidence sets exist. Dimension reduction in the smaller model not too radical.

slide-32
SLIDE 32

Matrix completion : Bernoulli Model

Problem : Application : Recommendation system (e.g. Netflix). Alice Bob Carine Daniel Ed            

           Inference (estimation + uncertainty quantification) of the matrix? Bernoulli Model f : matrix of dimension d × d. Data Yi,j = (fi,j+εi,j)Bi,j, (i, j) ∈ {1, . . . , d}2, where Bi,j ∼iid B(n/d2) and ε is an

  • indep. centered noise such that

|ε| ≤ 1. Customers Products        

      

slide-33
SLIDE 33

Confidence sets : Bernoulli Model

Theorem (C., Klopp, L¨

  • ffler and Nickl, 2016)

◮ Bernoulli Model with known noise variance : Adaptive and honest confidence sets exist. ◮ Bernoulli Model with unknown noise variance : Adaptive and honest confidence sets do not exist . The two models are not equivalent in this case!

slide-34
SLIDE 34

(Simplified) Idea of the proof : Unknown variance

No entries sampled twice! First example : rank one H0 : Random opinions! Customers Products      

     H1 : Rank one opinions. Customers Products      

    

slide-35
SLIDE 35

(Simplified) Idea of the proof : Unknown variance

No entries sampled twice! First example : rank one H0 : Random opinions! Customers Products      

     H1 : Rank one opinions. Customers Products      

    

slide-36
SLIDE 36

(Simplified) Idea of the proof : Unknown variance

No entries sampled twice! First example : rank one H0 : Random opinions! Customers Products      

− −       H1 : Rank one opinions. Customers Products      

− −      

slide-37
SLIDE 37

(Simplified) Idea of the proof : Unknown variance

No entries sampled twice! First example : rank one H0 : Random opinions! Customers Products      

− −       H1 : Rank one opinions. Customers Products      

− −       Less than n4

d4 such cycles whp → distinguishability only if

n ≫ d.

slide-38
SLIDE 38

(Simplified) Idea of the proof : Unknown variance

No entries sampled twice! General case : rank k H0 : Random opinions! Customers Products      

     H1 : Rank one opinions. Customers Products      

    

slide-39
SLIDE 39

(Simplified) Idea of the proof : Unknown variance

No entries sampled twice! General case : rank k H0 : Random opinions! Customers Products      

     H1 : Rank one opinions. Customers Products      

    

slide-40
SLIDE 40

(Simplified) Idea of the proof : Unknown variance

No entries sampled twice! General case : rank k H0 : Random opinions! Customers Products      

− −       H1 : Rank one opinions. Customers Products      

− −      

slide-41
SLIDE 41

(Simplified) Idea of the proof : Unknown variance

No entries sampled twice! General case : rank k H0 : Random opinions! Customers Products      

− −       H1 : Rank one opinions. Customers Products      

− −       Less than

n4 d4k3 correct cycles (taking rank groups into account)

→ distinguishability only if n ≫ k3/4d.

slide-42
SLIDE 42

Conclusion on adaptive inference

Adaptive inference paradox: adaptive estimation is generally possible and adaptive uncertainty quantification mostly not. We have seen that in the non-parametric regression with L∞ norm: ◮ Adaptive estimation is possible ◮ Adaptive and honest confidence sets do not exist Typical example of adaptive inference paradox.

slide-43
SLIDE 43

ADAPTIVITY IN X ARMED BANDITS

slide-44
SLIDE 44

Non-Convex Optimization

Problem

Finding/Exploiting the maximum M(f) of an unknown function f.

Question

Can we design algorithms that adapt to the difficulty of the problem?

slide-45
SLIDE 45

Non-Convex Optimization

Problem

Finding/Exploiting the maximum M(f) of an unknown function f.

Question

Can we design algorithms that adapt to the difficulty of the problem?

slide-46
SLIDE 46

Non-Convex Optimization

Depending on the difficulty of the problem, we would hope to get different performances :

Question

Can we adapt to the hyperparameters?

slide-47
SLIDE 47

X-armed bandit problem

Game: ◮ Parameters : function f with M(f) = maxx f(x), n ◮ for t = 1, ..., n

◮ learner picks Xt ∈ [0, 1]d ◮ receives Yt = f(Xt) + ǫt, ǫ

  • indep. noise

s.t. Eǫt = 0, |ǫt| ≤ 1

◮ output x(n) ∈ [0, 1]d

slide-48
SLIDE 48

X-armed bandit problem

Game: ◮ Parameters : function f with M(f) = maxx f(x), n ◮ for t = 1, ..., n

◮ learner picks Xt ∈ [0, 1]d ◮ receives Yt = f(Xt) + ǫt, ǫ

  • indep. noise

s.t. Eǫt = 0, |ǫt| ≤ 1

◮ output x(n) ∈ [0, 1]d

M(f)

slide-49
SLIDE 49

X-armed bandit problem

Game: ◮ Parameters : function f with M(f) = maxx f(x), n ◮ for t = 1, ..., n

◮ learner picks Xt ∈ [0, 1]d ◮ receives Yt = f(Xt) + ǫt, ǫ

  • indep. noise

s.t. Eǫt = 0, |ǫt| ≤ 1

◮ output x(n) ∈ [0, 1]d

M(f)

X

slide-50
SLIDE 50

X-armed bandit problem

Game: ◮ Parameters : function f with M(f) = maxx f(x), n ◮ for t = 1, ..., n

◮ learner picks Xt ∈ [0, 1]d ◮ receives Yt = f(Xt) + ǫt, ǫ

  • indep. noise

s.t. Eǫt = 0, |ǫt| ≤ 1

◮ output x(n) ∈ [0, 1]d

M(f)

X X

slide-51
SLIDE 51

X-armed bandit problem

Game: ◮ Parameters : function f with M(f) = maxx f(x), n ◮ for t = 1, ..., n

◮ learner picks Xt ∈ [0, 1]d ◮ receives Yt = f(Xt) + ǫt, ǫ

  • indep. noise

s.t. Eǫt = 0, |ǫt| ≤ 1

◮ output x(n) ∈ [0, 1]d

M(f)

X X X

slide-52
SLIDE 52

X-armed bandit problem

Game: ◮ Parameters : function f with M(f) = maxx f(x), n ◮ for t = 1, ..., n

◮ learner picks Xt ∈ [0, 1]d ◮ receives Yt = f(Xt) + ǫt, ǫ

  • indep. noise

s.t. Eǫt = 0, |ǫt| ≤ 1

◮ output x(n) ∈ [0, 1]d

M(f)

X X X X

slide-53
SLIDE 53

X-armed bandit problem

Game: ◮ Parameters : function f with M(f) = maxx f(x), n ◮ for t = 1, ..., n

◮ learner picks Xt ∈ [0, 1]d ◮ receives Yt = f(Xt) + ǫt, ǫ

  • indep. noise

s.t. Eǫt = 0, |ǫt| ≤ 1

◮ output x(n) ∈ [0, 1]d Performance measures: ◮ Simple regret: rn = M(f) − f(x(n)) ◮ Cumulative regret: Rn = nM(f) − n

t=1 f(Xt) M(f)

X X X X

slide-54
SLIDE 54

Classical result for stochastic bandits

K−armed stochastic bandits In the discrete case - f constant by parts on K known sets - classical stochastic bandits.

Idea

Approximate the continuous function f in Kn ‘relevant’ parts - this will depend on the regularity of f.

slide-55
SLIDE 55

Classical result for stochastic bandits

K−armed stochastic bandits In the discrete case - f constant by parts on K known sets - classical stochastic bandits. The minimax regret satisfies (up to logarithmic terms) inf

algo A

sup

K−discrete f

rn(A, f) ≈

  • K

n , and inf

algo A

sup

K−discrete f

Rn(A, f) ≈ √ nK.

Idea

Approximate the continuous function f in Kn ‘relevant’ parts - this will depend on the regularity of f.

slide-56
SLIDE 56

Classical result for stochastic bandits

K−armed stochastic bandits In the discrete case - f constant by parts on K known sets - classical stochastic bandits. The minimax regret satisfies (up to logarithmic terms) inf

algo A

sup

K−discrete f

rn(A, f) ≈

  • K

n , and inf

algo A

sup

K−discrete f

Rn(A, f) ≈ √ nK.

Idea

Approximate the continuous function f in Kn ‘relevant’ parts - this will depend on the regularity of f.

slide-57
SLIDE 57

Assumptions on f

Parametrize f from easy to hard problems.

slide-58
SLIDE 58

Assumptions on f

Parametrize f from easy to hard problems. ◮ Regularity condition: ∃α > 0 s.t. ∀x, y: |f(x) − f(y)| ≤ x − yα

∞,

For α ≤ 1, H¨

  • lder assumption
slide-59
SLIDE 59

Assumptions on f

Parametrize f from easy to hard problems. ◮ Regularity condition: ∃α > 0 s.t. ∀x, y: |f(x) − f(y)| ≤ x − yα

∞,

For α ≤ 1, H¨

  • lder assumption

M(f) M(f)

small large

slide-60
SLIDE 60

Assumptions on f

Parametrize f from easy to hard problems. ◮ Regularity condition: ∃α > 0 s.t. ∀x, y: |f(x) − f(y)| ≤ x − yα

∞,

For α ≤ 1, H¨

  • lder assumption

◮ Margin condition: ∃β ≥ 0 s.t.: Vol(x : M(f) − f(x) ≤ ∆) ≤ ∆β No restriction for β = 0, larger β corresponds to easier problem

slide-61
SLIDE 61

Assumptions on f

Parametrize f from easy to hard problems. ◮ Regularity condition: ∃α > 0 s.t. ∀x, y: |f(x) − f(y)| ≤ x − yα

∞,

For α ≤ 1, H¨

  • lder assumption

◮ Margin condition: ∃β ≥ 0 s.t.: Vol(x : M(f) − f(x) ≤ ∆) ≤ ∆β No restriction for β = 0, larger β corresponds to easier problem

M(f) M(f)

small large

β β

slide-62
SLIDE 62

Lower bounds

Define P(α, β) the class of functions that satisfy these assumptions.

Theorem: Lower-bound for α, β known [Bubeck et al. 11]

For any strategy that performs at most n noisy function evaluations, it holds that: sup

P∈P(α,β)

EP [rn] ≥ n−

α 2α+d−αβ := rα,β,

sup

P∈P(α,β)

EP [Rn] ≥ n1−

α 2α+d−αβ := Rα,β,

where does not depend on the strategy and note that Rα,β = nrα,β. Goal: design procedures without access to α, β with optimal regret

slide-63
SLIDE 63

Case α known

See e.g. [Agrawal (1995), Kleinberg (2004), Auer et al. (2007), Kleinberg et

  • al. (2008), Bubeck et al. (2011a,b,c),

Cope (2009), Munos (2014), Valko et al (2015)], etc. Optimistic strategies (e.g. HOO in [Bubeck et al. 11]): use the knowledge

  • f α to construct local (multi-scale)

upper-confidence bounds on f, and choose next Xt optimistically. Our strategy: similar intuition but works hierarchically (at a single scale),

  • nly refining the partition in promising

cells of the previous partition.

slide-64
SLIDE 64

Case α known

See e.g. [Agrawal (1995), Kleinberg (2004), Auer et al. (2007), Kleinberg et

  • al. (2008), Bubeck et al. (2011a,b,c),

Cope (2009), Munos (2014), Valko et al (2015)], etc. Optimistic strategies (e.g. HOO in [Bubeck et al. 11]): use the knowledge

  • f α to construct local (multi-scale)

upper-confidence bounds on f, and choose next Xt optimistically. Our strategy: similar intuition but works hierarchically (at a single scale),

  • nly refining the partition in promising

cells of the previous partition.

M(f) M(f)

slide-65
SLIDE 65

Case α known

Theorem: Upper-bound for α known [Locatelli, C, 2018]

Our strategy for opimisation is such that with probability at least 1 − n−1 sup

P∈P(α,β)

rn ≤ ˜ rα,β, EP Rn ≤ ˜ Rα,β. See also [Bubeck et al (2011), Minsker (2013), Bull (2014), Valko et al (2015)] etc.

slide-66
SLIDE 66

Adaptivity?

The algorithm naturally adapts to β but needs α as a parameter.

Question

Can we adapt to α?

slide-67
SLIDE 67

Adaptivity?

The algorithm naturally adapts to β but needs α as a parameter.

Question

Can we adapt to α? Reminder from adaptive inference : ◮ Adaptive estimation is possible in non-parametric regression ◮ Adaptive and honest confidence sets do not exist in non-parametric regression

slide-68
SLIDE 68

Adaptivity?

The algorithm naturally adapts to β but needs α as a parameter.

Question

Can we adapt to α? Reminder from adaptive inference : ◮ Adaptive estimation is possible in non-parametric regression ◮ Adaptive and honest confidence sets do not exist in non-parametric regression

Question

Is the X-armed bandit problem closer to adaptive estimation or adaptive and honest uncertainty quantification?

slide-69
SLIDE 69

Adaptivity for simple regret (optimisation)

α-Adaptive strategy: ◮ Split budget in log2 n chunks of same size ◮ Run previous strategy with αi =

i log n for all i

◮ Aggregate recommendations Idea: ∃αi∗ s.t.: α −

1 log(n) ≤ αi∗ ≤ α

slide-70
SLIDE 70

Adaptivity for simple regret (optimisation)

α-Adaptive strategy: ◮ Split budget in log2 n chunks of same size ◮ Run previous strategy with αi =

i log n for all i

◮ Aggregate recommendations Idea: ∃αi∗ s.t.: α −

1 log(n) ≤ αi∗ ≤ α

M(f) M(f) M(f)

T

  • o large

Just right T

  • o small
slide-71
SLIDE 71

Adaptivity for simple regret (optimisation)

α-Adaptive strategy: ◮ Split budget in log2 n chunks of same size ◮ Run previous strategy with αi =

i log n for all i

◮ Aggregate recommendations Idea: ∃αi∗ s.t.: α −

1 log(n) ≤ αi∗ ≤ α

◮ Strategy 1: cross-validate [Grill et al. 15] ◮ Strategy 2: recommend x(n) ∈

i≤ˆ I sn,αi [LCK 17]

Recovers the optimal rate but adaptively!

M(f) M(f) M(f)

T

  • o large

Just right T

  • o small
slide-72
SLIDE 72

Adaptivity for simple regret (optimisation)

α-Adaptive strategy: ◮ Split budget in log2 n chunks of same size ◮ Run previous strategy with αi =

i log n for all i

◮ Aggregate recommendations Idea: ∃αi∗ s.t.: α −

1 log(n) ≤ αi∗ ≤ α

◮ Strategy 1: cross-validate [Grill et al. 15] ◮ Strategy 2: recommend x(n) ∈

i≤ˆ I sn,αi [LCK 17]

Recovers the optimal rate but adaptively!

M(f) M(f) M(f)

T

  • o large

Just right T

  • o small
slide-73
SLIDE 73

Adaptivity for simple regret (optimisation)

α-Adaptive strategy: ◮ Split budget in log2 n chunks of same size ◮ Run previous strategy with αi =

i log n for all i

◮ Aggregate recommendations Idea: ∃αi∗ s.t.: α −

1 log(n) ≤ αi∗ ≤ α

Theorem: Upper-bound simple regret [Locatelli, C, 2018]

Our strategy yields whp sup

α,β∈S

sup

P ∈P(α,β)

rn log(n)urα,β ≤ .

M(f) M(f) M(f)

T

  • o large

Just right T

  • o small
slide-74
SLIDE 74

Adaptivity for Cumulative regret

Intuition: previous strategy favors exploration (linear regret) Can we adaptively balance exploration/exploitation?

Theorem: Impossibility result for adaptive cumulative regret [Locatelli, C, 2018]

Fix γ > α > 0 and β. Any strategy with (near)-optimal regret bounded by ˜ Rα,β uniformly over P(α, β) is such that sup

P∈P(γ,β)

EP [Rn] ≥ ˜ Rα,β. In fact something more refined holds for any algorithm with given rate on P(γ, β) or P(α, β).

slide-75
SLIDE 75

Adaptivity for Cumulative regret

Intuition: previous strategy favors exploration (linear regret) Can we adaptively balance exploration/exploitation?

Theorem: Impossibility result for adaptive cumulative regret [Locatelli, C, 2018]

Fix γ > α > 0 and β. Any strategy with (near)-optimal regret bounded by ˜ Rα,β uniformly over P(α, β) is such that sup

P∈P(γ,β)

EP [Rn] ≥ ˜ Rα,β. In fact something more refined holds for any algorithm with given rate on P(γ, β) or P(α, β).

slide-76
SLIDE 76

Adaptivity for Cumulative regret

Intuition: previous strategy favors exploration (linear regret) Can we adaptively balance exploration/exploitation?

Theorem: Impossibility result for adaptive cumulative regret [Locatelli, C, 2018]

Fix γ > α > 0 and β. Any strategy with (near)-optimal regret bounded by ˜ Rα,β uniformly over P(α, β) is such that sup

P∈P(γ,β)

EP [Rn] ≥ ˜ Rα,β. In fact something more refined holds for any algorithm with given rate on P(γ, β) or P(α, β).

slide-77
SLIDE 77

Conclusion

Adaptivity possible for simple regret but not for cumulative regret. ◮ Simple regret is in essence closer to adaptive estimation : adaptation possible ◮ Cumulative regret is in essence closer to adaptive and honest confidence sets : adaptation impossible More systematic relation to adaptivity in active learning?