SLIDE 1 Adaptive inference and its relations to sequential decision making
Alexandra Carpentier1 OvGU Magdeburg Based on joint works with Olga Klopp, Samory Kpotufe, Andr´ ea Locatelli, Matthias L¨
Criteo, Oct. 2nd, 2019
1Partly funded by the DFG EN CA1488, the CRC 1294, the GK 2297,
the GK 2433.
SLIDE 2
Non-Convex Optimization
Problem
Finding/Exploiting the maximum M(f) of an unknown function f.
SLIDE 3
Non-Convex Optimization
Problem
Finding/Exploiting the maximum M(f) of an unknown function f.
Question
Can we design algorithms that adapt to the difficulty of the problem?
SLIDE 4
Non-Convex Optimization
Depending on the difficulty of the problem, we would hope to get different performances :
SLIDE 5
Non-Convex Optimization
Depending on the difficulty of the problem, we would hope to get different performances :
Question
Can we adapt to the hyperparameters?
SLIDE 6
Scope of this talk
Talk : ◮ Presentation of adaptive inference in statistics. ◮ Adaptivity in continuously armed bandits.
SLIDE 7
ADAPTIVE INFERENCE
SLIDE 8
Adaptive inference for non-parametric regression
Problem : Non-parametric regression
X X X X X X X X X X X X X X X X
Inference (estimation + uncertainty quantification) of the function?
SLIDE 9
Adaptive inference for non-parametric regression
Problem : Non-parametric regression
X X X X X X X X X X X X X X X X
Inference (estimation + uncertainty quantification) of the function?
SLIDE 10 Adaptive inference for non-parametric regression
Problem : Non-parametric regression
X X X X X X X X X X X X X X X X
Inference (estimation + uncertainty quantification) of the function? The Model f: function on [0, 1]d. n observed data samples (Xi, Yi)i≤n : Yi = f(Xi) + εi, i = 1, . . . , n, where Xi ∼iid U[0,1]d and ε is an
- indep. centered noise s. t. |ε| ≤ 1.
SLIDE 11 Adaptive inference for non-parametric regression
Problem : Non-parametric regression
X X X X X X X X X X X X X X X X
Inference (estimation + uncertainty quantification) of the function? The Model f: function on [0, 1]d. n observed data samples (Xi, Yi)i≤n : Yi = f(Xi) + εi, i = 1, . . . , n, where Xi ∼iid U[0,1]d and ε is an
- indep. centered noise s. t. |ε| ≤ 1.
SLIDE 12 Adaptive inference for non-parametric regression
Problem : Non-parametric regression
X X X X X X X X X X X X X X X X
Inference (estimation + uncertainty quantification) of the function? The Model f: function on [0, 1]d. n observed data samples (Xi, Yi)i≤n : Yi = f(Xi) + εi, i = 1, . . . , n, where Xi ∼iid U[0,1]d and ε is an
- indep. centered noise s. t. |ε| ≤ 1.
C(α) = {Hoelder ball (α)}. E.g. for α ≤ 1 {f : |f(x) − f(y)| ≤ x − yα
∞}.
SLIDE 13
Adaptive inference for non-parametric regression
Problem : Non-parametric regression
X X X X X X X X X X X X X X X X
Inference (estimation + uncertainty quantification) of the function? Question : If f ∈ C(α), then the “optimal” precision of inference should depend on α. Inference adaptive to α?
SLIDE 14 Adaptive inference Adaptive estimation and confidence statements :
See [Lepski, 1990-92], [Juditsky and Lambert-Lacroix, 1994], [Donoho and Johnstone, 1990-92], [Low, 2004-06], [Birg´ e and Massart, 1994-00], [Gin´ e and Nickl, 2010], etc.
◮ “Large” sets C0 ⊂ C1 e.g. C0 =: C(γ) and C1 =: C(α) with α < γ.
◮ Associated probability distributions Pf for f ∈ C1 ◮ Receive a dataset of n i.i.d. entries according to Pf Adaptive inference : Adaptation to the set Ch when f ∈ Ch, h ∈ {0, 1}.
C0
C1
SLIDE 15 Adaptive inference Adaptive estimation and confidence statements :
See [Lepski, 1990-92], [Juditsky and Lambert-Lacroix, 1994], [Donoho and Johnstone, 1990-92], [Low, 2004-06], [Birg´ e and Massart, 1994-00], [Gin´ e and Nickl, 2010], etc.
◮ “Large” sets C0 ⊂ C1 e.g. C0 =: C(γ) and C1 =: C(α) with α < γ.
◮ Associated probability distributions Pf for f ∈ C1 ◮ Receive a dataset of n i.i.d. entries according to Pf Estimation : ◮ Minimax-optimal estimation errors r0 (over C0) and r1 (over C1) in . norm
Minimax-opt. est. error
rh = inf
˜ f est.
sup
f∈Ch
Ef ˜ f−f, h ∈ {0, 1}.
Minimax-optimal .∞ est. error in non-param. reg. C(α) :
n α/(2α+d) .
See [Lepski, 1990-92, etc].
SLIDE 16 Adaptive inference Adaptive estimation and confidence statements :
See [Lepski, 1990-92], [Juditsky and Lambert-Lacroix, 1994], [Donoho and Johnstone, 1990-92], [Low, 2004-06], [Birg´ e and Massart, 1994-00], [Gin´ e and Nickl, 2010], etc.
◮ “Large” sets C0 ⊂ C1 e.g. C0 =: C(γ) and C1 =: C(α) with α < γ.
◮ Associated probability distributions Pf for f ∈ C1 ◮ Receive a dataset of n i.i.d. entries according to Pf Adaptive estimation : ◮ Minimax-optimal estimation errors r0 (over C0) and r1 (over C1) in . norm
C0 C1
r1 r0 f f f f ^ ^
SLIDE 17 Adaptive inference Adaptive estimation and confidence statements :
See [Lepski, 1990-92], [Juditsky and Lambert-Lacroix, 1994], [Donoho and Johnstone, 1990-92], [Low, 2004-06], [Birg´ e and Massart, 1994-00], [Gin´ e and Nickl, 2010], etc.
◮ “Large” sets C0 ⊂ C1 e.g. C0 =: C(γ) and C1 =: C(α) with α < γ.
◮ Associated probability distributions Pf for f ∈ C1 ◮ Receive a dataset of n i.i.d. entries according to Pf Adaptive estimation : ◮ Minimax-optimal estimation errors r0 (over C0) and r1 (over C1) in . norm ◮ In many models : adaptive estimator ˆ f exists
Adaptive estimation
sup
f∈Ch
Ef ˆ f − f ≤ rh, ∀h ∈ {0, 1}.
Adaptive estimators exist in non-param. reg. See [Lepski, 1990-92,
Donoho and Johnstone, 1998, etc].
SLIDE 18 Adaptive inference Adaptive estimation and confidence statements :
See [Lepski, 1990-92], [Juditsky and Lambert-Lacroix, 1994], [Donoho and Johnstone, 1990-92], [Low, 2004-06], [Birg´ e and Massart, 1994-00], [Gin´ e and Nickl, 2010], etc.
◮ “Large” sets C0 ⊂ C1 e.g. C0 =: C(γ) and C1 =: C(α) with α < γ.
◮ Associated probability distributions Pf for f ∈ C1 ◮ Receive a dataset of n i.i.d. entries according to Pf Adaptive and honest confidence sets : ◮ Minimax-optimal estimation errors r0, r1 in . norm ◮ Confidence set ˆ C : contains f and has adaptive diameter
C0 C1
r1 r0 f f ^ ^ f f C ^ C ^
SLIDE 19 Adaptive inference Adaptive estimation and confidence statements :
See [Lepski, 1990-92], [Juditsky and Lambert-Lacroix, 1994], [Donoho and Johnstone, 1990-92], [Low, 2004-06], [Birg´ e and Massart, 1994-00], [Gin´ e and Nickl, 2010], etc.
◮ “Large” sets C0 ⊂ C1 e.g. C0 =: C(γ) and C1 =: C(α) with α < γ.
◮ Associated probability distributions Pf for f ∈ C1 ◮ Receive a dataset of n i.i.d. entries according to Pf Adaptive and honest confidence sets : ◮ Minimax-optimal estimation errors r0, r1 in . norm ◮ Confidence set ˆ C : contains f and has adaptive diameter
η-adapt. and honest conf. set
Honesty : sup
f∈C1
Pf(f ∈ ˆ C) ≥ 1 − η. Adaptivity : sup
f∈Ch
Ef ˆ C ≤ rh, ∀h ∈ {0, 1}.
SLIDE 20 Adaptive inference Adaptive estimation and confidence statements :
See [Lepski, 1990-92], [Juditsky and Lambert-Lacroix, 1994], [Donoho and Johnstone, 1990-92], [Low, 2004-06], [Birg´ e and Massart, 1994-00], [Gin´ e and Nickl, 2010], etc.
◮ “Large” sets C0 ⊂ C1 e.g. C0 =: C(γ) and C1 =: C(α) with α < γ.
◮ Associated probability distributions Pf for f ∈ C1 ◮ Receive a dataset of n i.i.d. entries according to Pf Adaptive and honest confidence sets : ◮ Minimax-optimal estimation errors r0, r1 in . norm ◮ Confidence set ˆ C : contains f and has adaptive diameter
C0 C1
r1 r0 f f ^ ^ f f C ^ C ^
SLIDE 21 Adaptive inference Adaptive estimation and confidence statements :
See [Lepski, 1990-92], [Juditsky and Lambert-Lacroix, 1994], [Donoho and Johnstone, 1990-92], [Low, 2004-06], [Birg´ e and Massart, 1994-00], [Gin´ e and Nickl, 2010], etc.
◮ “Large” sets C0 ⊂ C1 e.g. C0 =: C(γ) and C1 =: C(α) with α < γ.
◮ Associated probability distributions Pf for f ∈ C1 ◮ Receive a dataset of n i.i.d. entries according to Pf
In non-parametric regression : Adaptive and honest confidence sets do not exist.
See [Cai and Low (2004)], [Hoffmann and Nickl (2011)], etc.
Indeed minimax rate for testing between C0 = C(γ) and C1 = C(α) in .∞ norm is: log(n) n −α/(2α+d) = r1 ≫ r0. Common situation, adaptive inference paradox - see [Gine and Nickl, 2011], [C, Klopp, L¨
2017] for a systematic study and relations to a testing problem.
SLIDE 22 Subtle problem : Matrix completion
Problem : Application : Recommendation system (e.g. Netflix). Alice Bob Carine Daniel Ed
Inference (estimation + uncertainty quantification) of the matrix?
SLIDE 23 Subtle problem : Matrix completion
Problem : Application : Recommendation system (e.g. Netflix). Alice Bob Carine Daniel Ed
Inference (estimation + uncertainty quantification) of the matrix? Trace Regression Model f : matrix of dimension d × d. n observed data samples (Xi, Yi)i≤n : Yi = fXi + εi, i = 1, . . . , n, where Xi ∼iid U{1,...,d}2 and ε is an
- indep. centered noise s. t. |ε| ≤ 1.
Customers Products {, }
SLIDE 24 Subtle problem : Matrix completion
Problem : Application : Recommendation system (e.g. Netflix). Alice Bob Carine Daniel Ed
Inference (estimation + uncertainty quantification) of the matrix? Bernoulli Model f : matrix of dimension d × d. Data Yi,j = (fi,j+εi,j)Bi,j, (i, j) ∈ {1, . . . , d}2, where Bi,j ∼iid B(n/d2) and ε is an
- indep. centered noise such that
|ε| ≤ 1. Customers Products
SLIDE 25 Subtle problem : Matrix completion
Problem : Application : Recommendation system (e.g. Netflix). Alice Bob Carine Daniel Ed
Inference (estimation + uncertainty quantification) of the matrix?
High dimensional regime : d2 ≥ n.
SLIDE 26 Subtle problem : Matrix completion
Problem : Application : Recommendation system (e.g. Netflix). Alice Bob Carine Daniel Ed
Inference (estimation + uncertainty quantification) of the matrix? Let for 1 ≤ k ≤ d C(k) = {f : rank(f) ≤ k, f∞ ≤ 1}.
High dimensional regime : d2 ≥ n.
SLIDE 27 Subtle problem : Matrix completion
Problem : Application : Recommendation system (e.g. Netflix). Alice Bob Carine Daniel Ed
Inference (estimation + uncertainty quantification) of the matrix? Let for 1 ≤ k ≤ d C(k) = {f : rank(f) ≤ k, f∞ ≤ 1}. Question : If f ∈ C(k), then the “optimal” precision of inference should depend on k. Inference adaptive to k?
High dimensional regime : d2 ≥ n.
SLIDE 28 Adaptive estimation
There exists an adaptive estimator ˆ f of f ∈ C(k) that achieves the minimax-optimal error rk over all C(k) E ˜ f − fF ≤ d
n := rk. where . is the Frobenius norm, [Keshavan et al., 2009, Cai et al., 2010, Kolchinskii et al., 2011, Klopp and Gaiffas, 2015]. In terms of estimation of f, these two models are equivalent. Question : Adaptive and honest confidence set scaling with rk?
SLIDE 29 Adaptive estimation
There exists an adaptive estimator ˆ f of f ∈ C(k) that achieves the minimax-optimal error rk over all C(k) E ˜ f − fF ≤ d
n := rk. where . is the Frobenius norm, [Keshavan et al., 2009, Cai et al., 2010, Kolchinskii et al., 2011, Klopp and Gaiffas, 2015]. Question : Adaptive and honest confidence set scaling with rk?
SLIDE 30 Matrix completion : Trace regression
Problem : Application : Recommendation system (e.g. Netflix). Alice Bob Carine Daniel Ed
Inference (estimation + uncertainty quantification) of the matrix? Trace Regression Model f : matrix of dimension d × d. n observed data samples (Xi, Yi)i≤n : Yi = fXi + εi, i = 1, . . . , n, where Xi ∼iid U{1,...,d}2 and ε is an
- indep. centered noise s. t. |ε| ≤ 1.
Customers Products {, }
SLIDE 31 Confidence sets : Trace Regression Model
Theorem (C., Klopp, L¨
In the matrix completion “trace regression” model, η-adaptive and honest confidence sets exist. Dimension reduction in the smaller model not too radical.
SLIDE 32 Matrix completion : Bernoulli Model
Problem : Application : Recommendation system (e.g. Netflix). Alice Bob Carine Daniel Ed
Inference (estimation + uncertainty quantification) of the matrix? Bernoulli Model f : matrix of dimension d × d. Data Yi,j = (fi,j+εi,j)Bi,j, (i, j) ∈ {1, . . . , d}2, where Bi,j ∼iid B(n/d2) and ε is an
- indep. centered noise such that
|ε| ≤ 1. Customers Products
SLIDE 33 Confidence sets : Bernoulli Model
Theorem (C., Klopp, L¨
◮ Bernoulli Model with known noise variance : Adaptive and honest confidence sets exist. ◮ Bernoulli Model with unknown noise variance : Adaptive and honest confidence sets do not exist . The two models are not equivalent in this case!
SLIDE 34 (Simplified) Idea of the proof : Unknown variance
No entries sampled twice! First example : rank one H0 : Random opinions! Customers Products
H1 : Rank one opinions. Customers Products
SLIDE 35 (Simplified) Idea of the proof : Unknown variance
No entries sampled twice! First example : rank one H0 : Random opinions! Customers Products
H1 : Rank one opinions. Customers Products
SLIDE 36 (Simplified) Idea of the proof : Unknown variance
No entries sampled twice! First example : rank one H0 : Random opinions! Customers Products
−
− − H1 : Rank one opinions. Customers Products
−
− −
SLIDE 37 (Simplified) Idea of the proof : Unknown variance
No entries sampled twice! First example : rank one H0 : Random opinions! Customers Products
−
− − H1 : Rank one opinions. Customers Products
−
− − Less than n4
d4 such cycles whp → distinguishability only if
n ≫ d.
SLIDE 38 (Simplified) Idea of the proof : Unknown variance
No entries sampled twice! General case : rank k H0 : Random opinions! Customers Products
H1 : Rank one opinions. Customers Products
SLIDE 39 (Simplified) Idea of the proof : Unknown variance
No entries sampled twice! General case : rank k H0 : Random opinions! Customers Products
H1 : Rank one opinions. Customers Products
SLIDE 40 (Simplified) Idea of the proof : Unknown variance
No entries sampled twice! General case : rank k H0 : Random opinions! Customers Products
−
− − H1 : Rank one opinions. Customers Products
−
− −
SLIDE 41 (Simplified) Idea of the proof : Unknown variance
No entries sampled twice! General case : rank k H0 : Random opinions! Customers Products
−
− − H1 : Rank one opinions. Customers Products
−
− − Less than
n4 d4k3 correct cycles (taking rank groups into account)
→ distinguishability only if n ≫ k3/4d.
SLIDE 42
Conclusion on adaptive inference
Adaptive inference paradox: adaptive estimation is generally possible and adaptive uncertainty quantification mostly not. We have seen that in the non-parametric regression with L∞ norm: ◮ Adaptive estimation is possible ◮ Adaptive and honest confidence sets do not exist Typical example of adaptive inference paradox.
SLIDE 43
ADAPTIVITY IN X ARMED BANDITS
SLIDE 44
Non-Convex Optimization
Problem
Finding/Exploiting the maximum M(f) of an unknown function f.
Question
Can we design algorithms that adapt to the difficulty of the problem?
SLIDE 45
Non-Convex Optimization
Problem
Finding/Exploiting the maximum M(f) of an unknown function f.
Question
Can we design algorithms that adapt to the difficulty of the problem?
SLIDE 46
Non-Convex Optimization
Depending on the difficulty of the problem, we would hope to get different performances :
Question
Can we adapt to the hyperparameters?
SLIDE 47 X-armed bandit problem
Game: ◮ Parameters : function f with M(f) = maxx f(x), n ◮ for t = 1, ..., n
◮ learner picks Xt ∈ [0, 1]d ◮ receives Yt = f(Xt) + ǫt, ǫ
s.t. Eǫt = 0, |ǫt| ≤ 1
◮ output x(n) ∈ [0, 1]d
SLIDE 48 X-armed bandit problem
Game: ◮ Parameters : function f with M(f) = maxx f(x), n ◮ for t = 1, ..., n
◮ learner picks Xt ∈ [0, 1]d ◮ receives Yt = f(Xt) + ǫt, ǫ
s.t. Eǫt = 0, |ǫt| ≤ 1
◮ output x(n) ∈ [0, 1]d
M(f)
SLIDE 49 X-armed bandit problem
Game: ◮ Parameters : function f with M(f) = maxx f(x), n ◮ for t = 1, ..., n
◮ learner picks Xt ∈ [0, 1]d ◮ receives Yt = f(Xt) + ǫt, ǫ
s.t. Eǫt = 0, |ǫt| ≤ 1
◮ output x(n) ∈ [0, 1]d
M(f)
X
SLIDE 50 X-armed bandit problem
Game: ◮ Parameters : function f with M(f) = maxx f(x), n ◮ for t = 1, ..., n
◮ learner picks Xt ∈ [0, 1]d ◮ receives Yt = f(Xt) + ǫt, ǫ
s.t. Eǫt = 0, |ǫt| ≤ 1
◮ output x(n) ∈ [0, 1]d
M(f)
X X
SLIDE 51 X-armed bandit problem
Game: ◮ Parameters : function f with M(f) = maxx f(x), n ◮ for t = 1, ..., n
◮ learner picks Xt ∈ [0, 1]d ◮ receives Yt = f(Xt) + ǫt, ǫ
s.t. Eǫt = 0, |ǫt| ≤ 1
◮ output x(n) ∈ [0, 1]d
M(f)
X X X
SLIDE 52 X-armed bandit problem
Game: ◮ Parameters : function f with M(f) = maxx f(x), n ◮ for t = 1, ..., n
◮ learner picks Xt ∈ [0, 1]d ◮ receives Yt = f(Xt) + ǫt, ǫ
s.t. Eǫt = 0, |ǫt| ≤ 1
◮ output x(n) ∈ [0, 1]d
M(f)
X X X X
SLIDE 53 X-armed bandit problem
Game: ◮ Parameters : function f with M(f) = maxx f(x), n ◮ for t = 1, ..., n
◮ learner picks Xt ∈ [0, 1]d ◮ receives Yt = f(Xt) + ǫt, ǫ
s.t. Eǫt = 0, |ǫt| ≤ 1
◮ output x(n) ∈ [0, 1]d Performance measures: ◮ Simple regret: rn = M(f) − f(x(n)) ◮ Cumulative regret: Rn = nM(f) − n
t=1 f(Xt) M(f)
X X X X
SLIDE 54
Classical result for stochastic bandits
K−armed stochastic bandits In the discrete case - f constant by parts on K known sets - classical stochastic bandits.
Idea
Approximate the continuous function f in Kn ‘relevant’ parts - this will depend on the regularity of f.
SLIDE 55 Classical result for stochastic bandits
K−armed stochastic bandits In the discrete case - f constant by parts on K known sets - classical stochastic bandits. The minimax regret satisfies (up to logarithmic terms) inf
algo A
sup
K−discrete f
rn(A, f) ≈
n , and inf
algo A
sup
K−discrete f
Rn(A, f) ≈ √ nK.
Idea
Approximate the continuous function f in Kn ‘relevant’ parts - this will depend on the regularity of f.
SLIDE 56 Classical result for stochastic bandits
K−armed stochastic bandits In the discrete case - f constant by parts on K known sets - classical stochastic bandits. The minimax regret satisfies (up to logarithmic terms) inf
algo A
sup
K−discrete f
rn(A, f) ≈
n , and inf
algo A
sup
K−discrete f
Rn(A, f) ≈ √ nK.
Idea
Approximate the continuous function f in Kn ‘relevant’ parts - this will depend on the regularity of f.
SLIDE 57
Assumptions on f
Parametrize f from easy to hard problems.
SLIDE 58 Assumptions on f
Parametrize f from easy to hard problems. ◮ Regularity condition: ∃α > 0 s.t. ∀x, y: |f(x) − f(y)| ≤ x − yα
∞,
For α ≤ 1, H¨
SLIDE 59 Assumptions on f
Parametrize f from easy to hard problems. ◮ Regularity condition: ∃α > 0 s.t. ∀x, y: |f(x) − f(y)| ≤ x − yα
∞,
For α ≤ 1, H¨
M(f) M(f)
small large
SLIDE 60 Assumptions on f
Parametrize f from easy to hard problems. ◮ Regularity condition: ∃α > 0 s.t. ∀x, y: |f(x) − f(y)| ≤ x − yα
∞,
For α ≤ 1, H¨
◮ Margin condition: ∃β ≥ 0 s.t.: Vol(x : M(f) − f(x) ≤ ∆) ≤ ∆β No restriction for β = 0, larger β corresponds to easier problem
SLIDE 61 Assumptions on f
Parametrize f from easy to hard problems. ◮ Regularity condition: ∃α > 0 s.t. ∀x, y: |f(x) − f(y)| ≤ x − yα
∞,
For α ≤ 1, H¨
◮ Margin condition: ∃β ≥ 0 s.t.: Vol(x : M(f) − f(x) ≤ ∆) ≤ ∆β No restriction for β = 0, larger β corresponds to easier problem
M(f) M(f)
small large
β β
SLIDE 62 Lower bounds
Define P(α, β) the class of functions that satisfy these assumptions.
Theorem: Lower-bound for α, β known [Bubeck et al. 11]
For any strategy that performs at most n noisy function evaluations, it holds that: sup
P∈P(α,β)
EP [rn] ≥ n−
α 2α+d−αβ := rα,β,
sup
P∈P(α,β)
EP [Rn] ≥ n1−
α 2α+d−αβ := Rα,β,
where does not depend on the strategy and note that Rα,β = nrα,β. Goal: design procedures without access to α, β with optimal regret
SLIDE 63 Case α known
See e.g. [Agrawal (1995), Kleinberg (2004), Auer et al. (2007), Kleinberg et
- al. (2008), Bubeck et al. (2011a,b,c),
Cope (2009), Munos (2014), Valko et al (2015)], etc. Optimistic strategies (e.g. HOO in [Bubeck et al. 11]): use the knowledge
- f α to construct local (multi-scale)
upper-confidence bounds on f, and choose next Xt optimistically. Our strategy: similar intuition but works hierarchically (at a single scale),
- nly refining the partition in promising
cells of the previous partition.
SLIDE 64 Case α known
See e.g. [Agrawal (1995), Kleinberg (2004), Auer et al. (2007), Kleinberg et
- al. (2008), Bubeck et al. (2011a,b,c),
Cope (2009), Munos (2014), Valko et al (2015)], etc. Optimistic strategies (e.g. HOO in [Bubeck et al. 11]): use the knowledge
- f α to construct local (multi-scale)
upper-confidence bounds on f, and choose next Xt optimistically. Our strategy: similar intuition but works hierarchically (at a single scale),
- nly refining the partition in promising
cells of the previous partition.
M(f) M(f)
SLIDE 65
Case α known
Theorem: Upper-bound for α known [Locatelli, C, 2018]
Our strategy for opimisation is such that with probability at least 1 − n−1 sup
P∈P(α,β)
rn ≤ ˜ rα,β, EP Rn ≤ ˜ Rα,β. See also [Bubeck et al (2011), Minsker (2013), Bull (2014), Valko et al (2015)] etc.
SLIDE 66
Adaptivity?
The algorithm naturally adapts to β but needs α as a parameter.
Question
Can we adapt to α?
SLIDE 67
Adaptivity?
The algorithm naturally adapts to β but needs α as a parameter.
Question
Can we adapt to α? Reminder from adaptive inference : ◮ Adaptive estimation is possible in non-parametric regression ◮ Adaptive and honest confidence sets do not exist in non-parametric regression
SLIDE 68
Adaptivity?
The algorithm naturally adapts to β but needs α as a parameter.
Question
Can we adapt to α? Reminder from adaptive inference : ◮ Adaptive estimation is possible in non-parametric regression ◮ Adaptive and honest confidence sets do not exist in non-parametric regression
Question
Is the X-armed bandit problem closer to adaptive estimation or adaptive and honest uncertainty quantification?
SLIDE 69 Adaptivity for simple regret (optimisation)
α-Adaptive strategy: ◮ Split budget in log2 n chunks of same size ◮ Run previous strategy with αi =
i log n for all i
◮ Aggregate recommendations Idea: ∃αi∗ s.t.: α −
1 log(n) ≤ αi∗ ≤ α
SLIDE 70 Adaptivity for simple regret (optimisation)
α-Adaptive strategy: ◮ Split budget in log2 n chunks of same size ◮ Run previous strategy with αi =
i log n for all i
◮ Aggregate recommendations Idea: ∃αi∗ s.t.: α −
1 log(n) ≤ αi∗ ≤ α
M(f) M(f) M(f)
T
Just right T
SLIDE 71 Adaptivity for simple regret (optimisation)
α-Adaptive strategy: ◮ Split budget in log2 n chunks of same size ◮ Run previous strategy with αi =
i log n for all i
◮ Aggregate recommendations Idea: ∃αi∗ s.t.: α −
1 log(n) ≤ αi∗ ≤ α
◮ Strategy 1: cross-validate [Grill et al. 15] ◮ Strategy 2: recommend x(n) ∈
i≤ˆ I sn,αi [LCK 17]
Recovers the optimal rate but adaptively!
M(f) M(f) M(f)
T
Just right T
SLIDE 72 Adaptivity for simple regret (optimisation)
α-Adaptive strategy: ◮ Split budget in log2 n chunks of same size ◮ Run previous strategy with αi =
i log n for all i
◮ Aggregate recommendations Idea: ∃αi∗ s.t.: α −
1 log(n) ≤ αi∗ ≤ α
◮ Strategy 1: cross-validate [Grill et al. 15] ◮ Strategy 2: recommend x(n) ∈
i≤ˆ I sn,αi [LCK 17]
Recovers the optimal rate but adaptively!
M(f) M(f) M(f)
T
Just right T
SLIDE 73 Adaptivity for simple regret (optimisation)
α-Adaptive strategy: ◮ Split budget in log2 n chunks of same size ◮ Run previous strategy with αi =
i log n for all i
◮ Aggregate recommendations Idea: ∃αi∗ s.t.: α −
1 log(n) ≤ αi∗ ≤ α
Theorem: Upper-bound simple regret [Locatelli, C, 2018]
Our strategy yields whp sup
α,β∈S
sup
P ∈P(α,β)
rn log(n)urα,β ≤ .
M(f) M(f) M(f)
T
Just right T
SLIDE 74
Adaptivity for Cumulative regret
Intuition: previous strategy favors exploration (linear regret) Can we adaptively balance exploration/exploitation?
Theorem: Impossibility result for adaptive cumulative regret [Locatelli, C, 2018]
Fix γ > α > 0 and β. Any strategy with (near)-optimal regret bounded by ˜ Rα,β uniformly over P(α, β) is such that sup
P∈P(γ,β)
EP [Rn] ≥ ˜ Rα,β. In fact something more refined holds for any algorithm with given rate on P(γ, β) or P(α, β).
SLIDE 75
Adaptivity for Cumulative regret
Intuition: previous strategy favors exploration (linear regret) Can we adaptively balance exploration/exploitation?
Theorem: Impossibility result for adaptive cumulative regret [Locatelli, C, 2018]
Fix γ > α > 0 and β. Any strategy with (near)-optimal regret bounded by ˜ Rα,β uniformly over P(α, β) is such that sup
P∈P(γ,β)
EP [Rn] ≥ ˜ Rα,β. In fact something more refined holds for any algorithm with given rate on P(γ, β) or P(α, β).
SLIDE 76
Adaptivity for Cumulative regret
Intuition: previous strategy favors exploration (linear regret) Can we adaptively balance exploration/exploitation?
Theorem: Impossibility result for adaptive cumulative regret [Locatelli, C, 2018]
Fix γ > α > 0 and β. Any strategy with (near)-optimal regret bounded by ˜ Rα,β uniformly over P(α, β) is such that sup
P∈P(γ,β)
EP [Rn] ≥ ˜ Rα,β. In fact something more refined holds for any algorithm with given rate on P(γ, β) or P(α, β).
SLIDE 77
Conclusion
Adaptivity possible for simple regret but not for cumulative regret. ◮ Simple regret is in essence closer to adaptive estimation : adaptation possible ◮ Cumulative regret is in essence closer to adaptive and honest confidence sets : adaptation impossible More systematic relation to adaptivity in active learning?