Crush Optimism with Pessimism: Structured Bandits Beyond Asymptotic - - PowerPoint PPT Presentation

β–Ά
crush optimism with pessimism structured bandits beyond
SMART_READER_LITE
LIVE PREVIEW

Crush Optimism with Pessimism: Structured Bandits Beyond Asymptotic - - PowerPoint PPT Presentation

Crush Optimism with Pessimism: Structured Bandits Beyond Asymptotic Optimality Kwang-Sung Jun join work with Chicheng Zhang 1 Structured bandits e.g., linear = ! , , " # = { $ : #


slide-1
SLIDE 1

Crush Optimism with Pessimism: Structured Bandits Beyond Asymptotic Optimality

Kwang-Sung Jun join work with Chicheng Zhang

1

slide-2
SLIDE 2

Structured bandits

  • Input: Arm set 𝒝, hypothesis class β„± βŠ‚ 𝒝 β†’ ℝ
  • Initialize: The environment chooses π‘”βˆ— ∈ β„±

(unknown to the learner)

For 𝑒 = 1, …, π‘œ

  • Learner: chooses an arm 𝑏" ∈ 𝒝
  • Environment: generates the reward 𝑠

" = π‘”βˆ— 𝑏" + (zero-mean stochastic noise)

  • Learner: receives 𝑠

"

  • Goal: Minimize the cumulative regret

𝔽 Reg# = 𝔽 π‘œ β‹… max

$βˆˆπ’ π‘”βˆ— 𝑏

βˆ’ :

"'( #

π‘”βˆ— 𝑏"

  • Note: fixed arm set (=non-contextual), realizability π‘”βˆ— ∈ β„±

2

e.g., linear 𝒝 = 𝑏!, …, 𝑏" ∈ ℝ# β„± = {𝑏 ↦ πœ„$𝑏: πœ„ ∈ ℝ#}

β€œthe set of possible configurations of the mean rewards”

slide-3
SLIDE 3

Structured bandits

  • Why relevant?

Techniques may transfer to RL (e.g., ergodic RL [Ok18])

  • Naive strategy: UCB

⟹

) * log π‘œ regret bound (instance-dependent)

  • Scales with the number of arms 𝐿
  • Instead, the complexity of the hypothesis class β„± should appear.
  • The asymptotically optimal regret is well-defined.
  • E.g., linear bandits : π‘‘βˆ— β‹… log π‘œ for some well-defined π‘‘βˆ— β‰ͺ )

*.

3

The goal of this paper Achieve the asymptotic optimality with improved finite-time regret for any β„±.

[Ok18] Ok et al., Exploration in Structured Reinforcement Learning, NeurIPS, 2018

(the worst-case regret is beyond the scope)

slide-4
SLIDE 4

Asymptotic optimality (instance-dependent)

  • Optimism in the face of uncertainty

(e.g., UCB, Thompson sampling) ⟹ optimal asymptotic / worst-case regret in 𝑳-armed bandits.

  • Linear bandits: optimal worst-case rate = 𝑒 π‘œ
  • Asymptotically optimal regret? ⟹ No!

4

(AISTATS’17)

sweet sour

Do they like orange or apple? Maybe have them try lemon and see if they are sensitive to sourness..

(1,0) (0.95, 0.1) (1,0)

mean reward = 1*sweet + 0*sour

slide-5
SLIDE 5
  • 𝔽 Reg# β‰₯ 𝑑 π‘”βˆ— β‹… log π‘œ (asymptotically)

𝑑 π‘”βˆ— = min

&!,…,&" )* 3 +,! "

𝛿+ β‹… Ξ”+

  • π›Ώβˆ— = 𝛿(

βˆ—, …, 𝛿) βˆ—

β‰₯ 0 : the solution

  • To be optimal, we must pull arm 𝑏 like 𝛿$

βˆ— β‹… log π‘œ times.

  • E.g., 𝛿+,-./

βˆ—

= 8, 𝛿.01/2,

βˆ—

= 0 ⟹ lemon is the informative arm!

  • When 𝑑 π‘”βˆ— = 0: Bounded regret! (except for pathological ones [Lattimore14])

Asymptotic optimality: lower bound

5

βˆ€π‘• ∈ π’Ÿ π‘”βˆ— , 3

+,! "

𝛿+ β‹… KL- 𝑔 𝑏 , 𝑕 𝑏 β‰₯ 1 ”competing” hypotheses KL divergence with noise distribution πœ‰ Ξ”+ = max

.βˆˆπ’ π‘”βˆ— 𝑐

βˆ’ π‘”βˆ—(𝑏)

  • s. t.

𝛿+βˆ— 1 = 0

Lattimore & Munos, Bounded regret for finite-armed structured bandits, 2014.

slide-6
SLIDE 6

Existing asymptotically optimal algorithms

  • Mostly uses forced exploration. [Lattimore+17,Combes+17,Hao+20]

⟹ ensures every arm’s pull count is an unbounded function of π‘œ such as

+.2 # (3+.2 +.2 #.

⟹ 𝔽 Reg# βͺ… 𝑑 π‘”βˆ— β‹… log π‘œ + 𝐿 β‹…

+.2 # (3+.2 +.2 #

  • Issues
  • 1. 𝐿 appears in the regret* ⟹

what if 𝐿 is exponentially large?

  • 2. cannot achieve bounded regret when 𝑑 π‘”βˆ— = 0
  • Parallel studies avoid forced exploration, but still depend on 𝐿. [Menard+20, Degenne+20]

6

*Dependence on 𝐿 can be avoided in special cases (e.g., linear).

slide-7
SLIDE 7

Contribution

7

* it’s necessary (will be updated in arxiv)

Research Question Assume β„± is finite. Can we design an algorithm that

  • enjoys the asymptotic optimality
  • adapts to bounded regret whenever possible
  • does not necessarily depend on 𝐿?

Proposed algorithm: CRush Optimism with Pessimism (CROP)

  • No forced exploration 😁
  • The regret scales not with 𝐿 but with 𝐿" ≀ 𝐿 (defined in the paper).
  • An interesting log log π‘œ term in the regret*
slide-8
SLIDE 8

Preliminaries

8

slide-9
SLIDE 9

Assumptions

  • β„± < ∞
  • The noise model

𝑠

" = π‘”βˆ— 𝑏" + 𝜊"

where 𝜊" is 1-sub-Gaussian. (generalized to 𝜏! in the paper)

  • Notations: π‘βˆ— 𝑔 ≔ arg max

$βˆˆπ’ 𝑔 𝑏 ,

πœˆβˆ— 𝑔 ≔ 𝑔 π‘βˆ— 𝑔

  • 𝑔 supports arm 𝑏

⟺ π‘βˆ— 𝑔 = 𝑏

  • 𝑔 supports reward 𝑀

⟺ πœˆβˆ— 𝑔 = 𝑀

  • [Assumption] Every 𝑔 ∈ β„± has a unique best arm (i. e. , π‘βˆ— 𝑔

= 1 )

9

slide-10
SLIDE 10

Competing hypotheses

  • π’Ÿ π‘”βˆ— consists of 𝑔 ∈ β„± such that
  • (1) assigns the same reward to the best arm π‘βˆ—(π‘”βˆ—)
  • (2) but supports a different arm π‘βˆ— 𝑔 β‰  π‘βˆ—(π‘”βˆ—)
  • Importance: it’s why we get log(π‘œ) regret!

10

= π‘”βˆ—

𝑔

$

arms

mean reward

1 2 3

𝑔

%

𝑔

&

𝑔

'

𝑔

(

1

𝑔

)

.75 .5 .25

slide-11
SLIDE 11

βˆ€π‘• ∈ π’Ÿ π‘”βˆ— , 3

+,! "

𝛿+ β‹… π‘”βˆ— 𝑏 βˆ’ 𝑕 𝑏

2

2 β‰₯ 1

Lower bound revisited

11

”competing” hypotheses

  • s. t.

𝛿+βˆ— 1βˆ— = 0 𝑑 π‘”βˆ— ≔ min

&!,…,&" )* 3 +,! "

𝛿+ β‹… Ξ”+ 𝛿+ ln π‘œ samples for each 𝑏 ∈ 𝒝 can distinguish π‘”βˆ— from 𝑕 confidently.

Agrawal, Teneketzis, Anantharam. Asymptotically Efficient Adaptive Allocation Schemes for Controlled I.I.D. Processes: Finite Parameter Space, 1989.

Finds arm pull allocations that (1) eliminate competing hypotheses and (2) β€˜reward’-efficient

  • Assume Gaussian rewards.
  • 𝔽 Reg# β‰₯ 𝑑 π‘”βˆ— β‹… log π‘œ , asymptotically.

Ξ”+ = max

.βˆˆπ’ π‘”βˆ— 𝑐

βˆ’ π‘”βˆ—(𝑏)

slide-12
SLIDE 12

Example: Cheating code

  • πœ— > 0: very small (like 0.0001)
  • Ξ› > 0: not too small (like 0.5)
  • The lower bound: Θ

+.2! ) 4!

ln π‘œ

  • UCB:

Θ

) 5 ln π‘œ

  • Exponential gap in 𝐿!

12

A1 A2 A3 A4 A5 A6 𝑔

)

𝟐 1 βˆ’ πœ— 1 βˆ’ πœ— 1 βˆ’ πœ— 𝑔

%

1 βˆ’ πœ— 𝟐 1 βˆ’ πœ— 1 βˆ’ πœ— Ξ› 𝑔

&

1 βˆ’ πœ— 1 βˆ’ πœ— 𝟐 1 βˆ’ πœ— Ξ› 𝑔

$

1 βˆ’ πœ— 1 βˆ’ πœ— 1 βˆ’ πœ— 𝟐 Ξ› Ξ› 𝑔

'

𝟐 + 𝛝 1 1 βˆ’ πœ— 1 βˆ’ πœ— 𝑔

(

1 𝟐 + 𝛝 1 βˆ’ πœ— 1 βˆ’ πœ— Ξ› 𝑔

*

1 βˆ’ πœ— 1 βˆ’ πœ— 𝟐 + 𝛝 1 Ξ› … … … … … … …

cheating arms log2 𝐿* base arms 𝐿*

{1 βˆ’ πœ—, 1, 1 + πœ—} rewards: 0, Ξ›

slide-13
SLIDE 13

The function classes

  • π’Ÿ π‘”βˆ— : Competing ⟹ cannot distinguishable using π‘βˆ—(π‘”βˆ—), but supports a different arm
  • 𝒠 π‘”βˆ— : Docile ⟹ distinguishable using π‘βˆ—(π‘”βˆ—)
  • β„° π‘”βˆ— : Equivalent ⟹ supports π‘βˆ—(π‘”βˆ—) and the reward πœˆβˆ—(π‘”βˆ—)
  • [Proposition 2] β„± = π’Ÿ π‘”βˆ— βˆͺ 𝒠 π‘”βˆ— βˆͺ β„°(π‘”βˆ—)

(disjoint union)

13

β„± β„°βˆ— π’Ÿβˆ—

π’ βˆ— =

𝑔

3

𝑔

2

𝑔

4

𝑔

5

𝑔

6

𝑔

!

= π‘”βˆ—

𝑔

3

arms

mean reward

1 2 3

𝑔

2

𝑔

4

𝑔

5

𝑔

6

1

𝑔

!

.75 .5 .25 Θ(log π‘œ) Θ(1) can be Θ(log log π‘œ) regret contribution

slide-14
SLIDE 14

CRush Optimism with Pessimism (CROP)

14

slide-15
SLIDE 15

CROP: Overview

  • The confidence set

𝑀" 𝑔 ≔ :

6'( "

𝑠

6 βˆ’ 𝑔 𝑏6 7

β„±" ∢= 𝑔 ∈ β„±: 𝑀"8( 𝑔 βˆ’ min

9βˆˆβ„± 𝑀"8( 𝑕 ≀ 𝛾" ≔ Θ ln 𝑒 β„±

  • Four important branches
  • Exploit, Feasible, Fallback, Conflict
  • Exploit
  • Does every 𝑔 ∈ β„±" support the same best arm?
  • If yes, pull that arm.

15

ERM

confidence level: 1 βˆ’ poly

! 7

slide-16
SLIDE 16

CROP v1

At time 𝑒,

  • Maintain a confidence set β„±" βŠ† β„±
  • If every 𝑔 ∈ β„±" agree on the best arm
  • (Exploit) pull that arm.
  • Else: (Feasible)
  • Compute the pessimism: e

𝑔

" = arg min ;βˆˆβ„±" max $βˆˆπ’ 𝑔(𝑏)

(break ties by the cum. loss)

  • Compute π›Ώβˆ— ≔ solution of the optimization problem 𝑑 e

𝑔

"

  • (Tracking) Pull 𝑏" = arg min

$βˆˆπ’ <=++_?.=/@($) C#

βˆ—

16

  • Cf. optimism: f

𝑔

" = arg max ;βˆˆβ„±" max $βˆˆπ’ 𝑔(𝑏)

slide-17
SLIDE 17

Why pessimism?

  • Suppose β„±" = 𝑔

(, 𝑔 7, 𝑔 D

  • If I knew π‘”βˆ—, I could track 𝛿 π‘”βˆ— (= the solution of 𝑑(π‘”βˆ—))
  • Which 𝑔 should I track?
  • Pessimism: either does the right thing, or eliminates itself.
  • Other choices: may get stuck (so does ERM)

Key idea: the LB constraints prescribes how to distinguish π‘”βˆ— from those supporting higher rewards.

17

Arms A1 A2 A3 A4 A5 𝑔

)

1 .99 .98 𝑔

%

.98 .99 .98 .25 𝑔

&

.97 .97 .98 .25 .25

slide-18
SLIDE 18

But we may still get stuck.

  • Due to docile hypotheses.
  • We must do something else.

18

  • s. t. βˆ€ 𝑕 ∈ π’Ÿ 𝑔 βˆͺ 𝒠 𝑔 : πœˆβˆ— 𝑕 β‰₯ πœˆβˆ— 𝑔 ,

3

+

𝛿+ 𝑔 𝑏 βˆ’ 𝑕 𝑏

2

2 β‰₯ 1 𝛿 β‰₯ max 𝛿 𝑔 , 𝜚 𝑔

  • Includes docile hypotheses with best rewards higher πœˆβˆ— 𝑔

Arms A1 A2 A3 A4 A5 𝑔

)

1 .99 .98 𝑔

%

.98 .99 .98 .25 𝑔

&

.97 .97 1 .25 .25 𝑔

$

.97 .97 1 .2499 .25

πœ” 𝑔 ≔ arg min

&∈ *,8 " Ξ”9:; 𝑔 β‹… 𝛿+βˆ— 1 +

3

+<+βˆ— 1

Ξ”+ 𝑔 β‹… 𝛿+ π‘”βˆ— =

slide-19
SLIDE 19

When to fallback to πœ” 𝑔

  • ℬ" ≔

π‘βˆ— 𝑔 , πœˆβˆ— 𝑔 : 𝑔 ∈ β„±" ⟹ induces a partition of β„±"

  • Optimistic set: h

β„±" = the partition containing the optimism

  • Pessimistic set: i

β„±" = the partition containing the pessimism

  • Condition: Use 𝛿

Μ… 𝑔

" if

βˆ€ 𝑔 ∈ Z β„±7, 3

+

𝛿+ Μ… 𝑔

7

𝑔 𝑏 βˆ’ Μ… 𝑔

7 𝑏 2

2 β‰₯ 1

  • otherwise, fallback to πœ”

Μ… 𝑔

" .

  • Then, we never get stuck
  • Crush optimism with pessimism (or end up crushing

pessimism itself)

19

mean rewards arms

[a partition of β„±0]

  • ptimistic set

pessimistic set

slide-20
SLIDE 20

CROP v2

At time 𝑒,

  • Maintain a confidence set β„±7 βŠ† β„±
  • If every 𝑔 ∈ β„±7 agree on the best arm
  • (Exploit) pull that arm.
  • Else if 𝛿

Μ… 𝑔

7 is sufficient to eliminate the optimistic set Z

β„±7

  • (Feasible) 𝜌7 = 𝛿

Μ… 𝑔

7

  • Else
  • (Fallback) 𝜌7 = πœ”

Μ… 𝑔

7

  • (Tracking) Pull 𝑏7 = arg min

+βˆˆπ’ =>??_AB>;C(+) F+,-

20

slide-21
SLIDE 21

Still, we may not be asymptotical optimal

  • Issue: Which informative arm to pull?
  • If we follow 𝑔

7,

  • when π‘”βˆ— = 𝑔

7, it’s fine.

  • when π‘”βˆ— = 𝑔

D, we have suboptimal_const β‹… log π‘œ regret (and can be made arbitrarily

suboptimal)

  • Intuition: to guard against Θ π‘œ regret, we aim to be 1 βˆ’ (

# -confident.

to guard against Θ log π‘œ regret (w/ suboptimal const), we aim to be 1 βˆ’ ( ☐ -confident.

  • Solution: construct a 1 βˆ’

( +.2 # -confident set.

21

Arms A1 A2 A3 A4 A5 𝑔

)

1 .99 .98 𝑔

%

.98 .99 .98 .25 𝑔

&

.98 .99 .98 .25 .50

slide-22
SLIDE 22
  • Μ‡

β„±" = 𝑔 ∈ i β„±": 𝑀"8( 𝑔 βˆ’ 𝑀"8( Μ… 𝑔

"

≀ Μ‡ 𝛾" = 𝑃 log β„± log 𝑒

  • We have Μ‡

β„±" βŠ† i β„±" βŠ‚ β„±"

  • Ask: Compute 𝛿 𝑔 for every 𝑔 ∈

Μ‡ β„±". Do they all agree, up to constant scaling?

  • YES: CROP v2
  • NO: set 𝜌" = 𝜚

Μ… 𝑔

"

We build a refined confidence set

22

  • s. t.

𝛿+βˆ— 1 = 0 𝜚 𝑔 = arg min

&!,…,&" )* 3 +,! "

𝛿+ β‹… Ξ”+

distinguish those that give conflicting advice!

βˆ€π‘• ∈ β„° 𝑔 : 𝛿 𝑕 ∝ 𝛿 𝑔 , 3

+,! "

𝛿+ β‹… 𝑔 𝑏 βˆ’ 𝑕 𝑏

2

2 β‰₯ 1 confidence level: 1 βˆ’ poly

! ?BG 7

slide-23
SLIDE 23

CROP v3 (final)

At time 𝑒,

  • Maintain a confidence set β„±7 βŠ† β„±
  • If every 𝑔 ∈ β„±7 agree on the best arm
  • (Exploit) pull that arm.
  • Else if βˆƒπ‘”, 𝑕 ∈

Μ‡ β„±7: 𝛿(𝑔) and 𝛿 𝑕 are not proportional to each other

  • (Conflict)

𝜌7 = 𝜚 Μ… 𝑔

7

  • Else if 𝛿

Μ… 𝑔

7 is sufficient to eliminate the optimistic set Z

β„±7

  • (Feasible) 𝜌7 = 𝛿

Μ… 𝑔

7

  • Else
  • (Fallback) 𝜌7 = πœ”

Μ… 𝑔

7

  • (Tracking) Pull 𝑏7 = arg min

+βˆˆπ’ =>??_AB>;C(+) F+,-

23

slide-24
SLIDE 24

Main results

24

slide-25
SLIDE 25

Main result

  • Effective number of arms: 𝐿E = the number of arms with πœ”$ 𝑔 β‰  0 for some 𝑔 ∈ β„±
  • [Theorem 1] Anytime regret of CROP:

𝔽 Reg# = O 𝑄

( ln π‘œ + 𝑄 7 ln ln π‘œ

+ 𝑄

D ln β„±

+ 𝐿E where

  • [Corollary 1] If 𝑄

( = 0, then 𝑄 7 = 0. Thus, bounded regret.

25

(from feasible) (from conflict) (from fallback, mainly)

𝑄

! = 3 +

Ξ”+ β‹… 𝛿+(π‘”βˆ—) 𝑄

2 = 3 +

Ξ”+ β‹… max

1βˆˆβ„°(1βˆ—) 𝜚+(𝑔)

𝑄

4 = 3 +

Ξ”+ β‹… max

1βˆˆβ„± πœ”+(𝑔)

slide-26
SLIDE 26

Example: Cheating code

  • 𝐿E β‰ˆ log7 𝐿
  • CROP:

+/()) 4! ln π‘œ

  • Forced exploration:

+/()) 4! ln π‘œ + 𝐿

  • If Ξ› = .5, 𝐿 = 2F and π‘œ = 𝐿
  • π’†πŸ‘

vs πŸ‘πž

  • Exponential improvement!

26

A1 A2 A3 A3 A5 A6 𝑔

)

1 1 βˆ’ πœ— 1 βˆ’ πœ— 1 βˆ’ πœ— 𝑔

%

1 βˆ’ πœ— 1 1 βˆ’ πœ— 1 βˆ’ πœ— Ξ› 𝑔

&

1 βˆ’ πœ— 1 βˆ’ πœ— 1 1 βˆ’ πœ— Ξ› 𝑔

$

1 βˆ’ πœ— 1 βˆ’ πœ— 1 βˆ’ πœ— 1 Ξ› Ξ› 𝑔

'

1 1 + πœ— 1 βˆ’ πœ— 1 βˆ’ πœ— 𝑔

(

1 1 βˆ’ πœ— 1 + πœ— 1 βˆ’ πœ— 𝑔

*

1 1 βˆ’ πœ— 1 βˆ’ πœ— 1 + πœ— … … … … … … …

cheating arms log2 𝐿* base arms 𝐿*

slide-27
SLIDE 27

Lower bound

  • We pull some uninformative arm log(log π‘œ ) times. Is it necessary?
  • Existing lower bounds say: it can be anywhere between Θ 1 and 𝑝(log π‘œ).
  • Question: Say an algorithm A is asymptotically optimal. Can it pull all uninformative arms 𝑃 1

times?

  • [Theorem 2]

The answer is NO. There exists β„±I for which there exists an uninformative arm 𝑏 with 𝔽 pull_count# 𝑏 β‰₯ 𝑑 β‹… ln ln π‘œ (conditions are more relaxed in the paper; will be updated in the arxiv in a few days)

27

slide-28
SLIDE 28

The risk of naively mimicking the oracle

  • The oracle: knows π‘”βˆ—
  • At time t,
  • If βˆ€π‘, pull_count" 𝑏 β‰₯ 𝛿$ π‘”βˆ— β‹… ln 𝑒
  • (Exploit) Pull π‘βˆ— π‘”βˆ—
  • Else
  • (Explore) Track 𝛿 π‘”βˆ—
  • Most existing algorithms try to mimic the oracle!
  • E.g., replace π‘”βˆ— with the ERM + forced exploration.
  • CROP is not an exception

28

slide-29
SLIDE 29

The risk of naively mimicking the oracle

  • Regret of UCB: 𝑃 min )

5 ln π‘œ , πœ—π‘œ

  • The regret of the oracle: 𝑃 min +/ )

4! ln π‘œ , π‘œ

⟹ Linear worst-case regret!

  • Intuitively, if π‘œ is small, pulling πœ—-optimal arm is great!

29

A1 A2 A3 A4 A5 A6 𝑔

"

𝟐 1 βˆ’ πœ— 1 βˆ’ πœ— 1 βˆ’ πœ— 𝑔

!

1 βˆ’ πœ— 𝟐 1 βˆ’ πœ— 1 βˆ’ πœ— Ξ› 𝑔

#

1 βˆ’ πœ— 1 βˆ’ πœ— 𝟐 1 βˆ’ πœ— Ξ› 𝑔

$

1 βˆ’ πœ— 1 βˆ’ πœ— 1 βˆ’ πœ— 𝟐 Ξ› Ξ› 𝑔

%

𝟐 1 + πœ— 1 βˆ’ πœ— 1 βˆ’ πœ— 𝑔

&

𝟐 1 βˆ’ πœ— 1 + πœ— 1 βˆ’ πœ— 𝑔

'

𝟐 1 βˆ’ πœ— 1 βˆ’ πœ— 1 + πœ— … … … … … … …

cheating arms log% 𝐿. base arms 𝐿. The oracle UCB regret time

slide-30
SLIDE 30
  • Can we achieve the best of both worlds? I.e., 𝑃 min +/ )

4! ln π‘œ , πœ—π‘œ

  • Yes, if we know πœ—

It may not be the end of optimism

30

AO UCB regret time

run UCB run asymptotically optimal

slide-31
SLIDE 31

Summary

  • CROP: Asymptotically optimal, adapt to bounded regret, with improved finite-time regret.
  • Provides considerations when avoiding forced exploration.
  • Reveals the danger of naively mimicking the oracle
  • What next?
  • the worst-case regret simultaneously
  • can we use the pessimism for linear bandits?
  • can we even avoid solving the optimization problem?
  • Lower bounds for finite-time instance-dependent regret?
  • No explicit specification of confidence set construction/width?

31