Crush Optimism with Pessimism: Structured Bandits Beyond Asymptotic - PowerPoint PPT Presentation

Crush Optimism with Pessimism: Structured Bandits Beyond Asymptotic Optimality Kwang-Sung Jun join work with Chicheng Zhang 1

Structured bandits e.g., linear 𝒝 = 𝑏 ! , …, 𝑏 " ∈ ℝ # ℱ = {𝑏 ↦ 𝜄 $ 𝑏: 𝜄 ∈ ℝ # } • Input : Arm set 𝒝 , hypothesis class ℱ ⊂ 𝒝 → ℝ “the set of possible configurations of the mean rewards ” • Initialize : The environment chooses 𝑔 ∗ ∈ ℱ (unknown to the learner) For 𝑢 = 1, …, 𝑜 • Learner: chooses an arm 𝑏 " ∈ 𝒝 " = 𝑔 ∗ 𝑏 " + (zero-mean stochastic noise) • Environment: generates the reward 𝑠 • Learner: receives 𝑠 " • Goal : Minimize the cumulative regret # $∈𝒝 𝑔 ∗ 𝑏 𝑔 ∗ 𝑏 " 𝔽 Reg # = 𝔽 𝑜 ⋅ max − : "'( • Note: fixed arm set (=non-contextual), realizability 𝑔 ∗ ∈ ℱ 2

Structured bandits • Why relevant? Techniques may transfer to RL (e.g., ergodic RL [Ok18] ) • Naive strategy: UCB ) ⟹ * log 𝑜 regret bound (instance-dependent) • Scales with the number of arms 𝐿 • Instead, the complexity of the hypothesis class ℱ should appear. • The asymptotically optimal regret is well-defined. • E.g., linear bandits : 𝑑 ∗ ⋅ log 𝑜 for some well-defined 𝑑 ∗ ≪ ) * . The goal of this paper Achieve the asymptotic optimality with improved finite-time regret for any ℱ . (the worst-case regret is beyond the scope) [Ok18] Ok et al., Exploration in Structured Reinforcement Learning, NeurIPS, 2018 3

Asymptotic optimality (instance-dependent) Do they like orange or apple? • Optimism in the face of uncertainty Maybe have them try lemon and see if they (e.g., UCB, Thompson sampling) are sensitive to sourness.. ⟹ optimal asymptotic / worst-case regret in 𝑳 -armed bandits . (1,0) • Linear bandits: optimal worst-case rate = 𝑒 𝑜 sour • Asymptotically optimal regret? ⟹ No! (0.95, 0.1) (1,0) sweet (AISTATS’17) mean reward = 1*sweet + 0*sour 4

Asymptotic optimality: lower bound • 𝔽 Reg # ≥ 𝑑 𝑔 ∗ ⋅ log 𝑜 (asymptotically) .∈𝒝 𝑔 ∗ 𝑐 − 𝑔 ∗ (𝑏) Δ + = max " 𝑑 𝑔 ∗ = & ! ,…,& " )* 3 min 𝛿 + ⋅ Δ + +,! s. t. 𝛿 + ∗ 1 = 0 " ∀𝑕 ∈ 𝒟 𝑔 ∗ , 3 𝛿 + ⋅ KL - 𝑔 𝑏 , 𝑕 𝑏 ≥ 1 +,! ”competing” hypotheses KL divergence with noise distribution 𝜉 • 𝛿 ∗ = 𝛿 ( ∗ , …, 𝛿 ) ∗ ≥ 0 : the solution ∗ ⋅ log 𝑜 times. • To be optimal, we must pull arm 𝑏 like 𝛿 $ ∗ ∗ • E.g., 𝛿 +,-./ = 8, 𝛿 .01/2, = 0 ⟹ lemon is the informative arm ! • When 𝑑 𝑔 ∗ = 0 : Bounded regret ! (except for pathological ones [Lattimore14]) Lattimore & Munos, Bounded regret for finite-armed structured bandits, 2014. 5

Existing asymptotically optimal algorithms • Mostly uses forced exploration. [Lattimore+17,Combes+17,Hao+20] +.2 # ⟹ ensures every arm ’s pull count is an unbounded function of 𝑜 such as (3+.2 +.2 # . ⟹ 𝔽 Reg # ⪅ 𝑑 𝑔 ∗ ⋅ log 𝑜 + 𝐿 ⋅ +.2 # (3+.2 +.2 # • Issues 1. 𝐿 appears in the regret* ⟹ what if 𝐿 is exponentially large? 2. cannot achieve bounded regret when 𝑑 𝑔 ∗ = 0 • Parallel studies avoid forced exploration, but still depend on 𝐿 . [Menard+20, Degenne+20] *Dependence on 𝐿 can be avoided in special cases (e.g., linear). 6

Contribution Proposed algorithm: Research Question CRush Optimism with Pessimism (CROP) Assume ℱ is finite. Can we design an algorithm that • enjoys the asymptotic optimality • adapts to bounded regret whenever possible • does not necessarily depend on 𝐿 ? • No forced exploration 😁 • The regret scales not with 𝐿 but with 𝐿 " ≤ 𝐿 (defined in the paper). • An interesting log log 𝑜 term in the regret* * it’s necessary (will be updated in arxiv) 7

Preliminaries 8

Assumptions • ℱ < ∞ • The noise model " = 𝑔 ∗ 𝑏 " + 𝜊 " 𝑠 where 𝜊 " is 1-sub-Gaussian. (generalized to 𝜏 ! in the paper) • Notations: 𝑏 ∗ 𝑔 ≔ arg max 𝜈 ∗ 𝑔 ≔ 𝑔 𝑏 ∗ 𝑔 $∈𝒝 𝑔 𝑏 , 𝑏 ∗ 𝑔 = 𝑏 • 𝑔 supports arm 𝑏 ⟺ 𝜈 ∗ 𝑔 = 𝑤 • 𝑔 supports reward 𝑤 ⟺ • [Assumption] Every 𝑔 ∈ ℱ has a unique best arm ( i. e. , 𝑏 ∗ 𝑔 = 1 ) 9

Competing hypotheses • 𝒟 𝑔 ∗ consists of 𝑔 ∈ ℱ such that • (1) assigns the same reward to the best arm 𝑏 ∗ (𝑔 ∗ ) • (2) but supports a different arm 𝑏 ∗ 𝑔 ≠ 𝑏 ∗ (𝑔 ∗ ) • Importance: it’s why we get log(𝑜) regret! mean reward 𝑔 ) 𝑔 % 1 𝑔 & .75 = 𝑔 ∗ 𝑔 $ .5 𝑔 ' .25 𝑔 ( 0 arms 1 2 3 10

Lower bound revisited 𝔽 Reg # ≥ 𝑑 𝑔 ∗ ⋅ log 𝑜 , asymptotically. Assume Gaussian rewards. • • .∈𝒝 𝑔 ∗ 𝑐 − 𝑔 ∗ (𝑏) Δ + = max " 𝑑 𝑔 ∗ ≔ & ! ,…,& " )* 3 min 𝛿 + ⋅ Δ + +,! 𝛿 + ∗ 1 ∗ = 0 s. t. " 𝛿 + ⋅ 𝑔 ∗ 𝑏 − 𝑕 𝑏 2 ∀𝑕 ∈ 𝒟 𝑔 ∗ , 3 ≥ 1 2 +,! ”competing” hypotheses 𝛿 + ln 𝑜 samples for each 𝑏 ∈ 𝒝 can distinguish 𝑔 ∗ from 𝑕 confidently. Finds arm pull allocations that (1) eliminate competing hypotheses and (2) ‘reward’-efficient Agrawal, Teneketzis, Anantharam. Asymptotically Efficient Adaptive Allocation Schemes for Controlled I.I.D. Processes: Finite Parameter Space, 1989. 11

Example: Cheating code base arms cheating arms 𝐿 * log 2 𝐿 * • 𝜗 > 0 : very small (like 0.0001) { 1 − 𝜗, 1, 1 + 𝜗} 0, Λ rewards: • Λ > 0 : not too small (like 0.5) A1 A2 A3 A4 A5 A6 +.2 ! ) • The lower bound: Θ ln 𝑜 4 ! 𝑔 𝟐 1 − 𝜗 1 − 𝜗 1 − 𝜗 0 0 ) ) 𝑔 1 − 𝜗 𝟐 1 − 𝜗 1 − 𝜗 0 Λ • UCB: Θ 5 ln 𝑜 % 𝑔 1 − 𝜗 1 − 𝜗 𝟐 1 − 𝜗 Λ 0 & • Exponential gap in 𝐿 ! 𝑔 1 − 𝜗 1 − 𝜗 1 − 𝜗 𝟐 Λ Λ $ 𝑔 𝟐 + 𝛝 1 1 − 𝜗 1 − 𝜗 0 0 ' 𝑔 1 𝟐 + 𝛝 1 − 𝜗 1 − 𝜗 0 Λ ( 𝑔 1 − 𝜗 1 − 𝜗 𝟐 + 𝛝 1 Λ 0 * … … … … … … … 12

The function classes regret contribution • 𝒟 𝑔 ∗ : Competing ⟹ cannot distinguishable using 𝑏 ∗ (𝑔 ∗ ) , but supports a different arm Θ(log 𝑜) • 𝒠 𝑔 ∗ : Docile ⟹ distinguishable using 𝑏 ∗ (𝑔 ∗ ) Θ(1) • ℰ 𝑔 ∗ : Equivalent ⟹ supports 𝑏 ∗ (𝑔 ∗ ) and the reward 𝜈 ∗ (𝑔 ∗ ) can be Θ(log log 𝑜) • [Proposition 2] ℱ = 𝒟 𝑔 ∗ ∪ 𝒠 𝑔 ∗ ∪ ℰ(𝑔 ∗ ) ( disjoint union) ℱ mean reward 𝑔 ! 𝑔 𝑔 𝑔 𝑔 6 4 2 2 1 𝑔 ℰ ∗ 𝒟 ∗ 4 .75 = 𝑔 ∗ 𝑔 𝑔 𝑔 𝑔 3 3 5 ! .5 𝑔 5 .25 𝑔 6 𝒠 ∗ = 0 arms 1 2 3 13

CRush Optimism with Pessimism (CROP) 14

CROP: Overview ! confidence level: 1 − poly • The confidence set 7 " 7 𝑀 " 𝑔 ≔ : 𝑠 6 − 𝑔 𝑏 6 6'( ℱ " ∶= 𝑔 ∈ ℱ: 𝑀 "8( 𝑔 − min 9∈ℱ 𝑀 "8( 𝑕 ≤ 𝛾 " ≔ Θ ln 𝑢 ℱ ERM • Four important branches • Exploit, Feasible, Fallback, Conflict • Exploit • Does every 𝑔 ∈ ℱ " support the same best arm? • If yes, pull that arm. 15

CROP v1 At time 𝑢 , • Maintain a confidence set ℱ " ⊆ ℱ Cf. optimism: f • If every 𝑔 ∈ ℱ " agree on the best arm 𝑔 " = arg max ;∈ℱ " max $∈𝒝 𝑔(𝑏) • (Exploit) pull that arm. • Else: (Feasible) • Compute the pessimism : e 𝑔 " = arg min ;∈ℱ " max $∈𝒝 𝑔(𝑏) (break ties by the cum. loss) • Compute 𝛿 ∗ ≔ solution of the optimization problem 𝑑 e 𝑔 " <=++_?.=/@($) • (Tracking) Pull 𝑏 " = arg min ∗ C # $∈𝒝 16

Why pessimism? Arms A1 A2 A3 A4 A5 𝑔 1 .99 .98 0 0 ) 𝑔 .98 .99 .98 .25 0 % 𝑔 .97 .97 .98 .25 .25 & • Suppose ℱ " = 𝑔 ( , 𝑔 7 , 𝑔 D • If I knew 𝑔 ∗ , I could track 𝛿 𝑔 ∗ (= the solution of 𝑑(𝑔 ∗ ) ) • Which 𝑔 should I track? • Pessimism : either does the right thing, or eliminates itself. • Other choices: may get stuck (so does ERM) Key idea: the LB constraints prescribes how to distinguish 𝑔 ∗ from those supporting higher rewards. 17

But we may still get stuck. Arms A1 A2 A3 A4 A5 𝑔 1 .99 .98 0 0 ) 𝑔 ∗ = • Due to docile hypotheses. 𝑔 .98 .99 .98 .25 0 % 𝑔 .97 .97 1 .25 .25 • We must do something else. & 𝑔 .97 .97 1 .2499 .25 $ 𝜔 𝑔 ≔ arg &∈ *,8 " Δ 9:; 𝑔 ⋅ 𝛿 + ∗ 1 + min 3 Δ + 𝑔 ⋅ 𝛿 + +<+ ∗ 1 2 𝑔 𝑏 − 𝑕 𝑏 s. t. ∀ 𝑕 ∈ 𝒟 𝑔 ∪ 𝒠 𝑔 : 𝜈 ∗ 𝑕 ≥ 𝜈 ∗ 𝑔 , 3 𝛿 + ≥ 1 2 + 𝛿 ≥ max 𝛿 𝑔 , 𝜚 𝑔 • Includes docile hypotheses with best rewards higher 𝜈 ∗ 𝑔 18

Crush Optimism with Pessimism: Structured Bandits Beyond Asymptotic - PowerPoint PPT Presentation

Crush Optimism with Pessimism: Structured Bandits Beyond Asymptotic Optimality Kwang-Sung Jun join work with Chicheng Zhang 1 Structured bandits e.g., linear = ! , , " # = { $ : #

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

Page 2 Clinical Manifestations Causes of Mortality after Crush Syndrome Untreated Crush Injury

Cooperative Multi-Agent Bandits with Heavy Tails Introduction K-Armed Bandits Cooperation

What If We Only Know Our Solution Hurwiczs Home Page Optimism-Pessimism Title Page

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

DK CRUSH : SURESTIM? Olivier Darremont Cilinique St Augustin, Bordeaux 2018 ESC GUIDELINES SL

Introduction to Bandits R emi Munos SequeL project: Sequential Learning

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Small Business Optimism Index Small Business Optimism Index

Epistemic Optimism Julien Dutant Kings College London Les Principes de lpistmologie,

Chicag cago o Bandits dits Affili liate te Program ram Junior r Affiliate and Tra vel

Data Poisoning Attack cks on Stoch chastic c Bandits Fang Liu and Ness Shroff Outline

Module 13 Bayesian Bandits CS 886 Sequential Decision Making and Reinforcement Learning

Econ 2148, fall 2019 Multi-armed bandits Maximilian Kasy Department of Economics, Harvard

Differentially-Private Federated Linear Bandits Introduction Federated Learning Contextual

CS885 Reinforcement Learning Lecture 8b: May 25, 2018 Bayesian and Contextual Bandits [SutBar]

Common Sense Addition Computing Computing Explained by Hurwicz Let Us Apply Hurwicz . . .

OPTIMISM OF AGEING T Total 33% 64% India 1 73% 26% Turkey 2 67% 31% % who are looking

Anytime Online-to-Batch, Optimism, and Acceleration Ashok Cutkosky Google Research Stochastic

AICPA Business and Industry Economic Outlook Survey Detailed Survey Results: 1Q 2020 Management

Making Lemonade Teaching Young Children Teaching Young Children to Think Optimistically

Over-optimism in biostatistics and bioinformatics Anne-Laure Boulesteix joint with M. Jelizarow,

(Ir)rational Exuberance: Optimism, Ambiguity and Risk Anat Bracha and Don Brown Boston FRB and

The case for optimism Singapore Healthcare Management Congress August 14 16, 2018 Michael J.