crush optimism with pessimism structured bandits beyond
play

Crush Optimism with Pessimism: Structured Bandits Beyond Asymptotic - PowerPoint PPT Presentation

Crush Optimism with Pessimism: Structured Bandits Beyond Asymptotic Optimality Kwang-Sung Jun join work with Chicheng Zhang 1 Structured bandits e.g., linear = ! , , " # = { $ : #


  1. Crush Optimism with Pessimism: Structured Bandits Beyond Asymptotic Optimality Kwang-Sung Jun join work with Chicheng Zhang 1

  2. Structured bandits e.g., linear 𝒝 = 𝑏 ! , …, 𝑏 " ∈ ℝ # β„± = {𝑏 ↦ πœ„ $ 𝑏: πœ„ ∈ ℝ # } β€’ Input : Arm set 𝒝 , hypothesis class β„± βŠ‚ 𝒝 β†’ ℝ β€œthe set of possible configurations of the mean rewards ” β€’ Initialize : The environment chooses 𝑔 βˆ— ∈ β„± (unknown to the learner) For 𝑒 = 1, …, π‘œ β€’ Learner: chooses an arm 𝑏 " ∈ 𝒝 " = 𝑔 βˆ— 𝑏 " + (zero-mean stochastic noise) β€’ Environment: generates the reward 𝑠 β€’ Learner: receives 𝑠 " β€’ Goal : Minimize the cumulative regret # $βˆˆπ’ 𝑔 βˆ— 𝑏 𝑔 βˆ— 𝑏 " 𝔽 Reg # = 𝔽 π‘œ β‹… max βˆ’ : "'( β€’ Note: fixed arm set (=non-contextual), realizability 𝑔 βˆ— ∈ β„± 2

  3. Structured bandits β€’ Why relevant? Techniques may transfer to RL (e.g., ergodic RL [Ok18] ) β€’ Naive strategy: UCB ) ⟹ * log π‘œ regret bound (instance-dependent) β€’ Scales with the number of arms 𝐿 β€’ Instead, the complexity of the hypothesis class β„± should appear. β€’ The asymptotically optimal regret is well-defined. β€’ E.g., linear bandits : 𝑑 βˆ— β‹… log π‘œ for some well-defined 𝑑 βˆ— β‰ͺ ) * . The goal of this paper Achieve the asymptotic optimality with improved finite-time regret for any β„± . (the worst-case regret is beyond the scope) [Ok18] Ok et al., Exploration in Structured Reinforcement Learning, NeurIPS, 2018 3

  4. Asymptotic optimality (instance-dependent) Do they like orange or apple? β€’ Optimism in the face of uncertainty Maybe have them try lemon and see if they (e.g., UCB, Thompson sampling) are sensitive to sourness.. ⟹ optimal asymptotic / worst-case regret in 𝑳 -armed bandits . (1,0) β€’ Linear bandits: optimal worst-case rate = 𝑒 π‘œ sour β€’ Asymptotically optimal regret? ⟹ No! (0.95, 0.1) (1,0) sweet (AISTATS’17) mean reward = 1*sweet + 0*sour 4

  5. Asymptotic optimality: lower bound β€’ 𝔽 Reg # β‰₯ 𝑑 𝑔 βˆ— β‹… log π‘œ (asymptotically) .βˆˆπ’ 𝑔 βˆ— 𝑐 βˆ’ 𝑔 βˆ— (𝑏) Ξ” + = max " 𝑑 𝑔 βˆ— = & ! ,…,& " )* 3 min 𝛿 + β‹… Ξ” + +,! s. t. 𝛿 + βˆ— 1 = 0 " βˆ€π‘• ∈ π’Ÿ 𝑔 βˆ— , 3 𝛿 + β‹… KL - 𝑔 𝑏 , 𝑕 𝑏 β‰₯ 1 +,! ”competing” hypotheses KL divergence with noise distribution πœ‰ β€’ 𝛿 βˆ— = 𝛿 ( βˆ— , …, 𝛿 ) βˆ— β‰₯ 0 : the solution βˆ— β‹… log π‘œ times. β€’ To be optimal, we must pull arm 𝑏 like 𝛿 $ βˆ— βˆ— β€’ E.g., 𝛿 +,-./ = 8, 𝛿 .01/2, = 0 ⟹ lemon is the informative arm ! β€’ When 𝑑 𝑔 βˆ— = 0 : Bounded regret ! (except for pathological ones [Lattimore14]) Lattimore & Munos, Bounded regret for finite-armed structured bandits, 2014. 5

  6. Existing asymptotically optimal algorithms β€’ Mostly uses forced exploration. [Lattimore+17,Combes+17,Hao+20] +.2 # ⟹ ensures every arm ’s pull count is an unbounded function of π‘œ such as (3+.2 +.2 # . ⟹ 𝔽 Reg # βͺ… 𝑑 𝑔 βˆ— β‹… log π‘œ + 𝐿 β‹… +.2 # (3+.2 +.2 # β€’ Issues 1. 𝐿 appears in the regret* ⟹ what if 𝐿 is exponentially large? 2. cannot achieve bounded regret when 𝑑 𝑔 βˆ— = 0 β€’ Parallel studies avoid forced exploration, but still depend on 𝐿 . [Menard+20, Degenne+20] *Dependence on 𝐿 can be avoided in special cases (e.g., linear). 6

  7. Contribution Proposed algorithm: Research Question CRush Optimism with Pessimism (CROP) Assume β„± is finite. Can we design an algorithm that β€’ enjoys the asymptotic optimality β€’ adapts to bounded regret whenever possible β€’ does not necessarily depend on 𝐿 ? β€’ No forced exploration 😁 β€’ The regret scales not with 𝐿 but with 𝐿 " ≀ 𝐿 (defined in the paper). β€’ An interesting log log π‘œ term in the regret* * it’s necessary (will be updated in arxiv) 7

  8. Preliminaries 8

  9. Assumptions β€’ β„± < ∞ β€’ The noise model " = 𝑔 βˆ— 𝑏 " + 𝜊 " 𝑠 where 𝜊 " is 1-sub-Gaussian. (generalized to 𝜏 ! in the paper) β€’ Notations: 𝑏 βˆ— 𝑔 ≔ arg max 𝜈 βˆ— 𝑔 ≔ 𝑔 𝑏 βˆ— 𝑔 $βˆˆπ’ 𝑔 𝑏 , 𝑏 βˆ— 𝑔 = 𝑏 β€’ 𝑔 supports arm 𝑏 ⟺ 𝜈 βˆ— 𝑔 = 𝑀 β€’ 𝑔 supports reward 𝑀 ⟺ β€’ [Assumption] Every 𝑔 ∈ β„± has a unique best arm ( i. e. , 𝑏 βˆ— 𝑔 = 1 ) 9

  10. Competing hypotheses β€’ π’Ÿ 𝑔 βˆ— consists of 𝑔 ∈ β„± such that β€’ (1) assigns the same reward to the best arm 𝑏 βˆ— (𝑔 βˆ— ) β€’ (2) but supports a different arm 𝑏 βˆ— 𝑔 β‰  𝑏 βˆ— (𝑔 βˆ— ) β€’ Importance: it’s why we get log(π‘œ) regret! mean reward 𝑔 ) 𝑔 % 1 𝑔 & .75 = 𝑔 βˆ— 𝑔 $ .5 𝑔 ' .25 𝑔 ( 0 arms 1 2 3 10

  11. Lower bound revisited 𝔽 Reg # β‰₯ 𝑑 𝑔 βˆ— β‹… log π‘œ , asymptotically. Assume Gaussian rewards. β€’ β€’ .βˆˆπ’ 𝑔 βˆ— 𝑐 βˆ’ 𝑔 βˆ— (𝑏) Ξ” + = max " 𝑑 𝑔 βˆ— ≔ & ! ,…,& " )* 3 min 𝛿 + β‹… Ξ” + +,! 𝛿 + βˆ— 1 βˆ— = 0 s. t. " 𝛿 + β‹… 𝑔 βˆ— 𝑏 βˆ’ 𝑕 𝑏 2 βˆ€π‘• ∈ π’Ÿ 𝑔 βˆ— , 3 β‰₯ 1 2 +,! ”competing” hypotheses 𝛿 + ln π‘œ samples for each 𝑏 ∈ 𝒝 can distinguish 𝑔 βˆ— from 𝑕 confidently. Finds arm pull allocations that (1) eliminate competing hypotheses and (2) β€˜reward’-efficient Agrawal, Teneketzis, Anantharam. Asymptotically Efficient Adaptive Allocation Schemes for Controlled I.I.D. Processes: Finite Parameter Space, 1989. 11

  12. Example: Cheating code base arms cheating arms 𝐿 * log 2 𝐿 * β€’ πœ— > 0 : very small (like 0.0001) { 1 βˆ’ πœ—, 1, 1 + πœ—} 0, Ξ› rewards: β€’ Ξ› > 0 : not too small (like 0.5) A1 A2 A3 A4 A5 A6 +.2 ! ) β€’ The lower bound: Θ ln π‘œ 4 ! 𝑔 𝟐 1 βˆ’ πœ— 1 βˆ’ πœ— 1 βˆ’ πœ— 0 0 ) ) 𝑔 1 βˆ’ πœ— 𝟐 1 βˆ’ πœ— 1 βˆ’ πœ— 0 Ξ› β€’ UCB: Θ 5 ln π‘œ % 𝑔 1 βˆ’ πœ— 1 βˆ’ πœ— 𝟐 1 βˆ’ πœ— Ξ› 0 & β€’ Exponential gap in 𝐿 ! 𝑔 1 βˆ’ πœ— 1 βˆ’ πœ— 1 βˆ’ πœ— 𝟐 Ξ› Ξ› $ 𝑔 𝟐 + 𝛝 1 1 βˆ’ πœ— 1 βˆ’ πœ— 0 0 ' 𝑔 1 𝟐 + 𝛝 1 βˆ’ πœ— 1 βˆ’ πœ— 0 Ξ› ( 𝑔 1 βˆ’ πœ— 1 βˆ’ πœ— 𝟐 + 𝛝 1 Ξ› 0 * … … … … … … … 12

  13. The function classes regret contribution β€’ π’Ÿ 𝑔 βˆ— : Competing ⟹ cannot distinguishable using 𝑏 βˆ— (𝑔 βˆ— ) , but supports a different arm Θ(log π‘œ) β€’ 𝒠 𝑔 βˆ— : Docile ⟹ distinguishable using 𝑏 βˆ— (𝑔 βˆ— ) Θ(1) β€’ β„° 𝑔 βˆ— : Equivalent ⟹ supports 𝑏 βˆ— (𝑔 βˆ— ) and the reward 𝜈 βˆ— (𝑔 βˆ— ) can be Θ(log log π‘œ) β€’ [Proposition 2] β„± = π’Ÿ 𝑔 βˆ— βˆͺ 𝒠 𝑔 βˆ— βˆͺ β„°(𝑔 βˆ— ) ( disjoint union) β„± mean reward 𝑔 ! 𝑔 𝑔 𝑔 𝑔 6 4 2 2 1 𝑔 β„° βˆ— π’Ÿ βˆ— 4 .75 = 𝑔 βˆ— 𝑔 𝑔 𝑔 𝑔 3 3 5 ! .5 𝑔 5 .25 𝑔 6 𝒠 βˆ— = 0 arms 1 2 3 13

  14. CRush Optimism with Pessimism (CROP) 14

  15. CROP: Overview ! confidence level: 1 βˆ’ poly β€’ The confidence set 7 " 7 𝑀 " 𝑔 ≔ : 𝑠 6 βˆ’ 𝑔 𝑏 6 6'( β„± " ∢= 𝑔 ∈ β„±: 𝑀 "8( 𝑔 βˆ’ min 9βˆˆβ„± 𝑀 "8( 𝑕 ≀ 𝛾 " ≔ Θ ln 𝑒 β„± ERM β€’ Four important branches β€’ Exploit, Feasible, Fallback, Conflict β€’ Exploit β€’ Does every 𝑔 ∈ β„± " support the same best arm? β€’ If yes, pull that arm. 15

  16. CROP v1 At time 𝑒 , β€’ Maintain a confidence set β„± " βŠ† β„± Cf. optimism: f β€’ If every 𝑔 ∈ β„± " agree on the best arm 𝑔 " = arg max ;βˆˆβ„± " max $βˆˆπ’ 𝑔(𝑏) β€’ (Exploit) pull that arm. β€’ Else: (Feasible) β€’ Compute the pessimism : e 𝑔 " = arg min ;βˆˆβ„± " max $βˆˆπ’ 𝑔(𝑏) (break ties by the cum. loss) β€’ Compute 𝛿 βˆ— ≔ solution of the optimization problem 𝑑 e 𝑔 " <=++_?.=/@($) β€’ (Tracking) Pull 𝑏 " = arg min βˆ— C # $βˆˆπ’ 16

  17. Why pessimism? Arms A1 A2 A3 A4 A5 𝑔 1 .99 .98 0 0 ) 𝑔 .98 .99 .98 .25 0 % 𝑔 .97 .97 .98 .25 .25 & β€’ Suppose β„± " = 𝑔 ( , 𝑔 7 , 𝑔 D β€’ If I knew 𝑔 βˆ— , I could track 𝛿 𝑔 βˆ— (= the solution of 𝑑(𝑔 βˆ— ) ) β€’ Which 𝑔 should I track? β€’ Pessimism : either does the right thing, or eliminates itself. β€’ Other choices: may get stuck (so does ERM) Key idea: the LB constraints prescribes how to distinguish 𝑔 βˆ— from those supporting higher rewards. 17

  18. But we may still get stuck. Arms A1 A2 A3 A4 A5 𝑔 1 .99 .98 0 0 ) 𝑔 βˆ— = β€’ Due to docile hypotheses. 𝑔 .98 .99 .98 .25 0 % 𝑔 .97 .97 1 .25 .25 β€’ We must do something else. & 𝑔 .97 .97 1 .2499 .25 $ πœ” 𝑔 ≔ arg &∈ *,8 " Ξ” 9:; 𝑔 β‹… 𝛿 + βˆ— 1 + min 3 Ξ” + 𝑔 β‹… 𝛿 + +<+ βˆ— 1 2 𝑔 𝑏 βˆ’ 𝑕 𝑏 s. t. βˆ€ 𝑕 ∈ π’Ÿ 𝑔 βˆͺ 𝒠 𝑔 : 𝜈 βˆ— 𝑕 β‰₯ 𝜈 βˆ— 𝑔 , 3 𝛿 + β‰₯ 1 2 + 𝛿 β‰₯ max 𝛿 𝑔 , 𝜚 𝑔 β€’ Includes docile hypotheses with best rewards higher 𝜈 βˆ— 𝑔 18

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend