multi armed bandits
play

Multi-armed Bandits Prof. Kuan-Ting Lai 2020/3/12 k-armed Bandit - PowerPoint PPT Presentation

Multi-armed Bandits Prof. Kuan-Ting Lai 2020/3/12 k-armed Bandit Problem Playing k armed bandit machines and find a way to win most money! Note: assume you have unlimited money and never go bankrupt!


  1. Multi-armed Bandits Prof. Kuan-Ting Lai 2020/3/12

  2. k-armed Bandit Problem • Playing k armed bandit machines and find a way to win most money! • Note: assume you have unlimited money and never go bankrupt! https://towardsdatascience.com/reinforcement-learning-multi-arm-bandit-implementation-5399ef67b24b

  3. 10-armed Testbed • Each bandit machine has its own reward distribution

  4. Action-value Function • 𝑅 𝑢 𝑏 : The estimated value (reward) of action a at time t • Let 𝑟 ∗ 𝑏 be the true (optimal) action-value function 𝑟 ∗ 𝑏 ← 𝐹[𝑆 𝑢 |𝐵 𝑢 = 𝑏]

  5. ε -greedy • Greedy action − Always select the action with max value − 𝐵 𝑢 ← argmax 𝑅 𝑢 (𝑏) 𝑏 • ε -greedy − Select the greedy action (1- ε ) of the time, select random actions ε of the time

  6. Performance of ε -greedy • Average rewards over 2000 runs with ε =0, 0.1, 0.01

  7. Optimal Actions Selected by ε -greedy • Optimal actions selected over 2000 runs with ε =0, 0.1, 0.01

  8. Update 𝑅 𝑢 𝑏 • let 𝑅 𝑜 denote the estimate of its action value after it has been selected n − 1 times

  9. Deriving Update Rule • Require only memory of Qn and Rn

  10. Tracking a Nonstationary Problem • Using constant step-size 𝛽 ∈ ( 0,1] • Constant step- size doesn’t converge

  11. Exponential Recency-weighted Average 𝑜 𝑅 𝑜+1 = 1 − 𝛽 𝑜 𝑅 + ෍ 𝛽 1 − 𝛽 𝑜−𝑗 𝑆 𝑗 𝑗=1 𝑜 1 − 𝛽 𝑜 + ෍ 𝛽 1 − 𝛽 𝑜−𝑗 = 1 𝑗=1

  12. Optimistic Initial Values • We should not care about initial value too much in practice

  13. Upper-Confidence-Bound Action Selection • 𝑂 𝑢 𝑏 : Number of times that action a has been selected prior to time t • Not practical for large state spaces ln 𝑢 𝐵 ← arg max 𝑅 𝑢 𝑏 + 𝑑 𝑂 𝑢 𝑏 𝑏

  14. Gradient Bandit Algorithms • Soft-max function • 𝜌 𝑢 (𝑏) is the probability of taking action a at time t

  15. Selecting Actions based on 𝜌 𝑢 (𝑏)

  16. Gradient Ascent

  17. Calculating Gradient • Adding a baseline B

  18. Convert Equation into Expectation • Multiplied by 𝜌 𝑢 (𝑦)/𝜌 𝑢 (𝑦) • Choose baseline 𝐶 𝑢 = 𝑆 𝑢

  19. Calculating Gradient of Softmax

  20. Final Result • Gradient bandit algorithm = gradient of expected reward!

  21. Reference • Chapter 2, Richard S. Sutton and Andrew G. Barto , “Reinforcement Learning: An Introduction,” 2 nd edition, Nov. 2018

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend