Multi-armed Bandits Prof. Kuan-Ting Lai 2020/3/12 k-armed Bandit - - PowerPoint PPT Presentation

β–Ά
multi armed bandits
SMART_READER_LITE
LIVE PREVIEW

Multi-armed Bandits Prof. Kuan-Ting Lai 2020/3/12 k-armed Bandit - - PowerPoint PPT Presentation

Multi-armed Bandits Prof. Kuan-Ting Lai 2020/3/12 k-armed Bandit Problem Playing k armed bandit machines and find a way to win most money! Note: assume you have unlimited money and never go bankrupt!


slide-1
SLIDE 1

Multi-armed Bandits

  • Prof. Kuan-Ting Lai

2020/3/12

slide-2
SLIDE 2

k-armed Bandit Problem

  • Playing k armed bandit machines and find a way to win most money!
  • Note: assume you have unlimited money and never go bankrupt!

https://towardsdatascience.com/reinforcement-learning-multi-arm-bandit-implementation-5399ef67b24b

slide-3
SLIDE 3

10-armed Testbed

  • Each bandit

machine has its own reward distribution

slide-4
SLIDE 4

Action-value Function

  • Let π‘Ÿβˆ— 𝑏 be the true (optimal) action-value function

π‘Ÿβˆ— 𝑏 ← 𝐹[𝑆𝑒|𝐡𝑒 = 𝑏]

  • 𝑅𝑒 𝑏 : The estimated value (reward) of action a at time t
slide-5
SLIDE 5

Ξ΅-greedy

  • Greedy action

βˆ’ Always select the action with max value βˆ’ 𝐡𝑒 ← argmax

𝑏

𝑅𝑒(𝑏)

  • Ξ΅-greedy

βˆ’ Select the greedy action (1- Ξ΅) of the time, select random actions Ξ΅ of the time

slide-6
SLIDE 6

Performance of Ξ΅-greedy

  • Average rewards over 2000 runs with Ξ΅=0, 0.1, 0.01
slide-7
SLIDE 7

Optimal Actions Selected by Ξ΅-greedy

  • Optimal actions selected over 2000 runs with Ξ΅=0, 0.1, 0.01
slide-8
SLIDE 8

Update 𝑅𝑒 𝑏

  • let π‘…π‘œ denote the estimate of its action value after it has been

selected n βˆ’ 1 times

slide-9
SLIDE 9

Deriving Update Rule

  • Require only memory of

Qn and Rn

slide-10
SLIDE 10

Tracking a Nonstationary Problem

  • Using constant step-size 𝛽 ∈ (0,1]
  • Constant step-size doesn’t converge
slide-11
SLIDE 11

Exponential Recency-weighted Average π‘…π‘œ+1 = 1 βˆ’ 𝛽 π‘œπ‘… + ෍

𝑗=1 π‘œ

𝛽 1 βˆ’ 𝛽 π‘œβˆ’π‘—π‘†π‘—

1 βˆ’ 𝛽 π‘œ + ෍

𝑗=1 π‘œ

𝛽 1 βˆ’ 𝛽 π‘œβˆ’π‘— = 1

slide-12
SLIDE 12

Optimistic Initial Values

  • We should not care about initial value too much in practice
slide-13
SLIDE 13

Upper-Confidence-Bound Action Selection

  • 𝑂𝑒 𝑏 : Number of times that action a has been selected prior to time t
  • Not practical for large state spaces

𝐡 ← arg max

𝑏

𝑅𝑒 𝑏 + 𝑑 ln 𝑒 𝑂𝑒 𝑏

slide-14
SLIDE 14

Gradient Bandit Algorithms

  • Soft-max function
  • πœŒπ‘’(𝑏) is the probability of taking action a at time t
slide-15
SLIDE 15

Selecting Actions based on πœŒπ‘’(𝑏)

slide-16
SLIDE 16

Gradient Ascent

slide-17
SLIDE 17

Calculating Gradient

  • Adding a baseline B
slide-18
SLIDE 18

Convert Equation into Expectation

  • Multiplied by πœŒπ‘’(𝑦)/πœŒπ‘’(𝑦)
  • Choose baseline 𝐢𝑒 = 𝑆𝑒
slide-19
SLIDE 19

Calculating Gradient of Softmax

slide-20
SLIDE 20

Final Result

  • Gradient bandit algorithm = gradient of expected reward!
slide-21
SLIDE 21

Reference

  • Chapter 2, Richard S. Sutton and Andrew G. Barto, β€œReinforcement

Learning: An Introduction,” 2nd edition, Nov. 2018