Module 11 Introduction to Reinforcement Learning CS 886 Sequential - - PowerPoint PPT Presentation

module 11
SMART_READER_LITE
LIVE PREVIEW

Module 11 Introduction to Reinforcement Learning CS 886 Sequential - - PowerPoint PPT Presentation

Module 11 Introduction to Reinforcement Learning CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Machine Learning Supervised Learning Teacher tells learner what to remember Reinforcement Learning


slide-1
SLIDE 1

Module 11 Introduction to Reinforcement Learning

CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

slide-2
SLIDE 2

CS886 (c) 2013 Pascal Poupart

2

Machine Learning

  • Supervised Learning

– Teacher tells learner what to remember

  • Reinforcement Learning

– Environment provides hints to learner

  • Unsupervised Learning

– Learner discovers on its own

slide-3
SLIDE 3

CS886 (c) 2013 Pascal Poupart

3

Animal Psychology

  • Negative reinforcements:

– Pain and hunger

  • Positive reinforcements:

– Pleasure and food

  • Reinforcements used to train animals
  • Let’s do the same with computers!
slide-4
SLIDE 4

CS886 (c) 2013 Pascal Poupart

4

RL Examples

  • Game playing (backgammon, solitaire)
  • Operations research (pricing, vehicule routing)
  • Elevator scheduling
  • Helicopter control
  • Spoken dialog systems
slide-5
SLIDE 5

CS886 (c) 2013 Pascal Poupart

5

Reinforcement Learning

  • Definition:

– Markov decision process with unknown transition and reward models

  • Set of states S
  • Set of actions A

– Actions may be stochastic

  • Set of reinforcement signals (rewards)

– Rewards may be delayed

slide-6
SLIDE 6

CS886 (c) 2013 Pascal Poupart

6

Policy optimization

  • Markov Decision Process:

– Find optimal policy given transition and reward model – Execute policy found

  • Reinforcement learning:

– Learn an optimal policy while interacting with the environment

slide-7
SLIDE 7

CS886 (c) 2013 Pascal Poupart

7

Reinforcement Learning Problem

Agent Environment

State Reward Action s0 s1 s2 r0 a0 a1 r1 r2 a2 …

Goal: Learn to choose actions that maximize 𝑠

0 + 𝛿𝑠 1 + 𝛿2𝑠 2 + 𝛿3𝑠 3 + ⋯

slide-8
SLIDE 8

CS886 (c) 2013 Pascal Poupart

8

Example: Inverted Pendulum

  • State:

𝑦 𝑢 , 𝑦′ 𝑢 , 𝜄 𝑢 , 𝜄′(𝑢)

  • Action: Force 𝐺
  • Reward: 1 for any step

where pole balanced Problem: Find 𝜌: 𝑇 → 𝐵 that maximizes rewards

slide-9
SLIDE 9

CS886 (c) 2013 Pascal Poupart

9

Types of RL

  • Passive vs Active learning

– Passive learning: the agent executes a fixed policy and tries to evaluate it – Active learning: the agent updates its policy as it learns

  • Model based vs model free

– Model-based: learn transition and reward model and use it to determine optimal policy – Model free: derive optimal policy without learning the model

slide-10
SLIDE 10

CS886 (c) 2013 Pascal Poupart

10

Passive Learning

  • Transition and reward model known:

– Evaluate 𝜌: 𝑊𝜌 𝑡 = 𝑆 𝑡, 𝜌 𝑡 + 𝛿 Pr 𝑡′ 𝑡, 𝜌 𝑡 𝑊𝜌(𝑡′)

𝑡′

∀𝑡

  • Transition and reward model unknown:

– Estimate value of policy as agent executes policy: 𝑊𝜌 𝑡 = 𝐹𝜌[ 𝛿𝑢𝑆(𝑡𝑢, 𝜌 𝑡𝑢 )]

𝑢

– Model based vs model free

slide-11
SLIDE 11

CS886 (c) 2013 Pascal Poupart

11

Passive learning

l l l u

  • 1

u u +1 r r r

1 2 3 1 2 3 4

 = 1 Reward is -0.04 for non-terminal states Do not know the transition probabilities

(1,1) (1,2) (1,3) (1,2) (1,3) (2,3) (3,3) (4,3)+1 (1,1) (1,2) (1,3) (2,3) (3,3) (3,2) (3,3) (4,3)+1 (1,1) (2,1) (3,1) (3,2) (4,2)-1

What is the value 𝑊(𝑡) of being in state 𝑡?

slide-12
SLIDE 12

CS886 (c) 2013 Pascal Poupart

12

Passive ADP

  • Adaptive dynamic programming (ADP)

– Model-based – Learn transition probabilities and rewards from

  • bservations

– Then update the values of the states

slide-13
SLIDE 13

CS886 (c) 2013 Pascal Poupart

13

ADP Example

l l l u

  • 1

u u +1 r r r

1 2 3 1 2 3 4 (1,1) (1,2) (1,3) (1,2) (1,3) (2,3) (3,3) (4,3)+1 (1,1) (1,2) (1,3) (2,3) (3,3) (3,2) (3,3) (4,3)+1 (1,1) (2,1) (3,1) (3,2) (4,2)-1

 = 1 Reward is -0.04 for non-terminal states P((2,3)|(1,3),r) =2/3 P((1,2)|(1,3),r) =1/3 Use this information in We need to learn all the transition probabilities! 𝑊𝜌 𝑡 = 𝑆 𝑡, 𝜌 𝑡 + 𝛿 Pr 𝑡′ 𝑡, 𝜌 𝑡 𝑊𝜌(𝑡′)

𝑡′

slide-14
SLIDE 14

CS886 (c) 2013 Pascal Poupart

14

Passive ADP

PassiveADP(𝜌) Repeat Execute 𝜌(𝑡) Observe 𝑡’ and 𝑠 Update counts: 𝑜 𝑡 ← 𝑜 𝑡 + 1, 𝑜 𝑡, 𝑡′ ← 𝑜 𝑡, 𝑡′ + 1 Update transition: Pr 𝑡′ 𝑡, 𝜌(𝑡) ←

𝑜(𝑡,𝑡′) 𝑜(𝑡) ∀𝑡′

Update reward: 𝑆 𝑡, 𝜌 𝑡 ←

𝑠 + 𝑜 𝑡 −1 𝑆(𝑡,𝜌 𝑡 ) 𝑜(𝑡)

Solve: 𝑊𝜌 𝑡 = 𝑆 𝑡, 𝜌 𝑡 + 𝛿 Pr 𝑡′ 𝑡, 𝜌 𝑡

𝑡′

𝑊𝜌 𝑡′ ∀𝑡 𝑡 ← 𝑡′ Until convergence of 𝑊𝜌 Return 𝑊𝜌

slide-15
SLIDE 15

CS886 (c) 2013 Pascal Poupart

15

Passive TD

  • Temporal difference (TD)

– Model free

  • At each time step

– Observe: 𝑡, 𝑏, 𝑡’, 𝑠 – Update 𝑊𝜌(𝑡) after each move – 𝑊𝜌(𝑡) = 𝑊𝜌(𝑡) + 𝛽 (𝑆(𝑡, 𝜌(𝑡)) + 𝛿 𝑊𝜌(𝑡’) – 𝑊𝜌(𝑡)) Learning rate Temporal difference

slide-16
SLIDE 16

CS886 (c) 2013 Pascal Poupart

16

TD Convergence

Theorem: If 𝛽 is appropriately decreased with number of times a state is visited then 𝑊𝜌(𝑡) converges to correct value

  • 𝛽 must satisfy:
  • 𝛽𝑢 → ∞

𝑢

  • 𝛽𝑢 2 < ∞

𝑢

  • Often 𝛽 𝑡 = 1/𝑜(𝑡)
  • Where 𝑜(𝑡) = # of times 𝑡 is visited
slide-17
SLIDE 17

CS886 (c) 2013 Pascal Poupart

17

Passive TD

PassiveTD(𝜌, 𝑊𝜌) Repeat Execute 𝜌(𝑡) Observe 𝑡’ and 𝑠 Update counts: 𝑜 𝑡 ← 𝑜 𝑡 + 1 Learning rate: 𝛽 ← 1/𝑜(𝑡) Update value: 𝑊𝜌 𝑡 ← 𝑊𝜌 𝑡 + 𝛽(𝑠 + 𝛿𝑊𝜌 𝑡′ − 𝑊𝜌 𝑡 ) 𝑡 ← 𝑡′ Until convergence of 𝑊𝜌 Return 𝑊𝜌

slide-18
SLIDE 18

CS886 (c) 2013 Pascal Poupart

18

Comparison

  • Model free approach:

– Less computation per time step

  • Model based approach:

– Fewer time steps before convergence

slide-19
SLIDE 19

CS886 (c) 2013 Pascal Poupart

19

Active Learning

  • Ultimately, we are interested in improving 𝜌
  • Transition and reward model known:

𝑊∗ 𝑡 = max

𝑏

𝑆 𝑡, 𝑏 + 𝛿 Pr 𝑡′ 𝑡, 𝑏 𝑊∗(𝑡′)

𝑡′

  • Transition and reward model unknown:

– Improve policy as agent executes policy – Model based vs model free

slide-20
SLIDE 20

CS886 (c) 2013 Pascal Poupart

20

Q-learning (aka active temporal difference)

  • Q-function: 𝑅: 𝑇 × 𝐵 → ℜ

– Value of state-action pair – Policy 𝜌 𝑡 = 𝑏𝑠𝑕𝑛𝑏𝑦𝑏 𝑅(𝑡, 𝑏) is the greedy policy w.r.t. 𝑅

  • Bellman’s equation:

𝑅∗ 𝑡, 𝑏 = 𝑆 𝑡, 𝑏 + 𝛿 Pr (𝑡′|𝑡, 𝑏)

𝑡′

max

𝑏′ 𝑅∗(𝑡′, 𝑏′)

slide-21
SLIDE 21

CS886 (c) 2013 Pascal Poupart

21

Q-Learning

Qlearning(𝑡, 𝑅∗) Repeat Select and execute 𝑏 Observe 𝑡’ and 𝑠 Update counts: 𝑜 𝑡 ← 𝑜 𝑡 + 1 Learning rate: 𝛽 ← 1/𝑜(𝑡) Update Q-value: 𝑅∗ 𝑡, 𝑏 ← 𝑅∗ 𝑡, 𝑏 + 𝛽 𝑠 + 𝛿 max

𝑏′ 𝑅∗ 𝑡′, 𝑏′ − 𝑅∗ 𝑡, 𝑏

𝑡 ← 𝑡′ Until convergence of 𝑅 Return 𝑅

slide-22
SLIDE 22

CS886 (c) 2013 Pascal Poupart

22

Q-learning example

s1

73 100 66 81

s2

81.5 100 66 81

 = 0.9,  = 0.5, 𝑠 = 0 for non-terminal states 𝑅 𝑡1, 𝑠𝑗𝑕ℎ𝑢 = 𝑅 𝑡1, 𝑠𝑗𝑕ℎ𝑢 + 𝛽 𝑠 + 𝛿 max

𝑏′ 𝑅 𝑡2, 𝑏′ − 𝑅 𝑡1, 𝑠𝑗𝑕ℎ𝑢

= 73 + 0.5 0 + 0.9 max 66,81,100 − 73 = 73 + 0.5(17) = 81.5

s2

slide-23
SLIDE 23

CS886 (c) 2013 Pascal Poupart

23

Q-Learning

Qlearning(𝑡, 𝑅∗) Repeat Select and execute 𝑏 Observe 𝑡’ and 𝑠 Update counts: 𝑜 𝑡 ← 𝑜 𝑡 + 1 Learning rate: 𝛽 ← 1/𝑜(𝑡) Update Q-value: 𝑅∗ 𝑡, 𝑏 ← 𝑅∗ 𝑡, 𝑏 + 𝛽 𝑠 + 𝛿 max

𝑏′ 𝑅∗ 𝑡′, 𝑏′ − 𝑅∗ 𝑡, 𝑏

𝑡 ← 𝑡′ Until convergence of 𝑅∗ Return 𝑅∗

slide-24
SLIDE 24

CS886 (c) 2013 Pascal Poupart

24

Exploration vs Exploitation

  • If an agent always chooses the action with the

highest value then it is exploiting

– The learned model is not the real model – Leads to suboptimal results

  • By taking random actions (pure exploration) an

agent may learn the model

– But what is the use of learning a complete model if parts of it are never used?

  • Need a balance between exploitation and

exploration

slide-25
SLIDE 25

CS886 (c) 2013 Pascal Poupart

25

Common exploration methods

  • -greedy:

– With probability 𝜗 execute random action – Otherwise execute best action 𝑏∗

𝑏∗ = 𝑏𝑠𝑕𝑛𝑏𝑦𝑏 𝑅(𝑡, 𝑏)

  • Boltzmann exploration

Pr 𝑏 = 𝑓

𝑅 𝑡,𝑏 𝑈

𝑓

𝑅 𝑡,𝑏 𝑈 𝑏

slide-26
SLIDE 26

CS886 (c) 2013 Pascal Poupart

26

Exploration and Q-learning

  • Q-learning converges to optimal Q-values if

– Every state is visited infinitely often (due to exploration) – The action selection becomes greedy as time approaches infinity – The learning rate 𝛽 is decreased fast enough, but not too fast

slide-27
SLIDE 27

CS886 (c) 2013 Pascal Poupart

27

Model-based Active RL

  • Idea: at each step

– Execute action – Observe resulting state and reward – Update model – Update policy 𝜌

slide-28
SLIDE 28

CS886 (c) 2013 Pascal Poupart

28

Model-based Active RL

ModelBasedActiveRL(𝑡) Repeat Select and execute 𝑏 Observe 𝑡’ and 𝑠 Update counts: 𝑜 𝑡, 𝑏 ← 𝑜 𝑡, 𝑏 + 1, 𝑜 𝑡, 𝑏, 𝑡′ ← 𝑜 𝑡, 𝑏, 𝑡′ + 1 Update transition: Pr 𝑡′ 𝑡, 𝜌(𝑡) ←

𝑜(𝑡,𝑏,𝑡′) 𝑜(𝑡,𝑏) ∀𝑡′

Update reward: 𝑆 𝑡, 𝜌 𝑡 ←

𝑠 + 𝑜 𝑡,𝑏 −1 𝑆(𝑡,𝑏) 𝑜(𝑡,𝑏)

Solve: 𝑊∗ 𝑡 = max

𝑏

𝑆 𝑡, 𝑏 + 𝛿 Pr 𝑡′ 𝑡, 𝑏

𝑡′

𝑊∗ 𝑡′ ∀𝑡 𝑡 ← 𝑡′ Until convergence of 𝑊∗ Return 𝑊∗

slide-29
SLIDE 29

CS886 (c) 2013 Pascal Poupart

29

Summary

  • We can optimize a policy by RL when the

transition and reward functions are unknown

  • Comparison:

– Model free: computationally cheaper – Model-based: faster convergence

  • Active learning:

– Exploration/exploitation dilemma