Module 11 Introduction to Reinforcement Learning CS 886 Sequential - - PowerPoint PPT Presentation
Module 11 Introduction to Reinforcement Learning CS 886 Sequential - - PowerPoint PPT Presentation
Module 11 Introduction to Reinforcement Learning CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Machine Learning Supervised Learning Teacher tells learner what to remember Reinforcement Learning
CS886 (c) 2013 Pascal Poupart
2
Machine Learning
- Supervised Learning
– Teacher tells learner what to remember
- Reinforcement Learning
– Environment provides hints to learner
- Unsupervised Learning
– Learner discovers on its own
CS886 (c) 2013 Pascal Poupart
3
Animal Psychology
- Negative reinforcements:
– Pain and hunger
- Positive reinforcements:
– Pleasure and food
- Reinforcements used to train animals
- Let’s do the same with computers!
CS886 (c) 2013 Pascal Poupart
4
RL Examples
- Game playing (backgammon, solitaire)
- Operations research (pricing, vehicule routing)
- Elevator scheduling
- Helicopter control
- Spoken dialog systems
CS886 (c) 2013 Pascal Poupart
5
Reinforcement Learning
- Definition:
– Markov decision process with unknown transition and reward models
- Set of states S
- Set of actions A
– Actions may be stochastic
- Set of reinforcement signals (rewards)
– Rewards may be delayed
CS886 (c) 2013 Pascal Poupart
6
Policy optimization
- Markov Decision Process:
– Find optimal policy given transition and reward model – Execute policy found
- Reinforcement learning:
– Learn an optimal policy while interacting with the environment
CS886 (c) 2013 Pascal Poupart
7
Reinforcement Learning Problem
Agent Environment
State Reward Action s0 s1 s2 r0 a0 a1 r1 r2 a2 …
Goal: Learn to choose actions that maximize 𝑠
0 + 𝛿𝑠 1 + 𝛿2𝑠 2 + 𝛿3𝑠 3 + ⋯
CS886 (c) 2013 Pascal Poupart
8
Example: Inverted Pendulum
- State:
𝑦 𝑢 , 𝑦′ 𝑢 , 𝜄 𝑢 , 𝜄′(𝑢)
- Action: Force 𝐺
- Reward: 1 for any step
where pole balanced Problem: Find 𝜌: 𝑇 → 𝐵 that maximizes rewards
CS886 (c) 2013 Pascal Poupart
9
Types of RL
- Passive vs Active learning
– Passive learning: the agent executes a fixed policy and tries to evaluate it – Active learning: the agent updates its policy as it learns
- Model based vs model free
– Model-based: learn transition and reward model and use it to determine optimal policy – Model free: derive optimal policy without learning the model
CS886 (c) 2013 Pascal Poupart
10
Passive Learning
- Transition and reward model known:
– Evaluate 𝜌: 𝑊𝜌 𝑡 = 𝑆 𝑡, 𝜌 𝑡 + 𝛿 Pr 𝑡′ 𝑡, 𝜌 𝑡 𝑊𝜌(𝑡′)
𝑡′
∀𝑡
- Transition and reward model unknown:
– Estimate value of policy as agent executes policy: 𝑊𝜌 𝑡 = 𝐹𝜌[ 𝛿𝑢𝑆(𝑡𝑢, 𝜌 𝑡𝑢 )]
𝑢
– Model based vs model free
CS886 (c) 2013 Pascal Poupart
11
Passive learning
l l l u
- 1
u u +1 r r r
1 2 3 1 2 3 4
= 1 Reward is -0.04 for non-terminal states Do not know the transition probabilities
(1,1) (1,2) (1,3) (1,2) (1,3) (2,3) (3,3) (4,3)+1 (1,1) (1,2) (1,3) (2,3) (3,3) (3,2) (3,3) (4,3)+1 (1,1) (2,1) (3,1) (3,2) (4,2)-1
What is the value 𝑊(𝑡) of being in state 𝑡?
CS886 (c) 2013 Pascal Poupart
12
Passive ADP
- Adaptive dynamic programming (ADP)
– Model-based – Learn transition probabilities and rewards from
- bservations
– Then update the values of the states
CS886 (c) 2013 Pascal Poupart
13
ADP Example
l l l u
- 1
u u +1 r r r
1 2 3 1 2 3 4 (1,1) (1,2) (1,3) (1,2) (1,3) (2,3) (3,3) (4,3)+1 (1,1) (1,2) (1,3) (2,3) (3,3) (3,2) (3,3) (4,3)+1 (1,1) (2,1) (3,1) (3,2) (4,2)-1
= 1 Reward is -0.04 for non-terminal states P((2,3)|(1,3),r) =2/3 P((1,2)|(1,3),r) =1/3 Use this information in We need to learn all the transition probabilities! 𝑊𝜌 𝑡 = 𝑆 𝑡, 𝜌 𝑡 + 𝛿 Pr 𝑡′ 𝑡, 𝜌 𝑡 𝑊𝜌(𝑡′)
𝑡′
CS886 (c) 2013 Pascal Poupart
14
Passive ADP
PassiveADP(𝜌) Repeat Execute 𝜌(𝑡) Observe 𝑡’ and 𝑠 Update counts: 𝑜 𝑡 ← 𝑜 𝑡 + 1, 𝑜 𝑡, 𝑡′ ← 𝑜 𝑡, 𝑡′ + 1 Update transition: Pr 𝑡′ 𝑡, 𝜌(𝑡) ←
𝑜(𝑡,𝑡′) 𝑜(𝑡) ∀𝑡′
Update reward: 𝑆 𝑡, 𝜌 𝑡 ←
𝑠 + 𝑜 𝑡 −1 𝑆(𝑡,𝜌 𝑡 ) 𝑜(𝑡)
Solve: 𝑊𝜌 𝑡 = 𝑆 𝑡, 𝜌 𝑡 + 𝛿 Pr 𝑡′ 𝑡, 𝜌 𝑡
𝑡′
𝑊𝜌 𝑡′ ∀𝑡 𝑡 ← 𝑡′ Until convergence of 𝑊𝜌 Return 𝑊𝜌
CS886 (c) 2013 Pascal Poupart
15
Passive TD
- Temporal difference (TD)
– Model free
- At each time step
– Observe: 𝑡, 𝑏, 𝑡’, 𝑠 – Update 𝑊𝜌(𝑡) after each move – 𝑊𝜌(𝑡) = 𝑊𝜌(𝑡) + 𝛽 (𝑆(𝑡, 𝜌(𝑡)) + 𝛿 𝑊𝜌(𝑡’) – 𝑊𝜌(𝑡)) Learning rate Temporal difference
CS886 (c) 2013 Pascal Poupart
16
TD Convergence
Theorem: If 𝛽 is appropriately decreased with number of times a state is visited then 𝑊𝜌(𝑡) converges to correct value
- 𝛽 must satisfy:
- 𝛽𝑢 → ∞
𝑢
- 𝛽𝑢 2 < ∞
𝑢
- Often 𝛽 𝑡 = 1/𝑜(𝑡)
- Where 𝑜(𝑡) = # of times 𝑡 is visited
CS886 (c) 2013 Pascal Poupart
17
Passive TD
PassiveTD(𝜌, 𝑊𝜌) Repeat Execute 𝜌(𝑡) Observe 𝑡’ and 𝑠 Update counts: 𝑜 𝑡 ← 𝑜 𝑡 + 1 Learning rate: 𝛽 ← 1/𝑜(𝑡) Update value: 𝑊𝜌 𝑡 ← 𝑊𝜌 𝑡 + 𝛽(𝑠 + 𝛿𝑊𝜌 𝑡′ − 𝑊𝜌 𝑡 ) 𝑡 ← 𝑡′ Until convergence of 𝑊𝜌 Return 𝑊𝜌
CS886 (c) 2013 Pascal Poupart
18
Comparison
- Model free approach:
– Less computation per time step
- Model based approach:
– Fewer time steps before convergence
CS886 (c) 2013 Pascal Poupart
19
Active Learning
- Ultimately, we are interested in improving 𝜌
- Transition and reward model known:
𝑊∗ 𝑡 = max
𝑏
𝑆 𝑡, 𝑏 + 𝛿 Pr 𝑡′ 𝑡, 𝑏 𝑊∗(𝑡′)
𝑡′
- Transition and reward model unknown:
– Improve policy as agent executes policy – Model based vs model free
CS886 (c) 2013 Pascal Poupart
20
Q-learning (aka active temporal difference)
- Q-function: 𝑅: 𝑇 × 𝐵 → ℜ
– Value of state-action pair – Policy 𝜌 𝑡 = 𝑏𝑠𝑛𝑏𝑦𝑏 𝑅(𝑡, 𝑏) is the greedy policy w.r.t. 𝑅
- Bellman’s equation:
𝑅∗ 𝑡, 𝑏 = 𝑆 𝑡, 𝑏 + 𝛿 Pr (𝑡′|𝑡, 𝑏)
𝑡′
max
𝑏′ 𝑅∗(𝑡′, 𝑏′)
CS886 (c) 2013 Pascal Poupart
21
Q-Learning
Qlearning(𝑡, 𝑅∗) Repeat Select and execute 𝑏 Observe 𝑡’ and 𝑠 Update counts: 𝑜 𝑡 ← 𝑜 𝑡 + 1 Learning rate: 𝛽 ← 1/𝑜(𝑡) Update Q-value: 𝑅∗ 𝑡, 𝑏 ← 𝑅∗ 𝑡, 𝑏 + 𝛽 𝑠 + 𝛿 max
𝑏′ 𝑅∗ 𝑡′, 𝑏′ − 𝑅∗ 𝑡, 𝑏
𝑡 ← 𝑡′ Until convergence of 𝑅 Return 𝑅
CS886 (c) 2013 Pascal Poupart
22
Q-learning example
s1
73 100 66 81
s2
81.5 100 66 81
= 0.9, = 0.5, 𝑠 = 0 for non-terminal states 𝑅 𝑡1, 𝑠𝑗ℎ𝑢 = 𝑅 𝑡1, 𝑠𝑗ℎ𝑢 + 𝛽 𝑠 + 𝛿 max
𝑏′ 𝑅 𝑡2, 𝑏′ − 𝑅 𝑡1, 𝑠𝑗ℎ𝑢
= 73 + 0.5 0 + 0.9 max 66,81,100 − 73 = 73 + 0.5(17) = 81.5
s2
CS886 (c) 2013 Pascal Poupart
23
Q-Learning
Qlearning(𝑡, 𝑅∗) Repeat Select and execute 𝑏 Observe 𝑡’ and 𝑠 Update counts: 𝑜 𝑡 ← 𝑜 𝑡 + 1 Learning rate: 𝛽 ← 1/𝑜(𝑡) Update Q-value: 𝑅∗ 𝑡, 𝑏 ← 𝑅∗ 𝑡, 𝑏 + 𝛽 𝑠 + 𝛿 max
𝑏′ 𝑅∗ 𝑡′, 𝑏′ − 𝑅∗ 𝑡, 𝑏
𝑡 ← 𝑡′ Until convergence of 𝑅∗ Return 𝑅∗
CS886 (c) 2013 Pascal Poupart
24
Exploration vs Exploitation
- If an agent always chooses the action with the
highest value then it is exploiting
– The learned model is not the real model – Leads to suboptimal results
- By taking random actions (pure exploration) an
agent may learn the model
– But what is the use of learning a complete model if parts of it are never used?
- Need a balance between exploitation and
exploration
CS886 (c) 2013 Pascal Poupart
25
Common exploration methods
- -greedy:
– With probability 𝜗 execute random action – Otherwise execute best action 𝑏∗
𝑏∗ = 𝑏𝑠𝑛𝑏𝑦𝑏 𝑅(𝑡, 𝑏)
- Boltzmann exploration
Pr 𝑏 = 𝑓
𝑅 𝑡,𝑏 𝑈
𝑓
𝑅 𝑡,𝑏 𝑈 𝑏
CS886 (c) 2013 Pascal Poupart
26
Exploration and Q-learning
- Q-learning converges to optimal Q-values if
– Every state is visited infinitely often (due to exploration) – The action selection becomes greedy as time approaches infinity – The learning rate 𝛽 is decreased fast enough, but not too fast
CS886 (c) 2013 Pascal Poupart
27
Model-based Active RL
- Idea: at each step
– Execute action – Observe resulting state and reward – Update model – Update policy 𝜌
CS886 (c) 2013 Pascal Poupart
28
Model-based Active RL
ModelBasedActiveRL(𝑡) Repeat Select and execute 𝑏 Observe 𝑡’ and 𝑠 Update counts: 𝑜 𝑡, 𝑏 ← 𝑜 𝑡, 𝑏 + 1, 𝑜 𝑡, 𝑏, 𝑡′ ← 𝑜 𝑡, 𝑏, 𝑡′ + 1 Update transition: Pr 𝑡′ 𝑡, 𝜌(𝑡) ←
𝑜(𝑡,𝑏,𝑡′) 𝑜(𝑡,𝑏) ∀𝑡′
Update reward: 𝑆 𝑡, 𝜌 𝑡 ←
𝑠 + 𝑜 𝑡,𝑏 −1 𝑆(𝑡,𝑏) 𝑜(𝑡,𝑏)
Solve: 𝑊∗ 𝑡 = max
𝑏
𝑆 𝑡, 𝑏 + 𝛿 Pr 𝑡′ 𝑡, 𝑏
𝑡′
𝑊∗ 𝑡′ ∀𝑡 𝑡 ← 𝑡′ Until convergence of 𝑊∗ Return 𝑊∗
CS886 (c) 2013 Pascal Poupart
29
Summary
- We can optimize a policy by RL when the
transition and reward functions are unknown
- Comparison:
– Model free: computationally cheaper – Model-based: faster convergence
- Active learning: