Reinforcement Learning (RL)
CE-717: Machine Learning
Sharif University of Technology
- M. Soleymani
Fall 2018
Most slides have been taken from Klein and Abdeel, CS188, UC Berkeley.
Reinforcement Learning (RL) CE-717: Machine Learning Sharif - - PowerPoint PPT Presentation
Reinforcement Learning (RL) CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2018 Most slides have been taken from Klein and Abdeel, CS188, UC Berkeley. Reinforcement Learning (RL) Learning as a result of
Most slides have been taken from Klein and Abdeel, CS188, UC Berkeley.
2
Environment’s response affects our subsequent actions We find out the effects of our actions later
3
(state, action, reward)
it is not told of which action is the correct one to achieve its goal
4
The agent's objective is to maximize amount of reward it receives over time.
5
[We discuss about only fully observable environments.]
6
7
8
Actions influence later perceptions (inputs) Delayed reward: actions may affect not only the immediate reward but
by trial-and-error
Opportunity for active exploration
Needs trade-off between exploration and exploitation
9
10
11
12
13
Transition and reward functions
Agent observes state 𝑡𝑢 ∈ 𝑇 Then chooses action 𝑏𝑢 ∈ 𝐵 Then receives reward 𝑠
𝑢, and state changes to 𝑡𝑢+1
14
Stochastic transition and/or reward function
𝑙=0 ∞
15
16
17
18
Cool Warm Overheated
19
ℎ𝑗ℎ = 𝑡𝑓𝑏𝑠𝑑ℎ, 𝑥𝑏𝑗𝑢 𝑚𝑝𝑥 = 𝑡𝑓𝑏𝑠𝑑ℎ, 𝑥𝑏𝑗𝑢, 𝑠𝑓𝑑ℎ𝑏𝑠𝑓
𝑄(𝑡𝑢+1 = ℎ𝑗ℎ|𝑡𝑢 = ℎ𝑗ℎ, 𝑏𝑢 = 𝑡𝑓𝑏𝑠𝑑ℎ)
20
training data are of the form
21
22
They are dynamic programming approaches
It is a temporal-difference method.
23
∞
𝑢+𝑙 𝑡𝑢 = 𝑡, 𝜌
∞
∞
𝑏
𝑡′
𝑏
𝑏
∞
𝑢+𝑙+1 𝑡𝑢+1 = 𝑡′, 𝜌
𝑏 = 𝑄 𝑡𝑢+1 = 𝑡′ 𝑡𝑢 = 𝑡, 𝑏𝑢 = 𝑏
𝑏
𝑏
𝑡′
𝑏
𝑏
24
∞
𝑢+𝑙+1 𝑡𝑢+1 = 𝑡′, 𝜌
𝑡′
𝑏
𝑏
25
𝜌 𝑡 + 𝛿𝑊𝜌(𝑡′)
26
27
28
29
use the same policy no matter what the initial state of MDP is
30
(s,a,s’) is a transition
s is a state (s, a) is a q-state
31
32
33
34
Bellman equations characterize the optimal values: Value iteration computes them: Value iteration is just a fixed point solution method
… though the Vk vectors are also interpretable as time-limited values
35
36
37
Key idea: time-limited values Define Vk(s) to be the optimal value of s if the game
38
Start withV0(s) = 0: no time steps left means an expected reward sum of zero Given vector ofVk(s) values, do one ply of expectimax from each state: Repeat until convergence Complexity of each iteration: O(S2A) Theorem: will converge to unique optimal values
Basic idea: approximations get refined towards optimal values
Policy may converge long before values do
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
Let’s imagine we have the optimal valuesV*(s) How should we act?
It’s not obvious!
We need to do (one step) This is called policy extraction, since it gets the policy implied by the
56
Let’s imagine we have the optimal q-values: How should we act?
Completely trivial to decide!
Important lesson: actions are easier to select from q-values than values!
57
How do we know the Vk vectors are going to converge?
Case 1: If the tree has maximum depth M, then VM holds the actual untruncated values
Case 2: If the discount is less than 1
Sketch: For any state Vk and Vk+1 can be viewed as depth k+1 results in nearly identical search trees
The difference is that on the bottom layer, Vk+1 has actual rewards while Vk has zeros
That last layer is at best all RMAX
It is at worst RMIN
But everything is discounted by γk that far out
So Vk and Vk+1 are at most γk max|R| different
So as k increases, the values converge 58
59
but we must still visit each state infinitely often
It is time and memory expensive
Value iteration repeats the Bellman updates: Problem 1: It’s slow – O(S2A) per iteration Problem 2: The “max” at each state rarely changes Problem 3: The policy often converges long before the values
60
61
62
63
1)
2)
Compute the value function for the current policy 𝜌 (i.e. 𝑊𝜌) 𝑊 ← 𝑊𝜌 for 𝑡 ∈ 𝑇
𝑏
𝑏
𝑏
Repeat steps until policy converges
Step 1: Policy evaluation: calculate utilities for some fixed policy (not optimal
Step 2: Policy improvement: update policy using one-step look-ahead with
This is policy iteration
It’s still optimal! Can converge (much) faster under some conditions 64
65
Another basic operation: compute the utility of a state s
Recursive relation (one-step look-ahead / Bellman equation):
66
V(s) = expected total discounted rewards starting in s and following
How do we calculate the V’s for a fixed policy ? Idea 1: Turn recursive Bellman equations into updates
Efficiency: O(S2) per iteration Idea 2: Without the maxes, the Bellman equations are just a linear system
67
Evaluation: For fixed current policy , find values with policy evaluation:
Iterate until values converge:
Improvement: For fixed values, get a better policy using policy extraction
One-step look-ahead:
68
69
Both value iteration and policy iteration compute the same thing (all
In value iteration:
Every iteration updates both the values and (implicitly) the policy We don’t track the policy, but taking the max over actions implicitly recomputes
In policy iteration:
We do several passes that update utilities with fixed policy (each pass is fast
After the policy is evaluated, a new policy is chosen (slow like a value iteration
The new policy will be better (or we’re done)
Both are dynamic programs for solving MDPs
70
So you want to….
Compute optimal values: use value iteration or policy iteration Compute values for a particular policy: use policy evaluation Turn your values into a policy: use policy extraction (one-step lookahead) 71
72
Still assume a Markov decision process (MDP):
A set of states s S A set of actions (per state) A A model T(s,a,s’) A reward function R(s,a,s’)
Still looking for a policy (s) New twist: don’t know T or R
I.e. we don’t know which states are good or what the actions do Must actually try actions and states out to learn 73
Basic idea:
Receive feedback in the form of rewards Agent’s utility is defined by the reward function Must (learn to) act so as to maximize expected rewards All learning is based on observed samples of outcomes!
Actions: a State: s Reward: r
74
75
Autonomous helicopter
self-reliant agent must do to learn from its own experiences. eliminating hand coding of control strategies
76
77
That wasn’t planning, it was learning!
Specifically, reinforcement learning There was an MDP, but you couldn’t solve it with just computation You needed to actually act to figure it out
Important ideas in reinforcement learning that came up
Exploration: you have to try unknown actions to get information Exploitation: eventually, you have to use what you know Regret: even if you learn intelligently, you make mistakes Sampling: because of chance, you have to try things repeatedly Difficulty: learning can be much harder than solving a known MDP 78
79
80
Model-Based Idea:
Learn an approximate model based on experiences Solve for values as if the learned model were correct
Step 1: Learn empirical MDP model
Count outcomes s’ for each s, a Normalize to give an estimate of Discover each
Step 2: Solve the learned MDP
For example, use value iteration, as before 81
T(B, east, C) = 1.00 T(C, east, D) = 0.75 T(C, east, A) = 0.25 …
R(B, east, C) = -1 R(C, east, D) = -1 R(D, exit, x) = +10 …
82
83
We still assume an MDP:
A set of states s S A set of actions (per state) A A modelT(s,a,s’) A reward function R(s,a,s’)
Still looking for a policy (s) New twist: don’t knowT or R, so must try out actions Big idea: Compute all averages over T using sample outcomes
84
Goal: Compute values for each state under Idea:Average together observed sample values
Act according to Every time you visit a state, write down what the
Average those samples
This is called direct evaluation
85
86
87
88
It’s easy to understand It doesn’t require any knowledge of T, R It
It wastes information about state connections Each state must be learned separately So, it takes a long time to learn
If B and E both go to C under this policy, how can their values be different?
89
Simplified Bellman updates calculate V for a fixed policy:
Key question: how can we do this update to V without knowing T and R?
90
We want to improve our estimate of V by computing these averages: Idea: Take samples of outcomes s’ (by doing the action!) and average
Almost! But we can’t rewind time to get sample after sample from state s.
91
Big idea: learn from every experience!
Update V(s) each time we experience a transition (s, a, s’, r)
Likely outcomes s’ will contribute updates more often
Temporal difference learning of values
Policy still fixed, still doing evaluation!
Move values toward value of whatever successor occurs: running average
92
Exponential moving average
The running interpolation update: Makes recent samples more important: Forgets about the past (distant past values were wrong anyway)
Decreasing learning rate (alpha) can give converging averages
93
94
95
96
𝑊 𝑡 ← 𝑊 𝑡 + 𝛽 𝑠 + 𝛿𝑊 𝑡′ − 𝑊(𝑡)
97
98
99
Value iteration: find successive (depth-limited) values
Start with V0(s) = 0, which we know is right
Given Vk, calculate the depth k+1 values for all states:
But Q-values are more useful, so compute them instead
Start with Q0(s,a) = 0, which we know is right
Given Qk, calculate the depth k+1 q-values for all q-states:
100
We’d like to do Q-value updates to each Q-state:
But can’t compute this update without knowing T, R
Instead, compute average as we go
Receive a sample transition (s,a,r,s’)
This sample suggests
But we want to average over results from (s,a) (Why?)
So keep a running average
101
Learn Q(s,a) values as you go
Receive a sample (s,a,s’,r) Consider your old estimate: Consider your new sample estimate: Incorporate the new estimate into a running average: 102
103
Full reinforcement learning: optimal policies (like value iteration)
You don’t know the transitions T(s,a,s’) You don’t know the rewards R(s,a,s’) You choose the actions now Goal: learn the optimal policy / values
In this case:
Learner makes choices! Fundamental tradeoff: exploration vs. exploitation This is NOT offline planning! You actually take actions in the world and
104
105
Every time step, flip a coin With (small) probability , act randomly With (large) probability 1-, act on current policy
You do eventually explore the space, but keep
One solution: lower over time Another solution: exploration functions
106
107
108
Initialize
Repeat (for each episode):
𝑏′
109
110
The learning rate is decreased fast enough but not too fast
Even if you learn the optimal policy, you still make mistakes along the way! Regret is a measure of your total mistake cost: the difference between
Minimizing regret goes beyond learning to be optimal – it requires
Example: random exploration and exploration functions both end up
111
Amazing result: Q-learning converges to optimal policy --
This is called off-policy learning Caveats:
112
113
114
115
116
Initialize
Repeat (for each episode):
117
Goal Technique Compute V*, Q*, * VI/PI on approx. MDP Evaluate a fixed policy PE on approx. MDP Goal Technique Compute V*, Q*, * Q-learning Evaluate a fixed policy Value Learning
118
119
We may not even visit some states
But computation and memory problem
Basic Q-Learning keeps a table of all q-values In realistic situations, we cannot possibly learn
Instead, we want to generalize:
Learn about some small number of training states from experience
Generalize that experience to new, similar situations
This is a fundamental idea in machine learning, and we’ll see it over and over again
120
121
Solution: describe a state using a vector of
Features are functions from states to real numbers (often 0/1) that capture important properties of the state
Example features:
Distance to closest ghost
Distance to closest dot
Number of ghosts
1 / (dist to dot)2
Is Pacman in a tunnel? (0/1)
…… etc.
Is it the exact state on this slide?
Can also describe a q-state (s, a) with features (e.g. action moves closer to food)
122
Using a feature representation, we can write a q function (or value function) for any
Advantage: our experience is summed up in a few powerful numbers Disadvantage: states may share features but actually be very different in value! 123
In addition to the less space requirement
124
20
125
126
Q-learning with linear Q-functions: Intuitive interpretation:
Adjust weights of active features
E.g., if something unexpectedly bad happens, blame the features that were on: disprefer all states with that state’s features
127
128
129
130
131
132
Q(st ,a1 ), Q(st,a2 ), Q(st,a3 ), Q(st,a4 )
[Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]
133