CS 4803 / 7643: Deep Learning
Zsolt Kira Georgia Tech
Topics:
– Policy Gradients – Actor Critic
CS 4803 / 7643: Deep Learning Topics: Policy Gradients Actor - - PowerPoint PPT Presentation
CS 4803 / 7643: Deep Learning Topics: Policy Gradients Actor Critic Zsolt Kira Georgia Tech Administrative PS3/HW3 due Tuesday 03/31 PS4/HW4 is optional and due 04/03 There are lots of bonus/Extra credit questions
Topics:
– Policy Gradients – Actor Critic
2
3
4
5
6
Following policy that produces sample trajectories s0, a0, r0, s1, a1, …
How good is a state? The value function at state s, is the expected cumulative reward from state s (and following the policy thereafter): How good is a state-action pair? The Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s (and following the policy thereafter):
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
7
Given optimal policy that produces sample trajectories s0, a0, r0, s1, a1, …
How good is a state? The optimal value function at state s, and acting optimally thereafter How good is a state-action pair? The optimal Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s and acting optimally thereafter
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
8
The optimal Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s and acting optimally thereafter
9
The optimal Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s and acting optimally thereafter Optimal policy:
10
11 Slide credit: Pieter Abbeel
[NOTE: Here we are showing calculations for the action we know is argmax (go right), but in general we have to calculate this for each actions and return max]
Noise = 0.2 Discount = 0.9 Living reward = 0
Slide Credit: http://ai.berkeley.edu
values
Slide Credit: http://ai.berkeley.edu
Noise = 0.2 Discount = 0.9 Living reward = 0
Slide Credit: http://ai.berkeley.edu
values than values!
Slide Credit: http://ai.berkeley.edu
17
18
(s) s s, (s) '
1
s '
2
s '
3
s s, (s),s’ s '
Almost! But we can’t rewind time to get sample after sample from state s.
What’s the difficulty of this algorithm?
average
(s) s s, (s) s’ Sample of V(s): Update to V(s): Same update:
21
Image Credits: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
22
Target Q-Value Predicted Q-Value Pick action with best Q value
episodes are played, older samples discarded
the replay memory, instead of consecutive samples
24 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
25
Transition function and reward function
Use value / policy iteration known Obtain “optimal” policy
26
Transition function and reward function
Use value / policy iteration known Estimate Q values From data Obtain “optimal” policy
Previous class: Q - learning
unknown
27
Transition function and reward function
Use value / policy iteration known Obtain “optimal” policy Estimate and from data Estimate Q values From data unknown
Homework!
28
Transition function and reward function
Use value / policy iteration known Obtain “optimal” policy Estimate and from data Estimate Q values From data unknown unknown
This class!
29
30
31
32
Mathematically, we can write: Where r(𝜐) is the reward of a trajectory
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Expected reward:
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Now let’s differentiate this: Expected reward:
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Intractable! Expectation of gradient is problematic when p depends on θ
Now let’s differentiate this: Expected reward:
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Intractable! Expectation of gradient is problematic when p depends on θ
Now let’s differentiate this: However, we can use a nice trick: Expected reward:
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Intractable! Expectation of gradient is problematic when p depends on θ Can estimate with Monte Carlo sampling
Now let’s differentiate this: However, we can use a nice trick: If we inject this back: Expected reward:
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Can we compute those quantities without knowing the transition probabilities? We have:
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Can we compute those quantities without knowing the transition probabilities? We have: Thus:
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Can we compute those quantities without knowing the transition probabilities? We have: Thus: And when differentiating:
Doesn’t depend on transition probabilities!
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Can we compute those quantities without knowing the transition probabilities? We have: Thus: And when differentiating: Therefore when sampling a trajectory 𝜐, we can estimate J(𝜄) with
Doesn’t depend on transition probabilities!
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
43
Doesn’t depend on Transition probabilities!
44
45
46
Run the policy and sample trajectories Compute policy gradient Update policy Slide credit: Sergey Levine
47
Image Credit: http://karpathy.github.io/2016/05/31/rl/
48
Image Credit: http://karpathy.github.io/2016/05/31/rl/
(C) Dhruv Batra 49
50
Formalizes notion of “trial and error”:
51
52
Homework!
Gradient estimator: First idea: Push up probabilities of an action seen, only by the cumulative future reward from that state
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Gradient estimator: First idea: Push up probabilities of an action seen, only by the cumulative future reward from that state Second idea: Use discount factor 𝛿 to ignore delayed effects
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
55
A better baseline: Want to push up the probability of an action from a state, if this action was better than the expected value of what we should get from that state. Q: What does this remind you of?
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
A better baseline: Want to push up the probability of an action from a state, if this action was better than the expected value of what we should get from that state. Q: What does this remind you of? A: Q-function and value function!
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
A better baseline: Want to push up the probability of an action from a state, if this action was better than the expected value of what we should get from that state. Q: What does this remind you of? A: Q-function and value function! Intuitively, we are happy with an action at in a state st if is large. On the contrary, we are unhappy with an action if it’s small.
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
A better baseline: Want to push up the probability of an action from a state, if this action was better than the expected value of what we should get from that state. Q: What does this remind you of? A: Q-function and value function!
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
60
61
62
63
64
65
66
67
68
69
70
A better baseline: Want to push up the probability of an action from a state, if this action was better than the expected value of what we should get from that state. Q: What does this remind you of? A: Q-function and value function! Intuitively, we are happy with an action at in a state st if is large. On the contrary, we are unhappy with an action if it’s small. Using this, we get the estimator:
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
72
“how much better is an action than expected?
73