CS 4803 / 7643: Deep Learning Topics: Policy Gradients Actor - - PowerPoint PPT Presentation

cs 4803 7643 deep learning
SMART_READER_LITE
LIVE PREVIEW

CS 4803 / 7643: Deep Learning Topics: Policy Gradients Actor - - PowerPoint PPT Presentation

CS 4803 / 7643: Deep Learning Topics: Policy Gradients Actor Critic Zsolt Kira Georgia Tech Administrative PS3/HW3 due Tuesday 03/31 PS4/HW4 is optional and due 04/03 There are lots of bonus/Extra credit questions


slide-1
SLIDE 1

CS 4803 / 7643: Deep Learning

Zsolt Kira Georgia Tech

Topics:

– Policy Gradients – Actor Critic

slide-2
SLIDE 2

2

Administrative

  • PS3/HW3 due Tuesday 03/31
  • PS4/HW4 is optional and due 04/03
  • There are lots of bonus/Extra credit questions there!
  • Sessions with Facebook for project (fill out spreadsheet)
slide-3
SLIDE 3

3

Administrative

  • How to ask questions during live lecture:
  • Use Q&A window (other students can upvote)
  • Raise hands
slide-4
SLIDE 4

4

Topics we’ll cover

  • Overview of RL
  • RL vs other forms of learning
  • RL “API”
  • Applications
  • Framework: Markov Decision Processes (MDP’s)
  • Definitions and notations
  • Policies and Value Functions
  • Solving MDP’s
  • Value Iteration (recap)
  • Q-Value Iteration (new)
  • Policy Iteration
  • Reinforcement learning
  • Value-based RL (Q-learning, Deep-Q Learning)
  • Policy-based RL (Policy gradients)
  • Actor-Critic
slide-5
SLIDE 5

5

  • Markov Decision Processes (MDP):
  • States:
  • Actions:
  • Rewards:
  • Transition Function:
  • Discount Factor:

Recap: MDPs

slide-6
SLIDE 6

6

Value Function

Following policy that produces sample trajectories s0, a0, r0, s1, a1, …

How good is a state? The value function at state s, is the expected cumulative reward from state s (and following the policy thereafter): How good is a state-action pair? The Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s (and following the policy thereafter):

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-7
SLIDE 7

7

Optimal Quantities

Given optimal policy that produces sample trajectories s0, a0, r0, s1, a1, …

How good is a state? The optimal value function at state s, and acting optimally thereafter How good is a state-action pair? The optimal Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s and acting optimally thereafter

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-8
SLIDE 8

8

Recap: Optimal Value Function

The optimal Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s and acting optimally thereafter

slide-9
SLIDE 9

9

Recap: Optimal Value Function

The optimal Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s and acting optimally thereafter Optimal policy:

slide-10
SLIDE 10

Bellman Optimality Equations

  • Relations:
  • Recursive optimality equations:

10

slide-11
SLIDE 11

Value Iteration (VI)

11 Slide credit: Pieter Abbeel

[NOTE: Here we are showing calculations for the action we know is argmax (go right), but in general we have to calculate this for each actions and return max]

slide-12
SLIDE 12
slide-13
SLIDE 13

Snapshot of Demo – Gridworld V Values

Noise = 0.2 Discount = 0.9 Living reward = 0

Slide Credit: http://ai.berkeley.edu

slide-14
SLIDE 14

Computing Actions from Values

  • Let’s imagine we have the optimal values V*(s)
  • How should we act?
  • It’s not obvious!
  • We need to do a one step calculation
  • This is called policy extraction, since it gets the policy implied by the

values

Slide Credit: http://ai.berkeley.edu

slide-15
SLIDE 15

Snapshot of Demo – Gridworld Q Values

Noise = 0.2 Discount = 0.9 Living reward = 0

Slide Credit: http://ai.berkeley.edu

slide-16
SLIDE 16

Computing Actions from Q-Values

  • Let’s imagine we have the optimal q-values:
  • How should we act?
  • Completely trivial to decide!
  • Important lesson: actions are easier to select from q-

values than values!

Slide Credit: http://ai.berkeley.edu

slide-17
SLIDE 17

Recap: Learning Based Methods

  • Typically, we don’t know the environment
  • unknown, how actions affect the environment.
  • unknown, what/when are the good actions?

17

slide-18
SLIDE 18

Recap: Learning Based Methods

  • Typically, we don’t know the environment
  • unknown, how actions affect the environment.
  • unknown, what/when are the good actions?
  • But, we can learn by trial and error.
  • Gather experience (data) by performing actions.
  • Approximate unknown quantities from data.

18

slide-19
SLIDE 19

Sample-Based Policy Evaluation?

  • We want to improve our estimate of V by computing these averages:
  • Idea: Take samples of outcomes s’ (by doing the action!) and average

(s) s s, (s) '

1

s '

2

s '

3

s s, (s),s’ s '

Almost! But we can’t rewind time to get sample after sample from state s.

What’s the difficulty of this algorithm?

slide-20
SLIDE 20

Temporal Difference Learning

  • Big idea: learn from every experience!
  • Update V(s) each time we experience a transition (s, a, s’, r)
  • Likely outcomes s’ will contribute updates more often
  • Temporal difference learning of values
  • Policy still fixed, still doing evaluation!
  • Move values toward value of whatever successor occurs: running

average

(s) s s, (s) s’ Sample of V(s): Update to V(s): Same update:

slide-21
SLIDE 21

Deep Q-Learning

  • Q-Learning with linear function approximators
  • Has some theoretical guarantees
  • Deep Q-Learning: Fit a deep Q-Network
  • Works well in practice
  • Q-Network can take RGB images

21

Image Credits: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-22
SLIDE 22
  • Collect a dataset
  • Loss for a single data point:
  • Act optimally according to the learnt Q function:

Recap: Deep Q-Learning

22

Target Q-Value Predicted Q-Value Pick action with best Q value

slide-23
SLIDE 23

Exploration Problem

  • What should

be?

  • Greedy? -> Local minimas, no exploration
  • An exploration strategy:
  • 23
slide-24
SLIDE 24

Experience Replay

  • Address this problem using experience replay
  • A replay buffer stores transitions
  • Continually update replay buffer as game (experience)

episodes are played, older samples discarded

  • Train Q-network on random minibatches of transitions from

the replay memory, instead of consecutive samples

24 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-25
SLIDE 25

25

Transition function and reward function

Getting to the optimal policy

Use value / policy iteration known Obtain “optimal” policy

slide-26
SLIDE 26

26

Transition function and reward function

Getting to the optimal policy

Use value / policy iteration known Estimate Q values From data Obtain “optimal” policy

Previous class: Q - learning

unknown

slide-27
SLIDE 27

27

Transition function and reward function

Getting to the optimal policy

Use value / policy iteration known Obtain “optimal” policy Estimate and from data Estimate Q values From data unknown

Homework!

slide-28
SLIDE 28

28

Transition function and reward function

Getting to the optimal policy

Use value / policy iteration known Obtain “optimal” policy Estimate and from data Estimate Q values From data unknown unknown

This class!

slide-29
SLIDE 29

29

  • Class of policies defined by parameters
  • Eg: can be parameters of linear transformation, deep network, etc.

Learning the optimal policy

slide-30
SLIDE 30

30

  • Class of policies defined by parameters
  • Eg: can be parameters of linear transformation, deep network, etc.
  • Want to maximize:

Learning the optimal policy

slide-31
SLIDE 31

31

  • Class of policies defined by parameters
  • Eg: can be parameters of linear transformation, deep network, etc.
  • Want to maximize:
  • In other words,

Learning the optimal policy

slide-32
SLIDE 32

32

Learning the optimal policy

Sample a few trajectories by acting according to

slide-33
SLIDE 33

REINFORCE algorithm

Mathematically, we can write: Where r(𝜐) is the reward of a trajectory

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-34
SLIDE 34

REINFORCE algorithm

Expected reward:

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-35
SLIDE 35

REINFORCE algorithm

Now let’s differentiate this: Expected reward:

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-36
SLIDE 36

REINFORCE algorithm

Intractable! Expectation of gradient is problematic when p depends on θ

Now let’s differentiate this: Expected reward:

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-37
SLIDE 37

REINFORCE algorithm

Intractable! Expectation of gradient is problematic when p depends on θ

Now let’s differentiate this: However, we can use a nice trick: Expected reward:

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-38
SLIDE 38

REINFORCE algorithm

Intractable! Expectation of gradient is problematic when p depends on θ Can estimate with Monte Carlo sampling

Now let’s differentiate this: However, we can use a nice trick: If we inject this back: Expected reward:

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-39
SLIDE 39

REINFORCE algorithm

Can we compute those quantities without knowing the transition probabilities? We have:

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-40
SLIDE 40

REINFORCE algorithm

Can we compute those quantities without knowing the transition probabilities? We have: Thus:

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-41
SLIDE 41

REINFORCE algorithm

Can we compute those quantities without knowing the transition probabilities? We have: Thus: And when differentiating:

Doesn’t depend on transition probabilities!

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-42
SLIDE 42

REINFORCE algorithm

Can we compute those quantities without knowing the transition probabilities? We have: Thus: And when differentiating: Therefore when sampling a trajectory 𝜐, we can estimate J(𝜄) with

Doesn’t depend on transition probabilities!

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-43
SLIDE 43

43

Policy Gradients

Doesn’t depend on Transition probabilities!

slide-44
SLIDE 44

44

Policy Gradients

slide-45
SLIDE 45

45

Policy Gradients

slide-46
SLIDE 46

46

  • 1. Sample trajectories by acting according to
  • 2. Compute policy gradient as
  • 3. Update policy

REINFORCE

Run the policy and sample trajectories Compute policy gradient Update policy Slide credit: Sergey Levine

slide-47
SLIDE 47

Pong from pixels

47

Image Credit: http://karpathy.github.io/2016/05/31/rl/

slide-48
SLIDE 48

Pong from pixels

48

Image Credit: http://karpathy.github.io/2016/05/31/rl/

slide-49
SLIDE 49

Intuition

(C) Dhruv Batra 49

slide-50
SLIDE 50

50

Policy Gradients

Formalizes notion of “trial and error”:

  • If reward is high, probability of actions seen is increased
  • If reward is low, probability of actions seen is reduced
  • But in expectation, it averages out
slide-51
SLIDE 51

51

Issues with Policy Gradients

  • Credit assignment is hard!
  • Which specific action led to increase in reward
  • Suffers from high variance  leading to unstable training
  • How to reduce the variance?
  • Subtract a constant from the reward!
slide-52
SLIDE 52

52

Issues with Policy Gradients

  • Credit assignment is hard!
  • Which specific action led to increase in reward
  • Suffers from high variance  leading to unstable training
  • How to reduce the variance?
  • Subtract a constant from the reward!
  • Why does it work?
  • What is the best choice of b?

Homework!

slide-53
SLIDE 53

Variance reduction

Gradient estimator: First idea: Push up probabilities of an action seen, only by the cumulative future reward from that state

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-54
SLIDE 54

Variance reduction

Gradient estimator: First idea: Push up probabilities of an action seen, only by the cumulative future reward from that state Second idea: Use discount factor 𝛿 to ignore delayed effects

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-55
SLIDE 55

55

Issues with Policy Gradients

  • Credit assignment is hard!
  • Which specific action led to increase in reward
  • Suffers from high variance  leading to unstable training
slide-56
SLIDE 56

How to choose the baseline?

A better baseline: Want to push up the probability of an action from a state, if this action was better than the expected value of what we should get from that state. Q: What does this remind you of?

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-57
SLIDE 57

How to choose the baseline?

A better baseline: Want to push up the probability of an action from a state, if this action was better than the expected value of what we should get from that state. Q: What does this remind you of? A: Q-function and value function!

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-58
SLIDE 58

How to choose the baseline?

A better baseline: Want to push up the probability of an action from a state, if this action was better than the expected value of what we should get from that state. Q: What does this remind you of? A: Q-function and value function! Intuitively, we are happy with an action at in a state st if is large. On the contrary, we are unhappy with an action if it’s small.

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-59
SLIDE 59

How to choose the baseline?

A better baseline: Want to push up the probability of an action from a state, if this action was better than the expected value of what we should get from that state. Q: What does this remind you of? A: Q-function and value function!

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-60
SLIDE 60

60

  • Learn both policy and Q function
  • Use the “actor” to sample trajectories
  • Use the Q function to “evaluate” or “critic” the policy

Actor-Critic

slide-61
SLIDE 61

61

  • Learn both policy and Q function
  • Use the “actor” to sample trajectories
  • Use the Q function to “evaluate” or “critic” the policy
  • REINFORCE:
  • Actor-critic:

Actor-Critic

slide-62
SLIDE 62

62

  • Learn both policy and Q function
  • Use the “actor” to sample trajectories
  • Use the Q function to “evaluate” or “critic” the policy
  • REINFORCE:
  • Actor-critic:
  • Q function is unknown too! Update using

Actor-Critic

slide-63
SLIDE 63

63

  • Initialize s, (policy network) and (Q network)

Actor-Critic

slide-64
SLIDE 64

64

  • Initialize s, (policy network) and (Q network)
  • sample action

Actor-Critic

slide-65
SLIDE 65

65

  • Initialize s, (policy network) and (Q network)
  • sample action
  • For each step:
  • Sample reward and next state

Actor-Critic

slide-66
SLIDE 66

66

  • Initialize s, (policy network) and (Q network)
  • sample action
  • For each step:
  • Sample reward and next state
  • evaluate “actor” using “critic”

Actor-Critic

slide-67
SLIDE 67

67

  • Initialize s, (policy network) and (Q network)
  • sample action
  • For each step:
  • Sample reward and next state
  • evaluate “actor” using “critic” and update policy:

Actor-Critic

slide-68
SLIDE 68

68

  • Initialize s, (policy network) and (Q network)
  • sample action
  • For each step:
  • Sample reward and next state
  • evaluate “actor” using “critic” and update policy:
  • Update “critic”:
  • Recall Q-learning

Actor-Critic

slide-69
SLIDE 69

69

  • Initialize s, (policy network) and (Q network)
  • sample action
  • For each step:
  • Sample reward and next state
  • evaluate “actor” using “critic” and update policy:
  • Update “critic”:
  • Recall Q-learning
  • Update Accordingly
  • Actor-Critic
slide-70
SLIDE 70

70

Actor-critic

  • In general, replacing the policy evaluation or the “critic” leads to

different flavors of the actor-critic

  • REINFORCE:
  • Q – Actor Critic
slide-71
SLIDE 71

How to choose the baseline?

A better baseline: Want to push up the probability of an action from a state, if this action was better than the expected value of what we should get from that state. Q: What does this remind you of? A: Q-function and value function! Intuitively, we are happy with an action at in a state st if is large. On the contrary, we are unhappy with an action if it’s small. Using this, we get the estimator:

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-72
SLIDE 72

72

Actor-critic

  • In general, replacing the policy evaluation or the “critic” leads to

different flavors of the actor-critic

  • REINFORCE:
  • Q – Actor Critic
  • Advantage Actor Critic:

“how much better is an action than expected?

slide-73
SLIDE 73

73

  • Policy Learning:
  • Policy gradients
  • REINFORCE
  • Reducing Variance (Homework!)
  • Actor-Critic:
  • Other ways of performing “policy evaluation”
  • Variants of Actor-critic

Summary