CS 4803 / 7643: Deep Learning Topics: Dynamic Programming (Q-Value - - PowerPoint PPT Presentation

cs 4803 7643 deep learning
SMART_READER_LITE
LIVE PREVIEW

CS 4803 / 7643: Deep Learning Topics: Dynamic Programming (Q-Value - - PowerPoint PPT Presentation

CS 4803 / 7643: Deep Learning Topics: Dynamic Programming (Q-Value Iteration) Reinforcement Learning (Intro, Q-Learning, DQNs) Nirbhay Modhe Georgia Tech Topics well cover Overview of RL RL vs other forms of learning


slide-1
SLIDE 1

CS 4803 / 7643: Deep Learning

Nirbhay Modhe Georgia Tech

Topics:

– Dynamic Programming (Q-Value Iteration) – Reinforcement Learning (Intro, Q-Learning, DQNs)

slide-2
SLIDE 2

2

Topics we’ll cover

  • Overview of RL
  • RL vs other forms of learning
  • RL “API”
  • Applications
  • Framework: Markov Decision Processes (MDP’s)
  • Definitions and notations
  • Policies and Value Functions
  • Solving MDP’s
  • Value Iteration (recap)
  • Q-Value Iteration (new)
  • Policy Iteration
  • Reinforcement learning
  • Value-based RL (Q-learning, Deep-Q Learning)
  • Policy-based RL (Policy gradients)
slide-3
SLIDE 3

3

Topics we’ll cover

  • Overview of RL
  • RL vs other forms of learning
  • RL “API”
  • Applications
  • Framework: Markov Decision Processes (MDP’s)
  • Definitions and notations
  • Policies and Value Functions
  • Solving MDP’s
  • Value Iteration (recap)
  • Q-Value Iteration (new)
  • Policy Iteration
  • Reinforcement learning
  • Value-based RL (Q-learning, Deep-Q Learning)
  • Policy-based RL (Policy gradients)
slide-4
SLIDE 4

Recap

4

slide-5
SLIDE 5
  • Markov Decision Process (MDP)

– Defined by

: set of possible states [start state = s0, optional terminal / absorbing state] : set of possible actions : distribution of reward given (state, action, next state) tuple : transition probability distribution, also written as : discount factor

Recap

5

slide-6
SLIDE 6
  • Markov Decision Process (MDP)

– Defined by

: set of possible states [start state = s0, optional terminal / absorbing state] : set of possible actions : distribution of reward given (state, action, next state) tuple : transition probability distribution, also written as : discount factor

  • Value functions, optimal quantities, bellman equation
  • Algorithms for solving MDP’s

– Value Iteration

Recap

6

slide-7
SLIDE 7

7

Value Function

Following policy that produces sample trajectories s0, a0, r0, s1, a1, …

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-8
SLIDE 8

8

Value Function

Following policy that produces sample trajectories s0, a0, r0, s1, a1, …

How good is a state? The value function at state s, is the expected cumulative reward from state s (and following the policy thereafter):

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-9
SLIDE 9

9

Value Function

Following policy that produces sample trajectories s0, a0, r0, s1, a1, …

How good is a state? The value function at state s, is the expected cumulative reward from state s (and following the policy thereafter): How good is a state-action pair? The Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s (and following the policy thereafter):

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-10
SLIDE 10

10

Optimal Quantities

Given optimal policy that produces sample trajectories s0, a0, r0, s1, a1, …

How good is a state? The optimal value function at state s, and acting optimally thereafter

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-11
SLIDE 11

11

Optimal Quantities

Given optimal policy that produces sample trajectories s0, a0, r0, s1, a1, …

How good is a state? The optimal value function at state s, and acting optimally thereafter How good is a state-action pair? The optimal Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s and acting optimally thereafter

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-12
SLIDE 12

Bellman Optimality Equations

  • Relations:

12

slide-13
SLIDE 13

Bellman Optimality Equations

  • Relations:
  • Recursive optimality equations:

13

slide-14
SLIDE 14

Bellman Optimality Equations

  • Relations:
  • Recursive optimality equations:

14

slide-15
SLIDE 15

Bellman Optimality Equations

  • Relations:
  • Recursive optimality equations:

15

slide-16
SLIDE 16

Bellman Optimality Equations

  • Relations:
  • Recursive optimality equations:

16

slide-17
SLIDE 17

Bellman Optimality Equations

  • Relations:
  • Recursive optimality equations:

17

slide-18
SLIDE 18

Value Iteration (VI)

  • Based on the bellman optimality equation

18

slide-19
SLIDE 19

Value Iteration (VI)

  • Based on the bellman optimality equation
  • Algorithm

– Initialize values of all states – While not converged:

  • For each state:

– Repeat until convergence (no change in values)

19

Time complexity per iteration

Homework

slide-20
SLIDE 20

Q-Value Iteration

  • Value Iteration Update:
  • Q-Value Iteration Update:

20

The algorithm is same as value iteration, but it loops over actions as well as states

slide-21
SLIDE 21

Q-Value Iteration

  • Value Iteration Update:
  • Q-Value Iteration Update:

21

The algorithm is same as value iteration, but it loops over actions as well as states

slide-22
SLIDE 22

Policy Iteration

(C) Dhruv Batra 22

slide-23
SLIDE 23
  • Policy iteration: Start with arbitrary and refine it.

Policy Iteration

23

slide-24
SLIDE 24
  • Policy iteration: Start with arbitrary and refine it.
  • Involves repeating two steps:

– Policy Evaluation: Compute (similar to VI) – Policy Refinement: Greedily change actions as per

Policy Iteration

24

slide-25
SLIDE 25
  • Policy iteration: Start with arbitrary and refine it.
  • Involves repeating two steps:

– Policy Evaluation: Compute (similar to VI) – Policy Refinement: Greedily change actions as per

  • Why do policy iteration?

  • ften converges to much sooner than

Policy Iteration

25

slide-26
SLIDE 26

Summary

  • Value Iteration

– Bellman update to state value estimates

  • Q-Value Iteration

– Bellman update to (state, action) value estimates

  • Policy Iteration

– Policy evaluation + refinement

26

slide-27
SLIDE 27

Learning Based Methods

27

slide-28
SLIDE 28

Learning Based Methods

  • Typically, we don’t know the environment

– unknown, how actions affect the environment. – unknown, what/when are the good actions?

28

slide-29
SLIDE 29

Learning Based Methods

  • Typically, we don’t know the environment

– unknown, how actions affect the environment. – unknown, what/when are the good actions?

  • But, we can learn by trial and error.

– Gather experience (data) by performing actions. – Approximate unknown quantities from data.

29

Reinforcement Learning

slide-30
SLIDE 30

Learning Based Methods

(C) Dhruv Batra 30

Reinforcement Learning

  • Old Dynamic Programming Demo

– https://cs.stanford.edu/people/karpathy/reinforcejs/gridworld_dp.html

  • RL Demo

– https://cs.stanford.edu/people/karpathy/reinforcejs/gridworld_td.html

slide-31
SLIDE 31

(Deep) Learning Based Methods

31

slide-32
SLIDE 32

(Deep) Learning Based Methods

  • In addition to not knowing the environment,

sometimes the state space is too large.

32

slide-33
SLIDE 33

(Deep) Learning Based Methods

  • In addition to not knowing the environment,

sometimes the state space is too large.

  • A value iteration updates takes

– Not scalable to high dimensional states e.g.: RGB images.

33

slide-34
SLIDE 34

(Deep) Learning Based Methods

  • In addition to not knowing the environment,

sometimes the state space is too large.

  • A value iteration updates takes

– Not scalable to high dimensional states e.g.: RGB images.

  • Solution: Deep Learning!

– Use deep neural networks to learn low-dimensional representations.

34

Deep Reinforcement Learning

slide-35
SLIDE 35

Reinforcement Learning

(C) Dhruv Batra 35

slide-36
SLIDE 36

Reinforcement Learning

  • Value-based RL

– (Deep) Q-Learning, approximating with a deep Q-network

(C) Dhruv Batra 36

slide-37
SLIDE 37

Reinforcement Learning

  • Value-based RL

– (Deep) Q-Learning, approximating with a deep Q-network

  • Policy-based RL

– Directly approximate optimal policy with a parametrized policy

(C) Dhruv Batra 37

slide-38
SLIDE 38

Reinforcement Learning

  • Value-based RL

– (Deep) Q-Learning, approximating with a deep Q-network

  • Policy-based RL

– Directly approximate optimal policy with a parametrized policy

  • Model-based RL

– Approximate transition function and reward function – Plan by looking ahead in the (approx.) future!

(C) Dhruv Batra 38

slide-39
SLIDE 39

Reinforcement Learning

  • Value-based RL

– (Deep) Q-Learning, approximating with a deep Q-network

  • Policy-based RL

– Directly approximate optimal policy with a parametrized policy

  • Model-based RL

– Approximate transition function and reward function – Plan by looking ahead in the (approx.) future!

(C) Dhruv Batra 39

Homework!

slide-40
SLIDE 40

Value-based Reinforcement Learning

Deep Q-Learning

slide-41
SLIDE 41

Deep Q-Learning

  • Q-Learning with linear function approximators

– Has some theoretical guarantees

41

slide-42
SLIDE 42

Deep Q-Learning

  • Q-Learning with linear function approximators

– Has some theoretical guarantees

  • Deep Q-Learning: Fit a deep Q-Network

– Works well in practice – Q-Network can take RGB images

42

Image Credits: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-43
SLIDE 43

Deep Q-Learning

43

slide-44
SLIDE 44

Deep Q-Learning

  • Assume we have collected a dataset
  • We want a Q-function that satisfies:
  • Loss for a single data point:

44

Q-Value Bellman Optimality Target Q-Value Predicted Q-Value

slide-45
SLIDE 45
  • Minibatch of
  • Forward pass:

45

State Q-Network Q-Values per action

Deep Q-Learning

slide-46
SLIDE 46
  • Minibatch of
  • Forward pass:

46

State Q-Network Q-Values per action State Q-Network

Deep Q-Learning

slide-47
SLIDE 47

Deep Q-Learning

  • Minibatch of
  • Forward pass:
  • Compute loss:

47

State Q-Network Q-Values per action

slide-48
SLIDE 48

Deep Q-Learning

  • Minibatch of
  • Forward pass:
  • Compute loss:

48

State Q-Network Q-Values per action

slide-49
SLIDE 49

Deep Q-Learning

  • Minibatch of
  • Forward pass:
  • Compute loss:
  • Backward pass:

49

State Q-Network Q-Values per action

slide-50
SLIDE 50

Deep Q-Learning

  • In practice, for stability:

– Freeze and update parameters – Set at regular intervals

50

slide-51
SLIDE 51

How to gather experience? This is why RL is hard

slide-52
SLIDE 52

Environment Data

Update

How To Gather Experience?

Train

slide-53
SLIDE 53

Environment Data

Update

How To Gather Experience?

Challenge 1: Exploration vs Exploitation Challenge 2: Non iid, highly correlated data

Train

slide-54
SLIDE 54

Exploration Problem

  • What should

be?

– Greedy? -> Local minimas, no exploration

54

slide-55
SLIDE 55

Exploration Problem

  • What should

be?

– Greedy? -> Local minimas, no exploration

  • An exploration strategy:

55

slide-56
SLIDE 56

Correlated Data Problem

  • Samples are correlated => high variance gradients

=> inefficient learning

  • Current Q-network parameters determines next

training samples => can lead to bad feedback loops

– e.g. if maximizing action is to move left, training samples will be dominated by samples from left-hand size.

56

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-57
SLIDE 57

Experience Replay

  • Address this problem using experience replay

– A replay buffer stores transitions

57

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-58
SLIDE 58

Experience Replay

  • Address this problem using experience replay

– A replay buffer stores transitions – Continually update replay buffer as game (experience) episodes are played, older samples discarded

58

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-59
SLIDE 59

Experience Replay

  • Address this problem using experience replay

– A replay buffer stores transitions – Continually update replay buffer as game (experience) episodes are played, older samples discarded – Train Q-network on random minibatches of transitions from the replay memory, instead of consecutive samples

59

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-60
SLIDE 60

Q-Learning Algorithm

60

Epsilon-greedy Q Update Experience Replay

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-61
SLIDE 61

Case study: Playing Atari Games

  • Objective: Complete the game with the highest score
  • State: Raw pixel inputs from the game state
  • Action: Game controls e.g.: Left, Right, Up, Down
  • Reward: Score increase/decrease at each time step

61

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-62
SLIDE 62

Playing Atari Games

  • Q-Network architecture
  • State:

– Stack of 4 image frames, grayscale conversion, down-sampling and cropping to (84 x 84 x 4)

  • Last FC layer has #(actions)

dimensions (predicts Q-values)

62

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-63
SLIDE 63

Atari Games

63

https://www.youtube.com/watch?v=V1eYniJ0Rnk Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Pong Breakout

slide-64
SLIDE 64

Summary

In today’s class, we looked at

  • Dynamic Programming

– Q-Value Iteration – Policy Iteration

  • Reinforcement Learning (RL)

– The challenges of (deep) learning based methods – Value-based RL algorithms

  • Deep Q-Learning

Next class:

– Policy-based RL algorithms

64

slide-65
SLIDE 65

(C) Dhruv Batra 65

Thanks!