Deep RL Robert Platt Northeastern University Q-learning - PowerPoint PPT Presentation

Deep RL Robert Platt Northeastern University

Q-learning Q-function Q action argmax state action World e t a t s Update rule

Deep Q-learning (DQN) Q-function argmax state action World Values of different possible discrete actions

Deep Q-learning (DQN) Q-function argmax state action World But, why would we want to do this?

Where does “state” come from? Agent takes actions a Agent s,r Agent perceives states and rewards Earlier, we dodged this question: “it’s part of the MDP problem statement” But, that’s a cop out. How do we get state? Typically can’t use “raw” sensor data as state w/ a tabular Q-function – it’s too big (e.g. pacman has something like 2^(num pellets) + … states)

Where does “state” come from? Agent takes actions a Agent s,r Is it possible to do RL WITHOUT Agent perceives states and rewards hand-coding states? Earlier, we dodged this question: “it’s part of the MDP problem statement” But, that’s a cop out. How do we get state? Typically can’t use “raw” sensor data as state w/ a tabular Q-function – it’s too big (e.g. pacman has something like 2^(num pellets) + … states)

DQN Instead of state, we have an image – in practice, it could be a history of the k most recent images stacked as a single k -channel image Hopefully this new image representation is Markov… – in some domains, it might not be!

DQN Stack of images Q-function Conv 1 Conv 2 FC 1 Output

DQN Num output nodes equals the number of actions Stack of images Q-function Conv 1 Conv 2 FC 1 Output

Q-function updates in DQN Here’s the standard Q-learning update equation:

Q-function updates in DQN Here’s the standard Q-learning update equation: Rewriting:

Q-function updates in DQN Here’s the standard Q-learning update equation: Rewriting: let’s call this the “target” This equation adjusts Q(s,a) in the direction of the target

Q-function updates in DQN Here’s the standard Q-learning update equation: Rewriting: let’s call this the “target” This equation adjusts Q(s,a) in the direction of the target We’re going to accomplish this same thing in a different way using neural networks...

Q-function updates in DQN Use this loss function:

Q-function updates in DQN Use this loss function: Notice that Q is now parameterized by the weights, w

Q-function updates in DQN I’m including the bias in the weights Use this loss function:

Q-function updates in DQN Use this loss function: target

Question Use this loss function: target What’s this called?

Q-function updates in DQN Use this loss function: target We’re going to optimize this loss function using the following gradient:

Think-pair-share Use this loss function: target We’re going to optimize this loss function using the following gradient: What’s wrong with this?

Q-function updates in DQN Use this loss function: target We’re going to optimize this loss function using the following gradient: What’s wrong with this? We call this the semigradient rather than the gradient – semi-gradient descent still converges – this is often more convenient

“Barebones” DQN Initialize Q( s,a;w ) with random weights Repeat (for each episode): Initialize s Repeat (for each step of the episode): Choose a from s using policy derived from Q (e.g. e-greedy) Take action a , observe r, s’ Until s is terminal Where:

“Barebones” DQN Initialize Q( s,a;w ) with random weights Repeat (for each episode): Initialize s Repeat (for each step of the episode): Choose a from s using policy derived from Q (e.g. e-greedy) Take action a , observe r, s’ This is all that changed Until s is terminal relative to standard q-learning Where:

Example: 4x4 frozen lake env Get to the goal (G) Don’t fall in a hole (H) Demo!

Think-pair-share Suppose the “barebones” DQN algorithm w/ this DQN network experiences the following transition: Which weights in the network could be updated on this iteration?

Experience replay Deep learning typically assumes independent, identically distributed (IID) training data

Experience replay Deep learning typically assumes independent, identically distributed (IID) training data But is this true in the deep RL scenario? Initialize Q( s,a;w ) with random weights Repeat (for each episode): Initialize s Repeat (for each step of the episode): Choose a from s using policy derived from Q (e.g. e-greedy) Take action a , observe r, s’ Until s is terminal

Experience replay Deep learning typically assumes independent, identically distributed (IID) training data But is this true in the deep RL scenario? Initialize Q( s,a;w ) with random weights Repeat (for each episode): Initialize s Repeat (for each step of the episode): Choose a from s using policy derived from Q (e.g. e-greedy) Take action a , observe r, s’ Our solution: buffer experiences and then “replay” them during training Until s is terminal

Experience replay Initialize with random weights Replay buffer Repeat (for each episode): Initialize s Repeat (for each step of the episode): Choose a from s using policy derived from Q (e.g. e-greedy) Take action a , observe r, s’ Add this exp to buffer Train every If mod( step,trainfreq ) == 0: trainfreq steps sample batch B from D One step grad descent WRT buffer

Experience replay Initialize with random weights Replay buffer Repeat (for each episode): Initialize s Repeat (for each step of the episode): Choose a from s using policy derived from Q (e.g. e-greedy) Take action a , observe r, s’ Add this exp to buffer Train every If mod( step,trainfreq ) == 0: trainfreq steps sample batch B from D One step grad descent WRT buffer Where:

Experience replay Initialize with random weights Replay buffer Repeat (for each episode): Buffers like this are pretty common in DL Initialize s Repeat (for each step of the episode): Choose a from s using policy derived from Q (e.g. e-greedy) Take action a , observe r, s’ Add this exp to buffer Train every If mod( step,trainfreq ) == 0: trainfreq steps sample batch B from D One step grad descent WRT buffer Where:

Think-pair-share Initialize with random weights Repeat (for each episode): Initialize s Repeat (for each step of the episode): Choose a from s using policy derived from Q (e.g. e-greedy) Take action a , observe r, s’ If mod( step,trainfreq ) == 0: sample batch B from D What do you think are the tradeoffs between: – large replay buffer vs small replay buffer? – large batch size vs small batch size?

With target network Initialize with random weights Repeat (for each episode): Initialize s Repeat (for each step of the episode): Choose a from s using policy derived from Q (e.g. e-greedy) Take action a , observe r, s’ Target network helps stabilize DL If mod( step,trainfreq ) == 0: – why ? sample batch B from D if mod( step,copyfreq ) == 0: Where:

Example: 4x4 frozen lake env Get to the goal (G) Don’t fall in a hole (H) Demo!

Comparison: replay vs no replay (Avg final score achieved)

Double DQN Recall the problem of maximization bias:

Double DQN Recall the problem of maximization bias: Our solution from the TD lecture: Can we adapt this to the DQN setting?

Double DQN Initialize with random weights Repeat (for each episode): Initialize s Repeat (for each step of the episode): Choose a from s using policy derived from Q (e.g. e-greedy) Take action a , observe r, s’ If mod( step,trainfreq ) == 0: sample batch B from D if mod( step,copyfreq ) == 0: Where:

Think-pair-share Initialize with random weights Repeat (for each episode): Initialize s Repeat (for each step of the episode): Choose a from s using policy derived from Q (e.g. e-greedy) Take action a , observe r, s’ 1. In what sense is this double q-learning? If mod( step,trainfreq ) == 0: 2. What are the pros/cons vs earlier sample batch B from D version of double-Q? 3. Why not convert the original if mod( step,copyfreq ) == 0: double-Q algorithm into a deep version? Where:

Double DQN

Prioritized Replay Buffer Initialize with random weights Repeat (for each episode): Initialize s Repeat (for each step of the episode): Choose a from s using policy derived from Q (e.g. e-greedy) Take action a , observe r, s’ If mod( step,trainfreq ) == 0: sample batch B from D Previously this sample was uniformly random Can we do better by sampling the batch intelligently?

Prioritized Replay Buffer – Left action transitions to state 1 w/ zero reward – Far right state gets reward of 1

Question Why is the sampling method particularly important in this Domain? – Left action transitions to state 1 w/ zero reward – Far right state gets reward of 1

Deep RL Robert Platt Northeastern University Q-learning - PowerPoint PPT Presentation

Deep RL Robert Platt Northeastern University Q-learning Q-function Q action argmax state action World e t a t s Update rule Q-learning Q-function Q action argmax state action World e t a t s Update rule Deep Q-learning

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Arne Naess Founder of Deep Ecology: biospheric egalitarianism Coined term deep

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

America Needs Deep America Needs Deep America Needs Deep Innovation Again Innovation Again

St1 Deep Heat Oy First deep geothermal project in Scandinavia Achievements today Tero Saarno

ACCELERATE DEEP LEARNING WITH NVIDIA'S DEEP LEARNING PLATFORM | STEPHEN JONES | GTC16 DEEP

South Deep: Finding Its Feet SOUTH DEEP SITE VISIT 13 February 2015 Nick Holland, Nico Muller

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA GPU Xu Tianhao, Deep Learning Solution

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Deep learning for natural language processing A short primer on deep learning Benoit Favre <

Relational Deep Learning: A Deep Latent Variable Model for Link Prediction Hao Wang, Xingjian

Deep-Learning: Unsupervised Generative models Deep Belief Networks Deep Stacked AutoEncoders

Leadership Actions to Mitigate a COVID Learning Loss Webinar June 11, 2020 Slides Melissa

Orfg Online Planning for Distance Learning In a School District Roger Sams, along with Katie

Parents in Community Colleges Dr. Mary Gatta, Senior Scholar Wider Opportunities for Women Online

Asynchronous Parallel Methods for Optimization and Linear Algebra Stephen Wright University of

Professional Development 101: The Basics Part 1 A Professional Development Series from the

Asynchronous Transfer Mode (ATM) Switching technology that was widely used in 1980s and

The Data Link Layer Our goals: Previous understand principles link layer services behind

Communication Network Logical connections

Deep RL Robert Platt Northeastern University Q-learning - PowerPoint PPT Presentation

Deep RL Robert Platt Northeastern University Q-learning Q-function Q action argmax state action World e t a t s Update rule Q-learning Q-function Q action argmax state action World e t a t s Update rule Deep Q-learning

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Arne Naess Founder of Deep Ecology: biospheric egalitarianism Coined term deep

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

America Needs Deep America Needs Deep America Needs Deep Innovation Again Innovation Again

St1 Deep Heat Oy First deep geothermal project in Scandinavia Achievements today Tero Saarno

ACCELERATE DEEP LEARNING WITH NVIDIA'S DEEP LEARNING PLATFORM | STEPHEN JONES | GTC16 DEEP

South Deep: Finding Its Feet SOUTH DEEP SITE VISIT 13 February 2015 Nick Holland, Nico Muller

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA GPU Xu Tianhao, Deep Learning Solution

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Deep learning for natural language processing A short primer on deep learning Benoit Favre &lt;

Relational Deep Learning: A Deep Latent Variable Model for Link Prediction Hao Wang, Xingjian

Deep-Learning: Unsupervised Generative models Deep Belief Networks Deep Stacked AutoEncoders

Leadership Actions to Mitigate a COVID Learning Loss Webinar June 11, 2020 Slides Melissa

Orfg Online Planning for Distance Learning In a School District Roger Sams, along with Katie

Parents in Community Colleges Dr. Mary Gatta, Senior Scholar Wider Opportunities for Women Online

Asynchronous Parallel Methods for Optimization and Linear Algebra Stephen Wright University of

Professional Development 101: The Basics Part 1 A Professional Development Series from the

Asynchronous Transfer Mode (ATM) Switching technology that was widely used in 1980s and

The Data Link Layer Our goals: Previous understand principles link layer services behind

Communication Network Logical connections

Deep learning for natural language processing A short primer on deep learning Benoit Favre <