DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm - PowerPoint PPT Presentation

This lecture will be recorded!!! Welcome to DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm –8:50pm R Zoom Lecture Fall 2020

No Quiz Today

Project 3 due today 3

Next Thursday: No class Happy Thanksgiving 4

Project 4 is available Starts 10/29 Thursday v https://github.com/yingxue- zhang/DS595CS525-RL- Projects/tree/master/Project4 v Important Dates: v Project Proposal: Thursday 11/12/2020 v Progressive report: Thursday 11/26/2020 v Final Project: § Tuesday 12/8/2020 team project report is due § Thursday 12/10/2020 Virtual Poster Session 5

This Lecture v Actor-Critic methods § A2C § A3C § Pathwise Derivative Policy Gradient § Advanced RL Techniques § Advanced techniques for DQN • Multi-step DQN, Noisy net DQN • Distributional DQN • DQN for continuous action space § Sparse Reward • Reward shaping, Curiosity module • Curriculum learning, Hierarchical RL

Inverse Reinforcement Reinforcement Learning Learning Tabular representation of reward Linear reward function learning Model-based control Imitation learning Model-free control Apprenticeship learning Single Agent (MC, SARSA, Q-Learning) Inverse reinforcement learning MaxEnt IRL Function representation of reward MaxCausalEnt IRL 1. Linear value function approx MaxRelEnt IRL (MC, SARSA, Q-Learning) 2. Value function approximation (Deep Q-Learning, Double DQN, Non-linear reward function learning prioritized DQN, Dueling DQN) Generative adversarial 3. Policy function approximation imitation learning (GAIL) (Policy gradient , PPO, TRPO) 4. Actor-Critic methods (A2C, Adversarial inverse reinforcement A3C, Pathwise Derivative PG) learning (AIRL) Advanced topics in RL （ Sparse Rewards ） Review of Deep Learning Review of Generative Adversarial nets As bases for non-linear function As bases for non-linear IRL approximation (used in 2-4). Multiple Agents Multi-Agent Inverse Reinforcement Multi-Agent Reinforcement Learning Learning Multi-agent Actor-Critic MA-GAIL etc. Applications MA-AIRL AMA-GAIL

This Lecture v Actor-Critic methods § A2C § A3C § Pathwise Derivative Policy Gradient § Advanced RL Techniques § Advanced techniques for DQN • Multi-step DQN, Noisy net DQN • Distributional DQN • DQN for continuous action space § Sparse Reward • Reward shaping, Curiosity module • Curriculum learning, Hierarchical RL § Project #4 progress update

Model-free RL Algorithms v Value-based (Learned Value Function) v Policy-based (Learned Policy Function) § Actor-Critic (Learned both Value and Policy Functions)

Model-free RL Algorithms v Value-based (Learned Value Function) v Deep Q-Learning (DQN) v Double DQN v Dueling DQN v Prioritized DQN v Policy-based (Learned Policy Function) § Basic Policy Gradient Algorithm § REINFORCE § Vanilla, PPO, TRPO, PPO2 § Actor-Critic (Learned both Value and Policy Functions) § A2C § A3C § Pathwise Derivative Policy Gradient

Basic Policy Gradient Algorithm Update Model Data Collection only used once Unbiased estimator

Epsilon Greedy Boltzmann Exploration

(A2C algorithm) Value function Approximation Policy Gradient

Asynchronous Advantage Actor-Critic (A3C) From A2C to A3C

Pathwise Derivative Policy Gradient David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, Martin Riedmiller, “Deterministic Policy Gradient Algorithms”, ICML, 2014 Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, Daan Wierstra, “ CONTINUOUS CONTROL WITH DEEP REINFORCEMENT LEARNING ”, ICLR, 2016

Replaced ε-greedy policy with πnetwork.

Model-free RL Algorithms v Value-based (Learned Value Function) v Deep Q-Learning (DQN) v Double DQN v Dueling DQN v Prioritized DQN v Policy-based (Learned Policy Function) § Basic Policy Gradient Algorithm § REINFORCE § Vanilla, PPO, TRPO, PPO2 § Actor-Critic (Learned both Value and Policy Functions) § A2C § A3C § Pathwise Derivative Policy Gradient

This Lecture v Actor-Critic methods § A2C § A3C § Pathwise Derivative Policy Gradient § Advanced RL Techniques § Advanced techniques for DQN • Multi-step DQN, Noisy net DQN • Distributional DQN • DQN for continuous action space § Sparse Reward • Reward shaping, Curiosity module • Curriculum learning, Hierarchical RL

Multi-step DQN Balance between MC and TD Experience Buffer

Noisy Net for DQN https://arxiv.org/abs/1706.01905 https://arxiv.org/abs/1706.10295 v Noise on Action (Epsilon Greedy) v Noise on Parameters Inject noise into the parameters of Q- function at the beginning of each episode Add noise The noise would NOT change in an episode.

Noisy Net Random exploration Systematic exploration

Demo https://blog.openai.com/better- exploration-with-parameter-noise/ Which one is action noise vs parameter noise ？

Distributional Q-function -10 10 -10 10 Different distributions can have the same values.

Distributional Q-function s s A network with 15 outputs A network with 3 outputs (each action has 5 bins)

Demo https://youtu.be/yFBwyPuO2Vg

Rainbow https://arxiv.org/abs/1710.02298

P r o s a Continuous Actions n d C o n s ? ? ? Solution 1 See which action can obtain the largest Q value Solution 2 Using gradient ascent to solve the optimization problem.

Continuous Actions Solution 3 Design a network to make the optimization easy. vector s matrix scalar

https://www.youtube.com/watch?v=ZhsEKTo7V04 Continuous Actions Solution 4 Don’t use Q-learning Policy-based Value-based Learning a Critic Learning an Actor + Critic Actor

This Lecture v Actor-Critic methods § A2C § A3C § Pathwise Derivative Policy Gradient § Advanced RL Techniques § Advanced techniques for DQN • Multi-step DQN, Noisy net DQN • Distributional DQN • DQN for continuous action space § Sparse Reward • Reward shaping, Curiosity module • Curriculum learning, Hierarchical RL § Project #4 progress update

Sparse Reward Reward Shaping

Reward Shaping

Reward Shaping https://openreview.net/forum?id=Hk3 VizDoom mPK5gg&noteId=Hk3mPK5gg

Reward Shaping Get reward, when closer Need domain knowledge https://openreview.net/pdf?id=Hk3mPK5gg

https://arxiv.org/abs/1705.05363 Curiosity updated updated … Env Actor Env Actor Env … Reward ICM Reward ICM ICM = intrinsic curiosity module

Intrinsic Curiosity Module Encourage exploration diff Network 1 Some states is hard to predict, but not important. Trivial events

Intrinsic Curiosity Module diff Network Network 1 2 Feature Feature Ext Ext

Sparse Reward Curriculum Learning

Curriculum Learning v Starting from simple training examples, and then becoming harder and harder. VizDoom

Sparse Reward Hierarchical Reinforcement Learning

https://arxiv.org/abs/1805.08180

Inverse Reinforcement Reinforcement Learning Learning Tabular representation of reward Linear reward function learning Model-based control Imitation learning Model-free control Apprenticeship learning Single Agent (MC, SARSA, Q-Learning) Inverse reinforcement learning MaxEnt IRL Function representation of reward MaxCausalEnt IRL 1. Linear value function approx MaxRelEnt IRL (MC, SARSA, Q-Learning) 2. Value function approximation (Deep Q-Learning, Double DQN, Non-linear reward function learning prioritized DQN, Dueling DQN) Generative adversarial 3. Policy function approximation imitation learning (GAIL) (Policy gradient , PPO, TRPO) 4. Actor-Critic methods (A2C, Adversarial inverse reinforcement A3C, Pathwise Derivative PG) learning (AIRL) Advanced topics in RL （ Sparse Rewards ） Review of Deep Learning Review of Generative Adversarial nets As bases for non-linear function As bases for non-linear IRL approximation (used in 2-4). Multiple Agents Multi-Agent Inverse Reinforcement Multi-Agent Reinforcement Learning Learning Multi-agent Actor-Critic MA-GAIL etc. Applications MA-AIRL AMA-GAIL

Questions?

DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm - PowerPoint PPT Presentation

This lecture will be recorded!!! Welcome to DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm 8:50pm R Zoom Lecture Fall 2020 No Quiz Today Project 3 due today 3 Next Thursday: No class Happy Thanksgiving 4 Project 4

DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm 8:50pm R Zoom Lecture Fall

DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm 8:50pm R Zoom Lecture Fall

DS595/CS525: Reinforcement Learning --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm 8:50pm R Zoom Lecture Fall

DS595/CS525: Urban Network Analysis -- Urban Mobility Prof. Yanhua Li Time: 6:00pm 8:50pm

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

EFFECTIVELY DEALING WITH DEADLINE PRESSURE DAVE MOORE 8TH LIGHT * EFFECTIVELY DEALING WITH

Food Access for Immigrant Californians During COVID-19 Presented in partnership with the

Matthew 7:13-14 - Enter ye in at the strait gate: for wide is the gate, and broad is the way,

Third International Competition on Runtime Verification (CRV16) Giles Reger, Sylvain Hall e,

Reinforcement Learning: How Does It Work? We detect a state Reinforcement Learning We choose an

3DST-S report to spokespersons for LBNC March 31, 2019 1 Introduction The 3DST-Spectrometer

Scalable Massively Parallel I/O to Task-Local Files | Wolfgang Frings, Jlich Supercomputing

Preliminary valuation approach 1. . Lis isted peers approach 1.1. Selection of peers and

DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm - PowerPoint PPT Presentation

This lecture will be recorded!!! Welcome to DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm 8:50pm R Zoom Lecture Fall 2020 No Quiz Today Project 3 due today 3 Next Thursday: No class Happy Thanksgiving 4 Project 4

DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm 8:50pm R Zoom Lecture Fall

DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm 8:50pm R Zoom Lecture Fall

DS595/CS525: Reinforcement Learning --Introduction &amp; Logistics Prof. Yanhua Li Time: 6:00pm

DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm 8:50pm R Zoom Lecture Fall

DS595/CS525: Urban Network Analysis -- Urban Mobility Prof. Yanhua Li Time: 6:00pm 8:50pm

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

EFFECTIVELY DEALING WITH DEADLINE PRESSURE DAVE MOORE 8TH LIGHT * EFFECTIVELY DEALING WITH

Food Access for Immigrant Californians During COVID-19 Presented in partnership with the

Matthew 7:13-14 - Enter ye in at the strait gate: for wide is the gate, and broad is the way,

Third International Competition on Runtime Verification (CRV16) Giles Reger, Sylvain Hall e,

Reinforcement Learning: How Does It Work? We detect a state Reinforcement Learning We choose an

3DST-S report to spokespersons for LBNC March 31, 2019 1 Introduction The 3DST-Spectrometer

Scalable Massively Parallel I/O to Task-Local Files | Wolfgang Frings, Jlich Supercomputing

Preliminary valuation approach 1. . Lis isted peers approach 1.1. Selection of peers and

DS595/CS525: Reinforcement Learning --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm