DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm - - PowerPoint PPT Presentation

ds595 cs525 reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm - - PowerPoint PPT Presentation

This lecture will be recorded!!! Welcome to DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm 8:50pm R Zoom Lecture Fall 2020 No Quiz Today Project 3 due today 3 Next Thursday: No class Happy Thanksgiving 4 Project 4


slide-1
SLIDE 1

DS595/CS525 Reinforcement Learning

  • Prof. Yanhua Li

Welcome to

Time: 6:00pm –8:50pm R Zoom Lecture Fall 2020 This lecture will be recorded!!!

slide-2
SLIDE 2

No Quiz Today

slide-3
SLIDE 3

3

Project 3 due today

slide-4
SLIDE 4

4

Next Thursday: No class Happy Thanksgiving

slide-5
SLIDE 5

5

Project 4 is available Starts 10/29 Thursday

v https://github.com/yingxue-

zhang/DS595CS525-RL- Projects/tree/master/Project4

v Important Dates: v Project Proposal: Thursday 11/12/2020 v Progressive report: Thursday 11/26/2020 v Final Project:

§ Tuesday 12/8/2020 team project report is due § Thursday 12/10/2020 Virtual Poster Session

slide-6
SLIDE 6

This Lecture

v Actor-Critic methods

§ A2C § A3C

§ Pathwise Derivative Policy Gradient

§ Advanced RL Techniques

§ Advanced techniques for DQN

  • Multi-step DQN, Noisy net DQN
  • Distributional DQN
  • DQN for continuous action space

§ Sparse Reward

  • Reward shaping, Curiosity module
  • Curriculum learning, Hierarchical RL
slide-7
SLIDE 7

Reinforcement Learning Single Agent Multiple Agents Inverse Reinforcement Learning

Tabular representation of reward Model-based control Model-free control (MC, SARSA, Q-Learning) Function representation of reward

  • 1. Linear value function approx

(MC, SARSA, Q-Learning)

  • 2. Value function approximation

(Deep Q-Learning, Double DQN, prioritized DQN, Dueling DQN)

  • 3. Policy function approximation

(Policy gradient, PPO, TRPO)

  • 4. Actor-Critic methods (A2C,

A3C, Pathwise Derivative PG) Advanced topics in RL (Sparse Rewards) Review of Deep Learning As bases for non-linear function approximation (used in 2-4). Linear reward function learning Imitation learning Apprenticeship learning Inverse reinforcement learning MaxEnt IRL MaxCausalEnt IRL MaxRelEnt IRL Non-linear reward function learning Generative adversarial imitation learning (GAIL) Adversarial inverse reinforcement learning (AIRL) Review of Generative Adversarial nets As bases for non-linear IRL Multi-Agent Reinforcement Learning Multi-agent Actor-Critic etc. Multi-Agent Inverse Reinforcement Learning MA-GAIL MA-AIRL AMA-GAIL

Applications

slide-8
SLIDE 8

This Lecture

v Actor-Critic methods

§ A2C § A3C § Pathwise Derivative Policy Gradient

§ Advanced RL Techniques

§ Advanced techniques for DQN

  • Multi-step DQN, Noisy net DQN
  • Distributional DQN
  • DQN for continuous action space

§ Sparse Reward

  • Reward shaping, Curiosity module
  • Curriculum learning, Hierarchical RL

§ Project #4 progress update

slide-9
SLIDE 9

v Value-based v Policy-based

§ Actor-Critic

Model-free RL Algorithms

(Learned Value Function) (Learned Policy Function) (Learned both Value and Policy Functions)

slide-10
SLIDE 10

v Value-based

vDeep Q-Learning (DQN) vDouble DQN vDueling DQN vPrioritized DQN

v Policy-based

§ Basic Policy Gradient Algorithm § REINFORCE § Vanilla, PPO, TRPO, PPO2

§ Actor-Critic

§ A2C § A3C § Pathwise Derivative Policy Gradient

Model-free RL Algorithms

(Learned Value Function) (Learned Policy Function) (Learned both Value and Policy Functions)

slide-11
SLIDE 11
slide-12
SLIDE 12

Basic Policy Gradient Algorithm

Update Model Data Collection

  • nly used once

Unbiased estimator

slide-13
SLIDE 13
slide-14
SLIDE 14

Epsilon Greedy Boltzmann Exploration

slide-15
SLIDE 15
slide-16
SLIDE 16
slide-17
SLIDE 17
slide-18
SLIDE 18

(A2C algorithm) Value function Approximation Policy Gradient

slide-19
SLIDE 19
slide-20
SLIDE 20

Asynchronous Advantage Actor-Critic (A3C)

From A2C to A3C

slide-21
SLIDE 21
slide-22
SLIDE 22

Pathwise Derivative Policy Gradient

David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, Martin Riedmiller, “Deterministic Policy Gradient Algorithms”, ICML, 2014 Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, Daan Wierstra, “CONTINUOUS CONTROL WITH DEEP REINFORCEMENT LEARNING”, ICLR, 2016

slide-23
SLIDE 23
slide-24
SLIDE 24
slide-25
SLIDE 25
slide-26
SLIDE 26
slide-27
SLIDE 27
slide-28
SLIDE 28
slide-29
SLIDE 29
slide-30
SLIDE 30

Replaced ε-greedy policy with πnetwork.

slide-31
SLIDE 31

v Value-based

vDeep Q-Learning (DQN) vDouble DQN vDueling DQN vPrioritized DQN

v Policy-based

§ Basic Policy Gradient Algorithm § REINFORCE § Vanilla, PPO, TRPO, PPO2

§ Actor-Critic

§ A2C § A3C § Pathwise Derivative Policy Gradient

Model-free RL Algorithms

(Learned Value Function) (Learned Policy Function) (Learned both Value and Policy Functions)

slide-32
SLIDE 32

This Lecture

v Actor-Critic methods

§ A2C § A3C § Pathwise Derivative Policy Gradient

§ Advanced RL Techniques

§ Advanced techniques for DQN

  • Multi-step DQN, Noisy net DQN
  • Distributional DQN
  • DQN for continuous action space

§ Sparse Reward

  • Reward shaping, Curiosity module
  • Curriculum learning, Hierarchical RL
slide-33
SLIDE 33

Multi-step DQN

Experience Buffer

Balance between MC and TD

slide-34
SLIDE 34

Noisy Net for DQN

v Noise on Action (Epsilon Greedy) v Noise on Parameters

https://arxiv.org/abs/1706.01905 https://arxiv.org/abs/1706.10295

The noise would NOT change in an episode. Inject noise into the parameters of Q- function at the beginning of each episode Add noise

slide-35
SLIDE 35

Noisy Net

Random exploration Systematic exploration

slide-36
SLIDE 36

Demo

https://blog.openai.com/better- exploration-with-parameter-noise/ Which one is action noise vs parameter noise?

slide-37
SLIDE 37

Distributional Q-function

Different distributions can have the same values.

  • 10

10

  • 10

10

slide-38
SLIDE 38

Distributional Q-function

s A network with 3 outputs A network with 15 outputs (each action has 5 bins) s

slide-39
SLIDE 39

Demo

https://youtu.be/yFBwyPuO2Vg

slide-40
SLIDE 40

Rainbow

https://arxiv.org/abs/1710.02298

slide-41
SLIDE 41

Continuous Actions

Solution 1 Solution 2 Using gradient ascent to solve the

  • ptimization problem.

See which action can obtain the largest Q value

P r

  • s

a n d C

  • n

s ? ? ?

slide-42
SLIDE 42

Continuous Actions

Solution 3 Design a network to make the optimization easy. s vector matrix scalar

slide-43
SLIDE 43

Continuous Actions

Solution 4 Don’t use Q-learning

Policy-based Value-based Learning an Actor Learning a Critic Actor + Critic

https://www.youtube.com/watch?v=ZhsEKTo7V04

slide-44
SLIDE 44

This Lecture

v Actor-Critic methods

§ A2C § A3C § Pathwise Derivative Policy Gradient

§ Advanced RL Techniques

§ Advanced techniques for DQN

  • Multi-step DQN, Noisy net DQN
  • Distributional DQN
  • DQN for continuous action space

§ Sparse Reward

  • Reward shaping, Curiosity module
  • Curriculum learning, Hierarchical RL

§ Project #4 progress update

slide-45
SLIDE 45

Sparse Reward

Reward Shaping

slide-46
SLIDE 46

Reward Shaping

slide-47
SLIDE 47

Reward Shaping

VizDoom

https://openreview.net/forum?id=Hk3 mPK5gg&noteId=Hk3mPK5gg

slide-48
SLIDE 48

Reward Shaping

https://openreview.net/pdf?id=Hk3mPK5gg

Get reward, when closer Need domain knowledge

slide-49
SLIDE 49

Curiosity

https://arxiv.org/abs/1705.05363

Actor Env Env Actor Env

… …

Reward Reward updated updated

ICM = intrinsic curiosity module

ICM ICM

slide-50
SLIDE 50

Intrinsic Curiosity Module

Network 1 diff

Encourage exploration Some states is hard to predict, but not important. Trivial events

slide-51
SLIDE 51

Intrinsic Curiosity Module

Network 1 diff

Encourage exploration Some states is hard to predict, but not important. Trivial events

slide-52
SLIDE 52

Intrinsic Curiosity Module

Network 1 Feature Ext Feature Ext Network 2 diff

slide-53
SLIDE 53

Sparse Reward

Curriculum Learning

slide-54
SLIDE 54

Curriculum Learning

v Starting from simple training examples, and

then becoming harder and harder.

VizDoom

slide-55
SLIDE 55

Sparse Reward

Hierarchical Reinforcement Learning

slide-56
SLIDE 56

https://arxiv.org/abs/1805.08180

slide-57
SLIDE 57

Reinforcement Learning Single Agent Multiple Agents Inverse Reinforcement Learning

Tabular representation of reward Model-based control Model-free control (MC, SARSA, Q-Learning) Function representation of reward

  • 1. Linear value function approx

(MC, SARSA, Q-Learning)

  • 2. Value function approximation

(Deep Q-Learning, Double DQN, prioritized DQN, Dueling DQN)

  • 3. Policy function approximation

(Policy gradient, PPO, TRPO)

  • 4. Actor-Critic methods (A2C,

A3C, Pathwise Derivative PG) Advanced topics in RL (Sparse Rewards) Review of Deep Learning As bases for non-linear function approximation (used in 2-4). Linear reward function learning Imitation learning Apprenticeship learning Inverse reinforcement learning MaxEnt IRL MaxCausalEnt IRL MaxRelEnt IRL Non-linear reward function learning Generative adversarial imitation learning (GAIL) Adversarial inverse reinforcement learning (AIRL) Review of Generative Adversarial nets As bases for non-linear IRL Multi-Agent Reinforcement Learning Multi-agent Actor-Critic etc. Multi-Agent Inverse Reinforcement Learning MA-GAIL MA-AIRL AMA-GAIL

Applications

slide-58
SLIDE 58

Questions?