Deep Reinforcement Learning and Complex Environments Raia Hadsell - - PowerPoint PPT Presentation

deep reinforcement learning and complex environments
SMART_READER_LITE
LIVE PREVIEW

Deep Reinforcement Learning and Complex Environments Raia Hadsell - - PowerPoint PPT Presentation

Deep Reinforcement Learning and Complex Environments Raia Hadsell End-to-end Deep Learning for robots? slide from V. Vanhoucke End-to-end Deep Learning for robots? 2010 : Speech Recognition Audio Acoustic Model Phonetic Model


slide-1
SLIDE 1

Deep Reinforcement Learning 
 and Complex Environments

Raia Hadsell

slide-2
SLIDE 2

End-to-end Deep Learning for robots?

slide from V. Vanhoucke

slide-3
SLIDE 3

2010: Speech Recognition Audio → Acoustic Model → Phonetic Model → Language Model → Text

Deep Net

End-to-end Deep Learning for robots?

slide from V. Vanhoucke

slide-4
SLIDE 4

2010: Speech Recognition Audio → Acoustic Model → Phonetic Model → Language Model → Text 2012: Computer Vision Pixels → Key Points → SIFT features → Deformable Part Model → Labels

Deep Net

End-to-end Deep Learning for robots?

slide from V. Vanhoucke

Deep Net

slide-5
SLIDE 5

2010: Speech Recognition Audio → Acoustic Model → Phonetic Model → Language Model → Text 2012: Computer Vision Pixels → Key Points → SIFT features → Deformable Part Model → Labels 2014: Machine Translation Text → Reordering → Phrase Table/Dictionary → Language Model → Text

Deep Net

End-to-end Deep Learning for robots?

slide from V. Vanhoucke

Deep Net Deep Net

slide-6
SLIDE 6

2010: Speech Recognition Audio → Acoustic Model → Phonetic Model → Language Model → Text 2012: Computer Vision Pixels → Key Points → SIFT features → Deformable Part Model → Labels 2014: Machine Translation Text → Reordering → Phrase Table/Dictionary → Language Model → Text 2017: Robotics? Sensors → Perception → World Model → Planning → Control → Action

Deep Net

End-to-end Deep Learning for robots?

slide from V. Vanhoucke

Deep Net Deep Net

slide-7
SLIDE 7

General Artificial Intelligence

Robotics is different

LABELS

slide-8
SLIDE 8

General Artificial Intelligence

Robotics is different

ACTIONS SENSORS

slide-9
SLIDE 9

General Artificial Intelligence

Environment Agent

Deep Reinforcement Learning

GOAL OBSERVATIONS ACTIONS REWARD?

neural network

slide-10
SLIDE 10

General Atari Player

[Mnih et al, Playing Atari with Deep Reinforcement Learning, 2014]

slide-11
SLIDE 11

9DOF Random reacher

slide-12
SLIDE 12

Deep RL — Raia Hadsell

  • Can deep RL agents learn multiple tasks?
  • Can deep RL agents learn efficiently?
  • Can deep RL agents learn from real data?
  • Can deep RL agents learn continuous control?
slide-13
SLIDE 13

Lab Mazes StreetLearn Parkour Multiple Tasks & Lifelong learning

slide-14
SLIDE 14

Raia Hadsell 2017

Lifelong Learning - 3 challenges

  • 1. Catastrophic forgetting
  • 2. Positive transfer
  • 3. Specialization and generalization
slide-15
SLIDE 15

Raia Hadsell 2017

Catastrophic forgetting

  • Well-known phenomenon
  • Especially severe in Deep RL
slide-16
SLIDE 16

Raia Hadsell 2017

Catastrophic forgetting

  • Well-known phenomenon
  • Especially severe in Deep RL
slide-17
SLIDE 17

Raia Hadsell 2017

Catastrophic forgetting

slide-18
SLIDE 18

Raia Hadsell 2017

Catastrophic forgetting

slide-19
SLIDE 19

Raia Hadsell 2017

Elastic Weight Consolidation

Task B

𝜾 *

Task A

SGD EWC L2

James Kirkpatrick et al (2017), “Overcoming Catastrophic Forgetting in NNs”

slide-20
SLIDE 20

Raia Hadsell 2017

What if my tasks really don’t get along?

slide-21
SLIDE 21

Raia Hadsell 2017

Progressive Nets

  • add columns for new tasks
  • freeze params of learnt columns
  • layer-wise neural connections


→ capacity for task-specific features → enables deep compositionality → precludes forgetting 


What if my tasks really don’t get along?

𝛒1 𝝃1

Andrei Rusu et al (2016), “Progressive Neural Networks”

slide-22
SLIDE 22

Raia Hadsell 2017

Progressive Nets

  • add columns for new tasks
  • freeze params of learnt columns
  • layer-wise neural connections


→ capacity for task-specific features → enables deep compositionality → precludes forgetting 


What if my tasks really don’t get along?

𝛒1 𝝃1 𝛒

2

𝝃2

a a

Andrei Rusu et al (2016), “Progressive Neural Networks”

slide-23
SLIDE 23

Raia Hadsell 2017

Progressive Nets

  • add columns for new tasks
  • freeze params of learnt columns
  • layer-wise neural connections


→ capacity for task-specific features → enables deep compositionality → precludes forgetting 


What if my tasks really don’t get along?

𝛒1 𝝃1 𝛒

2

𝝃2 𝛒

3

𝝃3

a a a a a a

Andrei Rusu et al (2016), “Progressive Neural Networks”

slide-24
SLIDE 24

Sim-to-Real

𝛒1 𝝃1 𝛒

2

𝝃2 𝛒

3

𝝃3

Simulation Robot Task A Task A Task B

slide-25
SLIDE 25

Raia Hadsell 2017

What if my tasks really don’t get along?

slide-26
SLIDE 26

Raia Hadsell

  • Task-specific networks plus shared

network

  • KL Divergence constraint
  • Regularisation in policy space rather

than parameter space

  • Shared policy as a communication

channel between tasks

Distral (Distill and Transfer Learning)

𝛒

1

𝝃

1

𝛒

2

𝝃

2

𝛒

3

𝝃

3

𝛒

4

𝝃

4

𝛒 𝝃

KL KL KL KL Yee Whye Teh et al (2017), “Distral: Robust Multitask Reinforcement Learning”

slide-27
SLIDE 27

Raia Hadsell

Distral (Distill and Transfer Learning)

𝛒1 𝛒2 𝛒3 𝛒4 𝛒0

KL KL KL KL

distillation

  • Task-specific networks plus shared

network

  • Regularisation in policy space rather

than parameter space

  • Shared policy as a communication

channel between tasks → Distillation of knowledge into shared model enables transfer to tasks

Yee Whye Teh et al (2017), “Distral: Robust Multitask Reinforcement Learning”

slide-28
SLIDE 28

Raia Hadsell

Distral (Distill and Transfer Learning)

𝛒1 𝛒2 𝛒3 𝛒4 𝛒0

KL KL KL KL

distillation & regularisation

  • Task-specific networks plus shared

network

  • Regularisation in policy space rather

than parameter space

  • Shared policy as a communication

channel between tasks → Distillation of knowledge into shared model enables transfer to tasks → Regularisation of shared model gives stability and robustness

Yee Whye Teh et al (2017), “Distral: Robust Multitask Reinforcement Learning”

slide-29
SLIDE 29

Deep RL — Raia Hadsell

slide-30
SLIDE 30

Deep RL — Raia Hadsell

slide-31
SLIDE 31

Lab Mazes & Auxiliary Learning StreetLearn Parkour Multiple Tasks & Lifelong learning

slide-32
SLIDE 32

Navigation mazes

Game episode: 1. Random start 2. Find the goal (+10) 3. Teleport randomly 4. Re-find the goal (+10) 5. Repeat (limited time) Variants: Static maze, static goal Static maze, random goal Random maze

10800 steps/episode 3600 steps/episode

slide-33
SLIDE 33

Nav agent architecture

  • 1. Convolutional encoder and RGB inputs

enc

xt

Piotr Mirowski, Razvan Pascanu et al (2017) “Learning to navigate in complex environments”

slide-34
SLIDE 34

Nav agent architecture

  • 1. Convolutional encoder and RGB inputs
  • 2. Single or stacked LSTM with skip connection

enc

xt

Piotr Mirowski, Razvan Pascanu et al (2017) “Learning to navigate in complex environments”

slide-35
SLIDE 35

Nav agent architecture

  • 1. Convolutional encoder and RGB inputs
  • 2. Stacked LSTM
  • 3. Additional inputs (reward, action, and velocity)

enc

xt rt-1 {vt, at-1}

Piotr Mirowski, Razvan Pascanu et al (2017) “Learning to navigate in complex environments”

slide-36
SLIDE 36

Nav agent architecture

  • 1. Convolutional encoder and RGB inputs
  • 2. Stacked LSTM
  • 3. Additional inputs (reward, action, and velocity)
  • 4. RL: Asynchronous advantage actor critic (A3C)

enc

𝛒

𝑾

xt rt-1 {vt, at-1}

Piotr Mirowski, Razvan Pascanu et al (2017) “Learning to navigate in complex environments”

slide-37
SLIDE 37

Nav agent architecture

  • 1. Convolutional encoder and RGB inputs
  • 2. Stacked LSTM
  • 3. Additional inputs (reward, action, and velocity)
  • 4. RL: Asynchronous advantage actor critic (A3C)
  • 5. Aux task 1: Depth predictors


enc

𝛒

𝑾

Depth (D1 )

xt rt-1 {vt, at-1}

Depth (D2 )

Piotr Mirowski, Razvan Pascanu et al (2017) “Learning to navigate in complex environments”

slide-38
SLIDE 38

Nav agent architecture

  • 1. Convolutional encoder and RGB inputs
  • 2. Stacked LSTM
  • 3. Additional inputs (reward, action, and velocity)
  • 4. RL: Asynchronous advantage actor critic (A3C)
  • 5. Aux task 1: Depth predictor
  • 6. Aux task 2: Loop closure predictor

enc

𝛒

𝑾 Loop

(L) Depth (D1 )

xt rt-1 {vt, at-1}

Depth (D2 )

Piotr Mirowski, Razvan Pascanu et al (2017) “Learning to navigate in complex environments”

slide-39
SLIDE 39

Variations in architecture

xt rt-1 {vt, at-1}

enc

𝛒

𝑾

xt

enc

𝛒

𝑾

enc

𝛒

𝑾 Loop

(L) Depth (D1 )

  • a. FF A3C
  • c. Nav A3C
  • d. Nav A3C +D1D2L

xt rt-1 {vt, at-1}

enc

𝛒

𝑾

xt

  • b. LSTM A3C

Depth (D2 )

slide-40
SLIDE 40

+10 +1

Results on large maze with static goal

slide-41
SLIDE 41
slide-42
SLIDE 42

Deep RL — Raia Hadsell

slide-43
SLIDE 43

Lab Mazes & Auxiliary Learning StreetLearn & Real woRld RL Parkour Multiple Tasks & Lifelong learning

slide-44
SLIDE 44

Deep RL — Raia Hadsell

  • bservation

Navigation mazes in the real world?

  • bservation

structure structure

slide-45
SLIDE 45

Deep RL — Raia Hadsell

  • bservation

StreetView as an RL environment: StreetLearn

  • bservation

structure structure

  • RGB image cropped

from panorama (84x84)

  • Goal location

Actions: move to next node, rotate view 20° or 60°

slide-46
SLIDE 46

Deep RL — Raia Hadsell

left or right?

StreetView as an RL environment: StreetLearn

slide-47
SLIDE 47

Deep RL — Raia Hadsell

Looks like a road, but it’s a park entrance

StreetView as an RL environment: StreetLearn

slide-48
SLIDE 48

Deep RL — Raia Hadsell

west side highway

StreetView as an RL environment: StreetLearn

slide-49
SLIDE 49

Deep RL — Raia Hadsell

curved roads and tunnels

StreetView as an RL environment: StreetLearn

slide-50
SLIDE 50

Deep RL — Raia Hadsell

really, tunnels!

StreetView as an RL environment: StreetLearn

slide-51
SLIDE 51

Raia Hadsell

StreetLearn: The Courier Task

  • 1. Spawn randomly

and navigate to a random target location.

  • 2. Start receiving

reward when close to target 
 (within 400m).

  • 3. If target is

reached (100m), navigate to a new random target.

slide-52
SLIDE 52

Raia Hadsell

Agent architecture

rt-1, at-1 CNN

image

LSTM Policy (π, V)

target

LSTM

Relative
 pathway

Local graph neighbour prediction

Global
 pathway

Absolute heading prediction LSTM

slide-53
SLIDE 53

Raia Hadsell

Agent architecture

rt-1, at-1 CNN

image

LSTM Policy (π, V)

target

LSTM

Relative
 pathway

Local graph neighbour prediction

Global
 pathway

Absolute heading prediction LSTM

slide-54
SLIDE 54

Raia Hadsell

Agent architecture

rt-1, at-1 CNN

image

LSTM Policy (π, V)

target

LSTM

Relative
 pathway

Local graph neighbour prediction

Global
 pathway

Absolute heading prediction LSTM

slide-55
SLIDE 55

Raia Hadsell

Agent architecture

rt-1, at-1 CNN

image

LSTM Policy (π, V)

target

LSTM

Relative
 pathway

Local graph neighbour prediction

Global
 pathway

Absolute heading prediction LSTM

slide-56
SLIDE 56
slide-57
SLIDE 57

Deep RL — Raia Hadsell

slide-58
SLIDE 58

Lab Mazes & Auxiliary Learning StreetLearn & Real woRld RL Parkour & Continuous control Multiple Tasks & Lifelong learning

slide-59
SLIDE 59

Deep RL — Raia Hadsell

Proprioceptive and exteroceptive observations

Proprioceptive -- “near the body”:


  • Joint angles &

velocities

  • Touch sensors
  • Positions and

velocities of limbs in body coordinate frame

slide-60
SLIDE 60

Deep RL — Raia Hadsell

Proprioceptive and exteroceptive observations

Proprioceptive -- “near the body”:


  • Joint angles &

velocities

  • Touch sensors
  • Positions and

velocities of limbs in body coordinate frame

Exteroceptive -- “away from the body”:


  • Position / velocity

in global coordinate frame

  • Task-related 


(e.g. goal position)

  • Vision
slide-61
SLIDE 61

Raia Hadsell

Rich environments for skill discovery: setup

Training

  • Proximal policy optimization


[Schulman et al.]

  • Batched policy gradient
  • Trust region


(“gradient-based TRPO”)

  • High-performance

implementation: ○ Distributed (multiple workers) ○ Synchronous gradient updates

actions proprioceptio n terrain

Nicolas Heess, et al. 2016: “Learning and transfer of modulated locomotor controllers”

slide-62
SLIDE 62

Raia Hadsell

Single uniform reward, based on forward progress

Nicolas Heess, et al. 2017: 
 “Emergence of Locomotion Behaviours in Rich Environments”

slide-63
SLIDE 63

Deep RL — Raia Hadsell

Humanoid: learned behaviors

  • 27 DoFs
  • 21 actuators

Nicolas Heess, et al. 2017: 
 “Emergence of Locomotion Behaviours in Rich Environments”

slide-64
SLIDE 64

Deep RL — Raia Hadsell

  • Can deep RL agents learn multiple tasks?
  • Can deep RL agents learn efficiently?
  • Can deep RL agents learn from real data?
  • Can deep RL agents learn continuous control?
slide-65
SLIDE 65

Thank you!

Overcoming catastrophic forgetting in NNs, 2016

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, Raia Hadsell

Progressive Neural Networks, 2016

Andrei A. Rusu, Neil C. Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, Raia Hadsell

Distral: Robust Multitask RL, 2017

Yee Whye Teh, Victor Bapst, Wojciech Marian Czarnecki, John Quan, James Kirkpatrick, Raia Hadsell, Nicolas Heess, Razvan Pascanu

Learning to navigate in complex environments, 2017

Piotr Mirowski*, Razvan Pascanu*, Fabio Viola, Hubert Soyer, Andrew J. Ballard, Andrea Banino, Misha Denil, Ross Goroshin, Laurent Sifre, Koray Kavukcuoglu, Dharshan Kumaran, Raia Hadsell

Learning and transfer of modulated locomotor controllers, 2016

Nicolas Heess, Greg Wayne, Yuval Tassa, Timothy Lillicrap, Martin Riedmiller, David Silver

Emergence of Locomotion Behaviours in Rich Environments, 2017

Nicolas Heess, Dhruva TB, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang,

  • S. M. Ali Eslami, Martin Riedmiller, David Silver