Deep Learning for Robo/cs Pieter Abbeel Reinforcement Learning (RL) - - PowerPoint PPT Presentation

deep learning for robo cs pieter abbeel reinforcement
SMART_READER_LITE
LIVE PREVIEW

Deep Learning for Robo/cs Pieter Abbeel Reinforcement Learning (RL) - - PowerPoint PPT Presentation

Deep Learning for Robo/cs Pieter Abbeel Reinforcement Learning (RL) probability of taking ac2on a in state s Robo2cs ( a | s ) n Marke2ng / n Adver2sing Dialogue n Robot + Op2mizing Environment n opera2ons / logis2cs Queue H n


slide-1
SLIDE 1

Deep Learning for Robo/cs Pieter Abbeel

slide-2
SLIDE 2

Reinforcement Learning (RL)

n

Robo2cs

n

Marke2ng / Adver2sing

n

Dialogue

n

Op2mizing

  • pera2ons /

logis2cs

n

Queue management

n

Robot + Environment

πθ(a|s)

probability of taking ac2on a in state s

max

θ

E[

H

X

t=0

R(st)|πθ]

slide-3
SLIDE 3

Reinforcement Learning (RL)

n

Goal:

max

θ

E[

H

X

t=0

R(st)|πθ]

probability of taking ac2on a in state s Robot + Environment

πθ(a|s)

n

Addi/onal challenges:

n

Stability

n

Credit assignment

n

Explora/on

slide-4
SLIDE 4

Reinforcement Learning (RL)

n

Goal:

max

θ

E[

H

X

t=0

R(st)|πθ]

probability of taking ac2on a in state s Robot + Environment

πθ(a|s)

slide-5
SLIDE 5

Deep RL Success Stories

Silver et al, Nature 2015 Schulman, Levine, Moritz, Jordan, Abbeel, ICML 2015 Schulman, Moritz, Levine, Jordan, Abbeel, ICLR 2016 DQN Mnih et al, NIPS 2013 / Nature 2015 Gu et al, NIPS 2014 TRPO Schulman, Levine, Moritz, Jordan, Abbeel, ICML 2015 A3C Mnih et al,2016 Levine*, Finn*, Darrell, Abbeel, JMLR 2016

slide-6
SLIDE 6

Speed of Learning

Deep RL (DQN)

vs.

Human

Score: 18.9 Score: 9.3

#Experience measured in real 2me: 40 days #Experience measured in real 2me: 2 hours

“Slow” “Fast”

slide-7
SLIDE 7

Star2ng Observa2ons

n TRPO, DQN, A3C are fully general RL algorithms

n i.e., for any MDP that can be mathema2cally defined, these algorithms

are equally applicable

n MDPs encountered in real world

= 2ny, 2ny subset of all MDPs that could be defined

n Can we design “fast” RL algorithms that take advantage of

such knowledge?

slide-8
SLIDE 8

Research Ques2ons

n How to acquire a good prior for real-world MDPs?

n Or for starters, e.g., for real-games MDPs?

n How to design algorithms that make use of such prior

informa2on? Key idea: Learn a fast RL algorithm that encodes this prior

slide-9
SLIDE 9

Formula2on

n Given: Distribu2on over relevant MDPs n Train the fast RL algorithm to be fast on a training set of MDPs

slide-10
SLIDE 10

Formula2on

slide-11
SLIDE 11

Learning the Fast RL Algorithm

n Representa2on of the fast RL algorithm:

n RNN = generic computa2on architecture n different weights in the RNN means different RL algorithm n different ac2va2ons in the RNN means different current policy

n Training setup:

slide-12
SLIDE 12

Alterna2ve View on RL2

n RNN = policy for ac2ng in a POMDP

n Part of what’s not observed in the POMDP is which MDP the agent is in

slide-13
SLIDE 13

Related Work

n Wang et al., (2016) Learning to Reinforcement Learn, in

submission to ICLR 2017,

n Chen et al. (2016) Learning to Learn for Global Op2miza2on of

Black Box Func2ons

n Andrychowicz et al., (2016) Learning to learn by gradient

descent by gradient descent

n Santoro et al., (2016) One-shot Learning with Memory-

Augmented Neural Networks

n Larochelle et al., (2008), Zero-data Learning of New Tasks. n Younger et al. (2001), Meta learning with backpropaga2on n Schmidhuber et al. (1996), Simple principles of metalearning

slide-14
SLIDE 14

RL2: Fast RL by Slow RL

n

Key Insights:

n We represent the AI agent as a Recurrent Neural Net (RNN)

n i.e., the RNN is the “fast” RL algorithm n different weights in the RNN means different RL algorithm

n To discover good weights for the RNN (i.e., to discover the fast RL algorithm),

train with classical (“slow”) RL

[Duan, Schulman, Chen, Bartleo, Sutskever, Abbeel, 2016]

slide-15
SLIDE 15

Evalua2on

n

Mul2-Armed Bandits

n

Provably (asympto2cally) op2mal RL algorithms have been invented by humans: Giqns index, UCB1, Thompson sampling, … 5-armed bandit

(source: ebay)

[Duan, Schulman, Chen, Bartleo, Sutskever, Abbeel, 2016]

slide-16
SLIDE 16

Evalua2on

n Mul2-Armed Bandits

[Duan, Schulman, Chen, Bartleo, Sutskever, Abbeel, 2016]

slide-17
SLIDE 17

Evalua2on

n Mul2-Armed Bandits

slide-18
SLIDE 18

Evalua2on: Tabular MDPs

n Provably (asympto2cally)

  • p2mal algorithms:

n BEB, PSRL, UCRL2, …

slide-19
SLIDE 19

Evalua2on: Tabular MDPs

slide-20
SLIDE 20

Evalua2on: Tabular MDPs

slide-21
SLIDE 21

Evalua2on: Visual Naviga2on

(built on top of ViZDoom)

Agent’s view Small maze Large maze

slide-22
SLIDE 22

Evalua2on: Visual Naviga2on

After learning Before learning

slide-23
SLIDE 23

Evalua2on: Visual Naviga2on

n Visual Naviga2on (built on top of ViZDoom)

Occasional “bad” behavior

slide-24
SLIDE 24

Evalua2on: Visual Naviga2on

slide-25
SLIDE 25

Evalua2on

slide-26
SLIDE 26

OpenAI Universe

slide-27
SLIDE 27

n

RL^2: Fast Reinforcement Learning via Slow Reinforcement Learning Yan Duan, John Schulman, Xi Chen, Peter L. Bartleo, Ilya Sutskever, Pieter Abbeel

n

Third Person Imita2on Learning Bradly C Stadie, Pieter Abbeel, Ilya Sutskever

n

Varia2onal Lossy Autoencoder Xi Chen, D. P. Kingma, T. Salimans, Y. Duan, P. Dhariwal, J. Schulman, I. Sutskever, P. Abbeel

n

Deep Reinforcement Learning for Tensegrity Robot Locomo2on

  • X. Geng*, M. Zhang*, J. Bruce*, K. Caluwaerts, M. Vespignani,
  • V. SunSpiral, P. Abbeel, S. Levine

Outline

slide-28
SLIDE 28

n First person imita2on learning

n Demonstrate with robot itself

n E.g., drive a car, tele-operate a robot, etc…

n Third person imita2on learning

n Robot watches demonstra2ons n Challenges:

n Different viewpoint n Different incarna2on (human vs. robot)

Third Person Imita2on Learning

slide-29
SLIDE 29

n Example problem seqngs:

Third Person Imita2on Learning

Third-person view: Robot environment:

slide-30
SLIDE 30

n

Genera2ve Adversarial Imita2on Learning (Ho et al, 2016, Finn et al 2016)

n Reward = defined by learned classifier dis2nguishing expert from robot behavior

à Op2mizing such reward makes robot perform like expert à Works well for first person imita2on learning

BUT: in third person seqng classifier will simply iden2fy expert vs. robot environment and robot can never match expert

n

Domain confusion loss (Tzeng et al, 2015)

n

Deep learn a feature representa2on from which it isn’t possible to dis2nguish the environment

BUT: competes too directly with first objec2ve

n

Let first objec2ve have mul2ple frames – i.e.., see behavior

Basic Ideas

slide-31
SLIDE 31

Architecture

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-32
SLIDE 32

Learning Curves

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-33
SLIDE 33

Domain Classifica2on Accuracy

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-34
SLIDE 34

Does the algorithm we propose benefit from both domain confusion and the mul2-2me step input?

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-35
SLIDE 35

How sensi2ve is our proposed algorithm to the selec2on of hyper- parameters used in deployment? –Domain Confusion Weight lambda

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-36
SLIDE 36

How sensi2ve is our proposed algorithm to the selec2on of hyper- parameters used in deployment? – Number of lookahead frames

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-37
SLIDE 37

Third Person View on Expert

Results

Imitator

slide-38
SLIDE 38

n

RL^2: Fast Reinforcement Learning via Slow Reinforcement Learning Yan Duan, John Schulman, Xi Chen, Peter L. Bartleo, Ilya Sutskever, Pieter Abbeel

n

Third Person Imita2on Learning Bradly C Stadie, Pieter Abbeel, Ilya Sutskever

n

Probabilis2cally Safe Policy Transfer David Held, Zoe McCarthy, Michael Zhang, Fred Shentu, and Pieter Abbeel

n

Deep Reinforcement Learning for Tensegrity Robot Locomo2on

  • X. Geng*, M. Zhang*, J. Bruce*, K. Caluwaerts, M. Vespignani,
  • V. SunSpiral, P. Abbeel, S. Levine

Outline

slide-39
SLIDE 39

n Autonomous driving n Robots interac2ng around / with people n Robots manipula2ng fragile objects n Robot can damage itself

Ques2on: How to train a robot for these tasks?

Risky Robo2cs Tasks

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-40
SLIDE 40

n Isolated training environment (e.g. cage)

n May not represent test environment

n No human interac2on / collabora2on in isolated environment

n Train in simula2on

n May not represent test environment

n Watch robot carefully, try to press kill-switch in 2me

n Requires careful observa2on and predic2on of robot ac2ons by human,

may not react fast enough

Previous Approaches

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-41
SLIDE 41

n Operate in test environment ini2ally with low torques

n Use lowest torques at which task can s2ll be completed, but safely and

slowly

n Assump2on: low torques are safer but less efficient

n Increase the torque limit as the robot demonstrates that it can

  • perate safely

n How do we define safety?

Our Approach

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-42
SLIDE 42

n Dsafe defines our “safety budget” – how much expected

damage we can afford

n Example for autonomous car:

n low risk of hiqng a pedestrian at low speeds n Even lower risk of killing a pedestrian

How to Define Safety?

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

Expected damage

slide-43
SLIDE 43

n Higher torques -> lower 2me per task -> more benefit n Higher torques -> more damage

n Torques are clipped at Tlim

n Assume binary probability of being unsafe:

n = probability of failure to be safe

How to Define Safety?

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

(wlog assume α = 1)

slide-44
SLIDE 44

n High probability of failure -> low torques n Low probability of failure -> higher torques

How to Define Safety?

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

Expected damage

slide-45
SLIDE 45

Overall Algorithm

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-46
SLIDE 46

n Effect from adjus2ng torque limit n Effect from upda2ng policy

Predic2ng failure increases

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-47
SLIDE 47

n Effect from changing torque limit n Effect from upda2ng policy

Predic2ng failure increases

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

F(-Tlim) 1 - F(Tlim)

  • 6
  • 4
  • 2

2 4 6 8

Torque

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

PDF

Old policy New policy Intersection

slide-48
SLIDE 48

n Policy is represented as a Guassian n Due to torque limits, applied torque is given by a truncated

Gaussian:

Predic2ng Failure Increases: Adjus2ng the Torque Limit

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

F(-Tlim) 1 - F(Tlim)

slide-49
SLIDE 49

n Key observa2on: The sampled torque only changes for torques

such that

n If all such changes lead to new failures, then, due to torque

limit adjustments, the failure rate can increase by:

Predic2ng Failure Increases: Adjus2ng the Torque Limit

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

F(-Tlim) 1 - F(Tlim)

slide-50
SLIDE 50

n Upda2ng the policy can also lead to failures. n TRPO policy adjustment: n Failure rate can then increase by: n Total increase in failure rate:

Predic2ng Failure Increases: Upda2ng the Policy

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

  • 6
  • 4
  • 2

2 4 6 8

Torque

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

PDF

Old policy New policy Intersection

slide-51
SLIDE 51

Adap2ve vs Fixed Torque Limits

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

Torque Limit (N·m)

Number of Itera4ons Torque Limit (N·m)

slide-52
SLIDE 52

Adap2ve vs Fixed Torque Limits

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

Expected Damage Number of Itera4ons Dsafe

slide-53
SLIDE 53

Abla2on

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

Dsafe Dsafe Our full method Dsafe V3: Not predic7ng effect

  • f policy update

Dsafe V2: Not predic7ng effect

  • f torque limit

increase V4: Not predic7ng either effect

500 1000 1500 2000 2500 Number of itera7ons

Expected Damage Expected Damage

500 1000 1500 2000 2500 Number of itera7ons 500 1000 1500 2000 2500 Number of itera7ons 500 1000 1500 2000 2500 Number of itera7ons 0.5 1 0.5 1 0.5 1 0.5 1

Expected Damage Expected Damage

slide-54
SLIDE 54

Varying Dsafe

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

Dsafe Expected Damage Number of Itera5ons

slide-55
SLIDE 55

Sim2Real

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

Safe Policy Transfer

Simula2on Reality

slide-56
SLIDE 56

n

RL^2: Fast Reinforcement Learning via Slow Reinforcement Learning Yan Duan, John Schulman, Xi Chen, Peter L. Bartleo, Ilya Sutskever, Pieter Abbeel

n

Third Person Imita2on Learning Bradly C Stadie, Pieter Abbeel, Ilya Sutskever

n

Probabilis2cally Safe Policy Transfer David Held, Zoe McCarthy, Michael Zhang, Fred Shentu, and Pieter Abbeel

n

Deep Reinforcement Learning for Tensegrity Robot Locomo2on

  • X. Geng*, M. Zhang*, J. Bruce*, K. Caluwaerts, M. Vespignani,
  • V. SunSpiral, P. Abbeel, S. Levine

Outline

slide-57
SLIDE 57

n

Rigid rods connected by elas2c cables

n

Controlled by motors that extend / contract cables

n

Proper2es:

n

Lightweight

n

Low cost

n

Capable of withstanding significant impact

n

NASA inves2gates them for space explora2on

n

Major challenge: control

Tensegrity Robo2cs: NASA SuperBall

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-58
SLIDE 58

NASA SUPERball – A{er training with Guided Policy Search

slide-59
SLIDE 59

n

RL^2: Fast Reinforcement Learning via Slow Reinforcement Learning Yan Duan, John Schulman, Xi Chen, Peter L. Bartleo, Ilya Sutskever, Pieter Abbeel

n

Third Person Imita2on Learning Bradly C Stadie, Pieter Abbeel, Ilya Sutskever

n

Probabilis2cally Safe Policy Transfer David Held, Zoe McCarthy, Michael Zhang, Fred Shentu, and Pieter Abbeel

n

Deep Reinforcement Learning for Tensegrity Robot Locomo2on

  • X. Geng*, M. Zhang*, J. Bruce*, K. Caluwaerts, M. Vespignani,
  • V. SunSpiral, P. Abbeel, S. Levine

Summary

slide-60
SLIDE 60

n

Shared and transfer learning

Fron2ers

n

Memory

n Es2ma2on n Temporal hierarchy / goal

seqng

n Safe learning n Value alignment

n

Applica2ons

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-61
SLIDE 61

Thank you

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope