Deep Learning for Robo/cs Pieter Abbeel Reinforcement Learning (RL) - - PowerPoint PPT Presentation
Deep Learning for Robo/cs Pieter Abbeel Reinforcement Learning (RL) - - PowerPoint PPT Presentation
Deep Learning for Robo/cs Pieter Abbeel Reinforcement Learning (RL) probability of taking ac2on a in state s Robo2cs ( a | s ) n Marke2ng / n Adver2sing Dialogue n Robot + Op2mizing Environment n opera2ons / logis2cs Queue H n
Reinforcement Learning (RL)
n
Robo2cs
n
Marke2ng / Adver2sing
n
Dialogue
n
Op2mizing
- pera2ons /
logis2cs
n
Queue management
n
…
Robot + Environment
πθ(a|s)
probability of taking ac2on a in state s
max
θ
E[
H
X
t=0
R(st)|πθ]
Reinforcement Learning (RL)
n
Goal:
max
θ
E[
H
X
t=0
R(st)|πθ]
probability of taking ac2on a in state s Robot + Environment
πθ(a|s)
n
Addi/onal challenges:
n
Stability
n
Credit assignment
n
Explora/on
Reinforcement Learning (RL)
n
Goal:
max
θ
E[
H
X
t=0
R(st)|πθ]
probability of taking ac2on a in state s Robot + Environment
πθ(a|s)
Deep RL Success Stories
Silver et al, Nature 2015 Schulman, Levine, Moritz, Jordan, Abbeel, ICML 2015 Schulman, Moritz, Levine, Jordan, Abbeel, ICLR 2016 DQN Mnih et al, NIPS 2013 / Nature 2015 Gu et al, NIPS 2014 TRPO Schulman, Levine, Moritz, Jordan, Abbeel, ICML 2015 A3C Mnih et al,2016 Levine*, Finn*, Darrell, Abbeel, JMLR 2016
Speed of Learning
Deep RL (DQN)
vs.
Human
Score: 18.9 Score: 9.3
#Experience measured in real 2me: 40 days #Experience measured in real 2me: 2 hours
“Slow” “Fast”
Star2ng Observa2ons
n TRPO, DQN, A3C are fully general RL algorithms
n i.e., for any MDP that can be mathema2cally defined, these algorithms
are equally applicable
n MDPs encountered in real world
= 2ny, 2ny subset of all MDPs that could be defined
n Can we design “fast” RL algorithms that take advantage of
such knowledge?
Research Ques2ons
n How to acquire a good prior for real-world MDPs?
n Or for starters, e.g., for real-games MDPs?
n How to design algorithms that make use of such prior
informa2on? Key idea: Learn a fast RL algorithm that encodes this prior
Formula2on
n Given: Distribu2on over relevant MDPs n Train the fast RL algorithm to be fast on a training set of MDPs
Formula2on
Learning the Fast RL Algorithm
n Representa2on of the fast RL algorithm:
n RNN = generic computa2on architecture n different weights in the RNN means different RL algorithm n different ac2va2ons in the RNN means different current policy
n Training setup:
Alterna2ve View on RL2
n RNN = policy for ac2ng in a POMDP
n Part of what’s not observed in the POMDP is which MDP the agent is in
Related Work
n Wang et al., (2016) Learning to Reinforcement Learn, in
submission to ICLR 2017,
n Chen et al. (2016) Learning to Learn for Global Op2miza2on of
Black Box Func2ons
n Andrychowicz et al., (2016) Learning to learn by gradient
descent by gradient descent
n Santoro et al., (2016) One-shot Learning with Memory-
Augmented Neural Networks
n Larochelle et al., (2008), Zero-data Learning of New Tasks. n Younger et al. (2001), Meta learning with backpropaga2on n Schmidhuber et al. (1996), Simple principles of metalearning
RL2: Fast RL by Slow RL
n
Key Insights:
n We represent the AI agent as a Recurrent Neural Net (RNN)
n i.e., the RNN is the “fast” RL algorithm n different weights in the RNN means different RL algorithm
n To discover good weights for the RNN (i.e., to discover the fast RL algorithm),
train with classical (“slow”) RL
[Duan, Schulman, Chen, Bartleo, Sutskever, Abbeel, 2016]
Evalua2on
n
Mul2-Armed Bandits
n
Provably (asympto2cally) op2mal RL algorithms have been invented by humans: Giqns index, UCB1, Thompson sampling, … 5-armed bandit
(source: ebay)
[Duan, Schulman, Chen, Bartleo, Sutskever, Abbeel, 2016]
Evalua2on
n Mul2-Armed Bandits
[Duan, Schulman, Chen, Bartleo, Sutskever, Abbeel, 2016]
Evalua2on
n Mul2-Armed Bandits
Evalua2on: Tabular MDPs
n Provably (asympto2cally)
- p2mal algorithms:
n BEB, PSRL, UCRL2, …
Evalua2on: Tabular MDPs
Evalua2on: Tabular MDPs
Evalua2on: Visual Naviga2on
(built on top of ViZDoom)
Agent’s view Small maze Large maze
Evalua2on: Visual Naviga2on
After learning Before learning
Evalua2on: Visual Naviga2on
n Visual Naviga2on (built on top of ViZDoom)
Occasional “bad” behavior
Evalua2on: Visual Naviga2on
Evalua2on
OpenAI Universe
n
RL^2: Fast Reinforcement Learning via Slow Reinforcement Learning Yan Duan, John Schulman, Xi Chen, Peter L. Bartleo, Ilya Sutskever, Pieter Abbeel
n
Third Person Imita2on Learning Bradly C Stadie, Pieter Abbeel, Ilya Sutskever
n
Varia2onal Lossy Autoencoder Xi Chen, D. P. Kingma, T. Salimans, Y. Duan, P. Dhariwal, J. Schulman, I. Sutskever, P. Abbeel
n
Deep Reinforcement Learning for Tensegrity Robot Locomo2on
- X. Geng*, M. Zhang*, J. Bruce*, K. Caluwaerts, M. Vespignani,
- V. SunSpiral, P. Abbeel, S. Levine
Outline
n First person imita2on learning
n Demonstrate with robot itself
n E.g., drive a car, tele-operate a robot, etc…
n Third person imita2on learning
n Robot watches demonstra2ons n Challenges:
n Different viewpoint n Different incarna2on (human vs. robot)
Third Person Imita2on Learning
n Example problem seqngs:
Third Person Imita2on Learning
Third-person view: Robot environment:
n
Genera2ve Adversarial Imita2on Learning (Ho et al, 2016, Finn et al 2016)
n Reward = defined by learned classifier dis2nguishing expert from robot behavior
à Op2mizing such reward makes robot perform like expert à Works well for first person imita2on learning
BUT: in third person seqng classifier will simply iden2fy expert vs. robot environment and robot can never match expert
n
Domain confusion loss (Tzeng et al, 2015)
n
Deep learn a feature representa2on from which it isn’t possible to dis2nguish the environment
BUT: competes too directly with first objec2ve
n
Let first objec2ve have mul2ple frames – i.e.., see behavior
Basic Ideas
Architecture
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Learning Curves
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Domain Classifica2on Accuracy
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Does the algorithm we propose benefit from both domain confusion and the mul2-2me step input?
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
How sensi2ve is our proposed algorithm to the selec2on of hyper- parameters used in deployment? –Domain Confusion Weight lambda
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
How sensi2ve is our proposed algorithm to the selec2on of hyper- parameters used in deployment? – Number of lookahead frames
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Third Person View on Expert
Results
Imitator
n
RL^2: Fast Reinforcement Learning via Slow Reinforcement Learning Yan Duan, John Schulman, Xi Chen, Peter L. Bartleo, Ilya Sutskever, Pieter Abbeel
n
Third Person Imita2on Learning Bradly C Stadie, Pieter Abbeel, Ilya Sutskever
n
Probabilis2cally Safe Policy Transfer David Held, Zoe McCarthy, Michael Zhang, Fred Shentu, and Pieter Abbeel
n
Deep Reinforcement Learning for Tensegrity Robot Locomo2on
- X. Geng*, M. Zhang*, J. Bruce*, K. Caluwaerts, M. Vespignani,
- V. SunSpiral, P. Abbeel, S. Levine
Outline
n Autonomous driving n Robots interac2ng around / with people n Robots manipula2ng fragile objects n Robot can damage itself
Ques2on: How to train a robot for these tasks?
Risky Robo2cs Tasks
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
n Isolated training environment (e.g. cage)
n May not represent test environment
n No human interac2on / collabora2on in isolated environment
n Train in simula2on
n May not represent test environment
n Watch robot carefully, try to press kill-switch in 2me
n Requires careful observa2on and predic2on of robot ac2ons by human,
may not react fast enough
Previous Approaches
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
n Operate in test environment ini2ally with low torques
n Use lowest torques at which task can s2ll be completed, but safely and
slowly
n Assump2on: low torques are safer but less efficient
n Increase the torque limit as the robot demonstrates that it can
- perate safely
n How do we define safety?
Our Approach
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
n Dsafe defines our “safety budget” – how much expected
damage we can afford
n Example for autonomous car:
n low risk of hiqng a pedestrian at low speeds n Even lower risk of killing a pedestrian
How to Define Safety?
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Expected damage
n Higher torques -> lower 2me per task -> more benefit n Higher torques -> more damage
n Torques are clipped at Tlim
n Assume binary probability of being unsafe:
n = probability of failure to be safe
How to Define Safety?
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
(wlog assume α = 1)
n High probability of failure -> low torques n Low probability of failure -> higher torques
How to Define Safety?
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Expected damage
Overall Algorithm
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
n Effect from adjus2ng torque limit n Effect from upda2ng policy
Predic2ng failure increases
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
n Effect from changing torque limit n Effect from upda2ng policy
Predic2ng failure increases
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
F(-Tlim) 1 - F(Tlim)
- 6
- 4
- 2
2 4 6 8
Torque
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45
Old policy New policy Intersection
n Policy is represented as a Guassian n Due to torque limits, applied torque is given by a truncated
Gaussian:
Predic2ng Failure Increases: Adjus2ng the Torque Limit
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
F(-Tlim) 1 - F(Tlim)
n Key observa2on: The sampled torque only changes for torques
such that
n If all such changes lead to new failures, then, due to torque
limit adjustments, the failure rate can increase by:
Predic2ng Failure Increases: Adjus2ng the Torque Limit
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
F(-Tlim) 1 - F(Tlim)
n Upda2ng the policy can also lead to failures. n TRPO policy adjustment: n Failure rate can then increase by: n Total increase in failure rate:
Predic2ng Failure Increases: Upda2ng the Policy
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
- 6
- 4
- 2
2 4 6 8
Torque
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45
Old policy New policy Intersection
Adap2ve vs Fixed Torque Limits
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Torque Limit (N·m)
Number of Itera4ons Torque Limit (N·m)
Adap2ve vs Fixed Torque Limits
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Expected Damage Number of Itera4ons Dsafe
Abla2on
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Dsafe Dsafe Our full method Dsafe V3: Not predic7ng effect
- f policy update
Dsafe V2: Not predic7ng effect
- f torque limit
increase V4: Not predic7ng either effect
500 1000 1500 2000 2500 Number of itera7ons
Expected Damage Expected Damage
500 1000 1500 2000 2500 Number of itera7ons 500 1000 1500 2000 2500 Number of itera7ons 500 1000 1500 2000 2500 Number of itera7ons 0.5 1 0.5 1 0.5 1 0.5 1
Expected Damage Expected Damage
Varying Dsafe
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Dsafe Expected Damage Number of Itera5ons
Sim2Real
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Safe Policy Transfer
Simula2on Reality
n
RL^2: Fast Reinforcement Learning via Slow Reinforcement Learning Yan Duan, John Schulman, Xi Chen, Peter L. Bartleo, Ilya Sutskever, Pieter Abbeel
n
Third Person Imita2on Learning Bradly C Stadie, Pieter Abbeel, Ilya Sutskever
n
Probabilis2cally Safe Policy Transfer David Held, Zoe McCarthy, Michael Zhang, Fred Shentu, and Pieter Abbeel
n
Deep Reinforcement Learning for Tensegrity Robot Locomo2on
- X. Geng*, M. Zhang*, J. Bruce*, K. Caluwaerts, M. Vespignani,
- V. SunSpiral, P. Abbeel, S. Levine
Outline
n
Rigid rods connected by elas2c cables
n
Controlled by motors that extend / contract cables
n
Proper2es:
n
Lightweight
n
Low cost
n
Capable of withstanding significant impact
n
NASA inves2gates them for space explora2on
n
Major challenge: control
Tensegrity Robo2cs: NASA SuperBall
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
NASA SUPERball – A{er training with Guided Policy Search
n
RL^2: Fast Reinforcement Learning via Slow Reinforcement Learning Yan Duan, John Schulman, Xi Chen, Peter L. Bartleo, Ilya Sutskever, Pieter Abbeel
n
Third Person Imita2on Learning Bradly C Stadie, Pieter Abbeel, Ilya Sutskever
n
Probabilis2cally Safe Policy Transfer David Held, Zoe McCarthy, Michael Zhang, Fred Shentu, and Pieter Abbeel
n
Deep Reinforcement Learning for Tensegrity Robot Locomo2on
- X. Geng*, M. Zhang*, J. Bruce*, K. Caluwaerts, M. Vespignani,
- V. SunSpiral, P. Abbeel, S. Levine
Summary
n
Shared and transfer learning
Fron2ers
n
Memory
n Es2ma2on n Temporal hierarchy / goal
seqng
n Safe learning n Value alignment
n
Applica2ons
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Thank you
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope