Deep Learning for Robo/cs Pieter Abbeel UC Berkeley / OpenAI / Gradescope
Deep Learning for Robo/cs Pieter Abbeel UC Berkeley / OpenAI / - - PowerPoint PPT Presentation
Deep Learning for Robo/cs Pieter Abbeel UC Berkeley / OpenAI / - - PowerPoint PPT Presentation
Deep Learning for Robo/cs Pieter Abbeel UC Berkeley / OpenAI / Gradescope Outline n Some deep learning successes n Deep reinforcement learning n Current direc5ons Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope Object Detec5on in Computer
n Some deep learning successes n Deep reinforcement learning n Current direc5ons
Outline
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
n
State-of-the-art object detec5on un5l 2012:
n
Deep Supervised Learning (Krizhevsky, Sutskever, Hinton 2012; also LeCun, Bengio, Ng, Darrell, …):
n ~1.2 million training images from ImageNet [Deng, Dong, Socher, Li, Li, Fei-Fei, 2009]
Object Detec5on in Computer Vision
Input Image Hand-engineered features (SIFT, HOG, DAISY, …) Support Vector Machine (SVM) “cat” “dog” “car” … Input Image 8-layer neural network with 60 million parameters to learn “cat” “dog” “car” …
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Performance
graph credit Matt Zeiler, Clarifai
Performance
graph credit Matt Zeiler, Clarifai
Performance
graph credit Matt Zeiler, Clarifai
AlexNet
Performance
graph credit Matt Zeiler, Clarifai
AlexNet
Performance
graph credit Matt Zeiler, Clarifai
AlexNet
Speech Recogni5on
graph credit Matt Zeiler, Clarifai
MS COCO Image Cap5oning Challenge
Karpathy & Fei-Fei, 2015; Donahue et al., 2015; Xu et al, 2015; many more
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Visual QA Challenge
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
n
Varia5onal Autoencoders [Kingma and Welling, 2014]
n DRAW [Gregor et al, 2015] n …
n
Genera5ve Adversarial Networks [Goodfellow et al, 2014]
n DC-GAN [Radford, Metz, Chintala, 2016] n InfoGAN [Chen, Duan, Houthoof, Schulman, Sutskever, Abbeel, 2016] n …
n
Pixel RNN [van den Oord et al, 2016]
n Pixel CNN [van den Oord et al, 2016] n …
Unsupervised Learning
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Image Genera5on – DC-GAN
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Training
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen, 2016
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Comparison with Real Images
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen, 2016
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
[Chen, Duan, Houthoof, Schulman, Sutskever, Abbeel, 2016]
InfoGAN
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
n
Current state-of-the-art robo/cs
n
Deep reinforcement learning
Robo5cs
Percepts Hand- engineered state- estimation Many-layer neural network with many parameters to learn Hand- engineered control policy class Hand-tuned (or learned) 10’ish free parameters Motor commands Percepts Motor commands
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Backprop KF [Haarnoja, Ajay, Levine, Abbeel, 2016]
Deep Learning for Es5ma5on
Deep Tracking [Ondruska, Posner, 2016] SE3 Nets [Byravan,Fox, 2016] Structured Varia5onal Autoencoders [Johnson, Duvenaud, Wiltschko, Dala, Adams, 2016]
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Deep Learning for Detec5ng Robo5c Grasps [Lenz, Lee, Saxena, RSS 2013] Big Data for Grasp Planning [Kappler, Bohg, Schaal, 2015]
Deep Es5ma5on for Grasping/Control
DeepMPC [Lenz, Knepper, Saxena, RSS 2015] Dexnet Grasp Transfer [Mahler, …, Goldberg, 2015]
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Deep Reinforcement Learning (RL)
n
Goal:
max
θ
E[
H
X
t=0
R(st)|πθ]
probability of taking ac5on a in state s Robot + Environment
πθ(a|s)
n
Addi/onal challenges:
n
Stability
n
Credit assignment
n
Explora/on
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
From Pixels to Ac5ons?
Pong Enduro Beamrider Q*bert
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
[Source: Mnih et al., Nature 2015 (DeepMind) ]
Deep Q-Network (DQN): From Pixels to Joys5ck Commands
32 8x8 filters with stride 4 + ReLU 64 4x4 filters with stride 2 + ReLU 64 3x3 filters with stride 1 + ReLU fully connected 512 units + ReLU fully connected output units, one per ac5on
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
[ Source: Mnih et al., Nature 2015 (DeepMind) ]
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
How About Con5nuous Control, e.g., Locomo5on?
Joint angles and kinematics Control Standard deviations Fully connected layer 30 units Input layer Mean parameters Sampling
Neural network architecture: Input: joint angles and veloci5es Output: joint torques Robot models in physics simulator (MuJoCo, from Emo Todorov)
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
n How to score every possible ac5on? n How to ensure monotonic progress?
Challenges with Q-Learning
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
n
Ofen simpler to represent good policies than good value func5ons
n
True objec5ve of expected cost is op5mized (vs. a surrogate like Bellman error)
n
Exis5ng work: (natural) policy gradients
n Challenges: good, large step direc5ons
Policy Op5miza5on
max
θ
E[
H
X
t=0
R(st)|πθ]
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
n Trust Region n Surrogate Loss
Trust Region Policy Op5miza5on
max
θ
E[
H
X
t=0
R(st)|πθ]
[Schulman, Levine, Moritz, Jordan, Abbeel, 2015]
max
δθ
ˆ g>δθ s.t. KL(P(τ; θ)||P(τ; θ + δθ)) ≤ ε
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
n
Generalized Advantage Es5ma5on
n Exponen5al interpola5on between actor-cri5c and Monte Carlo es5mates n Trust region approach to (high-dimensional) value func5on es5ma5on
Generalized Advantage Es5ma5on (GAE)
max
θ
E[
H
X
t=0
R(st)|πθ]
[Schulman, Moritz, Levine, Jordan, Abbeel, ICLR 2016]
Objec5ve: Gradient: single sample es5mate of advantage
E[
H
X
t=0
rθ log πθ(at|st) H X
k=t
R(sk) V (st) ! ]
Learning Locomo5on
[Schulman, Moritz, Levine, Jordan, Abbeel, ICLR 2016]
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
n
Deep Q-Network (DQN) [Mnih et al, 2013/2015]
n
Dagger with Monte Carlo Tree Search [Xiao-Xiao et al, 2014]
n
Trust Region Policy Op5miza5on (TRPO) [Schulman, Levine, Moritz, Jordan, Abbeel, 2015]
n
A3C [Mnih et al, 2016]
Atari Games
Pong Enduro Beamrider Q*bert
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
n Tasks n Algorithms n Experimental setup
Deep RL Benchmarking
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
- 1. Basic tasks
- 2. Locomo5on
Deep RL Benchmarking -- Tasks
- 3. Hierarchical
- 4. Par5ally observable
sensing, delayed ac5on, sysID
- 5. Driving…
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
n
Reinforce
n
Truncated Natural Policy Gradient
n
Reward-Weighted Regression (RWR)
n
Rela5ve Entropy Policy Search (REPS)
n
Trust-Region Policy Op5miza5on (TRPO)
n
Cross-Entropy Method (CEM)
n
Covariance Matrix Adapta5on Evolu5on Strategy (CMA-ES)
n
Deep Determinis5c Policy Gradients (DDPG)
n
…
Deep RL Benchmarking -- Algorithms
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Benchmarking [Duan et al, ICML 2016]
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
rllab
[Duan et al]
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Open AI Gym
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
How About Real Robo5c Visuo-Motor Skills?
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Supervised learning
trajectory op5miza5on
policy search (RL) supervised learning trajectory op5miza5on complex dynamics complex policy complex dynamics complex policy complex dynamics complex policy HARD EASY EASY
general-purpose neural network controller
Guided Policy Search
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Instrumented Training
training time test time
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Deep Spa5al Neural Net Architecture
(92,000 parameters)
πθ
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
[Levine*, Finn*, Darrell, Abbeel, JMLR 2016]
Experimental Tasks
[Levine*, Finn*, Darrell, Abbeel, JMLR 2016]
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Learning
[Levine*, Finn*, Darrell, Abbeel, JMLR 2016]
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Learned Skills
[Levine*, Finn*, Darrell, Abbeel, JMLR 2016
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Experiments: Learned Neural Network Policy
[Khan, Zhang, Levine, Abbeel 2016]
Supersizing Self-Supervision: Learning to Grasp from 50K Tries and 700 Robot Hours [Pinto, Gupta, ICRA 2016]
Fron5ers: DATA | TRANSFER | EXPLORATION | REWARD | HIERARCHY
Learning Hand-Eye Coordina5on with Deep Learning and Large Scale Data Collec5on [Pastor, Krizhevsky, Quillen, Levine, 2016] Learning to Poke by Poking: Experien5al Learning of Intui5ve Physics [Agarwal, Nair, Abbeel, Malik, Levine, 2016]
Fron5ers: DATA | TRANSFER | EXPLORATION | REWARD | HIERARCHY
n
Mul5-task, Mul5-robot
n
Simula5on -> real world
Deep Domain Confusion [Tzeng, Hoffman, Saenko, Darrell, 2014] Combining Model- based Policy Search with Online Model Learning [Mordatch, Mishra, Eppner, Abbeel ICRA 2016] Progressive Neural Networks [Rusu, Rabinowitz,…, Hadsell, 2016]
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Fron5ers: DATA | TRANSFER | EXPLORATION | REWARD | HIERARCHY
[Houthooft, Chen, Duan, Schulman, Turck, Abbeel, 2016] Swimmer + Food Collection
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Fron5ers: DATA | TRANSFER | EXPLORATION | REWARD | HIERARCHY Difficult to specify reward / cost func5on Exis5ng work on inverse op5mal control / inverse RL BUT, how to scale up to perceptual spaces? Guided Cost Learning [Finn, Levine, Abbeel, ICML 2016]
Model-free Imita5on Learning with Policy Op5miza5on [Ho, Ermon, ICML 2016]
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Fron5ers: DATA | TRANSFER | EXPLORATION | REWARD | HIERARCHY
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
n
Colleagues: Trevor Darrell, Ken Goldberg, Michael Jordan, Stuart Russell, Ilya Sutskever
n
Post-docs: Sergey Levine, Igor Mordatch, Aviv Tamar, Dave Held
n
Students: John Schulman, Chelsea Finn, Rocky Duan, Peter Chen, Rein Houthoof, Gregory Kahn, Tianhao Zhang
Acknowledgements
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope