Deep Learning for Robo/cs Pieter Abbeel UC Berkeley / OpenAI / - - PowerPoint PPT Presentation

deep learning for robo cs pieter abbeel uc berkeley
SMART_READER_LITE
LIVE PREVIEW

Deep Learning for Robo/cs Pieter Abbeel UC Berkeley / OpenAI / - - PowerPoint PPT Presentation

Deep Learning for Robo/cs Pieter Abbeel UC Berkeley / OpenAI / Gradescope Outline n Some deep learning successes n Deep reinforcement learning n Current direc5ons Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope Object Detec5on in Computer


slide-1
SLIDE 1

Deep Learning for Robo/cs Pieter Abbeel UC Berkeley / OpenAI / Gradescope

slide-2
SLIDE 2

n Some deep learning successes n Deep reinforcement learning n Current direc5ons

Outline

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-3
SLIDE 3

n

State-of-the-art object detec5on un5l 2012:

n

Deep Supervised Learning (Krizhevsky, Sutskever, Hinton 2012; also LeCun, Bengio, Ng, Darrell, …):

n ~1.2 million training images from ImageNet [Deng, Dong, Socher, Li, Li, Fei-Fei, 2009]

Object Detec5on in Computer Vision

Input Image Hand-engineered features (SIFT, HOG, DAISY, …) Support Vector Machine (SVM) “cat” “dog” “car” … Input Image 8-layer neural network with 60 million parameters to learn “cat” “dog” “car” …

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-4
SLIDE 4

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-5
SLIDE 5

Performance

graph credit Matt Zeiler, Clarifai

slide-6
SLIDE 6

Performance

graph credit Matt Zeiler, Clarifai

slide-7
SLIDE 7

Performance

graph credit Matt Zeiler, Clarifai

AlexNet

slide-8
SLIDE 8

Performance

graph credit Matt Zeiler, Clarifai

AlexNet

slide-9
SLIDE 9

Performance

graph credit Matt Zeiler, Clarifai

AlexNet

slide-10
SLIDE 10

Speech Recogni5on

graph credit Matt Zeiler, Clarifai

slide-11
SLIDE 11

MS COCO Image Cap5oning Challenge

Karpathy & Fei-Fei, 2015; Donahue et al., 2015; Xu et al, 2015; many more

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-12
SLIDE 12

Visual QA Challenge

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-13
SLIDE 13

n

Varia5onal Autoencoders [Kingma and Welling, 2014]

n DRAW [Gregor et al, 2015] n …

n

Genera5ve Adversarial Networks [Goodfellow et al, 2014]

n DC-GAN [Radford, Metz, Chintala, 2016] n InfoGAN [Chen, Duan, Houthoof, Schulman, Sutskever, Abbeel, 2016] n …

n

Pixel RNN [van den Oord et al, 2016]

n Pixel CNN [van den Oord et al, 2016] n …

Unsupervised Learning

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-14
SLIDE 14

Image Genera5on – DC-GAN

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-15
SLIDE 15

Training

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen, 2016

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-16
SLIDE 16

Comparison with Real Images

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen, 2016

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-17
SLIDE 17

[Chen, Duan, Houthoof, Schulman, Sutskever, Abbeel, 2016]

InfoGAN

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-18
SLIDE 18

n

Current state-of-the-art robo/cs

n

Deep reinforcement learning

Robo5cs

Percepts Hand- engineered state- estimation Many-layer neural network with many parameters to learn Hand- engineered control policy class Hand-tuned (or learned) 10’ish free parameters Motor commands Percepts Motor commands

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-19
SLIDE 19

Backprop KF [Haarnoja, Ajay, Levine, Abbeel, 2016]

Deep Learning for Es5ma5on

Deep Tracking [Ondruska, Posner, 2016] SE3 Nets [Byravan,Fox, 2016] Structured Varia5onal Autoencoders [Johnson, Duvenaud, Wiltschko, Dala, Adams, 2016]

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-20
SLIDE 20

Deep Learning for Detec5ng Robo5c Grasps [Lenz, Lee, Saxena, RSS 2013] Big Data for Grasp Planning [Kappler, Bohg, Schaal, 2015]

Deep Es5ma5on for Grasping/Control

DeepMPC [Lenz, Knepper, Saxena, RSS 2015] Dexnet Grasp Transfer [Mahler, …, Goldberg, 2015]

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-21
SLIDE 21

Deep Reinforcement Learning (RL)

n

Goal:

max

θ

E[

H

X

t=0

R(st)|πθ]

probability of taking ac5on a in state s Robot + Environment

πθ(a|s)

n

Addi/onal challenges:

n

Stability

n

Credit assignment

n

Explora/on

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-22
SLIDE 22

From Pixels to Ac5ons?

Pong Enduro Beamrider Q*bert

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-23
SLIDE 23

[Source: Mnih et al., Nature 2015 (DeepMind) ]

Deep Q-Network (DQN): From Pixels to Joys5ck Commands

32 8x8 filters with stride 4 + ReLU 64 4x4 filters with stride 2 + ReLU 64 3x3 filters with stride 1 + ReLU fully connected 512 units + ReLU fully connected output units, one per ac5on

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-24
SLIDE 24

[ Source: Mnih et al., Nature 2015 (DeepMind) ]

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-25
SLIDE 25

How About Con5nuous Control, e.g., Locomo5on?

Joint angles and kinematics Control Standard deviations Fully connected layer 30 units Input layer Mean parameters Sampling

Neural network architecture: Input: joint angles and veloci5es Output: joint torques Robot models in physics simulator (MuJoCo, from Emo Todorov)

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-26
SLIDE 26

n How to score every possible ac5on? n How to ensure monotonic progress?

Challenges with Q-Learning

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-27
SLIDE 27

n

Ofen simpler to represent good policies than good value func5ons

n

True objec5ve of expected cost is op5mized (vs. a surrogate like Bellman error)

n

Exis5ng work: (natural) policy gradients

n Challenges: good, large step direc5ons

Policy Op5miza5on

max

θ

E[

H

X

t=0

R(st)|πθ]

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-28
SLIDE 28

n Trust Region n Surrogate Loss

Trust Region Policy Op5miza5on

max

θ

E[

H

X

t=0

R(st)|πθ]

[Schulman, Levine, Moritz, Jordan, Abbeel, 2015]

max

δθ

ˆ g>δθ s.t. KL(P(τ; θ)||P(τ; θ + δθ)) ≤ ε

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-29
SLIDE 29

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

n

Generalized Advantage Es5ma5on

n Exponen5al interpola5on between actor-cri5c and Monte Carlo es5mates n Trust region approach to (high-dimensional) value func5on es5ma5on

Generalized Advantage Es5ma5on (GAE)

max

θ

E[

H

X

t=0

R(st)|πθ]

[Schulman, Moritz, Levine, Jordan, Abbeel, ICLR 2016]

Objec5ve: Gradient: single sample es5mate of advantage

E[

H

X

t=0

rθ log πθ(at|st) H X

k=t

R(sk) V (st) ! ]

slide-30
SLIDE 30

Learning Locomo5on

[Schulman, Moritz, Levine, Jordan, Abbeel, ICLR 2016]

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-31
SLIDE 31

n

Deep Q-Network (DQN) [Mnih et al, 2013/2015]

n

Dagger with Monte Carlo Tree Search [Xiao-Xiao et al, 2014]

n

Trust Region Policy Op5miza5on (TRPO) [Schulman, Levine, Moritz, Jordan, Abbeel, 2015]

n

A3C [Mnih et al, 2016]

Atari Games

Pong Enduro Beamrider Q*bert

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-32
SLIDE 32

n Tasks n Algorithms n Experimental setup

Deep RL Benchmarking

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-33
SLIDE 33
  • 1. Basic tasks
  • 2. Locomo5on

Deep RL Benchmarking -- Tasks

  • 3. Hierarchical
  • 4. Par5ally observable

sensing, delayed ac5on, sysID

  • 5. Driving…

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-34
SLIDE 34

n

Reinforce

n

Truncated Natural Policy Gradient

n

Reward-Weighted Regression (RWR)

n

Rela5ve Entropy Policy Search (REPS)

n

Trust-Region Policy Op5miza5on (TRPO)

n

Cross-Entropy Method (CEM)

n

Covariance Matrix Adapta5on Evolu5on Strategy (CMA-ES)

n

Deep Determinis5c Policy Gradients (DDPG)

n

Deep RL Benchmarking -- Algorithms

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-35
SLIDE 35

Benchmarking [Duan et al, ICML 2016]

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-36
SLIDE 36

rllab

[Duan et al]

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-37
SLIDE 37

Open AI Gym

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-38
SLIDE 38

How About Real Robo5c Visuo-Motor Skills?

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-39
SLIDE 39

Supervised learning

trajectory op5miza5on

policy search (RL) supervised learning trajectory op5miza5on complex dynamics complex policy complex dynamics complex policy complex dynamics complex policy HARD EASY EASY

general-purpose neural network controller

Guided Policy Search

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-40
SLIDE 40

Instrumented Training

training time test time

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-41
SLIDE 41

Deep Spa5al Neural Net Architecture

(92,000 parameters)

πθ

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

[Levine*, Finn*, Darrell, Abbeel, JMLR 2016]

slide-42
SLIDE 42

Experimental Tasks

[Levine*, Finn*, Darrell, Abbeel, JMLR 2016]

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-43
SLIDE 43

Learning

[Levine*, Finn*, Darrell, Abbeel, JMLR 2016]

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-44
SLIDE 44

Learned Skills

[Levine*, Finn*, Darrell, Abbeel, JMLR 2016

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-45
SLIDE 45

Experiments: Learned Neural Network Policy

[Khan, Zhang, Levine, Abbeel 2016]

slide-46
SLIDE 46

Supersizing Self-Supervision: Learning to Grasp from 50K Tries and 700 Robot Hours [Pinto, Gupta, ICRA 2016]

Fron5ers: DATA | TRANSFER | EXPLORATION | REWARD | HIERARCHY

Learning Hand-Eye Coordina5on with Deep Learning and Large Scale Data Collec5on [Pastor, Krizhevsky, Quillen, Levine, 2016] Learning to Poke by Poking: Experien5al Learning of Intui5ve Physics [Agarwal, Nair, Abbeel, Malik, Levine, 2016]

slide-47
SLIDE 47

Fron5ers: DATA | TRANSFER | EXPLORATION | REWARD | HIERARCHY

n

Mul5-task, Mul5-robot

n

Simula5on -> real world

Deep Domain Confusion [Tzeng, Hoffman, Saenko, Darrell, 2014] Combining Model- based Policy Search with Online Model Learning [Mordatch, Mishra, Eppner, Abbeel ICRA 2016] Progressive Neural Networks [Rusu, Rabinowitz,…, Hadsell, 2016]

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-48
SLIDE 48

Fron5ers: DATA | TRANSFER | EXPLORATION | REWARD | HIERARCHY

[Houthooft, Chen, Duan, Schulman, Turck, Abbeel, 2016] Swimmer + Food Collection

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-49
SLIDE 49

Fron5ers: DATA | TRANSFER | EXPLORATION | REWARD | HIERARCHY Difficult to specify reward / cost func5on Exis5ng work on inverse op5mal control / inverse RL BUT, how to scale up to perceptual spaces? Guided Cost Learning [Finn, Levine, Abbeel, ICML 2016]

Model-free Imita5on Learning with Policy Op5miza5on [Ho, Ermon, ICML 2016]

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-50
SLIDE 50

Fron5ers: DATA | TRANSFER | EXPLORATION | REWARD | HIERARCHY

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

slide-51
SLIDE 51

n

Colleagues: Trevor Darrell, Ken Goldberg, Michael Jordan, Stuart Russell, Ilya Sutskever

n

Post-docs: Sergey Levine, Igor Mordatch, Aviv Tamar, Dave Held

n

Students: John Schulman, Chelsea Finn, Rocky Duan, Peter Chen, Rein Houthoof, Gregory Kahn, Tianhao Zhang

Acknowledgements

Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope