Deep Learning for Robo/cs Pieter Abbeel Reinforcement Learning (RL) - PowerPoint PPT Presentation

Deep Learning for Robo/cs Pieter Abbeel

Reinforcement Learning (RL) probability of taking ac2on a in state s Robo2cs π θ ( a | s ) n Marke2ng / n Adver2sing Dialogue n Robot + Op2mizing Environment n opera2ons / logis2cs Queue H n management X max E[ R ( s t ) | π θ ] θ … t =0 n

Reinforcement Learning (RL) probability of taking ac2on a in state s π θ ( a | s ) Robot + Environment Addi/onal challenges: n Stability Goal: n n Credit assignment n H X Explora/on max E[ R ( s t ) | π θ ] n θ t =0

Reinforcement Learning (RL) probability of taking ac2on a in state s π θ ( a | s ) Robot + Environment Goal: n H X max E[ R ( s t ) | π θ ] θ t =0

Deep RL Success Stories DQN Mnih et al, NIPS 2013 / Nature 2015 Gu et al, NIPS 2014 TRPO Schulman, Levine, Moritz, Jordan, Abbeel, ICML 2015 Schulman, Levine, Moritz, Jordan, Abbeel, ICML 2015 A3C Mnih et al,2016 Schulman, Moritz, Levine, Jordan, Abbeel, ICLR 2016 Levine*, Finn*, Darrell, Abbeel, JMLR 2016 Silver et al, Nature 2015

Speed of Learning Deep RL (DQN) Human vs. Score: 18.9 Score: 9.3 #Experience #Experience measured in real measured in real 2me: 40 days 2me: 2 hours “Slow” “Fast”

Star2ng Observa2ons n TRPO, DQN, A3C are fully general RL algorithms n i.e., for any MDP that can be mathema2cally defined, these algorithms are equally applicable n MDPs encountered in real world = 2ny, 2ny subset of all MDPs that could be defined n Can we design “fast” RL algorithms that take advantage of such knowledge?

Research Ques2ons n How to acquire a good prior for real-world MDPs? n Or for starters, e.g., for real-games MDPs? n How to design algorithms that make use of such prior informa2on? Key idea: Learn a fast RL algorithm that encodes this prior

Formula2on n Given: Distribu2on over relevant MDPs n Train the fast RL algorithm to be fast on a training set of MDPs

Formula2on

Learning the Fast RL Algorithm n Representa2on of the fast RL algorithm: n RNN = generic computa2on architecture n different weights in the RNN means different RL algorithm n different ac2va2ons in the RNN means different current policy n Training setup:

Alterna2ve View on RL2 n RNN = policy for ac2ng in a POMDP n Part of what’s not observed in the POMDP is which MDP the agent is in

Related Work n Wang et al., (2016) Learning to Reinforcement Learn, in submission to ICLR 2017, n Chen et al. (2016) Learning to Learn for Global Op2miza2on of Black Box Func2ons n Andrychowicz et al., (2016) Learning to learn by gradient descent by gradient descent n Santoro et al., (2016) One-shot Learning with Memory- Augmented Neural Networks n Larochelle et al., (2008), Zero-data Learning of New Tasks. n Younger et al. (2001), Meta learning with backpropaga2on n Schmidhuber et al. (1996), Simple principles of metalearning

RL 2 : Fast RL by Slow RL Key Insights : n n We represent the AI agent as a Recurrent Neural Net (RNN) n i.e., the RNN is the “fast” RL algorithm n different weights in the RNN means different RL algorithm n To discover good weights for the RNN (i.e., to discover the fast RL algorithm), train with classical (“slow”) RL [Duan, Schulman, Chen, Bartleo, Sutskever, Abbeel, 2016]

Evalua2on Mul2-Armed Bandits n Provably (asympto2cally) op2mal n RL algorithms have been invented by humans: Giqns index, UCB1, Thompson sampling, … 5-armed bandit (source: ebay) [Duan, Schulman, Chen, Bartleo, Sutskever, Abbeel, 2016]

Evalua2on n Mul2-Armed Bandits [Duan, Schulman, Chen, Bartleo, Sutskever, Abbeel, 2016]

Evalua2on n Mul2-Armed Bandits

Evalua2on: Tabular MDPs n Provably (asympto2cally) op2mal algorithms: n BEB, PSRL, UCRL2, …

Evalua2on: Tabular MDPs

Evalua2on: Visual Naviga2on (built on top of ViZDoom) Agent’s view Small maze Large maze

Evalua2on: Visual Naviga2on Before learning After learning

Evalua2on: Visual Naviga2on n Visual Naviga2on (built on top of ViZDoom) Occasional “bad” behavior

Evalua2on: Visual Naviga2on

Evalua2on

OpenAI Universe

Outline RL^2: Fast Reinforcement Learning via Slow Reinforcement n Learning Yan Duan, John Schulman, Xi Chen, Peter L. Bartleo, Ilya Sutskever, Pieter Abbeel Third Person Imita2on Learning n Bradly C Stadie, Pieter Abbeel, Ilya Sutskever Varia2onal Lossy Autoencoder n Xi Chen, D. P. Kingma, T. Salimans, Y. Duan, P. Dhariwal, J. Schulman, I. Sutskever, P. Abbeel Deep Reinforcement Learning for Tensegrity Robot Locomo2on n X. Geng*, M. Zhang*, J. Bruce*, K. Caluwaerts, M. Vespignani, V. SunSpiral, P. Abbeel, S. Levine

Third Person Imita2on Learning n First person imita2on learning n Demonstrate with robot itself n E.g., drive a car, tele-operate a robot, etc… n Third person imita2on learning n Robot watches demonstra2ons n Challenges: n Different viewpoint n Different incarna2on (human vs. robot)

Third Person Imita2on Learning n Example problem seqngs: Third-person view: Robot environment:

Basic Ideas Genera2ve Adversarial Imita2on Learning (Ho et al, 2016, Finn et al 2016) n n Reward = defined by learned classifier dis2nguishing expert from robot behavior à Op2mizing such reward makes robot perform like expert à Works well for first person imita2on learning BUT: in third person seqng classifier will simply iden2fy expert vs. robot environment and robot can never match expert Domain confusion loss (Tzeng et al, 2015) n Deep learn a feature representa2on from which it isn’t possible to dis2nguish the environment n BUT: competes too directly with first objec2ve Let first objec2ve have mul2ple frames – i.e.., see behavior n

Architecture Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

Learning Curves Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

Domain Classifica2on Accuracy Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

Does the algorithm we propose benefit from both domain confusion and the mul2-2me step input? Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

How sensi2ve is our proposed algorithm to the selec2on of hyper- parameters used in deployment? –Domain Confusion Weight lambda Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

How sensi2ve is our proposed algorithm to the selec2on of hyper- parameters used in deployment? – Number of lookahead frames Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

Results Third Person View on Expert Imitator

Outline RL^2: Fast Reinforcement Learning via Slow Reinforcement n Learning Yan Duan, John Schulman, Xi Chen, Peter L. Bartleo, Ilya Sutskever, Pieter Abbeel Third Person Imita2on Learning n Bradly C Stadie, Pieter Abbeel, Ilya Sutskever Probabilis2cally Safe Policy Transfer n David Held, Zoe McCarthy, Michael Zhang, Fred Shentu, and Pieter Abbeel Deep Reinforcement Learning for Tensegrity Robot Locomo2on n X. Geng*, M. Zhang*, J. Bruce*, K. Caluwaerts, M. Vespignani, V. SunSpiral, P. Abbeel, S. Levine

Risky Robo2cs Tasks n Autonomous driving n Robots interac2ng around / with people n Robots manipula2ng fragile objects n Robot can damage itself Ques2on: How to train a robot for these tasks? Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

Previous Approaches n Isolated training environment (e.g. cage) n May not represent test environment n No human interac2on / collabora2on in isolated environment n Train in simula2on n May not represent test environment n Watch robot carefully, try to press kill-switch in 2me n Requires careful observa2on and predic2on of robot ac2ons by human, may not react fast enough Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

Our Approach n Operate in test environment ini2ally with low torques n Use lowest torques at which task can s2ll be completed, but safely and slowly n Assump2on: low torques are safer but less efficient n Increase the torque limit as the robot demonstrates that it can operate safely n How do we define safety? Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

How to Define Safety? Expected damage n D safe defines our “safety budget” – how much expected damage we can afford n Example for autonomous car: n low risk of hiqng a pedestrian at low speeds n Even lower risk of killing a pedestrian Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

How to Define Safety? n Higher torques -> lower 2me per task -> more benefit n Higher torques -> more damage n Torques are clipped at T lim (wlog assume α = 1) n Assume binary probability of being unsafe: n = probability of failure to be safe Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

How to Define Safety? Expected damage n High probability of failure -> low torques n Low probability of failure -> higher torques Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

Overall Algorithm Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

Predic2ng failure increases n Effect from adjus2ng torque limit n Effect from upda2ng policy Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

Predic2ng failure increases n Effect from changing torque limit F(-T lim ) 1 - F(T lim ) n Effect from upda2ng policy 0.45 Old policy New policy Intersection 0.4 0.35 0.3 PDF 0.25 0.2 0.15 0.1 0.05 0 -6 -4 -2 0 2 4 6 8 Torque Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

Deep Learning for Robo/cs Pieter Abbeel Reinforcement Learning (RL) - PowerPoint PPT Presentation

Deep Learning for Robo/cs Pieter Abbeel Reinforcement Learning (RL) probability of taking ac2on a in state s Robo2cs ( a | s ) n Marke2ng / n Adver2sing Dialogue n Robot + Op2mizing Environment n opera2ons / logis2cs Queue H n

Deep Learning for Robo/cs Pieter Abbeel UC Berkeley / OpenAI / Gradescope Outline n Some deep

Deep Reinforcement Learning through Policy Op7miza7on Pieter Abbeel John Schulman Open AI /

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Full Stack Deep Learning Lecture 1: Intro Pieter Abbeel, Sergey Karayev, Josh Tobin Organizer

Reinforcement Learning CS 188: Artificial Intelligence Reinforcement Learning Instructors:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Robo Door Introduction Robo Door has been manufacturing their own products since 1995, and

Robo sapiens Robo sapiens The Forefront of AI? CPSC 433 Christian Jacob Dept. of Computer

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Deep Reinforcement Learning [Mastering the Game of Go with Deep Reinforcement Learning and Tree

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

3D Scene Understanding for Vision, Graphics, and Robocs CVPR 2020 Workshop, Virtual, June 15th,

AI Robo-Advisor and Chatbot for Conversational Commerce in FinTech Host: Prof. Yean-Fu Wen

P art icle Filt ers and Their Applicat ions Kaij en Hsiao Henr y de Plinval-Salgues J ason

AI Cogni9ve Systems Robots Goal-directed Situated Commanded

Mitsuku Mitsuku by Steve Worswick www.mitsuku.com Powerpoint Templates Powerpoint Templates

Helpers or Our New Overlords? http://www.carsonblock.com librarylandtech@gmail.com

Reasoning about Actions for Planning in Robotics Shiqi Zhang SUNY Binghamton 10/28/2018 2 The

AI and the Global Economy Machine Learning and the Market for Intelligence Conference Rotman