What Would it Take to Train an Agent to Play with a Shape-Sorter? - PowerPoint PPT Presentation

What Would it Take to Train an Agent to Play with a Shape-Sorter? Feryal Behbahani

Shape sorter? • Simple children toy: put shapes in the correct holes • Trivial for adults • Yet children cannot fully solve until 2 years old (!)

Requirements Recognize different shapes Grasp objects and manipulate them Understand the task and how to succeed Mentally / physically rotate shapes into position Move precisely to fit object into hole

How to do it? • Classical robotic control pipeline approach State Modeling & Low-level Observations Planning controls estimation prediction control • Deep robotic end-to-end learning End-to-end learning ……….. ……….. Observations controls

Using simulations as a proxy • How many samples do we need to train a good behaviour? Udacity car simulator – Real robot/car: stuck to real time speed – MuJoCo simulator: up to 10000x real time Real Jaco arm MuJoCo simulation Finger tracking with CyberGlove synced with 3D reconstruction in MuJoCo [Todorov et al., 2012 & Behbahani et al., 2016]

Deep Reinforcement Learning for control Environment Agent

Deep Reinforcement Learning for control Environment Agent Observations

Deep Reinforcement Learning for control Actions Environment Agent Observations

Deep Reinforcement Learning for control Actions Environment Agent Observations Reward

Learning to reach • Let’s first try to reach to a target and grasp it. • Should be able to do this regardless of object location

Task and setup Random agent • Reach red target – Reward of 1 if target inside hand – Random position each episode 40 x 40 x 40 cm • Observation space: – Two camera views • Action space: – Joint velocities 9 actuators, 5 possible velocities View 1 View 2

Agent architecture Policy Value • Inputs: – 64 x 64 x 6 channels • Vision LSTM – ConvNet 2 layers – ReLU activations • LSTM (recurrent core) – 128 units Vision • Policy – Softmax per actuator (5 values) • Value – Linear layer to scalar

Asynchronous Advantage Actor-Critic (A3C) ENVIRONMENT Agent acts for T timesteps (e.g., T=100) For each timestep t , compute " # = 𝑠 "(𝑡 , ) #)* + … + 𝛿 ,-#)* 𝑠 ,-* + 𝛿 ,-# 𝑊 𝑆 # + 𝛿𝑠 state state 3 # = 𝑆 " # − 𝑊 "(𝑡 # ) 𝐵 Compute loss gradient: , ACTOR CRITIC 𝟑 A 𝒖 + 𝑾 A 𝒕 𝒖 − 𝑺 A 𝒖 𝑕 = 𝛼 7 8 − 𝐦𝐩𝐡 𝝆 𝒃 𝒖 𝒕 𝒖 )𝑩 #E* Plug g into a stochastic gradient descent optimiser (e.g. action value RMSprop) Multiple workers interact with their own environments update actor network and send gradient updates asynchronously This helps with robustness and experience diversity [Mnih et al, 2016, Rusu et al., 2016]

Results • Successfully learns to reach to all target locations with sparse rewards Camera side views ~6 million training steps Domain randomisation After ~6 million training step for robustness in transfer to real Each episode can last up to 100 steps world When learned ~7 steps

Place shape into its correct position • Tries to place object in correct place but struggles to fit in

Deep RL end-to-end limitations • Reward function definition is more of an art than science! • Very sample inefficient • Learning vision from scratch every time • Policy does not transfer effectively to slightly different situations (e.g. move target by a few centimeters) End-to-end learning ……….. ……….. Observations controls A great recent overview of DRL methods à

Possible solutions Learning with auxiliary information Policy Value Leverage extra information in simulation, forcing the agent to make sense of the geometry of what it sees. This accelerate and stabilises reinforcement learning LSTM Auxiliary task: Predict auxiliary Information: Vision e.g. depth Auxiliary input Leverage information available visual input Joint angles & only within simulation and velocities learn to cope without them [e.g. Levine et al, 2016 & Mirowski et al., 2016]

Possible solutions Separating learning vision from the control problem Avoid learning vision every time, focus on the task at hand Requires a “general” vision module, useful on many possible tasks. End-to-end learning ….. ….. ….. Observations controls General-purpose Policy pretrained vision module Learn robust and transferable vision module e.g. [Higgins et al. 2017 & Finn et al. 2017]

Possible solutions Learning from Demonstrations Imitation Learning: Directly copy the expert (e.g. supervised learning) Inverse RL: First infer what the expert is trying to do (learn its reward function r ), then learn your own optimal policy to achieve it using RL. state Policy Supervised reproducing Training data learning expert actions action Infer expert reward function [e.g. Ho et al., 2016 & Wang et al., 2017]

Possible solutions Learning from Demonstrations Imitation Learning: Directly copy the expert (e.g. supervised learning) Inverse RL: First infer what the expert is trying to do (learn its reward function r ), then learn your own optimal policy to achieve it using RL. Modelling for deformable objects is challenging! Current simulators fail to capture full variability of deformable objects and even small differences can break the robot! World's first cat-petting robotic arm!

Thank you Dr Anil Bharath Feryal Behbahani Kai Arulkumaran feryal.github.io @feryalmp @feryal feryal@morpheuslabs.co.uk

What Would it Take to Train an Agent to Play with a Shape-Sorter? - PowerPoint PPT Presentation

What Would it Take to Train an Agent to Play with a Shape-Sorter? Feryal Behbahani Shape sorter? Simple children toy: put shapes in the correct holes Trivial for adults Yet children cannot fully solve until 2 years old (!)

23 Advanced Topics 5: Multi-lingual Models Up until now, we have assumed that in the case of

TOS Arno Puder 1 Objectives Introduce the train simulator Using the model train

Does God play dice with the cell? Does God play dice with the cell? Does God play dice with the

Overview Multi-Agent Systems Introduction to multi-agent systems and agent societies Agent

A-train Commuter Rail Updated July 31, 2018 Presentation Overview DCTA A-train Commuter Rail

An Agent Architecture An Agent Architecture An Agent Architecture An Agent Architecture for

S S S S calable calable Agent calable calable Agent Agent Plat forms Agent Plat forms

Agent-Based Systems Agent communication Speech act theory Michael Rovatsos Agent

Learn - Train - Play - Work Learn... from our qualified BTEC tutors Train... with FA/UEFA

Statistical Shape Models Eigenpatches model regions Assume shape is fixed What if it

Shape Features WangRuchen CVBIOUC http://vision.ouc.edu.cn/~zhenghaiyong How to Convex hull

Bethesda Big Train Partnership Presentation What is Big Train? Bethesda Big Train is a summer

Antwerp 50 by train Ghent 40 by

TRISTAN 2016, Aruba, June 2016 1 Real-time train rescheduling Train scheduling : routing and

The Player Agent The Player Agent Are they the most important league official right now? right

Rational Agents (Ch. 2) Rational agent An agent/robot must be able to perceive and interact with

Dynamically Optimizing End-to-End Latency for Time-Triggered Networks Zonghui Li NEAT2019

Global Plan to End TB Monitoring progress How Global Plan is being used?...1 United Nations

Shifter training for MICE runs March 6, 2015 Getting good data during a shift is easy but not

CPSC 213 for Java and C programs by examining language features and deciding how

CSCE 2214 Lab 06 Pre-Knowledge In order to complete this lab you will need to understand the

Challenges With Building End-to-End Encrypted Challenges With Building End-to-End Encrypted

Enterprise Storage Architecture Fall 2019 Security Tyler Bletsch Duke University What this

Voting Lecture 22 Requirements Requirements Integrity/End-to-End verifiability Requirements