Maximum Entropy Framework: Inverse RL, Soft Optimality, and More - PowerPoint PPT Presentation

Maximum Entropy Framework: Inverse RL, Soft Optimality, and More Chelsea Finn and Sergey Levine UC Berkeley 5/20/2017

Introductions Sergey Levine Chelsea Finn assistant professor PhD student

Outline 1. A World without Rewards 2. A Probabilistic Model of Behavior 3. Application: Inverse RL 4. GANs and Energy-Based Models 5. Application: Soft-Q Learning

reward Mnih et al. ’15 what is the reward? reinforcement learning agent In the real world, humans don’t get a score. video from Montessori New Zealand

Tesauro ’95 Kohl & Stone, ’04 Mnih et al. ’15 Silver et al. ‘16 reward function is essential for RL real-world domains : reward/cost often di ffi cult to specify • robotic manipulation • autonomous driving • dialog systems • virtual assistants • and more…

One approach: Mimic actions of human expert behavioral cloning + simple, sometimes works well - but no reasoning about outcomes or dynamics - the expert might have di ff erent degrees of freedom - the expert might not be always optimal Can we reason about human decision-making?

Op&mal Control as a Model of Human Behavior Muybridge (c. 1870) Mombaur et al. ‘09 Li & Todorov ‘06 Ziebart ‘08 opQmize this to explain the data

What if the data is no not op&mal? some mistakes maTer more than others! behavior is stochas'c but good behavior is sQll the most likely

A probabilis&c graphical model of decision making no assumpQon of opQmal behavior!

Inference = planning how to do inference?

A closer look at the backward pass “opQmisQc” transiQon (not a good idea!) Ziebart et al. ‘10 “Modeling InteracQon via the Principle of Maximum Causal Entropy”

Stochas&c op&mal control (MaxCausalEnt) summary summary: 1. ProbabilisQc graphical model for opQmal control variants: 2. Control = inference (similar to HMM, EKF, etc.) 3. Very similar to dynamic programming, value iteraQon, etc. (but “soc”)

Under reward, we can model how human can sub-optimally maximize reward. How can this help us with learning?

Inverse Optimal Control / Inverse Reinforcement Learning : infer cost/reward function from demonstrations (IOC/IRL) (Kalman ’64, Ng & Russell ’00) goal : given : - recover reward function - state & action space - then use reward to get policy - roll-outs from π * - dynamics model [sometimes] Challenges underde fj ned problem di ffi cult to evaluate a learned reward demonstrations may not be precisely optimal

Early IRL Approaches - deterministic MDP - alternative between solving MDP & updating reward - heuristics for handling sub-optimality Ng & Russell ‘00 : expert actions should have higher value than other actions, larger gap is better Abbeel & Ng ’04 : expert policy w.r.t. cost should match feature counts of expert trajectories Ratli ff et al. ’06 : max margin formulation between value of expert actions and other actions How to handle ambiguity and suboptimality?

Maximum Entropy Inverse RL (Ziebart et al. ’08) handle ambiguity using probabilistic model of behavior Notation: : reward with parameters [linear case ] : dataset of demonstrations Whiteboard

Maximum Entropy Inverse RL (Ziebart et al. ’08)

What about unknown dynamics? Whiteboard

Case Study : Guided Cost Learning ICML 2016 Goals : - remove need to solve MDP in the inner loop - be able to handle unknown dynamics - handle continuous state & actions spaces

guided cost learning algorithm policy π 0 generate policy x (1) h (2) h ( 3 ) h k m n p samples from π c ( x ) 2 θ x ( 2 ) ( 3 ) h ( 1 ) h h 2 2 2 2 x (1) h ( 2 ) ( 3 ) h h 1 1 1 1 Update reward using samples & demos update π w.r.t. reward reward r policy π

guided cost learning algorithm policy π 0 generate policy x (1) h (2) h ( 3 ) h k m n p samples from π c ( x ) 2 θ x ( 2 ) ( 3 ) h ( 1 ) h h 2 2 2 2 x (1) h ( 2 ) ( 3 ) h h 1 1 1 1 Update reward using generator samples & demos discriminator update π w.r.t. reward reward r policy π (partially optimize) update reward in inner loop of policy optimization

guided cost learning algorithm policy π 0 generate policy x h (1) ( 3 ) h (2) h k n p m samples from π c ( x ) 2 θ ( 2 ) x ( 1 ) h ( 3 ) h h 2 2 2 2 x (1) h ( 2 ) h ( 3 ) h 1 1 1 1 Update reward using generator samples & demos discriminator update π w.r.t. reward reward r policy π (partially optimize) Ho et al., ICML ’16, NIPS ‘16

GCL Experiments Real-world Tasks dish placement pouring almonds state includes unsupervised state includes goal plate pose visual features [Finn et al. ’16] action: joint torques

Comparisons Path Integral IRL Relative Entropy IRL (Kalakrishnan et al. ‘13) (Boularias et al. ‘11) generate policy x h (1) ( 3 ) h (2) h k n p m samples from q c ( x ) 2 θ ( 2 ) x ( 1 ) h ( 3 ) h h 2 2 2 2 x (1) h ( 2 ) h ( 3 ) h 1 1 1 1 Update reward using samples & demos reward r

Dish placement, demos

Dish placement, standard cost

Dish placement, RelEnt IRL • video of dish baseline method

Dish placement, GCL policy • video of dish our method - samples & reoptimizing

Pouring, demos • video of pouring demos

Pouring, RelEnt IRL • video of pouring baseline method

Pouring, GCL policy • video of pouring our method - samples

Conclusion : We can recover successful policies for new positions. Is the reward function also useful for new scenarios?

Dish placement - GCL reopt. • video of dish our method - samples & reoptimizing

Pouring - GCL reopt. • video of pouring our method - reoptimization Note : normally the GAN discriminator is discarded

Guided Cost Learning & Generative Adversarial Imitation Learning Strengths - can handle unknown dynamics - scales to neural net rewards - e ffi cient enough for real robots Limitations - adversarial optimization is hard - can’t scale to raw pixel observations of demos - demonstrations typically collected with kinesthetic teaching or teleoperation ( fj rst person)

Generative Adversarial Networks (Goodfellow et al. ’14) Arjovsky et al. ‘17 Zhu et al. ‘17 Isola et al. ‘17 Similarly, GANs learn an objective for generative modeling. real D G noise generated

Connection between Inverse RL and GANs trajectory τ sample x policy π ~q( τ ) generator G reward r discriminator D discriminator discriminator only needs to learn data distribution, θ independent of generator density Finn*, Christiano*, Abbeel, Levine, arXiv ‘16

Connection between Inverse RL and GANs trajectory τ sample x policy π ~q( τ ) generator G cost c discriminator D generator generator objective is entropy-regularized RL Finn*, Christiano*, Abbeel, Levine, arXiv ‘16

GANs for training EBMs MaxEnt IRL is an energy-based model sampler q(x) generator G energy E discriminator D Use the generator’s density q(x) to form a consistent estimator of the energy function 1 Z exp( − E θ ( x )) D θ ( x ) = 1 Z exp( − E θ ( x )) + q ( x ) Dai et al., ICLR submission ‘17 Kim & Bengio ICLR Workshop ’16; Zhao et al. arXiv ’16; Zhai et al. ICLR sub ‘17 Finn*, Christiano*, Abbeel, Levine, arXiv ‘16

Stochas&c models for learning control • How can we track both hypotheses?

StochasQc energy-based policies Tuomas Haarnoja Haoran Tang

SoK Q-learning

Tractable amor&zed inference for con&nuous ac&ons Wang & Liu, ‘17

StochasQc energy-based policies aid exploraQon

StochasQc energy-based policies provide pretraining

Stochas&c Op&mal Control & MaxEnt in RL Sallans & Hinton. Using Free Energies to Represent Q-values in a MulQagent Reinforcement Learning Task. 2000. Nachum et al. Bridging the Gap Between Value and Policy Based Reinforcement Learning. 2017. O’Donoghue et al. Combining Policy Gradient and Q-Learning. 2017 Peters et al. RelaQve Entropy Policy Search. 2010.

Maximum Entropy Framework: Inverse RL, Soft Optimality, and More - PowerPoint PPT Presentation

Maximum Entropy Framework: Inverse RL, Soft Optimality, and More Chelsea Finn and Sergey Levine UC Berkeley 5/20/2017 Introductions Sergey Levine Chelsea Finn assistant professor PhD student Outline 1. A World without Rewards 2. A

Entropy, Relative Entropy, Cross Entropy Entropy Entropy, H(x) is a measure of the uncertainty of

Formal Modeling in Cognitive Science Lecture 25: Entropy, Joint Entropy, Conditional Entropy 1

Maximum Entropy Inverse Reinforcement Learning Nomenclature Basis Feature Expectation Matching

Comparison Between Bayesian and Maximum Entropy Analysis of Flow Networks 1 Maximum Entropy

Maximum Entropy Beyond Fact to Explain Selecting Probability Maximum Entropy . . . Explaining a

Topic III.2: Maximum Entropy Models Discrete Topics in Data Mining Universitt des Saarlandes,

MAXIMUM CARDS MAXIMUM CARDS What is a Maximum Card ? The Maximum Card is the one which contains a

A gentle introduction to Maximum Entropy Models and their friends Mark Johnson Brown University

Maximum Entropy Tagging (for the Maximum Entropy method itself, refer to NPFL067 added slides

WALES SOFT POWER BAROMETER 2018 Measuring soft power beyond the nation-state April 2018 01 WHAT

Chapter 2 Entropy, Relative Entropy, and Mutual Infor- mation Peng-Hua Wang Graduate Institute

Entropy Coding Definition of Entropy Three Entropy coding techniques: (taken from the

1) Entropy = measure of randomness 2) Entropy = measure of compressibility More random = Less

Optimality Conditions Fabio Schoen 2008 http://gol.dsi.unifi.it/users/schoen Optimality

Statistical Inverse Problems and abstract inverse problems examples Instrumental Variables

Dynamic Inverse Problems: Schmitt Efficient Algorithms and Approximate Inverse Problems

Cosmology with HI intensity mapping: the SKA view ICTP

EASY DISH Amanda Grue Melody Kuna Vrajesh Modi Amy Qian Wendi Zhang Why clean & dry?

COMPUTATION Adaptive Differential Evolution Adam Viktorin aviktorin@utb.cz PhD student &

Deep Dive into Spring Data and MongoDB Fabiano Guizellini Modos Software Architect at HBSIS

Compactifying formulas with FORM J.A.M. Vermaseren Nikhef in collaboration with J. Kuipers

Acessing the Deep Web with Keywords: A Foundational Approach Andrea Cal and Martn Ugarte IKC

SLIPPERY SLIDES HACK.|100% WORKING!|NEW METHOD|HACK TOOL. Free No Ads Hard-Boiled Egg Hack Goes

The Beauty and Joy of Computing The Beauty and Joy of Computing Lectur Lecture #1 e #1

Maximum Entropy Framework: Inverse RL, Soft Optimality, and More - PowerPoint PPT Presentation

Maximum Entropy Framework: Inverse RL, Soft Optimality, and More Chelsea Finn and Sergey Levine UC Berkeley 5/20/2017 Introductions Sergey Levine Chelsea Finn assistant professor PhD student Outline 1. A World without Rewards 2. A

Entropy, Relative Entropy, Cross Entropy Entropy Entropy, H(x) is a measure of the uncertainty of

Formal Modeling in Cognitive Science Lecture 25: Entropy, Joint Entropy, Conditional Entropy 1

Maximum Entropy Inverse Reinforcement Learning Nomenclature Basis Feature Expectation Matching

Comparison Between Bayesian and Maximum Entropy Analysis of Flow Networks 1 Maximum Entropy

Maximum Entropy Beyond Fact to Explain Selecting Probability Maximum Entropy . . . Explaining a

Topic III.2: Maximum Entropy Models Discrete Topics in Data Mining Universitt des Saarlandes,

MAXIMUM CARDS MAXIMUM CARDS What is a Maximum Card ? The Maximum Card is the one which contains a

A gentle introduction to Maximum Entropy Models and their friends Mark Johnson Brown University

Maximum Entropy Tagging (for the Maximum Entropy method itself, refer to NPFL067 added slides

WALES SOFT POWER BAROMETER 2018 Measuring soft power beyond the nation-state April 2018 01 WHAT

Chapter 2 Entropy, Relative Entropy, and Mutual Infor- mation Peng-Hua Wang Graduate Institute

Entropy Coding Definition of Entropy Three Entropy coding techniques: (taken from the

1) Entropy = measure of randomness 2) Entropy = measure of compressibility More random = Less

Optimality Conditions Fabio Schoen 2008 http://gol.dsi.unifi.it/users/schoen Optimality

Statistical Inverse Problems and abstract inverse problems examples Instrumental Variables

Dynamic Inverse Problems: Schmitt Efficient Algorithms and Approximate Inverse Problems

Cosmology with HI intensity mapping: the SKA view ICTP

EASY DISH Amanda Grue Melody Kuna Vrajesh Modi Amy Qian Wendi Zhang Why clean &amp; dry?

COMPUTATION Adaptive Differential Evolution Adam Viktorin aviktorin@utb.cz PhD student &amp;

Deep Dive into Spring Data and MongoDB Fabiano Guizellini Modos Software Architect at HBSIS

Compactifying formulas with FORM J.A.M. Vermaseren Nikhef in collaboration with J. Kuipers

Acessing the Deep Web with Keywords: A Foundational Approach Andrea Cal and Martn Ugarte IKC

SLIPPERY SLIDES HACK.|100% WORKING!|NEW METHOD|HACK TOOL. Free No Ads Hard-Boiled Egg Hack Goes

The Beauty and Joy of Computing The Beauty and Joy of Computing Lectur Lecture #1 e #1

EASY DISH Amanda Grue Melody Kuna Vrajesh Modi Amy Qian Wendi Zhang Why clean & dry?

COMPUTATION Adaptive Differential Evolution Adam Viktorin aviktorin@utb.cz PhD student &