Maximum Entropy Inverse RL, Adversarial imitation learning Katerina - PowerPoint PPT Presentation

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Maximum Entropy Inverse RL, Adversarial imitation learning Katerina Fragkiadaki

Reinforcement Learning Probability Dynamics distribution over next Model T states given current Describes desirability state and action of being in a state. Reinforcement ! Controller/ Reward Learning / Policy π � Function R ! Optimal Control ! Prescribes action to take for each state Diagram: Pieter Abbeel

Inverse Reinforcement Learning Probability Dynamics distribution over next Model T states given current Describes desirability state and action of being in a state. Reinforcement ! Controller/ Reward Learning / Policy π � Function R ! Optimal Control ! Prescribes action to take for each state Diagram: Pieter Abbeel IRL reverses the diagram: Given a finite set of demonstration trajectories, let’s π ∗ recover reward R and policy !

Inverse Reinforcement Learning Probability Dynamics distribution over next Model T states given current Describes desirability state and action of being in a state. Reinforcement ! Controller/ Reward Learning / Policy π � Function R ! Optimal Control ! Prescribes action to take for each state Diagram: Pieter Abbeel IRL reverses the diagram: Given a finite set of demonstration trajectories, let’s π ∗ recover reward R and policy ! In contrast to the DAGGER setup, we cannot interactively query the expert for additional labels.

Inverse Reinforcement Learning Probability Dynamics distribution over next Model T states given current Describes desirability state and action of being in a state. Reinforcement ! Controller/ Reward Learning / Policy π � Function R ! Optimal Control ! Prescribes action to take for each state Mathematically imitation boils down to a distribution matching problem: the learner needs to come up with a reward/policy whose resulting state, action trajectory distribution matches the expert trajectory distribution.

A simple example • Roads have unknown costs linear in features • Paths (trajectories) have unknown costs, sum of road (state) costs • Experts (taxi-drivers) demonstrate Pittsburgh traveling behavior • How can we learn to navigate Pitts like a taxi (or uber) driver? • Assumption: cost is independent of the goal state, so it only depends on road features, e.g., traffic width tolls etc.

State features Features f can be: # Bridges crossed e: # Miles of interstate # Stoplights

A good guess: Match expected features Features f can be: # Bridges crossed Feature matching: e: # Miles of interstate P ( τ i ) f τ i = ˜ X f Path τ i “If a driver uses136.3 miles of interstate and crosses 12 bridges in a month’s worth of trips, # Stoplights the model should also use 136.3 miles of interstate and 12 bridges in expectation for those same start-destination pairs.”

A good guess: Match expected features Features f can be: # Bridges crossed Feature matching: e: # Miles of interstate P ( τ i ) f τ i = ˜ X f Path τ i Demonstrated feature counts # Stoplights

A good guess: Match expected features Features f can be: # Bridges crossed Feature matching: e: # Miles of interstate P ( τ i ) f τ i = ˜ X f Path τ i Demonstrated feature counts # Stoplights a policy induces a distribution over trajectories Y p ( τ ) = p ( s 1 ) p ( a t | s t ) P ( s t +1 | s t , a t )

Ambiguity However, many distributions over paths can Features f can be: match feature counts, and some will be very different from observed behavior. The model # Bridges crossed could produce a policy that avoid the interstate and bridges for all routes except one, which drives in circles on the interstate for 136 miles and crosses 12 bridges. Feature matching: e: # Miles of interstate P ( τ i ) f τ i = ˜ X f Path τ i Demonstrated feature counts # Stoplights a policy induces a distribution over trajectories Y p ( τ ) = p ( s 1 ) p ( a t | s t ) P ( s t +1 | s t , a t )

Principle of Maximum Entropy The probability distribution which best represents the current state of knowledge is the one with largest entropy, in the context of precisely stated prior data (such as a proposition that expresses testable information). Another way of stating this: Take precisely stated prior data or testable information about a probability distribution function. Consider the set of all trial probability distributions that would encode the prior data. The distribution with maximal information entropy is the best choice. • Maximizing entropy minimizes the amount of prior information built into the distribution • Many physical systems tend to move towards maximal entropy configurations over time

Resolve Ambiguity by Maximum Entropy Features f can be: Let’s pick the policy that satisfies feature count constraints without over-committing! # Bridges crossed X max P ( τ ) log P ( τ ) − P τ Feature matching constraint: e: # Miles of interstate P ( τ i ) f τ i = ˜ X f Path τ i Demonstrated feature counts # Stoplights a policy induces a distribution over trajectories Y p ( τ ) = p ( s 1 ) p ( a t | s t ) P ( s t +1 | s t , a t )

Maximum Entropy Inverse Optimal Control as uniform as possible Maximizing the entropy over paths: X max P ( τ ) log P ( τ ) − P τ While matching feature counts (and being a probability distribution): X P ( τ ) f τ = f dem τ X P ( τ ) = 1 τ

From features to costs Cost of a trajectory (linear): τ c θ ( τ ) = θ T f τ = θ T f s X s ∈ τ Constraint: Match the cost of expert trajectories in expectation: 1 Z X c θ ( τ ∗ ) p ( τ ) c θ ( τ ) d τ = | D | τ ∗ ∈ D τ Maximum Entropy min . − H ( p ( τ )) Z Z s.t. p ( τ ) c θ ( τ ) d τ = ˜ p ( τ ) d τ = 1 c,

From maximum entropy to exponential family Maximum Entropy min . − H ( p ( τ )) Z Z s.t. p ( τ ) c θ ( τ ) d τ = ˜ p ( τ ) d τ = 1 c, Z Z ⇒ L ( p, λ ) = p ( τ ) log( p ( τ )) d τ + λ 1 ( p ( τ ) c θ ( τ ) d τ − ˜ c ) ⇐ Z + λ 0 ( p ( τ ) d τ − 1) ∂ L ∂ p = log p ( τ ) + 1 + λ 1 c θ ( τ ) + λ 0 ∂ L ∂ p = 0 ⇐ ⇒ log p ( τ ) = − 1 − λ 1 c θ ( τ ) − λ 0 p ( τ ) ∝ e c θ ( τ ) p ( τ ) = e ( − 1 − λ 0 − λ 1 c θ ( τ ))

From maximum entropy to exponential family • Maximizing the entropy of the distribution over paths subject to the feature constraints from observed data implies that we maximize the likelihood of the observed data under the maximum entropy (exponential family) distribution (Jaynes 1957) 1 1 sj ∈ τ i θ T f sj Z ( θ ) e θ T f τ i = P P ( τ i | θ ) = Z ( θ ) e e θ T f τ S X Z ( θ , s ) = τ S • Strong Preference for Low Cost Paths • Equal Cost Paths Equally Probable

Maximum Likelihood Y X p ( τ ∗ ) ⇐ log p ( τ ∗ ) max . log ⇒ max . θ θ τ ∗ ∈ D τ ∗ ∈ D log e − c θ ( τ ∗ ) X max . Z θ τ ∗ ∈ D X X X e − c θ ( τ ) ) − c θ ( τ ∗ ) − max log( . θ τ ∗ τ τ ∗ ∈ D X X e − c θ ( τ ) ) | D | − c θ ( τ ∗ ) − log( max . θ τ τ ∗ ∈ D X X e − c θ ( τ ) ) → J ( θ ) c θ ( τ ∗ ) + | D | log( min . θ τ τ ∗ ∈ D dc θ ( τ ∗ ) τ ( e − c θ ( τ ) ) ) 1 ( − dc θ ( τ ) X X ∑ r θ J ( θ ) = + | D | d θ τ e − c θ ( τ ) P d θ τ ∗ ∈ D dc θ ( τ ∗ ) p ( τ | θ ) dc θ ( τ ) X X − = + | D | d θ d θ τ τ ∗ ∈ D

From trajectories to states Y p ( τ ) = p ( s 1 ) p ( a t | s t ) P ( s t +1 | s t , a t ) p ( τ ) ∞ e − c θ ( τ ) X p ( τ ) ∞ e −∑ s ∈ τ c θ ( s ) c θ ( τ ) = c θ ( s ) s ) ⇒ s ∈ τ dc θ ( s ) p ( s | θ , τ ) dc θ ( s ) X X r θ J ( θ ) = − + | D | d θ d θ s s ∈ τ ∗ ∈ D Successful imitation boils down to p ( s, a | θ , τ ) dc θ ( s, a ) learning a policy that matches the state X d θ visitation distribution (or state/action s,a visitation distribution)

State densities In the tabular case and for known dynamics we can compute them with dynamic programming, assuming we have obtained the policy: µ 1 ( s ) = p ( s s ) Time indexed state densities for t = 1 , ..., T X X µ t ( s 0 ) p ( a | s 0 ) p ( s | s 0 , a ) µ t +1 ( s ) = a s 0 X p ( s | θ , T ) = µ t ( s ) t dc θ ( s ) p ( s | θ , T ) dc θ ( s ) X X − r θ J ( θ ) = + | D | d θ d θ s s t ∈ τ ∗ ∈ D c θ ( s ) = θ T f s For linear costs: ∇ θ J ( θ ) = ∑ f s + | D | ∑ p ( s | θ , 𝒰 ) f s s ∈ τ * s

Maximum entropy Inverse RL Known dynamics, linear costs • Body Level One • Body Level Two • Body Level Three • Body Level Four Body Level Five −

Maximum entropy Inverse RL Demonstrated Behavior Model Behavior (ExpectaIon) Bridges Bridges crossed: 3 crossed: ? Miles of Miles of Cost interstate: interstate: Weight: 20.7 ? 5.0 Stoplights: Stoplights 10 : Cost ? Weight: 31 3.0

Maximum entropy Inverse RL Demonstrated Behavior Model Behavior (ExpectaIon) Bridges Bridges crossed: 3 crossed: 4.7 +1.7 Miles of Miles of Cost Weight: interstate: interstate: 5.0 20.7 16.2 ‐4.5 Stoplights: Stoplights 10 : Cost 7.4 Weight: 34 ‐2.6 3.0

Maximum entropy Inverse RL Demonstrated Behavior Model Behavior (ExpectaIon) Bridges Bridges crossed: 3 crossed: 4.7 Miles of Miles of 7.2 interstate: interstate: Cost Weight: 20.7 16.2 5.0 Stoplights: Stoplights 10 : 1.1 7.4 Cost 35 Weight:

Maximum Entropy Inverse RL, Adversarial imitation learning Katerina - PowerPoint PPT Presentation

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Maximum Entropy Inverse RL, Adversarial imitation learning Katerina Fragkiadaki Reinforcement Learning Probability Dynamics distribution over next Model T

Entropy, Relative Entropy, Cross Entropy Entropy Entropy, H(x) is a measure of the uncertainty of

Why do imitation and analogy fail? Why do imitation and analogy fail? Imitation Imitation

Formal Modeling in Cognitive Science Lecture 25: Entropy, Joint Entropy, Conditional Entropy 1

Maximum Entropy Inverse RL, Adversarial imitation learning Katerina Fragkiadaki Reinforcement

Maximum Entropy Inverse Reinforcement Learning Nomenclature Basis Feature Expectation Matching

Maximum Entropy Beyond Fact to Explain Selecting Probability Maximum Entropy . . . Explaining a

Topic III.2: Maximum Entropy Models Discrete Topics in Data Mining Universitt des Saarlandes,

Comparison Between Bayesian and Maximum Entropy Analysis of Flow Networks 1 Maximum Entropy

A Bayesian Approach to Generative Adversarial Imitation Learning NeurIPS 2018 Presenter Wonseok

A gentle introduction to Maximum Entropy Models and their friends Mark Johnson Brown University

MAXIMUM CARDS MAXIMUM CARDS What is a Maximum Card ? The Maximum Card is the one which contains a

Imitation Learning Initial Concept and Approaches Nguyen, Thi Linh Chi Outline Motivation

Maximum Entropy Tagging (for the Maximum Entropy method itself, refer to NPFL067 added slides

Entropy Coding Definition of Entropy Three Entropy coding techniques: (taken from the

1) Entropy = measure of randomness 2) Entropy = measure of compressibility More random = Less

Chapter 2 Entropy, Relative Entropy, and Mutual Infor- mation Peng-Hua Wang Graduate Institute

Multi-Agent Adversarial Inverse Reinforcement Learning Lantao Yu, Jiaming Song, Stefano Ermon

Toward a Common Model for Highly Concurrent Applications Douglas Thain University of Notre Dame

Authen'ca'on CS461/ECE422 Spring 2012 Readings Chapter 3

Course Outline Introduction and the MPEG standards Introduction to statistical pattern

Reinforcement Learning Ziebart, Maas, Bagnell, Dey Presenter: Naireen Hussain Overview What

Gen enerativ erative e Adver ersaria sarial l Im Imitation itation Le Learning arning

Networks Used As Episodic Memory for An Autonomous Robot Outline

NAG : Motivating Deployment of Networked Systems Mohit Lad UCLA Deployment of Networked Systems