Maximum Entropy Inverse Reinforcement Learning Nomenclature Basis - PowerPoint PPT Presentation

Maximilian Luz Algorithms for Imitation Learning Summer Semester 2019 MLR/IPVS Maximum Entropy Inverse Reinforcement Learning

Nomenclature Basis Feature Expectation Matching Principle of Maximum Entropy Maximum Entropy IRL Algorithm and Derivation Extensions Demonstration 1/16 Outline

Nomenclature

2/16 Transition dynamics Demonstrations Trajectory Reward States Actions Nomenclature (i) Markov Decision Process (MDP) S = { s i } i A = { a i } i T = p ( s t + 1 | s t , a t ) R : S → R Trajectories & Demonstrations � � τ = ( s 1 , a 1 ) , ( s 2 , a 2 ) , . . . , s | τ | D = { τ i } i

Policy (stochastic) with Learner Policy Expert Policy 3/16 Nomenclature (ii) Features � φ : S → R d φ ( τ ) = φ ( s t ) s t ∈ τ Policies π ( a j | s i ) π L π E

4/16 Feature Expectation Matching Idea: Learner should visit same features as expert (in expectation). Feature Expectation Matching [Abbeel and Ng 2004] E π E [ φ ( τ )] = E π L [ φ ( τ )] Note: We want to fjnd reward R : S → R defjning π L ( a | s ) and thus p ( τ ) . � E π L [ φ ( τ )] = p ( τ ) · φ ( τ ) τ ∈T Observation: Optimality for linear (unknown) reward [Abbeel and Ng 2004]. ω ∈ R d : Reward parameters R ( s ) = ω ⊤ φ ( s ) , ⇒

• But why? • Reward shaping [Ng et al. 1999]: 5/16 Feature Expectation Matching: Problem Problem: Multiple (infjnite) solutions ⇒ ill-posed (Hadamard). • Multiple reward functions R lead to same policy π . Idea (Ziebart et al. 2008): • Regularize by maximizing entropy H ( p ) .

6/16 Probability of occurrence 1 0 1 0 Optimal encoding length Event Shannon’s Entropy Entropy H ( p ) x ∈ X : � H ( p ) = − p ( x ) log 2 p ( x ) p ( x ) : x ∈X − log 2 p ( x ) : Expected information received when observing x ∈ X . ⇒ Measure of uncertainty. p ( x ) p ( x ) No uncertainty, H ( p ) minimal. Uniformly random, H ( p ) maximal.

(e.g. feature expectation matching) p q baseline (solutions) bias Information Entropy 7/16 Principle of Maximum Entropy [Jaynes 1957] Consider: A problem with solutions p , q , . . . ⇒ p , q represent partial information. ⇒ Maximizing entropy minimizes bias.

Maximum Entropy IRL

(entropy) (feature matching) (prob. distr.) 8/16 Constrained Optimization Problem Problem Formulation arg max p H ( p ) subject to E π E [ φ ( τ )] = E π L [ φ ( τ )] , � τ ∈T p ( τ ) = 1 , ∀ τ ∈ T : p ( τ ) > 0

9/16 1 partition function where with Lagrange multipliers for feature matching reward normalization Solution: Deterministic Dynamics Solution via Lagrange multipliers [Ziebart et al. 2008]: ω ⊤ φ ( τ ) p ( τ ) ∝ exp ( R ( τ )) R ( τ ) = Deterministic transition dynamics: � � � � � ω ⊤ φ ( τ ) ω ⊤ φ ( τ ) p ( τ | ω ) = exp Z ( ω ) = exp Z ( ω ) τ ∈T � ��

10/16 p combined transition probability assumes limited transition randomness 1 Solution: Stochastic Dynamics Stochastic transition dynamics: � � � ω ⊤ φ ( τ ) p ( τ | ω ) = Z s ( ω ) exp ( s t + 1 | s t , a t ) s t , a t , s t + 1 ∈ τ � �� ∝ deterministic via adaption of deterministic solution [Ziebart et al. 2008]. Problem: Adaption introduces bias [Osa et al. 2018; Ziebart 2010]: � ˜ R ( τ ) = ω ⊤ φ ( τ ) + log p ( s t + 1 | s t , a t ) s t , a t , s t + 1 ∈ τ Solution: Maximum Causal Entropy IRL (Ziebart 2010, not covered here).

11/16 • Convex, can be optimized via gradient ascent. state visitation frequency computationally infeasible Likelihood and Gradient Obtain parameters by maximizing Likelihood: � ω ∗ = arg max L ( ω ) = arg max log p ( τ | ω ) ω ω τ ∈D Observation: • Maximizing likelihood equiv. to minimizing KL-divergence [Bishop 2006]. ⇒ M-projection onto manifold of maximum entropy distributions [Osa et al. 2018]. Gradient [Ziebart et al. 2008]: � ∇L ( ω ) = E D [ φ ( τ )] − p ( τ | ω ) φ ( τ ) τ ∈T � = E D [ φ ( τ )] − D s i φ ( s i ) s i ∈ S “count” features in D

12/16 can be computed via R State Visitation Frequency Observation: � π ME ( a j | s i , ω ) ∝ p ( τ | ω ) τ ∈T : s i , a j ∈ τ t = 1 Idea: Split into sub-problems. 1. Backward Pass: Compute policy π ME ( a | s , ω ) . 2. Forward Pass: Compute state visitation frequency from π ME ( a | s , ω ) .

13/16 Z s i Parallels to value-iteration. recursively expand observation normalization State Visitation Frequency: Backward Pass Observation: � π ME ( a j | s i , ω ) ∝ p ( τ | ω ) τ ∈T : s i , a j ∈ τ t = 1 Idea: Z s i , a j π ME ( a j | s i , ω ) = � � � � ω ⊤ φ ( s i ) = p ( s k | s i , a j ) · exp · Z s k , Z s i = Z s i , a j Z s i , a j s k ∈ S a j ∈ A Algorithm: 1. Initialize Z s k = 1 for all terminal states s k ∈ S terminal . 2. Compute Z s i , a j and Z s i by recursively backing-up from terminal states. 3. Compute π ME ( a i | s i , ω ) .

14/16 2. Recursively compute 3. Sum up over t , i.e. State Visitation Frequency: Forward Pass Idea: Propagate starting-state probabilities p 0 ( s ) forward via policy π ME ( a | s , ω ) . Algorithm: 1. Initialize D s i , 0 = p 0 ( s ) = p ( τ ∈ T : s ∈ τ t = 1 ) . � � � � � � D s k , t + 1 = D s i , t · π ME a j | s i · p s k | a j , s i s i ∈ S a j ∈ A � D s i = D s i , t t = 0 ,...

• Limited transition randomness. • Need to “solve” MDP once per iteration. • Reward bias for stochastic transition dynamics. 15/16 Summary Algorithm: Iterate until convergence: 1. Compute policy π ME ( a | s , ω ) (forward pass) . 2. Compute state visitation frequency D s i (backward pass) . 3. Compute gradient ∇L ( ω ) of likelihood. 4. Gradient-based optimization step, e.g.: ω ← ω + η ∇L ( ω ) . Assumptions: • Known transition dynamics T = p ( s t + 1 | s t , a t ) . • Linear reward R ( s ) = ω ⊤ φ ( s ) . Other Drawbacks:

• Maximum Causal Entropy IRL [Ziebart 2010] • Maximum Entropy Deep IRL [Wulfmeier et al. 2015] • Maximum Entropy IRL in Continuous State Spaces with Path Integrals [Aghasadeghi and Bretl 2011] 16/16 Extensions

Demonstration github.com/qzed/irl-maxent

References Abbeel, Pieter and Andrew Y. Ng (2004). “Apprenticeship Learning via Inverse Reinforcement Learning”. In: Proc. 21st Intl. Conference on Machine Learning (ICML ’04) . Aghasadeghi, N. and T. Bretl (Sept. 2011). “Maximum entropy inverse reinforcement learning in continuous state spaces with path integrals”. In: Intl. Conference on Intelligent Robots and Systems (IORS 2011) , pp. 1561–1566. Bishop, Christopher M. (Aug. 17, 2006). Pattern Recognition and Machine Learning . Springer-Verlag New York Inc. Jaynes, E. T. (May 1957). “Information Theory and Statistical Mechanics”. In: Physical Review 106.4, pp. 620–630. Ng, Andrew Y., Daishi Harada, and Stuart J. Russell (1999). “Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping”. In: Proc. 16th Intl. Conference on Machine Learning (ICML ’99) , pp. 278–287. Osa, Takayuki et al. (2018). “An Algorithmic Perspective on Imitation Learning”. In: Foundations and Trends in Robotics 7.1-2, pp. 1–179. Wulfmeier, Markus, Peter Ondruska, and Ingmar Posner (2015). “Deep Inverse Reinforcement Learning”. In: Computing Research Repository . arXiv: 1507.04888 . Ziebart, Brian D. (2010). “Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy”. PhD thesis. Carnegie Mellon University. Ziebart, Brian D. et al. (2008). “Maximum Entropy Inverse Reinforcement Learning”. In: Proc. 23rd AAAI Conference on Artifjcial Intelligence (AAAI ’08) , pp. 1433–1438.

Maximum Entropy Inverse Reinforcement Learning Nomenclature Basis - PowerPoint PPT Presentation

Maximilian Luz Algorithms for Imitation Learning Summer Semester 2019 MLR/IPVS Maximum Entropy Inverse Reinforcement Learning Nomenclature Basis Feature Expectation Matching Principle of Maximum Entropy Maximum Entropy IRL Algorithm and

Entropy, Relative Entropy, Cross Entropy Entropy Entropy, H(x) is a measure of the uncertainty of

Formal Modeling in Cognitive Science Lecture 25: Entropy, Joint Entropy, Conditional Entropy 1

1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Maximum Entropy Inverse RL, Adversarial imitation learning Katerina Fragkiadaki Reinforcement

Maximum Entropy Inverse RL, Adversarial imitation learning Katerina Fragkiadaki Reinforcement

Maximum Entropy Beyond Fact to Explain Selecting Probability Maximum Entropy . . . Explaining a

Topic III.2: Maximum Entropy Models Discrete Topics in Data Mining Universitt des Saarlandes,

Comparison Between Bayesian and Maximum Entropy Analysis of Flow Networks 1 Maximum Entropy

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

A gentle introduction to Maximum Entropy Models and their friends Mark Johnson Brown University

MAXIMUM CARDS MAXIMUM CARDS What is a Maximum Card ? The Maximum Card is the one which contains a

Maximum Entropy Tagging (for the Maximum Entropy method itself, refer to NPFL067 added slides

Reinforcement Learning Ziebart, Maas, Bagnell, Dey Presenter: Naireen Hussain Overview What

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

FIND AN EXPERT Judicial Expertise in European Union A transnational project to promote

Graphics and Design Keep the background consistent and subtledo not use pictures or graphics

Towards a Robust, Sustainable and Profitable Growth Magma Fincorp Limited Investor Presentation

AGAI NST T HI S BACK DROP Go ve rnme nt a nd in-ho use inde pe nde nt re g ula to ry

Guideline: Format and Presentation of a Research Proposal for a Higher Degree by Research

DLA Internet Bid Board System (DIBBS) Solicitations WARFIGHTER FIRST PEOPLE & CULTURE

Soil Profile Documentation April 2015 David Hammonds Environmental Consultant Florida

Norbert Benda Feb. 2016 | Page 1/51 Contents 1.Superiority, non-inferiority and equivalence

Maximum Entropy Inverse Reinforcement Learning Nomenclature Basis - PowerPoint PPT Presentation

Maximilian Luz Algorithms for Imitation Learning Summer Semester 2019 MLR/IPVS Maximum Entropy Inverse Reinforcement Learning Nomenclature Basis Feature Expectation Matching Principle of Maximum Entropy Maximum Entropy IRL Algorithm and

Entropy, Relative Entropy, Cross Entropy Entropy Entropy, H(x) is a measure of the uncertainty of

Formal Modeling in Cognitive Science Lecture 25: Entropy, Joint Entropy, Conditional Entropy 1

1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Maximum Entropy Inverse RL, Adversarial imitation learning Katerina Fragkiadaki Reinforcement

Maximum Entropy Inverse RL, Adversarial imitation learning Katerina Fragkiadaki Reinforcement

Maximum Entropy Beyond Fact to Explain Selecting Probability Maximum Entropy . . . Explaining a

Topic III.2: Maximum Entropy Models Discrete Topics in Data Mining Universitt des Saarlandes,

Comparison Between Bayesian and Maximum Entropy Analysis of Flow Networks 1 Maximum Entropy

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

A gentle introduction to Maximum Entropy Models and their friends Mark Johnson Brown University

MAXIMUM CARDS MAXIMUM CARDS What is a Maximum Card ? The Maximum Card is the one which contains a

Maximum Entropy Tagging (for the Maximum Entropy method itself, refer to NPFL067 added slides

Reinforcement Learning Ziebart, Maas, Bagnell, Dey Presenter: Naireen Hussain Overview What

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

FIND AN EXPERT Judicial Expertise in European Union A transnational project to promote

Graphics and Design Keep the background consistent and subtledo not use pictures or graphics

Towards a Robust, Sustainable and Profitable Growth Magma Fincorp Limited Investor Presentation

AGAI NST T HI S BACK DROP Go ve rnme nt a nd in-ho use inde pe nde nt re g ula to ry

Guideline: Format and Presentation of a Research Proposal for a Higher Degree by Research

DLA Internet Bid Board System (DIBBS) Solicitations WARFIGHTER FIRST PEOPLE &amp; CULTURE

Soil Profile Documentation April 2015 David Hammonds Environmental Consultant Florida

Norbert Benda Feb. 2016 | Page 1/51 Contents 1.Superiority, non-inferiority and equivalence

DLA Internet Bid Board System (DIBBS) Solicitations WARFIGHTER FIRST PEOPLE & CULTURE