# Maximum Entropy Inverse Reinforcement Learning Nomenclature Basis - PowerPoint PPT Presentation

## Maximilian Luz Algorithms for Imitation Learning Summer Semester 2019 MLR/IPVS Maximum Entropy Inverse Reinforcement Learning Nomenclature Basis Feature Expectation Matching Principle of Maximum Entropy Maximum Entropy IRL Algorithm and

1. Maximilian Luz Algorithms for Imitation Learning Summer Semester 2019 MLR/IPVS Maximum Entropy Inverse Reinforcement Learning

2. Nomenclature Basis Feature Expectation Matching Principle of Maximum Entropy Maximum Entropy IRL Algorithm and Derivation Extensions Demonstration 1/16 Outline

3. Nomenclature

4. 2/16 Transition dynamics Demonstrations Trajectory Reward States Actions Nomenclature (i) Markov Decision Process (MDP) S = { s i } i A = { a i } i T = p ( s t + 1 | s t , a t ) R : S → R Trajectories & Demonstrations � � τ = ( s 1 , a 1 ) , ( s 2 , a 2 ) , . . . , s | τ | D = { τ i } i

5. Policy (stochastic) with Learner Policy Expert Policy 3/16 Nomenclature (ii) Features � φ : S → R d φ ( τ ) = φ ( s t ) s t ∈ τ Policies π ( a j | s i ) π L π E

6. Basis

7. 4/16 Feature Expectation Matching Idea: Learner should visit same features as expert (in expectation). Feature Expectation Matching [Abbeel and Ng 2004] E π E [ φ ( τ )] = E π L [ φ ( τ )] Note: We want to fjnd reward R : S → R defjning π L ( a | s ) and thus p ( τ ) . � E π L [ φ ( τ )] = p ( τ ) · φ ( τ ) τ ∈T Observation: Optimality for linear (unknown) reward [Abbeel and Ng 2004]. ω ∈ R d : Reward parameters R ( s ) = ω ⊤ φ ( s ) , ⇒

8. • But why? • Reward shaping [Ng et al. 1999]: 5/16 Feature Expectation Matching: Problem Problem: Multiple (infjnite) solutions ⇒ ill-posed (Hadamard). • Multiple reward functions R lead to same policy π . Idea (Ziebart et al. 2008): • Regularize by maximizing entropy H ( p ) .

9. 6/16 Probability of occurrence 1 0 1 0 Optimal encoding length Event Shannon’s Entropy Entropy H ( p ) x ∈ X : � H ( p ) = − p ( x ) log 2 p ( x ) p ( x ) : x ∈X − log 2 p ( x ) : Expected information received when observing x ∈ X . ⇒ Measure of uncertainty. p ( x ) p ( x ) No uncertainty, H ( p ) minimal. Uniformly random, H ( p ) maximal.

10. (e.g. feature expectation matching) p q baseline (solutions) bias Information Entropy 7/16 Principle of Maximum Entropy [Jaynes 1957] Consider: A problem with solutions p , q , . . . ⇒ p , q represent partial information. ⇒ Maximizing entropy minimizes bias.

11. Maximum Entropy IRL

12. (entropy) (feature matching) (prob. distr.) 8/16 Constrained Optimization Problem Problem Formulation arg max p H ( p ) subject to E π E [ φ ( τ )] = E π L [ φ ( τ )] , � τ ∈T p ( τ ) = 1 , ∀ τ ∈ T : p ( τ ) > 0

13. 9/16 1 partition function where with Lagrange multipliers for feature matching reward normalization Solution: Deterministic Dynamics Solution via Lagrange multipliers [Ziebart et al. 2008]: ω ⊤ φ ( τ ) p ( τ ) ∝ exp ( R ( τ )) R ( τ ) = Deterministic transition dynamics: � � � � � ω ⊤ φ ( τ ) ω ⊤ φ ( τ ) p ( τ | ω ) = exp Z ( ω ) = exp Z ( ω ) τ ∈T � �� � � �� � � �� �

14. 10/16 p combined transition probability assumes limited transition randomness 1 Solution: Stochastic Dynamics Stochastic transition dynamics: � � � ω ⊤ φ ( τ ) p ( τ | ω ) = Z s ( ω ) exp ( s t + 1 | s t , a t ) s t , a t , s t + 1 ∈ τ � �� � � �� � ∝ deterministic via adaption of deterministic solution [Ziebart et al. 2008]. Problem: Adaption introduces bias [Osa et al. 2018; Ziebart 2010]: � ˜ R ( τ ) = ω ⊤ φ ( τ ) + log p ( s t + 1 | s t , a t ) s t , a t , s t + 1 ∈ τ Solution: Maximum Causal Entropy IRL (Ziebart 2010, not covered here).

15. 11/16 • Convex, can be optimized via gradient ascent. state visitation frequency computationally infeasible Likelihood and Gradient Obtain parameters by maximizing Likelihood: � ω ∗ = arg max L ( ω ) = arg max log p ( τ | ω ) ω ω τ ∈D Observation: • Maximizing likelihood equiv. to minimizing KL-divergence [Bishop 2006]. ⇒ M-projection onto manifold of maximum entropy distributions [Osa et al. 2018]. Gradient [Ziebart et al. 2008]: � ∇L ( ω ) = E D [ φ ( τ )] − p ( τ | ω ) φ ( τ ) τ ∈T � = E D [ φ ( τ )] − D s i φ ( s i ) s i ∈ S “count” features in D

16. 12/16 can be computed via R State Visitation Frequency Observation: � π ME ( a j | s i , ω ) ∝ p ( τ | ω ) τ ∈T : s i , a j ∈ τ t = 1 Idea: Split into sub-problems. 1. Backward Pass: Compute policy π ME ( a | s , ω ) . 2. Forward Pass: Compute state visitation frequency from π ME ( a | s , ω ) .

17. 13/16 Z s i Parallels to value-iteration. recursively expand observation normalization State Visitation Frequency: Backward Pass Observation: � π ME ( a j | s i , ω ) ∝ p ( τ | ω ) τ ∈T : s i , a j ∈ τ t = 1 Idea: Z s i , a j π ME ( a j | s i , ω ) = � � � � ω ⊤ φ ( s i ) = p ( s k | s i , a j ) · exp · Z s k , Z s i = Z s i , a j Z s i , a j s k ∈ S a j ∈ A Algorithm: 1. Initialize Z s k = 1 for all terminal states s k ∈ S terminal . 2. Compute Z s i , a j and Z s i by recursively backing-up from terminal states. 3. Compute π ME ( a i | s i , ω ) .

18. 14/16 2. Recursively compute 3. Sum up over t , i.e. State Visitation Frequency: Forward Pass Idea: Propagate starting-state probabilities p 0 ( s ) forward via policy π ME ( a | s , ω ) . Algorithm: 1. Initialize D s i , 0 = p 0 ( s ) = p ( τ ∈ T : s ∈ τ t = 1 ) . � � � � � � D s k , t + 1 = D s i , t · π ME a j | s i · p s k | a j , s i s i ∈ S a j ∈ A � D s i = D s i , t t = 0 ,...

19. • Limited transition randomness. • Need to “solve” MDP once per iteration. • Reward bias for stochastic transition dynamics. 15/16 Summary Algorithm: Iterate until convergence: 1. Compute policy π ME ( a | s , ω ) (forward pass) . 2. Compute state visitation frequency D s i (backward pass) . 3. Compute gradient ∇L ( ω ) of likelihood. 4. Gradient-based optimization step, e.g.: ω ← ω + η ∇L ( ω ) . Assumptions: • Known transition dynamics T = p ( s t + 1 | s t , a t ) . • Linear reward R ( s ) = ω ⊤ φ ( s ) . Other Drawbacks:

20. • Maximum Causal Entropy IRL [Ziebart 2010] • Maximum Entropy Deep IRL [Wulfmeier et al. 2015] • Maximum Entropy IRL in Continuous State Spaces with Path Integrals [Aghasadeghi and Bretl 2011] 16/16 Extensions

21. Demonstration github.com/qzed/irl-maxent

22. References Abbeel, Pieter and Andrew Y. Ng (2004). “Apprenticeship Learning via Inverse Reinforcement Learning”. In: Proc. 21st Intl. Conference on Machine Learning (ICML ’04) . Aghasadeghi, N. and T. Bretl (Sept. 2011). “Maximum entropy inverse reinforcement learning in continuous state spaces with path integrals”. In: Intl. Conference on Intelligent Robots and Systems (IORS 2011) , pp. 1561–1566. Bishop, Christopher M. (Aug. 17, 2006). Pattern Recognition and Machine Learning . Springer-Verlag New York Inc. Jaynes, E. T. (May 1957). “Information Theory and Statistical Mechanics”. In: Physical Review 106.4, pp. 620–630. Ng, Andrew Y., Daishi Harada, and Stuart J. Russell (1999). “Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping”. In: Proc. 16th Intl. Conference on Machine Learning (ICML ’99) , pp. 278–287. Osa, Takayuki et al. (2018). “An Algorithmic Perspective on Imitation Learning”. In: Foundations and Trends in Robotics 7.1-2, pp. 1–179. Wulfmeier, Markus, Peter Ondruska, and Ingmar Posner (2015). “Deep Inverse Reinforcement Learning”. In: Computing Research Repository . arXiv: 1507.04888 . Ziebart, Brian D. (2010). “Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy”. PhD thesis. Carnegie Mellon University. Ziebart, Brian D. et al. (2008). “Maximum Entropy Inverse Reinforcement Learning”. In: Proc. 23rd AAAI Conference on Artifjcial Intelligence (AAAI ’08) , pp. 1433–1438.