Maximum Entropy Reinforcement Learning CMU 10-403 Katerina - PowerPoint PPT Presentation

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Maximum Entropy Reinforcement Learning CMU 10-403 Katerina Fragkiadaki

RL objective 𝔽 π [ ∑ R ( s t , a t ) ] π * = arg max π t 𝔽 ( s t , a t ) ∼ ρ π [ ∑ R ( s t , a t ) ] π * = arg max π t

MaxEntRL objective Promoting stochastic policies T ∑ π * = arg max 𝔽 π R ( s t , a t ) + α H( π ( ⋅ | s t )) π t =1 reward entropy Why? Better exploration • Learning alternative ways of accomplishing the task • Better generalization, e.g., in the presence of obstacles a stochastic • policy may still succeed.

Principle of Maximum Entropy Policies that generate similar rewards, should be equally probable. We do not want to commit. Why? • Better exploration • Learning alternative ways of accomplishing the task • Better generalization, e.g., in the presence of obstacles a stochastic policy may still succeed. Reinforcement Learning with Deep Energy-Based Policies,Haarnoja et al.

We have seen this before. d θ ← d θ + ∇ θ ′ � log π ( a i | s i ; θ ′ � ) ( R − V ( s i ; θ ′ � v )+ β ∇ θ ′ � H ( π ( s t ; θ ′ � )) ) “We also found that adding the entropy of the policy π to the objective function improved exploration by discouraging premature convergence to suboptimal deterministic policies. This technique was originally proposed by (Williams & Peng, 1991)” Mnih et al., Asynchronous Methods for Deep Reinforcement Learning

MaxEntRL objective Promoting stochastic policies T ∑ 𝔽 π + α H( π ( ⋅ | s t )) π * = arg max R ( s t , a t ) π t =1 reward entropy How can we maximize such an objective?

Recall:Back-up Diagrams q π ( s , a ) = r ( s , a ) + γ ∑ T ( s ′ � | s , a ) ∑ π ( a ′ � | s ′ � ) q π ( s ′ � , a ′ � ) s ′ � ∈𝒯 a ′ � ∈𝒝

Back-up Diagrams for MaxEnt Objective H ( π ( ⋅ | s ′ � )) = − 𝔽 a log π ( a ′ � | s ′ � )

Back-up Diagrams for MaxEnt Objective − log π ( a ′ � | s ′ � ) q π ( s , a ) = r ( s , a ) + γ ∑ T ( s ′ � | s , a ) ∑ π ( a ′ � | s ′ � ) ( q π ( s ′ � , a ′ � ) − log( π ( a ′ � | s ′ � )) ) s ′ � ∈𝒯 a ′ � ∈𝒝

(Soft) policy evaluation Bellman backup equation: q π ( s , a ) = r ( s , a ) + γ ∑ T ( s ′ � | s , a ) ∑ π ( a ′ � | s ′ � ) q π ( s ′ � , a ′ � ) s ′ � ∈𝒯 a ′ � ∈𝒝 Soft Bellman backup equation: q π ( s , a ) = r ( s , a ) + ∑ T ( s ′ � | s , a ′ � ) ( q π ( s ′ � , a ′ � ) − log( π ( a ′ � | s ′ � )) ) a ′ � , s ′ � Bellman backup update operator: Q ( s t , a t ) ← r ( s t , a t ) + γ 𝔽 s t +1 , a t +1 [ Q ( s t +1 , a t +1 | s t +1 )] Soft Bellman backup update operator: Q ( s t , a t ) ← r ( s t , a t ) + γ 𝔽 s t +1 , a t +1 [ Q ( s t +1 , a t +1 ) − log( π ( a t +1 | s t +1 )) ]

Soft Bellman backup update operator is a contraction Q ( s t , a t ) ← r ( s t , a t ) + γ 𝔽 s t +1 , a t +1 [ Q ( s t +1 , a t +1 ) − log( π ( a t +1 | s t +1 )) ] Q ( s t , a t ) ← r ( s t , a t ) + γ 𝔽 s t +1 ∼ ρ [ 𝔽 a t +1 ∼ π [ Q ( s t +1 , a t +1 ) − log( π ( a t +1 | s t +1 ))]] ← r ( s t , a t ) + γ 𝔽 s t +1 ∼ ρ , a t +1 ∼ π Q ( s t +1 , a t +1 ) + γ 𝔽 s t +1 ∼ ρ 𝔽 a t +1 ∼ π [ − log π ( a t +1 | s t +1 )] ← r ( s t , a t ) + γ 𝔽 s t +1 ∼ ρ , a t +1 ∼ π Q ( s t +1 , a t +1 ) + γ 𝔽 s t +1 ∼ ρ H( π ( ⋅ | s t +1 )) Rewrite the reward as: r soft ( s t , a t ) = r ( s t , a t ) + γ 𝔽 s t +1 ∼ ρ H( π ( ⋅ | s t +1 )) Then we get the old Bellman operator, which we know is a contraction

Soft Bellman backup update operator Q ( s t , a t ) ← r ( s t , a t ) + γ 𝔽 s t +1 , a t +1 [ Q ( s t +1 , a t +1 ) − α log π ( a t +1 | s t +1 )] ] Q ( s t , a t ) ← r ( s t , a t ) + γ 𝔽 s t +1 ∼ ρ [ 𝔽 a t +1 ∼ π [ Q ( s t +1 , a t +1 ) − α log π ( a t +1 | s t +1 )]] ← r ( s t , a t ) + γ 𝔽 s t +1 ∼ ρ , a t +1 ∼ π Q ( s t +1 , a t +1 ) + γα 𝔽 s t +1 ∼ ρ 𝔽 a t +1 ∼ π [ − log π ( a t +1 | s t +1 )] ← r ( s t , a t ) + γ 𝔽 s t +1 ∼ ρ , a t +1 ∼ π Q ( s t +1 , a t +1 ) + γα 𝔽 s t +1 ∼ ρ H ( π ( ⋅ | s t +1 )) We know that: Q ( s t , a t ) ← r ( s t , a t ) + γ 𝔽 s t +1 ∼ ρ [ V ( s t +1 )] Which means that: V ( s t ) = 𝔽 a t ∼ π [ Q ( s t , a t ) − α log π ( a t | s t )]

Review: Policy Iteration (unknown dynamics) Policy iteration iterates between two steps: 1. Policy evaluation: Fix policy, apply Bellman backup operator till convergence q π ( s , a ) ← r ( s , a ) + γ 𝔽 s ′ � , a ′ � q π ( s ′ � , a ′ � ) 4. Policy improvement: Update the policy

Soft Policy Iteration Soft policy iteration iterates between two steps: 1. Soft policy evaluation: Fix policy, apply Bellman backup operator till convergence q π ( s , a ) = r ( s , a ) + 𝔽 s ′ � , a ′ � ( q π ( s ′ � , a ′ � ) − α log( π ( a ′ � | s ′ � )) ) This converges to q π 2. Soft policy improvement: Update the policy: π k ∈Π D KL ( π k ( ⋅ | s t ) || exp(Q π (s t , ⋅ )) ) π ′ � = arg min Z π ( s t ) Leads to a sequence of policies with monotonically increasing soft q values This so far concerns tabular methods. Next we will use function approximations for policy and action values Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Review: Policy Improvement theorem for deterministic policies Let π , π ′ � be any pair of determinstic policies such that, for all s ∈ 𝒯 : q π ( s , π ′ � ( s )) ≥ v π ( s ) . Then π ′ � must be as good as or better than π , that is: v π ′ � ( s ) ≥ v π ( s )

Review: Policy Improvement theorem for deterministic policies Let π , π ′ � be any pair of determinstic policies such that, for all s ∈ 𝒯 : q π ( s , π ′ � ( s )) ≥ v π ( s ) . Then π ′ � must be as good as or better than π , that is: v π ′ � ( s ) ≥ v π ( s ) π k ∈Π D KL ( π k ( ⋅ | s t ) || exp(Q π (s t , ⋅ )) ) π ′ � = arg min Z π ( s t )

SoftMax

Soft Policy Iteration - Approximation Use function approximations for policy and action value functions: π ϕ ( a t | s t ) Q θ ( s t )

Soft Policy Iteration - Approximation Use function approximations for policy and action value functions: π ϕ ( a t | s t ) Q θ ( s t ) 1. Learning the state-action value function: Semi-gradient method:

Soft Policy Iteration - Approximation Use function approximations for policy and action value functions: π ϕ ( a t | s t ) Q θ ( s t ) 3. Learning the policy: Z θ ( s t ) = ∫ 𝒝 exp( Q θ ( s t , a t )) da t π ϕ ( a t | s t ) independent of \phi ∇ ϕ J π ( ϕ ) = ∇ ϕ 𝔽 s t ∈ D 𝔽 a t ∼ π ϕ ( a | s t ) log exp( Q θ ( s t , a t )) Z θ ( s t ) π ϕ ( a t | s t ) ∇ ϕ J π ( ϕ ) = ∇ ϕ 𝔽 s t ∈ D , ϵ ∼𝒪 (0, I ) log exp( Q θ ( s t , a t )) The variable w.r.t. which we take gradient parametrizes the distribution inside the expectation.

Soft Policy Iteration - Approximation Use function approximations for policy and action value functions: π ϕ ( a t | s t ) Q θ ( s t ) 3. Learning the policy: π ϕ ( a t | s t ) ∇ ϕ J π ( ϕ ) = ∇ ϕ 𝔽 s t ∈ D 𝔽 a t ∼ π ϕ ( a | s t ) log exp( Q θ ( s t , a t )) Reparametrization trick. The policy becomes a deterministic function of Gaussian random variables (fixed Gaussian distribution): a t = f ϕ ( s t , ϵ ) = μ ϕ ( s t ) + ϵ Σ ϕ ( s t ), ϵ ∼ 𝒪 (0, I ) π ϕ ( a t | s t ) ∇ ϕ J π ( ϕ ) = ∇ ϕ 𝔽 s t ∈ D , ϵ ∼𝒪 (0, I ) log exp( Q θ ( s t , a t ))

Composability of Maximum Entropy Policies Imagine we want to satisfy two objectives at the same time, e.g., pick an object up while avoiding an obstacle. We would learn a policy to maximize the addition of the the corresponding reward functions: C r C ( s , a ) = 1 ∑ r i ( s , a ) C i =1 MaxEnt policies permit to obtain the resulting policy’s optimal Q by simply adding the constituent Qs: C C ( s , a ) ≈ 1 ∑ Q * Q * i ( s , a ) C i =1 We can theoretically bound the suboptimality of the resulting policy w.r.t. the policy trained under the addition of rewards. We cannot do this for deterministic policies. Composable Deep Reinforcement Learning for Robotic Manipulation, Haarnoja et al.

https://www.youtube.com/watch?time_continue=82&v=FmMPHL3TcrE

Maximum Entropy Reinforcement Learning CMU 10-403 Katerina - PowerPoint PPT Presentation

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Maximum Entropy Reinforcement Learning CMU 10-403 Katerina Fragkiadaki RL objective [ R ( s t , a t ) ] * = arg max t ( s t , a t )

Entropy, Relative Entropy, Cross Entropy Entropy Entropy, H(x) is a measure of the uncertainty of

Formal Modeling in Cognitive Science Lecture 25: Entropy, Joint Entropy, Conditional Entropy 1

Maximum Entropy Inverse Reinforcement Learning Nomenclature Basis Feature Expectation Matching

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Maximum Entropy Beyond Fact to Explain Selecting Probability Maximum Entropy . . . Explaining a

Topic III.2: Maximum Entropy Models Discrete Topics in Data Mining Universitt des Saarlandes,

Comparison Between Bayesian and Maximum Entropy Analysis of Flow Networks 1 Maximum Entropy

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

A gentle introduction to Maximum Entropy Models and their friends Mark Johnson Brown University

MAXIMUM CARDS MAXIMUM CARDS What is a Maximum Card ? The Maximum Card is the one which contains a

Maximum Entropy Inverse RL, Adversarial imitation learning Katerina Fragkiadaki Reinforcement

Maximum Entropy Inverse RL, Adversarial imitation learning Katerina Fragkiadaki Reinforcement

Maximum Entropy Tagging (for the Maximum Entropy method itself, refer to NPFL067 added slides

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Entropy Coding Definition of Entropy Three Entropy coding techniques: (taken from the

Lifes Little Treasures & MCRI The Guiding Parents Webinar Series Past Webinars

State of the Semantic Web Tampere, 4 April, 2007 Ivan Herman, W3C What will I talk about? The

MISSION To understand breast cancer through basic and clinical scientific research To

Increasing Health Care Access for Teens through Medicaid and CHIP January 24, 2018 3:00 p.m. ET

Uncertainty quantification in computer experiments with polynomial chaos J. KO 1 with J. GARNIER 2

Q3 2011 CONFERENCE CALL Caution Regarding Forward-Looking Statements C O R P O R A T E Bank of

Lectures 1 and 2: Operator algebras, etc (basic theory and real positivity) David Blecher

RESULTS Hotline! 1 The RESULTS Team Simon Buckingham

Maximum Entropy Reinforcement Learning CMU 10-403 Katerina - PowerPoint PPT Presentation

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Maximum Entropy Reinforcement Learning CMU 10-403 Katerina Fragkiadaki RL objective [ R ( s t , a t ) ] * = arg max t ( s t , a t )

Entropy, Relative Entropy, Cross Entropy Entropy Entropy, H(x) is a measure of the uncertainty of

Formal Modeling in Cognitive Science Lecture 25: Entropy, Joint Entropy, Conditional Entropy 1

Maximum Entropy Inverse Reinforcement Learning Nomenclature Basis Feature Expectation Matching

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Maximum Entropy Beyond Fact to Explain Selecting Probability Maximum Entropy . . . Explaining a

Topic III.2: Maximum Entropy Models Discrete Topics in Data Mining Universitt des Saarlandes,

Comparison Between Bayesian and Maximum Entropy Analysis of Flow Networks 1 Maximum Entropy

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

A gentle introduction to Maximum Entropy Models and their friends Mark Johnson Brown University

MAXIMUM CARDS MAXIMUM CARDS What is a Maximum Card ? The Maximum Card is the one which contains a

Maximum Entropy Inverse RL, Adversarial imitation learning Katerina Fragkiadaki Reinforcement

Maximum Entropy Inverse RL, Adversarial imitation learning Katerina Fragkiadaki Reinforcement

Maximum Entropy Tagging (for the Maximum Entropy method itself, refer to NPFL067 added slides

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Entropy Coding Definition of Entropy Three Entropy coding techniques: (taken from the

Lifes Little Treasures &amp; MCRI The Guiding Parents Webinar Series Past Webinars

State of the Semantic Web Tampere, 4 April, 2007 Ivan Herman, W3C What will I talk about? The

MISSION To understand breast cancer through basic and clinical scientific research To

Increasing Health Care Access for Teens through Medicaid and CHIP January 24, 2018 3:00 p.m. ET

Uncertainty quantification in computer experiments with polynomial chaos J. KO 1 with J. GARNIER 2

Q3 2011 CONFERENCE CALL Caution Regarding Forward-Looking Statements C O R P O R A T E Bank of

Lectures 1 and 2: Operator algebras, etc (basic theory and real positivity) David Blecher

RESULTS Hotline! 1 The RESULTS Team Simon Buckingham

Lifes Little Treasures & MCRI The Guiding Parents Webinar Series Past Webinars