CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy - - PowerPoint PPT Presentation

β–Ά
cs885 reinforcement learning module 2 june 6 2020
SMART_READER_LITE
LIVE PREVIEW

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy - - PowerPoint PPT Presentation

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning Haarnoja, Tang et al. (2017) Reinforcement Learning with Deep Energy Based Policies, ICML . Haarnoja, Zhou et al. (2018) Soft Actor-Critic: Off-Policy


slide-1
SLIDE 1

CS885 Reinforcement Learning Module 2: June 6, 2020

Maximum Entropy Reinforcement Learning Haarnoja, Tang et al. (2017) Reinforcement Learning with Deep Energy Based Policies, ICML. Haarnoja, Zhou et al. (2018) Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor, ICML.

CS885 Spring 2020 Pascal Poupart 1 University of Waterloo

slide-2
SLIDE 2

CS885 Spring 2020 Pascal Poupart 2

Maximum Entropy RL

  • Why do several implementations of important RL

baselines (e.g., A2C, PPO) add an entropy regularizer?

  • Why is maximizing entropy desirable in RL?
  • What is the Soft Actor Critic algorithm?

University of Waterloo

slide-3
SLIDE 3

Reinforcement Learning

Deterministic Policies

  • There always exists an
  • ptimal deterministic policy
  • Search space is smaller for

deterministic than stochastic policies

  • Practitioners prefer

deterministic policies

Stochastic Policies

  • Search space is continuous

for stochastic policies (helps with gradient descent)

  • More robust (less likely to
  • verfit)
  • Naturally incorporate

exploration

  • Facilitate transfer learning
  • Mitigate local optima

University of Waterloo CS885 Spring 2020 Pascal Poupart 3

slide-4
SLIDE 4

Encouraging Stochasticity

Standard MDP

  • States: 𝑇
  • Actions: 𝐡
  • Reward: 𝑆(𝑑, 𝑏)
  • Transition: Pr(𝑑!|𝑑, 𝑏)
  • Discount: 𝛿

Soft MDP

  • States: 𝑇
  • Actions: 𝐡
  • Reward: 𝑆 𝑑, 𝑏 + πœ‡πΌ 𝜌 β‹… 𝑑
  • Transition: Pr(𝑑!|𝑑, 𝑏)
  • Discount: 𝛿

University of Waterloo CS885 Spring 2020 Pascal Poupart 4

slide-5
SLIDE 5

CS885 Spring 2020 Pascal Poupart 5

Entropy

  • Entropy: measure of uncertainty

– Information theory: expected #

  • f bits needed to communicate

the result of a sample

𝐼 π‘ž = βˆ’ βˆ‘! π‘ž 𝑦 log π‘ž(𝑦)

University of Waterloo

𝐼(π‘ž) π‘ž(𝑦)

slide-6
SLIDE 6

CS885 Spring 2020 Pascal Poupart 6

Optimal Policy

  • Standard MDP

πœŒβˆ— = argmax

"

)

#$% &

𝛿#𝐹'!,)!|" 𝑆 𝑑#, 𝑏#

  • Soft MDP

𝜌'+,-

βˆ—

= argmax

"

)

#$% &

𝛿#𝐹'!,)!|" 𝑆 𝑑#, 𝑏# + πœ‡πΌ 𝜌 β‹… 𝑑#

University of Waterloo

Maximum entropy policy Entropy regularized policy

slide-7
SLIDE 7

CS885 Spring 2020 Pascal Poupart 7

Q-function

  • Standard MDP

𝑅" 𝑑%, 𝑏% = 𝑆 𝑑%, 𝑏% + )

#$. /

𝛿#𝐹'!,)!|'",)","[𝑆 𝑑#, 𝑏# ]

  • Soft MDP

𝑅'+,-

"

𝑑%, 𝑏% = 𝑆 𝑑%, 𝑏% + )

#$. /

𝛿#𝐹'!,)!|'",)"," 𝑆 𝑑#, 𝑏# + πœ‡πΌ 𝜌 β‹… 𝑑#

University of Waterloo

NB: No entropy with first reward term since action is not chosen according to 𝜌

slide-8
SLIDE 8

CS885 Spring 2020 Pascal Poupart 8

Greedy Policy

  • Standard MDP (deterministic policy)

𝜌'())*+(𝑑) = argmax

,

𝑅(𝑑, 𝑏)

  • Soft MDP (stochastic policy)

𝜌'())*+ β‹… 𝑑 = argmax

  • βˆ‘, 𝜌 𝑏|𝑑 𝑅 𝑑, 𝑏 + πœ‡πΌ 𝜌 β‹… 𝑑

=

./0 1 2,β‹… /6 βˆ‘0 ./0 1 2,, /6 = 𝑑𝑝𝑔𝑒𝑛𝑏𝑦(𝑅 𝑑,β‹… /πœ‡)

when πœ‡ β†’ 0 then 𝑑𝑝𝑔𝑒𝑛𝑏𝑦 becomes regular max

University of Waterloo

slide-9
SLIDE 9

CS885 Spring 2020 Pascal Poupart 9

Derivation

  • Concave objective (can find global maximum)

𝐾(𝜌, 𝑅) = βˆ‘, 𝜌 𝑏|𝑑 𝑅 𝑑, 𝑏 + πœ‡πΌ 𝜌 β‹… 𝑑 = βˆ‘, 𝜌 𝑏 𝑑 [𝑅 𝑑, 𝑏 βˆ’ πœ‡ log 𝜌 𝑏 𝑑 ]

  • Partial derivative

πœ–πΎ πœ–πœŒ 𝑏 𝑑 = 𝑅 𝑑, 𝑏 βˆ’ πœ‡ log 𝜌 𝑏 𝑑 + 1

  • Setting the derivative to 0 and isolating 𝜌 𝑏 𝑑 yields

𝜌 𝑏 𝑑 = exp 𝑅 𝑑, 𝑏 /πœ‡ βˆ’ 1 ∝ exp(𝑅 𝑑, 𝑏 /πœ‡)

  • Hence 𝜌'())*+ β‹… 𝑑 =

./0 1 2,β‹… /6 βˆ‘0 ./0 1 2,, /6 = 𝑑𝑝𝑔𝑒𝑛𝑏𝑦(𝑅 𝑑,β‹… /πœ‡)

University of Waterloo

slide-10
SLIDE 10

CS885 Spring 2020 Pascal Poupart 10

Greedy Value function

  • What is the value function induced by the greedy policy?
  • Standard MDP: π‘Š 𝑑 = max

!

𝑅(𝑑, 𝑏)

  • Soft MDP:

π‘Š

!"#$ 𝑑 = πœ‡πΌ 𝜌%&''() β‹… 𝑑

+ βˆ‘* 𝜌%&''() 𝑏 𝑑 𝑅!"#$ 𝑑, 𝑏 = πœ‡ log βˆ‘* exp

+!"#$ !,*

  • = 3

𝑛𝑏𝑦-

*

𝑅!"#$(𝑑, 𝑏) when πœ‡ β†’ 0 then 3 𝑛𝑏𝑦- becomes regular max

University of Waterloo

slide-11
SLIDE 11

CS885 Spring 2020 Pascal Poupart 11

Derivation

π‘Š

#$%& 𝑑

= πœ‡πΌ 𝜌'())*+ β‹… 𝑑 + βˆ‘, 𝜌'())*+ 𝑏 𝑑 𝑅#$%& 𝑑, 𝑏 = πœ‡πΌ 𝜌'())*+ β‹… 𝑑 + βˆ‘, 𝜌'())*+ 𝑏 𝑑 πœ‡ log 𝜌'())*+ 𝑏 𝑑 + log βˆ‘,! exp

  • "#$% #,,!

/

= πœ‡πΌ 𝜌'())*+ β‹… 𝑑 + πœ‡ βˆ‘, 𝜌'())*+ 𝑏 𝑑 log 𝜌'())*+ 𝑏 𝑑 + πœ‡ log βˆ‘,! exp

  • "#$% #,,!

/

= πœ‡πΌ 𝜌'())*+ β‹… 𝑑 βˆ’ πœ‡πΌ 𝜌'())*+ β‹… 𝑑 + πœ‡ log βˆ‘,! exp

  • "#$% #,,!

/

= πœ‡ log βˆ‘,! exp

  • "#$% #,,!

/

= @ max/

,

𝑅#$%&(𝑑, 𝑏)

University of Waterloo

since 𝜌%&''() 𝑏 𝑑 =

*+, -"#$% .,0 /2 βˆ‘&! *+, -"#$% .,0! /2

slide-12
SLIDE 12

CS885 Spring 2020 Pascal Poupart 12

Soft Q-Value Iteration

SoftQValueIteration(MDP, πœ‡) Initialize 𝜌% to any policy 𝑗 ← 0 Repeat 𝑅'+,-

  • 12. 𝑑, 𝑏 ← 𝑆 𝑑, 𝑏 + 𝛿 βˆ‘'0 Pr(𝑑3|𝑑, 𝑏) ?

max4

)0

𝑅'+,-

1

(𝑑′, 𝑏′) 𝑗 ← 𝑗 + 1 Until 𝑅'+,-

1

𝑑, 𝑏 βˆ’ 𝑅'+,-

  • 15. 𝑑, 𝑏

/ ≀ πœ—

Extract policy: 𝜌67889: β‹… 𝑑 = 𝑑𝑝𝑔𝑒𝑛𝑏𝑦(𝑅'+,-

1

𝑑,β‹… /πœ‡)

University of Waterloo

Soft Bellman equation: 𝑅'+,-

βˆ—

𝑑, 𝑏 = 𝑆 𝑑, 𝑏 + 𝛿 )

'0

Pr(𝑑3|𝑑, 𝑏) ? max4

)0

𝑅'+,-

βˆ—

(𝑑′, 𝑏′)

slide-13
SLIDE 13

CS885 Spring 2020 Pascal Poupart 13

Soft Q-learning

  • Q-learning based on Soft Q-Value Iteration
  • Replace expectations by samples
  • Represent Q-function by a function approximator

(e.g., neural network)

  • Do gradient updates based on temporal differences

University of Waterloo

slide-14
SLIDE 14

CS885 Spring 2020 Pascal Poupart 14

Soft Q-learning (soft variant of DQN)

University of Waterloo

Initialize weights 𝒙 and = 𝒙 at random in [βˆ’1,1] Observe current state 𝑑 Loop Select action 𝑏 and execute it Receive immediate reward 𝑠 Observe new state 𝑑’ Add (𝑑, 𝑏, 𝑑4, 𝑠) to experience buffer Sample mini-batch of experiences from buffer For each experience Μ‚ 𝑑, E 𝑏, Μ‚ 𝑑4, Μ‚ 𝑠 in mini-batch Gradient:

56&& 5𝒙 = 𝑅𝒙 .89:

Μ‚ 𝑑, E 𝑏 βˆ’ Μ‚ 𝑠 βˆ’ 𝛿 H max2

; 0!

𝑅𝒙

.89:( Μ‚

𝑑′, E 𝑏′)

5-𝒙

"#$%

Μ‚ ., ; 5𝒙

Update weights: 𝒙 ← 𝒙 βˆ’ 𝛽

56&& 5𝒙

Update state: 𝑑 ← 𝑑’ Every 𝑑 steps, update target: = 𝒙 ← 𝒙

slide-15
SLIDE 15

CS885 Spring 2020 Pascal Poupart 15

Soft Actor Critic

  • In practice, actor critic techniques tend to perform

better than Q-learning.

  • Can we derive a soft actor-critic algorithm?
  • Yes, idea:

– Critic: soft Q-function – Actor: (greedy) softmax policy

University of Waterloo

slide-16
SLIDE 16

CS885 Spring 2020 Pascal Poupart 16

Soft Policy Iteration

SoftPolicyIteration(MDP, πœ‡) Initialize 𝜌% to any policy 𝑗 ← 0 Repeat Policy evaluation: Repeat until convergence 𝑅'+,-

"1

𝑑, 𝑏 ← 𝑆 𝑑, 𝑏 +𝛿 βˆ‘'0 Pr 𝑑3 𝑑, 𝑏 βˆ‘)0 𝜌1 𝑏′ 𝑑′ 𝑅'+,-

"1

𝑑′, 𝑏′ + πœ‡πΌ 𝜌1 β‹… 𝑑′ βˆ€π‘‘, 𝑏 Policy improvement: 𝜌12. 𝑏 𝑑 ← 𝑑𝑝𝑔𝑒𝑛𝑏𝑦 𝑅'+,-

"1

𝑑, 𝑏 /πœ‡ =

;<= >2345

61

',) /4 βˆ‘70 ;<= >2345

61

',)0 /4

βˆ€π‘‘, 𝑏 𝑗 ← 𝑗 + 1 Until 𝑅'+,-

"1

𝑑, 𝑏 βˆ’ 𝑅'+,-

"189 𝑑, 𝑏 /

≀ πœ—

University of Waterloo

slide-17
SLIDE 17

CS885 Spring 2020 Pascal Poupart 17

Policy improvement

Theorem 1: Let 𝑅LMNO

PE

(𝑑, 𝑏) be the Q-function of 𝜌Q Let 𝜌QRS 𝑏 𝑑 = 𝑑𝑝𝑔𝑒𝑛𝑏𝑦 𝑅LMNO

PE

𝑑, 𝑏 /πœ‡ Then 𝑅LMNO

PEFG 𝑑, 𝑏 β‰₯ 𝑅LMNO PE

𝑑, 𝑏 βˆ€π‘‘, 𝑏 Proof: first show that

βˆ‘, 𝜌P 𝑏 𝑑 𝑅2QRS

  • A

(𝑑, 𝑏) + πœ‡πΌ 𝜌P β‹… 𝑑 ≀ βˆ‘, 𝜌PTU 𝑏 𝑑 𝑅2QRS

  • A

𝑑, 𝑏 + πœ‡πΌ(𝜌PTU β‹… 𝑑 )

then use this inequality to show that

𝑅2QRS

  • ABC 𝑑, 𝑏 β‰₯ 𝑅2QRS
  • A

𝑑, 𝑏 βˆ€π‘‘, 𝑏

University of Waterloo

slide-18
SLIDE 18

CS885 Spring 2020 Pascal Poupart 18

Inequality derivation

βˆ‘! 𝜌" 𝑏 𝑑 𝑅#$%&

'!

(𝑑, 𝑏) + πœ‡πΌ 𝜌" β‹… 𝑑 = βˆ‘! 𝜌" 𝑏 𝑑 𝑅#$%&

'!

𝑑, 𝑏 βˆ’ πœ‡ log 𝜌" 𝑏 𝑑 = βˆ‘! 𝜌" 𝑏 𝑑 [ πœ‡ log 𝜌"() 𝑏 𝑑 βˆ’ πœ‡ log βˆ‘!" exp(𝑅#$%&

'!

𝑑, 𝑏* /πœ‡) βˆ’ πœ‡ log 𝜌" 𝑏 𝑑 ] = πœ‡ βˆ‘! 𝜌" 𝑏 𝑑 [ log

'!#$ 𝑏 𝑑 '! 𝑏 𝑑

+ log βˆ‘!" exp(𝑅#$%&

'!

𝑑, 𝑏′ /πœ‡)] = βˆ’πœ‡πΏπ‘€(𝜌"()| 𝜌" + πœ‡ βˆ‘! 𝜌" 𝑏 𝑑 log βˆ‘!" exp(𝑅#$%&

'!

𝑑, 𝑏′ /πœ‡) ≀ πœ‡ βˆ‘! 𝜌" 𝑏 𝑑 log βˆ‘!" exp(𝑅#$%&

'!

𝑑, 𝑏′ /πœ‡) = βˆ‘! 𝜌"() 𝑏 𝑑 πœ‡ log βˆ‘!" exp(𝑅#$%&

'!

𝑑, 𝑏′ /πœ‡) = βˆ‘! 𝜌"() 𝑏 𝑑 𝑅#$%&

'!

𝑑, 𝑏 βˆ’ πœ‡ log 𝜌"() 𝑑, 𝑏 = βˆ‘! 𝜌"() 𝑏 𝑑 𝑅#$%&

'!

𝑑, 𝑏 + πœ‡πΌ 𝜌"() β‹… 𝑑

University of Waterloo

since 𝜌=>? 𝑏 𝑑 =

*+, -"#$%

()

.,0 /2 βˆ‘&! *+, -"#$%

()

.,0! /2

since 𝜌=>? 𝑏 𝑑 =

*+, -() .,0 /2 βˆ‘&! *+, -() .,0! /2

slide-19
SLIDE 19

CS885 Spring 2020 Pascal Poupart 19

Proof derivation

𝑅2QRS

  • A

𝑑, 𝑏 = 𝑆 𝑑, 𝑏 + 𝛿𝐹2D 𝐹,D~-A 𝑅2QRS

  • A

𝑑!, 𝑏! + πœ‡πΌ 𝜌P β‹… 𝑑′ ≀ 𝑆 𝑑, 𝑏 + 𝛿𝐹2D 𝐹,D~-ABC 𝑅2QRS

  • A

𝑑!, 𝑏! + πœ‡πΌ 𝜌PTU β‹… 𝑑′ ≀ β‹― ≀ β‹― ≀ 𝑅2QRS

  • ABC(𝑑, 𝑏)

University of Waterloo

since 𝐹0!~A) 𝑅.89:

A)

𝑑4, 𝑏4 + πœ‡πΌ 𝜌= β‹… 𝑑′ ≀ 𝐹0!~A)*+ 𝑅.89:

A)

𝑑4, 𝑏4 + πœ‡πΌ 𝜌=>? β‹… 𝑑′ repeatedly apply QBCDE

A, (sβ€², aβ€²) ≀ 𝑆 𝑑′, 𝑏′ + 𝛿𝐹.!! 𝐹0!!~A)*+ 𝑅.89: A)

𝑑44, 𝑏44 + πœ‡πΌ 𝜌=>? β‹… 𝑑′′

slide-20
SLIDE 20

CS885 Spring 2020 Pascal Poupart 20

Convergence to Optimal 𝑅,-./

βˆ—

and 𝜌,-./

βˆ—

  • Theorem 2: When πœ— = 0, soft policy iteration converges to
  • ptimal 𝑅2QRS

βˆ—

and 𝜌2QRS

βˆ—

.

  • Proof:

– We know that 𝑅"1:9 𝑑, 𝑏 β‰₯ 𝑅"1 𝑑, 𝑏 βˆ€π‘‘, 𝑏 according to Theorem 1 – Since the Q-functions are upper bounded by (max

',) 𝑆 𝑑, 𝑏 + 𝐼 π‘£π‘œπ‘—π‘”π‘π‘ π‘› )/(1 βˆ’ 𝛿)

then soft policy iteration converges – At convergence, 𝑅"189 = 𝑅"1 and therefore the Q-function satisfies Bellman’s equation:

𝑅.89:

A)-+ 𝑑, 𝑏 = 𝑅.89: A)

𝑑, 𝑏 = 𝑆 𝑑, 𝑏 + 𝛿 _

.!

Pr(𝑑4|𝑑, 𝑏) H max2

0!

𝑅.89:

A)-+(𝑑′, 𝑏′)

University of Waterloo

slide-21
SLIDE 21

CS885 Spring 2020 Pascal Poupart 21

Soft Actor-Critic

  • RL version of soft policy iteration
  • Use neural networks to represent policy and value

function

  • At each policy improvement step, project new policy

in the space of parameterized neural nets

University of Waterloo

slide-22
SLIDE 22

CS885 Spring 2020 Pascal Poupart 22

Soft Actor Critic (SAC)

University of Waterloo

Initialize weights 𝒙, I 𝒙, πœ„ at random in [βˆ’1,1] Observe current state 𝑑 Loop Sample action 𝑏~𝜌;(β‹… |𝑑) and execute it Receive immediate reward 𝑠 Observe new state 𝑑’ Add (𝑑, 𝑏, 𝑑<, 𝑠) to experience buffer Sample mini-batch of experiences from buffer For each experience Μ‚ 𝑑, S 𝑏, Μ‚ 𝑑<, Μ‚ 𝑠 in mini-batch Sample S 𝑏′~𝜌;(β‹… | Μ‚ 𝑑′) Gradient:

=>(( =𝒙 = 𝑅𝒙 #$%&

Μ‚ 𝑑, S 𝑏 βˆ’ Μ‚ 𝑠 βˆ’ 𝛿[Q @

𝐱 #$%&

Μ‚ 𝑑<, S 𝑏< + πœ‡πΌ 𝜌; β‹… Μ‚ 𝑑< )

=-𝒙

"#$%

Μ‚ #, C , =𝒙

Update weights: 𝒙 ← 𝒙 βˆ’ 𝛽

=>(( =𝒙

Update policy: πœ„ ← πœ„ βˆ’ 𝛽

=DE 𝜌; 𝑑𝑝𝑔𝑒𝑛𝑏𝑦 𝑅@ F #$%&/πœ‡ =;

Update state: 𝑑 ← 𝑑’ Every 𝑑 steps, update target: I 𝒙 ← 𝒙

slide-23
SLIDE 23

CS885 Spring 2020 Pascal Poupart 23

Empirical Results

  • Comparison on several robotics tasks

University of Waterloo From Haarnoja, Zhou et al. (2018)

slide-24
SLIDE 24

CS885 Spring 2020 Pascal Poupart 24

Robustness to Environment Changes

University of Waterloo

https://youtu.be/KOObeIjzXTY

Check out this video

Using Soft Actor Critic (SAC), Minotaur learns to walk quickly and to generalize to environments with challenges that it was not trained to deal with!