Control Regularization for Reduced Variance Reinforcement Learning - - PowerPoint PPT Presentation

β–Ά
control regularization for reduced variance reinforcement
SMART_READER_LITE
LIVE PREVIEW

Control Regularization for Reduced Variance Reinforcement Learning - - PowerPoint PPT Presentation

Control Regularization for Reduced Variance Reinforcement Learning Richard Cheng, Abhinav Verma, Gabor Orosz, Swarat Chaudhuri Yisong Yue, Joel W. Burdick Reinforcement Learning Reinforcement learning (RL) studies how to use data from


slide-1
SLIDE 1

Control Regularization for Reduced Variance Reinforcement Learning

Richard Cheng, Abhinav Verma, Gabor Orosz, Swarat Chaudhuri Yisong Yue, Joel W. Burdick

slide-2
SLIDE 2

max

πœ„

𝐾(πœ„) = max

πœ„

π”½πœ~πœŒπœ„ ෍

𝑒 ∞

𝛿𝑒 𝑠 𝑑𝑒, 𝑏𝑒

Reinforcement Learning

Reinforcement learning (RL) studies how to use data from interactions with the environment to learn an optimal policy:

πœŒπœ„ 𝑏 𝑑 : 𝑇 Γ— 𝐡 β†’ 0,1

𝜐: 𝑑𝑒, 𝑏𝑒, … , 𝑑𝑒+𝑂, 𝑏𝑒+𝑂

Figure from Sergey Levine

Policy gradient-based optimization with no prior information:

Policy: Reward Optimization:

Williams, 1992; Sutton et al. 1999 Baxter and Bartlett, 2000 Greensmith et al. 2004

slide-3
SLIDE 3

RL methods suffer from high variance in learning

(Islam et al. 2017; Henderson et al. 2018)

Variance in Reinforcement Learning

Allows us to optimize policy with no prior information (only sampled trajectories from interactions)

Greensmith et al. 2004, Zhao et al. 2012 Zhao et al. 2015; Thodoroff et al. 2018

Inverted pendulum 10 random seeds

Figure from Alex Irpan

slide-4
SLIDE 4

RL methods suffer from high variance in learning

(Islam et al. 2017; Henderson et al. 2018)

Variance in Reinforcement Learning

Allows us to optimize policy with no prior information (only sampled trajectories from interactions)

Greensmith et al. 2004, Zhao et al. 2012 Zhao et al. 2015; Thodoroff et al. 2018

Inverted pendulum 10 random seeds

Figure from Kris Hauser Figure from Alex Irpan

However, is this necessary or even desirable?

𝒃 = 𝒗𝒒𝒔𝒋𝒑𝒔(𝒕)

Cartpole

𝑑𝑒+1 β‰ˆ 𝑔 𝑑𝑒 + 𝑕 𝑑𝑒 𝑏𝑒 𝑀𝑅𝑆 Controller

Nominal controller is stable but based on:

  • Error prone model
  • Linearized dynamics
slide-5
SLIDE 5

Regularization with a Control Prior

Combine control prior, π‘£π‘žπ‘ π‘—π‘π‘ (𝑑), with learned controller, π‘£πœ„π‘™ 𝑑 , sampled from πœŒπœ„π‘™ 𝑏 𝑑 πœŒπœ„π‘™ learned in same manner with samples drawn from new distribution (e.g. )

πœ‡ is a regularization parameter weighting the prior vs. the learned controller

Under the assumption of Gaussian exploration noise (i.e. πœŒπœ„ 𝑏 𝑑 has Gaussian distribution):

Johannink et al. 2018; Silver et al. 2019

which can be equivalently expressed as the constrained

  • ptimization problem,
slide-6
SLIDE 6

Interpretation of the Prior

Theorem 1. Using the mixed policy above, variance from each policy gradient step is reduced by factor

1 1+πœ‡ 2 .

However, this may introduce bias into the policy where represents the total variation distance between two policies.

slide-7
SLIDE 7

Interpretation of the Prior

Theorem 1. Using the mixed policy above, variance from each policy gradient step is reduced by factor

1 1+πœ‡ 2 .

However, this may introduce bias into the policy Strong regularization: The control prior heavily constrains

  • exploration. Stabilize to the red

trajectory, but miss green one. Weak regularization: Greater room for exploration, but may not stabilize around red trajectory. where represents the total variation distance between two policies.

slide-8
SLIDE 8

Theorem 2. Assume a stabilizing β„‹βˆž control prior within the set π’Ÿ for the dynamical system (14). Then asymptotic stability and forward invariance of the set 𝒯𝑑𝑒 βŠ† π’Ÿ is guaranteed under the regularized policy for all 𝑑 ∈ π’Ÿ. Cartpole With a robust control prior, the regularized controller always remains near the equilibrium point, even during learning

Stability Properties from the Prior

Regularization allows us to β€œcapture” stability properties from a robust control prior

slide-9
SLIDE 9

Data gathered from chain of cars following each other. Goal is to optimize fuel- efficiency of the middle car.

Results

Goal is to minimize laptime of simulated racecar

Control Regularization helps by providing:

  • Reduced variance
  • Higher rewards
  • Faster learning
  • Potential safety guarantees

However, high regularization also leads to potential bias See Poster for similar results on CartPole domain Code at: https://github.com/rcheng805/CORE-RL Poster Number: 42