Fingerprint Policy Optimisation for Robust Reinforcement Learning - - PowerPoint PPT Presentation

β–Ά
fingerprint policy optimisation for robust reinforcement
SMART_READER_LITE
LIVE PREVIEW

Fingerprint Policy Optimisation for Robust Reinforcement Learning - - PowerPoint PPT Presentation

Poster #50 Fingerprint Policy Optimisation for Robust Reinforcement Learning Supratik Paul, Michael A. Osborne, Shimon Whiteson This project has received funding from the European Research Council (ERC) under the European Unions Horizon 2020


slide-1
SLIDE 1

Fingerprint Policy Optimisation for Robust Reinforcement Learning

Supratik Paul, Michael A. Osborne, Shimon Whiteson

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreements \#637713)

Poster #50

slide-2
SLIDE 2

Motivation

2

slide-3
SLIDE 3

Motivation

2

slide-4
SLIDE 4

Motivation

2

  • Environment variable (EV)
  • E.g. wind conditions
  • Controllable during learning but

not during execution

slide-5
SLIDE 5

Motivation

2

  • Environment variable (EV)
  • E.g. wind conditions
  • Controllable during learning but

not during execution

  • Objective: Find πœŒβˆ— = π‘π‘ π‘•π‘›π‘π‘¦πœŒ 𝐾 𝜌 = π‘π‘ π‘•π‘›π‘π‘¦πœŒπ”½πΉπ‘Š~π‘ž(πΉπ‘Š)[𝑆 𝜌 ]
slide-6
SLIDE 6

Motivation

2

  • Environment variable (EV)
  • E.g. wind conditions
  • Controllable during learning but

not during execution

  • Objective: Find πœŒβˆ— = π‘π‘ π‘•π‘›π‘π‘¦πœŒ 𝐾 𝜌 = π‘π‘ π‘•π‘›π‘π‘¦πœŒπ”½πΉπ‘Š~π‘ž(πΉπ‘Š)[𝑆 𝜌 ]
  • Need to account for rare events
  • E.g. rare wind conditions leading to a crash
slide-7
SLIDE 7

NaΓ―ve application of f policy gradients

3

slide-8
SLIDE 8

NaΓ―ve application of f policy gradients

3

Trajectories ~ 𝜌

slide-9
SLIDE 9

NaΓ―ve application of f policy gradients

3

Trajectories ~ 𝜌

slide-10
SLIDE 10

NaΓ―ve application of f policy gradients

4

Trajectories ~ 𝜌

Rare Events

slide-11
SLIDE 11

NaΓ―ve application of f policy gradients

4

  • Monte Carlo estimate of the Policy Gradient has very high variance

⟹ Doomed to failure Trajectories ~ 𝜌

Rare Events

slide-12
SLIDE 12

Fingerprint Policy Optimisation (F (FPO)

5

slide-13
SLIDE 13

Fingerprint Policy Optimisation (F (FPO)

5

slide-14
SLIDE 14

Fingerprint Policy Optimisation (F (FPO)

5

slide-15
SLIDE 15

Fingerprint Policy Optimisation (F (FPO)

5

At each iteration, select parameters πœ” of π‘Ÿπœ”(πΉπ‘Š) such that it maximises one-step expected return

slide-16
SLIDE 16

Fingerprint Policy Optimisation (F (FPO)

6

slide-17
SLIDE 17

Fingerprint Policy Optimisation (F (FPO)

6

  • πœŒβ€² = 𝜌 + α𝛼𝐾 𝜌
  • 𝐾 πœŒβ€² = f(𝜌, πœ”)
slide-18
SLIDE 18

Fingerprint Policy Optimisation (F (FPO)

6

  • πœŒβ€² = 𝜌 + α𝛼𝐾 𝜌
  • 𝐾 πœŒβ€² = f(𝜌, πœ”)
  • Model 𝐾 πœŒβ€² as a Gaussian Process with

inputs (𝜌, πœ”)

  • Use Bayesian Optimisation to select

πœ”|𝜌 = argmaxπœ”f(𝜌, πœ”)

slide-19
SLIDE 19

Fingerprint Policy Optimisation (F (FPO)

6

Low dimensional representation β€œFingerprint” 𝜌 is high dimensional

  • πœŒβ€² = 𝜌 + α𝛼𝐾 𝜌
  • 𝐾 πœŒβ€² = f(𝜌, πœ”)
  • Model 𝐾 πœŒβ€² as a Gaussian Process with

inputs (𝜌, πœ”)

  • Use Bayesian Optimisation to select

πœ”|𝜌 = argmaxπœ”f(𝜌, πœ”)

slide-20
SLIDE 20

Policy fi fingerprints

7

slide-21
SLIDE 21

Policy fi fingerprints

  • Disambiguation, not accurate representation

7

slide-22
SLIDE 22

Policy fi fingerprints

  • Disambiguation, not accurate representation
  • State/Action fingerprints: Gaussians fitted to the stationary

state/action distribution induced by 𝜌

7

slide-23
SLIDE 23

Policy fi fingerprints

  • Disambiguation, not accurate representation
  • State/Action fingerprints: Gaussians fitted to the stationary

state/action distribution induced by 𝜌

  • Gross simplification, but good at disambiguating between policies

7

slide-24
SLIDE 24

Results

  • Velocity target = 2 with probability

98% and β€˜normal’ reward

  • Velocity target = 4 with probability 2%

with significantly high reward

8

Half Cheetah

  • Reward proportional to velocity
  • 5% chance that velocity > 2 leads to

joint damage with large negative reward Ant

slide-25
SLIDE 25

Fingerprint Policy Optimisation for Robust Reinforcement Learning

Supratik Paul, Michael A. Osborne, Shimon Whiteson

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreements \#637713)

Poster #50