DualDICE Behavior-Agnostic Estimation of Discounted Stationary - - PowerPoint PPT Presentation

dualdice
SMART_READER_LITE
LIVE PREVIEW

DualDICE Behavior-Agnostic Estimation of Discounted Stationary - - PowerPoint PPT Presentation

DualDICE Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections Ofir Nachum ,* Yinlam Chow,* Bo Dai, Lihong Li Google Research *Equal contribution Reinforcement Learning Reinforcement Learning A policy acts on an


slide-1
SLIDE 1

DualDICE

Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections

Ofir Nachum,* Yinlam Chow,* Bo Dai, Lihong Li

Google Research

*Equal contribution

slide-2
SLIDE 2

Reinforcement Learning

slide-3
SLIDE 3

Reinforcement Learning

  • A policy acts on an environment.
slide-4
SLIDE 4

Reinforcement Learning

  • A policy acts on an environment.

s0 Initial state distribution β

slide-5
SLIDE 5

Reinforcement Learning

  • A policy acts on an environment.

s0 Initial state distribution β 𝛒(-|s0) a0

slide-6
SLIDE 6

Reinforcement Learning

  • A policy acts on an environment.

s0 Initial state distribution β 𝛒(-|s0) a0 R(-|s0, a0) r0

slide-7
SLIDE 7

Reinforcement Learning

  • A policy acts on an environment.

s0 Initial state distribution β 𝛒(-|s0) T(-|s0, a0) a0 s1 R(-|s0, a0) r0

slide-8
SLIDE 8

Reinforcement Learning

  • A policy acts on an environment.

s0 Initial state distribution β 𝛒(-|s0) T(-|s0, a0) a0 s1 R(-|s0, a0) r0 𝛒(-|s1) T(-|s1, a1) a1 R(-|s1, a1) r1

slide-9
SLIDE 9

Reinforcement Learning

  • A policy acts on an environment.

s0 Initial state distribution β 𝛒(-|s0) T(-|s0, a0) a0 s1 R(-|s0, a0) r0 𝛒(-|s1) T(-|s1, a1) a1 s2 R(-|s1, a1) r1 𝛒(-|s2) T(-|s2, a2) a2 R(-|s2, a2) r2

slide-10
SLIDE 10

Reinforcement Learning

  • A policy acts on an environment.

s0 Initial state distribution β 𝛒(-|s0) T(-|s0, a0) a0 s1 R(-|s0, a0) r0 𝛒(-|s1) T(-|s1, a1) a1 s2 R(-|s1, a1) r1 𝛒(-|s2) T(-|s2, a2) a2 R(-|s2, a2) r2

  • Question: What is the value (average reward) of the policy?
slide-11
SLIDE 11

Off-policy Policy Estimation

slide-12
SLIDE 12

Off-policy Policy Estimation

  • Want to estimate average discounted per-step reward of policy,
slide-13
SLIDE 13

Off-policy Policy Estimation

  • Want to estimate average discounted per-step reward of policy,
  • Only have access to finite experience dataset

where transitions are from some unknown distribution

s, a, r, s’ s, a, r, s’ s, a, r, s’ s, a, r, s’ s, a, r, s’

. . .

slide-14
SLIDE 14

Off-policy Policy Estimation

  • Want to estimate average discounted per-step reward of policy,
  • Only have access to finite experience dataset

where transitions are from some unknown distribution

  • Don’t even know the behavior policy!

s, a, r, s’ s, a, r, s’ s, a, r, s’ s, a, r, s’ s, a, r, s’

. . .

slide-15
SLIDE 15

Reduction of OPE to Density Ratio Estimation

slide-16
SLIDE 16

Reduction of OPE to Density Ratio Estimation

  • Can write where dπ is discounted on-policy distribution
slide-17
SLIDE 17

Reduction of OPE to Density Ratio Estimation

  • Can write where dπ is discounted on-policy distribution
  • Using importance weighting trick, we have,
slide-18
SLIDE 18

Reduction of OPE to Density Ratio Estimation

  • Can write where dπ is discounted on-policy distribution
  • Using importance weighting trick, we have,
  • Given finite dataset, this corresponds to weighted average,
slide-19
SLIDE 19

Reduction of OPE to Density Ratio Estimation

  • Can write where dπ is discounted on-policy distribution
  • Using importance weighting trick, we have,
  • Given finite dataset, this corresponds to weighted average,
  • Problem reduces to estimating weights (density ratios)
slide-20
SLIDE 20

Reduction of OPE to Density Ratio Estimation

  • Can write where dπ is discounted on-policy distribution
  • Using importance weighting trick, we have,
  • Given finite dataset, this corresponds to weighted average,
  • Problem reduces to estimating weights (density ratios)
slide-21
SLIDE 21

Reduction of OPE to Density Ratio Estimation

  • Can write where dπ is discounted on-policy distribution
  • Using importance weighting trick, we have,
  • Given finite dataset, this corresponds to weighted average,
  • Problem reduces to estimating weights (density ratios)
  • Difficult because we don’t have access to environment and we don’t have explicit

knowledge of dD(s,a), only samples.

slide-22
SLIDE 22
  • Define zero-reward Bellman operator as

The DualDICE Objective

slide-23
SLIDE 23
  • Define zero-reward Bellman operator as

The DualDICE Objective

slide-24
SLIDE 24
  • Define zero-reward Bellman operator as

The DualDICE Objective

minimize squared Bellman error

slide-25
SLIDE 25
  • Define zero-reward Bellman operator as

The DualDICE Objective

s0 s1 s2 s3 s4

minimize squared Bellman error

slide-26
SLIDE 26
  • Define zero-reward Bellman operator as

The DualDICE Objective

maximize initial “nu-values”

s0 s1 s2 s3 s4

minimize squared Bellman error

slide-27
SLIDE 27
  • Define zero-reward Bellman operator as

The DualDICE Objective

maximize initial “nu-values”

s0 s1 s2 s3 s4

minimize squared Bellman error

slide-28
SLIDE 28
  • Define zero-reward Bellman operator as

The DualDICE Objective

maximize initial “nu-values”

s0 s1 s2 s3 s4

minimize squared Bellman error

slide-29
SLIDE 29
  • Define zero-reward Bellman operator as
  • Nice: Objective is based on expectations from dD, β, and π, which we have access to.

The DualDICE Objective

maximize initial “nu-values”

s0 s1 s2 s3 s4

minimize squared Bellman error

slide-30
SLIDE 30
  • Define zero-reward Bellman operator as
  • Nice: Objective is based on expectations from dD, β, and π, which we have access to.
  • Extension 1: Can remove appearance of Bellman operator from both objective and

solution by application of Fenchel conjugate!

The DualDICE Objective

maximize initial “nu-values”

s0 s1 s2 s3 s4

minimize squared Bellman error

slide-31
SLIDE 31
  • Define zero-reward Bellman operator as
  • Nice: Objective is based on expectations from dD, β, and π, which we have access to.
  • Extension 1: Can remove appearance of Bellman operator from both objective and

solution by application of Fenchel conjugate!

  • Extension 2: Can generalize this result to any convex function (not just square)!

The DualDICE Objective

maximize initial “nu-values”

s0 s1 s2 s3 s4

minimize squared Bellman error

slide-32
SLIDE 32

DualDICE Results

  • DualDICE accuracy during training compared to existing methods.
slide-33
SLIDE 33

DualDICE Results

  • DualDICE accuracy during training compared to existing methods.

East Exhibition Hall B+C

Poster #205