Lecture 2: Gradient Estimators CSC 2547 Spring 2018 David Duvenaud - - PowerPoint PPT Presentation

lecture 2 gradient estimators
SMART_READER_LITE
LIVE PREVIEW

Lecture 2: Gradient Estimators CSC 2547 Spring 2018 David Duvenaud - - PowerPoint PPT Presentation

Lecture 2: Gradient Estimators CSC 2547 Spring 2018 David Duvenaud Based mainly on slides by Will Grathwohl, Dami Choi, Yuhuai Wu and Geo ff Roeder Where do we see this guy? L ( ) = E p ( b | ) [ f ( b )] Just about everywhere!


slide-1
SLIDE 1

Lecture 2: Gradient Estimators

CSC 2547 Spring 2018 David Duvenaud

Based mainly on slides by Will Grathwohl, Dami Choi, Yuhuai Wu and Geoff Roeder

slide-2
SLIDE 2

Where do we see this guy?

  • Just about everywhere!
  • Variational Inference
  • Reinforcement Learning
  • Hard Attention
  • And so many more!

L(θ) = Ep(b|θ)[f(b)]

slide-3
SLIDE 3

Gradient based optimization

  • Gradient based optimization

is the standard method used today to optimize expectations

  • Necessary if models are

neural-net based

  • Very rarely can this gradient

be computed analytically

slide-4
SLIDE 4

Otherwise, we estimate…

  • A number of approaches exist to estimate this gradient
  • They make varying levels of assumptions about the

distribution and function being optimized

  • Most popular methods either make strong assumptions or

suffer from high variance

slide-5
SLIDE 5

REINFORCE (Williams, 1992)

  • Unbiased
  • Has few requirements
  • Easy to compute
  • Suffers from high variance

ˆ gREINFORCE[f] = f (b) ∂

∂θ log p(b|θ),

b ∼ p(b|θ)

slide-6
SLIDE 6

Reparameterization (Kingma & Welling, 2014)

  • Lower variance empirically
  • Unbiased
  • Makes stronger assumptions
  • Requires is known and

differentiable

  • Requires is

reparameterizable

ˆ greparam[f] = ∂f

∂b ∂b ∂θ

b = T(✓, ✏), ✏ ∼ p(✏)

f(b) p(b|θ)

slide-7
SLIDE 7

Concrete (Maddison et al., 2016)

  • Works well in practice
  • Low variance from

reparameterization

  • Biased
  • Adds temperature hyper-parameter
  • Requires that is known, and differentiable
  • Requires is reparameterizable
  • Requires behaves predictably outside of domain

ˆ gconcrete[f] =

∂f ∂σ(z/t) ∂σ(z/t) ∂θ

z = T(✓, ✏), ✏ ∼ p(✏)

p(z|θ) f(b) f(b)

slide-8
SLIDE 8

Control Variates

  • Allow us to reduce variance of a Monte Carlo estimator
  • Variance is reduced if
  • Does not change bias

ˆ gnew(b) = ˆ g(b) − c(b) + Ep(b)[c(b)]

corr(g, c) > 0

slide-9
SLIDE 9

Putting it all together

  • We would like a general gradient estimator that is
  • unbiased
  • low variance
  • usable when is unknown
  • useable when is discrete

f(b)

p(b|θ)

slide-10
SLIDE 10

Backpropagation Through

slide-11
SLIDE 11

Backpropagation Through

slide-12
SLIDE 12

Backpropagation Through

Will Grathwohl Dami Choi Yuhuai Wu Geoff Roeder David Duvenaud

slide-13
SLIDE 13

Our Approach

ˆ gLAX = gREINFORCE[f] − gREINFORCE[cφ] + greparam[cφ]

slide-14
SLIDE 14

Our Approach

  • Start with the reinforce estimator for
  • We introduce a new function
  • We subtract the reinforce estimator of its gradient and add the

reparameterization estimator

  • Can be thought of as using the reinforce estimator of as a control

variate

f(b)

cφ(b) cφ(b)

ˆ gLAX = gREINFORCE[f] − gREINFORCE[cφ] + greparam[cφ]

= [f(b) − cφ(b)] ∂

∂θ log p(b|✓) + ∂ ∂θcφ(b)

slide-15
SLIDE 15

Optimizing the Control Variate

  • For any unbiased estimator we can get Monte Carlo

estimates for the gradient of the variance of

  • Use to optimize

ˆ g cφ

∂ ∂φVariance(ˆ g) = E  ∂ ∂φ ˆ g2

slide-16
SLIDE 16

What about discrete b?

slide-17
SLIDE 17

Extension to discrete

  • When b is discrete, we introduce a relaxed distribution

and a function where

  • We use the conditioning scheme introduced in REBAR

(Tucker et al. 2017)

  • Unbiased for all

ˆ gRELAX = [f(b) − cφ(˜ z)] ∂

∂θ log p(b|θ) + ∂ ∂θcφ(z) − ∂ ∂θcφ(˜

z) b = H(z), z ∼ p(z|θ), ˜ z ∼ p(z|b, θ)

p(b|θ)

p(z|θ)

H(z) = b ∼ p(b|θ) H(

slide-18
SLIDE 18

A Simple Example

  • Used to validate REBAR (used t = .45)
  • We use t = .499
  • REBAR, REINFORCE fail due to noise outweighing signal
  • Can RELAX improve?

Ep(b|θ)[(t − b)2]

slide-19
SLIDE 19
  • RELAX outperforms baselines
  • Considerably reduced variance!
  • RELAX learns reasonable surrogate
slide-20
SLIDE 20

Analyzing the Surrogate

  • REBAR’s fixed

surrogate cannot produce consistent and correct gradients

  • RELAX learns to

balance REINFORCE variance and reparameterization variance

slide-21
SLIDE 21

A More Interesting Application

  • Discrete VAE
  • Latent state is 200 Bernoulli variables
  • Discrete sampling makes reparameterization estimator

unusable

log p(x) ≥ L(θ) = Eq(b|x)[log p(x|b) + log p(b) − log q(b|x)] cφ(z) = f(σλ(z)) + rρ(z)

slide-22
SLIDE 22

Results

slide-23
SLIDE 23

Reinforcement Learning

  • Policy gradient methods are very popular today (A2C,

A3C, ACKTR)

  • Seeks to find
  • Does this by estimating
  • R is not known so many popular estimators cannot be

used

argmaxθEτ∼π(τ|θ)[R(τ)]

∂ ∂θEτ∼π(τ|θ)[R(τ)]

slide-24
SLIDE 24

Actor Critic

  • is an estimate of the value function
  • This is exactly the REINFORCE estimator using an

estimate of the value function as a control variate

  • Why not use action in control variate?
  • Dependence on action would add bias

ˆ gAC =

T

X

t=1

∂ log π(at|st, θ) ∂θ " T X

t0=t

rt0 − cφ(st) #

slide-25
SLIDE 25

LAX for RL

  • Allows for action dependence in control variate
  • Remains unbiased
  • Similar extension available for discrete action spaces

ˆ gLAX =

T

X

t=1

∂ log π(at|st, θ) ∂θ " T X

t0=t

rt0 − cφ(st, at)

t)

# + ∂ ∂θcφ(st, at)

slide-26
SLIDE 26

Results

  • Improved performance
  • Lower variance gradient estimates
slide-27
SLIDE 27

Future Work

  • What does the optimal surrogate look like?
  • Many possible variations of LAX and RELAX
  • Which provides the best tradeoff between variance, ease of implementation, scope
  • f application, performance
  • RL
  • Incorporate other variance reduction techniques (GAE, reward

bootstrapping, trust-region)

  • Ways to train the surrogate off-policy
  • Applications
  • Inference of graph structure (coming soon)
  • Inference of discrete neural network architecture components (coming soon)
slide-28
SLIDE 28

Directions

  • Surrogate can take any form
  • can rely on global information even if forward pass only

uses local info

  • Can depend on order even if forward pass is invariant
  • Reparameterization can take many forms, ongoing work
  • n reparameterizing through rejection sampling, or

distributions on permutations

slide-29
SLIDE 29

Reparameterizing the Birkhoff Polytope for Variational Permutation Inference

slide-30
SLIDE 30

Learning Latent Permutations with Gumbel-Sinkhorn Networks

slide-31
SLIDE 31

Why are we optimizing policies anyways?

  • Next week: Variational optimization