Learning Fast-Mixing Models for Structured Prediction Jacob - - PowerPoint PPT Presentation

learning fast mixing models for structured prediction
SMART_READER_LITE
LIVE PREVIEW

Learning Fast-Mixing Models for Structured Prediction Jacob - - PowerPoint PPT Presentation

Learning Fast-Mixing Models for Structured Prediction Jacob Steinhardt Percy Liang Stanford University { jsteinhardt,pliang } @cs.stanford.edu July 8, 2015 J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 1 / 11


slide-1
SLIDE 1

Learning Fast-Mixing Models for Structured Prediction

Jacob Steinhardt Percy Liang

Stanford University

{jsteinhardt,pliang}@cs.stanford.edu

July 8, 2015

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 1 / 11

slide-2
SLIDE 2

Structured Prediction Task

a c b e d g f i h k j m l

  • n

q p s r u t w v y x z x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n # a-a # # n-n a y: b a n a n a

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 2 / 11

slide-3
SLIDE 3

Structured Prediction Task

a c b e d g f i h k j m l

  • n

q p s r u t w v y x z x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n # a-a # # n-n a y: b a n a n a

Goal: fit maximum likelihood model pθ(z | x). Two routes:

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 2 / 11

slide-4
SLIDE 4

Structured Prediction Task

a c b e d g f i h k j m l

  • n

q p s r u t w v y x z x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n # a-a # # n-n a y: b a n a n a

Goal: fit maximum likelihood model pθ(z | x). Two routes: Use simple model u, exact inference

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 2 / 11

slide-5
SLIDE 5

Structured Prediction Task

a c b e d g f i h k j m l

  • n

q p s r u t w v y x z x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n # a-a # # n-n a y: b a n a n a

Goal: fit maximum likelihood model pθ(z | x). Two routes: Use simple model u, exact inference Use expressive model, Gibbs sampling (transition kernel A)

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 2 / 11

slide-6
SLIDE 6

Structured Prediction Task

a c b e d g f i h k j m l

  • n

q p s r u t w v y x z x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n # a-a # # n-n a y: b a n a n a

Goal: fit maximum likelihood model pθ(z | x). Two routes: Use simple model u, exact inference Use expressive model, Gibbs sampling (transition kernel A) Can we get the best of both worlds?

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 2 / 11

slide-7
SLIDE 7

Strong Doeblin Chains

Definition (Doeblin, 1940)

A chain ˜ A is strong Doeblin with parameter ε if

˜

A(zt | zt−1) = εu(zt)+(1−ε)A(zt | zt−1) for some u, A.

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 3 / 11

slide-8
SLIDE 8

Strong Doeblin Chains

Definition (Doeblin, 1940)

A chain ˜ A is strong Doeblin with parameter ε if

˜

A(zt | zt−1) = εu(zt)+(1−ε)A(zt | zt−1) for some u, A. u

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 3 / 11

slide-9
SLIDE 9

Strong Doeblin Chains

Definition (Doeblin, 1940)

A chain ˜ A is strong Doeblin with parameter ε if

˜

A(zt | zt−1) = εu(zt)+(1−ε)A(zt | zt−1) for some u, A. u A

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 3 / 11

slide-10
SLIDE 10

Strong Doeblin Chains

Definition (Doeblin, 1940)

A chain ˜ A is strong Doeblin with parameter ε if

˜

A(zt | zt−1) = εu(zt)+(1−ε)A(zt | zt−1) for some u, A. u A A

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 3 / 11

slide-11
SLIDE 11

Strong Doeblin Chains

Definition (Doeblin, 1940)

A chain ˜ A is strong Doeblin with parameter ε if

˜

A(zt | zt−1) = εu(zt)+(1−ε)A(zt | zt−1) for some u, A. u A A A

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 3 / 11

slide-12
SLIDE 12

Strong Doeblin Chains

Definition (Doeblin, 1940)

A chain ˜ A is strong Doeblin with parameter ε if

˜

A(zt | zt−1) = εu(zt)+(1−ε)A(zt | zt−1) for some u, A. u A A A u

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 3 / 11

slide-13
SLIDE 13

Strong Doeblin Chains

Definition (Doeblin, 1940)

A chain ˜ A is strong Doeblin with parameter ε if

˜

A(zt | zt−1) = εu(zt)+(1−ε)A(zt | zt−1) for some u, A. u A A A u A

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 3 / 11

slide-14
SLIDE 14

Strong Doeblin Chains

Definition (Doeblin, 1940)

A chain ˜ A is strong Doeblin with parameter ε if

˜

A(zt | zt−1) = εu(zt)+(1−ε)A(zt | zt−1) for some u, A. u A A A u A u

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 3 / 11

slide-15
SLIDE 15

Strong Doeblin Chains

Definition (Doeblin, 1940)

A chain ˜ A is strong Doeblin with parameter ε if

˜

A(zt | zt−1) = εu(zt)+(1−ε)A(zt | zt−1) for some u, A. u A A A u A u A

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 3 / 11

slide-16
SLIDE 16

Strong Doeblin Chains

Definition (Doeblin, 1940)

A chain ˜ A is strong Doeblin with parameter ε if

˜

A(zt | zt−1) = εu(zt)+(1−ε)A(zt | zt−1) for some u, A. u A A A u A u A A

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 3 / 11

slide-17
SLIDE 17

Strong Doeblin Chains

Definition (Doeblin, 1940)

A chain ˜ A is strong Doeblin with parameter ε if

˜

A(zt | zt−1) = εu(zt)+(1−ε)A(zt | zt−1) for some u, A. u A A A u A u A A

···

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 3 / 11

slide-18
SLIDE 18

Strong Doeblin Chains

Definition (Doeblin, 1940)

A chain ˜ A is strong Doeblin with parameter ε if

˜

A(zt | zt−1) = εu(zt)+(1−ε)A(zt | zt−1) for some u, A. u A A A u A u A A

···

All Doeblin chains mix quickly:

Proposition

If ˜ A is ε strong Doeblin, then its mixing time is at most 1

ε .

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 3 / 11

slide-19
SLIDE 19

Strong Doeblin Chains

Definition (Doeblin, 1940)

A chain ˜ A is strong Doeblin with parameter ε if

˜

A(zt | zt−1) = εu(zt)+(1−ε)A(zt | zt−1) for some u, A. u A A A u A u A A

···

All Doeblin chains mix quickly:

Proposition

If ˜ A is ε strong Doeblin, then its mixing time is at most 1

ε .

Moreover, the stationary distribution is AT u, where T ∼ Geometric(ε).

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 3 / 11

slide-20
SLIDE 20

A Strong Doeblin Family

Let θ parameterize a distribution uθ and transition matrix Aθ.

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 4 / 11

slide-21
SLIDE 21

A Strong Doeblin Family

Let θ parameterize a distribution uθ and transition matrix Aθ.

˜

Aθ = εuθ +(1−ε)Aθ

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 4 / 11

slide-22
SLIDE 22

A Strong Doeblin Family

Let θ parameterize a distribution uθ and transition matrix Aθ.

˜

Aθ = εuθ +(1−ε)Aθ

πθ

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 4 / 11

slide-23
SLIDE 23

A Strong Doeblin Family

Let θ parameterize a distribution uθ and transition matrix Aθ.

˜

˜ πθ = εuθ +(1−ε)Aθ πθ

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 4 / 11

slide-24
SLIDE 24

A Strong Doeblin Family

Let θ parameterize a distribution uθ and transition matrix Aθ.

˜

˜ πθ = εuθ +(1−ε)Aθ πθ

Three model families:

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 4 / 11

slide-25
SLIDE 25

A Strong Doeblin Family

Let θ parameterize a distribution uθ and transition matrix Aθ.

˜

˜ πθ = εuθ +(1−ε)Aθ πθ

Three model families:

F0

{uθ}θ∈Θ

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 4 / 11

slide-26
SLIDE 26

A Strong Doeblin Family

Let θ parameterize a distribution uθ and transition matrix Aθ.

˜

˜ πθ = εuθ +(1−ε)Aθ πθ

Three model families:

F0

{uθ}θ∈Θ

F

{πθ}θ∈Θ

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 4 / 11

slide-27
SLIDE 27

A Strong Doeblin Family

Let θ parameterize a distribution uθ and transition matrix Aθ.

˜

˜ πθ = εuθ +(1−ε)Aθ πθ

Three model families:

F0

{uθ}θ∈Θ

F

{πθ}θ∈Θ

˜ F

{˜ πθ}θ∈Θ

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 4 / 11

slide-28
SLIDE 28

A Strong Doeblin Family

Let θ parameterize a distribution uθ and transition matrix Aθ.

˜

˜ πθ = εuθ +(1−ε)Aθ πθ

Three model families:

F0

{uθ}θ∈Θ

F

{πθ}θ∈Θ

˜ F

{˜ πθ}θ∈Θ

˜ F parameterizes computationally tractable distributions!

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 4 / 11

slide-29
SLIDE 29

Strategy

Parameterize strong Doeblin distributions ˜

πθ

Maximize log-likelihood: L(θ) = 1

n ∑n i=1 log ˜

πθ(z(i))

Issue: hard to compute ∇L(θ)

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 5 / 11

slide-30
SLIDE 30

Strategy

Parameterize strong Doeblin distributions ˜

πθ

Maximize log-likelihood: L(θ) = 1

n ∑n i=1 log ˜

πθ(z(i))

Issue: hard to compute ∇L(θ) Insight: interpret Markov chain as latent variable model:

pθ: z1

z2

···

zT

Aθ Aθ Aθ

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 5 / 11

slide-31
SLIDE 31

Strategy

Parameterize strong Doeblin distributions ˜

πθ

Maximize log-likelihood: L(θ) = 1

n ∑n i=1 log ˜

πθ(z(i))

Issue: hard to compute ∇L(θ) Insight: interpret Markov chain as latent variable model:

pθ: z1

z2

···

zT

Aθ Aθ Aθ

Observe: ˜

πθ(z) = pθ(zT = z), T ∼ Geometric(ε)

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 5 / 11

slide-32
SLIDE 32

Learning Updates

Recall latent variable model: pθ: z1

z2

···

zT

Aθ Aθ Aθ

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 6 / 11

slide-33
SLIDE 33

Learning Updates

Recall latent variable model: pθ: z1

z2

···

zT

Aθ Aθ Aθ

Lemma

For any fixed z,

∂ logpθ(zT = z) ∂θ = Ez1:T−1∼pθ(·|zT =z) ∂ logpθ(z1:T) ∂θ

  • .
  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 6 / 11

slide-34
SLIDE 34

Learning Updates

Recall latent variable model: pθ: z1

z2

···

zT

Aθ Aθ Aθ

Lemma

For any fixed z,

∂ logpθ(zT = z) ∂θ = Ez1:T−1∼pθ(·|zT =z) ∂ logpθ(z1:T) ∂θ

  • .

Upshot: just need to sample trajectories that end at z.

= ⇒ importance sampling

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 6 / 11

slide-35
SLIDE 35

Experiments: Task

Task from before:

a c b e d g f i h k j m l

  • n

q p s r u t w v y x z x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n # a-a # # n-n a y: b a n a n a

Note y is a deterministic function y = f(z).

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 7 / 11

slide-36
SLIDE 36

Experiments: Task

Task from before:

a c b e d g f i h k j m l

  • n

q p s r u t w v y x z x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n # a-a # # n-n a y: b a n a n a

Note y is a deterministic function y = f(z). Goal: learn model p(z | x) that maximizes p(y | x) = ∑

z∈f −1(y)

p(z | x)

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 7 / 11

slide-37
SLIDE 37

Experiments: Setup

Models: u(z | x) (bigram, DP): z1 z2 z3 z4

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 8 / 11

slide-38
SLIDE 38

Experiments: Setup

Models: u(z | x) (bigram, DP): z1 z2 z3 z4 A(zt | zt−1,x) (dictionary, Gibbs): z1 z2 z3 z4

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 8 / 11

slide-39
SLIDE 39

Experiments: Setup

Models: u(z | x) (bigram, DP): z1 z2 z3 z4 A(zt | zt−1,x) (dictionary, Gibbs): z1 z2 z3 z4 Comparisons:

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 8 / 11

slide-40
SLIDE 40

Experiments: Setup

Models: u(z | x) (bigram, DP): z1 z2 z3 z4 A(zt | zt−1,x) (dictionary, Gibbs): z1 z2 z3 z4 Comparisons: basic: Gibbs sampling (A)

(compute gradients assuming exact inference)

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 8 / 11

slide-41
SLIDE 41

Experiments: Setup

Models: u(z | x) (bigram, DP): z1 z2 z3 z4 A(zt | zt−1,x) (dictionary, Gibbs): z1 z2 z3 z4 Comparisons: basic: Gibbs sampling (A)

(compute gradients assuming exact inference)

uθ-Gibbs: Gibbs with random restarts from u

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 8 / 11

slide-42
SLIDE 42

Experiments: Setup

Models: u(z | x) (bigram, DP): z1 z2 z3 z4 A(zt | zt−1,x) (dictionary, Gibbs): z1 z2 z3 z4 Comparisons: basic: Gibbs sampling (A)

(compute gradients assuming exact inference)

uθ-Gibbs: Gibbs with random restarts from u Doeblin: our method

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 8 / 11

slide-43
SLIDE 43

Experiments: Results

5 10 15

training passes

0.00 0.05 0.10 0.15 0.20 0.25

accuracy

Swipe Typing (Test Accuracy)

Doeblin basic

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 9 / 11

slide-44
SLIDE 44

Experiments: Results

5 10 15

training passes

0.00 0.05 0.10 0.15 0.20 0.25

accuracy

Swipe Typing (Test Accuracy)

Doeblin uµ-Gibbs basic

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 9 / 11

slide-45
SLIDE 45

Discussion

Summary: Strong Doeblin property enables fast mixing Interpolates between tractability and expressivity Provides better learning updates at training time

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 10 / 11

slide-46
SLIDE 46

Discussion

Summary: Strong Doeblin property enables fast mixing Interpolates between tractability and expressivity Provides better learning updates at training time Also in paper: Theoretical analysis of strong Doeblin family Multi-stage Doeblin chains

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 10 / 11

slide-47
SLIDE 47

Discussion

Related work: Policy gradient (Sutton et al., 1999) Inference-aware learning (Barbu, 2009; Domke, 2011; Stoyanov et al., 2011; Huang et al., 2012) Strong Doeblin analysis (Doeblin, 1940; Propp & Wilson, 1996; Corcoran & Tweedie, 1998)

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 11 / 11

slide-48
SLIDE 48

Discussion

Related work: Policy gradient (Sutton et al., 1999) Inference-aware learning (Barbu, 2009; Domke, 2011; Stoyanov et al., 2011; Huang et al., 2012) Strong Doeblin analysis (Doeblin, 1940; Propp & Wilson, 1996; Corcoran & Tweedie, 1998) Future work: Explore other tractable families Learn multi-stage chains

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 11 / 11

slide-49
SLIDE 49

Discussion

Related work: Policy gradient (Sutton et al., 1999) Inference-aware learning (Barbu, 2009; Domke, 2011; Stoyanov et al., 2011; Huang et al., 2012) Strong Doeblin analysis (Doeblin, 1940; Propp & Wilson, 1996; Corcoran & Tweedie, 1998) Future work: Explore other tractable families Learn multi-stage chains Reproducible experiments on CodaLab: codalab.org/worksheets

  • J. Steinhardt & P. Liang (Stanford)

Fast-Mixing Models July 8, 2015 11 / 11