Minimum Stein Discrepancy Estimators Fran cois-Xavier Briol - - PowerPoint PPT Presentation

minimum stein discrepancy estimators
SMART_READER_LITE
LIVE PREVIEW

Minimum Stein Discrepancy Estimators Fran cois-Xavier Briol - - PowerPoint PPT Presentation

Minimum Stein Discrepancy Estimators Fran cois-Xavier Briol University of Cambridge & The Alan Turing Institute ICML Workshop on Steins Method for Machine Learning and Statistics 15 th June 2019 F-X Briol (University of


slide-1
SLIDE 1

Minimum Stein Discrepancy Estimators

Fran¸ cois-Xavier Briol University of Cambridge & The Alan Turing Institute ICML Workshop on “Stein’s Method for Machine Learning and Statistics”

F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 15th June 2019 1 / 15

slide-2
SLIDE 2

Collaborators

Alessandro Barp Andrew Duncan Mark Girolami Lester Mackey ICL ICL

  • U. Cambridge

Microsoft Barp, A., Briol, F-X., Duncan, A., Girolami, M., Mackey, L. (2019) Minimum Stein Discrepancy Estimators. (preprint available here: https://fxbriol.github.io)

F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 15th June 2019 2 / 15

slide-3
SLIDE 3

Statistical Inference for Unnormalised Models

Motivation: Suppose we observe some data {x1, . . . , xn}. Given a parametric family of distributions {Pθ : θ ∈ Θ} with densities denoted pθ, we seek θ∗ ∈ Θ which best approximates the empirical distribution: Qn = 1 n

n

  • i=1

δxi Challenge: For complex models, we often only have access to the likelihood in unnormalised form: pθ(x) = ˜ pθ(x) C where C > 0 is unknown and ˜ p can be evaluated pointwise. Examples include models of natural images, large graphical models, deep energy models, etc...

F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 15th June 2019 3 / 15

slide-4
SLIDE 4

Statistical Inference for Unnormalised Models

Motivation: Suppose we observe some data {x1, . . . , xn}. Given a parametric family of distributions {Pθ : θ ∈ Θ} with densities denoted pθ, we seek θ∗ ∈ Θ which best approximates the empirical distribution: Qn = 1 n

n

  • i=1

δxi Challenge: For complex models, we often only have access to the likelihood in unnormalised form: pθ(x) = ˜ pθ(x) C where C > 0 is unknown and ˜ p can be evaluated pointwise. Examples include models of natural images, large graphical models, deep energy models, etc...

F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 15th June 2019 3 / 15

slide-5
SLIDE 5

Statistical Inference for Unnormalised Models

Motivation: Suppose we observe some data {x1, . . . , xn}. Given a parametric family of distributions {Pθ : θ ∈ Θ} with densities denoted pθ, we seek θ∗ ∈ Θ which best approximates the empirical distribution: Qn = 1 n

n

  • i=1

δxi Challenge: For complex models, we often only have access to the likelihood in unnormalised form: pθ(x) = ˜ pθ(x) C where C > 0 is unknown and ˜ p can be evaluated pointwise. Examples include models of natural images, large graphical models, deep energy models, etc...

F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 15th June 2019 3 / 15

slide-6
SLIDE 6

Minimum Discrepancy Estimators

Let D be a function such that D(Q||Pθ) ≥ 0 measures the discrepancy between the empirical distribution Q and Pθ. We say that ˆ θ ∈ Θ is a minimum discrepancy estimator if: ˆ θn ∈ argminθ∈ΘD(Qn||Pθ) This includes, but is not limited to:

1

KL-divergence or other Bregman Divergence

2

Wasserstein distance or Sinkhorn Divergence

3

Maximum Mean Discrepancy

4

...

Question: Which discrepancy should we use for unnormalised models?

F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 15th June 2019 4 / 15

slide-7
SLIDE 7

Minimum Discrepancy Estimators

Let D be a function such that D(Q||Pθ) ≥ 0 measures the discrepancy between the empirical distribution Q and Pθ. We say that ˆ θ ∈ Θ is a minimum discrepancy estimator if: ˆ θn ∈ argminθ∈ΘD(Qn||Pθ) This includes, but is not limited to:

1

KL-divergence or other Bregman Divergence

2

Wasserstein distance or Sinkhorn Divergence

3

Maximum Mean Discrepancy

4

...

Question: Which discrepancy should we use for unnormalised models?

F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 15th June 2019 4 / 15

slide-8
SLIDE 8

Minimum Discrepancy Estimators

Let D be a function such that D(Q||Pθ) ≥ 0 measures the discrepancy between the empirical distribution Q and Pθ. We say that ˆ θ ∈ Θ is a minimum discrepancy estimator if: ˆ θn ∈ argminθ∈ΘD(Qn||Pθ) This includes, but is not limited to:

1

KL-divergence or other Bregman Divergence

2

Wasserstein distance or Sinkhorn Divergence

3

Maximum Mean Discrepancy

4

...

Question: Which discrepancy should we use for unnormalised models?

F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 15th June 2019 4 / 15

slide-9
SLIDE 9

Score Matching Estimators

The score matching estimator [Hyvarinen, 2006] is based on the Fisher Divergence: SM(Q||Pθ) :=

  • X

∇ log q(x) − ∇ log pθ(x)2

2Q(dx)

=

  • X

(∇ log pθ(x)2

2 + 2∆ log pθ(x))Q(dx) + Z

where Z ∈ R is independent of θ This is one of the most competitive methods to date with applications for inference in natural images, deep energy models and directional statistics. Several Failure Modes: This approach requires second-order derivatives and struggles with heavy-tailed data [Swersky, 2011].

F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 15th June 2019 5 / 15

slide-10
SLIDE 10

Score Matching Estimators

The score matching estimator [Hyvarinen, 2006] is based on the Fisher Divergence: SM(Q||Pθ) :=

  • X

∇ log q(x) − ∇ log pθ(x)2

2Q(dx)

=

  • X

(∇ log pθ(x)2

2 + 2∆ log pθ(x))Q(dx) + Z

where Z ∈ R is independent of θ This is one of the most competitive methods to date with applications for inference in natural images, deep energy models and directional statistics. Several Failure Modes: This approach requires second-order derivatives and struggles with heavy-tailed data [Swersky, 2011].

F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 15th June 2019 5 / 15

slide-11
SLIDE 11

Score Matching Estimators

The score matching estimator [Hyvarinen, 2006] is based on the Fisher Divergence: SM(Q||Pθ) :=

  • X

∇ log q(x) − ∇ log pθ(x)2

2Q(dx)

=

  • X

(∇ log pθ(x)2

2 + 2∆ log pθ(x))Q(dx) + Z

where Z ∈ R is independent of θ This is one of the most competitive methods to date with applications for inference in natural images, deep energy models and directional statistics. Several Failure Modes: This approach requires second-order derivatives and struggles with heavy-tailed data [Swersky, 2011].

F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 15th June 2019 5 / 15

slide-12
SLIDE 12

Minimum Stein Discrepancy Estimators

Let Γ(Y) := {f : X → Y}. A function class G ⊂ Γ(Rd) is a Stein class, with corresponding Stein operator SPθ : G ⊂ Γ(Rd) → Γ(Rd) if:

  • X

SPθ[f ]dPθ = 0 ∀f ∈ G This leads to the notion of Stein discrepancy (SD) [Gorham, 2015]: SDSPθ[G] (Q||Pθ) := sup

f ∈SPθ[G]

  • X

fdPθ −

  • X

fdQ

  • =

sup

g∈G

  • X

SPθ[g]dQ

  • ,
  • n which we base our minimum Stein discrepancy estimators:

ˆ θn ∈ argminθ∈ΘSDSPθ[G] (Qn||Pθ) .

F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 15th June 2019 6 / 15

slide-13
SLIDE 13

Minimum Stein Discrepancy Estimators

Let Γ(Y) := {f : X → Y}. A function class G ⊂ Γ(Rd) is a Stein class, with corresponding Stein operator SPθ : G ⊂ Γ(Rd) → Γ(Rd) if:

  • X

SPθ[f ]dPθ = 0 ∀f ∈ G This leads to the notion of Stein discrepancy (SD) [Gorham, 2015]: SDSPθ[G] (Q||Pθ) := sup

f ∈SPθ[G]

  • X

fdPθ −

  • X

fdQ

  • =

sup

g∈G

  • X

SPθ[g]dQ

  • ,
  • n which we base our minimum Stein discrepancy estimators:

ˆ θn ∈ argminθ∈ΘSDSPθ[G] (Qn||Pθ) .

F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 15th June 2019 6 / 15

slide-14
SLIDE 14

Score Matching Estimators are Minimum Stein Discrepancy Estimators

Consider the Stein operator Sm

p [g] := 1 pθ ∇ · (pθg) and the Stein class:

G =

  • g = (g1, . . . , gd) ∈ C 1(X, Rd) ∩ L2(X; Q) : gL2(X;Q) ≤ 1
  • .

In this case, the Stein discrepancy is the Score Matching divergence: SDSPθ[G] (Q||Pθ) = SM(Q||Pθ). Our paper also shows that several other popular estimators for unnormalised, including contrastive divergence and minimum probability flow are minimum SD estimators.

F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 15th June 2019 7 / 15

slide-15
SLIDE 15

Score Matching Estimators are Minimum Stein Discrepancy Estimators

Consider the Stein operator Sm

p [g] := 1 pθ ∇ · (pθg) and the Stein class:

G =

  • g = (g1, . . . , gd) ∈ C 1(X, Rd) ∩ L2(X; Q) : gL2(X;Q) ≤ 1
  • .

In this case, the Stein discrepancy is the Score Matching divergence: SDSPθ[G] (Q||Pθ) = SM(Q||Pθ). Our paper also shows that several other popular estimators for unnormalised, including contrastive divergence and minimum probability flow are minimum SD estimators.

F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 15th June 2019 7 / 15

slide-16
SLIDE 16

Score Matching Estimators are Minimum Stein Discrepancy Estimators

Consider the Stein operator Sm

p [g] := 1 pθ ∇ · (pθg) and the Stein class:

G =

  • g = (g1, . . . , gd) ∈ C 1(X, Rd) ∩ L2(X; Q) : gL2(X;Q) ≤ 1
  • .

In this case, the Stein discrepancy is the Score Matching divergence: SDSPθ[G] (Q||Pθ) = SM(Q||Pθ). Our paper also shows that several other popular estimators for unnormalised, including contrastive divergence and minimum probability flow are minimum SD estimators.

F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 15th June 2019 7 / 15

slide-17
SLIDE 17

Minimum Diffusion Kernel Stein Discrepancy Estimators

More general Stein operators were considered in [Gorham, 2016]: Sm

p [g] := 1

pθ ∇ · (pθmg) , Sm

pθ[A] := 1

pθ ∇ · (pθmA) , where g ∈ Γ(Rd), A ∈ Γ(Rd×d), and m ∈ Γ(Rd×d). Taking G to be the unit ball of a vector-valued RKHS HK, we get a diffusion kernel Stein discrepancy, which generalises the KSD: DKSDK,m(QP)2 =

  • X
  • X

k0(x, y)dQ(x)dQ(y) where k0(x, y) := Sm,2

p

Sm,1

p

K(x, y) = 1 p(y)p(x)∇y · ∇x ·

  • p(x)m(x)K(x, y)m(y)⊤p(y)
  • .

F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 15th June 2019 8 / 15

slide-18
SLIDE 18

Diffusion Kernel Stein Discrepancy Estimators

We therefore end up with the following estimators: ˆ θDKSD

n

∈ argminθ∈Θ DKSDK,m(QnPθ)2 where DKSDK,m(QnPθ)2 =

2 n(n−1)

  • 1≤i<j≤n k0(xi, xj).

Proposition (DKSD as statistical divergence) Suppose K is IPD and in the Stein class of Q, and m(x) is invertible. If ∇ log p − ∇ log q ∈ L1(Q), then DKSDK,m(QP)2 = 0 iff Q = P. Proposition (IPD matrix kernels) (i) Let K = diag(k1, . . . , kd). Then K is IPD iff each kernel ki is IPD. (ii) Let K = Bk for B be symmetric positive definite. Then K is IPD iff k is IPD.

F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 15th June 2019 9 / 15

slide-19
SLIDE 19

Diffusion Kernel Stein Discrepancy Estimators

We therefore end up with the following estimators: ˆ θDKSD

n

∈ argminθ∈Θ DKSDK,m(QnPθ)2 where DKSDK,m(QnPθ)2 =

2 n(n−1)

  • 1≤i<j≤n k0(xi, xj).

Proposition (DKSD as statistical divergence) Suppose K is IPD and in the Stein class of Q, and m(x) is invertible. If ∇ log p − ∇ log q ∈ L1(Q), then DKSDK,m(QP)2 = 0 iff Q = P. Proposition (IPD matrix kernels) (i) Let K = diag(k1, . . . , kd). Then K is IPD iff each kernel ki is IPD. (ii) Let K = Bk for B be symmetric positive definite. Then K is IPD iff k is IPD.

F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 15th June 2019 9 / 15

slide-20
SLIDE 20

Diffusion Kernel Stein Discrepancy Estimators

We therefore end up with the following estimators: ˆ θDKSD

n

∈ argminθ∈Θ DKSDK,m(QnPθ)2 where DKSDK,m(QnPθ)2 =

2 n(n−1)

  • 1≤i<j≤n k0(xi, xj).

Proposition (DKSD as statistical divergence) Suppose K is IPD and in the Stein class of Q, and m(x) is invertible. If ∇ log p − ∇ log q ∈ L1(Q), then DKSDK,m(QP)2 = 0 iff Q = P. Proposition (IPD matrix kernels) (i) Let K = diag(k1, . . . , kd). Then K is IPD iff each kernel ki is IPD. (ii) Let K = Bk for B be symmetric positive definite. Then K is IPD iff k is IPD.

F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 15th June 2019 9 / 15

slide-21
SLIDE 21

Consistency & Asymptotic Normality

Theorem (Consistency and Asymptotic Normality of DKSD) Under smoothness and integrability conditions on K, m and θ → Pθ and their derivatives, we have that θDKSD

n

converges to θ∗ a.s. Furthermore, √n

  • ˆ

θDKSD

n

− θ∗ → N

  • 0, g−1

DKSD(θ∗)Σg−1 DKSD(θ∗)

  • where Σ =
  • X
  • X ∇θ∗k0(x, y)dQ(y)
  • X ∇θ∗k0(x, z)dQ(z)
  • dQ(x)

and: gDKSD(θ)ij =

  • X
  • X

(∇x∂θj log pθ)⊤ mθ(x)K(x, y)m⊤

θ (y)∇y∂θi log pθ

dPθ(x)dPθ(y). Important Remark: The choice of kernel K and diffusion matrix m will have a significant impact on the performance of these estimators!

F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 15th June 2019 10 / 15

slide-22
SLIDE 22

Consistency & Asymptotic Normality

Theorem (Consistency and Asymptotic Normality of DKSD) Under smoothness and integrability conditions on K, m and θ → Pθ and their derivatives, we have that θDKSD

n

converges to θ∗ a.s. Furthermore, √n

  • ˆ

θDKSD

n

− θ∗ → N

  • 0, g−1

DKSD(θ∗)Σg−1 DKSD(θ∗)

  • where Σ =
  • X
  • X ∇θ∗k0(x, y)dQ(y)
  • X ∇θ∗k0(x, z)dQ(z)
  • dQ(x)

and: gDKSD(θ)ij =

  • X
  • X

(∇x∂θj log pθ)⊤ mθ(x)K(x, y)m⊤

θ (y)∇y∂θi log pθ

dPθ(x)dPθ(y). Important Remark: The choice of kernel K and diffusion matrix m will have a significant impact on the performance of these estimators!

F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 15th June 2019 10 / 15

slide-23
SLIDE 23

Robustness of DKSD

The influence function describes infinitesimal corruption of the data and is given by IF(z, Q) := ∂tθQt|t=0 if it exists, where Qt = (1 − t)Q + tδz, for t ∈ [0, 1]. An estimator is said to be bias robust if IF(z, Q) is bounded in z. Proposition (Robustness of DKSD estimators) The influence function of DKSD is given by: IF(z, Pθ) = gDKSD(θ)−1

  • X

∇θk0(z, y)dPθ(y). In particular, there are various conditions on m and K which can guarantee that supz∈X IF(z, Pθ) < ∞. Important Remark: Once again, carefully choosing K and m can lead to good robustness properties.

F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 15th June 2019 11 / 15

slide-24
SLIDE 24

Robustness of DKSD

The influence function describes infinitesimal corruption of the data and is given by IF(z, Q) := ∂tθQt|t=0 if it exists, where Qt = (1 − t)Q + tδz, for t ∈ [0, 1]. An estimator is said to be bias robust if IF(z, Q) is bounded in z. Proposition (Robustness of DKSD estimators) The influence function of DKSD is given by: IF(z, Pθ) = gDKSD(θ)−1

  • X

∇θk0(z, y)dPθ(y). In particular, there are various conditions on m and K which can guarantee that supz∈X IF(z, Pθ) < ∞. Important Remark: Once again, carefully choosing K and m can lead to good robustness properties.

F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 15th June 2019 11 / 15

slide-25
SLIDE 25

Robustness of DKSD

The influence function describes infinitesimal corruption of the data and is given by IF(z, Q) := ∂tθQt|t=0 if it exists, where Qt = (1 − t)Q + tδz, for t ∈ [0, 1]. An estimator is said to be bias robust if IF(z, Q) is bounded in z. Proposition (Robustness of DKSD estimators) The influence function of DKSD is given by: IF(z, Pθ) = gDKSD(θ)−1

  • X

∇θk0(z, y)dPθ(y). In particular, there are various conditions on m and K which can guarantee that supz∈X IF(z, Pθ) < ∞. Important Remark: Once again, carefully choosing K and m can lead to good robustness properties.

F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 15th June 2019 11 / 15

slide-26
SLIDE 26

Implementation of Minimum DKSD Estimators

In order to implement our DKSD estimators ˆ θDKSD

n

∈ argminθ∈Θ DKSDK,m(QnPθ)2, we propose to make use of stochastic optimisation. In particular, we can make use of the geometry induced by DKSD to obtain an efficient algorithm akin to stochastic natural gradient descent [Amari, 1998]: ˆ θt+1 = ˆ θt − γt g−1

DKSD(θt)∇θt

DKSD(QnPθ). which approximates the gradient flow ˙ θ(t) = −g−1

DKSD(θ(t))∇θDKSD(QPθ(t)),

using U-statistics estimates of the metric tensor and gradient.

F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 15th June 2019 12 / 15

slide-27
SLIDE 27

Application 1: Models with Rough Densities

Model: pθ(x) ∝ (x − θ12/θ2)(s−d/2)Ks−d/2(x − θ12/θ2) Parameters: (θ∗

1, θ∗ 2) = (0, 1) and s varies.

Number of samples: n = 100.

F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 15th June 2019 13 / 15

slide-28
SLIDE 28

Application 2: Models with Heavy-Tails

Model: pθ(x) ∝ (1/θ2)(1 + (1/ν)(x − θ12/θ2)2)−(ν+1)/2. Diffusion Matrix: mθ(x) = 1 = x − θ12/θ2

2.

Parameters: ν = 5, (θ∗

1, θ∗ 2) = (25, 10).

Number of Samples: Left: n = 100, Right: n = 1000.

F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 15th June 2019 14 / 15

slide-29
SLIDE 29

Summary & Conclusions

In this talk, we have: Introduced a class of minimum Stein discrepancy estimators, and focused on a particular subclass called minimum DKSD. Shown that this class includes many popular estimators for unnormalised models including score-matching, contrastive divergence and minimum probability flow. Discussed consistency, a CLT, and robustness of minimum DKSD estimators, and discussed the importance of the kernel and operator. Demonstrated the advantage of the estimators for rough densities or heavy-tailed distributions. Take home message: The flexibility offered by the choice of Stein class and

  • perator allows us to tailor the estimators to the model of interest.

Barp, A., Briol, F-X., Duncan, A., Girolami, M., Mackey, L. (2019) Minimum Stein Discrepancy Estimators. (preprint: https://fxbriol.github.io)

F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 15th June 2019 15 / 15

slide-30
SLIDE 30

Summary & Conclusions

In this talk, we have: Introduced a class of minimum Stein discrepancy estimators, and focused on a particular subclass called minimum DKSD. Shown that this class includes many popular estimators for unnormalised models including score-matching, contrastive divergence and minimum probability flow. Discussed consistency, a CLT, and robustness of minimum DKSD estimators, and discussed the importance of the kernel and operator. Demonstrated the advantage of the estimators for rough densities or heavy-tailed distributions. Take home message: The flexibility offered by the choice of Stein class and

  • perator allows us to tailor the estimators to the model of interest.

Barp, A., Briol, F-X., Duncan, A., Girolami, M., Mackey, L. (2019) Minimum Stein Discrepancy Estimators. (preprint: https://fxbriol.github.io)

F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 15th June 2019 15 / 15

slide-31
SLIDE 31

Summary & Conclusions

In this talk, we have: Introduced a class of minimum Stein discrepancy estimators, and focused on a particular subclass called minimum DKSD. Shown that this class includes many popular estimators for unnormalised models including score-matching, contrastive divergence and minimum probability flow. Discussed consistency, a CLT, and robustness of minimum DKSD estimators, and discussed the importance of the kernel and operator. Demonstrated the advantage of the estimators for rough densities or heavy-tailed distributions. Take home message: The flexibility offered by the choice of Stein class and

  • perator allows us to tailor the estimators to the model of interest.

Barp, A., Briol, F-X., Duncan, A., Girolami, M., Mackey, L. (2019) Minimum Stein Discrepancy Estimators. (preprint: https://fxbriol.github.io)

F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 15th June 2019 15 / 15