Understanding and Accelerating Particle-Based Variational Inference - - PowerPoint PPT Presentation

understanding and accelerating particle based variational
SMART_READER_LITE
LIVE PREVIEW

Understanding and Accelerating Particle-Based Variational Inference - - PowerPoint PPT Presentation

Understanding and Accelerating Particle-Based Variational Inference Chang Liu , Jingwei Zhuo , Pengyu Cheng , Ruiyi Zhang , Jun Zhu , Lawrence Carin ICML 2019 Tsinghua University Duke University :


slide-1
SLIDE 1

Understanding and Accelerating Particle-Based Variational Inference

Chang Liu†, Jingwei Zhuo†, Pengyu Cheng‡, Ruiyi Zhang‡, Jun Zhu†§, Lawrence Carin‡§ ICML 2019

† Tsinghua University ‡ Duke University §: Corresponding authors

slide-2
SLIDE 2

Introduction

Particle-based Variational Inference Methods (ParVIs):

  • Represent the variational distribution q by particles; update the

particles to minimize KLp(q).

  • More flexible than classical VIs; more particle-efficient than MCMCs.

Related Work:

  • Stein Variational Gradient Descent (SVGD) [3] simulates the

gradient flow (steepest descending curves) of KLp on PH(X) [2].

  • The Blob and DGF methods [1] simulate the gradient flow of KLp
  • n the Wasserstein space P2(X).

1

slide-3
SLIDE 3

ParVIs Approximate P2(X) (Wasserstein) Gradient Flow

Remark 1 Existing ParVI methods approximate Wasserstein Gradient flow by smoothing the density or functions. Smoothing the Density

  • Blob [1] partially smooths the density.

v

GF = −∇

δ δq Eq[log(q/p)]

  • =

⇒ v

Blob = −∇

δ δq Eq[log(˜ q/p)]

  • .
  • GFSD fully smooths the density.

v

GF := ∇ log p − ∇ log q =

⇒ v

GFSD := ∇ log p − ∇ log ˜

q.

Smoothing Functions

  • SVGD restricts the optimization domain L2

q to HD.

  • GFSF smoothed functions in a similar way: ˆ

v GFSF = ˆ g + ˆ K ′ ˆ K −1. (Note ˆ v SVGD = ˆ v GFSF ˆ K.)

ˆ g:,i = ∇x(i) log p(x(i)), ˆ Kij = K(x(i), x(j)), ˆ K ′

:,i = j ∇x(j)K(x(j), x(i)).

2

slide-4
SLIDE 4

ParVIs Approximate P2(X) Gradient Flow by Smoothing

  • Equivalence:

Smoothing-function objective = Eq[L(v)], L : L2

q → L2 q linear.

= ⇒ E˜

q[L(v)] = Eq∗K[L(v)] = Eq[L(v) ∗ K] = Eq[L(v ∗ K)].

  • Necessity: grad KLp(q) undefined at q = ˆ

q := 1

N

N

i=1 δx(i).

Theorem 2 (Necessity of smoothing for SVGD) For q = ˆ q and v ∈ L2

p:

max

v∈L2

p,vL2 p =1

  • v

GF, v

  • L2

ˆ q,

has no optimal solution. ☛ ✡ ✟ ✠ ParVIs rely on the smoothing assumption! No free lunch!

3

slide-5
SLIDE 5

Bandwidth Selection via the Heat Equation

Note Under the dynamics dx = −∇ log qt(x) dt, qt evolves following the heat equation (HE): ∂tqt(x) = ∆qt(x). Median: HE: SVGD Blob GFSD GFSF

Figure 1: Comparison of HE (bottom row) with the median method (top row) for bandwidth selection.

4

slide-6
SLIDE 6

Nesterov’s Acceleration Method on Riemannian Manifolds

  • Riemannian Accelerated Gradient (RAG) [4] (with simplification):

   qk = Exprk−1(εvk−1), rk = Expqk

  • −Γqk

rk−1

  • k−1

k

Exp−1

rk−1(qk−1) − k+α−2 k

εvk−1

  • .
  • Riemannian Nesterov’s method (RNes) [5] (with simplification):

   qk = Exprk−1(εvk−1), rk = Expqk

  • c1 Exp−1

qk

  • Exprk−1
  • (1−c2) Exp−1

rk−1(qk−1)+c2 Exp−1 rk−1(qk)

  • .
  • Inverse exponential map: computationally expensive

Proposition 3 (Inverse exponential map) For pairwise close samples {x(i)}i of q and {y (i)}i of r, we have

  • Exp−1

q (r)

  • (x(i)) ≈ y (i) − x(i).
  • Parallel transport: hard to implement

Proposition 4 (Parallel transport) For pairwise close samples {x(i)}i of q and {y (i)}i of r, we have

  • Γr

q(v)

  • (y (i)) ≈ v(x(i)), ∀v ∈ TqP2.

5

slide-7
SLIDE 7

Acceleration Framework for ParVIs

Algorithm 1 The acceleration framework with Wasserstein Accelerated Gradient (WAG) and Wasserstein Nesterov’s method (WNes)

1: WAG: select acceleration factor α > 3;

WNes: select or calculate c1, c2 ∈ R+;

2: Initialize {x(i)

0 }N i=1 distinctly; let y (i)

= x(i)

0 ;

3: for k = 1, 2, · · · , kmax, do 4:

for i = 1, · · · , N, do

5:

Find v(y (i)

k−1) by SVGD/Blob/DGF/GFSD/GFSF;

6:

x(i)

k

= y (i)

k−1 + εv(y (i) k−1);

7:

y (i)

k

= x(i)

k

+

  • WAG:

k−1 k (y (i) k−1 − x(i) k−1) + k+α−2 k

εv(y (i)

k−1);

WNes: c1(c2 − 1)(x(i)

k

− x(i)

k−1);

8:

end for

9: end for 10: Return {x(i)

kmax}N i=1. 6

slide-8
SLIDE 8

Bayesian Logistic Regression (BLR)

2500 5000 7500 iteration 0.70 0.72 0.74 0.76 accuracy SVGD-WGD SVGD-PO SVGD-WAG SVGD-WNes 2500 5000 7500 iteration 0.70 0.72 0.74 0.76 accuracy Blob-WGD Blob-PO Blob-WAG Blob-WNes 2500 5000 7500 iteration 0.70 0.72 0.74 0.76 accuracy GFSD-WGD GFSD-PO GFSD-WAG GFSD-WNes 2500 5000 7500 iteration 0.70 0.72 0.74 0.76 accuracy GFSF-WGD GFSF-PO GFSF-WAG GFSF-WNes

Figure 2: Acceleration effect of WAG and WNes on BLR on the Covertype dataset, measured by prediction accuracy on test dataset. Each curve is averaged over 10 runs.

7

slide-9
SLIDE 9

Latent Dirichlet Allocation (LDA)

200 400 iteration 1020 1040 1060 1080 1100 holdout perplexity SVGD-WGD SVGD-PO SVGD-WAG SVGD-WNes 200 400 iteration 1020 1040 1060 1080 1100 holdout perplexity Blob-WGD Blob-PO Blob-WAG Blob-WNes 200 400 iteration 1020 1040 1060 1080 1100 holdout perplexity GFSD-WGD GFSD-PO GFSD-WAG GFSD-WNes 200 400 iteration 1020 1040 1060 1080 1100 holdout perplexity GFSF-WGD GFSF-PO GFSF-WAG GFSF-WNes

Figure 3: Acceleration effect of WAG and WNes

  • n LDA. Inference results are measured by the

hold-out perplexity. Curves are averaged over 10 runs.

100 200 300 400 iteration 1025 1050 1075 1100 1125 1150 holdout perplexity SGNHT-seq SGNHT-para SVGD-WGD SVGD-WNes

Figure 4: Comparison

  • f SVGD and SGNHT
  • n LDA, as

representatives of ParVIs and MCMCs. Average over 10 runs.

8

slide-10
SLIDE 10

Summary

Contributions (in theory):

  • ParVIs approximate the Wasserstein gradient flow by a compulsory

smoothing assumption.

  • ParVIs either smooth the density or smooth functions, and they are

equivalent. Contributions (in practice):

  • Two new ParVIs (GFSF and GFSD).
  • A principled bandwidth selection method for the smoothing kernel.
  • An acceleration framework for general ParVIs.

9

slide-11
SLIDE 11

Changyou Chen, Ruiyi Zhang, Wenlin Wang, Bai Li, and Liqun Chen. A unified particle-optimization framework for scalable bayesian sampling. arXiv preprint arXiv:1805.11659, 2018. Qiang Liu. Stein variational gradient descent as gradient flow. In Advances in neural information processing systems, pages 3118–3126, 2017. Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances In Neural Information Processing Systems, pages 2378–2386, 2016. Yuanyuan Liu, Fanhua Shang, James Cheng, Hong Cheng, and Licheng Jiao.

9

slide-12
SLIDE 12

Accelerated first-order methods for geodesically convex

  • ptimization on riemannian manifolds.

In Advances in Neural Information Processing Systems, pages 4875–4884, 2017. Hongyi Zhang and Suvrit Sra. An estimate sequence for geodesically convex optimization. In Conference On Learning Theory, pages 1703–1723, 2018.

9