[PPT] - for Sequential Bayesian Inference Le Song Associate Professor, CSE PowerPoint Presentation

SLIDE 1

Meta Particle Flow for Sequential Bayesian Inference

Le Song Associate Professor, CSE Associate Director, Machine Learning Center Georgia Institute of Technology

Joint work with Xinshi Chen and Hanjun Dai

SLIDE 2

Bayesian In Inference

Infer the posterior distribution of unknown parameter 𝒚 given

Prior distribution 𝜌(𝑦)
Likelihood function 𝑞(𝑝|𝑦)
Observations 𝑝1, 𝑝2, … , 𝑝𝑛

𝑞 𝑦 𝑝1:𝑛 = 1 𝑨 𝜌 𝑦

𝑗=1 𝑛

𝑞(𝑝𝑗|𝑦) 𝑨 = 𝜌 𝑦

𝑗=1 𝑛

𝑞(𝑝𝑗|𝑦) 𝑒𝑦

Challenging computational problem for high dimensional 𝑦

𝑦 𝑝1 𝑝2 𝑝𝑛

…… ……

SLIDE 3

Gaussian Mixture Model

prior 𝑦1, 𝑦2 ∼ 𝜌 𝑦 = 𝒪(0, 𝐽)
bservations o|𝑦1, 𝑦2 ∼ 𝑞 𝑝 𝑦1, 𝑦2 = 1

2 𝒪 𝑦1, 1 + 1 2 𝒪(𝑦1 + 𝑦2, 1)

With 𝑦1, 𝑦2 = (1, −2), the resulting posterior will have two modes: 1, −2 and −1, 2

Challe llenges in in Bayesian In Inference

To fit only one posterior 𝒒(𝒚|𝒑𝟐:𝒏) is already not easy.

[Results reported by Dai et al. (2016)]

−2 −1.5 −1 −0.5 0.5 1 1.5 2 −3 −2 −1 1 2 3 1 2 3 4 5 6 7 8 9 10 x 10 −3

(a) True posterior (b) Stochastic Variational Inference (c) Stochastic Gradient Langevin Dynamics (d) Gibbs Sampling (e) One-pass SMC

SLIDE 4

Fundamental Prin incip iple for Machin ine Learnin ing

Lots of applications in machine learning

Hidden Markov model
Topic modeling
Uncertainty quantification
𝑝𝑢+1 = 𝑄𝑝𝑢−𝜐 exp −

𝑧𝑢−𝜐 𝑧0

𝑓𝑢 + 𝑝𝑢 exp −𝜀𝜗𝑢 ,

𝑓𝑢 ∼ Γ 𝜏𝑞

−2, 𝜏𝑞 2 , 𝜗𝑢 ∼ Γ 𝜏𝑒 −2, 𝜏𝑒 2

parameters 𝑦 = 𝑄, 𝑧0, 𝜏𝑞

2, 𝜏𝑒 2, 𝜐, 𝜀

𝑝 𝑨 𝜄 𝛽

𝑁 𝑂

𝑦 𝑦𝑛 𝑝1 𝑝2 𝑝𝑛

……

𝑦2 𝑦1

sensor measure true location topic word

SLIDE 5

Sequential l Bayesian In Inference

𝑦 𝑝1 𝑝2 𝑝𝑛

𝑞 𝑦 𝑝1 𝑞 𝑦 𝑝1:2 𝑞 𝑦 𝑝1:𝑛 … …… …… prior 𝜌(𝑦) 𝑝1 𝑝2 𝑝𝑛

𝑞 𝑦 𝑝1:𝑛 ∝ 𝑞 𝑦 𝑝1:𝑛−1 𝑞 𝑝𝑛 𝑦

updated posterior current posterior likelihood

Online Bayesian Inference

Observations 𝑝1, 𝑝2, … , 𝑝𝑛 arrive sequentially

An ideal algorithm should:

Efficiently update 𝑞 𝑦 𝑝1:𝑛 to 𝑞 𝑦 𝑝1:𝑛+1 when 𝑝𝑛+1 is observed
Without storing all historical observations 𝑝1, 𝑝2, … , 𝑝𝑛

SLIDE 6

Rela lated Work

MCMC

– requires a complete scan of the data

Variational Inference (VI)

– requires re-optimization for every new observation

Stochastic approximate inference

– are prescribed algorithms to optimize the final posterior 𝑞 𝑦 𝑝1:𝑁 – can not exploit the structure of the sequential inference problem

𝑦 𝑝1 𝑝2 𝑝𝑛

𝑞 𝑦 𝑝1 𝑞 𝑦 𝑝1:2 𝑞 𝑦 𝑝1:𝑛 … …… …… prior 𝜌(𝑦) 𝑝1 𝑝2 𝑝𝑛

SLIDE 7

Rela lated Work

 Sequential monte Carlo (Doucet et al., 2001; Balakrishnan&Madigan, 2006) – the state of art for online Bayesian Inference – but suffers from path degeneracy problem in high dimensions – rejuvenation steps can help but will violate online constraints (Canini et al., 2009)

𝑦 𝑝1 𝑝2 𝑝𝑛

𝑞 𝑦 𝑝1 𝑞 𝑦 𝑝1:2 𝑞 𝑦 𝑝1:𝑛 … …… …… prior 𝜌(𝑦) 𝑝1 𝑝2 𝑝𝑛 Can we learn to perform efficient and effective sequential Bayesian update?

SLIDE 8

Operator Vie iew

 Kernel Bayes’ Rule (Fukumizu et al., 2012) – the posterior is represented as an embedding 𝜈𝑛 = 𝔽𝑞(𝑦|𝑝1:𝑛)[𝜚 𝑦 ] – 𝜈𝑛+1 = 𝒧( 𝜈𝑛 , 𝑝𝑛+1) – views the Bayes update as an operator in reproducing kernel Hilbert space (RKHS) – conceptually nice but is limited in practice

𝑦 𝑝1 𝑝2 𝑝𝑛

𝑞 𝑦 𝑝1 𝑞 𝑦 𝑝1:2 𝑞 𝑦 𝑝1:𝑛 … …… …… prior 𝜌(𝑦) 𝑝1 𝑝2 𝑝𝑛

updated embedding current embedding

SLIDE 9

Our Approach: Bayesian In Inference as Particle Flo low

𝒴0 = {𝑦0

1, … , 𝑦0 𝑂}

𝒴0 ∼ 𝜌(𝑦) 𝑦1

𝑜 = 𝑦0 𝑜 + 𝑈

𝑔 𝒴0, 𝑝1, 𝑦(𝑢) 𝑒𝑢 𝒴1 = {𝑦1

1, … , 𝑦1 𝑂}

𝒴1 ∼ 𝑞(𝑦|𝑝1)

Particle Flow

Start with 𝑶 particles

𝒴0 = {𝑦0

1, … , 𝑦0 𝑂}, sampled i.i.d. from prior 𝜌(𝑦)

Transport particles to next posterior via solution of an initial value problem (IVP)

⟹ solution 𝑦1

𝑜 = 𝑦(𝑈)

𝑒𝑦 𝑒𝑢 = 𝑔 𝒴0, 𝑝1, 𝑦(𝑢) , ∀𝑢 ∈ [0, 𝑈] and 𝑦 0 = 𝑦0

𝑜

SLIDE 10

Flo low Property

Continuity Equation expresses the law of local conservation of mass:

– Mass can neither be created nor destroyed – nor can it ‘teleport’ from one place to another 𝜖𝑟 𝑦, 𝑢 𝜖𝑢 = −𝛼

𝑦 ⋅ (𝑟𝑔)

Theorem. If 𝑒𝑦

𝑒𝑢 = 𝑔, then the change in log-density follows the differential equation

𝑒 log 𝑟 𝑦, 𝑢 𝑒𝑢 = −𝛼

𝑦 ⋅ 𝑔

Notation

– 𝑒𝑟

𝑒𝑢 is material derivative that defines the rate of change of 𝑟 in a given particle as

it moves along its trajectory 𝑦 = 𝑦(𝑢) – 𝜖𝑟

𝜖𝑢 is partial derivative that defines the rate of change of 𝑟 at a particular point 𝑦

SLIDE 11

Partic icle Flo low for Sequentia ial Bayesian In Inference

Particle Flow for Sequential Bayesian Inference

𝑦𝑛+1

𝑜

= 𝑦𝑛

𝑜 + 𝑈

𝑔 𝒴𝑛, 𝑝𝑛+1, 𝑦(𝑢) 𝑒𝑢 −log 𝑞𝑛+1

𝑜

= −log 𝑞𝑛

𝑜 + 𝑈

𝛼

𝑦 ⋅ 𝑔 𝒴𝑛, 𝑝𝑛+1, 𝑦(𝑢) 𝑒𝑢

𝒴0 = {𝑦0

1, … , 𝑦0 𝑂}

𝒴1 = {𝑦1

1, … , 𝑦1 𝑂}

𝒴2 = {𝑦2

1, … , 𝑦2 𝑂}

……

𝑈

𝑔 𝒴0, 𝑝1, 𝑦(𝑢) 𝑒𝑢

𝑈

𝑔 𝒴1, 𝑝2, 𝑦(𝑢) 𝑒𝑢

𝑈

𝑔 𝒴2, 𝑝3, 𝑦(𝑢) 𝑒𝑢

𝑦 𝑝1 𝑝2 𝑝𝑛

𝑞 𝑦 𝑝1 𝑞 𝑦 𝑝1:2 𝑞 𝑦 𝑝1:𝑛 … …… …… prior 𝜌(𝑦) 𝑝1 𝑝2 𝑝𝑛

Other ODE approaches (eg. Neural ODE of Chen et al 18), are not for sequential case.

SLIDE 12

Shared Flo low Velocity 𝒈 Exis ists?

A simple Gaussian Example

Prior 𝜌 𝑦 = 𝒪(0, 𝜏0), likelihood 𝑞 𝑝 𝑦 = 𝒪 𝑦, σ , observation 𝑝 = 0
⟹ posterior 𝑞 𝑦 𝑝 = 0 = 𝒪(0, 𝜏⋅𝜏0

𝜏+𝜏0)

Whether a shared 𝑔 exists for priors with different 𝜏0? What is the form for it?

– E.g. 𝑔 in the form of 𝑔(𝑝, 𝑦(𝑢)) won’t be able to handle different 𝜏0.

𝑦 𝑈 = 𝑦 0 +

𝑈

𝑔(𝑗𝑜𝑞𝑣𝑢𝑡)𝑒𝑢 𝜌(𝑦) 𝑦 0 ∼ 𝑞(𝑦|𝑝1) 𝑦 𝑢 ∼

Does a shared flow velocity 𝑔 exist for different Bayesian inference tasks involving different priors and different observations?

SLIDE 13

Exi xistence: Connection to Stochastic Flo low

Langevin dynamics is a stochastic process

𝑒𝑦 𝑢 = 𝛼

𝑦 log 𝜌 𝑦 𝑞(𝑝|𝑦) 𝑒𝑢 +

2 𝑒𝑥 𝑢 , where 𝑒𝑥(𝑢) is a standard Brownian motion.

Property. If the potential function Ψ 𝑦 ≔ −log 𝜌 𝑦 𝑞(𝑝|𝑦) is smooth and

𝑓−Ψ ∈ 𝑀1(ℝ𝑒) , the Fokker-Planck equation has a unique stationary solution in the form of Gibbs distribution, 𝑟 𝑦, ∞ = 𝑓−Ψ 𝑎 = 𝜌 𝑦 𝑞 𝑝 𝑦 𝑎 = 𝑞(𝑦|𝑝)

SLIDE 14

Exi xistence: Connection to Stochastic Flo low

The probability density 𝑟(𝑦, 𝑢) of 𝑦 𝑢 follows a deterministic evolution according to

the Fokker-Planck equation 𝜖𝑟 𝜖𝑢 = −𝛼

𝑦 ⋅ 𝑟𝛼 𝑦 log 𝜌 𝑦 𝑞 𝑝 𝑦

+ Δ𝑦𝑟 𝑦, 𝑢 which is in the form of Continuity Equation.

Theorem. When the deterministic transformation of random variable 𝑦 𝑢 follows

𝑒𝑦 𝑒𝑢 = 𝛼

𝑦 log 𝜌 𝑦 𝑞(𝑝|𝑦) − 𝛼 𝑦 log 𝑟 𝑦, 𝑢 ,

its probability density 𝑞(𝑦, 𝑢) converges to the posterior 𝑞(𝑦|𝑝) as 𝑢 → ∞. = −𝛼

𝑦 ⋅ (𝑟(𝛼 𝑦 log 𝜌 𝑦 𝑞(𝑝|𝑦) − 𝛼 𝑦 log 𝑟(𝑦, 𝑢))),

𝑔

SLIDE 15

Exis xistence: Clo lose-Loop to Open-Loop Conversion

Close loop to Open loop

Fokker-Planck equation leads to close loop flow, depend not just on 𝜌(𝑦) and 𝑞 𝑝 𝑦 ,

but also on flow state 𝑟 𝑦, 𝑢 .

Is there an equivalent form independent of 𝑟 𝑦, 𝑢 which can achieve the same flow?

Optimization problem min

𝑥

𝑒 𝑟 𝑦, ∞ , 𝑞(𝑦|𝑝) 𝑡. 𝑢. 𝑒𝑦 𝑒𝑢 = 𝛼

𝑦 log 𝜌 𝑦 𝑞(𝑝|𝑦) − 𝑥,

Positive answer: there exists a fixed and deterministic flow velocity 𝑔 of the form

𝑒𝑦 𝑒𝑢 = 𝛼

𝑦 log 𝜌 𝑦 𝑞(𝑝|𝑦) − 𝑥∗(𝜌 𝑦 , 𝑞 𝑝|𝑦 , 𝑦, 𝑢)

SLIDE 16

Parameteriz ization

Parameterization

𝝆 𝒚 ⟹ 𝓨

– use samples 𝒴 as surrogates, feature space embedding – Ideally, if 𝜈𝒴 is an injective mapping from the space of probability measures

𝒒 𝒑|𝒚 ⟹ (𝒑, 𝒚(𝒖)) for a fixed likelihood function
With two neural networks 𝜚 and ℎ, the overall parameterization

𝜈𝒴 𝑞 ≔

𝒴

𝜚 𝑦 𝑞 𝑦 𝑒𝑦 ≈ 1 𝑂

𝑜=1 𝑂

𝜚 𝑦𝑜 , 𝑦𝑜 ∼ 𝜌

𝑔 𝒴, 𝑝, 𝑦 𝑢 , 𝑢 = ℎ 1 𝑂

𝑜=1 𝑂

𝜚 𝑦𝑜 , 𝑝, 𝑦 𝑢 , 𝑢 𝑒𝑦 𝑒𝑢 = 𝛼

𝑦 log 𝜌 𝑦 𝑞(𝑝|𝑦) − 𝑥∗(𝜌 𝑦 , 𝑞 𝑝 𝑦 , 𝑦, 𝑢)

SLIDE 17

Multi-task Framework

Training set 𝒠𝑢𝑠𝑏𝑗𝑜 for meta learning

– containing multiple inference tasks, with diverse prior and observations

Each task 𝒰 ∈ 𝒠𝑢𝑠𝑏𝑗𝑜,

Loss Function

Minimize negative evidence lower bound (ELBO) for each task 𝒰

ℒ 𝒰 =

𝑛=1 𝑁

𝐿𝑀(𝑟𝑛(𝑦)||𝑞(𝑦|𝑝1:𝑛)) =

𝑛=1 𝑁 𝑜=1 𝑂

log 𝑟𝑛

𝑜 − log 𝑞(𝑦𝑛 𝑜 , 𝑝1:𝑛) + 𝑑𝑝𝑜𝑡𝑢.

Meta Learnin ing

prior likelihood 𝑁 observations

𝒰 ≔ (𝜌 𝑦 , 𝑞 ⋅ 𝑦 , 𝑝1, 𝑝2, … , 𝑝𝑁 ) 𝑦𝑛+1

𝑜

= 𝑦𝑛

𝑜 + 𝑈

𝑔 𝒴𝑛, 𝑝𝑛+1, 𝑦 𝑢 , 𝑢 𝑒𝑢 −log 𝑟𝑛+1

𝑜

= −log 𝑟𝑛

𝑜 + 𝑈

𝛼

𝑦 ⋅ 𝑔 𝒴𝑛, 𝑝𝑛+1, 𝑦 𝑢 , 𝑢 𝑒𝑢

SLIDE 18

Exp xperiments: Benefit for Hig igh Dim imension

Multivariate Gaussian Model

prior 𝑦 ∼ 𝒪(𝜈𝑦, Σ𝑦)
bservation conditioned on prior o|𝑦 ∼ 𝒪(𝑦, Σ𝑝)

Experiment Setting

Training set only contains sequences of 10 observations, but a diverse set of prior

distributions.

Testing set contains 25 different sequences of 100 observations.

Result

As the dimension of the model increases, our MPF operator has more advantages.

SLIDE 19

Gaussian Mixture Model

prior 𝑦1, 𝑦2 ∼ 𝒪(0,1)
bservations o|𝑦1, 𝑦2 ∼ 1

2 𝒪 𝑦1, 1 + 1 2 𝒪(𝑦1 + 𝑦2, 1)

With 𝑦1, 𝑦2 = (1, −2), the resulting posterior will have two modes: 1, −2 and −1, 2

Exp xperiments: Benefit for Mult ltim imodal Posterior

To fit only one posterior 𝒒(𝒚|𝒑𝟐:𝒏) is already not easy.

[Results reported by Dai et al. (2016)]

−2 −1.5 −1 −0.5 0.5 1 1.5 2 −3 −2 −1 1 2 3 1 2 3 4 5 6 7 8 9 10 x 10 −3

(a) True posterior (b) Stochastic Variational Inference (c) Stochastic Gradient Langevin Dynamics (d) Gibbs Sampling (e) One-pass SMC

SLIDE 20

Gaussian Mixture Model

prior 𝑦1, 𝑦2 ∼ 𝒪(0,1)
bservations o|𝑦1, 𝑦2 ∼ 1

2 𝒪 𝑦1, 1 + 1 2 𝒪(𝑦1 + 𝑦2, 1)

With 𝑦1, 𝑦2 = (1, −2), the resulting posterior will have two modes: 1, −2 and −1, 2

Exp xperiments: Benefit for Mult ltim imodal Posterior

Our more challenging experimental setting:

The learned MPF operator will be tested on sequences that is not observed in the training set.
It needs to fit all intermediate posteriors 𝑞 𝑦 𝑝1 , 𝑞 𝑦 𝑝1:2 , … , 𝑞 𝑦 𝑝1:𝑛 .

Visualization of the evolution of posterior density from left to right.

SLIDE 21

Hidden Markov Model – Linear Dynamical System

Exp xperiments: Hid idden Markov Model

𝑦𝑛 𝑝1 𝑝2 𝑝𝑛

……

𝑦2 𝑦1

𝑦𝑛 = 𝐵 𝑦𝑛−1 + 𝜗𝑛, 𝜗𝑛 ∼ 𝒪(0, Σ1)
𝑝𝑛 = 𝐶 𝑦𝑛 + 𝜀𝑛, 𝜀𝑛 ∼ 𝒪(0, Σ2)

Transition sampling + MPF operator: 1. 𝑦𝑛

𝑜 = 𝐵 𝑦𝑛−1 𝑜

+ 𝜗𝑛

2. 𝑦𝑛

𝑜 = ℱ(

𝒴𝑛, 𝑦𝑛

𝑜 , 𝑝𝑛+1)

…… Marginal posteriors update:

SLIDE 22

Exp xperiments: Generalization across Mult ltitasks

Bayesian Logistic Regression on MNIST dataset 8 vs 6

About 1.6M training images and 1932 testing images
Each data point 𝑝𝑛: = (𝜚𝑛, 𝑑𝑛)
Logistic Regression 𝑧 = 𝜏(𝜄⊤𝜚𝑛)
Likelihood function 𝑞(𝑝𝑛 𝜄 = 𝑧𝑑𝑛 1 − 𝑧 1−𝑑𝑛

feature label

Multi-task Environment

We reduce the dimension of the images to 50 by PCA
We rotate the first two components by an angle 𝜔 ∼ −15°, 15°

– The first two components account for more variability in the data

Different rotation angle 𝜔 ⟹ different decision boundary ⟹ different tasks

Multi-task Training

MPF operator will be learned from multiple training tasks with different 𝜔
Use the learned MPF as an online-learning algorithm during testing

SLIDE 23

Exp xperiments: Generalization across Mult ltitasks

Testing as Online learning:

（1）All algorithms start with a set of particles sampled from the prior; （2）Each algorithm makes predictions to the encountered batch of 32 images; （3）All algorithms observe true labels of the encountered batch; （4）Each algorithm updates its particles and then make predictions to next batch; （predict → observe true labels → update particles → predict → observe true labels → update ……） accuracy 𝑠

1

accuracy 𝑠

2

SLIDE 24

Conclusion and Future Work

An ODE-based Bayesian operator for sequential Bayesian inference
Existence
Parametrization
Meta-learning framework
Future work
Architecture
Stable flow
Improve training

𝒴0 = {𝑦0

1, … , 𝑦0 𝑂}

𝒴0 ∼ 𝜌(𝑦) 𝑦1

𝑜 = 𝑦0 𝑜 + 𝑈

𝑔 𝒴0, 𝑝1, 𝑦 𝑢 , 𝑢 𝑒𝑢 𝒴1 = {𝑦1

1, … , 𝑦1 𝑂}

𝒴1 ∼ 𝑞(𝑦|𝑝1)