for Sequential Bayesian Inference Le Song Associate Professor, CSE - - PowerPoint PPT Presentation

β–Ά
for sequential bayesian inference
SMART_READER_LITE
LIVE PREVIEW

for Sequential Bayesian Inference Le Song Associate Professor, CSE - - PowerPoint PPT Presentation

Meta Particle Flow for Sequential Bayesian Inference Le Song Associate Professor, CSE Associate Director, Machine Learning Center Georgia Institute of Technology Joint work with Xinshi Chen and Hanjun Dai Bayesian In Inference Infer the


slide-1
SLIDE 1

Meta Particle Flow for Sequential Bayesian Inference

Le Song Associate Professor, CSE Associate Director, Machine Learning Center Georgia Institute of Technology

Joint work with Xinshi Chen and Hanjun Dai

slide-2
SLIDE 2

Bayesian In Inference

Infer the posterior distribution of unknown parameter π’š given

  • Prior distribution 𝜌(𝑦)
  • Likelihood function π‘ž(𝑝|𝑦)
  • Observations 𝑝1, 𝑝2, … , 𝑝𝑛

π‘ž 𝑦 𝑝1:𝑛 = 1 𝑨 𝜌 𝑦

𝑗=1 𝑛

π‘ž(𝑝𝑗|𝑦) 𝑨 = 𝜌 𝑦

𝑗=1 𝑛

π‘ž(𝑝𝑗|𝑦) 𝑒𝑦

Challenging computational problem for high dimensional 𝑦

𝑦 𝑝1 𝑝2 𝑝𝑛

…… ……

slide-3
SLIDE 3

Gaussian Mixture Model

  • prior 𝑦1, 𝑦2 ∼ 𝜌 𝑦 = π’ͺ(0, 𝐽)
  • bservations o|𝑦1, 𝑦2 ∼ π‘ž 𝑝 𝑦1, 𝑦2 = 1

2 π’ͺ 𝑦1, 1 + 1 2 π’ͺ(𝑦1 + 𝑦2, 1)

  • With 𝑦1, 𝑦2 = (1, βˆ’2), the resulting posterior will have two modes: 1, βˆ’2 and βˆ’1, 2

Challe llenges in in Bayesian In Inference

To fit only one posterior 𝒒(π’š|π’‘πŸ:𝒏) is already not easy.

[Results reported by Dai et al. (2016)]

βˆ’2 βˆ’1.5 βˆ’1 βˆ’0.5 0.5 1 1.5 2 βˆ’3 βˆ’2 βˆ’1 1 2 3 1 2 3 4 5 6 7 8 9 10 x 10 βˆ’3

(a) True posterior (b) Stochastic Variational Inference (c) Stochastic Gradient Langevin Dynamics (d) Gibbs Sampling (e) One-pass SMC

slide-4
SLIDE 4

Fundamental Prin incip iple for Machin ine Learnin ing

Lots of applications in machine learning

  • Hidden Markov model
  • Topic modeling
  • Uncertainty quantification
  • 𝑝𝑒+1 = π‘„π‘π‘’βˆ’πœ exp βˆ’

π‘§π‘’βˆ’πœ 𝑧0

𝑓𝑒 + 𝑝𝑒 exp βˆ’πœ€πœ—π‘’ ,

  • 𝑓𝑒 ∼ Ξ“ πœπ‘ž

βˆ’2, πœπ‘ž 2 , πœ—π‘’ ∼ Ξ“ πœπ‘’ βˆ’2, πœπ‘’ 2

  • parameters 𝑦 = 𝑄, 𝑧0, πœπ‘ž

2, πœπ‘’ 2, 𝜐, πœ€

𝑝 𝑨 πœ„ 𝛽

𝑁 𝑂

𝑦 𝑦𝑛 𝑝1 𝑝2 𝑝𝑛

……

𝑦2 𝑦1

sensor measure true location topic word

slide-5
SLIDE 5

Sequential l Bayesian In Inference

𝑦 𝑝1 𝑝2 𝑝𝑛

π‘ž 𝑦 𝑝1 π‘ž 𝑦 𝑝1:2 π‘ž 𝑦 𝑝1:𝑛 … …… …… prior 𝜌(𝑦) 𝑝1 𝑝2 𝑝𝑛

π‘ž 𝑦 𝑝1:𝑛 ∝ π‘ž 𝑦 𝑝1:π‘›βˆ’1 π‘ž 𝑝𝑛 𝑦

updated posterior current posterior likelihood

Online Bayesian Inference

  • Observations 𝑝1, 𝑝2, … , 𝑝𝑛 arrive sequentially

An ideal algorithm should:

  • Efficiently update π‘ž 𝑦 𝑝1:𝑛 to π‘ž 𝑦 𝑝1:𝑛+1 when 𝑝𝑛+1 is observed
  • Without storing all historical observations 𝑝1, 𝑝2, … , 𝑝𝑛
slide-6
SLIDE 6

Rela lated Work

  • MCMC

– requires a complete scan of the data

  • Variational Inference (VI)

– requires re-optimization for every new observation

  • Stochastic approximate inference

– are prescribed algorithms to optimize the final posterior π‘ž 𝑦 𝑝1:𝑁 – can not exploit the structure of the sequential inference problem

𝑦 𝑝1 𝑝2 𝑝𝑛

π‘ž 𝑦 𝑝1 π‘ž 𝑦 𝑝1:2 π‘ž 𝑦 𝑝1:𝑛 … …… …… prior 𝜌(𝑦) 𝑝1 𝑝2 𝑝𝑛

slide-7
SLIDE 7

Rela lated Work

οƒΌ Sequential monte Carlo (Doucet et al., 2001; Balakrishnan&Madigan, 2006) – the state of art for online Bayesian Inference – but suffers from path degeneracy problem in high dimensions – rejuvenation steps can help but will violate online constraints (Canini et al., 2009)

𝑦 𝑝1 𝑝2 𝑝𝑛

π‘ž 𝑦 𝑝1 π‘ž 𝑦 𝑝1:2 π‘ž 𝑦 𝑝1:𝑛 … …… …… prior 𝜌(𝑦) 𝑝1 𝑝2 𝑝𝑛 Can we learn to perform efficient and effective sequential Bayesian update?

slide-8
SLIDE 8

Operator Vie iew

οƒΌ Kernel Bayes’ Rule (Fukumizu et al., 2012) – the posterior is represented as an embedding πœˆπ‘› = π”½π‘ž(𝑦|𝑝1:𝑛)[𝜚 𝑦 ] – πœˆπ‘›+1 = 𝒧( πœˆπ‘› , 𝑝𝑛+1) – views the Bayes update as an operator in reproducing kernel Hilbert space (RKHS) – conceptually nice but is limited in practice

𝑦 𝑝1 𝑝2 𝑝𝑛

π‘ž 𝑦 𝑝1 π‘ž 𝑦 𝑝1:2 π‘ž 𝑦 𝑝1:𝑛 … …… …… prior 𝜌(𝑦) 𝑝1 𝑝2 𝑝𝑛

updated embedding current embedding

slide-9
SLIDE 9

Our Approach: Bayesian In Inference as Particle Flo low

𝒴0 = {𝑦0

1, … , 𝑦0 𝑂}

𝒴0 ∼ 𝜌(𝑦) 𝑦1

π‘œ = 𝑦0 π‘œ + π‘ˆ

𝑔 𝒴0, 𝑝1, 𝑦(𝑒) 𝑒𝑒 𝒴1 = {𝑦1

1, … , 𝑦1 𝑂}

𝒴1 ∼ π‘ž(𝑦|𝑝1)

Particle Flow

  • Start with 𝑢 particles

𝒴0 = {𝑦0

1, … , 𝑦0 𝑂}, sampled i.i.d. from prior 𝜌(𝑦)

  • Transport particles to next posterior via solution of an initial value problem (IVP)

⟹ solution 𝑦1

π‘œ = 𝑦(π‘ˆ)

𝑒𝑦 𝑒𝑒 = 𝑔 𝒴0, 𝑝1, 𝑦(𝑒) , βˆ€π‘’ ∈ [0, π‘ˆ] and 𝑦 0 = 𝑦0

π‘œ

slide-10
SLIDE 10

Flo low Property

  • Continuity Equation expresses the law of local conservation of mass:

– Mass can neither be created nor destroyed – nor can it β€˜teleport’ from one place to another πœ–π‘Ÿ 𝑦, 𝑒 πœ–π‘’ = βˆ’π›Ό

𝑦 β‹… (π‘Ÿπ‘”)

  • Theorem. If 𝑒𝑦

𝑒𝑒 = 𝑔, then the change in log-density follows the differential equation

𝑒 log π‘Ÿ 𝑦, 𝑒 𝑒𝑒 = βˆ’π›Ό

𝑦 β‹… 𝑔

  • Notation

– π‘’π‘Ÿ

𝑒𝑒 is material derivative that defines the rate of change of π‘Ÿ in a given particle as

it moves along its trajectory 𝑦 = 𝑦(𝑒) – πœ–π‘Ÿ

πœ–π‘’ is partial derivative that defines the rate of change of π‘Ÿ at a particular point 𝑦

slide-11
SLIDE 11

Partic icle Flo low for Sequentia ial Bayesian In Inference

Particle Flow for Sequential Bayesian Inference

𝑦𝑛+1

π‘œ

= 𝑦𝑛

π‘œ + π‘ˆ

𝑔 𝒴𝑛, 𝑝𝑛+1, 𝑦(𝑒) 𝑒𝑒 βˆ’log π‘žπ‘›+1

π‘œ

= βˆ’log π‘žπ‘›

π‘œ + π‘ˆ

𝛼

𝑦 β‹… 𝑔 𝒴𝑛, 𝑝𝑛+1, 𝑦(𝑒) 𝑒𝑒

𝒴0 = {𝑦0

1, … , 𝑦0 𝑂}

𝒴1 = {𝑦1

1, … , 𝑦1 𝑂}

𝒴2 = {𝑦2

1, … , 𝑦2 𝑂}

……

π‘ˆ

𝑔 𝒴0, 𝑝1, 𝑦(𝑒) 𝑒𝑒

π‘ˆ

𝑔 𝒴1, 𝑝2, 𝑦(𝑒) 𝑒𝑒

π‘ˆ

𝑔 𝒴2, 𝑝3, 𝑦(𝑒) 𝑒𝑒

𝑦 𝑝1 𝑝2 𝑝𝑛

π‘ž 𝑦 𝑝1 π‘ž 𝑦 𝑝1:2 π‘ž 𝑦 𝑝1:𝑛 … …… …… prior 𝜌(𝑦) 𝑝1 𝑝2 𝑝𝑛

  • Other ODE approaches (eg. Neural ODE of Chen et al 18), are not for sequential case.
slide-12
SLIDE 12

Shared Flo low Velocity π’ˆ Exis ists?

A simple Gaussian Example

  • Prior 𝜌 𝑦 = π’ͺ(0, 𝜏0), likelihood π‘ž 𝑝 𝑦 = π’ͺ 𝑦, Οƒ , observation 𝑝 = 0
  • ⟹ posterior π‘ž 𝑦 𝑝 = 0 = π’ͺ(0, πœβ‹…πœ0

𝜏+𝜏0)

  • Whether a shared 𝑔 exists for priors with different 𝜏0? What is the form for it?

– E.g. 𝑔 in the form of 𝑔(𝑝, 𝑦(𝑒)) won’t be able to handle different 𝜏0.

𝑦 π‘ˆ = 𝑦 0 +

π‘ˆ

𝑔(π‘—π‘œπ‘žπ‘£π‘’π‘‘)𝑒𝑒 𝜌(𝑦) 𝑦 0 ∼ π‘ž(𝑦|𝑝1) 𝑦 𝑒 ∼

Does a shared flow velocity 𝑔 exist for different Bayesian inference tasks involving different priors and different observations?

slide-13
SLIDE 13

Exi xistence: Connection to Stochastic Flo low

  • Langevin dynamics is a stochastic process

𝑒𝑦 𝑒 = 𝛼

𝑦 log 𝜌 𝑦 π‘ž(𝑝|𝑦) 𝑒𝑒 +

2 𝑒π‘₯ 𝑒 , where 𝑒π‘₯(𝑒) is a standard Brownian motion.

  • Property. If the potential function Ξ¨ 𝑦 ≔ βˆ’log 𝜌 𝑦 π‘ž(𝑝|𝑦) is smooth and

π‘“βˆ’Ξ¨ ∈ 𝑀1(ℝ𝑒) , the Fokker-Planck equation has a unique stationary solution in the form of Gibbs distribution, π‘Ÿ 𝑦, ∞ = π‘“βˆ’Ξ¨ π‘Ž = 𝜌 𝑦 π‘ž 𝑝 𝑦 π‘Ž = π‘ž(𝑦|𝑝)

slide-14
SLIDE 14

Exi xistence: Connection to Stochastic Flo low

  • The probability density π‘Ÿ(𝑦, 𝑒) of 𝑦 𝑒 follows a deterministic evolution according to

the Fokker-Planck equation πœ–π‘Ÿ πœ–π‘’ = βˆ’π›Ό

𝑦 β‹… π‘Ÿπ›Ό 𝑦 log 𝜌 𝑦 π‘ž 𝑝 𝑦

+ Ξ”π‘¦π‘Ÿ 𝑦, 𝑒 which is in the form of Continuity Equation.

  • Theorem. When the deterministic transformation of random variable 𝑦 𝑒 follows

𝑒𝑦 𝑒𝑒 = 𝛼

𝑦 log 𝜌 𝑦 π‘ž(𝑝|𝑦) βˆ’ 𝛼 𝑦 log π‘Ÿ 𝑦, 𝑒 ,

its probability density π‘ž(𝑦, 𝑒) converges to the posterior π‘ž(𝑦|𝑝) as 𝑒 β†’ ∞. = βˆ’π›Ό

𝑦 β‹… (π‘Ÿ(𝛼 𝑦 log 𝜌 𝑦 π‘ž(𝑝|𝑦) βˆ’ 𝛼 𝑦 log π‘Ÿ(𝑦, 𝑒))),

𝑔

slide-15
SLIDE 15

Exis xistence: Clo lose-Loop to Open-Loop Conversion

Close loop to Open loop

  • Fokker-Planck equation leads to close loop flow, depend not just on 𝜌(𝑦) and π‘ž 𝑝 𝑦 ,

but also on flow state π‘Ÿ 𝑦, 𝑒 .

  • Is there an equivalent form independent of π‘Ÿ 𝑦, 𝑒 which can achieve the same flow?

Optimization problem min

π‘₯

𝑒 π‘Ÿ 𝑦, ∞ , π‘ž(𝑦|𝑝) 𝑑. 𝑒. 𝑒𝑦 𝑒𝑒 = 𝛼

𝑦 log 𝜌 𝑦 π‘ž(𝑝|𝑦) βˆ’ π‘₯,

  • Positive answer: there exists a fixed and deterministic flow velocity 𝑔 of the form

𝑒𝑦 𝑒𝑒 = 𝛼

𝑦 log 𝜌 𝑦 π‘ž(𝑝|𝑦) βˆ’ π‘₯βˆ—(𝜌 𝑦 , π‘ž 𝑝|𝑦 , 𝑦, 𝑒)

slide-16
SLIDE 16

Parameteriz ization

Parameterization

  • 𝝆 π’š ⟹ 𝓨

– use samples 𝒴 as surrogates, feature space embedding – Ideally, if πœˆπ’΄ is an injective mapping from the space of probability measures

  • 𝒒 𝒑|π’š ⟹ (𝒑, π’š(𝒖)) for a fixed likelihood function
  • With two neural networks 𝜚 and β„Ž, the overall parameterization

πœˆπ’΄ π‘ž ≔

𝒴

𝜚 𝑦 π‘ž 𝑦 𝑒𝑦 β‰ˆ 1 𝑂

π‘œ=1 𝑂

𝜚 π‘¦π‘œ , π‘¦π‘œ ∼ 𝜌

𝑔 𝒴, 𝑝, 𝑦 𝑒 , 𝑒 = β„Ž 1 𝑂

π‘œ=1 𝑂

𝜚 π‘¦π‘œ , 𝑝, 𝑦 𝑒 , 𝑒 𝑒𝑦 𝑒𝑒 = 𝛼

𝑦 log 𝜌 𝑦 π‘ž(𝑝|𝑦) βˆ’ π‘₯βˆ—(𝜌 𝑦 , π‘ž 𝑝 𝑦 , 𝑦, 𝑒)

slide-17
SLIDE 17

Multi-task Framework

  • Training set π’ π‘’π‘ π‘π‘—π‘œ for meta learning

– containing multiple inference tasks, with diverse prior and observations

  • Each task 𝒰 ∈ π’ π‘’π‘ π‘π‘—π‘œ,

Loss Function

  • Minimize negative evidence lower bound (ELBO) for each task 𝒰

β„’ 𝒰 =

𝑛=1 𝑁

𝐿𝑀(π‘Ÿπ‘›(𝑦)||π‘ž(𝑦|𝑝1:𝑛)) =

𝑛=1 𝑁 π‘œ=1 𝑂

log π‘Ÿπ‘›

π‘œ βˆ’ log π‘ž(𝑦𝑛 π‘œ , 𝑝1:𝑛) + π‘‘π‘π‘œπ‘‘π‘’.

Meta Learnin ing

prior likelihood 𝑁 observations

𝒰 ≔ (𝜌 𝑦 , π‘ž β‹… 𝑦 , 𝑝1, 𝑝2, … , 𝑝𝑁 ) 𝑦𝑛+1

π‘œ

= 𝑦𝑛

π‘œ + π‘ˆ

𝑔 𝒴𝑛, 𝑝𝑛+1, 𝑦 𝑒 , 𝑒 𝑒𝑒 βˆ’log π‘Ÿπ‘›+1

π‘œ

= βˆ’log π‘Ÿπ‘›

π‘œ + π‘ˆ

𝛼

𝑦 β‹… 𝑔 𝒴𝑛, 𝑝𝑛+1, 𝑦 𝑒 , 𝑒 𝑒𝑒

slide-18
SLIDE 18

Exp xperiments: Benefit for Hig igh Dim imension

Multivariate Gaussian Model

  • prior 𝑦 ∼ π’ͺ(πœˆπ‘¦, Σ𝑦)
  • bservation conditioned on prior o|𝑦 ∼ π’ͺ(𝑦, Σ𝑝)

Experiment Setting

  • Training set only contains sequences of 10 observations, but a diverse set of prior

distributions.

  • Testing set contains 25 different sequences of 100 observations.

Result

  • As the dimension of the model increases, our MPF operator has more advantages.
slide-19
SLIDE 19

Gaussian Mixture Model

  • prior 𝑦1, 𝑦2 ∼ π’ͺ(0,1)
  • bservations o|𝑦1, 𝑦2 ∼ 1

2 π’ͺ 𝑦1, 1 + 1 2 π’ͺ(𝑦1 + 𝑦2, 1)

  • With 𝑦1, 𝑦2 = (1, βˆ’2), the resulting posterior will have two modes: 1, βˆ’2 and βˆ’1, 2

Exp xperiments: Benefit for Mult ltim imodal Posterior

To fit only one posterior 𝒒(π’š|π’‘πŸ:𝒏) is already not easy.

[Results reported by Dai et al. (2016)]

βˆ’2 βˆ’1.5 βˆ’1 βˆ’0.5 0.5 1 1.5 2 βˆ’3 βˆ’2 βˆ’1 1 2 3 1 2 3 4 5 6 7 8 9 10 x 10 βˆ’3

(a) True posterior (b) Stochastic Variational Inference (c) Stochastic Gradient Langevin Dynamics (d) Gibbs Sampling (e) One-pass SMC

slide-20
SLIDE 20

Gaussian Mixture Model

  • prior 𝑦1, 𝑦2 ∼ π’ͺ(0,1)
  • bservations o|𝑦1, 𝑦2 ∼ 1

2 π’ͺ 𝑦1, 1 + 1 2 π’ͺ(𝑦1 + 𝑦2, 1)

  • With 𝑦1, 𝑦2 = (1, βˆ’2), the resulting posterior will have two modes: 1, βˆ’2 and βˆ’1, 2

Exp xperiments: Benefit for Mult ltim imodal Posterior

Our more challenging experimental setting:

  • The learned MPF operator will be tested on sequences that is not observed in the training set.
  • It needs to fit all intermediate posteriors π‘ž 𝑦 𝑝1 , π‘ž 𝑦 𝑝1:2 , … , π‘ž 𝑦 𝑝1:𝑛 .

Visualization of the evolution of posterior density from left to right.

slide-21
SLIDE 21

Hidden Markov Model – Linear Dynamical System

Exp xperiments: Hid idden Markov Model

𝑦𝑛 𝑝1 𝑝2 𝑝𝑛

……

𝑦2 𝑦1

  • 𝑦𝑛 = 𝐡 π‘¦π‘›βˆ’1 + πœ—π‘›, πœ—π‘› ∼ π’ͺ(0, Ξ£1)
  • 𝑝𝑛 = 𝐢 𝑦𝑛 + πœ€π‘›, πœ€π‘› ∼ π’ͺ(0, Ξ£2)

Transition sampling + MPF operator: 1. 𝑦𝑛

π‘œ = 𝐡 π‘¦π‘›βˆ’1 π‘œ

+ πœ—π‘›

  • 2. 𝑦𝑛

π‘œ = β„±(

𝒴𝑛, 𝑦𝑛

π‘œ , 𝑝𝑛+1)

A Bayesian update from π‘ž(𝑦𝑛|𝑝1:π‘›βˆ’1) to π‘ž(𝑦𝑛|𝑝1:𝑛) π‘ž(𝑦1|𝑝1) π‘ž(𝑦2|𝑝1:2) π‘ž(𝑦𝑛|𝑝1:𝑛)

…… Marginal posteriors update:

slide-22
SLIDE 22

Exp xperiments: Generalization across Mult ltitasks

Bayesian Logistic Regression on MNIST dataset 8 vs 6

  • About 1.6M training images and 1932 testing images
  • Each data point 𝑝𝑛: = (πœšπ‘›, 𝑑𝑛)
  • Logistic Regression 𝑧 = 𝜏(πœ„βŠ€πœšπ‘›)
  • Likelihood function π‘ž(𝑝𝑛 πœ„ = 𝑧𝑑𝑛 1 βˆ’ 𝑧 1βˆ’π‘‘π‘›

feature label

Multi-task Environment

  • We reduce the dimension of the images to 50 by PCA
  • We rotate the first two components by an angle πœ” ∼ βˆ’15Β°, 15Β°

– The first two components account for more variability in the data

  • Different rotation angle πœ” ⟹ different decision boundary ⟹ different tasks

Multi-task Training

  • MPF operator will be learned from multiple training tasks with different πœ”
  • Use the learned MPF as an online-learning algorithm during testing
slide-23
SLIDE 23

Exp xperiments: Generalization across Mult ltitasks

Testing as Online learning:

(1οΌ‰All algorithms start with a set of particles sampled from the prior; (2οΌ‰Each algorithm makes predictions to the encountered batch of 32 images; (3οΌ‰All algorithms observe true labels of the encountered batch; (4οΌ‰Each algorithm updates its particles and then make predictions to next batch; (predict β†’ observe true labels β†’ update particles β†’ predict β†’ observe true labels β†’ update ……) accuracy 𝑠

1

accuracy 𝑠

2

slide-24
SLIDE 24

Conclusion and Future Work

  • An ODE-based Bayesian operator for sequential Bayesian inference
  • Existence
  • Parametrization
  • Meta-learning framework
  • Future work
  • Architecture
  • Stable flow
  • Improve training

𝒴0 = {𝑦0

1, … , 𝑦0 𝑂}

𝒴0 ∼ 𝜌(𝑦) 𝑦1

π‘œ = 𝑦0 π‘œ + π‘ˆ

𝑔 𝒴0, 𝑝1, 𝑦 𝑒 , 𝑒 𝑒𝑒 𝒴1 = {𝑦1

1, … , 𝑦1 𝑂}

𝒴1 ∼ π‘ž(𝑦|𝑝1)