Meta Particle Flow for Sequential Bayesian Inference
Le Song Associate Professor, CSE Associate Director, Machine Learning Center Georgia Institute of Technology
Joint work with Xinshi Chen and Hanjun Dai
for Sequential Bayesian Inference Le Song Associate Professor, CSE - - PowerPoint PPT Presentation
Meta Particle Flow for Sequential Bayesian Inference Le Song Associate Professor, CSE Associate Director, Machine Learning Center Georgia Institute of Technology Joint work with Xinshi Chen and Hanjun Dai Bayesian In Inference Infer the
Le Song Associate Professor, CSE Associate Director, Machine Learning Center Georgia Institute of Technology
Joint work with Xinshi Chen and Hanjun Dai
Infer the posterior distribution of unknown parameter π given
π π¦ π1:π = 1 π¨ π π¦
π=1 π
π(ππ|π¦) π¨ = π π¦
π=1 π
π(ππ|π¦) ππ¦
Challenging computational problem for high dimensional π¦
π¦ π1 π2 ππ
β¦β¦ β¦β¦
Gaussian Mixture Model
2 πͺ π¦1, 1 + 1 2 πͺ(π¦1 + π¦2, 1)
To fit only one posterior π(π|ππ:π) is already not easy.
[Results reported by Dai et al. (2016)]
β2 β1.5 β1 β0.5 0.5 1 1.5 2 β3 β2 β1 1 2 3 1 2 3 4 5 6 7 8 9 10 x 10 β3(a) True posterior (b) Stochastic Variational Inference (c) Stochastic Gradient Langevin Dynamics (d) Gibbs Sampling (e) One-pass SMC
Lots of applications in machine learning
π§π’βπ π§0
ππ’ + ππ’ exp βπππ’ ,
β2, ππ 2 , ππ’ βΌ Ξ ππ β2, ππ 2
2, ππ 2, π, π
π π¨ π π½
π π
π¦ π¦π π1 π2 ππ
β¦β¦
π¦2 π¦1
sensor measure true location topic word
π¦ π1 π2 ππ
π π¦ π1 π π¦ π1:2 π π¦ π1:π β¦ β¦β¦ β¦β¦ prior π(π¦) π1 π2 ππ
π π¦ π1:π β π π¦ π1:πβ1 π ππ π¦
updated posterior current posterior likelihood
Online Bayesian Inference
An ideal algorithm should:
β requires a complete scan of the data
β requires re-optimization for every new observation
β are prescribed algorithms to optimize the final posterior π π¦ π1:π β can not exploit the structure of the sequential inference problem
π¦ π1 π2 ππ
π π¦ π1 π π¦ π1:2 π π¦ π1:π β¦ β¦β¦ β¦β¦ prior π(π¦) π1 π2 ππ
οΌ Sequential monte Carlo (Doucet et al., 2001; Balakrishnan&Madigan, 2006) β the state of art for online Bayesian Inference β but suffers from path degeneracy problem in high dimensions β rejuvenation steps can help but will violate online constraints (Canini et al., 2009)
π¦ π1 π2 ππ
π π¦ π1 π π¦ π1:2 π π¦ π1:π β¦ β¦β¦ β¦β¦ prior π(π¦) π1 π2 ππ Can we learn to perform efficient and effective sequential Bayesian update?
οΌ Kernel Bayesβ Rule (Fukumizu et al., 2012) β the posterior is represented as an embedding ππ = π½π(π¦|π1:π)[π π¦ ] β ππ+1 = π§( ππ , ππ+1) β views the Bayes update as an operator in reproducing kernel Hilbert space (RKHS) β conceptually nice but is limited in practice
π¦ π1 π2 ππ
π π¦ π1 π π¦ π1:2 π π¦ π1:π β¦ β¦β¦ β¦β¦ prior π(π¦) π1 π2 ππ
updated embedding current embedding
π΄0 = {π¦0
1, β¦ , π¦0 π}
π΄0 βΌ π(π¦) π¦1
π = π¦0 π + π
π π΄0, π1, π¦(π’) ππ’ π΄1 = {π¦1
1, β¦ , π¦1 π}
π΄1 βΌ π(π¦|π1)
Particle Flow
π΄0 = {π¦0
1, β¦ , π¦0 π}, sampled i.i.d. from prior π(π¦)
βΉ solution π¦1
π = π¦(π)
ππ¦ ππ’ = π π΄0, π1, π¦(π’) , βπ’ β [0, π] and π¦ 0 = π¦0
π
β Mass can neither be created nor destroyed β nor can it βteleportβ from one place to another ππ π¦, π’ ππ’ = βπΌ
π¦ β (ππ)
ππ’ = π, then the change in log-density follows the differential equation
π log π π¦, π’ ππ’ = βπΌ
π¦ β π
β ππ
ππ’ is material derivative that defines the rate of change of π in a given particle as
it moves along its trajectory π¦ = π¦(π’) β ππ
ππ’ is partial derivative that defines the rate of change of π at a particular point π¦
Particle Flow for Sequential Bayesian Inference
π¦π+1
π
= π¦π
π + π
π π΄π, ππ+1, π¦(π’) ππ’ βlog ππ+1
π
= βlog ππ
π + π
πΌ
π¦ β π π΄π, ππ+1, π¦(π’) ππ’
π΄0 = {π¦0
1, β¦ , π¦0 π}
π΄1 = {π¦1
1, β¦ , π¦1 π}
π΄2 = {π¦2
1, β¦ , π¦2 π}
β¦β¦
π
π π΄0, π1, π¦(π’) ππ’
π
π π΄1, π2, π¦(π’) ππ’
π
π π΄2, π3, π¦(π’) ππ’
π¦ π1 π2 ππ
π π¦ π1 π π¦ π1:2 π π¦ π1:π β¦ β¦β¦ β¦β¦ prior π(π¦) π1 π2 ππ
A simple Gaussian Example
π+π0)
β E.g. π in the form of π(π, π¦(π’)) wonβt be able to handle different π0.
π¦ π = π¦ 0 +
π
π(ππππ£π’π‘)ππ’ π(π¦) π¦ 0 βΌ π(π¦|π1) π¦ π’ βΌ
Does a shared flow velocity π exist for different Bayesian inference tasks involving different priors and different observations?
ππ¦ π’ = πΌ
π¦ log π π¦ π(π|π¦) ππ’ +
2 ππ₯ π’ , where ππ₯(π’) is a standard Brownian motion.
πβΞ¨ β π1(βπ) , the Fokker-Planck equation has a unique stationary solution in the form of Gibbs distribution, π π¦, β = πβΞ¨ π = π π¦ π π π¦ π = π(π¦|π)
the Fokker-Planck equation ππ ππ’ = βπΌ
π¦ β ππΌ π¦ log π π¦ π π π¦
+ Ξπ¦π π¦, π’ which is in the form of Continuity Equation.
ππ¦ ππ’ = πΌ
π¦ log π π¦ π(π|π¦) β πΌ π¦ log π π¦, π’ ,
its probability density π(π¦, π’) converges to the posterior π(π¦|π) as π’ β β. = βπΌ
π¦ β (π(πΌ π¦ log π π¦ π(π|π¦) β πΌ π¦ log π(π¦, π’))),
π
Close loop to Open loop
but also on flow state π π¦, π’ .
Optimization problem min
π₯
π π π¦, β , π(π¦|π) π‘. π’. ππ¦ ππ’ = πΌ
π¦ log π π¦ π(π|π¦) β π₯,
ππ¦ ππ’ = πΌ
π¦ log π π¦ π(π|π¦) β π₯β(π π¦ , π π|π¦ , π¦, π’)
Parameterization
β use samples π΄ as surrogates, feature space embedding β Ideally, if ππ΄ is an injective mapping from the space of probability measures
ππ΄ π β
π΄
π π¦ π π¦ ππ¦ β 1 π
π=1 π
π π¦π , π¦π βΌ π
π π΄, π, π¦ π’ , π’ = β 1 π
π=1 π
π π¦π , π, π¦ π’ , π’ ππ¦ ππ’ = πΌ
π¦ log π π¦ π(π|π¦) β π₯β(π π¦ , π π π¦ , π¦, π’)
Multi-task Framework
β containing multiple inference tasks, with diverse prior and observations
Loss Function
β π° =
π=1 π
πΏπ(ππ(π¦)||π(π¦|π1:π)) =
π=1 π π=1 π
log ππ
π β log π(π¦π π , π1:π) + ππππ‘π’.
prior likelihood π observations
π° β (π π¦ , π β π¦ , π1, π2, β¦ , ππ ) π¦π+1
π
= π¦π
π + π
π π΄π, ππ+1, π¦ π’ , π’ ππ’ βlog ππ+1
π
= βlog ππ
π + π
πΌ
π¦ β π π΄π, ππ+1, π¦ π’ , π’ ππ’
Multivariate Gaussian Model
Experiment Setting
distributions.
Result
Gaussian Mixture Model
2 πͺ π¦1, 1 + 1 2 πͺ(π¦1 + π¦2, 1)
To fit only one posterior π(π|ππ:π) is already not easy.
[Results reported by Dai et al. (2016)]
β2 β1.5 β1 β0.5 0.5 1 1.5 2 β3 β2 β1 1 2 3 1 2 3 4 5 6 7 8 9 10 x 10 β3(a) True posterior (b) Stochastic Variational Inference (c) Stochastic Gradient Langevin Dynamics (d) Gibbs Sampling (e) One-pass SMC
Gaussian Mixture Model
2 πͺ π¦1, 1 + 1 2 πͺ(π¦1 + π¦2, 1)
Our more challenging experimental setting:
Visualization of the evolution of posterior density from left to right.
Hidden Markov Model β Linear Dynamical System
π¦π π1 π2 ππ
β¦β¦
π¦2 π¦1
Transition sampling + MPF operator: 1. π¦π
π = π΅ π¦πβ1 π
+ ππ
π = β±(
π΄π, π¦π
π , ππ+1)
A Bayesian update from π(π¦π|π1:πβ1) to π(π¦π|π1:π) π(π¦1|π1) π(π¦2|π1:2) π(π¦π|π1:π)
β¦β¦ Marginal posteriors update:
Bayesian Logistic Regression on MNIST dataset 8 vs 6
feature label
Multi-task Environment
β The first two components account for more variability in the data
Multi-task Training
Testing as Online learning:
οΌ1οΌAll algorithms start with a set of particles sampled from the prior; οΌ2οΌEach algorithm makes predictions to the encountered batch of 32 images; οΌ3οΌAll algorithms observe true labels of the encountered batch; οΌ4οΌEach algorithm updates its particles and then make predictions to next batch; οΌpredict β observe true labels β update particles β predict β observe true labels β update β¦β¦οΌ accuracy π
1
accuracy π
2
π΄0 = {π¦0
1, β¦ , π¦0 π}
π΄0 βΌ π(π¦) π¦1
π = π¦0 π + π
π π΄0, π1, π¦ π’ , π’ ππ’ π΄1 = {π¦1
1, β¦ , π¦1 π}
π΄1 βΌ π(π¦|π1)