Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani - PowerPoint PPT Presentation

Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ and Center for Automated Learning and Discovery Carnegie Mellon University, USA http://www.cs.cmu.edu/ ∼ zoubin/ Work with: Iain Murray and Hyun-Chul Kim Aug 2003, CMU ML Lunch

Undirected Graphical Models An Undirected Graphical Model (UGM; or Markov Network) is a graphical representation of the dependence relationships between a set of random variables. In an UGM, the joint probability over M variables x = [ x 1 , . . . , x M ] , can be written in a factored form: J p ( x ) = 1 � g j ( x C j ) Z j =1 Here the g j are non-negative potential functions over subsets of variables C j ⊆ { 1 , . . . , M } and the notation: x S ≡ [ x m : m ∈ S ] . � � The normalization constant (a.k.a. partition function) is Z = g j ( x C j ) j x We represent this type of probabilistic model graphically. Graph Definition: Let each variable be a node. Connect nodes i and k if there exists a set C j such that both i ∈ C j and k ∈ C j . These sets form the cliques of the graph (fully connected subgraphs).

Undirected Graphical Models: An Example B A C D E p ( A, B, C, D, E ) = 1 Z g ( A, C ) g ( B, C, D ) g ( C, D, E ) Markov Property: Every node is conditionally independent from its non- neighbors given its neighbors. Conditional Independence: A ⊥ ⊥ B | C ⇔ p ( A | B, C ) = p ( A | C ) for p ( B, C ) > 0 also A ⊥ ⊥ B | C ⇔ p ( A, B | C ) = p ( A | C ) p ( B | C ) .

Applications of Undirected Graphical Models • Markov Random Fields in Vision, Bioinformatics • Conditional Random Fields, and Exponential Language Models, e.g.: �� p ( s ) = 1 Zp 0 ( s ) exp λ i f i ( s ) i • Products of Experts: p ( x ) = 1 � p j ( x | θ j ) Z j • Semi-Supervised Learning: ⋆ Boltzmann Machines

Boltzmann Machines Undirected graph over a vector of binary variables s i ∈ { 0 , 1 } . Variables can be hidden or visible (observed).   p ( s | W ) = 1   � Z exp W ij s i s j   j<i where Z is the partition function (normalizer) Maximum Likelihood Learning Algorithm: a gradient version of EM • E step involves computing averages w.r.t. p ( s H | s V , W ) (“clamped phase”). This could be done via an exact message passing algorithm (e.g. Junction Tree) or more usually an approximate method such as Gibbs sampling. • M step also requires gradients w.r.t. Z , which can be computed by averages w.r.t. p ( s | W ) (“unclamped phase”). Hebbian and anti-Hebbian rule: ∆ W ij = η [ � s i s j � c − � s i s j � u ]

Bayesian Learning Prior over parameters: p ( W ) Posterior over parameters, given data set S = { s (1) , . . . s ( N ) } , p ( W |S ) = p ( W ) p ( S| W ) p ( S ) Model Comparison (for example for two different graph structures m , m ′ ) using Bayes factors: p ( m |S ) p ( S| m ) p ( m ′ |S ) = p ( m ) p ( m ′ ) p ( S| m ′ ) where the marginal likelihood is: � p ( S| m ) = p ( S| W, m ) p ( W | m ) dW

Why Bayesian Learning? • Useful prior knowledge can be included (e.g. sparsity, domain knowledge) • Avoids overfitting (because nothing needs to be fit) • Error bars on all parameters, and predictions • Model and feature selection

A Simple Idea Define the following joint distribution of weights W and matrix of binary variables S , organized into N rows (data vectors) and M columns (features, variables). Some variables on some data points may be hidden and some may be observed.   M N M p ( S, W ) = 1  − 1   � � � W 2 Z exp ij + W ij s ni s nj 2 σ 2  i,j =1 n =1 j<i � dW � S exp { . . . } is a nasty partition function. Where Z = Gibbs sampling in this model is very easy! • Gibbs sample s ni given all other s and W : Bernouilli, easy as usual. • Gibbs sample W given s : diagonal multivariate Gaussian, easy as well. What is wrong with this approach?

...a Strange Prior on W   M N M p ( S, W ) = 1  − 1   � � � W 2 Z exp ij + W ij s ni s nj 2 σ 2  i,j =1 n =1 j<i This defines a Boltzmann machine for the data given W , but defines a somewhat strange and hard to compute “prior” on the weights. What is the prior on W?     � � � p ( S, W ) ∝ N (0 , σ 2 I ) p ( W ) = exp W ij s ni s nj   n,j<i S S where the second factor is data-size dependent, so it’s not a valid hierarchical Bayesian model of the kind W → S . The second factor can be written as: N           � � � � = Z ( W ) N exp  = exp W ij s ni s nj W ij s i s j     s S n,j<i j<i This will not work!

Three Families of Approximations In order to do Bayesian inference in undirected models with nontrivial partition functions we can develop three classes of methods:     � � • Approximate Partition Function: Z ( W ) = exp W ij s i s j   j<i s • Approximate Ratio of Partition Functions.     Z ( W )   � � ( W ij − W ′ p ( s | W ) Z ( W ′ ) =  exp ij ) s i s j    j<i s ∂ ln Z ( W ) � • Approximate Gradients. = p ( s | W ) s i s j ∂W ij s The above quantities can be approximated using modern tools developed in the machine learning/statistics/physics communities. Surprisingly, none of the following methods have been explored!

I. Metropolis with Nested Sampling Simplest sampling approach: Metropolis Sampling • Start with random weight matrix W • Perturb it with a small-radius Gaussian proposal distribution W → W ′ • Accept the change with probability min [1 , a ] , where � Z ( W )   � N a = p ( S | W ′ ) p ( W ′ ) p ( W ′ )   s ( n ) s ( n ) � W ′ � � p ( S | W ) p ( W ) = exp ij − W ij i j Z ( W ′ ) p ( W )   n,i<j The partition function ratio is nasty. But one can estimate it using an MCMC sampling inner loop: ��   � s exp j<i W ij s i s j � � Z ( W )   � ( W ij − W ′ � = Z ( W ′ ) = exp ij ) s i s j �� j<i W ′ � s exp ij s i s j   j<i p ( s | W ′ ) too slow: inner loop can take exponential time

II. Naive Mean-Field Metropolis Same as above, but use naive mean-field to estimate the partition function. Jensen’s inequality gives us: � � ln Z ( W ) = ln exp { W ij s i s j } j<i s � � ≥ q ( s ) W ij s i s j + H ( q ) = F ( W, q ) j<i s i (1 − m i ) (1 − s i ) and H is the entropy. i m s i where q ( s ) = � Gradient-based variant: use expectations to compute approximate gradients

III. Tree Mean-Field Metropolis Same as above, but use tree-structured mean-field to estimate the partition function. Jensen’s inequality gives us: � � ln Z ( W ) = ln exp { W ij s i s j } j<i s � � ≥ q ( s ) W ij s i s j + H ( q ) = F ( W, q ) j<i s where q ( s ) ∈ Q t ree , the set of tree-structured distributions and H is the entropy. Gradient-based variant: use expectations to compute approximate gradients

IV. Loopy Metropolis Belief Propagation (BP) is an exact method for inference on trees. Run belief propagation (BP) on the (loopy) graph and use the Bethe free energy as an estimate of Z ( W ) . Loopy BP provides on non-trees: 1. approximate marginals b i ≈ p ( s i | W ) 2. approximate pairwise marginals b ij ≈ p ( s i , s j | W ) These marginals are fixed points of the Bethe Free energy F Bethe = U − H Bethe ≈ − log Z ( W ) where U is the expected energy and the approximate entropy is: � � � � H Bethe = − b ij ( s i , s j ) log b ij ( s i , s j ) − (1 − ne( i )) b i ( s i ) log b i ( s i ) . s i ,s j s i ( ij ) i Gradient-based variant: use expectations to compute approximate gradients

V. The Langevin MCMC Sampling Procedure So far, we’ve been describing Metropolis procedures, but these suffer from random walk behaviour. Langevin makes use of gradient information and resembles noisy steepest descent. This is uncorrected Langevin: ij = W ij + ǫ 2 ∂ W ′ log p ( S, W ) + ǫ n ij 2 ∂W ij where n ∼ N (0 , 1) . There are many ways of estimating gradients, but we use a method based on Contrastive Divergence (Hinton, 2000).

VI. Pseudo-Likelihood Based Approximations   p ( s | W ) = 1   � Z exp W ij s i s j   j<i The pseudo-likelihood is defined as exp { 1 � j � = i W ij s j } 2 s i � � p ( s | W ) ≈ p ( s i | s \ i , W ) = 1 + exp { 1 � j � = i W ij s j } 2 i i   � � 1   � = exp W ij s i s j i (1 + exp { 1 � � j � = i W ij s j } ) 2   j<i Therefore the use of pseudo-likelihood corresponds to:     1   � � Z ( W ) ≈  1 + exp W ij s j  2   i j � = i Has not been tried yet—one can design and compare many other approaches.

Naive Mean Field vs Tree Mean Field Approximation more sparse (n=10, 0.3 large weights) more dense (n = 10, 0.6 large weights) 14 22 mf tree mf 13 tree 20 12 18 approx F < log Z(W) 11 16 log Z(W) 10 14 9 12 8 10 7 6 8 6 7 8 9 10 11 12 13 14 8 10 12 14 16 18 20 22 true log Z(W) approx F < log Z(W) The tree based approximation found an MST and then used Wiegerinck’s (UAI, 2000) variational approximation.

Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani - PowerPoint PPT Presentation

Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ and Center for Automated Learning and Discovery Carnegie Mellon University,

Undirected Graphical Models Aaron Courville, Universit de Montral 2 (UNDIRECTED) GRAPHICAL

Probabilistic Graphical Models Probabilistic Graphical Models parameter learning in undirected

Undirected Graphical Models: Markov Random Fields Probabilistic Graphical Models Sharif

Graphical Models Graphical Models Bayesian Networks Siamak Ravanbakhsh Fall 2019 Previously on

Probabilistic Graphical Models Probabilistic Graphical Models Undirected Models Fall 2019

Probabilistic Graphical Models 10-708 Learning Completely Observed Learning Completely Observed

Graphical Models Graphical Models Relationship between the directed & undirected models

Directed Graphical Models + Undirected Graphical Models Matt Gormley Lecture 7 Sep. 18, 2019

Probabilistic Graphical Models Probabilistic Graphical Models Structure learning in Bayesian

Undirected Graphical Models Undirected Graphs Chris Williams, School of Informatics, University

Undirected graphical models Graph G : arbitrary undirected graph Useful when variables interact

Probabilistic Graphical Models Part II: Undirected Graphical Models Selim Aksoy Department of

Probabilistic Graphical Models Part II: Undirected Graphical Models Selim Aksoy Department of

Transforming Graphical System Models to Graphical Attack Models ! Joint work with Marieta

AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Bayesian Networks Directed Acyclic Graph (DAG)

Probabilistic Graphical Models Probabilistic Graphical Models Parameter learning in Bayesian

Physical Defaults & the Robust Probing Model Sebastian Faust, Vincent Grosso, Santos Merino

Designing Norms Whats the friendliest place on the web that CS 278 | Stanford University |

Cloud overview "The illiterate of the 21st century will not be those who cannot read and

Geometry of Voting Christian Klamler - University of Graz Estoril, 12 April 2010 Introduction 2

Using Seaborn Styles DATA VIS UALIZ ATION W ITH S EABORN Chris Moftt Instructor Setting

The neutron beta decay correlation program at SNS Dinko Po cani c University of Virginia

Reparameterization Gradient for Non-differentiable Models Wonyeol Lee Hangyeol Yu Hongseok

Privacy and Security of Online Social Networks at the Workplace Seda Grses COSIC/ESAT K.U.

Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani - PowerPoint PPT Presentation

Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ and Center for Automated Learning and Discovery Carnegie Mellon University,

Undirected Graphical Models Aaron Courville, Universit de Montral 2 (UNDIRECTED) GRAPHICAL

Probabilistic Graphical Models Probabilistic Graphical Models parameter learning in undirected

Undirected Graphical Models: Markov Random Fields Probabilistic Graphical Models Sharif

Graphical Models Graphical Models Bayesian Networks Siamak Ravanbakhsh Fall 2019 Previously on

Probabilistic Graphical Models Probabilistic Graphical Models Undirected Models Fall 2019

Probabilistic Graphical Models 10-708 Learning Completely Observed Learning Completely Observed

Graphical Models Graphical Models Relationship between the directed &amp; undirected models

Directed Graphical Models + Undirected Graphical Models Matt Gormley Lecture 7 Sep. 18, 2019

Probabilistic Graphical Models Probabilistic Graphical Models Structure learning in Bayesian

Undirected Graphical Models Undirected Graphs Chris Williams, School of Informatics, University

Undirected graphical models Graph G : arbitrary undirected graph Useful when variables interact

Probabilistic Graphical Models Part II: Undirected Graphical Models Selim Aksoy Department of

Probabilistic Graphical Models Part II: Undirected Graphical Models Selim Aksoy Department of

Transforming Graphical System Models to Graphical Attack Models ! Joint work with Marieta

AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Bayesian Networks Directed Acyclic Graph (DAG)

Probabilistic Graphical Models Probabilistic Graphical Models Parameter learning in Bayesian

Physical Defaults &amp; the Robust Probing Model Sebastian Faust, Vincent Grosso, Santos Merino

Designing Norms Whats the friendliest place on the web that CS 278 | Stanford University |

Cloud overview &quot;The illiterate of the 21st century will not be those who cannot read and

Geometry of Voting Christian Klamler - University of Graz Estoril, 12 April 2010 Introduction 2

Using Seaborn Styles DATA VIS UALIZ ATION W ITH S EABORN Chris Moftt Instructor Setting

The neutron beta decay correlation program at SNS Dinko Po cani c University of Virginia

Reparameterization Gradient for Non-differentiable Models Wonyeol Lee Hangyeol Yu Hongseok

Privacy and Security of Online Social Networks at the Workplace Seda Grses COSIC/ESAT K.U.

Graphical Models Graphical Models Relationship between the directed & undirected models

Physical Defaults & the Robust Probing Model Sebastian Faust, Vincent Grosso, Santos Merino

Cloud overview "The illiterate of the 21st century will not be those who cannot read and