undirected graphical models undirected graphs
play

Undirected Graphical Models Undirected Graphs Chris Williams, - PowerPoint PPT Presentation

Undirected Graphical Models Undirected Graphs Chris Williams, School of Informatics, University of Edinburgh graph G = ( X, E ) X is a set of nodes, in one-to-


  1. � � ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ � � � � � � Undirected Graphical Models Undirected Graphs Chris Williams, School of Informatics, University of Edinburgh • graph G = ( X, E ) • X is a set of nodes, in one-to- Overview one correspondence with a set of random variables • Undirected graphs • E is a set of undirected edges • Conditional independence between the nodes • Potential functions, energy functions • Examples: multivariate Gaussian, MRF • Boltzmann machines, learning rule • Reading: Jordan section 2.2. [chs 19, 20 for additional reading (not examinable)] Global conditional independence • Consider arbitrary disjoint index subsets A, B and C • If every path from a node in X A to a node in X C includes at least one node in B then I ( X A , X C | X B ) X X X • This is a na¨ ıve graph-theoretic separation condition (c.f. d-separation) A B C

  2. Graphs and Cliques Parameterization • Conditional independence properties of undirected graphs imply a representation of the joint probability as a product of local functions defined on the maximal cliques of the • For directed graphs use P ( X ) = � i P ( X i | Pa i ) , gives notion of locality graph p ( x ) = 1 � ψ X C ( x c ) Z C ∈C • For undirected graphs, locality depends on the notion of cliques with � � Z = ψ X C ( x c ) C ∈C x • A clique of a graph is a fully-connected set of nodes • Each ψ X C ( x C ) is a strictly positive, real-valued function, otherwise arbitrary • Z is called the partition function • A maximal clique is a clique which cannot be extended to include additional nodes without losing the property of being fully connected • Equivalence of conditional independence and clique factorization form is the Hammersley-Clifford theorem X • Potential functions are in general neither conditional or marginal 4 probabilities X 2 X • Natural interpretation as agreement, constraint, energy 6 X 1 X X • Potential function favours certain local configurations by assigning them 3 5 ψ (x1,x2) ψ (x1,x3) ψ (x3,x5) ψ (x2,x5,x6) ψ (x2,x4) larger values P(x) = /Z • Global configurations that have high probability are, roughly speaking, those that satisfy as many of the favoured local configurations as possible

  3. Energy functions Local Markov Property • Enforce positivity by defining • Denote all nodes by V ψ X C ( x C ) = exp {− H X C ( x C ) } • Negative sign is conventional (high probability, low energy) • For a vertex a , let ∂a denote the boundary of a , i.e. the set of vertices in p ( x ) = 1 ψ X C ( x c ) = 1 V \ a that are neighbours of a � � Z exp {− H X C ( x C ) } Z C ∈C C ∈C • Energy H ( x ) = � C ∈C H X C ( x C ) • Local Markov property : For any vertex a , the conditional distribution of X a given X V \ a depends only on X ∂a • Boltzmann distribution p ( x ) = 1 Z exp {− H ( x ) } Example I—Multivariate Gaussian X 4 p ( x ) ∝ exp {− 1 2 x T Σ − 1 x } X 2 X 6 X • It is the zeros in Σ − 1 that define the missing edges in the graph and 1 hence the conditional independence structure X X 3 5 ψ (x1,x2) ψ (x1,x3) ψ (x3,x5) ψ (x2,x5,x6) ψ (x2,x4) P(x) = /Z

  4. Example II—Markov Random Field Boltzmann machines • Discrete random variables • Hinton and Sejnowski, 1983 • Ising model in statistical physics (spins up/down) • Binary units ± 1 • MRF models used in image p ( x ) = 1 Z exp { 1 analysis, e.g. segmentation of � w ij x i x j } 2 regions. Define energies such ij that blocks of the same labels are preferred (Geman and Ge- • w ij = w ji and w ii = 0 man, 1984) • set x 0 = 1 (bias unit) • 1 � ij w ij x i x j = � i<j w ij x i x j 2 hidden units • Can have hidden units output (visible) • Potential function is not arbitrary function of cliques, but only based on units pairwise links (can generalize) • P ( X i = 1 | rest ) = σ (2 h i ) where h i = � j w ij x j

  5. Boltzmann machine learning rule p ( x , y ) = 1 � Z exp { θ k φ k ( x , y ) } Denote visible units by x , hidden units by y k p ( x ) = 1 � � exp { θ k φ k ( x , y ) } p ( x , y ) = 1 Z � Z exp { θ k φ k ( x , y ) } k y k � � log p ( x ) = log exp { θ k φ k ( x , y ) } − log Z y k This is the general form of a log linear model. ∂ log p ( x ) � � = φ l ( x , y ) p ( y | x ) − φ l ( x , y ) p ( x , y ) ∂θ l x , y y • Features φ k ( x , y ) are the pairwise potentials for a Boltzmann machine def = � φ l ( x , y ) � + − � φ l ( x , y ) � − • Parameters θ k correspond to weights in the Boltzmann machine Gibbs sampler • + denotes the clamped phase (with x clamped on visible units), − denotes the free-running phase (all unclamped) Loop T times for each unit i to be sampled from compute h i and sample P ( X i | rest ) end for • Learning stops when statistics match in both phases end loop • This is a Markov Chain Monte Carlo (MCMC) method. Under general conditions this will converge to the correct distribution as T → ∞ • Statistics could be computed exactly (using junction tree algorithm) but often this is intractable— use stochastic sampling • Boltzmann machine learning can be slow due to the need to use MCMC techniques. Gradient is the difference of two noisy estimates • Boltzmann machine learning rule is gradient based; one can also use • Hinton (1999, 2000) has introduced the Products of Experts (PoE) architecture, which may get round some of these difficulties Iterative Scaling algorithms (see Jordan ch 20) to update θ k ’s

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend