Probabilistic & Unsupervised Learning Convex Algorithms in - PowerPoint PPT Presentation

Probabilistic & Unsupervised Learning Convex Algorithms in Approximate Inference Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London Term 1, Autumn 2017

Convexity A convex function f : X → R is one where f ( α x 1 + ( 1 − α ) x 2 ) ≤ α f ( x 1 ) + ( 1 − α ) f ( x 2 ) for any x 1 , x 2 ∈ X and 0 ≤ α ≤ 1. α f ( x 1 ) + ( 1 − α ) f ( x 2 ) f ( α x 1 + ( 1 − α ) x 2 ) x 1 x 2 Convex functions have a global infimum (unless not bounded below) and there are efficient algorithms to find a minimum subject to convex constraints. Examples: linear programs (LP), quadratic programs (QP), second-order cone programs (SOCP), semi-definite programs (SDP), geometric programs.

Convexity and Approximate Inference The theory of convex functions and convex spaces has long been central to optimisation. It has recently also found application in the theory of free energy and approximation: ◮ Linear programming relaxation as an approximate method to find the MAP assignment in Markov random fields. ◮ Attractive Markov random fields: binary case exact and related to a maximum flow-minimum cut problem in graph theory (a linear program). Approximate otherwise. ◮ Unified view of approximate inference as optimization on the marginal polytope. ◮ Tree-structured convex upper bounds on the log partition function (convexified belief propagation). ◮ Learning graphical models using maximum margin principles and convex approximate inference.

LP Relaxation for Markov Random Fields Consider a discrete Markov random field (MRF) with pairwise interactions:   p ( X ) = 1 � � f i ( X i ) = 1 � � f ij ( X i , X j ) E ij ( X i , X j ) + E i ( X i ) Z exp  Z ( ij ) i ( ij ) i The problem is to find the most likely configuration X MAP : � � X MAP = argmax E ij ( X i , X j ) + E i ( X i ) X ( ij ) i

LP Relaxation for Markov Random Fields Consider a discrete Markov random field (MRF) with pairwise interactions:   p ( X ) = 1 � � f i ( X i ) = 1 � � f ij ( X i , X j ) E ij ( X i , X j ) + E i ( X i ) Z exp  Z ( ij ) i ( ij ) i The problem is to find the most likely configuration X MAP : � � X MAP = argmax E ij ( X i , X j ) + E i ( X i ) X ( ij ) i Reformulate in terms of indicator variables: b i ( k ) = δ ( X i = k ) b ij ( k , l ) = δ ( X i = k ) δ ( X j = l ) where δ ( · ) = 1 if argument is true, 0 otherwise. Each b i ( k ) is an indicator for whether variable X i takes on value k . The indicator variables need to satisfy certain constraints: b i ( k ) , b ij ( k , l ) ∈ { 0 , 1 } Indicator variables are binary variables. � b i ( k ) = 1 X i takes on exactly one value. k � b ij ( k , l ) = b i ( k ) Pairwise indicators are consistent with single-site indicators. l

LP Relaxation for Markov Random Fields MAP assignment problem is equivalent to: � � � � b ij ( k , l ) E ij ( k , l ) + b i ( k ) E i ( k ) argmax { b i , b ij } ( ij ) k , l i k with constraints: � � ∀ i , j , k , l : b i ( k ) , b ij ( k , l ) ∈ { 0 , 1 } b i ( k ) = 1 b ij ( k , l ) = b i ( k ) k l

LP Relaxation for Markov Random Fields MAP assignment problem is equivalent to: � � � � b ij ( k , l ) E ij ( k , l ) + b i ( k ) E i ( k ) argmax { b i , b ij } ( ij ) k , l i k with constraints: � � ∀ i , j , k , l : b i ( k ) , b ij ( k , l ) ∈ { 0 , 1 } b i ( k ) = 1 b ij ( k , l ) = b i ( k ) k l The linear programming relaxation for MRFs is: � � � � b ij ( k , l ) E ij ( k , l ) + b i ( k ) E i ( k ) argmax { b i , b ij } ( ij ) k , l i k with constraints: � � ∀ i , j , k , l : b i ( k ) , b ij ( k , l ) ∈ [ 0 , 1 ] b i ( k ) = 1 b ij ( k , l ) = b i ( k ) k l

LP Relaxation for Markov Random Fields ◮ The LP relaxation is a linear program which can be solved efficiently.

LP Relaxation for Markov Random Fields ◮ The LP relaxation is a linear program which can be solved efficiently. ◮ If the solution is integral, i.e. each b i ( k ) , b ij ( k , l ) ∈ { 0 , 1 } , then the solution corresponds to the MAP solution X MAP .

LP Relaxation for Markov Random Fields ◮ The LP relaxation is a linear program which can be solved efficiently. ◮ If the solution is integral, i.e. each b i ( k ) , b ij ( k , l ) ∈ { 0 , 1 } , then the solution corresponds to the MAP solution X MAP . ◮ LP relaxation is a zero-temperature version of the Bethe free energy formulation of loopy BP , where the Bethe entropy term can be ignored.

LP Relaxation for Markov Random Fields ◮ The LP relaxation is a linear program which can be solved efficiently. ◮ If the solution is integral, i.e. each b i ( k ) , b ij ( k , l ) ∈ { 0 , 1 } , then the solution corresponds to the MAP solution X MAP . ◮ LP relaxation is a zero-temperature version of the Bethe free energy formulation of loopy BP , where the Bethe entropy term can be ignored. ◮ If the MRF is binary and attractive, then (a slightly different reformulation of LP relaxation) will always give the MAP solution .

LP Relaxation for Markov Random Fields ◮ The LP relaxation is a linear program which can be solved efficiently. ◮ If the solution is integral, i.e. each b i ( k ) , b ij ( k , l ) ∈ { 0 , 1 } , then the solution corresponds to the MAP solution X MAP . ◮ LP relaxation is a zero-temperature version of the Bethe free energy formulation of loopy BP , where the Bethe entropy term can be ignored. ◮ If the MRF is binary and attractive, then (a slightly different reformulation of LP relaxation) will always give the MAP solution . ◮ Next: we show how to find the MAP solution directly for binary attractive MRFs using network flow.

Attractive Binary MRFs and Max Flow-Min Cut Binary MRFs:   p ( X ) = 1 � � Z exp W ij δ ( X i = X j ) + c i X i  ( ij ) i The binary MRF is attractive if W ij ≥ 0 for all i , j .

Attractive Binary MRFs and Max Flow-Min Cut Binary MRFs:   p ( X ) = 1 � � Z exp W ij δ ( X i = X j ) + c i X i  ( ij ) i The binary MRF is attractive if W ij ≥ 0 for all i , j . ◮ Neighbouring variables ‘prefer’ to be in the same state.

Attractive Binary MRFs and Max Flow-Min Cut Binary MRFs:   p ( X ) = 1 � � Z exp W ij δ ( X i = X j ) + c i X i  ( ij ) i The binary MRF is attractive if W ij ≥ 0 for all i , j . ◮ Neighbouring variables ‘prefer’ to be in the same state. ◮ No loss of generality; any Boltzmann machines with positive interactions can be reparametrised to this form.

Attractive Binary MRFs and Max Flow-Min Cut Binary MRFs:   p ( X ) = 1 � � Z exp W ij δ ( X i = X j ) + c i X i  ( ij ) i The binary MRF is attractive if W ij ≥ 0 for all i , j . ◮ Neighbouring variables ‘prefer’ to be in the same state. ◮ No loss of generality; any Boltzmann machines with positive interactions can be reparametrised to this form. ◮ Many practical MRFs are attractive, e.g. image segmentation, webpage classification.

Attractive Binary MRFs and Max Flow-Min Cut Binary MRFs:   p ( X ) = 1 � � Z exp W ij δ ( X i = X j ) + c i X i  ( ij ) i The binary MRF is attractive if W ij ≥ 0 for all i , j . ◮ Neighbouring variables ‘prefer’ to be in the same state. ◮ No loss of generality; any Boltzmann machines with positive interactions can be reparametrised to this form. ◮ Many practical MRFs are attractive, e.g. image segmentation, webpage classification. ◮ MAP X can be found efficiently by converting problem into a maximum flow-minimum cut program.

Attractive Binary MRFs and Max Flow-Min Cut The MAP problem: - � � W ij δ ( x i = x j ) + argmax c i x i x ( ij ) i -c j Construct a network as follows: - - 1. Edges ( ij ) are undirected with weight λ ij = W ij ; W ij i j + 2. Add a source s and a sink t node; - + + 3. c i > 0: Connect the source node to variable i with +c i weight λ si = c i ; + 4. c j < 0: Connect variable j to the sink node with weight + λ jt = − c j . A cut is a partition of the nodes into S and T with s ∈ S and t ∈ T . The weight of the cut is � Λ( S , T ) = λ ij i ∈ S , j ∈ T The minimum cut problem is to find the cut with minimum weight.

Attractive Binary MRFs and Max Flow-Min Cut Identify an assignment X = x with a cut: S = { s } ∪ { i : x i = 1 } - T = { t } ∪ { j : x j = 0 } The weight of the cut is: -c j � - - Λ( S , T ) = W ij δ ( x i � = x j ) ( ij ) W ij � i j + + ( 1 − x i ) max ( 0 , c i ) - + + i � +c i + x j max ( 0 , − c j ) + j + � � = − W ij δ ( x i = x j ) − x i c i + constant ( ij ) i So finding the minimum cut corresponds to finding the MAP assignment. How do we find the minimum cut? The minimum cut problem is dual to the maximum flow problem , i.e. find the maximum flow allowable from the source to the sink through the network. This can be solved extremely efficiently (see wikipedia entry). The framework can be generalized to general attractive MRFs, but will not be exact anymore.

◮ Convexity in exponential family inference and learning

Probabilistic & Unsupervised Learning Convex Algorithms in - PowerPoint PPT Presentation

Probabilistic & Unsupervised Learning Convex Algorithms in Approximate Inference Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London Term 1,

Convex Hell 362 dnc CS 16: Convex Hull Whoops, I mean... Convex Hull Whats a Convex Hull?

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Convex hull 1 - 1 Convex hull 1 - 2 Convex hull 1 - 3 Convex hull Definition, extremal

CS133 Computational Geometry Convex Hull 1 Convex Hull Given a set of n points, find the

constrained convex optimization virgil pavlu 1 convex set a set X in a vector space is convex if

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

Convex hull: basic facts Convex hull: basic facts CG Lecture 1 CG Lecture 1 Problem : give a set

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Sets Instructor: Shaddin Dughmi

Convex hulls of spheres and convex hulls of convex polytopes lying on parallel hyperplanes

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Functions Instructor: Shaddin

Convex Analysis Jos e De Don a September 2004 Centre of Complex Dynamic Systems and

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Sets Instructor: Shaddin Dughmi

CS133 Computational Geometry Convex Hull 4/12/2018 1 Convex Hull Given a set of n points,

CS675: Convex and Combinatorial Optimization Fall 2014 Convex Functions Instructor: Shaddin

Unsupervised Learning Andrea Passerini passerini@disi.unitn.it Machine Learning Unsupervised

Introduction to PCA Unsupervised Learning in R Unsupervised learning Two methods of

Inference and Representation David Sontag New York University Lecture 11, Nov. 24, 2015 David

Econ 551 Government Finance: Revenues Fall, 2019 Given by Kevin Milligan Vancouver School of

A Non-Monetary Mechanism for Optimal Rate Control Through Effjcient Delay Allocation Texas

Pricing Networks with Selfish Routing Tim Roughgarden (Cornell) Joint with Richard Cole (NYU)

Reconstruction Gianpaolo Palma Surface reconstruction Input Point cloud With or without

z A single Gaussian might be a poor fit . . . . . Simplest form is 2 layer . . . ... .

Probabilistic Graphical Models 10-708 Learning Completely Observed Learning Completely Observed

Part 2: Generalized output representations and structure Dale Schuurmans University of Alberta

Sambuz

Useful Links

Newsletter

Mail Us