Probabilistic Graphical Models Probabilistic Graphical Models - PowerPoint PPT Presentation

Probabilistic Graphical Models Probabilistic Graphical Models Exponential family & Variational Inference I Siamak Ravanbakhsh Fall 2019

Learning objectives Learning objectives entropy exponential family distribution duality in exponential family relationship between two parametrizations inference and learning as mapping between the two relative entropy and two types of projections

A measure of A measure of information information a measure of information I ( X = x ) observing a less probable event gives more information information is non-negative and I ( X = x ) = 0 ⇔ P ( X = x ) = 1 information from independent events is additive A = a ⊥ B = b ⇒ I ( A = a , B = b ) = I ( A = a ) + I ( B = b )

A measure of A measure of information information a measure of information I ( X = x ) observing a less probable event gives more information information is non-negative and I ( X = x ) = 0 ⇔ P ( X = x ) = 1 information from independent events is additive A = a ⊥ B = b ⇒ I ( A = a , B = b ) = I ( A = a ) + I ( B = b ) definition follows from these characteristics: 1 I ( X = x ) ≜ log( ) = − log( P ( X = x )) P ( X = x )

Entropy Entropy: information theory : information theory information in obs. is X = x I ( X = x ) ≜ − log( P ( X = x )) entropy: expected amount of information H ( P ) ≜ E [ I ( X )] = − ∑ x ∈ V al ( X ) P ( X = x ) log( P ( X = x ))

Entropy Entropy: information theory : information theory information in obs. is X = x I ( X = x ) ≜ − log( P ( X = x )) entropy: expected amount of information H ( P ) ≜ E [ I ( X )] = − ∑ x ∈ V al ( X ) P ( X = x ) log( P ( X = x )) achieves its maximum for uniform distribution 0 ≤ H ( P ) ≤ log(∣ V al ( X )∣)

Entropy: Entropy: information theory information theory alternatively expected (optimal) message length in reporting observed X e.g., using Huffman coding

Entropy: Entropy: information theory information theory alternatively expected (optimal) message length in reporting observed X e.g., using Huffman coding V al ( X ) = { a , b , c , d , e , f } 1 1 1 1 1 P ( a ) = , P ( b ) = , P ( c ) = , P ( d ) = , P ( e ) = P ( f ) = 2 4 8 16 32 an optimal code for transmitting X:

Entropy: Entropy: information theory information theory alternatively expected (optimal) message length in reporting observed X e.g., using Huffman coding V al ( X ) = { a , b , c , d , e , f } 1 1 1 1 1 P ( a ) = , P ( b ) = , P ( c ) = , P ( d ) = , P ( e ) = P ( f ) = 2 4 8 16 32 an optimal code for transmitting X: a → 0 average length? b → 10 1 1 1 1 1 1 1 1 1 1 15 H ( P ) = − log( ) − log( ) − log( ) − log( ) − log( ) = 1 2 2 4 4 8 8 16 16 16 32 16 c → 110 1 1 1 3 5 d → 1110 2 2 4 8 16 e → 11110 f → 11111 contribution to the average length from X=a

Relative Relative entropy: information theory entropy: information theory what if we used a code designed for q ? average cod length when transmitting X ∼ p is H ( p , q ) ≜ − p ( x ) log( q ( x )) ∑ x ∈ V al ( X ) negative of the optimal code length for X=x according to q cross entropy

Relative Relative entropy: information theory entropy: information theory what if we used a code designed for q ? average cod length when transmitting X ∼ p is H ( p , q ) ≜ − p ( x ) log( q ( x )) ∑ x ∈ V al ( X ) negative of the optimal code length for X=x according to q cross entropy the extra amount of information transmitted: D ( p ∥ q ) ≜ p ( x )(log( p ( x ) − log( q ( x ))) ∑ x ∈ V al ( X ) Kullback-Leibler divergence or relative entorpy

Relative Relative entropy: information theory entropy: information theory Kullback-Leibler divergence D ( p ∥ q ) ≜ p ( x )(log( q ( x ) − log( p ( x ))) ∑ x ∈ V al ( X ) some properties: non-negative and zero iff p=q asymmetric 1 D ( p ∥ u ) = p ( x )(log( p ( x )) − log( )) = log( N ) − H ( p ) ∑ x N

Entropy Entropy: physics : physics 16 microstates: position of 4 particles in top/bottom box 5 macrostates: indistinguishable states assuming exchangeable particles

Entropy Entropy: physics : physics p ( top ) = 1/4 p ( top ) = 1/2 p ( top ) = 3/4 16 microstates: position of 4 particles in top/bottom box 5 macrostates: indistinguishable states assuming exchangeable particles with we can assume V al ( X ) = { top , bottom } 5 different distributions p ( top ) = 0 p ( top ) = 1

Entropy Entropy: physics : physics p ( top ) = 1/4 p ( top ) = 1/2 p ( top ) = 3/4 16 microstates: position of 4 particles in top/bottom box 5 macrostates: indistinguishable states assuming exchangeable particles with we can assume V al ( X ) = { top , bottom } 5 different distributions each macrostate is a distribution p ( top ) = 0 p ( top ) = 1

Entropy Entropy: physics : physics p ( top ) = 1/4 p ( top ) = 1/2 p ( top ) = 3/4 16 microstates: position of 4 particles in top/bottom box 5 macrostates: indistinguishable states assuming exchangeable particles with we can assume V al ( X ) = { top , bottom } 5 different distributions each macrostate is a distribution p ( top ) = 0 which distribution is more likely? p ( top ) = 1

Entropy Entropy: physics : physics p ( top ) = 1/4 p ( top ) = 1/2 p ( top ) = 3/4 16 microstates: position of 4 particles in top/bottom box 5 macrostates: indistinguishable states assuming exchangeable particles with we can assume V al ( X ) = { top , bottom } 5 different distributions each macrostate is a distribution p ( top ) = 0 which distribution is more likely? entropy of a macrostate: (normalized) log number of its microstates p ( top ) = 1

Entropy: physics Entropy : physics p ( top ) = 1/4 p ( top ) = 1/2 p ( top ) = 3/4 entropy of a macrostate: (normalized) log number of its microstates assume a large number of particles N 1 ( 1 N ! = ln( ) = ln( N !) − ln( N !) − ln( N !)) ) H macrostate ! N ! t b N N N t b p ( top ) = 0 p ( top ) = 1

Entropy: physics Entropy : physics p ( top ) = 1/4 p ( top ) = 1/2 p ( top ) = 3/4 entropy of a macrostate: (normalized) log number of its microstates assume a large number of particles N 1 ( 1 N ! = ln( ) = ln( N !) − ln( N !) − ln( N !)) ) H macrostate ! N ! t b N N N t b ≃ N ln( N ) − N p ( top ) = 0 p ( top ) = 1

Entropy Entropy: physics : physics p ( top ) = 1/4 p ( top ) = 1/2 p ( top ) = 3/4 entropy of a macrostate: (normalized) log number of its microstates assume a large number of particles N 1 ( 1 N ! = ln( ) = ln( N !) − ln( N !) − ln( N !)) ) H macrostate ! N ! t b N N N t b ≃ N ln( N ) − N p ( top ) = 0 N N N N = c − ln( ) − ln( ) t t b b N N N N p ( top ) = 1

Entropy Entropy: physics : physics p ( top ) = 1/4 p ( top ) = 1/2 p ( top ) = 3/4 entropy of a macrostate: (normalized) log number of its microstates assume a large number of particles N 1 ( 1 N ! = ln( ) = ln( N !) − ln( N !) − ln( N !)) ) H macrostate ! N ! t b N N N t b ≃ N ln( N ) − N p ( top ) = 0 N N N N = c − ln( ) − ln( ) t t b b N N N N P ( X = top ) p ( top ) = 1 = − p ( x ) ln( p ( x )) ∑ x ∈{ top , bottom }

Differential entropy Differential entropy for continuous domains for continuous domains divide the domain using small bins of width Δ V al ( X ) ∃ x ∈ (Δ i , Δ( i + 1)) i ( i +1)Δ p ( x )d x = p ( x )Δ ∫ i Δ i ( p ) = − p ( x )Δ ln( p ( x )Δ) = − ln(Δ) − p ( x )Δ ln( p ( x )) ∑ i ∑ i H Δ i i i i ignore take the limit to get H ( p ) ≜ Δ → 0 ∫ V al ( x ) p ( x ) ln( p ( x ))d x

max-entropy max-entropy distribution distribution High entropy distribution: more information in observing X ~ p it's a more likely "macrostate" the least amount of assumption about p

max-entropy max-entropy distribution distribution High entropy distribution: more information in observing X ~ p it's a more likely "macrostate" the least amount of assumption about p when optimizing for p(x) subject to constrains, maximize the entropy arg max H ( p ) p p ( x ) > 0 ∀ x ∫ V al ( X ) p ( x )d x = 1 E [ ϕ ( X )] = ∀ k μ p k k

max-entropy max-entropy distribution distribution High entropy distribution: more information in observing X ~ p it's a more likely "macrostate" the least amount of assumption about p when optimizing for p(x) subject to constrains, maximize the entropy arg max H ( p ) p p ( x ) > 0 ∀ x p ( x ) ∝ exp( ∑ k ( x )) θ ϕ k k ∫ V al ( X ) p ( x )d x = 1 Lagrange multipliers E [ ϕ ( X )] = ∀ k μ p k k

Exponential family Exponential family an exponential family has the following form p ( x ; θ ) = h ( x ) exp(⟨ η ( θ ), ϕ ( x )⟩ − A ( θ )) base measure sufficient statistics log-partition function the inner product of two vectors A ( θ ) = ln( h ( x ) exp ( ( x )) dx ) ∫ V al ( X ) ∑ k θ ϕ k k with a convex parameter space θ ∈ Θ = { θ ∈ ℜ ∣ A ( θ ) < ∞} D

Probabilistic Graphical Models Probabilistic Graphical Models - PowerPoint PPT Presentation

Probabilistic Graphical Models Probabilistic Graphical Models Exponential family & Variational Inference I Siamak Ravanbakhsh Fall 2019 Learning objectives Learning objectives entropy exponential family distribution duality in

Probabilistic Graphical Models CMSC 678 UMBC Probabilistic Graphical Models A graph G that

Probabilistic Graphical Models Probabilistic Graphical Models Variable elimination Siamak

Graphical Models Graphical Models Bayesian Networks Siamak Ravanbakhsh Fall 2019 Previously on

Probabilistic Graphical Models Probabilistic Graphical Models introduction to learning Siamak

Probabilistic Graphical Models Probabilistic Graphical Models Undirected Models Fall 2019

Probabilistic Graphical Models Probabilistic Graphical Models parameter learning in undirected

Probabilistic Graphical Models Probabilistic Graphical Models Gaussian Network Models Fall 2019

Computer Science Let me be provocative Probabilistic graphical models is how we do probabilistic

Probabilistic Graphical Models Probabilistic Graphical Models Relationship between the directed

CS 6782: Fall 2010 Probabilistic Graphical Models Guozhang Wang December 10, 2010 1

Probabilistic Graphical Models Probabilistic Graphical Models Review of probability theory

Probabilistic Graphical Models Probabilistic Graphical Models Loopy BP and Bethe Free Energy

Probabilistic Graphical Models Probabilistic Graphical Models Structure learning in Bayesian

Probabilistic Graphical Models Probabilistic Graphical Models MAP inference Siamak Ravanbakhsh

Probabilistic Graphical Models Probabilistic Graphical Models Markov Chain Monte Carlo Inference

The Elimination Algorithm Probabilistic Graphical Models (10- Probabilistic Graphical Models

Ch 5 : Mathematical Functions, Characters, and Strings CS1: Java Programming Colorado State

UMBC A B M A L T F O U M B C I M Y O R T 1 (4/7/03) I E S R C E O V U

Mat 2170 Jargon Info Hiding Week 6 Math Lib Lab 6 Exercises Methods Spring 2014 Student

Internet Software Technologies I t t S ft T h l i JavaScript part three JavaScript

On t the J Jen ense senSha hanno non Symmet metrization of of Distan ances R Rel

Estimation with Infinite Dimensional Kernel Exponential Families Kenji Fukumizu The Institute of

Kernelization Using Structural Parameters on Sparse Graph Classes Jakub Gajarsk 1 Petr Hlinn

Kernelization using structural parameters on sparse graph classes Jakub Gajarsk 1 en 1 Jan