Graphical Models Graphical Models Exponential family & - PowerPoint PPT Presentation

Graphical Models Graphical Models Exponential family & Variational Inference I Siamak Ravanbakhsh Winter 2018

Learning objectives Learning objectives entropy exponential family distribution duality in exponential family relationship between two parametrizations inference and learning as mapping between the two relative entropy and two types of projections

A measure of A measure of information information a measure of information I ( X = x ) observing a less probable event gives more information information is non-negative and I ( X = x ) = 0 ⇔ P ( X = x ) = 1 information from independent events is additive A = a ⊥ B = b ⇒ I ( A = a , B = b ) = I ( A = a ) + I ( B = b )

A measure of A measure of information information a measure of information I ( X = x ) observing a less probable event gives more information information is non-negative and I ( X = x ) = 0 ⇔ P ( X = x ) = 1 information from independent events is additive A = a ⊥ B = b ⇒ I ( A = a , B = b ) = I ( A = a ) + I ( B = b ) definition follows from these characteristics: 1 I ( X = x ) ≜ log( ) = log( P ( X = x )) P ( X = x )

Entropy Entropy: information theory : information theory information in obs. is X = x I ( X = x ) ≜ − log( P ( X = x )) entropy: expected amount of information H ( P ) ≜ E [ I ( X )] = − ∑ x ∈ V al ( X ) P ( X = x ) log( P ( X = x )) expected code length in transmitting X (repeatedly) e.g., using Huffman coding achieves its maximum for uniform prob. 0 ≤ H ( P ) ≤ log(∣ V al ( X )∣)

Entropy: Entropy: example example V al ( X ) = { a , b , c , d , e , f } 1 1 1 1 1 P ( a ) = , P ( b ) = , P ( c ) = , P ( d ) = , P ( e ) = P ( f ) = 2 4 8 16 32 an optimal code for transmitting X: a → 0 average length? b → 10 c → 110 1 1 1 1 1 1 1 1 1 1 15 H ( P ) = − log( ) − log( ) − log( ) − log( ) − log( ) = 1 2 2 4 4 8 8 16 16 16 32 16 d → 1110 1 1 1 3 5 e → 11110 2 2 8 4 16 f → 11111 contribution to the average length from X=a

Relative Relative entropy: information theory entropy: information theory what if we used a code designed for q ? average cod length when transmitting X ∼ p is H ( p , q ) ≜ − p ( x ) log( q ( x )) ∑ x ∈ V al ( X ) negative of the optimal code length for X=x according to q cross entropy

Relative Relative entropy: information theory entropy: information theory what if we used a code designed for q ? average cod length when transmitting X ∼ p is H ( p , q ) ≜ − p ( x ) log( q ( x )) ∑ x ∈ V al ( X ) negative of the optimal code length for X=x according to q cross entropy the extra amount of information transmitted: D ( p ∥ q ) ≜ p ( x )(log( p ( x ) − log( q ( x ))) ∑ x ∈ V al ( X ) Kullback-Leibler divergence or relative entorpy

Relative Relative entropy: information theory entropy: information theory Kullback-Leibler divergence D ( p ∥ q ) ≜ p ( x )(log( q ( x ) − log( p ( x ))) ∑ x ∈ V al ( X ) some properties: non-negative and zero iff p=q asymmetric 1 D ( p ∥ u ) = p ( x )(log( p ( x )) − log( )) = log( N ) − H ( p ) ∑ x N

Entropy Entropy: physics : physics 16 microstates: position of 4 particles in top/bottom box 5 macrostates: indistinguishable states assuming exchangeable particles

Entropy Entropy: physics : physics 16 microstates: position of 4 particles in top/bottom box 5 macrostates: indistinguishable states assuming exchangeable particles with we can assume V al ( X ) = { top , bottom } 5 different distributions ≡ macrostate distribution

Entropy Entropy: physics : physics 16 microstates: position of 4 particles in top/bottom box 5 macrostates: indistinguishable states assuming exchangeable particles with we can assume V al ( X ) = { top , bottom } 5 different distributions ≡ macrostate distribution entropy of a macrostate: (normalized) log number of its microstates

Entropy Entropy: physics : physics entropy of a macrostate : normalized log #microstates assume a large number of particles N 1 ( 1 N ! H = ln( ) = ln( N !) − ln( N !) − ln( N )) ) t b N N N N t b ≃ N ln( N ) − N

Entropy: physics Entropy : physics entropy of a macrostate : normalized log #microstates assume a large number of particles N 1 ( 1 N ! H = ln( ) = ln( N !) − ln( N !) − ln( N )) ) t b N N N N t b ≃ N ln( N ) − N N t N t N b N b H = − ln( ) − ln( ) N N N N P ( X = top ) nats instead of bits

Differential entorpy Differential entorpy (continuous domains) (continuous domains) divide the domain using small bins of width Δ V al ( X ) ∃ x ∈ (Δ i , Δ( i + 1)) i ( i +1)Δ p ( x )d x = p ( x )Δ ∫ i Δ i H ( p ) = − p ( x )Δ ln( p ( x )Δ) = − ln(Δ) − p ( x )Δ ln( p ( x )) ∑ i ∑ i Δ i i i i ignore take the limit to get H ( p ) ≜ Δ → 0 ∫ V al ( x ) p ( x ) ln( p ( x ))d x

max-entropy max-entropy distribution distribution maximize the entropy subject to constraints arg max H ( p ) p p ( x ) > 0 ∀ x ∫ V al ( X ) p ( x )d x = 1 E [ ϕ ( X )] = μ ∀ k p k k

max-entropy distribution max-entropy distribution maximize the entropy subject to constraints arg max H ( p ) p p ( x ) > 0 ∀ x p ( x ) ∝ exp( θ ϕ ( x )) ∑ k k k ∫ V al ( X ) p ( x )d x = 1 Lagrange multipliers E [ ϕ ( X )] = μ ∀ k p k k

Exponential family Exponential family an exponential family has the following form p ( x ; θ ) = h ( x ) exp(⟨ η ( θ ), ϕ ( x )⟩ − A ( θ )) base measure sufficient statistics log-partition function the inner product of two vectors A ( θ ) = ln( h ( x ) exp ( θ ϕ ( x )) dx ) ∫ V al ( X ) ∑ k k k with a convex parameter space θ ∈ Θ = { θ ∈ ℜ ∣ A ( θ ) < ∞} D

Example: Example: univariate Gaussian univariate Gaussian ( x − μ ) 2 1 2 moment form: p ( x ; μ , σ ) = exp(− ) √ 2 σ 2 2 πσ 2 2 p ( x ; μ , σ ) = h ( x ) exp(⟨ η ( θ ), ϕ ( x )⟩ − A ( θ )) 2 1 [ x , x ] μ 2 −1 2 μ 1 2 η ( μ , σ ) = [ , ] (ln(2 πσ ) + ) σ 2 2 σ 2 2 σ 2 2 + for μ , σ ∈ ℜ × ℜ

Example: Example: Bernoulli Bernoulli 1− x conventional form (mean parametrization) p ( x ; μ ) = μ (1 − μ ) x p ( x ; μ ) = h ( x ) exp(⟨ η ( θ ), ϕ ( x )⟩ − A ( θ )) 1 1 η ( μ ) = [ln( μ ), ln(1 − μ )] for μ ∈ (0, 1) [ I ( x = 1), I ( x = 0)]

Linear Linear exponential family exponential family when using natural parameters p ( x ; θ ) = h ( x ) exp(⟨ θ , ϕ ( x )⟩ − A ( θ )) natural parameters η ( θ ) simply define to be the new ? θ natural parameter-space needs to be convex θ ∈ Θ = { θ ∈ ℜ D ∣ A ( θ ) < ∞}

Linear Linear exponential family exponential family when using natural parameters p ( x ; θ ) = h ( x ) exp(⟨ θ , ϕ ( x )⟩ − A ( θ )) can absorb it as a natural parameters η ( θ ) sufficient stat. with θ = 1 simply define to be the new ? θ natural parameter-space needs to be convex θ ∈ Θ = { θ ∈ ℜ D ∣ A ( θ ) < ∞}

Example: Example: univariate Gaussian univariate Gaussian take 2 natural parameters in the univariate Gaussian p ( x ; θ ) = exp(⟨ θ , ϕ ( x )⟩ − A ( θ )) −1 μ 2 2 [ , ] [ x , x ] −1 θ 1 (ln( θ / π ) + )? σ 2 2 σ 2 2 2 2 θ 2 θ ∈ ℜ × ℜ − where is a convex set

Example: Example: Bernoulli Bernoulli take 2 conventional form (mean parametrization) 1− x p ( x ; μ ) = μ (1 − μ ) x p ( x ; θ ) = h ( x ) exp(⟨ θ , ϕ ( x )⟩ − A ( θ )) [ I ( x = 1), I ( x = 0)] [ln( μ ), ln(1 − μ )]

Example: Example: Bernoulli Bernoulli take 2 conventional form (mean parametrization) 1− x p ( x ; μ ) = μ (1 − μ ) x p ( x ; θ ) = h ( x ) exp(⟨ θ , ϕ ( x )⟩ − A ( θ )) [ I ( x = 1), I ( x = 0)] [ln( μ ), ln(1 − μ )] however is not a convex set Θ

Example: Example: Bernoulli Bernoulli take 3 conventional form (mean parametrization) 1− x p ( x ; μ ) = μ (1 − μ ) x p ( x ; θ ) = h ( x ) exp(⟨ θ , ϕ ( x )⟩ − A ( θ )) ∈ ℜ 2 [ I ( x = 1), I ( x = 0)]

Example: Example: Bernoulli Bernoulli take 3 conventional form (mean parametrization) 1− x p ( x ; μ ) = μ (1 − μ ) x p ( x ; θ ) = h ( x ) exp(⟨ θ , ϕ ( x )⟩ − A ( θ )) ∈ ℜ 2 [ I ( x = 1), I ( x = 0)] this parametrization is redundant or overcomplete p ( x , [ θ , θ ]) = p ( x , [ θ + c , θ + c ]) 1 2 1 2 redundant iff ∃ θ s.t. ∀ x ⟨ θ , ϕ ( x )⟩ = c

Example: Example: Bernoulli Bernoulli take 4 conventional form (mean parametrization) 1− x p ( x ; μ ) = μ (1 − μ ) x p ( x ; θ ) = h ( x ) exp(⟨ θ , ϕ ( x )⟩ − A ( θ )) μ [ I ( x = 1)] [ln ] log(1 + e ) θ 1− μ

Example: Example: Bernoulli Bernoulli take 4 conventional form (mean parametrization) 1− x p ( x ; μ ) = μ (1 − μ ) x p ( x ; θ ) = h ( x ) exp(⟨ θ , ϕ ( x )⟩ − A ( θ )) μ [ I ( x = 1)] [ln ] log(1 + e ) θ 1− μ is convex and this parametrization is minimal Θ

Graphical Models Graphical Models Exponential family & - PowerPoint PPT Presentation

Graphical Models Graphical Models Exponential family & Variational Inference I Siamak Ravanbakhsh Winter 2018 Learning objectives Learning objectives entropy exponential family distribution duality in exponential family relationship

Graphical Models Graphical Models Bayesian Networks Siamak Ravanbakhsh Fall 2019 Previously on

Transforming Graphical System Models to Graphical Attack Models ! Joint work with Marieta

Probabilistic Graphical Models Probabilistic Graphical Models Variable elimination Siamak

Probabilistic Graphical Models CMSC 678 UMBC Probabilistic Graphical Models A graph G that

Undirected Graphical Models Aaron Courville, Universit de Montral 2 (UNDIRECTED) GRAPHICAL

Graphical models Review Graphical models (Bayes nets, Markov random fields, factor graphs) !

Probabilistic Graphical Models CMSC 691 UMBC Two Problems for Graphical Models 1 ,

Probabilistic Graphical Models Probabilistic Graphical Models introduction to learning Siamak

Graphical Models Graphical Models Relationship between the directed & undirected models

Probabilistic Graphical Models Probabilistic Graphical Models Undirected Models Fall 2019

Probabilistic Graphical Models Probabilistic Graphical Models parameter learning in undirected

Probabilistic Graphical Models Probabilistic Graphical Models Gaussian Network Models Fall 2019

Graphical Screen Design Grids are an essential tool for graphical design Important graphical

Graphical > Tangible? What are their limitations? 93 94 Graphical > Tangible? Graphical

Graphical Screen Design Grids are an essential tool for graphical design Important graphical

10/4/15 Graphical Programming (1) Maze Program TOPICS Graphical Programming Using

Approximate Inference Henrik I. Christensen Robotics & Intelligent Machines @ GT Georgia

Wolfes Combinatorial Method is Exponential Jamie Haddock STOC June 26, 2018 UC Davis/UCLA

Parameter Estimation in Mixtures of Truncated Exponentials Helge Langseth 1 Thomas D. Nielsen 2

Exponentials of derivations in prime Gradings characteristic Artin-Hasse exponentials Laguerre

CSci 8980: Advanced Topics in Graphical Models Mixture Models, EM, Exponential Families

Generalized Linear Models (GLIMs) Probabilistic Graphical Models Sharif University of Technology

Probabilistic Graphical Models 10-708 More on learning fully observed More on learning fully

Generalized Linear Model Certain nonlinear models with a specific structure arise from using

Graphical Models Graphical Models Exponential family & - PowerPoint PPT Presentation

Graphical Models Graphical Models Exponential family & Variational Inference I Siamak Ravanbakhsh Winter 2018 Learning objectives Learning objectives entropy exponential family distribution duality in exponential family relationship

Graphical Models Graphical Models Bayesian Networks Siamak Ravanbakhsh Fall 2019 Previously on

Transforming Graphical System Models to Graphical Attack Models ! Joint work with Marieta

Probabilistic Graphical Models Probabilistic Graphical Models Variable elimination Siamak

Probabilistic Graphical Models CMSC 678 UMBC Probabilistic Graphical Models A graph G that

Undirected Graphical Models Aaron Courville, Universit de Montral 2 (UNDIRECTED) GRAPHICAL

Graphical models Review Graphical models (Bayes nets, Markov random fields, factor graphs) !

Probabilistic Graphical Models CMSC 691 UMBC Two Problems for Graphical Models 1 ,

Probabilistic Graphical Models Probabilistic Graphical Models introduction to learning Siamak

Graphical Models Graphical Models Relationship between the directed &amp; undirected models

Probabilistic Graphical Models Probabilistic Graphical Models Undirected Models Fall 2019

Probabilistic Graphical Models Probabilistic Graphical Models parameter learning in undirected

Probabilistic Graphical Models Probabilistic Graphical Models Gaussian Network Models Fall 2019

Graphical Screen Design Grids are an essential tool for graphical design Important graphical

Graphical &gt; Tangible? What are their limitations? 93 94 Graphical &gt; Tangible? Graphical

Graphical Screen Design Grids are an essential tool for graphical design Important graphical

10/4/15 Graphical Programming (1) Maze Program TOPICS Graphical Programming Using

Approximate Inference Henrik I. Christensen Robotics &amp; Intelligent Machines @ GT Georgia

Wolfes Combinatorial Method is Exponential Jamie Haddock STOC June 26, 2018 UC Davis/UCLA

Parameter Estimation in Mixtures of Truncated Exponentials Helge Langseth 1 Thomas D. Nielsen 2

Exponentials of derivations in prime Gradings characteristic Artin-Hasse exponentials Laguerre

CSci 8980: Advanced Topics in Graphical Models Mixture Models, EM, Exponential Families

Generalized Linear Models (GLIMs) Probabilistic Graphical Models Sharif University of Technology

Probabilistic Graphical Models 10-708 More on learning fully observed More on learning fully

Generalized Linear Model Certain nonlinear models with a specific structure arise from using

Graphical Models Graphical Models Relationship between the directed & undirected models

Graphical > Tangible? What are their limitations? 93 94 Graphical > Tangible? Graphical

Approximate Inference Henrik I. Christensen Robotics & Intelligent Machines @ GT Georgia