graphical models graphical models
play

Graphical Models Graphical Models Exponential family & - PowerPoint PPT Presentation

Graphical Models Graphical Models Exponential family & Variational Inference I Siamak Ravanbakhsh Winter 2018 Learning objectives Learning objectives entropy exponential family distribution duality in exponential family relationship


  1. Graphical Models Graphical Models Exponential family & Variational Inference I Siamak Ravanbakhsh Winter 2018

  2. Learning objectives Learning objectives entropy exponential family distribution duality in exponential family relationship between two parametrizations inference and learning as mapping between the two relative entropy and two types of projections

  3. A measure of A measure of information information a measure of information I ( X = x ) observing a less probable event gives more information information is non-negative and I ( X = x ) = 0 ⇔ P ( X = x ) = 1 information from independent events is additive A = a ⊥ B = b ⇒ I ( A = a , B = b ) = I ( A = a ) + I ( B = b )

  4. A measure of A measure of information information a measure of information I ( X = x ) observing a less probable event gives more information information is non-negative and I ( X = x ) = 0 ⇔ P ( X = x ) = 1 information from independent events is additive A = a ⊥ B = b ⇒ I ( A = a , B = b ) = I ( A = a ) + I ( B = b ) definition follows from these characteristics: 1 I ( X = x ) ≜ log( ) = log( P ( X = x )) P ( X = x )

  5. Entropy Entropy: information theory : information theory information in obs. is X = x I ( X = x ) ≜ − log( P ( X = x )) entropy: expected amount of information H ( P ) ≜ E [ I ( X )] = − ∑ x ∈ V al ( X ) P ( X = x ) log( P ( X = x )) expected code length in transmitting X (repeatedly) e.g., using Huffman coding achieves its maximum for uniform prob. 0 ≤ H ( P ) ≤ log(∣ V al ( X )∣)

  6. Entropy: Entropy: example example V al ( X ) = { a , b , c , d , e , f } 1 1 1 1 1 P ( a ) = , P ( b ) = , P ( c ) = , P ( d ) = , P ( e ) = P ( f ) = 2 4 8 16 32 an optimal code for transmitting X: a → 0 average length? b → 10 c → 110 1 1 1 1 1 1 1 1 1 1 15 H ( P ) = − log( ) − log( ) − log( ) − log( ) − log( ) = 1 2 2 4 4 8 8 16 16 16 32 16 d → 1110 1 1 1 3 5 e → 11110 2 2 8 4 16 f → 11111 contribution to the average length from X=a

  7. Relative Relative entropy: information theory entropy: information theory what if we used a code designed for q ? average cod length when transmitting X ∼ p is H ( p , q ) ≜ − p ( x ) log( q ( x )) ∑ x ∈ V al ( X ) negative of the optimal code length for X=x according to q cross entropy

  8. Relative Relative entropy: information theory entropy: information theory what if we used a code designed for q ? average cod length when transmitting X ∼ p is H ( p , q ) ≜ − p ( x ) log( q ( x )) ∑ x ∈ V al ( X ) negative of the optimal code length for X=x according to q cross entropy the extra amount of information transmitted: D ( p ∥ q ) ≜ p ( x )(log( p ( x ) − log( q ( x ))) ∑ x ∈ V al ( X ) Kullback-Leibler divergence or relative entorpy

  9. Relative Relative entropy: information theory entropy: information theory Kullback-Leibler divergence D ( p ∥ q ) ≜ p ( x )(log( q ( x ) − log( p ( x ))) ∑ x ∈ V al ( X ) some properties: non-negative and zero iff p=q asymmetric 1 D ( p ∥ u ) = p ( x )(log( p ( x )) − log( )) = log( N ) − H ( p ) ∑ x N

  10. Entropy Entropy: physics : physics 16 microstates: position of 4 particles in top/bottom box 5 macrostates: indistinguishable states assuming exchangeable particles

  11. Entropy Entropy: physics : physics 16 microstates: position of 4 particles in top/bottom box 5 macrostates: indistinguishable states assuming exchangeable particles with we can assume V al ( X ) = { top , bottom } 5 different distributions ≡ macrostate distribution

  12. Entropy Entropy: physics : physics 16 microstates: position of 4 particles in top/bottom box 5 macrostates: indistinguishable states assuming exchangeable particles with we can assume V al ( X ) = { top , bottom } 5 different distributions ≡ macrostate distribution entropy of a macrostate: (normalized) log number of its microstates

  13. Entropy Entropy: physics : physics entropy of a macrostate : normalized log #microstates assume a large number of particles N 1 ( 1 N ! H = ln( ) = ln( N !) − ln( N !) − ln( N )) ) t b N N N N t b ≃ N ln( N ) − N

  14. Entropy: physics Entropy : physics entropy of a macrostate : normalized log #microstates assume a large number of particles N 1 ( 1 N ! H = ln( ) = ln( N !) − ln( N !) − ln( N )) ) t b N N N N t b ≃ N ln( N ) − N N t N t N b N b H = − ln( ) − ln( ) N N N N P ( X = top ) nats instead of bits

  15. Differential entorpy Differential entorpy (continuous domains) (continuous domains) divide the domain using small bins of width Δ V al ( X ) ∃ x ∈ (Δ i , Δ( i + 1)) i ( i +1)Δ p ( x )d x = p ( x )Δ ∫ i Δ i H ( p ) = − p ( x )Δ ln( p ( x )Δ) = − ln(Δ) − p ( x )Δ ln( p ( x )) ∑ i ∑ i Δ i i i i ignore take the limit to get H ( p ) ≜ Δ → 0 ∫ V al ( x ) p ( x ) ln( p ( x ))d x

  16. max-entropy max-entropy distribution distribution maximize the entropy subject to constraints arg max H ( p ) p p ( x ) > 0 ∀ x ∫ V al ( X ) p ( x )d x = 1 E [ ϕ ( X )] = μ ∀ k p k k

  17. max-entropy distribution max-entropy distribution maximize the entropy subject to constraints arg max H ( p ) p p ( x ) > 0 ∀ x p ( x ) ∝ exp( θ ϕ ( x )) ∑ k k k ∫ V al ( X ) p ( x )d x = 1 Lagrange multipliers E [ ϕ ( X )] = μ ∀ k p k k

  18. Exponential family Exponential family an exponential family has the following form p ( x ; θ ) = h ( x ) exp(⟨ η ( θ ), ϕ ( x )⟩ − A ( θ )) base measure sufficient statistics log-partition function the inner product of two vectors A ( θ ) = ln( h ( x ) exp ( θ ϕ ( x )) dx ) ∫ V al ( X ) ∑ k k k with a convex parameter space θ ∈ Θ = { θ ∈ ℜ ∣ A ( θ ) < ∞} D

  19. Example: Example: univariate Gaussian univariate Gaussian ( x − μ ) 2 1 2 moment form: p ( x ; μ , σ ) = exp(− ) √ 2 σ 2 2 πσ 2 2 p ( x ; μ , σ ) = h ( x ) exp(⟨ η ( θ ), ϕ ( x )⟩ − A ( θ )) 2 1 [ x , x ] μ 2 −1 2 μ 1 2 η ( μ , σ ) = [ , ] (ln(2 πσ ) + ) σ 2 2 σ 2 2 σ 2 2 + for μ , σ ∈ ℜ × ℜ

  20. Example: Example: Bernoulli Bernoulli 1− x conventional form (mean parametrization) p ( x ; μ ) = μ (1 − μ ) x p ( x ; μ ) = h ( x ) exp(⟨ η ( θ ), ϕ ( x )⟩ − A ( θ )) 1 1 η ( μ ) = [ln( μ ), ln(1 − μ )] for μ ∈ (0, 1) [ I ( x = 1), I ( x = 0)]

  21. Linear Linear exponential family exponential family when using natural parameters p ( x ; θ ) = h ( x ) exp(⟨ θ , ϕ ( x )⟩ − A ( θ )) natural parameters η ( θ ) simply define to be the new ? θ natural parameter-space needs to be convex θ ∈ Θ = { θ ∈ ℜ D ∣ A ( θ ) < ∞}

  22. Linear Linear exponential family exponential family when using natural parameters p ( x ; θ ) = h ( x ) exp(⟨ θ , ϕ ( x )⟩ − A ( θ )) can absorb it as a natural parameters η ( θ ) sufficient stat. with θ = 1 simply define to be the new ? θ natural parameter-space needs to be convex θ ∈ Θ = { θ ∈ ℜ D ∣ A ( θ ) < ∞}

  23. Example: Example: univariate Gaussian univariate Gaussian take 2 natural parameters in the univariate Gaussian p ( x ; θ ) = exp(⟨ θ , ϕ ( x )⟩ − A ( θ )) −1 μ 2 2 [ , ] [ x , x ] −1 θ 1 (ln( θ / π ) + )? σ 2 2 σ 2 2 2 2 θ 2 θ ∈ ℜ × ℜ − where is a convex set

  24. Example: Example: Bernoulli Bernoulli take 2 conventional form (mean parametrization) 1− x p ( x ; μ ) = μ (1 − μ ) x p ( x ; θ ) = h ( x ) exp(⟨ θ , ϕ ( x )⟩ − A ( θ )) [ I ( x = 1), I ( x = 0)] [ln( μ ), ln(1 − μ )]

  25. Example: Example: Bernoulli Bernoulli take 2 conventional form (mean parametrization) 1− x p ( x ; μ ) = μ (1 − μ ) x p ( x ; θ ) = h ( x ) exp(⟨ θ , ϕ ( x )⟩ − A ( θ )) [ I ( x = 1), I ( x = 0)] [ln( μ ), ln(1 − μ )] however is not a convex set Θ

  26. Example: Example: Bernoulli Bernoulli take 3 conventional form (mean parametrization) 1− x p ( x ; μ ) = μ (1 − μ ) x p ( x ; θ ) = h ( x ) exp(⟨ θ , ϕ ( x )⟩ − A ( θ )) ∈ ℜ 2 [ I ( x = 1), I ( x = 0)]

  27. Example: Example: Bernoulli Bernoulli take 3 conventional form (mean parametrization) 1− x p ( x ; μ ) = μ (1 − μ ) x p ( x ; θ ) = h ( x ) exp(⟨ θ , ϕ ( x )⟩ − A ( θ )) ∈ ℜ 2 [ I ( x = 1), I ( x = 0)] this parametrization is redundant or overcomplete p ( x , [ θ , θ ]) = p ( x , [ θ + c , θ + c ]) 1 2 1 2 redundant iff ∃ θ s.t. ∀ x ⟨ θ , ϕ ( x )⟩ = c

  28. Example: Example: Bernoulli Bernoulli take 4 conventional form (mean parametrization) 1− x p ( x ; μ ) = μ (1 − μ ) x p ( x ; θ ) = h ( x ) exp(⟨ θ , ϕ ( x )⟩ − A ( θ )) μ [ I ( x = 1)] [ln ] log(1 + e ) θ 1− μ

  29. Example: Example: Bernoulli Bernoulli take 4 conventional form (mean parametrization) 1− x p ( x ; μ ) = μ (1 − μ ) x p ( x ; θ ) = h ( x ) exp(⟨ θ , ϕ ( x )⟩ − A ( θ )) μ [ I ( x = 1)] [ln ] log(1 + e ) θ 1− μ is convex and this parametrization is minimal Θ

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend