 
              Graphical Models Graphical Models Exponential family & Variational Inference I Siamak Ravanbakhsh Winter 2018
Learning objectives Learning objectives entropy exponential family distribution duality in exponential family relationship between two parametrizations inference and learning as mapping between the two relative entropy and two types of projections
A measure of A measure of information information a measure of information I ( X = x ) observing a less probable event gives more information information is non-negative and I ( X = x ) = 0 ⇔ P ( X = x ) = 1 information from independent events is additive A = a ⊥ B = b ⇒ I ( A = a , B = b ) = I ( A = a ) + I ( B = b )
A measure of A measure of information information a measure of information I ( X = x ) observing a less probable event gives more information information is non-negative and I ( X = x ) = 0 ⇔ P ( X = x ) = 1 information from independent events is additive A = a ⊥ B = b ⇒ I ( A = a , B = b ) = I ( A = a ) + I ( B = b ) definition follows from these characteristics: 1 I ( X = x ) ≜ log( ) = log( P ( X = x )) P ( X = x )
Entropy Entropy: information theory : information theory information in obs. is X = x I ( X = x ) ≜ − log( P ( X = x )) entropy: expected amount of information H ( P ) ≜ E [ I ( X )] = − ∑ x ∈ V al ( X ) P ( X = x ) log( P ( X = x )) expected code length in transmitting X (repeatedly) e.g., using Huffman coding achieves its maximum for uniform prob. 0 ≤ H ( P ) ≤ log(∣ V al ( X )∣)
Entropy: Entropy: example example V al ( X ) = { a , b , c , d , e , f } 1 1 1 1 1 P ( a ) = , P ( b ) = , P ( c ) = , P ( d ) = , P ( e ) = P ( f ) = 2 4 8 16 32 an optimal code for transmitting X: a → 0 average length? b → 10 c → 110 1 1 1 1 1 1 1 1 1 1 15 H ( P ) = − log( ) − log( ) − log( ) − log( ) − log( ) = 1 2 2 4 4 8 8 16 16 16 32 16 d → 1110 1 1 1 3 5 e → 11110 2 2 8 4 16 f → 11111 contribution to the average length from X=a
Relative Relative entropy: information theory entropy: information theory what if we used a code designed for q ? average cod length when transmitting X ∼ p is H ( p , q ) ≜ − p ( x ) log( q ( x )) ∑ x ∈ V al ( X ) negative of the optimal code length for X=x according to q cross entropy
Relative Relative entropy: information theory entropy: information theory what if we used a code designed for q ? average cod length when transmitting X ∼ p is H ( p , q ) ≜ − p ( x ) log( q ( x )) ∑ x ∈ V al ( X ) negative of the optimal code length for X=x according to q cross entropy the extra amount of information transmitted: D ( p ∥ q ) ≜ p ( x )(log( p ( x ) − log( q ( x ))) ∑ x ∈ V al ( X ) Kullback-Leibler divergence or relative entorpy
Relative Relative entropy: information theory entropy: information theory Kullback-Leibler divergence D ( p ∥ q ) ≜ p ( x )(log( q ( x ) − log( p ( x ))) ∑ x ∈ V al ( X ) some properties: non-negative and zero iff p=q asymmetric 1 D ( p ∥ u ) = p ( x )(log( p ( x )) − log( )) = log( N ) − H ( p ) ∑ x N
Entropy Entropy: physics : physics 16 microstates: position of 4 particles in top/bottom box 5 macrostates: indistinguishable states assuming exchangeable particles
Entropy Entropy: physics : physics 16 microstates: position of 4 particles in top/bottom box 5 macrostates: indistinguishable states assuming exchangeable particles with we can assume V al ( X ) = { top , bottom } 5 different distributions ≡ macrostate distribution
Entropy Entropy: physics : physics 16 microstates: position of 4 particles in top/bottom box 5 macrostates: indistinguishable states assuming exchangeable particles with we can assume V al ( X ) = { top , bottom } 5 different distributions ≡ macrostate distribution entropy of a macrostate: (normalized) log number of its microstates
Entropy Entropy: physics : physics entropy of a macrostate : normalized log #microstates assume a large number of particles N 1 ( 1 N ! H = ln( ) = ln( N !) − ln( N !) − ln( N )) ) t b N N N N t b ≃ N ln( N ) − N
Entropy: physics Entropy : physics entropy of a macrostate : normalized log #microstates assume a large number of particles N 1 ( 1 N ! H = ln( ) = ln( N !) − ln( N !) − ln( N )) ) t b N N N N t b ≃ N ln( N ) − N N t N t N b N b H = − ln( ) − ln( ) N N N N P ( X = top ) nats instead of bits
Differential entorpy Differential entorpy (continuous domains) (continuous domains) divide the domain using small bins of width Δ V al ( X ) ∃ x ∈ (Δ i , Δ( i + 1)) i ( i +1)Δ p ( x )d x = p ( x )Δ ∫ i Δ i H ( p ) = − p ( x )Δ ln( p ( x )Δ) = − ln(Δ) − p ( x )Δ ln( p ( x )) ∑ i ∑ i Δ i i i i ignore take the limit to get H ( p ) ≜ Δ → 0 ∫ V al ( x ) p ( x ) ln( p ( x ))d x
max-entropy max-entropy distribution distribution maximize the entropy subject to constraints arg max H ( p ) p p ( x ) > 0 ∀ x ∫ V al ( X ) p ( x )d x = 1 E [ ϕ ( X )] = μ ∀ k p k k
max-entropy distribution max-entropy distribution maximize the entropy subject to constraints arg max H ( p ) p p ( x ) > 0 ∀ x p ( x ) ∝ exp( θ ϕ ( x )) ∑ k k k ∫ V al ( X ) p ( x )d x = 1 Lagrange multipliers E [ ϕ ( X )] = μ ∀ k p k k
Exponential family Exponential family an exponential family has the following form p ( x ; θ ) = h ( x ) exp(⟨ η ( θ ), ϕ ( x )⟩ − A ( θ )) base measure sufficient statistics log-partition function the inner product of two vectors A ( θ ) = ln( h ( x ) exp ( θ ϕ ( x )) dx ) ∫ V al ( X ) ∑ k k k with a convex parameter space θ ∈ Θ = { θ ∈ ℜ ∣ A ( θ ) < ∞} D
Example: Example: univariate Gaussian univariate Gaussian ( x − μ ) 2 1 2 moment form: p ( x ; μ , σ ) = exp(− ) √ 2 σ 2 2 πσ 2 2 p ( x ; μ , σ ) = h ( x ) exp(⟨ η ( θ ), ϕ ( x )⟩ − A ( θ )) 2 1 [ x , x ] μ 2 −1 2 μ 1 2 η ( μ , σ ) = [ , ] (ln(2 πσ ) + ) σ 2 2 σ 2 2 σ 2 2 + for μ , σ ∈ ℜ × ℜ
Example: Example: Bernoulli Bernoulli 1− x conventional form (mean parametrization) p ( x ; μ ) = μ (1 − μ ) x p ( x ; μ ) = h ( x ) exp(⟨ η ( θ ), ϕ ( x )⟩ − A ( θ )) 1 1 η ( μ ) = [ln( μ ), ln(1 − μ )] for μ ∈ (0, 1) [ I ( x = 1), I ( x = 0)]
Linear Linear exponential family exponential family when using natural parameters p ( x ; θ ) = h ( x ) exp(⟨ θ , ϕ ( x )⟩ − A ( θ )) natural parameters η ( θ ) simply define to be the new ? θ natural parameter-space needs to be convex θ ∈ Θ = { θ ∈ ℜ D ∣ A ( θ ) < ∞}
Linear Linear exponential family exponential family when using natural parameters p ( x ; θ ) = h ( x ) exp(⟨ θ , ϕ ( x )⟩ − A ( θ )) can absorb it as a natural parameters η ( θ ) sufficient stat. with θ = 1 simply define to be the new ? θ natural parameter-space needs to be convex θ ∈ Θ = { θ ∈ ℜ D ∣ A ( θ ) < ∞}
Example: Example: univariate Gaussian univariate Gaussian take 2 natural parameters in the univariate Gaussian p ( x ; θ ) = exp(⟨ θ , ϕ ( x )⟩ − A ( θ )) −1 μ 2 2 [ , ] [ x , x ] −1 θ 1 (ln( θ / π ) + )? σ 2 2 σ 2 2 2 2 θ 2 θ ∈ ℜ × ℜ − where is a convex set
Example: Example: Bernoulli Bernoulli take 2 conventional form (mean parametrization) 1− x p ( x ; μ ) = μ (1 − μ ) x p ( x ; θ ) = h ( x ) exp(⟨ θ , ϕ ( x )⟩ − A ( θ )) [ I ( x = 1), I ( x = 0)] [ln( μ ), ln(1 − μ )]
Example: Example: Bernoulli Bernoulli take 2 conventional form (mean parametrization) 1− x p ( x ; μ ) = μ (1 − μ ) x p ( x ; θ ) = h ( x ) exp(⟨ θ , ϕ ( x )⟩ − A ( θ )) [ I ( x = 1), I ( x = 0)] [ln( μ ), ln(1 − μ )] however is not a convex set Θ
Example: Example: Bernoulli Bernoulli take 3 conventional form (mean parametrization) 1− x p ( x ; μ ) = μ (1 − μ ) x p ( x ; θ ) = h ( x ) exp(⟨ θ , ϕ ( x )⟩ − A ( θ )) ∈ ℜ 2 [ I ( x = 1), I ( x = 0)]
Example: Example: Bernoulli Bernoulli take 3 conventional form (mean parametrization) 1− x p ( x ; μ ) = μ (1 − μ ) x p ( x ; θ ) = h ( x ) exp(⟨ θ , ϕ ( x )⟩ − A ( θ )) ∈ ℜ 2 [ I ( x = 1), I ( x = 0)] this parametrization is redundant or overcomplete p ( x , [ θ , θ ]) = p ( x , [ θ + c , θ + c ]) 1 2 1 2 redundant iff ∃ θ s.t. ∀ x ⟨ θ , ϕ ( x )⟩ = c
Example: Example: Bernoulli Bernoulli take 4 conventional form (mean parametrization) 1− x p ( x ; μ ) = μ (1 − μ ) x p ( x ; θ ) = h ( x ) exp(⟨ θ , ϕ ( x )⟩ − A ( θ )) μ [ I ( x = 1)] [ln ] log(1 + e ) θ 1− μ
Example: Example: Bernoulli Bernoulli take 4 conventional form (mean parametrization) 1− x p ( x ; μ ) = μ (1 − μ ) x p ( x ; θ ) = h ( x ) exp(⟨ θ , ϕ ( x )⟩ − A ( θ )) μ [ I ( x = 1)] [ln ] log(1 + e ) θ 1− μ is convex and this parametrization is minimal Θ
Recommend
More recommend