Exponential Family Distributions CMSC 691 UMBC Exponential Family - - PowerPoint PPT Presentation

β–Ά
exponential family distributions
SMART_READER_LITE
LIVE PREVIEW

Exponential Family Distributions CMSC 691 UMBC Exponential Family - - PowerPoint PPT Presentation

Exponential Family Distributions CMSC 691 UMBC Exponential Family Form Exponential Family Form Support function Formally necessary, often irrelevant (e.g., Gaussian distributions), except when it isnt (e.g., Dirichlet


slide-1
SLIDE 1

Exponential Family Distributions

CMSC 691 UMBC

slide-2
SLIDE 2

Exponential Family Form

slide-3
SLIDE 3

Exponential Family Form

Support function

  • Formally necessary, often

irrelevant (e.g., Gaussian distributions), except when it isn’t (e.g., Dirichlet distributions)

slide-4
SLIDE 4

Exponential Family Form

Distribution Parameters

  • Natural parameters
  • Feature weights
slide-5
SLIDE 5

Exponential Family Form

Sufficient statistics

  • Feature function(s)
slide-6
SLIDE 6

Exponential Family Form

Log-normalizer

slide-7
SLIDE 7

Exponential Family Form

Log-normalizer Discrete x

𝐡 πœ„ = log ∫ β„Ž 𝑦′ exp(πœ„π‘ˆπ‘”(𝑦′))𝑒𝑦′ 𝐡 πœ„ = log ෍

𝑦′

β„Ž 𝑦′ exp(πœ„π‘ˆπ‘”(𝑦′))

Continuous x

slide-8
SLIDE 8

Why Bother with This?

  • A common form for common distributions
  • β€œEasily” compute gradients of likelihood wrt

parameters

  • β€œEasily” compute expectations, especially

entropy and KL divergence

  • β€œEasy” posterior inference via conjugate

distributions

slide-9
SLIDE 9

Why? Capture Common Distributions

Bernoulli/Binomial Categorical/Multinomial Poisson Normal Gamma …

π‘žπœ„ 𝑦 = β„Ž 𝑦 exp πœ„π‘ˆπ‘” 𝑦 βˆ’ 𝐡(πœ„)

These can all be written in this β€œcommon” form (different h, f, and A functions)

See a good stats book, or https://en.wikipedia.org/wiki/Exponential_family#Table_of_distributions

slide-10
SLIDE 10

Why? Capture Common Distributions

Discrete/Categorical

(Finite distributions)

π‘žπœŒ π‘Œ = 𝑙 = ΰ·‘

π‘˜

πœŒπ‘˜

𝟐[𝑙=π‘˜]

1 𝑑 = α‰Š1, 𝑑 is true 0, 𝑑 is false

β€œTraditional” Form Exponential Family Form

πœ„ =??? 𝑔 𝑦 =??? β„Ž 𝑦 =???

slide-11
SLIDE 11

Why? Capture Common Distributions

Discrete/Categorical

(Finite distributions)

π‘žπœŒ π‘Œ = 𝑙 = ΰ·‘

π‘˜

πœŒπ‘˜

𝟐[𝑙=π‘˜]

1 𝑑 = α‰Š1, 𝑑 is true 0, 𝑑 is false

β€œTraditional” Form How do we find this? 𝑏𝑐 = exp(log 𝑏 + log 𝑐)

slide-12
SLIDE 12

Why? Capture Common Distributions

Discrete/Categorical

(Finite distributions)

π‘žπœŒ π‘Œ = 𝑙 = ΰ·‘

π‘˜

πœŒπ‘˜

𝟐[𝑙=π‘˜]

1 𝑑 = α‰Š1, 𝑑 is true 0, 𝑑 is false

β€œTraditional” Form π‘žπœŒ π‘Œ = 𝑙 = ΰ·‘

π‘˜

πœŒπ‘˜

𝟐[𝑙=π‘˜]

= exp ෍

π‘˜

1 𝑙 = π‘˜ βˆ— log πœŒπ‘˜ How do we find this? 𝑏𝑐 = exp(log 𝑏 + log 𝑐)

slide-13
SLIDE 13

Why? Capture Common Distributions

Discrete/Categorical

(Finite distributions)

π‘žπœŒ π‘Œ = 𝑙 = ΰ·‘

π‘˜

πœŒπ‘˜

𝟐[𝑙=π‘˜]

1 𝑑 = α‰Š1, 𝑑 is true 0, 𝑑 is false

β€œTraditional” Form π‘žπœŒ π‘Œ = 𝑙 = ΰ·‘

π‘˜

πœŒπ‘˜

𝟐[𝑙=π‘˜]

= exp ෍

π‘˜

1 𝑙 = π‘˜ βˆ— log πœŒπ‘˜ = exp 1[𝑙 = 1] … 1[𝑙 = 𝐿]

π‘ˆ

log 𝜌 How do we find this? 𝑏𝑐 = exp(log 𝑏 + log 𝑐)

slide-14
SLIDE 14

Why? Capture Common Distributions

Discrete/Categorical

(Finite distributions)

π‘žπœŒ π‘Œ = 𝑙 = ΰ·‘

π‘˜

πœŒπ‘˜

𝟐[𝑙=π‘˜]

1 𝑑 = α‰Š1, 𝑑 is true 0, 𝑑 is false

β€œTraditional” Form Exponential Family Form

πœ„ = log 𝜌1 , … , log 𝜌𝐿 𝑔 𝑦 = 1 𝑦 = 1 , … , 1[𝑦 = 𝐿] β„Ž 𝑦 = 1

slide-15
SLIDE 15

Why? Capture Common Distributions

Gaussian

β€œTraditional” Form Exponential Family Form

β„Ž 𝑦 = 1

slide-16
SLIDE 16

Why? Capture Common Distributions

Dirichlet

β€œTraditional” Form Exponential Family Form

β„Ž 𝑦 = 1

If we assume 𝑦 ∈ Ξ”πΏβˆ’1

β„Ž 𝑦 = 1 ෍

𝑙

𝑦𝑙 = 1

If we explicitly enforce 𝑦 ∈ Ξ”πΏβˆ’1

slide-17
SLIDE 17

Why? Capture Common Distributions

Discrete (Finite distributions) Dirichlet (Distributions over (finite) distributions) Gaussian Gamma, Exponential, Poisson, Negative-Binomial, Laplace, log-Normal,…

slide-18
SLIDE 18

Why? β€œEasy” Gradients

Gradient of likelihood

slide-19
SLIDE 19

Why? β€œEasy” Gradients

Gradient of likelihood Observed sufficient statistics (feature counts) Expected sufficient statistics (feature counts)

slide-20
SLIDE 20

Why? β€œEasy” Gradients

Observed sufficient statistics β€œCount” w.r.t. empirical distribution Expected sufficient statistics β€œCount” w.r.t. current model parameters Gradient of likelihood

slide-21
SLIDE 21

Why? β€œEasy” Expectations

expectation of the sufficient statistics gradient of the log normalizer

slide-22
SLIDE 22

Why Bother with This?

  • A common form for common distributions
  • β€œEasily” compute gradients of likelihood wrt

parameters

  • β€œEasily” compute expectations, especially

entropy and KL divergence

  • β€œEasy” posterior inference via conjugate

distributions

slide-23
SLIDE 23

Conjugate Distributions

  • Let πœ„ ∼ π‘ž, and let 𝑦|πœ„ ∼ π‘Ÿ
  • If p is the conjugate prior for q then the

posterior distribution π‘ž(πœ„|𝑦) is of the same type/family as the prior π‘ž(πœ„)

slide-24
SLIDE 24

Why? β€œEasy” Posterior Inference

slide-25
SLIDE 25

Why? β€œEasy” Posterior Inference

p is the conjugate prior for q

slide-26
SLIDE 26

Why? β€œEasy” Posterior Inference

p is the conjugate prior for q Posterior p has same form as prior p

slide-27
SLIDE 27

Why? β€œEasy” Posterior Inference

p is the conjugate prior for q Posterior p has same form as prior p All exponential family models have a conjugate prior (in theory)

slide-28
SLIDE 28

Why? β€œEasy” Posterior Inference

p is the conjugate prior for q Posterior p has same form as prior p Posterior Likelihood Prior Dirichlet (Beta) Discrete (Bernoulli) Dirichlet (Beta) Normal Normal (fixed var.) Normal Gamma Exponential Gamma …

slide-29
SLIDE 29

Conjugate Prior Example

  • π‘ž πœ„ = Dir(𝛽), π‘Ÿ 𝑦𝑗 πœ„ = Cat(πœ„) i.i.d.
  • Let 𝑔(𝑦) be the Cat sufficient statistic function
  • π‘ž πœ„ 𝑦 = Dir(𝛽 + σ𝑗 𝑔(𝑦𝑗))

π‘ž πœ„ 𝑦1, … , 𝑦𝑂) ∝ π‘Ÿ 𝑦1, … , 𝑦𝑂 πœ„) π‘ž(πœ„)

slide-30
SLIDE 30

Conjugate Prior Example

  • π‘ž πœ„ = Dir(𝛽), π‘Ÿ 𝑦𝑗 πœ„ = Cat(πœ„) i.i.d.
  • Let 𝑔(𝑦) be the Cat sufficient statistic function
  • π‘ž πœ„ 𝑦 = Dir(𝛽 + σ𝑗 𝑔(𝑦𝑗))

π‘ž πœ„ 𝑦1, … , 𝑦𝑂) ∝ π‘Ÿ 𝑦1, … , 𝑦𝑂 πœ„) π‘ž(πœ„) = ΰ·‘

𝑗

exp log πœ„π‘ˆ 𝑔 𝑦𝑗 exp 𝛽 βˆ’ 1 π‘ˆ log πœ„ βˆ’ 𝐡(πœ„) Rewrite q and p with exponential family forms

slide-31
SLIDE 31

Conjugate Prior Example

  • π‘ž πœ„ = Dir(𝛽), π‘Ÿ 𝑦𝑗 πœ„ = Cat(πœ„) i.i.d.
  • Let 𝑔(𝑦) be the Cat sufficient statistic function
  • π‘ž πœ„ 𝑦 = Dir(𝛽 + σ𝑗 𝑔(𝑦𝑗))

π‘ž πœ„ 𝑦1, … , 𝑦𝑂) ∝ π‘Ÿ 𝑦1, … , 𝑦𝑂 πœ„) π‘ž(πœ„) = ΰ·‘

𝑗

exp log πœ„π‘ˆ 𝑔 𝑦𝑗 exp 𝛽 βˆ’ 1 π‘ˆ log πœ„ βˆ’ 𝐡(πœ„) = exp ෍

𝑗

log πœ„π‘ˆ 𝑔 𝑦𝑗 + ( 𝛽 βˆ’ 1 π‘ˆ log πœ„ βˆ’ 𝐡(πœ„)) Replace with specific natural parameters and sufficient statistic functions

slide-32
SLIDE 32

Conjugate Prior Example

  • π‘ž πœ„ = Dir(𝛽), π‘Ÿ 𝑦𝑗 πœ„ = Cat(πœ„) i.i.d.
  • Let 𝑔(𝑦) be the Cat sufficient statistic function
  • π‘ž πœ„ 𝑦 = Dir(𝛽 + σ𝑗 𝑔(𝑦𝑗))

π‘ž πœ„ 𝑦1, … , 𝑦𝑂) ∝ π‘Ÿ 𝑦1, … , 𝑦𝑂 πœ„) π‘ž(πœ„) = ΰ·‘

𝑗

exp log πœ„π‘ˆ 𝑔 𝑦𝑗 exp 𝛽 βˆ’ 1 π‘ˆ log πœ„ βˆ’ 𝐡(πœ„) = exp ෍

𝑗

log πœ„π‘ˆ 𝑔 𝑦𝑗 + ( 𝛽 βˆ’ 1 π‘ˆ log πœ„ βˆ’ 𝐡(πœ„)) Notice common terms that can be simplified together

slide-33
SLIDE 33

Conjugate Prior Example

  • π‘ž πœ„ = Dir(𝛽), π‘Ÿ 𝑦𝑗 πœ„ = Cat(πœ„) i.i.d.
  • Let 𝑔(𝑦) be the Cat sufficient statistic function
  • π‘ž πœ„ 𝑦 = Dir(𝛽 + σ𝑗 𝑔(𝑦𝑗))

π‘ž πœ„ 𝑦1, … , 𝑦𝑂) ∝ π‘Ÿ 𝑦1, … , 𝑦𝑂 πœ„) π‘ž(πœ„) = ΰ·‘

𝑗

exp log πœ„π‘ˆ 𝑔 𝑦𝑗 exp 𝛽 βˆ’ 1 π‘ˆ log πœ„ βˆ’ 𝐡(πœ„) = exp ෍

𝑗

log πœ„π‘ˆ 𝑔 𝑦𝑗 + ( 𝛽 βˆ’ 1 π‘ˆ log πœ„ βˆ’ 𝐡(πœ„)) = exp 𝛽 βˆ’ 1 + ෍

𝑗

𝑔 𝑦𝑗

π‘ˆ

log πœ„ βˆ’ 𝐡(πœ„) Group common terms

slide-34
SLIDE 34

Conjugate Prior Example

  • π‘ž πœ„ = Dir(𝛽), π‘Ÿ 𝑦𝑗 πœ„ = Cat(πœ„) i.i.d.
  • Let 𝑔(𝑦) be the Cat sufficient statistic function
  • π‘ž πœ„ 𝑦 = Dir(𝛽 + σ𝑗 𝑔(𝑦𝑗))

π‘ž πœ„ 𝑦1, … , 𝑦𝑂) ∝ π‘Ÿ 𝑦1, … , 𝑦𝑂 πœ„) π‘ž(πœ„) = ΰ·‘

𝑗

exp log πœ„π‘ˆ 𝑔 𝑦𝑗 exp 𝛽 βˆ’ 1 π‘ˆ log πœ„ βˆ’ 𝐡(πœ„) = exp ෍

𝑗

log πœ„π‘ˆ 𝑔 𝑦𝑗 + ( 𝛽 βˆ’ 1 π‘ˆ log πœ„ βˆ’ 𝐡(πœ„)) = exp 𝛽 βˆ’ 1 + ෍

𝑗

𝑔 𝑦𝑗

T

log πœ„ βˆ’ 𝐡(πœ„) Group common terms Notice: this is the form of a Dirichlet

slide-35
SLIDE 35

Why Bother with This?

  • A common form for common distributions
  • β€œEasily” compute gradients of likelihood wrt

parameters

  • β€œEasily” compute expectations, especially

entropy and KL divergence

  • β€œEasy” posterior inference via conjugate

distributions