Exponential Family Distributions CMSC 691 UMBC Exponential Family - - PowerPoint PPT Presentation
Exponential Family Distributions CMSC 691 UMBC Exponential Family - - PowerPoint PPT Presentation
Exponential Family Distributions CMSC 691 UMBC Exponential Family Form Exponential Family Form Support function Formally necessary, often irrelevant (e.g., Gaussian distributions), except when it isnt (e.g., Dirichlet
Exponential Family Form
Exponential Family Form
Support function
- Formally necessary, often
irrelevant (e.g., Gaussian distributions), except when it isnβt (e.g., Dirichlet distributions)
Exponential Family Form
Distribution Parameters
- Natural parameters
- Feature weights
Exponential Family Form
Sufficient statistics
- Feature function(s)
Exponential Family Form
Log-normalizer
Exponential Family Form
Log-normalizer Discrete x
π΅ π = log β« β π¦β² exp(πππ(π¦β²))ππ¦β² π΅ π = log ΰ·
π¦β²
β π¦β² exp(πππ(π¦β²))
Continuous x
Why Bother with This?
- A common form for common distributions
- βEasilyβ compute gradients of likelihood wrt
parameters
- βEasilyβ compute expectations, especially
entropy and KL divergence
- βEasyβ posterior inference via conjugate
distributions
Why? Capture Common Distributions
Bernoulli/Binomial Categorical/Multinomial Poisson Normal Gamma β¦
ππ π¦ = β π¦ exp πππ π¦ β π΅(π)
These can all be written in this βcommonβ form (different h, f, and A functions)
See a good stats book, or https://en.wikipedia.org/wiki/Exponential_family#Table_of_distributions
Why? Capture Common Distributions
Discrete/Categorical
(Finite distributions)
ππ π = π = ΰ·
π
ππ
π[π=π]
1 π = α1, π is true 0, π is false
βTraditionalβ Form Exponential Family Form
π =??? π π¦ =??? β π¦ =???
Why? Capture Common Distributions
Discrete/Categorical
(Finite distributions)
ππ π = π = ΰ·
π
ππ
π[π=π]
1 π = α1, π is true 0, π is false
βTraditionalβ Form How do we find this? ππ = exp(log π + log π)
Why? Capture Common Distributions
Discrete/Categorical
(Finite distributions)
ππ π = π = ΰ·
π
ππ
π[π=π]
1 π = α1, π is true 0, π is false
βTraditionalβ Form ππ π = π = ΰ·
π
ππ
π[π=π]
= exp ΰ·
π
1 π = π β log ππ How do we find this? ππ = exp(log π + log π)
Why? Capture Common Distributions
Discrete/Categorical
(Finite distributions)
ππ π = π = ΰ·
π
ππ
π[π=π]
1 π = α1, π is true 0, π is false
βTraditionalβ Form ππ π = π = ΰ·
π
ππ
π[π=π]
= exp ΰ·
π
1 π = π β log ππ = exp 1[π = 1] β¦ 1[π = πΏ]
π
log π How do we find this? ππ = exp(log π + log π)
Why? Capture Common Distributions
Discrete/Categorical
(Finite distributions)
ππ π = π = ΰ·
π
ππ
π[π=π]
1 π = α1, π is true 0, π is false
βTraditionalβ Form Exponential Family Form
π = log π1 , β¦ , log ππΏ π π¦ = 1 π¦ = 1 , β¦ , 1[π¦ = πΏ] β π¦ = 1
Why? Capture Common Distributions
Gaussian
βTraditionalβ Form Exponential Family Form
β π¦ = 1
Why? Capture Common Distributions
Dirichlet
βTraditionalβ Form Exponential Family Form
β π¦ = 1
If we assume π¦ β ΞπΏβ1
β π¦ = 1 ΰ·
π
π¦π = 1
If we explicitly enforce π¦ β ΞπΏβ1
Why? Capture Common Distributions
Discrete (Finite distributions) Dirichlet (Distributions over (finite) distributions) Gaussian Gamma, Exponential, Poisson, Negative-Binomial, Laplace, log-Normal,β¦
Why? βEasyβ Gradients
Gradient of likelihood
Why? βEasyβ Gradients
Gradient of likelihood Observed sufficient statistics (feature counts) Expected sufficient statistics (feature counts)
Why? βEasyβ Gradients
Observed sufficient statistics βCountβ w.r.t. empirical distribution Expected sufficient statistics βCountβ w.r.t. current model parameters Gradient of likelihood
Why? βEasyβ Expectations
expectation of the sufficient statistics gradient of the log normalizer
Why Bother with This?
- A common form for common distributions
- βEasilyβ compute gradients of likelihood wrt
parameters
- βEasilyβ compute expectations, especially
entropy and KL divergence
- βEasyβ posterior inference via conjugate
distributions
Conjugate Distributions
- Let π βΌ π, and let π¦|π βΌ π
- If p is the conjugate prior for q then the
posterior distribution π(π|π¦) is of the same type/family as the prior π(π)
Why? βEasyβ Posterior Inference
Why? βEasyβ Posterior Inference
p is the conjugate prior for q
Why? βEasyβ Posterior Inference
p is the conjugate prior for q Posterior p has same form as prior p
Why? βEasyβ Posterior Inference
p is the conjugate prior for q Posterior p has same form as prior p All exponential family models have a conjugate prior (in theory)
Why? βEasyβ Posterior Inference
p is the conjugate prior for q Posterior p has same form as prior p Posterior Likelihood Prior Dirichlet (Beta) Discrete (Bernoulli) Dirichlet (Beta) Normal Normal (fixed var.) Normal Gamma Exponential Gamma β¦
Conjugate Prior Example
- π π = Dir(π½), π π¦π π = Cat(π) i.i.d.
- Let π(π¦) be the Cat sufficient statistic function
- π π π¦ = Dir(π½ + Οπ π(π¦π))
π π π¦1, β¦ , π¦π) β π π¦1, β¦ , π¦π π) π(π)
Conjugate Prior Example
- π π = Dir(π½), π π¦π π = Cat(π) i.i.d.
- Let π(π¦) be the Cat sufficient statistic function
- π π π¦ = Dir(π½ + Οπ π(π¦π))
π π π¦1, β¦ , π¦π) β π π¦1, β¦ , π¦π π) π(π) = ΰ·
π
exp log ππ π π¦π exp π½ β 1 π log π β π΅(π) Rewrite q and p with exponential family forms
Conjugate Prior Example
- π π = Dir(π½), π π¦π π = Cat(π) i.i.d.
- Let π(π¦) be the Cat sufficient statistic function
- π π π¦ = Dir(π½ + Οπ π(π¦π))
π π π¦1, β¦ , π¦π) β π π¦1, β¦ , π¦π π) π(π) = ΰ·
π
exp log ππ π π¦π exp π½ β 1 π log π β π΅(π) = exp ΰ·
π
log ππ π π¦π + ( π½ β 1 π log π β π΅(π)) Replace with specific natural parameters and sufficient statistic functions
Conjugate Prior Example
- π π = Dir(π½), π π¦π π = Cat(π) i.i.d.
- Let π(π¦) be the Cat sufficient statistic function
- π π π¦ = Dir(π½ + Οπ π(π¦π))
π π π¦1, β¦ , π¦π) β π π¦1, β¦ , π¦π π) π(π) = ΰ·
π
exp log ππ π π¦π exp π½ β 1 π log π β π΅(π) = exp ΰ·
π
log ππ π π¦π + ( π½ β 1 π log π β π΅(π)) Notice common terms that can be simplified together
Conjugate Prior Example
- π π = Dir(π½), π π¦π π = Cat(π) i.i.d.
- Let π(π¦) be the Cat sufficient statistic function
- π π π¦ = Dir(π½ + Οπ π(π¦π))
π π π¦1, β¦ , π¦π) β π π¦1, β¦ , π¦π π) π(π) = ΰ·
π
exp log ππ π π¦π exp π½ β 1 π log π β π΅(π) = exp ΰ·
π
log ππ π π¦π + ( π½ β 1 π log π β π΅(π)) = exp π½ β 1 + ΰ·
π
π π¦π
π
log π β π΅(π) Group common terms
Conjugate Prior Example
- π π = Dir(π½), π π¦π π = Cat(π) i.i.d.
- Let π(π¦) be the Cat sufficient statistic function
- π π π¦ = Dir(π½ + Οπ π(π¦π))
π π π¦1, β¦ , π¦π) β π π¦1, β¦ , π¦π π) π(π) = ΰ·
π
exp log ππ π π¦π exp π½ β 1 π log π β π΅(π) = exp ΰ·
π
log ππ π π¦π + ( π½ β 1 π log π β π΅(π)) = exp π½ β 1 + ΰ·
π
π π¦π
T
log π β π΅(π) Group common terms Notice: this is the form of a Dirichlet
Why Bother with This?
- A common form for common distributions
- βEasilyβ compute gradients of likelihood wrt
parameters
- βEasilyβ compute expectations, especially
entropy and KL divergence
- βEasyβ posterior inference via conjugate