Generalized Linear Models (GLIMs) Probabilistic Graphical Models - - PowerPoint PPT Presentation
Generalized Linear Models (GLIMs) Probabilistic Graphical Models - - PowerPoint PPT Presentation
Exponential family & Generalized Linear Models (GLIMs) Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Outline Exponential family Many standard distributions are in this family Similarities
Outline
2
Exponential family
Many standard distributions are in this family Similarities among learning algorithms for different models in
this family:
ML estimation has a simple form for exponential families
moment matching of sufficient statistics
Bayesian learning is simplest for exponential families
They have a maximum entropy interpretation
GLIMs as to parameterize conditional distributions that
have an exponential distribution on a variable for each value of parent
Exponential family: canonical parameterization
3
𝑄 𝒚 𝜽 = 1 𝑎 𝜽 ℎ 𝒚 exp 𝜽𝑈𝑈(𝒚) 𝑎 𝜽 = ℎ 𝒚 exp 𝜽𝑈𝑈(𝒚) 𝑒𝒚 𝑄 𝒚 𝜽 = ℎ 𝒚 exp 𝜽𝑈𝑈 𝒚 − ln 𝑎(𝜽)
𝑈: 𝒴 → ℝ𝐿: sufficient statistics function 𝜽: natural or canonical parameters ℎ: 𝒴 → ℝ+: reference measure independent of parameters 𝑎: Normalization factor or partition function (0 < 𝑎 𝜽 < ∞) 𝐵(𝜽): log partition function
Example: Bernouli
4
𝑄 𝑦 𝜄 = 𝜄𝑦 1 − 𝜄 1−𝑦 = exp ln 𝜄 1 − 𝜄 𝑦 + ln 1 − 𝜄
- 𝜃 = ln
𝜄 1−𝜄
- 𝜃 = ln
𝜄 1−𝜄 ⇒ 𝜄 = 𝑓𝜃 𝑓𝜃+1 = 1 1+𝑓−𝜃
- 𝑈 𝑦 = 𝑦
- 𝐵 𝜃 = − ln 1 − 𝜄 = ln 1 + 𝑓𝜃
- ℎ 𝑦 = 1
Example: Gaussian
5
𝑄 𝑦 𝜈, 𝜏2 = 1 2𝜌𝜏 exp − 𝑦 − 𝜈 2 2𝜏2
- 𝜽 = 𝜃1
𝜃2 =
𝜈 𝜏2
−
1 2𝜏2
- ⇒ 𝜈 = −
𝜃1 2𝜃2 , 𝜏2 = − 1 2𝜃2
- 𝑈 𝑦 =
𝑦 𝑦2
- 𝐵 𝜽 = − ln
2𝜌𝜏 exp
𝜈2 2𝜏2
= − 1
2 ln 2𝜌 − 1 2 ln −2𝜃2 − 𝜃1
2
4𝜃2
- ℎ 𝑦 = 1
Example: Multinomial
6
𝑄 𝒚 𝜾 =
𝑙=1 𝐿
𝜄𝑙
𝑦𝑙
𝑄 𝒚 𝜾 = exp
𝑙=1 𝐿
𝑦𝑙 ln 𝜄𝑙 = exp
𝑙=1 𝐿−1
𝑦𝑙 ln 𝜄𝑙 + 1 −
𝑙=1 𝐿−1
𝑦𝑙 ln 1 −
𝑙=1 𝐿−1
𝜄𝑙
- 𝜽 = 𝜃1, … , 𝜃𝐿−1 𝑈 = ln
𝜄1 1− 𝑙=1
𝐿−1 𝜄𝑙 , … , ln
𝜄𝐿−1 1− 𝑙=1
𝐿−1 𝜄𝑙
𝑈
- 𝜽 = ln
𝜄1 𝜄𝐿 , … , ln 𝜄𝐿−1 𝜄𝐿 𝑈
⇒ 𝜄𝑙 =
𝑓𝜃𝑙 𝑘=1
𝐿
𝑓𝜃𝑘
- 𝑈 𝒚 = 𝑦1, … , 𝑦𝐿−1 𝑈
- 𝐵 𝜽 = − ln 𝜄𝐿 = − ln 1 − 𝑙=1
𝐿−1 𝜄𝑙
= ln 𝑙=1
𝐿
𝑓𝜃𝑘
𝑙=1 𝐿
𝜄𝑙 = 1
Well-behaved parameter space
7
Multiple exponential families may encode the same set of
distributions
We want the parameter space 𝜽 0 < 𝑎 𝜽 < ∞ to be:
Convex set Non-redundant: 𝜽 ≠ 𝜽′ ⇒ 𝑄 𝒚 𝜽 ≠ 𝑄 𝒚 𝜽′
The function from 𝜾 to 𝜽 is invertible Example: invertible function from 𝜄 to 𝜃 in the Bernoulli example 𝜄
=
1 1+𝑓−𝜃
Examples of non-exponential distributions
8
Uniform Laplace Student t-distribution
Moments
9
𝐵 𝜽 = ln 𝑎 𝜽 𝑎 𝜽 = ℎ 𝒚 exp 𝜽𝑈𝑈(𝒚) 𝑒𝒚 𝛼
𝜽𝐵 𝜽 =
𝛼
𝜽𝑎 𝜽
𝑎 𝜽 = ℎ 𝒚 𝑈(𝒚) exp 𝜽𝑈𝑈(𝒚) 𝑒𝒚 𝑎 𝜽 = 𝑈(𝒚) ℎ 𝒚 exp 𝜽𝑈𝑈(𝒚) 𝑎 𝜽 𝑒𝒚 = 𝐹𝑄(𝒚|𝜽) 𝑈(𝒚) ⇒ 𝛼
𝜽𝐵 𝜽 = 𝐹𝜽 𝑈(𝒚)
𝛼
𝜽 2𝐵 𝜽 = 𝐹𝜽 𝑈 𝒚 𝑈 𝒚 𝑈 − 𝐹𝜽 𝑈 𝒚 𝐹𝜽 𝑈 𝒚 𝑈 = 𝐷𝑝𝑤𝜽 𝑈 𝒚
The first derivative of 𝐵 𝜽 is the mean of sufficient statistics The i-th derivative gives the i-th centered moment
- f sufficient statistics.
Properties
10
The moment parameters 𝜾 can be derived as a function of the
natural or canonical parameters: 𝛼
𝜽𝐵 𝜽 = 𝐹𝜽 𝑈(𝒚)
𝜾 ≡ 𝐹𝜽 𝑈(𝒚) ⇒ 𝛼
𝜽𝐵 𝜽 = 𝜾
𝐵(𝜽) is convex since 𝛼
𝜽 2𝐵 𝜽 = 𝐷𝑝𝑤𝜽 𝑈 𝒚
≽ 0
Covariance matrix is always positive semi-definite ⇒ Hessian 𝛼
𝜽 2𝐵 𝜽 is
positive semi-definite, and hence that 𝐵 𝜽 = ln 𝑎 𝜽 is a convex function of 𝜽.
For many distributions, we have 𝜾 ≡ 𝐹𝜽 𝑈(𝑦)
Exponential family: moment parameterization
11
A
distribution in the exponential family can also be parameterized by the moment parameterization: 𝑄 𝒚 𝜾 = 1 𝑎 𝜾 ℎ 𝒚 exp 𝜔 𝜾 𝑈𝑈(𝒚) 𝑎 𝜾 = ℎ 𝒚 exp 𝜔 𝜾 𝑈𝑈(𝒚) 𝑒𝒚
If 𝛼
𝜽 2𝐵 𝜽 ≻ 0 ⇒ 𝛼 𝜽𝐵 𝜽
is ascending ⟹ 𝜔−1 𝜽 = 𝜾 = 𝛼
𝜽𝐵 𝜽
is ascending and thus is 1-to-1
The mapping from the moments to the canonical parameters is
invertible (1-to-1 relationship): 𝜽 = 𝜔(𝜾)
𝜽 = 𝜔(𝜾) 𝜔 maps the parameters 𝜾 to the space of sufficient statistics
𝜾 ≡ 𝐹𝜽 𝑈(𝒚) = 𝛼
𝜽𝐵 𝜽
𝜾 = 𝜔−1 𝜽
Sufficiency
12
A statistic is a function of a random variable Suppose that the distribution of 𝑌 depends on a parameter 𝜄
“𝑈(𝑌) is a sufficient statistic for 𝜄 if there is no information in 𝑌
regarding 𝜄 beyond that in 𝑈(𝑌)”
Sufficiency in both frequentist and Bayesian frameworks implies a
factorization of 𝑄 𝑦 𝜄 (Neyman factorization theorem): 𝑄 𝑦, 𝑈 𝑦 , 𝜄 = 𝑔 𝑈 𝑦 , 𝜄 𝑦, 𝑈 𝑦 𝑄 𝑦, 𝜄 = 𝑔 𝑈 𝑦 , 𝜄 (𝑦, 𝑈(𝑦)) 𝑄 𝑦|𝜄 = 𝑔′ 𝑈 𝑦 , 𝜄 (𝑦, 𝑈(𝑦))
Sufficient statistic
13
Sufficient statistic and the exponential family:
𝑄 𝒚 𝜽 = ℎ 𝒚 exp 𝜽𝑈𝑈 𝒚 − 𝐵(𝜽)
Sufficient statistic in the case of i.i.d sampling can be obtained
easily for a set of N observations from a distribution 𝑄 𝜽 =
𝑜=1 𝑂
ℎ 𝒚(𝑜) exp 𝜽𝑈𝑈 𝒚 𝑜 − 𝐵(𝜽) =
𝑜=1 𝑂
ℎ 𝒚(𝑜) exp{𝜽𝑈
𝑜=1 𝑂
𝑈 𝒚 𝑜 − 𝑂𝐵 𝜽 } has itself an exponential distribution with sufficient statistic 𝑜=1
𝑂
𝑈 𝒚 𝑜
MLE for exponential family
15
ℓ 𝜽; = ln 𝑄 𝜽 = ln
𝑜=1 𝑂
ℎ 𝒚(𝑜) exp 𝜽𝑈𝑈 𝒚 𝑜 − 𝐵(𝜽) = ln
𝑜=1 𝑂
ℎ(𝒚(𝑜)) + 𝜽𝑈
𝑜=1 𝑂
𝑈 𝒚 𝑜 − 𝑂𝐵 𝜽 𝛼
𝜽ℓ 𝜽; = 0 ⇒ 𝑜=1 𝑂
𝑈 𝒚 𝑜 − 𝑂𝛼
𝜽𝐵 𝜽 = 0
⇒ 𝛼
𝜽𝐵
𝜽 = 𝑜=1
𝑂
𝑈 𝒚 𝑜 𝑂 ⇒ 𝛼
𝜽𝐵
𝜽 = 𝐹
𝜽 𝑈(𝒚) = 𝑜=1 𝑂
𝑈 𝒚 𝑜 𝑂
moment matching Concave function
Maximum entropy models
16
Among all distributions with certain moments of interest, the
exponential family is the most random (makes fewest assumptions or structure)
Out of all distributions which reproduce the observed sufficient
statistics, the exponential family distribution (roughly) makes the fewest additional assumptions.
The unique distribution maximizing the entropy, subject to the
constraint that these moments are exactly matched, is then an exponential family distribution
Maximum entropy
17
Constraints:
𝐹 𝑔
𝑙 = 𝒚
𝑔
𝑙 𝒚 𝑄 𝒚 = 𝐺𝑙
Maximum entropy (maxent): pick the distribution
with maximum entropy subject to the constraints
𝑀 𝑄, 𝝁 = −
𝒚
𝑄 𝒚 log 𝑄 𝒚 + 𝜇0 1 −
𝒚
𝑄 𝒚 +
𝑙
𝜇𝑙 𝐺𝑙 −
𝒚
𝑔
𝑙 𝒚 𝑄 𝒚
𝛼𝑀 = 0 ⇒ 𝑄 𝒚 = 1 𝑎 exp −
𝑙
𝜇𝑙𝑔
𝑙 𝒚
𝑔
𝑙 𝒚 : an arbitrary function
𝐺𝑙: constant 𝑎 =
𝒚
exp −
𝑙
𝜇𝑙𝑔
𝑙 𝒚
Maximum entropy: constraints
18
Constants in the constraints:
𝐺𝑙 measure the empirical counts on the training data
𝐺𝑙 =
𝑜=1
𝑂
𝑔𝑙 𝒚(𝑜) 𝑂
These constraints also ensure consistency automatically.
Exponential family: summary
19
Many famous distribution are in the exponential family Important properties for learning with exponential families:
Gradients of log partition function gives expected sufficient statistics, or
moments, for some models
Moments of any distribution in exponential family can be easily computed by
taking the derivatives of the log normalizer
The Hessian of the log partition function is positive semi-definite and so
the log partition function is convex
Among
all distributions with certain moments
- f
interest, the exponential family has the highest entropy Are important for modeling distributions of Markov
networks
Generalized linear models (GLIMs)
20
Conditional relationship between 𝑍 and 𝒀
Examples:
Linear regression: 𝑄 𝑧 𝒚, 𝒙, 𝜏2 = 𝒪(𝑧|𝒙𝑈𝒚, 𝜏2) Discriminative linear classifier (two class)
Logistic regression: 𝑄 𝑧 𝒚, 𝒙 = 𝐶𝑓𝑠(𝑧|𝜏 𝒙𝑈𝒚 ) Probit regression: 𝑄 𝑧 𝒚, 𝒙 = 𝐶𝑓𝑠(𝑧|Φ 𝒙𝑈𝒚 ) where Φ is the cdf
- f 𝒪(0,1)
Generalized linear models (GLIMs)
21
𝑄(𝑧|𝒚) is a generalized linear model if:
𝒚 enters into the model via a linear combination 𝒙𝑈𝒚 The conditional mean of 𝑄(𝑧|𝒚) is expressed as 𝑔 𝒙𝑈𝒚 :
𝑔 is called the response function 𝜈 = 𝐹 𝑧|𝒚 = 𝑔 𝒙𝑈𝒚
The distribution of 𝑧 is characterized by an exponential family distribution
(with conditional mean 𝑔 𝒙𝑈𝒚 )
We have two choices in the specification of a GLIM:
The choice of the exponential family distribution
Usually constrained by the nature of 𝑍
The choice of the response function 𝑔
the principal degree of freedom in the specification of a GLIM However, we need to impose constraints on this function (e.g., 𝑔 must be in [0,1] for
Bernoulli distribution on 𝑧)
The relation between vars. in a GLIMs
22
Canonical response function
23
Canonical response function: 𝑔(. ) = 𝜔−1(. ) or 𝜊 = 𝜃
In this case, the choice of the exponential family density completely
determines the GLIM
The constraints on the range of 𝑔 are automatically satisfied.
𝜈 = 𝑔 𝜃
are guaranteed to be possible values of the conditional expectation (i.e., 𝑔 𝜃 = 𝜔−1 𝜃 =
𝑒𝐵 𝜃 𝑒𝜃
= 𝐹 𝑍|𝜃 )
Log likelihood for GLIMs
24
ℓ 𝜽; = ln 𝑄 𝜽 = ln
𝑜=1 𝑂
ℎ 𝑧(𝑜) exp 𝜃(𝑜)𝑧(𝑜) − 𝐵 𝜃(𝑜) =
𝑜=1 𝑂
ln ℎ 𝑧(𝑜) +
𝑜=1 𝑂
𝜃(𝑜)𝑧(𝑜) − 𝐵 𝜃(𝑜)
𝜃(𝑜) = 𝜔(𝜈 𝑜 ) and 𝜈 𝑜 = 𝑔 𝜾𝑈𝒚(𝑜) In the case of canonical response function 𝜃(𝑜) = 𝜾𝑈𝒚(𝑜)
ℓ 𝜾; =
𝑜=1 𝑂
ln ℎ 𝑧(𝑜) + 𝜾𝑈
𝑜=1 𝑂
𝒚(𝑜)𝑧(𝑜) −
𝑜=1 𝑂
𝐵 𝜾𝑈𝒚(𝑜)
Sufficient statistics for 𝜾
Gradient of log likelihood
25
𝛼𝜾𝑚 𝜽; =
𝑜=1 𝑂
𝑒𝑚 𝑒𝜃(𝑜) 𝛼𝜾𝜃(𝑜) =
𝑜=1 𝑂
𝑧(𝑜) − 𝑒𝐵 𝜃(𝑜) 𝑒𝜃(𝑜) 𝛼𝜾𝜃(𝑜) =
𝑜=1 𝑂
𝑧(𝑜) − 𝜈(𝑜) 𝑒𝜃(𝑜) 𝑒𝜈(𝑜) 𝑒𝜈(𝑜) 𝑒𝜊(𝑜) 𝒚(𝑜)
In the case of canonical response function 𝜃(𝑜) = 𝜊(𝑜):
𝛼𝜾𝑚 𝜾; =
𝑜=1 𝑂
𝑧(𝑜) − 𝜈(𝑜) 𝒚(𝑜)
𝜈(𝑜) = 𝑔 𝜾𝑈𝒚(𝑜)
Online learning for GLIMs
26
An LMS like algorithm as a generic stochastic gradient
descent for GLIMs: 𝜾𝑢+1 = 𝜾𝑢 + 𝜍 𝑧(𝑜) − 𝜈 𝑜 𝑢 𝒚(𝑜) 𝜈 𝑜 𝑢 = 𝑔 𝜾𝑢𝑈𝒚(𝑜)
If we do not use the canonical response function only
scaling coefficients due to the derivatives of 𝑔(. ) and 𝜔(. ) will also incorporated into the step size
Similar to Least Mean Squares (LMS) algorithm
Batch learning for GLIMs : Newton-Rafson
27
For the canonical response functions:
𝛼𝜾𝑚 𝜾; =
𝑜=1 𝑂
𝑧(𝑜) − 𝜈(𝑜) 𝒚(𝑜) = 𝒀𝑈 𝒛 − 𝝂 𝐼 = 𝑒2𝑚 𝑒𝜾𝑒𝜾𝑈 = 𝑒𝑚 𝑒𝜾𝑈
𝑜=1 𝑂
𝑧(𝑜) − 𝜈(𝑜) 𝒚(𝑜) = −
𝑜=1 𝑂
𝒚 𝑜 𝑒𝜈 𝑜 𝑒𝜾𝑈 = −
𝑜=1 𝑂
𝒚(𝑜) 𝑒𝜈(𝑜) 𝑒𝜃(𝑜) 𝑒𝜃(𝑜) 𝑒𝜾𝑈
Since 𝜃(𝑜) = 𝜾𝑈𝒚(𝑜)
𝐼 = −
𝑜=1 𝑂
𝒚(𝑜) 𝑒𝜈(𝑜) 𝑒𝜃(𝑜) 𝒚 𝑜 𝑈 = −𝒀𝑈𝑿𝒀 𝑿 = 𝑒𝑗𝑏 𝑒𝜈(1) 𝑒𝜃(1) , … , 𝑒𝜈(𝑂) 𝑒𝜃(𝑂)
𝒀 = 𝑦1
(1)
⋯ 𝑦𝑒
(1)
⋮ ⋱ ⋮ 𝑦1
(𝑂)
⋯ 𝑦𝑒
(𝑂)
𝒛 = 𝑧(1) ⋮ 𝑧(𝑂) 𝑒𝜈(𝑜) 𝑒𝜃(𝑜) = 𝑒2𝐵 𝑒𝜃(𝑜)
Batch learning for GLIMs: Newton-Rafson
28
𝜾𝑢+1 = 𝜾𝑢 + 𝒀𝑈𝑿𝑢𝒀 −1𝒀𝑈 𝒛 − 𝝂𝑢 = 𝒀𝑈𝑿𝑢𝒀 −1 𝒀𝑈𝑿𝑢𝒀𝜾𝑢 + 𝒀𝑈 𝒛 − 𝝂𝑢 ⇒ 𝜾𝑢+1 = 𝒀𝑈𝑿𝑢𝒀 −1𝒀𝑈𝑿𝑢𝒜𝑢 𝒜𝑢 = 𝜃𝑢 + 𝑿𝑢−1 𝒛 − 𝝂𝑢
Iterative Reweighted Least Squares (IRLS)
Linear regression
29
Cost function (according to MLE where 𝑄 𝑧 𝒚 = 𝒪(𝑧|𝜾𝑈𝒚, 𝜏2)):
𝐾 𝜾 = 1 2
𝑜=1 𝑂
𝜾𝑈𝒚 𝑜 − 𝑧 𝑜
2
𝛼𝜾𝐾 𝜾 = 𝟏 ⇒ 𝜾 = 𝒀𝑈𝒀 −1𝒀𝑈𝒛
Online learning (LMS):
𝜾𝑢+1 = 𝜾𝑢 + 𝜍 𝑧(𝑜) − 𝜾𝑢𝑈𝒚(𝑜) 𝒚(𝑜)
IRLS:
𝜾𝑢+1 = 𝒀𝑈𝑿𝑢𝒀 −1𝒀𝑈𝑿𝑢𝒜𝑢 = 𝒀𝑈𝒀 −1𝒀𝑈 𝒀𝜾𝑢 + 𝒛 − 𝝂𝑢 = 𝒀𝑈𝒀 −1𝒀𝑈𝒛
Canonical response function 𝜈 𝒚 = 𝜾𝑈𝒚 = 𝜃(𝒚) 𝑒𝜈 𝑒𝜃 = 1 ⇒ 𝑿 = 𝑱
Logistic regression
30
𝜈 𝒚 = 1 1 + 𝑓−𝜃(𝒚)
Canonical response function 𝜃 = 𝜊 = 𝜾𝑈𝒚
IRLS: 𝑒𝜈 𝑒𝜃 = 𝜈 1 − 𝜈 𝑋 = 𝜈(1) 1 − 𝜈(1) ⋯ ⋮ ⋱ ⋮ ⋯ 𝜈(𝑂) 1 − 𝜈(𝑂 ) 𝜾𝑢+1 = 𝒀𝑈𝑿𝑢𝒀 −1𝒀𝑈𝑿𝑢𝒜𝑢 𝒜𝑢 = 𝒀𝜾𝑢 + 𝑿𝑢−1 𝒛 − 𝝂𝑢
References
31