Generalized Linear Models (GLIMs) Probabilistic Graphical Models - - PowerPoint PPT Presentation

generalized linear models glims
SMART_READER_LITE
LIVE PREVIEW

Generalized Linear Models (GLIMs) Probabilistic Graphical Models - - PowerPoint PPT Presentation

Exponential family & Generalized Linear Models (GLIMs) Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Outline Exponential family Many standard distributions are in this family Similarities


slide-1
SLIDE 1

Exponential family & Generalized Linear Models (GLIMs)

Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani

slide-2
SLIDE 2

Outline

2

 Exponential family

 Many standard distributions are in this family  Similarities among learning algorithms for different models in

this family:

 ML estimation has a simple form for exponential families

 moment matching of sufficient statistics

 Bayesian learning is simplest for exponential families

 They have a maximum entropy interpretation

 GLIMs as to parameterize conditional distributions that

have an exponential distribution on a variable for each value of parent

slide-3
SLIDE 3

Exponential family: canonical parameterization

3

𝑄 𝒚 𝜽 = 1 𝑎 𝜽 ℎ 𝒚 exp 𝜽𝑈𝑈(𝒚) 𝑎 𝜽 = ℎ 𝒚 exp 𝜽𝑈𝑈(𝒚) 𝑒𝒚 𝑄 𝒚 𝜽 = ℎ 𝒚 exp 𝜽𝑈𝑈 𝒚 − ln 𝑎(𝜽)

 𝑈: 𝒴 → ℝ𝐿: sufficient statistics function  𝜽: natural or canonical parameters  ℎ: 𝒴 → ℝ+: reference measure independent of parameters  𝑎: Normalization factor or partition function (0 < 𝑎 𝜽 < ∞) 𝐵(𝜽): log partition function

slide-4
SLIDE 4

Example: Bernouli

4

𝑄 𝑦 𝜄 = 𝜄𝑦 1 − 𝜄 1−𝑦 = exp ln 𝜄 1 − 𝜄 𝑦 + ln 1 − 𝜄

  • 𝜃 = ln

𝜄 1−𝜄

  • 𝜃 = ln

𝜄 1−𝜄 ⇒ 𝜄 = 𝑓𝜃 𝑓𝜃+1 = 1 1+𝑓−𝜃

  • 𝑈 𝑦 = 𝑦
  • 𝐵 𝜃 = − ln 1 − 𝜄 = ln 1 + 𝑓𝜃
  • ℎ 𝑦 = 1
slide-5
SLIDE 5

Example: Gaussian

5

𝑄 𝑦 𝜈, 𝜏2 = 1 2𝜌𝜏 exp − 𝑦 − 𝜈 2 2𝜏2

  • 𝜽 = 𝜃1

𝜃2 =

𝜈 𝜏2

1 2𝜏2

  • ⇒ 𝜈 = −

𝜃1 2𝜃2 , 𝜏2 = − 1 2𝜃2

  • 𝑈 𝑦 =

𝑦 𝑦2

  • 𝐵 𝜽 = − ln

2𝜌𝜏 exp

𝜈2 2𝜏2

= − 1

2 ln 2𝜌 − 1 2 ln −2𝜃2 − 𝜃1

2

4𝜃2

  • ℎ 𝑦 = 1
slide-6
SLIDE 6

Example: Multinomial

6

𝑄 𝒚 𝜾 =

𝑙=1 𝐿

𝜄𝑙

𝑦𝑙

𝑄 𝒚 𝜾 = exp

𝑙=1 𝐿

𝑦𝑙 ln 𝜄𝑙 = exp

𝑙=1 𝐿−1

𝑦𝑙 ln 𝜄𝑙 + 1 −

𝑙=1 𝐿−1

𝑦𝑙 ln 1 −

𝑙=1 𝐿−1

𝜄𝑙

  • 𝜽 = 𝜃1, … , 𝜃𝐿−1 𝑈 = ln

𝜄1 1− 𝑙=1

𝐿−1 𝜄𝑙 , … , ln

𝜄𝐿−1 1− 𝑙=1

𝐿−1 𝜄𝑙

𝑈

  • 𝜽 = ln

𝜄1 𝜄𝐿 , … , ln 𝜄𝐿−1 𝜄𝐿 𝑈

⇒ 𝜄𝑙 =

𝑓𝜃𝑙 𝑘=1

𝐿

𝑓𝜃𝑘

  • 𝑈 𝒚 = 𝑦1, … , 𝑦𝐿−1 𝑈
  • 𝐵 𝜽 = − ln 𝜄𝐿 = − ln 1 − 𝑙=1

𝐿−1 𝜄𝑙

= ln 𝑙=1

𝐿

𝑓𝜃𝑘

𝑙=1 𝐿

𝜄𝑙 = 1

slide-7
SLIDE 7

Well-behaved parameter space

7

 Multiple exponential families may encode the same set of

distributions

 We want the parameter space 𝜽 0 < 𝑎 𝜽 < ∞ to be:

 Convex set  Non-redundant: 𝜽 ≠ 𝜽′ ⇒ 𝑄 𝒚 𝜽 ≠ 𝑄 𝒚 𝜽′

 The function from 𝜾 to 𝜽 is invertible  Example: invertible function from 𝜄 to 𝜃 in the Bernoulli example 𝜄

=

1 1+𝑓−𝜃

slide-8
SLIDE 8

Examples of non-exponential distributions

8

 Uniform  Laplace  Student t-distribution

slide-9
SLIDE 9

Moments

9

𝐵 𝜽 = ln 𝑎 𝜽 𝑎 𝜽 = ℎ 𝒚 exp 𝜽𝑈𝑈(𝒚) 𝑒𝒚 𝛼

𝜽𝐵 𝜽 =

𝛼

𝜽𝑎 𝜽

𝑎 𝜽 = ℎ 𝒚 𝑈(𝒚) exp 𝜽𝑈𝑈(𝒚) 𝑒𝒚 𝑎 𝜽 = 𝑈(𝒚) ℎ 𝒚 exp 𝜽𝑈𝑈(𝒚) 𝑎 𝜽 𝑒𝒚 = 𝐹𝑄(𝒚|𝜽) 𝑈(𝒚) ⇒ 𝛼

𝜽𝐵 𝜽 = 𝐹𝜽 𝑈(𝒚)

𝛼

𝜽 2𝐵 𝜽 = 𝐹𝜽 𝑈 𝒚 𝑈 𝒚 𝑈 − 𝐹𝜽 𝑈 𝒚 𝐹𝜽 𝑈 𝒚 𝑈 = 𝐷𝑝𝑤𝜽 𝑈 𝒚

The first derivative of 𝐵 𝜽 is the mean of sufficient statistics The i-th derivative gives the i-th centered moment

  • f sufficient statistics.
slide-10
SLIDE 10

Properties

10

 The moment parameters 𝜾 can be derived as a function of the

natural or canonical parameters: 𝛼

𝜽𝐵 𝜽 = 𝐹𝜽 𝑈(𝒚)

𝜾 ≡ 𝐹𝜽 𝑈(𝒚) ⇒ 𝛼

𝜽𝐵 𝜽 = 𝜾

 𝐵(𝜽) is convex since 𝛼

𝜽 2𝐵 𝜽 = 𝐷𝑝𝑤𝜽 𝑈 𝒚

≽ 0

 Covariance matrix is always positive semi-definite ⇒ Hessian 𝛼

𝜽 2𝐵 𝜽 is

positive semi-definite, and hence that 𝐵 𝜽 = ln 𝑎 𝜽 is a convex function of 𝜽.

For many distributions, we have 𝜾 ≡ 𝐹𝜽 𝑈(𝑦)

slide-11
SLIDE 11

Exponential family: moment parameterization

11

 A

distribution in the exponential family can also be parameterized by the moment parameterization: 𝑄 𝒚 𝜾 = 1 𝑎 𝜾 ℎ 𝒚 exp 𝜔 𝜾 𝑈𝑈(𝒚) 𝑎 𝜾 = ℎ 𝒚 exp 𝜔 𝜾 𝑈𝑈(𝒚) 𝑒𝒚

 If 𝛼

𝜽 2𝐵 𝜽 ≻ 0 ⇒ 𝛼 𝜽𝐵 𝜽

is ascending ⟹ 𝜔−1 𝜽 = 𝜾 = 𝛼

𝜽𝐵 𝜽

is ascending and thus is 1-to-1

 The mapping from the moments to the canonical parameters is

invertible (1-to-1 relationship): 𝜽 = 𝜔(𝜾)

𝜽 = 𝜔(𝜾) 𝜔 maps the parameters 𝜾 to the space of sufficient statistics

𝜾 ≡ 𝐹𝜽 𝑈(𝒚) = 𝛼

𝜽𝐵 𝜽

𝜾 = 𝜔−1 𝜽

slide-12
SLIDE 12

Sufficiency

12

 A statistic is a function of a random variable  Suppose that the distribution of 𝑌 depends on a parameter 𝜄

 “𝑈(𝑌) is a sufficient statistic for 𝜄 if there is no information in 𝑌

regarding 𝜄 beyond that in 𝑈(𝑌)”

 Sufficiency in both frequentist and Bayesian frameworks implies a

factorization of 𝑄 𝑦 𝜄 (Neyman factorization theorem): 𝑄 𝑦, 𝑈 𝑦 , 𝜄 = 𝑔 𝑈 𝑦 , 𝜄 𝑕 𝑦, 𝑈 𝑦 𝑄 𝑦, 𝜄 = 𝑔 𝑈 𝑦 , 𝜄 𝑕(𝑦, 𝑈(𝑦)) 𝑄 𝑦|𝜄 = 𝑔′ 𝑈 𝑦 , 𝜄 𝑕(𝑦, 𝑈(𝑦))

slide-13
SLIDE 13

Sufficient statistic

13

 Sufficient statistic and the exponential family:

𝑄 𝒚 𝜽 = ℎ 𝒚 exp 𝜽𝑈𝑈 𝒚 − 𝐵(𝜽)

 Sufficient statistic in the case of i.i.d sampling can be obtained

easily for a set of N observations from a distribution 𝑄 𝒠 𝜽 =

𝑜=1 𝑂

ℎ 𝒚(𝑜) exp 𝜽𝑈𝑈 𝒚 𝑜 − 𝐵(𝜽) =

𝑜=1 𝑂

ℎ 𝒚(𝑜) exp{𝜽𝑈

𝑜=1 𝑂

𝑈 𝒚 𝑜 − 𝑂𝐵 𝜽 } 𝒠 has itself an exponential distribution with sufficient statistic 𝑜=1

𝑂

𝑈 𝒚 𝑜

slide-14
SLIDE 14

MLE for exponential family

15

ℓ 𝜽; 𝒠 = ln 𝑄 𝒠 𝜽 = ln

𝑜=1 𝑂

ℎ 𝒚(𝑜) exp 𝜽𝑈𝑈 𝒚 𝑜 − 𝐵(𝜽) = ln

𝑜=1 𝑂

ℎ(𝒚(𝑜)) + 𝜽𝑈

𝑜=1 𝑂

𝑈 𝒚 𝑜 − 𝑂𝐵 𝜽 𝛼

𝜽ℓ 𝜽; 𝒠 = 0 ⇒ 𝑜=1 𝑂

𝑈 𝒚 𝑜 − 𝑂𝛼

𝜽𝐵 𝜽 = 0

⇒ 𝛼

𝜽𝐵

𝜽 = 𝑜=1

𝑂

𝑈 𝒚 𝑜 𝑂 ⇒ 𝛼

𝜽𝐵

𝜽 = 𝐹

𝜽 𝑈(𝒚) = 𝑜=1 𝑂

𝑈 𝒚 𝑜 𝑂

moment matching Concave function

slide-15
SLIDE 15

Maximum entropy models

16

 Among all distributions with certain moments of interest, the

exponential family is the most random (makes fewest assumptions or structure)

 Out of all distributions which reproduce the observed sufficient

statistics, the exponential family distribution (roughly) makes the fewest additional assumptions.

 The unique distribution maximizing the entropy, subject to the

constraint that these moments are exactly matched, is then an exponential family distribution

slide-16
SLIDE 16

Maximum entropy

17

 Constraints:

𝐹 𝑔

𝑙 = 𝒚

𝑔

𝑙 𝒚 𝑄 𝒚 = 𝐺𝑙

 Maximum entropy (maxent): pick the distribution

with maximum entropy subject to the constraints

𝑀 𝑄, 𝝁 = −

𝒚

𝑄 𝒚 log 𝑄 𝒚 + 𝜇0 1 −

𝒚

𝑄 𝒚 +

𝑙

𝜇𝑙 𝐺𝑙 −

𝒚

𝑔

𝑙 𝒚 𝑄 𝒚

𝛼𝑀 = 0 ⇒ 𝑄 𝒚 = 1 𝑎 exp −

𝑙

𝜇𝑙𝑔

𝑙 𝒚

𝑔

𝑙 𝒚 : an arbitrary function

𝐺𝑙: constant 𝑎 =

𝒚

exp −

𝑙

𝜇𝑙𝑔

𝑙 𝒚

slide-17
SLIDE 17

Maximum entropy: constraints

18

 Constants in the constraints:

 𝐺𝑙 measure the empirical counts on the training data

 𝐺𝑙 =

𝑜=1

𝑂

𝑔𝑙 𝒚(𝑜) 𝑂

 These constraints also ensure consistency automatically.

slide-18
SLIDE 18

Exponential family: summary

19

 Many famous distribution are in the exponential family  Important properties for learning with exponential families:

 Gradients of log partition function gives expected sufficient statistics, or

moments, for some models

 Moments of any distribution in exponential family can be easily computed by

taking the derivatives of the log normalizer

 The Hessian of the log partition function is positive semi-definite and so

the log partition function is convex

 Among

all distributions with certain moments

  • f

interest, the exponential family has the highest entropy  Are important for modeling distributions of Markov

networks

slide-19
SLIDE 19

Generalized linear models (GLIMs)

20

 Conditional relationship between 𝑍 and 𝒀

 Examples:

 Linear regression: 𝑄 𝑧 𝒚, 𝒙, 𝜏2 = 𝒪(𝑧|𝒙𝑈𝒚, 𝜏2)  Discriminative linear classifier (two class)

 Logistic regression: 𝑄 𝑧 𝒚, 𝒙 = 𝐶𝑓𝑠(𝑧|𝜏 𝒙𝑈𝒚 )  Probit regression: 𝑄 𝑧 𝒚, 𝒙 = 𝐶𝑓𝑠(𝑧|Φ 𝒙𝑈𝒚 ) where Φ is the cdf

  • f 𝒪(0,1)
slide-20
SLIDE 20

Generalized linear models (GLIMs)

21

 𝑄(𝑧|𝒚) is a generalized linear model if:

 𝒚 enters into the model via a linear combination 𝒙𝑈𝒚  The conditional mean of 𝑄(𝑧|𝒚) is expressed as 𝑔 𝒙𝑈𝒚 :

 𝑔 is called the response function  𝜈 = 𝐹 𝑧|𝒚 = 𝑔 𝒙𝑈𝒚

 The distribution of 𝑧 is characterized by an exponential family distribution

(with conditional mean 𝑔 𝒙𝑈𝒚 )

 We have two choices in the specification of a GLIM:

 The choice of the exponential family distribution

 Usually constrained by the nature of 𝑍

 The choice of the response function 𝑔

 the principal degree of freedom in the specification of a GLIM  However, we need to impose constraints on this function (e.g., 𝑔 must be in [0,1] for

Bernoulli distribution on 𝑧)

slide-21
SLIDE 21

The relation between vars. in a GLIMs

22

slide-22
SLIDE 22

Canonical response function

23

 Canonical response function: 𝑔(. ) = 𝜔−1(. ) or 𝜊 = 𝜃

 In this case, the choice of the exponential family density completely

determines the GLIM

 The constraints on the range of 𝑔 are automatically satisfied.

 𝜈 = 𝑔 𝜃

are guaranteed to be possible values of the conditional expectation (i.e., 𝑔 𝜃 = 𝜔−1 𝜃 =

𝑒𝐵 𝜃 𝑒𝜃

= 𝐹 𝑍|𝜃 )

slide-23
SLIDE 23

Log likelihood for GLIMs

24

ℓ 𝜽; 𝒠 = ln 𝑄 𝒠 𝜽 = ln

𝑜=1 𝑂

ℎ 𝑧(𝑜) exp 𝜃(𝑜)𝑧(𝑜) − 𝐵 𝜃(𝑜) =

𝑜=1 𝑂

ln ℎ 𝑧(𝑜) +

𝑜=1 𝑂

𝜃(𝑜)𝑧(𝑜) − 𝐵 𝜃(𝑜)

 𝜃(𝑜) = 𝜔(𝜈 𝑜 ) and 𝜈 𝑜 = 𝑔 𝜾𝑈𝒚(𝑜)  In the case of canonical response function 𝜃(𝑜) = 𝜾𝑈𝒚(𝑜)

ℓ 𝜾; 𝒠 =

𝑜=1 𝑂

ln ℎ 𝑧(𝑜) + 𝜾𝑈

𝑜=1 𝑂

𝒚(𝑜)𝑧(𝑜) −

𝑜=1 𝑂

𝐵 𝜾𝑈𝒚(𝑜)

Sufficient statistics for 𝜾

slide-24
SLIDE 24

Gradient of log likelihood

25

𝛼𝜾𝑚 𝜽; 𝒠 =

𝑜=1 𝑂

𝑒𝑚 𝑒𝜃(𝑜) 𝛼𝜾𝜃(𝑜) =

𝑜=1 𝑂

𝑧(𝑜) − 𝑒𝐵 𝜃(𝑜) 𝑒𝜃(𝑜) 𝛼𝜾𝜃(𝑜) =

𝑜=1 𝑂

𝑧(𝑜) − 𝜈(𝑜) 𝑒𝜃(𝑜) 𝑒𝜈(𝑜) 𝑒𝜈(𝑜) 𝑒𝜊(𝑜) 𝒚(𝑜)

 In the case of canonical response function 𝜃(𝑜) = 𝜊(𝑜):

𝛼𝜾𝑚 𝜾; 𝒠 =

𝑜=1 𝑂

𝑧(𝑜) − 𝜈(𝑜) 𝒚(𝑜)

𝜈(𝑜) = 𝑔 𝜾𝑈𝒚(𝑜)

slide-25
SLIDE 25

Online learning for GLIMs

26

 An LMS like algorithm as a generic stochastic gradient

descent for GLIMs: 𝜾𝑢+1 = 𝜾𝑢 + 𝜍 𝑧(𝑜) − 𝜈 𝑜 𝑢 𝒚(𝑜) 𝜈 𝑜 𝑢 = 𝑔 𝜾𝑢𝑈𝒚(𝑜)

 If we do not use the canonical response function only

scaling coefficients due to the derivatives of 𝑔(. ) and 𝜔(. ) will also incorporated into the step size

Similar to Least Mean Squares (LMS) algorithm

slide-26
SLIDE 26

Batch learning for GLIMs : Newton-Rafson

27

 For the canonical response functions:

𝛼𝜾𝑚 𝜾; 𝒠 =

𝑜=1 𝑂

𝑧(𝑜) − 𝜈(𝑜) 𝒚(𝑜) = 𝒀𝑈 𝒛 − 𝝂 𝐼 = 𝑒2𝑚 𝑒𝜾𝑒𝜾𝑈 = 𝑒𝑚 𝑒𝜾𝑈

𝑜=1 𝑂

𝑧(𝑜) − 𝜈(𝑜) 𝒚(𝑜) = −

𝑜=1 𝑂

𝒚 𝑜 𝑒𝜈 𝑜 𝑒𝜾𝑈 = −

𝑜=1 𝑂

𝒚(𝑜) 𝑒𝜈(𝑜) 𝑒𝜃(𝑜) 𝑒𝜃(𝑜) 𝑒𝜾𝑈

 Since 𝜃(𝑜) = 𝜾𝑈𝒚(𝑜)

𝐼 = −

𝑜=1 𝑂

𝒚(𝑜) 𝑒𝜈(𝑜) 𝑒𝜃(𝑜) 𝒚 𝑜 𝑈 = −𝒀𝑈𝑿𝒀 𝑿 = 𝑒𝑗𝑏𝑕 𝑒𝜈(1) 𝑒𝜃(1) , … , 𝑒𝜈(𝑂) 𝑒𝜃(𝑂)

𝒀 = 𝑦1

(1)

⋯ 𝑦𝑒

(1)

⋮ ⋱ ⋮ 𝑦1

(𝑂)

⋯ 𝑦𝑒

(𝑂)

𝒛 = 𝑧(1) ⋮ 𝑧(𝑂) 𝑒𝜈(𝑜) 𝑒𝜃(𝑜) = 𝑒2𝐵 𝑒𝜃(𝑜)

slide-27
SLIDE 27

Batch learning for GLIMs: Newton-Rafson

28

𝜾𝑢+1 = 𝜾𝑢 + 𝒀𝑈𝑿𝑢𝒀 −1𝒀𝑈 𝒛 − 𝝂𝑢 = 𝒀𝑈𝑿𝑢𝒀 −1 𝒀𝑈𝑿𝑢𝒀𝜾𝑢 + 𝒀𝑈 𝒛 − 𝝂𝑢 ⇒ 𝜾𝑢+1 = 𝒀𝑈𝑿𝑢𝒀 −1𝒀𝑈𝑿𝑢𝒜𝑢 𝒜𝑢 = 𝜃𝑢 + 𝑿𝑢−1 𝒛 − 𝝂𝑢

Iterative Reweighted Least Squares (IRLS)

slide-28
SLIDE 28

Linear regression

29

 Cost function (according to MLE where 𝑄 𝑧 𝒚 = 𝒪(𝑧|𝜾𝑈𝒚, 𝜏2)):

𝐾 𝜾 = 1 2

𝑜=1 𝑂

𝜾𝑈𝒚 𝑜 − 𝑧 𝑜

2

𝛼𝜾𝐾 𝜾 = 𝟏 ⇒ 𝜾 = 𝒀𝑈𝒀 −1𝒀𝑈𝒛

 Online learning (LMS):

𝜾𝑢+1 = 𝜾𝑢 + 𝜍 𝑧(𝑜) − 𝜾𝑢𝑈𝒚(𝑜) 𝒚(𝑜)

 IRLS:

𝜾𝑢+1 = 𝒀𝑈𝑿𝑢𝒀 −1𝒀𝑈𝑿𝑢𝒜𝑢 = 𝒀𝑈𝒀 −1𝒀𝑈 𝒀𝜾𝑢 + 𝒛 − 𝝂𝑢 = 𝒀𝑈𝒀 −1𝒀𝑈𝒛

Canonical response function 𝜈 𝒚 = 𝜾𝑈𝒚 = 𝜃(𝒚) 𝑒𝜈 𝑒𝜃 = 1 ⇒ 𝑿 = 𝑱

slide-29
SLIDE 29

Logistic regression

30

𝜈 𝒚 = 1 1 + 𝑓−𝜃(𝒚)

 Canonical response function 𝜃 = 𝜊 = 𝜾𝑈𝒚

 IRLS: 𝑒𝜈 𝑒𝜃 = 𝜈 1 − 𝜈 𝑋 = 𝜈(1) 1 − 𝜈(1) ⋯ ⋮ ⋱ ⋮ ⋯ 𝜈(𝑂) 1 − 𝜈(𝑂 ) 𝜾𝑢+1 = 𝒀𝑈𝑿𝑢𝒀 −1𝒀𝑈𝑿𝑢𝒜𝑢 𝒜𝑢 = 𝒀𝜾𝑢 + 𝑿𝑢−1 𝒛 − 𝝂𝑢

slide-30
SLIDE 30

References

31

 Jordan, Chapter 8.  Koller & Friedman, Chapter 8.1-8.3.