CS480/680 Machine Learning Lecture 6: January 23 st , 2020 Maximum A - - PowerPoint PPT Presentation

β–Ά
cs480 680 machine learning lecture 6 january 23 st 2020
SMART_READER_LITE
LIVE PREVIEW

CS480/680 Machine Learning Lecture 6: January 23 st , 2020 Maximum A - - PowerPoint PPT Presentation

CS480/680 Machine Learning Lecture 6: January 23 st , 2020 Maximum A posteriori & Maximum Likelihood Zahra Sheikhbahaee Sources: A Tutorial on Energy Based Learning University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee Outline


slide-1
SLIDE 1

CS480/680 Winter 2020 Zahra Sheikhbahaee

CS480/680 Machine Learning Lecture 6: January 23st, 2020

Maximum A posteriori & Maximum Likelihood Zahra Sheikhbahaee

Sources: A Tutorial on Energy Based Learning

University of Waterloo

slide-2
SLIDE 2

CS480/680 Winter 2020 Zahra Sheikhbahaee

Outline

  • Probabilistic Modeling
  • Gibbs Distribution
  • Maximum A Posteriori Estimation
  • Maximum Likelihood Estimation

University of Waterloo

2

slide-3
SLIDE 3

CS480/680 Winter 2020 Zahra Sheikhbahaee

Probabilistic Modeling

Goal: Given a set of observations 𝑇 = { 𝑦%, 𝑧% : 𝑗 = 1, . . , 𝑄} (training set), we want to produce a model for regression, classification or decision making that predict the best 𝑍 from π‘Œ. we want to estimate a function that computes the conditional distribution 𝑄 ( 𝑍 |π‘Œ ) for any given π‘Œ. We write this function as 𝑄 𝑍 π‘Œ, 𝑇 .

  • Design Architecture: We decompose 𝑄

𝑍 π‘Œ, 𝑇 into two parts: 𝑄 𝑍 π‘Œ, 𝑇 = 3 𝑄 𝑍 π‘Œ, 𝑋 𝑄 𝑋 𝑇 𝑒𝑋 Our estimate of 𝑄 𝑍 π‘Œ, 𝑋 will be among a family of functions 𝑔 𝑋, 𝑍, π‘Œ , βˆ€π‘‹ , where the functions are parameterized by a vector 𝑋. The internal structure of the parameterized function 𝑔(𝑋, 𝑍, π‘Œ) is called the architecture i.e. logistic regressors, neural networks, etc.

University of Waterloo

3

slide-4
SLIDE 4

CS480/680 Winter 2020 Zahra Sheikhbahaee

Gibbs Distribution

  • Energy function: for convenience we will often define 𝑔(𝑋, 𝑍, π‘Œ) as the normalized

exponential of an energy function 𝐹 𝑋, 𝑍, π‘Œ : 𝑄(𝑍 |π‘Œ, 𝑋) β‰ˆ 𝑔(𝑋, 𝑍, π‘Œ) = exp(βˆ’π›ΎπΉ(𝑋, 𝑍, π‘Œ)) π‘Ž@(𝑋, π‘Œ, 𝛾) Ξ²: an arbitrary positive constant (inverse temperature) π‘Ž@ 𝑋, π‘Œ, 𝛾 : a normalization term (the partition function) π‘Ž@ 𝑋, π‘Œ, 𝛾 = 3 exp βˆ’π›ΎπΉ 𝑋, 𝑍, π‘Œ 𝑒𝑍 π‘Ž@ ensures that our estimate of 𝑄(𝑍 |π‘Œ, 𝑋) is normalized. High probability states corresponds to low energy configuration. Condition: We can only transform energies into probabilities if ∫ 𝑓DE@(F,G,H)𝑒𝑍 π‘‘π‘π‘œπ‘€π‘“π‘ π‘•π‘“π‘‘

University of Waterloo

4

slide-5
SLIDE 5

CS480/680 Winter 2020 Zahra Sheikhbahaee

Probabilistic Modeling

𝑄 ( 𝑍 |π‘Œ, 𝑇) = 3 𝑄 ( 𝑍 |π‘Œ, 𝑋 ) 𝑄 ( 𝑋|𝑇 )𝑒𝑋

  • The energy includes ”hidden” variables 𝑋 whose value

is never given to us Learning: 𝑄 ( 𝑋|𝑇 ) is the result of a learning procedure that assigns a probability (or an energy) to each possible value of 𝑋 as a function of the training set. The learning procedure will assign high probabilities to values of 𝑋 that assign high combined probability (low combined energy) to the observed data.

University of Waterloo

5

slide-6
SLIDE 6

CS480/680 Winter 2020 Zahra Sheikhbahaee

Likelihood of Observations

𝑄 ( 𝑋|𝑇) = 𝑄 ( 𝑋|𝒡, 𝒴 ) = 𝑄 (𝒡|𝒴 , 𝑋 ) 𝑄 ( 𝑋|𝒴 ) 𝑄 (𝒡|𝒴 ) Where 𝒴 = (𝑦R, 𝑦S, … , 𝑦U) and 𝒡 = (𝑧R, 𝑧S, … , 𝑧U) and the denominator is a normalization term 𝑄 𝒡 𝒴 = 3 𝑄 (𝒡|𝒴 , 𝑋 ) 𝑄 𝑋 𝒴 𝑒𝑋 Sample independence: We assume that samples are independent. So the conditional probability of the training set under the model is a product over samples 𝑄 𝑧R, … , 𝑧U 𝑦R, … , 𝑦U, 𝑋 = V

%WR U

𝑄 𝑧% 𝑦%, 𝑋 = V

%WR U

𝑔(𝑋, 𝑧%, 𝑦%) 𝑄 𝑍 π‘Œ, 𝑋 = exp(βˆ’π›Ύ X

%WR U

[𝐹 𝑋, 𝑧%, 𝑦% + 1 𝛾 log π‘Ž@ 𝑋, 𝑦%, 𝛾 ]) Where we have π‘Ž@ 𝑋, 𝑦%, 𝛾 =∫ 𝑓DE@(F,_,`a)𝑒𝑧

University of Waterloo

6

slide-7
SLIDE 7

CS480/680 Winter 2020 Zahra Sheikhbahaee

Choosing a Regularizer

𝑄 𝑋 𝑇 = 𝑄 𝑋 𝒡, 𝒴 = 𝑄 𝒡 𝒴, 𝑋 𝑄(𝑋|𝒴) 𝑄(𝒡|𝒴) The term 𝑄 ( 𝑋|𝒴 ) is an arbitrary prior distribution over the values of 𝑋 that we can choose freely. We will often represent this prior as the normalized exponential of a penalty term or regularizer H 𝑋 . The term H 𝑋 is used to embed our prior knowledge about which energy function in our family are preferable to others in the absence of training data. 𝑄 𝑋 = 1 π‘Žb 𝑓DEb(F)

Parameters that produce low values of the regularizer will be favored over parameters that produce large values. β€œgood” models (e.g. simple, smooth, well behaved) the regularizer is small β€œbad” models the regularizer is large

University of Waterloo

7

slide-8
SLIDE 8

CS480/680 Winter 2020 Zahra Sheikhbahaee

Posterior of a Parameter

  • The probability of a particular parameter value 𝑋 given the observations 𝑇 is

𝑄 𝑋 𝑇 = exp[βˆ’π›Ύ{βˆ‘%WR

U [𝐹 𝑋, 𝑧%, 𝑦% + 1

𝛾 log π‘Ž@ 𝑋, 𝑦%, 𝛾 ] + 𝐼 𝑋 }] π‘ŽF(𝑇, 𝛾) 𝐹(𝑋, 𝑍, π‘Œ) can be a linear combination of basis function. The advantage of the energy-based approach is that it puts very little restriction

  • n the nature of 𝜁 = {𝐹 𝑋, 𝑍, π‘Œ : 𝑋 ∈ 𝒳}

𝐼 ( 𝑋 ) is the regularizer that contains our preferences for β€œgood” models over β€œbad” ones. Our choice of 𝐼 ( 𝑋 ) is somewhat arbitrary, but some work better than others for particular applications.

University of Waterloo

8

slide-9
SLIDE 9

CS480/680 Winter 2020 Zahra Sheikhbahaee

Posterior of a Parameter

𝑄 𝑋 𝑇 = exp[βˆ’π›Ύ{βˆ‘%WR

U [𝐹 𝑋, 𝑧%, 𝑦% + 1

𝛾 log π‘Ž@ 𝑋, 𝑦%, 𝛾 ] + 𝐼 𝑋 }] π‘ŽF(𝑇, 𝛾) π‘ŽF(𝑇, 𝛾) is the normalization term that ensures that the integral of P 𝑋 𝑇

  • ver 𝑋 is 1.

π‘ŽF(𝑇, 𝛾) is the integral over 𝑋 of the numerator. π‘Ž@ 𝑋, 𝑦%, 𝛾 are the normalization terms (one for each sample) that ensure that the integral 𝑄 𝑍 𝑦%, 𝑋 over 𝑍 is 1: π‘Ž@ 𝑋, 𝑦%, 𝛾 = 3 exp(βˆ’π›ΎπΉ 𝑋, 𝑍, 𝑦% 𝑒𝑍

Ξ² is a positive constant that we are free to choose as we like or that we can estimate. It reflects the reliability of the

  • data. Low values should be used to get probability estimates with noisy data. Large values should be used to get

good discrimination. We can estimate Ξ² through learning too (we can fold it into E, as a component of W).

University of Waterloo

9

slide-10
SLIDE 10

CS480/680 Winter 2020 Zahra Sheikhbahaee

Intractability of Bayesian Learning

The Bayesian predictive distribution 𝑄 π‘§βˆ— π‘¦βˆ—, 𝑦R, 𝑧R , … , (𝑦U, 𝑧U)) = 3 𝑔 𝑋, 𝒡, 𝒴 𝑄 𝑋 𝑦R, 𝑧R , … , 𝑦U, 𝑧U 𝑒𝑋

  • To compute the distribution of π‘§βˆ—for a particular input π‘¦βˆ—, we are supposed to

integrate the product of two complicated functions over all possible values of 𝑋.

  • This is totally intractable in general.
  • There are special classes of functions for 𝑔 for which the integral is tractable, but

that class is fairly restricted.

University of Waterloo

10

slide-11
SLIDE 11

CS480/680 Winter 2020 Zahra Sheikhbahaee

Tractable Learning Methods

1. Maximum A Posteriori Estimation. simply replace the distribution

𝑄 ( 𝑋|𝑇 ) by a Dirac delta function centered on its mode (maximum).

  • 2. Maximum Likelihood Estimation. Same as above, but drop the regularizer.

3. Restricted Class of function. Simply restrict yourself to special forms of 𝑔(𝑋, 𝑍, π‘Œ ) for which the integral can be computed analytically (e.g. Gaussians).

  • 4. Sampling. Draw a a bunch of samples of 𝑋

from the distribution 𝑄 ( 𝑋|𝑇 ), and replace the integral by a sum over those samples.

University of Waterloo

11

slide-12
SLIDE 12

CS480/680 Winter 2020 Zahra Sheikhbahaee

Maximum A Posteriori Estimation

  • Assume that the mode (maximum) of 𝑄 ( 𝑋|𝑇 ) is so much larger than

all other values that we can view 𝑄 ( 𝑋|𝑇 ) as a Dirac delta function centered around its maximum 𝑄jkl ( 𝑋|𝑇 ) β‰ˆ πœ€ ( 𝑋 βˆ’ 𝑋MAP ) 𝑋MAP = argmax

p 𝑄 ( W|S )

with this approximation, we get simply: 𝑄 𝑍 π‘Œ, 𝑇 = 𝑄 𝑍 π‘Œ, 𝑋

jkl

If we take the limit Ξ² β†’ ∞, P ( W|S ) does converge to a delta function around its maximum. So the MAP approximation is simply the large Ξ² limit.

University of Waterloo

12

slide-13
SLIDE 13

CS480/680 Winter 2020 Zahra Sheikhbahaee

Computing 𝑋

jkl

𝑋

jkl = 𝑏𝑠𝑕max F 𝑄 𝑋 𝑇 = argmax F R tu(v,E)

exp[βˆ’π›Ύ{βˆ‘%WR

U [𝐹 𝑋, 𝑧%, 𝑦% + R E log π‘Ž@ 𝑋, 𝑦%, 𝛾 ] + 𝐼 𝑋 }]=

argmax

F

exp[βˆ’π›Ύ{βˆ‘%WR

U [𝐹 𝑋, 𝑧%, 𝑦% + R E log π‘Ž@ 𝑋, 𝑦%, 𝛾 ] + 𝐼 𝑋 }] =

arg min

F βˆ‘%WR U [𝐹 𝑋, 𝑧%, 𝑦% + R E log π‘Ž@ 𝑋, 𝑦%, 𝛾 ] + 𝐼 𝑋 }]=

argmin

F

βˆ‘%WR

U [𝐹 𝑋, 𝑧%, 𝑦% + R E log ∫ exp(βˆ’π›Ύ 𝐹 𝑋, 𝒡, 𝑦%) 𝑒𝒡 ] + 𝐼 𝑋

We can take the log because log is monotonic. To find the MAP parameter estimate, we need to find the value of 𝑋 that minimizes: 𝑀jkl 𝑋 = X

%WR U

[𝐹 𝑋, 𝑧%, 𝑦% + 1 𝛾 log 3 exp(βˆ’π›ΎπΉ 𝑋, 𝒡, 𝑦%) 𝑒𝒡 ] + 𝐼 𝑋

University of Waterloo

13

slide-14
SLIDE 14

CS480/680 Winter 2020 Zahra Sheikhbahaee

Maximum Likelihood Estimation

  • Maximum Likelihood Estimation: We ignore 𝐼 ( 𝑋). This is equivalent to

finding the 𝑋 that maximizes 𝑄 (𝒡|𝒴 , 𝑋 ) (the likelihood of the data) instead

  • f 𝑄 (𝒡|𝒴 , 𝑋 )𝑄 ( 𝑋|𝒴) (the unnormalized posterior of the parameter).
  • We assume that the mode (maximum) of 𝑄 ( 𝑋|𝑇 ) is so much larger than all
  • ther values that we can view 𝑄 ( 𝑋|𝑇 ) as a Dirac delta function centered

around its maximum, and we assume that the prior 𝑄 ( 𝑋 ) has no influence on the result: 𝑄 ( 𝑋|𝑇 ) β‰ˆ πœ€ ( 𝑋 βˆ’ 𝑋

z{| )

𝑋

z{|=arg max F 𝑄 (𝒡|𝒴 , 𝑋 )

with this approximation, we get simply: 𝑄 ( 𝑍 |π‘Œ, 𝑇) = 𝑄 ( 𝑍 |π‘Œ, 𝑋

z{|)

University of Waterloo

14

slide-15
SLIDE 15

CS480/680 Winter 2020 Zahra Sheikhbahaee

Computing 𝑋

j}@

𝑋

j}@ = argmax F R tu(v,E)

exp[βˆ’π›Ύ{βˆ‘%WR

U [𝐹 𝑋, 𝑧%, 𝑦% + R E log π‘Ž@ 𝑋, 𝑦%, 𝛾 ]}]= argmax F

exp[ βˆ’π›Ύ{βˆ‘%WR

U [𝐹 𝑋, 𝑧%, 𝑦% + R E log π‘Ž@ 𝑋, 𝑦%, 𝛾 ]}] =

arg min

p βˆ‘%WR U [𝐹 𝑋, 𝑧%, 𝑦% + R E log π‘Ž@ 𝑋, 𝑦%, 𝛾 ]=

arg min

p βˆ‘%WR U [𝐹 𝑋, 𝑧%, 𝑦% + R E log ∫ exp(βˆ’π›Ύ 𝐹 𝑋, 𝒡, 𝑦%) 𝑒𝒡]

We need to find the value of W that minimizes: 𝑀j}@ 𝑋 = min

p βˆ‘%WR U [𝐹 𝑋, 𝑧%, 𝑦% + R E log ∫ exp(βˆ’π›Ύ 𝐹 𝑋, 𝒡, 𝑦%) 𝑒𝒡]

University of Waterloo

15

slide-16
SLIDE 16

CS480/680 Winter 2020 Zahra Sheikhbahaee

Negative Log Likelihood

  • All of the terms in 𝑀j}@ 𝑋 have analogs and interpretations in Statistical

Physics and Thermodynamics.

  • βˆ‘% 𝐹(𝑋, 𝑧%, 𝑦%) is analogous to the Average Energy of a thermodynamical

system where each sample is analogous to a particle in an ideal gas.

  • βˆ‘%

R E log ∫ exp(βˆ’π›Ύ 𝐹 𝑋, 𝒡, 𝑦%) 𝑒𝒡 is analogous to the Helmoltz Free Energy

  • f a thermodynamical system (β„±πœΈ 𝑋, 𝒡, 𝑦%) .
  • 𝑀j}@ 𝑋 is analogous to the product of the Entropy by the Temperature.
  • The above equation is a form of the well-known thermodynamic equation

< Temperature > Γ— < Entropy >=< AverageEnergy > βˆ’ < FreeEnergy >

  • MAP/MLE estimation is like entropy minimization.

University of Waterloo

16

slide-17
SLIDE 17

CS480/680 Winter 2020 Zahra Sheikhbahaee

Formulate Regression as an Energy-based Model

University of Waterloo

17

  • The energy function is the square error between the output of a regression function 𝐻F 𝑦 and

the variable to be predicted 𝑧 𝐹 𝑋, 𝑧, 𝑦 = R

S βˆ₯ 𝐻F 𝑦 -𝑧 βˆ₯S

The negative log likelihood (a Gaussian integral with a constant variance) π‘€β€šΖ’β€šβ€žβ€¦_ 𝐹, 𝑇 = 1 𝑄 X

%WR l

𝐹 𝑋, 𝑧%, 𝑦% = 1 2𝑄 X

%WR l

βˆ₯ 𝐻F 𝑦% βˆ’π‘§% βˆ₯S We have 𝐻 as a linear function 𝐻F π‘Œ = 𝑋‑Φ π‘Œ , where Ξ¦ π‘Œ are a set of 𝑂 features. This loss will push down on the energy of desired answer Energy loss will not pull up on any other energy. It only works with the architectures that are designed in such a way that pushing down on 𝐹 𝑋, 𝑧, 𝑦 will automatically make the energies of the other answers larger. Training with the energy log reduces the least-squares minimization problem which is convex π‘‹βˆ— = arg min

F [ R Sl βˆ‘%WR l

βˆ₯ 𝑋‑Φ 𝑦% βˆ’π‘§% βˆ₯S]