CS480/680 Winter 2020 Zahra Sheikhbahaee
CS480/680 Machine Learning Lecture 6: January 23st, 2020
Maximum A posteriori & Maximum Likelihood Zahra Sheikhbahaee
Sources: A Tutorial on Energy Based Learning
University of Waterloo
CS480/680 Machine Learning Lecture 6: January 23 st , 2020 Maximum A - - PowerPoint PPT Presentation
CS480/680 Machine Learning Lecture 6: January 23 st , 2020 Maximum A posteriori & Maximum Likelihood Zahra Sheikhbahaee Sources: A Tutorial on Energy Based Learning University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee Outline
CS480/680 Winter 2020 Zahra Sheikhbahaee
Sources: A Tutorial on Energy Based Learning
University of Waterloo
CS480/680 Winter 2020 Zahra Sheikhbahaee
University of Waterloo
2
CS480/680 Winter 2020 Zahra Sheikhbahaee
Goal: Given a set of observations π = { π¦%, π§% : π = 1, . . , π} (training set), we want to produce a model for regression, classification or decision making that predict the best π from π. we want to estimate a function that computes the conditional distribution π ( π |π ) for any given π. We write this function as π π π, π .
π π, π into two parts: π π π, π = 3 π π π, π π π π ππ Our estimate of π π π, π will be among a family of functions π π, π, π , βπ , where the functions are parameterized by a vector π. The internal structure of the parameterized function π(π, π, π) is called the architecture i.e. logistic regressors, neural networks, etc.
University of Waterloo
3
CS480/680 Winter 2020 Zahra Sheikhbahaee
exponential of an energy function πΉ π, π, π : π(π |π, π) β π(π, π, π) = exp(βπΎπΉ(π, π, π)) π@(π, π, πΎ) Ξ²: an arbitrary positive constant (inverse temperature) π@ π, π, πΎ : a normalization term (the partition function) π@ π, π, πΎ = 3 exp βπΎπΉ π, π, π ππ π@ ensures that our estimate of π(π |π, π) is normalized. High probability states corresponds to low energy configuration. Condition: We can only transform energies into probabilities if β« πDE@(F,G,H)ππ ππππ€ππ πππ‘
University of Waterloo
4
CS480/680 Winter 2020 Zahra Sheikhbahaee
π ( π |π, π) = 3 π ( π |π, π ) π ( π|π )ππ
is never given to us Learning: π ( π|π ) is the result of a learning procedure that assigns a probability (or an energy) to each possible value of π as a function of the training set. The learning procedure will assign high probabilities to values of π that assign high combined probability (low combined energy) to the observed data.
University of Waterloo
5
CS480/680 Winter 2020 Zahra Sheikhbahaee
π ( π|π) = π ( π|π΅, π΄ ) = π (π΅|π΄ , π ) π ( π|π΄ ) π (π΅|π΄ ) Where π΄ = (π¦R, π¦S, β¦ , π¦U) and π΅ = (π§R, π§S, β¦ , π§U) and the denominator is a normalization term π π΅ π΄ = 3 π (π΅|π΄ , π ) π π π΄ ππ Sample independence: We assume that samples are independent. So the conditional probability of the training set under the model is a product over samples π π§R, β¦ , π§U π¦R, β¦ , π¦U, π = V
%WR U
π π§% π¦%, π = V
%WR U
π(π, π§%, π¦%) π π π, π = exp(βπΎ X
%WR U
[πΉ π, π§%, π¦% + 1 πΎ log π@ π, π¦%, πΎ ]) Where we have π@ π, π¦%, πΎ =β« πDE@(F,_,`a)ππ§
University of Waterloo
6
CS480/680 Winter 2020 Zahra Sheikhbahaee
π π π = π π π΅, π΄ = π π΅ π΄, π π(π|π΄) π(π΅|π΄) The term π ( π|π΄ ) is an arbitrary prior distribution over the values of π that we can choose freely. We will often represent this prior as the normalized exponential of a penalty term or regularizer H π . The term H π is used to embed our prior knowledge about which energy function in our family are preferable to others in the absence of training data. π π = 1 πb πDEb(F)
Parameters that produce low values of the regularizer will be favored over parameters that produce large values. βgoodβ models (e.g. simple, smooth, well behaved) the regularizer is small βbadβ models the regularizer is large
University of Waterloo
7
CS480/680 Winter 2020 Zahra Sheikhbahaee
U [πΉ π, π§%, π¦% + 1
University of Waterloo
8
CS480/680 Winter 2020 Zahra Sheikhbahaee
π π π = exp[βπΎ{β%WR
U [πΉ π, π§%, π¦% + 1
πΎ log π@ π, π¦%, πΎ ] + πΌ π }] πF(π, πΎ) πF(π, πΎ) is the normalization term that ensures that the integral of P π π
πF(π, πΎ) is the integral over π of the numerator. π@ π, π¦%, πΎ are the normalization terms (one for each sample) that ensure that the integral π π π¦%, π over π is 1: π@ π, π¦%, πΎ = 3 exp(βπΎπΉ π, π, π¦% ππ
Ξ² is a positive constant that we are free to choose as we like or that we can estimate. It reflects the reliability of the
good discrimination. We can estimate Ξ² through learning too (we can fold it into E, as a component of W).
University of Waterloo
9
CS480/680 Winter 2020 Zahra Sheikhbahaee
University of Waterloo
10
CS480/680 Winter 2020 Zahra Sheikhbahaee
University of Waterloo
11
CS480/680 Winter 2020 Zahra Sheikhbahaee
p π ( W|S )
jkl
University of Waterloo
12
CS480/680 Winter 2020 Zahra Sheikhbahaee
jkl = ππ πmax F π π π = argmax F R tu(v,E)
U [πΉ π, π§%, π¦% + R E log π@ π, π¦%, πΎ ] + πΌ π }]=
F
U [πΉ π, π§%, π¦% + R E log π@ π, π¦%, πΎ ] + πΌ π }] =
F β%WR U [πΉ π, π§%, π¦% + R E log π@ π, π¦%, πΎ ] + πΌ π }]=
F
U [πΉ π, π§%, π¦% + R E log β« exp(βπΎ πΉ π, π΅, π¦%) ππ΅ ] + πΌ π
%WR U
University of Waterloo
13
CS480/680 Winter 2020 Zahra Sheikhbahaee
z{| )
z{|=arg max F π (π΅|π΄ , π )
z{|)
University of Waterloo
14
CS480/680 Winter 2020 Zahra Sheikhbahaee
j}@ = argmax F R tu(v,E)
U [πΉ π, π§%, π¦% + R E log π@ π, π¦%, πΎ ]}]= argmax F
U [πΉ π, π§%, π¦% + R E log π@ π, π¦%, πΎ ]}] =
p β%WR U [πΉ π, π§%, π¦% + R E log π@ π, π¦%, πΎ ]=
p β%WR U [πΉ π, π§%, π¦% + R E log β« exp(βπΎ πΉ π, π΅, π¦%) ππ΅]
p β%WR U [πΉ π, π§%, π¦% + R E log β« exp(βπΎ πΉ π, π΅, π¦%) ππ΅]
University of Waterloo
15
CS480/680 Winter 2020 Zahra Sheikhbahaee
Physics and Thermodynamics.
system where each sample is analogous to a particle in an ideal gas.
R E log β« exp(βπΎ πΉ π, π΅, π¦%) ππ΅ is analogous to the Helmoltz Free Energy
< Temperature > Γ < Entropy >=< AverageEnergy > β < FreeEnergy >
University of Waterloo
16
CS480/680 Winter 2020 Zahra Sheikhbahaee
University of Waterloo
17
the variable to be predicted π§ πΉ π, π§, π¦ = R
S β₯ π»F π¦ -π§ β₯S
The negative log likelihood (a Gaussian integral with a constant variance) πβΖβββ¦_ πΉ, π = 1 π X
%WR l
πΉ π, π§%, π¦% = 1 2π X
%WR l
β₯ π»F π¦% βπ§% β₯S We have π» as a linear function π»F π = πβ‘Ξ¦ π , where Ξ¦ π are a set of π features. This loss will push down on the energy of desired answer Energy loss will not pull up on any other energy. It only works with the architectures that are designed in such a way that pushing down on πΉ π, π§, π¦ will automatically make the energies of the other answers larger. Training with the energy log reduces the least-squares minimization problem which is convex πβ = arg min
F [ R Sl β%WR l
β₯ πβ‘Ξ¦ π¦% βπ§% β₯S]