CS480/680 Machine Learning Lecture 6: January 23 st , 2020 Maximum A - PowerPoint PPT Presentation

CS480/680 Machine Learning Lecture 6: January 23 st , 2020 Maximum A posteriori & Maximum Likelihood Zahra Sheikhbahaee Sources: A Tutorial on Energy Based Learning University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee

Outline • Probabilistic Modeling • Gibbs Distribution • Maximum A Posteriori Estimation • Maximum Likelihood Estimation University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 2

Probabilistic Modeling Goal: Given a set of observations 𝑇 = { 𝑦 % , 𝑧 % : 𝑗 = 1, . . , 𝑄} (training set), we want to produce a model for regression, classification or decision making that predict the best 𝑍 from 𝑌 . we want to estimate a function that computes the conditional distribution 𝑄 ( 𝑍 |𝑌 ) for any given 𝑌 . We write this function as 𝑄 𝑍 𝑌, 𝑇 . • Design Architecture: We decompose 𝑄 𝑍 𝑌, 𝑇 into two parts: 𝑄 𝑍 𝑌, 𝑇 = 3 𝑄 𝑍 𝑌, 𝑋 𝑄 𝑋 𝑇 𝑒𝑋 Our estimate of 𝑄 𝑍 𝑌, 𝑋 will be among a family of functions 𝑔 𝑋, 𝑍, 𝑌 , ∀𝑋 , where the functions are parameterized by a vector 𝑋 . The internal structure of the parameterized function 𝑔(𝑋, 𝑍, 𝑌) is called the architecture i.e. logistic regressors, neural networks, etc. University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 3

Gibbs Distribution • Energy function : for convenience we will often define 𝑔(𝑋, 𝑍, 𝑌) as the normalized exponential of an energy function 𝐹 𝑋, 𝑍, 𝑌 : 𝑄(𝑍 |𝑌, 𝑋) ≈ 𝑔(𝑋, 𝑍, 𝑌) = exp(−𝛾𝐹(𝑋, 𝑍, 𝑌)) 𝑎 @ (𝑋, 𝑌, 𝛾) β : an arbitrary positive constant (inverse temperature) 𝑎 @ 𝑋, 𝑌, 𝛾 : a normalization term ( the partition function ) 𝑎 @ 𝑋, 𝑌, 𝛾 = 3 exp −𝛾𝐹 𝑋, 𝑍, 𝑌 𝑒𝑍 𝑎 @ ensures that our estimate of 𝑄(𝑍 |𝑌, 𝑋) is normalized. High probability states corresponds to low energy configuration. Condition : We can only transform energies into probabilities if ∫ 𝑓 DE@(F,G,H) 𝑒𝑍 𝑑𝑝𝑜𝑤𝑓𝑠𝑕𝑓𝑡 University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 4

Probabilistic Modeling 𝑄 ( 𝑍 |𝑌, 𝑇) = 3 𝑄 ( 𝑍 |𝑌, 𝑋 ) 𝑄 ( 𝑋|𝑇 )𝑒𝑋 • The energy includes ”hidden” variables 𝑋 whose value is never given to us Learning : 𝑄 ( 𝑋|𝑇 ) is the result of a learning procedure that assigns a probability (or an energy) to each possible value of 𝑋 as a function of the training set. The learning procedure will assign high probabilities to values of 𝑋 that assign high combined probability ( low combined energy ) to the observed data. University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 5

Likelihood of Observations 𝑄 ( 𝑋|𝑇) = 𝑄 ( 𝑋|𝒵, 𝒴 ) = 𝑄 (𝒵|𝒴 , 𝑋 ) 𝑄 ( 𝑋|𝒴 ) 𝑄 (𝒵|𝒴 ) Where 𝒴 = (𝑦 R , 𝑦 S , … , 𝑦 U ) and 𝒵 = (𝑧 R , 𝑧 S , … , 𝑧 U ) and the denominator is a normalization term 𝑄 𝒵 𝒴 = 3 𝑄 (𝒵|𝒴 , 𝑋 ) 𝑄 𝑋 𝒴 𝑒𝑋 Sample independence: We assume that samples are independent. So the conditional probability of the training set under the model is a product over samples U U 𝑄 𝑧 R , … , 𝑧 U 𝑦 R , … , 𝑦 U , 𝑋 = V 𝑄 𝑧 % 𝑦 % , 𝑋 = V 𝑔(𝑋, 𝑧 % , 𝑦 % ) %WR %WR U [𝐹 𝑋, 𝑧 % , 𝑦 % + 1 𝑄 𝑍 𝑌, 𝑋 = exp(−𝛾 X 𝛾 log 𝑎 @ 𝑋, 𝑦 % , 𝛾 ]) %WR Where we have 𝑎 @ 𝑋, 𝑦 % , 𝛾 = ∫ 𝑓 DE@(F,_,` a ) 𝑒𝑧 University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 6

Choosing a Regularizer = 𝑄 𝑋 𝒵, 𝒴 = 𝑄 𝒵 𝒴, 𝑋 𝑄(𝑋|𝒴) 𝑄 𝑋 𝑇 𝑄(𝒵|𝒴) The term 𝑄 ( 𝑋|𝒴 ) is an arbitrary prior distribution over the values of 𝑋 that we can choose freely. 𝑋 . We will often represent this prior as the normalized exponential of a penalty term or regularizer H 𝑋 The term H is used to embed our prior knowledge about which energy function in our family are preferable to others in the absence of training data. 𝑄 𝑋 = 1 𝑓 DEb(F) 𝑎 b Parameters that produce low values of the regularizer will be favored over parameters that produce large values. “good” models (e.g. simple, smooth, well behaved) the regularizer is small “bad” models the regularizer is large University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 7

Posterior of a Parameter • The probability of a particular parameter value 𝑋 given the observations 𝑇 is U [𝐹 𝑋, 𝑧 % , 𝑦 % + 1 exp[−𝛾{∑ %WR 𝛾 log 𝑎 @ 𝑋, 𝑦 % , 𝛾 ] + 𝐼 𝑋 }] 𝑄 𝑋 𝑇 = 𝑎 F (𝑇, 𝛾) 𝐹(𝑋, 𝑍, 𝑌) can be a linear combination of basis function. The advantage of the energy-based approach is that it puts very little restriction on the nature of 𝜁 = {𝐹 𝑋, 𝑍, 𝑌 : 𝑋 ∈ 𝒳} 𝐼 ( 𝑋 ) is the regularizer that contains our preferences for “good” models over “bad” ones. Our choice of 𝐼 ( 𝑋 ) is somewhat arbitrary, but some work better than others for particular applications. University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 8

Posterior of a Parameter U [𝐹 𝑋, 𝑧 % , 𝑦 % + 1 exp[−𝛾{∑ %WR 𝛾 log 𝑎 @ 𝑋, 𝑦 % , 𝛾 ] + 𝐼 𝑋 }] 𝑄 𝑋 𝑇 = 𝑎 F (𝑇, 𝛾) 𝑎 F (𝑇, 𝛾 ) is the normalization term that ensures that the integral of P 𝑋 𝑇 over 𝑋 is 1. 𝑎 F (𝑇, 𝛾 ) is the integral over 𝑋 of the numerator. 𝑎 @ 𝑋, 𝑦 % , 𝛾 are the normalization terms (one for each sample) that ensure that the integral 𝑄 𝑍 𝑦 % , 𝑋 over 𝑍 is 1: 𝑎 @ 𝑋, 𝑦 % , 𝛾 = 3 exp(−𝛾𝐹 𝑋, 𝑍, 𝑦 % 𝑒𝑍 β is a positive constant that we are free to choose as we like or that we can estimate. It reflects the reliability of the data. Low values should be used to get probability estimates with noisy data. Large values should be used to get good discrimination. We can estimate β through learning too (we can fold it into E, as a component of W). University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 9

Intractability of Bayesian Learning The Bayesian predictive distribution 𝑄 𝑧 ∗ 𝑦 ∗ , 𝑦 R , 𝑧 R , … , (𝑦 U , 𝑧 U )) = 3 𝑔 𝑋, 𝒵, 𝒴 𝑄 𝑋 𝑦 R , 𝑧 R , … , 𝑦 U , 𝑧 U 𝑒𝑋 • To compute the distribution of 𝑧 ∗ for a particular input 𝑦 ∗ , we are supposed to integrate the product of two complicated functions over all possible values of 𝑋 . • This is totally intractable in general. • There are special classes of functions for 𝑔 for which the integral is tractable, but that class is fairly restricted. University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 10

Tractable Learning Methods 1. Maximum A Posteriori Estimation. simply replace the distribution 𝑄 ( 𝑋|𝑇 ) by a Dirac delta function centered on its mode (maximum). 2. Maximum Likelihood Estimation. Same as above, but drop the regularizer. 3. Restricted Class of function. Simply restrict yourself to special forms of 𝑔(𝑋, 𝑍, 𝑌 ) for which the integral can be computed analytically (e.g. Gaussians). 4. Sampling. Draw a a bunch of samples of 𝑋 from the distribution 𝑄 ( 𝑋|𝑇 ), and replace the integral by a sum over those samples. University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 11

Maximum A Posteriori Estimation • Assume that the mode (maximum) of 𝑄 ( 𝑋|𝑇 ) is so much larger than all other values that we can view 𝑄 ( 𝑋|𝑇 ) as a Dirac delta function centered around its maximum 𝑄 jkl ( 𝑋|𝑇 ) ≈ 𝜀 ( 𝑋 − 𝑋 MAP ) 𝑋 MAP = arg max p 𝑄 ( W|S ) with this approximation, we get simply: 𝑄 𝑍 𝑌, 𝑇 = 𝑄 𝑍 𝑌, 𝑋 jkl If we take the limit β → ∞ , P ( W|S ) does converge to a delta function around its maximum. So the MAP approximation is simply the large β limit. University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 12

Computing 𝑋 jkl R 𝑋 jkl = 𝑏𝑠𝑕max F 𝑄 𝑋 𝑇 = argmax t u (v,E) F U [𝐹 𝑋, 𝑧 % , 𝑦 % + R exp[−𝛾{∑ %WR E log 𝑎 @ 𝑋, 𝑦 % , 𝛾 ] + 𝐼 𝑋 }] = U [𝐹 𝑋, 𝑧 % , 𝑦 % + R exp[−𝛾{∑ %WR argmax E log 𝑎 @ 𝑋, 𝑦 % , 𝛾 ] + 𝐼 𝑋 }] = F U [𝐹 𝑋, 𝑧 % , 𝑦 % + R F ∑ %WR arg min E log 𝑎 @ 𝑋, 𝑦 % , 𝛾 ] + 𝐼 𝑋 }] = U [𝐹 𝑋, 𝑧 % , 𝑦 % + R ∑ %WR argmin E log ∫ exp(−𝛾 𝐹 𝑋, 𝒵, 𝑦 % ) 𝑒𝒵 ] + 𝐼 𝑋 F We can take the log because log is monotonic. To find the MAP parameter estimate, we need to find the value of 𝑋 that minimizes: U [𝐹 𝑋, 𝑧 % , 𝑦 % + 1 𝑀 jkl 𝑋 = X 𝛾 log 3 exp(−𝛾𝐹 𝑋, 𝒵, 𝑦 % ) 𝑒𝒵 ] + 𝐼 𝑋 %WR University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 13

CS480/680 Machine Learning Lecture 6: January 23 st , 2020 Maximum A - PowerPoint PPT Presentation

CS480/680 Machine Learning Lecture 6: January 23 st , 2020 Maximum A posteriori & Maximum Likelihood Zahra Sheikhbahaee Sources: A Tutorial on Energy Based Learning University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee Outline

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

CS480/680 Machine Learning Lecture 1: January 7 th , 2020 Course Introduction Zahra Sheikhbahaee

CS480/680 Machine Learning Lecture 3: January 14 th , 2020 Linear Regression Zahra Sheikhbahaee

CS480/680 Machine Learning Lecture 5: January 21 st , 2020 Information Theory Zahra Sheikhbahaee

CS480/680 Machine Learning Lecture 8: January 30 th , 2020 Graphical Models Zahra Sheikhbahaee

CS480/680 Machine Learning Lecture 1: May 6 th , 2019 Course Introduction Pascal Poupart

CS480/680 Machine Learning Lecture 12: February 13 th , 2020 Expectation-Maximization Zahra

CS480/680 Machine Learning Lecture 20: Convolutional Neural Network Zahra Sheikhbahaee March 29,

CS480/680 Machine Learning Lecture 3: May 13, 2019 Linear Regression [RN] Sec. 18.6.1, [HTF]

CS480/680 Lecture 4: May 15, 2019 Statistical Learning [RN]: Sec 20.1, 20.2, [M]: Sec. 2.2, 3.2

CS480/680 Lecture 22: July 22, 2019 Ensemble Learning [RN] Sec. 18.10, [M] Sec. 16.2.5, [B]

CS480/680 Lecture 2: May 8 th , 2019 Nearest Neighbour [RN] Sec. 18.8.1, [HTF] Sec. 2.3.2, [D]

CS480/680 Lecture 9: June 5, 2019 Perceptrons, Neural Networks [D] Chapt. 4, [HTF] Chapt. 11,

CS480/680 Lecture 18: July 8, 2019 Recurrent and Recursive Neural Networks [GBC] Chap. 10

CS480/680 Lecture 15: June 26, 2019 Deep Neural Networks [GBC] Chap. 6, 7, 8 University of

CS480/680 Lecture 24: July 29, 2019 Gradient Boosting, Bagging, Decision Forest [RN] Sec. 18.10,

DIFFRACTION at the LHC Lszl Jenkovszky, Kiev, jenk@bitp.kiev.ua Some open questions: 0)

Network traffic: Scaling 1 Ways of representing a time series Timeseries Timeseries:

48-175 Descriptive Geometry Planes in Descriptive Geometry A spatial figure is a plane

FTP - File Transfer Protocol lctseng / Liang-Chi Tseng Computer Center, CS, NCTU FTP FTP

Outline Scheidegger Networks Networks Scheidegger NetworksA Bonus First return First

Weighted boundedness of multilinear maximal function using Dirac deltas Abhishek Ghosh (Joint

Mixing Additive and Multiplicative Masking for Probing Secure Polynomial Evaluation Methods Axel

Chapter 17 Network Flow VI - Min-Cost Flow Applications CS 573: Algorithms, Fall 2013 October