Machine Learning - MT 2016 3. Maximum Likelihood Varun Kanade - PowerPoint PPT Presentation

Machine Learning - MT 2016 3. Maximum Likelihood Varun Kanade University of Oxford October 17, 2016

Outline Probabilistic Perspective of Machine Learning ◮ Probabilistic Formulation of the Linear Model ◮ Maximum Likelihood Estimate ◮ Relation to the Least Squares Estimate 1

Outline Probability Review Linear Regression and Maximum Likelihood Information, Entropy, KL Divergence

Univariate Gaussian (Normal) Distribution The univariate normal distribution is defined by the following density function e − ( x − µ )2 1 X ∼ N ( µ, σ 2 ) √ p ( x ) = 2 σ 2 2 πσ Here µ is the mean and σ 2 is the variance. � ∞ p ( x ) d x = 1 −∞ � ∞ σ xp ( x ) d x = µ −∞ � ∞ Pr( a ≤ x ≤ b ) ( x − µ ) 2 p ( x ) d x = σ 2 a b Mean µ −∞ 2

Sampling from a Gaussian distribution Sampling from X ∼ N ( µ, σ 2 ) By setting Y = X − µ , sample from Y ∼ N (0 , 1) σ Cumulative distribution function � x 1 e − t 2 √ 2 dt Φ( x ; 0 , 1) = 2 π −∞ Φ( a ) a 1 1 Φ( a ) y ∼ Unif([0 , 1]) 0 0 a x 3

Bivariate Normal (Gaussian) Distribution Suppose X 1 ∼ N ( µ 1 , σ 2 1 ) and X 2 ∼ N ( µ 2 , σ 2 2 ) are independent The joint probability distribution p ( x 1 , x 2 ) is a bivariate normal distribution. p ( x 1 , x 2 ) = p ( x 1 ) · p ( x 2 ) � � � � − ( x − µ 1 ) 2 − ( x − µ 2 ) 2 1 1 √ √ = · exp · · exp 2 σ 2 2 σ 2 2 πσ 1 2 πσ 2 1 2  � � ( x − µ 1 ) 2 + ( x − µ 2 ) 2 1  −  = 2 ) 1 / 2 · exp 2 π ( σ 2 1 σ 2 2 σ 2 2 σ 2 1 2 � � 1 − 1 2 · ( x − µ ) T Σ − 1 ( x − µ ) = 2 π | Σ | 1 / 2 · exp where � � � � � � σ 2 0 µ 1 x 1 1 Σ = µ = x = σ 2 0 µ 2 x 2 2 Note: All equiprobable points lie on an ellipse. 4

Covariance and Correlation For random variable X and Y the covariance measures how the random variable change jointly � � cov( X, Y ) = E ( X − E [ X ]) · ( Y − E [ Y ]) Covariance depends on the scale. The (Pearson) correlation coefficient normalizes the covariance to give a value between − 1 and +1 . cov( X, Y ) corr( X, Y ) = � , var( X ) · var( Y ) where var( X ) = E [( X − E [ X ]) 2 ] and var( Y ) = E [( Y − E [ Y ]) 2 ] . Independent variables are uncorrelated, but the converse is not true! 5

Multivariate Gaussian Distribution Suppose x is a D -dimensional random vector. The covariance matrix consists of all pairwise covariances.   var( X 1 ) cov( X 1 , X 2 ) cov( X 1 , X D ) · · · cov( X 2 , X 1 ) var( X 2 ) cov( X 2 , X D )  · · ·  � ( x − E [ x ])( x − E [ x ]) T �   cov( x ) = E = . . . ... .   . . .  . . .    cov( X D , X 1 ) cov( X D , X 2 ) var( X D ) · · · If µ = E [ x ] and Σ = cov( x ) , the multivariate normal is defined by the density � � 1 − 1 2( x − µ ) T Σ − 1 ( x − µ ) N ( µ , Σ ) = (2 π ) D/ 2 | Σ | 1 / 2 exp 6

Suppose you are given three independent samples: x 1 = 0 . 3 , x 2 = 1 . 4 , and x 3 = 1 . 7 You know that the data is generated from either N (0 , 1) or N (2 , 1) . Let θ represent the parameters ( µ, σ ) of the two distributions. Then the probability of observing the data with parameter θ is called the likelihood. p ( x 1 , x 2 , x 3 | θ ) = p ( x 1 | θ ) · p ( x 2 | θ ) · p ( x 3 | θ ) We have to choose between θ = (0 , 1) or θ = (2 , 1) . Which one is more likely? x 1 x 2 x 3 µ = 0 µ = 2 Maximum Likelihood Estimation (MLE) Pick parameter θ that maximises the likelihood 7

Linear Regression Linear Model y = w 0 x 0 + w 1 x 1 + · · · + w D x D + ǫ = w · x + ǫ Noise/uncertainty Model y given x , w as a random variable with mean w T x . E [ y | x , w ] = w T x We will be specific in choosing the distribution of y given x and w . Let us assume that given x , w , y is normal with mean w T x and variance σ 2 p ( y | w , x ) = N ( w T x , σ 2 ) = w T x + N (0 , σ 2 ) Alternatively, we may view this model as ǫ ∼ N (0 , σ 2 ) (Gaussian Noise) Discriminative Framework Throughout this lecture, think of the inputs x 1 , . . . , x N as fixed 8

Likelihood of Linear Regression (Gaussian Noise Model) Suppose we observe data � ( x i , y i ) � N i =1 . What is the likelihood of observing the data for model parameters w , σ ? MLE Estimator Find parameters which maximise the likelihood. (product of ‘‘likelihood density’’ segments) Least Square Estimator Find parameters which minimise the sum of squares of the residuals (sum of squares of the segments). 9

Likelihood of Linear Regression (Gaussian Noise Model) Suppose we observe data � ( x i , y i ) � N i =1 . What is the likelihood of observing the data for model parameters w , σ ? N � � � � � p y 1 , . . . , y N | x 1 , . . . , x N , w , σ = p y i | x i , w , σ i =1 According to the model y i ∼ w T x i + N (0 , σ 2 ) � � N � � � − ( y i − w T x i ) 2 1 √ p y 1 , . . . , y N | x 1 , . . . , x N , w , σ = 2 πσ 2 exp 2 σ 2 i =1   � � N/ 2 N � 1  − 1 ( y i − w T x i ) 2  = exp 2 σ 2 · 2 πσ 2 i =1 Want to find parameters w and σ that maximise the likelihood 10

Likelihood of Linear Regression (Gaussian Noise Model) Let us consider the likelihood p ( y | X , w , σ )   � � N/ 2 N � � � 1  − 1 ( y i − w T x i ) 2  p y 1 , . . . , y N | x 1 , . . . , x N , w , σ = exp 2 σ 2 · 2 πσ 2 i =1 As log : R + → R is an increasing function, we can instead maximise the log of the likelihood (called log-likelihood), which results in a simpler mathematical expression. N � LL( y 1 , . . . , y N | x 1 , . . . , x N , w , σ ) = − N 1 2 log(2 πσ 2 ) − ( y i − w T x i ) 2 2 σ 2 i =1 LL( y | X , w , σ ) = − N 1 2 log(2 πσ 2 ) − 2 σ 2 ( Xw − y ) T ( Xw − y ) In vector form, Let’s first find w that maximizes the log-likelihood 11

Maximum Likelihood and Least Squares Estimates We’d like to find w that maximises the log-likelihood 1 LL( y | X , w , σ ) = − N 2 log(2 πσ 2 ) − 2 σ 2 ( Xw − y ) T ( Xw − y ) Alternatively, we can minimise the negative log-likelihood 2 σ 2 ( Xw − y ) T ( Xw − y ) + N 1 2 log(2 πσ 2 ) NLL( y | X , w , σ ) = Recall the objective function we used for the least squares estimate in the previous lecture 1 2 N ( Xw − y ) T ( Xw − y ) L ( w ) = For minimizing with respect to w , the two objectives are the same upto a constant additive and multiplicative factor! 12

Maximum Likelihood Estimate for Linear Regression As the solution w ML to find the maximum likelihood estimator is the same as the least squares estimator, we have � � − 1 X T X X T y w ML = Recall the form of the negative log-likelihood 2 σ 2 ( Xw − y ) T ( Xw − y ) + N 1 2 log(2 πσ 2 ) NLL( y | X , w , σ ) = We can also find the maximum likelihood estimate for σ Exercise on sheet 2 to show that the MLE of σ is given by ML = 1 σ 2 N ( Xw ML − y ) T ( Xw ML − y ) 13

Prediction using the MLE for Linear Regression Given training data � ( x i , y i ) � N i =1 , we can obtain the MLE w ML and σ ML . One a new point x new , we can use these to make a prediction and also give confidence intervals y new = w ML · x new � y new + N (0 , σ 2 y new ∼ � ML ) 14

Summary : MLE for Linear Regression (Gaussian Noise) Model ◮ Linear model: y = w · x + ǫ ◮ Explicitly model ǫ ∼ N (0 , σ 2 ) Maximum Likelihood Estimation ◮ Every w , σ defines a probability distribution over observed data ◮ Pick w and σ that maximise the likelihood of observing the data Algorithm ◮ As in the previous lecture, we have closed form expressions ◮ Algorithm simply implements elementary matrix operations 15

Outliers and Laplace Distribution If the data has outliers, we can model the noise using a distribution that has heavier tails For the linear model y = w · x + ǫ , use ǫ ∼ Lap(0 , b ) , where the density function for Lap( µ, b ) is given by � � p ( x ) = 1 −| x − µ | 2 b exp b 0 . 3 1 0 . 2 0 . 5 0 . 1 0 0 − 3 − 2 − 1 0 1 2 3 − 6 − 4 − 2 0 2 4 6 Laplace and normal distributions with the same mean and variance 16

Maximum Likelihood for Laplace Noise Model Given data � ( x i , y i ) � N i =1 , let us express the likelihood of observing the data in terms model parameters w and b . � � N � −| y i − w T x i | 1 p ( y 1 , . . . , y N | x 1 , . . . , x N , w , b ) = 2 b exp b i =1   N � 1  − 1 | y i − w T x i |  = (2 b ) N exp b i =1 As in the case of the Gaussian noise model, we look at the negative log-likelihood N � NLL( y | X , w , b ) = 1 | y i − w T x i | + N log(2 b ) b i =1 Thus, the maximum likelihood estimate in this case can be obtained by minimising the sum of the absolute values of the residuals, which is the same objective we discussed in the last lecture in the context fitting a linear model that is robust to outliers. 17

Machine Learning - MT 2016 3. Maximum Likelihood Varun Kanade - PowerPoint PPT Presentation

Machine Learning - MT 2016 3. Maximum Likelihood Varun Kanade University of Oxford October 17, 2016 Outline Probabilistic Perspective of Machine Learning Probabilistic Formulation of the Linear Model Maximum Likelihood Estimate

Maximum Likelihood properties Maximum parsimony Maximum likelihood Experimental design

Phylogenetic trees IV Maximum Likelihood Gerhard Jger ESSLLI 2016 Gerhard Jger Maximum

15-388/688 - Practical Data Science: Maximum likelihood estimation, nave Bayes J. Zico Kolter

Maximum likelihood models Tues. Feb. 27, 2018 1 Overview of today Informal notion of

Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood

Machine Learning - MT 2017 4. Maximum Likelihood Varun Kanade University of Oxford October 16,

Maximum Likelihood Estimation CS 446 Maximum likelihood: abstract formulation Weve had one

Binary choice 3.3 Maximum likelihood estimation Michel Bierlaire Maximum likelihood

Maximum Likelihood Estimation CS 446 Maximum likelihood: abstract formulation Weve had one

Binary choice 3.3 Maximum likelihood estimation Michel Bierlaire Output of the estimation

Phylogenetic trees IV Maximum Likelihood Gerhard Jger Words, Bones, Genes, Tools February 28,

Maximum likelihood parameter estimation Maximum likelihood parameter estimation For an HMM

MAXIMUM CARDS MAXIMUM CARDS What is a Maximum Card ? The Maximum Card is the one which contains a

Outline n Maximum likelihood (ML) n Priors, and maximum a posteriori (MAP) n

CS 287 Advanced Robotics (Fall 2019) Lecture 13: Kalman Smoother, Maximum A Posteriori, Maximum

Applied Machine Learning Maximum Likelihood and Bayesian Reasoning Siamak Ravanbakhsh COMP 551

Model-agnostic Approaches to Handling Noisy Labels When Training Sound Event Classifiers Eduardo

On the Influence of Input Noise On the Influence of Input Noise on a Generalization Error

Protec'ng quantum gates from control noise Constantin Brif Sandia

Weak-noise limit of systems driven by non-Gaussian fluctuations Adrian Baule with P. Sollich

Foundations of Chemical Kinetics Lecture 12: Transition-state theory: Examples Marc R. Roussel

Honeycomb Crea/ve Works is financed by the European

Climate Change risk and Agricultural Productivity in the Sahel Imed Drine and Younfu Huang World

Benjamin Casey C S 329 E Spring 2009 The setup: 2 players take turns picking circles from