comp24111 machine learning and optimisation
play

COMP24111: Machine Learning and Optimisation Chapter 3: Logistic - PowerPoint PPT Presentation

COMP24111: Machine Learning and Optimisation Chapter 3: Logistic Regression Dr. Tingting Mu Email: tingting.mu@manchester.ac.uk Outline Understand the concept of likelihood. Know some simple ways to build a likelihood function for


  1. COMP24111: Machine Learning and Optimisation Chapter 3: Logistic Regression Dr. Tingting Mu Email: tingting.mu@manchester.ac.uk

  2. Outline • Understand the concept of likelihood. • Know some simple ways to build a likelihood function for classification and regression. • Understand logistic regression model. • Understand Newton-Raphson Update, and iterative reweighted least squares. • Understand linear basis function model (a nonlinear model). 1

  3. Linear Regression: Least Square (Chapter 2) 12 • The model assumes a linear relationship between the input 10 variables and the estimated output variables. 8 y = w T ! ˆ x y 6 • Model parameters are fitted by minimising sum of 4 squares error. A different way 2 1 1.5 2 2.5 3 to interpret this? x 2

  4. Probabilistic View • Assume the output 12 variable is a random number. 10 What is the chance we • It is generated by observe this sample? adding noise to a linear function. 8 ( ) + noise y = f x y = w T ! x + noise 6 4 Op3mise w by maximising the chances of 2 observing training 1 1.5 2 2.5 3 samples. x 3

  5. Likelihood • In informal context, likelihood means probability. • It is a function of parameters of a statistical model, computed with the given data. • A more formal definition: The likelihood of a set of parameter values ( w ) given the observed data ( x ) is the probability assumed for the observed data given those parameter values. ( ) = p x w ( ) Likelihood w x L(w) for simplification • Maximum likelihood estimator: the model parameters are optimised so that the probability of observing the training data is maximised. 4

  6. Maximum Likelihood for Linear Regression y = w T ! • The output variable is a random number: . x + noise • Noise is a random number. It follows Gaussian distribution and has zero mean ( μ =0 ). Standard deviation quantifies the amount of variation of a set of data values. Gaussian Distribution:mean µ , variance σ 2 0.4 standard deviation σ µ =0, σ =1 0.35 ⎛ ⎞ 2 µ =0, σ =2 ( ) 2 πσ 2 exp − x − µ 1 N ( x µ , σ 2 ) = ⎜ ⎟ µ =1, σ =1 ⎜ ⎟ 0.3 2 σ 2 ⎝ ⎠ 0.25 p(x) 0.2 0.15 0.1 0.05 0 -10 -5 0 5 10 x The above figure is from https://kanbanize.com/blog/normal-gaussian-distribution-over-cycle-time/ 5

  7. Maximum Likelihood for Linear Regression y = w T ! • Because , the output variable also follows Gaussian x + noise µ = w T ! distribution and its mean is x y w T ! ) = N ( ) ( x , β − 1 p y x , w , β β : noise precision (inverse variance). β -1 = σ 2 • Probability of observing the i-th training sample: y i w T ! ) = N ( ) ( x i , β − 1 p y i x i , w , β • Probability of observing all the N training samples: N N y i w T ! N ∏ ∏ ( x i , β − 1 ) ( ) = ( ) p Y X , w , β p y i x i , w , β = i = 1 i = 1 6

  8. Maximum Likelihood for Linear Regression • Likelihood function: ⎛ 2 ⎞ 2 πβ − 1 exp − β x − µ ( ) 1 N ( ) = ⎜ ⎟ x µ , β ⎜ ⎟ N 2 y i w T ! ⎝ ⎠ N ( ) ∏ x i , β − 1 ( ) = L w , β i = 1 • Log-likelihood function: taking the logarithm of the likelihood function N N y i w T ! = N 2 ln β − N ) − β 1 y i − w T ! 2 ln N ( ) = ∑ ( x i , β − 1 ) ∑ ( ) ( ) = ln L w , β ( ) ( O w , β 2 ln 2 π x i 2 i = 1 i = 1 This is the sum-of-squares error function in Chapter 2. • Optimise w by maximising the log likelihood function is equivalent to minimising the sum-of-squares error function, under the assumption of additive zero-mean Gaussian noise. 7

  9. Multivariate Gaussian Distribution • Multivariate Gaussian Distribution: case 1 3 0.2 2 mean µ , covariance matrix Σ : 0.15 p(x 1 ,x 2 ) 1 0.1 T Σ − 1 x − µ ⎛ ⎞ ( ) ( ) exp − x − µ 1 x 2 0 N 0.05 ( ) = ⎜ ⎟ x µ , Σ ⎜ ⎟ d Σ -1 0 2 ( ) 2 π ⎝ ⎠ 2 2 -2 0 0 -2 -2 x 2 -3 x 1 -3 -2 -1 0 1 2 3 Covariance measures the joint variability of two x 1 random variables cov(x, y) = E[(x-E[x])(y-E[y])] 3 0.2 2 • A bi-variate example: 0.15 p(x 1 ,x 2 ) 1 0.1 correlation between x 1 and x 2 x 2 0 0.05 ⎛ ⎞ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 0 -1 2 x 1 µ 1 σ 1 ρσ 1 σ 2 ⎜ ⎟ N ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ , 2 -2 ⎜ ⎟ 2 ⎢ ⎥ x 2 2 ⎜ ⎢ ⎥ ⎢ µ 2 ⎥ ⎟ 0 ρσ 1 σ 2 σ 2 ⎣ ⎦ ⎣ ⎦ 0 ⎣ ⎦ ⎝ ⎠ -2 -3 -2 x2 -3 -2 -1 0 1 2 3 x1 x 1 ⎛ ⎞ ⎡ 2 2 ⎤ ( ) ( ) ( ) x 2 − µ 2 ( ) x 1 − µ 1 + x 2 − µ 2 − 2 ρ x 1 − µ 1 1 1 3 ⎜ ⎟ 2 πσ 1 σ 2 1 − ρ 2 exp − ⎢ ⎥ = ⎜ ⎟ 2 2 2 1 − ρ 2 σ 1 σ 2 σ 1 σ 2 ⎢ ⎥ 2 ⎣ ⎦ ⎝ ⎠ 0.4 1 p(x 1 ,x 2 ) 0.2 x 2 0 Case 1: µ 1 = µ 2 = 0, σ 1 = σ 2 = 1, ρ = 0 0 -1 Case 2 : µ 1 = µ 2 = 0, σ 1 = σ 2 = 1, ρ = 0.5 2 -2 2 0 0 Case 3: µ 1 = µ 2 = 1, σ 1 = 0.2, σ 2 = 1, ρ = 0 -2 -2 -3 x2 -3 -2 -1 0 1 2 3 x1 x 1 8

  10. Maximum Likelihood for Binary Classification (Gaussian Distribution) • The probability of observing a sample belonging to one of the two possible classes follows the Bernoulli distribution (a simple probabilistic model for fliping coins): Flip coin: ⎧ θ 1 , if y = 1 (head), ⎪ 1 − y = y θ 2 p = θ 1 ⎨ ⎧ θ 2 , if y = 0 (tail). ⎪ ( ) , ⎩ θ x ,1 if y = 1, ⎪ 1 − y = y θ x ,1 − y ( ) = θ x , y ( ) ( ) p x , y ⎨ ( ) , θ x ,0 if y = 0. ⎪ ⎩ • Samples from each class are random variables following Gaussian distribution . Prior class probability: ) = α N ( ( x µ 1 , Σ ) ( ) = p C 1 ( ) p x C 1 θ x ,1 ( ) = α p C 1 ) N x µ 2 , Σ ( ) = 1 − α ( ) ( ) = p C 2 ( ) p x C 2 ( θ x ,0 ( ) = 1 − α ( ) p C 2 Assume different classes have different mean vectors (µ 1 and µ 2 ), but the same covariance matrix Σ . 9

  11. Maximum Likelihood for Binary Classification (Gaussian Distribution) • Likelihood function: N N y i 1 − y i α N ) N ⎡ ⎤ ⎡ ⎤ x i µ 1 , Σ x i µ 2 , Σ ∏ ∏ ( ) ( ) ( ) = ( ) ( L α , µ 1 , µ 2 , Σ p x i , y i 1 − α = ⎣ ⎦ ⎣ ⎦ i = 1 i = 1 • Log-likelihood function ⎧ ⎫ N y i 1 − y i α N ) N ⎡ ⎤ ⎡ ⎤ ∏ ( x i µ 1 , Σ ) ( x i µ 2 , Σ ) ( ) = ln ( O α , µ 1 , µ 2 , Σ 1 − α ⎨ ⎬ ⎣ ⎦ ⎣ ⎦ ⎩ ⎭ i = 1 N N N y i ln N ) ln N ∑ ∑ x i µ 1 , Σ ∑ x i µ 2 , Σ ⎡ ⎤ ( ) ( ) ( ) ln 1 − α ( ) ( y i ln α + 1 − y i 1 − y i = + + ⎣ ⎦ i = 1 i = 1 i = 1 10

  12. Maximum Likelihood for Binary Classification (Gaussian Distribution) • We need to decide the optimal setting of the following model parameters. µ 1 : mean vector of class 1 α : class prior µ 2 : mean vector of class 2 Σ : shared covariance matrix for both classes • Optimal parameters are obtained by setting the gradients to zero. ∂ O α , µ 1 , µ 2 , Σ ( ) = 0 ⇒ α * = N 1 • The prior probability of a class is simply the fraction of the training N ∂ α samples in that class. N ∂ O α , µ 1 , µ 2 , Σ ( ) 1 = 1 = 0 ⇒ µ * ∑ y i x i • The mean vector of each class is N 1 ∂ µ 1 i = 1 simply the averaged training ( ) N ∂ O α , µ 1 , µ 2 , Σ 2 = 1 samples in that class. ∑ = 0 ⇒ µ * ( ) x i 1 − y i N 2 ∂ µ 2 • The covariance matrix is simply a i = 1 ∂ O α , µ 1 , µ 2 , Σ ( ) weighted average of the = 0 ⇒ Σ * = N 1 N Σ 1 + N 2 N Σ 2 covariance matrices associated ∂Σ with each of the two classes. where Σ C = 1 T , C = 1,2 ∑ ( ) ( ) x i − µ C x i − µ C N C i ∈ Class C 11

  13. Example: Binary Classification 20 training samples from class A, each characterised by 2 features. 20 training samples from class B, each characterised by 2 features. training samples and separation boundary Red region: ( ) p y = class A, x α * , µ 1 * , µ 2 * , Σ * ( ) < p y = class B, x α * , µ 1 * , µ 2 * , Σ * Blue region: ( ) p y = class A, x α * , µ 1 * , µ 2 * , Σ * ( ) ≥ p y = class B, x α * , µ 1 * , µ 2 * , Σ * 12

  14. • We just used the following model to formulate a likelihood function for binary classification. y θ x ,0 1 − y ( ) = θ x ,1 ( ) ( ) p x , y : The probability ( ) θ x ,1 ) = α N of observing (x, class 1). ( x µ 1 , Σ ) , ( θ x ,1 ) N x µ 2 , Σ ( ) , ( ) = 1 − α ( θ x ,0 : The probability ( ) θ x ,0 N of observing (x, class 2). ∏ ( ) = ( ) L α , µ 1 , µ 2 , Σ p x i , y i , i = 1 • Is there another way to formulate the likelihood function for classification? 13

  15. Logistic Regression: Binary Classification • Another way to construct the likelihood function is: Given class label y ∈ 0,1 } : { ( ) : Given an observe sample θ y = 1 x y θ y = 0 x 1 − y ⎡ ⎤ ( ) = θ y = 1 x ( ) ( ) p y x ⎣ ⎦ x , the probability it is from class 1. y 1 − θ y = 1 x 1 − y ⎡ ⎤ ( ) ( ) = θ y = 1 x ⎣ ⎦ N ( ) θ y = 0 x : Given an observe sample ∏ ( ) Likelihood = p y i x i , i = 1 x , the probability it is from class 0. ( ) + θ y = 1 x ( ) = 1 ( ) θ y = 0 x We directly model θ y = 1 x ) = σ w T ! 1 ( ) = ( θ y = 1 x x 1 + exp − w T ! ( ) 1 x σ x ( ) = ( ) 1 + exp − x This is called logistic This is a linear model sigmoid function. as learned in Chapter 2. 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend