COMP24111: Machine Learning and Optimisation Chapter 3: Logistic - PowerPoint PPT Presentation

COMP24111: Machine Learning and Optimisation Chapter 3: Logistic Regression Dr. Tingting Mu Email: tingting.mu@manchester.ac.uk

Outline • Understand the concept of likelihood. • Know some simple ways to build a likelihood function for classification and regression. • Understand logistic regression model. • Understand Newton-Raphson Update, and iterative reweighted least squares. • Understand linear basis function model (a nonlinear model). 1

Linear Regression: Least Square (Chapter 2) 12 • The model assumes a linear relationship between the input 10 variables and the estimated output variables. 8 y = w T ! ˆ x y 6 • Model parameters are fitted by minimising sum of 4 squares error. A different way 2 1 1.5 2 2.5 3 to interpret this? x 2

Probabilistic View • Assume the output 12 variable is a random number. 10 What is the chance we • It is generated by observe this sample? adding noise to a linear function. 8 ( ) + noise y = f x y = w T ! x + noise 6 4 Op3mise w by maximising the chances of 2 observing training 1 1.5 2 2.5 3 samples. x 3

Likelihood • In informal context, likelihood means probability. • It is a function of parameters of a statistical model, computed with the given data. • A more formal definition: The likelihood of a set of parameter values ( w ) given the observed data ( x ) is the probability assumed for the observed data given those parameter values. ( ) = p x w ( ) Likelihood w x L(w) for simplification • Maximum likelihood estimator: the model parameters are optimised so that the probability of observing the training data is maximised. 4

Maximum Likelihood for Linear Regression y = w T ! • The output variable is a random number: . x + noise • Noise is a random number. It follows Gaussian distribution and has zero mean ( μ =0 ). Standard deviation quantifies the amount of variation of a set of data values. Gaussian Distribution:mean µ , variance σ 2 0.4 standard deviation σ µ =0, σ =1 0.35 ⎛ ⎞ 2 µ =0, σ =2 ( ) 2 πσ 2 exp − x − µ 1 N ( x µ , σ 2 ) = ⎜ ⎟ µ =1, σ =1 ⎜ ⎟ 0.3 2 σ 2 ⎝ ⎠ 0.25 p(x) 0.2 0.15 0.1 0.05 0 -10 -5 0 5 10 x The above figure is from https://kanbanize.com/blog/normal-gaussian-distribution-over-cycle-time/ 5

Maximum Likelihood for Linear Regression y = w T ! • Because , the output variable also follows Gaussian x + noise µ = w T ! distribution and its mean is x y w T ! ) = N ( ) ( x , β − 1 p y x , w , β β : noise precision (inverse variance). β -1 = σ 2 • Probability of observing the i-th training sample: y i w T ! ) = N ( ) ( x i , β − 1 p y i x i , w , β • Probability of observing all the N training samples: N N y i w T ! N ∏ ∏ ( x i , β − 1 ) ( ) = ( ) p Y X , w , β p y i x i , w , β = i = 1 i = 1 6

Maximum Likelihood for Linear Regression • Likelihood function: ⎛ 2 ⎞ 2 πβ − 1 exp − β x − µ ( ) 1 N ( ) = ⎜ ⎟ x µ , β ⎜ ⎟ N 2 y i w T ! ⎝ ⎠ N ( ) ∏ x i , β − 1 ( ) = L w , β i = 1 • Log-likelihood function: taking the logarithm of the likelihood function N N y i w T ! = N 2 ln β − N ) − β 1 y i − w T ! 2 ln N ( ) = ∑ ( x i , β − 1 ) ∑ ( ) ( ) = ln L w , β ( ) ( O w , β 2 ln 2 π x i 2 i = 1 i = 1 This is the sum-of-squares error function in Chapter 2. • Optimise w by maximising the log likelihood function is equivalent to minimising the sum-of-squares error function, under the assumption of additive zero-mean Gaussian noise. 7

Multivariate Gaussian Distribution • Multivariate Gaussian Distribution: case 1 3 0.2 2 mean µ , covariance matrix Σ : 0.15 p(x 1 ,x 2 ) 1 0.1 T Σ − 1 x − µ ⎛ ⎞ ( ) ( ) exp − x − µ 1 x 2 0 N 0.05 ( ) = ⎜ ⎟ x µ , Σ ⎜ ⎟ d Σ -1 0 2 ( ) 2 π ⎝ ⎠ 2 2 -2 0 0 -2 -2 x 2 -3 x 1 -3 -2 -1 0 1 2 3 Covariance measures the joint variability of two x 1 random variables cov(x, y) = E[(x-E[x])(y-E[y])] 3 0.2 2 • A bi-variate example: 0.15 p(x 1 ,x 2 ) 1 0.1 correlation between x 1 and x 2 x 2 0 0.05 ⎛ ⎞ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 0 -1 2 x 1 µ 1 σ 1 ρσ 1 σ 2 ⎜ ⎟ N ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ , 2 -2 ⎜ ⎟ 2 ⎢ ⎥ x 2 2 ⎜ ⎢ ⎥ ⎢ µ 2 ⎥ ⎟ 0 ρσ 1 σ 2 σ 2 ⎣ ⎦ ⎣ ⎦ 0 ⎣ ⎦ ⎝ ⎠ -2 -3 -2 x2 -3 -2 -1 0 1 2 3 x1 x 1 ⎛ ⎞ ⎡ 2 2 ⎤ ( ) ( ) ( ) x 2 − µ 2 ( ) x 1 − µ 1 + x 2 − µ 2 − 2 ρ x 1 − µ 1 1 1 3 ⎜ ⎟ 2 πσ 1 σ 2 1 − ρ 2 exp − ⎢ ⎥ = ⎜ ⎟ 2 2 2 1 − ρ 2 σ 1 σ 2 σ 1 σ 2 ⎢ ⎥ 2 ⎣ ⎦ ⎝ ⎠ 0.4 1 p(x 1 ,x 2 ) 0.2 x 2 0 Case 1: µ 1 = µ 2 = 0, σ 1 = σ 2 = 1, ρ = 0 0 -1 Case 2 : µ 1 = µ 2 = 0, σ 1 = σ 2 = 1, ρ = 0.5 2 -2 2 0 0 Case 3: µ 1 = µ 2 = 1, σ 1 = 0.2, σ 2 = 1, ρ = 0 -2 -2 -3 x2 -3 -2 -1 0 1 2 3 x1 x 1 8

Maximum Likelihood for Binary Classification (Gaussian Distribution) • The probability of observing a sample belonging to one of the two possible classes follows the Bernoulli distribution (a simple probabilistic model for fliping coins): Flip coin: ⎧ θ 1 , if y = 1 (head), ⎪ 1 − y = y θ 2 p = θ 1 ⎨ ⎧ θ 2 , if y = 0 (tail). ⎪ ( ) , ⎩ θ x ,1 if y = 1, ⎪ 1 − y = y θ x ,1 − y ( ) = θ x , y ( ) ( ) p x , y ⎨ ( ) , θ x ,0 if y = 0. ⎪ ⎩ • Samples from each class are random variables following Gaussian distribution . Prior class probability: ) = α N ( ( x µ 1 , Σ ) ( ) = p C 1 ( ) p x C 1 θ x ,1 ( ) = α p C 1 ) N x µ 2 , Σ ( ) = 1 − α ( ) ( ) = p C 2 ( ) p x C 2 ( θ x ,0 ( ) = 1 − α ( ) p C 2 Assume different classes have different mean vectors (µ 1 and µ 2 ), but the same covariance matrix Σ . 9

Maximum Likelihood for Binary Classification (Gaussian Distribution) • Likelihood function: N N y i 1 − y i α N ) N ⎡ ⎤ ⎡ ⎤ x i µ 1 , Σ x i µ 2 , Σ ∏ ∏ ( ) ( ) ( ) = ( ) ( L α , µ 1 , µ 2 , Σ p x i , y i 1 − α = ⎣ ⎦ ⎣ ⎦ i = 1 i = 1 • Log-likelihood function ⎧ ⎫ N y i 1 − y i α N ) N ⎡ ⎤ ⎡ ⎤ ∏ ( x i µ 1 , Σ ) ( x i µ 2 , Σ ) ( ) = ln ( O α , µ 1 , µ 2 , Σ 1 − α ⎨ ⎬ ⎣ ⎦ ⎣ ⎦ ⎩ ⎭ i = 1 N N N y i ln N ) ln N ∑ ∑ x i µ 1 , Σ ∑ x i µ 2 , Σ ⎡ ⎤ ( ) ( ) ( ) ln 1 − α ( ) ( y i ln α + 1 − y i 1 − y i = + + ⎣ ⎦ i = 1 i = 1 i = 1 10

Maximum Likelihood for Binary Classification (Gaussian Distribution) • We need to decide the optimal setting of the following model parameters. µ 1 : mean vector of class 1 α : class prior µ 2 : mean vector of class 2 Σ : shared covariance matrix for both classes • Optimal parameters are obtained by setting the gradients to zero. ∂ O α , µ 1 , µ 2 , Σ ( ) = 0 ⇒ α * = N 1 • The prior probability of a class is simply the fraction of the training N ∂ α samples in that class. N ∂ O α , µ 1 , µ 2 , Σ ( ) 1 = 1 = 0 ⇒ µ * ∑ y i x i • The mean vector of each class is N 1 ∂ µ 1 i = 1 simply the averaged training ( ) N ∂ O α , µ 1 , µ 2 , Σ 2 = 1 samples in that class. ∑ = 0 ⇒ µ * ( ) x i 1 − y i N 2 ∂ µ 2 • The covariance matrix is simply a i = 1 ∂ O α , µ 1 , µ 2 , Σ ( ) weighted average of the = 0 ⇒ Σ * = N 1 N Σ 1 + N 2 N Σ 2 covariance matrices associated ∂Σ with each of the two classes. where Σ C = 1 T , C = 1,2 ∑ ( ) ( ) x i − µ C x i − µ C N C i ∈ Class C 11

Example: Binary Classification 20 training samples from class A, each characterised by 2 features. 20 training samples from class B, each characterised by 2 features. training samples and separation boundary Red region: ( ) p y = class A, x α * , µ 1 * , µ 2 * , Σ * ( ) < p y = class B, x α * , µ 1 * , µ 2 * , Σ * Blue region: ( ) p y = class A, x α * , µ 1 * , µ 2 * , Σ * ( ) ≥ p y = class B, x α * , µ 1 * , µ 2 * , Σ * 12

• We just used the following model to formulate a likelihood function for binary classification. y θ x ,0 1 − y ( ) = θ x ,1 ( ) ( ) p x , y : The probability ( ) θ x ,1 ) = α N of observing (x, class 1). ( x µ 1 , Σ ) , ( θ x ,1 ) N x µ 2 , Σ ( ) , ( ) = 1 − α ( θ x ,0 : The probability ( ) θ x ,0 N of observing (x, class 2). ∏ ( ) = ( ) L α , µ 1 , µ 2 , Σ p x i , y i , i = 1 • Is there another way to formulate the likelihood function for classification? 13

Logistic Regression: Binary Classification • Another way to construct the likelihood function is: Given class label y ∈ 0,1 } : { ( ) : Given an observe sample θ y = 1 x y θ y = 0 x 1 − y ⎡ ⎤ ( ) = θ y = 1 x ( ) ( ) p y x ⎣ ⎦ x , the probability it is from class 1. y 1 − θ y = 1 x 1 − y ⎡ ⎤ ( ) ( ) = θ y = 1 x ⎣ ⎦ N ( ) θ y = 0 x : Given an observe sample ∏ ( ) Likelihood = p y i x i , i = 1 x , the probability it is from class 0. ( ) + θ y = 1 x ( ) = 1 ( ) θ y = 0 x We directly model θ y = 1 x ) = σ w T ! 1 ( ) = ( θ y = 1 x x 1 + exp − w T ! ( ) 1 x σ x ( ) = ( ) 1 + exp − x This is called logistic This is a linear model sigmoid function. as learned in Chapter 2. 14

COMP24111: Machine Learning and Optimisation Chapter 3: Logistic - PowerPoint PPT Presentation

COMP24111: Machine Learning and Optimisation Chapter 3: Logistic Regression Dr. Tingting Mu Email: tingting.mu@manchester.ac.uk Outline Understand the concept of likelihood. Know some simple ways to build a likelihood function for

COMP24111: Machine Learning and Optimisation Chapter 1A: Machine Learning Basics Dr. Tingting Mu

COMP24111 Course Unit Overview Ke Chen and Tingting Mu http:/ / syllabus.cs.manchester.ac.uk/ ugt/

COMP24111: Machine Learning and Optimisation Chapter 1: Machine Learning Basics Dr. Tingting Mu

COMP24111: Machine Learning and Optimisation Chapter 5: Neural Networks and Deep Learning Dr.

Clustering Analysis Basics Ke Chen Reading: [Ch. 7, EA], [25.1, KPM] COMP24111 Machine Learning

COMP24111: Machine Learning and Optimisation Chapter 4: Support Vector Machines Dr. Tingting Mu

Medicines optimisation The road to excellence Workshop Overview of meds optimisation Your

COMP24111: Machine Learning and Optimization (Part I) Dr. Tingting Mu Email:

K -means Clustering Ke Chen Reading: [7.3, EA], [9.1, CMB] COMP24111 Machine Learning Outline

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Automated and Accurate Geometry Extraction and Shape Optimisation of 3D Topology Optimisation

Introductory Course on Non-smooth Optimisation Lecture 09 - Non-convex optimisation Jingwei Liang

Introduction to program optimisation Michel Schinz (based on Erik Stenmans slides) Advanced

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Fast algorithms for nonconvex compressive sensing Rick Chartrand Los Alamos National Laboratory

Reference Classes Lee Edlefsen, Ph.D. Chief Scientist Sue Ranney, Ph.D. Chief Data Scientist

Lecture 9: Nave Bayes Classifier (contd.) Logistic Regression Discriminative vs.

Inverse problems with L 1 data fitting Christian Clason, Bangti JIN, Karl Kunisch Institute for

GradNet: Unsupervised Deep Screened Poisson Reconstruction for GradientDomain Rendering Jie Guo

Paper Summaries Any takers? Procedural Shading Announcement Logistics Checkpoint 2

Local/Global Scene Flow using Intensity and Depth Data Julian Quiroga Frederic Devernay James

t r r