Machine Learning Lecture 03: Logistic Regression and Gradient - PowerPoint PPT Presentation

Machine Learning Lecture 03: Logistic Regression and Gradient Descent Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set of notes is based on internet resources and KP Murphy (2012). Machine learning: a probabilistic perspective. MIT Press. (Chapter 8) Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning . MIT press. www.deeplearningbook.org . (Chapter 4) Andrew Ng. Lecture Notes on Machine Learning. Stanford. Hal Daume. A Course on Machine Learning. http://ciml.info/ Nevin L. Zhang (HKUST) Machine Learning 1 / 51

Logistic Regression Outline 1 Logistic Regression 2 Gradient Descent 3 Gradient Descent for Logistic Regression 4 Newton’s Method 5 Softmax Regression 6 Optimization Approach to Classification Nevin L. Zhang (HKUST) Machine Learning 2 / 51

Logistic Regression Recap of Probabilistic Linear Regression Training set D = { x i , y i } N i =1 , where y i ∈ R . Probabilistic model: p ( y | x , θ ) = N ( y | µ ( x ) , σ 2 ) = N ( y | w ⊤ x , σ 2 ) Learning: Determining w by minimizing the cross entropy loss: N J ( w ) = − 1 � log p ( y i | x i , w ) N i =1 Point estimation of y : y = µ ( x ) = w ⊤ x ˆ Nevin L. Zhang (HKUST) Machine Learning 3 / 51

Logistic Regression Logistic Regression (for Classification) Training set D = { x i , y i } N i =1 , where y i ∈ { 0 , 1 } . Probabilistic model: p ( y | x , w ) = Ber ( y | σ ( w ⊤ x )) σ ( z ) is the sigmoid/logistic/logit function . e z 1 σ ( z ) = 1 + exp( − z ) = e z + 1 It maps the real line R to [0 , 1]. Not to be confused with the variance σ 2 in the Gaussian distribution Nevin L. Zhang (HKUST) Machine Learning 4 / 51

Logistic Regression Logistic Regression The model: p ( y | x , w ) = Ber ( y | σ ( w ⊤ x )) 1 σ ( w ⊤ x ) = p ( y = 1 | x , w ) = 1 + exp( − w ⊤ x ) exp( − w ⊤ x ) 1 − σ ( w ⊤ x ) = p ( y = 0 | x , w ) = 1 + exp( − w ⊤ x ) Consider the logit of p ( y = 1 | x , w ) p ( y = 1 | x , w ) log p ( y = 1 | x , w ) log = 1 − p ( y = 1 | x , w ) p ( y = 0 | x , w ) log exp( w ⊤ x ) = w ⊤ x . = So, a linear model for the logit. Hence called logistic regression . Nevin L. Zhang (HKUST) Machine Learning 5 / 51

Logistic Regression Logistic Regression: Decision Rule To classify instances, we obtain a point estimation of y : y = arg max y p ( y | x , w ) ˆ In other words, the decision/classification rule is: y = 1 iff p ( y = 1 | x , w ) > 0 . 5 ˆ This is called the optimal Bayes classifier : Suppose the same situation is to occur many times. You will always make mistakes no matter what decision rule to use. The probability of mistakes is minimized if you use the above rule. Nevin L. Zhang (HKUST) Machine Learning 6 / 51

Logistic Regression Logistic Regression is a Linear Classifier In fact, the decision/classification rule in logistic regression is equivalent to: y = 1 iff w ⊤ x > 0 ˆ So,it is a linear classifier with a decision boundary . Example: Whether a students is admitted based on the results of two exams: Nevin L. Zhang (HKUST) Machine Learning 7 / 51

Logistic Regression Logistic Regression: Example Solid black dots are the data. Those at the bottom are the SAT scores of applicants rejected by a university, and those at the top are the SAT scores of applicants accepted by a university. The red circles are the predicted probabilities that the applicants would be accepted. Nevin L. Zhang (HKUST) Machine Learning 8 / 51

Logistic Regression Logistic Regression: 2D Example The decision boundary p ( y = 1 | x , w ) = 0 . 5 is a straight line in the feature space. Nevin L. Zhang (HKUST) Machine Learning 9 / 51

Logistic Regression Parameter Learning We would like to find the MLE of w , i.e, the values of w that minimizes the � N cross entropy loss: J ( w ) = − 1 i =1 log P ( y i | x i , w ) N Consider a general training example ( x , y ). Because y is binary, we have log P ( y | x , w ) = y log ˆ y + (1 − y ) log(1 − ˆ y ) (ˆ y = P ( y = 1 | x , w )) y log σ ( w ⊤ x ) + (1 − y ) log(1 − σ ( w ⊤ x )) . = Hence, N J ( w ) = − 1 � [ y i log σ ( w ⊤ x i ) + (1 − y i ) log(1 − σ ( w ⊤ x i ))] N i =1 Unlike linear regression, we can no longer write down the MLE in closed form. Instead, we need to use optimization algorithms to compute it. Gradient descent Newton’s method Nevin L. Zhang (HKUST) Machine Learning 10 / 51

Gradient Descent Outline 1 Logistic Regression 2 Gradient Descent 3 Gradient Descent for Logistic Regression 4 Newton’s Method 5 Softmax Regression 6 Optimization Approach to Classification Nevin L. Zhang (HKUST) Machine Learning 11 / 51

Gradient Descent Gradient Descent Consider a function y = J ( w ) of a scalar variable w . The derivative of J ( w ) is defined as follows: J ′ ( w ) = df ( w ) J ( w + ǫ ) − J ( w ) = lim dw ǫ ǫ → 0 When ǫ is small, we have J ( w + ǫ ) ≈ J ( w ) + ǫ J ′ ( w ) This equation tells use how to reduce J ( w ) by changing w in small steps: If J ′ ( w ) > 0, make ǫ negative, i.e., decrease w ; If J ′ ( w ) < 0, make ǫ positive, i.e., increase w . In other words, move in the opposite direction of the derivative (gradient) Nevin L. Zhang (HKUST) Machine Learning 12 / 51

Gradient Descent Gradient Descent To implement the idea, we update w as follows: w ← w − α J ′ ( w ) The term − J ′ ( w ) means that we move in the opposite direction of the derivative, and α determines how much we move in that direction. It is called the step size in optimization and the learning rate in machine learning Nevin L. Zhang (HKUST) Machine Learning 13 / 51

Gradient Descent Gradient Descent Consider a function y = J ( w ), where w = ( w 0 , w 1 , . . . , w D ) ⊤ . The gradient of J at w is defined as ∇ J = ( dJ , dJ , . . . , dJ ) ⊤ dw 0 dw 1 dw D The gradient is the direction along which J increases the fastest. If we want to reduce J as fast as possible, move in the opposite direction of the gradient. Nevin L. Zhang (HKUST) Machine Learning 14 / 51

Gradient Descent Gradient Descent The method of steepest descent/gradient descent for minimizing J ( w ) 1 Initialize w 2 Repeat until convergence w ← w − α ∇ J ( w ) The learning rate α usually changes from iteration to iteration. Nevin L. Zhang (HKUST) Machine Learning 15 / 51

Gradient Descent Choice of Learning Rate Constant learning rate is difficult to set: Too small, convergence will be very slow Too large, the method can fail to converge at all. Better to vary the learning rate. Will discuss this more later. Nevin L. Zhang (HKUST) Machine Learning 16 / 51

Gradient Descent Gradient Descent Gradient descent can get stuck at local minima or saddle points Nonetheless, it usually works well. Nevin L. Zhang (HKUST) Machine Learning 17 / 51

Gradient Descent for Logistic Regression Outline 1 Logistic Regression 2 Gradient Descent 3 Gradient Descent for Logistic Regression 4 Newton’s Method 5 Softmax Regression 6 Optimization Approach to Classification Nevin L. Zhang (HKUST) Machine Learning 18 / 51

Gradient Descent for Logistic Regression Derivative of σ ( z ) To apply gradient descent to logistic regression, we need to compute the partial derivative of J ( w ) w.r.t each weight w j . Before doing that, first consider the derivative of the sigma function: d σ ( z ) = d 1 σ ′ ( z ) = 1 + e − z dz dz d (1 + e − z ) 1 − = (1 + e − z ) 2 dz 1 (1 + e − z ) 2 e − z = 1 1 = 1 + e − z (1 − 1 + e − z ) σ ( z )(1 − σ ( z )) = Nevin L. Zhang (HKUST) Machine Learning 19 / 51

Gradient Descent for Logistic Regression Derivative of log P ( w | x , w ) z = w ⊤ x , x = [ x 0 , x 1 , . . . , x D ] ⊤ , w = [ w 0 , w 1 , . . . , w D ] ⊤ . ∂σ ( z ) = d σ ( z ) ∂ z = σ ( z )(1 − σ ( z )) x j ∂ w j dz ∂ w j ∂ ∂ log P ( y | x , w ) [ y log σ ( z ) + (1 − y ) log(1 − σ ( z ))] = ∂ w j ∂ w j 1 ∂σ ( z ) 1 ∂σ ( z ) = − (1 − y ) y σ ( z ) 1 − σ ( z ) ∂ w j ∂ w j [ y (1 − σ ( z )) − (1 − y ) σ ( z )] x j = [ y − σ ( z )] x j . = Nevin L. Zhang (HKUST) Machine Learning 20 / 51

Gradient Descent for Logistic Regression Derivative of the Cross Entropy Loss The i -th training example: x i = [ x i , 0 , x i , 1 , . . . , x i , D ] ⊤ N ∂ J ( w ) − 1 ∂ � log P ( y i | x i , w ) = ∂ w j N ∂ w j i =1 N − 1 � = [ y i − σ ( z i )] x i , j N i =1 Nevin L. Zhang (HKUST) Machine Learning 21 / 51

Gradient Descent for Logistic Regression Batch Gradient Descent The Batch Gradient Descent algorithm for logistic regression Repeat until convergence N w j ← w j + α 1 � [ y i − σ ( w ⊤ x i )] x i , j N i =1 Interpretation: (Assume x i are positive vectors) If predicted value σ ( w ⊤ x i ) is smaller than the actual value y i , there is reason to increase w j . The increment is proportional to x i , j . If predicted value σ ( w ⊤ x i ) is larger than the actual value y i , there is reason to decrease w j . The decrement is proportional to x i , j . Nevin L. Zhang (HKUST) Machine Learning 22 / 51

Machine Learning Lecture 03: Logistic Regression and Gradient - PowerPoint PPT Presentation

Machine Learning Lecture 03: Logistic Regression and Gradient Descent Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set of notes is based on internet

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Session 14 Demystifying Neural Networks Overview The model: An input node for every

An Introduction to An Introduction to Variational Variational Methods for Graphical Models

1 ,1 % Logit 4 3 2 1 Logit(p) 0

Logistic mixed models for DIF IRT models can be regarded as logistic mixed models (e.g., Adams,

1 SimFlock: An object oriented model Sampling from the hyper distribution Breeding animals Draw

Restructuring anab::ParticleID Kirsty Duffy and Adam Lister 1 2 Introduction The current

Maxent Models (III), & Neural Language Models CMSC 473/673 UMBC September 25 th , 2017 Some

Machine Learning Lecture 03: Logistic Regression and Gradient - PowerPoint PPT Presentation

Machine Learning Lecture 03: Logistic Regression and Gradient Descent Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set of notes is based on internet

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Session 14 Demystifying Neural Networks Overview The model: An input node for every

An Introduction to An Introduction to Variational Variational Methods for Graphical Models

1 ,1 % Logit 4 3 2 1 Logit(p) 0

Logistic mixed models for DIF IRT models can be regarded as logistic mixed models (e.g., Adams,

1 SimFlock: An object oriented model Sampling from the hyper distribution Breeding animals Draw

Restructuring anab::ParticleID Kirsty Duffy and Adam Lister 1 2 Introduction The current

Maxent Models (III), &amp; Neural Language Models CMSC 473/673 UMBC September 25 th , 2017 Some

Maxent Models (III), & Neural Language Models CMSC 473/673 UMBC September 25 th , 2017 Some