Data Mining and Machine Learning: Fundamental Concepts and - PowerPoint PPT Presentation

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science Universidade Federal de Minas Gerais, Belo Horizonte, Brazil Chapter 24: Logistic Regression Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 24: Logistic Regression 1 / 29

Binary Logistic Regression In logistic regression, we are given a set of d predictor or independent variables X 1 , X 2 , ··· , X d , and a binary or Bernoulli response variable Y that takes on only two values, namely, 0 and 1. Since there are only two outcomes for the response variable Y , its probability mass function for ˜ X = ˜ x is given as: P ( Y = 1 | ˜ P ( Y = 0 | ˜ X = ˜ x ) = π (˜ x ) X = ˜ x ) = 1 − π (˜ x ) where π (˜ x ) is the unknown true parameter value, denoting the probability of Y = 1 given ˜ X = ˜ x . Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 24: Logistic Regression 2 / 29

Binary Logistic Regression Instead of directly predicting the response value, the goal is to learn the probability, P ( Y = 1 | ˜ x ) , which is also the expected value of Y given ˜ X = ˜ X = ˜ x . Since P ( Y = 1 | ˜ X = ˜ x ) is a probability, it is not appropriate to directly use the linear regression model. The reason we cannot simply use P ( Y = 1 | ˜ X = ˜ x ) = f (˜ x ) is due to the fact that f (˜ x ) can be arbitrarily large or arbitrarily small, whereas for logistic regression, we require that the output represents a probability value. The name “logistic regression” comes from the logistic function (also called the sigmoid function) that “squashes” the output to be between 0 and 1 for any scalar input. z 1 exp { z } θ ( z ) = 1 + exp {− z } = (1) 1 + exp { z } Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 24: Logistic Regression 3 / 29

Logistic Function 1 . 0 0 . 9 0 . 8 0 . 7 0 . 6 θ ( z ) 0 . 5 0 . 4 0 . 3 0 . 2 0 . 1 0 −∞ 0 + ∞ z Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 24: Logistic Regression 4 / 29

Logistic Function Example Figure shows the plot for the logistic function for z ranging from −∞ to + ∞ . In particular consider what happens when z is −∞ , + ∞ and 0; we have 1 + exp {∞} = 1 1 θ ( −∞ ) = ∞ = 0 1 + exp {−∞} = 1 1 θ (+ ∞ ) = 1 = 1 1 + exp { 0 } = 1 1 θ ( 0 ) = 2 = 0 . 5 As desired, θ ( z ) lies in the range [ 0 , 1 ] , and z = 0 is the “threshold” value in the sense that for z > 0 we have θ ( z ) > 0 . 5, and for z < 0, we have θ ( z ) < 0 . 5. Thus, interpreting θ ( z ) as a probability, the larger the z value, the higher the probability. Another interesting property of the logistic function is that 1 + exp { z } = 1 + exp { z } − exp { z } exp { z } 1 1 − θ ( z ) = 1 − = 1 + exp { z } = θ ( − z ) (2) 1 + exp { z } Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 24: Logistic Regression 5 / 29

Binary Logistic Regression Using the logistic function, we define the logistic regression model as follows: ω T ˜ exp { ˜ x } ω T ˜ P ( Y = 1 | ˜ X = ˜ x ) = π (˜ x ) = θ ( f (˜ x )) = θ (˜ x ) = (3) ω T ˜ 1 + exp { ˜ x } Thus, the probability that the response is Y = 1 is the output of the logistic ω T ˜ function for the input ˜ x . On the other hand, the probability for Y = 0 is given as 1 ω T ˜ P ( Y = 0 | ˜ x ) = 1 − P ( Y = 1 | ˜ X = ˜ X = ˜ x ) = θ ( − ˜ x ) = ω T ˜ 1 + exp { ˜ x } ω T ˜ that is, 1 − θ ( z ) = θ ( − z ) for z = ˜ x . Combining these two cases the full logistic regression model is given as ω T ˜ x ) Y · θ ( − ˜ ω T ˜ P ( Y | ˜ x ) 1 − Y X = ˜ x ) = θ (˜ (4) since Y is a Bernoulli random variable that takes on either the value 1 or 0. We ω T ˜ can observe that P ( Y | ˜ X = ˜ x ) = θ (˜ x ) when Y = 1 and ω T ˜ P ( Y | ˜ X = ˜ x ) = θ ( − ˜ x ) when Y = 0, as desired. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 24: Logistic Regression 6 / 29

Log-Odds Ratio Define the odds ratio for the occurrence of Y = 1 as follows: ω T ˜ x ) = P ( Y = 1 | ˜ X = ˜ x ) = θ (˜ x ) odds( Y = 1 | ˜ X = ˜ ω T ˜ P ( Y = 0 | ˜ X = ˜ x ) θ ( − ˜ x ) ω T ˜ exp { ˜ x } ω T ˜ � � = x } · 1 + exp { ˜ x } ω T ˜ 1 + exp { ˜ ω T ˜ = exp { ˜ x } (5) The logarithm of the odds ratio, called the log-odds ratio , is therefore given as: � � P ( Y = 1 | ˜ X = ˜ x ) � � � ω T ˜ � ω T ˜ odds( Y = 1 | ˜ ln X = ˜ x ) = ln = ln exp { ˜ x } = ˜ x 1 − P ( Y = 1 | ˜ X = ˜ x ) = ω 0 · x 0 + ω 1 · x 1 + ··· + ω d · x d (6) The log-odds ratio function is also called the logit function, defined as � � z logit ( z ) = ln 1 − z It is the inverse of the logistic function. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 24: Logistic Regression 7 / 29

Log-Odds Ratio We can see that � � � � odds( Y = 1 | ˜ P ( Y = 1 | ˜ ln X = ˜ x ) = logit X = ˜ x ) The logistic regression model is therefore based on the assumption that the log-odds ratio for Y = 1 given ˜ X = ˜ x is a linear function (or a weighted sum) of the independent attributes. In particular, let us consider the effect of attribute X i by fixing the values for all other attributes,we get ln(odds( Y = 1 | ˜ X = ˜ x )) = ω i · x i + C ⇒ odds( Y = 1 | ˜ = X = ˜ x ) = exp { ω i · x i + C } = exp { ω i · x i } · exp { C } ∝ exp { ω i · x i } where C is a constant comprising the fixed attributes. The regression coefficient ω i can therefore be interpreted as the change in the log-odds ratio for Y = 1 for a unit change in X i , or equivalently the odds ratio for Y = 1 increases exponentially per unit change in X i . Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 24: Logistic Regression 8 / 29

Maximum Likelihood Estimation We will use the maximum likelihood approach to learn the weight vector ˜ w . Likelihood is defined as the probability of the observed data given the estimated parameters ˜ w . n n � � w T ˜ w T ˜ x i ) y i · θ ( − ˜ x i ) 1 − y i L ( ˜ w ) = P ( Y | ˜ w ) = P ( y i | ˜ x i ) = θ ( ˜ i = 1 i = 1 Instead of trying to maximize the likelihood, we can maximize the logarithm of the likelihood, called log-likelihood , to convert the product into a summation as follows: n � � w T ˜ � � w T ˜ � ln( L ( ˜ w )) = y i · ln θ ( ˜ x i ) + ( 1 − y i ) · ln θ ( − ˜ x i ) (7) i = 1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 24: Logistic Regression 9 / 29

Maximum Likelihood Estimation The negative of the log-likelihood can also be considered as an error function, the cross-entropy error function , given as follows: n � � � � 1 1 � E ( ˜ w ) = − ln( L ( ˜ w )) = y i · ln + ( 1 − y i ) · ln w T ˜ w T ˜ θ ( ˜ 1 − θ ( ˜ x i ) x i ) i = 1 (8) The task of maximizing the log-likelihood is therefore equivalent to minimizing the cross-entropy error. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 24: Logistic Regression 10 / 29

Maximum Likelihood Estimation Typically, to obtain the optimal weight vector ˜ w , we would differentiate the log-likelihood function with respect to ˜ w , set the result to 0, and then solve for w . However, for the log-likelihood formulation there is no closed form solution to ˜ compute the weight vector ˜ w . Instead, we use an iterative gradient ascent method to compute the optimal value. The gradient ascent method relies on the gradient of the log-likelihood function, which can be obtained by taking its partial derivative with respect to ˜ w , as follows: � n � � � w ) = ∂ = ∂ � ∇ ( ˜ ln( L ( ˜ w )) y i · ln( θ ( z i )) + ( 1 − y i ) · ln( θ ( − z i )) (9) ∂ ˜ ∂ ˜ w w i = 1 w T ˜ where z i = ˜ x i . Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 24: Logistic Regression 11 / 29

Maximum Likelihood Estimation w 0 . At The gradient ascent method starts at some initial estimate for ˜ w , denoted ˜ each step t , the method moves in the direction of steepest ascent, which is given w t , we can obtain the by the gradient vector. Thus, given the current estimate ˜ next estimate as follows: w t + 1 = ˜ w t + η · ∇ ( ˜ w t ) ˜ (10) Here, η > 0 is a user-specified parameter called the learning rate . It should not be too large, otherwise the estimates will vary wildly from one iteration to the next, and it should not be too small, otherwise it will take a long time to converge. At the optimal value of ˜ w , the gradient will be zero, i.e., ∇ ( ˜ w ) = 0, as desired. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 24: Logistic Regression 12 / 29

Data Mining and Machine Learning: Fundamental Concepts and - PowerPoint PPT Presentation

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Data Mining: Concepts and Techniques Chapter 1 Introduction 1 August 19, 2013

Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 1 of Data Mining by

Introduction What is data mining? to Data mining functionalities Data Mining Major

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Data Mining: Concepts and Techniques Web Mining Li Xiong Slides credits: Jiawei Han and

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info

CSE-571 Grid maps or scans Probabilistic Robotics [Lu & Milios, 97; Gutmann, 98: Thrun

Searching for family members - (Durbin et al., Ch.5) Suppose we have a family of related

Week 4: Binary Outcomes Logistic Regression & Classification Max H. Farrell The University

Statistical Modelling in Stata: Categorical Outcomes Mark Lunt Centre for Epidemiology Versus

STAT 213 Logistic Regression II Colin Reimer Dawson Oberlin College 28 April 2016 Outline

TIDY TEXT Jeff Goldsmith, PhD Department of Biostatistics 1 Text data Written

Are the clients of flawed classes (also) defect prone? Authors: Radu & Cristina Marinescu

Welcome and Introductions Statistical Consulting What is it, and why is it important? Welcome to

Data Mining and Machine Learning: Fundamental Concepts and - PowerPoint PPT Presentation

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Data Mining: Concepts and Techniques Chapter 1 Introduction 1 August 19, 2013

Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 1 of Data Mining by

Introduction What is data mining? to Data mining functionalities Data Mining Major

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Data Mining: Concepts and Techniques Web Mining Li Xiong Slides credits: Jiawei Han and

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info

CSE-571 Grid maps or scans Probabilistic Robotics [Lu &amp; Milios, 97; Gutmann, 98: Thrun

Searching for family members - (Durbin et al., Ch.5) Suppose we have a family of related

Week 4: Binary Outcomes Logistic Regression &amp; Classification Max H. Farrell The University

Statistical Modelling in Stata: Categorical Outcomes Mark Lunt Centre for Epidemiology Versus

STAT 213 Logistic Regression II Colin Reimer Dawson Oberlin College 28 April 2016 Outline

TIDY TEXT Jeff Goldsmith, PhD Department of Biostatistics 1 Text data Written

Are the clients of flawed classes (also) defect prone? Authors: Radu &amp; Cristina Marinescu

Welcome and Introductions Statistical Consulting What is it, and why is it important? Welcome to

CSE-571 Grid maps or scans Probabilistic Robotics [Lu & Milios, 97; Gutmann, 98: Thrun

Week 4: Binary Outcomes Logistic Regression & Classification Max H. Farrell The University

Are the clients of flawed classes (also) defect prone? Authors: Radu & Cristina Marinescu