Logistic Regression Dr. Besnik Fetahu Supervised Classification X = - PowerPoint PPT Presentation

Logistic Regression Dr. Besnik Fetahu

Supervised Classification X = { x (1) , . . . , x ( n ) } Y = { T, F } Input instances Output labels (classes) S = { x ( i ) , y ( i ) } m Training IID examples (input-target samples) i =1 f ( x ( i ) ) → y ( i ) Learn a function that maps x (i) to y (i) 2

Generative vs. Discriminative Classifiers • Generative and Discriminative models are two different machine learning models that are used for classification • Generative models (Naïve Bayes) learn the joint distribution P(x,y): • How are the observations of different classes generated? P(x|Y=y) • Discriminative models (Logistic Regression) learns only how to distinguish between the different classes: • Which feature distinguishes best the different classes? P(Y=y|x) 3

Generative vs. Discriminative Classifiers Generative: will try to model how horses look like! Discriminative: will try to map horse instances to the correct class ! 4

Generative Models 5

Naïve Bayes • For an input instance x ( e.g. a document ) predict the class y (e.g. the topic): y max = arg max y ∈ Y P ( Y = y | x ) P ( x | Y = y ) P ( Y = y ) = arg max P ( x ) y ∈ Y = arg max y ∈ Y P ( x | Y = y ) P ( Y = y ) = arg max y ∈ Y P ( x 1 . . . x k | Y = y ) P ( Y = y ) prior likelihood 6

Naïve Bayes y max = arg max y ∈ Y P ( x 1 . . . x k | Y = y ) P ( Y = y ) = P ( x 1 | y ) ∗ . . . P ( x k | y ) ∗ P ( y ) k Y = arg max y ∈ Y P ( y ) P ( x i | y ) i =1 Feature independence assumption 7

Generative Classifiers • Generative models try model the input space (e.g. what are the characteristics of instances belonging to some class y) • Use the Bayes rule to make predictions • Generative models by modelling P(x|y) solve intermediate problems that are not directly related to P(y|x). What class does x belong to? O ( | X | n | Y | ) • Number of parameters is linear to the feature space and number of class • Describe how likely a class y will generate some instance x (likelihood term) 8

Discriminative Models 9

Discriminative Models • Map the input instance features to the correct target label! • Discriminative models optimize directly for accuracy in predicting the right class. • Assign high weights to features for the input instances that have high ability to discriminate between the different classes. • Logistic regression is a discriminative model • Use a sigmoid or softmax function for determining the right class for P(y|x) 10

Logistic Regression • What do we need for a logistic regression model in the binary case? x ( i ) = [ x ( i ) 1 . . . x ( i ) • Feature representation k ] • Classification function: sigmoid function • Objective function for learning (loss function) • Algorithm for optimizing the loss function • LR learns a set of feature weights w and a bias factor b based on some training data for the classification task. 11

LR – Classification • Classification: k ! X z = w i x i + b i =1 • w represents the importance of the individual features for our input space (e.g. “awesome” is important in determining positive sentiment ) • b is the bias term , also called the intercept 12

LR – Classification • Classification: k ! X z = w i x i + b i =1 • To classify, we push z through a sigmoid function (aka logistic function) 1 σ ( z ) = 1 + e − z 13

LR – Classification 1 σ ( z ) = 1 + e − z 14

LR – Classification • How can we classify through the sigmoid function? P ( y = 1) = σ ( w · x + b ) P ( y = 0) = 1 − σ ( w · x + b ) 1 1 = 1 − = 1 + e − ( w · x + b ) 1 + e − ( w · x + b ) e − ( w · x + b ) = 1 + e − ( w · x + b ) ( 1 if P ( y = 1 | X i ) > 0 . 5 y i = b Decision boundary 0 otherwise 15

LR - Feature Space 16

LR – Classification Example • Assume we know the optimal w and b: w = [2 . 5 , − 5 . 0 , − 1 . 2 , 0 . 5 , 2 . 0 , 0 . 7] b = 0 . 1 p ( Y = 1 | x ) = σ ( w · x + b ) = σ ([2 . 5 , − 5 . 0 , − 1 . 2 , 0 . 5 , 2 . 0 , 0 . 7] · [3 , 2 , 1 , 3 , 0 , 4 . 15] + 0 . 1) = σ (1 . 805) =0 . 86 P ( Y = 0 | x ) =1 − σ ( w · x + b ) =0 . 14 17

LR – Feature Design/Engineering • Design features based on the train set • Features should reflect linguistic intuitions (i.e. a document with positive sentiment will contain more word that have a prior positive sentiment) • n-gram features to capture contextual/topical information in NLP tasks • POS tags to capture stylistic information • What features would be useful to determine sentence boundaries? • How about correlated features? 18

How do we learn the parameters of LR? 19

Cross-entropy loss function • Why do we need a loss function? L ( b y, y ) = how much does our prediction b y di ff er from y • What function can we use for L ? • MSE (Mean squared error) used in regression, is very hard to optimize for probabilistic output. • Conditional Maximum Likelihood? • Choose w, b such that they maximize the log probability of the true labels in the training data (the negative log likelihood loss is also called the cross-entropy loss ) 20

Cross-entropy loss function • The binary labelling case we can express in terms of the Bernoulli distribution: y ) 1 − y p ( y | x ) = b y y (1 − b y ) 1 − y ] = log[ b y y (1 − b = y log b y + (1 − y ) log(1 − b y ) This is the log likelihood that should be maximized such that w, b will maximize the probability of our labels being close to the true labels 21

Cross-entropy loss function • To compute the loss function, we need to minimize, thus, we flip the sign of the log likelihood: L CE ( b y, y ) = − log p ( y | x ) = − [ y log b y + (1 − y ) log(1 − b y )] L CE ( w, b ) = − [ y log σ ( w · x + b ) + (1 − y ) log(1 − σ ( w · x + b ))] y = σ ( w · x + b ) b LR model 22

Cross-entropy loss function • Why do need to minimize the log likelihood? • A perfect classifier would assign with perfect probability close to 1 to the correct class (y=1 or y=0) • The closer our prediction is to 1 the better the classifier, and vice versa, the closer it is to zero the worse it is. • The loss goes to zero for perfect classification, whereas goes to infinity for the cases where we get everything wrong ( log 0 ) • Since the two parts in our loss function sum to one, by maximizing the correct label we do this on the expense of the wrong label. 23

Cross-entropy loss function • Loss function for the entire training set: m X Cost ( w, b ) = 1 y ( i ) , y ( i ) ) L CE ( b m i =1 ⇣ ⇣ ⌘⌘ m X = − 1 y ( i ) log σ ( w · x ( i ) + b ) + (1 − y ( i ) ) log 1 − σ ( w · x ( i ) + b m i =1 24

How can we find the minimum? 25

Gradient Descent – GD • Optimal parameters for our loss function: 1 b mL CE ( y ( i ) , x ( i ) ; θ ) θ = arg min θ • GD finds the minimum of a function by figuring out in which direction in the parameter space the function’s slope is rising most steeply and moving in the opposite direction. • In case of convex functions, GD finds the global optimum (minimum) • Cross-entropy loss is a convex function 26

Gradient Descent – GD 27

Gradient Descent – GD • GD finds the gradient of the loss function for a given point and then moves in the opposite direction s.t. the loss function is minimized • The magnitude of the amount of the move in the gradient descent is determined by the value of the slope (or derivative) weighted by some learning rate • In the case of a function with one parameter: w t +1 = w t − η d dwf ( x ; w ) 28

Gradient Descent - GD However, with each time step the gradient will be smaller and smaller, thus, there is no need to adaptively fix the learning rate, as the value of the negative direction as the slope will be less steep too Gradient descent with small (top) and large (bottom) learning rates. Source: Andrew Ng’s Machine Learning course on Coursera 29

Gradient Descent – GD • Cross-entropy loss function has many variables as parameters that GD needs to find their optimal value, thus, we operate in the N-dimensional space • The gradient expresses the directional components of the sharpest slope along each of those N dimensions 30

Gradient Descent – GD • Through GD we answer the question • “How much would a small change in w i influence the total loss in L ?” ∂ ∂ w 1 L ( f ( x ; θ ) , y )   ∂ ∂ w 2 L ( f ( x ; θ ) , y )   r θ L ( f ( x ; θ ) , y )) =   . .   .   ∂ ∂ w n L ( f ( x ; θ ) , y ) θ t +1 = θ t � η r L ( f ( x ; θ ) , y ) 31

Gradient Descent – GD • GD in the case of cross-entropy loss m Cost ( w, b ) = − 1 y ( i ) log σ ( w · x ( i ) + b )+(1 − y ( i ) ) log ⇣ 1 − σ ( w · x ( i ) + b ) ⌘ X m i =1 m ∂ Cost ( w, b ) σ ( w · x ( i ) + b ) − y ( i ) i h x ( i ) X = j ∂ w j i =1 32

Gradient Descent – GD d σ ( z ) = σ ( z )(1 − σ ( z )) dz Use the following derivatives to derive the partial derivative of the cross-entropy loss function dx ln( x ) = 1 d x m ∂ Cost ( w, b ) σ ( w · x ( i ) + b ) − y ( i ) i h x ( i ) X = j ∂ w j i =1 33

Gradient Descent – GD 34

Logistic Regression Dr. Besnik Fetahu Supervised Classification X = - PowerPoint PPT Presentation

Logistic Regression Dr. Besnik Fetahu Supervised Classification X = { x (1) , . . . , x ( n ) } Y = { T, F } Input instances Output labels (classes) S = { x ( i ) , y ( i ) } m Training IID examples (input-target samples) i =1 f ( x ( i ) )

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Todays lecture Logistic regression How can we use logistic regression for reranking? Shay

From Logistic Regression to Neural Networks CMSC 470 Marine Carpuat Logistic Regression What

LEARNING Outline Math Behind Logistic Regression Visualizing Logistic Regression Loss

Workshop 10.5a: Logistic regression Murray Logan August 23, 2016 Table of contents 1 Logistic

Logistic Regression using OLS1D in Excel 2013 XL4D: V0H XL4D: V0H XL4D: V0H 2015 Schield

Workshop 10.5a: Logistic regression Murray Logan 05 Sep 2016 Section 1 Logistic regression

Lecture 3: Logistic Regression Feng Li Shandong University fli@sdu.edu.cn September 21, 2020

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

XL4B: Logistic Regression using OLS1B in Excel 2013 25 Feb 2018 V0C-2x XL4B: V0C-2x XL4B: V0C-2x

Logistic regression Shay Cohen (based on slides by Sharon Goldwater) 28 October 2019 Todays

Machine Learning Logistic Regression Hamid R. Rabiee Spring 2015

Learning From Data Lecture 9 Logistic Regression and Gradient Descent Logistic Regression

Logistic regression Predict binary outcomes (success/failure) from numerical or categorical

Logistic Regression: MLE vs. OLS3 in Excel2013 25 Aug 2016 V0H V0H V0H Schield MLE vs.

Compressed Strings and Applications Shunsuke Inenaga Kyushu University, Japan Collaborators

Effective 2D description of thin liquid crystal elastomer sheets Marius Lemm (Caltech) joint

Disclosures The Rapidly Changing Landscape Consulting of Diabetes Mellitus: What You

Segmented-Crystal Electromagnetic Precision Calorimeter (S-CEPCal) 12 March 2019 Calorimetry

Neural Network LMs READ CHAPTERS 5 AND 7 IN JURAFSKY AND MARTIN READ CHAPTER 4 FROM YOAV

| V ub | from QCD Sum Rules on the Light-Cone Patricia Ball IPPP , Durham CKM06, 14 December

A Journey through the World of Incompressible Viscous Flows : an Evolution Equation Perspective

Compositions of Extended Top-down Tree Transducers Andreas Maletti March 30, 2007 Short