Logistic Regression Dr. Besnik Fetahu Supervised Classification X = - - PowerPoint PPT Presentation

logistic regression
SMART_READER_LITE
LIVE PREVIEW

Logistic Regression Dr. Besnik Fetahu Supervised Classification X = - - PowerPoint PPT Presentation

Logistic Regression Dr. Besnik Fetahu Supervised Classification X = { x (1) , . . . , x ( n ) } Y = { T, F } Input instances Output labels (classes) S = { x ( i ) , y ( i ) } m Training IID examples (input-target samples) i =1 f ( x ( i ) )


slide-1
SLIDE 1

Logistic Regression

  • Dr. Besnik Fetahu
slide-2
SLIDE 2

Supervised Classification

2

Input instances Output labels (classes) Training IID examples (input-target samples) Learn a function that maps x(i) to y(i)

X = {x(1), . . . , x(n)} Y = {T, F}

S = {x(i), y(i)}m

i=1

f(x(i)) → y(i)

slide-3
SLIDE 3

Generative vs. Discriminative Classifiers

  • Generative and Discriminative models are two

different machine learning models that are used for classification

  • Generative models (Naïve Bayes) learn the joint

distribution P(x,y):

  • How are the observations of different classes

generated? P(x|Y=y)

  • Discriminative models (Logistic Regression)

learns only how to distinguish between the different classes:

  • Which feature distinguishes best the different classes?

P(Y=y|x)

3

slide-4
SLIDE 4

Generative vs. Discriminative Classifiers

4

Generative: will try to model how horses look like! Discriminative: will try to map horse instances to the correct class!

slide-5
SLIDE 5

Generative Models

5

slide-6
SLIDE 6

Naïve Bayes

  • For an input instance x (e.g. a document)

predict the class y (e.g. the topic):

6

likelihood prior

ymax = arg max

y∈Y P(Y = y|x)

= arg max

y∈Y

P(x|Y = y)P(Y = y) P(x) = arg max

y∈Y P(x|Y = y)P(Y = y)

= arg max

y∈Y P(x1 . . . xk|Y = y)P(Y = y)

slide-7
SLIDE 7

Naïve Bayes

7

ymax = arg max

y∈Y P(x1 . . . xk|Y = y)P(Y = y)

=P(x1|y) ∗ . . . P(xk|y) ∗ P(y) = arg max

y∈Y P(y) k

Y

i=1

P(xi|y)

Feature independence assumption

slide-8
SLIDE 8

Generative Classifiers

  • Generative models try model the input space (e.g. what

are the characteristics of instances belonging to some class y)

  • Use the Bayes rule to make predictions
  • Generative models by modelling P(x|y) solve intermediate

problems that are not directly related to P(y|x). What class does x belong to?

  • Number of parameters is linear to the feature space and

number of class

  • Describe how likely a class y will generate some instance

x (likelihood term)

8

O(|X|n|Y |)

slide-9
SLIDE 9

Discriminative Models

9

slide-10
SLIDE 10

Discriminative Models

  • Map the input instance features to the correct target

label!

  • Discriminative models optimize directly for accuracy in

predicting the right class.

  • Assign high weights to features for the input instances

that have high ability to discriminate between the different classes.

  • Logistic regression is a discriminative model
  • Use a sigmoid or softmax function for determining the

right class for P(y|x)

10

slide-11
SLIDE 11

Logistic Regression

  • What do we need for a logistic regression

model in the binary case?

  • Feature representation
  • Classification function: sigmoid function
  • Objective function for learning (loss function)
  • Algorithm for optimizing the loss function
  • LR learns a set of feature weights w and a bias

factor b based on some training data for the classification task.

11

x(i) = [x(i)

1 . . . x(i) k ]

slide-12
SLIDE 12

LR – Classification

  • Classification:
  • w represents the importance of the individual

features for our input space (e.g. “awesome” is important in determining positive sentiment)

  • b is the bias term, also called the intercept

12

z = k X

i=1

wixi ! + b

slide-13
SLIDE 13

LR – Classification

  • Classification:
  • To classify, we push z through a sigmoid

function (aka logistic function)

13

σ(z) = 1 1 + e−z

z = k X

i=1

wixi ! + b

slide-14
SLIDE 14

LR – Classification

14

σ(z) = 1 1 + e−z

slide-15
SLIDE 15

LR – Classification

15

  • How can we classify through the sigmoid

function? P(y = 1) = σ(w · x + b) = 1 1 + e−(w·x+b)

P(y = 0) = 1 − σ(w · x + b) = 1 − 1 1 + e−(w·x+b) = e−(w·x+b) 1 + e−(w·x+b)

b yi = ( 1 if P(y = 1|Xi) > 0.5 0 otherwise

Decision boundary

slide-16
SLIDE 16

LR - Feature Space

16

slide-17
SLIDE 17

LR – Classification Example

  • Assume we know the optimal w and b:

17

w = [2.5, −5.0, −1.2, 0.5, 2.0, 0.7]

b = 0.1

p(Y = 1|x) =σ(w · x + b) =σ([2.5, −5.0, −1.2, 0.5, 2.0, 0.7] · [3, 2, 1, 3, 0, 4.15] + 0.1) =σ(1.805) =0.86 P(Y = 0|x) =1 − σ(w · x + b) =0.14

slide-18
SLIDE 18

LR – Feature Design/Engineering

  • Design features based on the train set
  • Features should reflect linguistic intuitions (i.e. a document with positive

sentiment will contain more word that have a prior positive sentiment)

  • n-gram features to capture contextual/topical information in NLP tasks
  • POS tags to capture stylistic information
  • What features would be useful to determine sentence boundaries?
  • How about correlated features?

18

slide-19
SLIDE 19

How do we learn the parameters of LR?

19

slide-20
SLIDE 20

Cross-entropy loss function

  • Why do we need a loss function?
  • What function can we use for L?
  • MSE (Mean squared error) used in regression, is

very hard to optimize for probabilistic output.

  • Conditional Maximum Likelihood?
  • Choose w, b such that they maximize the log probability
  • f the true labels in the training data (the negative log

likelihood loss is also called the cross-entropy loss)

20

L(b y, y) = how much does our prediction b y differ from y

slide-21
SLIDE 21

Cross-entropy loss function

  • The binary labelling case we can express in

terms of the Bernoulli distribution:

21

p(y|x) =b yy(1 − b y)1−y = log[b yy(1 − b y)1−y] =y log b y + (1 − y) log(1 − b y)

This is the log likelihood that should be maximized such that w, b will maximize the probability of our labels being close to the true labels

slide-22
SLIDE 22

Cross-entropy loss function

  • To compute the loss function, we need to

minimize, thus, we flip the sign of the log likelihood:

22

LCE(b y, y) = − log p(y|x) = −[y log b y + (1 − y) log(1 − b y)]

b y = σ(w · x + b)

LR model

LCE(w, b) = −[y log σ(w · x + b) + (1 − y) log(1 − σ(w · x + b))]

slide-23
SLIDE 23

Cross-entropy loss function

  • Why do need to minimize the log likelihood?
  • A perfect classifier would assign with perfect

probability close to 1 to the correct class (y=1 or y=0)

  • The closer our prediction is to 1 the better the

classifier, and vice versa, the closer it is to zero the worse it is.

  • The loss goes to zero for perfect classification,

whereas goes to infinity for the cases where we get everything wrong (log 0)

  • Since the two parts in our loss function sum to one,

by maximizing the correct label we do this on the expense of the wrong label.

23

slide-24
SLIDE 24

Cross-entropy loss function

  • Loss function for the entire training set:

24

Cost(w, b) = 1 m

m

X

i=1

LCE(b y(i), y(i)) = − 1 m

m

X

i=1

⇣ y(i) log σ(w · x(i) + b) + (1 − y(i)) log ⇣ 1 − σ(w · x(i) + b ⌘⌘

slide-25
SLIDE 25

How can we find the minimum?

25

slide-26
SLIDE 26

Gradient Descent – GD

  • Optimal parameters for our loss function:
  • GD finds the minimum of a function by figuring out in

which direction in the parameter space the function’s slope is rising most steeply and moving in the

  • pposite direction.
  • In case of convex functions, GD finds the global
  • ptimum (minimum)
  • Cross-entropy loss is a convex function

26

b θ = arg min

θ

1 mLCE(y(i), x(i); θ)

slide-27
SLIDE 27

Gradient Descent – GD

27

slide-28
SLIDE 28

Gradient Descent – GD

  • GD finds the gradient of the loss function for a

given point and then moves in the opposite direction s.t. the loss function is minimized

  • The magnitude of the amount of the move in

the gradient descent is determined by the value of the slope (or derivative) weighted by some learning rate

  • In the case of a function with one parameter:

28

wt+1 = wt − η d dwf(x; w)

slide-29
SLIDE 29

Gradient Descent - GD

29

Gradient descent with small (top) and large (bottom) learning rates. Source: Andrew Ng’s Machine Learning course on Coursera

However, with each time step the gradient will be smaller and smaller, thus, there is no need to adaptively fix the learning rate, as the value of the negative direction as the slope will be less steep too

slide-30
SLIDE 30

Gradient Descent – GD

  • Cross-entropy loss function has many variables as

parameters that GD needs to find their optimal value, thus, we operate in the N-dimensional space

  • The gradient expresses the directional components
  • f the sharpest slope along each of those N

dimensions

30

slide-31
SLIDE 31

Gradient Descent – GD

  • Through GD we answer the question
  • “How much would a small change in wi influence

the total loss in L?”

31

θt+1 = θt ηrL(f(x; θ), y)

rθL(f(x; θ), y)) =     

∂ ∂w1 L(f(x; θ), y) ∂ ∂w2 L(f(x; θ), y)

. . .

∂ ∂wn L(f(x; θ), y)

    

slide-32
SLIDE 32

Gradient Descent – GD

  • GD in the case of cross-entropy loss

32

Cost(w, b) = − 1 m

m

X

i=1

y(i) log σ(w·x(i)+b)+(1−y(i)) log ⇣ 1 − σ(w · x(i) + b) ⌘

∂Cost(w, b) ∂wj =

m

X

i=1

h σ(w · x(i) + b) − y(i)i x(i)

j

slide-33
SLIDE 33

Gradient Descent – GD

33

dσ(z) dz = σ(z)(1 − σ(z))

d dx ln(x) = 1 x

Use the following derivatives to derive the partial derivative of the cross-entropy loss function

∂Cost(w, b) ∂wj =

m

X

i=1

h σ(w · x(i) + b) − y(i)i x(i)

j

slide-34
SLIDE 34

34

Gradient Descent – GD

slide-35
SLIDE 35

Gradient Descent – GD Example

  • Sentiment classification where each document

has only two features:

35

x = [x1 = 3, x2 = 2]

x1 count of positive lexicon words x2 count of negative lexicon words

w1 = w2 = b = 0

Weights are initialized to zero

θ2 =   w1 w2 b   − η   −1.5 −1.0 −0.5   =   0.15 0.10 0.05  

rw,b =   

∂LCE(w,b) ∂w1 ∂LCE(w,b) ∂w2 ∂LCE(w,b) ∂b

   =   (σ(w · x + b) y)x1 (σ(w · x + b) y)x2 σ(w · x + b) y)   =   (σ(0) 1)x1 (σ(0) 1)x2 σ(0) 1   =   0.5x1 0.5x2 0.5   =   1.5 1.0 0.5  

η = 0.1

Learning rate

slide-36
SLIDE 36

Regularization

36

slide-37
SLIDE 37

Regularization

  • If a feature perfectly predicts the class in the

training data, the weight of that feature will be very high

  • In many cases this leads to overfitting and

models that are not robust to noise and do not generalize well

  • Regularization is a way to avoid overfitting

37

b w = arg max

w m

X

i=1

log P(y(i)|x(i)) − αR(w)

Regularization term

slide-38
SLIDE 38

Regularization

  • L1 regularizers – lasso:
  • Represents the Manhattan distance, and is the

absolute sum of the weight values

  • L2 regularizer – ridge
  • Represents the Euclidean distance, is the sum of

the squares of the weight values

38

R(w) = ||W||1 =

m

X

i=1

|wi|

R(w) = ||W||2

2 = m

X

i=1

w2

i

slide-39
SLIDE 39

Multinomial LR

39

slide-40
SLIDE 40

Multinomial LR

  • What if we have more than 2 classes?
  • We can adapt the LR model to classify more

than two classes by changing its classification function to the softmax function

40

softmax(zi) = ezi Pk

j=1 ezj

P(Y = c|x) = ewc·x+bc Pk

j=1 ewj·x+bj

slide-41
SLIDE 41

Features in Multinomial LR

41

slide-42
SLIDE 42

Learning in Multinomial LR

  • What is the loss function for the multinomial

LR?

42

LCE(b y, y) = −

k

X

i=1

1{Y = k} log P(Y = k|x) = −

k

X

i=1

1{Y = k} log ewk·x+bk PK

j=1 ewj·x+bj

∂LCE ∂wk = (1{Y = k} − P(Y = k|x))xk = 1{Y = k} − ewk·x+bk PK

j=1 ewj·x+bj

! xk

slide-43
SLIDE 43

Resources

  • https://web.stanford.edu/~jurafsky/slp3/5.pdf

43

slide-44
SLIDE 44

Upcoming Lecture

  • Penn-treebank
  • HMM POS tagger
  • LR POS tagger
  • Extra: CRF POS tagger

44