Machine Learning for Computational Linguistics Classifjcation ar - - PowerPoint PPT Presentation

machine learning for computational linguistics
SMART_READER_LITE
LIVE PREVIEW

Machine Learning for Computational Linguistics Classifjcation ar - - PowerPoint PPT Presentation

Machine Learning for Computational Linguistics Classifjcation ar ltekin University of Tbingen Seminar fr Sprachwissenschaft May 3, 2016 Practical matters Classifjcation Logistic Regression More than two classes Practical


slide-1
SLIDE 1

Machine Learning for Computational Linguistics

Classifjcation Çağrı Çöltekin

University of Tübingen Seminar für Sprachwissenschaft

May 3, 2016

slide-2
SLIDE 2

Practical matters Classifjcation Logistic Regression More than two classes

Practical issues

▶ Homework 1: try to program it without help from specialized

libraries (like NLTK)

▶ Time to think about projects. A short proposal towards the

end of May.

Ç. Çöltekin, SfS / University of Tübingen May 3, 2016 1 / 23

slide-3
SLIDE 3

Practical matters Classifjcation Logistic Regression More than two classes

The problem

x1 x2 + + + + + + − − − − − − − ?

▶ The response (outcome) is a

  • label. In the example:

positive + or negative −

▶ Given the features (x1 and

x2), we want to predict the label of an unknown instance ?

▶ Note: regression is not a

good idea here

Ç. Çöltekin, SfS / University of Tübingen May 3, 2016 2 / 23

slide-4
SLIDE 4

Practical matters Classifjcation Logistic Regression More than two classes

The problem (with a single predictor)

1 y x1 + + + + + + − − − − − −

Ç. Çöltekin, SfS / University of Tübingen May 3, 2016 3 / 23

slide-5
SLIDE 5

Practical matters Classifjcation Logistic Regression More than two classes

A quick survey of some solutions

Decision trees

x1 x2 + + + + + + − − − − − − − ? a1 a2 x2 < a2

x1 < a1

+ −

yes n

  • no

y e s

Ç. Çöltekin, SfS / University of Tübingen May 3, 2016 4 / 23

slide-6
SLIDE 6

Practical matters Classifjcation Logistic Regression More than two classes

A quick survey of some solutions

Instance/memory based methods

x1 x2 + + + + + + − − − − − − − ?

▶ No training: just memorize

the instances

▶ During test time, decide

based on the k nearest neighbors

▶ Like decision trees, kNN is

non-parametric

▶ It can also be used for

regression

Ç. Çöltekin, SfS / University of Tübingen May 3, 2016 5 / 23

slide-7
SLIDE 7

Practical matters Classifjcation Logistic Regression More than two classes

A quick survey of some solutions

(Linear) discriminant functions

x1 x2 + + + + + + − − − − − − −

▶ Find a discriminant function

(f) that separates the training instance best (for a defjnition of ‘best’) Use the discriminant to predict the label of unknown instances

Ç. Çöltekin, SfS / University of Tübingen May 3, 2016 6 / 23

slide-8
SLIDE 8

Practical matters Classifjcation Logistic Regression More than two classes

A quick survey of some solutions

(Linear) discriminant functions

x1 x2 + + + + + + − − − − − − − ?

▶ Find a discriminant function

(f) that separates the training instance best (for a defjnition of ‘best’)

▶ Use the discriminant to

predict the label of unknown instances ˆ y = { + f(x) > 0 − f(x) < 0

Ç. Çöltekin, SfS / University of Tübingen May 3, 2016 6 / 23

slide-9
SLIDE 9

Practical matters Classifjcation Logistic Regression More than two classes

A quick survey of some solutions

Probability-based solutions

x1 x2 + + + + + + − − − − − − −

▶ Estimate distributions of

p(x|y = +) and p(x|y = −) from the training data

▶ Assign the new items to the

class c with the highest p(x|y = c)

Ç. Çöltekin, SfS / University of Tübingen May 3, 2016 7 / 23

slide-10
SLIDE 10

Practical matters Classifjcation Logistic Regression More than two classes

A quick survey of some solutions

Probability-based solutions

x1 x2 + + + + + + − − − − − − − ?

▶ Estimate distributions of

p(x|y = +) and p(x|y = −) from the training data

▶ Assign the new items to the

class c with the highest p(x|y = c)

Ç. Çöltekin, SfS / University of Tübingen May 3, 2016 7 / 23

slide-11
SLIDE 11

Practical matters Classifjcation Logistic Regression More than two classes

A quick survey of some solutions

Artifjcial neural networks

x1 x2 + + + + + + − − − − − − − ? x1 x2 y

Ç. Çöltekin, SfS / University of Tübingen May 3, 2016 8 / 23

slide-12
SLIDE 12

Practical matters Classifjcation Logistic Regression More than two classes

Logistic regression

▶ Logistic regression is a classifjcation method ▶ In logistic regression, we fjt a model that predicts P(y|x) ▶ Alternatively, logistic regression is an extension of linear

  • regression. It is a member of the family of models called

generalized linear models

Ç. Çöltekin, SfS / University of Tübingen May 3, 2016 9 / 23

slide-13
SLIDE 13

Practical matters Classifjcation Logistic Regression More than two classes

A simple example

We would like to guess whether a child would develop dyslexia or not based on a test applied to pre-verbal children. Here is a simplifjed problem:

▶ We test children when they are less than 2 years of age. ▶ We want to predict the diagnosis from the test score ▶ The data looks like

Test score Dyslexia 82 22 1 62 1 . . . . . .

* The research question is from a real study by Ben Maasen and his colleagues. Data is fake as usual. Ç. Çöltekin, SfS / University of Tübingen May 3, 2016 10 / 23

slide-14
SLIDE 14

Practical matters Classifjcation Logistic Regression More than two classes

Example: fjtting ordinary least squares regression

20 40 60 80 100

  • 1

1 2 Test score P(dyslexia|score)

Problems:

▶ The probability values

are not bounded between 0 and 1

▶ Residuals will be large

for correct predictions

▶ Residuals are not

distributed normally

Ç. Çöltekin, SfS / University of Tübingen May 3, 2016 11 / 23

slide-15
SLIDE 15

Practical matters Classifjcation Logistic Regression More than two classes

Example: transforming the output variable

Instead of predicting the probability p, we predict logit(p) ˆ y = logit(p) = log p 1 − p = w0 + w1x

▶ p 1−p (odds) is bounded between 0 and ∞ ▶ log p 1−p (log odds) is bounded between −∞ and ∞ ▶ we can estimate logit(p) with regression, and convert it to a

probability using the inverse of logit ˆ p = ew0+w1x 1 + ew0+w1x = 1 1 + e−w0−w1x which is called logistic function (or sometimes sigmoid function, with some ambiguity).

Ç. Çöltekin, SfS / University of Tübingen May 3, 2016 12 / 23

slide-16
SLIDE 16

Practical matters Classifjcation Logistic Regression More than two classes

Logit function

0.0 0.2 0.4 0.6 0.8 1.0

  • 4
  • 2

2 4 p logit(p) logit(p) = log

p 1−p

Ç. Çöltekin, SfS / University of Tübingen May 3, 2016 13 / 23

slide-17
SLIDE 17

Practical matters Classifjcation Logistic Regression More than two classes

Logit function

  • 4
  • 2

2 4 0.0 0.2 0.4 0.6 0.8 1.0 x logistic(x)

logistic(x) =

1 1−e−x

Ç. Çöltekin, SfS / University of Tübingen May 3, 2016 14 / 23

slide-18
SLIDE 18

Practical matters Classifjcation Logistic Regression More than two classes

Logistic regression as a generalized linear model

Logistic regression is a special case of generalized linear models (GLM). GLMs are expressed with, g(y) = Xw + ϵ

▶ The function g() is called the link function ▶ ϵ is distributed according to a distribution from exponential

family

▶ For logistic regression, g() is the logit function, ϵ is

distributed binomially

▶ For linear regression g() is the identity function, ϵ is

distributed normally

Ç. Çöltekin, SfS / University of Tübingen May 3, 2016 15 / 23

slide-19
SLIDE 19

Practical matters Classifjcation Logistic Regression More than two classes

Interpreting the dyslexia example

glm(formula = diag ~ score, family = binomial, data = dys) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 6.90079 2.31737 2.978 0.00290 ** score

  • 0.14491

0.04493 -3.225 0.00126 **

  • (Dispersion parameter for binomial family taken to be 1)

Null deviance: 54.548 on 39 degrees of freedom Residual deviance: 30.337 on 38 degrees of freedom AIC: 34.337 Number of Fisher Scoring iterations: 5

Ç. Çöltekin, SfS / University of Tübingen May 3, 2016 16 / 23

slide-20
SLIDE 20

Practical matters Classifjcation Logistic Regression More than two classes

Interpreting the dyslexia example

20 40 60 80 100 1 Test score P(dyslexia|score)

logit(p) = 6.9 − 0.14x p =

1 1+e−6.9+0.14x

Ç. Çöltekin, SfS / University of Tübingen May 3, 2016 17 / 23

slide-21
SLIDE 21

Practical matters Classifjcation Logistic Regression More than two classes

How to fjt a logistic regression model

Reminder: P(y = 1|x) = p = 1 1 + e−wx P(y = 0|x) = 1 − p = e−wx 1 + e−wx The likelihood of the training set is, L(w) = ∏

i

P(yi|xi) = ∏

i

pyi(1 − p)1−yi In practice, maximizing log likelihood is more practical: ˆ w = arg max

w

log L(w) = ∑

i

P(yi|xi) = ∑

i

yi log p+(1−yi) log(1−p) To maximize, we fjnd the gradient: ∇ log L(w) = ∑

i

(yi − 1 1 + e−wx )xi

Ç. Çöltekin, SfS / University of Tübingen May 3, 2016 18 / 23

slide-22
SLIDE 22

Practical matters Classifjcation Logistic Regression More than two classes

How to fjt a logistic regression model (2)

▶ Bad news is that there is no analytic solution to the set of

equations ∇ log L(w) = 0

▶ Good news is that the (negative) log likelihood is a convex

function: there is a global maximum

▶ We can use iterative methods such as gradient descent to fjnd

parameters that maximize the (log) likelihood

▶ In practice, it is more common minimize the negative log

likelihood J(w) = −logL(w) J(w) is called the loss function, cost function or objective function

▶ Using gradient descent, we repeat

w ← w − α∇J(w) until convergence. α is called learning rate.

Ç. Çöltekin, SfS / University of Tübingen May 3, 2016 19 / 23

slide-23
SLIDE 23

Practical matters Classifjcation Logistic Regression More than two classes

An example with two predictors

Call: glm(formula = label ~ x1 + x2, family = binomial, data = d) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.09692 4.74728 0.020 0.9837 x1

  • 2.53416

1.69222 -1.498 0.1343 x1 2.57632 1.36655 1.885 0.0594 .

  • (Dispersion parameter for binomial family taken to be 1)

Null deviance: 19.408 on 13 degrees of freedom Residual deviance: 7.987 on 11 degrees of freedom AIC: 13.987 Number of Fisher Scoring iterations: 6

Ç. Çöltekin, SfS / University of Tübingen May 3, 2016 20 / 23

slide-24
SLIDE 24

Practical matters Classifjcation Logistic Regression More than two classes

An example with two predictors (2)

1 2 3 4 5 1 2 3 4 5 x1 x2

0.1 − 2.53x1 + 2.58x2 = 0

p =

1 1+e−(0.1−2.53x1+2.58x2)

Ç. Çöltekin, SfS / University of Tübingen May 3, 2016 21 / 23

slide-25
SLIDE 25

Practical matters Classifjcation Logistic Regression More than two classes

More than two classes

▶ Some algorithms can naturally be extended to multiple labels ▶ Others tend to work well in binary classifjcation ▶ Any binary classifjer can be turned into a k-way classifjer by

▶ training k one-vs.-rest (OvR) or one-vs.-all (OvA) classifjers.

Decisions are made based on the class with the highest confjdence score. This approach is feasible for classifjers that assign a weight or probability to the individual classes

▶ training k(k−1)

2

  • ne-vs.-one (OvO) classifjers. Decisions are

made based on majority voting

Ç. Çöltekin, SfS / University of Tübingen May 3, 2016 22 / 23

slide-26
SLIDE 26

Practical matters Classifjcation Logistic Regression More than two classes

One vs. Rest

x1 x2 + + + + − − − × × × ×

▶ For 3 classes we fjt 3

classifjers separating one class from the rest Some regions of the feature space will be ambiguous We can assign labels based

  • n probability or weight

value, if classifjer returns one One-vs.-one and majority voting is another option

Ç. Çöltekin, SfS / University of Tübingen May 3, 2016 23 / 23

slide-27
SLIDE 27

Practical matters Classifjcation Logistic Regression More than two classes

One vs. Rest

x1 x2 + + + + − − − × × × ×

▶ For 3 classes we fjt 3

classifjers separating one class from the rest Some regions of the feature space will be ambiguous We can assign labels based

  • n probability or weight

value, if classifjer returns one One-vs.-one and majority voting is another option

Ç. Çöltekin, SfS / University of Tübingen May 3, 2016 23 / 23

slide-28
SLIDE 28

Practical matters Classifjcation Logistic Regression More than two classes

One vs. Rest

x1 x2 + + + + − − − × × × ×

▶ For 3 classes we fjt 3

classifjers separating one class from the rest Some regions of the feature space will be ambiguous We can assign labels based

  • n probability or weight

value, if classifjer returns one One-vs.-one and majority voting is another option

Ç. Çöltekin, SfS / University of Tübingen May 3, 2016 23 / 23

slide-29
SLIDE 29

Practical matters Classifjcation Logistic Regression More than two classes

One vs. Rest

x1 x2 + + + + − − − × × × ×

▶ For 3 classes we fjt 3

classifjers separating one class from the rest Some regions of the feature space will be ambiguous We can assign labels based

  • n probability or weight

value, if classifjer returns one One-vs.-one and majority voting is another option

Ç. Çöltekin, SfS / University of Tübingen May 3, 2016 23 / 23

slide-30
SLIDE 30

Practical matters Classifjcation Logistic Regression More than two classes

One vs. Rest

x1 x2 + + + + − − − × × × ×

▶ For 3 classes we fjt 3

classifjers separating one class from the rest

▶ Some regions of the feature

space will be ambiguous We can assign labels based

  • n probability or weight

value, if classifjer returns one One-vs.-one and majority voting is another option

Ç. Çöltekin, SfS / University of Tübingen May 3, 2016 23 / 23

slide-31
SLIDE 31

Practical matters Classifjcation Logistic Regression More than two classes

One vs. Rest

x1 x2 + + + + − − − × × × ×

▶ For 3 classes we fjt 3

classifjers separating one class from the rest

▶ Some regions of the feature

space will be ambiguous

▶ We can assign labels based

  • n probability or weight

value, if classifjer returns one

▶ One-vs.-one and majority

voting is another option

Ç. Çöltekin, SfS / University of Tübingen May 3, 2016 23 / 23

slide-32
SLIDE 32

Multi-class logistic regression

▶ Generalizing logistic regression for more than two classes is

straightforward

▶ We estimate,

P(Ck|x) = ewkx ∑

j ewjx

Where Ck is the kth class. j iterates over all classes.

▶ This model is also known as a log-linear model, Maximum

entropy model, Boltzman machine

Ç. Çöltekin, SfS / University of Tübingen May 3, 2016 A.1