Recall: Linear Regression 200 180 160 140 Power - - PowerPoint PPT Presentation

recall linear regression
SMART_READER_LITE
LIVE PREVIEW

Recall: Linear Regression 200 180 160 140 Power - - PowerPoint PPT Presentation

Logis&c Regression Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata September 1, 2014 Recall: Linear Regression 200 180 160 140 Power (bhp) 120


slide-1
SLIDE 1

Logis&c ¡Regression ¡

Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata

September 1, 2014

slide-2
SLIDE 2

Recall: ¡Linear ¡Regression ¡

2 ¡

0 ¡ 20 ¡ 40 ¡ 60 ¡ 80 ¡ 100 ¡ 120 ¡ 140 ¡ 160 ¡ 180 ¡ 200 ¡ 0 ¡ 500 ¡ 1000 ¡ 1500 ¡ 2000 ¡ 2500 ¡

Engine ¡displacement ¡(cc) ¡ Power ¡(bhp) ¡

§ Assume: the relation is linear § Then for a given x (=1800), predict the value of y § Both the dependent and the independent variables are continuous

slide-3
SLIDE 3

0 ¡ 20 ¡ 40 ¡ 60 ¡ 80 ¡ 100 ¡

Scenario: ¡Heart ¡disease ¡– ¡vs ¡– ¡Age ¡

3 ¡

Age (X) Heart disease (Y)

The task: calculate P(Y = Yes | X)

No ¡ Yes ¡

Training set

Age (numarical): independent variable Heart disease (Yes/No): dependent variable with two classes Task: Given a new person’s age, predict if (s)he has heart disease

slide-4
SLIDE 4

0 ¡ 20 ¡ 40 ¡ 60 ¡ 80 ¡ 100 ¡

Scenario: ¡Heart ¡disease ¡– ¡vs ¡– ¡Age ¡

4 ¡

Age (X) Heart disease (Y)

§ Calculate P(Y = Yes | X) for different ranges of X § A curve that estimates the probability P(Y = Yes | X)

No ¡ Yes ¡

Training set

Age (numarical): independent variable Heart disease (Yes/No): dependent variable with two classes Task: Given a new person’s age, predict if (s)he has heart disease

slide-5
SLIDE 5

The ¡Logis&c ¡func&on ¡

Logistic function on t : takes values between 0 and 1

5 ¡

The logistic curve t L(t)

Logistic(t) = et 1+et = 1 1+e−t

If t is a linear function of x

t = β0 + β1x

Logistic function becomes:

F(x) = 1 1+e−(β0+β1x)

Probability of the dependent variable Y taking one value against another

slide-6
SLIDE 6

The ¡Likelihood ¡func&on ¡

§ Let, a discrete random variable X has a probability distribution p(x; θ), that depends on a parameter θ § In case of Bernoulli’s distribution

6 ¡

§ Intuitively, likelihood is “how likely” is an outcome being estimated correctly by the parameter θ

– For x = 1, p(x;θ) = θ – For x = 0, p(x;θ) = 1−θ

§ Given a set of data points x1, x2 ,…, xn, the likelihood function is defined as:

p(x;θ) =θ x(1−θ)1−x

l(θ) = p(xi;θ)

i=1 n

slide-7
SLIDE 7

About ¡the ¡Likelihood ¡func&on ¡

§ The actual value does not have any meaning, only the relative likelihood matters, as we want to estimate the parameter θ § Constant factors do not matter § Likelihood is not a probability density function § The sum (or integral) does not add up to 1 § In practice it is often easier to work with the log-likelihood § Provides same relative comparison § The expression becomes a sum

7 ¡

l(θ) = p(xi;θ)

i=1 n

L(θ) = ln l(θ)

( ) = ln

p(xi;θ)

i=1 n

" # $ % & ' = ln p(xi;θ)

( )

i=1 n

slide-8
SLIDE 8

Example ¡

§ Experiment: a coin toss, not known to be unbiased § Random variable X takes values 1 if head and 0 if tail § Data: 100 outcomes, 75 heads, 25 tails

8 ¡

L(θ) = 75× ln(θ)+ 25× ln(1−θ)

§ Relative likelihood: if θ1 > θ2, L(θ1) > L(θ2)

slide-9
SLIDE 9

Maximum ¡likelihood ¡es&mate ¡

§ Maximum likelihood estimation: Estimating the set of values for the parameters (for example, θ) which maximizes the likelihood function § Estimate:

9 ¡

argmaxθ L(θ)

[ ] = argmaxθ

ln p(xi;θ)

( )

i=1 n

" # $ % & '

§ One method: Newton’s method

– Start with some value of θ and iteratively improve – Converge when improvement is negligible

§ May not always converge

slide-10
SLIDE 10

Taylor’s ¡theorem ¡

§ If f is a – Real-valued function – k times differentiable at a point a, for an integer k > 0 Then f has a polynomial approximation at a § In other words, there exists a function hk, such that

10 ¡

f (x) = f (a)+ f '(a) 1! (x − a)+...+ f (k−1)(a) k! (x − a)k

P(x)

! " ####### $ ####### + hk(x)(x − a)k

Polynomial approximation (k-th order Taylor’s polynomial)

limx→a hk(x)

( ) = 0

and

slide-11
SLIDE 11

Newton’s ¡method ¡

§ Finding the global maximum w* of a function f of one variable Assumptions:

1. The function f is smooth 2. The derivative of f at w* is 0, second derivative is negative

§ Start with a value w = w0 § Near the maximum, approximate the function using a second

  • rder Taylor polynomial

11 ¡

f (w) ≈ f (w0)+(w − w0) df dw w=w0 + 1 2 (w − w0) d 2 f dw2

w=w0

≈ f (w0)+(w − w0) f '(w0)+ 1 2 (w − w0) f ''(w0)

§ Using the gradient descent approach iteratively estimate the maximum of f

slide-12
SLIDE 12

Newton’s ¡method ¡

§ Take derivative w.r.t. w, and set it to zero at a point w1

12 ¡

f '(w1) ≈ 0 = f '(w0)+ 1 2 f ''(w0)×2(w1 − w0) f (w) ≈ f (w0)+(w − w0) f '(w0)+ 1 2 (w − w0) f ''(w0) ⇒ w1 = w0 − f '(w0) f ''(w0)

Iteratively: wn+1 = wn − f '(wn)

f ''(wn)

§ Converges very fast, if at all § Use the optim function in R

slide-13
SLIDE 13

Logis&c ¡Regression: ¡Es&ma&ng ¡β0 ¡and ¡β1

§ Logistic function

13 ¡

F(x) = eβ0+β1x 1+eβ0+β1x = 1 1+e−(β0+β1x)

§ Log-likelihood function

– Say we have n data points x1, x2 ,…, xn – Outcomes y1, y2 ,…, yn, each either 0 or 1 – Each yi = 1 with probabilities p and 0 with probability 1 − p

L(β) = ln l(β)

( ) =

yi ln p(xi)+(1− yi)ln(1− p(xi))

i=1 n

= yi β0 + β1x

( )− ln(1+eβ0+β1x)

i=1 n

slide-14
SLIDE 14

Visualiza&on ¡

14 ¡

0 ¡ 20 ¡ 40 ¡ 60 ¡ 80 ¡ 100 ¡

Age (X) Heart disease (Y) No ¡ Yes ¡ 0.5 ¡ 0.75 ¡ 0.25 ¡

§ Fit some plot with parameters β0 and β1

slide-15
SLIDE 15

Visualiza&on ¡

15 ¡

0 ¡ 20 ¡ 40 ¡ 60 ¡ 80 ¡ 100 ¡

Age (X) Heart disease (Y) No ¡ Yes ¡ 0.5 ¡ 0.75 ¡ 0.25 ¡

§ Fit some plot with parameters β0 and β1 § Iteratively adjust curve and the probabilities of some point being classified as one class vs another For a single independent variable x the separation is a point x = a

slide-16
SLIDE 16

Two ¡independent ¡variables ¡

Separation is a line where the probability becomes 0.5

16 ¡

30 40 50 60 70 80 50 100 150 200 Age (Years) Income (thousand rupees)

0.5 ¡ 0.25 ¡ 0.75 ¡

slide-17
SLIDE 17

CLASSIFICATION ¡

Wrapping up classification

17 ¡

slide-18
SLIDE 18

Binary ¡and ¡Mul&-­‑class ¡classifica&on ¡

§ Binary classification:

– Target class has two values – Example: Heart disease Yes / No

§ Multi-class classification

– Target class can take more than two values – Example: text classification into several labels (topics)

§ Many classifiers are simple to use for binary classification tasks § How to apply them for multi-class problems?

18 ¡

slide-19
SLIDE 19

Compound ¡and ¡Monolithic ¡classifiers ¡

§ Compound models

– By combining binary submodels – 1-vs-all: for each class c, determine if an observation belongs to c or some other class – 1-vs-last

§ Monolithic models (a single classifier)

– Examples: decision trees, k-NN

19 ¡