recall linear regression
play

Recall: Linear Regression 200 180 160 140 Power - PowerPoint PPT Presentation

Logis&c Regression Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata September 1, 2014 Recall: Linear Regression 200 180 160 140 Power (bhp) 120


  1. Logis&c ¡Regression ¡ Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata September 1, 2014

  2. Recall: ¡Linear ¡Regression ¡ 200 ¡ 180 ¡ 160 ¡ 140 ¡ Power ¡(bhp) ¡ 120 ¡ 100 ¡ 80 ¡ 60 ¡ 40 ¡ 20 ¡ 0 ¡ 0 ¡ 500 ¡ 1000 ¡ 1500 ¡ 2000 ¡ 2500 ¡ Engine ¡displacement ¡(cc) ¡ § Assume: the relation is linear § Then for a given x (=1800), predict the value of y § Both the dependent and the independent variables are continuous 2 ¡

  3. Scenario: ¡Heart ¡disease ¡– ¡vs ¡– ¡Age ¡ Training set Age (numarical): Yes ¡ independent variable Heart disease ( Y ) Heart disease (Yes/No): dependent variable with two classes Task: Given a new No ¡ person’s age, predict if 0 ¡ 20 ¡ 40 ¡ 60 ¡ 80 ¡ 100 ¡ (s)he has heart disease Age ( X ) The task: calculate P ( Y = Yes | X ) 3 ¡

  4. Scenario: ¡Heart ¡disease ¡– ¡vs ¡– ¡Age ¡ Training set Age (numarical): Yes ¡ independent variable Heart disease ( Y ) Heart disease (Yes/No): dependent variable with two classes Task: Given a new No ¡ person’s age, predict if 0 ¡ 20 ¡ 40 ¡ 60 ¡ 80 ¡ 100 ¡ (s)he has heart disease Age ( X ) § Calculate P ( Y = Yes | X ) for different ranges of X § A curve that estimates the probability P ( Y = Yes | X ) 4 ¡

  5. The ¡Logis&c ¡func&on ¡ Logistic function on t : takes values between 0 and 1 e t 1 Logistic ( t ) = 1 + e t = 1 + e − t If t is a linear function of x L ( t ) t = β 0 + β 1 x Logistic function becomes: 1 F ( x ) = t 1 + e − ( β 0 + β 1 x ) Probability of the dependent variable The logistic curve Y taking one value against another 5 ¡

  6. The ¡Likelihood ¡func&on ¡ § Let, a discrete random variable X has a probability distribution p ( x ; θ ), that depends on a parameter θ § In case of Bernoulli’s distribution p ( x ; θ ) = θ x (1 − θ ) 1 − x § Intuitively, likelihood is “how likely” is an outcome being estimated correctly by the parameter θ – For x = 1, p ( x ; θ ) = θ – For x = 0, p ( x ; θ ) = 1 − θ § Given a set of data points x 1 , x 2 ,…, x n , the likelihood function is defined as: n ∏ l ( θ ) = p ( x i ; θ ) i = 1 6 ¡

  7. About ¡the ¡Likelihood ¡func&on ¡ n ∏ l ( θ ) = p ( x i ; θ ) i = 1 § The actual value does not have any meaning, only the relative likelihood matters, as we want to estimate the parameter θ § Constant factors do not matter § Likelihood is not a probability density function § The sum (or integral) does not add up to 1 § In practice it is often easier to work with the log-likelihood § Provides same relative comparison § The expression becomes a sum " % n n ∏ ∑ ( ) = ln ( ) L ( θ ) = ln l ( θ ) p ( x i ; θ ) ln p ( x i ; θ ) ' = $ # & i = 1 i = 1 7 ¡

  8. Example ¡ § Experiment: a coin toss, not known to be unbiased § Random variable X takes values 1 if head and 0 if tail § Data: 100 outcomes, 75 heads, 25 tails L ( θ ) = 75 × ln( θ ) + 25 × ln(1 − θ ) § Relative likelihood: if θ 1 > θ 2 , L ( θ 1 ) > L ( θ 2 ) 8 ¡

  9. Maximum ¡likelihood ¡es&mate ¡ § Maximum likelihood estimation: Estimating the set of values for the parameters (for example, θ ) which maximizes the likelihood function § Estimate: " % n ∑ [ ] = argmax θ ( ) argmax θ L ( θ ) ln p ( x i ; θ ) $ ' # & i = 1 § One method: Newton’s method – Start with some value of θ and iteratively improve – Converge when improvement is negligible § May not always converge 9 ¡

  10. Taylor’s ¡theorem ¡ § If f is a – Real-valued function – k times differentiable at a point a, for an integer k > 0 Then f has a polynomial approximation at a § In other words, there exists a function h k , such that ( x − a ) + ... + f ( k − 1) ( a ) f ( x ) = f ( a ) + f '( a ) ( x − a ) k + h k ( x )( x − a ) k 1! k ! ! ####### " ####### $ P ( x ) and ( ) = 0 lim x → a h k ( x ) Polynomial approximation ( k- th order Taylor’s polynomial) 10 ¡

  11. Newton’s ¡method ¡ § Finding the global maximum w * of a function f of one variable Assumptions: 1. The function f is smooth 2. The derivative of f at w * is 0, second derivative is negative § Start with a value w = w 0 § Near the maximum, approximate the function using a second order Taylor polynomial 2 ( w − w 0 ) d 2 f f ( w ) ≈ f ( w 0 ) + ( w − w 0 ) df + 1 dw 2 dw w = w 0 w = w 0 ≈ f ( w 0 ) + ( w − w 0 ) f '( w 0 ) + 1 2 ( w − w 0 ) f ''( w 0 ) § Using the gradient descent approach iteratively estimate the maximum of f 11 ¡

  12. Newton’s ¡method ¡ f ( w ) ≈ f ( w 0 ) + ( w − w 0 ) f '( w 0 ) + 1 2 ( w − w 0 ) f ''( w 0 ) § Take derivative w.r.t. w, and set it to zero at a point w 1 f '( w 1 ) ≈ 0 = f '( w 0 ) + 1 2 f ''( w 0 ) × 2( w 1 − w 0 ) ⇒ w 1 = w 0 − f '( w 0 ) f ''( w 0 ) Iteratively: w n + 1 = w n − f '( w n ) f ''( w n ) § Converges very fast, if at all § Use the optim function in R 12 ¡

  13. Logis&c ¡Regression: ¡Es&ma&ng ¡ β 0 ¡and ¡ β 1 § Logistic function e β 0 + β 1 x 1 F ( x ) = 1 + e β 0 + β 1 x = 1 + e − ( β 0 + β 1 x ) § Log-likelihood function – Say we have n data points x 1 , x 2 ,…, x n – Outcomes y 1 , y 2 ,…, y n , each either 0 or 1 – Each y i = 1 with probabilities p and 0 with probability 1 − p n ∑ ( ) = L ( β ) = ln l ( β ) y i ln p ( x i ) + (1 − y i )ln(1 − p ( x i )) i = 1 n ∑ ) − ln(1 + e β 0 + β 1 x ) ( y i β 0 + β 1 x = i = 1 13 ¡

  14. Visualiza&on ¡ § Fit some plot with Yes ¡ parameters β 0 and β 1 Heart disease ( Y ) 0.25 ¡ 0.75 ¡ 0.5 ¡ No ¡ 0 ¡ 20 ¡ 40 ¡ 60 ¡ 80 ¡ 100 ¡ Age ( X ) 14 ¡

  15. Visualiza&on ¡ § Fit some plot with Yes ¡ parameters β 0 and β 1 § Iteratively adjust Heart disease ( Y ) curve and the 0.25 ¡ probabilities of some 0.75 ¡ point being classified 0.5 ¡ as one class vs another No ¡ 0 ¡ 20 ¡ 40 ¡ 60 ¡ 80 ¡ 100 ¡ Age ( X ) For a single independent variable x the separation is a point x = a 15 ¡

  16. Two ¡independent ¡variables ¡ Separation is a line where the probability 200 becomes 0.5 Income (thousand rupees) 150 100 50 0.75 ¡ 0.5 ¡ 0.25 ¡ 0 30 40 50 60 70 80 Age (Years) 16 ¡

  17. Wrapping up classification CLASSIFICATION ¡ 17 ¡

  18. Binary ¡and ¡Mul&-­‑class ¡classifica&on ¡ § Binary classification: – Target class has two values – Example: Heart disease Yes / No § Multi-class classification – Target class can take more than two values – Example: text classification into several labels (topics) § Many classifiers are simple to use for binary classification tasks § How to apply them for multi-class problems? 18 ¡

  19. Compound ¡and ¡Monolithic ¡classifiers ¡ § Compound models – By combining binary submodels – 1-vs-all: for each class c , determine if an observation belongs to c or some other class – 1-vs-last § Monolithic models (a single classifier) – Examples: decision trees, k-NN 19 ¡

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend