Learning From Data Lecture 9 Logistic Regression and Gradient - PowerPoint PPT Presentation

Learning From Data Lecture 9 Logistic Regression and Gradient Descent Logistic Regression Gradient Descent M. Magdon-Ismail CSCI 4100/6100

recap: Linear Classification and Regression The linear signal: s = w t x Good Features are Important Algorithms Linear Classification. Pocket algorithm can tolerate errors Simple and efficient Linear Regression. Single step learning: y w = X † y = (X t X) − 1 X t y Very efficient O ( Nd 2 ) exact algorithm. Before looking at the data, we can reason that x 1 symmetry and intensity should be good features x 2 based on our knowledge of the problem . M Logistic Regression and Gradient Descent : 2 /23 � A c L Creator: Malik Magdon-Ismail Predicting a probability − →

Predicting a Probability Will someone have a heart attack over the next year? age 62 years gender male blood sugar 120 mg/dL40,000 HDL 50 LDL 120 Mass 190 lbs 5 ′ 10 ′′ Height . . . . . . Classification: Yes/No Logistic Regression: Likelihood of heart attack logistic regression ≡ y ∈ [0 , 1] � d � � = θ ( w t x ) h ( x ) = θ w i x i i =0 M Logistic Regression and Gradient Descent : 3 /23 � A c L Creator: Malik Magdon-Ismail What is θ ? − →

Predicting a Probability Will someone have a heart attack over the next year? age 62 years gender male blood sugar 120 mg/dL40,000 HDL 50 LDL 120 Mass 190 lbs 5 ′ 10 ′′ Height . . . . . . Classification: Yes/No Logistic Regression: Likelihood of heart attack logistic regression ≡ y ∈ [0 , 1] 1 � d e s 1 � θ ( s ) = 1 + e s = 1 + e − s . θ ( s ) � = θ ( w t x ) h ( x ) = θ w i x i 1 e − s θ ( − s ) = 1 + e − s = 1 + e s = 1 − θ ( s ) . 0 i =0 s M Logistic Regression and Gradient Descent : 4 /23 � A c L Creator: Malik Magdon-Ismail Data is binary ± 1 − →

The Data is Still Binary, ± 1 D = ( x 1 , y 1 = ± 1) , · · · , ( x N , y N = ± 1) x n ← a person’s health information y n = ± 1 ← did they have a heart attack or not We cannot measure a probability . We can only see the occurence of an event and try to infer a probability. M Logistic Regression and Gradient Descent : 5 /23 � A c L Creator: Malik Magdon-Ismail f is noisy − →

The Target Function is Inherently Noisy f ( x ) = P [ y = +1 | x ] . The data is generated from a noisy target function:  f ( x ) for y = +1;   P ( y | x ) =   1 − f ( x ) for y = − 1 . M Logistic Regression and Gradient Descent : 6 /23 � A c L Creator: Malik Magdon-Ismail When is h good? − →

What Makes an h Good? ‘fitting’ the data means finding a good h  h ( x n ) ≈ 1 whenever y n = +1;  h is good if:  h ( x n ) ≈ 0 whenever y n = − 1 . A simple error measure that captures this: N E in ( h ) = 1 � 2 . � � h ( x n ) − 1 2 (1 + y n ) N n =1 Not very convenient (hard to minimize). M Logistic Regression and Gradient Descent : 7 /23 � A c L Creator: Malik Magdon-Ismail Cross entropy error − →

The Cross Entropy Error Measure N E in ( w ) = 1 � ln(1 + e − y n · w t x ) N n =1 It looks complicated and ugly (ln , e ( · ) , . . . ), But, – it is based on an intuitive probabilistic interpretation of h . – it is very convenient and mathematically friendly (‘easy’ to minimize). Verify: y n = +1 encourages w t x n ≫ 0, so θ ( w t x n ) ≈ 1; y n = − 1 encourages w t x n ≪ 0, so θ ( w t x n ) ≈ 0; M Logistic Regression and Gradient Descent : 8 /23 � A c L Creator: Malik Magdon-Ismail Probabilistic interpretation − →

The Probabilistic Interpretation Suppose that h ( x ) = θ ( w t x ) closely captures P [+1 | x ]:  θ ( w t x ) for y = +1;   P ( y | x ) =   1 − θ ( w t x ) for y = − 1 . M Logistic Regression and Gradient Descent : 9 /23 � A c L Creator: Malik Magdon-Ismail 1 − θ ( s ) = θ ( − s ) − →

The Probabilistic Interpretation So, if h ( x ) = θ ( w t x ) closely captures P [+1 | x ]:  θ ( w t x ) for y = +1;   P ( y | x ) =   θ ( − w t x ) for y = − 1 . M Logistic Regression and Gradient Descent : 10 /23 � A c L Creator: Malik Magdon-Ismail Simplify to one equation − →

The Probabilistic Interpretation So, if h ( x ) = θ ( w t x ) closely captures P [+1 | x ]:  θ ( w t x ) for y = +1;   P ( y | x ) =   θ ( − w t x ) for y = − 1 . . . . or, more compactly, P ( y | x ) = θ ( y · w t x ) M Logistic Regression and Gradient Descent : 11 /23 � A c L Creator: Malik Magdon-Ismail The likelihood − →

The Likelihood P ( y | x ) = θ ( y · w t x ) Recall: ( x 1 , y 1 ) , . . . , ( x N , y N ) are independently generated Likelihood : The probability of getting the y 1 , . . . , y N in D from the corresponding x 1 , . . . , x N : N � P ( y 1 , . . . , y N | x 1 , . . . , x n ) = P ( y n | x n ) . n =1 The likelihood measures the probability that the data were generated if f were h . M Logistic Regression and Gradient Descent : 12 /23 � A c L Creator: Malik Magdon-Ismail Maximize the likelihood − →

Maximizing The Likelihood (why?) � N max n =1 P ( y n | x n ) �� N � ⇔ max ln n =1 P ( y n | x n ) � N ≡ max n =1 ln P ( y n | x n ) � N − 1 ⇔ min n =1 ln P ( y n | x n ) N � N 1 1 ≡ min n =1 ln P ( y n | x n ) N � N 1 1 ≡ min n =1 ln ← we specialize to our “model” here θ ( y n · w t x n ) N � N 1 n =1 ln(1 + e − y n · w t x n ) ≡ min N N E in ( w ) = 1 � ln(1 + e − y n · w t x n ) N n =1 M Logistic Regression and Gradient Descent : 13 /23 � A c L Creator: Malik Magdon-Ismail How to minimize E in ( w ) − →

How To Minimize E in ( w ) Classification – PLA/Pocket (iterative) Regression – pseudoinverse (analytic), from solving ∇ w E in ( w ) = 0 . Logistic Regression – analytic won’t work. Numerically/iteratively set ∇ w E in ( w ) → 0 . M Logistic Regression and Gradient Descent : 14 /23 � A c L Creator: Malik Magdon-Ismail Hill analogy − →

Finding The Best Weights - Hill Descent Ball on a complicated hilly terrain — rolls down to a local valley ↑ this is called a local minimum Questions: How to get to the bottom of the deepest valey? How to do this when we don’t have gravity? M Logistic Regression and Gradient Descent : 15 /23 � A c L Creator: Malik Magdon-Ismail Our E in is convex − →

Our E in Has Only One Valley In-sample Error, E in Weights, w . . . because E in ( w ) is a convex function of w . (So, who care’s if it looks ugly!) M Logistic Regression and Gradient Descent : 16 /23 � A c L Creator: Malik Magdon-Ismail How to roll down? − →

How to “Roll Down”? Assume you are at weights w ( t ) and you take a step of size η in the direction ˆ v . w ( t + 1) = w ( t ) + η ˆ v We get to pick ˆ v ← what’s the best direction to take the step? Pick ˆ v to make E in ( w ( t + 1)) as small as possible. M Logistic Regression and Gradient Descent : 17 /23 � A c L Creator: Malik Magdon-Ismail The gradient − →

The Gradient is the Fastest Way to Roll Down Approximating the change in E in ∆ E in = E in ( w ( t + 1)) − E in ( w ( t )) = E in ( w ( t ) + η ˆ v ) − E in ( w ( t )) + O ( η 2 ) = η ∇ E in ( w ( t )) t ˆ v (Taylor’s Approximation) � �� v = − ∇ E in ( w ( t )) minimized at ˆ | | ∇ E in ( w ( t )) | | > v = − ∇ E in ( w ( t )) ≈ − η | | ∇ E in ( w ( t )) | | ← attained at ˆ | | ∇ E in ( w ( t )) | | The best (steepest) direction to move is the negative gradient: v = − ∇ E in ( w ( t )) ˆ | | ∇ E in ( w ( t )) | | M Logistic Regression and Gradient Descent : 18 /23 � A c L Creator: Malik Magdon-Ismail Iterate the gradient − →

“Rolling Down” ≡ Iterating the Negative Gradient w (0) ↓ ← negative gradient w (1) ↓ ← negative gradient w (2) ↓ ← negative gradient w (3) ↓ ← negative gradient . . . η = 0 . 5; 15 steps M Logistic Regression and Gradient Descent : 19 /23 � A c L Creator: Malik Magdon-Ismail What step size? − →

The ‘Goldilocks’ Step Size η too small η too large variable η t – just right large η In-sample Error, E in In-sample Error, E in In-sample Error, E in small η Weights, w Weights, w Weights, w η = 0 . 1; 75 steps η = 2; 10 steps variable η t ; 10 steps M Logistic Regression and Gradient Descent : 20 /23 � A c L Creator: Malik Magdon-Ismail Fixed learning rate gradient descent − →

Fixed Learning Rate Gradient Descent η t = η · | | ∇ E in ( w ( t )) | | 1: Initialize at step t = 0 to w (0). 2: for t = 0 , 1 , 2 , . . . do Compute the gradient | | ∇ E in ( w ( t )) | | → 0 when closer to the minimum. 3: ← − (Ex. 3.7 in LFD) g t = ∇ E in ( w ( t )) . ∇ E in ( w ( t )) v = − η t · ˆ Move in the direction v t = − g t . | | ∇ E in ( w ( t )) | | 4: Update the weights: ∇ E in ( w ( t )) 5: = − η · | | ∇ E in ( w ( t )) | | · | | ∇ E in ( w ( t )) | | w ( t + 1) = w ( t ) + η v t . Iterate ‘until it is time to stop’. 6: 7: end for v = − η · ∇ E in ( w ( t )) ˆ 8: Return the final weights. Gradient descent can minimize any smooth function, for example N E in ( w ) = 1 � ln(1 + e − y n · w t x ) ← logistic regression N n =1 M Logistic Regression and Gradient Descent : 21 /23 � A c L Creator: Malik Magdon-Ismail Stochastic gradient descent − →

Learning From Data Lecture 9 Logistic Regression and Gradient - PowerPoint PPT Presentation

Learning From Data Lecture 9 Logistic Regression and Gradient Descent Logistic Regression Gradient Descent M. Magdon-Ismail CSCI 4100/6100 recap: Linear Classification and Regression The linear signal: s = w t x Good Features are Important

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Todays lecture Logistic regression How can we use logistic regression for reranking? Shay

LEARNING Outline Math Behind Logistic Regression Visualizing Logistic Regression Loss

From Logistic Regression to Neural Networks CMSC 470 Marine Carpuat Logistic Regression What

Workshop 10.5a: Logistic regression Murray Logan August 23, 2016 Table of contents 1 Logistic

Lecture 3: Logistic Regression Feng Li Shandong University fli@sdu.edu.cn September 21, 2020

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Logistic Regression using OLS1D in Excel 2013 XL4D: V0H XL4D: V0H XL4D: V0H 2015 Schield

Workshop 10.5a: Logistic regression Murray Logan 05 Sep 2016 Section 1 Logistic regression

Machine Learning Logistic Regression Hamid R. Rabiee Spring 2015

Logistic regression Shay Cohen (based on slides by Sharon Goldwater) 28 October 2019 Todays

XL4B: Logistic Regression using OLS1B in Excel 2013 25 Feb 2018 V0C-2x XL4B: V0C-2x XL4B: V0C-2x

Lecture 9 Logistic regression Lecture 9 Logistic regression 10 17 2008 Review Review

Binary Logistic Regression + Multinomial Logistic Regression Matt Gormley Lecture 10 Feb. 17,

Logistic regression Predict binary outcomes (success/failure) from numerical or categorical

materials for magnetic refrigeration Ekkes Brck Introduction Magnetic cooling Giant

Walk alkin ing Ran andomly ly, Mas Massiv ively ly, an and Effic iciently ly Jakub

The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima

PA-GD: On the Convergence of Perturbed Alternating Gradient Descent to Second-Order Stationary

SK-Gd

Double Chooz Experiment Status Double Chooz Experiment Status Jelena Maricic, Drexel University

Performance of Local Algorithms in Random Structures. Power and limitations David Gamarnik MIT

alexander.wolff @ informatik . uni-wuerzburg . de Drawing metro maps [with Martin2, Herman, Max,