Summary Linearly separable classification problems. Logistic loss - PowerPoint PPT Presentation

Summary ◮ Linearly separable classification problems. ◮ Logistic loss ℓ log and (empirical) risk � R log . ◮ Gradient descent. 20 / 68

(Slide from last time) Classification For now, let’s consider binary classification: Y = {− 1 , +1 } . A linear predictor w ∈ R d classifies according to sign( w T x ) ∈ {− 1 , +1 } . Given (( x i , y i )) n i =1 , a predictor w ∈ R d , � � w T x i we want sign and y i to agree. 21 / 68

(Slide from last time) Logistic loss 1 Let’s state our classification goal with a generic margin loss ℓ : n � R ( w ) = 1 � T x i ); ℓ ( y i w n i =1 the key properties we want: ◮ ℓ is continuous; ◮ ℓ ( z ) ≥ c 1 [ z ≤ 0] = cℓ zo ( z ) for some c > 0 and any z ∈ R , which implies � R ℓ ( w ) ≥ c � R zo ( w ) . ◮ ℓ ′ (0) < 0 (pushes stuff from wrong to right). 22 / 68

(Slide from last time) Logistic loss 1 Let’s state our classification goal with a generic margin loss ℓ : n � R ( w ) = 1 � T x i ); ℓ ( y i w n i =1 the key properties we want: ◮ ℓ is continuous; ◮ ℓ ( z ) ≥ c 1 [ z ≤ 0] = cℓ zo ( z ) for some c > 0 and any z ∈ R , which implies � R ℓ ( w ) ≥ c � R zo ( w ) . ◮ ℓ ′ (0) < 0 (pushes stuff from wrong to right). Examples. ◮ Squared loss, written in margin form: ℓ ls ( z ) := (1 − z ) 2 ; y ) 2 = y 2 (1 − y ˆ y ) 2 = ( y − ˆ y ) 2 . note ℓ ls ( y ˆ y ) = (1 − y ˆ ◮ Logistic loss: ℓ log ( z ) = ln(1 + exp( − z )) . 22 / 68

(Slide from last time) Logistic loss 2 1.0 1.0 0.8 0.8 0.6 0.6 0.000 - 1 - - 1 -1.200 -0.800 -0.400 8 4 0 4 8 0.400 0.800 1.200 2 . . . 2 . . 0 0 0 . . 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.4 0.4 0.2 0.2 0.0 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Logistic loss. Squared loss. 23 / 68

(Slide from last time) Logistic loss 3 1.0 1.0 0.8 0.8 0.6 0.6 - - 8 4 1 0 1 - - . . 1 . - . 0 . 0 - 0 2 0 4 0 8 2 0 0 0 2 0 . 0 . 0 . 0 . . 0 . 0 0 8 . 4 8 . 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.4 0.4 0.2 0.2 0.0 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Logistic loss. Squared loss. 24 / 68

(Slide from last time) Gradient descent 1 Given a function F : R d → R , gradient descent is the iteration w i +1 := w i − η i ∇ w F ( w i ) , where w 0 is given, and η i is a learning rate / step size. 10.0 7.5 5.0 2.000 2.5 4.000 1 10.000 8 6 4 0.0 12.000 . . . 0 0 0 0 0 0 0 0 0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 25 / 68

(Slide from last time) Gradient descent 1 Given a function F : R d → R , gradient descent is the iteration w i +1 := w i − η i ∇ w F ( w i ) , where w 0 is given, and η i is a learning rate / step size. 10.0 7.5 5.0 2.000 2.5 4.000 1 10.000 8 6 4 0.0 12.000 . . 0 . 0 0 0 0 0 0 0 0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 Does this work for least squares? 25 / 68

(Slide from last time) Gradient descent 1 Given a function F : R d → R , gradient descent is the iteration w i +1 := w i − η i ∇ w F ( w i ) , where w 0 is given, and η i is a learning rate / step size. 10.0 7.5 5.0 2.000 2.5 4.000 1 10.000 8 6 4 0.0 12.000 . . 0 . 0 0 0 0 0 0 0 0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 Does this work for least squares? Later we’ll show it works for least squares and logistic regression due to convexity. 25 / 68

(Slide from last time) Gradient descent 2 Gradient descent is the iteration: w i +1 := w i − η i ∇ w � R log ( w i ) . ◮ Note ℓ ′ − 1 log ( z ) = 1+exp( z ) , and use the chain rule ( hw1 !). ◮ Or use pytorch: def GD(X, y, loss, step = 0.1, n iters = 10000): w = torch.zeros(X.shape[1], requires grad = True) for i in range (n iters): l = loss(X, y, w).mean() l.backward() with torch.no grad(): w − = step ∗ w.grad w.grad.zero () return w 26 / 68

Part 2 of logistic regression. . .

5. A maximum likelihood derivation

MLE and ERM We’ve studied an ERM perspective on logistic regression: � n ◮ Form empirical logistic risk � R log ( w ) = 1 i =1 ln(1 + exp( − y i w T x i )) . n ◮ Approximately solve arg min w ∈ R d � R log ( w ) via gradient descent (or other convex optimization technique). We only justified it with “popularity”! Today we’ll derive � R log via Maximum Likelihood Estimation (MLE). 1. We form a model for Pr[ Y = 1 | X = x ] , parameterized by w . 2. We form a full data log-likelihood (equivalent to � R log ). Let’s first describe the distributions underlying the data. 27 / 68

Learning prediction functions IID model for supervised learning : ( X 1 , Y 1 ) , . . . , ( X n , Y n ) , ( X, Y ) are iid random pairs (i.e., labeled examples ). ◮ X takes values in X . E.g., X = R d . ◮ Y takes values in Y . E.g., ( regression problems ) Y = R ; ( classification problems ) Y = { 1 , . . . , K } or Y = { 0 , 1 } or Y = {− 1 , +1 } . 1. We observe ( X 1 , Y 1 ) , . . . , ( X n , Y n ) , and the choose a prediction function (i.e., predictor ) ˆ f : X → Y , This is called “ learning ” or “ training ”. 2. At prediction time, observe X , and form prediction ˆ f ( X ) . 3. Outcome is Y , and f ( X ) − Y ) 2 (regression problems). ◮ squared loss is ( ˆ ◮ zero-one loss is 1 { ˆ f ( X ) � = Y } (classification problems). Note : expected zero-one loss is E [ 1 { ˆ f ( X ) � = Y } ] = P ( ˆ f ( X ) � = Y ) , which we also call error rate . 28 / 68

Distributions over labeled examples X : space of possible side-information ( feature space ). Y : space of possible outcomes ( label space or output space ). Distribution P of random pair ( X, Y ) taking values in X × Y can be thought of in two parts: 1. Marginal distribution P X of X : P X is a probability distribution on X . 2. Conditional distribution P Y | X = x of Y given X = x , for each x ∈ X : P Y | X = x is a probability distribution on Y . 29 / 68

Optimal classifier For binary classification, what function f : X → { 0 , 1 } has smallest risk (i.e., error rate ) R ( f ) := P ( f ( X ) � = Y ) ? ◮ Conditional on X = x , the minimizer of conditional risk y �→ P (ˆ ˆ y � = Y | X = x ) is � 1 if P ( Y = 1 | X = x ) > 1 / 2 , y := ˆ 0 if P ( Y = 1 | X = x ) ≤ 1 / 2 . ◮ Therefore, the function f ⋆ : X → { 0 , 1 } where � 1 if P ( Y = 1 | X = x ) > 1 / 2 , f ⋆ ( x ) = x ∈ X , 0 if P ( Y = 1 | X = x ) ≤ 1 / 2 , has the smallest risk. ◮ f ⋆ is called the Bayes (optimal) classifier . For Y = { 1 , . . . , K } , f ⋆ ( x ) = arg max P ( Y = y | X = x ) , x ∈ X . y ∈Y 30 / 68

Logistic regression Suppose X = R d and Y = { 0 , 1 } . A logistic regression model is a statistical model where the conditional probability function has a particular form: x ∈ R d , Y | X = x ∼ Bern( η w ( x )) , with T w ) , x ∈ R d η w ( x ) := logistic( x (with parameters w ∈ R d ), and e z 1 logistic( z ) := 1 + e − z = 1 + e z , z ∈ R . 1 0.8 0.6 0.4 0.2 0 -6 -4 -2 0 2 4 6 ◮ Conditional distribution of Y given X is Bernoulli; marginal distribution of X not specified. ◮ With least squares, Y | X = x was N( w T x, σ 2 ) . 31 / 68

MLE for logistic regression Log-likelihood of w in iid logistic regression model, given data ( X i , Y i ) = ( x i , y i ) for i = 1 , . . . , n : n � η w ( x i ) y i � � 1 − y i ln 1 − η w ( x i ) i =1 � n � � = y i ln η w ( x i ) + (1 − y i ) ln(1 − η w ( x i )) i =1 � � � n T x i )) + (1 − y i ) ln(1 + exp( w T x i )) = − y i ln(1 + exp( − w i =1 n � T x i )) , = − ln(1 + exp( − (2 y i − 1) w i =1 and old form is recovered with labels ˜ y i := 2 y i − 1 ∈ {− 1 , +1 } . 32 / 68

Log-odds function and classifier Equivalent way to characterize logistic regression model : The log-odds function , given by   e x T β     η β ( x ) 1 + e x T β   T β , log-odds β ( x ) = ln 1 − η β ( x ) = ln = x   1   1 + e x T β is a linear function 1 , parameterized by β ∈ R d . 1 Some authors allow affine function; we can get this using affine expansion. 33 / 68

Log-odds function and classifier Equivalent way to characterize logistic regression model : The log-odds function , given by   e x T β     η β ( x ) 1 + e x T β   T β , log-odds β ( x ) = ln 1 − η β ( x ) = ln = x   1   1 + e x T β is a linear function 1 , parameterized by β ∈ R d . Bayes optimal classifier f β : R d → { 0 , 1 } in logistic regression model: � if x T β ≤ 0 , 0 f β ( x ) = if x T β > 0 . 1 1 Some authors allow affine function; we can get this using affine expansion. 33 / 68

Log-odds function and classifier Equivalent way to characterize logistic regression model : The log-odds function , given by   e x T β     η β ( x ) 1 + e x T β   T β , log-odds β ( x ) = ln 1 − η β ( x ) = ln = x   1   1 + e x T β is a linear function 1 , parameterized by β ∈ R d . Bayes optimal classifier f β : R d → { 0 , 1 } in logistic regression model: � if x T β ≤ 0 , 0 f β ( x ) = if x T β > 0 . 1 Such classifiers are called linear classifiers . 1 Some authors allow affine function; we can get this using affine expansion. 33 / 68

Summary Linearly separable classification problems. Logistic loss - PowerPoint PPT Presentation

Summary Linearly separable classification problems. Logistic loss log and (empirical) risk R log . Gradient descent. 20 / 68 (Slide from last time) Classification For now, lets consider binary classification: Y = { 1 , +1

Baldwin Space Summary October 25 1 Baldwin School Space Summary 2 Baldwin School Space Summary

1 Product Range Products 2 summary summary summary summary Relays with 8 and 11-Pins

An Ultramarathon Pie with Doge Glaze An Ultramarathon Pie with Doge Glaze Marathon: The Summary

SUMMARY OF 2 0 1 5 BRI TI SH EVENTI NG DATA DATA SUMMARY 2015 68,269 Cross Country Starters

summary(dsm_x_tw) summary(dsm_xyb_tw) summary(dsm_xy_tw) Overview Estimating smooths How

New patent case filings per year 1 Summary Judgment motions per year 2 All courts: 101 Summary

Search Summary Search Summary Some material from: D Lin, J You, JC Latombe 1 Search Summary #

Q3FY18 RESULTS Results Summary Operating Highlights Financial Summary Key Strategies Appendix

Summary 1. Summary of

Preliminary Results For year end 31st July 2019 6 November 2019 SUMMARY & OUTLOOK SUMMARY

EXECUTIVE SUMMARY ABOUT SEMPERTI Semperti Executive Summary Version: v1 // 2016 SEMPERTI

Q1FY18 RESULTS Results Summary Operating Highlights Financial Summary Key Strategies Appendix

How similar are these curves? Jessica Sherette EAPSI Research and Experience Summary of Proposal

Lecture 12: Summary Summary Advanced Digital Communications (EQ2410) 1 Standards Final Exam

Security Summary Michael McCool Intel Osaka, W3C Web of Things F2F, 17 May 2017 Summary

GDRSD FINANCIAL GDRSD FINANCIAL GDRSD FINANCIAL GDRSD FINANCIAL OVERVIEW SUMMARY OVERVIEW

Probabilistic Graphical Models Lecture 12 Dynamical Models CS/CNS/EE 155 Andreas Krause

Logistic Regression, Generative and Discriminative Classifiers Recommended reading: Ng and

CSC421 Lecture 2: Linear Models Roger Grosse and Jimmy Ba Roger Grosse and Jimmy Ba CSC421

RFID based People-Object Direction of Pass Detection Ral Parada a , Joan Meli-Segu b , c and

Combining probabilities with log-linear pooling : application to spatial data Denis Allard 1 ,

Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu 8 May 2007 1

IN5550: Neural Methods in Natural Language Processing Lecture 2 Supervised Machine Learning:

The Log-Linear Model The flu example from last class is actually one of our most common

Summary Linearly separable classification problems. Logistic loss - PowerPoint PPT Presentation

Summary Linearly separable classification problems. Logistic loss log and (empirical) risk R log . Gradient descent. 20 / 68 (Slide from last time) Classification For now, lets consider binary classification: Y = { 1 , +1

Baldwin Space Summary October 25 1 Baldwin School Space Summary 2 Baldwin School Space Summary

1 Product Range Products 2 summary summary summary summary Relays with 8 and 11-Pins

An Ultramarathon Pie with Doge Glaze An Ultramarathon Pie with Doge Glaze Marathon: The Summary

SUMMARY OF 2 0 1 5 BRI TI SH EVENTI NG DATA DATA SUMMARY 2015 68,269 Cross Country Starters

summary(dsm_x_tw) summary(dsm_xyb_tw) summary(dsm_xy_tw) Overview Estimating smooths How

New patent case filings per year 1 Summary Judgment motions per year 2 All courts: 101 Summary

Search Summary Search Summary Some material from: D Lin, J You, JC Latombe 1 Search Summary #

Q3FY18 RESULTS Results Summary Operating Highlights Financial Summary Key Strategies Appendix

Summary 1. Summary of

Preliminary Results For year end 31st July 2019 6 November 2019 SUMMARY &amp; OUTLOOK SUMMARY

EXECUTIVE SUMMARY ABOUT SEMPERTI Semperti Executive Summary Version: v1 // 2016 SEMPERTI

Q1FY18 RESULTS Results Summary Operating Highlights Financial Summary Key Strategies Appendix

How similar are these curves? Jessica Sherette EAPSI Research and Experience Summary of Proposal

Lecture 12: Summary Summary Advanced Digital Communications (EQ2410) 1 Standards Final Exam

Security Summary Michael McCool Intel Osaka, W3C Web of Things F2F, 17 May 2017 Summary

GDRSD FINANCIAL GDRSD FINANCIAL GDRSD FINANCIAL GDRSD FINANCIAL OVERVIEW SUMMARY OVERVIEW

Probabilistic Graphical Models Lecture 12 Dynamical Models CS/CNS/EE 155 Andreas Krause

Logistic Regression, Generative and Discriminative Classifiers Recommended reading: Ng and

CSC421 Lecture 2: Linear Models Roger Grosse and Jimmy Ba Roger Grosse and Jimmy Ba CSC421

RFID based People-Object Direction of Pass Detection Ral Parada a , Joan Meli-Segu b , c and

Combining probabilities with log-linear pooling : application to spatial data Denis Allard 1 ,

Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu 8 May 2007 1

IN5550: Neural Methods in Natural Language Processing Lecture 2 Supervised Machine Learning:

The Log-Linear Model The flu example from last class is actually one of our most common

Preliminary Results For year end 31st July 2019 6 November 2019 SUMMARY & OUTLOOK SUMMARY