summary
play

Summary Linearly separable classification problems. Logistic loss - PowerPoint PPT Presentation

Summary Linearly separable classification problems. Logistic loss log and (empirical) risk R log . Gradient descent. 20 / 68 (Slide from last time) Classification For now, lets consider binary classification: Y = { 1 , +1


  1. Summary ◮ Linearly separable classification problems. ◮ Logistic loss ℓ log and (empirical) risk � R log . ◮ Gradient descent. 20 / 68

  2. (Slide from last time) Classification For now, let’s consider binary classification: Y = {− 1 , +1 } . A linear predictor w ∈ R d classifies according to sign( w T x ) ∈ {− 1 , +1 } . Given (( x i , y i )) n i =1 , a predictor w ∈ R d , � � w T x i we want sign and y i to agree. 21 / 68

  3. (Slide from last time) Logistic loss 1 Let’s state our classification goal with a generic margin loss ℓ : n � R ( w ) = 1 � T x i ); ℓ ( y i w n i =1 the key properties we want: ◮ ℓ is continuous; ◮ ℓ ( z ) ≥ c 1 [ z ≤ 0] = cℓ zo ( z ) for some c > 0 and any z ∈ R , which implies � R ℓ ( w ) ≥ c � R zo ( w ) . ◮ ℓ ′ (0) < 0 (pushes stuff from wrong to right). 22 / 68

  4. (Slide from last time) Logistic loss 1 Let’s state our classification goal with a generic margin loss ℓ : n � R ( w ) = 1 � T x i ); ℓ ( y i w n i =1 the key properties we want: ◮ ℓ is continuous; ◮ ℓ ( z ) ≥ c 1 [ z ≤ 0] = cℓ zo ( z ) for some c > 0 and any z ∈ R , which implies � R ℓ ( w ) ≥ c � R zo ( w ) . ◮ ℓ ′ (0) < 0 (pushes stuff from wrong to right). Examples. ◮ Squared loss, written in margin form: ℓ ls ( z ) := (1 − z ) 2 ; y ) 2 = y 2 (1 − y ˆ y ) 2 = ( y − ˆ y ) 2 . note ℓ ls ( y ˆ y ) = (1 − y ˆ ◮ Logistic loss: ℓ log ( z ) = ln(1 + exp( − z )) . 22 / 68

  5. (Slide from last time) Logistic loss 2 1.0 1.0 0.8 0.8 0.6 0.6 0.000 - 1 - - 1 -1.200 -0.800 -0.400 8 4 0 4 8 0.400 0.800 1.200 2 . . . 2 . . 0 0 0 . . 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.4 0.4 0.2 0.2 0.0 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Logistic loss. Squared loss. 23 / 68

  6. (Slide from last time) Logistic loss 3 1.0 1.0 0.8 0.8 0.6 0.6 - - 8 4 1 0 1 - - . . 1 . - . 0 . 0 - 0 2 0 4 0 8 2 0 0 0 2 0 . 0 . 0 . 0 . . 0 . 0 0 8 . 4 8 . 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.4 0.4 0.2 0.2 0.0 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Logistic loss. Squared loss. 24 / 68

  7. (Slide from last time) Gradient descent 1 Given a function F : R d → R , gradient descent is the iteration w i +1 := w i − η i ∇ w F ( w i ) , where w 0 is given, and η i is a learning rate / step size. 10.0 7.5 5.0 2.000 2.5 4.000 1 10.000 8 6 4 0.0 12.000 . . . 0 0 0 0 0 0 0 0 0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 25 / 68

  8. (Slide from last time) Gradient descent 1 Given a function F : R d → R , gradient descent is the iteration w i +1 := w i − η i ∇ w F ( w i ) , where w 0 is given, and η i is a learning rate / step size. 10.0 7.5 5.0 2.000 2.5 4.000 1 10.000 8 6 4 0.0 12.000 . . 0 . 0 0 0 0 0 0 0 0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 Does this work for least squares? 25 / 68

  9. (Slide from last time) Gradient descent 1 Given a function F : R d → R , gradient descent is the iteration w i +1 := w i − η i ∇ w F ( w i ) , where w 0 is given, and η i is a learning rate / step size. 10.0 7.5 5.0 2.000 2.5 4.000 1 10.000 8 6 4 0.0 12.000 . . 0 . 0 0 0 0 0 0 0 0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 Does this work for least squares? Later we’ll show it works for least squares and logistic regression due to convexity. 25 / 68

  10. (Slide from last time) Gradient descent 2 Gradient descent is the iteration: w i +1 := w i − η i ∇ w � R log ( w i ) . ◮ Note ℓ ′ − 1 log ( z ) = 1+exp( z ) , and use the chain rule ( hw1 !). ◮ Or use pytorch: def GD(X, y, loss, step = 0.1, n iters = 10000): w = torch.zeros(X.shape[1], requires grad = True) for i in range (n iters): l = loss(X, y, w).mean() l.backward() with torch.no grad(): w − = step ∗ w.grad w.grad.zero () return w 26 / 68

  11. Part 2 of logistic regression. . .

  12. 5. A maximum likelihood derivation

  13. MLE and ERM We’ve studied an ERM perspective on logistic regression: � n ◮ Form empirical logistic risk � R log ( w ) = 1 i =1 ln(1 + exp( − y i w T x i )) . n ◮ Approximately solve arg min w ∈ R d � R log ( w ) via gradient descent (or other convex optimization technique). We only justified it with “popularity”! Today we’ll derive � R log via Maximum Likelihood Estimation (MLE). 1. We form a model for Pr[ Y = 1 | X = x ] , parameterized by w . 2. We form a full data log-likelihood (equivalent to � R log ). Let’s first describe the distributions underlying the data. 27 / 68

  14. Learning prediction functions IID model for supervised learning : ( X 1 , Y 1 ) , . . . , ( X n , Y n ) , ( X, Y ) are iid random pairs (i.e., labeled examples ). ◮ X takes values in X . E.g., X = R d . ◮ Y takes values in Y . E.g., ( regression problems ) Y = R ; ( classification problems ) Y = { 1 , . . . , K } or Y = { 0 , 1 } or Y = {− 1 , +1 } . 1. We observe ( X 1 , Y 1 ) , . . . , ( X n , Y n ) , and the choose a prediction function (i.e., predictor ) ˆ f : X → Y , This is called “ learning ” or “ training ”. 2. At prediction time, observe X , and form prediction ˆ f ( X ) . 3. Outcome is Y , and f ( X ) − Y ) 2 (regression problems). ◮ squared loss is ( ˆ ◮ zero-one loss is 1 { ˆ f ( X ) � = Y } (classification problems). Note : expected zero-one loss is E [ 1 { ˆ f ( X ) � = Y } ] = P ( ˆ f ( X ) � = Y ) , which we also call error rate . 28 / 68

  15. Distributions over labeled examples X : space of possible side-information ( feature space ). Y : space of possible outcomes ( label space or output space ). Distribution P of random pair ( X, Y ) taking values in X × Y can be thought of in two parts: 1. Marginal distribution P X of X : P X is a probability distribution on X . 2. Conditional distribution P Y | X = x of Y given X = x , for each x ∈ X : P Y | X = x is a probability distribution on Y . 29 / 68

  16. Optimal classifier For binary classification, what function f : X → { 0 , 1 } has smallest risk (i.e., error rate ) R ( f ) := P ( f ( X ) � = Y ) ? ◮ Conditional on X = x , the minimizer of conditional risk y �→ P (ˆ ˆ y � = Y | X = x ) is � 1 if P ( Y = 1 | X = x ) > 1 / 2 , y := ˆ 0 if P ( Y = 1 | X = x ) ≤ 1 / 2 . ◮ Therefore, the function f ⋆ : X → { 0 , 1 } where � 1 if P ( Y = 1 | X = x ) > 1 / 2 , f ⋆ ( x ) = x ∈ X , 0 if P ( Y = 1 | X = x ) ≤ 1 / 2 , has the smallest risk. ◮ f ⋆ is called the Bayes (optimal) classifier . For Y = { 1 , . . . , K } , f ⋆ ( x ) = arg max P ( Y = y | X = x ) , x ∈ X . y ∈Y 30 / 68

  17. Logistic regression Suppose X = R d and Y = { 0 , 1 } . A logistic regression model is a statistical model where the conditional probability function has a particular form: x ∈ R d , Y | X = x ∼ Bern( η w ( x )) , with T w ) , x ∈ R d η w ( x ) := logistic( x (with parameters w ∈ R d ), and e z 1 logistic( z ) := 1 + e − z = 1 + e z , z ∈ R . 1 0.8 0.6 0.4 0.2 0 -6 -4 -2 0 2 4 6 ◮ Conditional distribution of Y given X is Bernoulli; marginal distribution of X not specified. ◮ With least squares, Y | X = x was N( w T x, σ 2 ) . 31 / 68

  18. MLE for logistic regression Log-likelihood of w in iid logistic regression model, given data ( X i , Y i ) = ( x i , y i ) for i = 1 , . . . , n : n � η w ( x i ) y i � � 1 − y i ln 1 − η w ( x i ) i =1 � n � � = y i ln η w ( x i ) + (1 − y i ) ln(1 − η w ( x i )) i =1 � � � n T x i )) + (1 − y i ) ln(1 + exp( w T x i )) = − y i ln(1 + exp( − w i =1 n � T x i )) , = − ln(1 + exp( − (2 y i − 1) w i =1 and old form is recovered with labels ˜ y i := 2 y i − 1 ∈ {− 1 , +1 } . 32 / 68

  19. Log-odds function and classifier Equivalent way to characterize logistic regression model : The log-odds function , given by   e x T β     η β ( x ) 1 + e x T β   T β , log-odds β ( x ) = ln 1 − η β ( x ) = ln = x   1   1 + e x T β is a linear function 1 , parameterized by β ∈ R d . 1 Some authors allow affine function; we can get this using affine expansion. 33 / 68

  20. Log-odds function and classifier Equivalent way to characterize logistic regression model : The log-odds function , given by   e x T β     η β ( x ) 1 + e x T β   T β , log-odds β ( x ) = ln 1 − η β ( x ) = ln = x   1   1 + e x T β is a linear function 1 , parameterized by β ∈ R d . Bayes optimal classifier f β : R d → { 0 , 1 } in logistic regression model: � if x T β ≤ 0 , 0 f β ( x ) = if x T β > 0 . 1 1 Some authors allow affine function; we can get this using affine expansion. 33 / 68

  21. Log-odds function and classifier Equivalent way to characterize logistic regression model : The log-odds function , given by   e x T β     η β ( x ) 1 + e x T β   T β , log-odds β ( x ) = ln 1 − η β ( x ) = ln = x   1   1 + e x T β is a linear function 1 , parameterized by β ∈ R d . Bayes optimal classifier f β : R d → { 0 , 1 } in logistic regression model: � if x T β ≤ 0 , 0 f β ( x ) = if x T β > 0 . 1 Such classifiers are called linear classifiers . 1 Some authors allow affine function; we can get this using affine expansion. 33 / 68

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend