logistic regression
play

Logistic regression CS 446 1. Linear classifiers Linear regression - PowerPoint PPT Presentation

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied linear regression ; the output/label space Y was R . 90 80 delay 70 60 50 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 duration 1 / 68 Linear


  1. Logistic regression CS 446

  2. 1. Linear classifiers

  3. Linear regression Last two lectures, we studied linear regression ; the output/label space Y was R . 90 80 delay 70 60 50 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 duration 1 / 68

  4. Linear classification Today, the goal is a linear classifier ; the output/label space Y is discrete. 1.0 0.8 0.6 0.4 0.2 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2 / 68

  5. Notation For now, let’s consider binary classification: Y = {− 1 , +1 } . A linear predictor w ∈ R d classifies according to sign( w T x ) ∈ {− 1 , +1 } . Given (( x i , y i )) n i =1 , a predictor w ∈ R d , � � w T x i we want sign and y i to agree. 3 / 68

  6. Geometry of linear classifiers x 2 A hyperplane in R d is a linear subspace of dimension d − 1 . ◮ A hyperplane in R 2 is a line. H ◮ A hyperplane in R 3 is a plane. ◮ As a linear subspace, a hyperplane always contains the origin. w x 1 A hyperplane H can be specified by a (non-zero) normal vector w ∈ R d . The hyperplane with normal vector w is the set of points orthogonal to w : � � x ∈ R d : x T w = 0 H = . Given w and its corresponding H : H splits the sets labeled positive { x : w T x > 0 } and those labeled negative { x : w T w < 0 } . 4 / 68

  7. Classification with a hyperplane H w span { w } 5 / 68

  8. Classification with a hyperplane H x Projection of x onto span { w } (a line) has coordinate w � x � 2 · cos( θ ) θ where � x � 2 · cos θ x T w cos θ = . � w � 2 � x � 2 span { w } (Distance to hyperplane is � x � 2 · | cos( θ ) | .) 5 / 68

  9. Classification with a hyperplane H x Projection of x onto span { w } (a line) has coordinate w � x � 2 · cos( θ ) θ where � x � 2 · cos θ x T w cos θ = . � w � 2 � x � 2 span { w } (Distance to hyperplane is � x � 2 · | cos( θ ) | .) Decision boundary is hyperplane (oriented by w ): T w > 0 x ⇐ ⇒ � x � 2 · cos( θ ) > 0 ⇐ ⇒ x on same side of H as w 5 / 68

  10. Classification with a hyperplane H x Projection of x onto span { w } (a line) has coordinate w � x � 2 · cos( θ ) θ where � x � 2 · cos θ x T w cos θ = . � w � 2 � x � 2 span { w } (Distance to hyperplane is � x � 2 · | cos( θ ) | .) Decision boundary is hyperplane (oriented by w ): T w > 0 x ⇐ ⇒ � x � 2 · cos( θ ) > 0 ⇐ ⇒ x on same side of H as w What should we do if we want hyperplane decision boundary that doesn’t (necessarily) go through origin? 5 / 68

  11. Linear separability Is it always possible to find w with sign( w T x i ) = y i ? Is it always possible to find a hyperplane separating the data? (Appending 1 means it need not go through the origin.) 1.0 0.8 0.6 0.4 0.2 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Linearly separable. Not linearly separable. 6 / 68

  12. Decision boundary with quadratic feature expansion elliptical decision boundary hyperbolic decision boundary 7 / 68

  13. Decision boundary with quadratic feature expansion elliptical decision boundary hyperbolic decision boundary Same feature expansions we saw for linear regression models can also be used here to “upgrade” linear classifiers. 7 / 68

  14. Finding linear classifiers with ERM Why not feed our goal into an optimization package, in the form n � 1 T x i ) � = y i ] ? arg min 1 [sign( w n w ∈ R d i =1 8 / 68

  15. Finding linear classifiers with ERM Why not feed our goal into an optimization package, in the form n � 1 T x i ) � = y i ] ? arg min 1 [sign( w n w ∈ R d i =1 ◮ Discrete/combinatorial search; often NP-hard. 8 / 68

  16. Relaxing the ERM problem Let’s remove one source of discreteness: n n � � � � 1 1 T x i ) � = y i ] T x i ) ≤ 0 1 [sign( w − → y i ( w . 1 n n i =1 i =1 Did we lose something in this process? Should it be “ > ” or “ ≥ ”? 9 / 68

  17. Relaxing the ERM problem Let’s remove one source of discreteness: n n � � � � 1 1 T x i ) � = y i ] T x i ) ≤ 0 1 [sign( w − → y i ( w . 1 n n i =1 i =1 Did we lose something in this process? Should it be “ > ” or “ ≥ ”? y i ( w T x i ) is the (unnormalized) margin of w on example i ; we have written this problem with a margin loss: n � R zo ( w ) = 1 � T x i ) ℓ zo ( y i w where ℓ zo ( z ) = 1 [ z ≤ 0] . n i =1 (Remainder of lecture will use single-parameter margin losses.) 9 / 68

  18. 2. Logistic loss and risk

  19. Logistic loss Let’s state our classification goal with a generic margin loss ℓ : n � R ( w ) = 1 � T x i ); ℓ ( y i w n i =1 the key properties we want: ◮ ℓ is continuous; ◮ ℓ ( z ) ≥ c 1 [ z ≤ 0] = cℓ zo ( z ) for some c > 0 and any z ∈ R , which implies � R ℓ ( w ) ≥ c � R zo ( w ) . ◮ ℓ ′ (0) < 0 (pushes stuff from wrong to right). 10 / 68

  20. Logistic loss Let’s state our classification goal with a generic margin loss ℓ : n � R ( w ) = 1 � T x i ); ℓ ( y i w n i =1 the key properties we want: ◮ ℓ is continuous; ◮ ℓ ( z ) ≥ c 1 [ z ≤ 0] = cℓ zo ( z ) for some c > 0 and any z ∈ R , which implies � R ℓ ( w ) ≥ c � R zo ( w ) . ◮ ℓ ′ (0) < 0 (pushes stuff from wrong to right). Examples. ◮ Squared loss, written in margin form: ℓ ls ( z ) := (1 − z ) 2 ; y ) 2 = y 2 (1 − y ˆ y ) 2 = ( y − ˆ y ) 2 . note ℓ ls ( y ˆ y ) = (1 − y ˆ ◮ Logistic loss: ℓ log ( z ) = ln(1 + exp( − z )) . 10 / 68

  21. Squared and logistic losses on linearly separable data I 1.0 1.0 0.8 0.8 0.6 0.6 0.000 - 1 - - 1 -1.200 -0.800 -0.400 8 4 0 4 8 0.400 0.800 1.200 2 . . . 2 . . 0 0 0 . . 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.4 0.4 0.2 0.2 0.0 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Logistic loss. Squared loss. 11 / 68

  22. Squared and logistic losses on linearly separable data II 1.0 1.0 0.8 0.8 0.6 0.6 - - 8 4 1 0 1 - - . . 1 . - . 0 . 0 - 0 2 0 4 0 8 2 0 0 0 2 0 . 0 . 0 . 0 . . 0 . 0 0 8 . 4 8 . 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.4 0.4 0.2 0.2 0.0 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Logistic loss. Squared loss. 12 / 68

  23. Logistic risk and separation If there exists a perfect linear separator, empirical logistic risk minimization should find it. Theorem. 13 / 68

  24. Logistic risk and separation If there exists a perfect linear separator, empirical logistic risk minimization should find it. w T x i > 0 for all i , Theorem. If there exists ¯ w with y i ¯ then every w with � R log ( w ) < ln(2) / 2 n + inf v � R log ( v ) , also satisfies y i w T x i > 0 . 13 / 68

  25. Logistic risk and separation If there exists a perfect linear separator, empirical logistic risk minimization should find it. w T x i > 0 for all i , Theorem. If there exists ¯ w with y i ¯ then every w with � R log ( w ) < ln(2) / 2 n + inf v � R log ( v ) , also satisfies y i w T x i > 0 . Proof. Step 1: low risk implies few mistakes. For any w with y j w T x j ≤ 0 for some j , R log ( w ) ≥ 1 T x j )) ≥ ln(2) � n ln(1 + exp( − y j w . n By contrapositive, any w with � R log ( w ) < ln(2) / n makes no mistakes. Step 2: inf v � R log ( v ) = 0 . Note: n � 1 � T x i )) = 0 . 0 ≤ inf R log ( v ) ≤ inf ln(1 + exp( − ry i ¯ w n r> 0 v i =1 This completes the proof. 13 / 68

  26. 3. Minimizing the empirical logistic risk

  27. Least squares and logistic ERM Least squares: ◮ Take gradient of � Aw − b � 2 , set to 0; obtain normal equations A T Aw = A T b . ◮ One choice is minimum norm solution A + b . 14 / 68

  28. Least squares and logistic ERM Least squares: ◮ Take gradient of � Aw − b � 2 , set to 0; obtain normal equations A T Aw = A T b . ◮ One choice is minimum norm solution A + b . Logistic loss: � n ◮ Take gradient of � R log ( w ) = 1 i =1 ln(1+exp( y i w T x i )) and set to 0 ??? n 14 / 68

  29. Least squares and logistic ERM Least squares: ◮ Take gradient of � Aw − b � 2 , set to 0; obtain normal equations A T Aw = A T b . ◮ One choice is minimum norm solution A + b . Logistic loss: � n ◮ Take gradient of � R log ( w ) = 1 i =1 ln(1+exp( y i w T x i )) and set to 0 ??? n Remark . Is A + b a “closed form expression”? 14 / 68

  30. Decreasing � R We need to move down the contours of � R log : 10.0 7.5 5.0 2.000 2.5 4.000 1 1 8 6.000 4 0 . 0.0 12.000 . . 0 0 0 0 0 0 0 0 0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 15 / 68

  31. Gradient descent Given a function F : R d → R , gradient descent is the iteration w i +1 := w i − η i ∇ w F ( w i ) , where w 0 is given, and η i is a learning rate / step size. 10.0 7.5 5.0 2.000 2.5 4.000 1 10.000 8 6 4 0.0 12.000 . . . 0 0 0 0 0 0 0 0 0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 16 / 68

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend