cs 6316 machine learning
play

CS 6316 Machine Learning Linear Predictors Yangfeng Ji Department - PowerPoint PPT Presentation

CS 6316 Machine Learning Linear Predictors Yangfeng Ji Department of Computer Science University of Virginia Overview 1. Review: Linear Functions 2. Perceptron 3. Logistic Regression 4. Linear Regression 1 Review: Linear Functions Linear


  1. CS 6316 Machine Learning Linear Predictors Yangfeng Ji Department of Computer Science University of Virginia

  2. Overview 1. Review: Linear Functions 2. Perceptron 3. Logistic Regression 4. Linear Regression 1

  3. Review: Linear Functions

  4. Linear Predictors Linear predictors discussed in this course ◮ halfspace predictors ◮ logistic regression classifiers ◮ linear SVMs (lecture on support vector machines) ◮ naive Bayes classifiers (lecture on generative models) ◮ linear regression predictors 3

  5. Linear Predictors Linear predictors discussed in this course ◮ halfspace predictors ◮ logistic regression classifiers ◮ linear SVMs (lecture on support vector machines) ◮ naive Bayes classifiers (lecture on generative models) ◮ linear regression predictors A common core form of these linear predictors d � h w , b � � w , x � + b � � � w i x i + b (1) i � 1 where w is the weights and b is the bias 3

  6. Alternative Form Given the original definition of a linear function d � h w , b � � w , x � + b � � � w i x i + b , (2) i � 1 we could redefine it in a more compact form ( w 1 , w 2 , . . . , w d , b ) T ← w ( x 1 , x 2 , . . . , x d , 1 ) T x ← and then h w , b ( x ) � � w , x � (3) 4

  7. Linear Functions Consider a two-dimensional case with w � ( 1 , 1 , − 0 . 5 ) f ( x ) � w T x � x 1 + x 2 − 0 . 5 (4) x 2 x 1 Different values of f ( x ) map to different areas on this 2-D space. For example, the following equation defines the blue line L . f ( x ) � w T x � 0 (5) 5

  8. Properties of Linear Functions (II) For any two points x and x ′ lying in the line f ( x ) − f ( x ′ ) � w T x − w T x ′ � 0 (6) x 2 x x 1 x ′ [Friedman et al., 2001, Section 4.5] 6

  9. Properties of Linear Functions (III) Furthermore, f ( x ) � x 1 + x 2 − 0 . 5 � 0 (7) separates the 2-D space R 2 into two half spaces x 2 f ( x ) > 0 x 1 f ( x ) < 0 7

  10. Properties of Linear Functions (IV) From the perspective of linear projection, f ( x ) � 0 defines the vectors on this 2-D space, whose projections onto the direction ( 1 , 1 ) have the same magnitude 0 . 5 � 1 � x 1 + x 2 − 0 . 5 � 0 ⇒ ( x 1 , x 2 ) · � 0 . 5 (8) 1 x 2 ( 1 , 1 ) x x 1 8 [Friedman et al., 2001, Section 4.5]

  11. Properties of Linear Functions (IV) From the perspective of linear projection, f ( x ) � 0 defines the vectors on this 2-D space, whose projections onto the direction ( 1 , 1 ) have the same magnitude 0 . 5 � 1 � x 1 + x 2 − 0 . 5 � 0 ⇒ ( x 1 , x 2 ) · � 0 . 5 (8) 1 x 2 ( 1 , 1 ) This idea can be generalized x to compute the distance x 1 between a point and a line. 8 [Friedman et al., 2001, Section 4.5]

  12. Properties of Linear Functions (IV) The distance of point x to line L : f ( x ) � � w , x � � 0 is given by f ( x ) � � w , x � w � � , x � (9) � w � 2 � x � 2 � w � 2 x 2 x x 1 [Friedman et al., 2001, Section 4.5] 9

  13. Perceptron

  14. Halfspace Hypothesis Class ◮ X � R d ◮ Y � {− 1 , + 1 } ◮ Halfspace hypothesis class half � { sign (� w , x �) : w ∈ R d } H (10) which is an infinite hypothesis space. The sign function y � sign ( x ) is defined as 11

  15. Linearly Separable Cases The algorithm can find a hyperplane to separate all positive examples from negative examples x 2 x 1 The definition of linearly separable cases is with respect to the training set S instead of D 12

  16. Prediction Rule The prediction rule of a half-space predictor is based on the sign of h ( x ) � sign (� w , x �) � + 1 � w , x � > 0 h ( x ) � (11) − 1 � w , x � < 0 x 2 x 2 � w , x � > 0 + x 1 x 1 − � w , x � < 0 13

  17. Prediction Rule The prediction rule of a half-space predictor is based on the sign of h ( x ) � sign (� w , x �) � + 1 � w , x � > 0 h ( x ) � (11) − 1 � w , x � < 0 or, if y ′ ∈ {− 1 , + 1 } and y ′ � w , x � > 0 h ( x ) � y ′ (12) x 2 x 2 � w , x � > 0 + x 1 x 1 − � w , x � < 0 13

  18. Perceptron Algorithm The perceptron algorithm is defined as 1: Input : S � {( x 1 , y 1 ) , . . . , ( x m , y m ))} 2: Initialize w ( 0 ) � ( 0 , . . . , 0 ) 9: Output : w ( T ) 14

  19. Perceptron Algorithm The perceptron algorithm is defined as 1: Input : S � {( x 1 , y 1 ) , . . . , ( x m , y m ))} 2: Initialize w ( 0 ) � ( 0 , . . . , 0 ) 3: for t � 1 , 2 , · · · , T do i ← t mod m 4: 8: end for 9: Output : w ( T ) 14

  20. Perceptron Algorithm The perceptron algorithm is defined as 1: Input : S � {( x 1 , y 1 ) , . . . , ( x m , y m ))} 2: Initialize w ( 0 ) � ( 0 , . . . , 0 ) 3: for t � 1 , 2 , · · · , T do i ← t mod m 4: if y i � w ( t ) , x i � ≤ 0 then 5: w ( t + 1 ) ← w ( t ) + y i x i // updating rule 6: end if 7: 8: end for 9: Output : w ( T ) 14

  21. Perceptron Algorithm The perceptron algorithm is defined as 1: Input : S � {( x 1 , y 1 ) , . . . , ( x m , y m ))} 2: Initialize w ( 0 ) � ( 0 , . . . , 0 ) 3: for t � 1 , 2 , · · · , T do i ← t mod m 4: if y i � w ( t ) , x i � ≤ 0 then 5: w ( t + 1 ) ← w ( t ) + y i x i // updating rule 6: end if 7: 8: end for 9: Output : w ( T ) Exercise : Implementing this algorithm with a simple 14 example

  22. Two Questions The updating rule can be break down into two cases: w ( t + 1 ) ← w ( t ) + y i x i (13) ◮ For y i � + 1 , w ( t + 1 ) ← w ( t ) + x i ◮ For y i � − 1 , w ( t + 1 ) ← w ( t ) − x i 15

  23. Two Questions The updating rule can be break down into two cases: w ( t + 1 ) ← w ( t ) + y i x i (13) ◮ For y i � + 1 , w ( t + 1 ) ← w ( t ) + x i ◮ For y i � − 1 , w ( t + 1 ) ← w ( t ) − x i Two questions: ◮ How the updating rule can help? ◮ How many updating steps the algorithm needs? 15

  24. The Updating Rule At time step t , given the training example ( x i , y i ) and the current weight w ( t ) y i � w ( t + 1 ) , x i � y i � w ( t ) + y i x i , x i � (14) � y i � w ( t ) , x i � + � x i � 2 (15) � ◮ w ( t + 1 ) gives a higher value of y i � w ( t + 1 ) , x i � on predicting x i than w ( t ) ◮ the updating is affected by the norm of x i , � x i � 2 16

  25. Theorem Assume that {( x i , y i )} m i � 1 is separable. Let ◮ B � min {� w � : ∀ i ∈ [ m ] , y i � w , x i � ≥ 1 } , and ◮ R � max i � x i � . Then, the Perceptron algorithm stops after at most ( RB ) 2 iterations, and when it stops it holds that ∀ i ∈ [ m ] , y i � w ( t ) , x � > 0 (16) ◮ A realizable case with infinite hypothesis space ◮ Finish training in finite steps 17

  26. Example [Bishop, 2006, Page 195] 18

  27. Example [Bishop, 2006, Page 195] 18

  28. Example [Bishop, 2006, Page 195] 18

  29. Example [Bishop, 2006, Page 195] 18

  30. The XOR Example: a Non-separable Case ◮ X 1 , X 2 ∈ { 0 , 1 } ◮ the XOR operation is defined as x 2 Y � X 1 ⊕ X 2 where x 1 � 1 X 1 � X 2 Y � X 1 � X 2 0 19

  31. The XOR Example: Further Comment 20

  32. Logistic Regression

  33. Hypothesis Class ◮ The hypothesis class of logistic regression is defined as LR � { σ (� w , x �) : w ∈ R d } H (17) ◮ The sigmoid function σ ( a ) with a ∈ R 1 σ ( a ) � (18) 1 + exp (− a ) 22

  34. Unified Form for Logistic Predictors ◮ An unified form for y ∈ {− 1 , + 1 } 1 h ( x , y ) � (19) 1 + exp (− y � w , x �) which is similar to the half-space predictors 23

  35. Unified Form for Logistic Predictors ◮ An unified form for y ∈ {− 1 , + 1 } 1 h ( x , y ) � (19) 1 + exp (− y � w , x �) which is similar to the half-space predictors ◮ Prediction 1. Compute the the values from Eq. 19 with y ∈ {− 1 , + 1 } 2. Pick the y that has bigger value � + 1 h ( x , + 1 ) > h ( x , − 1 ) y � (20) − 1 h ( x , + 1 ) < h ( x , − 1 ) 23

  36. A Predictor Take a close look of the uniform definition of h ( x , y ) ◮ When y � + 1 1 h w ( x , + 1 ) � 1 + exp (−� w , x �) ◮ When y � − 1 1 h w ( x , − 1 ) � 1 + exp (� w , x �) exp (−� w , x �) � 1 + exp (−� w , x �) 1 1 − � 1 + exp (−� w , x �) 1 − h w ( x , + 1 ) � 24

  37. A Linear Classifier? To justify this is a linear classifier, let take a look the decision boundary given by h ( x , + 1 ) � h ( x , − 1 ) (21) Specifically, we have 1 1 � 1 + exp (−� w , x �) 1 + exp (� w , x �) exp (−� w , x �) exp (� w , x �) � −� w , x � � w , x � � 2 � w , x � 0 � The decision boundary is a straight line 25

  38. Risk/Loss Function For a given training example ( x , y ) , the risk/loss function is defined as the negative log of h ( x , y ) 1 L ( h w , ( x , y )) − log � 1 + exp (− y � w , x �) log ( 1 + exp (− y � w , x �)) (22) � Intuitively, minimizing the risk will increase the value of h ( x , y ) 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend