coms 4721 machine learning for data science lecture 8 2
play

COMS 4721: Machine Learning for Data Science Lecture 8, 2/14/2017 - PowerPoint PPT Presentation

COMS 4721: Machine Learning for Data Science Lecture 8, 2/14/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University L INEAR C LASSIFICATION B INARY CLASSIFICATION We focus on binary


  1. COMS 4721: Machine Learning for Data Science Lecture 8, 2/14/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University

  2. L INEAR C LASSIFICATION

  3. B INARY CLASSIFICATION We focus on binary classification, with input x i ∈ R d and output y i ∈ {± 1 } . ◮ We define a classifier f , which makes prediction y i = f ( x i , Θ) based on a function of x i and parameters Θ . In other words f : R d → {− 1 , + 1 } . Last lecture, we discussed the Bayes classification framework. ◮ Here, Θ contains: (1) class prior probabilities on y , (2) parameters for class-dependent distribution on x . This lecture we’ll introduce the linear classification framework. ◮ In this approach the prediction is linear in the parameters Θ . ◮ In fact, there is an intersection between the two that we discuss next.

  4. A B AYES CLASSIFIER Bayes decisions With the Bayes classifier we predict the class of a new x to be the most probable label given the model and training data ( x 1 , y 1 ) , . . . , ( x n , y n ) . In the binary case, we declare class y = 1 if p ( x | y = 1 ) P ( y = 1 ) > p ( x | y = 0 ) P ( y = 0 ) � �� � � �� � π 1 π 0 � ln p ( x | y = 1 ) P ( y = 1 ) > 0 p ( x | y = 0 ) P ( y = 0 ) This second line is referred to as the log odds .

  5. A B AYES CLASSIFIER Gaussian with shared covariance Let’s look at the log odds for the special case where p ( x | y ) = N ( x | µ y , Σ) (i.e., a single Gaussian with a shared covariance matrix) ln p ( x | y = 1 ) P ( y = 1 ) ln π 1 − 1 2 ( µ 1 + µ 0 ) T Σ − 1 ( µ 1 − µ 0 ) = p ( x | y = 0 ) P ( y = 0 ) π 0 � �� � a constant, call it w 0 + x T Σ − 1 ( µ 1 − µ 0 ) � �� � a vector, call it w This is also called “linear discriminant analysis” (used to be called LDA).

  6. A B AYES CLASSIFIER So we can write the decision rule for this Bayes classifier as a linear one: f ( x ) = sign ( x T w + w 0 ) . ◮ This is what we saw last lecture 4 4 2 2 (but now class 0 is called − 1) 0 0 -2 -2 0.15 ◮ The Bayes classifier produced a 0.1 linear decision boundary in the 0.05 data space when Σ 1 = Σ 0 . 0 ◮ w and w 0 are obtained through a x P( ω 2 )=.5 specific equation. P( ω 1 )=.5 R 2 R 1 -2 -2 0 0 2 2 4 4

  7. L INEAR CLASSIFIERS This Bayes classifier is one instance of a linear classifier sign ( x T w + w 0 ) f ( x ) = where ln π 1 − 1 2 ( µ 1 + µ 0 ) T Σ − 1 ( µ 1 − µ 0 ) = w 0 π 0 Σ − 1 ( µ 1 − µ 0 ) = w With MLE used to find values for π y , µ y and Σ . Setting w 0 and w this way may be too restrictive: ◮ This Bayes classifier assumes single Gaussian with shared covariance. ◮ Maybe if we relax what values w 0 and w can take we can do better.

  8. L INEAR CLASSIFIERS ( BINARY CASE ) Definition: Binary linear classifier A binary linear classifier is a function of the form f ( x ) = sign ( x T w + w 0 ) , where w ∈ R d and w 0 ∈ R . Since the goal is to learn w , w 0 from data, we are assuming that linear separability in x is an accurate property of the classes. Definition: Linear separability Two sets A , B ⊂ R d are called linearly separable if � > 0 if x ∈ A ( e.g, class + 1 ) x T w + w 0 < 0 if x ∈ B ( e.g, class − 1 ) The pair ( w , w 0 ) defines an affine hyperplane . It is important to develop the right geometric understanding about what this is doing.

  9. H YPERPLANES Geometric interpretation of linear classifiers: x 2 A hyperplane in R d is a linear subspace of dimension ( d − 1 ) . H ◮ A R 2 -hyperplane is a line. ◮ A R 3 -hyperplane is a plane. w ◮ As a linear subspace, a hyperplane always contains the origin. x 1 A hyperplane H can be represented by a vector w as follows: � � x ∈ R d | x T w = 0 H = .

  10. W HICH SIDE OF THE PLANE ARE WE ON ? x H Distance from the plane w ◮ How close is a point x to H ? θ ◮ Cosine rule: x T w = � x � 2 � w � 2 cos θ � x � 2 · cos θ ◮ The distance of x to the hyperplane is � x � 2 · | cos θ | = | x T w | / � w � 2 . So | x T w | gives a sense of distance. Which side of the hyperplane? ◮ The cosine satisfies cos θ > 0 if θ ∈ ( − π 2 , π 2 ) . ◮ So the sign of cos ( · ) tells us the side of H , and by the cosine rule sign ( cos θ ) = sign ( x T w ) .

  11. A FFINE H YPERPLANES x 2 Affine Hyperplanes H ◮ An affine hyperplane H is a hyperplane translated (shifted) using a scalar w 0 . ◮ Think of: H = x T w + w 0 = 0. w x 1 ◮ Setting w 0 > 0 moves the hyperplane in the opposite direction of w . ( w 0 < 0 in figure) − w 0 / � w � 2 Which side of the hyperplane now? ◮ The plane has been shifted by distance − w 0 � w � 2 in the direction w . ◮ For a given w , w 0 and input x the inequality x T w + w 0 > 0 says that x is on the far side of an affine hyperplane H in the direction w points.

  12. C LASSIFICATION WITH A FFINE H YPERPLANES − w 0 � w � 2 w sign ( x T w + w 0 ) > 0 H sign ( x T w + w 0 ) < 0

  13. P OLYNOMIAL GENERALIZATIONS The same generalizations from regression also hold for classification: ◮ (left) A linear classifier using x = ( x 1 , x 2 ) . ◮ (right) A linear classifier using x = ( x 1 , x 2 , x 2 1 , x 2 2 ) . The decision boundary is linear in R 4 , but isn’t when plotted in R 2 .

  14. A NOTHER B AYES CLASSIFIER Gaussian with different covariance Let’s look at the log odds for the general case where p ( x | y ) = N ( x | µ y , Σ y ) (i.e., now each class has its own covariance) ln p ( x | y = 1 ) P ( y = 1 ) = something complicated not involving x p ( x | y = 0 ) P ( y = 0 ) � �� � a constant + x T (Σ − 1 1 µ 1 − Σ − 1 0 µ 0 ) � �� � a part that’s linear in x + x T (Σ − 1 0 / 2 − Σ − 1 1 / 2 ) x � �� � a part that’s quadratic in x Also called “quadratic discriminant analysis,” but it’s linear in the weights.

  15. A NOTHER B AYES CLASSIFIER ◮ We also saw this last lecture. ◮ Notice that 0.03 f ( x ) = sign ( x T Ax + x T b + c ) 0.02 0.01 is linear in A , b , c . p 0 ◮ When x ∈ R 2 , rewrite as 10 x ← ( x 1 , x 2 , 2 x 1 x 2 , x 2 1 , x 2 2 ) -10 -10 0 0 0 and do linear classification in R 5 . 10 10 -10 20 20 Whereas the Bayes classifier with shared covariance is a version of linear classification, using different covariances is like polynomial classification.

  16. L EAST SQUARES ON {− 1 , + 1 } How do we define more general classifiers of the form f ( x ) = sign ( x T w + w 0 ) ? ◮ One simple idea is to treat classification as a regression problem: 1. Let y = ( y 1 , . . . , y n ) T , where y i ∈ {− 1 , + 1 } is the class of x i . 2. Add dimension equal to 1 to x i and construct the matrix X = [ x 1 , . . . , x n ] T . 3. Learn the least squares weight vector w = ( X T X ) − 1 X T y . 4. For a new point x 0 declare y 0 = sign ( x T 0 w ) ← − w 0 is included in w . ◮ Another option: Instead of LS, use ℓ p regularization. ◮ These are “baseline” options. We can use them, along with k -NN, to get a quick sense what performance we’re aiming to beat.

  17. S ENSITIVITY TO OUTLIERS 4 4 2 2 0 0 −2 −2 −4 −4 −6 −6 −8 −8 −4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8 Least squares can do well, but it is sensitive to outliers. In general we can find better classifiers that focus more on the decision boundary. ◮ (left) Least squares (purple) does well compared with another method ◮ (right) Least squares does poorly because of outliers

  18. T HE P ERCEPTRON A LGORITHM

  19. E ASY CASE : L INEARLY SEPARABLE DATA (Assume data x i has a 1 attached.) Suppose there is a linear classifier with zero training error: y i = sign ( x T i w ) , for all i . Then the data is “linearly separable” Left: Can separate classes with a line. (Can find an infinite number of lines.)

  20. P ERCEPTRON (R OSENBLATT , 1958) Using the linear classifier y = f ( x ) = sign ( x T w ) , the Perceptron seeks to minimize n � ( y i · x T i w ) 1 { y i � = sign ( x T L = − i w ) } . i = 1 Because y ∈ {− 1 , + 1 } , � > 0 if y i = sign ( x T i w ) y i · x T i w is < 0 if y i � = sign ( x T i w ) By minimizing L we’re trying to always predict the correct label.

  21. L EARNING THE PERCEPTRON ◮ Unlike other techniques we’ve talked about, we can’t find the minimum of L by taking a derivative and setting to zero: ∇ w L = 0 cannot be solved for w analytically. However ∇ w L does tell us the direction in which L is increasing in w . ◮ Therefore, for a sufficiently small η , if we update w ′ ← w − η ∇ w L , then L ( w ′ ) < L ( w ) — i.e., we have a better value for w . ◮ This is a very general method for optimizing an objective functions called gradient descent . Perceptron uses a “stochastic” version of this.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend