cs6220 data mining techniques
play

CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 3 - PowerPoint PPT Presentation

CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 3 Instructor: Yizhou Sun yzsun@ccs.neu.edu October 5, 2015 Announcements Homework 2 will be out tomorrow No class next week Course project proposal due next Monday 2


  1. CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 3 Instructor: Yizhou Sun yzsun@ccs.neu.edu October 5, 2015

  2. Announcements • Homework 2 will be out tomorrow • No class next week • Course project proposal due next Monday 2

  3. Methods to Learn Matrix Data Text Set Data Sequence Time Series Graph & Images Data Data Network Classification Decision Tree; HMM Label Neural Naïve Bayes; Propagation* Network Logistic Regression SVM; kNN Clustering K-means; PLSA SCAN*; hierarchical Spectral clustering; Clustering* DBSCAN; Mixture Models; kernel k- means* Frequent Apriori; GSP; FP-growth PrefixSpan Pattern Mining Linear Regression Autoregression Prediction Similarity DTW P-PageRank Search Ranking PageRank 3

  4. Matrix Data: Classification: Part 3 • SVM (Support Vector Machine) • kNN (k Nearest Neighbor) • Other Issues • Summary 4

  5. Math Review • Vector • 𝒚 = x 1 , x 2 , … , 𝑦 𝑜 • Subt ors: 𝒚 = 𝒄 − 𝒃 btra racting ting two o vec ectors: • Dot product • 𝒃 ⋅ 𝒄 = ∑𝑏 𝑗 𝑐 𝑗 • Geometric interpretation: projection • If 𝒃 𝑏𝑜𝑒 𝒄 are orthogonal, 𝒃 ⋅ 𝒄 = 0 5

  6. Math Review (Cont.) • Plane/Hyperplane • 𝑏 1 𝑦 1 + 𝑏 2 𝑦 2 + ⋯ + 𝑏 𝑜 𝑦 𝑜 = 𝑑 • Line (n=2), plane (n=3), hyperplane (higher dimensions) • Normal of a plane • 𝒐 = 𝑏 1 , 𝑏 2 , … , 𝑏 𝑜 • a vector which is perpendicular to the surface 6

  7. Math Review (Cont.) • Define a plane using normal 𝒐 = 𝑏, 𝑐, 𝑑 and a point (𝑦 0 , 𝑧 0 , 𝑨 0 ) in the plane: • 𝑏, 𝑐, 𝑑 ⋅ 𝑦 0 − 𝑦, 𝑧 0 − 𝑧, 𝑨 0 − 𝑨 = 0 ⇒ 𝑏𝑦 + 𝑐𝑧 + 𝑑𝑨 = 𝑏𝑦 0 + 𝑐𝑧 0 + 𝑑𝑨 0 (= 𝑒) • Distance from a point (𝑦 0 , 𝑧 0 , 𝑨 0 ) to a plane 𝑏𝑦 + 𝑐𝑧 + 𝑑𝑨 = d 𝑏,𝑐,𝑑 • 𝑦 0 − 𝑦, 𝑧 0 − 𝑧, 𝑨 0 − 𝑨 ⋅ = 𝑏,𝑐,𝑑 𝑏𝑦 0 +𝑐𝑧 0 +𝑑𝑨 0 −𝑒 𝑏 2 +𝑐 2 +𝑑 2 7

  8. Linear Classifier 𝑂 • Given a training dataset 𝒚 𝑗 , 𝑧 𝑗 𝑗=1 A separating hyperplane can be written as a linear combination of  attributes W ● X + b = 0 where W={w 1 , w 2 , …, w n } is a weight vector and b a scalar (bias) For 2-D it can be written as  w 0 + w 1 x 1 + w 2 x 2 = 0 Classification:  w 0 + w 1 x 1 + w 2 x 2 > 0 => y i = +1 w 0 + w 1 x 1 + w 2 x 2 ≤ 0 => y i = – 1 8

  9. Perceptron 9

  10. Example 10

  11. Can we do better? • Which hyperplane to choose? 11

  12. SVM — Margins and Support Vectors Small Margin Large Margin Support Vectors 12

  13. SVM — When Data Is Linearly Separable m Let data D be ( X 1 , y 1 ), …, ( X |D| , y |D| ), where X i is the set of training tuples associated with the class labels y i There are infinite lines (hyperplanes) separating the two classes but we want to find the best one (the one that minimizes classification error on unseen data) SVM searches for the hyperplane with the largest margin , i.e., maximum marginal hyperplane (MMH) 13

  14. SVM — Linearly Separable A separating hyperplane can be written as  W ● X + b = 0 The hyperplane defining the sides of the margin, e.g.,:  H 1 : w 0 + w 1 x 1 + w 2 x 2 ≥ 1 for y i = +1, and H 2 : w 0 + w 1 x 1 + w 2 x 2 ≤ – 1 for y i = – 1 Any training tuples that fall on hyperplanes H 1 or H 2 (i.e., the  sides defining the margin) are support vectors This becomes a constrained (convex) quadratic optimization problem:  Quadratic objective function and linear constraints  Quadratic Programming (QP)  Lagrangian multipliers 14

  15. Maximum Margin Calculation • w : decision hyperplane normal vector • x i : data point i • y i : class of data point i (+1 or -1) w T x a + b = 1 ρ 2 w T x b + b = -1 𝑛𝑏𝑠𝑕𝑗𝑜: 𝜍 = ||𝒙|| Hint: what is the distance between 𝑦 𝑏 and w T x b + b = -1 w T x + b = 0 15

  16. SVM as a Quadratic Programming • QP 2 Objective: Find w and b such that 𝜍 = ||𝒙|| is maximized; Constraints: For all { ( x i , y i )} w T x i + b ≥ 1 if y i =1; w T x i + b ≤ - 1 if y i = -1 • A better form Objective: Find w and b such that Φ ( w ) =½ w T w is minimized; Constraints: for all { ( x i , y i )} : y i ( w T x i + b ) ≥ 1 16

  17. Solve QP • This is now optimizing a quadratic function subject to linear constraints • Quadratic optimization problems are a well- known class of mathematical programming problem, and many (intricate) algorithms exist for solving them (with many special ones built for SVMs) • The solution involves constructing a dual problem where a Lagrange multiplier α i is associated with every constraint in the primary problem: 17

  18. Lagrange Formulation 18

  19. Primal Form and Dual Form Objective: Find w and b such that Φ ( w ) =½ w T w is minimized; Primal Constraints: for all { ( x i , y i )} : y i ( w T x i + b ) ≥ 1 Equivalent under some conditions: KKT conditions Objective: Find α 1 …α n such that T x j is maximized and Q ( α ) = Σ α i - ½ ΣΣ α i α j y i y j x i Dual Constraints (1) Σ α i y i = 0 (2) α i ≥ 0 for all α i • More derivations: http://cs229.stanford.edu/notes/cs229-notes3.pdf 19

  20. The Optimization Problem Solution • The solution has the form: w = Σ α i y i x i b = y k - w T x k for any x k such that α k  0 • Each non-zero α i indicates that corresponding x i is a support vector. • Then the classifying function will have the form: f ( x ) = Σ α i y i x i T x + b • Notice that it relies on an inner product between the test point x and the support vectors x i • We will return to this later. • Also keep in mind that solving the optimization problem involved computing T x j between all pairs of training points. the inner products x i 20

  21. Sec. 15.2.1 Soft Margin Classification • If the training data is not linearly separable, slack variables ξ i can be added to allow misclassification of difficult or noisy examples. • Allow some errors • Let some points be ξ i moved to where they ξ j belong, at a cost • Still, try to minimize training set errors, and to place hyperplane “ far ” from each class (large margin) 21

  22. Sec. 15.2.1 Soft Margin Classification Mathematically • The old formulation: Find w and b such that Φ ( w ) =½ w T w is minimized and for all { ( x i , y i )} y i ( w T x i + b) ≥ 1 • The new formulation incorporating slack variables: Find w and b such that Φ ( w ) =½ w T w + C Σ ξ i is minimized and for all { ( x i , y i )} y i ( w T x i + b ) ≥ 1- ξ i and ξ i ≥ 0 for all i • Parameter C can be viewed as a way to control overfitting • A regularization term (L1 regularization) 22

  23. Sec. 15.2.1 Soft Margin Classification – Solution • The dual problem for soft margin classification: Find α 1 …α N such that Q ( α ) = Σ α i - ½ ΣΣ α i α j y i y j x i T x j is maximized and (1) Σ α i y i = 0 (2) 0 ≤ α i ≤ C for all α i • Neither slack variables ξ i nor their Lagrange multipliers appear in the dual problem! • Again, x i with non-zero α i will be support vectors. • Solution to the dual problem is: w is not needed explicitly w = Σ α i y i x i for classification! b = y k (1- ξ k ) - w T x k where k = argmax α k ’ f ( x ) = Σ α i y i x i k ’ T x + b 23

  24. Sec. 15.1 Classification with SVMs • Given a new point x , we can score its projection onto the hyperplane normal: • I.e., compute score: w T x + b = Σ α i y i x i T x x + + b • Decide class based on whether < or > 0 • Can set confidence threshold t . Score > t : yes Score < - t : no 1 Else: don ’ t know -10 24

  25. Sec. 15.2.1 Linear SVMs: Summary • The classifier is a separating hyperplane. • The most “ important ” training points are the support vectors; they define the hyperplane. • Quadratic optimization algorithms can identify which training points x i are support vectors with non-zero Lagrangian multipliers α i . • Both in the dual formulation of the problem and in the solution, training points appear only inside inner products: f ( x ) = Σ α i y i x i Find α 1 …α N such that T x + b Q ( α ) = Σ α i - ½ ΣΣ α i α j y i y j x i T x j is maximized and (1) Σ α i y i = 0 (2) 0 ≤ α i ≤ C for all α i 25

  26. Sec. 15.2.3 Non-linear SVMs • Datasets that are linearly separable (with some noise) work out great: x 0 • But what are we going to do if the dataset is just too hard? x 0 • How about … mapping data to a higher -dimensional space: x 2 x 0 26

  27. Sec. 15.2.3 Non-linear SVMs: Feature spaces • General idea: the original feature space can always be mapped to some higher- dimensional feature space where the training set is separable: Φ : x → φ ( x ) 27

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend