CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector - PowerPoint PPT Presentation

CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun yzsun@cs.ucla.edu October 18, 2017

Announcements • Homework 1 • Due end of the day of this Thursday (11:59pm) • Reminder of late submission policy • original score * • E.g., if you are t = 12 hours late, maximum of half score will be obtained; if you are 24 hours late, 0 score will be given. 2

Methods to Learn: Last Lecture Vector Data Set Data Sequence Data Text Data Logistic Regression; Naïve Bayes for Text Classification Decision Tree ; KNN SVM; NN Clustering K-means; hierarchical PLSA clustering; DBSCAN; Mixture Models Linear Regression Prediction GLM* Apriori; FP growth GSP; PrefixSpan Frequent Pattern Mining Similarity Search DTW 3

Methods to Learn Vector Data Set Data Sequence Data Text Data Logistic Regression; Naïve Bayes for Text Classification Decision Tree ; KNN SVM ; NN Clustering K-means; hierarchical PLSA clustering; DBSCAN; Mixture Models Linear Regression Prediction GLM* Apriori; FP growth GSP; PrefixSpan Frequent Pattern Mining Similarity Search DTW 4

Support Vector Machine • Introduction • Linear SVM • Non-linear SVM • Scalability Issues* • Summary 5

Math Review • Vector • 𝒚 = x 1 , x 2 , … , 𝑦 𝑜 • Su rs: 𝒚 = 𝒄 − 𝒃 Subt btrac racti ting ng tw two v o vec ecto tors: • Dot product • 𝒃 ⋅ 𝒄 = ∑𝑏 𝑗 𝑐 𝑗 • Geometric interpretation: projection • If 𝒃 𝑏𝑜𝑒 𝒄 are orthogonal, 𝒃 ⋅ 𝒄 = 0 6

Math Review (Cont.) • Plane/Hyperplane • 𝑏 1 𝑦 1 + 𝑏 2 𝑦 2 + ⋯ + 𝑏 𝑜 𝑦 𝑜 = 𝑑 • Line (n=2), plane (n=3), hyperplane (higher dimensions) • Normal of a plane • 𝒐 = 𝑏 1 , 𝑏 2 , … , 𝑏 𝑜 • a vector which is perpendicular to the surface 7

Math Review (Cont.) • Define a plane using normal 𝒐 = 𝑏, 𝑐, 𝑑 and a point (𝑦 0 , 𝑧 0 , 𝑨 0 ) in the plane: • 𝑏, 𝑐, 𝑑 ⋅ 𝑦 0 − 𝑦, 𝑧 0 − 𝑧, 𝑨 0 − 𝑨 = 0 ⇒ 𝑏𝑦 + 𝑐𝑧 + 𝑑𝑨 = 𝑏𝑦 0 + 𝑐𝑧 0 + 𝑑𝑨 0 (= 𝑒) • Distance from a point (𝑦 0 , 𝑧 0 , 𝑨 0 ) to a plane 𝑏𝑦 + 𝑐𝑧 + 𝑑𝑨 = d 𝑏,𝑐,𝑑 • 𝑦 0 − 𝑦, 𝑧 0 − 𝑧, 𝑨 0 − 𝑨 ⋅ = 𝑏,𝑐,𝑑 𝑏𝑦 0 +𝑐𝑧 0 +𝑑𝑨 0 −𝑒 𝑏 2 +𝑐 2 +𝑑 2 8

Linear Classifier 𝑂 • Given a training dataset 𝒚 𝑗 , 𝑧 𝑗 𝑗=1 A separating hyperplane can be written as a linear combination of  attributes W ● X + b = 0 where W={w 1 , w 2 , …, w n } is a weight vector and b a scalar (bias) For 2-D it can be written as  w 0 + w 1 x 1 + w 2 x 2 = 0 Classification:  w 0 + w 1 x 1 + w 2 x 2 > 0 => y i = +1 w 0 + w 1 x 1 + w 2 x 2 ≤ 0 => y i = – 1 9

Recall • Is the decision boundary for logistic regression linear? • Is the decision boundary for decision tree linear? 10

Simple Linear Classifier: Perceptron Loss function: max{0, −𝑧 𝑗 ∗ 𝑥 𝑈 𝑦 𝑗 } 11

More on Sign Function • 12

Example 13

Can we do better? • Which hyperplane to choose? 15

SVM — Margins and Support Vectors Small Margin Large Margin Support Vectors 16

SVM — When Data Is Linearly Separable m Let data D be ( X 1 , y 1 ), …, ( X |D| , y |D| ), where X i is the set of training tuples associated with the class labels y i There are infinite lines (hyperplanes) separating the two classes but we want to find the best one (the one that minimizes classification error on unseen data) SVM searches for the hyperplane with the largest margin , i.e., maximum marginal hyperplane (MMH) 17

SVM — Linearly Separable  A separating hyperplane can be written as W ● X + b = 0  The hyperplane defining the sides of the margin, e.g.,: H 1 : w 0 + w 1 x 1 + w 2 x 2 ≥ 1 for y i = +1, and H 2 : w 0 + w 1 x 1 + w 2 x 2 ≤ – 1 for y i = – 1  Any training tuples that fall on hyperplanes H 1 or H 2 (i.e., the sides defining the margin) are support vectors  This becomes a constrained (convex) quadratic optimization problem: Quadratic objective function and linear constraints  Quadratic Programming (QP)  Lagrangian multipliers 18

Maximum Margin Calculation • w : decision hyperplane normal vector • x i : data point i • y i : class of data point i (+1 or -1) w T x a + b = 1 ρ 2 w T x b + b = -1 𝑛𝑏𝑠𝑕𝑗𝑜: 𝜍 = ||𝒙|| Hint: what is the distance between 𝑦 𝑏 and w T x + b = -1 w T x + b = 0 19

SVM as a Quadratic Programming • QP 2 Objective: Find w and b such that 𝜍 = ||𝒙|| is maximized; Constraints: For all { ( x i , y i )} w T x i + b ≥ 1 if y i =1; w T x i + b ≤ - 1 if y i = -1 • A better form Objective: Find w and b such that Φ ( w ) =½ w T w is minimized; Constraints: for all { ( x i , y i )} : y i ( w T x i + b ) ≥ 1 20

Solve QP • This is now optimizing a quadratic function subject to linear constraints • Quadratic optimization problems are a well- known class of mathematical programming problem, and many (intricate) algorithms exist for solving them (with many special ones built for SVMs) • The solution involves constructing a dual problem where a Lagrange multiplier α i is associated with every constraint in the primary problem: 21

Lagrange Formulation 22

Primal Form and Dual Form Objective: Find w and b such that Φ ( w ) =½ w T w is minimized; Primal Constraints: for all { ( x i , y i )} : y i ( w T x i + b ) ≥ 1 Equivalent under some conditions: KKT conditions Objective: Find α 1 …α n such that T x j is maximized and Q ( α ) = Σ α i - ½ ΣΣ α i α j y i y j x i Dual Constraints (1) Σ α i y i = 0 (2) α i ≥ 0 for all α i • More derivations: http://cs229.stanford.edu/notes/cs229-notes3.pdf 23

The Optimization Problem Solution • The solution has the form: w = Σ α i y i x i b = y k - w T x k for any x k such that α k  0 • Each non-zero α i indicates that corresponding x i is a support vector. • Then the classifying function will have the form: f ( x ) = Σ α i y i x i T x + b • Notice that it relies on an inner product between the test point x and the support vectors x i • We will return to this later. • Also keep in mind that solving the optimization problem involved T x j between all pairs of training computing the inner products x i points. 24

Sec. 15.2.1 Soft Margin Classification • If the training data is not linearly separable, slack variables ξ i can be added to allow misclassification of difficult or noisy examples. • Allow some errors • Let some points be ξ i moved to where they ξ j belong, at a cost • Still, try to minimize training set errors, and to place hyperplane “ far ” from each class (large margin) 25

Sec. 15.2.1 Soft Margin Classification Mathematically • The old formulation: Find w and b such that Φ ( w ) =½ w T w is minimized and for all { ( x i , y i )} y i ( w T x i + b) ≥ 1 • The new formulation incorporating slack variables: Find w and b such that Φ ( w ) =½ w T w + C Σ ξ i is minimized and for all { ( x i , y i )} y i ( w T x i + b ) ≥ 1- ξ i and ξ i ≥ 0 for all i • Parameter C can be viewed as a way to control overfitting • A regularization term (L1 regularization) 26

Sec. 15.2.1 Soft Margin Classification – Solution • The dual problem for soft margin classification: Find α 1 …α N such that Q ( α ) = Σ α i - ½ ΣΣ α i α j y i y j x i T x j is maximized and (1) Σ α i y i = 0 (2) 0 ≤ α i ≤ C for all α i • Neither slack variables ξ i nor their Lagrange multipliers appear in the dual problem! • Again, x i with non-zero α i will be support vectors . • If 0< α i <C, ξ i =0 • If α i =C, ξ i >0 • Solution to the problem is: w is not needed explicitly w = Σ α i y i x i for classification! b = y k - w T x k for any x k such that 0< α k <C f ( x ) = Σ α i y i x i T x + b 27

Sec. 15.1 Classification with SVMs • Given a new point x , we can score its projection onto the hyperplane normal: • I.e., compute score: w T x + b = Σ α i y i x i T x x + + b • Decide class based on whether < or > 0 • Can set confidence threshold t . Score > t : yes Score < - t : no 1 Else: don ’ t know -10 28

Sec. 15.2.1 Linear SVMs: Summary • The classifier is a separating hyperplane. • The most “ important ” training points are the support vectors; they define the hyperplane. • Quadratic optimization algorithms can identify which training points x i are support vectors with non-zero Lagrangian multipliers α i . • Both in the dual formulation of the problem and in the solution, training points appear only inside inner products: f ( x ) = Σ α i y i x i Find α 1 …α N such that T x + b Q ( α ) = Σ α i - ½ ΣΣ α i α j y i y j x i T x j is maximized and (1) Σ α i y i = 0 (2) 0 ≤ α i ≤ C for all α i 29

Sec. 15.2.3 Non-linear SVMs • Datasets that are linearly separable (with some noise) work out great: x 0 • But what are we going to do if the dataset is just too hard? x 0 • How about … mapping data to a higher -dimensional space: x 2 x 0 31

CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector - PowerPoint PPT Presentation

CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun yzsun@cs.ucla.edu October 18, 2017 Announcements Homework 1 Due end of the day of this Thursday (11:59pm) Reminder of late submission

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

CS145: INTRODUCTION TO DATA MINING Sequence Data: Sequential Pattern Mining Instructor: Yizhou

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

CS145: INTRODUCTION TO DATA MINING 09: Vector Data: Clustering Basics Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING 7: Vector Data: K Nearest Neighbor Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING Sequence Data: Similarity Search Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING 6: Vector Data: Neural Network Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING Text Data: Topic Model Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING Course Project Overview Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING 08: Classification Evaluation and Practical Issues

CS145: INTRODUCTION TO DATA MINING Clustering Evaluation and Practical Issues Instructor: Yizhou

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data mining functionalities Data Mining Major

CS145: INTRODUCTION TO DATA MINING 1: Introduction Instructor: Yizhou Sun yzsun@cs.ucla.edu

CS145: INTRODUCTION TO DATA MINING 1: Introduction Instructor: Yizhou Sun yzsun@cs.ucla.edu

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

Structured Perceptron/ Margin Methods Graham Neubig Site https://phontron.com/class/nn4nlp2020/

ECON 950 Winter 2020 Prof. James MacKinnon 12. Support Vector Machines These notes are based

f able : Estimation of marginal effects with transformed covariates Taking Margins a step further

Support Vector Machines INFO-4604, Applied Machine Learning University of Colorado Boulder

Luke 10:38-42 Martha Mary & Margin OVERCOMING OVERLOAD Mary FOCUSED RELAXED UNCONCERNED

THIRD QUARTER 2017 REVIEW November 1, 2017 w w w . w e s t e r n g a s . c o m | N Y S E : W E

ECO 199 GAMES OF STRATEGY Spring Term 2004 April 15 COLLECTIVE ACTION GAMES TWO GENERAL

Does Private Equity Ownership Make Firms Cleaner? The Role Of Environmental Liability Risks

CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector - PowerPoint PPT Presentation

CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun yzsun@cs.ucla.edu October 18, 2017 Announcements Homework 1 Due end of the day of this Thursday (11:59pm) Reminder of late submission

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

CS145: INTRODUCTION TO DATA MINING Sequence Data: Sequential Pattern Mining Instructor: Yizhou

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

CS145: INTRODUCTION TO DATA MINING 09: Vector Data: Clustering Basics Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING 7: Vector Data: K Nearest Neighbor Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING Sequence Data: Similarity Search Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING 6: Vector Data: Neural Network Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING Text Data: Topic Model Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING Course Project Overview Instructor: Yizhou Sun

CS145: INTRODUCTION TO DATA MINING 08: Classification Evaluation and Practical Issues

CS145: INTRODUCTION TO DATA MINING Clustering Evaluation and Practical Issues Instructor: Yizhou

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data mining functionalities Data Mining Major

CS145: INTRODUCTION TO DATA MINING 1: Introduction Instructor: Yizhou Sun yzsun@cs.ucla.edu

CS145: INTRODUCTION TO DATA MINING 1: Introduction Instructor: Yizhou Sun yzsun@cs.ucla.edu

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

Structured Perceptron/ Margin Methods Graham Neubig Site https://phontron.com/class/nn4nlp2020/

ECON 950 Winter 2020 Prof. James MacKinnon 12. Support Vector Machines These notes are based

f able : Estimation of marginal effects with transformed covariates Taking Margins a step further

Support Vector Machines INFO-4604, Applied Machine Learning University of Colorado Boulder

Luke 10:38-42 Martha Mary &amp; Margin OVERCOMING OVERLOAD Mary FOCUSED RELAXED UNCONCERNED

THIRD QUARTER 2017 REVIEW November 1, 2017 w w w . w e s t e r n g a s . c o m | N Y S E : W E

ECO 199 GAMES OF STRATEGY Spring Term 2004 April 15 COLLECTIVE ACTION GAMES TWO GENERAL

Does Private Equity Ownership Make Firms Cleaner? The Role Of Environmental Liability Risks

Luke 10:38-42 Martha Mary & Margin OVERCOMING OVERLOAD Mary FOCUSED RELAXED UNCONCERNED