CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 3 - PowerPoint PPT Presentation

CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 3 Instructor: Yizhou Sun yzsun@ccs.neu.edu October 5, 2015

Announcements • Homework 2 will be out tomorrow • No class next week • Course project proposal due next Monday 2

Methods to Learn Matrix Data Text Set Data Sequence Time Series Graph & Images Data Data Network Classification Decision Tree; HMM Label Neural Naïve Bayes; Propagation* Network Logistic Regression SVM; kNN Clustering K-means; PLSA SCAN*; hierarchical Spectral clustering; Clustering* DBSCAN; Mixture Models; kernel k- means* Frequent Apriori; GSP; FP-growth PrefixSpan Pattern Mining Linear Regression Autoregression Prediction Similarity DTW P-PageRank Search Ranking PageRank 3

Matrix Data: Classification: Part 3 • SVM (Support Vector Machine) • kNN (k Nearest Neighbor) • Other Issues • Summary 4

Math Review • Vector • 𝒚 = x 1 , x 2 , … , 𝑦 𝑜 • Subt ors: 𝒚 = 𝒄 − 𝒃 btra racting ting two o vec ectors: • Dot product • 𝒃 ⋅ 𝒄 = ∑𝑏 𝑗 𝑐 𝑗 • Geometric interpretation: projection • If 𝒃 𝑏𝑜𝑒 𝒄 are orthogonal, 𝒃 ⋅ 𝒄 = 0 5

Math Review (Cont.) • Plane/Hyperplane • 𝑏 1 𝑦 1 + 𝑏 2 𝑦 2 + ⋯ + 𝑏 𝑜 𝑦 𝑜 = 𝑑 • Line (n=2), plane (n=3), hyperplane (higher dimensions) • Normal of a plane • 𝒐 = 𝑏 1 , 𝑏 2 , … , 𝑏 𝑜 • a vector which is perpendicular to the surface 6

Math Review (Cont.) • Define a plane using normal 𝒐 = 𝑏, 𝑐, 𝑑 and a point (𝑦 0 , 𝑧 0 , 𝑨 0 ) in the plane: • 𝑏, 𝑐, 𝑑 ⋅ 𝑦 0 − 𝑦, 𝑧 0 − 𝑧, 𝑨 0 − 𝑨 = 0 ⇒ 𝑏𝑦 + 𝑐𝑧 + 𝑑𝑨 = 𝑏𝑦 0 + 𝑐𝑧 0 + 𝑑𝑨 0 (= 𝑒) • Distance from a point (𝑦 0 , 𝑧 0 , 𝑨 0 ) to a plane 𝑏𝑦 + 𝑐𝑧 + 𝑑𝑨 = d 𝑏,𝑐,𝑑 • 𝑦 0 − 𝑦, 𝑧 0 − 𝑧, 𝑨 0 − 𝑨 ⋅ = 𝑏,𝑐,𝑑 𝑏𝑦 0 +𝑐𝑧 0 +𝑑𝑨 0 −𝑒 𝑏 2 +𝑐 2 +𝑑 2 7

Linear Classifier 𝑂 • Given a training dataset 𝒚 𝑗 , 𝑧 𝑗 𝑗=1 A separating hyperplane can be written as a linear combination of  attributes W ● X + b = 0 where W={w 1 , w 2 , …, w n } is a weight vector and b a scalar (bias) For 2-D it can be written as  w 0 + w 1 x 1 + w 2 x 2 = 0 Classification:  w 0 + w 1 x 1 + w 2 x 2 > 0 => y i = +1 w 0 + w 1 x 1 + w 2 x 2 ≤ 0 => y i = – 1 8

Perceptron 9

Example 10

Can we do better? • Which hyperplane to choose? 11

SVM — Margins and Support Vectors Small Margin Large Margin Support Vectors 12

SVM — When Data Is Linearly Separable m Let data D be ( X 1 , y 1 ), …, ( X |D| , y |D| ), where X i is the set of training tuples associated with the class labels y i There are infinite lines (hyperplanes) separating the two classes but we want to find the best one (the one that minimizes classification error on unseen data) SVM searches for the hyperplane with the largest margin , i.e., maximum marginal hyperplane (MMH) 13

SVM — Linearly Separable A separating hyperplane can be written as  W ● X + b = 0 The hyperplane defining the sides of the margin, e.g.,:  H 1 : w 0 + w 1 x 1 + w 2 x 2 ≥ 1 for y i = +1, and H 2 : w 0 + w 1 x 1 + w 2 x 2 ≤ – 1 for y i = – 1 Any training tuples that fall on hyperplanes H 1 or H 2 (i.e., the  sides defining the margin) are support vectors This becomes a constrained (convex) quadratic optimization problem:  Quadratic objective function and linear constraints  Quadratic Programming (QP)  Lagrangian multipliers 14

Maximum Margin Calculation • w : decision hyperplane normal vector • x i : data point i • y i : class of data point i (+1 or -1) w T x a + b = 1 ρ 2 w T x b + b = -1 𝑛𝑏𝑠𝑕𝑗𝑜: 𝜍 = ||𝒙|| Hint: what is the distance between 𝑦 𝑏 and w T x b + b = -1 w T x + b = 0 15

SVM as a Quadratic Programming • QP 2 Objective: Find w and b such that 𝜍 = ||𝒙|| is maximized; Constraints: For all { ( x i , y i )} w T x i + b ≥ 1 if y i =1; w T x i + b ≤ - 1 if y i = -1 • A better form Objective: Find w and b such that Φ ( w ) =½ w T w is minimized; Constraints: for all { ( x i , y i )} : y i ( w T x i + b ) ≥ 1 16

Solve QP • This is now optimizing a quadratic function subject to linear constraints • Quadratic optimization problems are a well- known class of mathematical programming problem, and many (intricate) algorithms exist for solving them (with many special ones built for SVMs) • The solution involves constructing a dual problem where a Lagrange multiplier α i is associated with every constraint in the primary problem: 17

Lagrange Formulation 18

Primal Form and Dual Form Objective: Find w and b such that Φ ( w ) =½ w T w is minimized; Primal Constraints: for all { ( x i , y i )} : y i ( w T x i + b ) ≥ 1 Equivalent under some conditions: KKT conditions Objective: Find α 1 …α n such that T x j is maximized and Q ( α ) = Σ α i - ½ ΣΣ α i α j y i y j x i Dual Constraints (1) Σ α i y i = 0 (2) α i ≥ 0 for all α i • More derivations: http://cs229.stanford.edu/notes/cs229-notes3.pdf 19

The Optimization Problem Solution • The solution has the form: w = Σ α i y i x i b = y k - w T x k for any x k such that α k  0 • Each non-zero α i indicates that corresponding x i is a support vector. • Then the classifying function will have the form: f ( x ) = Σ α i y i x i T x + b • Notice that it relies on an inner product between the test point x and the support vectors x i • We will return to this later. • Also keep in mind that solving the optimization problem involved computing T x j between all pairs of training points. the inner products x i 20

Sec. 15.2.1 Soft Margin Classification • If the training data is not linearly separable, slack variables ξ i can be added to allow misclassification of difficult or noisy examples. • Allow some errors • Let some points be ξ i moved to where they ξ j belong, at a cost • Still, try to minimize training set errors, and to place hyperplane “ far ” from each class (large margin) 21

Sec. 15.2.1 Soft Margin Classification Mathematically • The old formulation: Find w and b such that Φ ( w ) =½ w T w is minimized and for all { ( x i , y i )} y i ( w T x i + b) ≥ 1 • The new formulation incorporating slack variables: Find w and b such that Φ ( w ) =½ w T w + C Σ ξ i is minimized and for all { ( x i , y i )} y i ( w T x i + b ) ≥ 1- ξ i and ξ i ≥ 0 for all i • Parameter C can be viewed as a way to control overfitting • A regularization term (L1 regularization) 22

Sec. 15.2.1 Soft Margin Classification – Solution • The dual problem for soft margin classification: Find α 1 …α N such that Q ( α ) = Σ α i - ½ ΣΣ α i α j y i y j x i T x j is maximized and (1) Σ α i y i = 0 (2) 0 ≤ α i ≤ C for all α i • Neither slack variables ξ i nor their Lagrange multipliers appear in the dual problem! • Again, x i with non-zero α i will be support vectors. • Solution to the dual problem is: w is not needed explicitly w = Σ α i y i x i for classification! b = y k (1- ξ k ) - w T x k where k = argmax α k ’ f ( x ) = Σ α i y i x i k ’ T x + b 23

Sec. 15.1 Classification with SVMs • Given a new point x , we can score its projection onto the hyperplane normal: • I.e., compute score: w T x + b = Σ α i y i x i T x x + + b • Decide class based on whether < or > 0 • Can set confidence threshold t . Score > t : yes Score < - t : no 1 Else: don ’ t know -10 24

Sec. 15.2.1 Linear SVMs: Summary • The classifier is a separating hyperplane. • The most “ important ” training points are the support vectors; they define the hyperplane. • Quadratic optimization algorithms can identify which training points x i are support vectors with non-zero Lagrangian multipliers α i . • Both in the dual formulation of the problem and in the solution, training points appear only inside inner products: f ( x ) = Σ α i y i x i Find α 1 …α N such that T x + b Q ( α ) = Σ α i - ½ ΣΣ α i α j y i y j x i T x j is maximized and (1) Σ α i y i = 0 (2) 0 ≤ α i ≤ C for all α i 25

Sec. 15.2.3 Non-linear SVMs • Datasets that are linearly separable (with some noise) work out great: x 0 • But what are we going to do if the dataset is just too hard? x 0 • How about … mapping data to a higher -dimensional space: x 2 x 0 26

Sec. 15.2.3 Non-linear SVMs: Feature spaces • General idea: the original feature space can always be mapped to some higher- dimensional feature space where the training set is separable: Φ : x → φ ( x ) 27

CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 3 - PowerPoint PPT Presentation

CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 3 Instructor: Yizhou Sun yzsun@ccs.neu.edu October 5, 2015 Announcements Homework 2 will be out tomorrow No class next week Course project proposal due next Monday 2

Link Analysis Lecture 7 Link Analysis November 29, 2017 1 CS6220 Data Mining Techniques

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

CS6220: DATA MINING TECHNIQUES Chapter 7: Advanced Pattern Mining Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Time Series Data Instructor: Yizhou Sun yzsun@ccs.neu.edu

CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Set Data: Frequent Pattern Mining Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data: Part I Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Set Data: Frequent Pattern Mining Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Sequential and Time Series Data Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Time Series Data Instructor: Yizhou Sun yzsun@ccs.neu.edu

CS6220: DATA MINING TECHNIQUES Set Data: Frequent Pattern Mining Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES 2: Data Pre-Processing Instructor: Yizhou Sun yzsun@ccs.neu.edu

CS6220: DATA MINING TECHNIQUES 2: Data Pre-Processing Instructor: Yizhou Sun yzsun@ccs.neu.edu

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

CS6220: DATA MINING TECHNIQUES Image Data: Classification via Neural Networks Instructor: Yizhou

Beyond Convexity Submodularity in Machine Learning Andreas Krause, Carlos Guestrin Carnegie

Talk 3: On the Classical Limit of Quantum Mechanics Bruce Driver Department of Mathematics,

For Tuesday Read chapter 12, sections 1-4 Homework: Chapter 10, exercise 9 Chapter

A Review of Fact-Checking, Fake News Detection and Argumentation Tariq Alhindi March 02, 2020

More ore on on BN BNets ets str tructure ucture an and d cons onstruction truction

FEVER IN THE ICU Infectious Diseases in Clinical Practice February 2014 Jennifer Babik, MD, PhD

Health Search From Consumers to Clinicians Slides available at

Why Is It Difficult to Detect Outbreaks in Twitter? Avar Stewart, Nattiya Kanhabua, Sara Romano