Kernels + K-Means
1
10-601 Introduction to Machine Learning
Matt Gormley Lecture 29 April 25, 2018
Machine Learning Department School of Computer Science Carnegie Mellon University
Kernels + K-Means Matt Gormley Lecture 29 April 25, 2018 1 - - PowerPoint PPT Presentation
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Kernels + K-Means Matt Gormley Lecture 29 April 25, 2018 1 Reminders Homework 8: Reinforcement Learning Out:
1
10-601 Introduction to Machine Learning
Matt Gormley Lecture 29 April 25, 2018
Machine Learning Department School of Computer Science Carnegie Mellon University
2
3
Hard-margin SVM (Primal) Hard-margin SVM (Lagrangian Dual)
4
dual problem
answer (i.e. the minimum of one is the maximum of the
which α(i) ≠ 0
5
6
Hard-margin SVM (Primal) Soft-margin SVM (Primal)
not linearly separable, can we still use an SVM?
margin version. It will never find a feasible solution. In the soft-margin version, we add “slack variables” that allow some points to violate the large-margin constraints. The constant C dictates how large we should allow the slack variables to be
7
Hard-margin SVM (Primal) Soft-margin SVM (Primal)
Hard-margin SVM (Primal) Soft-margin SVM (Primal) Soft-margin SVM (Lagrangian Dual) Hard-margin SVM (Lagrangian Dual)
8
We can also work with the dual of the soft-margin SVM
The SVM is inherently a binary classification method, but can be extended to handle K-class classification in many ways. 1.
– build K binary classifiers – train the kth classifier to predict whether an instance has label k or something else – predict the class with largest score
2.
– build (K choose 2) binary classifiers – train one classifier for distinguishing between each pair
– predict the class with the most “votes” from any given classifier
10
Support Vector Machines You should be able to… 1. Motivate the learning of a decision boundary with large margin 2. Compare the decision boundary learned by SVM with that of Perceptron 3. Distinguish unconstrained and constrained optimization 4. Compare linear and quadratic mathematical programs 5. Derive the hard-margin SVM primal formulation 6. Derive the Lagrangian dual for a hard-margin SVM 7. Describe the mathematical properties of support vectors and provide an intuitive explanation of their role 8. Draw a picture of the weight vector, bias, decision boundary, training examples, support vectors, and margin of an SVM 9. Employ slack variables to obtain the soft-margin SVM 10. Implement an SVM learner using a black-box quadratic programming (QP) solver
11
12
13
Example: pixel representation for Facial Recognition:
14
1. Rewrite the algorithm so that we only work with dot products xTz
2. Replace the dot products xTz with a kernel function k(x, z)
k(x, z) = φ(x) Tφ(z) for any function φ: X à RD So we only compute the φ dot product implicitly
– classification: perceptron, SVM, … – regression: ridge regression, … – clustering: k-means, …
16
Hard-margin SVM (Primal) Hard-margin SVM (Lagrangian Dual)
17
feature engineering
input vector x
Hard-margin SVM (Lagrangian Dual)
18
We could replace the dot product of the two feature vectors in the transformed space with a function k(x,z)
Hard-margin SVM (Lagrangian Dual)
19
We could replace the dot product of the two feature vectors in the transformed space with a function k(x,z)
1. Rewrite the algorithm so that we only work with dot products xTz
2. Replace the dot products xTz with a kernel function k(x, z)
k(x, z) = φ(x) Tφ(z) for any function φ: X à RD So we only compute the φ dot product implicitly
– classification: perceptron, SVM, … – regression: ridge regression, … – clustering: k-means, …
20
21
22
Slide from Nina Balcan
For n=2, d=2, the kernel K x, z = x ⋅ z d corresponds to 𝑦1, 𝑦2 → Φ 𝑦 = (𝑦1
2, 𝑦2 2,
2𝑦1𝑦2) Φ
K x, z = x ⋅ z d 𝑦1, 𝑦2 → Φ 𝑦 = (𝑦1
2, 𝑦2 2,
2𝑦1𝑦2)
x2 x1
O O O O O O O O X X X X X X X X X X X X X X X X X X
Φ Original space K x, z = x ⋅ z d 𝑦1, 𝑦2 → Φ 𝑦 = (𝑦1
2, 𝑦2 2,
2𝑦1𝑦2)
z1 z3
O O O O O O O O O X X X X X X X X X X X X X X X X X X
Φ-space
ϕ: R2 → R3, x1, x2 → Φ x = (x1
2, x2 2,
2x1x2)
Φ
ϕ x ⋅ ϕ 𝑨 = x1
2, x2 2,
2x1x2 ⋅ (𝑨1
2, 𝑨2 2,
2𝑨1𝑨2) = x1𝑨1 + x2𝑨2 2 = x ⋅ 𝑨 2 = K(x, z) ϕ: R2 → R3 x1, x2 → Φ x = (x1
2, x2 2,
2x1x2)
Φ
ϕ x ⋅ ϕ 𝑨 = x1
2, x2 2,
2x1x2 ⋅ (𝑨1
2, 𝑨2 2,
2𝑨1𝑨2) = x1𝑨1 + x2𝑨2 2 = x ⋅ 𝑨 2 = K(x, z)
25
Name Kernel Function (implicit dot product) Feature Space (explicit dot product) Linear Same as original input space Polynomial (v1) All polynomials of degree d Polynomial (v2) All polynomials up to degree d Gaussian Infinite dimensional space Hyperbolic Tangent (Sigmoid) Kernel (With SVM, this is equivalent to a 2-layer neural network)
26
RBF Kernel:
27
RBF Kernel:
28
RBF Kernel:
29
RBF Kernel:
30
RBF Kernel:
31
RBF Kernel:
32
RBF Kernel:
33
RBF Kernel:
34
RBF Kernel:
35
RBF Kernel:
36
RBF Kernel:
37
RBF Kernel:
38
RBF Kernel: KNN vs. SVM
39
RBF Kernel: KNN vs. SVM
40
RBF Kernel: KNN vs. SVM
41
RBF Kernel: KNN vs. SVM
1. Rewrite the algorithm so that we only work with dot products xTz
2. Replace the dot products xTz with a kernel function k(x, z)
k(x, z) = φ(x) Tφ(z) for any function φ: X à RD So we only compute the φ dot product implicitly
– classification: perceptron, SVM, … – regression: ridge regression, … – clustering: k-means, …
42
training criteria
linear classifier
black-box Quadratic Programming (QP) solvers
vectors
feature space without explicitly representing that space
many other algorithms
45
46
47
– Coordinate Descent – Block Coordinate Descent
– Inputs and Outputs – Objective-based Clustering
– K-Means Objective – Computational Complexity – K-Means Algorithm / Lloyd’s Method
– Random – Farthest Point – K-Means++
48
Goal: Automatically partition unlabeled data into groups of similar datapoints. Question: When and why would we want to do this?
Useful for:
for visualization purposes).
Slide courtesy of Nina Balcan
profile.
Facebook network Twitter Network
Slide courtesy of Nina Balcan
Slide courtesy of Nina Balcan
52
53
54
55
56
Example: Given a set of datapoints
Slide courtesy of Nina Balcan
Select initial centers at random
Slide courtesy of Nina Balcan
Assign each point to its nearest center
Slide courtesy of Nina Balcan
Recompute optimal centers given a fixed clustering
Slide courtesy of Nina Balcan
Assign each point to its nearest center
Slide courtesy of Nina Balcan
Recompute optimal centers given a fixed clustering
Slide courtesy of Nina Balcan
Assign each point to its nearest center
Slide courtesy of Nina Balcan
Recompute optimal centers given a fixed clustering
Get a good quality solution in this example.
Slide courtesy of Nina Balcan
It always converges, but it may converge at a local optimum that is different from the global optimum, and in fact could be arbitrarily worse in terms of its score.
Slide courtesy of Nina Balcan
Local optimum: every point is assigned to its nearest center and every center is the mean value of its points.
Slide courtesy of Nina Balcan
.It is arbitrarily worse than optimum solution….
Slide courtesy of Nina Balcan
This bad performance, can happen even with well separated Gaussian clusters.
Slide courtesy of Nina Balcan
This bad performance, can happen even with well separated Gaussian clusters. Some Gaussian are combined…..
Slide courtesy of Nina Balcan
K-Means You should be able to… 1. Distinguish between coordinate descent and block coordinate descent 2. Define an objective function that gives rise to a "good" clustering 3. Apply block coordinate descent to an objective function preferring each point to be close to its nearest
4. Implement the K-Means algorithm 5. Connect the nonconvexity of the K-Means objective function with the (possibly) poor performance of random initialization
71