Data Mining Techniques
CS 6220 - Section 2 - Spring 2017
Lecture 4
Jan-Willem van de Meent (credit: Yijun Zhao, Arthur Gretton Rasmussen & Williams, Percy Liang)
Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 4 - - PowerPoint PPT Presentation
Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 4 Jan-Willem van de Meent ( credit : Yijun Zhao, Arthur Gretton Rasmussen & Williams, Percy Liang) Kernel Regression Basis function regression Linear regression Basis
CS 6220 - Section 2 - Spring 2017
Jan-Willem van de Meent (credit: Yijun Zhao, Arthur Gretton Rasmussen & Williams, Percy Liang)
Linear regression Basis function regression Polynomial regression For N samples
x t M = 3 1 −1 1
Define a kernel function such that k can be cheaper to evaluate than φ!
Φ := Φ(X) A := (Φ>Φ + λI) E[w|y] = A1Φ>y
MAP / Expected value for Weights (requires inversion of DxD matrix) Predictive posterior (using kernel function)
E[f (x⇤)| y] = φ(x⇤)>E[w|y] = φ(x⇤)>Φ>(K + λI)1y X
Alternate representation (requires inversion of NxN matrix)
K := λ1ΦΦ> A1Φ> = Φ>(K + λI)1
Φ := Φ(X) A := (Φ>Φ + λI) E[w|y] = A1Φ>y
MAP / Expected value for Weights (requires inversion of DxD matrix) Predictive posterior (using kernel function) Alternate representation (requires inversion of NxN matrix)
K := λ1ΦΦ> A1Φ> = Φ>(K + λI)1 E[f (x⇤)| y] = φ(x⇤)>E[w|y] = φ(x⇤)>Φ>(K + λI)1y = X
n,m
k(x⇤, xn)(K + λI)1
nm ym
f ∗ = arg min
f ∈H
n X
i=1
(yi hf , φ(xi)iH)2 + λkf k2
H
! .
−0.5 0.5 1 1.5 −1 −0.5 0.5 1
λ=0.1, σ=0.6
−0.5 0.5 1 1.5 −1 −0.5 0.5 1
λ=10, σ=0.6
−0.5 0.5 1 1.5 −1 −0.5 0.5 1 1.5
λ=1e−07, σ=0.6
Closed form Solution
adapted from: Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/
−5 5 −2 −1 1 2 input, x
−5 5 −2 −1 1 2 input, x
p(y⇤|x⇤, x, y) ∼ N
noiseI]−1y,
k(x⇤, x⇤) + σ2
noise − k(x⇤, x)>[K + σ2 noiseI]−1k(x⇤, x)
adapted from: Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/
−5 5 10 −0.5 0.0 0.5 1.0 input, x function value, y too long about right too short
The mean posterior predictive function is plotted for 3 different length scales (the
noisexx0.
Characteristic Lengthscales
Borrowing from: Arthur Gretton (Gatsby, UCL)
Definition (Inner product) Let H be a vector space over R. A function h·, ·iH : H ⇥ H ! R is an inner product on H if
1 Linear: hα1f1 + α2f2, giH = α1 hf1, giH + α2 hf2, giH 2 Symmetric: hf , giH = hg, f iH 3 hf , f iH 0 and hf , f iH = 0 if and only if f = 0.
Norm induced by the inner product: kf kH := p hf , f iH
Fourier modes define a vector space
Definition Let X be a non-empty set. A function k : X × X → R is a kernel if there exists an R-Hilbert space and a map φ : X → H such that ∀x, x0 ∈ X, k(x, x0) := ⌦ φ(x), φ(x0) ↵
H .
Almost no conditions on X (eg, X itself doesn’t need an inner product, eg. documents). A single kernel can correspond to several possible features. A trivial example for X := R: φ1(x) = x and φ2(x) = x/ √ 2 x/ √ 2
Theorem (Sums of kernels are kernels) Given α > 0 and k, k1 and k2 all kernels on X, then αk and k1 + k2 are kernels on X. (Proof via positive definiteness: later!) A difference of kernels may not be a kernel (why?) Theorem (Mappings between spaces) Let X and e X be sets, and define a map A : X → e
kernel k on e
Example: k(x, x0) = x2 (x0)2 . Theorem (Products of kernels are kernels) Given k1 on X1 and k2 on X2, then k1 ⇥ k2 is a kernel on X1 ⇥ X2. If X1 = X2 = X, then k := k1 ⇥ k2 is a kernel on X. Proof: Main idea only!
Theorem (Polynomial kernels) Let x, x0 2 Rd for d 1, and let m 1 be an integer and c 0 be a positive real. Then k(x, x0) := ⌦ x, x0↵ + c m is a valid kernel. To prove: expand into a sum (with non-negative scalars) of kernels hx, x0i raised to integer powers. These individual terms are valid kernels by the product rule.
Definition The space `2 (square summable sequences) comprises all sequences a := (ai)i1 for which kak2
`2 = 1
X
i=1
a2
i < 1.
Definition Given sequence of functions (i(x))i1 in `2 where i : X ! R is the ith coordinate of (x). Then k(x, x0) :=
1
X
i=1
i(x)i(x0) (1)
Why square summable? By Cauchy-Schwarz,
X
i=1
φi(x)φi(x0)
so the sequence defining the inner product converges for all x, x0 2 X
Definition (Taylor series kernel) For r 2 (0, 1], with an 0 for all n 0 f (z) =
1
X
n=0
anzn |z| < r, z 2 R, Define X to be the pr-ball in Rd, sokxk < pr, k(x, x0) = f ⌦ x, x0↵ =
1
X
n=0
an ⌦ x, x0↵n . Example (Exponential kernel) k(x, x0) := exp ⌦ x, x0↵ .
Example (Gaussian kernel) The Gaussian kernel on Rd is defined as k(x, x0) := exp ⇣ −γ2 x − x0 2⌘ . Proof: an exercise! Use product rule, mapping rule, exponential kernel.
(also known as Radial Basis Function (RBF) kernel)
Example (Gaussian kernel) The Gaussian kernel on Rd is defined as k(x, x0) := exp ⇣ −γ2 x − x0 2⌘ . Proof: an exercise! Use product rule, mapping rule, exponential kernel.
(also known as Radial Basis Function (RBF) kernel) Squared Exponential (SE) Automatic Relevance Determination (ARD)
Lin × Lin SE × Per Lin × SE Lin × Per
x (with xÕ = 1) x − xÕ x (with xÕ = 1) x (with xÕ = 1) me: Squared-exp (SE) Periodic (Per) Linear (Lin) σ2
f exp
1
−(x≠xÕ)2
2¸2
2
σ2
f exp
1
− 2
¸2 sin2 1
π x≠xÕ
p
22
σ2
f(x − c)(xÕ − c)
): x − xÕ x − xÕ x (with xÕ = 1)
source: David Duvenaud (PhD Thesis)
Definition (Positive definite functions) A symmetric function k : X × X → R is positive definite if ∀n ≥ 1, ∀(a1, . . . an) ∈ Rn, ∀(x1, . . . , xn) ∈ X n,
n
X
i=1 n
X
j=1
aiajk(xi, xj) ≥ 0. The function k(·, ·) is strictly positive definite if for mutually distinct xi, the equality holds only when all the ai are zero.
Theorem Let H be a Hilbert space, X a non-empty set and φ : X ! H. Then hφ(x), φ(y)iH =: k(x, y) is positive definite. Proof.
n
X
i=1 n
X
j=1
aiajk(xi, xj) =
n
X
i=1 n
X
j=1
haiφ(xi), ajφ(xj)iH =
X
i=1
aiφ(xi)
H
0. Reverse also holds: positive definite k(x, x0) is inner product in a unique H (Moore-Aronsajn: coming later!).
Borrowing from: Percy Liang (Stanford)
∈ x ∈ R361 z = U>x z ∈ R10
Idea: Project high-dimensional vector
Given n data points in d dimensions: x1, . . . , xn ∈ Rd X =( x1 · · · · · · xn) ∈ Rd⇥n
Transpose of X used in regression!
Given n data points in d dimensions: x1, . . . , xn ∈ Rd X =( x1 · · · · · · xn) ∈ Rd⇥n Want to reduce dimensionality from d to k Choose k directions u1, . . . , uk U =( u1 ·· uk ) ∈ Rd⇥k
Given n data points in d dimensions: x1, . . . , xn ∈ Rd X =( x1 · · · · · · xn) ∈ Rd⇥n Want to reduce dimensionality from d to k Choose k directions u1, . . . , uk U =( u1 ·· uk ) ∈ Rd⇥k For each uj, compute “similarity” zj = u>
j x
Given n data points in d dimensions: x1, . . . , xn ∈ Rd X =( x1 · · · · · · xn) ∈ Rd⇥n Want to reduce dimensionality from d to k Choose k directions u1, . . . , uk U =( u1 ·· uk ) ∈ Rd⇥k For each uj, compute “similarity” zj = u>
j x
Project x down to z = (z1, . . . , zk)> = U>x How to choose U?
∈ x ∈ R361 z = U>x z ∈ R10
Optimize two equivalent objectives
U serves two functions:
zj = u>
j x
P
U serves two functions:
zj = u>
j x
x = Uz = Pk
j=1 zjuj
U serves two functions:
zj = u>
j x
x = Uz = Pk
j=1 zjuj
Want reconstruction error kx ˜ xk to be small
U serves two functions:
zj = u>
j x
x = Uz = Pk
j=1 zjuj
Want reconstruction error kx ˜ xk to be small Objective: minimize total squared reconstruction error min
U2Rd⇥k n
X
i=1
kxi UU>xik2
Empirical distribution: uniform over x1, . . . , xn Expectation (think sum over data points): ˆ E[f(x)] = 1
n
Pn
i=1 f(xi)
Variance (think sum of squares if centered): c var[f(x)] + (ˆ E[f(x)])2 = ˆ E[f(x)2] = 1
n
Pn
i=1 f(xi)2
Empirical distribution: uniform over x1, . . . , xn Expectation (think sum over data points): ˆ E[f(x)] = 1
n
Pn
i=1 f(xi)
Variance (think sum of squares if centered): c var[f(x)] + (ˆ E[f(x)])2 = ˆ E[f(x)2] = 1
n
Pn
i=1 f(xi)2
c Assume data is centered: ˆ E[x] = 0 (what’s
Empirical distribution: uniform over x1, . . . , xn Expectation (think sum over data points): ˆ E[f(x)] = 1
n
Pn
i=1 f(xi)
Variance (think sum of squares if centered): c var[f(x)] + (ˆ E[f(x)])2 = ˆ E[f(x)2] = 1
n
Pn
i=1 f(xi)2
c Assume data is centered: ˆ E[x] = 0 (what’s
Empirical distribution: uniform over x1, . . . , xn Expectation (think sum over data points): ˆ E[f(x)] = 1
n
Pn
i=1 f(xi)
Variance (think sum of squares if centered): c var[f(x)] + (ˆ E[f(x)])2 = ˆ E[f(x)2] = 1
n
Pn
i=1 f(xi)2
Objective: maximize variance of projected data max
U2Rd⇥k,U>U=I
ˆ E[kU>xk2] c Assume data is centered: ˆ E[x] = 0 (what’s
Empirical distribution: uniform over x1, . . . , xn Expectation (think sum over data points): ˆ E[f(x)] = 1
n
Pn
i=1 f(xi)
Variance (think sum of squares if centered): c var[f(x)] + (ˆ E[f(x)])2 = ˆ E[f(x)2] = 1
n
Pn
i=1 f(xi)2
c P Assume data is centered: ˆ E[x] = 0 (what’s ˆ
E[U>x]?)
Objective: maximize variance of projected data max
U2Rd⇥k,U>U=I
ˆ E[kU>xk2]
Key intuition: variance of data | {z }
fixed
= captured variance | {z }
want large
+ reconstruction error | {z }
want small
Key intuition: variance of data | {z }
fixed
= captured variance | {z }
want large
+ reconstruction error | {z }
want small
Pythagorean decomposition: x = UU>x + (I UU>)x kUU>xk k(I UU>)xk kxk Take expectations; note rotation U doesn’t affect length: ˆ E[kxk2] = ˆ E[kU>xk2] + ˆ E[kx UU>xk2]
Key intuition: variance of data | {z }
fixed
= captured variance | {z }
want large
+ reconstruction error | {z }
want small
Pythagorean decomposition: x = UU>x + (I UU>)x kUU>xk k(I UU>)xk kxk Take expectations; note rotation U doesn’t affect length: ˆ E[kxk2] = ˆ E[kU>xk2] + ˆ E[kx UU>xk2] Minimize reconstruction error $ Maximize captured variance
U =( u1 ·· uk ) ∈ Rd⇥
Data Orthonormal Basis
X =( x1 · · · · · · xn) ∈ Rd⇥n
d d
U =( u1 ·· uk ) ∈ Rd⇥
Data Orthonormal Basis
X =( x1 · · · · · · xn) ∈ Rd⇥n
d d
Change of basis
z = U>x
” zj = u>
j x
to z = (z1, . . . , zk)>
d
Inverse Change of basis
˜ x = Uz =
> j x
d
U =( u1 ·· uk ) ∈ Rd⇥
Data Orthonormal Basis
X =( x1 · · · · · · xn) ∈ Rd⇥n
d d
Eigenvectors of Covariance
Λ = λ1 λ2 ... λd
Eigen-decomposition
d
Claim: Eigenvectors of a symmetric matrix are orthogonal
n
(from stack exchange)
Λ = λ1 λ2 ... λd
U =( u1 ·· uk ) ∈ Rd⇥
Data Orthonormal Basis
X =( x1 · · · · · · xn) ∈ Rd⇥n
d d
Eigenvectors of Covariance Eigen-decomposition
d
Idea: Take top-k eigenvectors to maximize variance
U =( u1 ·· uk ) ∈ Rd⇥k
Data Truncated Basis
X =( x1 · · · · · · xn) ∈ Rd⇥n
Truncated decomposition Eigenvectors of Covariance
Λ(k) = λ1 λ2 ... λk
Top 2 components Bottom 2 components
Data: three varieties of wheat: Kama, Rosa, Canadian Attributes: Area, Perimeter, Compactness, Length of Kernel, Width of Kernel, Asymmetry Coefficient, Length of Groove
Using eigen-value decomposition
U =( u1 ·· uk ) ∈ Rd⇥k
Data Truncated Basis
X =( x1 · · · · · · xn) ∈ Rd⇥n
Using singular-value decomposition
(with power method)
U =( u1 ·· uk ) ∈ Rd⇥k
Data Truncated Basis
X =( x1 · · · · · · xn) ∈ Rd⇥n
Idea: Decompose a d x d matrix M into
(unitary matrix)
(diagonal matrix)
(unitary matrix)
Idea: Decompose the d x n matrix X into
(unitary matrix)
(diagonal projection)
(unitary matrix)
d X = Ud⇥dΣd⇥nV>
n⇥n
Xd×n u Ud×k Zk×n
. . .
Xd×n u Ud×k Zk×n
. . .
Idea: zi more “meaningful” representation of i-th face than xi Can use zi for nearest-neighbor classification
Xd×n u Ud×k Zk×n
. . .
Idea: zi more “meaningful” representation of i-th face than xi Can use zi for nearest-neighbor classification Much faster: O(dk + nk) time instead of O(dn) when n, d k
Xd×n u Ud×k Zk×n
. . .
Idea: zi more “meaningful” representation of i-th face than xi Can use zi for nearest-neighbor classification Much faster: O(dk + nk) time instead of O(dn) when n, d k Why no time savings for linear classifier?
2 3 4 5 6 7 8 9 10 11
i
287.1 553.6 820.1 1086.7 1353.2
λi
=
Xd⇥n u Ud⇥k Zk⇥n
stocks: 2 · · · · · · · · · 0 chairman: 4 · · · · · · · · · 1 the: 8 · · · · · · · · · 7 · · · . . . · · · · · · · · · . . . wins: 0 · · · · · · · · · 2 game: 1 · · · · · · · · · 3)
u(
0.4 ·· -0.001 0.8 ·· 0.03 0.01 ·· 0.04 . . . ·· . . . 0.002 ·· 2.3 0.003 ·· 1.9 )( z1 . . . zn)
Xd⇥n u Ud⇥k Zk⇥n
stocks: 2 · · · · · · · · · 0 chairman: 4 · · · · · · · · · 1 the: 8 · · · · · · · · · 7 · · · . . . · · · · · · · · · . . . wins: 0 · · · · · · · · · 2 game: 1 · · · · · · · · · 3)
u(
0.4 ·· -0.001 0.8 ·· 0.03 0.01 ·· 0.04 . . . ·· . . . 0.002 ·· 2.3 0.003 ·· 1.9 )( z1 . . . zn)
How to measure similarity between two documents? z>
1 z2 is probably better than x> 1 x2
Xd⇥n u Ud⇥k Zk⇥n
stocks: 2 · · · · · · · · · 0 chairman: 4 · · · · · · · · · 1 the: 8 · · · · · · · · · 7 · · · . . . · · · · · · · · · . . . wins: 0 · · · · · · · · · 2 game: 1 · · · · · · · · · 3)
u(
0.4 ·· -0.001 0.8 ·· 0.03 0.01 ·· 0.04 . . . ·· . . . 0.002 ·· 2.3 0.003 ·· 1.9 )( z1 . . . zn)
How to measure similarity between two documents? z>
1 z2 is probably better than x> 1 x2
Applications: information retrieval Note: no computational savings; original x is already sparse
reconstruction error
matrix or SVD
time complexity
anomaly detection, etc.
g: max
U p(X | U)
For each data point i = 1, . . . , n: Draw the latent vector: zi ∼ N(0, Ik×k) Create the data point: xi ∼ N(Uzi, σ2Id×d) PCA finds the U that maximizes the likelihood of the data
Generative Model [Tipping and Bishop, 1999]:
g: max
U p(X | U)
For each data point i = 1, . . . , n: Draw the latent vector: zi ∼ N(0, Ik×k) Create the data point: xi ∼ N(Uzi, σ2Id×d) PCA finds the U that maximizes the likelihood of the data
Generative Model [Tipping and Bishop, 1999]:
Advantages:
filtering)
(replace σ2Id×d with arbitrary diagonal matrix)
PCA is effective PCA is ineffective
PCA is effective PCA is ineffective Problem is that PCA subspace is linear: S = {x = Uz : z ∈ Rk} In this example: S = {(x1, x2) : x2 = u2
u1x1}
Broken solution Desired solution We want desired solution: S = {(x1, x2) : x2 = u2
u1x2 1}
Broken solution Desired solution We want desired solution: S = {(x1, x2) : x2 = u2
u1x2 1}
We can get this: S = {φ(x) = Uz} with φ(x) = (x2
1, x2)>
Broken solution Desired solution We want desired solution: S = {(x1, x2) : x2 = u2
u1x2 1}
We can get this: S = {φ(x) = Uz} with φ(x) = (x2
1, x2)>
{ }
Linear dimensionality reduction in φ(x) space ⇔ Nonlinear dimensionality reduction in x space
Broken solution Desired solution We want desired solution: S = {(x1, x2) : x2 = u2
u1x2 1}
We can get this: S = {φ(x) = Uz} with φ(x) = (x2
1, x2)>
{ }
Linear dimensionality reduction in φ(x) space ⇔ Nonlinear dimensionality reduction in x space
Idea: Use kernels
t u = Xα = Pn
i=1 αixi
Representer theorem:
: XX>u = λu x
Kernel function: k(x1, x2) such that K, the kernel matrix formed by Kij = k(xi, xj), is positive semi-definite
t u = Xα = Pn
i=1 αixi
Representer theorem:
: XX>u = λu x
Kernel function: k(x1, x2) such that K, the kernel matrix formed by Kij = k(xi, xj), is positive semi-definite
u = max α>Kα=1 α>K2α max
kuk=1 u>XX>u =
max α>X>Xα=1 α>(X>X)(X>X)α
t u = Xα = Pn
i=1 αixi
Representer theorem:
: XX>u = λu x
Direct method: Kernel PCA objective: max α>Kα=1 α>K2α ⇒ kernel PCA eigenvalue problem: X>Xα = λ0α Modular method (if you don’t want to think about kernels): Find vectors x0
1, . . . , x0 n such that
x0>
i x0 j = Kij = φ(xi)>φ(xj)
Key: use any vectors that preserve inner products One possibility is Cholesky decomposition K = X>X
. . , x0
n
. . , x0
n
Often, each data point consists of two views:
– x: Pixels (or other visual features) – y: Text around the image
Often, each data point consists of two views:
– x: Pixels (or other visual features) – y: Text around the image
– x: Signal at time t – y: Signal at time t + 1
Often, each data point consists of two views:
– x: Pixels (or other visual features) – y: Text around the image
– x: Signal at time t – y: Signal at time t + 1
– x: Features of a word/object, etc. – y: Features of the context in which it appears
Often, each data point consists of two views:
– x: Pixels (or other visual features) – y: Text around the image
– x: Signal at time t – y: Signal at time t + 1
– x: Features of a word/object, etc. – y: Features of the context in which it appears Goal: reduce the dimensionality of the two views jointly
Setup: Input data: (x1, y1), . . . , (xn, yn) (matrices X, Y) Goal: find pair of projections (u, v)
Setup: Input data: (x1, y1), . . . , (xn, yn) (matrices X, Y) Goal: find pair of projections (u, v) Dimensionality reduction solutions: Independent Joint , x and y are paired by brightness
Definitions: Variance: c var(u>x) = u>XX>u Covariance: c cov(u>x, v>y) = u>XY>v Correlation:
c cov(u>x,v>y)
√
c var(u>x)√ c var(v>y)
Objective: maximize correlation between projected views max
u,v d
corr(u>x, v>y) Properties:
PCA on views separately: no covariance term
max
u,v
u>XX>u u>u + v>YY>v v>v
PCA on concatenation (X>, Y>)>: includes covariance term
max
u,v
u>XX>u + 2u>XY>v + v>YY>v u>u + v>v
PCA on views separately: no covariance term
max
u,v
u>XX>u u>u + v>YY>v v>v
PCA on concatenation (X>, Y>)>: includes covariance term
max
u,v
u>XX>u + 2u>XY>v + v>YY>v u>u + v>v
Maximum covariance: drop variance terms
max
u,v
u>XY>v √ u>u √ v>v
PCA on views separately: no covariance term
max
u,v
u>XX>u u>u + v>YY>v v>v
PCA on concatenation (X>, Y>)>: includes covariance term
max
u,v
u>XX>u + 2u>XY>v + v>YY>v u>u + v>v
Maximum covariance: drop variance terms
max
u,v
u>XY>v √ u>u √ v>v
Maximum correlation (CCA): divide out variance terms
max
u,v
u>XY>v √ u>XX>u √ v>YY>v
Extreme examples of degeneracy:
(correlation 1)
(correlation 0)
Extreme examples of degeneracy:
(correlation 1)
(correlation 0) Problem: if X or Y has rank n, then any (u, v) is optimal
†>Yv ⇒ CCA is meaningless!
(correlation 1) with u = X
Extreme examples of degeneracy:
(correlation 1)
(correlation 0) Problem: if X or Y has rank n, then any (u, v) is optimal
†>Yv ⇒ CCA is meaningless!
(correlation 1) with u = X ⇒ Solution: regularization (interpolate between maximum covariance and maximum correlation)
max
u,v
u>XY>v p u>(XX> + λI)u p v>(YY> + λI)v