Data Mining II Prof. Dr. Karsten Borgwardt, Department Biosystems, - - PowerPoint PPT Presentation

data mining ii
SMART_READER_LITE
LIVE PREVIEW

Data Mining II Prof. Dr. Karsten Borgwardt, Department Biosystems, - - PowerPoint PPT Presentation

Data Mining II Prof. Dr. Karsten Borgwardt, Department Biosystems, ETH Z urich Basel, Spring Semester 2016 D-BSSE Our course - The team Dr. Damian Roqueiro, Dr. Dean Bodenham, Dr. Dominik Grimm, Dr. Xiao He D-BSSE Karsten Borgwardt Data


slide-1
SLIDE 1

Data Mining II

  • Prof. Dr. Karsten Borgwardt, Department Biosystems, ETH Z¨

urich Basel, Spring Semester 2016

D-BSSE

slide-2
SLIDE 2

Our course - The team

  • Dr. Damian Roqueiro, Dr. Dean Bodenham, Dr. Dominik Grimm, Dr. Xiao He

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 2 / 117

slide-3
SLIDE 3

Our course - Background information

Schedule Lecture: Wednesdays 14:15-15:50 Tutorial: Wednesdays 16:00-16:45 Room: Manser Written exam to get the certificate in summer 2016 Structure Key topics: dimensionality reduction, text mining, graph mining, association rules, and others Biweekly homework to apply the algorithms in practice, weekly tutorials, rotating between example tutorials and homework solution tutorials Moodle link https://moodle-app2.let.ethz.ch/course/view.php?id=2072

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 3 / 117

slide-4
SLIDE 4

Our course - Background information

More information

Homework: Avoid plagiarism and be aware of the consequences. Next week, March 2, we will host the Exam Inspection of the Data Mining 1 Exam, from 15:30-16:30 in this room.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 4 / 117

slide-5
SLIDE 5
  • 1. Dimensionality Reduction

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 5 / 117

slide-6
SLIDE 6

Why Dimensionality Reduction?

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 6 / 117

slide-7
SLIDE 7

Why Dimensionality Reduction?

Defintion

Dimensionality Reduction is the task to find an r-dimensional representation of a d-dimensional dataset, with r << d such that the d-dimensional information is maximally preserved.

Motivation

uncovering the intrinsic dimensionality of the data, visualization, reduction of redundancy, correlation and noise, computational or memory savings.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 7 / 117

slide-8
SLIDE 8

Why Dimensionality Reduction?

x 2 1 1 2 3 4 5 6 y 10 8 6 4 2 2 4 z 6 4 2 2 4 6 z 6 4 2 2 4 6

class 1 class 2 class 3 class 4 class 5

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 8 / 117

slide-9
SLIDE 9

Why Dimensionality Reduction?

6 4 2 2 4 6 8 Transformed X Values 10 8 6 4 2 2 Transformed Y Values

Variance Explained - PC1: 0.55, PC2: 0.32, PC3: 0.13

1.0 2.0 3.0 4.0 5.0

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 9 / 117

slide-10
SLIDE 10

Reminder: Feature Selection

Feature Selection versus Dimensionality Reduction

Feature Selection tries to select a subset of relevant variables: for reduced computational, experimental and storage costs, for better data understanding and visualization Univariate feature selection ignores correlations or even redundancies between the

  • features. The same underlying signal can be represented by several features.

Dimensionality reduction tries to find these underlying signals and to represent the

  • riginal data in terms of these signals.

Unlike feature selection, these underlying signals are not identical to the original features, but typically combinations of them.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 10 / 117

slide-11
SLIDE 11

1.1 Principal Component Analysis

based on: Shai Shalev-Shwartz and Shai Ben-David, Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press 2014, Chapter 23.1 Mohammed J. Zaki and Wagner Meira Jr., Data Mining and Analysis: Fundamental Concepts and Algorithms, Cambridge University Press 2014, Chapter 7.2

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 11 / 117

slide-12
SLIDE 12

Dimensionality Reduction: Principal Component Analysis

Goals

To understand why principal component analysis can be interpreted as a compression-recovery scheme. To understand the link between principal component analysis and the eigenvectors and eigenvalues of the covariance matrix. To understand what a principal component is. To understand how to compute principal component analysis efficiently, both for large n and d. To learn an objective way to determine the number of principal components. To understand in which sense principal component analysis is capturing the major axis of variation in the data.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 12 / 117

slide-13
SLIDE 13

Dimensionality Reduction: Principal Component Analysis

Concept

Let x1, . . . , xn be n vectors in Rd. We assume that they are mean-centered. The goal of PCA is to reduce the dimensionality of these vectors using a linear transformation. The matrix W ∈ Rr×d, where r < d, induces a mapping x → W x, where W x is the lower dimensional representation of x. A second matrix U ∈ Rd×r can be used to approximately recover each original vector x from its compressed form. That is, for a compressed vector y = W x, where y is in the low-dimensional space Rr, we can construct ˜ x = Uy, so that ˜ x so that ˜ x is the recovered version of x in the original space Rd.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 13 / 117

slide-14
SLIDE 14

Principal Component Analysis

Objective

In PCA, the corresponding objective of finding the compression matrix W and the recovering matrix U is then phrased as arg min

W ∈Rr×d,U∈Rd×r n

  • i=1

||xi − UW xi||2

2

(1) That is, PCA tries to minimize the total squared distance between the original vectors and the recovered vectors.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 14 / 117

slide-15
SLIDE 15

Principal Component Analysis

How to find U and W ? Lemma

Let (U, W ) be a solution to Objective (1). Then the columns of U are orthonormal (that is, U⊤U is the identity matrix of Rn) and W = U⊤.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 15 / 117

slide-16
SLIDE 16

Principal Component Analysis

Proof

We make the following assumptions: Choose any U and W and consider the mapping x → UW x. The range of this mapping, R = {UW x : x ∈ Rd} is an r-dimensional linear subspace of Rd. Let V ∈ Rd×r be a matrix whose columns form an orthonormal basis of this subspace (i.e. V ⊤V = I). Each vector in R can be written as V y, where y ∈ Rr.

For every x ∈ Rd and y ∈ Rr we have ||x − V y||2

2 = ||x||2 + ||y||2 − 2y⊤(V ⊤x)

(2) This difference is minimized for y = V ⊤x.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 16 / 117

slide-17
SLIDE 17

Principal Component Analysis

Proof

Continued: Therefore, for each x we have that: VV ⊤x = arg min

˜ x∈R

||x − ˜ x||2

2

(3) This holds for all xi and we can replace U, W by V , V ⊤ without increasing the objective:

n

  • i=1

||xi − UW xi||2

2 ≥ n

  • i=1

||xi − VV ⊤xi||2

2

(4) As this holds for any U, W , the proof is complete.

  • D-BSSE

Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 17 / 117

slide-18
SLIDE 18

Principal Component Analysis

How to find U and W ?

Due to the previous lemma, we can rewrite the Objective (1) as arg min

U∈Rd×r:U⊤U=I n

  • i=1

||xi − UU⊤xi||2

2

(5)

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 18 / 117

slide-19
SLIDE 19

Principal Component Analysis

How to find U and W ?

We can further simplify the objective by the following transformation: ||x − UU⊤x||2 = ||x||2 − 2x⊤UU⊤x + x⊤UU⊤UU⊤x (6) = ||x||2 − x⊤UU⊤x (7) = ||x||2 − trace(U⊤xx⊤U). (8) Note: The trace of a matrix is the sum of its diagonal entries.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 19 / 117

slide-20
SLIDE 20

Principal Component Analysis

How to find U and W ?

This allows us to turn Objective (5) into a trace maximization problem: arg max

U∈Rd×r:U⊤U=I

trace

  • U⊤

n

  • i=1

xix⊤

i U

  • (9)

If we define Σ = 1

n

n

i=1 xix⊤ i , this is equivalent (up to a constant factor) to

arg max

U∈Rd×r:U⊤U=I

trace

  • U⊤ΣU
  • (10)

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 20 / 117

slide-21
SLIDE 21

Principal Component Analysis

Theorem

Let Σ = VDV ⊤ be the spectral decomposition of Σ. D is a diagonal matrix, such that Di,i is the i-th largest eigenvalue of Σ. The columns of V are the corresponding eigenvectors, and V ⊤V = VV ⊤ = I. Then the solution to Objective (10) is the matrix U whose columns are the r first eigenvectors of Σ.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 21 / 117

slide-22
SLIDE 22

Principal Component Analysis

Proof

Choose a matrix U ∈ Rd,r with orthonormal columns and let B = V ⊤U. Then, VB = VV ⊤U = U. It follows that U⊤ΣU = B⊤V ⊤VDV ⊤VB = B⊤DB (11) and therefore trace(U⊤ΣU) =

d

  • j=1

Dj,j

r

  • i=1

B2

j,i.

(12) Note that B⊤B = U⊤VV ⊤U = U⊤U = I. Hence the columns of B are orthonormal and d

j=1

r

i=1 B2 j,i = r.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 22 / 117

slide-23
SLIDE 23

Principal Component Analysis

Proof

Define ˜ B ∈ Rd×d to be the matrix such that the first r columns are the columns of B and in addition, ˜ B⊤ ˜ B = I. Then for every j : d

i=1 ˜

B2

j,i = 1, which implies that

r

i=1 B2 j,i ≤ 1.

It follows that trace(U⊤ΣU) ≤ max

β∈[0,1]d:||β||1≤r d

  • j=1

Dj,jβj =

r

  • j=1

Dj,j . Therefore for every matrix U ∈ Rd×r with orthonormal columns, the following inequality holds: trace(U⊤ΣU) ≤ r

j=1 Dj,j.

But if we set U to the matrix with the r leading eigenvectors of Σ as its columns, we

  • btain trace(U⊤ΣU) = r

j=1 Dj,j and thereby the optimal solution.

  • D-BSSE

Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 23 / 117

slide-24
SLIDE 24

Principal Component Analysis

Runtime properties

The overall runtime of PCA is O(d3 + d2n):

O(d3) for calculating the eigenvalues of Σ, O(d2n) for constructing the matrix Σ.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 24 / 117

slide-25
SLIDE 25

Principal Component Analysis

Speed-up for d >> n

Often the number of features d greatly exceeds the number of samples n. The standard runtime of O(d3 + d2n) is very expensive for large d. However, there is a workaround to perform the same calculations in O(n3 + n2d).

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 25 / 117

slide-26
SLIDE 26

Principal Component Analysis

Speed-up for d >> n

Workaround:

X ∈ Rd×n, Σ = 1

n

n

i=1 xix⊤ i

  • r Σ = 1

nXX ⊤.

Consider K = X ⊤X, such that Kij = xi, xj Suppose v is an eigenvector of K: Kv∗ = λv∗ for some λ ∈ R. Multiplying the equation by 1

nX and using the definition of K we obtain

1 nXX ⊤Xv∗ = 1 nλXv∗. (13) But, using the definition of Σ, we get that Σ(Xv∗) = λ

n (Xv∗).

Hence

Xv∗ ||Xv∗|| is an eigenvector of Σ with eigenvalue λ n .

Therefore, one can calculate the PCA solution by calculating the eigenvalues of K rather than Σ.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 26 / 117

slide-27
SLIDE 27

Principal Component Analysis

Pseudocode

Require: A matrix X of n examples ∈ Rd×n, number of components r

1: for i = 1, . . . , n set xi ← xi − 1

n

n

i=1 xi

2: if n > d then 3:

Σ = 1

nXX ⊤

4:

Compute the r leading eigenvectors v1, . . . , vr of Σ.

5: else 6:

K = X ⊤X

7:

Compute the r leading eigenvectors v∗

1, . . . , v∗ r of K.

8:

for i = 1, . . . , r set vi :=

1 ||Xv∗

i ||Xv∗

i .

9: end if 10: return Compression matrix W = [v1, . . . , vr]⊤ or compressed points WX

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 27 / 117

slide-28
SLIDE 28

Principal Component Analysis

How to choose the number of principal components?

The principal components should capture α per cent of the total variance in the data (α is often set to be 90%). Total variance in the data: d

i=1 λi.

Total variance captured by first r eigenvectors: r

i=1 λi

The variance explained α is the ratio between the two: α =

r

i=1 λi

d

i=1 λi . D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 28 / 117

slide-29
SLIDE 29

Principal Component Analysis

Theorem

The variance captured by the first r eigenvectors of Σ is the sum over its r largest eigenvalues r

i=1 λi.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 29 / 117

slide-30
SLIDE 30

Principal Component Analysis

Proof

The variance in a dataset X is defined as var(X) = 1 n

n

  • i=1

||xi − 0||2 = (14) = 1 n

n

  • i=1

||xi||2 = 1 n

n

  • i=1

xi, xi = (15) = tr(Σ) = tr(V ⊤DV ) = tr(VV ⊤D) = tr(D) = (16) =

d

  • i=1

Di,i =

d

  • i=1

λi. (17)

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 30 / 117

slide-31
SLIDE 31

Principal Component Analysis

Proof - Continued

The variance in a projected dataset WX, with W = [v1, . . . , vr]⊤, is defined as var(WX) = 1 n

n

  • j=1

||W xj||2 = 1 n

n

  • j=1

W xj, W xj = 1 n

n

  • j=1

x⊤

j W ⊤W xj =

(18) = 1 n

n

  • j=1

x⊤

j (v1v⊤ 1 + . . . + vrv⊤ r )xj = r

  • i=1

v⊤

i n

  • j=1

1 n(xjx⊤

j )vi =

(19) = (v⊤

1 Σv1 + . . . + v⊤ r Σvr) = r

  • i=1

v⊤

i Σvi = r

  • i=1

v⊤

i λivi = r

  • i=1

λi. (20) Therefore the variance explained can be written as a ratio over sums over eigenvalues of the covariance matrix Σ. .

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 31 / 117

slide-32
SLIDE 32

Principal Component Analysis

Geometric Interpretation

An alternative interpretation of PCA is that it finds the major axis of variation in the data. This means that the 1st principal component defines the direction in the data with the greatest variance. The 2nd principal component defines a direction that i) is orthogonal to the 1st principal component and ii) captures the major direction of the remaining variance in the data. In general, the i-th principal component is orthogonal to all previous i − 1 principal components and represents the direction of maximum variance remaining in the data.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 32 / 117

slide-33
SLIDE 33

Principal Component Analysis

Proof: Variance Maximization along 1st principal component

We start by trying to find one principal component v1 that maximizes the variance of X: arg maxv1 Var(v⊤

1 X) = arg maxv1 v⊤ 1 Σv1

To avoid picking a v1 with arbitrarily large entries, we enforce v⊤

1 v1 = 1.

We form the Lagrangian to solve this problem v⊤

1 Σv1 − λ(v⊤ 1 v1 − 1).

(21) and take the derivative with respect to v1 and set it to zero ∂ ∂v1 (v⊤

1 Σv1 − λ(v⊤ 1 v1 − 1)) = 0;

(22)

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 33 / 117

slide-34
SLIDE 34

Principal Component Analysis

Proof: Variance Maximization along 1st principal component

∂ ∂v1 (v⊤

1 Σv1 − λ(v⊤ 1 v1 − 1)) = 0;

(23) 2Σv1 − 2λv1 = 0; (24) Σv1 = λv1. (25) The solution v1 is an eigenvector of Σ. As v⊤

1 Σv1 = v⊤ 1 λv1 = λv⊤ 1 v1 = λ1, the variance

is maximized by picking the eigenvector corresponding to the largest eigenvalue. Hence the 1st eigenvector v1 is the principal component/direction that maximizes the variance of Xv1.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 34 / 117

slide-35
SLIDE 35

Principal Component Analysis

Proof: Variance Maximization along 2nd principal component

We will now show the solution for the 2nd principal component. The second direction of projection should be independent from the first one: cov(v⊤

2 X, v⊤ 1 X) = 0.

This can be written as cov(v⊤

2 X, v⊤ 1 X) = v⊤ 2 XX ⊤v1 = v⊤ 2 Σv1 = v⊤ 2 λv1 = λv⊤ 2 v1 = 0.

We form the Lagrangian v⊤

2 Σv2 − λ(v⊤ 2 v2 − 1) − ρ(v⊤ 2 v1) and set the derivative with

respect to v2 to zero: 2Σv2 − 2λv2 − ρv1 = 0. If we multiply this from the left by v⊤

1 , the first two terms vanish: −ρ = 0;

As ρ = 0, we are left with Σv2 = λv2, showing that v2 is again an eigenvector of Σ, and we again pick the eigenvector of the - now second largest - eigenvalue to maximize the variance along the second principle component. The proofs for the other principal components k > 2 follows the same scheme.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 35 / 117

slide-36
SLIDE 36

Principal Component Analysis

15 10 5 5 10 15 20 25 x 10 5 5 10 15 y D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 36 / 117

slide-37
SLIDE 37

Principal Component Analysis

15 10 5 5 10 15 20 25 x 10 5 5 10 15 y

PC1 PC2

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 37 / 117

slide-38
SLIDE 38

Principal Component Analysis

15 10 5 5 10 15 20 25 30 Transformed X Values 4 2 2 4 6 8 10 12 14 Transformed Y Values

1.0

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 38 / 117

slide-39
SLIDE 39

Principal Component Analysis

Summary

Principal Component Analysis (Pearson, 1901) is the optimal way to perform dimensionality reduction to a lower-dimensional space with subsequent reconstruction with minimal squared reconstruction error. The principal components are the eigenvectors of the covariance matrix of the data. The eigenvalues capture the variance that is described by the corresponding eigenvector. The number of principal components can be chosen such that they capture α per cent

  • f the total variance.

The principal components capture the orthogonal directions of maximum variance in the data. PCA can be computed in O(min(n3 + n2d, d3 + d2n)).

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 39 / 117

slide-40
SLIDE 40

1.2 Singular Value Decomposition

based on: Charu Aggarwal, Data Mining - The Textbook, Springer 2015, Chapter 2.4.3.2 Mohammed J. Zaki and Wagner Meira Jr., Data Mining and Analysis: Fundamental Concepts and Algorithms, Cambridge University Press 2014, Chapter 7.4

  • Dr. R. Costin. Lecture Notes of MATHEMATICS 5101: Linear Mathematics in Finite Dimensions. OSU 2013.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 40 / 117

slide-41
SLIDE 41

Singular Value Decomposition

Goals

Get to know a fundamental matrix decomposition that is widely used throughout data mining. Learn how SVD can be used to obtain a low-rank approximation to a given matrix. Get to know how to compute Principal Component Analysis via SVD. Learn how SVD can speed up fundamental matrix operations in data mining.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 41 / 117

slide-42
SLIDE 42

Singular Value Decomposition

Definition

A Singular Value Decomposition is defined as a factorization of a given matrix D ∈ Rn×d into three matrices: D = L∆R⊤, (26) where

L is an n × n matrix with orthonormal columns, the left singular vectors, ∆ is an n × d diagonal matrix containing the singular values, and R is a d × d matrix with orthonormal columns, the right singular vectors.

The singular values are non-negative and by convention arranged in decreasing order. Note that ∆ is not (necessarily) a square matrix.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 42 / 117

slide-43
SLIDE 43

Singular Value Decomposition

D

d n D{1,1} ... D{1,d} . . . . D{n,1} ... D{n,d}

= L

n n . . . .

L{1,1} ... L{1,n} L{n,1} ... L{n,n}

Δ

d n δ1 0 0 0 ... 0 δ2 0 0 ... ... ... ... ...

RT

d d R{1,1} ... R{1,d} . . . . . . R{d,1} ... R{d,d}

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 43 / 117

slide-44
SLIDE 44

Singular Value Decomposition

Reduced SVD

One can discard the singular vectors that correspond to zero singular values, to obtain the reduced SVD: D = Lr∆rR⊤

r ,

where Lr is the n × r matrix of the left singular vectors, Rr is the d × r matrix of the right singular vectors, and δr is the r × r diagonal matrix containing the positive singular vectors.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 44 / 117

slide-45
SLIDE 45

Singular Value Decomposition

Reduced SVD

The reduced SVD gives rise to the spectral decomposition of D: D = Lr∆rR⊤

r =

(27) = ([l1, l2, . . . , lr])     δ1 . . . δ2 . . . . . . . . . . . . . . . . . . δr         r⊤

1

r⊤

2

. . . r⊤

r

    = (28) = δ1l1r⊤

1 + δ2l2r⊤ 2 + . . . + δrlrr⊤ r = r

  • i=1

δilir⊤

i

(29) Hence D is represented as a sum of rank-one matrices of the form δilir⊤

i .

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 45 / 117

slide-46
SLIDE 46

Singular Value Decomposition

Eckart-Young Theorem

By selecting the r largest singular values δ1, δ2, . . . , δr, and the corresponding left and right singular vectors, we obtain the best rank-r approximation to the original matrix D.

Theorem (Eckart-Young Theorem, 1936)

If Dr is the matrix defined as r

i=1 δilir⊤ i , then Dr is the rank-r matrix that minimizes

the objective ||D − Dr||F. The Frobenius Norm of a matrix A, ||A||F, is defined as ||A||F =

  • n
  • i=1

n

  • j=1

A2

i,j.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 46 / 117

slide-47
SLIDE 47

Singular Value Decomposition

Proof of Eckart-Young Theorem (Part 1/2)

Assume D is of rank k (k > r). Since ||D − Dr||F = ||L∆R⊤ − Dr||F = ||∆ − L⊤DrR||F. Denoting N = L⊤DrR, we can compute the Frobenius norm as ||∆ − N||2

F =

  • i,j

|∆i,j − Ni,j|2 =

k

  • i=1

|δi − Ni,i|2 +

  • i>k

|Ni,i|2 +

  • i=j

|Ni,j|2. This is minimized if all off-diagonal terms of N and all Ni,i for i > k are zero. The minimum of |δi − Ni,i|2 is obtained for Ni,i = δi for i = 1, . . . , r and all other Ni,i are zero.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 47 / 117

slide-48
SLIDE 48

Singular Value Decomposition

Proof of Eckart-Young Theorem (Part 2/2)

We do not need the full L and R for computing Dr, only their first r columns. This can be seen by splitting L and R into blocks: L = [Lr, L0] and R = [Rr, R0] and Dr = L∆rR⊤ = [Lr, L0] ∆r R⊤

r

R⊤

  • (30)

= [Lr, L0] ∆rR⊤

r

  • = Lr∆rR⊤

r .

(31)

  • D-BSSE

Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 48 / 117

slide-49
SLIDE 49

Singular Value Decomposition

Geometric interpretation: Column and row space

Any n × d matrix D represents a linear transformation, D : Rd → Rn, from the space of d-dimensional vectors to the space of n-dimensional vectors, because for any x ∈ Rd, there exists a y ∈ Rn, such that Dx = y. The column space of D is defined as the set of all vectors y ∈ Rn such that Dx = y over all possible x ∈ Rd. The row space of D is defined as the set of all vectors x ∈ Rd such that D⊤y = x over all possible y ∈ Rn. The row space of D is the column space of D⊤.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 49 / 117

slide-50
SLIDE 50

Singular Value Decomposition

Geometric interpretation: Null spaces

The set of all vectors x ∈ Rd, such that Dx = 0 is called the (right) null space of D. The set of all vectors y ∈ Rn, such that D⊤y = 0 is called the left null space of D.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 50 / 117

slide-51
SLIDE 51

Singular Value Decomposition

Geometric interpretation: SVD as a ‘basis-generator’

SVD gives a basis for each of the four spaces associated with a matrix D. If D has rank r, then it has only r independent columns and only r independent rows. The r left singular vectors l1, l2, . . . , lr corresponding to the r nonzero singular values of D represent a basis for the column space of D. The remaining n − r left singular vectors lr+1, . . . , ln represent a basis for the left null space of D. The r right singular vectors r1, r2, . . . , rr, corresponding to the r non-zero singular values represent a basis for the row space of D. The remaining d − r right singular vectors rr+1, . . . , rd represent a basis for the (right) null space of D.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 51 / 117

slide-52
SLIDE 52

Singular Value Decomposition

Geometric interpretation: SVD as a ‘basis-generator’

Proof of right null space: Dx = 0 (32) L⊤L∆R⊤x = 0 (33) R⊤x =: ˜ x (34) ∆˜ x = (δ1˜ x1, δ2˜ x2, . . . , δr˜ xr, 0, . . . , 0) = 0 (35) x = R˜ x (36)

The weights for the first r right singular vectors r1, . . . , rr is zero. Hence x is a linear combination over the d − r right singular vectors rr+1, . . . , rd.

  • D-BSSE

Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 52 / 117

slide-53
SLIDE 53

Singular Value Decomposition

Geometric interpretation: SVD as a ‘basis-generator’

Consider the reduced SVD expression from Equation (27). Right-multiplying both sides by Rr and noting that R⊤

r Rr = I, we obtain:

DRr = Lr∆rR⊤

r Rr

(37) DRr = Lr∆r (38) DRr = Lr     δ1 . . . δ2 . . . . . . . . . . . . . . . . . . δr     (39) D([r1, r2, . . . , rr]) = ([δ1l1, δ2l2, . . . , δrlr]) (40)

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 53 / 117

slide-54
SLIDE 54

Singular Value Decomposition

Geometric interpretation: SVD as a ‘basis-generator’

Hence Dri = δili for all i = 1, . . . , r. That means, SVD is a special factorization of the matrix D such that any basis vector ri for the row space is mapped to the corresponding basis vector li for the column space, scaled by δi. SVD can be thought of as a mapping from an orthonormal basis (r1, r2, . . . , rr) in Rd, the row space, to an orthonormal basis (l1, l2, . . . , lr) in Rn, the column space, with the corresponding axes scaled according to the singular values δ1, δ2, . . . , δr.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 54 / 117

slide-55
SLIDE 55

Singular Value Decomposition

Link to PCA

There is a direct link between PCA and SVD, which we will elucidate next. This link allows the computation of PCA via SVD. We assume that D is a d × n matrix.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 55 / 117

slide-56
SLIDE 56

Singular Value Decomposition

Link to PCA: The covariance matrix via SVD

The columns of matrix L, which are also the left singular vectors, are the orthonormal eigenvectors of DD⊤. Proof: DD⊤ = L∆(R⊤R)∆⊤L⊤ = L∆∆⊤L⊤

Note that R⊤R = 1. The diagonal entries of ∆∆⊤, that is, the squared nonzero singular values, are therefore the nonzero-eigenvalues of DD⊤.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 56 / 117

slide-57
SLIDE 57

Singular Value Decomposition

Link to PCA: The covariance matrix via SVD

The covariance matrix of mean-centered data is 1

nDD⊤ and the left singular vectors of

SVD are eigenvectors of DD⊤. It follows that the eigenvectors of PCA are the same as the left singular vectors of SVD for mean-centered data. Furthermore, the square singular values of SVD are n times the eigenvalues of PCA.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 57 / 117

slide-58
SLIDE 58

Singular Value Decomposition

Link to PCA: The kernel matrix via SVD

The columns of matrix R, which are the right singular vectors, are the orthonormal eigenvectors of D⊤D. Proof: D⊤D = R∆⊤(L⊤L)∆R⊤ = R∆⊤∆R⊤

Note that L⊤L = 1.

  • The diagonal entries of ∆⊤∆, that is, the squared nonzero singular values of ∆, are

therefore the nonzero-eigenvalues of D⊤D.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 58 / 117

slide-59
SLIDE 59

Singular Value Decomposition

Link to PCA: The kernel matrix via SVD

The kernel matrix on D is D⊤D and the right singular vectors of SVD are eigenvectors

  • f D⊤D.

It follows that each eigenvector v of PCA can be expressed in terms of a right singular vectors r of SVD: v = Dr ||Dr|| Again, the square singular values of SVD are n times the eigenvalues of PCA (λΣ = λK

n = δ2 n ).

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 59 / 117

slide-60
SLIDE 60

Singular Value Decomposition

Applications of SVD and PCA beyond dimensionality reduction

Noise reduction: Removing smaller eigenvectors or singular vectors often improves data quality. Data imputation: The reduced SVD matrices Lr, ∆r and Rr can be computed even for incomplete matrices and used to impute missing values (→ link prediction) Linear equations: Obtain basis of the solution Dx = 0. Matrix inversion: Assume D is square. D = L∆R⊤ ⇒ D−1 = R∆−1L⊤ Powers of a matrix: Assume D is square and positive semidefinite. Then D = L∆L⊤ and Dk = L∆kL⊤.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 60 / 117

slide-61
SLIDE 61

Singular Value Decomposition

Summary

Singular Value Decomposition (SVD) is a decomposition of a given matrix D into three submatrices. SVD can be used to obtain an optimal low-rank approximation of a matrix in terms of the Frobenius norm. SVD generates bases for the four spaces associated with a matrix. D can be thought of as a mapping between the basis vectors of its row space and the scaled basis vectors of its column space. SVD can be used to implement PCA. SVD is used throughout data mining for efficient matrix computations.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 61 / 117

slide-62
SLIDE 62

1.3 Kernel Principal Component Analysis

based on:

  • B. Sch¨
  • lkopf, A. Smola, K.R. M¨

uller, Nonlinear Component Analysis as a Kernel Eigenvalue Problem. Max Planck Institute for Biological Cybernetics, Technical Report No.44, 1996 Mohammed J. Zaki and Wagner Meira Jr., Data Mining and Analysis: Fundamental Concepts and Algorithms, Cambridge University Press 2014, Chapter 7.3

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 62 / 117

slide-63
SLIDE 63

Kernel Principal Component Analysis

1.5 1.0 0.5 0.0 0.5 1.0 1.5 x 1.5 1.0 0.5 0.0 0.5 1.0 1.5 y

Nonlinear 2D Dataset D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 63 / 117

slide-64
SLIDE 64

Kernel Principal Component Analysis

1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 x 1.0 0.5 0.0 0.5 1.0 1.5 y

Nonlinear 2D Dataset D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 64 / 117

slide-65
SLIDE 65

Kernel Principal Component Analysis

Goals

Learn how to perform non-linear dimensionality reduction. Learn how to perform PCA in feature space solely in terms of kernel functions. Learn how to compute the projections of points onto principal components in feature space.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 65 / 117

slide-66
SLIDE 66

Kernel Principal Component Analysis

Kernel Principal Component Analysis (Sch¨

  • lkopf, Smola, M¨

uller, 1996)

PCA is restrictive in the sense that it only allows for linear dimensionality reduction. What about non-linear dimensionality reduction? Idea: Move the computation of principal components to feature space. This approach exists and is called Kernel PCA (Sch¨

  • lkopf, Smola and M¨

uller, 1996)! Define a mapping φ : X → H, x → φ(x).

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 66 / 117

slide-67
SLIDE 67

Kernel Principal Component Analysis

Kernel Principal Component Analysis

We assume that we are dealing with centered data n

i=1 φ(xi) = 0.

The covariance matrix then takes the form C = 1

n

n

i=1 φ(xi)φ(xi)⊤.

Then we have to find eigenvalues λ ≥ 0 and nonzero eigenvectors v ∈ H \ {0} satisfying: λv = Cv. All solutions v with λ = 0 lie in the span of φ(x1), . . . , φ(xn), due to the fact that λv = Cv = 1

n

n

i=1(φ(xi)⊤v)φ(xi).

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 67 / 117

slide-68
SLIDE 68

Kernel Principal Component Analysis

Kernel Principal Component Analysis

The first useful consequence is that we can consider the following equations instead: λφ(xj), v = φ(xj), Cv for all j = 1, . . . , n and the second is that there exist coefficients αj(j = 1, . . . n) : v =

n

  • i=1

αiφ(xi). Combining these two consequences, we get for all j = 1, . . . , n: λ

n

  • i=1

αiφ(xj), φ(xi) = 1 n

n

  • i=1

αiφ(xj),

n

  • k=1

φ(xk)φ(xk), φ(xi).

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 68 / 117

slide-69
SLIDE 69

Kernel Principal Component Analysis

Kernel Principal Component Analysis

Combining these two consequences, we get for all j = 1, . . . , n: λ

n

  • i=1

αiφ(xj), φ(xi) = 1 n

n

  • i=1

αi

n

  • k=1

φ(xj), φ(xk)φ(xk), φ(xi).

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 69 / 117

slide-70
SLIDE 70

Kernel Principal Component Analysis

Kernel PCA as an eigenvector problem

In terms of the n × n Gram matrix Kj,i := φ(xj), φ(xi), this can be rewritten as nλKα = K2α, (41) where α denotes the column vector with entries α1, . . . , αn. To find solutions of Equation (41), we solve the problem nλα = Kα, (42) which we obtain by multiplying (41) by K−1 from the left.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 70 / 117

slide-71
SLIDE 71

Kernel Principal Component Analysis

Normalizing the coefficients

We require the eigenvectors vk to have unit length, that is vk, vk = 1 for all k = 1, . . . , r. That means that 1 = vk, vk (43) =

n

  • i,j=1

αk

i αk j φ(xi), φ(xj)

(44) =

n

  • i,j=1

αk

i αk j Ki,j

(45) = αk, Kαk = λkαk, αk. (46)

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 71 / 117

slide-72
SLIDE 72

Kernel Principal Component Analysis

Normalizing the coefficients

As eigenvectors of K, the αk have unit norm. Therefore we have to rescale them by

  • 1

λ to enforce that their norm is 1 λ.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 72 / 117

slide-73
SLIDE 73

Kernel Principal Component Analysis

Projecting points onto principal components

A point x can be projected onto the principal component vk (for k = 1, . . . , r) via vk, φ(x) =

n

  • i=1

αk

i φ(xi), φ(x).

resulting in a r-dimensional representation of x based on KernelPCA in H.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 73 / 117

slide-74
SLIDE 74

Kernel Principal Component Analysis

How to center the kernel matrix in H?

The kernel matrix K can be centered via ˜ K = K − 1 n1n×n K − 1 n K 1n×n + 1 n2 1n×n K 1n×n, (47) = (I − 1 n1n×n) K (I − 1 n1n×n) (48) using the notation (1n×n)ij := 1 for all i, j.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 74 / 117

slide-75
SLIDE 75

Kernel Principal Component Analysis

Pseudocode: KernelPCA(X, r)

Require: A matrix X of n examples ∈ Rd,n, number of components r

1: Ki,j = {k(xi, xj)i,j=1,...,n} 2: K := (I − 1

n1n×n)K(I − 1 n1n×n)

3: (λ1, . . . , λr) = eigenvalues(K) 4: (α1, . . . , αr) = eigenvectors(K) 5: αi :=

  • 1

λi αi for all i = 1, . . . , r

6: A = (α1, . . . , αr) 7: return Set of projected points: Z = {zi|zi = A⊤K(:, i), for i = 1, . . . , n}

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 75 / 117

slide-76
SLIDE 76

Kernel Principal Component Analysis

Summary

Non-linear dimensionality reduction can be performed in feature space in KernelPCA. At the heart of KernelPCA is finding eigenvectors of the kernel matrix. One can compute projections of the data onto the principal components explicitly, but not the principal components themselves (unless function φ is explicitly known).

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 76 / 117

slide-77
SLIDE 77

1.4 Multidimensional Scaling

based on: Wolfgang Karl H¨ ardle, Leopold Simar. Applied Multivariate Statistical Analysis. Springer 2015, Chapter 17 Trevor Hastie, Robert Tibshirani and Jerome Friedman, The Elements of Statistical Learning, Springer 2008, Second Edition, Chapters 14.8 and 14.9

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 77 / 117

slide-78
SLIDE 78

Multidimensional Scaling

214 279 610 596 237 214 492 533 496 444 279 492 520 772 140 610 533 520 521 687 596 496 772 521 771 237 444 140 687 771

Source of Table and subsequent figure: Wolfgang Karl H¨ ardle, Leopold Simar. Applied Multivariate Statistical Analysis. Springer 2015, Chapter 17 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 78 / 117

slide-79
SLIDE 79

Multidimensional Scaling

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 79 / 117

slide-80
SLIDE 80

Multidimensional Scaling

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 80 / 117

slide-81
SLIDE 81

Multidimensional Scaling

Goals

Find a low-dimensional representation of data for which only distances, similarities or dissimilarities are given Understand the link between Multidimensional scaling and PCA

Applications

Visualize similarities or dissimilarities between high-dimensional objects, e.g. DNA sequences, protein structures

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 81 / 117

slide-82
SLIDE 82

Multidimensional Scaling

Setting

We assume that we are given the pairwise distances, similarities or dissimilarities between all pairs of points di,j in a dataset. Note: We do not need to know the actual coordinates of the points, as in PCA. The goal in multidimensional scaling (MDS) is to

1 find a low-dimensional representation of the data points, 2 which maximally preserves the pairwise distances.

The solution is only determined up to rotation, reflection and shift.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 82 / 117

slide-83
SLIDE 83

Multidimensional Scaling

Optimization problem

We assume that the original data objects are x1, . . . , xn ∈ Rd, and the distance between xi and xj is di,j. The goal is to find a lower-dimensional representation of these n objects: z1, . . . , zn ∈ Rr. Hence the objective in metric MDS is to minimize the so-called stress function SM: arg min SM(z1, . . . , zn) =

  • i=j

(dij − ||zi − zj||)2 (49)

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 83 / 117

slide-84
SLIDE 84

Multidimensional Scaling

Classic scaling Definition

A n × n distance matrix Dij = dij is Euclidean if for some points x1, . . . , xn ∈ Rd d2

ij = (xi − xj)⊤(xi − xj).

Theorem

Define a n × n matrix A with Aij = − 1

2d2 ij and B = HAH where H is the centering

  • matrix. D is Euclidean if and only if B is positive semidefinite. If D is the distance

matrix of a data matrix X, then B = HXX⊤H. B is called the inner product matrix.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 84 / 117

slide-85
SLIDE 85

Multidimensional Scaling

Classical scaling - From distances to inner products: Overview

If distances are Euclidean, we can convert them to centered inner-products. Let d2

ij = ||xi − xj||2 be the matrix of pairwise squared Euclidean distances.

Then we write d2

ij = ||xi − ¯

x||2 + ||xj − ¯ x||2 − 2xi − ¯ x, xj − ¯ x (50) Defining Aij = {− 1

2d2 ij}, we double center B:

B = HAH, (51) where H = I − 1

n11⊤.

B is then the matrix of centered inner products.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 85 / 117

slide-86
SLIDE 86

Multidimensional Scaling

Classical scaling - From distances to inner products: Step by step

The task of MDS is to find the original Euclidean coordinates from a given distance matrix. The Euclidean distance between the ith and jth points is given by d2

ij = d

  • k=1

(xik − xjk)2 The general bij term of B is given by bij =

d

  • k=1

xikxjk = x⊤

i xj

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 86 / 117

slide-87
SLIDE 87

Multidimensional Scaling

Classical scaling - From distances to inner products: Step by step

It is possible to derive B from the known squared distances dij and them from B the unknown coordinates. d2

ij = x⊤ i xi + x⊤ j xj − 2x⊤ i xj =

= bii + bjj − 2bij (52) Centering of the coordinate matrix X implies that n

i=1 bij = 0.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 87 / 117

slide-88
SLIDE 88

Multidimensional Scaling

Classical scaling - From distances to inner products: Step by step

Summing over i and j, we obtain 1 n

n

  • i=1

d2

ij = 1

n

n

  • i=1

bii + bjj (53) 1 n

n

  • j=1

d2

ij = bii + 1

n

n

  • j=1

bjj (54) 1 n2

n

  • i=1

n

  • j=1

d2

ij = 2

n

n

  • i=1

bii (55) bij = −1 2(d2

ij − d2 i• − d2

  • j + d2
  • •)

(56)

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 88 / 117

slide-89
SLIDE 89

Multidimensional Scaling

Classical scaling - From distances to inner products: Step by step

With aij = − 1

2d2 ij, and

ai• = 1 n

n

  • j=1

aij (57) a•j = 1 n

n

  • i=1

aij (58) a•• = 1 n2

n

  • i=1

n

  • j=1

aij, (59) we get bij = aij − ai• − a•j + a••

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 89 / 117

slide-90
SLIDE 90

Multidimensional Scaling

Classical scaling - From distances to inner products: Step by step

Define the matrix A as (aij) and observe that: B = HAH (60) The inner product matrix B can be expressed as B = XX⊤, (61) where X = (x1, . . . , xn)⊤ is the n × p matrix of coordinates. The rank of B is then rank(B) = rank(XX⊤) = rank(X) = p. (62)

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 90 / 117

slide-91
SLIDE 91

Multidimensional Scaling

Classical scaling - From distances to inner products: Step by step

B is symmetric, positive definite, of rank p and has p non-zero eigenvalues. B can now be written as B = VpΛpV⊤

p ,

(63) where Λp = diag(λ1, . . . , λp), the diagonal matrix of the eigenvalues of B, and Vp = (v1, . . . , vp), the matrix of corresponding eigenvectors. Hence the coordinate matrix X containing the point configuration in Rp is given by X = VpΛ

1 2

p

(64)

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 91 / 117

slide-92
SLIDE 92

Multidimensional Scaling

Lower-dimensional approximation

What if we want the representation of X to be lower-dimensional than p (r dimensional, where r < p)? Minimize arg minZ:ZZ⊤=Br ||B − Br||F Achieved if Br is the rank-r of B approximation based on SVD Then Z = VrΛ

1 2

r .

(65)

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 92 / 117

slide-93
SLIDE 93

Multidimensional Scaling

Classical scaling - Pseudocode

The algorithm for recovering coordinates from distances Dij = dij between pairs of points is as follows:

1 Form matrix Aij = {− 1

2d2 ij}.

2 Form matrix B = HAH, where H is the centering matrix H = I − 1

n11⊤.

3 Find the spectral decomposition of B, B = VΛV⊤, where Λ is the diagonal matrix formed

from the eigenvalues of B, and V is the matrix of corresponding eigenvectors.

4 If the points were originally in a p-dimensional space, the first p eigenvalues of K are nonzero

and the remaining n − p are zero. Discard these from Λ (rename as Λp), and discard the corresponding eigenvalues from V (rename as Vp).

5 Find X = VpΛ

1 2

p , and then the coordinates of the points are given by the rows of X.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 93 / 117

slide-94
SLIDE 94

Multidimensional Scaling

Non-metric multidimensional scaling

Classic scaling, and more general metric scaling, approximate actual dissimilarities or similarities between the data. Non-metric scaling effectively uses a ranking of the dissimilarities to obtain a low-dimensional approximation of the data.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 94 / 117

slide-95
SLIDE 95

Multidimensional Scaling

Methods for nonlinear dimensionality reduction

The underlying idea of these methods is that they lie close to an intrinsically low-dimensional nonlinear manifold embedded in a high-dimensional space. The methods are “flattening” the manifold and thereby create a low-dimensional representation of the data and their relative location in the manifold. They tend to be useful for systems with high signal to noise ratios.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 95 / 117

slide-96
SLIDE 96

Multidimensional Scaling

ISOMAP (Tenenbaum et al., 2000)

Isometric feature mapping (ISOMAP) constructs a graph to approximate the geodesic distance between points along the manifold:

Find all points in the ǫ-neighborhood of a point. Connect the point to its neighbors by an edge. The distance between all non-adjacent points will be approximated by the shortest path (geodesic) distance along the neighborhood graph. Finally, classical scaling is applied to the graph distances.

Hence ISOMAP can be thought of as multidimensional scaling on an ǫ-neighborhood graph.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 96 / 117

slide-97
SLIDE 97

Multidimensional Scaling

ISOMAP (Tenenbaum et al., 2000)

  • 1. Find the neighbours of

each point

  • 2. Connect the point to its

neighbours by an edge

  • 3. Compute shortest path

between non-adjacent points

  • 4. MDS on graph distances

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 97 / 117

slide-98
SLIDE 98

Multidimensional Scaling

Local linear embedding (Roweis and Saul, 2000)

Key idea: Local linear embedding (LLE) approximates each point as a linear combination of its neighbors. Then a lower dimensional representation is constructed that best preserves these local approximations.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 98 / 117

slide-99
SLIDE 99

Multidimensional Scaling

Local linear embedding (Roweis and Saul, 2000)

  • 1. Find the k-nearest neighbours
  • f each point
  • 2. Compute weights for each

point (linear combination

  • f its neighbours)
  • 3. Compute embedding

coordinates for fixed weights

w1 w2 w3

x0 x1 x2 x3

x0=x1w1+x2w2+x3w3 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 99 / 117

slide-100
SLIDE 100

Multidimensional Scaling

Local linear embedding (Roweis and Saul, 2000)

For each data point xi in d dimensions, we find its k-nearest neighbors N(i) in Euclidean distance. We approximate each point by an affine mixture of the points in it neighborhood: minWil||xi −

  • l∈N(i)

wilxl||2 (66)

  • ver weights wil satisfying wil = 0, l /

∈ N(i), N

l=1 wil = 1. wil is the contribution of

point l to the reconstruction of point i. Note that for a hope of a unique solution, we must have l < d.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 100 / 117

slide-101
SLIDE 101

Multidimensional Scaling

Local linear embedding (Roweis and Saul, 2000)

In the final step, we find points yi in a space of dimension r < d to minimize

N

  • i=1

||yi −

N

  • l=1

wilyl||2 (67) with wil fixed.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 101 / 117

slide-102
SLIDE 102

Multidimensional Scaling

Local linear embedding (Roweis and Saul, 2000)

In step 3, the following expression is minimized tr

  • (Y − WY)⊤(Y − WY)
  • =

(68) tr

  • Y⊤(I − W)⊤(I − W)Y
  • (69)

where W is N × N, Y is N × r, for some small r < d. The solutions are the trailing eigenvectors of M = (I − W)⊤(I − W). 1 is a trivial eigenvector with eigenvalue 0, which is discarded. The solution are the next d eigenvalues. As a side effect, 1⊤Y = 0. That means, the embedding coordinates are centered.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 102 / 117

slide-103
SLIDE 103

Multidimensional Scaling

Local MDS (Chen and Buja, 2008)

Local MDS defines N to be the symmetric set of nearby pairs of points. An edge exists between two points (i, i′) if i is among the k nearest neighbors of i′ or vice versa. Then the stress function to be minimized is: SL(z1, . . . , zN) =

  • (i,i′)∈N

(dii′ − ||zi − z′

i ||)2 +

  • (i,i′)/

∈N

w(D − ||zi − zi′||)2 (70) D is a large constant and w is a weight. A large choice of D means that non-neighbors should be far apart. A small choice of w ensures that non-neigbors do not dominate the overall objective function.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 103 / 117

slide-104
SLIDE 104

Multidimensional Scaling

Local MDS (Chen and Buja, 2008)

Setting w ∼ 1

D and D → ∞, we obtain:

SL(z1, . . . , zN) =

  • (i,i′)∈N

(dii′ − ||zi − z′

i ||)2 − τ

  • (i,i′)/

∈N

||zi − zi′|| (71) +w

  • (i,i′)/

∈N

||zi − zi′||2 (72) where τ = 2wD. The last term vanishes for D → ∞. The first term tries to preserve local structure and the second term encourages the representations zi, zi′ for non-neighbor pairs (i, i′) to be further apart Local MDS optimizes (z1, . . . , zN) for fixed values of k and τ.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 104 / 117

slide-105
SLIDE 105

Multidimensional Scaling

Summary

Multidimensional scaling learns a low-dimensional representation of the data given dissimilarity or similarity scores only. The solution of classic scaling are the scaled eigenvectors of the centered inner products, as well as the principal components in PCA. ISOMAP, Local Linear Embedding and Local MDS try to approximate local distances rather than all pairwise distances, to better capture the underlying manifold by nonlinear dimensionality reduction.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 105 / 117

slide-106
SLIDE 106

1.5 Self-Organizing Maps

based on: Trevor Hastie, Robert Tibshirani and Jerome Friedman, The Elements of Statistical Learning, Springer 2008, Second Edition, Chapter 14.4

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 106 / 117

slide-107
SLIDE 107

Self-Organizing Maps

Goals

Understand the key idea behind self-organizing maps Understand the computation of self-organizing maps Understand the link to k-means

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 107 / 117

slide-108
SLIDE 108

Self-Organizing Maps

Key idea (Kohonen, 1982)

The idea is to learn a map of the data, a low-dimensional embedding. Self-Organizing Maps can be thought of as a constrained version of k-means clustering, in which the prototypes are encouraged to lie in a one- or two-dimensional manifold in the feature space. This manifold is also referred to as constrained topological map, since the original high-dimensional observations can be mapped down onto the two-dimensional coordinate system. The original SOM algorithm was online, but batch versions have been proposed in the meanwhile.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 108 / 117

slide-109
SLIDE 109

Self-Organizing Maps

Key idea (Kohonen, 1982)

We consider a SOM with a two-dimensional rectangular grid of k prototypes mj ∈ Rd. The choice of a grid is arbitrary, other choices like hexagonal grids are also possible. Each of the k prototypes are parameterized with respect to an integer coordinate pair ℓj ∈ Q1 × Q2. Here Q1 = {1, 2, . . . , q1}, Q2 = {1, 2, . . . , q2}, and k = q1q2. One can think of the protoypes as ”buttons” in a regular pattern. Intuitively, the SOM tries to bend the plane so that the buttons approximate the data points as well as possible. Once the model is fit, the observations can be mapped onto the two-dimensional grid.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 109 / 117

slide-110
SLIDE 110

Self-Organizing Maps

−1 −0.5 0.5 1 1.5 −1 −0.5 0.5 1 1.5 −1 −0.5 0.5 1 1.5

FIGURE 14.15. Simulated data in three classes, near the surface of a half-sphere.

Source: Hastie, Tibshirani, Friedman, 2008

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 110 / 117

slide-111
SLIDE 111

Self-Organizing Maps

  • 1

2 3 4 5 1 2 3 4 5

  • 1

2 3 4 5 1 2 3 4 5

FIGURE 14.16. Self-organizing map applied to half– sphere data example. Left panel is the initial config- uration, right panel the final one. The 5 × 5 grid of prototypes are indicated by circles, and the points that project to each prototype are plotted randomly within the corresponding circle.

Source: Hastie, Tibshirani, Friedman, 2008

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 111 / 117

slide-112
SLIDE 112

FIGURE 14.17. Wiremesh representation of the fit- ted SOM model in I

  • R3. The lines represent the hori-

zontal and vertical edges of the topological lattice. The double lines indicate that the surface was folded diag-

  • nally back on itself in order to model the red points.

The cluster members have been jittered to indicate their color, and the purple points are the node centers. Source: Hastie, Tibshirani, Friedman, 2008

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 112 / 117

slide-113
SLIDE 113

Self-Organizing Maps

Key idea (Kohonen, 1982)

The observations xi are processed one at a time. We find the closest prototype mj to xi in Euclidean distance in Rd, and then for all neighbors mk of mj, we move mk toward xi via the update mk ← mk + α(xi − mk) The neighbors of mj are defined to be all mk such that the distance between ℓj and ℓk is small. α ∈ R is the learning rate and determines the scale of the step. The simplest approach uses Euclidean distance, and “small” is determined by a threshold ǫ. The neighborhood always includes the closest prototype itself. Notice that the distance is defined in the space Q1 × Q2 of integer topological coordinates of the prototypes, rather than in the feature space Rd. The effect of the update is to move the prototypes closer to the data, but also to maintain a smooth two-dimensional spatial relationship between the prototypes.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 113 / 117

slide-114
SLIDE 114

Self-Organizing Maps

Practical considerations

The performance is influenced by the learning rate α and the distance threshold ǫ. Typically α is decreased from 1 to 0 in a few thousand iterations. Similarly ǫ is decreased linearly from starting value R to 1 over a few thousand iterations. The description above refers to the simplest instance of SOM. In more advanced approaches, the update step is changed with distance: mk ← mk + αh(||ℓj − ℓk||)(xi − mk), (73) where the neighborhood function h gives more weight to prototypes mk with indices ℓk closer to ℓj than those further away.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 114 / 117

slide-115
SLIDE 115

Self-Organizing Maps

Link to k-means

If ǫ is chosen so small that each neighborhood contains only one point, then the spatial connection between prototypes is lost. In that case, SOM is an online version of k-means, and converges to a local minimum of the k-means objective. Since SOM is a constrained version of k-means, it is important to check whether the constraint is reasonable. Once can do this by computing the reconstruction error ||x − mj|| with respect to the closest prototype and across all points. This is necessarily smaller for k-means, but should not be much smaller if the SOM is a reasonable approximation.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 115 / 117

slide-116
SLIDE 116

Iteration Reconstruction Error 500 1000 1500 2000 2500 10 20 30 40 50

  • FIGURE 14.18. Half-sphere data: reconstruction er-

ror for the SOM as a function of iteration. Error for k-means clustering is indicated by the horizontal line. Source: Hastie, Tibshirani, Friedman, 2008

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 116 / 117

slide-117
SLIDE 117

Self-Organizing Maps

Summary

Self-Organizing Maps are a variant of k-means for nonlinear dimensionality reduction and visualisation. As in k-means, points are iteratively assigned to cluster means. Unlike k-means, updates of cluster means are constrained and synchronized based on a user-defined lattice between cluster means.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 117 / 117