Multidimensional Scaling MAT 6480W / STT 6705V Guy Wolf - - PowerPoint PPT Presentation

multidimensional scaling
SMART_READER_LITE
LIVE PREVIEW

Multidimensional Scaling MAT 6480W / STT 6705V Guy Wolf - - PowerPoint PPT Presentation

Geometric Data Analysis Multidimensional Scaling MAT 6480W / STT 6705V Guy Wolf guy.wolf@umontreal.ca Universit e de Montr eal Fall 2019 MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 1 / 29 Outline Multidimensional scaling (MDS) 1


slide-1
SLIDE 1

Geometric Data Analysis

Multidimensional Scaling

MAT 6480W / STT 6705V

Guy Wolf guy.wolf@umontreal.ca

Universit´ e de Montr´ eal Fall 2019

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 1 / 29

slide-2
SLIDE 2

Outline

1

Multidimensional scaling (MDS) Gram matrix Double-centering Stress function

2

Distance metrics Minkowski distances Mahalanobis distance Hamming distance

3

Similarities and dissimilarities Gaussian affinities Cosine similarities Jaccard index

4

Dynamic time-warp Comparing misaligned signals Computing DTW dissimilarity

5

Combining similarities

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 2 / 29

slide-3
SLIDE 3

Multidimensional scaling

What if we cannot compute a covariance matrix? Consider a k-dimensional rigid body - all we need to know are distances between its parts. We can ignore its position and orientation and find the most “efficient” way to place it in ❘k.

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 3 / 29

slide-4
SLIDE 4

Multidimensional scaling

D =

        

· · · d1j · · · d1m . . . ... . . . di1 dim . . . ... . . . dm1 · · · dmj ·

        

  • y1, . . . , ym ∈ ❘k :

yi − yj = dij = xi − xj

  • Multidimensional scaling

Given a m × m matrix D of distances between m objects, find k dimensional coordinates that preserve these distances.

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 3 / 29

slide-5
SLIDE 5

Multidimensional scaling

Gram matrix

A distance matrix is not convenient to directly embed in ❘k, but embedding inner products is a simpler task.

Gram matrix

A matrix G that contains inner products gij = xi, xj is a Gram matrix. Using the spectral theorem we can decompose G = ΦΛΦT and get xi, xj = gij =

m

  • q=1

λqΦ[i, q]Φ[j, q] = Φ[i, ·]Λ1/2, Φ[j, ·]Λ1/2 Similar to PCA, we can truncate small eigenvalues and use the k biggest eigenpairs.

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 4 / 29

slide-6
SLIDE 6

Multidimensional scaling

Spectral embedding

G = λ1 ≥ λ2 ≥ λ3 ≥ · · · ≥ λk > 0 φ1 φ2 φ3 · · · φk m

                        

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 5 / 29

slide-7
SLIDE 7

Multidimensional scaling

Spectral embedding

G = λ1 ≥ λ2 ≥ λ3 ≥ · · · ≥ λk > 0 φ1 φ2 φ3 · · · φk m

                        

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 5 / 29

slide-8
SLIDE 8

Multidimensional scaling

Spectral embedding

G = λ1 ≥ λ2 ≥ λ3 ≥ · · · ≥ λk > 0 φ1 φ2 φ3 · · · φk m

                        

x → Φ(x) [λ1/2

1 φ1(x) , λ1/2 2 φ2(x) , λ1/2 3 φ3(x) , . . . , λ1/2 k φk(x)]T

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 5 / 29

slide-9
SLIDE 9

Multidimensional scaling

Double-centering

Notice that given a distance metric that is equivalent to Euclidean distances, we can write: x − y2 = x2 + y2 − 2x, y But then:

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 6 / 29

slide-10
SLIDE 10

Multidimensional scaling

Double-centering

Notice that given a distance metric that is equivalent to Euclidean distances, we can write: x − y2 = x2 + y2 − 2x, y But then: meanx(x − y2) = z2 + y2 − 2z, y where z and z2 are the mean and mean squared norm of the data

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 6 / 29

slide-11
SLIDE 11

Multidimensional scaling

Double-centering

Notice that given a distance metric that is equivalent to Euclidean distances, we can write: x − y2 = x2 + y2 − 2x, y But then: meanx(x − y2) = z2 + y2 − 2z, y meany(x − y2) = z2 + x2 − 2x, z where z and z2 are the mean and mean squared norm of the data

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 6 / 29

slide-12
SLIDE 12

Multidimensional scaling

Double-centering

Notice that given a distance metric that is equivalent to Euclidean distances, we can write: x − y2 = x2 + y2 − 2x, y But then: meanx(x − y2) = z2 + y2 − 2z, y meany(x − y2) = z2 + x2 − 2x, z meanx,y(x − y2) = 2z2 − 2z, z where z and z2 are the mean and mean squared norm of the data

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 6 / 29

slide-13
SLIDE 13

Multidimensional scaling

Double-centering

Thus, if we set g(x, y) = −2−1 x − y2 − meanx(x − y2) − meany(x − y2) + meanx,y(x − y2)

  • we get a gram matrix, since:

g(x, y) = (x, y − x, z)

  • x,y−z

− (z, y − z, z)

  • z,y−z

= x − z, y − z Therefore, we can compute G = −1

2 J D(2) J where J = Id − 1 m

1 1T

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 6 / 29

slide-14
SLIDE 14

Multidimensional scaling

Classic MDS

Classic MDS is computed with the following algorithm:

MDS algorithm

1

Formulate squared distances

2

Build Gram matrix by double-centering

3

SVD (or eigendecomposition)

4

Assign coordinates based on eigenvalues and eigenvectors Exercise: show that for centered data in Euclidean space this embedding is identical to PCA.

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 7 / 29

slide-15
SLIDE 15

Multidimensional scaling

Stress function

What if we are not given a distance metric, but just dissimilarities?

Stress function

A function that quantifies the disagreement between given dissimilarities and embedded Euclidean distances.

Examples

Stress functions Metric MDS stress:

  • i<j(ˆ

dij−f (dij))2

  • i<j d2

ij

, where f is a predetermined

monotonically increasing function

Kruskal’s stress-1:

  • i<j(ˆ

dij−f (dij))2

  • i<j ˆ

d2

ij

, where f is optimized, but still

monotonically increasing

Sammon’s stress: (

i<j dij)−1 i<j (ˆ dij−dij)2 dij

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 8 / 29

slide-16
SLIDE 16

Multidimensional scaling

Non-metric MDS

Non-metric, or non-classical MDS is computed by the following algorithm:

Non-metric MDS algorithm

1

Formulate a dissimilarity matrix D.

2

Find an initial configuration (e.g., using classical MDS) with distance matrix ˆ D.

3

Minimize STRESSD(f , ˆ D) by optimizing the fitting function.

4

Minimize STRESSD(f , ˆ D) by optimizing the configuration and resulting ˆ D.

5

Iterate the previous two steps until the stress is lower than a stopping threshold.

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 9 / 29

slide-17
SLIDE 17

Distance metrics

Metric spaces

Consider a dataset X as an arbitrary collection of data points

Distance metric

A distance metric is a function d : X × X → [0, ∞) that satisfies three conditions for any x, y, z ∈ X:

1

d(x, y) = 0 ⇔ x = y

2

d(x, y) = d(y, x)

3

d(x, y) ≤ d(x, z) + d(z, y) The set X of data points together with an appropriate distance metric d(·, ·) is called a metric space.

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 10 / 29

slide-18
SLIDE 18

Distance metrics

Euclidean distance

When X ⊂ ❘n we can consider Euclidean distances:

Euclidean distance

The distance between x, y ∈ X is defined by x − y2 = n

i=1(x[i] − y[i])2

One of the classic most common distance metrics Often inappropriate in realistic settings without proper preprocessing & feature extraction Also used for least mean square error optimizations Proximity requires all attributes to have equally small differences

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 11 / 29

slide-19
SLIDE 19

Distance metrics

Manhattan distances

Manhattan distance

The Manhattan distance between x, y ∈ X is defined by x − y1 = n

i=1 |x[i] − y[i]|. This distance is also called taxicab or

cityblock distance

Taken from Wikipedia MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 12 / 29

slide-20
SLIDE 20

Distance metrics

Minkowski (ℓp) distance

Minkowski distance

The Minkowski distance between x, y ∈ X ⊂ ❘n is defined by x − yp

p = n

  • i=1

|x[i] − y[i]|p for some p > 0. This is also called the ℓp distance. Three popular Minkowski distances are: p = 1 Manhattan distance: x − y1 = n

i=1 |x[i] − y[i]|

p = 2 Euclidean distance: x − y2 =

n

i=1 |x[i] − y[i]|2

p = ∞ Supremum/ℓmax distance: x − y∞ = sup1≤i≤n |x[i] − y[i]|

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 13 / 29

slide-21
SLIDE 21

Distance metrics

Normalization & standardization

Minkowski distances require normalization to deal with varying magnitudes, scaling, distribution or measurement units.

Min-max normalization

minmax(x)[i] = x[i]−mi

ri

, where mi and ri are the min value and range

  • f attribute i.

Z-score standardization

zscore(x)[i] = x[i]−µi

σi

, where µi and σi are the mean and STD of attribute i.

log attenuation

logatt(x)[i] = sgn(x[i]) log(|x[i]| + 1)

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 14 / 29

slide-22
SLIDE 22

Distance metrics

Mahalanobis distance

Mahalanobis distances

The Mahalanobis distance is defined by mahal(x, y) =

  • (x − y)Σ−1(x − y)T

where Σ is the covariance matrix of the data and data points are represented as row vectors.

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 15 / 29

slide-23
SLIDE 23

Distance metrics

Mahalanobis distance

Mahalanobis distances

The Mahalanobis distance is defined by mahal(x, y) =

  • (x − y)Σ−1(x − y)T

where Σ is the covariance matrix of the data and data points are represented as row vectors. When all attributes are independent with unit standard deviation (e.g., z-scored) then Σ = Id and we get the Euclidean distance.

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 15 / 29

slide-24
SLIDE 24

Distance metrics

Mahalanobis distance

Mahalanobis distances

The Mahalanobis distance is defined by mahal(x, y) =

  • (x − y)Σ−1(x − y)T

where Σ is the covariance matrix of the data and data points are represented as row vectors. When all attributes are independent with variances σ2

i then

Σ = diag(σ2

1, . . . , σ2 n) and we get mahal(x, y) =

n

i=1(x[i]−y[i] σi

)2, which is the Euclidean distance between z-scored data points.

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 15 / 29

slide-25
SLIDE 25

Distance metrics

Mahalanobis distance

x y z Σ =

  • 0.3

0.2 0.2 0.3

  • x

= (0, 1) y = (0.5, 0.5) z = (1.5, 1.5) d(x, y) = 5 d(y, z) = 4

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 15 / 29

slide-26
SLIDE 26

Distance metrics

Hamming distance

When the data contains nominal values, we can use Hamming distances:

Hamming distances

The hamming distance is defined as hamm(x, y) = n

i=1 x[i] = y[i]

for data points x, y that contain n nominal attributes. This distance is equivalent to ℓ1 distance with binary flag representation.

Example

If x = (‘big’, ‘black’, ‘cat’), y = (‘small’, ‘black’, ‘rat’), and z = (’big’, ’blue’, ‘bulldog’) then hamm(x, y) = d(x, z) = 2 and hamm(y, z) = 3.

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 16 / 29

slide-27
SLIDE 27

Similarities and dissimilarities

Similarities / affinities

Similarities or affinities quantify whether, or how much, data points are similar.

Similarity/affinity measure

We will consider a similarity or affinity measure as a function a : X × X → [0, 1] such that for every x, y ∈ X a(x, x) = a(y, y) = 1 a(x, y) = a(y, x) Dissimilarities quantify the opposite notion, and typically take values in [0, ∞), although they are sometimes normalized to finite ranges. Distances can serve as a way to measure dissimilarities.

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 17 / 29

slide-28
SLIDE 28

Similarities and dissimilarities

Simple similarity measures

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 18 / 29

slide-29
SLIDE 29

Similarities and dissimilarities

Correlation

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 19 / 29

slide-30
SLIDE 30

Similarities and dissimilarities

Gaussian affinities

Given a distance metric d(x, y), we can use it to formulate Guassian affinities

Gaussian affinities

Gaussian affinities are defined as k(x, y) = exp(−d(x,y)2

) given a distance metric d. Essentially, data points are similar if they are within the same spherical neighborhoods w.r.t. the distance metric, whose radius is determined by ε. For Euclidean distances they are also known as RBF (radial basis function) affinities.

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 20 / 29

slide-31
SLIDE 31

Similarities and dissimilarities

Cosine similarities

Another similarity metric in Euclidean space is based on the inner product (i.e., dot product) x, y = x y cos(∠xy)

Cosine similarities

The cosine similarity between x, y ∈ X ⊂ ❘n is defined as cos(x, y) = x, y x y

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 21 / 29

slide-32
SLIDE 32

Similarities and dissimilarities

Cosine similarities

Another similarity metric in Euclidean space is based on the inner product (i.e., dot product) x, y = x y cos(∠xy)

Cosine similarities

The cosine similarity between x, y ∈ X ⊂ ❘n is defined as cos(x, y) = x, y x y

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 21 / 29

slide-33
SLIDE 33

Similarities and dissimilarities

Cosine similarities

Another similarity metric in Euclidean space is based on the inner product (i.e., dot product) x, y = x y cos(∠xy)

Cosine similarities

The cosine similarity between x, y ∈ X ⊂ ❘n is defined as cos(x, y) = x, y x y

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 21 / 29

slide-34
SLIDE 34

Similarities and dissimilarities

Cosine similarities

Another similarity metric in Euclidean space is based on the inner product (i.e., dot product) x, y = x y cos(∠xy)

Cosine similarities

The cosine similarity between x, y ∈ X ⊂ ❘n is defined as cos(x, y) = x, y x y

✟✟✟✟✟✟✟✟✟ ✯ ✟✟✟✟✟✟✟✟✟✟✟✟✟✟ ✯ ✟✟✟✟✟✟✟✟✟✟✟✟✟✟✟✟✟✟ ✯ ✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘ ✿ ✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘ ✿

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 21 / 29

slide-35
SLIDE 35

Similarities and dissimilarities

Jaccard index

For data with n binary attributes we consider two similarity metrics:

Simple matching coefficient

SMC(x, y) =

n

i=1 x[i]∧y[i]+n i=1 ¬x[i]∧¬y[i]

n

Jaccard coefficient

J(x, y) =

n

i=1 x[i]∧y[i]

n

i=1 x[i]∨y[i]

The Jaccard coefficient can be extended to continuous attributes:

Tanimoto (extended Jaccard) coefficient

T(x, y) =

x,y x2+y2−x,y

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 22 / 29

slide-36
SLIDE 36

Dynamic time-warp

Comparing misaligned signals

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 23 / 29

slide-37
SLIDE 37

Dynamic time-warp

Comparing misaligned signals

Theoretically: Use time offset to align signals

❛✲

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 23 / 29

slide-38
SLIDE 38

Dynamic time-warp

Comparing misaligned signals

Realistically:

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 23 / 29

slide-39
SLIDE 39

Dynamic time-warp

Comparing misaligned signals

Realistically: Which offset to use?

❛✲ ❛✲ ❛

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 23 / 29

slide-40
SLIDE 40

Dynamic time-warp

Adaptive alignment

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 24 / 29

slide-41
SLIDE 41

Dynamic time-warp

Adaptive alignment

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 24 / 29

slide-42
SLIDE 42

Dynamic time-warp

Adaptive alignment

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 24 / 29

slide-43
SLIDE 43

Dynamic time-warp

Computing DTW dissimilarity

Signal x

Signal y

❛ ❛ q ❡

❍❍❍❍❍❍❍ ❍

✣✢ ✤✜

i j x[i] − y[j] Pairwise diff. matrix: each cell holds difference between two signal entries

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 25 / 29

slide-44
SLIDE 44

Dynamic time-warp

Computing DTW dissimilarity

Signal x

Signal y

Alignment path: get from start to end

  • f both signals

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 25 / 29

slide-45
SLIDE 45

Dynamic time-warp

Computing DTW dissimilarity

Signal x

Signal y

1:1 alignment: trivial - nothing modified by the alignment Aligned distance:

2 = x − y2

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 25 / 29

slide-46
SLIDE 46

Dynamic time-warp

Computing DTW dissimilarity

Signal x

Signal y

Time offset: works sometimes, but not always optimal Aligned distance:

2 =?

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 25 / 29

slide-47
SLIDE 47

Dynamic time-warp

Computing DTW dissimilarity

Signal x

Signal y

Extreme offset: complete misalignment - worst alignment alternative Aligned distance:

2 = x2 + y2

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 25 / 29

slide-48
SLIDE 48

Dynamic time-warp

Computing DTW dissimilarity

Signal x

Signal y

Optimal alignment: Optimize alignment by minimizing aligned distance Aligned distance:

2 = min

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 25 / 29

slide-49
SLIDE 49

Dynamic time-warp

Dynamic programming algorithm

Dynamic Programming

A method for solving complex problems by breaking them down into simpler subproblems. Applicable to problems exhibiting the properties of overlapping subproblems and optimal substructure. Better performances than naive methods that do not utilize the subproblem overlap.

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 26 / 29

slide-50
SLIDE 50

Dynamic time-warp

Dynamic programming algorithm

DTW Algorithm:

For each signal-time i and for each signal-time j: Set cost ← (x[i] − y[j])2 Set the optimal distance at stage [i, j] to: DTW[i,j] ← cost + min

    

DTW[i,j−1] DTW[i−1,j−1] DTW[i−1,j]

    

Optimal distance: DTW[m,n] (where m & n are lengths of signals). Optimal alignment: backtracking the path leading to DTW[m,n] via min-cost choices of the algorithm

DTW[i,j] DTW[i−1,j−1]

DTW[i−1,j]

DTW[i,j−1]

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 26 / 29

slide-51
SLIDE 51

Dynamic time-warp

Remark about earth-mover distances (EMD)

What is the cost of transforming one distribution to another? EMDp

p(x, y) = min{ n

  • i=1

n

  • j=1

|i − j|p Ωij :

n

  • j=1

Ωij = x[i] ∧

n

  • i=1

Ωij = y[j]} where Ω is a moving strategy (transferring Ωij mass from i to j). Can be solved with the Hungarian algorithm, but more efficient methods exist and rely on wavelets and mathematical analysis.

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 27 / 29

slide-52
SLIDE 52

Combining similarities

To combine similarities of different attributes we can consider several alternatives:

1

Transform all the attributes to conform to the same similarity/distance metric

2

Use weighted average to combine similarities a(x, y) =

n

i=1 wiai(x, y) or distances

d2(x, y) = n

i=1 wid2 i (x, y) with n i=1 wi = 1.

3

Consider asymmetric attributes by defining binary flags δi(x, y) ∈ {0, 1} that mark whether two data points share comparable information in affinity i and then combine only comparable information by a(x, y) =

n

i=1 wiδi(x,y)ai(x,y)

n

i=1 δi(x,y)

.

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 28 / 29

slide-53
SLIDE 53

Summary

Multidimensional scaling (MDS) provides an alternative to PCA based on relations / distances in the data. Aims to rigidly embeds data in low dimensions while minimizing distortion of its intrinisic structure. Classic (linear) approach uses leading eigenvalues of a Gram matrix and entries of corresponding eigenvectors. Other approaches minimize stress between original (dis)similarities in the data and distances in embedded coordinates The MDS approach offers flexibility since on can choose many possible metrics (e.g., Euclidean, Mahalanobis, Hamming, Gaussian, Cosine, Jaccard) to quantify relations in the data Exact choice of metric can be adapted to both the task and the input data.

MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 29 / 29