Geometric Data Analysis
Multidimensional Scaling
MAT 6480W / STT 6705V
Guy Wolf guy.wolf@umontreal.ca
Universit´ e de Montr´ eal Fall 2019
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 1 / 29
Multidimensional Scaling MAT 6480W / STT 6705V Guy Wolf - - PowerPoint PPT Presentation
Geometric Data Analysis Multidimensional Scaling MAT 6480W / STT 6705V Guy Wolf guy.wolf@umontreal.ca Universit e de Montr eal Fall 2019 MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 1 / 29 Outline Multidimensional scaling (MDS) 1
Geometric Data Analysis
MAT 6480W / STT 6705V
Guy Wolf guy.wolf@umontreal.ca
Universit´ e de Montr´ eal Fall 2019
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 1 / 29
1
Multidimensional scaling (MDS) Gram matrix Double-centering Stress function
2
Distance metrics Minkowski distances Mahalanobis distance Hamming distance
3
Similarities and dissimilarities Gaussian affinities Cosine similarities Jaccard index
4
Dynamic time-warp Comparing misaligned signals Computing DTW dissimilarity
5
Combining similarities
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 2 / 29
What if we cannot compute a covariance matrix? Consider a k-dimensional rigid body - all we need to know are distances between its parts. We can ignore its position and orientation and find the most “efficient” way to place it in ❘k.
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 3 / 29
D =
· · · d1j · · · d1m . . . ... . . . di1 dim . . . ... . . . dm1 · · · dmj ·
→
yi − yj = dij = xi − xj
Given a m × m matrix D of distances between m objects, find k dimensional coordinates that preserve these distances.
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 3 / 29
Gram matrix
A distance matrix is not convenient to directly embed in ❘k, but embedding inner products is a simpler task.
Gram matrix
A matrix G that contains inner products gij = xi, xj is a Gram matrix. Using the spectral theorem we can decompose G = ΦΛΦT and get xi, xj = gij =
m
λqΦ[i, q]Φ[j, q] = Φ[i, ·]Λ1/2, Φ[j, ·]Λ1/2 Similar to PCA, we can truncate small eigenvalues and use the k biggest eigenpairs.
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 4 / 29
Spectral embedding
G = λ1 ≥ λ2 ≥ λ3 ≥ · · · ≥ λk > 0 φ1 φ2 φ3 · · · φk m
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 5 / 29
Spectral embedding
G = λ1 ≥ λ2 ≥ λ3 ≥ · · · ≥ λk > 0 φ1 φ2 φ3 · · · φk m
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 5 / 29
Spectral embedding
G = λ1 ≥ λ2 ≥ λ3 ≥ · · · ≥ λk > 0 φ1 φ2 φ3 · · · φk m
x → Φ(x) [λ1/2
1 φ1(x) , λ1/2 2 φ2(x) , λ1/2 3 φ3(x) , . . . , λ1/2 k φk(x)]T
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 5 / 29
Double-centering
Notice that given a distance metric that is equivalent to Euclidean distances, we can write: x − y2 = x2 + y2 − 2x, y But then:
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 6 / 29
Double-centering
Notice that given a distance metric that is equivalent to Euclidean distances, we can write: x − y2 = x2 + y2 − 2x, y But then: meanx(x − y2) = z2 + y2 − 2z, y where z and z2 are the mean and mean squared norm of the data
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 6 / 29
Double-centering
Notice that given a distance metric that is equivalent to Euclidean distances, we can write: x − y2 = x2 + y2 − 2x, y But then: meanx(x − y2) = z2 + y2 − 2z, y meany(x − y2) = z2 + x2 − 2x, z where z and z2 are the mean and mean squared norm of the data
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 6 / 29
Double-centering
Notice that given a distance metric that is equivalent to Euclidean distances, we can write: x − y2 = x2 + y2 − 2x, y But then: meanx(x − y2) = z2 + y2 − 2z, y meany(x − y2) = z2 + x2 − 2x, z meanx,y(x − y2) = 2z2 − 2z, z where z and z2 are the mean and mean squared norm of the data
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 6 / 29
Double-centering
Thus, if we set g(x, y) = −2−1 x − y2 − meanx(x − y2) − meany(x − y2) + meanx,y(x − y2)
g(x, y) = (x, y − x, z)
− (z, y − z, z)
= x − z, y − z Therefore, we can compute G = −1
2 J D(2) J where J = Id − 1 m
1 1T
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 6 / 29
Classic MDS
Classic MDS is computed with the following algorithm:
MDS algorithm
1
Formulate squared distances
2
Build Gram matrix by double-centering
3
SVD (or eigendecomposition)
4
Assign coordinates based on eigenvalues and eigenvectors Exercise: show that for centered data in Euclidean space this embedding is identical to PCA.
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 7 / 29
Stress function
What if we are not given a distance metric, but just dissimilarities?
Stress function
A function that quantifies the disagreement between given dissimilarities and embedded Euclidean distances.
Examples
Stress functions Metric MDS stress:
dij−f (dij))2
ij
, where f is a predetermined
monotonically increasing function
Kruskal’s stress-1:
dij−f (dij))2
d2
ij
, where f is optimized, but still
monotonically increasing
Sammon’s stress: (
i<j dij)−1 i<j (ˆ dij−dij)2 dij
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 8 / 29
Non-metric MDS
Non-metric, or non-classical MDS is computed by the following algorithm:
Non-metric MDS algorithm
1
Formulate a dissimilarity matrix D.
2
Find an initial configuration (e.g., using classical MDS) with distance matrix ˆ D.
3
Minimize STRESSD(f , ˆ D) by optimizing the fitting function.
4
Minimize STRESSD(f , ˆ D) by optimizing the configuration and resulting ˆ D.
5
Iterate the previous two steps until the stress is lower than a stopping threshold.
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 9 / 29
Metric spaces
Consider a dataset X as an arbitrary collection of data points
Distance metric
A distance metric is a function d : X × X → [0, ∞) that satisfies three conditions for any x, y, z ∈ X:
1
d(x, y) = 0 ⇔ x = y
2
d(x, y) = d(y, x)
3
d(x, y) ≤ d(x, z) + d(z, y) The set X of data points together with an appropriate distance metric d(·, ·) is called a metric space.
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 10 / 29
Euclidean distance
When X ⊂ ❘n we can consider Euclidean distances:
Euclidean distance
The distance between x, y ∈ X is defined by x − y2 = n
i=1(x[i] − y[i])2
One of the classic most common distance metrics Often inappropriate in realistic settings without proper preprocessing & feature extraction Also used for least mean square error optimizations Proximity requires all attributes to have equally small differences
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 11 / 29
Manhattan distances
Manhattan distance
The Manhattan distance between x, y ∈ X is defined by x − y1 = n
i=1 |x[i] − y[i]|. This distance is also called taxicab or
cityblock distance
Taken from Wikipedia MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 12 / 29
Minkowski (ℓp) distance
Minkowski distance
The Minkowski distance between x, y ∈ X ⊂ ❘n is defined by x − yp
p = n
|x[i] − y[i]|p for some p > 0. This is also called the ℓp distance. Three popular Minkowski distances are: p = 1 Manhattan distance: x − y1 = n
i=1 |x[i] − y[i]|
p = 2 Euclidean distance: x − y2 =
n
i=1 |x[i] − y[i]|2
p = ∞ Supremum/ℓmax distance: x − y∞ = sup1≤i≤n |x[i] − y[i]|
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 13 / 29
Normalization & standardization
Minkowski distances require normalization to deal with varying magnitudes, scaling, distribution or measurement units.
Min-max normalization
minmax(x)[i] = x[i]−mi
ri
, where mi and ri are the min value and range
Z-score standardization
zscore(x)[i] = x[i]−µi
σi
, where µi and σi are the mean and STD of attribute i.
log attenuation
logatt(x)[i] = sgn(x[i]) log(|x[i]| + 1)
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 14 / 29
Mahalanobis distance
Mahalanobis distances
The Mahalanobis distance is defined by mahal(x, y) =
where Σ is the covariance matrix of the data and data points are represented as row vectors.
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 15 / 29
Mahalanobis distance
Mahalanobis distances
The Mahalanobis distance is defined by mahal(x, y) =
where Σ is the covariance matrix of the data and data points are represented as row vectors. When all attributes are independent with unit standard deviation (e.g., z-scored) then Σ = Id and we get the Euclidean distance.
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 15 / 29
Mahalanobis distance
Mahalanobis distances
The Mahalanobis distance is defined by mahal(x, y) =
where Σ is the covariance matrix of the data and data points are represented as row vectors. When all attributes are independent with variances σ2
i then
Σ = diag(σ2
1, . . . , σ2 n) and we get mahal(x, y) =
n
i=1(x[i]−y[i] σi
)2, which is the Euclidean distance between z-scored data points.
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 15 / 29
Mahalanobis distance
x y z Σ =
0.2 0.2 0.3
= (0, 1) y = (0.5, 0.5) z = (1.5, 1.5) d(x, y) = 5 d(y, z) = 4
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 15 / 29
Hamming distance
When the data contains nominal values, we can use Hamming distances:
Hamming distances
The hamming distance is defined as hamm(x, y) = n
i=1 x[i] = y[i]
for data points x, y that contain n nominal attributes. This distance is equivalent to ℓ1 distance with binary flag representation.
Example
If x = (‘big’, ‘black’, ‘cat’), y = (‘small’, ‘black’, ‘rat’), and z = (’big’, ’blue’, ‘bulldog’) then hamm(x, y) = d(x, z) = 2 and hamm(y, z) = 3.
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 16 / 29
Similarities / affinities
Similarities or affinities quantify whether, or how much, data points are similar.
Similarity/affinity measure
We will consider a similarity or affinity measure as a function a : X × X → [0, 1] such that for every x, y ∈ X a(x, x) = a(y, y) = 1 a(x, y) = a(y, x) Dissimilarities quantify the opposite notion, and typically take values in [0, ∞), although they are sometimes normalized to finite ranges. Distances can serve as a way to measure dissimilarities.
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 17 / 29
Simple similarity measures
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 18 / 29
Correlation
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 19 / 29
Gaussian affinities
Given a distance metric d(x, y), we can use it to formulate Guassian affinities
Gaussian affinities
Gaussian affinities are defined as k(x, y) = exp(−d(x,y)2
2ε
) given a distance metric d. Essentially, data points are similar if they are within the same spherical neighborhoods w.r.t. the distance metric, whose radius is determined by ε. For Euclidean distances they are also known as RBF (radial basis function) affinities.
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 20 / 29
Cosine similarities
Another similarity metric in Euclidean space is based on the inner product (i.e., dot product) x, y = x y cos(∠xy)
Cosine similarities
The cosine similarity between x, y ∈ X ⊂ ❘n is defined as cos(x, y) = x, y x y
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 21 / 29
Cosine similarities
Another similarity metric in Euclidean space is based on the inner product (i.e., dot product) x, y = x y cos(∠xy)
Cosine similarities
The cosine similarity between x, y ∈ X ⊂ ❘n is defined as cos(x, y) = x, y x y
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 21 / 29
Cosine similarities
Another similarity metric in Euclidean space is based on the inner product (i.e., dot product) x, y = x y cos(∠xy)
Cosine similarities
The cosine similarity between x, y ∈ X ⊂ ❘n is defined as cos(x, y) = x, y x y
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 21 / 29
Cosine similarities
Another similarity metric in Euclidean space is based on the inner product (i.e., dot product) x, y = x y cos(∠xy)
Cosine similarities
The cosine similarity between x, y ∈ X ⊂ ❘n is defined as cos(x, y) = x, y x y
✟✟✟✟✟✟✟✟✟ ✯ ✟✟✟✟✟✟✟✟✟✟✟✟✟✟ ✯ ✟✟✟✟✟✟✟✟✟✟✟✟✟✟✟✟✟✟ ✯ ✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘ ✿ ✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘ ✿
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 21 / 29
Jaccard index
For data with n binary attributes we consider two similarity metrics:
Simple matching coefficient
SMC(x, y) =
n
i=1 x[i]∧y[i]+n i=1 ¬x[i]∧¬y[i]
n
Jaccard coefficient
J(x, y) =
n
i=1 x[i]∧y[i]
n
i=1 x[i]∨y[i]
The Jaccard coefficient can be extended to continuous attributes:
Tanimoto (extended Jaccard) coefficient
T(x, y) =
x,y x2+y2−x,y
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 22 / 29
Comparing misaligned signals
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 23 / 29
Comparing misaligned signals
Theoretically: Use time offset to align signals
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 23 / 29
Comparing misaligned signals
Realistically:
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 23 / 29
Comparing misaligned signals
Realistically: Which offset to use?
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 23 / 29
Adaptive alignment
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 24 / 29
Adaptive alignment
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 24 / 29
Adaptive alignment
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 24 / 29
Computing DTW dissimilarity
Signal x
Signal y
i j x[i] − y[j] Pairwise diff. matrix: each cell holds difference between two signal entries
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 25 / 29
Computing DTW dissimilarity
Signal x
Signal y
Alignment path: get from start to end
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 25 / 29
Computing DTW dissimilarity
Signal x
Signal y
1:1 alignment: trivial - nothing modified by the alignment Aligned distance:
2 = x − y2
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 25 / 29
Computing DTW dissimilarity
Signal x
Signal y
Time offset: works sometimes, but not always optimal Aligned distance:
2 =?
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 25 / 29
Computing DTW dissimilarity
Signal x
Signal y
Extreme offset: complete misalignment - worst alignment alternative Aligned distance:
2 = x2 + y2
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 25 / 29
Computing DTW dissimilarity
Signal x
Signal y
Optimal alignment: Optimize alignment by minimizing aligned distance Aligned distance:
2 = min
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 25 / 29
Dynamic programming algorithm
Dynamic Programming
A method for solving complex problems by breaking them down into simpler subproblems. Applicable to problems exhibiting the properties of overlapping subproblems and optimal substructure. Better performances than naive methods that do not utilize the subproblem overlap.
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 26 / 29
Dynamic programming algorithm
DTW Algorithm:
For each signal-time i and for each signal-time j: Set cost ← (x[i] − y[j])2 Set the optimal distance at stage [i, j] to: DTW[i,j] ← cost + min
DTW[i,j−1] DTW[i−1,j−1] DTW[i−1,j]
Optimal distance: DTW[m,n] (where m & n are lengths of signals). Optimal alignment: backtracking the path leading to DTW[m,n] via min-cost choices of the algorithm
DTW[i,j] DTW[i−1,j−1]
DTW[i−1,j]
✻
DTW[i,j−1]
✲
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 26 / 29
Remark about earth-mover distances (EMD)
What is the cost of transforming one distribution to another? EMDp
p(x, y) = min{ n
n
|i − j|p Ωij :
n
Ωij = x[i] ∧
n
Ωij = y[j]} where Ω is a moving strategy (transferring Ωij mass from i to j). Can be solved with the Hungarian algorithm, but more efficient methods exist and rely on wavelets and mathematical analysis.
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 27 / 29
To combine similarities of different attributes we can consider several alternatives:
1
Transform all the attributes to conform to the same similarity/distance metric
2
Use weighted average to combine similarities a(x, y) =
n
i=1 wiai(x, y) or distances
d2(x, y) = n
i=1 wid2 i (x, y) with n i=1 wi = 1.
3
Consider asymmetric attributes by defining binary flags δi(x, y) ∈ {0, 1} that mark whether two data points share comparable information in affinity i and then combine only comparable information by a(x, y) =
n
i=1 wiδi(x,y)ai(x,y)
n
i=1 δi(x,y)
.
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 28 / 29
Multidimensional scaling (MDS) provides an alternative to PCA based on relations / distances in the data. Aims to rigidly embeds data in low dimensions while minimizing distortion of its intrinisic structure. Classic (linear) approach uses leading eigenvalues of a Gram matrix and entries of corresponding eigenvectors. Other approaches minimize stress between original (dis)similarities in the data and distances in embedded coordinates The MDS approach offers flexibility since on can choose many possible metrics (e.g., Euclidean, Mahalanobis, Hamming, Gaussian, Cosine, Jaccard) to quantify relations in the data Exact choice of metric can be adapted to both the task and the input data.
MAT 6480W (Guy Wolf) MDS UdeM - Fall 2019 29 / 29