Distances & Similarities CPSC/AMTH 445a/545a Guy Wolf - PowerPoint PPT Presentation

Introduction to Data Mining Distances & Similarities CPSC/AMTH 445a/545a Guy Wolf guy.wolf@yale.edu Yale University Fall 2016 CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 1 / 22

Outline Distance metrics 1 Minkowski distances Euclidean distance Manhattan distance Normalization & standardization Mahalanobis distance Hamming distance Similarities and dissimilarities 2 Correlation Gaussian affinities Cosine similarities Jaccard index Dynamic time-warp 3 Comparing misaligned signals Computing DTW dissimilarity Combining similarities 4 CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 2 / 22

Distance metrics Metric spaces Consider a dataset X as an arbitrary collection of data points Distance metric A distance metric is a function d : X × X → [0 , ∞ ) that satisfies three conditions for any x , y , z ∈ X : d ( x , y ) = 0 ⇔ x = y 1 d ( x , y ) = d ( y , x ) 2 d ( x , y ) ≤ d ( x , z ) + d ( z , y ) 3 The set X of data points together with an appropriate distance metric d ( · , · ) is called a metric space. CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 3 / 22

Distance metrics Euclidean distance When X ⊂ ❘ n we can consider Euclidean distances: Euclidean distance The distance between x , y ∈ X is defined by � x − y � 2 = � n i =1 ( x [ i ] − y [ i ]) 2 One of the classic most common distance metrics Often inappropriate in realistic settings without proper preprocessing & feature extraction Also used for least mean square error optimizations Proximity requires all attributes to have equally small differences CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 4 / 22

Distance metrics Manhattan distances Manhattan distance The Manhattan distance between x , y ∈ X is defined by � x − y � 1 = � n i =1 | x [ i ] − y [ i ] | . This distance is also called taxicab or cityblock distance Taken from Wikipedia CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 5 / 22

Distance metrics Minkowski ( ℓ p ) distance Minkowski distance The Minkowski distance between x , y ∈ X ⊂ ❘ n is defined by n � x − y � p | x [ i ] − y [ i ] | p � p = i =1 for some p > 0. This is also called the ℓ p distance. Three popular Minkowski distances are: Manhattan distance: � x − y � 1 = � n p = 1 i =1 | x [ i ] − y [ i ] | i =1 | x [ i ] − y [ i ] | 2 � n p = 2 Euclidean distance: � x − y � 2 = p = ∞ Supremum/ ℓ max distance: � x − y � ∞ = sup 1 ≤ i ≤ n | x [ i ] − y [ i ] | CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 6 / 22

Distance metrics Normalization & standardization Minkowski distances require normalization to deal with varying magnitudes, scaling, distribution or measurement units. Min-max normalization minmax( x )[ i ] = x [ i ] − m i , where m i and r i are the min value and range r i of attribute i . Z-score standardization zscore( x )[ i ] = x [ i ] − µ i , where µ i and σ i are the mean and STD of σ i attribute i . log attenuation logatt( x )[ i ] = sgn( x [ i ]) log( | x [ i ] | + 1) CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 7 / 22

Distance metrics Mahalanobis distance Mahalanobis distances The Mahalanobis distance is defined by � ( x − y )Σ − 1 ( x − y ) T mahal( x , y ) = where Σ is the covariance matrix of the data and data points are represented as row vectors. CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 8 / 22

Distance metrics Mahalanobis distance Mahalanobis distances The Mahalanobis distance is defined by � mahal( x , y ) = ( x − y )Σ − 1 ( x − y ) T where Σ is the covariance matrix of the data and data points are represented as row vectors. When all attributes are independent with unit standard deviation (e.g., z-scored) then Σ = Id and we get the Euclidean distance. CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 8 / 22

Distance metrics Mahalanobis distance Mahalanobis distances The Mahalanobis distance is defined by � ( x − y )Σ − 1 ( x − y ) T mahal( x , y ) = where Σ is the covariance matrix of the data and data points are represented as row vectors. When all attributes are independent with variances σ 2 i then �� n i =1 ( x [ i ] − y [ i ] Σ = diag( σ 2 1 , . . . , σ 2 ) 2 , n ) and we get mahal( x , y ) = σ i which is the Euclidean distance between z-scored data points. CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 8 / 22

Distance metrics Mahalanobis distance � � 0 . 3 0 . 2 Σ = 0 . 2 0 . 3 z x = (0 , 1) x = (0 . 5 , 0 . 5) y z = (1 . 5 , 1 . 5) y d ( x , y ) = 5 d ( y , z ) = 4 CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 8 / 22

Distance metrics Hamming distance When the data contains nominal values, we can use Hamming distances: Hamming distances The hamming distance is defined as hamm( x , y ) = � n i =1 x [ i ] � = y [ i ] for data points x , y that contain n nominal attributes. This distance is equivalent to ℓ 1 distance with binary flag representation. Example If x = (‘big’ , ‘black’ , ‘cat’), y = (‘small’ , ‘black’ , ‘rat’), and z = (’big’ , ’blue’ , ‘bulldog’) then hamm ( x , y ) = d ( x , z ) = 2 and hamm ( y , z ) = 3. CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 9 / 22

Similarities and dissimilarities Similarities / affinities Similarities or affinities quantify whether, or how much, data points are similar . Similarity/affinity measure We will consider a similarity or affinity measure as a function a : X × X → [0 , 1] such that for every x , y ∈ X a ( x , x ) = a ( y , y ) = 1 a ( x , y ) = a ( y , x ) Dissimilarities quantify the opposite notion, and typically take values in [0 , ∞ ), although they are sometimes normalized to finite ranges. Distances can serve as a way to measure dissimilarities. CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 10 / 22

Similarities and dissimilarities Simple similarity measures CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 11 / 22

Similarities and dissimilarities Correlation CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 12 / 22

Similarities and dissimilarities Gaussian affinities Given a distance metric d ( x , y ), we can use it to formulate Guassian affinities Gaussian affinities Gaussian affinities are defined as k ( x , y ) = exp( − d ( x , y ) 2 ) 2 ε given a distance metric d . Essentially, data points are similar if they are within the same spherical neighborhoods w.r.t. the distance metric, whose radius is determined by ε . For Euclidean distances they are also known as RBF (radial basis function) affinities. CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 13 / 22

Similarities and dissimilarities Cosine similarities Another similarity metric in Euclidean space is based on the inner product (i.e., dot product) � x , y � = � x � � y � cos( ∠ xy ) Cosine similarities The cosine similarity between x , y ∈ X ⊂ ❘ n is defined as � x , y � cos( x , y ) = � x � � y � CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 14 / 22

Similarities and dissimilarities Cosine similarities Another similarity metric in Euclidean space is based on the inner product (i.e., dot product) � x , y � = � x � � y � cos( ∠ xy ) Cosine similarities The cosine similarity between x , y ∈ X ⊂ ❘ n is defined as � x , y � cos( x , y ) = � x � � y � ✟✟✟✟✟✟✟✟✟✟✟✟✟✟✟✟✟✟ ✯ � ✒ ✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘ ✿ � ✟✟✟✟✟✟✟✟✟✟✟✟✟✟ ✯ � � ✒ � ✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘ ✿ ✟✟✟✟✟✟✟✟✟ � � ✯ � � � � � � � � CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 14 / 22

Similarities and dissimilarities Jaccard index For data with n binary attributes we consider two similarity metrics: Simple matching coefficient � n i =1 x [ i ] ∧ y [ i ]+ � n i =1 ¬ x [ i ] ∧¬ y [ i ] SMC ( x , y ) = n Jaccard coefficient � n i =1 x [ i ] ∧ y [ i ] J ( x , y ) = � n i =1 x [ i ] ∨ y [ i ] The Jaccard coefficient can be extended to continuous attributes: Tanimoto (extended Jaccard) coefficient � x , y � T ( x , y ) = � x � 2 + � y � 2 −� x , y � CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 15 / 22

Distances & Similarities CPSC/AMTH 445a/545a Guy Wolf - PowerPoint PPT Presentation

Introduction to Data Mining Distances & Similarities CPSC/AMTH 445a/545a Guy Wolf guy.wolf@yale.edu Yale University Fall 2016 CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 1 / 22 Outline Distance metrics 1

Dr Jeffrey Chow Research Consultant Civic Exchange Distances to public open spaces Distances to

A Sociolinguistic Analysis of Linguistically Sensitive Dialectal Word Pronunciation Distances

Phylogenetic trees II Estimating distances, estimating trees from distances Gerhard Jger

Metric Distances 28 Great Circle Distances North Pole (90N lat) North Pole C Prime

Geodesic distances and intrinsic distances on some fractal sets Masanori Hino (Kyoto Univ.)

1.4 Keyboard Training We are going to learn Similarities between a typewriter and a computer

Heuristic searches Genomics Compare DNA sequences to discover similarities/differences

Advanced Lesson 30 Topic 30: Identifying similarities and differences in text .Reading

Distance Metric Learning: Beyond 0/1 Loss Praveen Krishnan CVIT, IIIT Hyderabad June 14, 2017 1

Seametrix is a modern and advanced sea routing software, with: Highly accurate sea distances,

VIRTUAL LABS PROJECT VIRTUAL LABS PROJECT Motivation Motivation Physical Distances Physical

Measuring Distances ASTR/PHYS 4080: Intro to Cosmology Week 6 ASTR/PHYS 4080: Introduction to

Generalized Distances Between Rankings Ravi Kumar Sergei Vassilvitskii Yahoo! Research

String-brane interactions from large to small distances Giuseppe DAppollonio Universit` a di

MDS Embedding MDS takes as input a distance matrix D , containing all N N pair of distances

Points, Distances, and Cellular Automata: Geometric and Spatial Algorithmics Luidnel Maignan

Draft Community Draft Community Engagement Strategy Engagement Strategy Developed by The

ClusterPCAML November 13, 2018 1 Lecture 23: Clustering and machine learning CBIO (CSCI)

Non-Bayesian Classifiers Part I: k -Nearest Neighbor Classifier and Distance Functions Selim

1. Lecture Motivation Digital images Syllabus Date Title Link 23.02. Introduction,

RSESLIB 3: Rough Set and Machine Learning Open Source in Java Agenda Overview Library

Machine Learning Instance Based Learning Hamid Beigy Sharif University of Technology Fall 1396

w o o o o o o o x o o o x o o o that represents how aligned the x x x x x x

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from