E x ploring the MNIST dataset AD VAN C E D D IME N SION AL ITY R E - PowerPoint PPT Presentation

E x ploring the MNIST dataset AD VAN C E D D IME N SION AL ITY R E D U C TION IN R Federico Castanedo Data Scientist at DataRobot

Wh y do w e need dimensionalit y red u ction techniq u es ? t - Distrib u ted Stochastic Neighbor Embedding ( t - SNE ) Generali z ed Lo w Rank Models ( GLRM ) Ad v antages of dimensionalit y red u ction techniq u es : Feat u re selection Data compressed into a fe w important feat u res Memor y- sa v ing and speeding u p of machine learning models Vis u alisation of high dimensional datasets Imp u ting missing data ( GLRM ) ADVANCED DIMENSIONALITY REDUCTION IN R

MNIST dataset 70.000 images of hand w ri � en digits (0-9) 28x28 pi x els ADVANCED DIMENSIONALITY REDUCTION IN R

Se v eral digits Samples of hand w ri � en digits ADVANCED DIMENSIONALITY REDUCTION IN R

Pi x els v al u es First v al u es head(mnist[, 1:6]) label pixel0 pixel1 pixel2 pixel3 pixel4 1 1 0 0 0 0 0 2 0 0 0 0 0 0 3 1 0 0 0 0 0 4 4 0 0 0 0 0 5 0 0 0 0 0 0 6 0 0 0 0 0 0 ADVANCED DIMENSIONALITY REDUCTION IN R

Pi x els v al u es Val u es of pi x els 400 to 405 for the � rst record mnist[1, 402:407] pixel400 pixel401 pixel402 . pixel403 pixel404 pixel405 1 0 0 0 20 206 254 ADVANCED DIMENSIONALITY REDUCTION IN R

Pi x els statistics Basic statistics of pi x el 408 for digits of label 1 summary(mnist[mnist$label==1, 408]) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.0 253.0 253.0 246.5 254.0 255.0 Basic statistics of pi x el 408 for digits of label 0 summary(mnist[mnist$label==0, 408]) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 0.000 0.000 4.517 0.000 255.000 ADVANCED DIMENSIONALITY REDUCTION IN R

Let ' s practice ! AD VAN C E D D IME N SION AL ITY R E D U C TION IN R

Distance metrics AD VAN C E D D IME N SION AL ITY R E D U C TION IN R Federico Castanedo Data Scientist at DataRobot

Distance metrics to comp u te similarit y The similarit y bet w een MNIST digits can be comp u ted u sing a distance metric . A metric is a f u nction that for an y gi v en points , x , y , z the o u tp u t satis � es : 1. Triangle ineq u alit y : d ( x , z ) ≤ d ( x , y ) + d ( y , z ) 2. S y mmetric propert y : d ( x , y ) = d ( y , x ) 3. Non - negati v it y and identit y : d ( x , y ) ≥ 0 and d ( x , y ) = 0 onl y if x = y ADVANCED DIMENSIONALITY REDUCTION IN R

E u clidean distance E u clidean distance in t w o dimensions Can be generali z ed to _ n _ dimensions ADVANCED DIMENSIONALITY REDUCTION IN R

E u clidean distance in R E u clidean distance bet w een the last 6 digits of mnist_sample distances <- dist(mnist_sample[195:200 ,-1]) distances 195 196 197 198 199 196 2582.812 197 2549.652 2520.634 198 1823.275 2286.126 2498.119 199 2537.907 2064.515 2317.869 2304.517 200 2362.112 2539.937 2756.149 2379.478 2593.528 ADVANCED DIMENSIONALITY REDUCTION IN R

Plotting distances Plot of the distances u sing heatmap() heatmap(as.matrix(distances), Rowv = NA, symm = T, labRow = mnist_sample$label[195:200], labCol = mnist_sample$label[195:200]) ADVANCED DIMENSIONALITY REDUCTION IN R

Heatmap of the E u clidean distance ADVANCED DIMENSIONALITY REDUCTION IN R

Minko w ski famil y of distances i p 1/ p Minko w ski : d = ( ∣ P − Q ∣ ) ∑ i E x ample : Minko w ski distance of order 3 distances <- dist(mnist_sample[195:200, -1], method = "minkowski", p = 3) ADVANCED DIMENSIONALITY REDUCTION IN R

Manhattan distance Manha � an distance ( Minko w ski distance of order 1) distances <- dist(mnist_sample[195:200 ,-1], method = "manhattan") ADVANCED DIMENSIONALITY REDUCTION IN R

K u llback - Leibler ( KL ) di v ergence Not a metric since it does not satisf y the s y mmetric and triangle ineq u alit y properties Meas u res di � erences in probabilit y distrib u tions A di v ergence of 0 indicates that the t w o distrib u tions are identical A common distance metric in Machine Learning ( t - SNE ). For e x ample , in decision trees it is called Information Gain ADVANCED DIMENSIONALITY REDUCTION IN R

K u llback - Leibler ( KL ) di v ergence in R Load the philentropy package and get the last 6 MNIST records library(philentropy) mnist_6 <- mnist_sample[195:200, -1] Add 1 to all records to a v oid NaN and comp u te the totals per ro w mnist_6 <- mnist_6 + 1 sums <- rowSums(mnist_6) Comp u te the KL di v ergence distances <- distance(mnist_6/sums, method = "kullback-leibler") heatmap(as.matrix(distances), Rowv = NA, symm = T, labRow = mnist_sample$label, labCol = mnist_sample$label) ADVANCED DIMENSIONALITY REDUCTION IN R

Heatmap of the KL di v ergence ADVANCED DIMENSIONALITY REDUCTION IN R

Dimensionalit y red u ction : PCA and t - SNE AD VAN C E D D IME N SION AL ITY R E D U C TION IN R Federico Castanedo Data Scientist at DataRobot

Dimensionalit y red u ction Distance metrics can not deal w ith high - dimensional datasets . This concept is kno w n as c u rse of dimensionalit y. The problem of � nding similar digits can be sol v ed w ith dimensionalit y red u ction techniq u es s u ch as PCA and t - SNE . ADVANCED DIMENSIONALITY REDUCTION IN R

C u rse of dimensionalit y Coined b y Richard Bellman Describes the problems that arise w hen the n u mber of dimensions gro w s ADVANCED DIMENSIONALITY REDUCTION IN R

Principal component anal y sis ( PCA ) Linear feat u re e x traction techniq u e : creates ne w independent feat u res ADVANCED DIMENSIONALITY REDUCTION IN R

PCA in R PCA w ith defa u lt parameters pca_result <- prcomp(mnist[, -1]) PCA w ith t w o principal components pca_result <- prcomp(mnist[, -1], rank = 2) summary(pca_result) Importance of first k=2 (out of 784) components: PC1 PC2 Standard deviation 578.60227 495.8680 Proportion of Variance 0.09749 0.0716 Cumulative Proportion 0.09749 0.1691 ADVANCED DIMENSIONALITY REDUCTION IN R

plot(pca_result$x[,1:2], pch = as.character(mnist$label), col = mnist$label, main = "PCA output") ADVANCED DIMENSIONALITY REDUCTION IN R

plot(tsne$tsne_x, tsne$tsne_y, pch = as.character(mnist$label), col = mnist$label+1, main = "t-SNE output") ADVANCED DIMENSIONALITY REDUCTION IN R

E x ploring the MNIST dataset AD VAN C E D D IME N SION AL ITY R E - PowerPoint PPT Presentation

E x ploring the MNIST dataset AD VAN C E D D IME N SION AL ITY R E D U C TION IN R Federico Castanedo Data Scientist at DataRobot Wh y do w e need dimensionalit y red u ction techniq u es ? t - Distrib u ted Stochastic Neighbor Embedding ( t

E x ploring fashion MNIST dataset AD VAN C E D D IME N SION AL ITY R E D U C TION IN R

Cold Case : The Lost MNIST Digits The Sherlocks: Chhavi Yadav NYU Lon Bottou FAIR,NYU What

1 | Core SMA Dataset Review 2020 Core SMA Dataset for TREAT-NMD affiliated Registries First

MATH6380o Mini-Project 1 Feature Extraction and Transfer Learning on Fashion-MNIST Jason WU ,

E x ploring categorical data E XP L OR ATOR Y DATA AN ALYSIS IN R Andre w Bra y Assistant

E x ploring n u merical data E XP L OR ATOR Y DATA AN ALYSIS IN R Andre w Bra y Assistant

Surprise Billing Surprise Billing Dataset Review Dataset Review October 9, October 9, 2019

The Problem I K G J E C H F A D B = dataset In dataset creation, if each step is

Mina Kwon 2020. 04. 09. vs vs Preference Gaze influence Fixation Choice A HIGH B LOW

Linking t w o charts IN TE R ME D IATE IN TE R AC TIVE DATA VISU AL IZATION W ITH P L OTLY IN

E x ploring relationships E XP L OR ATOR Y DATA AN ALYSIS IN P YTH ON Allen Do w ne y Professor

Image Classification with Fashion-MNIST and CIFAR-10 Khoi Hoang California State University,

INPUTS We made MNIST images

Neural Networks: Powerful yet Mysterious MNIST (hand-written digit recognition) Power lies

K-Nearest Neighbors Nicolas Indelicato K-Nearest Neighbors Dataset Background How the

Downtown Osaka Scene Text Dataset Masakazu Iwamura, Takahiro Matsuda Naoyuki Morimoto, Hitomi

in a Euclidean Space Jiaoyang Li, Ariel Felner, Sven Koenig, T. K. Satish Kumar Berkeley, CA

Covered Topics! v Big Graph Data Mining Sampling Ranking v Big Data Management Indexing v

Learning From Data Lecture 16 Similarity and Nearest Neighbor Similarity Nearest Neighbor M.

A quick review The parsimony principle: Find the tree that requires the fewest

Energy minimization for periodic sets in Euclidean spaces Renaud Coulangeon, joint work with

Computational Aspects of Computational . . . Physical Models Based on Euclidean Space: Proof

A Brief Introduction to Mathematical Relativity Arick Shao Imperial College London Arick Shao

Scattering in a Euclidean formulation of relativistic quantum mechanics W. N. Polyzou The