e x ploring the mnist dataset
play

E x ploring the MNIST dataset AD VAN C E D D IME N SION AL ITY R E - PowerPoint PPT Presentation

E x ploring the MNIST dataset AD VAN C E D D IME N SION AL ITY R E D U C TION IN R Federico Castanedo Data Scientist at DataRobot Wh y do w e need dimensionalit y red u ction techniq u es ? t - Distrib u ted Stochastic Neighbor Embedding ( t


  1. E x ploring the MNIST dataset AD VAN C E D D IME N SION AL ITY R E D U C TION IN R Federico Castanedo Data Scientist at DataRobot

  2. Wh y do w e need dimensionalit y red u ction techniq u es ? t - Distrib u ted Stochastic Neighbor Embedding ( t - SNE ) Generali z ed Lo w Rank Models ( GLRM ) Ad v antages of dimensionalit y red u ction techniq u es : Feat u re selection Data compressed into a fe w important feat u res Memor y- sa v ing and speeding u p of machine learning models Vis u alisation of high dimensional datasets Imp u ting missing data ( GLRM ) ADVANCED DIMENSIONALITY REDUCTION IN R

  3. MNIST dataset 70.000 images of hand w ri � en digits (0-9) 28x28 pi x els ADVANCED DIMENSIONALITY REDUCTION IN R

  4. Se v eral digits Samples of hand w ri � en digits ADVANCED DIMENSIONALITY REDUCTION IN R

  5. Pi x els v al u es First v al u es head(mnist[, 1:6]) label pixel0 pixel1 pixel2 pixel3 pixel4 1 1 0 0 0 0 0 2 0 0 0 0 0 0 3 1 0 0 0 0 0 4 4 0 0 0 0 0 5 0 0 0 0 0 0 6 0 0 0 0 0 0 ADVANCED DIMENSIONALITY REDUCTION IN R

  6. Pi x els v al u es Val u es of pi x els 400 to 405 for the � rst record mnist[1, 402:407] pixel400 pixel401 pixel402 . pixel403 pixel404 pixel405 1 0 0 0 20 206 254 ADVANCED DIMENSIONALITY REDUCTION IN R

  7. Pi x els statistics Basic statistics of pi x el 408 for digits of label 1 summary(mnist[mnist$label==1, 408]) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.0 253.0 253.0 246.5 254.0 255.0 Basic statistics of pi x el 408 for digits of label 0 summary(mnist[mnist$label==0, 408]) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 0.000 0.000 4.517 0.000 255.000 ADVANCED DIMENSIONALITY REDUCTION IN R

  8. Let ' s practice ! AD VAN C E D D IME N SION AL ITY R E D U C TION IN R

  9. Distance metrics AD VAN C E D D IME N SION AL ITY R E D U C TION IN R Federico Castanedo Data Scientist at DataRobot

  10. Distance metrics to comp u te similarit y The similarit y bet w een MNIST digits can be comp u ted u sing a distance metric . A metric is a f u nction that for an y gi v en points , x , y , z the o u tp u t satis � es : 1. Triangle ineq u alit y : d ( x , z ) ≤ d ( x , y ) + d ( y , z ) 2. S y mmetric propert y : d ( x , y ) = d ( y , x ) 3. Non - negati v it y and identit y : d ( x , y ) ≥ 0 and d ( x , y ) = 0 onl y if x = y ADVANCED DIMENSIONALITY REDUCTION IN R

  11. E u clidean distance E u clidean distance in t w o dimensions Can be generali z ed to _ n _ dimensions ADVANCED DIMENSIONALITY REDUCTION IN R

  12. E u clidean distance in R E u clidean distance bet w een the last 6 digits of mnist_sample distances <- dist(mnist_sample[195:200 ,-1]) distances 195 196 197 198 199 196 2582.812 197 2549.652 2520.634 198 1823.275 2286.126 2498.119 199 2537.907 2064.515 2317.869 2304.517 200 2362.112 2539.937 2756.149 2379.478 2593.528 ADVANCED DIMENSIONALITY REDUCTION IN R

  13. Plotting distances Plot of the distances u sing heatmap() heatmap(as.matrix(distances), Rowv = NA, symm = T, labRow = mnist_sample$label[195:200], labCol = mnist_sample$label[195:200]) ADVANCED DIMENSIONALITY REDUCTION IN R

  14. Heatmap of the E u clidean distance ADVANCED DIMENSIONALITY REDUCTION IN R

  15. Minko w ski famil y of distances i p 1/ p Minko w ski : d = ( ∣ P − Q ∣ ) ∑ i E x ample : Minko w ski distance of order 3 distances <- dist(mnist_sample[195:200, -1], method = "minkowski", p = 3) ADVANCED DIMENSIONALITY REDUCTION IN R

  16. Manhattan distance Manha � an distance ( Minko w ski distance of order 1) distances <- dist(mnist_sample[195:200 ,-1], method = "manhattan") ADVANCED DIMENSIONALITY REDUCTION IN R

  17. K u llback - Leibler ( KL ) di v ergence Not a metric since it does not satisf y the s y mmetric and triangle ineq u alit y properties Meas u res di � erences in probabilit y distrib u tions A di v ergence of 0 indicates that the t w o distrib u tions are identical A common distance metric in Machine Learning ( t - SNE ). For e x ample , in decision trees it is called Information Gain ADVANCED DIMENSIONALITY REDUCTION IN R

  18. K u llback - Leibler ( KL ) di v ergence in R Load the philentropy package and get the last 6 MNIST records library(philentropy) mnist_6 <- mnist_sample[195:200, -1] Add 1 to all records to a v oid NaN and comp u te the totals per ro w mnist_6 <- mnist_6 + 1 sums <- rowSums(mnist_6) Comp u te the KL di v ergence distances <- distance(mnist_6/sums, method = "kullback-leibler") heatmap(as.matrix(distances), Rowv = NA, symm = T, labRow = mnist_sample$label, labCol = mnist_sample$label) ADVANCED DIMENSIONALITY REDUCTION IN R

  19. Heatmap of the KL di v ergence ADVANCED DIMENSIONALITY REDUCTION IN R

  20. Let ' s practice ! AD VAN C E D D IME N SION AL ITY R E D U C TION IN R

  21. Dimensionalit y red u ction : PCA and t - SNE AD VAN C E D D IME N SION AL ITY R E D U C TION IN R Federico Castanedo Data Scientist at DataRobot

  22. Dimensionalit y red u ction Distance metrics can not deal w ith high - dimensional datasets . This concept is kno w n as c u rse of dimensionalit y. The problem of � nding similar digits can be sol v ed w ith dimensionalit y red u ction techniq u es s u ch as PCA and t - SNE . ADVANCED DIMENSIONALITY REDUCTION IN R

  23. C u rse of dimensionalit y Coined b y Richard Bellman Describes the problems that arise w hen the n u mber of dimensions gro w s ADVANCED DIMENSIONALITY REDUCTION IN R

  24. Principal component anal y sis ( PCA ) Linear feat u re e x traction techniq u e : creates ne w independent feat u res ADVANCED DIMENSIONALITY REDUCTION IN R

  25. PCA in R PCA w ith defa u lt parameters pca_result <- prcomp(mnist[, -1]) PCA w ith t w o principal components pca_result <- prcomp(mnist[, -1], rank = 2) summary(pca_result) Importance of first k=2 (out of 784) components: PC1 PC2 Standard deviation 578.60227 495.8680 Proportion of Variance 0.09749 0.0716 Cumulative Proportion 0.09749 0.1691 ADVANCED DIMENSIONALITY REDUCTION IN R

  26. plot(pca_result$x[,1:2], pch = as.character(mnist$label), col = mnist$label, main = "PCA output") ADVANCED DIMENSIONALITY REDUCTION IN R

  27. plot(tsne$tsne_x, tsne$tsne_y, pch = as.character(mnist$label), col = mnist$label+1, main = "t-SNE output") ADVANCED DIMENSIONALITY REDUCTION IN R

  28. Let ' s practice ! AD VAN C E D D IME N SION AL ITY R E D U C TION IN R

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend