large scale machine learning i scalability issues
play

Large-Scale Machine Learning I. Scalability issues Jean-Philippe - PowerPoint PPT Presentation

Large-Scale Machine Learning I. Scalability issues Jean-Philippe Vert jean-philippe.vert@ { mines-paristech,curie,ens } .fr 1 / 76 Outline Introduction 1 Standard machine learning 2 Dimension reduction: PCA Clustering: k -means Regression:


  1. Large-Scale Machine Learning I. Scalability issues Jean-Philippe Vert jean-philippe.vert@ { mines-paristech,curie,ens } .fr 1 / 76

  2. Outline Introduction 1 Standard machine learning 2 Dimension reduction: PCA Clustering: k -means Regression: ridge regression Classification: kNN, logistic regression and SVM Nonlinear models: kernel methods Scalability issues 3 2 / 76

  3. Acknowledgement In the preparation of these slides I got inspiration and copied several slides from several sources: Sanjiv Kumar’s ”Large-scale machine learning” course: http://www.sanjivk.com/EECS6898/lectures.html Ala Al-Fuqaha’s ”Data mining” course: https://cs.wmich.edu/alfuqaha/summer14/cs6530/ lectures/SimilarityAnalysis.pdf L´ eon Bottou’s ”Large-scale machine learning revisited” conference https://bigdata2013.sciencesconf.org/conference/ bigdata2013/pages/bottou.pdf 3 / 76

  4. Outline Introduction 1 Standard machine learning 2 Dimension reduction: PCA Clustering: k -means Regression: ridge regression Classification: kNN, logistic regression and SVM Nonlinear models: kernel methods Scalability issues 3 4 / 76

  5. 5 / 76

  6. Perception 6 / 76

  7. Communication 7 / 76

  8. Mobility 8 / 76

  9. Health https://pct.mdanderson.org 9 / 76

  10. Reasoning 10 / 76

  11. A common process: learning from data https://www.linkedin.com/pulse/supervised-machine-learning-pega-decisioning-solution-nizam-muhammad Given examples (training data), make a machine learn how to predict on new samples, or discover patterns in data Statistics + optimization + computer science Gets better with more training examples and bigger computers 11 / 76

  12. Large-scale ML? d dimensions t tasks n samples X Y Iris dataset: n = 150 , d = 4 , t = 1 Cancer drug sensitivity: n = 1 k , d = 1 M , t = 100 Imagenet: n = 14 M , d = 60 k + , t = 22 k Shopping, e-marketing n = O ( M ) , d = O ( B ) , t = O (100 M ) Astronomy, GAFA, web... n = O ( B ) , d = O ( B ) , t = O ( B ) 12 / 76

  13. Today’s goals 1 Review a few standard ML techniques 2 Introduce a few ideas and techniques to scale them to modern, big datasets 13 / 76

  14. Outline Introduction 1 Standard machine learning 2 Dimension reduction: PCA Clustering: k -means Regression: ridge regression Classification: kNN, logistic regression and SVM Nonlinear models: kernel methods Scalability issues 3 14 / 76

  15. Main ML paradigms Unsupervised learning Dimension reduction Clustering Density estimation Feature learning Supervised learning Regression Classification Structured output classification Semi-supervised learning Reinforcement learning 15 / 76

  16. Main ML paradigms Unsupervised learning Dimension reduction: PCA Clustering: k-means Density estimation Feature learning Supervised learning Regression: OLS, ridge regression Classification: kNN, logistic regression, SVM Structured output classification Semi-supervised learning Reinforcement learning 16 / 76

  17. Outline Introduction 1 Standard machine learning 2 Dimension reduction: PCA Clustering: k -means Regression: ridge regression Classification: kNN, logistic regression and SVM Nonlinear models: kernel methods Scalability issues 3 17 / 76

  18. Motivation k < d d X X’ n n Dimension reduction Preprocessing (remove noise, keep signal) Visualization ( k = 2 , 3) Discover structure 18 / 76

  19. PCA definition PC1 PC2 Training set S = { x 1 , . . . , x n } ⊂ R d For i = 1 , . . . , k ≤ d , PC i is the linear projection onto the direction that captures the largest amount of variance and is orthogonal to the previous ones:   2 n n i u − 1 � �  x ⊤ x ⊤ u i ∈ argmax j u  n � u � =1 , u ⊥{ u 1 ,..., u i − 1 } i =1 j =1 19 / 76

  20. PCA solution PC1 PC2 Let ˜ X be the centered n × d data matrix PCA solves, for i = 1 , . . . , k ≤ d : u ⊤ ˜ X ⊤ ˜ u i ∈ argmax Xu � u � =1 , u ⊥{ u 1 ,..., u i − 1 } X ⊤ ˜ Solution: u i is the i -th eigenvector of C = ˜ X , the empirical covariance matrix 20 / 76

  21. PCA example Iris dataset ● 0.4 ● ● setosa ● versicolor ● ● ● virginica ● ● ● ● ● 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PC2 ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.2 ● ● ● ● ● ● ● ● ● ● −0.4 ● ● −2 −1 0 1 PC1 > data(iris) > head(iris, 3) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa > m <- princomp(log(iris[,1:4])) 21 / 76

  22. PCA complexity Memory: store X and C : O ( max ( nd , d 2 )) Compute C : O ( nd 2 ) Compute k eigenvectors of C (power method): O ( kd 2 ) Computing C is more expensive than computing its eigenvectors ( n > k )! n = 1 B , d = 100 M Store C: 40 , 000 TB Compute C: 2 × 10 25 FLOPS = 20 yottaFLOPS (about 300 years of the most powerful supercomputer in 2016) 22 / 76

  23. Outline Introduction 1 Standard machine learning 2 Dimension reduction: PCA Clustering: k -means Regression: ridge regression Classification: kNN, logistic regression and SVM Nonlinear models: kernel methods Scalability issues 3 23 / 76

  24. Motivation Iris dataset ● 0.4 ● ● ● ● ● ● ● ● 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PC2 ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.2 ● ● ● ● ● ● ● ● ● ● −0.4 ● ● −2 −1 0 1 PC1 Unsupervised learning Discover groups Reduce dimension 24 / 76

  25. Motivation Iris k−means, k = 5 ● 0.4 ● ● Cluster 1 ● Cluster 2 ● ● ● Cluster 3 ● ● ● ● ● ● Cluster 4 0.2 ● ● ● ● Cluster 5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PC2 ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.2 ● ● ● ● ● ● ● ● ● ● −0.4 ● ● −2 −1 0 1 PC1 Unsupervised learning Discover groups Reduce dimension 24 / 76

  26. k -means definition Training set S = { x 1 , . . . , x n } ⊂ R d Given k , find C = ( C 1 , . . . , C n ) ∈ { 1 , k } n that solves n � � x i − µ C i � 2 min C i =1 where is the barycentre of data in class i . This is an NP-hard problem. k -means finds an approximate solution by iterating Assignment step: fix µ , optimize C 1 ∀ i = 1 , . . . , n , C i ← arg c ∈{ 1 ,..., k } � x i − µ c � min Update step 2 1 � ∀ i = 1 , . . . , k , µ i ← x j | C i | j : C j = i 25 / 76

  27. k -means example Iris dataset ● 0.4 ● ● ● ● ● ● ● ● 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PC2 ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.2 ● ● ● ● ● ● ● ● ● ● −0.4 ● ● −2 −1 0 1 PC1 > irisCluster <- kmeans(log(iris[, 1:4]), 3, nstart = 20) > table(irisCluster$cluster, iris$Species) setosa versicolor virginica 1 0 48 4 2 50 0 0 3 0 2 46 26 / 76

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend