Large-Scale Machine Learning I. Scalability issues Jean-Philippe - PowerPoint PPT Presentation

Large-Scale Machine Learning I. Scalability issues Jean-Philippe Vert jean-philippe.vert@ { mines-paristech,curie,ens } .fr 1 / 76

Outline Introduction 1 Standard machine learning 2 Dimension reduction: PCA Clustering: k -means Regression: ridge regression Classification: kNN, logistic regression and SVM Nonlinear models: kernel methods Scalability issues 3 2 / 76

Acknowledgement In the preparation of these slides I got inspiration and copied several slides from several sources: Sanjiv Kumar’s ”Large-scale machine learning” course: http://www.sanjivk.com/EECS6898/lectures.html Ala Al-Fuqaha’s ”Data mining” course: https://cs.wmich.edu/alfuqaha/summer14/cs6530/ lectures/SimilarityAnalysis.pdf L´ eon Bottou’s ”Large-scale machine learning revisited” conference https://bigdata2013.sciencesconf.org/conference/ bigdata2013/pages/bottou.pdf 3 / 76

5 / 76

Perception 6 / 76

Communication 7 / 76

Mobility 8 / 76

Health https://pct.mdanderson.org 9 / 76

Reasoning 10 / 76

A common process: learning from data https://www.linkedin.com/pulse/supervised-machine-learning-pega-decisioning-solution-nizam-muhammad Given examples (training data), make a machine learn how to predict on new samples, or discover patterns in data Statistics + optimization + computer science Gets better with more training examples and bigger computers 11 / 76

Large-scale ML? d dimensions t tasks n samples X Y Iris dataset: n = 150 , d = 4 , t = 1 Cancer drug sensitivity: n = 1 k , d = 1 M , t = 100 Imagenet: n = 14 M , d = 60 k + , t = 22 k Shopping, e-marketing n = O ( M ) , d = O ( B ) , t = O (100 M ) Astronomy, GAFA, web... n = O ( B ) , d = O ( B ) , t = O ( B ) 12 / 76

Today’s goals 1 Review a few standard ML techniques 2 Introduce a few ideas and techniques to scale them to modern, big datasets 13 / 76

Main ML paradigms Unsupervised learning Dimension reduction Clustering Density estimation Feature learning Supervised learning Regression Classification Structured output classification Semi-supervised learning Reinforcement learning 15 / 76

Main ML paradigms Unsupervised learning Dimension reduction: PCA Clustering: k-means Density estimation Feature learning Supervised learning Regression: OLS, ridge regression Classification: kNN, logistic regression, SVM Structured output classification Semi-supervised learning Reinforcement learning 16 / 76

Motivation k < d d X X’ n n Dimension reduction Preprocessing (remove noise, keep signal) Visualization ( k = 2 , 3) Discover structure 18 / 76

PCA definition PC1 PC2 Training set S = { x 1 , . . . , x n } ⊂ R d For i = 1 , . . . , k ≤ d , PC i is the linear projection onto the direction that captures the largest amount of variance and is orthogonal to the previous ones:   2 n n i u − 1 � �  x ⊤ x ⊤ u i ∈ argmax j u  n � u � =1 , u ⊥{ u 1 ,..., u i − 1 } i =1 j =1 19 / 76

PCA solution PC1 PC2 Let ˜ X be the centered n × d data matrix PCA solves, for i = 1 , . . . , k ≤ d : u ⊤ ˜ X ⊤ ˜ u i ∈ argmax Xu � u � =1 , u ⊥{ u 1 ,..., u i − 1 } X ⊤ ˜ Solution: u i is the i -th eigenvector of C = ˜ X , the empirical covariance matrix 20 / 76

PCA example Iris dataset ● 0.4 ● ● setosa ● versicolor ● ● ● virginica ● ● ● ● ● 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PC2 ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.2 ● ● ● ● ● ● ● ● ● ● −0.4 ● ● −2 −1 0 1 PC1 > data(iris) > head(iris, 3) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa > m <- princomp(log(iris[,1:4])) 21 / 76

PCA complexity Memory: store X and C : O ( max ( nd , d 2 )) Compute C : O ( nd 2 ) Compute k eigenvectors of C (power method): O ( kd 2 ) Computing C is more expensive than computing its eigenvectors ( n > k )! n = 1 B , d = 100 M Store C: 40 , 000 TB Compute C: 2 × 10 25 FLOPS = 20 yottaFLOPS (about 300 years of the most powerful supercomputer in 2016) 22 / 76

Motivation Iris dataset ● 0.4 ● ● ● ● ● ● ● ● 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PC2 ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.2 ● ● ● ● ● ● ● ● ● ● −0.4 ● ● −2 −1 0 1 PC1 Unsupervised learning Discover groups Reduce dimension 24 / 76

Motivation Iris k−means, k = 5 ● 0.4 ● ● Cluster 1 ● Cluster 2 ● ● ● Cluster 3 ● ● ● ● ● ● Cluster 4 0.2 ● ● ● ● Cluster 5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PC2 ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.2 ● ● ● ● ● ● ● ● ● ● −0.4 ● ● −2 −1 0 1 PC1 Unsupervised learning Discover groups Reduce dimension 24 / 76

k -means definition Training set S = { x 1 , . . . , x n } ⊂ R d Given k , find C = ( C 1 , . . . , C n ) ∈ { 1 , k } n that solves n � � x i − µ C i � 2 min C i =1 where is the barycentre of data in class i . This is an NP-hard problem. k -means finds an approximate solution by iterating Assignment step: fix µ , optimize C 1 ∀ i = 1 , . . . , n , C i ← arg c ∈{ 1 ,..., k } � x i − µ c � min Update step 2 1 � ∀ i = 1 , . . . , k , µ i ← x j | C i | j : C j = i 25 / 76

k -means example Iris dataset ● 0.4 ● ● ● ● ● ● ● ● 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PC2 ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.2 ● ● ● ● ● ● ● ● ● ● −0.4 ● ● −2 −1 0 1 PC1 > irisCluster <- kmeans(log(iris[, 1:4]), 3, nstart = 20) > table(irisCluster$cluster, iris$Species) setosa versicolor virginica 1 0 48 4 2 50 0 0 3 0 2 46 26 / 76

Large-Scale Machine Learning I. Scalability issues Jean-Philippe - PowerPoint PPT Presentation

Large-Scale Machine Learning I. Scalability issues Jean-Philippe Vert jean-philippe.vert@ { mines-paristech,curie,ens } .fr 1 / 76 Outline Introduction 1 Standard machine learning 2 Dimension reduction: PCA Clustering: k -means Regression:

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

Scalability and Replication Marco Serafini COMPSCI 532 Lecture 13 Scalability 2 Scalability

Performance and Scalability (Chapter 11) Performance and Scalability Performance: How long

Root zone scalability model Bart Gijsen October 28, 2009 Root zone scalability model

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Versioning of Topic Map Templates Structuring Versioning and Scalability Scalability Proc.

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Large-Scale Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science,

TensorFlow: A System for Learning-Scale Machine Learning Google Brain The Problem Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Contents Introduction Scalability and Accuracy in a Architecture Large-Scale Network

Building a large scale SaaS app Open Source, Storage and Scalability Dan Hanley, CTO

The OPTIDUAL randomized trial Grard HELFT on behalf of the OPTIDUAL Investigators Institut de

st t r r

Joint work with Antoine Chambert-Loir . P . 2 A X -L INDEMANN T HEOREM (L INDEMANN -W EIERSTRASS )

Probabilistic Inference and Learning with Steins Method Lester Mackey Microsoft Research New

Tree Transformations by means of visibly pushdown transducers Jean-Marc Talbot Joint Work L.

LARGE DEVIATIONS FOR RANDOM NETWORKS AND APPLICATIONS. LECTURE NOTES SHIRSHENDU GANGULY Abstract.

IOF, ACSys and WMSO+U St ephane Demri CNRS Marie Curie Fellow Groupe de travail INFINI,

Embeddability and universal equivalence of partially commutative groups Montserrat Casals-Ruiz