large scale machine learning

Large-Scale Machine Learning Jean-Philippe Vert jean-philippe.vert@ - PowerPoint PPT Presentation

Large-Scale Machine Learning Jean-Philippe Vert jean-philippe.vert@ { mines-paristech,curie,ens } .fr 1 / 104 Outline Introduction 1 Standard machine learning 2 Dimension reduction: PCA Clustering: k -means Regression: ridge regression


  1. k -means example Iris dataset ● 0.4 ● ● ● ● ● ● ● ● 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PC2 ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● βˆ’0.2 ● ● ● ● ● ● ● ● ● ● βˆ’0.4 ● ● βˆ’2 βˆ’1 0 1 PC1 > irisCluster <- kmeans(log(iris[, 1:4]), 3, nstart = 20) > table(irisCluster$cluster, iris$Species) setosa versicolor virginica 1 0 48 4 2 50 0 0 3 0 2 46 26 / 104

  2. k -means example Iris kβˆ’means, k = 2 ● 0.4 ● ● Cluster 1 ● Cluster 2 ● ● ● ● ● ● ● 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PC2 ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● βˆ’0.2 ● ● ● ● ● ● ● ● ● ● βˆ’0.4 ● ● βˆ’2 βˆ’1 0 1 PC1 > irisCluster <- kmeans(log(iris[, 1:4]), 3, nstart = 20) > table(irisCluster$cluster, iris$Species) setosa versicolor virginica 1 0 48 4 2 50 0 0 3 0 2 46 26 / 104

  3. k -means example Iris kβˆ’means, k = 3 ● 0.4 ● ● Cluster 1 ● Cluster 2 ● ● ● Cluster 3 ● ● ● ● ● 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PC2 ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● βˆ’0.2 ● ● ● ● ● ● ● ● ● ● βˆ’0.4 ● ● βˆ’2 βˆ’1 0 1 PC1 > irisCluster <- kmeans(log(iris[, 1:4]), 3, nstart = 20) > table(irisCluster$cluster, iris$Species) setosa versicolor virginica 1 0 48 4 2 50 0 0 3 0 2 46 26 / 104

  4. k -means example Iris kβˆ’means, k = 4 ● 0.4 ● ● Cluster 1 ● Cluster 2 ● ● ● Cluster 3 ● ● ● ● ● ● Cluster 4 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PC2 ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● βˆ’0.2 ● ● ● ● ● ● ● ● ● ● βˆ’0.4 ● ● βˆ’2 βˆ’1 0 1 PC1 > irisCluster <- kmeans(log(iris[, 1:4]), 3, nstart = 20) > table(irisCluster$cluster, iris$Species) setosa versicolor virginica 1 0 48 4 2 50 0 0 3 0 2 46 26 / 104

  5. k -means example Iris kβˆ’means, k = 5 ● 0.4 ● ● Cluster 1 ● Cluster 2 ● ● ● Cluster 3 ● ● ● ● ● ● Cluster 4 0.2 ● ● ● ● Cluster 5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PC2 ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● βˆ’0.2 ● ● ● ● ● ● ● ● ● ● βˆ’0.4 ● ● βˆ’2 βˆ’1 0 1 PC1 > irisCluster <- kmeans(log(iris[, 1:4]), 3, nstart = 20) > table(irisCluster$cluster, iris$Species) setosa versicolor virginica 1 0 48 4 2 50 0 0 3 0 2 46 26 / 104

  6. k -means complexity Each update step: O ( nd ) Each assgnment step: O ( ndk ) 27 / 104

  7. Outline Introduction 1 Standard machine learning 2 Dimension reduction: PCA Clustering: k -means Regression: ridge regression Classification: kNN, logistic regression and SVM Nonlinear models: kernel methods Large-scale machine learning 3 Conclusion 4 28 / 104

  8. Motivation 12 ● ● ● 10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 8 ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 6 ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ● ● 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● 0 1 2 3 4 5 x Predict a continuous output from an input 29 / 104

  9. Motivation 12 ● ● ● 10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 8 ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 6 ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ● ● 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● 0 1 2 3 4 5 x Predict a continuous output from an input 29 / 104

  10. Model Training set S = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } βŠ‚ R d Γ— R Fit a linear function: f Ξ² ( x ) = Ξ² ⊀ x Goodness of fit measured by residual sum of squares: n οΏ½ ( y i βˆ’ f Ξ² ( x i )) 2 RSS ( Ξ² ) = i =1 Ridge regression minimizes the regularized RSS: d οΏ½ Ξ² 2 min Ξ² RSS ( Ξ² ) + Ξ» i i =1 Solution (set gradient to 0): οΏ½ οΏ½ βˆ’ 1 Λ† X ⊀ X + Ξ» I X ⊀ Y Ξ² = 30 / 104

  11. Ridge regression complexity Compute X ⊀ X : O ( nd 2 ) � � X ⊀ X + λ I : O ( d 3 ) Inverse Computing X ⊀ X is more expensive than inverting it! 31 / 104

  12. Outline Introduction 1 Standard machine learning 2 Dimension reduction: PCA Clustering: k -means Regression: ridge regression Classification: kNN, logistic regression and SVM Nonlinear models: kernel methods Large-scale machine learning 3 Conclusion 4 32 / 104

  13. Motivation Predict the category of a data 2 or more (sometimes many) categories 33 / 104

  14. Motivation Predict the category of a data 2 or more (sometimes many) categories 33 / 104

  15. Motivation Predict the category of a data 2 or more (sometimes many) categories 33 / 104

  16. Motivation Predict the category of a data 2 or more (sometimes many) categories 33 / 104

  17. k -nearest neigbors (kNN) o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o (Hastie et al. The elements of statistical learning. Springer, 2001.) Training set S = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } βŠ‚ R d Γ— {βˆ’ 1 , 1 } No training Given a new point x ∈ R d , predict the majority class among its k nearest neighbors (take k odd) 34 / 104

  18. kNN properties Uniform Bayes consistency [Stone, 1977] Take k = √ n (for example) Let P be any distribution over ( X , Y ) pairs Assume training data are random pairs sampled i.i.d. according to P Then the k -NN classifier Λ† f n satisfies almost surely: n β†’ + ∞ P (Λ† lim f ( X ) οΏ½ = Y ) = fmeasurable P ( f ( X ) οΏ½ = Y ) inf Complexity: Memory: story X is O ( nd ) Training time: 0 Prediction: O ( nd ) for each test point 35 / 104

  19. Linear models for classification Training set S = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } βŠ‚ R d Γ— {βˆ’ 1 , 1 } Fit a linear function f Ξ² ( x ) = Ξ² ⊀ x The prediction on a new point x ∈ R d is: οΏ½ +1 if f Ξ² ( x ) > 0 , βˆ’ 1 otherwise. 36 / 104

  20. Large-margin classifiers For any f : R d β†’ R , the margin of f on an ( x , y ) pair is yf ( x ) Large-margin classifiers fit a classifier by maximizing the margins on the training set: n οΏ½ β„“ ( y i f Ξ² ( x i )) + λβ ⊀ Ξ² min Ξ² i =1 for a convex, non-increasing loss function β„“ : R β†’ R + 37 / 104

  21. Loss function examples Loss Method β„“ ( u ) 1( u ≀ 0) 0-1 none Hinge Support vector machine (SVM) max (1 βˆ’ u , 0) log (1 + e βˆ’ u ) Logistic Logistic regression (1 βˆ’ u ) 2 Square Ridge regression 38 / 104

  22. Ridge logistic regression [Le Cessie and van Houwelingen, 1992] n οΏ½ οΏ½ οΏ½ 1 + e βˆ’ y i Ξ² ⊀ x i + λβ ⊀ Ξ² Ξ² ∈ R p J ( Ξ² ) = min ln i =1 Can be interpreted as a regularized conditional maximum likelihood estimator No explicit solution, but smooth convex optimization problem that can be solved numerically by Newton-Raphson iterations: οΏ½ οΏ½ Ξ² old οΏ½οΏ½ βˆ’ 1 οΏ½ Ξ² old οΏ½ Ξ² new ← Ξ² old βˆ’ βˆ‡ 2 βˆ‡ Ξ² J . Ξ² J Each iteration amounts to solving a weighted ridge regression problem, hence the name iteratively reweighted least squares (IRLS). Complexity O ( iterations βˆ— ( nd 2 + d 3 )) 39 / 104

  23. SVM [Boser et al., 1992] n οΏ½ οΏ½ οΏ½ 0 , 1 βˆ’ y i Ξ² ⊀ x i + λβ ⊀ Ξ² min max Ξ² ∈ R p i =1 A non-smooth convex optimization problem (convex quadratic program) Equivalent to the dual problem 0 ≀ y i Ξ± i ≀ 1 Ξ± ∈ R n 2 Ξ± ⊀ Y βˆ’ Ξ± ⊀ XX ⊀ Ξ± max s.t. 2 Ξ» for i = 1 , . . . , n The solution Ξ² βˆ— of the primal is obtained from the solution Ξ± βˆ— of the dual: Ξ² βˆ— = X ⊀ Ξ± βˆ— f Ξ² βˆ— ( x ) = ( Ξ² βˆ— ) ⊀ x = ( Ξ± βˆ— ) ⊀ Xx Training complexity: O ( n 2 ) to store XX ⊀ , O ( n 3 ) to find Ξ± βˆ— Prediction: O ( d ) for ( Ξ² βˆ— ) ⊀ x , O ( nd ) for ( Ξ± βˆ— ) ⊀ Xx 40 / 104

  24. Outline Introduction 1 Standard machine learning 2 Dimension reduction: PCA Clustering: k -means Regression: ridge regression Classification: kNN, logistic regression and SVM Nonlinear models: kernel methods Large-scale machine learning 3 Conclusion 4 41 / 104

  25. Motivation ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● βˆ’1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 x 42 / 104

  26. Model Learn a function f : R d β†’ R of the form n οΏ½ f ( x ) = Ξ± i K ( x i , x ) i =1 For a positive definite (p.d.) kernel K : R d Γ— R d β†’ R , such as K ( x , x β€² ) = x ⊀ x β€² Linear οΏ½ οΏ½ p x ⊀ x β€² + c K ( x , x β€² ) = Polynomial οΏ½ οΏ½ x βˆ’ x β€² οΏ½ 2 οΏ½ K ( x , x β€² ) = exp Gaussian 2 Οƒ 2 d min( | x i | , | x β€² i | ) οΏ½ K ( x , x β€² ) = Min/max max( | x i | , | x β€² i | ) i =1 43 / 104

  27. Feature space A function K : R d Γ— R d β†’ R is a p.d. kernel if and only if there existe a mapping Ξ¦ : R d β†’ R D , for some D ∈ N βˆͺ { + ∞} , such that βˆ€ x , x β€² ∈ R d , K ( x , x β€² ) = Ξ¦( x ) ⊀ Ξ¦( x β€² ) f is then a linear function in R D : n n οΏ½ οΏ½ Ξ± i Ξ¦( x i ) ⊀ Ξ¦( x ) = Ξ² ⊀ Ξ¦( x ) f ( x ) = Ξ± i K ( x i , x ) = i =1 i =1 for Ξ² = οΏ½ n i =1 Ξ± i Ξ¦( x i ). x1 2 x1 x2 R 2 x2 44 / 104

  28. Learning 2 x1 x1 x2 R 2 x2 We can learn f ( x ) = οΏ½ n i =1 Ξ± i K ( x i , x ) by fitting a linear model Ξ² ⊀ Ξ¦( x ) in the feature space Example: ridge regression / logistic regression / SVM n οΏ½ β„“ ( y i , Ξ² ⊀ Ξ¦( x i )) + λβ ⊀ Ξ² min Ξ² ∈ R D i =1 But D can be very large, even infinite... 45 / 104

  29. Kernel tricks K ( x , x β€² ) = Ξ¦( x ) ⊀ Ξ¦( x β€² ) can be quick to compute even if D is large (even infinite) For a set of training samples { x 1 , . . . , x n } βŠ‚ R d let K n the n Γ— n Gram matrix: [ K n ] ij = K ( x i , x j ) For Ξ² = οΏ½ n i =1 Ξ± i Ξ¦( x i ) we have Ξ² ⊀ Ξ¦( x i ) = [ K Ξ± ] i Ξ² ⊀ Ξ² = Ξ± ⊀ K Ξ± and We can therefore solve the equivalent problem in Ξ± ∈ R n n οΏ½ β„“ ( y i , [ K Ξ± ] i ) + λα ⊀ K Ξ± min Ξ± ∈ R n i =1 46 / 104

  30. Example: kernel ridge regression (KRR) n οΏ½ οΏ½ 2 οΏ½ y i βˆ’ Ξ² ⊀ Ξ¦( x i ) + λβ ⊀ Ξ² min Ξ² ∈ R d i =1 Solve in R D : οΏ½ οΏ½ βˆ’ 1 Λ† Ξ¦( X ) ⊀ Ξ¦( X ) + Ξ» I Ξ¦( X ) ⊀ Y Ξ² = οΏ½ οΏ½οΏ½ οΏ½ D Γ— D Solve in R n : Ξ± = ( K + Ξ» I ) βˆ’ 1 Λ† Y οΏ½ οΏ½οΏ½ οΏ½ n Γ— n 47 / 104

  31. KRR with Gaussian RBF kernel n οΏ½ οΏ½ x βˆ’ x β€² οΏ½ 2 οΏ½ οΏ½ οΏ½ 2 οΏ½ y i βˆ’ Ξ² ⊀ Ξ¦( x i ) + λβ ⊀ Ξ² K ( x , x β€² ) = exp min 2 Οƒ 2 Ξ² ∈ R d i =1 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● βˆ’1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 48 / 104 x

  32. KRR with Gaussian RBF kernel n οΏ½ οΏ½ x βˆ’ x β€² οΏ½ 2 οΏ½ οΏ½ οΏ½ 2 οΏ½ y i βˆ’ Ξ² ⊀ Ξ¦( x i ) + λβ ⊀ Ξ² K ( x , x β€² ) = exp min 2 Οƒ 2 Ξ² ∈ R d i =1 lambda = 1000 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● βˆ’1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 48 / 104 x

  33. KRR with Gaussian RBF kernel n οΏ½ οΏ½ x βˆ’ x β€² οΏ½ 2 οΏ½ οΏ½ οΏ½ 2 οΏ½ y i βˆ’ Ξ² ⊀ Ξ¦( x i ) + λβ ⊀ Ξ² K ( x , x β€² ) = exp min 2 Οƒ 2 Ξ² ∈ R d i =1 lambda = 100 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● βˆ’1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 48 / 104 x

  34. KRR with Gaussian RBF kernel n οΏ½ οΏ½ x βˆ’ x β€² οΏ½ 2 οΏ½ οΏ½ οΏ½ 2 οΏ½ y i βˆ’ Ξ² ⊀ Ξ¦( x i ) + λβ ⊀ Ξ² K ( x , x β€² ) = exp min 2 Οƒ 2 Ξ² ∈ R d i =1 lambda = 10 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● βˆ’1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 48 / 104 x

  35. KRR with Gaussian RBF kernel n οΏ½ οΏ½ x βˆ’ x β€² οΏ½ 2 οΏ½ οΏ½ οΏ½ 2 οΏ½ y i βˆ’ Ξ² ⊀ Ξ¦( x i ) + λβ ⊀ Ξ² K ( x , x β€² ) = exp min 2 Οƒ 2 Ξ² ∈ R d i =1 lambda = 1 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● βˆ’1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 48 / 104 x

  36. KRR with Gaussian RBF kernel n οΏ½ οΏ½ x βˆ’ x β€² οΏ½ 2 οΏ½ οΏ½ οΏ½ 2 οΏ½ y i βˆ’ Ξ² ⊀ Ξ¦( x i ) + λβ ⊀ Ξ² K ( x , x β€² ) = exp min 2 Οƒ 2 Ξ² ∈ R d i =1 lambda = 0.1 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● βˆ’1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 48 / 104 x

  37. KRR with Gaussian RBF kernel n οΏ½ οΏ½ x βˆ’ x β€² οΏ½ 2 οΏ½ οΏ½ οΏ½ 2 οΏ½ y i βˆ’ Ξ² ⊀ Ξ¦( x i ) + λβ ⊀ Ξ² K ( x , x β€² ) = exp min 2 Οƒ 2 Ξ² ∈ R d i =1 lambda = 0.01 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● βˆ’1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 48 / 104 x

  38. KRR with Gaussian RBF kernel n οΏ½ οΏ½ x βˆ’ x β€² οΏ½ 2 οΏ½ οΏ½ οΏ½ 2 οΏ½ y i βˆ’ Ξ² ⊀ Ξ¦( x i ) + λβ ⊀ Ξ² K ( x , x β€² ) = exp min 2 Οƒ 2 Ξ² ∈ R d i =1 lambda = 0.001 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● βˆ’1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 48 / 104 x

  39. KRR with Gaussian RBF kernel n οΏ½ οΏ½ x βˆ’ x β€² οΏ½ 2 οΏ½ οΏ½ οΏ½ 2 οΏ½ y i βˆ’ Ξ² ⊀ Ξ¦( x i ) + λβ ⊀ Ξ² K ( x , x β€² ) = exp min 2 Οƒ 2 Ξ² ∈ R d i =1 lambda = 0.0001 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● βˆ’1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 48 / 104 x

  40. KRR with Gaussian RBF kernel n οΏ½ οΏ½ x βˆ’ x β€² οΏ½ 2 οΏ½ οΏ½ οΏ½ 2 οΏ½ y i βˆ’ Ξ² ⊀ Ξ¦( x i ) + λβ ⊀ Ξ² K ( x , x β€² ) = exp min 2 Οƒ 2 Ξ² ∈ R d i =1 lambda = 0.00001 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● βˆ’1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 48 / 104 x

  41. KRR with Gaussian RBF kernel n οΏ½ οΏ½ x βˆ’ x β€² οΏ½ 2 οΏ½ οΏ½ οΏ½ 2 οΏ½ y i βˆ’ Ξ² ⊀ Ξ¦( x i ) + λβ ⊀ Ξ² K ( x , x β€² ) = exp min 2 Οƒ 2 Ξ² ∈ R d i =1 lambda = 0.000001 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● βˆ’1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 48 / 104 x

  42. KRR with Gaussian RBF kernel n οΏ½ οΏ½ x βˆ’ x β€² οΏ½ 2 οΏ½ οΏ½ οΏ½ 2 οΏ½ y i βˆ’ Ξ² ⊀ Ξ¦( x i ) + λβ ⊀ Ξ² K ( x , x β€² ) = exp min 2 Οƒ 2 Ξ² ∈ R d i =1 lambda = 0.0000001 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● βˆ’1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 48 / 104 x

  43. Complexity lambda = 1 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● 0 ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● βˆ’1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 x Compute K : O ( dn 2 ) Store K : O ( n 2 ) Solve Ξ± : O ( n 2 ∼ 3 ) Compute f ( x ) for one x : O ( nd ) Unpractical for n > 10 ∼ 100 k 49 / 104

  44. Outline Introduction 1 Standard machine learning 2 Dimension reduction: PCA Clustering: k -means Regression: ridge regression Classification: kNN, logistic regression and SVM Nonlinear models: kernel methods Large-scale machine learning 3 Scalability issues The tradeoffs of large-scale learning Random projections Random features Approximate NN Shingling, hashing, sketching Conclusion 4 50 / 104

  45. Outline Introduction 1 Standard machine learning 2 Large-scale machine learning 3 Scalability issues The tradeoffs of large-scale learning Random projections Random features Approximate NN Shingling, hashing, sketching Conclusion 4 51 / 104

  46. What is ”large-scale”? Data cannot fit in RAM Algorithm cannot run on a single machine in reasonable time (algorithm-dependent) Sometimes even O ( n ) is too large! (e.g., nearest neighbor in a database of O ( B +) items) Many tasks / parameters (e.g., image categorization in O (10 M ) classes) Streams of data 52 / 104

  47. Things to worry about Training time (usually offline) Memory requirements Test time Complexities so far Method Memory Training time Test time O ( d 2 ) O ( nd 2 ) PCA O ( d ) k -means O ( nd ) O ( ndk ) O ( kd ) O ( d 2 ) O ( nd 2 ) Ridge regression O ( d ) kNN O ( nd ) 0 O ( nd ) O ( nd 2 ) Logistic regression O ( nd ) O ( d ) O ( n 2 ) O ( n 3 ) SVM, kernel methods O ( nd ) 53 / 104

  48. Techniques for large-scale machine learning Good baselines: Subsample data and run standard method Split and run on several machines (depends on algorithm) Need to revisit standard algorithms and implementation, taking into account scalability Trade exactness for scalability Compress, sketch, hash data in a smart way 54 / 104

  49. Outline Introduction 1 Standard machine learning 2 Large-scale machine learning 3 Scalability issues The tradeoffs of large-scale learning Random projections Random features Approximate NN Shingling, hashing, sketching Conclusion 4 55 / 104

  50. Motivation Classical learning theory analyzes the trade-off between: approximation error (how well we approximate the true function) estimation errors (how well we estimate the parameters) β„± But reaching the best trade-off for a given n may be impossible with limited computational resources We should include in the trade-off the computational budget, and see which optimization algorithm gives the best trade-off! Seminal paper of Bottou and Bousquet [2008] 56 / 104

  51. Classical ERM setting Goal: learn a function f : R d β†’ Y ( Y = R or {βˆ’ 1 , 1 } ) P unknown distribution over R d Γ— Y Training set: S = { ( X 1 , Y 1 ) , . . . , ( X n , Y n ) } βŠ‚ R d Γ— Y i.i.d. following P οΏ½ f : R d β†’ R οΏ½ Fix a class of functions F βŠ‚ Choose a loss β„“ ( y , f ( x )) Learning by empirical risk minimization n f ∈F R n [ f ] = 1 οΏ½ f n ∈ arg min β„“ ( Y i , f ( X i )) n i =1 Hope that f n has a small risk: R [ f n ] = E β„“ ( Y , f n ( X )) 57 / 104

  52. Classical ERM setting The best possible risk is R βˆ— = f : R d β†’Y R [ f ] min The best achievable risk over F is R βˆ— F = min f ∈F R [ f ] We then have the decomposition R [ f n ] βˆ’ R βˆ— = R [ f n ] βˆ’ R βˆ— R βˆ— F βˆ’ R βˆ— + F οΏ½ οΏ½οΏ½ οΏ½ οΏ½ οΏ½οΏ½ οΏ½ estimation error Η« est approximation errror Η« app β„± 58 / 104

  53. Optimization error Solving the ERM problem may be hard (when n and d are large) Instead we usually find an approximate solution ˜ f n that satisfies R n [˜ f n ] ≀ R n [ f n ] + ρ The excess risk of ˜ f n is then f n ] βˆ’ R βˆ— = Η« = R [˜ R [˜ f n ] βˆ’ R [ f n ] + Η« est + Η« app οΏ½ οΏ½οΏ½ οΏ½ optimization error Η« opt 59 / 104

  54. A new trade-off Η« = Η« app + Η« est + Η« opt Problem Choose F , n , ρ to make Η« as small as possible Subject to a limit on n and on the computation time T Table 1: Typical variations when F , n , and ρ increase. F n ρ E app (approximation error) β†˜ E est (estimation error) β†— β†˜ E opt (optimization error) Β· Β· Β· Β· Β· Β· β†— T (computation time) β†— β†— β†˜ Large-scale or small-scale? Small-scale when constraint on n is active Large-scale when constraint on T is active 60 / 104

  55. Comparing optimization methods n οΏ½ Ξ² ∈BβŠ‚ R d R n [ f Ξ² ] = min β„“ ( y i , f Ξ² ( x i )) i =1 Gradient descent (GD): Ξ² t +1 ← Ξ² t βˆ’ Ξ·βˆ‚ R n ( f Ξ² t ) βˆ‚Ξ² Second-order gradient descent (2GD), assuming Hessian H known Ξ² t +1 ← Ξ² t βˆ’ H βˆ’ 1 βˆ‚ R n ( f Ξ² t ) βˆ‚Ξ² Stochastic gradient descent (SGD): Ξ² t +1 ← Ξ² t βˆ’ Ξ· βˆ‚β„“ ( y t , f Ξ² t ( x t )) t βˆ‚Ξ² 61 / 104

  56. Results [Bottou and Bousquet, 2008] Algorithm Cost of one Iterations Time to reach Time to reach iteration to reach ρ accuracy ρ E ≀ c ( E app + Ξ΅ ) οΏ½ οΏ½ οΏ½ οΏ½ οΏ½ οΏ½ d 2 ΞΊ ΞΊ log 1 nd ΞΊ log 1 Ξ΅ 1 /Ξ± log 2 1 GD O ( nd ) O O O ρ ρ Ξ΅ οΏ½ οΏ½ οΏ½οΏ½ οΏ½ οΏ½ οΏ½ οΏ½ d 2 + nd οΏ½ d 2 + nd οΏ½ d 2 log log 1 log log 1 Ξ΅ 1 /Ξ± log 1 Ξ΅ log log 1 2GD O O O O ρ ρ Ξ΅ οΏ½ οΏ½ οΏ½ οΏ½ οΏ½ οΏ½ Ξ½ΞΊ 2 dΞ½ΞΊ 2 d Ξ½ ΞΊ 2 1 SGD O ( d ) ρ + o O O ρ ρ Ξ΅ οΏ½ οΏ½ οΏ½ οΏ½ οΏ½ οΏ½ οΏ½ οΏ½ 2SGD Ξ± ∈ [1 / 2 , 1] comes from the bound on Ξ΅ est and depends on the data In the last column, n and ρ are optimized to reach Η« for each method 2GD optimizes much faster than GD, but limited gain on the final performance limited by Η« βˆ’ 1 /Ξ± coming from the estimation error SGD: Optimization speed is catastrophic Learning speed is the best, and independent of Ξ± This suggests that SGD is very competitive (and has become the de facto standard in large-scale ML) 62 / 104

  57. Illustration https://bigdata2013.sciencesconf.org/conference/bigdata2013/pages/bottou.pdf 63 / 104

  58. Outline Introduction 1 Standard machine learning 2 Large-scale machine learning 3 Scalability issues The tradeoffs of large-scale learning Random projections Random features Approximate NN Shingling, hashing, sketching Conclusion 4 64 / 104

  59. Motivation Affects scalability of algorithms, e.g., O ( nd ) for kNN or O ( d 3 ) for ridge regression Hard to visualize (Sometimes) counterintuitive phenomena in high dimension, e.g., concentration of measure for Gaussian data d=1 d=10 d=100 400 250 150 300 200 Frequency Frequency Frequency 150 100 200 100 100 50 50 0 0 0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 ||x||/sqrt(k) ||x||/sqrt(k) ||x||/sqrt(k) Statistical inference degrades when d increases (curse of dimension) 65 / 104

  60. Dimension reduction with PCA PC1 PC2 Projects data onto k < d dimensions that captures the largest amount of variance Also minimizes total reconstruction errors: n οΏ½ οΏ½ x i βˆ’ Ξ  S k ( x i ) οΏ½ 2 min S k i =1 But computational expensive: O ( nd 2 ) No theoretical garantee on distance preservation 66 / 104

  61. Linear dimension reduction X β€² = X Γ— R οΏ½οΏ½οΏ½οΏ½ οΏ½οΏ½οΏ½οΏ½ οΏ½οΏ½οΏ½οΏ½ n Γ— k n Γ— d d Γ— k Can we find R efficiently? Can we preserve distances? βˆ€ i , j = 1 , . . . , n , οΏ½ f ( x i ) βˆ’ f ( x j ) οΏ½ β‰ˆ οΏ½ x i βˆ’ x j οΏ½ Note: when d > n , we can take k = n and preserve all distances exactly (kernel trick) 67 / 104

  62. Random projections Simply take a random projection matrix: 1 R ⊀ x √ R ij ∼ N (0 , 1) f ( x ) = with k Theorem [Johnson and Lindenstrauss, 1984] For any Η« > 0 and n ∈ N , take οΏ½ βˆ’ 1 log( n ) β‰ˆ Η« βˆ’ 2 log( n ) . οΏ½ Η« 2 / 2 βˆ’ Η« 3 / 3 k β‰₯ 4 Then the following holds with probabiliy at least 1 βˆ’ 1 / n : (1 βˆ’ Η« ) οΏ½ x i βˆ’ x j οΏ½ 2 ≀ οΏ½ f ( x i ) βˆ’ f ( x j ) οΏ½ 2 ≀ (1+ Η« ) οΏ½ x i βˆ’ x j οΏ½ 2 βˆ€ i , j = 1 , . . . , n k does not depend on d ! n = 1 M , Η« = 0 . 1 = β‡’ k β‰ˆ 5 K n = 1 B , Η« = 0 . 1 = β‡’ k β‰ˆ 8 K 68 / 104

  63. Proof (1/3) For a single dimension, q j = r ⊀ j u : E ( q j ) = E ( r j ) ⊀ u = 0 E ( q j ) 2 = u ⊀ E ( r j r ⊀ j ) u = οΏ½ u οΏ½ 2 √ kR ⊀ u : For the k -dimensional projection f ( u ) = 1 / k j ∼ οΏ½ u οΏ½ 2 οΏ½ f ( u ) οΏ½ 2 = 1 οΏ½ q 2 Ο‡ 2 ( k ) k k j =1 k E οΏ½ f ( u ) οΏ½ 2 = 1 οΏ½ E ( q 2 j ) = οΏ½ u οΏ½ 2 k j =1 Need to show that οΏ½ f ( u ) οΏ½ 2 is concentrated around its mean 69 / 104

  64. Proof (2/3) οΏ½ οΏ½ f οΏ½ 2 > (1 + Η« ) οΏ½ u οΏ½ 2 οΏ½ P οΏ½ οΏ½ Ο‡ 2 ( k ) > (1 + Η« ) k = P οΏ½ e λχ 2 ( k ) > e Ξ» (1+ Η« ) k οΏ½ = P οΏ½ e λχ 2 ( k ) οΏ½ e βˆ’ Ξ» (1+ Η« ) k ≀ E (Markov) = (1 βˆ’ 2 Ξ» ) βˆ’ k 2 e βˆ’ Ξ» (1+ Η« ) k (MGF of Ο‡ 2 ( k ) for 0 ≀ Ξ» ≀ 1 / 2) οΏ½ (1 + Η« ) e βˆ’ Η« οΏ½ k / 2 = (take Ξ» = Η«/ 2(1 + Η« )) ≀ e βˆ’ ( Η« 2 / 2 βˆ’ Η« 3 / 3 ) k / 2 (use log(1 + x ) ≀ x βˆ’ x 2 / 2 + x 3 / 3) οΏ½ οΏ½ = n βˆ’ 2 Η« 2 / 2 βˆ’ Η« 3 / 3 (take k = 4 log( n )) Similarly we get οΏ½ οΏ½ f οΏ½ 2 < (1 βˆ’ Η« ) οΏ½ u οΏ½ 2 οΏ½ < n βˆ’ 2 P 70 / 104

  65. Proof (3/3) Apply with u = x i βˆ’ x j and use linearity of f to show that for an ( x i , x j ) pair, the probability of large distortion is ≀ 2 n βˆ’ 2 Union bound: for all n ( n βˆ’ 1) / 2 pairs, the probability that at least one has large distortion is smaller than n ( n βˆ’ 1) Γ— 2 n 2 = 1 βˆ’ 1 οΏ½ 2 n 71 / 104

  66. Scalability n = O (1 B ); d = O (1 M ) = β‡’ k = O (10 K ) Memory: need to store R , O ( dk ) β‰ˆ 40 GB Computation: X Γ— R in O ( ndk ) Other random matrices R have similar properties but better scalability, e.g.: ”add or subtract” [Achlioptas, 2003], 1 bit/entry, size β‰ˆ 1 , 25 GB οΏ½ +1 with probability 1 / 2 R ij = βˆ’ 1 with probability 1 / 2 Fast Johnson-Lindenstrauss transform [Ailon and Chazelle, 2009] where R = PHD , compute f ( x ) in O ( d log d ) 72 / 104

  67. Outline Introduction 1 Standard machine learning 2 Large-scale machine learning 3 Scalability issues The tradeoffs of large-scale learning Random projections Random features Approximate NN Shingling, hashing, sketching Conclusion 4 73 / 104

  68. Motivation JL random projec<on Kernel Phi d D R R k R Random features? 74 / 104

  69. Fourier feature space Example: Gaussian kernel οΏ½ e βˆ’ οΏ½ x βˆ’ x β€² οΏ½ 2 1 R d e i Ο‰ ⊀ ( x βˆ’ x β€² ) e βˆ’ οΏ½ Ο‰ οΏ½ 2 = d Ο‰ 2 2 d (2 Ο€ ) 2 οΏ½ οΏ½ Ο‰ ⊀ ( x βˆ’ x β€² ) = E Ο‰ cos οΏ½ οΏ½ οΏ½ οΏ½ οΏ½οΏ½ Ο‰ ⊀ x β€² + b Ο‰ ⊀ x + b = E Ο‰, b 2 cos cos with 1 e βˆ’ οΏ½ Ο‰ οΏ½ 2 Ο‰ ∼ p ( d Ο‰ ) = d Ο‰ , b ∼ U ([0 , 2 Ο€ ]) . 2 d (2 Ο€ ) 2 This is of the form K ( x , x β€² ) = Ξ¦( x ) ⊀ Ξ¦( x β€² ) with D = + ∞ : οΏ½οΏ½ οΏ½ οΏ½ Ξ¦ : R d β†’ L 2 R d , p ( d Ο‰ ) Γ— ([0 , 2 Ο€ ] , U ) 75 / 104

  70. Random Fourier features [Rahimi and Recht, 2008] = Γ— For i = 1 , . . . , k , sample randomly: = + Ο‰ ( Ο‰ i , b i ) ∼ p ( d Ο‰ ) Γ— U ([0 , 2 Ο€ ]) Ο‰ Ο‰ Ο€ Create random features: Ο‰ Ο‰ οΏ½ 2 οΏ½ οΏ½ βˆ€ x ∈ R d , Ο‰ ⊀ f i ( x ) = k cos i x + b i j x Ο‰ j Ο‰ T + x b j 76 / 104

  71. Random Fourier features [Rahimi and Recht, 2008] For any x , x β€² ∈ R d , it holds k οΏ½ οΏ½ οΏ½ οΏ½ οΏ½ f ( x ) ⊀ f ( x β€² ) f i ( x ) f i ( x β€² ) E = E i =1 k = 1 οΏ½ οΏ½ οΏ½ οΏ½ οΏ½οΏ½ οΏ½ Ο‰ ⊀ x β€² + b Ο‰ ⊀ x + b 2 cos cos E k i =1 = K ( x , x β€² ) and by Hoeffding’s inequality, οΏ½οΏ½ οΏ½ οΏ½ ≀ 2 e βˆ’ k Η« 2 οΏ½ f ( x ) ⊀ f ( x β€² ) βˆ’ K ( x , x β€² ) οΏ½ οΏ½ P οΏ½ > Η« 2 This allows to approximate learning with the Gaussian kernel with a simple linear model in k dimensions! 77 / 104

  72. Generalization A translation-invariant (t.i.) kernel is of the form K ( x , x β€² ) = Ο• ( x βˆ’ x β€² ) Bochner’s theorem For a continuous function Ο• : R d β†’ R , K is p.d. if and only if Ο• is the Fourier-Stieltjes transform of a symmetric and positive finite Borel οΏ½ R d οΏ½ measure Β΅ ∈ M : οΏ½ R d e βˆ’ i Ο‰ ⊀ x d Β΅ ( Ο‰ ) Ο• ( x ) = Just sample Ο‰ i ∼ d Β΅ ( Ο‰ ) Β΅ ( R d ) and b i ∼ U ([0 , 2 Ο€ ]) to approximate any t.i. kernel K with random features οΏ½ 2 οΏ½ οΏ½ Ο‰ ⊀ k cos i x + b i 78 / 104

  73. Examples οΏ½ R d e βˆ’ i Ο‰ ⊀ ( x βˆ’ x β€² ) d Β΅ ( Ο‰ ) K ( x , x β€² ) = Ο• ( x βˆ’ x β€² ) = Kernel Ο• ( x ) Β΅ ( d Ο‰ ) οΏ½ οΏ½ οΏ½ οΏ½ βˆ’ οΏ½ x οΏ½ 2 (2 Ο€ ) βˆ’ d / 2 exp βˆ’ οΏ½ Ο‰ οΏ½ 2 Gaussian exp 2 2 οΏ½ k 1 exp ( βˆ’οΏ½ x οΏ½ 1 ) Laplace i =1 Ο€ ( 1+ Ο‰ 2 i ) οΏ½ k 2 e βˆ’οΏ½ Ο‰ οΏ½ 1 Cauchy i =1 1+ x 2 i 79 / 104

  74. Performance [Rahimi and Recht, 2008] 80 / 104

Recommend


More recommend