k -means example Iris dataset β 0.4 β β β β β β β β 0.2 β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β PC2 β β β β β β β β β β β β 0.0 β β β β β β β β β β β β β β β β β β ββ β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β0.2 β β β β β β β β β β β0.4 β β β2 β1 0 1 PC1 > irisCluster <- kmeans(log(iris[, 1:4]), 3, nstart = 20) > table(irisCluster$cluster, iris$Species) setosa versicolor virginica 1 0 48 4 2 50 0 0 3 0 2 46 26 / 104
k -means example Iris kβmeans, k = 2 β 0.4 β β Cluster 1 β Cluster 2 β β β β β β β 0.2 β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β PC2 β β β β β β β β β β β β 0.0 β β β β β β β β β β β β β β β β β β ββ β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β0.2 β β β β β β β β β β β0.4 β β β2 β1 0 1 PC1 > irisCluster <- kmeans(log(iris[, 1:4]), 3, nstart = 20) > table(irisCluster$cluster, iris$Species) setosa versicolor virginica 1 0 48 4 2 50 0 0 3 0 2 46 26 / 104
k -means example Iris kβmeans, k = 3 β 0.4 β β Cluster 1 β Cluster 2 β β β Cluster 3 β β β β β 0.2 β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β PC2 β β β β β β β β β β β β 0.0 β β β β β β β β β β β β β β β β β β ββ β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β0.2 β β β β β β β β β β β0.4 β β β2 β1 0 1 PC1 > irisCluster <- kmeans(log(iris[, 1:4]), 3, nstart = 20) > table(irisCluster$cluster, iris$Species) setosa versicolor virginica 1 0 48 4 2 50 0 0 3 0 2 46 26 / 104
k -means example Iris kβmeans, k = 4 β 0.4 β β Cluster 1 β Cluster 2 β β β Cluster 3 β β β β β β Cluster 4 0.2 β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β PC2 β β β β β β β β β β β β 0.0 β β β β β β β β β β β β β β β β β β ββ β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β0.2 β β β β β β β β β β β0.4 β β β2 β1 0 1 PC1 > irisCluster <- kmeans(log(iris[, 1:4]), 3, nstart = 20) > table(irisCluster$cluster, iris$Species) setosa versicolor virginica 1 0 48 4 2 50 0 0 3 0 2 46 26 / 104
k -means example Iris kβmeans, k = 5 β 0.4 β β Cluster 1 β Cluster 2 β β β Cluster 3 β β β β β β Cluster 4 0.2 β β β β Cluster 5 β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β PC2 β β β β β β β β β β β β 0.0 β β β β β β β β β β β β β β β β β β ββ β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β0.2 β β β β β β β β β β β0.4 β β β2 β1 0 1 PC1 > irisCluster <- kmeans(log(iris[, 1:4]), 3, nstart = 20) > table(irisCluster$cluster, iris$Species) setosa versicolor virginica 1 0 48 4 2 50 0 0 3 0 2 46 26 / 104
k -means complexity Each update step: O ( nd ) Each assgnment step: O ( ndk ) 27 / 104
Outline Introduction 1 Standard machine learning 2 Dimension reduction: PCA Clustering: k -means Regression: ridge regression Classification: kNN, logistic regression and SVM Nonlinear models: kernel methods Large-scale machine learning 3 Conclusion 4 28 / 104
Motivation 12 β β β 10 β β β β β β β β β β β β β β β β β 8 β β β β ββ β β β β β β β β β β β β β β β β β 6 β β β β y β β β β β β β β β β β β β 4 β β β β β β β β β β β β β β β β β β β β β β β β 2 β β β β β β β β β β β β β 0 β β β 0 1 2 3 4 5 x Predict a continuous output from an input 29 / 104
Motivation 12 β β β 10 β β β β β β β β β β β β β β β β β 8 β β β β ββ β β β β β β β β β β β β β β β β β 6 β β β β y β β β β β β β β β β β β β 4 β β β β β β β β β β β β β β β β β β β β β β β β 2 β β β β β β β β β β β β β 0 β β β 0 1 2 3 4 5 x Predict a continuous output from an input 29 / 104
Model Training set S = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } β R d Γ R Fit a linear function: f Ξ² ( x ) = Ξ² β€ x Goodness of fit measured by residual sum of squares: n οΏ½ ( y i β f Ξ² ( x i )) 2 RSS ( Ξ² ) = i =1 Ridge regression minimizes the regularized RSS: d οΏ½ Ξ² 2 min Ξ² RSS ( Ξ² ) + Ξ» i i =1 Solution (set gradient to 0): οΏ½ οΏ½ β 1 Λ X β€ X + Ξ» I X β€ Y Ξ² = 30 / 104
Ridge regression complexity Compute X β€ X : O ( nd 2 ) οΏ½ οΏ½ X β€ X + Ξ» I : O ( d 3 ) Inverse Computing X β€ X is more expensive than inverting it! 31 / 104
Outline Introduction 1 Standard machine learning 2 Dimension reduction: PCA Clustering: k -means Regression: ridge regression Classification: kNN, logistic regression and SVM Nonlinear models: kernel methods Large-scale machine learning 3 Conclusion 4 32 / 104
Motivation Predict the category of a data 2 or more (sometimes many) categories 33 / 104
Motivation Predict the category of a data 2 or more (sometimes many) categories 33 / 104
Motivation Predict the category of a data 2 or more (sometimes many) categories 33 / 104
Motivation Predict the category of a data 2 or more (sometimes many) categories 33 / 104
k -nearest neigbors (kNN) o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o (Hastie et al. The elements of statistical learning. Springer, 2001.) Training set S = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } β R d Γ {β 1 , 1 } No training Given a new point x β R d , predict the majority class among its k nearest neighbors (take k odd) 34 / 104
kNN properties Uniform Bayes consistency [Stone, 1977] Take k = β n (for example) Let P be any distribution over ( X , Y ) pairs Assume training data are random pairs sampled i.i.d. according to P Then the k -NN classifier Λ f n satisfies almost surely: n β + β P (Λ lim f ( X ) οΏ½ = Y ) = fmeasurable P ( f ( X ) οΏ½ = Y ) inf Complexity: Memory: story X is O ( nd ) Training time: 0 Prediction: O ( nd ) for each test point 35 / 104
Linear models for classification Training set S = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } β R d Γ {β 1 , 1 } Fit a linear function f Ξ² ( x ) = Ξ² β€ x The prediction on a new point x β R d is: οΏ½ +1 if f Ξ² ( x ) > 0 , β 1 otherwise. 36 / 104
Large-margin classifiers For any f : R d β R , the margin of f on an ( x , y ) pair is yf ( x ) Large-margin classifiers fit a classifier by maximizing the margins on the training set: n οΏ½ β ( y i f Ξ² ( x i )) + λβ β€ Ξ² min Ξ² i =1 for a convex, non-increasing loss function β : R β R + 37 / 104
Loss function examples Loss Method β ( u ) 1( u β€ 0) 0-1 none Hinge Support vector machine (SVM) max (1 β u , 0) log (1 + e β u ) Logistic Logistic regression (1 β u ) 2 Square Ridge regression 38 / 104
Ridge logistic regression [Le Cessie and van Houwelingen, 1992] n οΏ½ οΏ½ οΏ½ 1 + e β y i Ξ² β€ x i + λβ β€ Ξ² Ξ² β R p J ( Ξ² ) = min ln i =1 Can be interpreted as a regularized conditional maximum likelihood estimator No explicit solution, but smooth convex optimization problem that can be solved numerically by Newton-Raphson iterations: οΏ½ οΏ½ Ξ² old οΏ½οΏ½ β 1 οΏ½ Ξ² old οΏ½ Ξ² new β Ξ² old β β 2 β Ξ² J . Ξ² J Each iteration amounts to solving a weighted ridge regression problem, hence the name iteratively reweighted least squares (IRLS). Complexity O ( iterations β ( nd 2 + d 3 )) 39 / 104
SVM [Boser et al., 1992] n οΏ½ οΏ½ οΏ½ 0 , 1 β y i Ξ² β€ x i + λβ β€ Ξ² min max Ξ² β R p i =1 A non-smooth convex optimization problem (convex quadratic program) Equivalent to the dual problem 0 β€ y i Ξ± i β€ 1 Ξ± β R n 2 Ξ± β€ Y β Ξ± β€ XX β€ Ξ± max s.t. 2 Ξ» for i = 1 , . . . , n The solution Ξ² β of the primal is obtained from the solution Ξ± β of the dual: Ξ² β = X β€ Ξ± β f Ξ² β ( x ) = ( Ξ² β ) β€ x = ( Ξ± β ) β€ Xx Training complexity: O ( n 2 ) to store XX β€ , O ( n 3 ) to find Ξ± β Prediction: O ( d ) for ( Ξ² β ) β€ x , O ( nd ) for ( Ξ± β ) β€ Xx 40 / 104
Outline Introduction 1 Standard machine learning 2 Dimension reduction: PCA Clustering: k -means Regression: ridge regression Classification: kNN, logistic regression and SVM Nonlinear models: kernel methods Large-scale machine learning 3 Conclusion 4 41 / 104
Motivation β 2 β β β β β β β β β β β β β β 1 β β β β β β β β β β β β β β β y β β β β β β β β β β β β ββ β 0 β β β β β β β β β β β ββ β β β β β β β β β β β β β β β β β β β β β β β ββ β β1 β β β β β β β β β β β β β β β β 0 2 4 6 8 10 x 42 / 104
Model Learn a function f : R d β R of the form n οΏ½ f ( x ) = Ξ± i K ( x i , x ) i =1 For a positive definite (p.d.) kernel K : R d Γ R d β R , such as K ( x , x β² ) = x β€ x β² Linear οΏ½ οΏ½ p x β€ x β² + c K ( x , x β² ) = Polynomial οΏ½ οΏ½ x β x β² οΏ½ 2 οΏ½ K ( x , x β² ) = exp Gaussian 2 Ο 2 d min( | x i | , | x β² i | ) οΏ½ K ( x , x β² ) = Min/max max( | x i | , | x β² i | ) i =1 43 / 104
Feature space A function K : R d Γ R d β R is a p.d. kernel if and only if there existe a mapping Ξ¦ : R d β R D , for some D β N βͺ { + β} , such that β x , x β² β R d , K ( x , x β² ) = Ξ¦( x ) β€ Ξ¦( x β² ) f is then a linear function in R D : n n οΏ½ οΏ½ Ξ± i Ξ¦( x i ) β€ Ξ¦( x ) = Ξ² β€ Ξ¦( x ) f ( x ) = Ξ± i K ( x i , x ) = i =1 i =1 for Ξ² = οΏ½ n i =1 Ξ± i Ξ¦( x i ). x1 2 x1 x2 R 2 x2 44 / 104
Learning 2 x1 x1 x2 R 2 x2 We can learn f ( x ) = οΏ½ n i =1 Ξ± i K ( x i , x ) by fitting a linear model Ξ² β€ Ξ¦( x ) in the feature space Example: ridge regression / logistic regression / SVM n οΏ½ β ( y i , Ξ² β€ Ξ¦( x i )) + λβ β€ Ξ² min Ξ² β R D i =1 But D can be very large, even infinite... 45 / 104
Kernel tricks K ( x , x β² ) = Ξ¦( x ) β€ Ξ¦( x β² ) can be quick to compute even if D is large (even infinite) For a set of training samples { x 1 , . . . , x n } β R d let K n the n Γ n Gram matrix: [ K n ] ij = K ( x i , x j ) For Ξ² = οΏ½ n i =1 Ξ± i Ξ¦( x i ) we have Ξ² β€ Ξ¦( x i ) = [ K Ξ± ] i Ξ² β€ Ξ² = Ξ± β€ K Ξ± and We can therefore solve the equivalent problem in Ξ± β R n n οΏ½ β ( y i , [ K Ξ± ] i ) + λα β€ K Ξ± min Ξ± β R n i =1 46 / 104
Example: kernel ridge regression (KRR) n οΏ½ οΏ½ 2 οΏ½ y i β Ξ² β€ Ξ¦( x i ) + λβ β€ Ξ² min Ξ² β R d i =1 Solve in R D : οΏ½ οΏ½ β 1 Λ Ξ¦( X ) β€ Ξ¦( X ) + Ξ» I Ξ¦( X ) β€ Y Ξ² = οΏ½ οΏ½οΏ½ οΏ½ D Γ D Solve in R n : Ξ± = ( K + Ξ» I ) β 1 Λ Y οΏ½ οΏ½οΏ½ οΏ½ n Γ n 47 / 104
KRR with Gaussian RBF kernel n οΏ½ οΏ½ x β x β² οΏ½ 2 οΏ½ οΏ½ οΏ½ 2 οΏ½ y i β Ξ² β€ Ξ¦( x i ) + λβ β€ Ξ² K ( x , x β² ) = exp min 2 Ο 2 Ξ² β R d i =1 β 2 β β β β β β β β β β β β β β 1 β β β β β β β β β β β β β β β β y β β β β β β β β β β β ββ β 0 β β β β β β β β β β β ββ β β β β β β β β β β β β β β β β β β β β β β β ββ β β1 β β β β β β β β β β β β β β β β 0 2 4 6 8 10 48 / 104 x
KRR with Gaussian RBF kernel n οΏ½ οΏ½ x β x β² οΏ½ 2 οΏ½ οΏ½ οΏ½ 2 οΏ½ y i β Ξ² β€ Ξ¦( x i ) + λβ β€ Ξ² K ( x , x β² ) = exp min 2 Ο 2 Ξ² β R d i =1 lambda = 1000 β 2 β β β β β β β β β β β β β β 1 β β β β β β β β β β β β β β β β y β β β β β β β β β β β ββ β 0 β β β β β β β β β β β ββ β β β β β β β β β β β β β β β β β β β β β β β ββ β β1 β β β β β β β β β β β β β β β β 0 2 4 6 8 10 48 / 104 x
KRR with Gaussian RBF kernel n οΏ½ οΏ½ x β x β² οΏ½ 2 οΏ½ οΏ½ οΏ½ 2 οΏ½ y i β Ξ² β€ Ξ¦( x i ) + λβ β€ Ξ² K ( x , x β² ) = exp min 2 Ο 2 Ξ² β R d i =1 lambda = 100 β 2 β β β β β β β β β β β β β β 1 β β β β β β β β β β β β β β β β y β β β β β β β β β β β ββ β 0 β β β β β β β β β β β ββ β β β β β β β β β β β β β β β β β β β β β β β ββ β β1 β β β β β β β β β β β β β β β β 0 2 4 6 8 10 48 / 104 x
KRR with Gaussian RBF kernel n οΏ½ οΏ½ x β x β² οΏ½ 2 οΏ½ οΏ½ οΏ½ 2 οΏ½ y i β Ξ² β€ Ξ¦( x i ) + λβ β€ Ξ² K ( x , x β² ) = exp min 2 Ο 2 Ξ² β R d i =1 lambda = 10 β 2 β β β β β β β β β β β β β β 1 β β β β β β β β β β β β β β β β y β β β β β β β β β β β ββ β 0 β β β β β β β β β β β ββ β β β β β β β β β β β β β β β β β β β β β β β ββ β β1 β β β β β β β β β β β β β β β β 0 2 4 6 8 10 48 / 104 x
KRR with Gaussian RBF kernel n οΏ½ οΏ½ x β x β² οΏ½ 2 οΏ½ οΏ½ οΏ½ 2 οΏ½ y i β Ξ² β€ Ξ¦( x i ) + λβ β€ Ξ² K ( x , x β² ) = exp min 2 Ο 2 Ξ² β R d i =1 lambda = 1 β 2 β β β β β β β β β β β β β β 1 β β β β β β β β β β β β β β β β y β β β β β β β β β β β ββ β 0 β β β β β β β β β β β ββ β β β β β β β β β β β β β β β β β β β β β β β ββ β β1 β β β β β β β β β β β β β β β β 0 2 4 6 8 10 48 / 104 x
KRR with Gaussian RBF kernel n οΏ½ οΏ½ x β x β² οΏ½ 2 οΏ½ οΏ½ οΏ½ 2 οΏ½ y i β Ξ² β€ Ξ¦( x i ) + λβ β€ Ξ² K ( x , x β² ) = exp min 2 Ο 2 Ξ² β R d i =1 lambda = 0.1 β 2 β β β β β β β β β β β β β β 1 β β β β β β β β β β β β β β β β y β β β β β β β β β β β ββ β 0 β β β β β β β β β β β ββ β β β β β β β β β β β β β β β β β β β β β β β ββ β β1 β β β β β β β β β β β β β β β β 0 2 4 6 8 10 48 / 104 x
KRR with Gaussian RBF kernel n οΏ½ οΏ½ x β x β² οΏ½ 2 οΏ½ οΏ½ οΏ½ 2 οΏ½ y i β Ξ² β€ Ξ¦( x i ) + λβ β€ Ξ² K ( x , x β² ) = exp min 2 Ο 2 Ξ² β R d i =1 lambda = 0.01 β 2 β β β β β β β β β β β β β β 1 β β β β β β β β β β β β β β β β y β β β β β β β β β β β ββ β 0 β β β β β β β β β β β ββ β β β β β β β β β β β β β β β β β β β β β β β ββ β β1 β β β β β β β β β β β β β β β β 0 2 4 6 8 10 48 / 104 x
KRR with Gaussian RBF kernel n οΏ½ οΏ½ x β x β² οΏ½ 2 οΏ½ οΏ½ οΏ½ 2 οΏ½ y i β Ξ² β€ Ξ¦( x i ) + λβ β€ Ξ² K ( x , x β² ) = exp min 2 Ο 2 Ξ² β R d i =1 lambda = 0.001 β 2 β β β β β β β β β β β β β β 1 β β β β β β β β β β β β β β β β y β β β β β β β β β β β ββ β 0 β β β β β β β β β β β ββ β β β β β β β β β β β β β β β β β β β β β β β ββ β β1 β β β β β β β β β β β β β β β β 0 2 4 6 8 10 48 / 104 x
KRR with Gaussian RBF kernel n οΏ½ οΏ½ x β x β² οΏ½ 2 οΏ½ οΏ½ οΏ½ 2 οΏ½ y i β Ξ² β€ Ξ¦( x i ) + λβ β€ Ξ² K ( x , x β² ) = exp min 2 Ο 2 Ξ² β R d i =1 lambda = 0.0001 β 2 β β β β β β β β β β β β β β 1 β β β β β β β β β β β β β β β β y β β β β β β β β β β β ββ β 0 β β β β β β β β β β β ββ β β β β β β β β β β β β β β β β β β β β β β β ββ β β1 β β β β β β β β β β β β β β β β 0 2 4 6 8 10 48 / 104 x
KRR with Gaussian RBF kernel n οΏ½ οΏ½ x β x β² οΏ½ 2 οΏ½ οΏ½ οΏ½ 2 οΏ½ y i β Ξ² β€ Ξ¦( x i ) + λβ β€ Ξ² K ( x , x β² ) = exp min 2 Ο 2 Ξ² β R d i =1 lambda = 0.00001 β 2 β β β β β β β β β β β β β β 1 β β β β β β β β β β β β β β β β y β β β β β β β β β β β ββ β 0 β β β β β β β β β β β ββ β β β β β β β β β β β β β β β β β β β β β β β ββ β β1 β β β β β β β β β β β β β β β β 0 2 4 6 8 10 48 / 104 x
KRR with Gaussian RBF kernel n οΏ½ οΏ½ x β x β² οΏ½ 2 οΏ½ οΏ½ οΏ½ 2 οΏ½ y i β Ξ² β€ Ξ¦( x i ) + λβ β€ Ξ² K ( x , x β² ) = exp min 2 Ο 2 Ξ² β R d i =1 lambda = 0.000001 β 2 β β β β β β β β β β β β β β 1 β β β β β β β β β β β β β β β β y β β β β β β β β β β β ββ β 0 β β β β β β β β β β β ββ β β β β β β β β β β β β β β β β β β β β β β β ββ β β1 β β β β β β β β β β β β β β β β 0 2 4 6 8 10 48 / 104 x
KRR with Gaussian RBF kernel n οΏ½ οΏ½ x β x β² οΏ½ 2 οΏ½ οΏ½ οΏ½ 2 οΏ½ y i β Ξ² β€ Ξ¦( x i ) + λβ β€ Ξ² K ( x , x β² ) = exp min 2 Ο 2 Ξ² β R d i =1 lambda = 0.0000001 β 2 β β β β β β β β β β β β β β 1 β β β β β β β β β β β β β β β β y β β β β β β β β β β β ββ β 0 β β β β β β β β β β β ββ β β β β β β β β β β β β β β β β β β β β β β β ββ β β1 β β β β β β β β β β β β β β β β 0 2 4 6 8 10 48 / 104 x
Complexity lambda = 1 β 2 β β β β β β β β β β β β β β 1 β β β β β β β β β β β β β β β y β β β β β β β β β β β β ββ β β β 0 β β β β β β β β β ββ β β β β β β β β β β β β β β β β β β β β β β β ββ β1 β β β β β β β β β β β β β β β β β 0 2 4 6 8 10 x Compute K : O ( dn 2 ) Store K : O ( n 2 ) Solve Ξ± : O ( n 2 βΌ 3 ) Compute f ( x ) for one x : O ( nd ) Unpractical for n > 10 βΌ 100 k 49 / 104
Outline Introduction 1 Standard machine learning 2 Dimension reduction: PCA Clustering: k -means Regression: ridge regression Classification: kNN, logistic regression and SVM Nonlinear models: kernel methods Large-scale machine learning 3 Scalability issues The tradeoffs of large-scale learning Random projections Random features Approximate NN Shingling, hashing, sketching Conclusion 4 50 / 104
Outline Introduction 1 Standard machine learning 2 Large-scale machine learning 3 Scalability issues The tradeoffs of large-scale learning Random projections Random features Approximate NN Shingling, hashing, sketching Conclusion 4 51 / 104
What is βlarge-scaleβ? Data cannot fit in RAM Algorithm cannot run on a single machine in reasonable time (algorithm-dependent) Sometimes even O ( n ) is too large! (e.g., nearest neighbor in a database of O ( B +) items) Many tasks / parameters (e.g., image categorization in O (10 M ) classes) Streams of data 52 / 104
Things to worry about Training time (usually offline) Memory requirements Test time Complexities so far Method Memory Training time Test time O ( d 2 ) O ( nd 2 ) PCA O ( d ) k -means O ( nd ) O ( ndk ) O ( kd ) O ( d 2 ) O ( nd 2 ) Ridge regression O ( d ) kNN O ( nd ) 0 O ( nd ) O ( nd 2 ) Logistic regression O ( nd ) O ( d ) O ( n 2 ) O ( n 3 ) SVM, kernel methods O ( nd ) 53 / 104
Techniques for large-scale machine learning Good baselines: Subsample data and run standard method Split and run on several machines (depends on algorithm) Need to revisit standard algorithms and implementation, taking into account scalability Trade exactness for scalability Compress, sketch, hash data in a smart way 54 / 104
Outline Introduction 1 Standard machine learning 2 Large-scale machine learning 3 Scalability issues The tradeoffs of large-scale learning Random projections Random features Approximate NN Shingling, hashing, sketching Conclusion 4 55 / 104
Motivation Classical learning theory analyzes the trade-off between: approximation error (how well we approximate the true function) estimation errors (how well we estimate the parameters) β± But reaching the best trade-off for a given n may be impossible with limited computational resources We should include in the trade-off the computational budget, and see which optimization algorithm gives the best trade-off! Seminal paper of Bottou and Bousquet [2008] 56 / 104
Classical ERM setting Goal: learn a function f : R d β Y ( Y = R or {β 1 , 1 } ) P unknown distribution over R d Γ Y Training set: S = { ( X 1 , Y 1 ) , . . . , ( X n , Y n ) } β R d Γ Y i.i.d. following P οΏ½ f : R d β R οΏ½ Fix a class of functions F β Choose a loss β ( y , f ( x )) Learning by empirical risk minimization n f βF R n [ f ] = 1 οΏ½ f n β arg min β ( Y i , f ( X i )) n i =1 Hope that f n has a small risk: R [ f n ] = E β ( Y , f n ( X )) 57 / 104
Classical ERM setting The best possible risk is R β = f : R d βY R [ f ] min The best achievable risk over F is R β F = min f βF R [ f ] We then have the decomposition R [ f n ] β R β = R [ f n ] β R β R β F β R β + F οΏ½ οΏ½οΏ½ οΏ½ οΏ½ οΏ½οΏ½ οΏ½ estimation error Η« est approximation errror Η« app β± 58 / 104
Optimization error Solving the ERM problem may be hard (when n and d are large) Instead we usually find an approximate solution Λ f n that satisfies R n [Λ f n ] β€ R n [ f n ] + Ο The excess risk of Λ f n is then f n ] β R β = Η« = R [Λ R [Λ f n ] β R [ f n ] + Η« est + Η« app οΏ½ οΏ½οΏ½ οΏ½ optimization error Η« opt 59 / 104
A new trade-off Η« = Η« app + Η« est + Η« opt Problem Choose F , n , Ο to make Η« as small as possible Subject to a limit on n and on the computation time T Table 1: Typical variations when F , n , and Ο increase. F n Ο E app (approximation error) β E est (estimation error) β β E opt (optimization error) Β· Β· Β· Β· Β· Β· β T (computation time) β β β Large-scale or small-scale? Small-scale when constraint on n is active Large-scale when constraint on T is active 60 / 104
Comparing optimization methods n οΏ½ Ξ² βBβ R d R n [ f Ξ² ] = min β ( y i , f Ξ² ( x i )) i =1 Gradient descent (GD): Ξ² t +1 β Ξ² t β Ξ·β R n ( f Ξ² t ) βΞ² Second-order gradient descent (2GD), assuming Hessian H known Ξ² t +1 β Ξ² t β H β 1 β R n ( f Ξ² t ) βΞ² Stochastic gradient descent (SGD): Ξ² t +1 β Ξ² t β Ξ· ββ ( y t , f Ξ² t ( x t )) t βΞ² 61 / 104
Results [Bottou and Bousquet, 2008] Algorithm Cost of one Iterations Time to reach Time to reach iteration to reach Ο accuracy Ο E β€ c ( E app + Ξ΅ ) οΏ½ οΏ½ οΏ½ οΏ½ οΏ½ οΏ½ d 2 ΞΊ ΞΊ log 1 nd ΞΊ log 1 Ξ΅ 1 /Ξ± log 2 1 GD O ( nd ) O O O Ο Ο Ξ΅ οΏ½ οΏ½ οΏ½οΏ½ οΏ½ οΏ½ οΏ½ οΏ½ d 2 + nd οΏ½ d 2 + nd οΏ½ d 2 log log 1 log log 1 Ξ΅ 1 /Ξ± log 1 Ξ΅ log log 1 2GD O O O O Ο Ο Ξ΅ οΏ½ οΏ½ οΏ½ οΏ½ οΏ½ οΏ½ Ξ½ΞΊ 2 dΞ½ΞΊ 2 d Ξ½ ΞΊ 2 1 SGD O ( d ) Ο + o O O Ο Ο Ξ΅ οΏ½ οΏ½ οΏ½ οΏ½ οΏ½ οΏ½ οΏ½ οΏ½ 2SGD Ξ± β [1 / 2 , 1] comes from the bound on Ξ΅ est and depends on the data In the last column, n and Ο are optimized to reach Η« for each method 2GD optimizes much faster than GD, but limited gain on the final performance limited by Η« β 1 /Ξ± coming from the estimation error SGD: Optimization speed is catastrophic Learning speed is the best, and independent of Ξ± This suggests that SGD is very competitive (and has become the de facto standard in large-scale ML) 62 / 104
Illustration https://bigdata2013.sciencesconf.org/conference/bigdata2013/pages/bottou.pdf 63 / 104
Outline Introduction 1 Standard machine learning 2 Large-scale machine learning 3 Scalability issues The tradeoffs of large-scale learning Random projections Random features Approximate NN Shingling, hashing, sketching Conclusion 4 64 / 104
Motivation Affects scalability of algorithms, e.g., O ( nd ) for kNN or O ( d 3 ) for ridge regression Hard to visualize (Sometimes) counterintuitive phenomena in high dimension, e.g., concentration of measure for Gaussian data d=1 d=10 d=100 400 250 150 300 200 Frequency Frequency Frequency 150 100 200 100 100 50 50 0 0 0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 ||x||/sqrt(k) ||x||/sqrt(k) ||x||/sqrt(k) Statistical inference degrades when d increases (curse of dimension) 65 / 104
Dimension reduction with PCA PC1 PC2 Projects data onto k < d dimensions that captures the largest amount of variance Also minimizes total reconstruction errors: n οΏ½ οΏ½ x i β Ξ S k ( x i ) οΏ½ 2 min S k i =1 But computational expensive: O ( nd 2 ) No theoretical garantee on distance preservation 66 / 104
Linear dimension reduction X β² = X Γ R οΏ½οΏ½οΏ½οΏ½ οΏ½οΏ½οΏ½οΏ½ οΏ½οΏ½οΏ½οΏ½ n Γ k n Γ d d Γ k Can we find R efficiently? Can we preserve distances? β i , j = 1 , . . . , n , οΏ½ f ( x i ) β f ( x j ) οΏ½ β οΏ½ x i β x j οΏ½ Note: when d > n , we can take k = n and preserve all distances exactly (kernel trick) 67 / 104
Random projections Simply take a random projection matrix: 1 R β€ x β R ij βΌ N (0 , 1) f ( x ) = with k Theorem [Johnson and Lindenstrauss, 1984] For any Η« > 0 and n β N , take οΏ½ β 1 log( n ) β Η« β 2 log( n ) . οΏ½ Η« 2 / 2 β Η« 3 / 3 k β₯ 4 Then the following holds with probabiliy at least 1 β 1 / n : (1 β Η« ) οΏ½ x i β x j οΏ½ 2 β€ οΏ½ f ( x i ) β f ( x j ) οΏ½ 2 β€ (1+ Η« ) οΏ½ x i β x j οΏ½ 2 β i , j = 1 , . . . , n k does not depend on d ! n = 1 M , Η« = 0 . 1 = β k β 5 K n = 1 B , Η« = 0 . 1 = β k β 8 K 68 / 104
Proof (1/3) For a single dimension, q j = r β€ j u : E ( q j ) = E ( r j ) β€ u = 0 E ( q j ) 2 = u β€ E ( r j r β€ j ) u = οΏ½ u οΏ½ 2 β kR β€ u : For the k -dimensional projection f ( u ) = 1 / k j βΌ οΏ½ u οΏ½ 2 οΏ½ f ( u ) οΏ½ 2 = 1 οΏ½ q 2 Ο 2 ( k ) k k j =1 k E οΏ½ f ( u ) οΏ½ 2 = 1 οΏ½ E ( q 2 j ) = οΏ½ u οΏ½ 2 k j =1 Need to show that οΏ½ f ( u ) οΏ½ 2 is concentrated around its mean 69 / 104
Proof (2/3) οΏ½ οΏ½ f οΏ½ 2 > (1 + Η« ) οΏ½ u οΏ½ 2 οΏ½ P οΏ½ οΏ½ Ο 2 ( k ) > (1 + Η« ) k = P οΏ½ e Ξ»Ο 2 ( k ) > e Ξ» (1+ Η« ) k οΏ½ = P οΏ½ e Ξ»Ο 2 ( k ) οΏ½ e β Ξ» (1+ Η« ) k β€ E (Markov) = (1 β 2 Ξ» ) β k 2 e β Ξ» (1+ Η« ) k (MGF of Ο 2 ( k ) for 0 β€ Ξ» β€ 1 / 2) οΏ½ (1 + Η« ) e β Η« οΏ½ k / 2 = (take Ξ» = Η«/ 2(1 + Η« )) β€ e β ( Η« 2 / 2 β Η« 3 / 3 ) k / 2 (use log(1 + x ) β€ x β x 2 / 2 + x 3 / 3) οΏ½ οΏ½ = n β 2 Η« 2 / 2 β Η« 3 / 3 (take k = 4 log( n )) Similarly we get οΏ½ οΏ½ f οΏ½ 2 < (1 β Η« ) οΏ½ u οΏ½ 2 οΏ½ < n β 2 P 70 / 104
Proof (3/3) Apply with u = x i β x j and use linearity of f to show that for an ( x i , x j ) pair, the probability of large distortion is β€ 2 n β 2 Union bound: for all n ( n β 1) / 2 pairs, the probability that at least one has large distortion is smaller than n ( n β 1) Γ 2 n 2 = 1 β 1 οΏ½ 2 n 71 / 104
Scalability n = O (1 B ); d = O (1 M ) = β k = O (10 K ) Memory: need to store R , O ( dk ) β 40 GB Computation: X Γ R in O ( ndk ) Other random matrices R have similar properties but better scalability, e.g.: βadd or subtractβ [Achlioptas, 2003], 1 bit/entry, size β 1 , 25 GB οΏ½ +1 with probability 1 / 2 R ij = β 1 with probability 1 / 2 Fast Johnson-Lindenstrauss transform [Ailon and Chazelle, 2009] where R = PHD , compute f ( x ) in O ( d log d ) 72 / 104
Outline Introduction 1 Standard machine learning 2 Large-scale machine learning 3 Scalability issues The tradeoffs of large-scale learning Random projections Random features Approximate NN Shingling, hashing, sketching Conclusion 4 73 / 104
Motivation JL random projec<on Kernel Phi d D R R k R Random features? 74 / 104
Fourier feature space Example: Gaussian kernel οΏ½ e β οΏ½ x β x β² οΏ½ 2 1 R d e i Ο β€ ( x β x β² ) e β οΏ½ Ο οΏ½ 2 = d Ο 2 2 d (2 Ο ) 2 οΏ½ οΏ½ Ο β€ ( x β x β² ) = E Ο cos οΏ½ οΏ½ οΏ½ οΏ½ οΏ½οΏ½ Ο β€ x β² + b Ο β€ x + b = E Ο, b 2 cos cos with 1 e β οΏ½ Ο οΏ½ 2 Ο βΌ p ( d Ο ) = d Ο , b βΌ U ([0 , 2 Ο ]) . 2 d (2 Ο ) 2 This is of the form K ( x , x β² ) = Ξ¦( x ) β€ Ξ¦( x β² ) with D = + β : οΏ½οΏ½ οΏ½ οΏ½ Ξ¦ : R d β L 2 R d , p ( d Ο ) Γ ([0 , 2 Ο ] , U ) 75 / 104
Random Fourier features [Rahimi and Recht, 2008] = Γ For i = 1 , . . . , k , sample randomly: = + Ο ( Ο i , b i ) βΌ p ( d Ο ) Γ U ([0 , 2 Ο ]) Ο Ο Ο Create random features: Ο Ο οΏ½ 2 οΏ½ οΏ½ β x β R d , Ο β€ f i ( x ) = k cos i x + b i j x Ο j Ο T + x b j 76 / 104
Random Fourier features [Rahimi and Recht, 2008] For any x , x β² β R d , it holds k οΏ½ οΏ½ οΏ½ οΏ½ οΏ½ f ( x ) β€ f ( x β² ) f i ( x ) f i ( x β² ) E = E i =1 k = 1 οΏ½ οΏ½ οΏ½ οΏ½ οΏ½οΏ½ οΏ½ Ο β€ x β² + b Ο β€ x + b 2 cos cos E k i =1 = K ( x , x β² ) and by Hoeffdingβs inequality, οΏ½οΏ½ οΏ½ οΏ½ β€ 2 e β k Η« 2 οΏ½ f ( x ) β€ f ( x β² ) β K ( x , x β² ) οΏ½ οΏ½ P οΏ½ > Η« 2 This allows to approximate learning with the Gaussian kernel with a simple linear model in k dimensions! 77 / 104
Generalization A translation-invariant (t.i.) kernel is of the form K ( x , x β² ) = Ο ( x β x β² ) Bochnerβs theorem For a continuous function Ο : R d β R , K is p.d. if and only if Ο is the Fourier-Stieltjes transform of a symmetric and positive finite Borel οΏ½ R d οΏ½ measure Β΅ β M : οΏ½ R d e β i Ο β€ x d Β΅ ( Ο ) Ο ( x ) = Just sample Ο i βΌ d Β΅ ( Ο ) Β΅ ( R d ) and b i βΌ U ([0 , 2 Ο ]) to approximate any t.i. kernel K with random features οΏ½ 2 οΏ½ οΏ½ Ο β€ k cos i x + b i 78 / 104
Examples οΏ½ R d e β i Ο β€ ( x β x β² ) d Β΅ ( Ο ) K ( x , x β² ) = Ο ( x β x β² ) = Kernel Ο ( x ) Β΅ ( d Ο ) οΏ½ οΏ½ οΏ½ οΏ½ β οΏ½ x οΏ½ 2 (2 Ο ) β d / 2 exp β οΏ½ Ο οΏ½ 2 Gaussian exp 2 2 οΏ½ k 1 exp ( βοΏ½ x οΏ½ 1 ) Laplace i =1 Ο ( 1+ Ο 2 i ) οΏ½ k 2 e βοΏ½ Ο οΏ½ 1 Cauchy i =1 1+ x 2 i 79 / 104
Performance [Rahimi and Recht, 2008] 80 / 104
Recommend
More recommend