Distributed Machine Learning and Big Data Sourangshu Bhattacharya Dept. of Computer Science and Engineering, IIT Kharagpur. http://cse.iitkgp.ac.in/~sourangshu/ August 21, 2015 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 1 / 76
Outline 1 Machine Learning and Big Data Support Vector Machines Stochastic Sub-gradient descent Distributed Optimization 2 ADMM Convergence Distributed Loss Minimization Results Development of ADMM 3 Applications and extensions Weighted Parameter Averaging Fully-distributed SVM Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 2 / 76
Machine Learning and Big Data What is Big Data ? 6 Billion web queries per day. 10 Billion display advertisements per day. 30 Billion text ads per day. 150 Million credit card transactions per day. 100 Billion emails per day. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 3 / 76
Machine Learning and Big Data Machine Learning on Big Data Classification - Spam / No Spam - 100B emails. Multi-label classification - image tagging - 14M images 10K tags. Regression - CTR estimation - 10B ad views. Ranking - web search - 6B queries. Recommendation - online shopping - 1.7B views in the US. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 4 / 76
Machine Learning and Big Data Classification example Email spam classification. Features ( u i ) ): Vector of counts of all words. No. of Features ( d ): Words in vocabulary ( ∼ 100,000). No. of non-zero features: 100. No. of emails per day: 100 M. Size of training set using 30 days data: 6 TB (assuming 20 B per data) Time taken to read the data once: 41.67 hrs (at 20 MB per second) Solution: use multiple computers. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 5 / 76
Machine Learning and Big Data Big Data Paradigm 3V’s - Volume, Variety, Velocity. Distributed system. Chance of failure: Computers 1 10 100 Chance of a failure in an hour 0.01 0.09 0.63 Communication efficiency - Data locality. Many systems: Hadoop, Spark, Graphlab, etc. Goal: Implement Machine Learning algorithms on Big data systems. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 6 / 76
Machine Learning and Big Data Binary Classification Problem A set of labeled datapoints ( S ) = { ( u i , v i ) , i = 1 , . . . , n } , u i ∈ R d and v i ∈ { + 1 , − 1 } Linear Predictor function: v = sign ( x T u ) Error function: E = � n i = 1 1 ( v i x T u i ≤ 0 ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 7 / 76
Machine Learning and Big Data Logistic Regression Probability of v is given by: 1 P ( v | u , x ) = σ ( v x T u ) = 1 + e − v x T u Learning problem is: Given dataset S , estimate x . Maximizing the regularized log likelihood: n log ( 1 + e − v i x T u i ) + λ x ∗ = argmin x � 2 x T x i = 1 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 8 / 76
Machine Learning and Big Data Convex Function f is a Convex function: f ( tx 1 + ( 1 − t ) x 2 ) ≤ tf ( x 1 ) + ( 1 − t ) f ( x 2 ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 9 / 76
Machine Learning and Big Data Convex Optimization Convex optimization problem minimize x f ( x ) subject to: g i ( x ) ≤ 0 , ∀ i = 1 , . . . , k where: f , g i are convex functions. For convex optimization problems, local optima are also global optima. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 10 / 76
Machine Learning and Big Data Optimization Algorithm: Gradient Descent Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 11 / 76
Machine Learning and Big Data Support Vector Machines Classification Problem Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 12 / 76
Machine Learning and Big Data Support Vector Machines SVM Separating hyperplane: x T u = 0 Parallel hyperplanes (developing margin): x T u = ± 1 2 Margin (perpendicular distance between parallel hyperplanes): � x � Correct classification of training datapoints: v i x T u i ≥ 1 , ∀ i Allowing error (slack), ξ i : v i x T u i ≥ 1 − ξ i , ∀ i Max-margin formulation: n 1 2 � x � 2 + C � min ξ i x ,ξ i = 1 subject to: v i x T u i ≥ 1 − ξ i , ξ i ≥ 0 ∀ i = 1 , . . . , n Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 13 / 76
Machine Learning and Big Data Support Vector Machines SVM: dual Lagrangian: n n n L = 1 2 x T x + C � � α i ( 1 − ξ i − v i x T u i ) + � ξ i + µ i ξ i i = 1 i = 1 i = 1 Dual problem: ( x ∗ , α ∗ , µ ∗ ) = max α,µ min x L ( x , α, µ ) For strictly convex problem, primal and dual solutions are same (Strong duality). KKT conditions: n � x = α i v i u i i = 1 C = α i + µ i Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 14 / 76
Machine Learning and Big Data Support Vector Machines SVM: dual The dual problem: n n , n α i − 1 � � α i α j v i v j u T max i u j 2 α i = 1 i = 1 , j = 1 subject to: 0 ≤ α i ≤ C , ∀ i The dual is a quadratic programming problem in n variables. Can be solved even if kernel function, k ( u i , u j ) = u T i u j are given. Dimension agnostic. Many efficient algorithms exist for solving it, e.g. SMO (Platt99). Worst case complexity is O ( n 3 ) , usually O ( n 2 ) . Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 15 / 76
Machine Learning and Big Data Support Vector Machines SVM � n i = 1 max ( 0 , 1 − v i x T u i ) + λ � x � 2 A more compact form: min x 2 � n i = 1 l ( x , u i , v i ) + λ Ω( x ) Or: min x Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 16 / 76
Machine Learning and Big Data Support Vector Machines Multi-class classification There are m classes. v i ∈ { 1 , . . . , m } Most popular scheme: v i = argmax v ∈{ 1 ,..., m } x T v u i Given example ( u i , v i ) , x T v i u i ≥ x T j u i ∀ j ∈ { 1 , . . . , m } Using a margin of at least 1, loss l ( u i , v i ) = max j ∈{ 1 ,..., v i − 1 , v i + 1 ,..., m } { 0 , 1 − ( x T v i u i − x T j u i ) } Given dataset D , solve the problem m � � � x j � 2 min l ( u i , v i ) + λ x 1 ,..., x m i ∈D j = 1 This can be extended to many settings e.g. sequence labeling, learning to rank, etc. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 17 / 76
Machine Learning and Big Data Support Vector Machines General Learning Problems Support Vector Machines: n � max { 0 , 1 − v i x T u i } + λ � x � 2 min 2 x i = 1 Logistic Regression: n � log ( 1 + exp ( − v i x T u i )) + λ � x � 2 min 2 x i = 1 General form: n � min l ( x , u i , v i ) + λ Ω( x ) x i = 1 l : loss function, Ω : regularizer. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 18 / 76
Machine Learning and Big Data Stochastic Sub-gradient descent Sub-gradient Descent Sub-gradient for a non-differentiable convex function f at a point x 0 is a vector v such that: f ( x ) − f ( x 0 ) ≥ v T ( x − x 0 ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 19 / 76
Machine Learning and Big Data Stochastic Sub-gradient descent Sub-gradient Descent Randomly initialize x 0 Iterate x k = x k − 1 − t k g ( x k − 1 ) , k = 1 , 2 , 3 , . . . . Where g is a sub-gradient of f . 1 t k = k . √ x best ( k ) = min i = 1 ,..., k f ( x k ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 20 / 76
Machine Learning and Big Data Stochastic Sub-gradient descent Sub-gradient Descent Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 21 / 76
Machine Learning and Big Data Stochastic Sub-gradient descent Stochastic Sub-gradient Descent Convergence rate is: O ( 1 k ) . √ Each iteration takes O ( n ) time. Reduce time by calculating the gradient using a subset of examples - stochastic subgradient. Inherently serial. Typical O ( 1 ǫ 2 ) behaviour. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 22 / 76
Machine Learning and Big Data Stochastic Sub-gradient descent Stochastic Sub-gradient Descent Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 23 / 76
Distributed Optimization Distributed gradient descent Divide the dataset into m parts. Each part is processed on one computer. Total m . There is one central computer. All computers can communicate with the central computer via network. Define loss ( x ) = � m � i ∈ C j l i ( x )+ λ Ω( x ) , where l i ( x ) = l ( x , u i , v i ) j = 1 The gradient (in case of differentiable loss): m � � ∇ loss ( x ) = ∇ ( l i ( x )) + λ Ω( x ) j = 1 i ∈ C j i ∈ C j ∇ l i ( x ) on the j th computer. Communicate Compute ∇ l j ( x ) = � to central computer. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 24 / 76
Distributed Optimization Distributed gradient descent Compute ∇ loss ( x ) = � m j = 1 ∇ l j ( x ) + Ω( x ) at the central computer. The gradient descent update: x k + 1 = x k − α ∇ loss ( x ) . α chosen by a line search algorithm (distributed). For non-differentiable loss functions, we can use distributed sub-gradient descent algorithm. Slow for most practical problems. For achieving ǫ tolerance, Gradient descent (Logistic regression): O ( 1 /ǫ ) iterations. Sub-gradient descent (Stochastic Sub-gradient descent): O ( 1 ǫ 2 ) iterations. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 25 / 76
Recommend
More recommend