Distributed Machine Learning and Big Data Sourangshu Bhattacharya - PowerPoint PPT Presentation

Distributed Machine Learning and Big Data Sourangshu Bhattacharya Dept. of Computer Science and Engineering, IIT Kharagpur. http://cse.iitkgp.ac.in/~sourangshu/ August 21, 2015 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 1 / 76

Outline 1 Machine Learning and Big Data Support Vector Machines Stochastic Sub-gradient descent Distributed Optimization 2 ADMM Convergence Distributed Loss Minimization Results Development of ADMM 3 Applications and extensions Weighted Parameter Averaging Fully-distributed SVM Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 2 / 76

Machine Learning and Big Data What is Big Data ? 6 Billion web queries per day. 10 Billion display advertisements per day. 30 Billion text ads per day. 150 Million credit card transactions per day. 100 Billion emails per day. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 3 / 76

Machine Learning and Big Data Machine Learning on Big Data Classification - Spam / No Spam - 100B emails. Multi-label classification - image tagging - 14M images 10K tags. Regression - CTR estimation - 10B ad views. Ranking - web search - 6B queries. Recommendation - online shopping - 1.7B views in the US. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 4 / 76

Machine Learning and Big Data Classification example Email spam classification. Features ( u i ) ): Vector of counts of all words. No. of Features ( d ): Words in vocabulary ( ∼ 100,000). No. of non-zero features: 100. No. of emails per day: 100 M. Size of training set using 30 days data: 6 TB (assuming 20 B per data) Time taken to read the data once: 41.67 hrs (at 20 MB per second) Solution: use multiple computers. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 5 / 76

Machine Learning and Big Data Big Data Paradigm 3V’s - Volume, Variety, Velocity. Distributed system. Chance of failure: Computers 1 10 100 Chance of a failure in an hour 0.01 0.09 0.63 Communication efficiency - Data locality. Many systems: Hadoop, Spark, Graphlab, etc. Goal: Implement Machine Learning algorithms on Big data systems. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 6 / 76

Machine Learning and Big Data Binary Classification Problem A set of labeled datapoints ( S ) = { ( u i , v i ) , i = 1 , . . . , n } , u i ∈ R d and v i ∈ { + 1 , − 1 } Linear Predictor function: v = sign ( x T u ) Error function: E = � n i = 1 1 ( v i x T u i ≤ 0 ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 7 / 76

Machine Learning and Big Data Logistic Regression Probability of v is given by: 1 P ( v | u , x ) = σ ( v x T u ) = 1 + e − v x T u Learning problem is: Given dataset S , estimate x . Maximizing the regularized log likelihood: n log ( 1 + e − v i x T u i ) + λ x ∗ = argmin x � 2 x T x i = 1 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 8 / 76

Machine Learning and Big Data Convex Function f is a Convex function: f ( tx 1 + ( 1 − t ) x 2 ) ≤ tf ( x 1 ) + ( 1 − t ) f ( x 2 ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 9 / 76

Machine Learning and Big Data Convex Optimization Convex optimization problem minimize x f ( x ) subject to: g i ( x ) ≤ 0 , ∀ i = 1 , . . . , k where: f , g i are convex functions. For convex optimization problems, local optima are also global optima. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 10 / 76

Machine Learning and Big Data Optimization Algorithm: Gradient Descent Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 11 / 76

Machine Learning and Big Data Support Vector Machines Classification Problem Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 12 / 76

Machine Learning and Big Data Support Vector Machines SVM Separating hyperplane: x T u = 0 Parallel hyperplanes (developing margin): x T u = ± 1 2 Margin (perpendicular distance between parallel hyperplanes): � x � Correct classification of training datapoints: v i x T u i ≥ 1 , ∀ i Allowing error (slack), ξ i : v i x T u i ≥ 1 − ξ i , ∀ i Max-margin formulation: n 1 2 � x � 2 + C � min ξ i x ,ξ i = 1 subject to: v i x T u i ≥ 1 − ξ i , ξ i ≥ 0 ∀ i = 1 , . . . , n Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 13 / 76

Machine Learning and Big Data Support Vector Machines SVM: dual Lagrangian: n n n L = 1 2 x T x + C � � α i ( 1 − ξ i − v i x T u i ) + � ξ i + µ i ξ i i = 1 i = 1 i = 1 Dual problem: ( x ∗ , α ∗ , µ ∗ ) = max α,µ min x L ( x , α, µ ) For strictly convex problem, primal and dual solutions are same (Strong duality). KKT conditions: n � x = α i v i u i i = 1 C = α i + µ i Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 14 / 76

Machine Learning and Big Data Support Vector Machines SVM: dual The dual problem: n n , n α i − 1 � � α i α j v i v j u T max i u j 2 α i = 1 i = 1 , j = 1 subject to: 0 ≤ α i ≤ C , ∀ i The dual is a quadratic programming problem in n variables. Can be solved even if kernel function, k ( u i , u j ) = u T i u j are given. Dimension agnostic. Many efficient algorithms exist for solving it, e.g. SMO (Platt99). Worst case complexity is O ( n 3 ) , usually O ( n 2 ) . Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 15 / 76

Machine Learning and Big Data Support Vector Machines SVM � n i = 1 max ( 0 , 1 − v i x T u i ) + λ � x � 2 A more compact form: min x 2 � n i = 1 l ( x , u i , v i ) + λ Ω( x ) Or: min x Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 16 / 76

Machine Learning and Big Data Support Vector Machines Multi-class classification There are m classes. v i ∈ { 1 , . . . , m } Most popular scheme: v i = argmax v ∈{ 1 ,..., m } x T v u i Given example ( u i , v i ) , x T v i u i ≥ x T j u i ∀ j ∈ { 1 , . . . , m } Using a margin of at least 1, loss l ( u i , v i ) = max j ∈{ 1 ,..., v i − 1 , v i + 1 ,..., m } { 0 , 1 − ( x T v i u i − x T j u i ) } Given dataset D , solve the problem m � � � x j � 2 min l ( u i , v i ) + λ x 1 ,..., x m i ∈D j = 1 This can be extended to many settings e.g. sequence labeling, learning to rank, etc. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 17 / 76

Machine Learning and Big Data Support Vector Machines General Learning Problems Support Vector Machines: n � max { 0 , 1 − v i x T u i } + λ � x � 2 min 2 x i = 1 Logistic Regression: n � log ( 1 + exp ( − v i x T u i )) + λ � x � 2 min 2 x i = 1 General form: n � min l ( x , u i , v i ) + λ Ω( x ) x i = 1 l : loss function, Ω : regularizer. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 18 / 76

Machine Learning and Big Data Stochastic Sub-gradient descent Sub-gradient Descent Sub-gradient for a non-differentiable convex function f at a point x 0 is a vector v such that: f ( x ) − f ( x 0 ) ≥ v T ( x − x 0 ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 19 / 76

Machine Learning and Big Data Stochastic Sub-gradient descent Sub-gradient Descent Randomly initialize x 0 Iterate x k = x k − 1 − t k g ( x k − 1 ) , k = 1 , 2 , 3 , . . . . Where g is a sub-gradient of f . 1 t k = k . √ x best ( k ) = min i = 1 ,..., k f ( x k ) Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 20 / 76

Machine Learning and Big Data Stochastic Sub-gradient descent Sub-gradient Descent Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 21 / 76

Machine Learning and Big Data Stochastic Sub-gradient descent Stochastic Sub-gradient Descent Convergence rate is: O ( 1 k ) . √ Each iteration takes O ( n ) time. Reduce time by calculating the gradient using a subset of examples - stochastic subgradient. Inherently serial. Typical O ( 1 ǫ 2 ) behaviour. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 22 / 76

Machine Learning and Big Data Stochastic Sub-gradient descent Stochastic Sub-gradient Descent Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 23 / 76

Distributed Optimization Distributed gradient descent Divide the dataset into m parts. Each part is processed on one computer. Total m . There is one central computer. All computers can communicate with the central computer via network. Define loss ( x ) = � m � i ∈ C j l i ( x )+ λ Ω( x ) , where l i ( x ) = l ( x , u i , v i ) j = 1 The gradient (in case of differentiable loss): m � � ∇ loss ( x ) = ∇ ( l i ( x )) + λ Ω( x ) j = 1 i ∈ C j i ∈ C j ∇ l i ( x ) on the j th computer. Communicate Compute ∇ l j ( x ) = � to central computer. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 24 / 76

Distributed Optimization Distributed gradient descent Compute ∇ loss ( x ) = � m j = 1 ∇ l j ( x ) + Ω( x ) at the central computer. The gradient descent update: x k + 1 = x k − α ∇ loss ( x ) . α chosen by a line search algorithm (distributed). For non-differentiable loss functions, we can use distributed sub-gradient descent algorithm. Slow for most practical problems. For achieving ǫ tolerance, Gradient descent (Logistic regression): O ( 1 /ǫ ) iterations. Sub-gradient descent (Stochastic Sub-gradient descent): O ( 1 ǫ 2 ) iterations. Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 25 / 76

Distributed Machine Learning and Big Data Sourangshu Bhattacharya - PowerPoint PPT Presentation

Distributed Machine Learning and Big Data Sourangshu Bhattacharya Dept. of Computer Science and Engineering, IIT Kharagpur. http://cse.iitkgp.ac.in/~sourangshu/ August 21, 2015 Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Distributed Machine Learning Maria-Florina Balcan Carnegie Mellon University Distributed Machine

x ? Machine Learning 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data,

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Differential Privacy Machine Learning Li Xiong Big Data + Machine Learning + Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

Machine Learning 1 Machine(Learning(in(a(Nutshell ( Data$ Model$ Performance$ Measure$

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Deep Neural Networks based Text- Dependent Speaker Verification Gautam Bhattacharya, Jahangir

Affine Invariant LCCs and LTCs Sivakanth Gopi Joint work with Arnab Bhattacharya (Indian

Speeding up Permutation Testing Vamsi Ithapu http://pages.cs.wisc.edu/~vamsi/pt_fast November

Bhattacharyya clustering with applications to mixture simplifications ICPR 2010, Istanbul, Turkey

Intuitive Parameterization of Distance-Based Clustering Techniques Altobelli de Brito Mantuan

Quantum Lecture 6 Shannon information Quantum information Distance measures Mikael

Multimodal Image Retrieval Based on Keywords and Low-level Image Features Miran Pobar, Marina

PROPERTY TESTING Arnab BHATTACHARYYA (in lieu of Seth) 29/08/2019 Lecture Outline What is