Distributed Training for Large-scale Logistic Models Siddharth - PowerPoint PPT Presentation

Distributed Training for Large-scale Logistic Models Siddharth Gopal Carnegie Mellon Univeristy 21 Aug 2013 1 Joint work with Yiming Yang presented at ICML’13 Siddharth Gopal Distributed Training for Large-scale Logistic Models

Outline of the Talk Logistic Models Maximum Likelihood Estimation Parallelization Experiments Siddharth Gopal Distributed Training for Large-scale Logistic Models

Logistic Models Logistic Models model probability of an outcome Y given a predictor x . P ( Y = y | x ; w ) ∝ exp( w ⊤ φ ( y , x )) Subsumes Multinomial Logistic Regression, Conditional Random fields and Maximum entropy Models. For example, in Multinomial Logistic Regression exp( w ⊤ k x ) P ( Y = k | x ; w ) = � exp( w ⊤ j x ) j Siddharth Gopal Distributed Training for Large-scale Logistic Models

Focus of the Talk Train Logistic models on large-scale data. What is Large-scale ? Large number of Training Examples High dimensionality Large number of Outcomes Siddharth Gopal Distributed Training for Large-scale Logistic Models

Motivation Some commonly used data on the web, Dataset #Instances #Labels #Features #Parameters ODP subset 93,805 12,294 347,256 4,269,165,264 Wikipedia subset 2,365,436 325,056 1,617,899 525,907,777,344 Image-net 14,197,122 21,841 - - Siddharth Gopal Distributed Training for Large-scale Logistic Models

Motivation Some commonly used data on the web, Dataset #Instances #Labels #Features #Parameters ODP subset 93,805 12,294 347,256 4,269,165,264 Wikipedia subset 2,365,436 325,056 1,617,899 525,907,777,344 Image-net 14,197,122 21,841 - - How can we parallelize the training of such models ? How can we optimize different subsets of parameters simultaneously ? Siddharth Gopal Distributed Training for Large-scale Logistic Models

Maximum Likelihood Estimation (MLE) Typical MLE estimation N training examples, K classes. x i denotes the i th training example. Indicator variable y ik denotes whether x i belongs to class k . Estimate parameters w by maximizing the log-likelihood, N K y ik log P ( y ik | x i ; w ) − λ � � 2 � w � 2 max w i =1 k =1 Siddharth Gopal Distributed Training for Large-scale Logistic Models

Maximum Likelihood Estimation (MLE) Typical MLE estimation N training examples, K classes. x i denotes the i th training example. Indicator variable y ik denotes whether x i belongs to class k . Estimate parameters w by maximizing the log-likelihood, N K y ik log P ( y ik | x i ; w ) − λ � � 2 � w � 2 max w i =1 k =1 � K N K N � λ 2 � w � 2 − � � � � y ik w ⊤ exp( w ⊤ [ OPT1 ] min k x i + log k x i ) w i =1 k =1 i =1 k =1 Siddharth Gopal Distributed Training for Large-scale Logistic Models

Parallelization � K N K N � λ 2 � w � 2 − � � � � y ik w ⊤ exp( w ⊤ min k x i + log k x i ) w i =1 k =1 i =1 k =1 Siddharth Gopal Distributed Training for Large-scale Logistic Models

Parallelization � K N K N � λ 2 � w � 2 − � � � � y ik w ⊤ exp( w ⊤ min k x i + log k x i ) w i =1 k =1 i =1 k =1 The log-sum-exp (LSE) function couples all the class-level parameter w k ’s together. Siddharth Gopal Distributed Training for Large-scale Logistic Models

Parallelization � K N K N � λ 2 � w � 2 − � � � � y ik w ⊤ exp( w ⊤ min k x i + log k x i ) w i =1 k =1 i =1 k =1 The log-sum-exp (LSE) function couples all the class-level parameter w k ’s together. Replace LSE by a parallelizable function This parallelizable function should be an upper-bound It should not make the optimization harder - like introduce non-convexity, non-differentiability etc. Siddharth Gopal Distributed Training for Large-scale Logistic Models

Bound 1 - Piecewise Linear Bound (Hsiung et al) Properties used LSE is a convex-function Convex function can be approximated to any precision by piecewise linear functions. � K � � { a ⊤ j ′ { c ⊤ max j γ + b j } ≤ log exp( γ k ) ≤ max j ′ γ + d j ′ } j k =1 a , c ∈ R K b , d ∈ R Upper Bound LSE Lower Bound Siddharth Gopal Distributed Training for Large-scale Logistic Models

Bound 1 - Piecewise Linear Bound (Hsiung et al) � K � { a ⊤ � j ′ { c ⊤ max j γ + b j } ≤ log exp( γ k ) ≤ max j ′ γ + d j ′ } j k =1 a , c ∈ R K b , d ∈ R Advantages The bound can be made arbitrarily accurate by increasing the number of pieces. Disadvantages Max-function makes the objective non-differentiable. The number of variational parameters grows with the approximation level. Optimizing the variational parameter is hard. Siddharth Gopal Distributed Training for Large-scale Logistic Models

Bound 2 - Double Majorization (Bouchard 2007) The LSE is bound by, � K � K � � exp( w ⊤ log(1 + exp( w ⊤ log k x i ) ≤ a i + k x i − a i )) , a i ∈ R k =1 k =1 Advantages The bound is parallelizable. It is an upper bound. It is differentiable and convex . Siddharth Gopal Distributed Training for Large-scale Logistic Models

Bound 2 - Double Majorization (Bouchard 2007) Disadvantage The bound is not tight enough. Efficiency of Bound 3.0E+04 2.5E+04 Log-sum-exp Upper-bound Function-value 2.0E+04 1.5E+04 1.0E+04 5.0E+03 0.0E+00 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 Iteration The gap between true objective and upper-bounded objective on the 20-newsgroup dataset. Siddharth Gopal Distributed Training for Large-scale Logistic Models

Bound 3 - Log Concavity A relatively famous bound using the concavity of the log-function log( x ) ≤ ax − log( a ) − 1 ∀ x , a > 0 Log Concavity Bound 5 3 1 log(x) -1 log(x) -3 a = .3 a = 2 -5 a = .02 -7 x Siddharth Gopal Distributed Training for Large-scale Logistic Models

Bound 3 - Log Concavity Applying to the LSE function, � K � K � � exp( w ⊤ exp( w ⊤ log k x i ) ≤ a i k x i ) − log( a i ) − 1 k =1 k =1 Advantages The bound is parallelizable. It is differentiable. Optimizing the variational parameter a i is easy. 1 The upper bound is exact at a i = . K exp( w ⊤ � k x i ) k =1 Disadvantage The combined objective is non-convex. Siddharth Gopal Distributed Training for Large-scale Logistic Models

Reaching Optimality � K � N K N λ 2 � w � 2 − � � y ik w ⊤ � � exp( w ⊤ MLE estimation min k x i + log k x i ) w i =1 k =1 i =1 k =1 � K � K � exp( w ⊤ � exp( w ⊤ Log-concavity Bound log k x i ) ≤ a i k x i ) − log( a i ) − 1 k =1 k =1 Siddharth Gopal Distributed Training for Large-scale Logistic Models

Reaching Optimality � K � N K N λ 2 � w � 2 − � � y ik w ⊤ � � exp( w ⊤ MLE estimation min k x i + log k x i ) w i =1 k =1 i =1 k =1 � K � K � exp( w ⊤ � exp( w ⊤ Log-concavity Bound log k x i ) ≤ a i k x i ) − log( a i ) − 1 k =1 k =1 Combined Objective K N � K K � F ( W , A ) = λ � w k � 2 + � � � � y ik w ⊤ exp( w ⊤ − k x i + a i k x i ) − log( a i ) − 1 2 k =1 i =1 k =1 k =1 Siddharth Gopal Distributed Training for Large-scale Logistic Models

Reaching Optimality � K � N K N λ 2 � w � 2 − � � y ik w ⊤ � � exp( w ⊤ MLE estimation min k x i + log k x i ) w i =1 k =1 i =1 k =1 � K � K � exp( w ⊤ � exp( w ⊤ Log-concavity Bound log k x i ) ≤ a i k x i ) − log( a i ) − 1 k =1 k =1 Combined Objective K N � K K � F ( W , A ) = λ � w k � 2 + � � � � y ik w ⊤ exp( w ⊤ − k x i + a i k x i ) − log( a i ) − 1 2 k =1 i =1 k =1 k =1 Despite the non-convexity, we can show that The combined objective has a unique minima. This minimum coincides with the optimal MLE solution. Siddharth Gopal Distributed Training for Large-scale Logistic Models

Reaching Optimality An iterative and parallel block coordinate descent algorithm to converge to the unique minimum. Algorithm 1 A parallel block coordinate descent Initialize : t ← 0 , A 0 ← 1 K , W 0 ← 0. While : Not converged In parallel : W t +1 ← arg min W F ( W , A t ) A t +1 ← arg min A F ( W t +1 , A ) t ← t + 1 Siddharth Gopal Distributed Training for Large-scale Logistic Models

Experimental Comparison Datasets Dataset # instances #Leaf-labels #Features #Parameters Parameter Size (approx) CLEF 10,000 63 80 5,040 40KB NEWS20 11,260 20 53,975 1,079,500 4MB LSHTC-small 4,463 1,139 51,033 227,760,279 911MB 93,805 12,294 347,256 4,269,165,264 17GB LSHTC-large Optimization Methods Double Majorization Bound (DM) Log concavity Bound (LC) Limited Memory BFGS (LBFGS) - the most widely used method. Alternating Direction Method of Multipliers (ADMM) Siddharth Gopal Distributed Training for Large-scale Logistic Models

Distributed Training for Large-scale Logistic Models Siddharth - PowerPoint PPT Presentation

Distributed Training for Large-scale Logistic Models Siddharth Gopal Carnegie Mellon Univeristy 21 Aug 2013 1 Joint work with Yiming Yang presented at ICML13 Siddharth Gopal Distributed Training for Large-scale Logistic Models Outline of

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Logistic mixed models for DIF IRT models can be regarded as logistic mixed models (e.g., Adams,

2015 Schield Logistic MLE1C Excel2013 8/18/2016 V0D V0D V0D 2015 Schield Logistic MLE 1C

2015 Schield Logistic MLE1A Excel2013 10/29/2015 V0D V0D V0D 2015 Schield Logistic MLE 1A

Todays lecture Logistic regression How can we use logistic regression for reranking? Shay

From Logistic Regression to Neural Networks CMSC 470 Marine Carpuat Logistic Regression What

LEARNING Outline Math Behind Logistic Regression Visualizing Logistic Regression Loss

Excel2013: Model Logistic MLE 1Y1X Sept 2015 V1A V1A V1A Excel2013 Model Logistic MLE 1Y1X

Workshop 10.5a: Logistic regression Murray Logan August 23, 2016 Table of contents 1 Logistic

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Workshop 10.5a: Logistic regression Murray Logan 05 Sep 2016 Section 1 Logistic regression

Spruce Budworm Eddie Koch May 14th, 2008 Eddie Koch Spruce Budworm Logistic Equation Logistic

Logistic Regression using OLS1D in Excel 2013 XL4D: V0H XL4D: V0H XL4D: V0H 2015 Schield

Logistic Regression: MLE vs. OLS3 in Excel2013 25 Aug 2016 V0H V0H V0H Schield MLE vs.

Adopting OC Matterhorn for large scale deployment 7th TF-Media

Precoded Integer-Forcing Universally Achieves the MIMO Capacity to Within a Constant Gap Or

Presentation to IETF54, July 2002 (PPVPN WG) Mina Azad (Nortel Networks) Note: This presentation

Approximation theory Xiaojing Ye, Math & Stat, Georgia State University Spring 2019

When do birds of a feather flock together? k -means, proximity, and conic programming Shuyang Ling

Hyperbolic Polynomials Approach to Van der Waerden/Schrijver-Valiant like Conjectures : Sharper

Course on Automated Planning: Planning as Heuristic Search Hector Geffner ICREA & Universitat

Geoapplications development http://rgeo.wikience.org Higher School of Economics, Moscow,

Distributed Training for Large-scale Logistic Models Siddharth - PowerPoint PPT Presentation

Distributed Training for Large-scale Logistic Models Siddharth Gopal Carnegie Mellon Univeristy 21 Aug 2013 1 Joint work with Yiming Yang presented at ICML13 Siddharth Gopal Distributed Training for Large-scale Logistic Models Outline of

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Logistic mixed models for DIF IRT models can be regarded as logistic mixed models (e.g., Adams,

2015 Schield Logistic MLE1C Excel2013 8/18/2016 V0D V0D V0D 2015 Schield Logistic MLE 1C

2015 Schield Logistic MLE1A Excel2013 10/29/2015 V0D V0D V0D 2015 Schield Logistic MLE 1A

Todays lecture Logistic regression How can we use logistic regression for reranking? Shay

From Logistic Regression to Neural Networks CMSC 470 Marine Carpuat Logistic Regression What

LEARNING Outline Math Behind Logistic Regression Visualizing Logistic Regression Loss

Excel2013: Model Logistic MLE 1Y1X Sept 2015 V1A V1A V1A Excel2013 Model Logistic MLE 1Y1X

Workshop 10.5a: Logistic regression Murray Logan August 23, 2016 Table of contents 1 Logistic

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Workshop 10.5a: Logistic regression Murray Logan 05 Sep 2016 Section 1 Logistic regression

Spruce Budworm Eddie Koch May 14th, 2008 Eddie Koch Spruce Budworm Logistic Equation Logistic

Logistic Regression using OLS1D in Excel 2013 XL4D: V0H XL4D: V0H XL4D: V0H 2015 Schield

Logistic Regression: MLE vs. OLS3 in Excel2013 25 Aug 2016 V0H V0H V0H Schield MLE vs.

Adopting OC Matterhorn for large scale deployment 7th TF-Media

Precoded Integer-Forcing Universally Achieves the MIMO Capacity to Within a Constant Gap Or

Presentation to IETF54, July 2002 (PPVPN WG) Mina Azad (Nortel Networks) Note: This presentation

Approximation theory Xiaojing Ye, Math &amp; Stat, Georgia State University Spring 2019

When do birds of a feather flock together? k -means, proximity, and conic programming Shuyang Ling

Hyperbolic Polynomials Approach to Van der Waerden/Schrijver-Valiant like Conjectures : Sharper

Course on Automated Planning: Planning as Heuristic Search Hector Geffner ICREA &amp; Universitat

Geoapplications development http://rgeo.wikience.org Higher School of Economics, Moscow,

Approximation theory Xiaojing Ye, Math & Stat, Georgia State University Spring 2019

Course on Automated Planning: Planning as Heuristic Search Hector Geffner ICREA & Universitat