distributed training for large scale logistic models
play

Distributed Training for Large-scale Logistic Models Siddharth - PowerPoint PPT Presentation

Distributed Training for Large-scale Logistic Models Siddharth Gopal Carnegie Mellon Univeristy 21 Aug 2013 1 Joint work with Yiming Yang presented at ICML13 Siddharth Gopal Distributed Training for Large-scale Logistic Models Outline of


  1. Distributed Training for Large-scale Logistic Models Siddharth Gopal Carnegie Mellon Univeristy 21 Aug 2013 1 Joint work with Yiming Yang presented at ICML’13 Siddharth Gopal Distributed Training for Large-scale Logistic Models

  2. Outline of the Talk Logistic Models Maximum Likelihood Estimation Parallelization Experiments Siddharth Gopal Distributed Training for Large-scale Logistic Models

  3. Logistic Models Logistic Models model probability of an outcome Y given a predictor x . P ( Y = y | x ; w ) ∝ exp( w ⊤ φ ( y , x )) Subsumes Multinomial Logistic Regression, Conditional Random fields and Maximum entropy Models. For example, in Multinomial Logistic Regression exp( w ⊤ k x ) P ( Y = k | x ; w ) = � exp( w ⊤ j x ) j Siddharth Gopal Distributed Training for Large-scale Logistic Models

  4. Focus of the Talk Train Logistic models on large-scale data. What is Large-scale ? Large number of Training Examples High dimensionality Large number of Outcomes Siddharth Gopal Distributed Training for Large-scale Logistic Models

  5. Focus of the Talk Train Logistic models on large-scale data. What is Large-scale ? Large number of Training Examples High dimensionality Large number of Outcomes Siddharth Gopal Distributed Training for Large-scale Logistic Models

  6. Motivation Some commonly used data on the web, Dataset #Instances #Labels #Features #Parameters ODP subset 93,805 12,294 347,256 4,269,165,264 Wikipedia subset 2,365,436 325,056 1,617,899 525,907,777,344 Image-net 14,197,122 21,841 - - Siddharth Gopal Distributed Training for Large-scale Logistic Models

  7. Motivation Some commonly used data on the web, Dataset #Instances #Labels #Features #Parameters ODP subset 93,805 12,294 347,256 4,269,165,264 Wikipedia subset 2,365,436 325,056 1,617,899 525,907,777,344 Image-net 14,197,122 21,841 - - How can we parallelize the training of such models ? How can we optimize different subsets of parameters simultaneously ? Siddharth Gopal Distributed Training for Large-scale Logistic Models

  8. Maximum Likelihood Estimation (MLE) Typical MLE estimation N training examples, K classes. x i denotes the i th training example. Indicator variable y ik denotes whether x i belongs to class k . Estimate parameters w by maximizing the log-likelihood, N K y ik log P ( y ik | x i ; w ) − λ � � 2 � w � 2 max w i =1 k =1 Siddharth Gopal Distributed Training for Large-scale Logistic Models

  9. Maximum Likelihood Estimation (MLE) Typical MLE estimation N training examples, K classes. x i denotes the i th training example. Indicator variable y ik denotes whether x i belongs to class k . Estimate parameters w by maximizing the log-likelihood, N K y ik log P ( y ik | x i ; w ) − λ � � 2 � w � 2 max w i =1 k =1 � K N K N � λ 2 � w � 2 − � � � � y ik w ⊤ exp( w ⊤ [ OPT1 ] min k x i + log k x i ) w i =1 k =1 i =1 k =1 Siddharth Gopal Distributed Training for Large-scale Logistic Models

  10. Parallelization � K N K N � λ 2 � w � 2 − � � � � y ik w ⊤ exp( w ⊤ min k x i + log k x i ) w i =1 k =1 i =1 k =1 Siddharth Gopal Distributed Training for Large-scale Logistic Models

  11. Parallelization � K N K N � λ 2 � w � 2 − � � � � y ik w ⊤ exp( w ⊤ min k x i + log k x i ) w i =1 k =1 i =1 k =1 The log-sum-exp (LSE) function couples all the class-level parameter w k ’s together. Siddharth Gopal Distributed Training for Large-scale Logistic Models

  12. Parallelization � K N K N � λ 2 � w � 2 − � � � � y ik w ⊤ exp( w ⊤ min k x i + log k x i ) w i =1 k =1 i =1 k =1 The log-sum-exp (LSE) function couples all the class-level parameter w k ’s together. Replace LSE by a parallelizable function This parallelizable function should be an upper-bound It should not make the optimization harder - like introduce non-convexity, non-differentiability etc. Siddharth Gopal Distributed Training for Large-scale Logistic Models

  13. Bound 1 - Piecewise Linear Bound (Hsiung et al) Properties used LSE is a convex-function Convex function can be approximated to any precision by piecewise linear functions. � K � � { a ⊤ j ′ { c ⊤ max j γ + b j } ≤ log exp( γ k ) ≤ max j ′ γ + d j ′ } j k =1 a , c ∈ R K b , d ∈ R Upper Bound LSE Lower Bound Siddharth Gopal Distributed Training for Large-scale Logistic Models

  14. Bound 1 - Piecewise Linear Bound (Hsiung et al) � K � { a ⊤ � j ′ { c ⊤ max j γ + b j } ≤ log exp( γ k ) ≤ max j ′ γ + d j ′ } j k =1 a , c ∈ R K b , d ∈ R Advantages The bound can be made arbitrarily accurate by increasing the number of pieces. Disadvantages Max-function makes the objective non-differentiable. The number of variational parameters grows with the approximation level. Optimizing the variational parameter is hard. Siddharth Gopal Distributed Training for Large-scale Logistic Models

  15. Bound 2 - Double Majorization (Bouchard 2007) The LSE is bound by, � K � K � � exp( w ⊤ log(1 + exp( w ⊤ log k x i ) ≤ a i + k x i − a i )) , a i ∈ R k =1 k =1 Advantages The bound is parallelizable. It is an upper bound. It is differentiable and convex . Siddharth Gopal Distributed Training for Large-scale Logistic Models

  16. Bound 2 - Double Majorization (Bouchard 2007) Disadvantage The bound is not tight enough. Efficiency of Bound 3.0E+04 2.5E+04 Log-sum-exp Upper-bound Function-value 2.0E+04 1.5E+04 1.0E+04 5.0E+03 0.0E+00 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 Iteration The gap between true objective and upper-bounded objective on the 20-newsgroup dataset. Siddharth Gopal Distributed Training for Large-scale Logistic Models

  17. Bound 3 - Log Concavity A relatively famous bound using the concavity of the log-function log( x ) ≤ ax − log( a ) − 1 ∀ x , a > 0 Log Concavity Bound 5 3 1 log(x) -1 log(x) -3 a = .3 a = 2 -5 a = .02 -7 x Siddharth Gopal Distributed Training for Large-scale Logistic Models

  18. Bound 3 - Log Concavity Applying to the LSE function, � K � K � � exp( w ⊤ exp( w ⊤ log k x i ) ≤ a i k x i ) − log( a i ) − 1 k =1 k =1 Advantages The bound is parallelizable. It is differentiable. Optimizing the variational parameter a i is easy. 1 The upper bound is exact at a i = . K exp( w ⊤ � k x i ) k =1 Disadvantage The combined objective is non-convex. Siddharth Gopal Distributed Training for Large-scale Logistic Models

  19. Reaching Optimality � K � N K N λ 2 � w � 2 − � � y ik w ⊤ � � exp( w ⊤ MLE estimation min k x i + log k x i ) w i =1 k =1 i =1 k =1 � K � K � exp( w ⊤ � exp( w ⊤ Log-concavity Bound log k x i ) ≤ a i k x i ) − log( a i ) − 1 k =1 k =1 Siddharth Gopal Distributed Training for Large-scale Logistic Models

  20. Reaching Optimality � K � N K N λ 2 � w � 2 − � � y ik w ⊤ � � exp( w ⊤ MLE estimation min k x i + log k x i ) w i =1 k =1 i =1 k =1 � K � K � exp( w ⊤ � exp( w ⊤ Log-concavity Bound log k x i ) ≤ a i k x i ) − log( a i ) − 1 k =1 k =1 Combined Objective K N � K K � F ( W , A ) = λ � w k � 2 + � � � � y ik w ⊤ exp( w ⊤ − k x i + a i k x i ) − log( a i ) − 1 2 k =1 i =1 k =1 k =1 Siddharth Gopal Distributed Training for Large-scale Logistic Models

  21. Reaching Optimality � K � N K N λ 2 � w � 2 − � � y ik w ⊤ � � exp( w ⊤ MLE estimation min k x i + log k x i ) w i =1 k =1 i =1 k =1 � K � K � exp( w ⊤ � exp( w ⊤ Log-concavity Bound log k x i ) ≤ a i k x i ) − log( a i ) − 1 k =1 k =1 Combined Objective K N � K K � F ( W , A ) = λ � w k � 2 + � � � � y ik w ⊤ exp( w ⊤ − k x i + a i k x i ) − log( a i ) − 1 2 k =1 i =1 k =1 k =1 Despite the non-convexity, we can show that The combined objective has a unique minima. This minimum coincides with the optimal MLE solution. Siddharth Gopal Distributed Training for Large-scale Logistic Models

  22. Reaching Optimality An iterative and parallel block coordinate descent algorithm to converge to the unique minimum. Algorithm 1 A parallel block coordinate descent Initialize : t ← 0 , A 0 ← 1 K , W 0 ← 0. While : Not converged In parallel : W t +1 ← arg min W F ( W , A t ) A t +1 ← arg min A F ( W t +1 , A ) t ← t + 1 Siddharth Gopal Distributed Training for Large-scale Logistic Models

  23. Experimental Comparison Datasets Dataset # instances #Leaf-labels #Features #Parameters Parameter Size (approx) CLEF 10,000 63 80 5,040 40KB NEWS20 11,260 20 53,975 1,079,500 4MB LSHTC-small 4,463 1,139 51,033 227,760,279 911MB 93,805 12,294 347,256 4,269,165,264 17GB LSHTC-large Optimization Methods Double Majorization Bound (DM) Log concavity Bound (LC) Limited Memory BFGS (LBFGS) - the most widely used method. Alternating Direction Method of Multipliers (ADMM) Siddharth Gopal Distributed Training for Large-scale Logistic Models

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend