Distributed Training for Large-scale Logistic Models Siddharth - - PowerPoint PPT Presentation

distributed training for large scale logistic models
SMART_READER_LITE
LIVE PREVIEW

Distributed Training for Large-scale Logistic Models Siddharth - - PowerPoint PPT Presentation

Distributed Training for Large-scale Logistic Models Siddharth Gopal Carnegie Mellon Univeristy 21 Aug 2013 1 Joint work with Yiming Yang presented at ICML13 Siddharth Gopal Distributed Training for Large-scale Logistic Models Outline of


slide-1
SLIDE 1

Distributed Training for Large-scale Logistic Models

Siddharth Gopal

Carnegie Mellon Univeristy

21 Aug 2013

1Joint work with Yiming Yang presented at ICML’13 Siddharth Gopal Distributed Training for Large-scale Logistic Models

slide-2
SLIDE 2

Outline of the Talk

Logistic Models Maximum Likelihood Estimation Parallelization Experiments

Siddharth Gopal Distributed Training for Large-scale Logistic Models

slide-3
SLIDE 3

Logistic Models

Logistic Models model probability of an outcome Y given a predictor x. P(Y = y|x; w) ∝ exp(w⊤φ(y, x)) Subsumes Multinomial Logistic Regression, Conditional Random fields and Maximum entropy Models. For example, in Multinomial Logistic Regression P(Y = k|x; w) = exp(w⊤

k x)

  • j

exp(w⊤

j x)

Siddharth Gopal Distributed Training for Large-scale Logistic Models

slide-4
SLIDE 4

Focus of the Talk

Train Logistic models on large-scale data. What is Large-scale ? Large number of Training Examples High dimensionality Large number of Outcomes

Siddharth Gopal Distributed Training for Large-scale Logistic Models

slide-5
SLIDE 5

Focus of the Talk

Train Logistic models on large-scale data. What is Large-scale ? Large number of Training Examples High dimensionality Large number of Outcomes

Siddharth Gopal Distributed Training for Large-scale Logistic Models

slide-6
SLIDE 6

Motivation

Some commonly used data on the web,

Dataset #Instances #Labels #Features #Parameters ODP subset 93,805 12,294 347,256 4,269,165,264 Wikipedia subset 2,365,436 325,056 1,617,899 525,907,777,344 Image-net 14,197,122 21,841

  • Siddharth Gopal

Distributed Training for Large-scale Logistic Models

slide-7
SLIDE 7

Motivation

Some commonly used data on the web,

Dataset #Instances #Labels #Features #Parameters ODP subset 93,805 12,294 347,256 4,269,165,264 Wikipedia subset 2,365,436 325,056 1,617,899 525,907,777,344 Image-net 14,197,122 21,841

  • How can we parallelize the training of such models ?

How can we optimize different subsets of parameters simultaneously ?

Siddharth Gopal Distributed Training for Large-scale Logistic Models

slide-8
SLIDE 8

Maximum Likelihood Estimation (MLE)

Typical MLE estimation N training examples, K classes. xi denotes the ith training example. Indicator variable yik denotes whether xi belongs to class k. Estimate parameters w by maximizing the log-likelihood, max

w N

  • i=1

K

  • k=1

yik log P(yik|xi; w) − λ 2 w2

Siddharth Gopal Distributed Training for Large-scale Logistic Models

slide-9
SLIDE 9

Maximum Likelihood Estimation (MLE)

Typical MLE estimation N training examples, K classes. xi denotes the ith training example. Indicator variable yik denotes whether xi belongs to class k. Estimate parameters w by maximizing the log-likelihood, max

w N

  • i=1

K

  • k=1

yik log P(yik|xi; w) − λ 2 w2 [OPT1] min

w

λ 2 w2 −

N

  • i=1

K

  • k=1

yikw⊤

k xi + N

  • i=1

log K

  • k=1

exp(w⊤

k xi)

  • Siddharth Gopal

Distributed Training for Large-scale Logistic Models

slide-10
SLIDE 10

Parallelization

min

w

λ 2 w2 −

N

  • i=1

K

  • k=1

yikw⊤

k xi + N

  • i=1

log K

  • k=1

exp(w⊤

k xi)

  • Siddharth Gopal

Distributed Training for Large-scale Logistic Models

slide-11
SLIDE 11

Parallelization

min

w

λ 2 w2 −

N

  • i=1

K

  • k=1

yikw⊤

k xi + N

  • i=1

log K

  • k=1

exp(w⊤

k xi)

  • The log-sum-exp (LSE) function couples all the class-level

parameter wk’s together.

Siddharth Gopal Distributed Training for Large-scale Logistic Models

slide-12
SLIDE 12

Parallelization

min

w

λ 2 w2 −

N

  • i=1

K

  • k=1

yikw⊤

k xi + N

  • i=1

log K

  • k=1

exp(w⊤

k xi)

  • The log-sum-exp (LSE) function couples all the class-level

parameter wk’s together. Replace LSE by a parallelizable function

This parallelizable function should be an upper-bound It should not make the optimization harder - like introduce non-convexity, non-differentiability etc.

Siddharth Gopal Distributed Training for Large-scale Logistic Models

slide-13
SLIDE 13

Bound 1 - Piecewise Linear Bound (Hsiung et al)

Properties used LSE is a convex-function Convex function can be approximated to any precision by piecewise linear functions.

max

j

{a⊤

j γ + bj} ≤ log

K

  • k=1

exp(γk)

  • ≤ max

j′ {c⊤ j′ γ + dj′}

a, c ∈ RK b, d ∈ R

Upper Bound LSE Lower Bound

Siddharth Gopal Distributed Training for Large-scale Logistic Models

slide-14
SLIDE 14

Bound 1 - Piecewise Linear Bound (Hsiung et al)

max

j

{a⊤

j γ + bj} ≤ log

K

  • k=1

exp(γk)

  • ≤ max

j′ {c⊤ j′ γ + dj′}

a, c ∈ RK b, d ∈ R

Advantages The bound can be made arbitrarily accurate by increasing the number of pieces. Disadvantages Max-function makes the objective non-differentiable. The number of variational parameters grows with the approximation level. Optimizing the variational parameter is hard.

Siddharth Gopal Distributed Training for Large-scale Logistic Models

slide-15
SLIDE 15

Bound 2 - Double Majorization (Bouchard 2007)

The LSE is bound by, log K

  • k=1

exp(w⊤

k xi)

  • ≤ ai +

K

  • k=1

log(1 + exp(w⊤

k xi − ai)) , ai ∈ R

Advantages The bound is parallelizable. It is an upper bound. It is differentiable and convex.

Siddharth Gopal Distributed Training for Large-scale Logistic Models

slide-16
SLIDE 16

Bound 2 - Double Majorization (Bouchard 2007)

Disadvantage The bound is not tight enough.

1 13 5 9 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 0.0E+00 5.0E+03 1.0E+04 1.5E+04 2.0E+04 2.5E+04 3.0E+04

Efficiency of Bound Log-sum-exp Upper-bound Iteration Function-value

The gap between true objective and upper-bounded objective on the 20-newsgroup dataset.

Siddharth Gopal Distributed Training for Large-scale Logistic Models

slide-17
SLIDE 17

Bound 3 - Log Concavity

A relatively famous bound using the concavity of the log-function log(x) ≤ ax − log(a) − 1 ∀ x, a > 0

  • 7
  • 5
  • 3
  • 1

1 3 5

Log Concavity Bound

log(x) a = .3 a = 2 a = .02

x log(x)

Siddharth Gopal Distributed Training for Large-scale Logistic Models

slide-18
SLIDE 18

Bound 3 - Log Concavity

Applying to the LSE function, log K

  • k=1

exp(w⊤

k xi)

  • ≤ ai

K

  • k=1

exp(w⊤

k xi) − log(ai) − 1

Advantages The bound is parallelizable. It is differentiable. Optimizing the variational parameter ai is easy. The upper bound is exact at ai =

1

K

  • k=1

exp(w⊤

k xi)

. Disadvantage The combined objective is non-convex.

Siddharth Gopal Distributed Training for Large-scale Logistic Models

slide-19
SLIDE 19

Reaching Optimality

MLE estimation min

w

λ 2 w2 −

N

  • i=1

K

  • k=1

yikw ⊤

k xi + N

  • i=1

log K

  • k=1

exp(w ⊤

k xi)

  • Log-concavity Bound log

K

  • k=1

exp(w ⊤

k xi)

  • ≤ ai

K

  • k=1

exp(w ⊤

k xi) − log(ai) − 1 Siddharth Gopal Distributed Training for Large-scale Logistic Models

slide-20
SLIDE 20

Reaching Optimality

MLE estimation min

w

λ 2 w2 −

N

  • i=1

K

  • k=1

yikw ⊤

k xi + N

  • i=1

log K

  • k=1

exp(w ⊤

k xi)

  • Log-concavity Bound log

K

  • k=1

exp(w ⊤

k xi)

  • ≤ ai

K

  • k=1

exp(w ⊤

k xi) − log(ai) − 1

Combined Objective

F(W , A) = λ 2

K

  • k=1

wk2 +

N

  • i=1

K

  • k=1

yikw ⊤

k xi + ai K

  • k=1

exp(w ⊤

k xi) − log(ai) − 1

  • Siddharth Gopal

Distributed Training for Large-scale Logistic Models

slide-21
SLIDE 21

Reaching Optimality

MLE estimation min

w

λ 2 w2 −

N

  • i=1

K

  • k=1

yikw ⊤

k xi + N

  • i=1

log K

  • k=1

exp(w ⊤

k xi)

  • Log-concavity Bound log

K

  • k=1

exp(w ⊤

k xi)

  • ≤ ai

K

  • k=1

exp(w ⊤

k xi) − log(ai) − 1

Combined Objective

F(W , A) = λ 2

K

  • k=1

wk2 +

N

  • i=1

K

  • k=1

yikw ⊤

k xi + ai K

  • k=1

exp(w ⊤

k xi) − log(ai) − 1

  • Despite the non-convexity, we can show that

The combined objective has a unique minima. This minimum coincides with the optimal MLE solution.

Siddharth Gopal Distributed Training for Large-scale Logistic Models

slide-22
SLIDE 22

Reaching Optimality

An iterative and parallel block coordinate descent algorithm to converge to the unique minimum. Algorithm 1 A parallel block coordinate descent Initialize : t ← 0, A0 ← 1

K , W0 ← 0.

While : Not converged In parallel : Wt+1 ← arg minW F(W , At) At+1 ← arg minA F(Wt+1, A) t ← t + 1

Siddharth Gopal Distributed Training for Large-scale Logistic Models

slide-23
SLIDE 23

Experimental Comparison

Datasets

Dataset # instances #Leaf-labels #Features #Parameters Parameter Size (approx) CLEF 10,000 63 80 5,040 40KB NEWS20 11,260 20 53,975 1,079,500 4MB LSHTC-small 4,463 1,139 51,033 227,760,279 911MB LSHTC-large 93,805 12,294 347,256 4,269,165,264 17GB

Optimization Methods Double Majorization Bound (DM) Log concavity Bound (LC) Limited Memory BFGS (LBFGS) - the most widely used method. Alternating Direction Method of Multipliers (ADMM)

Siddharth Gopal Distributed Training for Large-scale Logistic Models

slide-24
SLIDE 24

Time Complexity

1.0E-06 1.0E-04 1.0E-02 1.0E+00 1.0E+02 1.0E+04 1.0E+06 1.0E+00 1.0E+01 1.0E+02 1.0E+03 Difference from Optimum Time Taken (secs) NEWS-20 Dataset ADMM LC LBFGS DM 1.0E-06 1.0E-04 1.0E-02 1.0E+00 1.0E+02 1.0E+04 1.0E+06 1.0E+00 1.0E+01 1.0E+02 1.0E+03 1.0E+04 1.0E+05 Difference from Optimum Time Taken (secs) CLEF Dataset 1.0E-06 1.0E-04 1.0E-02 1.0E+00 1.0E+02 1.0E+04 1.0E+06 1.0E+00 1.0E+01 1.0E+02 1.0E+03 1.0E+04 1.0E+05 Difference from Optimum Time Taken (secs) LSHTC-small Dataset 1.5E+04 2.0E+04 2.5E+04 3.0E+04 3.5E+04 4.0E+04 4.5E+04 0.0E+00 2.0E+04 4.0E+04 6.0E+04 8.0E+04 1.0E+05 1.2E+05 1.4E+05 Objective Time Taken (secs) LSHTC-large Dataset

Figure : The difference from the true optimum vs time

Siddharth Gopal Distributed Training for Large-scale Logistic Models

slide-25
SLIDE 25

Conclusion

Discussed multiple ways to perform distributed training of large-scale Logistic Models. The LC method seem to offer the best trade-off between accuracy and time. Several open questions,

Effect of the regularization parameter λ. Effect of the correlation between the parameters.

Siddharth Gopal Distributed Training for Large-scale Logistic Models

slide-26
SLIDE 26

0.05 0.1 0.5 0.6 0.7 0.8 0.9 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

0.75 0.76 0.77 0.78 0.79 0.8 0.81 0.82

Binary vs Multiclass

Binary Log-reg Multiclass Log-reg

Lambda (Regularization parameter) Accuracy

Siddharth Gopal Distributed Training for Large-scale Logistic Models