Breaking the gridlock in Mixture-of-Experts: Consistent and Efficient algorithms
Ashok Vardhan Makkuva
University of Illinois at Urbana-Champaign Joint work with Sewoong Oh, Sreeram Kannan, Pramod Viswanath
Breaking the gridlock in Mixture-of-Experts: Consistent and - - PowerPoint PPT Presentation
Breaking the gridlock in Mixture-of-Experts: Consistent and Efficient algorithms Ashok Vardhan Makkuva University of Illinois at Urbana-Champaign Joint work with Sewoong Oh, Sreeram Kannan, Pramod Viswanath Mixture-of-Experts (MoE) Jacobs,
Ashok Vardhan Makkuva
University of Illinois at Urbana-Champaign Joint work with Sewoong Oh, Sreeram Kannan, Pramod Viswanath
Jacobs, Jordan, Nowlan and Hinton, 1991 g(❛⊺
1①)
g(❛⊺
2①)
① ① y f (✇ ⊺①) ① f 1 − f f = sigmoid, g = linear,tanh,ReLU,leakyReLU
Outrageously large neural networks
Figure: Gated Recurrent Unit (GRU)
Key features: Gating mechanism Long term memory
Gates: zt,rt ∈ [0,1]d depend on the input xt and the past ht−1 States: ht, ˜ ht ∈ Rd Update equations for each t: ht = (1 − zt) ⊙ ht−1 + zt ⊙ ˜ ht ˜ ht = f (Axt + rt ⊙ Bht−1)
ht = (1 − zt) ⊙ ht−1 + zt ⊙ (1 − rt) ⊙ f (Axt) + zt ⊙ rt ⊙ f (Axt + Bht−1) ❤t NN-1 ①t,❤t−1 1 − zt zt NN-2 ①t,❤t−1 1 − rt NN-3 ①t,❤t−1 rt
ht = (1 − zt) ⊙ ht−1 + zt ⊙ (1 − rt) ⊙ f (Axt) + zt ⊙ rt ⊙ f (Axt + Bht−1) ❤t NN-1 ①t,❤t−1 1 − zt zt NN-2 ①t,❤t−1 1 − rt NN-3 ①t,❤t−1 rt
No provable learning algorithms for parameters1
120 years of MoE, MoE: a literature survey
g(❛⊺
1①)
g(❛⊺
2①)
f 1 − f ① ① y f (✇ ⊺①) ① ⇔ Py∣① = f (✇ ⊺①) ⋅ N(y∣g(❛⊺
1①),σ2) + (1 − f (✇ ⊺①)) ⋅ N(y∣g(❛⊺ 2①),σ2)
Open question
Given n i.i.d. samples (①(i),y(i)), does there exist an efficient learning algorithm with provable theoretical guarantees to learn the regressors ❛1,❛2 and the gating parameter ✇?
Mixture of classification (✇) and regression (❛1,❛2) problems
Key observation
If we know the regressors, learning the gating parameter is easy and vice-versa. How to break the gridlock?
Recall the model for MoE: Py∣① = f (✇ ⊺①) ⋅ N(y∣g(❛⊺
1①),σ2) + (1 − f (✇ ⊺①)) ⋅ N(y∣g(❛⊺ 2①),σ2)
Main message
We propose a novel algorithm with first recoverable guarantees We learn (❛1,❛2) and ✇ separately First recover (❛1,❛2) without knowing ✇ at all Later learn ✇ using traditional methods like EM Global consistency guarantees (population setting)
Tensor decomp. {ˆ ❛1, ˆ ❛2} EM ˆ ✇ Cubic Transform Score function Samples ① y
10 20 30 40 50 60 70 80 90 100
0.5 1 1.5 2 2.5 Parameter estimation error Spectral+EM EM
(a) 3 mixtures
10 20 30 40 50 60 70 80 90 100
0.5 1 1.5 2 2.5 3 Parameter estimation error Spectral+EM EM
(b) 4 mixtures Figure: Plot of parameter estimation error
Algorithmic innovation: First provably consistent algorithms for MoE in 25+ years Global convergence: Our algorithms work with global initializations