Breaking the gridlock in Mixture-of-Experts: Consistent and - - PowerPoint PPT Presentation

▶

Oct 05, 2023 229 likes •419 views

Breaking the gridlock in Mixture-of-Experts: Consistent and Efficient algorithms Ashok Vardhan Makkuva University of Illinois at Urbana-Champaign Joint work with Sewoong Oh, Sreeram Kannan, Pramod Viswanath Mixture-of-Experts (MoE) Jacobs,

SLIDE 1

Breaking the gridlock in Mixture-of-Experts: Consistent and Efficient algorithms

Ashok Vardhan Makkuva

University of Illinois at Urbana-Champaign Joint work with Sewoong Oh, Sreeram Kannan, Pramod Viswanath

SLIDE 2

Mixture-of-Experts (MoE)

Jacobs, Jordan, Nowlan and Hinton, 1991 g(❛⊺

1①)

g(❛⊺

2①)

① ① y f (✇ ⊺①) ① f 1 − f f = sigmoid, g = linear,tanh,ReLU,leakyReLU

SLIDE 3

Motivation-I: Modern relevance of MoE

Outrageously large neural networks

SLIDE 4

Motivation-II: Gated RNNs

Figure: Gated Recurrent Unit (GRU)

Key features: Gating mechanism Long term memory

SLIDE 5

Motivation-II: GRU

Gates: zt,rt ∈ [0,1]d depend on the input xt and the past ht−1 States: ht, ˜ ht ∈ Rd Update equations for each t: ht = (1 − zt) ⊙ ht−1 + zt ⊙ ˜ ht ˜ ht = f (Axt + rt ⊙ Bht−1)

SLIDE 6

MoE: Building blocks of GRU

ht = (1 − zt) ⊙ ht−1 + zt ⊙ (1 − rt) ⊙ f (Axt) + zt ⊙ rt ⊙ f (Axt + Bht−1) ❤t NN-1 ①t,❤t−1 1 − zt zt NN-2 ①t,❤t−1 1 − rt NN-3 ①t,❤t−1 rt

SLIDE 7

MoE: Building blocks of GRU

ht = (1 − zt) ⊙ ht−1 + zt ⊙ (1 − rt) ⊙ f (Axt) + zt ⊙ rt ⊙ f (Axt + Bht−1) ❤t NN-1 ①t,❤t−1 1 − zt zt NN-2 ①t,❤t−1 1 − rt NN-3 ①t,❤t−1 rt

SLIDE 8

What is known about MoE?

No provable learning algorithms for parameters1

120 years of MoE, MoE: a literature survey

SLIDE 9

Open problem for 25+ years

g(❛⊺

1①)

g(❛⊺

2①)

f 1 − f ① ① y f (✇ ⊺①) ① ⇔ Py∣① = f (✇ ⊺①) ⋅ N(y∣g(❛⊺

1①),σ2) + (1 − f (✇ ⊺①)) ⋅ N(y∣g(❛⊺ 2①),σ2)

Open question

Given n i.i.d. samples (①(i),y(i)), does there exist an efficient learning algorithm with provable theoretical guarantees to learn the regressors ❛1,❛2 and the gating parameter ✇?

SLIDE 10

Modular structure

Mixture of classification (✇) and regression (❛1,❛2) problems

SLIDE 11

Key observation

If we know the regressors, learning the gating parameter is easy and vice-versa. How to break the gridlock?

SLIDE 12

Breaking the gridlock: An overview

Recall the model for MoE: Py∣① = f (✇ ⊺①) ⋅ N(y∣g(❛⊺

1①),σ2) + (1 − f (✇ ⊺①)) ⋅ N(y∣g(❛⊺ 2①),σ2)

Main message

We propose a novel algorithm with first recoverable guarantees We learn (❛1,❛2) and ✇ separately First recover (❛1,❛2) without knowing ✇ at all Later learn ✇ using traditional methods like EM Global consistency guarantees (population setting)

SLIDE 13

Algorithm

Tensor decomp. {ˆ ❛1, ˆ ❛2} EM ˆ ✇ Cubic Transform Score function Samples ① y

SLIDE 14

Comparison with EM

10 20 30 40 50 60 70 80 90 100

No. of EM iterations

0.5 1 1.5 2 2.5 Parameter estimation error Spectral+EM EM

(a) 3 mixtures

10 20 30 40 50 60 70 80 90 100

No. of EM iterations

0.5 1 1.5 2 2.5 3 Parameter estimation error Spectral+EM EM

(b) 4 mixtures Figure: Plot of parameter estimation error

SLIDE 15

Summary

Algorithmic innovation: First provably consistent algorithms for MoE in 25+ years Global convergence: Our algorithms work with global initializations

SLIDE 16

Conclusion

SLIDE 17

Poster #210 Thank you!

SLIDE 18