Breaking the gridlock in Mixture-of-Experts: Consistent and - - PowerPoint PPT Presentation

breaking the gridlock in mixture of experts consistent
SMART_READER_LITE
LIVE PREVIEW

Breaking the gridlock in Mixture-of-Experts: Consistent and - - PowerPoint PPT Presentation

Breaking the gridlock in Mixture-of-Experts: Consistent and Efficient algorithms Ashok Vardhan Makkuva University of Illinois at Urbana-Champaign Joint work with Sewoong Oh, Sreeram Kannan, Pramod Viswanath Mixture-of-Experts (MoE) Jacobs,


slide-1
SLIDE 1

Breaking the gridlock in Mixture-of-Experts: Consistent and Efficient algorithms

Ashok Vardhan Makkuva

University of Illinois at Urbana-Champaign Joint work with Sewoong Oh, Sreeram Kannan, Pramod Viswanath

slide-2
SLIDE 2

Mixture-of-Experts (MoE)

Jacobs, Jordan, Nowlan and Hinton, 1991 g(❛⊺

1①)

g(❛⊺

2①)

① ① y f (✇ ⊺①) ① f 1 − f f = sigmoid, g = linear,tanh,ReLU,leakyReLU

slide-3
SLIDE 3

Motivation-I: Modern relevance of MoE

Outrageously large neural networks

slide-4
SLIDE 4

Motivation-II: Gated RNNs

Figure: Gated Recurrent Unit (GRU)

Key features: Gating mechanism Long term memory

slide-5
SLIDE 5

Motivation-II: GRU

Gates: zt,rt ∈ [0,1]d depend on the input xt and the past ht−1 States: ht, ˜ ht ∈ Rd Update equations for each t: ht = (1 − zt) ⊙ ht−1 + zt ⊙ ˜ ht ˜ ht = f (Axt + rt ⊙ Bht−1)

slide-6
SLIDE 6

MoE: Building blocks of GRU

ht = (1 − zt) ⊙ ht−1 + zt ⊙ (1 − rt) ⊙ f (Axt) + zt ⊙ rt ⊙ f (Axt + Bht−1) ❤t NN-1 ①t,❤t−1 1 − zt zt NN-2 ①t,❤t−1 1 − rt NN-3 ①t,❤t−1 rt

slide-7
SLIDE 7

MoE: Building blocks of GRU

ht = (1 − zt) ⊙ ht−1 + zt ⊙ (1 − rt) ⊙ f (Axt) + zt ⊙ rt ⊙ f (Axt + Bht−1) ❤t NN-1 ①t,❤t−1 1 − zt zt NN-2 ①t,❤t−1 1 − rt NN-3 ①t,❤t−1 rt

slide-8
SLIDE 8

What is known about MoE?

No provable learning algorithms for parameters1

120 years of MoE, MoE: a literature survey

slide-9
SLIDE 9

Open problem for 25+ years

g(❛⊺

1①)

g(❛⊺

2①)

f 1 − f ① ① y f (✇ ⊺①) ① ⇔ Py∣① = f (✇ ⊺①) ⋅ N(y∣g(❛⊺

1①),σ2) + (1 − f (✇ ⊺①)) ⋅ N(y∣g(❛⊺ 2①),σ2)

Open question

Given n i.i.d. samples (①(i),y(i)), does there exist an efficient learning algorithm with provable theoretical guarantees to learn the regressors ❛1,❛2 and the gating parameter ✇?

slide-10
SLIDE 10

Modular structure

Mixture of classification (✇) and regression (❛1,❛2) problems

slide-11
SLIDE 11

Key observation

Key observation

If we know the regressors, learning the gating parameter is easy and vice-versa. How to break the gridlock?

slide-12
SLIDE 12

Breaking the gridlock: An overview

Recall the model for MoE: Py∣① = f (✇ ⊺①) ⋅ N(y∣g(❛⊺

1①),σ2) + (1 − f (✇ ⊺①)) ⋅ N(y∣g(❛⊺ 2①),σ2)

Main message

We propose a novel algorithm with first recoverable guarantees We learn (❛1,❛2) and ✇ separately First recover (❛1,❛2) without knowing ✇ at all Later learn ✇ using traditional methods like EM Global consistency guarantees (population setting)

slide-13
SLIDE 13

Algorithm

Tensor decomp. {ˆ ❛1, ˆ ❛2} EM ˆ ✇ Cubic Transform Score function Samples ① y

slide-14
SLIDE 14

Comparison with EM

10 20 30 40 50 60 70 80 90 100

  • No. of EM iterations

0.5 1 1.5 2 2.5 Parameter estimation error Spectral+EM EM

(a) 3 mixtures

10 20 30 40 50 60 70 80 90 100

  • No. of EM iterations

0.5 1 1.5 2 2.5 3 Parameter estimation error Spectral+EM EM

(b) 4 mixtures Figure: Plot of parameter estimation error

slide-15
SLIDE 15

Summary

Algorithmic innovation: First provably consistent algorithms for MoE in 25+ years Global convergence: Our algorithms work with global initializations

slide-16
SLIDE 16

Conclusion

slide-17
SLIDE 17

Poster #210 Thank you!

slide-18
SLIDE 18

Poster #210 Thank you!