Breaking the Softmax Bottleneck via Monotonic Functions Octavian - - PowerPoint PPT Presentation

β–Ά
breaking the softmax bottleneck via monotonic functions
SMART_READER_LITE
LIVE PREVIEW

Breaking the Softmax Bottleneck via Monotonic Functions Octavian - - PowerPoint PPT Presentation

Breaking the Softmax Bottleneck via Monotonic Functions Octavian Ganea, Sylvain Gelly, Gary Bcigneul, Aliaksei Severyn Softmax Layer (for Language Models) Natural language as conditional distributions next word context Parametric


slide-1
SLIDE 1

Breaking the Softmax Bottleneck via Monotonic Functions

Octavian Ganea, Sylvain Gelly, Gary BΓ©cigneul, Aliaksei Severyn

slide-2
SLIDE 2

Softmax Layer (for Language Models)

  • Natural language as conditional distributions
  • Parametric distributions & softmax:

context next word

slide-3
SLIDE 3

Softmax Layer (for Language Models)

  • Natural language as conditional distributions
  • Parametric distributions & softmax:

context next word

  • Natural language as conditional distributions
  • Parametric distributions & softmax:
  • Challenge: Can we always find 𝛴 s.t. for all c :

?

slide-4
SLIDE 4

Softmax Layer (for Language Models)

  • Natural language as conditional distributions
  • Parametric distributions & softmax:

context next word

  • Natural language as conditional distributions
  • Parametric distributions & softmax:
  • No, when embedding size < label cardinality (vocab size) !
  • Natural language as conditional distributions
  • Parametric distributions & softmax:
  • Challenge: Can we always find 𝛴 s.t. for all c :

?

slide-5
SLIDE 5

What is the Softmax Bottleneck (Yang et al, β€˜18) ?

  • log-P matrix:

Breaking the Softmax Bottleneck: A High-Rank RNN Language Model, Yang et al., ICLR 2018

Label cardinality = Vocabulary size

slide-6
SLIDE 6

What is the Softmax Bottleneck (Yang et al, β€˜18) ?

  • log-P matrix:
  • Then:

Breaking the Softmax Bottleneck: A High-Rank RNN Language Model, Yang et al., ICLR 2018

Number of labels = Vocabulary size

slide-7
SLIDE 7

What is the Softmax Bottleneck (Yang et al, β€˜18) ?

  • log-P matrix:
  • Then:

Breaking the Softmax Bottleneck: A High-Rank RNN Language Model, Yang et al., ICLR 2018

slide-8
SLIDE 8

Breaking the Softmax Bottleneck [1]

  • MoS [1] : Mixture of K Softmaxes

[1] Breaking the Softmax Bottleneck: A High-Rank RNN Language Model, Yang et al., ICLR 2018

slide-9
SLIDE 9

Breaking the Softmax Bottleneck [1]

  • MoS [1] : Mixture of K Softmaxes
  • Improves perplexity

[1] Breaking the Softmax Bottleneck: A High-Rank RNN Language Model, Yang et al., ICLR 2018

slide-10
SLIDE 10

Breaking the Softmax Bottleneck [1]

  • MoS [1] : Mixture of K Softmaxes
  • Improves perplexity
  • Slower than vanilla softmax: 2 - 6.4x
  • GPU Memory: M x N x K tensor

Vanilla softmax

[1] Breaking the Softmax Bottleneck: A High-Rank RNN Language Model, Yang et al., ICLR 2018

slide-11
SLIDE 11

Breaking the Softmax Bottleneck [2]

  • Sig-Softmax [2] :

[2] Sigsoftmax: Reanalysis of the Softmax Bottleneck, S. Kanai et al., NIPS 2018

slide-12
SLIDE 12

Breaking the Softmax Bottleneck [2]

  • Sig-Softmax [2] :
  • Small improvement over vanilla Softmax

[2] Sigsoftmax: Reanalysis of the Softmax Bottleneck, S. Kanai et al., NIPS 2018

slide-13
SLIDE 13

Breaking the Softmax Bottleneck [2]

  • Sig-Softmax [2] :
  • Small improvement over vanilla Softmax

Can we learn the best non-linearity to deform the logits ?

[2] Sigsoftmax: Reanalysis of the Softmax Bottleneck, S. Kanai et al., NIPS 2018

slide-14
SLIDE 14

Can we do better ?

  • Our idea - learn a pointwise monotonic function on top of logits:
slide-15
SLIDE 15

Can we do better ?

  • Our idea - learn a pointwise monotonic function on top of logits:

should be:

slide-16
SLIDE 16

Can we do better ?

  • Our idea - learn a pointwise monotonic function on top of logits:

should be: 1. With unbounded image set -- to model sparse distributions

slide-17
SLIDE 17

Can we do better ?

  • Our idea - learn a pointwise monotonic function on top of logits:

should be: 1. With unbounded image set -- to model sparse distributions 2. Continuous and (piecewise) differentiable -- for backprop

slide-18
SLIDE 18

Can we do better ?

  • Our idea - learn a pointwise monotonic function on top of logits:

should be: 1. With unbounded image set -- to model sparse distributions 2. Continuous and (piecewise) differentiable -- for backprop 3. Non-linear -- to break the softmax bottleneck

slide-19
SLIDE 19

Can we do better ?

  • Our idea - learn a pointwise monotonic function on top of logits:

should be: 1. With unbounded image set -- to model sparse distributions 2. Continuous and (piecewise) differentiable -- for backprop 3. Non-linear -- to break the softmax bottleneck 4. Monotonic -- to preserve the ranking of logits

slide-20
SLIDE 20

Can we do better ?

  • Our idea - learn a pointwise monotonic function on top of logits:

should be: 1. With unbounded image set -- to model sparse distributions 2. Continuous and (piecewise) differentiable -- for backprop 3. Non-linear -- to break the softmax bottleneck 4. Monotonic -- to preserve the ranking of logits 5. Fast and memory efficient -- comparable w/ vanilla softmax

slide-21
SLIDE 21

Can we do better ?

  • Our idea - learn a pointwise monotonic function on top of logits:

should be: 1. With unbounded image set -- to model sparse distributions 2. Continuous and (piecewise) differentiable -- for backprop 3. Non-linear -- to break the softmax bottleneck 4. Monotonic -- to preserve the ranking of logits 5. Fast and memory efficient -- comparable w/ vanilla softmax

Theorem: these properties are not restrictive in terms of rank deficiency

slide-22
SLIDE 22

Learnable parametric monotonic real functions

  • A neural network with 1 hidden layer and positive (constrained) weights [3]
  • Universal approximator for all monotonic functions (when K is large enough !)

[3] Monotone and Partially Monotone Neural Networks, Daniels and Velikova, 2010, IEEE TRANSACTIONS ON NEURAL NETWORKS

slide-23
SLIDE 23

Synthetic Experiment

  • Goal: separate softmax bottleneck from context embedding bottleneck
slide-24
SLIDE 24

Synthetic Experiment

  • Goal: separate softmax bottleneck from context embedding bottleneck
  • Sample N different categorical distributions with M outcomes:
slide-25
SLIDE 25

Synthetic Experiment

  • Goal: separate softmax bottleneck from context embedding bottleneck
  • Sample N different categorical distributions with M outcomes:
  • Goal:
slide-26
SLIDE 26

Synthetic Experiment

  • Goal: separate softmax bottleneck from context embedding bottleneck
  • Sample N different categorical distributions with M outcomes:
  • Goal:
  • Independent context embeddings; shared word embeddings
slide-27
SLIDE 27

Synthetic Experiments - Mode Matching (𝜷=0.01)

  • Percentage of contexts c for which

Vanilla Softmax Monotonic fn (K=100) Ratio 2nd / 1st

num words M num words M num words M

slide-28
SLIDE 28
  • Percentage of contexts c for which

Vanilla Softmax Monotonic fn (K=100) Ratio 2nd / 1st

num words M num words M num words M

  • Similar results for cross-entropy and other values of 𝜷

Synthetic Experiments - Mode Matching (𝜷=0.01)

slide-29
SLIDE 29

Piecewise Linear Increasing Functions (PLIF)

  • NN w/ 1 hidden layer β‡’ memory hungry:

β—‹ Tensor of size N x M x K on GPU , where K >= 1000

slide-30
SLIDE 30

Piecewise Linear Increasing Functions (PLIF)

  • NN w/ 1 hidden layer β‡’ memory hungry:

β—‹ Tensor of size N x M x K on GPU , where K >= 1000

  • PLIF:
slide-31
SLIDE 31

Piecewise Linear Increasing Functions (PLIF)

  • NN w/ 1 hidden layer β‡’ memory hungry:

β—‹ Tensor of size N x M x K on GPU , where K >= 1000

  • PLIF:
  • Forward & backward passes: just a lookup in two K dim vectors
  • Memory and running time very efficient (comparable with Vanilla Softmax)
slide-32
SLIDE 32

Language Modeling Results

GPU Memory: N x M GPU Memory: N x M x K

slide-33
SLIDE 33

Thank you!

Poster #23