Language Modeling, Efficiency/Training Tricks Graham Neubig Site - - PowerPoint PPT Presentation

language modeling efficiency training tricks
SMART_READER_LITE
LIVE PREVIEW

Language Modeling, Efficiency/Training Tricks Graham Neubig Site - - PowerPoint PPT Presentation

CS11-747 Neural Networks for NLP Language Modeling, Efficiency/Training Tricks Graham Neubig Site https://phontron.com/class/nn4nlp2020/ Are These Sentences OK? Jane went to the store. store to Jane went the. Jane went store.


slide-1
SLIDE 1

CS11-747 Neural Networks for NLP

Language Modeling, Efficiency/Training Tricks

Graham Neubig

Site https://phontron.com/class/nn4nlp2020/

slide-2
SLIDE 2

Are These Sentences OK?

  • Jane went to the store.
  • store to Jane went the.
  • Jane went store.
  • Jane goed to the store.
  • The store went to Jane.
  • The food truck went to Jane.
slide-3
SLIDE 3

Language Modeling: Calculating the Probability of a Sentence

P(X) =

I

Y

i=1

P(xi | x1, . . . , xi−1)

Next Word Context

P(xi | x1, . . . , xi−1)

The big problem: How do we predict ?!?!

slide-4
SLIDE 4

Covered Concept Tally

slide-5
SLIDE 5

Review: Count-based Language Models

slide-6
SLIDE 6

Count-based Language Models

  • Count up the frequency and divide:
  • Add smoothing, to deal with zero counts:

P(xi | xi−n+1, . . . , xi−1) =λPML(xi | xi−n+1, . . . , xi−1) + (1 − λ)P(xi | x1−n+2, . . . , xi−1) PML(xi | xi−n+1, . . . , xi−1) := c(xi−n+1, . . . , xi) c(xi−n+1, . . . , xi−1)

  • Modified Kneser-Ney smoothing
slide-7
SLIDE 7

A Refresher on Evaluation

  • Log-likelihood:

  • Per-word Log Likelihood:

  • Per-word (Cross) Entropy:

  • Perplexity:


LL(Etest) = X

E∈Etest

log P(E) WLL(Etest) = 1 P

E∈Etest |E|

X

E∈Etest

log P(E) H(Etest) = 1 P

E∈Etest |E|

X

E∈Etest

− log2 P(E) ppl(Etest) = 2H(Etest) = e−W LL(Etest)

slide-8
SLIDE 8

What Can we Do w/ LMs?

  • Score sentences:
  • Generate sentences:

while didn’t choose end-of-sentence symbol: calculate probability sample a new word from the probability distribution Jane went to the store . → high store to Jane went the . → low (same as calculating loss for training)

slide-9
SLIDE 9

Problems and Solutions?

  • Cannot share strength among similar words

she bought a car she purchased a car she bought a bicycle she purchased a bicycle → solution: class based language models

  • Dr. Jane Smith
  • Cannot condition on context with intervening words
  • Dr. Gertrude Smith

→ solution: skip-gram language models

  • Cannot handle long-distance dependencies

for tennis class he wanted to buy his own racquet → solution: cache, trigger, topic, syntactic models, etc. for programming class he wanted to buy his own computer

slide-10
SLIDE 10

An Alternative:
 Featurized Log-Linear Models

slide-11
SLIDE 11

An Alternative:
 Featurized Models

  • Calculate features of the context
  • Based on the features, calculate probabilities
  • Optimize feature weights using gradient descent,

etc.

slide-12
SLIDE 12

Example:

Previous words: “giving a" a the talk gift hat …

Words we’re predicting

3.0 2.5

  • 0.2

0.1 1.2 … b=

How likely are they?

  • 6.0
  • 5.1

0.2 0.1 0.5 … w1,a=

How likely are they given prev. word is “a”?

  • 0.2
  • 0.3

1.0 2.0

  • 1.2

… w2,giving=

How likely are they given 2nd prev. word is “giving”?

  • 3.2
  • 2.9

1.0 2.2 0.6 … s=

Total score

slide-13
SLIDE 13

Softmax

  • Convert scores into probabilities by taking the

exponent and normalizing (softmax)

P(xi | xi−1

i−n+1) =

es(xi|xi−1

i−n+1)

P

˜ xi es(˜ xi|xi−1

i−n+1)

  • 3.2
  • 2.9

1.0 2.2 0.6 … s= 0.002 0.003 0.329 0.444 0.090 … p=

slide-14
SLIDE 14

A Computation Graph View

giving a

lookup2 lookup1

+ + bias = scores

softmax

probs Each vector is size of output vocabulary

slide-15
SLIDE 15

A Note: “Lookup”

  • Lookup can be viewed as “grabbing” a single

vector from a big matrix of word embeddings lookup(2)

  • num. words

vector size

  • Similarly, can be viewed as multiplying by a “one-

hot” vector

  • num. words

vector size

1 …

*

  • Former tends to be faster
slide-16
SLIDE 16

Training a Model

  • Reminder: to train, we calculate a “loss

function” (a measure of how bad our predictions are), and move the parameters to reduce the loss

  • The most common loss function for probabilistic

models is “negative log likelihood” 0.002 0.003 0.329 0.444 0.090 … p= If element 3 (or zero-indexed, 2) is the correct answer:

  • log

1.112

slide-17
SLIDE 17

Parameter Update

  • Back propagation allows us to calculate the

derivative of the loss with respect to the parameters @` @θ

  • Simple stochastic gradient descent optimizes

parameters according to the following rule

θ ← θ − ↵ @` @θ

slide-18
SLIDE 18

Choosing a Vocabulary

slide-19
SLIDE 19

Unknown Words

  • Necessity for UNK words
  • We won’t have all the words in the world in training data
  • Larger vocabularies require more memory and

computation time

  • Common ways:
  • Frequency threshold (usually UNK <= 1)
  • Rank threshold
slide-20
SLIDE 20

Evaluation and Vocabulary

  • Important: the vocabulary must be the same over

models you compare

  • Or more accurately, all models must be able to

generate the test set (it’s OK if they can generate more than the test set, but not less)

  • e.g. Comparing a character-based model to a

word-based model is fair, but not vice-versa

slide-21
SLIDE 21

Let’s try it out! (loglin-lm.py)

slide-22
SLIDE 22

What Problems are Handled?

  • Cannot share strength among similar words

she bought a car she purchased a car she bought a bicycle she purchased a bicycle → not solved yet 😟

  • Cannot condition on context with intervening words
  • Dr. Jane Smith
  • Dr. Gertrude Smith
  • Cannot handle long-distance dependencies

for tennis class he wanted to buy his own racquet for programming class he wanted to buy his own computer → solved! 😁 → not solved yet 😟

slide-23
SLIDE 23

Beyond Linear Models

slide-24
SLIDE 24

Linear Models can’t Learn Feature Combinations

  • These can’t be expressed by linear features
  • What can we do?
  • Remember combinations as features (individual

scores for “students take”, “teachers write”)
 → Feature space explosion!

  • Neural nets

students take tests→ high students write tests → low teachers take tests → low teachers write tests → high

slide-25
SLIDE 25

Neural Language Models

  • (See Bengio et al. 2004)

giving a

lookup lookup

probs

softmax

+ bias = scores

W

tanh(
 W1*h + b1)

slide-26
SLIDE 26

Where is Strength Shared?

giving a

lookup lookup

probs

softmax tanh(
 W1*h + b1)

+ bias = scores

W

Word embeddings: Similar input words get similar vectors Similar output words get similar rows in in the softmax matrix Similar contexts get similar hidden states

slide-27
SLIDE 27
  • Cannot share strength among similar words

she bought a car she purchased a car she bought a bicycle she purchased a bicycle

  • Cannot condition on context with intervening words
  • Dr. Jane Smith
  • Dr. Gertrude Smith
  • Cannot handle long-distance dependencies

for tennis class he wanted to buy his own racquet for programming class he wanted to buy his own computer → solved! 😁 → not solved yet 😟 → solved, and similar contexts as well! 😁

What Problems are Handled?

slide-28
SLIDE 28

Let’s Try it Out! (nn-lm.py)

slide-29
SLIDE 29

Tying Input/Output Embeddings

  • We can share parameters

between the input and output embeddings (Press et al. 2016, inter alia)

giving a

pick row pick row

probs

softmax tanh(
 W1*h + b1)

+ bias = scores

W

Want to try? Delete the input embeddings, and instead pick a row from the softmax matrix.

slide-30
SLIDE 30

Optimizers

slide-31
SLIDE 31

Standard SGD

  • Reminder: Standard stochastic gradient descent does

Learning Rate Gradient of Loss

  • There are many other optimization options! (see

Ruder 2016 in references)

θt = θt−1 − ηgt

<latexit sha1_base64="WmTb7CmseFEFrcsdvuCoFOH6zU0=">ACnicbZBPS8MwGMbT+W/Of1WPXoJD8LRiqAehKEXjxOsG6ylpFm2haVpSd4Ko+zuxa/ixYOKVz+BN7+N2daDbj4Q+OV535fkfaJUcA2O82VlpZXVtfK65WNza3tHXt3714nmaLMo4lIVDsimgkumQcBGunipE4EqwVDa8n9dYDU5on8g5GKQti0pe8xykBY4X2oQ8DBiQEfIkLzKHmjnEN+aC+yGEdtWpO1PhRXALqKJCzdD+8rsJzWImgQqidcd1UghyoBTwcYVP9MsJXRI+qxjUJKY6SCf7jLGR8bp4l6izJGAp+7viZzEWo/iyHTGBAZ6vjYx/6t1MuidBzmXaQZM0tlDvUxgSPAkGNzlilEQIwOEKm7+iumAKELBxFcxIbjzKy+Cd1K/qLu3p9XGVZFGR2gQ3SMXHSGugGNZGHKHpEz+gVvVlP1ov1bn3MWktWMbOP/sj6/AFhNJmN</latexit><latexit sha1_base64="WmTb7CmseFEFrcsdvuCoFOH6zU0=">ACnicbZBPS8MwGMbT+W/Of1WPXoJD8LRiqAehKEXjxOsG6ylpFm2haVpSd4Ko+zuxa/ixYOKVz+BN7+N2daDbj4Q+OV535fkfaJUcA2O82VlpZXVtfK65WNza3tHXt3714nmaLMo4lIVDsimgkumQcBGunipE4EqwVDa8n9dYDU5on8g5GKQti0pe8xykBY4X2oQ8DBiQEfIkLzKHmjnEN+aC+yGEdtWpO1PhRXALqKJCzdD+8rsJzWImgQqidcd1UghyoBTwcYVP9MsJXRI+qxjUJKY6SCf7jLGR8bp4l6izJGAp+7viZzEWo/iyHTGBAZ6vjYx/6t1MuidBzmXaQZM0tlDvUxgSPAkGNzlilEQIwOEKm7+iumAKELBxFcxIbjzKy+Cd1K/qLu3p9XGVZFGR2gQ3SMXHSGugGNZGHKHpEz+gVvVlP1ov1bn3MWktWMbOP/sj6/AFhNJmN</latexit><latexit sha1_base64="WmTb7CmseFEFrcsdvuCoFOH6zU0=">ACnicbZBPS8MwGMbT+W/Of1WPXoJD8LRiqAehKEXjxOsG6ylpFm2haVpSd4Ko+zuxa/ixYOKVz+BN7+N2daDbj4Q+OV535fkfaJUcA2O82VlpZXVtfK65WNza3tHXt3714nmaLMo4lIVDsimgkumQcBGunipE4EqwVDa8n9dYDU5on8g5GKQti0pe8xykBY4X2oQ8DBiQEfIkLzKHmjnEN+aC+yGEdtWpO1PhRXALqKJCzdD+8rsJzWImgQqidcd1UghyoBTwcYVP9MsJXRI+qxjUJKY6SCf7jLGR8bp4l6izJGAp+7viZzEWo/iyHTGBAZ6vjYx/6t1MuidBzmXaQZM0tlDvUxgSPAkGNzlilEQIwOEKm7+iumAKELBxFcxIbjzKy+Cd1K/qLu3p9XGVZFGR2gQ3SMXHSGugGNZGHKHpEz+gVvVlP1ov1bn3MWktWMbOP/sj6/AFhNJmN</latexit>

gt = rθt−1`(✓t−1)

<latexit sha1_base64="6hDez93AjMrSnNoeZfXZjS/6HaI=">ACFnicbVDLSsNAFJ34rPUVdelmsAi6sCQiqAuh6MZlBatCE8JketsOTiZh5kYoIX/hxl9x40LFrbjzb5w+Fr4OXO7hnHuZuSfOpDoeZ/O1PTM7Nx8ZaG6uLS8suqurV+ZNcWjyVqb6JmQEpFLRQoISbTANLYgnX8e3Z0L+A21Eqi5xkEGYsJ4SXcEZWily670I6QkNFIsli4oA+4C245fljQAKXe+S7uRW/Pq3gj0L/EnpEYmaEbuR9BJeZ6AQi6ZMW3fyzAsmEbBJZTVIDeQMX7LetC2VLETFiM7irptlU6tJtqWwrpSP2+UbDEmES28mEYd/89obif147x+5RWAiV5QiKjx/q5pJiSoch0Y7QwFEOLGFcC/tXyvtM42yqoNwf98l/S2q8f1/2Lg1rjdJGhWySLbJDfHJIGuScNEmLcHJPHskzeXEenCfn1Xkbj045k50N8gPO+xdFOJ7z</latexit><latexit sha1_base64="6hDez93AjMrSnNoeZfXZjS/6HaI=">ACFnicbVDLSsNAFJ34rPUVdelmsAi6sCQiqAuh6MZlBatCE8JketsOTiZh5kYoIX/hxl9x40LFrbjzb5w+Fr4OXO7hnHuZuSfOpDoeZ/O1PTM7Nx8ZaG6uLS8suqurV+ZNcWjyVqb6JmQEpFLRQoISbTANLYgnX8e3Z0L+A21Eqi5xkEGYsJ4SXcEZWily670I6QkNFIsli4oA+4C245fljQAKXe+S7uRW/Pq3gj0L/EnpEYmaEbuR9BJeZ6AQi6ZMW3fyzAsmEbBJZTVIDeQMX7LetC2VLETFiM7irptlU6tJtqWwrpSP2+UbDEmES28mEYd/89obif147x+5RWAiV5QiKjx/q5pJiSoch0Y7QwFEOLGFcC/tXyvtM42yqoNwf98l/S2q8f1/2Lg1rjdJGhWySLbJDfHJIGuScNEmLcHJPHskzeXEenCfn1Xkbj045k50N8gPO+xdFOJ7z</latexit><latexit sha1_base64="6hDez93AjMrSnNoeZfXZjS/6HaI=">ACFnicbVDLSsNAFJ34rPUVdelmsAi6sCQiqAuh6MZlBatCE8JketsOTiZh5kYoIX/hxl9x40LFrbjzb5w+Fr4OXO7hnHuZuSfOpDoeZ/O1PTM7Nx8ZaG6uLS8suqurV+ZNcWjyVqb6JmQEpFLRQoISbTANLYgnX8e3Z0L+A21Eqi5xkEGYsJ4SXcEZWily670I6QkNFIsli4oA+4C245fljQAKXe+S7uRW/Pq3gj0L/EnpEYmaEbuR9BJeZ6AQi6ZMW3fyzAsmEbBJZTVIDeQMX7LetC2VLETFiM7irptlU6tJtqWwrpSP2+UbDEmES28mEYd/89obif147x+5RWAiV5QiKjx/q5pJiSoch0Y7QwFEOLGFcC/tXyvtM42yqoNwf98l/S2q8f1/2Lg1rjdJGhWySLbJDfHJIGuScNEmLcHJPHskzeXEenCfn1Xkbj045k50N8gPO+xdFOJ7z</latexit>
slide-32
SLIDE 32

SGD With Momentum

  • Remember gradients from past time steps

Momentum Momentum Conservation Parameter Previous Momentum

  • Intuition: Prevent instability resulting from sudden changes

vt = γvt−1 + ηgt

<latexit sha1_base64="HmkNoCJNVtCiFQYOq963plf1tJg=">ACB3icbVA9SwNBEN2LXzF+RS0tXAyCIY7EdRCNpYRjAmkIRjbrNJluzeHbtzgXCktPGv2Fio2PoX7Pw3bj4KTXw8Hhvhpl5QSyFQdf9djILi0vLK9nV3Nr6xuZWfnvnwUSJZrzCIhnpWgCGSxHyCgqUvBZrDiqQvBr0bkZ+tc+1EVF4j4OYNxV0QtEWDNBKfn6/7yO9o0OKAW076d4g3pMW1wBNrx0c8X3KI7Bp0n3pQUyBRlP/VaEUsUTxEJsGYufG2ExBo2CSD3ONxPAYWA86vG5pCIqbZjp+ZEgPrdKi7UjbCpGO1d8TKShjBiqwnQqwa2a9kfifV0+wfdFMRgnyEM2WdROJMWIjlKhLaE5QzmwBJgW9lbKuqCBoc0uZ0PwZl+eJ5XT4mXRuzsrlK6naWTJHjkgR8Qj56REbkmZVAgj+SZvJI358l5cd6dj0lrxpnO7JI/cD5/AId/l/Q=</latexit><latexit sha1_base64="HmkNoCJNVtCiFQYOq963plf1tJg=">ACB3icbVA9SwNBEN2LXzF+RS0tXAyCIY7EdRCNpYRjAmkIRjbrNJluzeHbtzgXCktPGv2Fio2PoX7Pw3bj4KTXw8Hhvhpl5QSyFQdf9djILi0vLK9nV3Nr6xuZWfnvnwUSJZrzCIhnpWgCGSxHyCgqUvBZrDiqQvBr0bkZ+tc+1EVF4j4OYNxV0QtEWDNBKfn6/7yO9o0OKAW076d4g3pMW1wBNrx0c8X3KI7Bp0n3pQUyBRlP/VaEUsUTxEJsGYufG2ExBo2CSD3ONxPAYWA86vG5pCIqbZjp+ZEgPrdKi7UjbCpGO1d8TKShjBiqwnQqwa2a9kfifV0+wfdFMRgnyEM2WdROJMWIjlKhLaE5QzmwBJgW9lbKuqCBoc0uZ0PwZl+eJ5XT4mXRuzsrlK6naWTJHjkgR8Qj56REbkmZVAgj+SZvJI358l5cd6dj0lrxpnO7JI/cD5/AId/l/Q=</latexit><latexit sha1_base64="HmkNoCJNVtCiFQYOq963plf1tJg=">ACB3icbVA9SwNBEN2LXzF+RS0tXAyCIY7EdRCNpYRjAmkIRjbrNJluzeHbtzgXCktPGv2Fio2PoX7Pw3bj4KTXw8Hhvhpl5QSyFQdf9djILi0vLK9nV3Nr6xuZWfnvnwUSJZrzCIhnpWgCGSxHyCgqUvBZrDiqQvBr0bkZ+tc+1EVF4j4OYNxV0QtEWDNBKfn6/7yO9o0OKAW076d4g3pMW1wBNrx0c8X3KI7Bp0n3pQUyBRlP/VaEUsUTxEJsGYufG2ExBo2CSD3ONxPAYWA86vG5pCIqbZjp+ZEgPrdKi7UjbCpGO1d8TKShjBiqwnQqwa2a9kfifV0+wfdFMRgnyEM2WdROJMWIjlKhLaE5QzmwBJgW9lbKuqCBoc0uZ0PwZl+eJ5XT4mXRuzsrlK6naWTJHjkgR8Qj56REbkmZVAgj+SZvJI358l5cd6dj0lrxpnO7JI/cD5/AId/l/Q=</latexit>

θt = θt−1 − vt

<latexit sha1_base64="oxwK4UvXJth96v9YgVP9R9O+WE=">ACB3icbVDLSsNAFJ34rPUVdenCwSK4aUlEUBdC0Y3LCsYW2hAm0k7dPJg5qZQpZu/BU3LlTc+gvu/BunbRBtPXDhzDn3MvcePxFcgWV9GQuLS8srq6W18vrG5ta2ubN7r+JUubQWMSy5RPFBI+YAxwEayWSkdAXrOkPrsd+c8ik4nF0B6OEuSHpRTzglICWPOgA30GxMsgx5f451G1c1zFQw8s2LVrAnwPLELUkEFGp752enGNA1ZBFQpdq2lYCbEQmcCpaXO6liCaED0mNtTSMSMuVmk0NyfKSVLg5iqSsCPF/T2QkVGoU+rozJNBXs95Y/M9rpxCcuxmPkhRYRKcfBanAEONxKrjLJaMgRpoQKrneFdM+kYSCzq6sQ7BnT54nzkntombfnlbqV0UaJbSPDtExstEZqMb1EAOougBPaEX9Go8Gs/Gm/E+bV0wipk9AfGxzfPT5jA</latexit><latexit sha1_base64="oxwK4UvXJth96v9YgVP9R9O+WE=">ACB3icbVDLSsNAFJ34rPUVdenCwSK4aUlEUBdC0Y3LCsYW2hAm0k7dPJg5qZQpZu/BU3LlTc+gvu/BunbRBtPXDhzDn3MvcePxFcgWV9GQuLS8srq6W18vrG5ta2ubN7r+JUubQWMSy5RPFBI+YAxwEayWSkdAXrOkPrsd+c8ik4nF0B6OEuSHpRTzglICWPOgA30GxMsgx5f451G1c1zFQw8s2LVrAnwPLELUkEFGp752enGNA1ZBFQpdq2lYCbEQmcCpaXO6liCaED0mNtTSMSMuVmk0NyfKSVLg5iqSsCPF/T2QkVGoU+rozJNBXs95Y/M9rpxCcuxmPkhRYRKcfBanAEONxKrjLJaMgRpoQKrneFdM+kYSCzq6sQ7BnT54nzkntombfnlbqV0UaJbSPDtExstEZqMb1EAOougBPaEX9Go8Gs/Gm/E+bV0wipk9AfGxzfPT5jA</latexit><latexit sha1_base64="oxwK4UvXJth96v9YgVP9R9O+WE=">ACB3icbVDLSsNAFJ34rPUVdenCwSK4aUlEUBdC0Y3LCsYW2hAm0k7dPJg5qZQpZu/BU3LlTc+gvu/BunbRBtPXDhzDn3MvcePxFcgWV9GQuLS8srq6W18vrG5ta2ubN7r+JUubQWMSy5RPFBI+YAxwEayWSkdAXrOkPrsd+c8ik4nF0B6OEuSHpRTzglICWPOgA30GxMsgx5f451G1c1zFQw8s2LVrAnwPLELUkEFGp752enGNA1ZBFQpdq2lYCbEQmcCpaXO6liCaED0mNtTSMSMuVmk0NyfKSVLg5iqSsCPF/T2QkVGoU+rozJNBXs95Y/M9rpxCcuxmPkhRYRKcfBanAEONxKrjLJaMgRpoQKrneFdM+kYSCzq6sQ7BnT54nzkntombfnlbqV0UaJbSPDtExstEZqMb1EAOougBPaEX9Go8Gs/Gm/E+bV0wipk9AfGxzfPT5jA</latexit>
slide-33
SLIDE 33

Adagrad

  • Adaptively reduce learning rate based on

accumulated variance of the gradients

  • Intuition: frequently updated parameters (e.g. common word

embeddings) should be updated less

  • Problem: learning rate continuously decreases, and training can

stall -- fixed by using rolling average in AdaDelta and RMSProp

Squared Current Gradient Small Constant

Gt = Gt−1 + gt gt

<latexit sha1_base64="Z2cxcDFOvpvgsypOoSmSEm4CFcQ=">ACBXicbZDLSgMxFIYz9VbrepShGARBLHMiKAuhKLuqxgbaEdhkwmbUMzyZCcEcrQlRtfxY0LFbe+gzvfxvSy0OoPgS/OYfk/GEiuAHX/XJyc/MLi0v5cLK6tr6RnFz686oVFNWp0o3QyJYJLVgcOgjUTzUgcCtYI+1ejeuOeacOVvIVBwvyYdCXvcErAWkFxtxoAvsDVIMjb4gPcdfe2ypSMKgWHL7lj4L3hTKGpakHxsx0pmsZMAhXEmJbnJuBnRAOng0L7dSwhNA+6bKWRUliZvxsvMYQ71snwh2l7ZGAx+7PiYzExgzi0HbGBHpmtjYy/6u1Uuic+RmXSQpM0slDnVRgUHiUCY64ZhTEwAKhmtu/YtojmlCwyRVsCN7syn+hflw+L3s3J6XK5TSNPNpBe+gAegUVdA1qE6ougBPaEX9Oo8Os/Om/M+ac0505lt9EvOxzeO0pbZ</latexit><latexit sha1_base64="Z2cxcDFOvpvgsypOoSmSEm4CFcQ=">ACBXicbZDLSgMxFIYz9VbrepShGARBLHMiKAuhKLuqxgbaEdhkwmbUMzyZCcEcrQlRtfxY0LFbe+gzvfxvSy0OoPgS/OYfk/GEiuAHX/XJyc/MLi0v5cLK6tr6RnFz686oVFNWp0o3QyJYJLVgcOgjUTzUgcCtYI+1ejeuOeacOVvIVBwvyYdCXvcErAWkFxtxoAvsDVIMjb4gPcdfe2ypSMKgWHL7lj4L3hTKGpakHxsx0pmsZMAhXEmJbnJuBnRAOng0L7dSwhNA+6bKWRUliZvxsvMYQ71snwh2l7ZGAx+7PiYzExgzi0HbGBHpmtjYy/6u1Uuic+RmXSQpM0slDnVRgUHiUCY64ZhTEwAKhmtu/YtojmlCwyRVsCN7syn+hflw+L3s3J6XK5TSNPNpBe+gAegUVdA1qE6ougBPaEX9Oo8Os/Om/M+ac0505lt9EvOxzeO0pbZ</latexit><latexit sha1_base64="Z2cxcDFOvpvgsypOoSmSEm4CFcQ=">ACBXicbZDLSgMxFIYz9VbrepShGARBLHMiKAuhKLuqxgbaEdhkwmbUMzyZCcEcrQlRtfxY0LFbe+gzvfxvSy0OoPgS/OYfk/GEiuAHX/XJyc/MLi0v5cLK6tr6RnFz686oVFNWp0o3QyJYJLVgcOgjUTzUgcCtYI+1ejeuOeacOVvIVBwvyYdCXvcErAWkFxtxoAvsDVIMjb4gPcdfe2ypSMKgWHL7lj4L3hTKGpakHxsx0pmsZMAhXEmJbnJuBnRAOng0L7dSwhNA+6bKWRUliZvxsvMYQ71snwh2l7ZGAx+7PiYzExgzi0HbGBHpmtjYy/6u1Uuic+RmXSQpM0slDnVRgUHiUCY64ZhTEwAKhmtu/YtojmlCwyRVsCN7syn+hflw+L3s3J6XK5TSNPNpBe+gAegUVdA1qE6ougBPaEX9Oo8Os/Om/M+ac0505lt9EvOxzeO0pbZ</latexit>

✓t = ✓t−1 − ⌘ √Gt + ✏gt

<latexit sha1_base64="zqtmM8v1Le4LNs3ZqSPl2MoNc8=">ACKnicbVBNS8NAEN34bf2qevSyWARBWhIR1IMgetCjgrWFpoTNdtIu3Wzi7kQoIf/Hi3/Fgx5UvPpD3NYifj0YePveDvzwlQKg676kxMTk3PzM7NlxYWl5ZXyqtr1ybJNIc6T2SimyEzIWCOgqU0Ew1sDiU0Aj7p0O/cQvaiERd4SCFdsy6SkSCM7RSUD7xsQfIghwLekS/HlWvoFXqR5rx3LdSkfvmRmN+FiDdoT6kRshEFQXtBhiUK27NHYH+Jd6YVMgYF0H50e8kPItBIZfMmJbnptjOmUbBJRQlPzOQMt5nXWhZqlgMp2Pbi3olU6NEq0LYV0pH6fyFlszCAObWfMsGd+e0PxP6+VYXTQzoVKMwTFPz+KMkxocPgaEdo4CgHljCuhd2V8h6zAaGNt2RD8H6f/JfUd2uHNe9yr3J8Mk5jmyQTbJNPLJPjsk5uSB1wskdeSDP5MW5d56cV+fts3XCGc+skx9w3j8AzKWnpg=</latexit><latexit sha1_base64="zqtmM8v1Le4LNs3ZqSPl2MoNc8=">ACKnicbVBNS8NAEN34bf2qevSyWARBWhIR1IMgetCjgrWFpoTNdtIu3Wzi7kQoIf/Hi3/Fgx5UvPpD3NYifj0YePveDvzwlQKg676kxMTk3PzM7NlxYWl5ZXyqtr1ybJNIc6T2SimyEzIWCOgqU0Ew1sDiU0Aj7p0O/cQvaiERd4SCFdsy6SkSCM7RSUD7xsQfIghwLekS/HlWvoFXqR5rx3LdSkfvmRmN+FiDdoT6kRshEFQXtBhiUK27NHYH+Jd6YVMgYF0H50e8kPItBIZfMmJbnptjOmUbBJRQlPzOQMt5nXWhZqlgMp2Pbi3olU6NEq0LYV0pH6fyFlszCAObWfMsGd+e0PxP6+VYXTQzoVKMwTFPz+KMkxocPgaEdo4CgHljCuhd2V8h6zAaGNt2RD8H6f/JfUd2uHNe9yr3J8Mk5jmyQTbJNPLJPjsk5uSB1wskdeSDP5MW5d56cV+fts3XCGc+skx9w3j8AzKWnpg=</latexit><latexit sha1_base64="zqtmM8v1Le4LNs3ZqSPl2MoNc8=">ACKnicbVBNS8NAEN34bf2qevSyWARBWhIR1IMgetCjgrWFpoTNdtIu3Wzi7kQoIf/Hi3/Fgx5UvPpD3NYifj0YePveDvzwlQKg676kxMTk3PzM7NlxYWl5ZXyqtr1ybJNIc6T2SimyEzIWCOgqU0Ew1sDiU0Aj7p0O/cQvaiERd4SCFdsy6SkSCM7RSUD7xsQfIghwLekS/HlWvoFXqR5rx3LdSkfvmRmN+FiDdoT6kRshEFQXtBhiUK27NHYH+Jd6YVMgYF0H50e8kPItBIZfMmJbnptjOmUbBJRQlPzOQMt5nXWhZqlgMp2Pbi3olU6NEq0LYV0pH6fyFlszCAObWfMsGd+e0PxP6+VYXTQzoVKMwTFPz+KMkxocPgaEdo4CgHljCuhd2V8h6zAaGNt2RD8H6f/JfUd2uHNe9yr3J8Mk5jmyQTbJNPLJPjsk5uSB1wskdeSDP5MW5d56cV+fts3XCGc+skx9w3j8AzKWnpg=</latexit>
slide-34
SLIDE 34

Adam

  • Most standard optimization option in NLP and beyond
  • Considers rolling average of gradient, and momentum

mt = β1mt−1 + (1 β1)gt vt = β2vt−1 + (1 β2)gt gt

<latexit sha1_base64="7LaGMij2fvu9TSQ4l1XLThYMwZo=">ACSHicbZBLSwMxFIUz9VXrq+rSTbAoilhmiqAuhKIblxWsLXTKmEnTGpMhuROoQz9e27cufM/uHGh4s60nYVaLwQ+zjmXJCeMBTfgui9Obm5+YXEpv1xYWV1b3yhubt0ZlWjK6lQJpZshMUzwiNWBg2DNWDMiQ8EaYf9q7DcGTBuolsYxqwtS/iXU4JWCko3sA8D6+wH7IgAQelkEKx94IH+ED7zgTD3HPpny/MPgVruDBbLiShVHwZiCYsktu5PBs+BlUELZ1ILis9RNJEsAiqIMS3PjaGdEg2cCjYq+IlhMaF90mMtixGRzLTSRMjvGeVDu4qbU8EeKL+3EiJNGYoQ5uUB7MX28s/ue1EuietVMexQmwiE4v6iYCg8LjWnGHa0ZBDC0Qqrl9K6YPRBMKtvyCLcH7+VZqFfK52Xv5qRUvczayKMdtIsOkIdOURVdoxqI4oe0St6Rx/Ok/PmfDpf02jOyXa20a/J5b4BCp+sPA=</latexit><latexit sha1_base64="7LaGMij2fvu9TSQ4l1XLThYMwZo=">ACSHicbZBLSwMxFIUz9VXrq+rSTbAoilhmiqAuhKIblxWsLXTKmEnTGpMhuROoQz9e27cufM/uHGh4s60nYVaLwQ+zjmXJCeMBTfgui9Obm5+YXEpv1xYWV1b3yhubt0ZlWjK6lQJpZshMUzwiNWBg2DNWDMiQ8EaYf9q7DcGTBuolsYxqwtS/iXU4JWCko3sA8D6+wH7IgAQelkEKx94IH+ED7zgTD3HPpny/MPgVruDBbLiShVHwZiCYsktu5PBs+BlUELZ1ILis9RNJEsAiqIMS3PjaGdEg2cCjYq+IlhMaF90mMtixGRzLTSRMjvGeVDu4qbU8EeKL+3EiJNGYoQ5uUB7MX28s/ue1EuietVMexQmwiE4v6iYCg8LjWnGHa0ZBDC0Qqrl9K6YPRBMKtvyCLcH7+VZqFfK52Xv5qRUvczayKMdtIsOkIdOURVdoxqI4oe0St6Rx/Ok/PmfDpf02jOyXa20a/J5b4BCp+sPA=</latexit><latexit sha1_base64="7LaGMij2fvu9TSQ4l1XLThYMwZo=">ACSHicbZBLSwMxFIUz9VXrq+rSTbAoilhmiqAuhKIblxWsLXTKmEnTGpMhuROoQz9e27cufM/uHGh4s60nYVaLwQ+zjmXJCeMBTfgui9Obm5+YXEpv1xYWV1b3yhubt0ZlWjK6lQJpZshMUzwiNWBg2DNWDMiQ8EaYf9q7DcGTBuolsYxqwtS/iXU4JWCko3sA8D6+wH7IgAQelkEKx94IH+ED7zgTD3HPpny/MPgVruDBbLiShVHwZiCYsktu5PBs+BlUELZ1ILis9RNJEsAiqIMS3PjaGdEg2cCjYq+IlhMaF90mMtixGRzLTSRMjvGeVDu4qbU8EeKL+3EiJNGYoQ5uUB7MX28s/ue1EuietVMexQmwiE4v6iYCg8LjWnGHa0ZBDC0Qqrl9K6YPRBMKtvyCLcH7+VZqFfK52Xv5qRUvczayKMdtIsOkIdOURVdoxqI4oe0St6Rx/Ok/PmfDpf02jOyXa20a/J5b4BCp+sPA=</latexit>

Momentum Rolling Average of Gradient

  • Correction of bias early in training

ˆ mt = mt 1 − (β1)t

<latexit sha1_base64="uU/7wIbYkpNLdU8wOtlDEbAhmgY=">ACDXicbVA9SwNBEJ3z2/gVtbRZlEAsDHc2aiEbSwVjAq5eOxt9pIlu3fH7pwQjvsFNv4VGwsVsbO38x/Y+RfcJBZ+PRh4vDfDzLwlcKg6745Y+MTk1PTM7OlufmFxaXy8sqZSTLNeIMlMtEXITVcipg3UKDkF6nmVIWSn4e9w4F/fsW1EUl8iv2UtxTtxCISjKVgnLF71LMVREg2Sd+pCnLVYBF7m1V/ZAjDbzNSyC8oZbc4cgf4n3RTbq9Y/3ZwA4DsqvfjthmeIxMkmNaXpuiq2cahRM8qLkZ4anlPVohzctjanipUP3ylIxSptEiXaVoxkqH6fyKkypq9C26kods1vbyD+5zUzjHZbuYjTDHnMRouiTBJMyCAb0haM5R9SyjTwt5KWJfaTNAmWLIheL9f/ksa27W9mndiwziAEWZgDdahCh7sQB2O4BgawOAabuEeHpwb585dJ5GrWPO18wq/IDz8gnWbp6D</latexit><latexit sha1_base64="ebysrkgbrz8OmYTlFMkBtIdy4=">ACDXicbVC7SgNBFJ31bXxFBRubwRiIhWHXRi2EoI1lBGMC2bjMTmbN4MzuMnNXCMt+gY0f4E/YWGiwtbfzD+z8BSePQhMPXDicy/3uPHgmuw7U9ranpmdm5+YTG3tLyupZf37jSUaIoq9FIRKrhE80ED1kNOAjWiBUj0hes7t+e9f36HVOaR+EldGPWkuQm5AGnBIzk5Ytuh0AqMw/wCXYDRWgqPchSZ7/k+gyI5+xdQ+blC3bZHgBPEmdECpXK91dv63G36uU/3HZE8lCoIJo3XTsGFopUcCpYFnOTSLCb0lN6xpaEgk06108E6Gi0Zp4yBSpkLA/X3REqk1l3pm05JoKPHvb74n9dMIDhqpTyME2AhHS4KEoEhwv1scJsrRkF0DSFUcXMrph1iMgGTYM6E4Iy/PElqB+XjsnNhwjhFQygbSDSshBh6iCzlEV1RBF9+gJvaBX68F6tnrW27B1yhrNbKI/sN5/AO4hn1U=</latexit><latexit sha1_base64="tX+KmExHPpLt2r1vquyYOWnxPw=">ACDXicbVA9SwNBEN2LXzF+nVraLIZALAx3NmohBG0sIxgTyMVjb7OXLNm9O3bnhHDcL7Dxr9hYqNja2/lv3HwUmvhg4PHeDPzgkRwDY7zbRWldW14rpY3Nre0de3fvTsepoqxJYxGrdkA0EzxiTeAgWDtRjMhAsFYwvBr7rQemNI+jWxglrCtJP+IhpwSM5NsVb0Agk7kP+AJ7oSI0kz7kmXtc9QIGxHeP7iH37bJTcybAi8SdkTKaoeHbX14vpqlkEVBtO64TgLdjCjgVLC85KWaJYQOSZ91DI2IZLqbTd7JcUoPRzGylQEeKL+nsiI1HokA9MpCQz0vDcW/M6KYRn3YxHSQosotNFYSowxHicDe5xSiIkSGEKm5uxXRATCZgEiyZENz5lxdJ86R2XnNvnHL9cpZGER2gQ1RFLjpFdXSNGqiJKHpEz+gVvVlP1ov1bn1MWwvWbGYf/YH1+QOt25ts</latexit>

ˆ vt = vt 1 − (β2)t

<latexit sha1_base64="3k3UytFXUZXN9xAvbSC1KbMZCk=">ACDXicbVC7SgNBFL3r2/iKWtoMSiAWht0aiEbSwVjBGycZmdzCZDZh/M3A2EZb/Axl+xsVARO3s7/8DOX3DyKHwduHA4517uvcdPpNBo2+/W1PTM7Nz8wmJhaXlda24vnGp41QxXmexjNWVTzWXIuJ1FCj5VaI4DX3JG37vZOg3+lxpEUcXOEh4K6SdSASCUTSVy5XYpZP/eQHBE3UJRlfQ/zNkruz5H6lV3rzH3ijt2xR6B/CXOhOzUap8fLwBw5hXf3HbM0pBHyCTVunYCbYyqlAwyfOCm2qeUNajHd40NKIh161s9E5OSkZpkyBWpiIkI/X7REZDrQehbzpDil392xuK/3nNFIODViaiJEUesfGiIJUEYzLMhrSF4gzlwBDKlDC3EtalJhM0CRZMCM7vl/+SerVyWHOTRjHMYCbME2lMGBfajBKZxBHRjcwB08wKN1a91bT9bzuHXKmsxswg9Yr1/05J6W</latexit><latexit sha1_base64="inXyXnDlmalxS+brMkBon+upyE=">ACDXicbVC7SgNBFJ31bXytCjY2gzEQC8OujVoIQRtLBWMC2XWZncyaIbMPZu4GwrJfYOMH+BM2Fiq29nb+gZ2/4ORaOKBC4dz7uXe/xEcAW9WlMTc/Mzs0vLBaWldW18z1jWsVp5KyGo1FLBs+UzwiNWAg2CNRDIS+oLV/c5Z3693mVQ8jq6glzA3JLcRDzgloCXPLDltAlk39wCfYCeQhGZdD/LM3i87PgPiHezdQO6ZRatiDYAniT0ixWr1+t162H3wjM/nFZM05BFQAVRqmlbCbgZkcCpYHnBSRVLCO2QW9bUNCIhU242eCfHJa20cBLXRHgfp7IiOhUr3Q150hgbYa9/rif14zheDIzXiUpMAiOlwUpAJDjPvZ4BaXjILoaUKo5PpWTNtEZwI6wYIOwR5/eZLUDirHFftSh3GKhlhA2gHlZGNDlEVnaMLVEMU3aFH9IxejHvjyXg13oatU8ZoZhP9gfH+Awymn2g=</latexit><latexit sha1_base64="4PRu3kPKoHVfeG6bK0+b+RrguA0=">ACDXicbVC7SgNBFJ2NrxhfUubwRCIhWE3jVoIQRvLCK4JZNdldjKbDJl9MHM3EJb9Aht/xcZCxdbezr9x8ig08cCFwzn3cu89fiK4AtP8Ngorq2vrG8XN0tb2zu5ef/gXsWpMymsYhlxyeKCR4xGzgI1kI6EvWNsfXk/89ohJxePoDsYJc0PSj3jAKQEteWqMyCQjXIP8CV2AkloNvIgz6zTmuMzIF7j5AFyr1wx6+YUeJlYc1JBc7S8pfTi2kasgioIEp1LTMBNyMSOBUsLzmpYgmhQ9JnXU0jEjLlZtN3clzVSg8HsdQVAZ6qvycyEio1Dn3dGRIYqEVvIv7ndVMIzt2MR0kKLKzRUEqMR4kg3uckoiLEmhEqub8V0QHQmoBMs6RCsxZeXid2oX9StW7PSvJqnURH6BjVkIXOUBPdoBayEUWP6Bm9ojfjyXgx3o2PWvBmM8coj8wPn8AzFGbfw=</latexit>
  • Final update

✓t = ✓t−1 − ⌘ √ˆ vt + ✏ ˆ mt

<latexit sha1_base64="vFWgf8Z25xhNy/8deLRj2WaSIkw=">ACM3icbZDLSgMxFIYz3q23qks3wSIYpkRQV0IohvBjYJVoVOGTHrGBjOZMTkjlDAP5cYHcSOCxW3voNpLeLth8Cf75xDcv4l8Kg7z96Q8Mjo2PjE5OVqemZ2bnq/MKZyQrNocEzmemLmBmQkEDBUq4yDWwNJZwHl8d9OrnN6CNyNQpdnNopexSiURwhg5F1aMQO4AsljSXfp1WQ9Kuk7DRDNuQ4dKG5prjTbsMLQ3ZYTlWgi5ETJTZ+ljkXVml/3+6J/TAwNTLQcVS9D9sZL1JQyCUzphn4ObYs0yi4hLISFgZyxq/YJTSdVSwF07L9pUu64kibJpl2RyHt0+8TlqXGdNPYdaYMO+Z3rQf/qzULTLZbVqi8QFD86GkBQz2kuQtoUGjrLrDONauL9S3mEuKXQ5V1wIwe+V/5rGRn2nHpxs1vb2B2lMkCWyTFZJQLbIHjkx6RBOLklD+SZvHh3pP36r19tg5g5lF8kPe+wel+6z3</latexit><latexit sha1_base64="vFWgf8Z25xhNy/8deLRj2WaSIkw=">ACM3icbZDLSgMxFIYz3q23qks3wSIYpkRQV0IohvBjYJVoVOGTHrGBjOZMTkjlDAP5cYHcSOCxW3voNpLeLth8Cf75xDcv4l8Kg7z96Q8Mjo2PjE5OVqemZ2bnq/MKZyQrNocEzmemLmBmQkEDBUq4yDWwNJZwHl8d9OrnN6CNyNQpdnNopexSiURwhg5F1aMQO4AsljSXfp1WQ9Kuk7DRDNuQ4dKG5prjTbsMLQ3ZYTlWgi5ETJTZ+ljkXVml/3+6J/TAwNTLQcVS9D9sZL1JQyCUzphn4ObYs0yi4hLISFgZyxq/YJTSdVSwF07L9pUu64kibJpl2RyHt0+8TlqXGdNPYdaYMO+Z3rQf/qzULTLZbVqi8QFD86GkBQz2kuQtoUGjrLrDONauL9S3mEuKXQ5V1wIwe+V/5rGRn2nHpxs1vb2B2lMkCWyTFZJQLbIHjkx6RBOLklD+SZvHh3pP36r19tg5g5lF8kPe+wel+6z3</latexit><latexit sha1_base64="vFWgf8Z25xhNy/8deLRj2WaSIkw=">ACM3icbZDLSgMxFIYz3q23qks3wSIYpkRQV0IohvBjYJVoVOGTHrGBjOZMTkjlDAP5cYHcSOCxW3voNpLeLth8Cf75xDcv4l8Kg7z96Q8Mjo2PjE5OVqemZ2bnq/MKZyQrNocEzmemLmBmQkEDBUq4yDWwNJZwHl8d9OrnN6CNyNQpdnNopexSiURwhg5F1aMQO4AsljSXfp1WQ9Kuk7DRDNuQ4dKG5prjTbsMLQ3ZYTlWgi5ETJTZ+ljkXVml/3+6J/TAwNTLQcVS9D9sZL1JQyCUzphn4ObYs0yi4hLISFgZyxq/YJTSdVSwF07L9pUu64kibJpl2RyHt0+8TlqXGdNPYdaYMO+Z3rQf/qzULTLZbVqi8QFD86GkBQz2kuQtoUGjrLrDONauL9S3mEuKXQ5V1wIwe+V/5rGRn2nHpxs1vb2B2lMkCWyTFZJQLbIHjkx6RBOLklD+SZvHh3pP36r19tg5g5lF8kPe+wel+6z3</latexit>
slide-35
SLIDE 35

Training Tricks

slide-36
SLIDE 36

Shuffling the Training Data

  • Stochastic gradient methods update the

parameters a little bit at a time

  • What if we have the sentence “I love this

sentence so much!” at the end of the training data 50 times?

  • To train correctly, we should randomly shuffle the
  • rder at each time step
slide-37
SLIDE 37

Simple Methods to Prevent Over-fitting

  • Neural nets have tons of parameters: we want to prevent

them from over-fitting

  • Early stopping:
  • monitor performance on held-out development data

and stop training when it starts to get worse

  • Learning rate decay:
  • gradually reduce learning rate as training continues, or
  • reduce learning rate when dev performance plateaus
  • Patience:
  • learning can be unstable, so sometimes avoid

stopping or decay until the dev performance gets worse n times

slide-38
SLIDE 38

Which One to Use?

  • Adam is usually fast to converge and stable
  • But simple SGD tends to do very will in terms of

generalization (Wilson et al. 2017)

  • You should use learning rate decay, (e.g. on Machine

translation results by Denkowski & Neubig 2017)

slide-39
SLIDE 39

Dropout

(Srivastava+ 14)

  • Neural nets have lots of parameters, and are prone

to overfitting

  • Dropout: randomly zero-out nodes in the hidden

layer with probability p at training time only

  • Because the number of nodes at training/test is different, scaling is

necessary:

  • Standard dropout: scale by p at test time
  • Inverted dropout: scale by 1/(1-p) at training time
  • An alternative: DropConnect (Wan+ 2013) instead zeros out

weights in the NN

x x

slide-40
SLIDE 40

Let’s Try it Out! (nn-lm-optim.py)

slide-41
SLIDE 41

Efficiency Tricks:
 Operation Batching

slide-42
SLIDE 42

Efficiency Tricks:
 Mini-batching

  • On modern hardware 10 operations of size 1 is

much slower than 1 operation of size 10

  • Minibatching combines together smaller operations

into one big one

slide-43
SLIDE 43

Minibatching

slide-44
SLIDE 44

Manual Mini-batching

  • Group together similar operations (e.g. loss calculations

for a single word) and execute them all together

  • In the case of a feed-forward language model, each

word prediction in a sentence can be batched

  • For recurrent neural nets, etc., more complicated
  • How this works depends on toolkit
  • Most toolkits have require you to add an extra

dimension representing the batch size

  • DyNet has special minibatch operations for lookup

and loss functions, everything else automatic

slide-45
SLIDE 45

Mini-batched Code Example

slide-46
SLIDE 46

Let’s Try it Out! (nn-lm-batch.py)

slide-47
SLIDE 47

Automatic Optimization

slide-48
SLIDE 48

Automatic Mini-batching!

  • TensorFlow Fold, DyNet Autobatching (see Neubig et al.

2017)

  • Try it with the —dynet-autobatch command line option
slide-49
SLIDE 49

Autobatching Usage

  • for each minibatch:
  • for each data point in mini-batch:
  • define/add data
  • sum losses
  • forward (autobatch engine does magic!)
  • backward
  • update
slide-50
SLIDE 50

Speed Improvements

slide-51
SLIDE 51

Code-level Optimization

  • e.g. TorchScript provides a restricted representation
  • f a PyTorch module that can be run efficiently in C++
slide-52
SLIDE 52

A Case Study: Regularizing and Optimizing LSTM Language Models (Merity et al. 2017)

slide-53
SLIDE 53

Regularizing and Optimizing LSTM Language Models (Merity et al. 2017)

  • Uses LSTMs as a backbone (discussed later)
  • A number of tricks to improve stability and prevent overfitting:
  • DropConnect regularization
  • SGD w/ averaging triggered when model is close to

convergence

  • Dropout on recurrent connections and embeddings
  • Weight tying
  • Independently tuned embedding and hidden layer sizes
  • Regularization of activations of the network
  • Strong baseline for language modeling, SOTA at the time

(without special model, just training methods)

slide-54
SLIDE 54

Questions?