CS11-747 Neural Networks for NLP
Debugging Neural Networks for NLP
Graham Neubig
Site https://phontron.com/class/nn4nlp2019/
Debugging Neural Networks for NLP Graham Neubig Site - - PowerPoint PPT Presentation
CS11-747 Neural Networks for NLP Debugging Neural Networks for NLP Graham Neubig Site https://phontron.com/class/nn4nlp2019/ In Neural Networks, Tuning is Paramount! Everything is a hyperparameter Network size/depth Small model
CS11-747 Neural Networks for NLP
Graham Neubig
Site https://phontron.com/class/nn4nlp2019/
be difficult!
set
training long enough (e.g. 20-30 epochs)?
layers
to vanishing gradients
Residual Connections (He et al. 2015) Highway Networks (Srivastava et al. 2015)
an optimization problem
steps to prevent sudden changes
frequently updated parameters (as measured by the variance of the gradient)
momentum and gradient variance
(See Ruder 2016 reference for more details)
diverge
then gradually decrease
(SGD default is 0.1, Adam 0.001)
different sized gradients
Gaussian distribution
within a range
uniform manner, where the range is specified according to net size
this is an example </s> this is another </s> </s> Padding Loss Calculation Mask
1 1
1
1
1
padding and sorting can result in slow training
sentences are in the same batch
implemented in different functions
calc_loss() and generate() functions
source of bugs!
generation not.
sum them
numerical precision)
calculation
your model the reference word at each time step
maintain several paths
get better (almost all the time)
you get a better overall model score with larger sizes
looking at outputs and realizing something is wrong
every time > went to the store yesterday > bought a dog
Nebraska at Kearney
chose to focus on? Is that phenomenon getting better?
low frequency words increasing?
getting better, are you doing better on long-distance dependencies?
Accuracy is a symptom of a different problem.
(Zhang et al. 2017)
b/t input and labels), it can still learn
(Wilson et al. 2017)
stronger tendency to overfit on small data
prevent them from over-fitting
held-out development data and stopping training when it starts to get worse
and continue training
rate when start overfitting the development set (the “newbob” learning rate schedule)
MT (Denkowski and Neubig 2017)
to overfitting
layer with probability p at training time only
different, scaling is necessary:
x x
likelihood for training
accuracy can get worse
worse BLEU score!
search finds them, but BLEU likes correct-length sentences.
techniques to be discussed in future classes
everything is a hyper-parameter
usually respond!