language modeling efficiency training tricks
play

Language Modeling, Efficiency/Training Tricks Graham Neubig Site - PowerPoint PPT Presentation

CS11-747 Neural Networks for NLP Language Modeling, Efficiency/Training Tricks Graham Neubig Site https://phontron.com/class/nn4nlp2020/ Are These Sentences OK? Jane went to the store. store to Jane went the. Jane went store.


  1. CS11-747 Neural Networks for NLP Language Modeling, Efficiency/Training Tricks Graham Neubig Site https://phontron.com/class/nn4nlp2020/

  2. Are These Sentences OK? • Jane went to the store. • store to Jane went the. • Jane went store. • Jane goed to the store. • The store went to Jane. • The food truck went to Jane.

  3. Language Modeling: Calculating the Probability of a Sentence I Y P ( X ) = P ( x i | x 1 , . . . , x i − 1 ) i =1 Next Word Context The big problem: How do we predict P ( x i | x 1 , . . . , x i − 1 ) ?!?!

  4. Covered Concept Tally

  5. Review: Count-based Language Models

  6. Count-based Language Models • Count up the frequency and divide: c ( x i − n +1 , . . . , x i ) P ML ( x i | x i − n +1 , . . . , x i − 1 ) := c ( x i − n +1 , . . . , x i − 1 ) • Add smoothing, to deal with zero counts: P ( x i | x i − n +1 , . . . , x i − 1 ) = λ P ML ( x i | x i − n +1 , . . . , x i − 1 ) + (1 − λ ) P ( x i | x 1 − n +2 , . . . , x i − 1 ) • Modified Kneser-Ney smoothing

  7. A Refresher on Evaluation • Log-likelihood: 
 X LL ( E test ) = log P ( E ) E ∈ E test • Per-word Log Likelihood: 
 1 X WLL ( E test ) = log P ( E ) P E ∈ E test | E | E ∈ E test • Per-word (Cross) Entropy: 
 1 X H ( E test ) = − log 2 P ( E ) P E ∈ E test | E | E ∈ E test • Perplexity: 
 ppl ( E test ) = 2 H ( E test ) = e − W LL ( E test )

  8. What Can we Do w/ LMs? • Score sentences: Jane went to the store . → high store to Jane went the . → low (same as calculating loss for training) • Generate sentences: while didn’t choose end-of-sentence symbol: calculate probability sample a new word from the probability distribution

  9. Problems and Solutions? • Cannot share strength among similar words she bought a car she bought a bicycle she purchased a car she purchased a bicycle → solution: class based language models • Cannot condition on context with intervening words Dr. Jane Smith Dr. Gertrude Smith → solution: skip-gram language models • Cannot handle long-distance dependencies for tennis class he wanted to buy his own racquet for programming class he wanted to buy his own computer → solution: cache, trigger, topic, syntactic models, etc.

  10. An Alternative: 
 Featurized Log-Linear Models

  11. An Alternative: 
 Featurized Models • Calculate features of the context • Based on the features, calculate probabilities • Optimize feature weights using gradient descent, etc.

  12. Example: Previous words: “giving a" a 3.0 -6.0 -0.2 -3.2 the 2.5 -5.1 -0.3 -2.9 talk -0.2 0.2 1.0 1.0 b= w 1,a = w 2,giving = s= gift 0.1 0.1 2.0 2.2 hat 1.2 0.5 -1.2 0.6 … … … … … How likely How likely Words we’re How likely are they are they Total predicting are they? given prev. given 2nd prev. score word is “a”? word is “giving”?

  13. Softmax • Convert scores into probabilities by taking the exponent and normalizing (softmax) e s ( x i | x i − 1 i − n +1 ) P ( x i | x i − 1 i − n +1 ) = x i | x i − 1 x i e s (˜ i − n +1 ) P ˜ -3.2 0.002 -2.9 0.003 1.0 s= p= 0.329 2.2 0.444 0.6 0.090 … …

  14. A Computation Graph View giving a bias scores lookup2 lookup1 + + = probs softmax Each vector is size of output vocabulary

  15. A Note: “Lookup” • Lookup can be viewed as “grabbing” a single vector from a big matrix of word embeddings num. words vector size lookup(2) • Similarly, can be viewed as multiplying by a “one- hot” vector 0 num. words 0 vector 1 * 0 size 0 … • Former tends to be faster

  16. Training a Model • Reminder: to train, we calculate a “loss function” (a measure of how bad our predictions are), and move the parameters to reduce the loss • The most common loss function for probabilistic models is “negative log likelihood” 0.002 If element 3 0.003 (or zero-indexed, 2) p= 0.329 1.112 -log is the correct answer: 0.444 0.090 …

  17. Parameter Update • Back propagation allows us to calculate the derivative of the loss with respect to the parameters @` @ θ • Simple stochastic gradient descent optimizes parameters according to the following rule θ ← θ − ↵ @` @ θ

  18. Choosing a Vocabulary

  19. Unknown Words • Necessity for UNK words • We won’t have all the words in the world in training data • Larger vocabularies require more memory and computation time • Common ways: • Frequency threshold (usually UNK <= 1) • Rank threshold

  20. Evaluation and Vocabulary • Important: the vocabulary must be the same over models you compare • Or more accurately, all models must be able to generate the test set (it’s OK if they can generate more than the test set, but not less) • e.g. Comparing a character-based model to a word-based model is fair, but not vice-versa

  21. Let’s try it out! ( loglin-lm.py )

  22. What Problems are Handled? • Cannot share strength among similar words she bought a bicycle she bought a car she purchased a car she purchased a bicycle → not solved yet 😟 • Cannot condition on context with intervening words Dr. Jane Smith Dr. Gertrude Smith → solved! 😁 • Cannot handle long-distance dependencies for tennis class he wanted to buy his own racquet for programming class he wanted to buy his own computer → not solved yet 😟

  23. Beyond Linear Models

  24. Linear Models can’t Learn Feature Combinations students take tests → high teachers take tests → low students write tests → low teachers write tests → high • These can’t be expressed by linear features • What can we do? • Remember combinations as features (individual scores for “students take”, “teachers write”) 
 → Feature space explosion! • Neural nets

  25. Neural Language Models • (See Bengio et al. 2004) giving a lookup lookup tanh( 
 W 1 *h + b 1 ) + = W softmax probs bias scores

  26. Where is Strength Shared? giving a Similar output words lookup lookup get similar rows in in the softmax matrix Similar contexts get tanh( 
 W 1 *h + b 1 ) similar hidden states + = W Word embeddings: softmax Similar input words probs bias scores get similar vectors

  27. What Problems are Handled? • Cannot share strength among similar words she bought a bicycle she bought a car she purchased a car she purchased a bicycle → solved, and similar contexts as well! 😁 • Cannot condition on context with intervening words Dr. Jane Smith Dr. Gertrude Smith → solved! 😁 • Cannot handle long-distance dependencies for tennis class he wanted to buy his own racquet for programming class he wanted to buy his own computer → not solved yet 😟

  28. Let’s Try it Out! ( nn-lm.py )

  29. Tying Input/Output Embeddings giving a • We can share parameters between the input and output pick row pick row embeddings (Press et al. 2016, inter alia) tanh( 
 W 1 *h + b 1 ) + = W softmax probs bias scores Want to try? Delete the input embeddings, and instead pick a row from the softmax matrix.

  30. Optimizers

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend