si425 nlp
play

SI425 : NLP Set 14 Neural NLP Fall 2020 : Chambers Why are these - PowerPoint PPT Presentation

SI425 : NLP Set 14 Neural NLP Fall 2020 : Chambers Why are these so different? Last time : Word2Vec learned word embeddings This time : use word embeddings as input to classifiers Recall: Logistic Regression x = it was the best of times it


  1. SI425 : NLP Set 14 Neural NLP Fall 2020 : Chambers

  2. Why are these so different? Last time : Word2Vec learned word embeddings This time : use word embeddings as input to classifiers

  3. Recall: Logistic Regression x = “it was the best of times it was the worst of times” it was the best of he she times pizza ok worst f(x) 2 1 2 1 2 0 0 2 0 0 1 x Dickens w -0.1 0.05 0.0 0.42 0.12 0.3 0.2 1.1 -1.5 -0.2 0.3 3.87

  4. Recall: Logistic Regression x = “it was the best of times it was the worst of times” 1.3 -0.8 0.4 -1.1 -3.8 2.5 0.9 1.2 -0.3 -0.8 1.0 f(x) x Dickens w -0.1 0.05 0.0 0.42 0.12 0.3 0.2 1.1 -1.5 -0.2 0.3 3.87

  5. Logistic Regression -> Neural Network This is a visualization of • logistic regression. Each arrow is one learned • weight for Dickens. Input is a word embedding, • or an entire sentence embedding! 5

  6. Logistic Regression -> Neural Network A full feed-forward neural network! • A vector of weights for each author. • W is thus a matrix of weights, each row is an author’s • 6

  7. Feed-Forward Neural Network The final prediction layer usually has a softmax normalization • The yellow boxes are just scores from dot products • e score c softmax ( c ) = ∑ i e score i e w c ∙ x softmax ( w c ∙ x ) = ∑ i e w i ∙ x 7

  8. Logistic Regression -> Neural Network Logistic regression is just a feed-forward neural network! • The softmax produces your probabilities. • Your probabilities! softmax X 8

  9. Multi-layer neural network Add another layer. • The middle layer represents “hidden” attributes of the texts. • The numbers optimize themselves to identify these attributes • softmax X 9

  10. Input to neural networks Where does X come from? These can be standard features that we’ve discussed • But often it’s not features, instead word embeddings • 10

  11. How do we embed more than one word? • Additive • Concatenate (if your input will be fixed size) 11

  12. Problems with additive Function words (“the”) can add • noise to the final embedding All words receive equal treatment • in the final embedding Long text leads to a less • meaningful final embedding 12

  13. Solution: Recurrent Neural Networks • RNN: Recurrent Neural Network • (a family of neural networks) 13

  14. Solution: Recurrent Neural Networks • Key idea: reuse same Wh and We probability prediction softmax ( W c ∙ h t + c 2 ) hidden states h t = σ ( W h h t − 1 + W e x t + c ) Word2Vec embeddings 14

  15. RNN pros and cons Benefits • Handles any length input. • Uses the same weights for all time steps. Repeated words treated the same regardless of position. • • Model size the same regardless of length. • Hidden state weights can learn to drop irrelevant history Drawbacks • Information farthest back tends to get lost 15

  16. Training RNNs • Find a big text dataset with labeled passages! positive sentiment Positive Negative … Laith and I Rlly … 16

  17. Training RNNs • Compute the loss from the predicted probabilities Positive Negative 0.8 0.2 softmax ( W c ∙ h t + c 2 ) 1.0 0.0 Gold = 0.2 Approximate Loss Gold − softmax ( W c ∙ h t + c 2 ) 17

  18. Training RNNs • Compute the loss from the predicted probabilities Positive Negative 0.8 0.2 softmax ( W c ∙ h t + c 2 ) 1.0 0.0 Gold = 0.2 Approximate Loss Gold − softmax ( W c ∙ h t + c 2 ) Update the weights! (back propagation) This is like logistic regression, take derivatives. The loss is passed backward through each layer of computation. Beyond the scope of this class. 18

  19. RNN applications • Standard classifiers from text to label. Funny • Language generation My roommate started _______ ? 19

  20. RNN applications • Part of speech tagging! 20

  21. Takeaways • You can represent any piece of text as a vector. • Once you have a vector, you can make predictions from it using simple classifiers (logistic regression). • RNNs encode a passage into a single hidden state vector. • RNNs are just a smarter way to combine words than simple addition (they’re weighted addition!). 21

  22. 22

  23. RNNs Improved • Recurrent Neural Networks refer to any architecture that reuses its weights over a sequence of input. • There are many improvements to the basic RNN that we discussed last time. 23

  24. Bidirectional RNNs Run an RNN backwards • in addition to forwards. Use both final hidden • states to then make your prediction. 24

  25. Bidirectional RNNs A simplified diagram • although less precise on how it works. Arrows between layers • typically mean “fully connected”. Each cell is connected to each cell in the next layer with a learned weight. 25

  26. LSTM (a type of RNN) • LSTM: Long Short-Term Memory The LSTM replaces a base RNN’s hidden state with a “black box” that contains several hidden state vectors, all connected through different types of weights. https://medium.com/mlreview/understanding-lstm-and-its-diagrams-37e2f46f1714 26

  27. LSTM Maintains memory so it • can represent words farther away. Has a “forget” mechanism • so it can learn to ignore certain inputs. Bi-directional typically still • helps, despite longer memory. 27

  28. AllenNLP 2018 ELMo https://arxiv.org/pdf/1802.05365.pdf • Stacked LSTMs ! • Output representation is a linear combination of all hidden states below. 28

  29. Can I use an LSTM? • Sure! (TensorFlow and Keras) Yes, there is a learning curve to understanding most neural net code. • model_lstm = Sequential() model_lstm.add(Embedding(vocab_in_size, embedding_dim, input_length=len_input_train)) model_lstm.add(LSTM(units)) model_lstm.add(Dense(nb_labels, activation='softmax')) model_lstm.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) model_lstm.summary() history_lstm = model_lstm.fit(input_data_train, intent_data_label_cat_train, epochs=10, batch_size=BATCH_SIZE) 29

  30. Text: today’s common input • RNNs like LSTMs focus on representing language in a high-dimensional space … a vector (embedding) • If we can convert any text to a vector, then we can more easily focus on whatever our goal is Sentiment prediction • Language generation (word prediction) • Political stance detection • Author detection • Information extraction • … • 30

  31. Text: today’s common input FUNNY 31

  32. Text: today’s common input 1. Word2Vec word representations Better text 2. Embedding addition/concatenation representations, more effective embeddings for 3. RNNs NLP tasks. 4. Transformers The latest craze 32

  33. The Transformer Google Research 2017 https://arxiv.org/pdf/1706.03762.pdf • Problem with RNN: loses context with text distance • Solution: combine all words at once at each position 33

  34. Transformer • Instead of sequentially adding the surrounding context, add everything at once but with specialized word weights The word “it” will add • information to itself from other relevant words in the sentence, but weighted by importance to itself. 34

  35. Self-Attention These 3 are just transformations from X1, think of them as variants of X1 Add 88% of V1 and 12% of V2! 35

  36. Transformer • Self-Attention is the mechanism to convert an input vector X into an output vector Z based on all other input vectors. Step 2 Transform Z1 to R1 with a normal feed-forward layer. Step 1 Transform X1 into Z1 by adding up all the other X inputs. 36

  37. Transformer Network Make it deep of course! • Transformer Transformer Transformer 37

  38. Details • Self-Attention first proposed here: https://arxiv.org/pdf/1706.03762.pdf • • A great overview of Transformers is here: http://jalammar.github.io/illustrated-transformer/ • • Lots of details obviously not included in this brief lecture. 38

  39. Who is using Transformers? • BERT (Google 2019) • BART (Facebook 2019) • GPT, GPT-2, GPT-3 (OpenAI 2019-20) • XLNet (CMU, Google AI) • M2M-100 (Facebook) 1.5 billion parameters • T5 (Google AI 2020) 175 billion parameters 39

  40. BERT • BERT is freely available. Lots of tutorials. https://towardsdatascience.com/bert-for-dummies-step-by-step- • tutorial-fb90890ffe03 • Encode your text with BERT, and use the [CLS] token as your text representation. • Put a classification layer on top, and you’re ready! • If just classifying, you can do this, although needs significant computation power. 40

  41. BERT as text classification http://jalammar.github.io/illustrated-bert/ 41

  42. GPT -2 • Online example https://transformer.huggingface.co/doc/gpt2-large • Training this is beyond your computer’s capabilities. Using a • trained GPT-2 is not. Install and Run instructions pip install tensorflow=1.14.0 git clone https://github.com/openai/gpt-2.git cd gpt-2.git pip install -r requirements.txt python download_model.py 345M python src/interactive_conditional_samples.py --top_k 40 --length 50 --model_name 345M 42

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend