SI425 : NLP Set 14 Neural NLP Fall 2020 : Chambers Why are these - - PowerPoint PPT Presentation
SI425 : NLP Set 14 Neural NLP Fall 2020 : Chambers Why are these - - PowerPoint PPT Presentation
SI425 : NLP Set 14 Neural NLP Fall 2020 : Chambers Why are these so different? Last time : Word2Vec learned word embeddings This time : use word embeddings as input to classifiers Recall: Logistic Regression x = it was the best of times it
Why are these so different?
Last time: Word2Vec learned word embeddings This time: use word embeddings as input to classifiers
Recall: Logistic Regression
x = “it was the best of times it was the worst of times”
it was the best
- f
he she times pizza
- k
worst 2 1 2 1 2 2 1
- 0.1
0.05 0.0 0.42 0.12 0.3 0.2 1.1
- 1.5
- 0.2
0.3
Dickens w f(x)
3.87 x
Recall: Logistic Regression
x = “it was the best of times it was the worst of times”
1.3
- 0.8
0.4
- 1.1
- 3.8
2.5 0.9 1.2
- 0.3
- 0.8
1.0
- 0.1
0.05 0.0 0.42 0.12 0.3 0.2 1.1
- 1.5
- 0.2
0.3
Dickens w f(x)
3.87 x
Logistic Regression -> Neural Network
- This is a visualization of
logistic regression.
- Each arrow is one learned
weight for Dickens.
- Input is a word embedding,
- r an entire sentence
embedding!
5
Logistic Regression -> Neural Network
- A full feed-forward neural network!
- A vector of weights for each author.
- W is thus a matrix of weights, each row is an author’s
6
Feed-Forward Neural Network
- The final prediction layer usually has a softmax normalization
- The yellow boxes are just scores from dot products
7
softmax(c) = escorec ∑i escorei softmax(wc ∙ x) = ewc∙x ∑i ewi∙x
Logistic Regression -> Neural Network
- Logistic regression is just a feed-forward neural network!
- The softmax produces your probabilities.
8
X softmax Your probabilities!
Multi-layer neural network
- Add another layer.
- The middle layer represents “hidden” attributes of the texts.
- The numbers optimize themselves to identify these attributes
9
softmax X
Input to neural networks
Where does X come from?
10
- These can be standard features that we’ve discussed
- But often it’s not features, instead word embeddings
How do we embed more than one word?
- Additive
- Concatenate (if your input will be fixed size)
11
Problems with additive
- Function words (“the”) can add
noise to the final embedding
- All words receive equal treatment
in the final embedding
- Long text leads to a less
meaningful final embedding
12
Solution: Recurrent Neural Networks
13
- RNN: Recurrent Neural Network
- (a family of neural networks)
Solution: Recurrent Neural Networks
14
- Key idea: reuse same Wh and We
ht = σ(Whht−1 + Wext + c)
Word2Vec embeddings softmax(Wc ∙ ht + c2) hidden states probability prediction
RNN pros and cons
Benefits
- Handles any length input.
- Uses the same weights for all time steps.
- Repeated words treated the same regardless of position.
- Model size the same regardless of length.
- Hidden state weights can learn to drop irrelevant history
Drawbacks
- Information farthest back tends to get lost
15
Training RNNs
- Find a big text dataset with labeled passages!
16
positive sentiment Laith and I Rlly … …
Positive Negative
Training RNNs
- Compute the loss from the predicted probabilities
17
0.8 0.2
Positive Negative Gold
1.0 0.0
Approximate Loss
= 0.2
Gold − softmax(Wc ∙ ht + c2) softmax(Wc ∙ ht + c2)
Training RNNs
- Compute the loss from the predicted probabilities
18
0.8 0.2
Positive Negative Gold
1.0 0.0
Approximate Loss
= 0.2
Gold − softmax(Wc ∙ ht + c2) softmax(Wc ∙ ht + c2)
Update the weights! (back propagation) This is like logistic regression, take derivatives. The loss is passed backward through each layer of computation. Beyond the scope of this class.
RNN applications
- Standard classifiers from text to label.
19
Funny
- Language generation
My roommate started _______ ?
RNN applications
- Part of speech tagging!
20
Takeaways
- You can represent any piece of text as a vector.
- Once you have a vector, you can make predictions
from it using simple classifiers (logistic regression).
- RNNs encode a passage into a single hidden state
vector.
- RNNs are just a smarter way to combine words than
simple addition (they’re weighted addition!).
21
22
RNNs Improved
- Recurrent Neural Networks refer to any architecture
that reuses its weights over a sequence of input.
- There are many improvements to the basic RNN that
we discussed last time.
23
Bidirectional RNNs
- Run an RNN backwards
in addition to forwards.
- Use both final hidden
states to then make your prediction.
24
Bidirectional RNNs
- A simplified diagram
although less precise on how it works.
- Arrows between layers
typically mean “fully connected”. Each cell is connected to each cell in the next layer with a learned weight.
25
LSTM (a type of RNN)
- LSTM: Long Short-Term Memory
26
https://medium.com/mlreview/understanding-lstm-and-its-diagrams-37e2f46f1714
The LSTM replaces a base RNN’s hidden state with a “black box” that contains several hidden state vectors, all connected through different types of weights.
LSTM
- Maintains memory so it
can represent words farther away.
- Has a “forget” mechanism
so it can learn to ignore certain inputs.
- Bi-directional typically still
helps, despite longer memory.
27
ELMo
- Stacked LSTMs !
- Output representation is a linear combination of all
hidden states below.
28
AllenNLP 2018 https://arxiv.org/pdf/1802.05365.pdf
Can I use an LSTM?
- Sure! (TensorFlow and Keras)
- Yes, there is a learning curve to understanding most neural net code.
29 model_lstm = Sequential() model_lstm.add(Embedding(vocab_in_size, embedding_dim, input_length=len_input_train)) model_lstm.add(LSTM(units)) model_lstm.add(Dense(nb_labels, activation='softmax')) model_lstm.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) model_lstm.summary() history_lstm = model_lstm.fit(input_data_train, intent_data_label_cat_train, epochs=10, batch_size=BATCH_SIZE)
Text: today’s common input
- RNNs like LSTMs focus on representing language in a
high-dimensional space … a vector (embedding)
- If we can convert any text to a vector, then we can
more easily focus on whatever our goal is
- Sentiment prediction
- Language generation (word prediction)
- Political stance detection
- Author detection
- Information extraction
- …
30
Text: today’s common input
31
FUNNY
Text: today’s common input
- 1. Word2Vec word representations
- 2. Embedding addition/concatenation
- 3. RNNs
- 4. Transformers
32
Better text representations, more effective embeddings for NLP tasks.
The latest craze
The Transformer
- Problem with RNN: loses context with text distance
- Solution: combine all words at once at each position
33 Google Research 2017 https://arxiv.org/pdf/1706.03762.pdf
Transformer
- Instead of sequentially adding the surrounding
context, add everything at once but with specialized word weights
34
- The word “it” will add
information to itself from other relevant words in the sentence, but weighted by importance to itself.
Self-Attention
35
These 3 are just transformations from X1, think
- f them as variants of X1
Add 88% of V1 and 12% of V2!
Transformer
- Self-Attention is the mechanism to convert an input
vector X into an output vector Z based on all other input vectors.
36
Step 1 Transform X1 into Z1 by adding up all the other X inputs. Step 2 Transform Z1 to R1 with a normal feed-forward layer.
Transformer Network
- Make it deep of course!
37
Transformer Transformer Transformer
Details
- Self-Attention first proposed here:
- https://arxiv.org/pdf/1706.03762.pdf
- A great overview of Transformers is here:
- http://jalammar.github.io/illustrated-transformer/
- Lots of details obviously not included in this brief
lecture.
38
Who is using Transformers?
- BERT (Google 2019)
- BART (Facebook 2019)
- GPT, GPT-2, GPT-3 (OpenAI 2019-20)
- XLNet (CMU, Google AI)
- M2M-100 (Facebook)
- T5 (Google AI 2020)
39
1.5 billion parameters 175 billion parameters
BERT
- BERT is freely available. Lots of tutorials.
- https://towardsdatascience.com/bert-for-dummies-step-by-step-
tutorial-fb90890ffe03
- Encode your text with BERT, and use the [CLS] token
as your text representation.
- Put a classification layer on top, and you’re ready!
- If just classifying, you can do this, although needs
significant computation power.
40
BERT as text classification
41 http://jalammar.github.io/illustrated-bert/
GPT
- 2
- Online example
- https://transformer.huggingface.co/doc/gpt2-large
- Training this is beyond your computer’s capabilities. Using a
trained GPT-2 is not.
42
pip install tensorflow=1.14.0 git clone https://github.com/openai/gpt-2.git cd gpt-2.git pip install -r requirements.txt python download_model.py 345M python src/interactive_conditional_samples.py --top_k 40 --length 50 --model_name 345M
Install and Run instructions
GPT
- 2 examples
43 Model prompt >>> The Naval Academy would like ======================================== SAMPLE 1 ======================================== to thank all of their students, staff and supporters for their support, especially those involved in organizing our campaign to secure this contract. This contract is a crucial piece in supporting a cadet education program and this contract reflects the commitment of the Academy to ================================================================================ Model prompt >>> The Naval Academy would like ======================================== SAMPLE 1 ======================================== _______________ to tell you that their newest mascot, the red dot will not be permitted to wear a blue shirt at the football games. There will be no football helmets, helmets would be too conspicuous to wear. We're sorry, but this will not ================================================================================ Model prompt >>> The Naval Academy would like ======================================== SAMPLE 1 ======================================== to thank the many people that contributed their time, energy and expertise to bringing this book to life. For more information, visit www.navy.edu/jailhousefiling/sailhouse_hockey.htm ================================================================================
Moving Forward
- Many algorithms now use BERT (or others) as the first step
to encode their text.
- Optionally, if you have the GPUs, you can fine-tune these models to
your task.
- You then just use that encoded vector as input to a
classifier — like those used in this class!
- Problem: the vector is opaque. Why is the vector what it is?
- Many NLP tasks are now like engineering turning knobs and hoping
the output gets better. The link to linguistics/science is partially lost.
44