SI425 : NLP Set 14 Neural NLP Fall 2020 : Chambers Why are these - PowerPoint PPT Presentation

SI425 : NLP Set 14 Neural NLP Fall 2020 : Chambers

Why are these so different? Last time : Word2Vec learned word embeddings This time : use word embeddings as input to classifiers

Recall: Logistic Regression x = “it was the best of times it was the worst of times” it was the best of he she times pizza ok worst f(x) 2 1 2 1 2 0 0 2 0 0 1 x Dickens w -0.1 0.05 0.0 0.42 0.12 0.3 0.2 1.1 -1.5 -0.2 0.3 3.87

Recall: Logistic Regression x = “it was the best of times it was the worst of times” 1.3 -0.8 0.4 -1.1 -3.8 2.5 0.9 1.2 -0.3 -0.8 1.0 f(x) x Dickens w -0.1 0.05 0.0 0.42 0.12 0.3 0.2 1.1 -1.5 -0.2 0.3 3.87

Logistic Regression -> Neural Network This is a visualization of • logistic regression. Each arrow is one learned • weight for Dickens. Input is a word embedding, • or an entire sentence embedding! 5

Logistic Regression -> Neural Network A full feed-forward neural network! • A vector of weights for each author. • W is thus a matrix of weights, each row is an author’s • 6

Feed-Forward Neural Network The final prediction layer usually has a softmax normalization • The yellow boxes are just scores from dot products • e score c softmax ( c ) = ∑ i e score i e w c ∙ x softmax ( w c ∙ x ) = ∑ i e w i ∙ x 7

Logistic Regression -> Neural Network Logistic regression is just a feed-forward neural network! • The softmax produces your probabilities. • Your probabilities! softmax X 8

Multi-layer neural network Add another layer. • The middle layer represents “hidden” attributes of the texts. • The numbers optimize themselves to identify these attributes • softmax X 9

Input to neural networks Where does X come from? These can be standard features that we’ve discussed • But often it’s not features, instead word embeddings • 10

How do we embed more than one word? • Additive • Concatenate (if your input will be fixed size) 11

Problems with additive Function words (“the”) can add • noise to the final embedding All words receive equal treatment • in the final embedding Long text leads to a less • meaningful final embedding 12

Solution: Recurrent Neural Networks • RNN: Recurrent Neural Network • (a family of neural networks) 13

Solution: Recurrent Neural Networks • Key idea: reuse same Wh and We probability prediction softmax ( W c ∙ h t + c 2 ) hidden states h t = σ ( W h h t − 1 + W e x t + c ) Word2Vec embeddings 14

RNN pros and cons Benefits • Handles any length input. • Uses the same weights for all time steps. Repeated words treated the same regardless of position. • • Model size the same regardless of length. • Hidden state weights can learn to drop irrelevant history Drawbacks • Information farthest back tends to get lost 15

Training RNNs • Find a big text dataset with labeled passages! positive sentiment Positive Negative … Laith and I Rlly … 16

Training RNNs • Compute the loss from the predicted probabilities Positive Negative 0.8 0.2 softmax ( W c ∙ h t + c 2 ) 1.0 0.0 Gold = 0.2 Approximate Loss Gold − softmax ( W c ∙ h t + c 2 ) 17

Training RNNs • Compute the loss from the predicted probabilities Positive Negative 0.8 0.2 softmax ( W c ∙ h t + c 2 ) 1.0 0.0 Gold = 0.2 Approximate Loss Gold − softmax ( W c ∙ h t + c 2 ) Update the weights! (back propagation) This is like logistic regression, take derivatives. The loss is passed backward through each layer of computation. Beyond the scope of this class. 18

RNN applications • Standard classifiers from text to label. Funny • Language generation My roommate started _______ ? 19

RNN applications • Part of speech tagging! 20

Takeaways • You can represent any piece of text as a vector. • Once you have a vector, you can make predictions from it using simple classifiers (logistic regression). • RNNs encode a passage into a single hidden state vector. • RNNs are just a smarter way to combine words than simple addition (they’re weighted addition!). 21

RNNs Improved • Recurrent Neural Networks refer to any architecture that reuses its weights over a sequence of input. • There are many improvements to the basic RNN that we discussed last time. 23

Bidirectional RNNs Run an RNN backwards • in addition to forwards. Use both final hidden • states to then make your prediction. 24

Bidirectional RNNs A simplified diagram • although less precise on how it works. Arrows between layers • typically mean “fully connected”. Each cell is connected to each cell in the next layer with a learned weight. 25

LSTM (a type of RNN) • LSTM: Long Short-Term Memory The LSTM replaces a base RNN’s hidden state with a “black box” that contains several hidden state vectors, all connected through different types of weights. https://medium.com/mlreview/understanding-lstm-and-its-diagrams-37e2f46f1714 26

LSTM Maintains memory so it • can represent words farther away. Has a “forget” mechanism • so it can learn to ignore certain inputs. Bi-directional typically still • helps, despite longer memory. 27

AllenNLP 2018 ELMo https://arxiv.org/pdf/1802.05365.pdf • Stacked LSTMs ! • Output representation is a linear combination of all hidden states below. 28

Can I use an LSTM? • Sure! (TensorFlow and Keras) Yes, there is a learning curve to understanding most neural net code. • model_lstm = Sequential() model_lstm.add(Embedding(vocab_in_size, embedding_dim, input_length=len_input_train)) model_lstm.add(LSTM(units)) model_lstm.add(Dense(nb_labels, activation='softmax')) model_lstm.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) model_lstm.summary() history_lstm = model_lstm.fit(input_data_train, intent_data_label_cat_train, epochs=10, batch_size=BATCH_SIZE) 29

Text: today’s common input • RNNs like LSTMs focus on representing language in a high-dimensional space … a vector (embedding) • If we can convert any text to a vector, then we can more easily focus on whatever our goal is Sentiment prediction • Language generation (word prediction) • Political stance detection • Author detection • Information extraction • … • 30

Text: today’s common input FUNNY 31

Text: today’s common input 1. Word2Vec word representations Better text 2. Embedding addition/concatenation representations, more effective embeddings for 3. RNNs NLP tasks. 4. Transformers The latest craze 32

The Transformer Google Research 2017 https://arxiv.org/pdf/1706.03762.pdf • Problem with RNN: loses context with text distance • Solution: combine all words at once at each position 33

Transformer • Instead of sequentially adding the surrounding context, add everything at once but with specialized word weights The word “it” will add • information to itself from other relevant words in the sentence, but weighted by importance to itself. 34

Self-Attention These 3 are just transformations from X1, think of them as variants of X1 Add 88% of V1 and 12% of V2! 35

Transformer • Self-Attention is the mechanism to convert an input vector X into an output vector Z based on all other input vectors. Step 2 Transform Z1 to R1 with a normal feed-forward layer. Step 1 Transform X1 into Z1 by adding up all the other X inputs. 36

Transformer Network Make it deep of course! • Transformer Transformer Transformer 37

Details • Self-Attention first proposed here: https://arxiv.org/pdf/1706.03762.pdf • • A great overview of Transformers is here: http://jalammar.github.io/illustrated-transformer/ • • Lots of details obviously not included in this brief lecture. 38

Who is using Transformers? • BERT (Google 2019) • BART (Facebook 2019) • GPT, GPT-2, GPT-3 (OpenAI 2019-20) • XLNet (CMU, Google AI) • M2M-100 (Facebook) 1.5 billion parameters • T5 (Google AI 2020) 175 billion parameters 39

BERT • BERT is freely available. Lots of tutorials. https://towardsdatascience.com/bert-for-dummies-step-by-step- • tutorial-fb90890ffe03 • Encode your text with BERT, and use the [CLS] token as your text representation. • Put a classification layer on top, and you’re ready! • If just classifying, you can do this, although needs significant computation power. 40

BERT as text classification http://jalammar.github.io/illustrated-bert/ 41

GPT -2 • Online example https://transformer.huggingface.co/doc/gpt2-large • Training this is beyond your computer’s capabilities. Using a • trained GPT-2 is not. Install and Run instructions pip install tensorflow=1.14.0 git clone https://github.com/openai/gpt-2.git cd gpt-2.git pip install -r requirements.txt python download_model.py 345M python src/interactive_conditional_samples.py --top_k 40 --length 50 --model_name 345M 42

SI425 : NLP Set 14 Neural NLP Fall 2020 : Chambers Why are these - PowerPoint PPT Presentation

SI425 : NLP Set 14 Neural NLP Fall 2020 : Chambers Why are these so different? Last time : Word2Vec learned word embeddings This time : use word embeddings as input to classifiers Recall: Logistic Regression x = it was the best of times it

SI425 : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

SI485i : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

SI425 : NLP Set 11 Distributional Similarity some slides adapted from Dan Jurafsky and Bill

SI425 : NLP Set 10 Lexical Relations slides adapted from Dan Jurafsky and Bill MacCartney Three

SI425 : NLP Set 13 Information Extraction Information Extraction Yesterday GM released third

SI425 : NLP Set 4 Smoothing Language Models Fall 2017 : Chambers Review: evaluating n-gram

SI425 : NLP Set 3 Language Models Fall 2017 : Chambers Language Modeling Which sentence is

SI425 : NLP Set 5 Nave Bayes Classification Fall 2020 : Chambers Motivation We want to

SI425 Natural Language Processing Set 1 Intro to NLP Fall 2020: Chambers Assumptions about

SI425 : NLP Set 8 Words as Vectors (distributional similarity) Fall 2020 : Chambers some

SI425 : NLP Set 4 Smoothing Language Models Fall 2020 : Chambers Review: evaluating n-gram

SI425 : NLP Set 10 Syntax and Parsing Fall 2020 : Chambers Syntax Grammar, or syntax:

SI425 : NLP Set 9 Word2Vec - Neural Words Fall 2020 : Chambers Why are these so different? Last

SI425 : NLP Set 2 Probability Review Fall 2020 : Chambers help me make a new rumor

SI425 : NLP Set 6 Logistic Regression Fall 2020 : Chambers Last time Naive Bayes Classifier

SI425 : NLP Set 7 Syntax and Parsing Syntax Grammar, or syntax: The kind of implicit

How You Can Use Open Source Materials to Learn Python & Data Science Kamila Stpniowska,

CSE 5194.01: OpenAI and ONNX John Herwig CSE 5194.01 OpenAI What is OpenAI? According to their

Extending Spark ML Super Happy New Pipeline Stage Time! *Scala only - see developer for details.

CS480/680 Lecture 19: July 10, 2019 Attention and Transformer Networks [Vaswani et al.,

Overview of Compilation Readings: EAC2 Chapter 1 EECS4302 M: Compilers and Interpreters Winter

Neutron complementary studies and neutral pion reconstruction. HPgTPC Meeting Eldwan Brianne

Optimized geothermal Optimized geothermal binary power cycles binary power cycles Kontoleontos

performance analysis of parallel codes on heterogeneous systems E. Agullo, O. Aumage, B. Bramas,

SI425 : NLP Set 14 Neural NLP Fall 2020 : Chambers Why are these - PowerPoint PPT Presentation

SI425 : NLP Set 14 Neural NLP Fall 2020 : Chambers Why are these so different? Last time : Word2Vec learned word embeddings This time : use word embeddings as input to classifiers Recall: Logistic Regression x = it was the best of times it

SI425 : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

SI485i : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

SI425 : NLP Set 11 Distributional Similarity some slides adapted from Dan Jurafsky and Bill

SI425 : NLP Set 10 Lexical Relations slides adapted from Dan Jurafsky and Bill MacCartney Three

SI425 : NLP Set 13 Information Extraction Information Extraction Yesterday GM released third

SI425 : NLP Set 4 Smoothing Language Models Fall 2017 : Chambers Review: evaluating n-gram

SI425 : NLP Set 3 Language Models Fall 2017 : Chambers Language Modeling Which sentence is

SI425 : NLP Set 5 Nave Bayes Classification Fall 2020 : Chambers Motivation We want to

SI425 Natural Language Processing Set 1 Intro to NLP Fall 2020: Chambers Assumptions about

SI425 : NLP Set 8 Words as Vectors (distributional similarity) Fall 2020 : Chambers some

SI425 : NLP Set 4 Smoothing Language Models Fall 2020 : Chambers Review: evaluating n-gram

SI425 : NLP Set 10 Syntax and Parsing Fall 2020 : Chambers Syntax Grammar, or syntax:

SI425 : NLP Set 9 Word2Vec - Neural Words Fall 2020 : Chambers Why are these so different? Last

SI425 : NLP Set 2 Probability Review Fall 2020 : Chambers help me make a new rumor

SI425 : NLP Set 6 Logistic Regression Fall 2020 : Chambers Last time Naive Bayes Classifier

SI425 : NLP Set 7 Syntax and Parsing Syntax Grammar, or syntax: The kind of implicit

How You Can Use Open Source Materials to Learn Python &amp; Data Science Kamila Stpniowska,

CSE 5194.01: OpenAI and ONNX John Herwig CSE 5194.01 OpenAI What is OpenAI? According to their

Extending Spark ML Super Happy New Pipeline Stage Time! *Scala only - see developer for details.

CS480/680 Lecture 19: July 10, 2019 Attention and Transformer Networks [Vaswani et al.,

Overview of Compilation Readings: EAC2 Chapter 1 EECS4302 M: Compilers and Interpreters Winter

Neutron complementary studies and neutral pion reconstruction. HPgTPC Meeting Eldwan Brianne

Optimized geothermal Optimized geothermal binary power cycles binary power cycles Kontoleontos

performance analysis of parallel codes on heterogeneous systems E. Agullo, O. Aumage, B. Bramas,

How You Can Use Open Source Materials to Learn Python & Data Science Kamila Stpniowska,