SI425 : NLP Set 14 Neural NLP Fall 2020 : Chambers Why are these - - PowerPoint PPT Presentation

si425 nlp
SMART_READER_LITE
LIVE PREVIEW

SI425 : NLP Set 14 Neural NLP Fall 2020 : Chambers Why are these - - PowerPoint PPT Presentation

SI425 : NLP Set 14 Neural NLP Fall 2020 : Chambers Why are these so different? Last time : Word2Vec learned word embeddings This time : use word embeddings as input to classifiers Recall: Logistic Regression x = it was the best of times it


slide-1
SLIDE 1

SI425 : NLP

Set 14 Neural NLP

Fall 2020 : Chambers

slide-2
SLIDE 2

Why are these so different?

Last time: Word2Vec learned word embeddings This time: use word embeddings as input to classifiers

slide-3
SLIDE 3

Recall: Logistic Regression

x = “it was the best of times it was the worst of times”

it was the best

  • f

he she times pizza

  • k

worst 2 1 2 1 2 2 1

  • 0.1

0.05 0.0 0.42 0.12 0.3 0.2 1.1

  • 1.5
  • 0.2

0.3

Dickens w f(x)

3.87 x

slide-4
SLIDE 4

Recall: Logistic Regression

x = “it was the best of times it was the worst of times”

1.3

  • 0.8

0.4

  • 1.1
  • 3.8

2.5 0.9 1.2

  • 0.3
  • 0.8

1.0

  • 0.1

0.05 0.0 0.42 0.12 0.3 0.2 1.1

  • 1.5
  • 0.2

0.3

Dickens w f(x)

3.87 x

slide-5
SLIDE 5

Logistic Regression -> Neural Network

  • This is a visualization of

logistic regression.

  • Each arrow is one learned

weight for Dickens.

  • Input is a word embedding,
  • r an entire sentence

embedding!

5

slide-6
SLIDE 6

Logistic Regression -> Neural Network

  • A full feed-forward neural network!
  • A vector of weights for each author.
  • W is thus a matrix of weights, each row is an author’s

6

slide-7
SLIDE 7

Feed-Forward Neural Network

  • The final prediction layer usually has a softmax normalization
  • The yellow boxes are just scores from dot products

7

softmax(c) = escorec ∑i escorei softmax(wc ∙ x) = ewc∙x ∑i ewi∙x

slide-8
SLIDE 8

Logistic Regression -> Neural Network

  • Logistic regression is just a feed-forward neural network!
  • The softmax produces your probabilities.

8

X softmax Your probabilities!

slide-9
SLIDE 9

Multi-layer neural network

  • Add another layer.
  • The middle layer represents “hidden” attributes of the texts.
  • The numbers optimize themselves to identify these attributes

9

softmax X

slide-10
SLIDE 10

Input to neural networks

Where does X come from?

10

  • These can be standard features that we’ve discussed
  • But often it’s not features, instead word embeddings
slide-11
SLIDE 11

How do we embed more than one word?

  • Additive
  • Concatenate (if your input will be fixed size)

11

slide-12
SLIDE 12

Problems with additive

  • Function words (“the”) can add

noise to the final embedding

  • All words receive equal treatment

in the final embedding

  • Long text leads to a less

meaningful final embedding

12

slide-13
SLIDE 13

Solution: Recurrent Neural Networks

13

  • RNN: Recurrent Neural Network
  • (a family of neural networks)
slide-14
SLIDE 14

Solution: Recurrent Neural Networks

14

  • Key idea: reuse same Wh and We

ht = σ(Whht−1 + Wext + c)

Word2Vec embeddings softmax(Wc ∙ ht + c2) hidden states probability prediction

slide-15
SLIDE 15

RNN pros and cons

Benefits

  • Handles any length input.
  • Uses the same weights for all time steps.
  • Repeated words treated the same regardless of position.
  • Model size the same regardless of length.
  • Hidden state weights can learn to drop irrelevant history

Drawbacks

  • Information farthest back tends to get lost

15

slide-16
SLIDE 16

Training RNNs

  • Find a big text dataset with labeled passages!

16

positive sentiment Laith and I Rlly … …

Positive Negative

slide-17
SLIDE 17

Training RNNs

  • Compute the loss from the predicted probabilities

17

0.8 0.2

Positive Negative Gold

1.0 0.0

Approximate Loss

= 0.2

Gold − softmax(Wc ∙ ht + c2) softmax(Wc ∙ ht + c2)

slide-18
SLIDE 18

Training RNNs

  • Compute the loss from the predicted probabilities

18

0.8 0.2

Positive Negative Gold

1.0 0.0

Approximate Loss

= 0.2

Gold − softmax(Wc ∙ ht + c2) softmax(Wc ∙ ht + c2)

Update the weights! (back propagation) This is like logistic regression, take derivatives. The loss is passed backward through each layer of computation. Beyond the scope of this class.

slide-19
SLIDE 19

RNN applications

  • Standard classifiers from text to label.

19

Funny

  • Language generation

My roommate started _______ ?

slide-20
SLIDE 20

RNN applications

  • Part of speech tagging!

20

slide-21
SLIDE 21

Takeaways

  • You can represent any piece of text as a vector.
  • Once you have a vector, you can make predictions

from it using simple classifiers (logistic regression).

  • RNNs encode a passage into a single hidden state

vector.

  • RNNs are just a smarter way to combine words than

simple addition (they’re weighted addition!).

21

slide-22
SLIDE 22

22

slide-23
SLIDE 23

RNNs Improved

  • Recurrent Neural Networks refer to any architecture

that reuses its weights over a sequence of input.

  • There are many improvements to the basic RNN that

we discussed last time.

23

slide-24
SLIDE 24

Bidirectional RNNs

  • Run an RNN backwards

in addition to forwards.

  • Use both final hidden

states to then make your prediction.

24

slide-25
SLIDE 25

Bidirectional RNNs

  • A simplified diagram

although less precise on how it works.

  • Arrows between layers

typically mean “fully connected”. Each cell is connected to each cell in the next layer with a learned weight.

25

slide-26
SLIDE 26

LSTM (a type of RNN)

  • LSTM: Long Short-Term Memory

26

https://medium.com/mlreview/understanding-lstm-and-its-diagrams-37e2f46f1714

The LSTM replaces a base RNN’s hidden state with a “black box” that contains several hidden state vectors, all connected through different types of weights.

slide-27
SLIDE 27

LSTM

  • Maintains memory so it

can represent words farther away.

  • Has a “forget” mechanism

so it can learn to ignore certain inputs.

  • Bi-directional typically still

helps, despite longer memory.

27

slide-28
SLIDE 28

ELMo

  • Stacked LSTMs !
  • Output representation is a linear combination of all

hidden states below.

28

AllenNLP 2018 https://arxiv.org/pdf/1802.05365.pdf

slide-29
SLIDE 29

Can I use an LSTM?

  • Sure! (TensorFlow and Keras)
  • Yes, there is a learning curve to understanding most neural net code.

29 model_lstm = Sequential() model_lstm.add(Embedding(vocab_in_size, embedding_dim, input_length=len_input_train)) model_lstm.add(LSTM(units)) model_lstm.add(Dense(nb_labels, activation='softmax')) model_lstm.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) model_lstm.summary() history_lstm = model_lstm.fit(input_data_train, intent_data_label_cat_train, epochs=10, batch_size=BATCH_SIZE)

slide-30
SLIDE 30

Text: today’s common input

  • RNNs like LSTMs focus on representing language in a

high-dimensional space … a vector (embedding)

  • If we can convert any text to a vector, then we can

more easily focus on whatever our goal is

  • Sentiment prediction
  • Language generation (word prediction)
  • Political stance detection
  • Author detection
  • Information extraction

30

slide-31
SLIDE 31

Text: today’s common input

31

FUNNY

slide-32
SLIDE 32

Text: today’s common input

  • 1. Word2Vec word representations
  • 2. Embedding addition/concatenation
  • 3. RNNs
  • 4. Transformers

32

Better text representations, more effective embeddings for NLP tasks.

The latest craze

slide-33
SLIDE 33

The Transformer

  • Problem with RNN: loses context with text distance
  • Solution: combine all words at once at each position

33 Google Research 2017 https://arxiv.org/pdf/1706.03762.pdf

slide-34
SLIDE 34

Transformer

  • Instead of sequentially adding the surrounding

context, add everything at once but with specialized word weights

34

  • The word “it” will add

information to itself from other relevant words in the sentence, but weighted by importance to itself.

slide-35
SLIDE 35

Self-Attention

35

These 3 are just transformations from X1, think

  • f them as variants of X1

Add 88% of V1 and 12% of V2!

slide-36
SLIDE 36

Transformer

  • Self-Attention is the mechanism to convert an input

vector X into an output vector Z based on all other input vectors.

36

Step 1 Transform X1 into Z1 by adding up all the other X inputs. Step 2 Transform Z1 to R1 with a normal feed-forward layer.

slide-37
SLIDE 37

Transformer Network

  • Make it deep of course!

37

Transformer Transformer Transformer

slide-38
SLIDE 38

Details

  • Self-Attention first proposed here:
  • https://arxiv.org/pdf/1706.03762.pdf
  • A great overview of Transformers is here:
  • http://jalammar.github.io/illustrated-transformer/
  • Lots of details obviously not included in this brief

lecture.

38

slide-39
SLIDE 39

Who is using Transformers?

  • BERT (Google 2019)
  • BART (Facebook 2019)
  • GPT, GPT-2, GPT-3 (OpenAI 2019-20)
  • XLNet (CMU, Google AI)
  • M2M-100 (Facebook)
  • T5 (Google AI 2020)

39

1.5 billion parameters 175 billion parameters

slide-40
SLIDE 40

BERT

  • BERT is freely available. Lots of tutorials.
  • https://towardsdatascience.com/bert-for-dummies-step-by-step-

tutorial-fb90890ffe03

  • Encode your text with BERT, and use the [CLS] token

as your text representation.

  • Put a classification layer on top, and you’re ready!
  • If just classifying, you can do this, although needs

significant computation power.

40

slide-41
SLIDE 41

BERT as text classification

41 http://jalammar.github.io/illustrated-bert/

slide-42
SLIDE 42

GPT

  • 2
  • Online example
  • https://transformer.huggingface.co/doc/gpt2-large
  • Training this is beyond your computer’s capabilities. Using a

trained GPT-2 is not.

42

pip install tensorflow=1.14.0 git clone https://github.com/openai/gpt-2.git cd gpt-2.git pip install -r requirements.txt python download_model.py 345M python src/interactive_conditional_samples.py --top_k 40 --length 50 --model_name 345M

Install and Run instructions

slide-43
SLIDE 43

GPT

  • 2 examples

43 Model prompt >>> The Naval Academy would like ======================================== SAMPLE 1 ======================================== to thank all of their students, staff and supporters for their support, especially those involved in organizing our campaign to secure this contract. This contract is a crucial piece in supporting a cadet education program and this contract reflects the commitment of the Academy to ================================================================================ Model prompt >>> The Naval Academy would like ======================================== SAMPLE 1 ======================================== _______________ to tell you that their newest mascot, the red dot will not be permitted to wear a blue shirt at the football games. There will be no football helmets, helmets would be too conspicuous to wear. We're sorry, but this will not ================================================================================ Model prompt >>> The Naval Academy would like ======================================== SAMPLE 1 ======================================== to thank the many people that contributed their time, energy and expertise to bringing this book to life. For more information, visit www.navy.edu/jailhousefiling/sailhouse_hockey.htm ================================================================================

slide-44
SLIDE 44

Moving Forward

  • Many algorithms now use BERT (or others) as the first step

to encode their text.

  • Optionally, if you have the GPUs, you can fine-tune these models to

your task.

  • You then just use that encoded vector as input to a

classifier — like those used in this class!

  • Problem: the vector is opaque. Why is the vector what it is?
  • Many NLP tasks are now like engineering turning knobs and hoping

the output gets better. The link to linguistics/science is partially lost.

44