CS546: Machine Learning in NLP (Spring 2020)
http://courses.engr.illinois.edu/cs546/
Julia Hockenmaier
juliahmr@illinois.edu 3324 Siebel Center Office hours: Monday, 11am—12:30pm
Lecture 2 More Intro Julia Hockenmaier juliahmr@illinois.edu 3324 - - PowerPoint PPT Presentation
CS546: Machine Learning in NLP (Spring 2020) http://courses.engr.illinois.edu/cs546/ Lecture 2 More Intro Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Monday, 11am12:30pm Wrap-up: Syllabus for this class
CS546: Machine Learning in NLP (Spring 2020)
http://courses.engr.illinois.edu/cs546/
Julia Hockenmaier
juliahmr@illinois.edu 3324 Siebel Center Office hours: Monday, 11am—12:30pm
CS546 Machine Learning in NLP
2
CS546 Machine Learning in NLP 3
CS546 Machine Learning in NLP
You will receive an email with a link to a Google form where you can sign up for slots to present.
— Please sign up for at least three slots so that I have some flexibility in assigning you to a presentation
We will give you one week to fill this in. You will have to meet with me the Monday before your presentation to go over your slides.
4
CS546 Machine Learning in NLP
— Clarity of exposition and presentation — Analysis (don’t just regurgitate what’s in the paper) — Quality of slides (and effort that went into making them — just re-using other people’s slides is not enough)
5
CS447: Natural Language Processing (J. Hockenmaier)
6
CS546 Machine Learning in NLP
How do you represent (or predict) words?
Do you treat words in the input as atomic categories, as continuous vectors, or as structured objects? How do you handle rare/unseen words, typos, spelling variants, morphological information? Lexical semantics: do you capture word meanings/senses?
How do you represent (or predict) word sequences?
Sequences = sentences, paragraphs, documents, dialogs,… As a vector, or as a structured object?
How do you represent (or predict) structures?
Structures = labeled sequences, trees, graphs, formal languages (e.g. DB records/queries, logical representations) How do you represent “meaning”?
7
CS546 Machine Learning in NLP
Ambiguity: Natural language is highly ambiguous
Coverage (compounded by Zipf’s Law)
constructions that did not occur during training.
training to unseen events that occur during testing (i.e. when we actually use the system).
8
CS447: Natural Language Processing (J. Hockenmaier)
9
CS447: Natural Language Processing (J. Hockenmaier)
1 10 100 1000 10000 100000 1 10 100 1000 10000 100000
Frequency (log) Number of words (log)
How many words occur N times?
Word frequency (log-scale)
In natural language:
10
A few words are very frequent
English words, sorted by frequency (log-scale) w1 = the, w2 = to, …., w5346 = computer, ...
Most words are very rare
How many words occur once, twice, 100 times, 1000 times?
the r-th most common word wr has P(wr) ∝ 1/r
CS546 Machine Learning in NLP
The good:
Any text will contain a number of words that are very common. We have seen these words often enough that we know (almost) everything about them. These words will help us get at the structure (and possibly meaning) of this text.
The bad:
Any text will contain a number of words that are rare. We know something about these words, but haven’t seen them
with a meaning or a part of speech we haven’t seen before.
The ugly:
Any text will contain a number of words that are unknown to us. We have never seen them before, but we still need to get at the structure (and meaning) of these texts.
11
CS546 Machine Learning in NLP
Our systems need to be able to generalize from what they have seen to unseen events. There are two (complementary) approaches to generalization:
— Linguistics provides us with insights about the rules and structures in language that we can exploit in the (symbolic) representations we use
E.g.: a finite set of grammar rules is enough to describe an infinite language
— Machine Learning/Statistics allows us to learn models (and/or representations) from real data that often work well empirically on unseen data
E.g. most statistical or neural NLP
12
CS546 Machine Learning in NLP
Option 1: Words are atomic symbols
Can’t capture syntactic/semantic relations between words
— Each (surface) word form is its own symbol — Map different forms of a word to the same symbol
(esp. in English, the lemma is still a word in the language, but lemmatized text is no longer grammatical)
(no guarantee that the resulting symbol is an actual word)
the same canonical variant (e.g. lowercase everything, normalize spellings, perhaps spell-check)
13
CS546 Machine Learning in NLP
Option 2: Represent the structure of each word
“books” => “book N pl” (or “book V 3rd sg”) This requires a morphological analyzer The output is often a lemma plus morphological information This is particularly useful for highly inflected languages (less so for English or Chinese) Aims: — the lemma/stem captures core (semantic) information — reduce the vocabulary of highly inflected languages
14
CS546 Machine Learning in NLP
Option 3: Each word is a (high-dimensional) vector Advantage: Neural nets need vectors as input! How do we represent words as vectors? — Naive solution: as one-hot vectors — Distributional similarity solution: as very high-dimensional sparse vectors — Static word embedding solution (word2vec etc.): by a dictionary that maps words to fixed lower-dimensional dense vectors — Dynamic embedding solution (Elmo etc.): Compute context-dependent dense embeddings
15
CS546 Machine Learning in NLP
Systems that use machine learning may need to have a unique representation of each word. Option 1: the UNK token
Replace all rare words (in your training data) with an UNK token (for Unknown word). Replace all unknown words that you come across after training (including rare training words) with the same UNK token
Option 2: substring-based representations
Represent (rare and unknown) words as sequences of characters or substrings
common in the vocabulary of your language
16
CS546 Machine Learning in NLP
17
CS447: Natural Language Processing (J. Hockenmaier)
What does this sentence mean?
“duck”: noun or verb? “make”: “cook X” or “cause X to do Y” ? “her”: “for her” or “belonging to her” ?
Language has different kinds of ambiguity, e.g.: Structural ambiguity
“I eat sushi with tuna” vs. “I eat sushi with chopsticks” “I saw the man with the telescope on the hill”
Lexical (word sense) ambiguity
“I went to the bank”: financial institution or river bank?
Referential ambiguity
“John saw Jim. He was drinking coffee.”
18
CS447: Natural Language Processing (J. Hockenmaier)
19
Open the pod door, Hal.
verb, adjective, or noun? Verb: open the door Adjective: the open door Noun: in the open
CS447: Natural Language Processing (J. Hockenmaier)
We want to know the most likely tags T for the sentence S We need to define a statistical model of P(T | S), e.g.: We need to estimate the parameters of P(T |S), e.g.: P( ti =V | ti-1 =N ) = 0.3
20
argmax
T
P(T|S) argmax
T
P(T|S) = argmax
T
P(T)P(S|T)
∏
|
T
P(T) =de f
∏
i
P(ti|ti−1)
∏
i
P(S|T) =de f
∏
i
P(wi|i) P(wi | ti)
CS447: Natural Language Processing (J. Hockenmaier)
(Cassoulet = a French bean casserole)
The second major problem in NLP is coverage: We will always encounter unfamiliar words and constructions. Our models need to be able to deal with this. This means that our models need to be able to generalize from what they have been trained on to what they will be used on.
21
CS546 Machine Learning in NLP
22
CS546 Machine Learning in NLP
Starting in the early 1990s, NLP became very empirical and data-driven due to
— success of statistical methods in machine translation (IBM systems) — availability of large(ish) annotated corpora (Susanne Treebank, Penn Treebank, etc.)
Advantages over rule-based approaches:
— Common benchmarks to compare models against — Empirical (objective) evaluation is possible — Better coverage — Principled way to handle ambiguity
23
CS546 Machine Learning in NLP
NLP makes heavy use of statistical models as a way to handle both the ambiguity and the coverage issues.
Basic approach:
(may depend on available labeled training data)
processing steps, i.e. a pipeline)
24
CS546 Machine Learning in NLP
A language model defines a distribution P(w) over the strings w = w1w2..wi… in a language Typically we factor P(w) such that we compute the probability word by word: P(w) = P(w1) P(w2 | w1)… P(wi | w1…wi−1) Standard n-gram models make the Markov assumption that wi depends only on the preceding n−1 words: P(wi | w1…wi−1) := P(wi | wi−n+1…wi−1) We know that this independence assumption is invalid (there are many long-range dependencies), but it is computationally and statistically necessary (we can’t store or estimate larger models)
25
CS546 Machine Learning in NLP
Many statistical NLP systems use explicit features:
(e.g. for semantic role labeling, etc.)
Feature design is usually a big component of building any particular NLP system.
Which features are useful for a particular task and model typically requires experimentation, but there are a number of commonly used ones (words, POS tags, syntactic dependencies, NER labels, etc.)
26
CS447: Natural Language Processing (J. Hockenmaier)
27
CS546 Machine Learning in NLP
Traditional sequence models (n-gram language models, HMMs, MEMMs, CRFs) make rigid Markov assumptions (bigram/trigram/n-gram). Recurrent neural nets (RNNs, LSTMs) and transformers can capture arbitrary-length histories without requiring more parameters.
28
CS546 Machine Learning in NLP
Word-based features:
How do we handle unseen/rare words?
Many features are produced by other NLP systems (POS tags, dependencies, NER output, etc.) These systems are often trained on labeled data.
Producing labeled data can be very expensive. We typically don’t have enough labeled data from the domain
We might not get accurate features for our domain of interest.
29
CS546 Machine Learning in NLP
Many of the current successful neural approaches to NLP do not use traditional discrete features. — End-to-end models: no pipeline! Words in the input are often represented as dense vectors (aka. word embeddings, e.g. word2vec)
Traditional approaches: each word in the vocabulary is a separate feature. No generalization across words that have similar meanings. Neural approaches: Words with similar meanings have similar
Other kinds of features (POS tags, dependencies, etc.) are often ignored.
30
CS546 Machine Learning in NLP
Neural networks, typically with several hidden layers
(depth = # of hidden layers) Single-layer neural nets are linear classifiers Multi-layer neural nets are more expressive
Very impressive performance gains in computer vision (ImageNet) and speech recognition over the last several years. Neural nets have been around for decades. Why have they suddenly made a comeback?
Fast computers (GPUs!) and (very) large datasets have made it possible to train these very complex models.
31
CS546 Machine Learning in NLP
Simplest variant: single-layer feedforward net
32
Input layer: vector x Output unit: scalar y Input layer: vector x Output layer: vector y For binary classification tasks: Single output unit Return 1 if y > 0.5 Return 0 otherwise For multiclass classification tasks: K output units (a vector) Each output unit yi = class i Return argmaxi(yi)
CS546 Machine Learning in NLP
Multiclass classification = predict one of K classes.
Return the class i with the highest score: argmaxi(yi) In neural networks, this is typically done by using the softmax function, which maps real-valued vectors in RN into a distribution
For a vector z = (z0…zK): P(i) = softmax(zi) = exp(zi) ∕ ∑k=0..K exp(zk) (NB: This is just logistic regression)
33
CS546 Machine Learning in NLP
Single-layer (linear) feedforward network
y = wx + b (binary classification)
w is a weight vector, b is a bias term (a scalar)
This is just a linear classifier (aka Perceptron)
(the output y is a linear function of the input x)
Single-layer non-linear feedforward networks: Pass wx + b through a non-linear activation function, e.g. y = tanh(wx + b)
34
CS546 Machine Learning in NLP
Sigmoid (logistic function): σ(x) = 1/(1 + e−x)
Useful for output units (probabilities) [0,1] range
Hyperbolic tangent: tanh(x) = (e2x −1)/(e2x+1)
Useful for internal units: [-1,1] range
Hard tanh (approximates tanh) htanh(x) = −1 for x < −1, 1 for x > 1, x otherwise Rectified Linear Unit: ReLU(x) = max(0, x)
Useful for internal units
Softmax: softmax(zi) = exp(zi) ∕ ∑k=0..K exp(zk)
Special case for output units (multiclass classification)
35
0.5 0.0
2 4 6 1.0 0.5 0.0
2 4 6
2 4 6
2 4 6 1.0 0.5 0.0
1.0 0.5 0.0
sigmoid(x) tanh(x) hardtanh(x) ReLU(x)
f f f f
CS546 Machine Learning in NLP
We can generalize this to multi-layer feedforward nets
36
Input layer: vector x Hidden layer: vector h1 Hidden layer: vector hn Output layer: vector y
… … … … … … … … ….
CS546 Machine Learning in NLP
In NLP, the input and output variables are discrete: words, labels, structures. NNs work best with continuous vectors.
We typically want to learn a mapping (embedding) from discrete words (input) to dense vectors. We can do this with (simple) neural nets and related methods.
The input to a NN is (traditionally) a fixed-length
sequence as a vector?
With recurrent neural nets: read in one word at the time to predict a vector, use that vector and the next word to predict a new vector, etc.; With convolutional nets: use a sliding (fixed-length) window)
37
CS546 Machine Learning in NLP
Word embeddings (word2vec, Glove, etc.)
Train a NN to predict a word from its context (or the context from a word) to get a dense vector representation of each word
Neural language models:
Use recurrent neural networks (RNNs, GRUs, LSTMs) to predict word sequences (or to obtain context-sensitive embeddings (ELMO)
Sequence-to-sequence (seq2seq) models:
From machine translation: use one RNN to encode source string, and another RNN to decode this into a target string. Also used for automatic image captioning, etc.
Convolutional neural nets
Used e.g. for text classification
Transformers
38
CS546 Machine Learning in NLP
39
CS546 Machine Learning in NLP
Probability distribution over the strings in a language, typically factored into distributions P(wi | …) for each word: P(w) = P(w1…wn) = ∏i P(wi | w1…wi-1) N-gram models assume each word depends only preceding n−1 words: P(wi | w1…wi-1) =def P(wi | wi−n+1…wi−1)
To handle variable length strings, we assume each string starts with n−1 start-of-sentence symbols (BOS), or〈S〉 and ends in a special end-of-sentence symbol (EOS) or〈\S〉
40
CS546 Machine Learning in NLP
— The vocabulary V contains n types (incl. UNK, BOS, EOS) — We want to condition each word on k preceding words — [Naive] Each input word wi ∈ V (that we’re conditioning on) is an n-dimensional one-hot vector v(w) = (0,…0, 1,0….0) — Our input layer x = [v(w1),…,v(wk)] has n×k elements — To predict the probability over output words, the output layer is a softmax over n elements P(w | w1…wk) = softmax(hW2 + b2) With (say) one hidden layer h we’ll need two sets of parameters,
41
CS447: Natural Language Processing (J. Hockenmaier)
Architecture:
Input Layer: x = [v(w1)….v(wk)] v(w): a one-hot vector of size dim(V) = |V| Hidden Layer: h = g(xW1 + b1) Output Layer: P(w | w1…wk) = softmax(hW2 + b2)
Parameters:
Weight matrices and biases: first layer: W1 ∈ Rk·dim(V)×dim(h) b1 ∈ Rdim(h) second layer: W2 ∈ Rdim(h)×|V| b2 ∈ R|V|
42
CS546 Machine Learning in NLP
How many parameters do we need to learn?
Traditional n-gram model: dim(V)k parameters
With dim(V) = 10,000 and k=3: 1,000,000,000,000
Naive neural n-gram model (one-hot encoding of vocabulary): #parameters going to hidden layer: k∙dim(V)∙dim(h),
with dim(h) = 300, dim(V) = 10,000 and k=3: 9,000,000
plus #parameters going to output layer: dim(h)∙dim(V)
with dim(h) = 300, dim(V) = 10,000: 3,000,000
The neural model requires still a lot of parameters, but far fewer than the traditional n-gram model
43
CS546 Machine Learning in NLP
Advantages over traditional n-gram models:
— The hidden layer captures interactions among context words — Increasing the order of the n-gram requires only a small linear increase in the number of parameters.
dim(W1) goes from k·dim(emb)×dim(h) to (k+1)·dim(emb)×dim(h) A traditional k-gram model requires dim(V)k parameters
— Increasing the vocabulary also leads only to a linear increase in the number of parameters
44
CS546 Machine Learning in NLP
Naive neural models have similar shortcomings as standard n-gram models
— Models get very large (and sparse) as n increases — We can’t generalize across similar contexts — N-gram Markov (independence) assumptions are too strict
Better neural language models overcome these by…
… using word embeddings instead of one-hots as input:
Instead of representing context words as distinct, discrete symbols (i.e. very high- dimensional one-hot vectors), use a dense low-dimensional vector representation where similar words have similar vectors [next]
… using recurrent nets instead of feedforward nets:
Instead of a fixed-length (n-gram) context, use recurrent nets to encode variable- lengths contexts [later class]
45
CS546 Machine Learning in NLP
Basic RNN: Modify the standard feedforward architecture (which predicts a string w0…wn one word at a time) such that the output of the current step (wi) is given as additional input to the next time step (when predicting the output for wi+1).
“Output” — typically (the last) hidden layer.
46
input
hidden input
hidden
Feedforward Net Recurrent Net
CS546 Machine Learning in NLP
Main idea: If you use a feedforward network to predict the probability of words that appear in the context of (near) an input word, the hidden layer of that network provides a dense vector representation of the input word. Words that appear in similar contexts (that have high distributional similarity) wils have very similar vector representations. These models can be trained on large amounts of raw text (and pretrained embeddings can be downloaded)
47
CS546 Machine Learning in NLP
Task (e.g. machine translation):
Given one variable length sequence as input, return another variable length sequence as output
Main idea:
Use one RNN to encode the input sequence (“encoder”) Feed the last hidden state as input to a second RNN (“decoder”) that then generates the output sequence. Use attention mechanisms (e.g. to focus on certain parts of the input when generating output)
48
CS546 Machine Learning in NLP
Non-recurrent architecture for seq2seq tasks:
— Also has an encoder and a decoder, but these process the input at once (decoder may mask future outputs to generate output sequentially) — May use positional embeddings to capture sequence information — May use multiple self-attention (attention within a sequence) mechanisms in parallel — Can be (pre)trained on very large amounts of data, and then fine-tuned for specific tasks — Yields state-of-the-art context-dependent encodings (BERT) and language models (GPT-2)
49