IN4080 – 2020 FALL
NATURAL LANGUAGE PROCESSING
Jan Tore Lønning
1
IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 - - PowerPoint PPT Presentation
1 IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 Neural networks, Language models, word2wec Lecture 6, 21 Sept Today 3 Neural networks Language models Word embeddings Word2vec Artificial neural networks
1
2
3
Neural networks Language models Word embeddings Word2vec
Inspired by the brain
neurons, synapses
Does not pretend to be a
The simplest model is the
Feed forward network, also
Multi-layer Perceptron
4
1 1
Each feature, 𝑦𝑗, of the input
An additional bias node 𝑦0 = 1
A weight at each edge, Multiply the input values with
Sum them ො
𝑛 𝑥𝑗𝑦𝑗 = 𝒙 ∙ 𝒚
5
x1 x2 x3 1
w0 w1 w2 w3
input nodes
node target value bias node
We start with an initial set of
Consider training examples Adjust the weights to reduce the
How? Gradient descent Gradient means partial
6
Linear regression of more than two variables
We try to fit the best (hyper-)plane
𝑗=0 𝑜
We can use the same mean square:
𝑗=1 𝑛
7
A function of more than one
The partial derivative, e.g.
𝜖𝑔 𝜖𝑦 is
E.g. if 𝑔 𝑦, 𝑧 = 𝑏𝑦 + 𝑐𝑧 + 𝑑,
𝜖𝑔 𝜖𝑦 = 𝑏 and 𝜖𝑔 𝜖𝑧 = 𝑐
8
https://www.wikihow.com/Image:OyXsh.png
We move in the opposite
Intuitively:
Take small steps in all direction
The length of the steps are
9
10
1.
we also write
𝑒𝑔 𝑒𝑦 = 𝑏
and if 𝑧 = 𝑔 𝑦 , we can write
𝑒𝑧 𝑒𝑦 = 𝑏
2.
3.
if 𝑨 = 𝑔 𝑦 = (𝑧), this can be written
𝑒𝑨 𝑒𝑦 = 𝑒𝑨 𝑒𝑧 𝑒𝑧 𝑒𝑦
In particular, if 𝑔 𝑦 = 𝑏𝑦 + 𝑐 2 then 𝑔′ 𝑦 = 2 𝑏𝑦 + 𝑐 𝑏
Loss: Mean squared error :
𝑀 ෝ
1 𝑜 σ𝑘=1 𝑜
2
ො
𝑛 𝑥𝑗𝑦𝑘,𝑗 = 𝒙 ∙ 𝒚𝑘
We will update the 𝑥𝑗-s Consider the partial derivatives w.r.t
𝜖 𝜖𝑥𝑗 𝑀 ෝ
1 𝑜 σ𝑘=1 𝑜
Update 𝑥𝑗: 𝑥𝑗 = 𝑥𝑗 − 𝜃
𝜖 𝜖𝑥𝑗 𝑀 ෝ
11
𝑜 is the number of observations, 0 ≤ 𝑘 ≤ 𝑜 and 𝑛 is the number of features for each observation, 0 ≤ 𝑗 ≤ 𝑛
12
x1 x2 x3 1
w0 w1 w2 w3
input nodes
node target value bias node
𝑘=1 𝑜
z = σ𝑗=0
𝑛 𝑥𝑗𝑦𝑗 = 𝒙 ∙ 𝒚
ො 𝑧 = 𝜏(𝑨) =
1 1+𝑓−𝑨
Loss: 𝑀𝐷𝐹 = − σ𝑘=1
𝑜
log ො 𝑧𝑘
𝑘 1 − ො
𝑧𝑘
1−𝑧𝑘
𝜖 𝜖ෞ 𝑥𝑗 𝑀𝐷𝐹 = 𝜖 𝜖 ො 𝑧 𝑀𝐷𝐹 × 𝜖 ො 𝑧 𝜖𝑨 × 𝜖𝑨 𝜖𝑥𝑗
𝜖 𝜖 ො 𝑧 𝑀𝐷𝐹 = 𝑧− ො 𝑧 ො 𝑧 1− ො 𝑧
𝜖 ො 𝑧 𝜖𝑨 = ො
𝑧 1 − ො 𝑧
𝜖𝑨 𝜖𝑥𝑗 = 𝑦𝑗
𝜖 𝜖ෞ 𝑥𝑗 𝑀𝐷𝐹 = 𝑧− ො 𝑧 ො 𝑧 1− ො 𝑧 ො
𝑧 1 − ො 𝑧 𝑦𝑗= 𝑧 − ො 𝑧 𝑦𝑗
13
x1 x2 x3 1
w0 w1 w2 w3
input nodes
node target value bias node
To simplify, consider only one
14
x1 x2 x3 1
w0 w1 w2 w3
input nodes
node target value bias node
𝜖 𝜖ෞ 𝑥𝑗 𝑀𝐷𝐹 = 𝑧− ො 𝑧 ො 𝑧 1− ො 𝑧 ො
An input layer An output layer: the predictions One or more hidden layers Connections from one layer to
15
1 1
Each hidden node is like a small
First sum of weighted inputs : z = σ𝑗=0
𝑛 𝑥𝑗𝑦𝑗 = 𝒙 ∙ 𝒚
Then the result is run through an
𝑧 = 𝜏(𝑨) =
1 1+𝑓−𝑥∙𝑦
16
x1 x2 x3 1
w0 w1 w2 w3 z y
Regression: One node No activation function Binary classifier: One node Logistic activation function Multinomial classifier Several nodes Softmax + more alternatives Choice of loss function depends on task
17
1 1
18
Consider two consecutive layers:
Layer M, with 1 ≤ 𝑗 ≤ 𝑛 nodes, and a bias
Layer N, with 1 ≤ 𝑘 ≤ 𝑜 nodes Let 𝑥𝑗,𝑘 be the weight at the edge going
𝑘
Consider processing one observation:
Let 𝑦𝑗 be the value going out of node 𝑁𝑗 If M is a hidden layer:
𝑦𝑗 = 𝜏(𝑨𝑗), where 𝑨𝑗 = σ(… )
M1 M2 M3 M0 N3 N1 N2 N4
19
If N is the output layer, calculate the error
𝑘 𝑂 as before from the loss and the
𝑘
If M is a hidden layer: Calculate the error
A weighted sum of the error terms at layer N The derivative of the activation function 𝜀𝑗
𝑁 = σ𝑘=1 𝑜
𝑘 𝑂 𝑒𝑦𝑗 𝑒𝑨𝑗
where 𝑦𝑗 = 𝜏(𝑨𝑗), where 𝑨𝑗 = σ(… )
M1 M2 M3 M0 N3 N1 N2 N4
20
By repeating the process, we get error
The update of the weights between the
𝑥𝑗,𝑘 = 𝑥𝑗,𝑘 − 𝑦𝑗𝜀
𝑘 𝑂
where 𝑦𝑗 is the value going out of node 𝑁𝑗 M1 M2 M3 M0 N3 N1 N2 N4
There are alternative activation functions tanh 𝑦 =
𝑓𝑦−𝑓−𝑦 𝑓𝑦+𝑓−𝑦
𝑆𝑓𝑀𝑉 𝑦 = max 𝑦, 0 ReLU is the preferred method in hidden layers
21
22
Neural networks Language models Word embeddings Word2vec
23
24
Goal: Ascribe probabilities to word sequences. Motivation:
Translation:
P(she is a tall woman) > P(she is a high woman) P(she has a high position) > P(she has a tall position)
Spelling correction:
P(She met the prefect.) > P(She met the perfect.)
Speech recognition:
P(I saw a van) > P(eyes awe of an)
25
Goal: Ascribe probabilities to word sequences.
𝑄(𝑥1, 𝑥2, 𝑥3, … , 𝑥𝑜)
Related: the probability of the next word
𝑄(𝑥𝑜 | 𝑥1, 𝑥2, 𝑥3, … , 𝑥𝑜−1)
A model which does either is called a Language Model, LM
Comment: The term is somewhat misleading
(Probably origin from speech recognition)
26
The two definitions are related by the chain rule for probability: 𝑄 𝑥1, 𝑥2, 𝑥3, … , 𝑥𝑜 = 𝑄 𝑥1 × 𝑄 𝑥2 𝑥1 × 𝑄 𝑥3|𝑥1, 𝑥2
ς𝑗
𝑜 𝑄 𝑥𝑗|𝑥1, 𝑥2, … , 𝑥𝑗−1 = ς𝑗 𝑜 𝑄 𝑥𝑗|𝑥1 𝑗−1
P(“its water is so transparent”) =
But this does not work for long sequences (we may not even have seen before)
27
A word depends only on the immediate preceding word 𝑄 𝑥1, 𝑥2, 𝑥3, … , 𝑥𝑜 ≈ 𝑄 𝑥1 × 𝑄 𝑥2 𝑥1 × 𝑄 𝑥3|𝑥2
ς𝑗
𝑜 𝑄 𝑥𝑗| 𝑥𝑗−1
P(“its water is so transparent”) ≈
This is called a bigram model
28
The probabilities can be estimated by counting This yields maximum likelihood probabilities
(=maximum probable on the training data)
𝑑𝑝𝑣𝑜𝑢(𝑥𝑗−1,𝑥𝑗) 𝑑𝑝𝑣𝑜𝑢(𝑥𝑗−1)
29
30
A word depends only on the k many immediately preceding words 𝑄 𝑥1, 𝑥2, 𝑥3, … , 𝑥𝑜 ≈ ς𝑗
𝑜 𝑄 𝑥𝑗| 𝑥𝑗−𝑙, 𝑥𝑗+1−𝑙, … , 𝑥𝑗−1 = ς𝑗 𝑜 𝑄 𝑥𝑗| 𝑥𝑗−𝑙 𝑗−1
This is called a
unigram model – no preceding words trigram model – two preceding words k-gram model – k-1 preceding words
31
Goal: Generate a sequence of words Unigram:
Choose the first word according to how probable it is Choose the second word according to how probable it is, etc. = the generative model for multinomial NB text classification
Bigram
Select word k according to
k-gram
Select word 𝑥𝑗 according to how probable it is given the 𝑙 − 1 preceding words
𝑗−1
32
33
There might be words that is never observed during training. Use a special symbol for unseen words during application, e.g. UNK Set aside a probability for seeing a new word
This may be estimated from a held-out corpus
Adjust
the probabilities for the other words in a unigram model accordingly the conditional probabilities of the k-gram model
34
Since we might not have seen all possibilities in training data, we might
𝑑𝑝𝑣𝑜𝑢 𝑥𝑗−1,𝑥𝑗 +𝑙 𝑑𝑝𝑣𝑜𝑢 𝑥𝑗−1 +𝑙 |𝑊|
where |𝑊| is the size of the vocabulary 𝑊.
Shakespeare produced
N = 884,647 word tokens V = 29,066 word types
Bigrams:
Possibilities:
𝑊2 = 844,000,000
Shakespeare,
bigram tokens: 884,647 bigram types: 300,000 Add-k smoothing is not
35
If you have good evidence, use
If not, use the bigram model, or even the unigram model Combine the models
36
Backoff Interpolation
Simple interpolation: The 𝜇-s can be tuned on a held out corpus A more elaborate model will condition the 𝜇-s on the context
(Brings in elements of backoff)
37
38
Extrinsic evaluation:
To compare two LMs, see how well they are doing in an application, e.g.
Intrinsic evaluation:
Use a held out-corpus and measure 𝑄 𝑥1, 𝑥2, 𝑥3, … , 𝑥𝑜
1 𝑜
The n-root compensate for different lengths
ς𝑗
𝑜 𝑄 𝑥𝑗| 𝑥𝑗−𝑙 𝑗−1
1 𝑜 for a k-gram model
It is normal to use the inverse of this, called the perplexity 𝑄𝑄 𝑥1
𝑜 = 1 𝑄 𝑥1,𝑥2,𝑥3,…,𝑥𝑜
1 𝑜
−1
𝑜
The best smoothing is achieved with Kneser-Ney smoothing Short-comings of all n-gram models
The smoothing is not optimal The context are restricted to a limited number of preceding words.
39
40
Neural networks Language models Word embeddings Word2vec
Two words are similar in meaning if their context vectors are similar aardvark computer data pinch result sugar … apricot 1 1 pineapple 1 1 digital 2 1 1 information 1 6 4
41
A word 𝑥 can be represented
𝑘 with 𝑥.
Can be used for
studying similarities between
document similarities
But the vectors are sparse
Long: 20-50,000 Many entries are 0
Even though car and automobile
42
43
Lexical semantics Vector models of documents tf-idf weighting Word-context matrices Word embeddings with dense vectors
Shorter vectors.
(length 50-1000) ``low-dimensional’’ space
Dense (most elements are not 0) Intuitions:
Similar words should have similar
Words that occur in similar contexts
Generalize better than sparse
Input to deep learning
Fewer weights (or other weights)
Capture semantic similarities
Better for sequence modelling:
Language models, etc.
44
In current LT: Each word is
Words are more or less similar A word can be similar to one
45
Figure from https://medium.com/@jayeshbahire
48
http://vectors.nlpl.eu/explore/embeddings/en/
49
50
Negative words change
51
52
Man is to computer programmer as woman is to homemaker. Different adjectives associated with:
male and female terms typical black names and typical white names
Embeddings may be used to study historical bias
Goal: neutralize the biases Some positive results But also reports that is is not
Is debiasing a goal? When should we (not) debias?
53 https://vagdevik.wordpress.com/2018/07/08/debiasing-word-embeddings/
Extrinsic evaluation:
Evaluate contribution as part of an
Intrinsic evaluation:
Evaluate against a resource
Some datasets
WordSim-353:
Broader "semantic relatedness"
SimLex-999:
Narrower: similarity Manually annotated for similarity
54
Part of SimLex-999
55
Embeddings are used as representations for words as input in all kinds
Text classification Language models Named-entity recognition Machine translation etc.
gensim
Easy-to-use tool for training own models
Word2wec
https://code.google.com/archive/p/word2vec/
https://fasttext.cc/ https://nlp.stanford.edu/projects/glove/ http://vectors.nlpl.eu/repository/
Pretrained embeddings, also for Norwegian
56
57
Neural networks Language models Word embeddings Word2vec
58
Instead of counting, use a neural network to learn a LM Simplest form: a bigram model:
For a given word 𝑥𝑗−1, try to predict the next word 𝑥𝑗 i.e. try to estimate 𝑄 𝑥𝑗| 𝑥𝑗−1
59
From J&M 3.ed. 2018 Ch. 16
60
Input and output word are repre-
Dim d typically 50-300 Independent learning for each
Consider all possible next words for
Use softmax to get a probability
Idea: Use the weight matrix
|𝑊|×𝑒 as embeddings, i.e.:
Represent word 𝑘 by
𝑘,1, 𝑥 𝑘,2, … , 𝑥 𝑘,𝑒) = the
Why? since similar words will
61
Since two words that are similar
This will help the training of
|𝑊|×𝑒
We could alternatively use
62
We could generalize to
Observe this is order-
Continuous bag of words model
Predict 𝑥𝑢 from a window
63
https://commons.wikimedia.org/wiki/File:Cbow.png
From 𝑥𝑢 predict all the words in
Assume independence of the
Boils down to similar to unigram
64
https://commons.wikimedia.org/wiki/File:Skip-gram.png
65
From J&M 3.ed. 2018 Ch. 16
66
To train a skip gram model is expensive Soft-max 𝑄 𝐷
𝑘 Ԧ
𝑓𝑥𝑘∙𝑦 σ𝑗=1
𝑙
𝑓𝑥𝑗∙𝑦
where the classes corresponds to the next word i.e. in making an update for a pair (𝑥𝑢, 𝑥𝑡) one has to calculate the
𝑦 for each word in the vocabulary
Looking for cheaper training methods
67
1.
2.
3.
4.
Training sentence:
... lemon, a tablespoon of apricot jam a
Training data: input/output pairs centering on apricot Asssume a +/- 2 word window
9/22/2020
68
... lemon, a tablespoon of apricot preserves or
For each positive example, we'll create k negative examples.
Using noise words: Any random word that isn't 𝑢
69
One of various ways to train the classifier to distinguish pos and neg words Intuition: Words are likely to appear near similar words Model similarity with dot-product! Similarity 𝑢, 𝑑 ~ 𝑢 ∙ 𝑑 Problem:
Dot product is not a probability!
(Neither is cosine)
Given a tuple (target, context)
(apricot, jam) (apricot, aardvark)
Calculate the probabilities
𝑄 + 𝑢, 𝑑) 𝑄 − 𝑢, 𝑑) = 1 − 𝑄 + 𝑢, 𝑑)
Maximize
where
71
73
We feed a pair of
Compare to target
Update weights We learn the set of
0 0 0…0 1 0…0 0 0 0 0 0 0 0 0 0 0…0 1 0… 0 0 0 0
74
We learn two separate embedding matrices W and C We can use W as representations for the words
(or combine with C in some ways)
What have we learned:
If two words w1 and w2 occur in similar contexts
= with the same (or similar) context words, e.g. c,
then both w1 and w2 should have a large cosine with c,
hence have similar vectors.
75
Embeddings are used as representations for words as input in all kinds
Text classification Language models Named-entity recognition Machine translation etc.
IN5550 Spring 2020
gensim
Easy-to-use tool for training own models
Word2wec
https://code.google.com/archive/p/word2vec/
https://fasttext.cc/ https://nlp.stanford.edu/projects/glove/ http://vectors.nlpl.eu/explore/embeddings/en/
Pretrained embeddings, also for Norwegian
76