Modern Neural-Networks approaches to NLP J.-C. Chappelier - - PowerPoint PPT Presentation

modern neural networks approaches to nlp
SMART_READER_LITE
LIVE PREVIEW

Modern Neural-Networks approaches to NLP J.-C. Chappelier - - PowerPoint PPT Presentation

Introduction Introduction How does it work? Conclusion Modern Neural-Networks approaches to NLP J.-C. Chappelier Laboratoire dIntelligence Artificielle Facult I&C EPFL c J.-C. Chappelier Modern Neural-Networks approaches to


slide-1
SLIDE 1

Introduction Introduction How does it work? Conclusion

c

EPFL

J.-C. Chappelier

Modern Neural-Networks approaches to NLP

J.-C. Chappelier

Laboratoire d’Intelligence Artificielle Faculté I&C

Modern Neural-Networks approaches to NLP – 1 / 47

slide-2
SLIDE 2

Introduction Introduction How does it work? Conclusion

c

EPFL

J.-C. Chappelier

Objectives of this lecture

Introduction NLP Natural Language Functions Corpus-Based Approach to NLP Linguistic Processing Levels Example of NLP architecture Interdependencies between processing levels Conclusion

c

EPFL

J.-C. Chappelier & M. Rajman

So, is this course a Machine Learning Course?

◮ NLP makes use of Machine Learning (as would Image Processing for instance) ◮ but good results require:

◮ good preprocessing ◮ good data (to learn from), relevant annotations ◮ good understanding of the pros/cons, features, outputs, results, ...

☞ The goal of this course is to provide you with the core concepts and baseline techniques to achieve the above mentioned requirements.

Introduction to INLP – 22 / 45

CAVEAT/REMINDER The goal of this lecture is to make give a broad overview on modern Neural Network approaches to NLP. This lecture is worth deepening with some full Deep Learning course; e.g.: ◮ F . Fleuret (Master) Deep learning (EE-559) ◮ J. Henderson (EDOC) Deep Learning For Natural Language Processing (EE-608)

Modern Neural-Networks approaches to NLP – 2 / 47

slide-3
SLIDE 3

Introduction Introduction How does it work? Conclusion

c

EPFL

J.-C. Chappelier

Contents

➀ Introduction ◮ What is it all about? What does it change? ◮ Why now? ◮ Is it worth it? ➁ How does it work? ◮ words (word2vec (CBow, Skip-gram), GloVe, fastText) ◮ documents (RNN, CNN, LSTM, GRU) ➂ Conclusion ◮ Advantages and drawbacks ◮ Future

Modern Neural-Networks approaches to NLP – 3 / 47

slide-4
SLIDE 4

Introduction Introduction

What is it all about? Why now? Is it worth it?

How does it work? Conclusion

c

EPFL

J.-C. Chappelier

What is it all about?

Modern approach to NLP heavily emphasizes “Neural Networks” and “Deep Learning” Two key ideas (which are, in fact, quite independant): ◮ make use of more abstract/algeabraic representation of words: use “word embeddings”:

◮ go from sparse (& high-dimensional) to dense (& less high-dimensional) representation of documents

◮ make use of (“deep”) neural networks (= trainable non-linear functions) Other characteristics: ◮ supervised tasks ◮ better results (at least on usual benchmarks) ◮ less? preprocessing/“feature selection” ◮ CPU and data consuming

Modern Neural-Networks approaches to NLP – 4 / 47

slide-5
SLIDE 5

Introduction Introduction

What is it all about? Why now? Is it worth it?

How does it work? Conclusion

c

EPFL

J.-C. Chappelier

How does it work?

◮ Key idea #1: Learning Word Representations Typical NLP: Corpus –> some algo –> word/tokens/n-grams vectors Key idea in recent approaches: can we do it task independant? so as to reduce whatever NL P(rocessing) to some algebraic vector manipulation: no longer start “core (NL)P” from words anymore, but from vectors (learned once for all) that capture general syntactical and semantic information ◮ Key idea #2: use Neural Networks (NN) to do the “from vectors to output”-job A NN is simply a Rn − → Rm non-linear function with (many) parameters

Modern Neural-Networks approaches to NLP – 5 / 47

slide-6
SLIDE 6

Introduction Introduction

What is it all about? Why now? Is it worth it?

How does it work? Conclusion

c

EPFL

J.-C. Chappelier

Neural Networks (NN): a 4-slides primer

◮ NN are non-linear non-parametric (= many parameters models) functions ◮ The ones we’re here talking about are for supervised learning

☞ make use of a loss function to evaluate how their output fits to the desired output

usual loss: corpus (negative) log-likelihood ∝ P(output|input) ◮ non-linarity: localised on each "neuron" (1-D non linear function) sigmoïd-like (e.g. logistic function 1/(1+e−x)) or ReLU (weird name for very simple function: max(0,x))

0.2 0.4 0.6 0.8 1

  • 10
  • 5

5 10 sigmoid(x) 2 4 6 8 10

  • 10
  • 5

5 10 max(0,x)

◮ the non-linearity is applyed to a linear combination of input: dot-product of input (vector) and parameters (“weight” vector)

Modern Neural-Networks approaches to NLP – 6 / 47

slide-7
SLIDE 7

Introduction Introduction

What is it all about? Why now? Is it worth it?

How does it work? Conclusion

c

EPFL

J.-C. Chappelier

Softmax output function

Another famous non-linearity is the “softmax function” softmax = generalization from 1D to n-D of logistic function (see e.g. "Logitic Regression", 2

weeks ago)

Purpose: turns whatever list of values into a probability distribution (x1,...,xm) − → (s1,...,sm) where si = exi

m

j=1

exj Examples: x = (7, 12, −4, 8, 4) − → s = (0.0066, 0.9752, 1e −6, 0.01798,0.0003) x = (0.33, 0.5, 0.1, 0.07) − → s = (0.266, 0.316, 0.211, 0.206)

Modern Neural-Networks approaches to NLP – 7 / 47

slide-8
SLIDE 8

Introduction Introduction

What is it all about? Why now? Is it worth it?

How does it work? Conclusion

c

EPFL

J.-C. Chappelier

Multi-Layer Perceptrons (MLP) a.k.a. Feed-Forward NN (FFNN)

MLP (Rumelhart, 1986) : neurons are organized in (a few) layers, from input to ouput: Parameters: "weights" of the network = input weights of each neurons MLP are universal approximators: input : x1,...,xn (n-dimensional real vecto), output : ≃ f(x1,...,xn) ∈ Rm to whatever precision decided a priori In a probabilistic framework: very often used to approximate the posterior probability P(y1,...,ym|x1,...,xn) Convergence to a local minimum of the loss function (often the mean quadratic error)

Modern Neural-Networks approaches to NLP – 8 / 47

slide-9
SLIDE 9

Introduction Introduction

What is it all about? Why now? Is it worth it?

How does it work? Conclusion

c

EPFL

J.-C. Chappelier

NN learning procedure

General learning procedure (see e.g. Baum-Welch): ➀ Initialize the parameters ➁ Then loop over training data (supervised):

  • 1. Compute (using NN) output from given input
  • 2. Compute loss by comparing output to reference
  • 3. Update parameters: “backpropagation”:

update proportionnal to the gradient of the loss function

  • 4. Stopping when some criterion is fulfilled

(e.g. loss function is small, validation-set error increases, number of steps is reached)

Modern Neural-Networks approaches to NLP – 9 / 47

slide-10
SLIDE 10

Introduction Introduction

What is it all about? Why now? Is it worth it?

How does it work? Conclusion

c

EPFL

J.-C. Chappelier

about Deep Learning (more later)

◮ not all Neural Network models (NN) are deep learners ◮ there is NO need of deep learning for good "word"-embeddings ◮ models: convolutional NN (CNN) or recurrrent NN (RNN, incl. LSTM) ◮ still suffer the same old problems: overfitting and computational power a quote, from Pr. Michel Jordan (IEEE Spectrum, 2014): “deep learning is largely a rebranding of neural networks, which go back to the 1980s. They actually go back to the 1960s; it seems like every 20 years there is a new wave that involves them. In the current wave, the main success story is the convolutional neural network, but that idea was already present in the previous wave.” Why such a reborn now? ☞ many more data (user-data pillage), more computational power (GPUs)

Modern Neural-Networks approaches to NLP – 10 / 47

slide-11
SLIDE 11

Introduction Introduction

What is it all about? Why now? Is it worth it?

How does it work? Conclusion

c

EPFL

J.-C. Chappelier

What is Deep Learning after all?

composition of many functions (neural-net layers) taking advantage of ◮ the chain rule (aka “back-propagation”) ◮ stochastic gradient decent ◮ parameters-sharing/localization of computation (a.k.a. “convolutions”) ◮ parallel operations on GPUs This does not differ much from networks from the 90s: several tricks and algorithmic improvements backed-up by

  • 1. large data sets (user-data pillage)
  • 2. large computational resources (GPU popularized)
  • 3. enthusiasm from academia and industry (hype)

Modern Neural-Networks approaches to NLP – 11 / 47

slide-12
SLIDE 12

Introduction Introduction

What is it all about? Why now? Is it worth it?

How does it work? Conclusion

c

EPFL

J.-C. Chappelier

Corpus-based linguistics: the evolution

◮ before corpora (< 1970): hand written rules ◮ first wave (≃ 1980-2015): probabilistic models (HMM, SCFG, CRF, ...) ◮ neural-nets and "word" embedings (1986, 1990, 1997, 2003, 2011, 2013+):

◮ MLP: David Rumelhart, 1986 ◮ RNN: Jeffrey Elman, 1990 ◮ LSTM: Hochreiter and Schmidhuber, 1997 ◮ early NN Word Embdedings: Yoshua Bengio et al., 2003; Collobert & Weston (et al.) 2008 & 2011 ◮ word2vec (2013), GloVe (2014) ◮ ...

◮ transfer learning (2018–): ULMFiT (2018), ELMo (2018), BERT (2018), OpenAI GPT2 (2019) use even more than "word" embeddings: pre-trained early layers to feed the later layers of some NN, followed by a (shallow?) task-specific architecture that is trained in a supervised way

Modern Neural-Networks approaches to NLP – 12 / 47

slide-13
SLIDE 13

Introduction Introduction

What is it all about? Why now? Is it worth it?

How does it work? Conclusion

c

EPFL

J.-C. Chappelier

Is it worth it?

Improved performances on well known benchmarks

see e.g. https://aclweb.org/aclwiki/POS_Tagging_(State_of_the_art), https://nlpoverview.com/#8, http://nlpprogress.com/

Constituency Parsing “WallStreet Journal” corpus:

Model publication F1 (%) Probabilistic context-free grammars Petrov et al. (2006) 91.80 Recursive neural networks Socher et al. (2011) 90.29 Feature-based transition parsing Zhu et al. (2013) 91.30 seq2seq learning with LSTM+Attention Vinyals et al. (2015) 93.50

PoS-tagging on the “WallStreet Journal” corpus:

name technique publication accuracy (%) TnT HMM Brants (2000) 96.5 GENiA Tagger MaxEnt Tsuruoka, et al. (2005) 97.0 Averaged Perceptron Collins (2002) 97.1 SVMTool SVM Giménez and Márquez (2004) 97.2 Stanford Tagger 2.0 MaxEnt Manning (2011) 97.3 structReg CRF Sun (2014) 97.4 Flair LSTM-CRF Akbik et al. (2018) 97.8

Modern Neural-Networks approaches to NLP – 13 / 47

slide-14
SLIDE 14

Introduction Introduction

What is it all about? Why now? Is it worth it?

How does it work? Conclusion

c

EPFL

J.-C. Chappelier

Contents

➀ Introduction ◮ What is it all about? What does it change? ◮ Why now? ◮ Is it worth it? ➁ How does it work? ☞ words

◮ word2vec (CBoW, skipgram) ◮ GloVe ◮ fastText

◮ documents ➂ Conclusion ◮ Advantages and drawbacks ◮ Future

Modern Neural-Networks approaches to NLP – 14 / 47

slide-15
SLIDE 15

Introduction Introduction How does it work?

"Word" Embeddings word2vec GloVe Practice fastText From words to documents

Conclusion

c

EPFL

J.-C. Chappelier

Starting point (reminder)

N "row"

  • bjects

(e.g. documents) x(i) characterized by m "features" (e.g. "words") x(i)

j

  • bjects

features i j = "importance" of feature j for object i N m

(i) j

x

(i) j

x

Vector space model:

1

t

2

t

3

t

1

d

2

d

3

d

◮ tokens/words define the axis ◮ documents are point in the vector space

Modern Neural-Networks approaches to NLP – 15 / 47

slide-16
SLIDE 16

Introduction Introduction How does it work?

"Word" Embeddings word2vec GloVe Practice fastText From words to documents

Conclusion

c

EPFL

J.-C. Chappelier

From "word" vectors to "word" embeddings

embedding = vectorial representation + dimension reduction from sparse (m ≃ 104–105) to dense (=more compact) representation (m ≃ 102–103) Why should dense vectors be better? ◮ More efficient (shorter dimension: less data to handle, store, estimate, ...) ◮ capture “the essence” (capture statistical invariants): less noisy? (☞ generalize better)

Modern Neural-Networks approaches to NLP – 16 / 47

slide-17
SLIDE 17

Introduction Introduction How does it work?

"Word" Embeddings word2vec GloVe Practice fastText From words to documents

Conclusion

c

EPFL

J.-C. Chappelier

Distributional Semantics

Idea (dates back to Harris (1954) and Firth (1957))

There is a high degree of correlation between the observable co-occurence caracteristics of a term and its meaning

Example

◮ Some X, for instance, naturally attack rats. ◮ The X on the roof was exposing its back to the shine of the sun. ◮ He heard the mewings of X in the forest . ◮ X is a: . . . Typically, word embeddings are trained by "predicting a word based on its contex" (or vice-versa) on a large (unlabeled) corpus

Modern Neural-Networks approaches to NLP – 17 / 47

slide-18
SLIDE 18

Introduction Introduction How does it work?

"Word" Embeddings word2vec GloVe Practice fastText From words to documents

Conclusion

c

EPFL

J.-C. Chappelier

Key idea: illustration

word 2 word 1 context A

Modern Neural-Networks approaches to NLP – 18 / 47

slide-19
SLIDE 19

Introduction Introduction How does it work?

"Word" Embeddings word2vec GloVe Practice fastText From words to documents

Conclusion

c

EPFL

J.-C. Chappelier

Word Embedings

“Word embedding”: ◮ numerical representation of "words"(/"tokens") ◮ a.k.a. “Semantic Vectors”, “Distributionnal Semantics” ◮ objective: relative similarities of representations correlate with syntactic/semantic similarity of words/phrases. ◮ two key ideas:

  • 1. representation(composition of words) = vectorial-composition(representations(word))

for instance: representation(phrase) =

word∈phrase

representation(word)

  • 2. remove sparsness, compactify representation: dimension reduction

◮ have been aroud for a long time

Harris, Z. (1954), "Distributional structure", Word 10(23):146–162. Firth, J.R. (1957), "A synopsis of linguistic theory 1930-1955", Studies in Linguistic Analysis. pp 1–32.

Modern Neural-Networks approaches to NLP – 19 / 47

slide-20
SLIDE 20

Introduction Introduction How does it work?

"Word" Embeddings word2vec GloVe Practice fastText From words to documents

Conclusion

c

EPFL

J.-C. Chappelier

Word Embedings: different techniques

“Many recent publications (and talks) on word embeddings are surprisingly oblivious of the large body of previous work [...]”

(from https://www.gavagai.se/blog/2015/09/30/a-brief-history-of-word-embeddings/)

Main techniques: ◮ co-occurence matrix; often reduced (LSI, Hellinger-PCA (2013), GloVe (2014)) ◮ probabilistic/distribution (DSIR, LDA) ◮ shallow (Mikolov et al. 2013) or deep Neural Networks (ELMo) There are theoretical and empirical correspondences between these different models

[see e.g. Levy, Goldberg and Dagan (2015), Pennington et al. (2014), Österlund et al. (2015)].

Popular word embeding are not from Deep Learning but can then serve as input to Deep Learners

Modern Neural-Networks approaches to NLP – 20 / 47

slide-21
SLIDE 21

Introduction Introduction How does it work?

"Word" Embeddings word2vec GloVe Practice fastText From words to documents

Conclusion

c

EPFL

J.-C. Chappelier

Word embedding “geometry”

◮ The geometry of embeddings should account for desired properties (e.g. syntactic, semantics, synonymy, word classes, ...) e.g. predict new word representation (embedding) from the sum of embeddings

  • f words around it

◮ Word embedding indeed exhibit some semantic compositionality Some theoretical justification for this behavior was recently given by Gittens et al. (2017): words need to be uniformly distributed in the embedding space.

  • A. Gittens et al. (2017), "Skip-Gram – Zipf + Uniform = Vector Additivity", proc. ACL.

Modern Neural-Networks approaches to NLP – 21 / 47

slide-22
SLIDE 22

Introduction Introduction How does it work?

"Word" Embeddings word2vec GloVe Practice fastText From words to documents

Conclusion

c

EPFL

J.-C. Chappelier

word2vec (Mikolov et al. 2013)

Predict new word representation (embedding) from the sum of the embeddings of the words around it context = (2k+1)-gram around (not including) word: wi−k ···wi−1(wi)wi+1 ···wi+k Example: The black cat ate the white mouse With k = 2, w =“ate”, then c =“black cat the white” (if no other preprocessing) word2vec comes with 2 flavors: ◮ CBoW (Continuous Bag-of-Words): predicts the current "word" based on its context ◮ Skip-gram: predict the context from the current "word"

  • T. Mikolov et al. (2013a), “Distributed Representations of Words and Phrases and their Compositionality”,
  • proc. NIPS.
  • T. Mikolov et al. (2013b), “Efficient Estimation of Word Representations in Vector Space”, proc. ICLR.

Modern Neural-Networks approaches to NLP – 22 / 47

slide-23
SLIDE 23

Introduction Introduction How does it work?

"Word" Embeddings word2vec GloVe Practice fastText From words to documents

Conclusion

c

EPFL

J.-C. Chappelier

CBoW architecture

c Moucrowap, CC-BY-SA-4.0 2015 Modern Neural-Networks approaches to NLP – 23 / 47

slide-24
SLIDE 24

Introduction Introduction How does it work?

"Word" Embeddings word2vec GloVe Practice fastText From words to documents

Conclusion

c

EPFL

J.-C. Chappelier

Skip-gram architecture

c Moucrowap, CC-BY-SA-4.0 2015 Modern Neural-Networks approaches to NLP – 24 / 47

slide-25
SLIDE 25

Introduction Introduction How does it work?

"Word" Embeddings word2vec GloVe Practice fastText From words to documents

Conclusion

c

EPFL

J.-C. Chappelier

word2vec key ideas

◮ idea #1: unsupervised co-learning of context c representation and word w representation so as to maximize either P(c|w) (skip-gram model)

  • r P(w|c) (CBoW model).

◮ idea #2 (“negative sampling”): minimize as well P(w′|c) for w′ not having c as context Actual other key simplification: ◮ turn word prediction (P(w|c)) into binary classification (P(y = 1|w,c)) Example: Turn P(X|black cat X the white) (for all words X) into : P(Ok|black cat ate the white) (1 number)

Modern Neural-Networks approaches to NLP – 25 / 47

slide-26
SLIDE 26

Introduction Introduction How does it work?

"Word" Embeddings word2vec GloVe Practice fastText From words to documents

Conclusion

c

EPFL

J.-C. Chappelier

Illustration

word 2 word 1 context A (neg.) cont. B (neg.)

  • cont. C

Modern Neural-Networks approaches to NLP – 26 / 47

slide-27
SLIDE 27

Introduction Introduction How does it work?

"Word" Embeddings word2vec GloVe Practice fastText From words to documents

Conclusion

c

EPFL

J.-C. Chappelier

word2vec method

More formally: the “word embeddings” (i.e. vectors) hi = h(w(i)) ∈ Rd (for each word w(i) ∈ L ) are optimized at the same times as “reverse projection” mj ∈ Rd (i.e. matrix M = (mj) projects “word embeddings” back to input space; this corresponds to the weight of the

  • utput layer)

such that the context log-likelihood L = −

w∈corpus

logQ(c,w) is minimized, where: ◮ in CBoW, using a softmax output layer, Q(c,w) = P(w|c) could be modeled as P(w(i)|c) = exp(mi ·h(c))

w(k)∈L

exp(mk ·h(c)) (for a context c of word w(i), h(c) = ∑

w∈c

h(w)) ◮ and in skipgram Q(c,w) = P(c|w) modeled similarly.

Modern Neural-Networks approaches to NLP – 27 / 47

slide-28
SLIDE 28

Introduction Introduction How does it work?

"Word" Embeddings word2vec GloVe Practice fastText From words to documents

Conclusion

c

EPFL

J.-C. Chappelier

word2vec actual loss

In fact, softmax is too expensive too compute (and less stable, it seems) so, rather than softmax, the output is directly σ(mi ·h(c)) with σ() the sigmoïd function This in fact replaces Q(c,w) = P(w|c) with Q(c,w) = P(y = 1|w,c) the probability of genuine co-occurrence (i.e. simplifies a word-prediction task into a binaray classification task)... ...which then leads to the idea of learning P(y = 0|w′,c) as well (for some other words w′): negative sampling To do this, word2vec draws R negative random samples from the words distribution loss function then becomes:

w∈corpus

  • log(1+exp(−mi ·h(c)))+

R

r=1

log

  • 1+exp
  • +mj(r) ·h(c)
  • (where c is the context of w, i is the index of w in the lexicon (i.e. w = w(i)) and j(r) is drawn at

random)

Modern Neural-Networks approaches to NLP – 28 / 47

slide-29
SLIDE 29

Introduction Introduction How does it work?

"Word" Embeddings word2vec GloVe Practice fastText From words to documents

Conclusion

c

EPFL

J.-C. Chappelier

GloVe (Pennington et al. 2014)

GloVe (Global Vectors) is another famous (non NN) "word" embedding method which works directly on the word co-occurrence matrix ◮ normalizing the co-occurrence counts, ◮ log-smoothing them, ◮ then factorizing the matrix to get lower dimensional representations by minimizing some “reconstruction loss” (difference between the dot-product of word embeddings and the log of the probability of co-occurrence) GloVe embeddings work better on some data sets, while word2vec embeddings work better on others

  • J. Pennington, R. Socher, and C. D. Manning (2014) “GloVe: Global Vectors for Word Representation“,
  • proc. EMNLP

.

Modern Neural-Networks approaches to NLP – 29 / 47

slide-30
SLIDE 30

Introduction Introduction How does it work?

"Word" Embeddings word2vec GloVe Practice fastText From words to documents

Conclusion

c

EPFL

J.-C. Chappelier

In practice

In practice, you can either: ◮ construct your own embeddings CBoW: for corpus with short sentences but high number of samples Skip-gram: for corpus with long sentences and low number of samples (infrequent words)

  • r GloVE, fastText, ELMo

◮ use existing word embeddings word embeddings provide generally helpful features without the need for a lengthy training (for NN) Some softwares/models: word2vec, GloVe, Gensim, fastText, ELMo, ... Advice: When using already computed "word" embeddings: use the same preprocessing that has been used: get your vocabulary (words?/tokens?) as close to the embeddings as possible e.g. gensim.utils.tokenize(): “maximal contiguous sequences of alphabetic characters (no digits!)” (sic)

Modern Neural-Networks approaches to NLP – 30 / 47

slide-31
SLIDE 31

Introduction Introduction How does it work?

"Word" Embeddings word2vec GloVe Practice fastText From words to documents

Conclusion

c

EPFL

J.-C. Chappelier

fastText (Joulin, Bojanowski, Mikolov et al. 2017)

aim to address the token/OoV issue: use n-grams of characters (< token) embedding More usefull for less semantic but more lexical task (e.g. morphology, POS-tagging or even NER) also usefull for OOV fastText = skip-gram on n-grams of characters The method is fast, which allows quick training of new models on large corpora. looks promising in terms of speed, scalability, and effectiveness. better model: Embedding for Language Models (ELMo): compute a different word embedding for different contexts

  • A. Joulin et al. (2017), "Bag of Tricks for Efficient Text Classification", proc. EACL.

P . Bojanowski et al. (2017), "Enriching Word Vectors with Subword Information", Trans. ACL, vol. 5.

  • E. P

. Matthew et al. (2018), "Deep Contextualized Word Representation", proc. NAACL.

Modern Neural-Networks approaches to NLP – 31 / 47

slide-32
SLIDE 32

Introduction Introduction How does it work?

"Word" Embeddings word2vec GloVe Practice fastText From words to documents

Conclusion

c

EPFL

J.-C. Chappelier

Contents

➀ Introduction ◮ What is it all about? What does it change? ◮ Why now? ◮ Is it worth it? ➁ How does it work? ◮ words ☞ documents

◮ Convolutional Neural Networks (CNN) ◮ Recurrent Neural Networks (RNN): LSTM, GRU

➂ Conclusion ◮ Advantages and drawbacks ◮ Future

Modern Neural-Networks approaches to NLP – 32 / 47

slide-33
SLIDE 33

Introduction Introduction How does it work?

"Word" Embeddings word2vec GloVe Practice fastText From words to documents CNN RNN LSTM & GRU

Conclusion

c

EPFL

J.-C. Chappelier

From "words" to sentences/documents

word2vec: how to go from tokens to compound words, phrases, sentences, documents? Compounds/Name Entities/Phrases: idioms like “hot potato” or named entities such as “Boston Globe”) does not represent the combination of meanings of individual words. One solution to this problem, as explored by Mikolov et al. (2013), is to identify such phrases based on word co-occurrence and train embeddings for them separately. More recent methods have explored directly learning n-gram embeddings from unlabeled data How to represent a document: average/sum of its word vectors? ☞ not so good Solution: effective feature function that extracts higher-level features from constituting tokens-grams: CNN and RNN

Modern Neural-Networks approaches to NLP – 33 / 47

slide-34
SLIDE 34

Introduction Introduction How does it work?

"Word" Embeddings word2vec GloVe Practice fastText From words to documents CNN RNN LSTM & GRU

Conclusion

c

EPFL

J.-C. Chappelier

Convolutional Neural Nets (CNN; Fukushima (1980), Le Cun (1998))

  • riginal key idea (inspired from visual cortex): share weights

Modern Neural-Networks approaches to NLP – 34 / 47

slide-35
SLIDE 35

Introduction Introduction How does it work?

"Word" Embeddings word2vec GloVe Practice fastText From words to documents CNN RNN LSTM & GRU

Conclusion

c

EPFL

J.-C. Chappelier

CNN for NLP (example)

(source: Zhang and Wallace (2015)) Modern Neural-Networks approaches to NLP – 35 / 47

slide-36
SLIDE 36

Introduction Introduction How does it work?

"Word" Embeddings word2vec GloVe Practice fastText From words to documents CNN RNN LSTM & GRU

Conclusion

c

EPFL

J.-C. Chappelier

Recurrent Neural Networks (Elman 1990)

Designed to deal with sequences (of vectors) by composing former intermediate representations (= outputs)

  • utput is a function of input an previous output:

c François Deloche, CC-BY-SA-4.0

RNN are generalized to ◮ bidirectional RNN ◮ RNN with gates:

◮ Long Short-Term Memory (LSTM; Hochreiter and Schmidhuber (1997)) ◮ Gated recurrent unit (GRU; Cho et al. (2014))

Modern Neural-Networks approaches to NLP – 36 / 47

slide-37
SLIDE 37

Introduction Introduction How does it work?

"Word" Embeddings word2vec GloVe Practice fastText From words to documents CNN RNN LSTM & GRU

Conclusion

c

EPFL

J.-C. Chappelier

Typical simple RNN for NLP

  • xi

RNN Classif. Softmax

  • y
  • xi: "word" embedding for i-th word/token
  • y:
  • utput = probability distribution;

e.g. yj ≃ P(Classj|w1...wi) “Classif.”: a MLP

Modern Neural-Networks approaches to NLP – 37 / 47

slide-38
SLIDE 38

Introduction Introduction How does it work?

"Word" Embeddings word2vec GloVe Practice fastText From words to documents CNN RNN LSTM & GRU

Conclusion

c

EPFL

J.-C. Chappelier

RNN with gates

Limitations of classical RNNs: ◮ vanishing gradients: addressed with gate neuron/vector: learning to forget some parts of the memory ◮ exploding gradients: addressed by gradient clipping Gate neuron: a 0/1 selection (elementwise product) of input component input/memory information filter(= gate)

(source http://colah.github.io/posts/2015-08-Understanding-LSTMs/) Modern Neural-Networks approaches to NLP – 38 / 47

slide-39
SLIDE 39

Introduction Introduction How does it work?

"Word" Embeddings word2vec GloVe Practice fastText From words to documents CNN RNN LSTM & GRU

Conclusion

c

EPFL

J.-C. Chappelier

LSTM vs. GRU

(source: Chung et al. (2014)) Modern Neural-Networks approaches to NLP – 39 / 47

slide-40
SLIDE 40

Introduction Introduction How does it work?

"Word" Embeddings word2vec GloVe Practice fastText From words to documents CNN RNN LSTM & GRU

Conclusion

c

EPFL

J.-C. Chappelier

Neuron-type summary

from D. Jurafsky & J. H. Martin, Speech and Language Processing, draft 3rd edition Modern Neural-Networks approaches to NLP – 40 / 47

slide-41
SLIDE 41

Introduction Introduction How does it work?

"Word" Embeddings word2vec GloVe Practice fastText From words to documents CNN RNN LSTM & GRU

Conclusion

c

EPFL

J.-C. Chappelier

Example applications (1/2)

Image caption generator:

from Vinyals et al. (2015) Modern Neural-Networks approaches to NLP – 41 / 47

slide-42
SLIDE 42

Introduction Introduction How does it work?

"Word" Embeddings word2vec GloVe Practice fastText From words to documents CNN RNN LSTM & GRU

Conclusion

c

EPFL

J.-C. Chappelier

Example applications (2/2)

Image question answering engine:

from Malinowski et al. (2015)) Modern Neural-Networks approaches to NLP – 42 / 47

slide-43
SLIDE 43

Introduction Introduction How does it work? Conclusion

c

EPFL

J.-C. Chappelier

Conclusion

Modern approach to NLP heavily emphasizes “Neural Networks” and “Deep Learning” Two key ideas (which are, in fact, quite independant): ◮ “word embeddings”:

◮ go from sparse (& high-dimensional) to dense (& less high-dimensional) representation of documents

◮ make use of (“deep”) neural networks (= trainable non-linear functions) Models: ◮ word embeddings: word2vec (CBoW, Skip-gram), GloVe, fastText, ELMo ◮ neural networks: CNN, LSTM, GPU (software: spaCy, Keras, Torch/PyTorch, TensorFlow, scikit-learn, DarkNet)

Modern Neural-Networks approaches to NLP – 43 / 47

slide-44
SLIDE 44

Introduction Introduction How does it work? Conclusion

c

EPFL

J.-C. Chappelier

Pros and Cons

◮ Best performances, but lots of data (unsupervised for word embeddings, supervised for taks-oriented NN) and lot of CPU(/GPU) ◮ word embeddings are dependent on the applications in which it is used. Labutov and Lipson (“word re-embedding”, 2013) proposed task specific embeddings which retrain the word embeddings to align them in the current task space. ◮ Traditional word embedding algorithms assign a distinct vector to each word. This makes them unable to account for polysemy. Several approaches address this issue : e.g. Upadhyay et al. (2017), ELMo (E. P . Matthew et al. (2018)) ◮ discussions on the relevance of word embeddings in the long run have cropped up recently e.g. Lucy and Gauthier (2017) has recently tried to evaluate how well the word vectors capture the necessary facets of conceptual meaning. The authors have discovered severe limitations in perceptual understanding of the concepts behind the words, which cannot be inferred from distributional semantics alone.

Modern Neural-Networks approaches to NLP – 44 / 47

slide-45
SLIDE 45

Introduction Introduction How does it work? Conclusion

c

EPFL

J.-C. Chappelier

Future(?): Transfert Learning

Transfer learning (2018–): ULMFiT (2018), ELMo (2018), BERT (2018), OpenAI GPT2 (2019) use even more than "word" embeddings: pre-trained early layers on some task ’A’ to feed the later layers of some NN trained in a supervised way on a task ’B’ based-on “transformer” models: include “attention” (Vaswani et al. NIPS 2017) layers in FFNN

Modern Neural-Networks approaches to NLP – 45 / 47

slide-46
SLIDE 46

Introduction Introduction How does it work? Conclusion

c

EPFL

J.-C. Chappelier

References

◮ D. Jurafsky & J. H. Martin, Speech and Language Processing, draft 3rd edition,

  • chap. 6,7 & 9 https://web.stanford.edu/~jurafsky/slp3/, 2019.

◮ Y. Goldberg Neural Network Methods for Natural Language Processing, Morgan & Claypool Publishers, 2017. https://www.morganclaypool.com/doi/abs/10.2200/ S00762ED1V01Y201703HLT037

Modern Neural-Networks approaches to NLP – 46 / 47

slide-47
SLIDE 47

Introduction Introduction How does it work? Conclusion

c

EPFL

J.-C. Chappelier

Word Embedings: some references

  • R. Lebret and R. Collobert (2013), “Word Emdeddings through Hellinger PCA”, proc. EACL.
  • T. Mikolov et al. (2013a), “Distributed Representations of Words and Phrases and their

Compositionality”, proc. NIPS.

  • T. Mikolov et al. (2013b), “Efficient Estimation of Word Representations in Vector Space”,
  • proc. ICLR.
  • J. Pennington, R. Socher, and C. D. Manning (2014) “GloVe: Global Vectors for Word

Representation“, proc. EMNLP .

  • O. Levy, Y. Goldberg and I. Dagan (2015), “Improving distributional similarity with lessons

learned from word embeddings“, Journ. Trans. ACL, vol. 3, pp. 211-225. Österlund et al. (2015) “Factorization of Latent Variables in Distributional Semantic Models”,

  • proc. EMNLP

.

  • A. Joulin et al. (2017), "Bag of Tricks for Efficient Text Classification", proc. EACL.

P . Bojanowski et al.(2017), "Enriching Word Vectors with Subword Information", Trans. ACL,

  • vol. 5.
  • A. Gittens et al. (2017), "Skip-Gram – Zipf + Uniform = Vector Additivity", proc. ACL.
  • E. P

. Matthew et al. (2018), "Deep Contextualized Word Representation", proc. NAACL.

Modern Neural-Networks approaches to NLP – 47 / 47