Traitement automatique des langues : Fondements et applications - - PowerPoint PPT Presentation

traitement automatique des langues fondements et
SMART_READER_LITE
LIVE PREVIEW

Traitement automatique des langues : Fondements et applications - - PowerPoint PPT Presentation

Traitement automatique des langues : Fondements et applications Cours 10 : Neural networks (1) Tim Van de Cruys & Philippe Muller 20162017 Introduction Machine learning for NLP Standard approach: linear model trained over


slide-1
SLIDE 1

Traitement automatique des langues : Fondements et applications

Cours 10 : Neural networks (1) Tim Van de Cruys & Philippe Muller 2016—2017

slide-2
SLIDE 2

Introduction

Machine learning for NLP

  • Standard approach: linear model trained over high-dimensional

but very sparse feature vectors

  • Recently: non-linear neural networks over dense input vectors
slide-3
SLIDE 3

Neural Network Architectures

Feed-forward neural networks

  • Best known, standard neural network approach
  • Fully connected layers
  • Can be used as drop-in replacement for typical NLP classifiers
slide-4
SLIDE 4

Feature representation

Dense vs. one hot

  • One hot: each feature is its own dimension
  • Dimensionality vector is same as number of features
  • Each feature is completely independent from one another
  • Dense: each feature is a d-dimensional vector
  • Dimensionality is d
  • Similar features have similar vectors
slide-5
SLIDE 5

Feature representation

Feature combinations

  • Traditional NLP: specify interactions of features
  • E.g. features like ’word is jump, tag is V and previous word is

they’

  • Non-linear network: only specify core features
  • Non-linearity of network takes care of finding indicative feature

combinations

slide-6
SLIDE 6

Feature representation

Why dense?

  • Discrete approach often works surprisingly well for NLP tasks
  • n-gram language models
  • POS-tagging, parsing
  • sentiment analysis
  • Still, a very poor representation of word meaning
  • No notion of similarity
  • Limited inference
slide-7
SLIDE 7

Feature representation

Why dense?

  • Discrete approach often works surprisingly well for NLP tasks
  • n-gram language models
  • POS-tagging, parsing
  • sentiment analysis
  • Still, a very poor representation of word meaning
  • No notion of similarity
  • Limited inference

[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 01 0 0 0 0 ]

slide-8
SLIDE 8

Feature representation

Why dense?

  • Discrete approach often works surprisingly well for NLP tasks
  • n-gram language models
  • POS-tagging, parsing
  • sentiment analysis
  • Still, a very poor representation of word meaning
  • No notion of similarity
  • Limited inference

[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 01 0 0 0 0 ] [ 0 0 0 0 0 01 0 0 0 0 0 0 0 0 0 0 0 0 0 ]

slide-9
SLIDE 9

Feature representation

Why dense?

slide-10
SLIDE 10

Feature representation

Why dense?

slide-11
SLIDE 11

Feature representation

Why dense?

slide-12
SLIDE 12

Feed-forward

Architecture

Multi-layer perceptron with 2 hidden layers NNMLP2(x) = y (1) h1 = g(xW1 + b1) (2) h2 = g(h1W2 + b2) (3) y = h2W3 (4) x: vector of size din = 3 y: vector of size dout = 2 h1, h2: vectors of size dhidden = 4

slide-13
SLIDE 13

Feed-forward

Architecture

Multi-layer perceptron with 2 hidden layers NNMLP2(x) = y (1) h1 = g(xW1 + b1) (2) h2 = g(h1W2 + b2) (3) y = h2W3 (4) W1, W2, W3: matrices of size [3 × 4], [4 × 4], [4 × 2] b1, b2: ’bias’ vectors of size dhidden = 4 g(·): non-linear activation function (elementwise)

slide-14
SLIDE 14

Feed-forward

Architecture

Multi-layer perceptron with 2 hidden layers NNMLP2(x) = y (1) h1 = g(xW1 + b1) (2) h2 = g(h1W2 + b2) (3) y = h2W3 (4) W1, W2, W3, b1, b2 = parameters of the network (θ) Use of multiple hidden layers: deep learning

slide-15
SLIDE 15

Feed-forward

Non-linear activation functions

Sigmoid (logistic) function σ(x) =

1 1+e−x

0.5 1 −6 −4 −2 2 4 6

slide-16
SLIDE 16

Feed-forward

Non-linear activation functions

Hyperbolic tangent (tanh) function tanh(x) = e2x−1

e2x+1

slide-17
SLIDE 17

Feed-forward

Non-linear activation functions

Rectified linear unit (ReLU) ReLU(x) = max(0, x)

slide-18
SLIDE 18

Feed-forward

Output transformation function

Softmax function x = x1 . . . xk softmax(xi) =

exi k

j=1 exj

slide-19
SLIDE 19

Feed-forward

Input vector

  • Embedding lookup from embedding matrix
  • concatenate or sum embeddings
slide-20
SLIDE 20

Feed-forward

Loss functions

  • L(ˆ

y, y) - the loss of predicting ˆ y when true output is y

  • Set parameters θ in order to minimize loss across different

training examples

  • Compute gradient of parameters with regard to loss function

to find minimum, take steps in right direction

slide-21
SLIDE 21

Feed-forward

Loss functions

  • Hinge loss (binary and multi-class)
  • classify correct class over incorrect class(es) with margin of at

least 1

  • Categorical cross-entropy loss (negative log-likelihood)
  • Measure difference between true class distribution y and

predicted class distribution ˆ y

  • Use with softmax output
  • Ranking loss
  • In unsupervised setting: rank attested examples over

unattested, corrupted ones with margin of at least 1

slide-22
SLIDE 22

Training

Stochastic gradient descent

  • Goal: minimize total loss n

i=1 L(f(xi; θ), yi)

  • Estimating gradient over entire training set before taking step is

computationally heavy

  • Compute gradient for small batch of samples from training set

→ estimate of gradient: stochastic

  • Learning rate λ: size of step in right direction
  • Improvements: momentum, adaptive learning rate
slide-23
SLIDE 23

Training

Stochastic gradient descent

  • Size mini-batch: balance between better estimate and faster

convergence

  • Gradients over different parameters (weight matrices, bias

terms, embeddings, ...) efficiently calculated using backpropagation algorithm

  • No need to carry out derivations yourself: automatic tools for

gradient computation using computational graphs

slide-24
SLIDE 24

Training

Initialization

  • Parameters of network are initialized randomly
  • Magnitude of random samples has effect on training success
  • effective initialization schemes exist
slide-25
SLIDE 25

Training

Misc

  • Shuffling: shuffle training set with each epoch
  • Learning rate: balance between proper convergence and fast

convergence

  • Minibatch: balance speed/proper estimate; efficient using GPU
slide-26
SLIDE 26

Training

Regularization

  • Neural networks have many parameters: risk of overfitting
  • Solution: regularization
  • L2: extend loss function with squared penalty on parameters,

i.e. λ

2 ||θ||2

  • Dropout: Randomly dropping (setting to zero) half of the

neurons in the network for each training sample

slide-27
SLIDE 27

Word embeddings

  • Each word i is represented by a small, dense vector vi ∈ Rd
  • d is typically in the range 50–1000
  • Matrix of size V (vocabulary size) × d (embedding size)
  • words are ‘embedded’ in a real-valued, low-dimensional space
  • Similar words have similar embeddings
slide-28
SLIDE 28

Word embeddings

  • Each word i is represented by a small, dense vector vi ∈ Rd
  • d is typically in the range 50–1000
  • Matrix of size V (vocabulary size) × d (embedding size)
  • words are ‘embedded’ in a real-valued, low-dimensional space
  • Similar words have similar embeddings

d1 d2 d3 . . . apple –2.34 –1.01 0.33 pear –2.28 –1.20 0.11 car –0.20 1.02 2.44 . . .

slide-29
SLIDE 29

Word embeddings

  • Each word i is represented by a small, dense vector vi ∈ Rd
  • d is typically in the range 50–1000
  • Matrix of size V (vocabulary size) × d (embedding size)
  • words are ‘embedded’ in a real-valued, low-dimensional space
  • Similar words have similar embeddings
slide-30
SLIDE 30

Neural word embeddings

  • Word embeddings have been around for quite some time
  • The term ‘embedding’ was coined within the neural network

community, along with new methods to learn them

  • Idea: Let’s allocate a number of parameters for each word and

allow the neural network to automatically learn what the useful values should be

  • Prediction-based: learn to predict the next word
slide-31
SLIDE 31

Embeddings through language modeling

  • Predict the next word in a

sequence, based on the previous word

  • One non-linear hidden layer,
  • ne softmax layer for

classification

  • Choose parameters that
  • ptimize probability of

correct word

slide-32
SLIDE 32

Embeddings through language modeling

  • Predict the next word in a

sequence, based on the previous word

  • One non-linear hidden layer,
  • ne softmax layer for

classification

  • Choose parameters that
  • ptimize probability of

correct word

slide-33
SLIDE 33

Embeddings through error detection

  • Take a correct sentence and

create a corrupted counterpart

  • Train the network to assign a

higher score to the correct version of each sentence

slide-34
SLIDE 34

Embeddings through error detection

  • Take a correct sentence and

create a corrupted counterpart

  • Train the network to assign a

higher score to the correct version of each sentence

slide-35
SLIDE 35

Word2vec

  • Neural network approaches work well, but large number of

parameters makes them computationally heavy

  • Popular, light-weight approach with less parameters:

word2vec

  • No hidden layer, only softmax classifier
  • Two different models
  • Continuous bag of words (CBOW): predict current word based
  • n surrounding words
  • Skip-gram: predict surrounding words based on context words
slide-36
SLIDE 36

CBOW

  • Current word wt is predicted

from context words

  • Prediction is made from the

sum of context embeddings

slide-37
SLIDE 37

Skip-gram

  • Each context word is

predicted from current word

  • Parameters for each

softmax classifier are shared

slide-38
SLIDE 38

Negative sampling

  • Computation of full

softmax classifier is still rather expensive

  • Only compute score for

correct context, and a number of wrong contexts

  • Maximize correct

contexts and minimize wrong ones

slide-39
SLIDE 39

Word similarity

france j´ esus rouge peinture grande-bretagne christ jaune sculpture belgique j´ esus-christ bleu gravure espagne mo¨ ıse vert dessin angleterre dieu blanc photographie pologne proph` ete noir peintures bretagne r´ esurrection bleue toile su` ede mahomet gris c´ eramique italie abraham verte po´ esie roumanie saints blanche peintre tunisie anges noire art

slide-40
SLIDE 40

Word similarity

slide-41
SLIDE 41

Word similarity

slide-42
SLIDE 42

Word similarity

slide-43
SLIDE 43

Word similarity

slide-44
SLIDE 44

Word similarity

slide-45
SLIDE 45

Computing analogies

  • Questions in the form:

a is to b as c is to d

  • The system is given a, b, and c; it needs to compute d. For

example: apple is to apples as car is to ? man is to woman as king is to ?

slide-46
SLIDE 46

Computing analogies

  • task: a is to b as c is to d
  • idea: direction of the

relation should remain the same a − b ≈ c − d man − woman ≈ king − queen queen ≈ king − man + woman dw = arg max

d′

w∈V

cos(d′, c − a + b)

slide-47
SLIDE 47

Computing analogies

slide-48
SLIDE 48

Computing analogies

slide-49
SLIDE 49

Computing analogies

slide-50
SLIDE 50

Computing analogies

slide-51
SLIDE 51

Computing analogies

slide-52
SLIDE 52

Computing analogies

Relationship Example 1 Example 2 Example 3 France - Paris Italy: Rome Japan: Tokyo Florida: Tallahassee big - bigger small: larger cold: colder quick: quicker Sarkozy - France Berlusconi: Italy Merkel: Germany Koizumi: Japan copper - Cu zinc: Zn gold: Au uranium: plutonium Japan - sushi Germany: bratwurst France: tapas USA: pizza

slide-53
SLIDE 53

Computing analogies

a is to b as c is to d homme roi : femme reine autriche vienne : allemagne berlin ´ ecrivain livre : po` ete po` eme france nicolas sarkozy : ´ etats- unis bush droite ump : gauche ps droite front national : gauche pcf

slide-54
SLIDE 54

Prediction-based vs. count-based

  • Count-based: extract co-occurrence frequencies from corpus
  • Prediction-based: induce parameters based on prediction of

correct word

  • Initially seen as two different methods, where prediction-based

approach works better

  • Later, emerging evidence and proof that they are actually

equivalent or quite similar

slide-55
SLIDE 55

Word Embeddings

unsupervised pre-training: parameters

  • Training objective
  • Choice of contexts (bag of words, syntax, parallel corpus)
  • Window size (small, sentence, paragraph), dynamic, ...
  • Directed
  • Lemmatization, normalization, stop words, ...
  • Character-based, sub-word models
slide-56
SLIDE 56

Word embeddings in practice

  • Word embeddings are often used for pretraining
  • Unsupervised: only requires plain text, so can be trained on a

lot of data

  • Fast algorithms available
  • It helps a model start from an informed position
  • Model is initialized with pretrained word embeddings, and then

finetuned depending on task

slide-57
SLIDE 57

Word embeddings in practice

slide-58
SLIDE 58

Beyond words

  • Embeddings are not restricted to words
  • Can equally be computed for sentences, paragraphs,

documents

  • Different methods
  • word2vec-alike method (paragraph vector)
  • Convolutional neural network
  • Recurrent neural network (sequence-to-sequence learning)
  • Current research trend in NLP with applications in machine

translation, paraphrase detection, etc.

slide-59
SLIDE 59

Neural Network Architectures

Convolutional neural networks

  • Type of feedforward neural network
  • Certain layers are not fully connected (convolutional layers,

pooling layers)

  • local cues appear in different places in input (cfr. vision)
slide-60
SLIDE 60

Neural Network Architectures

Recurrent (+ recursive) neural networks

  • Handle structured data of arbitrary sizes
  • Recurrent networks for sequences
  • Recursive networks for trees
slide-61
SLIDE 61

Convolutional layers

Introduction

How to represent variable number of features, e.g. words in a sentence, document?

  • Continuous Bag of Words (CBOW): sum embedding vectors of

corresponding features

  • no ordering info (”not good quite bad” = ”not bad quite good”)
  • Convolutional layer
  • ’Sliding window’ approach that takes local structure into account
  • Combine individual windows to create vector of fixed size
slide-62
SLIDE 62

Feature representation

Variable number of features

  • Feed-forward network assume fixed dimensional input
  • How to represent variable number of features, e.g. words in a

sentence, document?

  • Continuous Bag of Words (CBOW): sum embedding vectors of

corresponding features

slide-63
SLIDE 63

Convolutional layers

Basic convolution + pooling

  • Goal: identify indicative local features (n-grams) in large

structure, combine them into fixed size vector

  • Convolution: apply filter to each window (linear transformation

+ non-linear activation)

  • Pooling: combine by taking maximum
slide-64
SLIDE 64

Convolutional layers

Narrow vs. wide convolution

  • Whether or not to include padding elements to beginning and

end

  • E.g. ”the quick brown fox”, convolution of size 2
  • Narrow: [the quick], [quick brown], [brown fox]
  • Wide: [ the], [the quick], [quick brown], [brown fox], [fox ]
slide-65
SLIDE 65

Convolutional layers

Variations

  • Dynamic: divide sentence/document in different regions, do

pooling over each region

  • Hierarchical: succession of convolution and pooling layers
  • K-max pooling: Keep top-k values when pooling
slide-66
SLIDE 66

Recurrent neural network

Introduction

  • CBOW: no ordering, no structure
  • CNN: improvement, but mostly local patterns
  • RNN: represent arbitrarily sized structured input as fixed-size

vectors, paying attention to structured properties

slide-67
SLIDE 67

Recurrent neural network

Model

  • x1: input layer (current word)
  • a1: hidden layer of current timestep
  • a0: hidden layer of previous timestep
  • U, W and V: weights matrices
  • f(·): element-wise activation function (sigmoid)
  • g(·): softmax function to ensure probability distribution

a1 = f(Ux1 + Wa0) (5) y1 = g(Va1) (6)

slide-68
SLIDE 68

Recurrent neural network

Graphical representation

slide-69
SLIDE 69

Recurrent neural network

Training

  • Consider recurrent neural network as very deep neural network

with shared parameters across computation

  • Backpropagation through time
  • What kind of supervision?
  • Acceptor: based on final state
  • Transducer: an output for each input (e.g. language modeling)
  • Encoder-decoder: one RNN to encode sequence into vector

representation, another RNN to decode into sequence (e.g. machine translation)

slide-70
SLIDE 70

Recurrent neural network

Training: graphical representation

slide-71
SLIDE 71

Recurrent neural network

Multi-layer RNN

  • multiple layers of RNNs
  • input of next layer is output of RNN layer below it
  • Empirically shown to work better
slide-72
SLIDE 72

Recurrent neural network

Bi-directional RNN

  • Input sequence both forward and backward to different RNNs
  • Representation is concatenation of forward and backward state

(A & A’)

  • Represent both history and future
slide-73
SLIDE 73

Concrete RNN architectures

Simple RNN

slide-74
SLIDE 74

Concrete RNN architectures

LSTM

  • Long short term memory networks
  • In practice, simple RNNs only able to remember narrow context

(vanishing gradient)

  • LSTM: complex architecture able to capture long-term

dependencies

slide-75
SLIDE 75

Concrete RNN architectures

LSTM

slide-76
SLIDE 76

Concrete RNN architectures

LSTM

slide-77
SLIDE 77

Concrete RNN architectures

LSTM

slide-78
SLIDE 78

Concrete RNN architectures

LSTM

slide-79
SLIDE 79

Concrete RNN architectures

LSTM

slide-80
SLIDE 80

Concrete RNN architectures

LSTM

slide-81
SLIDE 81

Concrete RNN architectures

GRU

  • LSTM: effective, but complex, computationally expensive
  • GRU: cheaper alternative that works well in practice
slide-82
SLIDE 82

Concrete RNN architectures

GRU

  • reset gate (r): how much information from previous hidden

state needs to be included (reset with current information?)

  • upgate gate (z): controls updates to hidden state (how much

does hidden state need to be updated with current information?)

slide-83
SLIDE 83

Recursive neural networks

Introduction

  • Generalization of RNNs from sequences to (binary) trees
  • Linear transformation + non-linear activation function applied

recursively throughout a tree

  • Useful for parsing
slide-84
SLIDE 84

Software

  • Tensorflow
  • Python, C++
  • http://www.tensorflow.org
  • Theano
  • Python
  • http://deeplearning.net/software/theano/
  • Keras
  • Theano/tensorflow-based modular deep learning library
  • Lasagne
  • Theano-based deep learning library