Deep Learning Applications in Natural Language Processing Jindich - - PowerPoint PPT Presentation

deep learning applications in natural language processing
SMART_READER_LITE
LIVE PREVIEW

Deep Learning Applications in Natural Language Processing Jindich - - PowerPoint PPT Presentation

Deep Learning Applications in Natural Language Processing Jindich Libovick December 5, 2018 B4M36NLP Introduction to Natural Language Processing Charles University Faculty of Mathematics and Physics Institute of Formal and Applied


slide-1
SLIDE 1

Deep Learning Applications in Natural Language Processing

Jindřich Libovický

December 5, 2018

B4M36NLP Introduction to Natural Language Processing

Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated
slide-2
SLIDE 2

Outline

Information Search Unsupervised Dictionary Induction Image Captioning

Deep Learning Applications in Natural Language Processing 1/38
slide-3
SLIDE 3

Information Search

slide-4
SLIDE 4

Answer Span Selection

Task: Find an answer for a question given question in a coherent text.

http://demo.allennlp.org/machine-comprehension

Deep Learning Applications in Natural Language Processing 2/38
slide-5
SLIDE 5

Standard Dataset: SQuAD

  • best articles from Wikipedia, of reasonable size (23k

paragraphs, 500 articles)

  • crowd-sourced more than 100k question-answer pairs
  • complex quality testing (which got estimate of single

human doing the task)

https://rajpurkar.github.io/SQuAD-explorer/explore/1.1/dev/

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas, November 2016. Association for Computational
  • Linguistics. URL https://aclweb.org/anthology/D16-1264
Deep Learning Applications in Natural Language Processing 3/38
slide-6
SLIDE 6

Method Overview

  • 1. Get text and question representation from
  • pre-trained word embeddings
  • character-level CNN

…using your favourite architecture.d

  • 2. Compute a similarity between all pairs of words in the text and in

the question.

  • 3. Collect all informations we have for each token.
  • 4. Classify where the span is.
Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional attention fmow for machine comprehension. CoRR, abs/1611.01603,
  • 2016. URL http://arxiv.org/abs/1611.01603
Deep Learning Applications in Natural Language Processing 4/38
slide-7
SLIDE 7

Method Overview: Image

Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional attention fmow for machine comprehension. CoRR, abs/1611.01603,
  • 2016. URL http://arxiv.org/abs/1611.01603
Deep Learning Applications in Natural Language Processing 5/38
slide-8
SLIDE 8

Representing Words

  • pre-trained word embeddings
  • concatenate with trained character-level representations
  • character-level representaions allows searching for out-of-vocabulary structured

informations (numbers, addresses) character embeddings of size 16 1D-convolution to 100 dimensions max-pooling

Deep Learning Applications in Natural Language Processing 6/38
slide-9
SLIDE 9

Contextual Embeddings Layer

  • process both question and context with bidirectional LSTM layer

→ one state per word

  • parameters are shared → representaions share the space
Deep Learning Applications in Natural Language Processing 7/38
slide-10
SLIDE 10

Attention Flow

𝑣1 𝑣2 𝑣3 𝑣4 𝑣5 query U ℎ1 ℎ2 ℎ3 ℎ4 ℎ5 ℎ6 ℎ7 ℎ8 ℎ9 ℎ10 ℎ11 ℎ12 context H S𝑗𝑘 = w𝑈 [ℎ𝑗, 𝑑𝑘, ℎ𝑗 ⊙ 𝑑𝑘] Captures affjnity / similarity between pairs of question and context words.

Deep Learning Applications in Natural Language Processing 8/38
slide-11
SLIDE 11

Context-to-query Attention

𝑣1 𝑣2 𝑣3 𝑣4 𝑣5 query U ℎ1 ℎ2 ℎ3 ℎ4 ℎ5 ℎ6 ℎ7 ℎ8 ℎ9 ℎ10 ℎ11 ℎ12 context H

softmax softmax softmax softmax softmax softmax softmax softmax softmax softmax softmax softmax

×

weighted sum ̃ 𝑣1 ̃ 𝑣2 ̃ 𝑣3 ̃ 𝑣4 ̃ 𝑣5 ̃ 𝑣6 ̃ 𝑣7 ̃ 𝑣8 ̃ 𝑣9 ̃ 𝑣10 ̃ 𝑣11 ̃ 𝑣12

Deep Learning Applications in Natural Language Processing 9/38
slide-12
SLIDE 12

Query-to-Context Attention

𝑣1 𝑣2 𝑣3 𝑣4 𝑣5 query U ℎ1 ℎ2 ℎ3 ℎ4 ℎ5 ℎ6 ℎ7 ℎ8 ℎ9 ℎ10 ℎ11 ℎ12 context H maximum × × × × × × × × × × × × weighted sum

Deep Learning Applications in Natural Language Processing 10/38
slide-13
SLIDE 13

Modeling Layer

  • concatenate: LSTM outputs for each context word , context-to-query-vectors
  • copy query-to-context vector to each of them
  • apply one non-linear layer and bidirectional LSTM
Deep Learning Applications in Natural Language Processing 11/38
slide-14
SLIDE 14

Output Layer

  • 1. Start-token probabilities: project each state to scalar → apply softmax over the context
  • 2. End-token probabilities:
  • Compute weighted averate using the start-token probablities → single vector
  • Concatenate the vector to each state
  • Project states to scalar, renormalize with softmax
  • 3. At the end select the most probable span
Deep Learning Applications in Natural Language Processing 12/38
slide-15
SLIDE 15

Method Overview: Recap

Deep Learning Applications in Natural Language Processing 13/38
slide-16
SLIDE 16

Attention Analysis (1)

Deep Learning Applications in Natural Language Processing 14/38
slide-17
SLIDE 17

Attention Analysis (2)

Deep Learning Applications in Natural Language Processing 15/38
slide-18
SLIDE 18

Make it 100× Faster!

Replace LSTMs by dilated convolutions.

Deep Learning Applications in Natural Language Processing 16/38
slide-19
SLIDE 19

Convolutional Blocks

Deep Learning Applications in Natural Language Processing 17/38
slide-20
SLIDE 20

Using Pre-Trained Representations Just replace the contextual embeddings wiht ELMo or BERT…

Deep Learning Applications in Natural Language Processing 18/38
slide-21
SLIDE 21

SQuAD Leaderboard

method Exact Match F1 Score Human performacne 82.304 91.221 BiDAF with BERT 87.433 93.160 BiDAF with ELMo 81.003 87.432 BiDAF trained from scratch 73.744 81.525

Deep Learning Applications in Natural Language Processing 19/38
slide-22
SLIDE 22

Unsupervised Dictionary Induction

slide-23
SLIDE 23

Unsupervised Bilingual Dictionary

Task: Get a translation dictionary between two languages using monolignual data only.

  • makes NLP accessible for low-resourced languages
  • basic for unsupervised machine translation
  • hot research topic (at least 10 research papers on this topic this year)

We will approach: Mikel Artetxe, Gorka Labaka, and Eneko Agirre. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 789–798, Melbourne, Australia, July 2018. Association for Computational

  • Linguistics. URL http://www.aclweb.org/anthology/P18-1073
Deep Learning Applications in Natural Language Processing 20/38
slide-24
SLIDE 24

How it is done

  • 1. Train word embeddings on large monolignual corpora.
  • 2. Find a mapping between the two languages.

So far looks simple…

Deep Learning Applications in Natural Language Processing 21/38
slide-25
SLIDE 25

Dictionary and Common Projection

𝑌, 𝑎 embedding matrices for 2 languages. Dictionary matrix 𝐸𝑗𝑘 = 1 if 𝑌𝑗 is translation of 𝑎𝑘.

Supervised projection between embeddings

Given existing dictionary 𝐸 (small seed dictionary): argmax

𝑋𝑎,𝑋𝑌

𝑗

𝑘

𝐸𝑗𝑘 ⋅ similarity (𝑌𝑗∶𝑋𝑌, 𝑎𝑘∶𝑋𝑎) (𝑌∶𝑗𝑋𝑌 (𝑎𝑘∶𝑋𝑎)

𝑈)

…but we need to fjnd all 𝐸, 𝑋𝑌, and 𝑋𝑎.

Deep Learning Applications in Natural Language Processing 22/38
slide-26
SLIDE 26

A Tiny Observation

𝑌𝑌𝑈

Question: How would you interpret this matrix? It is a table of similarities between pairs of words.

Deep Learning Applications in Natural Language Processing 23/38
slide-27
SLIDE 27

If the Vocabularies were Isometric…

  • 𝑁𝑌 = 𝑌𝑌𝑈 and 𝑁𝑎 = 𝑎𝑎𝑈 would only have permuted rows and columns
  • if we sorted values in each row of 𝑁𝑌 and 𝑁𝑎, corresponding words would have the

same vectors Let’s assume, it is true (at least approximately) 𝐸𝑗,∶ ← 1 [argmin

𝑘

(𝑁𝑌)𝑗,∶(𝑁𝑎)𝑈

𝑘,∶]

Assign nearest neighbor from the other language. ……in practice tragically bad but at least good initialization.

Deep Learning Applications in Natural Language Processing 24/38
slide-28
SLIDE 28

Self-Learning

Iterate until convergence:

  • 1. Optimize 𝑋𝑎 and 𝑋𝑌, w.r.t to current dictionary

argmax

𝑋𝑎,𝑋𝑌

𝑗

𝑘

𝐸𝑗𝑘 ⋅ (𝑌∶𝑗𝑋𝑌 (𝑎𝑘∶𝑋𝑎)

𝑈)

  • 2. Update dictionary matrix 𝐸

𝐸𝑗𝑘 = {1, if 𝑗 is nearest neighbor of 𝑘 or wise versa 0,

  • therwise
Deep Learning Applications in Natural Language Processing 25/38
slide-29
SLIDE 29

Accuracy on Large Dictionary

Deep Learning Applications in Natural Language Processing 26/38
slide-30
SLIDE 30

Try it yourself!

  • Pre-train monolingual word embeddings using FastText / Word2Vec
  • Install VecMap

https://github.com/artetxem/vecmap

python3 map_embeddings.py --unsupervised SRC.EMB TRG.EMB SRC_MAPPED.EMB TRG_MAPPED.EMB

Deep Learning Applications in Natural Language Processing 27/38
slide-31
SLIDE 31

Image Captioning

slide-32
SLIDE 32

Image Captioning

Task: Generate a caption in natural language given an image. Example:

A group of people wearing snowshoes, and dressed for winter hiking, is standing in front of a building that looks like it’s made of blocks of ice. The people are quietly listening while the story

  • f the ice cabin was explained to them.

A group of people standing in front of an igloo. Several students waiting outside an igloo.

Deep Learning Applications in Natural Language Processing 28/38
slide-33
SLIDE 33

Deep Learning Solution

  • 1. Obtain pre-trained image representation.
  • 2. Use autoregressive decoder to generate the caption using the image representation.
Deep Learning Applications in Natural Language Processing 29/38
slide-34
SLIDE 34

2D Convolution over an Image

Basic method in deep learning for computer vision.

RGB image 9 × 9 × 3 convolutional map 4 × 4 × 6 s t r i d e 2 fjlter size 6 k e r n e l s i z e 3

Deep Learning Applications in Natural Language Processing 30/38
slide-35
SLIDE 35

Convolutional Network for Image Classifjcation

3 @ 224 × 224 24 @ 48 × 48 48 @ 27 × 27 60 @ 13 × 13 60 @ 13 × 13 50 @ 13 × 13 1 × 2048 1 × 2048 1 × 1000 RGB image Flatten + Dense layer Dense layer Convolution + Max-Pooling Convolution + Max-Pooling Convolution Convolution Convolution + Max-pooling
  • Trained for 1k classes classifjcation, millions of training examples
  • Architecture: convolutions, max-pooling, residual connections, batch normalization,

50–150 layers

Deep Learning Applications in Natural Language Processing 31/38
slide-36
SLIDE 36

Reminder: Autoregressive Decoder

<s> ~y1 ~y2 ~y3 ~y4 ~y5 <s> x1 x2 x3 x4 <s> y1 y2 y3 y4 loss

Deep Learning Applications in Natural Language Processing 32/38
slide-37
SLIDE 37

Attention Model in Equations (1)

Inputs: decoder state 𝑡𝑗 encoder states ℎ𝑘 = [⃗⃗⃗⃗⃗⃗⃗ ℎ𝑘; ⃖⃖⃖⃖⃖⃖⃖ ℎ𝑘] ∀𝑗 = 1 … 𝑈𝑦 Attention energies: 𝑓𝑗𝑘 = 𝑤⊤

𝑏 tanh (𝑋𝑏𝑡𝑗−1 + 𝑉𝑏ℎ𝑘 + 𝑐𝑏)

Attention distribution: 𝛽𝑗𝑘 = exp (𝑓𝑗𝑘) ∑

𝑈𝑦 𝑙=1 exp (𝑓𝑗𝑙)

Context vector: 𝑑𝑗 =

𝑈𝑦

𝑘=1

𝛽𝑗𝑘ℎ𝑘

Deep Learning Applications in Natural Language Processing 33/38
slide-38
SLIDE 38

Attention Model in Equations (2)

Output projection: 𝑢𝑗 = MLP (𝑉𝑝𝑡𝑗−1 + 𝑊𝑝𝐹𝑧𝑗−1 + 𝐷𝑝𝑑𝑗 + 𝑐𝑝) …attention is mixed with the hidden state Output distribution: 𝑞 (𝑧𝑗 = 𝑙|𝑡𝑗, 𝑧𝑗−1, 𝑑𝑗) ∝ exp (𝑋𝑝𝑢𝑗)𝑙 + 𝑐𝑙

Deep Learning Applications in Natural Language Processing 34/38
slide-39
SLIDE 39

Example Outputs: Correct

Deep Learning Applications in Natural Language Processing 35/38
slide-40
SLIDE 40

Example Outputs: Incorrect

Deep Learning Applications in Natural Language Processing 36/38
slide-41
SLIDE 41

Employing Transformer Decoder

input embeddings ⊕ position encoding self-attentive sublayer multihead attention keys & values queries ⊕ layer normalization cross-attention sublayer multihead attention keys & values queries ⊕ layer normalization encoder feed-forward sublayer non-linear layer linear layer ⊕ layer normalization 𝑂× linear softmax
  • utput symbol probabilities

𝑤1 𝑤2 𝑤3 … 𝑤𝑁 𝑟1 𝑟2 𝑟3 … 𝑟𝑂 … … … … … Queries 𝑅 Values 𝑊

−∞

Deep Learning Applications in Natural Language Processing 37/38
slide-42
SLIDE 42

Quantitative Results

model BLEU score RNN + attention (original) 24.3 RNN + attention (with better image representation) 32.6 Transformer (with better image representation) 33.3

Deep Learning Applications in Natural Language Processing 38/38