Deep Neural Networks in Natural Language Processing Charles - - PowerPoint PPT Presentation

deep neural networks in natural language processing
SMART_READER_LITE
LIVE PREVIEW

Deep Neural Networks in Natural Language Processing Charles - - PowerPoint PPT Presentation

Rudolf Rosa rosa@ufal.mff.cuni.cz Deep Neural Networks in Natural Language Processing Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics Hora Informaticae, I AV R, Praha, 14 Jan 2019


slide-1
SLIDE 1

Hora Informaticae, ÚI AV ČR, Praha, 14 Jan 2019 Rudolf Rosa rosa@ufal.mff.cuni.cz

Deep Neural Networks in Natural Language Processing

Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics

slide-2
SLIDE 2

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

2/116

Background check: do you know...

slide-3
SLIDE 3

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

3/116

Background check: do you know...

 Machine learning? (ML)

slide-4
SLIDE 4

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

4/116

Background check: do you know...

 Machine learning? (ML)  Artificial neural networks? (NN)

slide-5
SLIDE 5

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

5/116

Background check: do you know...

 Machine learning? (ML)  Artificial neural networks? (NN)  Deep neural networks? (DNN)

slide-6
SLIDE 6

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

6/116

Background check: do you know...

 Machine learning? (ML)  Artificial neural networks? (NN)  Deep neural networks? (DNN)  Convolutional neural networks? (CNN)

slide-7
SLIDE 7

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

7/116

Background check: do you know...

 Machine learning? (ML)  Artificial neural networks? (NN)  Deep neural networks? (DNN)  Convolutional neural networks? (CNN)  Recurrent neural networks? (RNN)

slide-8
SLIDE 8

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

8/116

Background check: do you know...

 Machine learning? (ML)  Artificial neural networks? (NN)  Deep neural networks? (DNN)  Convolutional neural networks? (CNN)  Recurrent neural networks? (RNN)

 Long short-term memory units? (LSTM)  Gated recurrent units? (GRU)

slide-9
SLIDE 9

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

9/116

Background check: do you know...

 Machine learning? (ML)  Artificial neural networks? (NN)  Deep neural networks? (DNN)  Convolutional neural networks? (CNN)  Recurrent neural networks? (RNN)

 Long short-term memory units? (LSTM)  Gated recurrent units? (GRU)

 Attention mechanism? (Bahdanau+, 2014)

slide-10
SLIDE 10

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

10/116

Background check: do you know...

 Machine learning? (ML)  Artificial neural networks? (NN)  Deep neural networks? (DNN)  Convolutional neural networks? (CNN)  Recurrent neural networks? (RNN)

 Long short-term memory units? (LSTM)  Gated recurrent units? (GRU)

 Attention mechanism? (Bahdanau+, 2014)

 Self-attentive networks? (SAN, Transformer)

slide-11
SLIDE 11

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

11/116

Background check: do you know...

 Machine learning? (ML)  Artificial neural networks? (NN)  Deep neural networks? (DNN)  Convolutional neural networks? (CNN)  Recurrent neural networks? (RNN)

 Long short-term memory units? (LSTM)  Gated recurrent units? (GRU)

 Attention mechanism? (Bahdanau+, 2014)

 Self-attentive networks? (SAN, Transformer)

 Word embeddings? (Bengio+, 2003)

 Word2vec? (Mikolov+, 2013)

slide-12
SLIDE 12

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

12/116

ML in Natual Language Processing

 Before: complex multistep pipelines

 Preprocessing, low-level processing, high-level

processing, classification, post-processing…

 Massive feature engineering, linguistic knowledge…

slide-13
SLIDE 13

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

13/116

ML in Natual Language Processing

 Before: complex multistep pipelines

 Preprocessing, low-level processing, high-level

processing, classification, post-processing…

 Massive feature engineering, linguistic knowledge…

 Now: monolitic end-to-end systems (or nearly)

slide-14
SLIDE 14

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

14/116

ML in Natual Language Processing

 Before: complex multistep pipelines

 Preprocessing, low-level processing, high-level

processing, classification, post-processing…

 Massive feature engineering, linguistic knowledge…

 Now: monolitic end-to-end systems (or nearly)

 text → deep neural network → output  Little or no linguistic knowledge required  Little or no feature engineering  Little or no dependence on external tools

slide-15
SLIDE 15

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

15/116

ML in Natual Language Processing

 Before: complex multistep pipelines

 Preprocessing, low-level processing, high-level

processing, classification, post-processing…

 Massive feature engineering, linguistic knowledge…

 Now: monolitic end-to-end systems (or nearly)

 text → deep neural network → output  Little or no linguistic knowledge required  Little or no feature engineering  Little or no dependence on external tools  → so now is a good time for anyone to get into NLP!

slide-16
SLIDE 16

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

16/116

Neural networks & text processing

 Input to a neuron: fixed-dimension real vector

slide-17
SLIDE 17

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

17/116

Neural networks & text processing

 Input to a neuron: fixed-dimension real vector

 Dimension should be reasonable (<103)

slide-18
SLIDE 18

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

18/116

Neural networks & text processing

 Input to a neuron: fixed-dimension real vector

 Dimension should be reasonable (<103)  Neural net: fixed-sized network of neurons

slide-19
SLIDE 19

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

19/116

Neural networks & text processing

 Input to a neuron: fixed-dimension real vector

 Dimension should be reasonable (<103)  Neural net: fixed-sized network of neurons

 Text input: sequence processing

 Sentence = sequence of words

slide-20
SLIDE 20

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

20/116

Neural networks & text processing

 Input to a neuron: fixed-dimension real vector

 Dimension should be reasonable (<103)  Neural net: fixed-sized network of neurons

 Text input: sequence processing

 Sentence = sequence of words  Words: discrete (but interrelated)

 Massively multi-valued (~106)  Very sparse (Zipf distribution)

slide-21
SLIDE 21

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

21/116

Neural networks & text processing

 Input to a neuron: fixed-dimension real vector

 Dimension should be reasonable (<103)  Neural net: fixed-sized network of neurons

 Text input: sequence processing

 Sentence = sequence of words  Words: discrete (but interrelated)

 Massively multi-valued (~106)  Very sparse (Zipf distribution)

 Sentences: variable length (~1 to 100)

 Complex and hidden internal structure

slide-22
SLIDE 22

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

22/116

Outline of the talk

 Problem 1: Words  Problem 2: Sentences

slide-23
SLIDE 23

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

23/116

Outline of the talk

 Problem 1: Words

 There are too many  They are discrete  Representing massively multi-valued discrete data

by continuous low-dimensional vectors

 Problem 2: Sentences

slide-24
SLIDE 24

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

24/116

Outline of the talk

 Problem 1: Words

 There are too many  They are discrete  Representing massively multi-valued discrete data

by continuous low-dimensional vectors

 Problem 2: Sentences

 They have various lengths  They have internal structure  Handling variable-length input sequences with

complex internal relations by fixed-sized neural units

slide-25
SLIDE 25

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

25/116

Outline of the talk

 Problem 1: Words

 There are too many  They are discrete  Representing massively multi-valued discrete data

by continuous low-dimensional vectors

 Problem 2: Sentences

 They have various lengths  They have internal structure  Handling variable-length input sequences with

complex internal relations by fixed-sized neural units

slide-26
SLIDE 26

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

26/116

Outline of the talk

 Problem 1: Words

 There are too many  They are discrete  Representing massively multi-valued discrete data

by continuous low-dimensional vectors

 Problem 2: Sentences

 They have various lengths  They have internal structure  Handling variable-length input sequences with

complex internal relations by fixed-sized neural units

slide-27
SLIDE 27

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

27/116

Warnings

slide-28
SLIDE 28

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

28/116

Warnings

 I am not a ML expert, rather a ML user

 Please excuse any errors and inaccuracies

slide-29
SLIDE 29

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

29/116

Warnings

 I am not a ML expert, rather a ML user

 Please excuse any errors and inaccuracies

 Focus of talk: input representation (“encoding”)

 Key problem in NLP, interesting properties

slide-30
SLIDE 30

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

30/116

Warnings

 I am not a ML expert, rather a ML user

 Please excuse any errors and inaccuracies

 Focus of talk: input representation (“encoding”)

 Key problem in NLP, interesting properties

 Leaving out

 Generating output (“decoding”) – that’s also interesting

slide-31
SLIDE 31

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

31/116

Warnings

 I am not a ML expert, rather a ML user

 Please excuse any errors and inaccuracies

 Focus of talk: input representation (“encoding”)

 Key problem in NLP, interesting properties

 Leaving out

 Generating output (“decoding”) – that’s also interesting

 Sequence generation  Seq. elements discrete, large domain (softmax over 106)  Sequence length not a priori known

slide-32
SLIDE 32

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

32/116

Warnings

 I am not a ML expert, rather a ML user

 Please excuse any errors and inaccuracies

 Focus of talk: input representation (“encoding”)

 Key problem in NLP, interesting properties

 Leaving out

 Generating output (“decoding”) – that’s also interesting

 Sequence generation  Seq. elements discrete, large domain (softmax over 106)  Sequence length not a priori known

 Decision at encoder/decoder boundary (if any)

slide-33
SLIDE 33

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

33/116

Problem 1: Words

Continuous low-dimensional vectors (word embeddings) Massively multi-valued discrete data (words)

slide-34
SLIDE 34

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

34/116

Simplification

 For now, forget sentences

1 word some output

slide-35
SLIDE 35

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

35/116

Simplification

 For now, forget sentences

1 word some output

Word is positive/neutral/negative,

slide-36
SLIDE 36

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

36/116

Simplification

 For now, forget sentences

1 word some output

Word is positive/neutral/negative, Definition of the word,

slide-37
SLIDE 37

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

37/116

Simplification

 For now, forget sentences

1 word some output

Word is positive/neutral/negative, Definition of the word, Hyperonym (dog → animal), …

slide-38
SLIDE 38

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

38/116

Simplification

 For now, forget sentences

1 word some output

Word is positive/neutral/negative, Definition of the word, Hyperonym (dog → animal), …

 Situation

 We have labelled training data for some words (103)  We want to generalize (ideally) to all words (106)

slide-39
SLIDE 39

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

39/116

The problem with words

 How many words are there?

slide-40
SLIDE 40

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

40/116

The problem with words

 How many words are there? Too many!

slide-41
SLIDE 41

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

41/116

The problem with words

 How many words are there? Too many!

 Many problems with counting words, cannot be done

slide-42
SLIDE 42

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

42/116

The problem with words

 How many words are there? Too many!

 Many problems with counting words, cannot be done  ~106

slide-43
SLIDE 43

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

43/116

The problem with words

 How many words are there? Too many!

 Many problems with counting words, cannot be done  ~106 (but potentially infinite – new words get created every day)

slide-44
SLIDE 44

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

44/116

The problem with words

 How many words are there? Too many!

 Many problems with counting words, cannot be done  ~106 (but potentially infinite – new words get created every day)

 Long-standing problem of NLP

slide-45
SLIDE 45

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

45/116

The problem with words

 How many words are there? Too many!

 Many problems with counting words, cannot be done  ~106 (but potentially infinite – new words get created every day)

 Long-standing problem of NLP  Natural representation: 1-hot vector

0 0 0 0 1 0 0 0 0

… … 106 i

slide-46
SLIDE 46

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

46/116

The problem with words

 How many words are there? Too many!

 Many problems with counting words, cannot be done  ~106 (but potentially infinite – new words get created every day)

 Long-standing problem of NLP  Natural representation: 1-hot vector

 ML with ~106 binary features on input

0 0 0 0 1 0 0 0 0

… … 106 i

slide-47
SLIDE 47

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

47/116

The problem with words

 How many words are there? Too many!

 Many problems with counting words, cannot be done  ~106 (but potentially infinite – new words get created every day)

 Long-standing problem of NLP  Natural representation: 1-hot vector

 ML with ~106 binary features on input  Pair of words: ~1012

0 0 0 0 1 0 0 0 0

… … 106 i

slide-48
SLIDE 48

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

48/116

The problem with words

 How many words are there? Too many!

 Many problems with counting words, cannot be done  ~106 (but potentially infinite – new words get created every day)

 Long-standing problem of NLP  Natural representation: 1-hot vector

 ML with ~106 binary features on input  Pair of words: ~1012  No generalization, meaning of words not captured

 dog~puppy, dog~~cat, dog~~~platypus, dog~~~~whiskey

0 0 0 0 1 0 0 0 0

… … 106 i

slide-49
SLIDE 49

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

49/116

Split the words

 Split into characters M O C K

slide-50
SLIDE 50

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

50/116

Split the words

 Split into characters

 Not that many (~102)

M O C K

slide-51
SLIDE 51

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

51/116

Split the words

 Split into characters

 Not that many (~102)  Do not capture meaning

 Starts with “m-”, is it positive or negative?

M O C K

slide-52
SLIDE 52

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

52/116

Split the words

 Split into characters

 Not that many (~102)  Do not capture meaning

 Starts with “m-”, is it positive or negative?

 Split into subwords/morphemes

M O C K

mis class if ied

slide-53
SLIDE 53

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

53/116

Split the words

 Split into characters

 Not that many (~102)  Do not capture meaning

 Starts with “m-”, is it positive or negative?

 Split into subwords/morphemes

 Word starts with “mis-”: it is probably negative

 misclassify, mistake, misconception…

M O C K

mis class if ied

slide-54
SLIDE 54

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

54/116

Split the words

 Split into characters

 Not that many (~102)  Do not capture meaning

 Starts with “m-”, is it positive or negative?

 Split into subwords/morphemes

 Word starts with “mis-”: it is probably negative

 misclassify, mistake, misconception…

 Helps, used in practice

M O C K

mis class if ied

slide-55
SLIDE 55

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

55/116

Split the words

 Split into characters

 Not that many (~102)  Do not capture meaning

 Starts with “m-”, is it positive or negative?

 Split into subwords/morphemes

 Word starts with “mis-”: it is probably negative

 misclassify, mistake, misconception…

 Helps, used in practice

 Potentially infinite set covered by a finite set of subwords

M O C K

mis class if ied

slide-56
SLIDE 56

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

56/116

Split the words

 Split into characters

 Not that many (~102)  Do not capture meaning

 Starts with “m-”, is it positive or negative?

 Split into subwords/morphemes

 Word starts with “mis-”: it is probably negative

 misclassify, mistake, misconception…

 Helps, used in practice

 Potentially infinite set covered by a finite set of subwords

 Meaning-capturing subwords still too many (~105)

M O C K

mis class if ied

slide-57
SLIDE 57

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

57/116

Distributional hypothesis

slide-58
SLIDE 58

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

58/116

Distributional hypothesis

 smelt (assume you don’t know this word)

slide-59
SLIDE 59

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

59/116

Distributional hypothesis

 smelt (assume you don’t know this word)

 I had a smelt for lunch.

slide-60
SLIDE 60

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

60/116

Distributional hypothesis

 smelt (assume you don’t know this word)

 I had a smelt for lunch. → noun, meal/food

slide-61
SLIDE 61

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

61/116

Distributional hypothesis

 smelt (assume you don’t know this word)

 I had a smelt for lunch. → noun, meal/food  My father caught a smelt.

slide-62
SLIDE 62

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

62/116

Distributional hypothesis

 smelt (assume you don’t know this word)

 I had a smelt for lunch. → noun, meal/food  My father caught a smelt. → animal/illness

slide-63
SLIDE 63

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

63/116

Distributional hypothesis

 smelt (assume you don’t know this word)

 I had a smelt for lunch. → noun, meal/food  My father caught a smelt. → animal/illness

slide-64
SLIDE 64

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

64/116

Distributional hypothesis

 smelt (assume you don’t know this word)

 I had a smelt for lunch. → noun, meal/food  My father caught a smelt. → animal/illness  Smelts are disappearing from oceans.

slide-65
SLIDE 65

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

65/116

Distributional hypothesis

 smelt (assume you don’t know this word)

 I had a smelt for lunch. → noun, meal/food  My father caught a smelt. → animal/illness  Smelts are disappearing from oceans. → plant/fish

slide-66
SLIDE 66

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

66/116

Distributional hypothesis

 smelt (assume you don’t know this word)

 I had a smelt for lunch. → noun, meal/food  My father caught a smelt. → animal/illness  Smelts are disappearing from oceans. → plant/fish

slide-67
SLIDE 67

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

67/116

Distributional hypothesis

 smelt (assume you don’t know this word)

 I had a smelt for lunch. → noun, meal/food  My father caught a smelt. → animal/illness  Smelts are disappearing from oceans. → plant/fish

slide-68
SLIDE 68

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

68/116

Distributional hypothesis

 smelt (assume you don’t know this word)

 I had a smelt for lunch. → noun, meal/food  My father caught a smelt. → animal/illness  Smelts are disappearing from oceans. → plant/fish

koruška

slide-69
SLIDE 69

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

69/116

Distributional hypothesis

 smelt (assume you don’t know this word)

 I had a smelt for lunch. → noun, meal/food  My father caught a smelt. → animal/illness  Smelts are disappearing from oceans. → plant/fish

 Harris (1954): “Words that occur in the same

contexts tend to have similar meanings.”

slide-70
SLIDE 70

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

70/116

Distributional hypothesis

 Harris (1954): “Words that occur in the same

contexts tend to have similar meanings.”

 Cooccurrence matrix

 # of sentences containing both WORD and CONTEXT

WORD CONTEXT lunch caught

  • ceans

doctor green smelt 10 10 10 1 1 salmon 100 100 100 1 1 flu 1 100 1 100 10 seaweed 10 1 100 1 100

slide-71
SLIDE 71

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

71/116

Distributional hypothesis

 Harris (1954): “Words that occur in the same

contexts tend to have similar meanings.”

 Cooccurrence matrix

 # of sentences containing both WORD and CONTEXT

WORD CONTEXT lunch caught

  • ceans

doctor green smelt 10 10 10 1 1 salmon 100 100 100 1 1 flu 1 100 1 100 10 seaweed 10 1 100 1 100

 Cheap plentiful data (webs, news, books…): ~109

slide-72
SLIDE 72

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

72/116

Distributional hypothesis

 Harris (1954): “Words that occur in the same

contexts tend to have similar meanings.”

 Cooccurrence matrix

 # of sentences containing both WORD and CONTEXT

WORD CONTEXT lunch caught

  • ceans

doctor green smelt 10 10 10 1 1 salmon 100 100 100 1 1 flu 1 100 1 100 10 seaweed 10 1 100 1 100

 Cheap plentiful data (webs, news, books…): ~109

NxN, N~106

slide-73
SLIDE 73

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

73/116

From cooccurence to PMI

 Cooccurrence matrix

 MC[i, j] = count(wordi & contextj)

 Conditional probability matrix

 MP[i, j] = P(wordi | contextj) = MC[i, j] / count(contextj)

 Conditional log-probability matrix

 MLogP[i, j] = log P(wordi | contextj) = log MP[i, j]

 Pointwise mutual information matrix

 MPMI[i, j] = log [P(wordi | contextj) / P(wordi)]

Association measures

slide-74
SLIDE 74

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

74/116

From cooccurence to PMI

 Cooccurrence matrix

 MC[i, j] = count(wordi & contextj)

 Conditional probability matrix

 MP[i, j] = P(wordi | contextj) = MC[i, j] / count(contextj)

 Conditional log-probability matrix

 MLogP[i, j] = log P(wordi | contextj) = log MP[i, j]

 Pointwise mutual information matrix

 MPMI[i, j] = log [P(wordi | contextj) / P(wordi)]  PMI(A, B) = log P(A & B) / P(A) P(B)

Association measures

slide-75
SLIDE 75

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

75/116

From cooccurence to PMI

 Word representation still impratically huge

 MPMI[i] ∈ RN, N~106

slide-76
SLIDE 76

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

76/116

From cooccurence to PMI

 Word representation still impratically huge

 MPMI[i] ∈ RN, N~106

 But better than 1-hot

 Meaningful continuous vectors (e.g. cos similarity)

slide-77
SLIDE 77

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

77/116

From cooccurence to PMI

 Word representation still impratically huge

 MPMI[i] ∈ RN, N~106

 But better than 1-hot

 Meaningful continuous vectors (e.g. cos similarity)

 Just need to compress it!

slide-78
SLIDE 78

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

78/116

From cooccurence to PMI

 Word representation still impratically huge

 MPMI[i] ∈ RN, N~106

 But better than 1-hot

 Meaningful continuous vectors (e.g. cos similarity)

 Just need to compress it!

 Explicitly: matrix factorization

 post-hoc, not used

 Implicitly: word2vec

 widely used

slide-79
SLIDE 79

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

79/116

Matrix factorization

slide-80
SLIDE 80

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

80/116

Matrix factorization

 Levy&Goldberg (2014)

slide-81
SLIDE 81

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

81/116

Matrix factorization

 Levy&Goldberg (2014)  Take MLogP or MPMI

slide-82
SLIDE 82

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

82/116

Matrix factorization

 Levy&Goldberg (2014)  Take MLogP or MPMI  Shift the matrix to make it positive (- min)

slide-83
SLIDE 83

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

83/116

Matrix factorization

 Levy&Goldberg (2014)  Take MLogP or MPMI  Shift the matrix to make it positive (- min)  Truncated Singular Value Decomposition:

 M = UDVT M ∈ RNxN→U ∈ RNxd, D ∈ Rdxd, V ∈ RNxd

N ~ 106 d ~ 102

slide-84
SLIDE 84

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

84/116

Matrix factorization

 Levy&Goldberg (2014)  Take MLogP or MPMI  Shift the matrix to make it positive (- min)  Truncated Singular Value Decomposition:

 M = UDVT M ∈ RNxN→U ∈ RNxd, D ∈ Rdxd, V ∈ RNxd

 Word embedding matrix: W = UD ∈ RNxd

 Embedding vec(wordi) = W[i] ∈ Rd

N ~ 106 d ~ 102

slide-85
SLIDE 85

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

85/116

Matrix factorization

 Levy&Goldberg (2014)  Take MLogP or MPMI  Shift the matrix to make it positive (- min)  Truncated Singular Value Decomposition:

 M = UDVT M ∈ RNxN→U ∈ RNxd, D ∈ Rdxd, V ∈ RNxd

 Word embedding matrix: W = UD ∈ RNxd

 Embedding vec(wordi) = W[i] ∈ Rd  Continuous low-dimensional vector

N ~ 106 d ~ 102

slide-86
SLIDE 86

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

86/116

Matrix factorization

 Levy&Goldberg (2014)  Take MLogP or MPMI  Shift the matrix to make it positive (- min)  Truncated Singular Value Decomposition:

 M = UDVT M ∈ RNxN→U ∈ RNxd, D ∈ Rdxd, V ∈ RNxd

 Word embedding matrix: W = UD ∈ RNxd

 Embedding vec(wordi) = W[i] ∈ Rd  Continuous low-dimensional vector  Meaningful (cos similarity, algebraic operations)

N ~ 106 d ~ 102

slide-87
SLIDE 87

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

87/116

Word embeddings magic

 Word similarity (cos)

 vec(dog) ~ vec(puppy), vec(cat) ~ vec(kitten)

slide-88
SLIDE 88

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

88/116

Word embeddings magic

 Word similarity (cos)

 vec(dog) ~ vec(puppy), vec(cat) ~ vec(kitten)

 Word meaning algebra

 Some relations parallel across words  vec(puppy) - vec(dog) ~ vec(kitten) - vec(cat)

dog puppy cat kitten

slide-89
SLIDE 89

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

89/116

Word embeddings magic

 Word similarity (cos)

 vec(dog) ~ vec(puppy), vec(cat) ~ vec(kitten)

 Word meaning algebra

 Some relations parallel across words  vec(puppy) - vec(dog) ~ vec(kitten) - vec(cat)

dog puppy cat kitten

 => vec(puppy) - vec(dog) + vec(cat) ~ vec(kitten)

slide-90
SLIDE 90

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

90/116

Word embeddings magic

 Word similarity (cos)

 vec(dog) ~ vec(puppy), vec(cat) ~ vec(kitten)

 Word meaning algebra

 Some relations parallel across words  vec(puppy) - vec(dog) ~ vec(kitten) - vec(cat)

dog puppy cat kitten

 => vec(puppy) - vec(dog) + vec(cat) ~ vec(kitten)

 vodka – Russia + Mexico, teacher – school + hospital…

slide-91
SLIDE 91

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

91/116

word2vec (Mikolov+, 2013)

 Predict word wi from its context (CBOW)

 E.g.: “I had _____ for lunch”  Sentence: … wi-2 wi-1 wi wi+1 wi+2 …

0 0 0 0 1 0 0 0 0

… …

0 0 0 0 1 0 0 0 0

… …

0 0 0 0 1 0 0 0 0

… …

0 0 0 0 1 0 0 0 0

… … W W W W Wi-2 Wi-1 Wi+1 Wi+2 V

0 0 0 0 1 0 0 0 0

… … Wi ∑ Context vectors (1-hot) Shared projection matrix (Nxd) ”Linear hidden layer” Another matrix (dxN) Output word (distribution) σ Softmax (hierarchical)

Train with SGD

slide-92
SLIDE 92

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

92/116

word2vec (Mikolov+, 2013)

 Predict context from a word wi (SGNS)

 E.g.: “____ _____ smelt _____ _____”  Sentence: … wi-2 wi-1 wi wi+1 wi+2 …

0 0 0 0 1 0 0 0 0

… …

0 0 0 0 1 0 0 0 0

… …

0 0 0 0 1 0 0 0 0

… …

0 0 0 0 1 0 0 0 0

… … V V V V Wi-2 Wi-1 Wi+1 Wi+2 W

0 0 0 0 1 0 0 0 0

… … Wi Output context vectors (distributions) Another matrix, shared (dxN) Projection matrix (Nxd) Input word (1-hot) σ σ σ σ

slide-93
SLIDE 93

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

93/116

word2vec ~ implicit factorization

 Word embedding matrix W ∈ RNxd

 embedding(wordi) = W[i] ∈ Rd

 Levy&Goldberg (2014)

 word2vec SGNS implicitly factorizes MPMI  MPMI[i, j] = log [P(wordi | contextj) / P(wordi)]  SGNS: MPMI = WV  MPMI ∈ RNxN→W ∈ RNxd, V ∈ RdxN

W

0 0 0 0 1 0 0 0 0

… … Wi

0 0 0 0 1 0 0 0 0

… … V Wi-2 σ

slide-94
SLIDE 94

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

94/116

Problem 2: Sentences

Fixed-sized neural units (attention mechanisms) Variable-length input sequences with long-distance relations between elements (sentences)

slide-95
SLIDE 95

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

95/116

Processing sentences

 Convolutional neural netowrks  Recurrent neural networks  Attention mechanism  Self-attentive networks

slide-96
SLIDE 96

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

96/116

Convolutional neural networks

 Input: sequence of word embeddings  Filters (size 3-5), norm, maxpooling

slide-97
SLIDE 97

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

97/116

Convolutional neural networks

 Input: sequence of word embeddings  Filters (size 3-5), norm, maxpooling  Training deep CNNs hard→residual connections

 Layer input averaged with output, skips non-linearity

slide-98
SLIDE 98

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

98/116

Convolutional neural networks

 Input: sequence of word embeddings  Filters (size 3-5), norm, maxpooling  Training deep CNNs hard→residual connections

 Layer input averaged with output, skips non-linearity

 Problem: capturing long-range dependencies

 Receptive field of each filter is limited  My computer works, but I have to buy a new mouse.

slide-99
SLIDE 99

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

99/116

Convolutional neural networks

 Input: sequence of word embeddings  Filters (size 3-5), norm, maxpooling  Training deep CNNs hard→residual connections

 Layer input averaged with output, skips non-linearity

 Problem: capturing long-range dependencies

 Receptive field of each filter is limited  My computer works, but I have to buy a new mouse.

 Good for word ngram spotting

 Sentiment analysis, named entity detection…

slide-100
SLIDE 100

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

100/116

Recurrent neural networks

 Input: sequence of word embeddings  Output: final state of RNN

slide-101
SLIDE 101

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

101/116

Recurrent neural networks

 Input: sequence of word embeddings  Output: final state of RNN  Problems

slide-102
SLIDE 102

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

102/116

Recurrent neural networks

 Input: sequence of word embeddings  Output: final state of RNN  Problems

 Vanishing gradient → memory cells (LSTM, GRU)

slide-103
SLIDE 103

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

103/116

Recurrent neural networks

 Input: sequence of word embeddings  Output: final state of RNN  Problems

 Vanishing gradient → memory cells (LSTM, GRU)  Long distance dependencies not perfectly captured

slide-104
SLIDE 104

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

104/116

Recurrent neural networks

 Input: sequence of word embeddings  Output: final state of RNN  Problems

 Vanishing gradient → memory cells (LSTM, GRU)  Long distance dependencies not perfectly captured  Final state is biased (“forgetting”)

 …sentence end better captured than sentence start

slide-105
SLIDE 105

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

105/116

Recurrent neural networks

 Input: sequence of word embeddings  Output: final state of RNN  Problems

 Vanishing gradient → memory cells (LSTM, GRU)  Long distance dependencies not perfectly captured  Final state is biased (“forgetting”)

 …sentence end better captured than sentence start  Bidirectional RNN, output = concat of both final states  Still may not well capture the middle parts…

slide-106
SLIDE 106

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

106/116

Recurrent neural networks

 Input: sequence of word embeddings  Output: final state of RNN  Problems

 Vanishing gradient → memory cells (LSTM, GRU)  Long distance dependencies not perfectly captured  Final state is biased (“forgetting”)

 …sentence end better captured than sentence start  Bidirectional RNN, output = concat of both final states  Still may not well capture the middle parts…  Using all hidden states as output, not just the final one  We loose the fixed-sized representation

slide-107
SLIDE 107

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

107/116

Attention (on top of a RNN)

slide-108
SLIDE 108

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

108/116

Attention (on top of a RNN)

 Classifier/decoder gets a fixed-size context vector

 Weighted average of encoder hidden states

slide-109
SLIDE 109

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

109/116

Attention (on top of a RNN)

 Classifier/decoder gets a fixed-size context vector

 Weighted average of encoder hidden states  Attention weights computed by a feed-forward subnet

 weighti ~ NN(statei, statedecoder)

   

slide-110
SLIDE 110

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

110/116

Advanced attention

 Multi-head attention

 Multiple attention heads (~8), each has its own distro  Resulting context vectors concatenated

slide-111
SLIDE 111

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

111/116

Advanced attention

 Multi-head attention

 Multiple attention heads (~8), each has its own distro  Resulting context vectors concatenated

 Self-attentitive encoder (SAN, Transformer)

 CNN/attention hybrid  CNN: cell gets small local context via filters  SAN: cell gets global context via attention heads

slide-112
SLIDE 112

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

112/116

Conclusion

slide-113
SLIDE 113

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

113/116

Conclusion

 Words → word embeddings

 Too many, too sparse  Word meaning ~ context in which it appears  Cooccurrence matrix, implicit/explicit factorization

slide-114
SLIDE 114

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

114/116

Conclusion

 Words → word embeddings

 Too many, too sparse  Word meaning ~ context in which it appears  Cooccurrence matrix, implicit/explicit factorization

 Sentences → attention

 Variable length, complex internal structure  biRNN (LSTM, GRU), CNN+residuals  Attention: weighted sum of encoder hidden states  Self-attention: à la CNN, filters → attention

slide-115
SLIDE 115

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

115/116

References

Word embeddings:

 Distributional hyp.: Harris: Distributional structure. Word, 1954  First: Bengio+: A neural probabilistic language model. JMLR, 2003  Efficient implicit (word2vec): Mikolov+: Linguistic Regularities in

Continuous Space Word Representations. NAACL, 2013

 Explicit (TSVD): Levy&Goldberg: Neural Word Embedding as

Implicit Matrix Factorization. NIPS, 2014

Recurrent neural networks and attention:

 LSTM: Hochreiter+: Long short-term memory. NeCo, 1997  Attention: Bahdanau+: Neural Machine Translation by Jointly

Learning to Align and Translate. CoRR, 2014

 Tranformer SAN: Vaswani+: Attention is all you need. NIPS, 2017

slide-116
SLIDE 116

Rudolf Rosa – Deep Neural Networks in Natural Language Processing

116/116

Thank you for your attention

http://ufal.mff.cuni.cz/rudolf-rosa/

Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics Deep Neural Networks in Natural Language Processing Rudolf Rosa rosa@ufal.mff.cuni.cz