deep neural networks in natural language processing
play

Deep Neural Networks in Natural Language Processing Charles - PowerPoint PPT Presentation

Rudolf Rosa rosa@ufal.mff.cuni.cz Deep Neural Networks in Natural Language Processing Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics Hora Informaticae, I AV R, Praha, 14 Jan 2019


  1. Warnings  I am not a ML expert, rather a ML user  Please excuse any errors and inaccuracies  Focus of talk: input representation (“encoding”)  Key problem in NLP, interesting properties  Leaving out  Generating output (“decoding”) – that’s also interesting  Sequence generation  Seq. elements discrete, large domain (softmax over 10 6 )  Sequence length not a priori known Rudolf Rosa – Deep Neural Networks in Natural Language Processing 31/116

  2. Warnings  I am not a ML expert, rather a ML user  Please excuse any errors and inaccuracies  Focus of talk: input representation (“encoding”)  Key problem in NLP, interesting properties  Leaving out  Generating output (“decoding”) – that’s also interesting  Sequence generation  Seq. elements discrete, large domain (softmax over 10 6 )  Sequence length not a priori known  Decision at encoder/decoder boundary (if any) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 32/116

  3. Problem 1: Words Massively multi-valued discrete data (words) Continuous low-dimensional vectors (word embeddings) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 33/116

  4. Simplification  For now, forget sentences  1 word some output Rudolf Rosa – Deep Neural Networks in Natural Language Processing 34/116

  5. Simplification Word is positive/neutral/negative,  For now, forget sentences  1 word some output Rudolf Rosa – Deep Neural Networks in Natural Language Processing 35/116

  6. Simplification Word is positive/neutral/negative, Definition of the word,  For now, forget sentences  1 word some output Rudolf Rosa – Deep Neural Networks in Natural Language Processing 36/116

  7. Simplification Word is positive/neutral/negative, Definition of the word,  For now, forget sentences Hyperonym (dog → animal), …  1 word some output Rudolf Rosa – Deep Neural Networks in Natural Language Processing 37/116

  8. Simplification Word is positive/neutral/negative, Definition of the word,  For now, forget sentences Hyperonym (dog → animal), …  1 word some output  Situation  We have labelled training data for some words (10 3 )  We want to generalize (ideally) to all words (10 6 ) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 38/116

  9. The problem with words  How many words are there? Rudolf Rosa – Deep Neural Networks in Natural Language Processing 39/116

  10. The problem with words  How many words are there? Too many! Rudolf Rosa – Deep Neural Networks in Natural Language Processing 40/116

  11. The problem with words  How many words are there? Too many!  Many problems with counting words, cannot be done Rudolf Rosa – Deep Neural Networks in Natural Language Processing 41/116

  12. The problem with words  How many words are there? Too many!  Many problems with counting words, cannot be done  ~10 6 Rudolf Rosa – Deep Neural Networks in Natural Language Processing 42/116

  13. The problem with words  How many words are there? Too many!  Many problems with counting words, cannot be done  ~10 6 (but potentially infinite – new words get created every day) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 43/116

  14. The problem with words  How many words are there? Too many!  Many problems with counting words, cannot be done  ~10 6 (but potentially infinite – new words get created every day)  Long-standing problem of NLP Rudolf Rosa – Deep Neural Networks in Natural Language Processing 44/116

  15. The problem with words  How many words are there? Too many!  Many problems with counting words, cannot be done  ~10 6 (but potentially infinite – new words get created every day)  Long-standing problem of NLP 10 6  Natural representation: 1-hot vector i … … 0 0 0 0 1 0 0 0 0 Rudolf Rosa – Deep Neural Networks in Natural Language Processing 45/116

  16. The problem with words  How many words are there? Too many!  Many problems with counting words, cannot be done  ~10 6 (but potentially infinite – new words get created every day)  Long-standing problem of NLP 10 6  Natural representation: 1-hot vector i … … 0 0 0 0 1 0 0 0 0  ML with ~10 6 binary features on input Rudolf Rosa – Deep Neural Networks in Natural Language Processing 46/116

  17. The problem with words  How many words are there? Too many!  Many problems with counting words, cannot be done  ~10 6 (but potentially infinite – new words get created every day)  Long-standing problem of NLP 10 6  Natural representation: 1-hot vector i … … 0 0 0 0 1 0 0 0 0  ML with ~10 6 binary features on input  Pair of words: ~10 12 Rudolf Rosa – Deep Neural Networks in Natural Language Processing 47/116

  18. The problem with words  How many words are there? Too many!  Many problems with counting words, cannot be done  ~10 6 (but potentially infinite – new words get created every day)  Long-standing problem of NLP 10 6  Natural representation: 1-hot vector i … … 0 0 0 0 1 0 0 0 0  ML with ~10 6 binary features on input  Pair of words: ~10 12  No generalization, meaning of words not captured  dog~puppy, dog~~cat, dog~~~platypus, dog~~~~whiskey Rudolf Rosa – Deep Neural Networks in Natural Language Processing 48/116

  19. Split the words  Split into characters M O C K Rudolf Rosa – Deep Neural Networks in Natural Language Processing 49/116

  20. Split the words  Split into characters M O C K  Not that many (~10 2 ) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 50/116

  21. Split the words  Split into characters M O C K  Not that many (~10 2 )  Do not capture meaning  Starts with “m-”, is it positive or negative? Rudolf Rosa – Deep Neural Networks in Natural Language Processing 51/116

  22. Split the words  Split into characters M O C K  Not that many (~10 2 )  Do not capture meaning  Starts with “m-”, is it positive or negative?  Split into subwords/morphemes mis class if ied Rudolf Rosa – Deep Neural Networks in Natural Language Processing 52/116

  23. Split the words  Split into characters M O C K  Not that many (~10 2 )  Do not capture meaning  Starts with “m-”, is it positive or negative?  Split into subwords/morphemes mis class if ied  Word starts with “mis-”: it is probably negative  misclassify, mistake, misconception… Rudolf Rosa – Deep Neural Networks in Natural Language Processing 53/116

  24. Split the words  Split into characters M O C K  Not that many (~10 2 )  Do not capture meaning  Starts with “m-”, is it positive or negative?  Split into subwords/morphemes mis class if ied  Word starts with “mis-”: it is probably negative  misclassify, mistake, misconception…  Helps, used in practice Rudolf Rosa – Deep Neural Networks in Natural Language Processing 54/116

  25. Split the words  Split into characters M O C K  Not that many (~10 2 )  Do not capture meaning  Starts with “m-”, is it positive or negative?  Split into subwords/morphemes mis class if ied  Word starts with “mis-”: it is probably negative  misclassify, mistake, misconception…  Helps, used in practice  Potentially infinite set covered by a finite set of subwords Rudolf Rosa – Deep Neural Networks in Natural Language Processing 55/116

  26. Split the words  Split into characters M O C K  Not that many (~10 2 )  Do not capture meaning  Starts with “m-”, is it positive or negative?  Split into subwords/morphemes mis class if ied  Word starts with “mis-”: it is probably negative  misclassify, mistake, misconception…  Helps, used in practice  Potentially infinite set covered by a finite set of subwords  Meaning-capturing subwords still too many (~10 5 ) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 56/116

  27. Distributional hypothesis Rudolf Rosa – Deep Neural Networks in Natural Language Processing 57/116

  28. Distributional hypothesis  smelt (assume you don’t know this word) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 58/116

  29. Distributional hypothesis  smelt (assume you don’t know this word)  I had a smelt for lunch. Rudolf Rosa – Deep Neural Networks in Natural Language Processing 59/116

  30. Distributional hypothesis  smelt (assume you don’t know this word)  I had a smelt for lunch. → noun, meal/food Rudolf Rosa – Deep Neural Networks in Natural Language Processing 60/116

  31. Distributional hypothesis  smelt (assume you don’t know this word)  I had a smelt for lunch. → noun, meal/food  My father caught a smelt . Rudolf Rosa – Deep Neural Networks in Natural Language Processing 61/116

  32. Distributional hypothesis  smelt (assume you don’t know this word)  I had a smelt for lunch. → noun, meal/food  My father caught a smelt . → animal/illness Rudolf Rosa – Deep Neural Networks in Natural Language Processing 62/116

  33. Distributional hypothesis  smelt (assume you don’t know this word)  I had a smelt for lunch. → noun, meal/food  My father caught a smelt . → animal/illness Rudolf Rosa – Deep Neural Networks in Natural Language Processing 63/116

  34. Distributional hypothesis  smelt (assume you don’t know this word)  I had a smelt for lunch. → noun, meal/food  My father caught a smelt . → animal/illness  Smelts are disappearing from oceans. Rudolf Rosa – Deep Neural Networks in Natural Language Processing 64/116

  35. Distributional hypothesis  smelt (assume you don’t know this word)  I had a smelt for lunch. → noun, meal/food  My father caught a smelt . → animal/illness  Smelts are disappearing from oceans. → plant/fish Rudolf Rosa – Deep Neural Networks in Natural Language Processing 65/116

  36. Distributional hypothesis  smelt (assume you don’t know this word)  I had a smelt for lunch. → noun, meal/food  My father caught a smelt . → animal/illness  Smelts are disappearing from oceans. → plant/fish Rudolf Rosa – Deep Neural Networks in Natural Language Processing 66/116

  37. Distributional hypothesis  smelt (assume you don’t know this word)  I had a smelt for lunch. → noun, meal/food  My father caught a smelt . → animal/illness  Smelts are disappearing from oceans. → plant/fish Rudolf Rosa – Deep Neural Networks in Natural Language Processing 67/116

  38. Distributional hypothesis  smelt (assume you don’t know this word)  I had a smelt for lunch. → noun, meal/food  My father caught a smelt . → animal/illness  Smelts are disappearing from oceans. → plant/fish koruška Rudolf Rosa – Deep Neural Networks in Natural Language Processing 68/116

  39. Distributional hypothesis  smelt (assume you don’t know this word)  I had a smelt for lunch. → noun, meal/food  My father caught a smelt . → animal/illness  Smelts are disappearing from oceans. → plant/fish  Harris (1954): “Words that occur in the same contexts tend to have similar meanings.” Rudolf Rosa – Deep Neural Networks in Natural Language Processing 69/116

  40. Distributional hypothesis  Harris (1954): “Words that occur in the same contexts tend to have similar meanings.”  Cooccurrence matrix  # of sentences containing both WORD and CONTEXT WORD CONTEXT lunch caught oceans doctor green smelt 10 10 10 1 1 salmon 100 100 100 1 1 flu 1 100 1 100 10 seaweed 10 1 100 1 100 Rudolf Rosa – Deep Neural Networks in Natural Language Processing 70/116

  41. Distributional hypothesis  Harris (1954): “Words that occur in the same contexts tend to have similar meanings.”  Cooccurrence matrix  # of sentences containing both WORD and CONTEXT WORD CONTEXT lunch caught oceans doctor green smelt 10 10 10 1 1 salmon 100 100 100 1 1 flu 1 100 1 100 10 seaweed 10 1 100 1 100  Cheap plentiful data (webs, news, books…): ~10 9 Rudolf Rosa – Deep Neural Networks in Natural Language Processing 71/116

  42. Distributional hypothesis  Harris (1954): “Words that occur in the same contexts tend to have similar meanings.”  Cooccurrence matrix  # of sentences containing both WORD and CONTEXT WORD CONTEXT lunch caught oceans doctor green NxN, N~10 6 smelt 10 10 10 1 1 salmon 100 100 100 1 1 flu 1 100 1 100 10 seaweed 10 1 100 1 100  Cheap plentiful data (webs, news, books…): ~10 9 Rudolf Rosa – Deep Neural Networks in Natural Language Processing 72/116

  43. From cooccurence to PMI  Cooccurrence matrix Association  M C [i, j] = count(word i & context j ) measures  Conditional probability matrix  M P [i, j] = P(word i | context j ) = M C [i, j] / count(context j )  Conditional log-probability matrix  M LogP [i, j] = log P(word i | context j ) = log M P [i, j]  Pointwise mutual information matrix  M PMI [i, j] = log [P(word i | context j ) / P(word i )] Rudolf Rosa – Deep Neural Networks in Natural Language Processing 73/116

  44. From cooccurence to PMI  Cooccurrence matrix Association  M C [i, j] = count(word i & context j ) measures  Conditional probability matrix  M P [i, j] = P(word i | context j ) = M C [i, j] / count(context j )  Conditional log-probability matrix  M LogP [i, j] = log P(word i | context j ) = log M P [i, j]  Pointwise mutual information matrix  M PMI [i, j] = log [P(word i | context j ) / P(word i )]  PMI(A, B) = log P(A & B) / P(A) P(B) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 74/116

  45. From cooccurence to PMI  Word representation still impratically huge  M PMI [i] ∈ R N , N~10 6 Rudolf Rosa – Deep Neural Networks in Natural Language Processing 75/116

  46. From cooccurence to PMI  Word representation still impratically huge  M PMI [i] ∈ R N , N~10 6  But better than 1-hot  Meaningful continuous vectors (e.g. cos similarity) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 76/116

  47. From cooccurence to PMI  Word representation still impratically huge  M PMI [i] ∈ R N , N~10 6  But better than 1-hot  Meaningful continuous vectors (e.g. cos similarity)  Just need to compress it! Rudolf Rosa – Deep Neural Networks in Natural Language Processing 77/116

  48. From cooccurence to PMI  Word representation still impratically huge  M PMI [i] ∈ R N , N~10 6  But better than 1-hot  Meaningful continuous vectors (e.g. cos similarity)  Just need to compress it!  Explicitly: matrix factorization  post-hoc, not used  Implicitly: word2vec  widely used Rudolf Rosa – Deep Neural Networks in Natural Language Processing 78/116

  49. Matrix factorization Rudolf Rosa – Deep Neural Networks in Natural Language Processing 79/116

  50. Matrix factorization  Levy&Goldberg (2014) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 80/116

  51. Matrix factorization  Levy&Goldberg (2014)  Take M LogP or M PMI Rudolf Rosa – Deep Neural Networks in Natural Language Processing 81/116

  52. Matrix factorization  Levy&Goldberg (2014)  Take M LogP or M PMI  Shift the matrix to make it positive (- min) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 82/116

  53. Matrix factorization  Levy&Goldberg (2014)  Take M LogP or M PMI  Shift the matrix to make it positive (- min) N ~ 10 6 d ~ 10 2  Truncated Singular Value Decomposition:  M = UDV T M ∈ R NxN →U ∈ R Nxd , D ∈ R dxd , V ∈ R Nxd Rudolf Rosa – Deep Neural Networks in Natural Language Processing 83/116

  54. Matrix factorization  Levy&Goldberg (2014)  Take M LogP or M PMI  Shift the matrix to make it positive (- min) N ~ 10 6 d ~ 10 2  Truncated Singular Value Decomposition:  M = UDV T M ∈ R NxN →U ∈ R Nxd , D ∈ R dxd , V ∈ R Nxd  Word embedding matrix: W = UD ∈ R Nxd  Embedding vec(word i ) = W[i] ∈ R d Rudolf Rosa – Deep Neural Networks in Natural Language Processing 84/116

  55. Matrix factorization  Levy&Goldberg (2014)  Take M LogP or M PMI  Shift the matrix to make it positive (- min) N ~ 10 6 d ~ 10 2  Truncated Singular Value Decomposition:  M = UDV T M ∈ R NxN →U ∈ R Nxd , D ∈ R dxd , V ∈ R Nxd  Word embedding matrix: W = UD ∈ R Nxd  Embedding vec(word i ) = W[i] ∈ R d  Continuous low-dimensional vector Rudolf Rosa – Deep Neural Networks in Natural Language Processing 85/116

  56. Matrix factorization  Levy&Goldberg (2014)  Take M LogP or M PMI  Shift the matrix to make it positive (- min) N ~ 10 6 d ~ 10 2  Truncated Singular Value Decomposition:  M = UDV T M ∈ R NxN →U ∈ R Nxd , D ∈ R dxd , V ∈ R Nxd  Word embedding matrix: W = UD ∈ R Nxd  Embedding vec(word i ) = W[i] ∈ R d  Continuous low-dimensional vector  Meaningful (cos similarity, algebraic operations) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 86/116

  57. Word embeddings magic  Word similarity (cos)  vec(dog) ~ vec(puppy), vec(cat) ~ vec(kitten) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 87/116

  58. Word embeddings magic  Word similarity (cos)  vec(dog) ~ vec(puppy), vec(cat) ~ vec(kitten)  Word meaning algebra  Some relations parallel across words  vec(puppy) - vec(dog) ~ vec(kitten) - vec(cat) cat dog kitten puppy Rudolf Rosa – Deep Neural Networks in Natural Language Processing 88/116

  59. Word embeddings magic  Word similarity (cos)  vec(dog) ~ vec(puppy), vec(cat) ~ vec(kitten)  Word meaning algebra  Some relations parallel across words  vec(puppy) - vec(dog) ~ vec(kitten) - vec(cat) cat dog kitten puppy  => vec(puppy) - vec(dog) + vec(cat) ~ vec(kitten) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 89/116

  60. Word embeddings magic  Word similarity (cos)  vec(dog) ~ vec(puppy), vec(cat) ~ vec(kitten)  Word meaning algebra  Some relations parallel across words  vec(puppy) - vec(dog) ~ vec(kitten) - vec(cat) cat dog kitten puppy  => vec(puppy) - vec(dog) + vec(cat) ~ vec(kitten)  vodka – Russia + Mexico, teacher – school + hospital… Rudolf Rosa – Deep Neural Networks in Natural Language Processing 90/116

  61. word2vec (Mikolov+, 2013)  Predict word w i from its context (CBOW)  E.g.: “ I had _____ for lunch ”  Sentence: … w i-2 w i-1 w i w i+1 w i+2 … W i-2 … … W 0 0 0 0 1 0 0 0 0 Softmax W i-1 (hierarchical) … … W 0 0 0 0 1 0 0 0 0 W i … … ∑ σ V 0 0 0 0 1 0 0 0 0 W i+1 … … Output word W 0 0 0 0 1 0 0 0 0 ”Linear Another (distribution) hidden matrix W i+2 … … W 0 0 0 0 1 0 0 0 0 layer” (dxN) Context Shared Train with vectors projection matrix SGD (1-hot) (Nxd) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 91/116

  62. word2vec (Mikolov+, 2013)  Predict context from a word w i (SGNS)  E.g.: “ ____ _____ smelt _____ _____ ”  Sentence: … w i-2 w i-1 w i w i+1 w i+2 … W i-2 … … V σ 0 0 0 0 1 0 0 0 0 W i-1 … … V σ 0 0 0 0 1 0 0 0 0 W i … … 0 0 0 0 1 0 0 0 0 W W i+1 … … V σ 0 0 0 0 1 0 0 0 0 Input Projection word matrix W i+2 … … V σ 0 0 0 0 1 0 0 0 0 (1-hot) (Nxd) Another matrix, Output shared context vectors (dxN) (distributions) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 92/116

  63. word2vec ~ implicit factorization W i W i-2 … … … … W V σ 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0  Word embedding matrix W ∈ R Nxd  embedding(word i ) = W[i] ∈ R d  Levy&Goldberg (2014)  word2vec SGNS implicitly factorizes M PMI  M PMI [i, j] = log [P(word i | context j ) / P(word i )]  SGNS: M PMI = WV  M PMI ∈ R NxN →W ∈ R Nxd , V ∈ R dxN Rudolf Rosa – Deep Neural Networks in Natural Language Processing 93/116

  64. Problem 2: Sentences Variable-length input sequences with long-distance relations between elements (sentences) Fixed-sized neural units (attention mechanisms) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 94/116

  65. Processing sentences  Convolutional neural netowrks  Recurrent neural networks  Attention mechanism  Self-attentive networks Rudolf Rosa – Deep Neural Networks in Natural Language Processing 95/116

  66. Convolutional neural networks  Input: sequence of word embeddings  Filters (size 3-5), norm, maxpooling Rudolf Rosa – Deep Neural Networks in Natural Language Processing 96/116

  67. Convolutional neural networks  Input: sequence of word embeddings  Filters (size 3-5), norm, maxpooling  Training deep CNNs hard→residual connections  Layer input averaged with output, skips non-linearity Rudolf Rosa – Deep Neural Networks in Natural Language Processing 97/116

  68. Convolutional neural networks  Input: sequence of word embeddings  Filters (size 3-5), norm, maxpooling  Training deep CNNs hard→residual connections  Layer input averaged with output, skips non-linearity  Problem: capturing long-range dependencies  Receptive field of each filter is limited  My computer works, but I have to buy a new mouse. Rudolf Rosa – Deep Neural Networks in Natural Language Processing 98/116

  69. Convolutional neural networks  Input: sequence of word embeddings  Filters (size 3-5), norm, maxpooling  Training deep CNNs hard→residual connections  Layer input averaged with output, skips non-linearity  Problem: capturing long-range dependencies  Receptive field of each filter is limited  My computer works, but I have to buy a new mouse.  Good for word n gram spotting  Sentiment analysis, named entity detection… Rudolf Rosa – Deep Neural Networks in Natural Language Processing 99/116

  70. Recurrent neural networks  Input: sequence of word embeddings  Output: final state of RNN Rudolf Rosa – Deep Neural Networks in Natural Language Processing 100/116

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend