Recurrent Neural Networks CS 6956: Deep Learning for NLP Overview - - PowerPoint PPT Presentation

recurrent neural networks
SMART_READER_LITE
LIVE PREVIEW

Recurrent Neural Networks CS 6956: Deep Learning for NLP Overview - - PowerPoint PPT Presentation

Recurrent Neural Networks CS 6956: Deep Learning for NLP Overview 1. Modeling sequences 2. Recurrent neural networks: An abstraction 3. Usage patterns for RNNs 4. BiDirectional RNNs 5. A concrete example: The Elman RNN 6. The vanishing


slide-1
SLIDE 1

CS 6956: Deep Learning for NLP

Recurrent Neural Networks

slide-2
SLIDE 2

Overview

1. Modeling sequences 2. Recurrent neural networks: An abstraction 3. Usage patterns for RNNs 4. BiDirectional RNNs 5. A concrete example: The Elman RNN 6. The vanishing gradient problem 7. Long short-term memory units

1

slide-3
SLIDE 3

Overview

1. Modeling sequences 2. Recurrent neural networks: An abstraction 3. Usage patterns for RNNs 4. BiDirectional RNNs 5. A concrete example: The Elman RNN 6. The vanishing gradient problem 7. Long short-term memory units

2

slide-4
SLIDE 4

Sequences abound in NLP

3

S a l t L a k e C i t y

Words are sequences of characters

slide-5
SLIDE 5

Sequences abound in NLP

4

S a l t L a k e C i t y John lives in Salt Lake City

Sentences are sequences of words

slide-6
SLIDE 6

Sequences abound in NLP

5

S a l t L a k e C i t y John lives in Salt Lake City

Paragraphs are sequences of sentences

John lives in Salt Lake City. He enjoys hiking with his dog. His cat hates hiking.

slide-7
SLIDE 7

Sequences abound in NLP

6

S a l t L a k e C i t y John lives in Salt Lake City

And so on… inputs are naturally sequences at different levels

John lives in Salt Lake City. He enjoys hiking with his dog. His cat hates hiking.

slide-8
SLIDE 8

Sequences abound in NLP

7

S a l t L a k e C i t y John lives in Salt Lake City

Outputs can also be sequences

John lives in Salt Lake City. He enjoys hiking with his dog. His cat hates hiking.

slide-9
SLIDE 9

Sequences abound in NLP

8

S a l t L a k e C i t y John lives in Salt Lake City

Part-of-speech tags form a sequence

John lives in Salt Lake City. He enjoys hiking with his dog. His cat hates hiking. John lives in Salt Lake City

slide-10
SLIDE 10

Sequences abound in NLP

9

S a l t L a k e C i t y John lives in Salt Lake City

Part-of-speech tags form a sequence

John lives in Salt Lake City. He enjoys hiking with his dog. His cat hates hiking. John lives in Salt Lake City Noun Verb Preposition Noun Noun Noun

slide-11
SLIDE 11

Sequences abound in NLP

10

S a l t L a k e C i t y John lives in Salt Lake City

Even things that don’t look like a sequence can be made to look like one

John lives in Salt Lake City. He enjoys hiking with his dog. His cat hates hiking. John lives in Salt Lake City Noun Verb Preposition Noun Noun Noun Person Location

Example: Named entity tags

slide-12
SLIDE 12

Sequences abound in NLP

11

S a l t L a k e C i t y John lives in Salt Lake City

Even things that don’t look like a sequence can be made to look like one

John lives in Salt Lake City. He enjoys hiking with his dog. His cat hates hiking. John lives in Salt Lake City Noun Verb Preposition Noun Noun Noun

Example: Named entity tags

B-PER O O B-LOC I-LOC I-LOC

slide-13
SLIDE 13

Sequences abound in NLP

12

S a l t L a k e C i t y John lives in Salt Lake City

And we can get very creative with such encodings

John lives in Salt Lake City. He enjoys hiking with his dog. His cat hates hiking. Noun Verb Preposition Noun Noun Noun

Example: We can encode parse trees as a sequence

  • f decisions needed to construct the tree

B-PER O O B-LOC I-LOC I-LOC

slide-14
SLIDE 14

Sequences abound in NLP

13

S a l t L a k e C i t y John lives in Salt Lake City

And we can get very creative with such encodings

John lives in Salt Lake City. He enjoys hiking with his dog. His cat hates hiking. Noun Verb Preposition Noun Noun Noun

Example: We can encode parse trees as a sequence

  • f decisions needed to construct the tree

B-PER O O B-LOC I-LOC I-LOC

Natural question: How do we model sequential inputs and outputs?

slide-15
SLIDE 15

Sequences abound in NLP

14

S a l t L a k e C i t y John lives in Salt Lake City

And we can get very creative with such encodings

John lives in Salt Lake City. He enjoys hiking with his dog. His cat hates hiking. Noun Verb Preposition Noun Noun Noun

Example: We can encode parse trees as a sequence

  • f decisions needed to construct the tree

B-PER O O B-LOC I-LOC I-LOC

Natural question: How do we model sequential inputs and outputs? More concretely, we need a mechanism that allows us to

  • 1. Capture sequential dependencies between inputs
  • 2. Model uncertainty over sequential outputs
slide-16
SLIDE 16

Modeling sequences: The problem

Suppose we want to build a language model that computes the probability of sentences We can write the probability as 𝑄 𝑦#, 𝑦%, 𝑦&, ⋯ , 𝑦( = * 𝑄(𝑦, ∣ 𝑦#, 𝑦% ⋯ , 𝑦,.#)

  • ,

15

slide-17
SLIDE 17

It was a bright cold day in April.

Example: A Language model

16

slide-18
SLIDE 18

It was a bright cold day in April.

Probability of a word starting a sentence

Example: A Language model

17

slide-19
SLIDE 19

It was a bright cold day in April.

Probability of a word starting a sentence Probability of a word following “It”

Example: A Language model

18

slide-20
SLIDE 20

It was a bright cold day in April.

Probability of a word starting a sentence Probability of a word following “It”

Example: A Language model

19

Probability of a word following “It was”

slide-21
SLIDE 21

It was a bright cold day in April.

Probability of a word starting a sentence Probability of a word following “It”

Example: A Language model

20

Probability of a word following “It was” Probability of a word following “It was a”

slide-22
SLIDE 22

It was a bright cold day in April.

Probability of a word starting a sentence Probability of a word following “It” Probability of a word following “It was” Probability of a word following “It was a”

Example: A Language model

21

slide-23
SLIDE 23

A history-based model

  • Each token is dependent on all the tokens that came

before it

– Simple conditioning – Each P(xi | …) is a multinomial probability distribution over the tokens

  • What is the problem here?

– How many parameters do we have?

  • Grows with the size of the sequence!

22

slide-24
SLIDE 24

A history-based model

  • Each token is dependent on all the tokens that came

before it

– Simple conditioning – Each P(xi | …) is a multinomial probability distribution over the tokens

  • What is the problem here?

– How many parameters do we have?

  • Grows with the size of the sequence!

23

slide-25
SLIDE 25

The traditional solution: Lose the history

Make a modeling assumption Example: The first order Markov model assumes that 𝑄 𝑦, 𝑦#, 𝑦%, ⋯ , 𝑦,.# = 𝑄(𝑦, ∣ 𝑦,.#) This allows us to simplify 𝑄 𝑦#, 𝑦%, 𝑦&, ⋯ , 𝑦( = * 𝑄(𝑦, ∣ 𝑦#, 𝑦% ⋯ , 𝑦,.#)

  • ,

24

slide-26
SLIDE 26

The traditional solution: Lose the history

Make a modeling assumption Example: The first order Markov model assumes that 𝑄 𝑦, 𝑦#, 𝑦%, ⋯ , 𝑦,.# = 𝑄(𝑦, ∣ 𝑦,.#) This allows us to simplify 𝑄 𝑦#, 𝑦%, 𝑦&, ⋯ , 𝑦( = * 𝑄(𝑦, ∣ 𝑦#, 𝑦% ⋯ , 𝑦,.#)

  • ,

25

These dependencies are ignored

slide-27
SLIDE 27

The traditional solution: Lose the history

Make a modeling assumption Example: The first order Markov model assumes that 𝑄 𝑦, 𝑦#, 𝑦%, ⋯ , 𝑦,.# = 𝑄(𝑦, ∣ 𝑦,.#) This allows us to simplify 𝑄 𝑦#, 𝑦%, 𝑦&, ⋯ , 𝑦( = * 𝑄(𝑦, ∣ 𝑦,.#)

  • ,

26

slide-28
SLIDE 28

Example: Another language model

It was a bright cold day in April

27

Probability of a word starting a sentence Probability of a word following “It” Probability of a word following “was” Probability of a word following “a”

slide-29
SLIDE 29

Example: Another language model

It was a bright cold day in April

28

Probability of a word starting a sentence Probability of a word following “It” Probability of a word following “was” Probability of a word following “a” If there are K tokens/states, how many parameters do we need?

slide-30
SLIDE 30

Example: Another language model

It was a bright cold day in April

29

Probability of a word starting a sentence Probability of a word following “It” Probability of a word following “was” Probability of a word following “a” If there are K tokens/states, how many parameters do we need? O(K2)

slide-31
SLIDE 31

Can we do better?

  • Can we capture the meaning of the entire history

without arbitrarily growing the number of parameters?

  • Or equivalently, can we discard the Markov assumption?
  • Can we represent arbitrarily long sequences as fixed

sized vectors?

– Perhaps to provide features for subsequent classification

  • Answer: Recurrent neural networks (RNNs)

30

slide-32
SLIDE 32

Can we do better?

  • Can we capture the meaning of the entire history

without arbitrarily growing the number of parameters?

  • Or equivalently, can we discard the Markov assumption?
  • Can we represent arbitrarily long sequences as fixed

sized vectors?

– Perhaps to provide features for subsequent classification

  • Answer: Recurrent neural networks (RNNs)

31

slide-33
SLIDE 33

Can we do better?

  • Can we capture the meaning of the entire history

without arbitrarily growing the number of parameters?

  • Or equivalently, can we discard the Markov assumption?
  • Can we represent arbitrarily long sequences as fixed

sized vectors?

– Perhaps to provide features for subsequent classification

  • Answer: Recurrent neural networks (RNNs)

32