Spell Once, Summon Anywhere: A Two-Level Open-Vocabulary Language - - PowerPoint PPT Presentation

spell once summon anywhere a two level open vocabulary
SMART_READER_LITE
LIVE PREVIEW

Spell Once, Summon Anywhere: A Two-Level Open-Vocabulary Language - - PowerPoint PPT Presentation

Spell Once, Summon Anywhere: A Two-Level Open-Vocabulary Language Model AAAI 2019 Technical Track Sabrina J. Mielke and Jason Eisner sjmielke@jhu.edu , jason@cs.jhu.edu Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA


slide-1
SLIDE 1

Spell Once, Summon Anywhere: A Two-Level Open-Vocabulary Language Model

AAAI 2019 Technical Track Sabrina J. Mielke and Jason Eisner

sjmielke@jhu.edu, jason@cs.jhu.edu

Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA 1

slide-2
SLIDE 2

Language modeling: a generative story of text

p(the cat chased the)

2

slide-3
SLIDE 3

Language modeling: a generative story of text

p(the cat chased the) = p(the)

2

slide-4
SLIDE 4

Language modeling: a generative story of text

p(the cat chased the) = p(the) · p(cat | the)

2

slide-5
SLIDE 5

Language modeling: a generative story of text

p(the cat chased the) = p(the) · p(cat | the) · p(chased | the cat)

2

slide-6
SLIDE 6

Language modeling: a generative story of text

p(the cat chased the) = p(the) · p(cat | the) · p(chased | the cat) · p(the | the cat chased)

2

slide-7
SLIDE 7

Language modeling: a generative story of text

p(the cat chased the) = p(the) · p(cat | the) · p(chased | the cat) · p(the | the cat chased) Lexicon / vocabulary type spelling

w

σ(w)

1

  • the

2

  • cat

3

  • chased

2

slide-8
SLIDE 8

Language modeling: a generative story of text

p(the cat chased the) = p(the) · p(cat | the) · p(chased | the cat) · p(the | the cat chased) Lexicon / vocabulary type spelling embedding

w

σ(w)

1

  • the

2

  • cat

3

  • chased

2

slide-9
SLIDE 9

Language modeling: a generative story of text

p(the cat chased the) = p(the) · p(cat | the) · p(chased | the cat) · p(the | the cat chased) Lexicon / vocabulary type spelling embedding

w

σ(w) e(w)

1

  • the

[ 0.2,··· ,0.0]

2

  • cat

[ 0.4,··· ,0.5]

3

  • chased

[−0.1,··· ,0.2]

2

slide-10
SLIDE 10

Language modeling: a generative story of text

p(the cat chased the) = p(the) · p(cat | the) · p(chased | the cat) · p(the | the cat chased) Lexicon / vocabulary type spelling embedding

w

σ(w) e(w)

1

  • the

[ 0.2,··· ,0.0]

2

  • cat

[ 0.4,··· ,0.5]

3

  • chased

[−0.1,··· ,0.2] Text generation with an RNN

RNN cell

slide-11
SLIDE 11

Language modeling: a generative story of text

p(the cat chased the) = p(the) · p(cat | the) · p(chased | the cat) · p(the | the cat chased) Lexicon / vocabulary type spelling embedding

w

σ(w) e(w)

1

  • the

[ 0.2,··· ,0.0]

2

  • cat

[ 0.4,··· ,0.5]

3

  • chased

[−0.1,··· ,0.2] Text generation with an RNN

RNN cell the

slide-12
SLIDE 12

Language modeling: a generative story of text

p(the cat chased the) = p(the) · p(cat | the) · p(chased | the cat) · p(the | the cat chased) Lexicon / vocabulary type spelling embedding

w

σ(w) e(w)

1

  • the

[ 0.2,··· ,0.0]

2

  • cat

[ 0.4,··· ,0.5]

3

  • chased

[−0.1,··· ,0.2] Text generation with an RNN

RNN cell the RNN cell

slide-13
SLIDE 13

Language modeling: a generative story of text

p(the cat chased the) = p(the) · p(cat | the) · p(chased | the cat) · p(the | the cat chased) Lexicon / vocabulary type spelling embedding

w

σ(w) e(w)

1

  • the

[ 0.2,··· ,0.0]

2

  • cat

[ 0.4,··· ,0.5]

3

  • chased

[−0.1,··· ,0.2] Text generation with an RNN

RNN cell the RNN cell cat

slide-14
SLIDE 14

Language modeling: a generative story of text

p(the cat chased the) = p(the) · p(cat | the) · p(chased | the cat) · p(the | the cat chased) Lexicon / vocabulary type spelling embedding

w

σ(w) e(w)

1

  • the

[ 0.2,··· ,0.0]

2

  • cat

[ 0.4,··· ,0.5]

3

  • chased

[−0.1,··· ,0.2] Text generation with an RNN

RNN cell the RNN cell cat RNN cell

slide-15
SLIDE 15

Language modeling: a generative story of text

p(the cat chased the) = p(the) · p(cat | the) · p(chased | the cat) · p(the | the cat chased) Lexicon / vocabulary type spelling embedding

w

σ(w) e(w)

1

  • the

[ 0.2,··· ,0.0]

2

  • cat

[ 0.4,··· ,0.5]

3

  • chased

[−0.1,··· ,0.2] Text generation with an RNN

RNN cell the RNN cell cat RNN cell chased

slide-16
SLIDE 16

Language modeling: a generative story of text

p(the cat chased the) = p(the) · p(cat | the) · p(chased | the cat) · p(the | the cat chased) Lexicon / vocabulary type spelling embedding

w

σ(w) e(w)

1

  • the

[ 0.2,··· ,0.0]

2

  • cat

[ 0.4,··· ,0.5]

3

  • chased

[−0.1,··· ,0.2] Text generation with an RNN

RNN cell the RNN cell cat RNN cell chased RNN cell

slide-17
SLIDE 17

Language modeling: a generative story of text

p(the cat chased the) = p(the) · p(cat | the) · p(chased | the cat) · p(the | the cat chased) Lexicon / vocabulary type spelling embedding

w

σ(w) e(w)

1

  • the

[ 0.2,··· ,0.0]

2

  • cat

[ 0.4,··· ,0.5]

3

  • chased

[−0.1,··· ,0.2] Text generation with an RNN

RNN cell the RNN cell cat RNN cell chased RNN cell the

2

slide-18
SLIDE 18

Language modeling: a generative story of text

p(the cat chased the) = p(the) · p(cat | the) · p(chased | the cat) · p(the | the cat chased) Lexicon / vocabulary type spelling embedding

w

σ(w) e(w)

1

  • the

[ 0.2,··· ,0.0]

2

  • cat

[ 0.4,··· ,0.5]

3

  • chased

[−0.1,··· ,0.2] Text generation with an RNN

RNN cell the RNN cell cat RNN cell c a g e d RNN cell the

2

slide-19
SLIDE 19

Language modeling: a generative story of text

p(the cat chased the) = p(the) · p(cat | the) · p(chased | the cat) · p(the | the cat chased) Lexicon / vocabulary type spelling embedding

w

σ(w) e(w)

1

  • the

[ 0.2,··· ,0.0]

2

  • cat

[ 0.4,··· ,0.5]

3

  • chased

[−0.1,··· ,0.2]

4

  • UNK

[ 0.3,··· ,0.1] Text generation with an RNN

RNN cell the RNN cell cat RNN cell c a g e d

UNK

RNN cell the

2

slide-20
SLIDE 20

Language modeling: a generative story of text

p(the cat chased the) = p(the) · p(cat | the) · p(chased | the cat) · p(the | the cat chased) Lexicon / vocabulary type spelling embedding

w

σ(w) e(w)

1

  • the

[ 0.2,··· ,0.0]

2

  • cat

[ 0.4,··· ,0.5]

3

  • chased

[−0.1,··· ,0.2]

4

  • UNK

[ 0.3,··· ,0.1] Text generation with an RNN

RNN cell the RNN cell cat RNN cell c a g e d

UNK ...but what is the word?

RNN cell the

2

slide-21
SLIDE 21

Language modeling: a generative story of text

p(the cat chased the) = p(the) · p(cat | the) · p(chased | the cat) · p(the | the cat chased) Lexicon / vocabulary type spelling embedding

w

σ(w) e(w)

1

  • the

[ 0.2,··· ,0.0]

2

  • cat

[ 0.4,··· ,0.5]

3

  • chased

[−0.1,··· ,0.2]

4

  • UNK

[ 0.3,··· ,0.1] Text generation with an RNN

RNN cell the RNN cell cat RNN cell c a g e d

UNK ...but what is the word?

RNN cell the

Pure character-level model as the solution?

RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell

t h e c a t c h a s e d

2

slide-22
SLIDE 22

Language modeling: a generative story of text

p(the cat chased the) = p(the) · p(cat | the) · p(chased | the cat) · p(the | the cat chased) Lexicon / vocabulary type spelling embedding

w

σ(w) e(w)

1

  • the

[ 0.2,··· ,0.0]

2

  • cat

[ 0.4,··· ,0.5]

3

  • chased

[−0.1,··· ,0.2]

4

  • UNK

[ 0.3,··· ,0.1] Text generation with an RNN

RNN cell the RNN cell cat RNN cell c a g e d

UNK ...but what is the word?

RNN cell the

Pure character-level model as the solution?

RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell

t h e c a t c h a s e d

RNN cell

t

2

slide-23
SLIDE 23

Language modeling: a generative story of text

p(the cat chased the) = p(the) · p(cat | the) · p(chased | the cat) · p(the | the cat chased) Lexicon / vocabulary type spelling embedding

w

σ(w) e(w)

1

  • the

[ 0.2,··· ,0.0]

2

  • cat

[ 0.4,··· ,0.5]

3

  • chased

[−0.1,··· ,0.2]

4

  • UNK

[ 0.3,··· ,0.1] Text generation with an RNN

RNN cell the RNN cell cat RNN cell c a g e d

UNK ...but what is the word?

RNN cell the

Pure character-level model as the solution?

RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell

t h e c a t c h a s e d

RNN cell

t

RNN cell

h

2

slide-24
SLIDE 24

Language modeling: a generative story of text

p(the cat chased the) = p(the) · p(cat | the) · p(chased | the cat) · p(the | the cat chased) Lexicon / vocabulary type spelling embedding

w

σ(w) e(w)

1

  • the

[ 0.2,··· ,0.0]

2

  • cat

[ 0.4,··· ,0.5]

3

  • chased

[−0.1,··· ,0.2]

4

  • UNK

[ 0.3,··· ,0.1] Text generation with an RNN

RNN cell the RNN cell cat RNN cell c a g e d

UNK ...but what is the word?

RNN cell the

Pure character-level model as the solution?

RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell

t h e c a t c h a s e d

RNN cell

t

RNN cell

h

RNN cell

e

2

slide-25
SLIDE 25

Language modeling: a generative story of text

p(the cat chased the) = p(the) · p(cat | the) · p(chased | the cat) · p(the | the cat chased) Lexicon / vocabulary type spelling embedding

w

σ(w) e(w)

1

  • the

[ 0.2,··· ,0.0]

2

  • cat

[ 0.4,··· ,0.5]

3

  • chased

[−0.1,··· ,0.2]

4

  • UNK

[ 0.3,··· ,0.1] Text generation with an RNN

RNN cell the RNN cell cat RNN cell c a g e d

UNK ...but what is the word?

RNN cell the

Pure character-level model as the solution?

RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell

t h e c a t c h a s e d

RNN cell

t

RNN cell

h

RNN cell

e

Ugh, spelling the again...

2

slide-26
SLIDE 26

Language modeling: a generative story of text

p(the cat chased the) = p(the) · p(cat | the) · p(chased | the cat) · p(the | the cat chased) Lexicon / vocabulary type spelling embedding

w

σ(w) e(w)

1

  • the

[ 0.2,··· ,0.0]

2

  • cat

[ 0.4,··· ,0.5]

3

  • chased

[−0.1,··· ,0.2]

4

  • UNK

[ 0.3,··· ,0.1] Text generation with an RNN

RNN cell the RNN cell cat RNN cell c a g e d

UNK ...but what is the word?

RNN cell the

Pure character-level model as the solution?

RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell

t h e c a t c h a s e d

RNN cell

t

RNN cell

h

RNN cell

e

Ugh, spelling the again... ...can’t we memorize it?

2

slide-27
SLIDE 27

Our model: Spell once, summon anywhere

3

slide-28
SLIDE 28

Our model: Spell once, summon anywhere

Known words only have to be spelled out once,

e( 1 ) e( 2 ) e( 3 )

3

slide-29
SLIDE 29

Our model: Spell once, summon anywhere

Known words only have to be spelled out once,

e( 1 ) e( 2 ) e( 3 )

RNN cell

t

3

slide-30
SLIDE 30

Our model: Spell once, summon anywhere

Known words only have to be spelled out once,

e( 1 ) e( 2 ) e( 3 )

RNN cell

t

RNN cell

h

3

slide-31
SLIDE 31

Our model: Spell once, summon anywhere

Known words only have to be spelled out once,

e( 1 ) e( 2 ) e( 3 )

RNN cell

t

RNN cell

h

RNN cell

e

3

slide-32
SLIDE 32

Our model: Spell once, summon anywhere

Known words only have to be spelled out once,

e( 1 ) e( 2 ) e( 3 )

RNN cell

t

RNN cell

h

RNN cell

e

σ( 1 )

3

slide-33
SLIDE 33

Our model: Spell once, summon anywhere

Known words only have to be spelled out once,

e( 1 ) e( 2 ) e( 3 )

RNN cell

t

RNN cell

h

RNN cell

e

σ( 1 )

RNN cell RNN cell RNN cell

c a t

σ( 2 )

3

slide-34
SLIDE 34

Our model: Spell once, summon anywhere

Known words only have to be spelled out once,

e( 1 ) e( 2 ) e( 3 )

RNN cell

t

RNN cell

h

RNN cell

e

σ( 1 )

RNN cell RNN cell RNN cell

c a t

σ( 2 )

RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell

c h a s e d

σ( 3 )

3

slide-35
SLIDE 35

Our model: Spell once, summon anywhere

Known words only have to be spelled out once,

e( 1 ) e( 2 ) e( 3 )

RNN cell

t

RNN cell

h

RNN cell

e

σ( 1 )

RNN cell RNN cell RNN cell

c a t

σ( 2 )

RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell

c h a s e d

σ( 3 )

3

slide-36
SLIDE 36

Our model: Spell once, summon anywhere

Known words only have to be spelled out once, and can then be summoned anywhere:

e( 1 ) e( 2 ) e( 3 )

RNN cell

t

RNN cell

h

RNN cell

e

σ( 1 )

RNN cell RNN cell RNN cell

c a t

σ( 2 )

RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell

c h a s e d

σ( 3 )

  • h1

3

slide-37
SLIDE 37

Our model: Spell once, summon anywhere

Known words only have to be spelled out once, and can then be summoned anywhere:

look up embeddings

e( 1 ) e( 2 ) e( 3 )

RNN cell

t

RNN cell

h

RNN cell

e

σ( 1 )

RNN cell RNN cell RNN cell

c a t

σ( 2 )

RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell

c h a s e d

σ( 3 )

  • h1

w1

=

1

  • 3
slide-38
SLIDE 38

Our model: Spell once, summon anywhere

Known words only have to be spelled out once, and can then be summoned anywhere:

look up embeddings look up spellings

e( 1 ) e( 2 ) e( 3 )

RNN cell

t

RNN cell

h

RNN cell

e

σ( 1 )

RNN cell RNN cell RNN cell

c a t

σ( 2 )

RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell

c h a s e d

σ( 3 )

  • h1

w1

=

1

  • the

σ(w1)

3

slide-39
SLIDE 39

Our model: Spell once, summon anywhere

Known words only have to be spelled out once, and can then be summoned anywhere:

look up embeddings look up spellings

e( 1 ) e( 2 ) e( 3 )

RNN cell

t

RNN cell

h

RNN cell

e

σ( 1 )

RNN cell RNN cell RNN cell

c a t

σ( 2 )

RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell

c h a s e d

σ( 3 )

  • h1

w1

=

1

  • the

σ(w1)

  • h2

RNN

3

slide-40
SLIDE 40

Our model: Spell once, summon anywhere

Known words only have to be spelled out once, and can then be summoned anywhere:

look up embeddings look up spellings

e( 1 ) e( 2 ) e( 3 )

RNN cell

t

RNN cell

h

RNN cell

e

σ( 1 )

RNN cell RNN cell RNN cell

c a t

σ( 2 )

RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell

c h a s e d

σ( 3 )

  • h1

w1

=

1

  • the

σ(w1)

  • h2

RNN w2

=

2

  • 3
slide-41
SLIDE 41

Our model: Spell once, summon anywhere

Known words only have to be spelled out once, and can then be summoned anywhere:

look up embeddings look up spellings

e( 1 ) e( 2 ) e( 3 )

RNN cell

t

RNN cell

h

RNN cell

e

σ( 1 )

RNN cell RNN cell RNN cell

c a t

σ( 2 )

RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell

c h a s e d

σ( 3 )

  • h1

w1

=

1

  • the

σ(w1)

  • h2

RNN w2

=

2

  • cat

σ(w2)

3

slide-42
SLIDE 42

Our model: Spell once, summon anywhere

Known words only have to be spelled out once, and can then be summoned anywhere:

look up embeddings look up spellings

e( 1 ) e( 2 ) e( 3 )

RNN cell

t

RNN cell

h

RNN cell

e

σ( 1 )

RNN cell RNN cell RNN cell

c a t

σ( 2 )

RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell

c h a s e d

σ( 3 )

  • h1

w1

=

1

  • the

σ(w1)

  • h2

RNN w2

=

2

  • cat

σ(w2)

  • h3

RNN

3

slide-43
SLIDE 43

Our model: Spell once, summon anywhere

Known words only have to be spelled out once, and can then be summoned anywhere:

look up embeddings look up spellings

e( 1 ) e( 2 ) e( 3 )

RNN cell

t

RNN cell

h

RNN cell

e

σ( 1 )

RNN cell RNN cell RNN cell

c a t

σ( 2 )

RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell

c h a s e d

σ( 3 )

  • h1

w1

=

1

  • the

σ(w1)

  • h2

RNN w2

=

2

  • cat

σ(w2)

  • h3

RNN w3

=

3

  • 3
slide-44
SLIDE 44

Our model: Spell once, summon anywhere

Known words only have to be spelled out once, and can then be summoned anywhere:

look up embeddings look up spellings

e( 1 ) e( 2 ) e( 3 )

RNN cell

t

RNN cell

h

RNN cell

e

σ( 1 )

RNN cell RNN cell RNN cell

c a t

σ( 2 )

RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell

c h a s e d

σ( 3 )

  • h1

w1

=

1

  • the

σ(w1)

  • h2

RNN w2

=

2

  • cat

σ(w2)

  • h3

RNN w3

=

3

  • chased

σ(w3)

3

slide-45
SLIDE 45

Our model: Spell once, summon anywhere

Known words only have to be spelled out once, and can then be summoned anywhere:

look up embeddings look up spellings

e( 1 ) e( 2 ) e( 3 )

RNN cell

t

RNN cell

h

RNN cell

e

σ( 1 )

RNN cell RNN cell RNN cell

c a t

σ( 2 )

RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell

c h a s e d

σ( 3 )

  • h1

w1

=

1

  • the

σ(w1)

  • h2

RNN w2

=

2

  • cat

σ(w2)

  • h3

RNN w3

=

3

  • chased

σ(w3)

  • h4

RNN

3

slide-46
SLIDE 46

Our model: Spell once, summon anywhere

Known words only have to be spelled out once, and can then be summoned anywhere:

look up embeddings look up spellings

e( 1 ) e( 2 ) e( 3 )

RNN cell

t

RNN cell

h

RNN cell

e

σ( 1 )

RNN cell RNN cell RNN cell

c a t

σ( 2 )

RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell

c h a s e d

σ( 3 )

  • h1

w1

=

1

  • the

σ(w1)

  • h2

RNN w2

=

2

  • cat

σ(w2)

  • h3

RNN w3

=

3

  • chased

σ(w3)

  • h4

RNN w4

=

1

  • 3
slide-47
SLIDE 47

Our model: Spell once, summon anywhere

Known words only have to be spelled out once, and can then be summoned anywhere:

look up embeddings look up spellings

e( 1 ) e( 2 ) e( 3 )

RNN cell

t

RNN cell

h

RNN cell

e

σ( 1 )

RNN cell RNN cell RNN cell

c a t

σ( 2 )

RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell

c h a s e d

σ( 3 )

  • h1

w1

=

1

  • the

σ(w1)

  • h2

RNN w2

=

2

  • cat

σ(w2)

  • h3

RNN w3

=

3

  • chased

σ(w3)

  • h4

RNN w4

=

1

  • the

σ(w4)

3

slide-48
SLIDE 48

Our model: Spell once, summon anywhere – the open-vocabulary case

Known words only have to be spelled out once, and can then be summoned anywhere.

e( 1 ) e( 2 ) e( 3 )

RNN cell RNN cell RNN cell

t h e

RNN cell RNN cell RNN cell

c a t

RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell

c h a s e d

σ( 1 ) σ( 2 ) σ( 3 )

  • h1

w1

=

1

  • h2

w2

=

2

  • h3

?

  • h4

w4

=

1

  • c a g e d

cat the the

σ(w1) σ(w2) σ(w4)

4

slide-49
SLIDE 49

Our model: Spell once, summon anywhere – the open-vocabulary case

Known words only have to be spelled out once, and can then be summoned anywhere.

e(UNK) e( 1 ) e( 2 ) e( 3 )

RNN cell RNN cell RNN cell

t h e

RNN cell RNN cell RNN cell

c a t

RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell

c h a s e d

σ( 1 ) σ( 2 ) σ( 3 )

  • h1

w1

=

1

  • h2

w2

=

2

  • h3

?

  • h4

w4

=

1

  • c a g e d

cat the the

σ(w1) σ(w2) σ(w4)

4

slide-50
SLIDE 50

Our model: Spell once, summon anywhere – the open-vocabulary case

Known words only have to be spelled out once, and can then be summoned anywhere.

e(UNK) e( 1 ) e( 2 ) e( 3 )

RNN cell RNN cell RNN cell

t h e

RNN cell RNN cell RNN cell

c a t

RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell

c h a s e d

σ( 1 ) σ( 2 ) σ( 3 )

  • h1

w1

=

1

  • h2

w2

=

2

  • h3

w3

= UNK

  • h4

w4

=

1

  • c a g e d

cat the the

σ(w1) σ(w2) σ(w4)

4

slide-51
SLIDE 51

Our model: Spell once, summon anywhere – the open-vocabulary case

Known words only have to be spelled out once, and can then be summoned anywhere. Unknown words are spelled out “on-demand” using the same character-level model.

e(UNK) e( 1 ) e( 2 ) e( 3 )

RNN cell RNN cell RNN cell

t h e

RNN cell RNN cell RNN cell

c a t

RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell

c h a s e d

σ( 1 ) σ( 2 ) σ( 3 )

  • h1

w1

=

1

  • h2

w2

=

2

  • h3

w3

= UNK

  • h4

w4

=

1

  • RNN

cell

c a g e d cat the the

σ(w1) σ(w2) σ(w4)

4

slide-52
SLIDE 52

Our model: Spell once, summon anywhere – the open-vocabulary case

Known words only have to be spelled out once, and can then be summoned anywhere. Unknown words are spelled out “on-demand” using the same character-level model.

e(UNK) e( 1 ) e( 2 ) e( 3 )

RNN cell RNN cell RNN cell

t h e

RNN cell RNN cell RNN cell

c a t

RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell

c h a s e d

σ( 1 ) σ( 2 ) σ( 3 )

  • h1

w1

=

1

  • h2

w2

=

2

  • h3

w3

= UNK

  • h4

w4

=

1

  • RNN

cell

c

RNN cell

a g e d cat the the

σ(w1) σ(w2) σ(w4)

4

slide-53
SLIDE 53

Our model: Spell once, summon anywhere – the open-vocabulary case

Known words only have to be spelled out once, and can then be summoned anywhere. Unknown words are spelled out “on-demand” using the same character-level model.

e(UNK) e( 1 ) e( 2 ) e( 3 )

RNN cell RNN cell RNN cell

t h e

RNN cell RNN cell RNN cell

c a t

RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell

c h a s e d

σ( 1 ) σ( 2 ) σ( 3 )

  • h1

w1

=

1

  • h2

w2

=

2

  • h3

w3

= UNK

  • h4

w4

=

1

  • RNN

cell

c

RNN cell

a

RNN cell

g e d cat the the

σ(w1) σ(w2) σ(w4)

4

slide-54
SLIDE 54

Our model: Spell once, summon anywhere – the open-vocabulary case

Known words only have to be spelled out once, and can then be summoned anywhere. Unknown words are spelled out “on-demand” using the same character-level model.

e(UNK) e( 1 ) e( 2 ) e( 3 )

RNN cell RNN cell RNN cell

t h e

RNN cell RNN cell RNN cell

c a t

RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell

c h a s e d

σ( 1 ) σ( 2 ) σ( 3 )

  • h1

w1

=

1

  • h2

w2

=

2

  • h3

w3

= UNK

  • h4

w4

=

1

  • RNN

cell

c

RNN cell

a

RNN cell

g

RNN cell

e d cat the the

σ(w1) σ(w2) σ(w4)

4

slide-55
SLIDE 55

Our model: Spell once, summon anywhere – the open-vocabulary case

Known words only have to be spelled out once, and can then be summoned anywhere. Unknown words are spelled out “on-demand” using the same character-level model.

e(UNK) e( 1 ) e( 2 ) e( 3 )

RNN cell RNN cell RNN cell

t h e

RNN cell RNN cell RNN cell

c a t

RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell

c h a s e d

σ( 1 ) σ( 2 ) σ( 3 )

  • h1

w1

=

1

  • h2

w2

=

2

  • h3

w3

= UNK

  • h4

w4

=

1

  • RNN

cell

c

RNN cell

a

RNN cell

g

RNN cell

e

RNN cell

d cat the the

σ(w1) σ(w2) σ(w4)

4

slide-56
SLIDE 56

Samples from the model

Sampled text from our model: Following the death of Edward Mc- Cartney in 1060 , the new defini- tion was transferred to the WDIC

  • f Fullett .

5

slide-57
SLIDE 57

Samples from the model

Sampled text from our model: Following the death of Edward Mc- Cartney in 1060 , the new defini- tion was transferred to the WDIC

  • f Fullett .

novel word with contextually appropriate spelling

5

slide-58
SLIDE 58

Samples from the model

Sampled text from our model: Following the death of Edward Mc- Cartney in 1060 , the new defini- tion was transferred to the WDIC

  • f Fullett .

novel word with contextually appropriate spelling known spelling ❀ novel spelling sampled from its embedding

5

slide-59
SLIDE 59

Samples from the model

Sampled text from our model: Following the death of Edward Mc- Cartney in 1060 , the new defini- tion was transferred to the WDIC

  • f Fullett .

novel word with contextually appropriate spelling known spelling ❀ novel spelling sampled from its embedding

grounded

stipped

5

slide-60
SLIDE 60

Samples from the model

Sampled text from our model: Following the death of Edward Mc- Cartney in 1060 , the new defini- tion was transferred to the WDIC

  • f Fullett .

novel word with contextually appropriate spelling known spelling ❀ novel spelling sampled from its embedding

grounded

stipped differ

coronate

5

slide-61
SLIDE 61

Samples from the model

Sampled text from our model: Following the death of Edward Mc- Cartney in 1060 , the new defini- tion was transferred to the WDIC

  • f Fullett .

novel word with contextually appropriate spelling known spelling ❀ novel spelling sampled from its embedding

grounded

stipped differ

coronate Clive

Dickey

5

slide-62
SLIDE 62

Samples from the model

Sampled text from our model: Following the death of Edward Mc- Cartney in 1060 , the new defini- tion was transferred to the WDIC

  • f Fullett .

novel word with contextually appropriate spelling known spelling ❀ novel spelling sampled from its embedding

grounded

stipped differ

coronate Clive

Dickey Southport

Strigger

5

slide-63
SLIDE 63

Samples from the model

Sampled text from our model: Following the death of Edward Mc- Cartney in 1060 , the new defini- tion was transferred to the WDIC

  • f Fullett .

novel word with contextually appropriate spelling known spelling ❀ novel spelling sampled from its embedding

grounded

stipped differ

coronate Clive

Dickey Southport

Strigger Carl

Wuly

5

slide-64
SLIDE 64

Samples from the model

Sampled text from our model: Following the death of Edward Mc- Cartney in 1060 , the new defini- tion was transferred to the WDIC

  • f Fullett .

novel word with contextually appropriate spelling known spelling ❀ novel spelling sampled from its embedding

grounded

stipped differ

coronate Clive

Dickey Southport

Strigger Carl

Wuly Chants

Tranquels

5

slide-65
SLIDE 65

Samples from the model

Sampled text from our model: Following the death of Edward Mc- Cartney in 1060 , the new defini- tion was transferred to the WDIC

  • f Fullett .

novel word with contextually appropriate spelling known spelling ❀ novel spelling sampled from its embedding

grounded

stipped differ

coronate Clive

Dickey Southport

Strigger Carl

Wuly Chants

Tranquels valuables

migrations

5

slide-66
SLIDE 66

Samples from the model

Sampled text from our model: Following the death of Edward Mc- Cartney in 1060 , the new defini- tion was transferred to the WDIC

  • f Fullett .

novel word with contextually appropriate spelling known spelling ❀ novel spelling sampled from its embedding

grounded

stipped differ

coronate Clive

Dickey Southport

Strigger Carl

Wuly Chants

Tranquels valuables

migrations

So why is this a good way of modeling language?

5

slide-67
SLIDE 67

Linguistic notions: duality of patterning

The meaningful elements in any language—"words" in everyday parlance [...]— [...] are represented by [a] small stock of distinguishable sounds which are in themselves wholly meaningless. – Hockett, 1960 characters

6

slide-68
SLIDE 68

Linguistic notions: duality of patterning

The meaningful elements in any language—"words" in everyday parlance [...]— [...] are represented by [a] small stock of distinguishable sounds which are in themselves wholly meaningless. – Hockett, 1960 characters

“Meaningless” character composition should be separate from “meaningful” word composition!

6

slide-69
SLIDE 69

Linguistic notions: duality of patterning

The meaningful elements in any language—"words" in everyday parlance [...]— [...] are represented by [a] small stock of distinguishable sounds which are in themselves wholly meaningless. – Hockett, 1960 characters

“Meaningless” character composition should be separate from “meaningful” word composition!

We should need a word’s spelling only to define it – not to later use it.

6

slide-70
SLIDE 70

Duality of patterning → conditional independence!

So? Why does this linguistics blurb matter?

7

slide-71
SLIDE 71

Duality of patterning → conditional independence!

So? Why does this linguistics blurb matter?

  • Irregular words have uncommon spellings

children

...yet we use them like regular words!

7

slide-72
SLIDE 72

Duality of patterning → conditional independence!

So? Why does this linguistics blurb matter?

  • Irregular words have uncommon spellings

children

...yet we use them like regular words!

  • Function words have uncommon spellings

the, of

...yet we use them all the time without feeling weird!

7

slide-73
SLIDE 73

Duality of patterning → conditional independence!

So? Why does this linguistics blurb matter?

  • Irregular words have uncommon spellings

children

...yet we use them like regular words!

  • Function words have uncommon spellings

the, of

...yet we use them all the time without feeling weird! Recall: We should need a word’s spelling only to define it – not to later use it.

7

slide-74
SLIDE 74

Duality of patterning → conditional independence!

So? Why does this linguistics blurb matter?

  • Irregular words have uncommon spellings

children

...yet we use them like regular words!

  • Function words have uncommon spellings

the, of

...yet we use them all the time without feeling weird! Recall: We should need a word’s spelling only to define it – not to later use it. i.e. character-level models do it wrong!

7

slide-75
SLIDE 75

Duality of patterning → conditional independence!

So? Why does this linguistics blurb matter?

  • Irregular words have uncommon spellings

children

...yet we use them like regular words!

  • Function words have uncommon spellings

the, of

...yet we use them all the time without feeling weird! Recall: We should need a word’s spelling only to define it – not to later use it. i.e. character-level models do it wrong!

...and they’re slow as hell...

7

slide-76
SLIDE 76

Duality of patterning → conditional independence!

So? Why does this linguistics blurb matter?

  • Irregular words have uncommon spellings

children

...yet we use them like regular words!

  • Function words have uncommon spellings

the, of

...yet we use them all the time without feeling weird! Recall: We should need a word’s spelling only to define it – not to later use it. i.e. character-level models do it wrong!

...and they’re slow as hell...

usage ⊥ spelling | embedding

7

slide-77
SLIDE 77

The arbitrariness of the sign → allowing for idiosyncracy

How should a word’s embedding and its spelling be connected?

8

slide-78
SLIDE 78

The arbitrariness of the sign → allowing for idiosyncracy

How should a word’s embedding and its spelling be connected? The connection between the signifier and the signified is arbitrary. – de Saussure, 1916, translated

8

slide-79
SLIDE 79

The arbitrariness of the sign → allowing for idiosyncracy

How should a word’s embedding and its spelling be connected? The connection between the signifier and the signified is arbitrary. spelling meaning – de Saussure, 1916, translated

8

slide-80
SLIDE 80

The arbitrariness of the sign → allowing for idiosyncracy

How should a word’s embedding and its spelling be connected? The connection between the signifier and the signified is arbitrary. spelling meaning – de Saussure, 1916, translated

Meaning is not fully predictable from spellings.

8

slide-81
SLIDE 81

The arbitrariness of the sign → allowing for idiosyncracy

How should a word’s embedding and its spelling be connected? The connection between the signifier and the signified is arbitrary. spelling meaning – de Saussure, 1916, translated

Meaning is not fully predictable from spellings.

Example: neither silly nor folly is an adverb, even though they both end in -ly!

8

slide-82
SLIDE 82

The arbitrariness of the sign → allowing for idiosyncracy

How should a word’s embedding and its spelling be connected? The connection between the signifier and the signified is arbitrary. spelling meaning – de Saussure, 1916, translated

Meaning is not fully predictable from spellings.

Example: neither silly nor folly is an adverb, even though they both end in -ly! “construction” models like e(caged) := CNN(c a g e d) ignore this!

8

slide-83
SLIDE 83

The arbitrariness of the sign → allowing for idiosyncracy

How should a word’s embedding and its spelling be connected? The connection between the signifier and the signified is arbitrary. spelling meaning – de Saussure, 1916, translated

Meaning is not fully predictable from spellings.

Example: neither silly nor folly is an adverb, even though they both end in -ly! “construction” models like e(caged) := CNN(c a g e d) ignore this!

⇒ Allow any pairing a priori, but use spellings as prior / regularization!

8

slide-84
SLIDE 84

The arbitrariness of the sign → allowing for idiosyncracy

How should a word’s embedding and its spelling be connected? The connection between the signifier and the signified is arbitrary. spelling meaning – de Saussure, 1916, translated

Meaning is not fully predictable from spellings.

Example: neither silly nor folly is an adverb, even though they both end in -ly! “construction” models like e(caged) := CNN(c a g e d) ignore this!

⇒ Allow any pairing a priori, but Outliers (children, the, ...) use spellings as prior / regularization! may have idiosyncratic embeddings!

8

slide-85
SLIDE 85

The arbitrariness of the sign → allowing for idiosyncracy

How should a word’s embedding and its spelling be connected? The connection between the signifier and the signified is arbitrary. spelling meaning – de Saussure, 1916, translated

Meaning is not fully predictable from spellings.

Example: neither silly nor folly is an adverb, even though they both end in -ly! “construction” models like e(caged) := CNN(c a g e d) ignore this!

⇒ Allow any pairing a priori, but Outliers (children, the, ...) use spellings as prior / regularization! may have idiosyncratic embeddings!

regularize embeddings, don’t construct them

8

slide-86
SLIDE 86

Recap: how does our model implement these ideas?

Embeddings and spellings are connected on the type level, ensuring conditional independence of usage and spelling while assigning positive probability to any pairing!

e(UNK) e( 1 ) e( 2 ) e( 3 )

RNN cell RNN cell RNN cell

t h e

σ( 1 )

RNN cell RNN cell RNN cell

c a t

σ( 2 )

RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell

c h a s e d

σ( 3 )

  • h1

w1

=

1

  • h2

w2

=

2

  • h3

w3

= UNK

  • h4

w4

=

1

  • RNN

cell RNN cell RNN cell RNN cell RNN cell

c a g e d cat the the

σ(w1) σ(w2) s ∼ pspell(· | h3) σ(w4)

9

slide-87
SLIDE 87

How do we evaluate open-vocabulary language models?

  • 1. Report likelihood p(held-out text) as perplexity? (↓ lower is better)

10

slide-88
SLIDE 88

How do we evaluate open-vocabulary language models?

bits per character

  • 1. Report likelihood p(held-out text) as perplexity (↓ lower is better)

10

slide-89
SLIDE 89

How do we evaluate open-vocabulary language models?

bits per character

  • 1. Report likelihood p(held-out text) as perplexity (↓ lower is better)

2. no UNKing allowed!

10

slide-90
SLIDE 90

How do we evaluate open-vocabulary language models?

bits per character

  • 1. Report likelihood p(held-out text) as perplexity (↓ lower is better)

2. no UNKing allowed!∗

10

slide-91
SLIDE 91

How do we evaluate open-vocabulary language models?

bits per character

  • 1. Report likelihood p(held-out text) as perplexity (↓ lower is better)

2. no UNKing allowed!∗ ______________

∗ Yes, we call some words “UNK” temporarily, but we still generate them fully! 10

slide-92
SLIDE 92

How do we evaluate open-vocabulary language models?

bits per character

  • 1. Report likelihood p(held-out text) as perplexity (↓ lower is better)

2. no UNKing allowed!∗ → we must predict every character of the text, regardless of vocabulary size ______________

∗ Yes, we call some words “UNK” temporarily, but we still generate them fully! 10

slide-93
SLIDE 93

How do we evaluate open-vocabulary language models?

bits per character

  • 1. Report likelihood p(held-out text) as perplexity (↓ lower is better)

2. no UNKing allowed!∗ → we must predict every character of the text, regardless of vocabulary size ______________

∗ Yes, we call some words “UNK” temporarily, but we still generate them fully!

⇒ A tunable “vocabulary size” hyperparameter decides what is temporary-UNK.

10

slide-94
SLIDE 94

Results

WikiText-2 (Merity et al., 2017) test

2.5 million tokenized words from the English Wikipedia ← 1.8 1.4

11

slide-95
SLIDE 95

Results

WikiText-2 (Merity et al., 2017) test

2.5 million tokenized words from the English Wikipedia ← 1.8 1.4

RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell

t h e c a t c h a s e d

  • 1.775

11

slide-96
SLIDE 96

Results

WikiText-2 (Merity et al., 2017) test

2.5 million tokenized words from the English Wikipedia ← 1.8 1.4

RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell

t h e c a t c h a s e d

  • 1.775

HCLM + cache

previous SOTA (Kawakami et al., 2017)

  • 1.500

11

slide-97
SLIDE 97

Results

WikiText-2 (Merity et al., 2017) test

2.5 million tokenized words from the English Wikipedia ← 1.8 1.4

RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell

t h e c a t c h a s e d

  • 1.775

HCLM + cache

previous SOTA (Kawakami et al., 2017)

  • 1.500

BPE:

RNN cell RNN cell RNN cell RNN cell RNN cell

the ca@ @ t cha@ @ sed

  • 1.468

11

slide-98
SLIDE 98

Results

WikiText-2 (Merity et al., 2017) test

2.5 million tokenized words from the English Wikipedia ← 1.8 1.4

RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell

t h e c a t c h a s e d

  • 1.775

HCLM + cache

previous SOTA (Kawakami et al., 2017)

  • 1.500

BPE:

RNN cell RNN cell RNN cell RNN cell RNN cell

the ca@ @ t cha@ @ sed

  • 1.468

11

slide-99
SLIDE 99

Results

WikiText-2 (Merity et al., 2017) test

2.5 million tokenized words from the English Wikipedia ← 1.8 1.4

RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell

t h e c a t c h a s e d

  • 1.775

HCLM + cache

previous SOTA (Kawakami et al., 2017)

  • 1.500

BPE:

RNN cell RNN cell RNN cell RNN cell RNN cell

the ca@ @ t cha@ @ sed

  • 1.468
  • ur full model: Spell Once, Summon Anywhere
  • 1.455

11

slide-100
SLIDE 100

Results

WikiText-2 (Merity et al., 2017)

  • n dev data

test

2.5 million tokenized words from the English Wikipedia novel words rare words frequent words all

RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell

t h e c a t c h a s e d

3.89 2.08 1.38 1.775 HCLM + cache

previous SOTA (Kawakami et al., 2017)

– – – 1.500 BPE:

RNN cell RNN cell RNN cell RNN cell RNN cell

the ca@ @ t cha@ @ sed

4.01 1.70 1.08 1.468

  • ur full model: Spell Once, Summon Anywhere

4.00 1.64 1.10 1.455

11

slide-101
SLIDE 101

Results

WikiText-2 (Merity et al., 2017)

  • n dev data

test

2.5 million tokenized words from the English Wikipedia novel words rare words frequent words all

RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell

t h e c a t c h a s e d

3.89 2.08 1.38 1.775 HCLM + cache

previous SOTA (Kawakami et al., 2017)

– – – 1.500 BPE:

RNN cell RNN cell RNN cell RNN cell RNN cell

the ca@ @ t cha@ @ sed

4.01 1.70 1.08 1.468

  • ur full model: Spell Once, Summon Anywhere

4.00 1.64 1.10 1.455 ...and plenty more baselines, ablations, datasets, and questions answered in the paper!

11

slide-102
SLIDE 102

Conclusion

  • 1. think about language before you model:

usage ⊥ spelling | embedding regularize embeddings, don’t construct them

12

slide-103
SLIDE 103

Conclusion

  • 1. think about language before you model:

usage ⊥ spelling | embedding regularize embeddings, don’t construct them

  • 2. simple and criminally underused baselines can beat fancy but bad models

model strings by segments?

12

slide-104
SLIDE 104

Conclusion

  • 1. think about language before you model:

usage ⊥ spelling | embedding regularize embeddings, don’t construct them

  • 2. simple and criminally underused baselines can beat fancy but bad models

model strings by segments?

  • 3. open-vocabulary language modeling is an exciting task!

12

slide-105
SLIDE 105

Spell Once, Summon Anywhere: A Two-Level Open-Vocabulary Language Model

AAAI 2019 Technical Track Sabrina J. Mielke and Jason Eisner

sjmielke@jhu.edu, jason@cs.jhu.edu

Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA 13