Spell Once, Summon Anywhere: A Two-Level Open-Vocabulary Language Model
AAAI 2019 Technical Track Sabrina J. Mielke and Jason Eisner
sjmielke@jhu.edu, jason@cs.jhu.edu
Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA 1
Spell Once, Summon Anywhere: A Two-Level Open-Vocabulary Language - - PowerPoint PPT Presentation
Spell Once, Summon Anywhere: A Two-Level Open-Vocabulary Language Model AAAI 2019 Technical Track Sabrina J. Mielke and Jason Eisner sjmielke@jhu.edu , jason@cs.jhu.edu Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
AAAI 2019 Technical Track Sabrina J. Mielke and Jason Eisner
sjmielke@jhu.edu, jason@cs.jhu.edu
Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA 1
p(the cat chased the)
2
p(the cat chased the) = p(the)
2
p(the cat chased the) = p(the) · p(cat | the)
2
p(the cat chased the) = p(the) · p(cat | the) · p(chased | the cat)
2
p(the cat chased the) = p(the) · p(cat | the) · p(chased | the cat) · p(the | the cat chased)
2
p(the cat chased the) = p(the) · p(cat | the) · p(chased | the cat) · p(the | the cat chased) Lexicon / vocabulary type spelling
w
σ(w)
1
2
3
2
p(the cat chased the) = p(the) · p(cat | the) · p(chased | the cat) · p(the | the cat chased) Lexicon / vocabulary type spelling embedding
w
σ(w)
1
2
3
2
p(the cat chased the) = p(the) · p(cat | the) · p(chased | the cat) · p(the | the cat chased) Lexicon / vocabulary type spelling embedding
w
σ(w) e(w)
1
[ 0.2,··· ,0.0]
2
[ 0.4,··· ,0.5]
3
[−0.1,··· ,0.2]
2
p(the cat chased the) = p(the) · p(cat | the) · p(chased | the cat) · p(the | the cat chased) Lexicon / vocabulary type spelling embedding
w
σ(w) e(w)
1
[ 0.2,··· ,0.0]
2
[ 0.4,··· ,0.5]
3
[−0.1,··· ,0.2] Text generation with an RNN
RNN cell
p(the cat chased the) = p(the) · p(cat | the) · p(chased | the cat) · p(the | the cat chased) Lexicon / vocabulary type spelling embedding
w
σ(w) e(w)
1
[ 0.2,··· ,0.0]
2
[ 0.4,··· ,0.5]
3
[−0.1,··· ,0.2] Text generation with an RNN
RNN cell the
p(the cat chased the) = p(the) · p(cat | the) · p(chased | the cat) · p(the | the cat chased) Lexicon / vocabulary type spelling embedding
w
σ(w) e(w)
1
[ 0.2,··· ,0.0]
2
[ 0.4,··· ,0.5]
3
[−0.1,··· ,0.2] Text generation with an RNN
RNN cell the RNN cell
p(the cat chased the) = p(the) · p(cat | the) · p(chased | the cat) · p(the | the cat chased) Lexicon / vocabulary type spelling embedding
w
σ(w) e(w)
1
[ 0.2,··· ,0.0]
2
[ 0.4,··· ,0.5]
3
[−0.1,··· ,0.2] Text generation with an RNN
RNN cell the RNN cell cat
p(the cat chased the) = p(the) · p(cat | the) · p(chased | the cat) · p(the | the cat chased) Lexicon / vocabulary type spelling embedding
w
σ(w) e(w)
1
[ 0.2,··· ,0.0]
2
[ 0.4,··· ,0.5]
3
[−0.1,··· ,0.2] Text generation with an RNN
RNN cell the RNN cell cat RNN cell
p(the cat chased the) = p(the) · p(cat | the) · p(chased | the cat) · p(the | the cat chased) Lexicon / vocabulary type spelling embedding
w
σ(w) e(w)
1
[ 0.2,··· ,0.0]
2
[ 0.4,··· ,0.5]
3
[−0.1,··· ,0.2] Text generation with an RNN
RNN cell the RNN cell cat RNN cell chased
p(the cat chased the) = p(the) · p(cat | the) · p(chased | the cat) · p(the | the cat chased) Lexicon / vocabulary type spelling embedding
w
σ(w) e(w)
1
[ 0.2,··· ,0.0]
2
[ 0.4,··· ,0.5]
3
[−0.1,··· ,0.2] Text generation with an RNN
RNN cell the RNN cell cat RNN cell chased RNN cell
p(the cat chased the) = p(the) · p(cat | the) · p(chased | the cat) · p(the | the cat chased) Lexicon / vocabulary type spelling embedding
w
σ(w) e(w)
1
[ 0.2,··· ,0.0]
2
[ 0.4,··· ,0.5]
3
[−0.1,··· ,0.2] Text generation with an RNN
RNN cell the RNN cell cat RNN cell chased RNN cell the
2
p(the cat chased the) = p(the) · p(cat | the) · p(chased | the cat) · p(the | the cat chased) Lexicon / vocabulary type spelling embedding
w
σ(w) e(w)
1
[ 0.2,··· ,0.0]
2
[ 0.4,··· ,0.5]
3
[−0.1,··· ,0.2] Text generation with an RNN
RNN cell the RNN cell cat RNN cell c a g e d RNN cell the
2
p(the cat chased the) = p(the) · p(cat | the) · p(chased | the cat) · p(the | the cat chased) Lexicon / vocabulary type spelling embedding
w
σ(w) e(w)
1
[ 0.2,··· ,0.0]
2
[ 0.4,··· ,0.5]
3
[−0.1,··· ,0.2]
4
[ 0.3,··· ,0.1] Text generation with an RNN
RNN cell the RNN cell cat RNN cell c a g e d
UNK
RNN cell the
2
p(the cat chased the) = p(the) · p(cat | the) · p(chased | the cat) · p(the | the cat chased) Lexicon / vocabulary type spelling embedding
w
σ(w) e(w)
1
[ 0.2,··· ,0.0]
2
[ 0.4,··· ,0.5]
3
[−0.1,··· ,0.2]
4
[ 0.3,··· ,0.1] Text generation with an RNN
RNN cell the RNN cell cat RNN cell c a g e d
UNK ...but what is the word?
RNN cell the
2
p(the cat chased the) = p(the) · p(cat | the) · p(chased | the cat) · p(the | the cat chased) Lexicon / vocabulary type spelling embedding
w
σ(w) e(w)
1
[ 0.2,··· ,0.0]
2
[ 0.4,··· ,0.5]
3
[−0.1,··· ,0.2]
4
[ 0.3,··· ,0.1] Text generation with an RNN
RNN cell the RNN cell cat RNN cell c a g e d
UNK ...but what is the word?
RNN cell the
Pure character-level model as the solution?
RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell
t h e c a t c h a s e d
2
p(the cat chased the) = p(the) · p(cat | the) · p(chased | the cat) · p(the | the cat chased) Lexicon / vocabulary type spelling embedding
w
σ(w) e(w)
1
[ 0.2,··· ,0.0]
2
[ 0.4,··· ,0.5]
3
[−0.1,··· ,0.2]
4
[ 0.3,··· ,0.1] Text generation with an RNN
RNN cell the RNN cell cat RNN cell c a g e d
UNK ...but what is the word?
RNN cell the
Pure character-level model as the solution?
RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell
t h e c a t c h a s e d
RNN cell
t
2
p(the cat chased the) = p(the) · p(cat | the) · p(chased | the cat) · p(the | the cat chased) Lexicon / vocabulary type spelling embedding
w
σ(w) e(w)
1
[ 0.2,··· ,0.0]
2
[ 0.4,··· ,0.5]
3
[−0.1,··· ,0.2]
4
[ 0.3,··· ,0.1] Text generation with an RNN
RNN cell the RNN cell cat RNN cell c a g e d
UNK ...but what is the word?
RNN cell the
Pure character-level model as the solution?
RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell
t h e c a t c h a s e d
RNN cell
t
RNN cell
h
2
p(the cat chased the) = p(the) · p(cat | the) · p(chased | the cat) · p(the | the cat chased) Lexicon / vocabulary type spelling embedding
w
σ(w) e(w)
1
[ 0.2,··· ,0.0]
2
[ 0.4,··· ,0.5]
3
[−0.1,··· ,0.2]
4
[ 0.3,··· ,0.1] Text generation with an RNN
RNN cell the RNN cell cat RNN cell c a g e d
UNK ...but what is the word?
RNN cell the
Pure character-level model as the solution?
RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell
t h e c a t c h a s e d
RNN cell
t
RNN cell
h
RNN cell
e
2
p(the cat chased the) = p(the) · p(cat | the) · p(chased | the cat) · p(the | the cat chased) Lexicon / vocabulary type spelling embedding
w
σ(w) e(w)
1
[ 0.2,··· ,0.0]
2
[ 0.4,··· ,0.5]
3
[−0.1,··· ,0.2]
4
[ 0.3,··· ,0.1] Text generation with an RNN
RNN cell the RNN cell cat RNN cell c a g e d
UNK ...but what is the word?
RNN cell the
Pure character-level model as the solution?
RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell
t h e c a t c h a s e d
RNN cell
t
RNN cell
h
RNN cell
e
Ugh, spelling the again...
2
p(the cat chased the) = p(the) · p(cat | the) · p(chased | the cat) · p(the | the cat chased) Lexicon / vocabulary type spelling embedding
w
σ(w) e(w)
1
[ 0.2,··· ,0.0]
2
[ 0.4,··· ,0.5]
3
[−0.1,··· ,0.2]
4
[ 0.3,··· ,0.1] Text generation with an RNN
RNN cell the RNN cell cat RNN cell c a g e d
UNK ...but what is the word?
RNN cell the
Pure character-level model as the solution?
RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell
t h e c a t c h a s e d
RNN cell
t
RNN cell
h
RNN cell
e
Ugh, spelling the again... ...can’t we memorize it?
2
3
Known words only have to be spelled out once,
e( 1 ) e( 2 ) e( 3 )
3
Known words only have to be spelled out once,
e( 1 ) e( 2 ) e( 3 )
RNN cell
t
3
Known words only have to be spelled out once,
e( 1 ) e( 2 ) e( 3 )
RNN cell
t
RNN cell
h
3
Known words only have to be spelled out once,
e( 1 ) e( 2 ) e( 3 )
RNN cell
t
RNN cell
h
RNN cell
e
3
Known words only have to be spelled out once,
e( 1 ) e( 2 ) e( 3 )
RNN cell
t
RNN cell
h
RNN cell
e
σ( 1 )
3
Known words only have to be spelled out once,
e( 1 ) e( 2 ) e( 3 )
RNN cell
t
RNN cell
h
RNN cell
e
σ( 1 )
RNN cell RNN cell RNN cell
c a t
σ( 2 )
3
Known words only have to be spelled out once,
e( 1 ) e( 2 ) e( 3 )
RNN cell
t
RNN cell
h
RNN cell
e
σ( 1 )
RNN cell RNN cell RNN cell
c a t
σ( 2 )
RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell
c h a s e d
σ( 3 )
3
Known words only have to be spelled out once,
e( 1 ) e( 2 ) e( 3 )
RNN cell
t
RNN cell
h
RNN cell
e
σ( 1 )
RNN cell RNN cell RNN cell
c a t
σ( 2 )
RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell
c h a s e d
σ( 3 )
3
Known words only have to be spelled out once, and can then be summoned anywhere:
e( 1 ) e( 2 ) e( 3 )
RNN cell
t
RNN cell
h
RNN cell
e
σ( 1 )
RNN cell RNN cell RNN cell
c a t
σ( 2 )
RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell
c h a s e d
σ( 3 )
3
Known words only have to be spelled out once, and can then be summoned anywhere:
look up embeddings
e( 1 ) e( 2 ) e( 3 )
RNN cell
t
RNN cell
h
RNN cell
e
σ( 1 )
RNN cell RNN cell RNN cell
c a t
σ( 2 )
RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell
c h a s e d
σ( 3 )
w1
=
1
Known words only have to be spelled out once, and can then be summoned anywhere:
look up embeddings look up spellings
e( 1 ) e( 2 ) e( 3 )
RNN cell
t
RNN cell
h
RNN cell
e
σ( 1 )
RNN cell RNN cell RNN cell
c a t
σ( 2 )
RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell
c h a s e d
σ( 3 )
w1
=
1
σ(w1)
3
Known words only have to be spelled out once, and can then be summoned anywhere:
look up embeddings look up spellings
e( 1 ) e( 2 ) e( 3 )
RNN cell
t
RNN cell
h
RNN cell
e
σ( 1 )
RNN cell RNN cell RNN cell
c a t
σ( 2 )
RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell
c h a s e d
σ( 3 )
w1
=
1
σ(w1)
RNN
3
Known words only have to be spelled out once, and can then be summoned anywhere:
look up embeddings look up spellings
e( 1 ) e( 2 ) e( 3 )
RNN cell
t
RNN cell
h
RNN cell
e
σ( 1 )
RNN cell RNN cell RNN cell
c a t
σ( 2 )
RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell
c h a s e d
σ( 3 )
w1
=
1
σ(w1)
RNN w2
=
2
Known words only have to be spelled out once, and can then be summoned anywhere:
look up embeddings look up spellings
e( 1 ) e( 2 ) e( 3 )
RNN cell
t
RNN cell
h
RNN cell
e
σ( 1 )
RNN cell RNN cell RNN cell
c a t
σ( 2 )
RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell
c h a s e d
σ( 3 )
w1
=
1
σ(w1)
RNN w2
=
2
σ(w2)
3
Known words only have to be spelled out once, and can then be summoned anywhere:
look up embeddings look up spellings
e( 1 ) e( 2 ) e( 3 )
RNN cell
t
RNN cell
h
RNN cell
e
σ( 1 )
RNN cell RNN cell RNN cell
c a t
σ( 2 )
RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell
c h a s e d
σ( 3 )
w1
=
1
σ(w1)
RNN w2
=
2
σ(w2)
RNN
3
Known words only have to be spelled out once, and can then be summoned anywhere:
look up embeddings look up spellings
e( 1 ) e( 2 ) e( 3 )
RNN cell
t
RNN cell
h
RNN cell
e
σ( 1 )
RNN cell RNN cell RNN cell
c a t
σ( 2 )
RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell
c h a s e d
σ( 3 )
w1
=
1
σ(w1)
RNN w2
=
2
σ(w2)
RNN w3
=
3
Known words only have to be spelled out once, and can then be summoned anywhere:
look up embeddings look up spellings
e( 1 ) e( 2 ) e( 3 )
RNN cell
t
RNN cell
h
RNN cell
e
σ( 1 )
RNN cell RNN cell RNN cell
c a t
σ( 2 )
RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell
c h a s e d
σ( 3 )
w1
=
1
σ(w1)
RNN w2
=
2
σ(w2)
RNN w3
=
3
σ(w3)
3
Known words only have to be spelled out once, and can then be summoned anywhere:
look up embeddings look up spellings
e( 1 ) e( 2 ) e( 3 )
RNN cell
t
RNN cell
h
RNN cell
e
σ( 1 )
RNN cell RNN cell RNN cell
c a t
σ( 2 )
RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell
c h a s e d
σ( 3 )
w1
=
1
σ(w1)
RNN w2
=
2
σ(w2)
RNN w3
=
3
σ(w3)
RNN
3
Known words only have to be spelled out once, and can then be summoned anywhere:
look up embeddings look up spellings
e( 1 ) e( 2 ) e( 3 )
RNN cell
t
RNN cell
h
RNN cell
e
σ( 1 )
RNN cell RNN cell RNN cell
c a t
σ( 2 )
RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell
c h a s e d
σ( 3 )
w1
=
1
σ(w1)
RNN w2
=
2
σ(w2)
RNN w3
=
3
σ(w3)
RNN w4
=
1
Known words only have to be spelled out once, and can then be summoned anywhere:
look up embeddings look up spellings
e( 1 ) e( 2 ) e( 3 )
RNN cell
t
RNN cell
h
RNN cell
e
σ( 1 )
RNN cell RNN cell RNN cell
c a t
σ( 2 )
RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell
c h a s e d
σ( 3 )
w1
=
1
σ(w1)
RNN w2
=
2
σ(w2)
RNN w3
=
3
σ(w3)
RNN w4
=
1
σ(w4)
3
Known words only have to be spelled out once, and can then be summoned anywhere.
e( 1 ) e( 2 ) e( 3 )
RNN cell RNN cell RNN cell
t h e
RNN cell RNN cell RNN cell
c a t
RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell
c h a s e d
σ( 1 ) σ( 2 ) σ( 3 )
w1
=
1
w2
=
2
w4
=
1
cat the the
σ(w1) σ(w2) σ(w4)
4
Known words only have to be spelled out once, and can then be summoned anywhere.
e(UNK) e( 1 ) e( 2 ) e( 3 )
RNN cell RNN cell RNN cell
t h e
RNN cell RNN cell RNN cell
c a t
RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell
c h a s e d
σ( 1 ) σ( 2 ) σ( 3 )
w1
=
1
w2
=
2
w4
=
1
cat the the
σ(w1) σ(w2) σ(w4)
4
Known words only have to be spelled out once, and can then be summoned anywhere.
e(UNK) e( 1 ) e( 2 ) e( 3 )
RNN cell RNN cell RNN cell
t h e
RNN cell RNN cell RNN cell
c a t
RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell
c h a s e d
σ( 1 ) σ( 2 ) σ( 3 )
w1
=
1
w2
=
2
w3
= UNK
w4
=
1
cat the the
σ(w1) σ(w2) σ(w4)
4
Known words only have to be spelled out once, and can then be summoned anywhere. Unknown words are spelled out “on-demand” using the same character-level model.
e(UNK) e( 1 ) e( 2 ) e( 3 )
RNN cell RNN cell RNN cell
t h e
RNN cell RNN cell RNN cell
c a t
RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell
c h a s e d
σ( 1 ) σ( 2 ) σ( 3 )
w1
=
1
w2
=
2
w3
= UNK
w4
=
1
cell
c a g e d cat the the
σ(w1) σ(w2) σ(w4)
4
Known words only have to be spelled out once, and can then be summoned anywhere. Unknown words are spelled out “on-demand” using the same character-level model.
e(UNK) e( 1 ) e( 2 ) e( 3 )
RNN cell RNN cell RNN cell
t h e
RNN cell RNN cell RNN cell
c a t
RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell
c h a s e d
σ( 1 ) σ( 2 ) σ( 3 )
w1
=
1
w2
=
2
w3
= UNK
w4
=
1
cell
c
RNN cell
a g e d cat the the
σ(w1) σ(w2) σ(w4)
4
Known words only have to be spelled out once, and can then be summoned anywhere. Unknown words are spelled out “on-demand” using the same character-level model.
e(UNK) e( 1 ) e( 2 ) e( 3 )
RNN cell RNN cell RNN cell
t h e
RNN cell RNN cell RNN cell
c a t
RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell
c h a s e d
σ( 1 ) σ( 2 ) σ( 3 )
w1
=
1
w2
=
2
w3
= UNK
w4
=
1
cell
c
RNN cell
a
RNN cell
g e d cat the the
σ(w1) σ(w2) σ(w4)
4
Known words only have to be spelled out once, and can then be summoned anywhere. Unknown words are spelled out “on-demand” using the same character-level model.
e(UNK) e( 1 ) e( 2 ) e( 3 )
RNN cell RNN cell RNN cell
t h e
RNN cell RNN cell RNN cell
c a t
RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell
c h a s e d
σ( 1 ) σ( 2 ) σ( 3 )
w1
=
1
w2
=
2
w3
= UNK
w4
=
1
cell
c
RNN cell
a
RNN cell
g
RNN cell
e d cat the the
σ(w1) σ(w2) σ(w4)
4
Known words only have to be spelled out once, and can then be summoned anywhere. Unknown words are spelled out “on-demand” using the same character-level model.
e(UNK) e( 1 ) e( 2 ) e( 3 )
RNN cell RNN cell RNN cell
t h e
RNN cell RNN cell RNN cell
c a t
RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell
c h a s e d
σ( 1 ) σ( 2 ) σ( 3 )
w1
=
1
w2
=
2
w3
= UNK
w4
=
1
cell
c
RNN cell
a
RNN cell
g
RNN cell
e
RNN cell
d cat the the
σ(w1) σ(w2) σ(w4)
4
Sampled text from our model: Following the death of Edward Mc- Cartney in 1060 , the new defini- tion was transferred to the WDIC
5
Sampled text from our model: Following the death of Edward Mc- Cartney in 1060 , the new defini- tion was transferred to the WDIC
novel word with contextually appropriate spelling
5
Sampled text from our model: Following the death of Edward Mc- Cartney in 1060 , the new defini- tion was transferred to the WDIC
novel word with contextually appropriate spelling known spelling ❀ novel spelling sampled from its embedding
5
Sampled text from our model: Following the death of Edward Mc- Cartney in 1060 , the new defini- tion was transferred to the WDIC
novel word with contextually appropriate spelling known spelling ❀ novel spelling sampled from its embedding
grounded
❀
stipped
5
Sampled text from our model: Following the death of Edward Mc- Cartney in 1060 , the new defini- tion was transferred to the WDIC
novel word with contextually appropriate spelling known spelling ❀ novel spelling sampled from its embedding
grounded
❀
stipped differ
❀
coronate
5
Sampled text from our model: Following the death of Edward Mc- Cartney in 1060 , the new defini- tion was transferred to the WDIC
novel word with contextually appropriate spelling known spelling ❀ novel spelling sampled from its embedding
grounded
❀
stipped differ
❀
coronate Clive
❀
Dickey
5
Sampled text from our model: Following the death of Edward Mc- Cartney in 1060 , the new defini- tion was transferred to the WDIC
novel word with contextually appropriate spelling known spelling ❀ novel spelling sampled from its embedding
grounded
❀
stipped differ
❀
coronate Clive
❀
Dickey Southport
❀
Strigger
5
Sampled text from our model: Following the death of Edward Mc- Cartney in 1060 , the new defini- tion was transferred to the WDIC
novel word with contextually appropriate spelling known spelling ❀ novel spelling sampled from its embedding
grounded
❀
stipped differ
❀
coronate Clive
❀
Dickey Southport
❀
Strigger Carl
❀
Wuly
5
Sampled text from our model: Following the death of Edward Mc- Cartney in 1060 , the new defini- tion was transferred to the WDIC
novel word with contextually appropriate spelling known spelling ❀ novel spelling sampled from its embedding
grounded
❀
stipped differ
❀
coronate Clive
❀
Dickey Southport
❀
Strigger Carl
❀
Wuly Chants
❀
Tranquels
5
Sampled text from our model: Following the death of Edward Mc- Cartney in 1060 , the new defini- tion was transferred to the WDIC
novel word with contextually appropriate spelling known spelling ❀ novel spelling sampled from its embedding
grounded
❀
stipped differ
❀
coronate Clive
❀
Dickey Southport
❀
Strigger Carl
❀
Wuly Chants
❀
Tranquels valuables
❀
migrations
5
Sampled text from our model: Following the death of Edward Mc- Cartney in 1060 , the new defini- tion was transferred to the WDIC
novel word with contextually appropriate spelling known spelling ❀ novel spelling sampled from its embedding
grounded
❀
stipped differ
❀
coronate Clive
❀
Dickey Southport
❀
Strigger Carl
❀
Wuly Chants
❀
Tranquels valuables
❀
migrations
So why is this a good way of modeling language?
5
The meaningful elements in any language—"words" in everyday parlance [...]— [...] are represented by [a] small stock of distinguishable sounds which are in themselves wholly meaningless. – Hockett, 1960 characters
6
The meaningful elements in any language—"words" in everyday parlance [...]— [...] are represented by [a] small stock of distinguishable sounds which are in themselves wholly meaningless. – Hockett, 1960 characters
6
The meaningful elements in any language—"words" in everyday parlance [...]— [...] are represented by [a] small stock of distinguishable sounds which are in themselves wholly meaningless. – Hockett, 1960 characters
6
So? Why does this linguistics blurb matter?
7
So? Why does this linguistics blurb matter?
...yet we use them like regular words!
7
So? Why does this linguistics blurb matter?
...yet we use them like regular words!
...yet we use them all the time without feeling weird!
7
So? Why does this linguistics blurb matter?
...yet we use them like regular words!
...yet we use them all the time without feeling weird! Recall: We should need a word’s spelling only to define it – not to later use it.
7
So? Why does this linguistics blurb matter?
...yet we use them like regular words!
...yet we use them all the time without feeling weird! Recall: We should need a word’s spelling only to define it – not to later use it. i.e. character-level models do it wrong!
7
So? Why does this linguistics blurb matter?
...yet we use them like regular words!
...yet we use them all the time without feeling weird! Recall: We should need a word’s spelling only to define it – not to later use it. i.e. character-level models do it wrong!
...and they’re slow as hell...
7
So? Why does this linguistics blurb matter?
...yet we use them like regular words!
...yet we use them all the time without feeling weird! Recall: We should need a word’s spelling only to define it – not to later use it. i.e. character-level models do it wrong!
...and they’re slow as hell...
7
How should a word’s embedding and its spelling be connected?
8
How should a word’s embedding and its spelling be connected? The connection between the signifier and the signified is arbitrary. – de Saussure, 1916, translated
8
How should a word’s embedding and its spelling be connected? The connection between the signifier and the signified is arbitrary. spelling meaning – de Saussure, 1916, translated
8
How should a word’s embedding and its spelling be connected? The connection between the signifier and the signified is arbitrary. spelling meaning – de Saussure, 1916, translated
8
How should a word’s embedding and its spelling be connected? The connection between the signifier and the signified is arbitrary. spelling meaning – de Saussure, 1916, translated
Example: neither silly nor folly is an adverb, even though they both end in -ly!
8
How should a word’s embedding and its spelling be connected? The connection between the signifier and the signified is arbitrary. spelling meaning – de Saussure, 1916, translated
Example: neither silly nor folly is an adverb, even though they both end in -ly! “construction” models like e(caged) := CNN(c a g e d) ignore this!
8
How should a word’s embedding and its spelling be connected? The connection between the signifier and the signified is arbitrary. spelling meaning – de Saussure, 1916, translated
Example: neither silly nor folly is an adverb, even though they both end in -ly! “construction” models like e(caged) := CNN(c a g e d) ignore this!
⇒ Allow any pairing a priori, but use spellings as prior / regularization!
8
How should a word’s embedding and its spelling be connected? The connection between the signifier and the signified is arbitrary. spelling meaning – de Saussure, 1916, translated
Example: neither silly nor folly is an adverb, even though they both end in -ly! “construction” models like e(caged) := CNN(c a g e d) ignore this!
⇒ Allow any pairing a priori, but Outliers (children, the, ...) use spellings as prior / regularization! may have idiosyncratic embeddings!
8
How should a word’s embedding and its spelling be connected? The connection between the signifier and the signified is arbitrary. spelling meaning – de Saussure, 1916, translated
Example: neither silly nor folly is an adverb, even though they both end in -ly! “construction” models like e(caged) := CNN(c a g e d) ignore this!
⇒ Allow any pairing a priori, but Outliers (children, the, ...) use spellings as prior / regularization! may have idiosyncratic embeddings!
8
Embeddings and spellings are connected on the type level, ensuring conditional independence of usage and spelling while assigning positive probability to any pairing!
e(UNK) e( 1 ) e( 2 ) e( 3 )
RNN cell RNN cell RNN cell
t h e
σ( 1 )
RNN cell RNN cell RNN cell
c a t
σ( 2 )
RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell
c h a s e d
σ( 3 )
w1
=
1
w2
=
2
w3
= UNK
w4
=
1
cell RNN cell RNN cell RNN cell RNN cell
c a g e d cat the the
σ(w1) σ(w2) s ∼ pspell(· | h3) σ(w4)
9
10
bits per character
10
bits per character
2. no UNKing allowed!
10
bits per character
2. no UNKing allowed!∗
10
bits per character
2. no UNKing allowed!∗ ______________
∗ Yes, we call some words “UNK” temporarily, but we still generate them fully! 10
bits per character
2. no UNKing allowed!∗ → we must predict every character of the text, regardless of vocabulary size ______________
∗ Yes, we call some words “UNK” temporarily, but we still generate them fully! 10
bits per character
2. no UNKing allowed!∗ → we must predict every character of the text, regardless of vocabulary size ______________
∗ Yes, we call some words “UNK” temporarily, but we still generate them fully!
⇒ A tunable “vocabulary size” hyperparameter decides what is temporary-UNK.
10
WikiText-2 (Merity et al., 2017) test
2.5 million tokenized words from the English Wikipedia ← 1.8 1.4
11
WikiText-2 (Merity et al., 2017) test
2.5 million tokenized words from the English Wikipedia ← 1.8 1.4
RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell
t h e c a t c h a s e d
11
WikiText-2 (Merity et al., 2017) test
2.5 million tokenized words from the English Wikipedia ← 1.8 1.4
RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell
t h e c a t c h a s e d
HCLM + cache
previous SOTA (Kawakami et al., 2017)
11
WikiText-2 (Merity et al., 2017) test
2.5 million tokenized words from the English Wikipedia ← 1.8 1.4
RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell
t h e c a t c h a s e d
HCLM + cache
previous SOTA (Kawakami et al., 2017)
BPE:
RNN cell RNN cell RNN cell RNN cell RNN cell
the ca@ @ t cha@ @ sed
11
WikiText-2 (Merity et al., 2017) test
2.5 million tokenized words from the English Wikipedia ← 1.8 1.4
RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell
t h e c a t c h a s e d
HCLM + cache
previous SOTA (Kawakami et al., 2017)
BPE:
RNN cell RNN cell RNN cell RNN cell RNN cell
the ca@ @ t cha@ @ sed
11
WikiText-2 (Merity et al., 2017) test
2.5 million tokenized words from the English Wikipedia ← 1.8 1.4
RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell
t h e c a t c h a s e d
HCLM + cache
previous SOTA (Kawakami et al., 2017)
BPE:
RNN cell RNN cell RNN cell RNN cell RNN cell
the ca@ @ t cha@ @ sed
11
WikiText-2 (Merity et al., 2017)
test
2.5 million tokenized words from the English Wikipedia novel words rare words frequent words all
RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell
t h e c a t c h a s e d
3.89 2.08 1.38 1.775 HCLM + cache
previous SOTA (Kawakami et al., 2017)
– – – 1.500 BPE:
RNN cell RNN cell RNN cell RNN cell RNN cell
the ca@ @ t cha@ @ sed
4.01 1.70 1.08 1.468
4.00 1.64 1.10 1.455
11
WikiText-2 (Merity et al., 2017)
test
2.5 million tokenized words from the English Wikipedia novel words rare words frequent words all
RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell
t h e c a t c h a s e d
3.89 2.08 1.38 1.775 HCLM + cache
previous SOTA (Kawakami et al., 2017)
– – – 1.500 BPE:
RNN cell RNN cell RNN cell RNN cell RNN cell
the ca@ @ t cha@ @ sed
4.01 1.70 1.08 1.468
4.00 1.64 1.10 1.455 ...and plenty more baselines, ablations, datasets, and questions answered in the paper!
11
usage ⊥ spelling | embedding regularize embeddings, don’t construct them
12
usage ⊥ spelling | embedding regularize embeddings, don’t construct them
model strings by segments?
12
usage ⊥ spelling | embedding regularize embeddings, don’t construct them
model strings by segments?
12
AAAI 2019 Technical Track Sabrina J. Mielke and Jason Eisner
sjmielke@jhu.edu, jason@cs.jhu.edu
Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA 13