An Unsupervised Method for Uncovering Morphological Chains
Karthik Narasimhan Regina Barzilay Tommi Jaakkola CSAIL, Massachusetts Institute of Technology
1
An Unsupervised Method for Uncovering Morphological Chains Karthik - - PowerPoint PPT Presentation
An Unsupervised Method for Uncovering Morphological Chains Karthik Narasimhan Regina Barzilay Tommi Jaakkola CSAIL, Massachusetts Institute of Technology 1 Morphological Chains 2 Morphological Chains Chains to model the formation of words.
Karthik Narasimhan Regina Barzilay Tommi Jaakkola CSAIL, Massachusetts Institute of Technology
1
2
Chains to model the formation of words.
2
Chains to model the formation of words.
paint → painting → paintings
2
Chains to model the formation of words.
paint → painting → paintings
2
Richer representation than traditional scenarios
Chains to model the formation of words.
paint → painting → paintings
2
Richer representation than traditional scenarios Segmentation
Chains to model the formation of words.
paint → painting → paintings
2
Richer representation than traditional scenarios Segmentation Paradigms
Chains to model the formation of words.
paint → painting → paintings
2
Richer representation than traditional scenarios Segmentation Paradigms
3
paint → painting
Core Idea: Unsupervised discriminative model over pairs of words in the chain.
Morfessor (Goldwater and Johnson, 2004; Creutz and Lagus, 2007), Poon et al., 2009, Dreyer and Eisner, 2009, Sirts and Goldwater, 2013
3
paint → painting
Core Idea: Unsupervised discriminative model over pairs of words in the chain.
Morfessor (Goldwater and Johnson, 2004; Creutz and Lagus, 2007), Poon et al., 2009, Dreyer and Eisner, 2009, Sirts and Goldwater, 2013
Schone and Jurafsky, 2000; Baroni et al., 2002
3
paint → painting
Core Idea: Unsupervised discriminative model over pairs of words in the chain.
Morfessor (Goldwater and Johnson, 2004; Creutz and Lagus, 2007), Poon et al., 2009, Dreyer and Eisner, 2009, Sirts and Goldwater, 2013
Schone and Jurafsky, 2000; Baroni et al., 2002
3
paint → painting
Core Idea: Unsupervised discriminative model over pairs of words in the chain.
4
Orthographic
4
Orthographic Patterns in the characters forming words.
4
Orthographic Patterns in the characters forming words. paint paints painted pain pains pained
4
Orthographic Patterns in the characters forming words. paint paints painted pain pains pained pain paint ran rant
4
Orthographic Patterns in the characters forming words. paint paints painted pain pains pained pain paint ran rant Semantic Meaning embedded as vectors.
4
Orthographic Patterns in the characters forming words. paint paints painted pain pains pained pain paint ran rant
A B cos(A,B) paint paints 0.68 paint painted 0.60 pain pains 0.60 pain paint 0.11 ran rant 0.09
Semantic Meaning embedded as vectors.
4
Orthographic Patterns in the characters forming words. paint paints painted pain pains pained pain paint ran rant
A B cos(A,B) paint paints 0.68 paint painted 0.60 pain pains 0.60 pain paint 0.11 ran rant 0.09
Semantic Meaning embedded as vectors.
4
5
Training Unannotated word list with frequencies Word Vector Learning Large text corpus Wikipedia a ability able about 395134 17793 56802 524355
6
6
nation → national → international → internationally nation → national → nationally → internationally
Multiple chains possible for a word.
6
nation → national → international → internationally nation → national → nationally → internationally
Multiple chains possible for a word.
nation → national → international → internationally nation → national → nationalize
Different chains can share word pairs.
7
7
Treat word-parent pairs separately
7
national Word (w) Treat word-parent pairs separately
7
national Word (w) nation Parent (p) Suffix Type (t) Treat word-parent pairs separately
7
national Word (w) nation Parent (p) Suffix Type (t) Treat word-parent pairs separately
7
national Word (w) Candidate (z) nation Parent (p) Suffix Type (t) Treat word-parent pairs separately
8
national Word (w) nation Parent (p) Suffix Type (t) Candidate (z)
8
national Word (w) nation Parent (p) Suffix Type (t) Candidate (z)
8
national Word (w) nation Parent (p) Suffix Type (t) Candidate (z)
Types - Prefix, Suffix, Transformations, Stop.
addition of affixes.
alphabet). Ex.
plan → planning P Q R
9
10
3 different transformations:
10
3 different transformations:
10
3 different transformations:
10
3 different transformations:
10
3 different transformations:
Trade-off between types of transformation and computational tractability.
10
3 different transformations:
Trade-off between types of transformation and computational tractability.
computationally tractable: max O(|∑|2) for alphabet ∑
10
11
11
Orthographic
11
Orthographic
for top affixes
11
Orthographic
for top affixes
affixes sharing set of stems (inter-, re-), (under-, over-)
11
Orthographic
for top affixes
affixes sharing set of stems (inter-, re-), (under-, over-)
11
Orthographic
for top affixes
affixes sharing set of stems (inter-, re-), (under-, over-)
character bigrams
11
Orthographic
for top affixes
affixes sharing set of stems (inter-, re-), (under-, over-)
character bigrams
Semantic
between word vectors
11
Orthographic
for top affixes
affixes sharing set of stems (inter-, re-), (under-, over-)
character bigrams
Semantic
between word vectors
Cosine similarity with player
12
12
12
Y
w
P(w) = Y
w
X
z
P(w, z) = Y
w
X
z
eθ·φ(w,z) P
w0∈Σ⇤,z0 eθ·φ(w0,z0)
(with regularization)
12
Y
w
P(w) = Y
w
X
z
P(w, z) = Y
w
X
z
eθ·φ(w,z) P
w0∈Σ⇤,z0 eθ·φ(w0,z0)
(with regularization)
in alphabet to calculate normalization constant, Z.
12
Y
w
P(w) = Y
w
X
z
P(w, z) = Y
w
X
z
eθ·φ(w,z) P
w0∈Σ⇤,z0 eθ·φ(w0,z0)
(with regularization)
in alphabet to calculate normalization constant, Z.
12
Y
w
P(w) = Y
w
X
z
P(w, z) = Y
w
X
z
eθ·φ(w,z) P
w0∈Σ⇤,z0 eθ·φ(w0,z0)
13
Eisner, 2005):
13
Eisner, 2005):
take probability mass from.
13
Eisner, 2005):
take probability mass from.
last k chars in word. (Ex. painting → paintnig)
13
Eisner, 2005):
take probability mass from.
last k chars in word. (Ex. painting → paintnig) P(w, z) =
eθ·φ(w,z) P
w02N(w),z0 eθ·φ(w0,z0) 13
candidate each time) till stop. paintings painting paint STOP
14
Recall, F1 over individual segmentation points
AGMorph (Sirts and Goldwater, 2013) and Lee et al. (2011)
15
22.5 45 67.5 90 English Turkish Arabic
Morfessor-B Morfessor-C AGMorph Lee (M2) MorphoChain
16
17
18
Language Over-segment Under-segment English 10% 86% Turkish 12% 78% Arabic 60% 40%
not present or having low count.
19
20
utterances
ATWV
0.05 0.1 0.15 0.2 KWS-Test Supervised Unsupervised (Morfessor)
KWS Results on OOV keywords in Turkish LLP (Narasimhan et al., 2014)
utterances
ATWV
0.05 0.1 0.15 0.2 KWS-Test Supervised Unsupervised (Morfessor)
KWS Results on OOV keywords in Turkish LLP (Narasimhan et al., 2014)
unsupervised)
utterances
morphological system on KWS
ATWV 20.2 20.35 20.5 20.65 20.8 w/o web data
Morfessor MorphoChain
ATWV 23 23.75 24.5 25.25 26 w/ web data
ATWV scores on Bengali VLLP
*in collaboration with Damianos Karakos and Rich Schwartz at BBN
analysis incorporating both orthographic and semantic features.
morphological segmentation.
Code: http://people.csail.mit.edu/karthikn/morphochain/
23
24