Sign constraints on feature weights improve a joint model of word segmentation and phonology
Mark Johnson Macquarie University Joint work with Joe Pater, Robert Staubs and Emmanuel Dupoux
1 / 33
Sign constraints on feature weights improve a joint model of word - - PowerPoint PPT Presentation
Sign constraints on feature weights improve a joint model of word segmentation and phonology Mark Johnson Macquarie University Joint work with Joe Pater, Robert Staubs and Emmanuel Dupoux 1 / 33 Summary Background on word segmentation and
1 / 33
▶ Liang et al and Berg-Kirkpatrick et al MaxEnt word segmentation models ▶ Smolenksy’s Harmony theory and Optimality theory of phonology ▶ Goldwater et al MaxEnt phonology models
▶ because Berg-Kirkpatrick’s and Goldwater’s models are MaxEnt models, and
▶ Harmony theory and sign constraints on MaxEnt feature weights
▶ better results than Börschinger et al 2014 on a harder task ▶ Harmony theory feature weight constraints improve model performance 2 / 33
3 / 33
u ▲ w ɑ n t ▲ t u ▲ s i ▲ ð ə ▲ b ʊ k
▶ can we identify the underlying forms of words ▶ can we learn how pronunciations alternate? 4 / 33
5 / 33
▶ each utterance wi = (si,1, . . . , si,mi ) is a sequence of (surface) phones
6 / 33
7 / 33
▶ the MLE minimises the KL-divergence between the data distribution and the
w P(w | θ) < 1)
▶ because we use a word length penalty in the same way, our models are deficient
8 / 33
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.4 1.5 1.6 1.7
Word length penalty Surface token f-score
Data Brent Buckeye
9 / 33
▶ here we model word-final /d/ and /t/ deletion ▶ e.g., /w ɑ n t t u/ ⇒ [w ɑ n t u]
▶ assuming that each word has an underlying form which may differ from the
▶ there is a set of phonological processes mapping underlying forms into surface
▶ these phonological processes can be conditioned on the context
▶ these processes can also be nondeterministic
10 / 33
▶ faithfulness constraints, e.g., underlying /t/ should appear on surface ▶ universal markedness constraints, e.g., ⋆tC
▶ in Harmony theory, violated constraints incur real-valued costs ▶ in Optimality theory, constraints are ranked
▶ often not possible to simultaneously satisfy all constraints ▶ in Harmony theory, the optimal analysis minimises the sum of the costs of the
▶ in Optimality theory, the optimal analysis violates the lowest-ranked constraint
11 / 33
12 / 33
▶ now widely used in phonology, e.g., Hayes and Wilson 2008
13 / 33
▶ generate a sequence of underlying words from Goldwater’s bigram model ▶ map the underlying phoneme sequence into a sequence of surface phones
▶ for computational reasons they only consider simple substitutions
14 / 33
15 / 33
▶ s ∈ S, where S is the set of length-bounded substrings of D, and ▶ s = u or s ∈ p(u), where p ∈ P is a phonological alternation
▶ we also require that u ∈ S (i.e., every underlying form must appear somewhere
▶ these correspond to “live”, “lived” and the non-word “livet” 16 / 33
17 / 33
▶ Underlying form lexical features: A feature for each underlying form u. In our
▶ Surface markedness features: The length of the surface string (<#L 3>), the
▶ Faithfulness features: A feature for each divergence between underlying and
18 / 33
▶ rather than assigning every possible lexical entry and constraint a non-zero
▶ in turns out that L1 and L2 regularisation produce similiar results
▶ gradient-based methods like LBFGS can’t handle this discontinuity
▶ easy to extend this to impose sign constraints on weights
▶ Lexical entry weights must be positive (i.e., you learn what words are in the
▶ Harmony faithfulness and markedness constraint weights must be negative 19 / 33
20 / 33
21 / 33
22 / 33
23 / 33
▶ provides an underlying and surface form for each word
▶ we use the Buckeye underlying form as our underlying form ▶ we use the Buckeye underlying form as our surface form as well . . . ▶ except that if the Buckeye underlying form ends in a /d/ or /t/ and the surface
24 / 33
25 / 33
▶ their model only recovers word-final /t/ deletions and was run on data without
26 / 33
0.0 0.2 0.4 0.6 0.8 1.4 1.5 1.6 1.7
Word length penalty Surface token f-score
Sign constraints
None OT Lexical OT+Lexical
27 / 33
5000 10000 15000 20000 40000
Number of deleted /d/ Number of deleted /t/
Sign constraints
None OT Lexical OT+Lexical
28 / 33
3800000 3900000 4000000 4100000 4200000 4000 6000 8000 10000
Number of non-zero feature weights Regularised negative log-likelihood
Sign constraints
None OT Lexical OT+Lexical
29 / 33
20000 40000 60000 4000 6000 8000 10000
Number of non-zero feature weights Number of underlying types
Sign constraints
None OT Lexical OT+Lexical
30 / 33
0.0 0.2 0.4 0.6 1.4 1.5 1.6 1.7
Word length penalty Deleted segment f-score
All pairs in all contexts FALSE TRUE
31 / 33
32 / 33
▶ sensitivity to the word length penalty is a major drawback ▶ can this be set in a principled way?
33 / 33