Learning probabilities over underlying representations Presented by - - PowerPoint PPT Presentation

learning probabilities over underlying representations
SMART_READER_LITE
LIVE PREVIEW

Learning probabilities over underlying representations Presented by - - PowerPoint PPT Presentation

Learning probabilities over underlying representations Presented by Robert Staubs Joe Pater * Robert Staubs * Brian Smith * Karen Jesney * University of Massachusetts Amherst University of Southern California SIGMORPHON 2012 June 7,


slide-1
SLIDE 1

Learning probabilities over underlying representations

Presented by Robert Staubs Joe Pater* Robert Staubs* Karen Jesney† Brian Smith*

*University of Massachusetts Amherst †University of Southern California

SIGMORPHON 2012 June 7, 2012

  • J. Pater, R. Staubs, K. Jesney, B. Smith

UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 1 / 37

slide-2
SLIDE 2

We pursue an approach to the learning of underlying representations (URs) in which URs are chosen by the grammar from a contextually conditioned distribution over forms observed as surface representations (SRs). Learning consists of adjusting probabilities over URs, rather than selecting unique URs. In particular, we show that such an approach allows an account of some cases otherwise dealt with by abstract underlying representations and can still be made to generalize properly.

  • J. Pater, R. Staubs, K. Jesney, B. Smith

UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 2 / 37

slide-3
SLIDE 3

1 Single and multiple underlying representations 2 Formal model in a Maximum Entropy version of Optimality

Theory

3 Basic learning results 4 Learning for “abstract” URs for alternation 5 Lexical variation

  • J. Pater, R. Staubs, K. Jesney, B. Smith

UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 3 / 37

slide-4
SLIDE 4

Consider a morpheme that alternates in voicing: [bet] ‘cat’ and [beda] ‘cats’. Is this final devoicing? UR SR Meaning a. /bed/ [bet] cat b. /bed+a/ [beda] cats

  • J. Pater, R. Staubs, K. Jesney, B. Smith

UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 4 / 37

slide-5
SLIDE 5

Consider a morpheme that alternates in voicing: [bet] ‘cat’ and [beda] ‘cats’. Or intervocalic voicing? UR SR Meaning a. /bet/ [bet] cat b. /bet+a/ [beda] cats

  • J. Pater, R. Staubs, K. Jesney, B. Smith

UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 5 / 37

slide-6
SLIDE 6

This kind of structural ambiguity is typical: the observed SRs allow multiple URs. More information can clarify the situation (here a typical analysis): UR SR Meaning a. /bed/ [bet] cat b. /bed+a/ [beda] cats c. /mot/ [mot] dog d. /mot+a/ [mota] dogs Intervocalic voiceless consonants in SRs ⇒ it’s final devoicing.

  • J. Pater, R. Staubs, K. Jesney, B. Smith

UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 6 / 37

slide-7
SLIDE 7

This standard single UR view (e.g. Jakobson 1948) is not the only

  • ne available.

Multiple URs can be used for the same kind of learning task.

  • J. Pater, R. Staubs, K. Jesney, B. Smith

UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 7 / 37

slide-8
SLIDE 8

An internally-consistent analysis with multiple URs is possible: UR SR Meaning a. /bet/ [bet] cat b. /bed+a/ [beda] cats c. /mot/ [mot] dog d. /mot+a/ [mota] dogs Here when the SR varies, the UR varies as well. The UR is just an observed SR.

  • J. Pater, R. Staubs, K. Jesney, B. Smith

UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 8 / 37

slide-9
SLIDE 9

Multiple URs are common in analysis of slightly more complex cases. These analyses are typically labelled as allomorphy. We extend the machinery for such accounts (Kager 2008) to all URs.

  • J. Pater, R. Staubs, K. Jesney, B. Smith

UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 9 / 37

slide-10
SLIDE 10

English indefinite

English ‘a’ alternates with ‘an’. This occurs in a phonologically sensible way: the form with a final consonant occurs when that consonant gives a syllable an onset. But English has no general process of [n] epenthesis. This kind of alternation is sometimes formalized as UR choice: pick the “phonologically best” UR.

  • J. Pater, R. Staubs, K. Jesney, B. Smith

UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 10 / 37

slide-11
SLIDE 11

We represent UR choice with constraints: Definition (UR Constraints) X→/y/: Violated if the meaning is X and the chosen UR is not /y/. Example (English Indefinite) Indefinite→/@n/ Indefinite→/@/ Constraints give UR options, but here Output constraints choose the UR.

  • J. Pater, R. Staubs, K. Jesney, B. Smith

UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 11 / 37

slide-12
SLIDE 12

Multiple URs, generalized

Proposal Consider all observed SRs of a morpheme as possible URs for that morpheme. (Make UR constraints for these mappings.) Choose URs based on UR constraints, Output constraints, and Faithfulness constraints in competition. The use of UR constraints follows Apoussidou (2007) for UR learning with such constraints and Eisenstat (2009) for a similar model with log-linear grammar.

  • J. Pater, R. Staubs, K. Jesney, B. Smith

UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 12 / 37

slide-13
SLIDE 13

Grammar

Our grammatical model is Maximum Entropy Grammar (MaxEnt; Goldwater and Johnson, 2003). MaxEnt is a probabilistic version of Harmonic Grammar (Smolensky and Legendre, 2006) related to Optimality Theory (Prince and Smolensky, 2003).

  • J. Pater, R. Staubs, K. Jesney, B. Smith

UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 13 / 37

slide-14
SLIDE 14

The grammar assigns probabilities to input-output mappings based

  • n the mappings’ violations of weighted constraints.

Here “inputs” are the morphemes or their meanings. “Outputs” are SRs (as usual). URs form a third, “hidden” level.

  • J. Pater, R. Staubs, K. Jesney, B. Smith

UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 14 / 37

slide-15
SLIDE 15

The probability of an input/output pair (xi, yij) is determined by its harmony Hij. This is the sum of violations fc(xi, yij) for the pair weighted by the constraints wc. Definition (Harmony) Hij =

  • c

wcfc(xi, yij)

  • J. Pater, R. Staubs, K. Jesney, B. Smith

UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 15 / 37

slide-16
SLIDE 16

The actual probabilities are proportional to the exponentials of the

  • harmonies. They are normalized within each input.

Definition (Input-Output Probabilities) p(yij | xi) = 1 Zi eHij Zi =

  • j′

eHij′

  • J. Pater, R. Staubs, K. Jesney, B. Smith

UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 16 / 37

slide-17
SLIDE 17

The above MaxEnt definition does not account for multiple URs. For each input/output probability we sum over the probabilities for compatible hidden structures (URs) zijk corresponding to inputs xi and outputs yij. Definition (Probabilities with URs) p(yij | xi) =

  • k

p(yij, zijk | xi)

  • J. Pater, R. Staubs, K. Jesney, B. Smith

UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 17 / 37

slide-18
SLIDE 18

Learning

We minimize the KL-divergence between predicted probabilities and the observed probabilities to obtain weights. KL allows a uniform treatment of probabilistic and categorical data. We include an L2 regularization with σ2 = 10, 000 for all tests presented here.

  • J. Pater, R. Staubs, K. Jesney, B. Smith

UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 18 / 37

slide-19
SLIDE 19

Generalization

Our learned solutions should generalize. We want the grammar to extend phonological patterns to novel forms. We ensure generalization in a way similar to other work on learning in OT and HG (Smolensky, 1996; Jesney and Tessier, 2011): we bias Output constraints to be weighted higher than Faithfulness constraints.

  • J. Pater, R. Staubs, K. Jesney, B. Smith

UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 19 / 37

slide-20
SLIDE 20

We include the difference between Output constraints O and Faithfulness constraints F in the objective: Definition (Generalization Term) λ

  • f ∈F

wf −

  • ∈O

wo

  • This term is reminiscient of Prince and Tesar’s (2004) R-measure.

Many other priors might have a similar effect. We use this one to demonstrate feasibility.

  • J. Pater, R. Staubs, K. Jesney, B. Smith

UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 20 / 37

slide-21
SLIDE 21

Consider again the final devoicing language: SR Meaning a. [bet] cat b. [beda] cats c. [mot] dog d. [mota] dogs Learning with multiple URs gives the following results: UR SR Meaning a. /bet/ (0.92) /bed/ (0.08) [bet] cat b. /bed+a/ [beda] cats c. /mot/ [mot] dog d. /mot+a/ [mota] dogs

  • J. Pater, R. Staubs, K. Jesney, B. Smith

UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 21 / 37

slide-22
SLIDE 22

In the observed forms a faithful (surface-true) UR is chosen. The generalization is learned, however. The constraint favoring final devoicing (No-Coda-Voice) greatly

  • utweighs the constraint motivating preserving voicing

(Ident-Voice). Thus novel forms will have final devoicing with probability near 1. Constraint Weight No-Coda-Voice 401.41 Ident-Voice 6.05 cat→/bed/ 3.65 Inter-V-Voice 1.94 cat→/bet/

  • J. Pater, R. Staubs, K. Jesney, B. Smith

UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 22 / 37

slide-23
SLIDE 23

Abstract URs?

Tesar (2006) proposes the following test of UR learning (stated with single URs): /re-/ /ra:-/ /r´

  • -/

/r´ u:-/ /-se/ [r´ ese] [r´ a:se] [r´

  • se]

[r´ u:se] /-s´ a/ [res´ a] [ras´ a] [r´

  • sa]

[r´ u:sa] /-s´

  • :/

[res´

  • :]

[ras´

  • :]

[r´

  • so]

[r´ u:so]

1 Default stress on root. 2 No unstressed long vowels on surface. 3 Partially predictable stress.

  • J. Pater, R. Staubs, K. Jesney, B. Smith

UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 23 / 37

slide-24
SLIDE 24

/re-/ /ra:-/ /r´

  • -/

/r´ u:-/ /-se/ [r´ ese] [r´ a:se] [r´

  • se]

[r´ u:se] /-s´ a/ [res´ a] [ras´ a] [r´

  • sa]

[r´ u:sa] /-s´

  • :/

[res´

  • :]

[ras´

  • :]

[r´

  • so]

[r´ u:so] /ra:-/ is an abstract UR – it never appears as such in the SR. Long to contrast with /re/. Stressless to contrast with /r´ u:/.

  • J. Pater, R. Staubs, K. Jesney, B. Smith

UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 24 / 37

slide-25
SLIDE 25

UR SR p UR SR p /r´ e+se/ [r´ ese] 0.98 /re+s´ a/ [res´ a] 1 /re+se/ 0.02 /re+s´

  • :/

[res´

  • :]

1 /r´ a:+se/ [r´ a:se] 0.99 /ra+se/ [r´ ase] 0.01 /ra+s´ a/ [ras´ a] 1 /ra+s´

  • :/

[ras´

  • :]

0.99 /r´ a:+so/ [r´ a:so] 0.01 /r´

  • +se/

[r´

  • se]

1 /r´

  • +sa/

[r´

  • sa]

0.93 /r´

  • +s´

a/ 0.07 /r´

  • +s´

a/ [ros´ a] 0.01 /r´

  • +so/

[r´

  • so]

0.99 /r´ u:+se/ [r´ u:se] 1 /r´

  • +s´
  • :/

[ros´

  • :]

0.01 /r´ u:+sa/ [r´ u:sa] 0.93 /r´ u:+so/ [r´ u:so] 1 /r´ u:+s´ a/ 0.07

  • J. Pater, R. Staubs, K. Jesney, B. Smith

UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 25 / 37

slide-26
SLIDE 26

The learned (UR, SR) mappings are faithful with high probability. Alternations come from choosing different URs for different contexts. The abstract /ra:-/ is not needed at all – only the observed forms /ra-/ and /r´ a:-/.

  • J. Pater, R. Staubs, K. Jesney, B. Smith

UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 26 / 37

slide-27
SLIDE 27

Replacing /ra:-/

The mapping (/r´ a:+se/, [r´ a:se]) is preferred to (/ra+se/, [ras´ e]). No Ident-Stress violations – don’t need to change any underlying stress. General preference for root stress. The mapping (/ra+s´ a/, [ras´ a]) is preferred to (/r´ a:+sa/, [r´ a:sa]) Ident-Stress not an issue – a stressed UR for ‘sa’ is available. /ra/ preferred as UR overall.

  • J. Pater, R. Staubs, K. Jesney, B. Smith

UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 27 / 37

slide-28
SLIDE 28

Constraint Weight No-Long-Unstress 26.43 Stress-Root 26.05 Stress-Suffix 23.50 Ident-Stress 7.66 Ident-Long 6.50 ‘sa’→/s´ a/ 5.04 ‘so’→/s´

  • :/

4.96 ‘re’→/re/ 3.85 ‘ra’→/ra/ 3.15 ‘ra’→/r´ a:/ 0.25 ‘so’→/so/ 0.02 ‘sa’→/sa/ ‘re’→/r´ e/ No-Long

  • J. Pater, R. Staubs, K. Jesney, B. Smith

UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 28 / 37

slide-29
SLIDE 29

A generalization is learned in this case as well. No-Long-Unstress (“stress all long vowels”) has high weight. Ident-Long (“keep underlying length”) has intermediate weight. No-Long (“no long vowels”) has low weight. The resulting weights give a language with no unstressed long vowels, even in novel forms. Multiple URs do not degrade the grammar’s generalizability.

  • J. Pater, R. Staubs, K. Jesney, B. Smith

UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 29 / 37

slide-30
SLIDE 30

Lexically-conditioned variation

Definition (Lexically-conditioned variation) Some generalization holds of SRs. The generalization is not categorical – it applies incompletely. The probability of application varies with the lexical item. We analyze such cases identically to categorical cases. Variation in probability is variation in weights on UR constraints.

  • J. Pater, R. Staubs, K. Jesney, B. Smith

UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 30 / 37

slide-31
SLIDE 31

French vowel deletion

In French the vowel [ø] is variably deleted ([ø] sometimes called “schwa” in these cases). Approximate probabilities based on Dell (1973) and Racine (2007): Word SR p femelle [fømEl] 1 semestre [sømEstK] 0.8 [smEstK] 0.2 semelle [sømEl] 0.5 [smEl] 0.5 Fnac [fnak] 1 breton [bKøt˜ O] 1

  • J. Pater, R. Staubs, K. Jesney, B. Smith

UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 31 / 37

slide-32
SLIDE 32

Learned probabilities agree well with the target distribution: UR [ø] SR p UR [ø] SR p Y s’mestre 0.08 Y s’melle 0.04 N s’mestre 0.15 N s’melle 0.45 Y semestre 0.77 Y semelle 0.47 N semestre 0.01 N semelle 0.03 Y f’melle 0.09 N F[ø]nac 0.07 Y femelle 0.91 N Fnac 0.93 Y breton 1

  • J. Pater, R. Staubs, K. Jesney, B. Smith

UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 32 / 37

slide-33
SLIDE 33

Constraint Weight *CCC 467.26 Max 4.93 ‘semestre’→/sømEstK/ 4.23 ‘semelle’→/sømEl/ 2.71 *[ø] 2.58 ‘semelle’→/smEl/ 0.10 ‘semestre’→/smEstK/ 0.03 Dep 0.00

  • J. Pater, R. Staubs, K. Jesney, B. Smith

UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 33 / 37

slide-34
SLIDE 34

‘semestre’→/sømEstK/ is weighted higher than ‘semelle’→/sømEl/ because the [ø] form has higher probability for the former. *CCC is highest-weighted. Three-consonant sequences will never

  • ccur: deletion will not create them and epenthesis will repair

them. We obtain both categorical and gradient generalizations.

  • J. Pater, R. Staubs, K. Jesney, B. Smith

UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 34 / 37

slide-35
SLIDE 35

Future directions

We don’t permit learners to collapse multiple representations into a single UR. Is this ever needed? If so, when? (From allomorphy we see the answer isn’t “always”.) Initial explorations show promise for UR constraints and morpheme segmentation. Similar kinds of “hidden structure” over other structures: syllable structure, syntactic trees, and derivations. (Work on the last in Staubs and Pater 2012).

  • J. Pater, R. Staubs, K. Jesney, B. Smith

UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 35 / 37

slide-36
SLIDE 36

Conclusion

Multiple URs can be used in learning and grammar without loss of grammatical generalization. Such methods sidestep issues of abstraction for alternation. The “single UR doctrine” is worth reconsidering.

  • J. Pater, R. Staubs, K. Jesney, B. Smith

UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 36 / 37

slide-37
SLIDE 37

Acknowledgements

People Diana Apoussidou and David Smith for their collaboration on earlier presentations of this work, and Mark Johnson for extended

  • discussion. Thanks also to Adam Albright, Paul Boersma, Naomi

Feldman, Jeff Heinz, John McCarthy, Paul Smolensky, Colin Wilson and three anonymous reviewers for helpful comments. Grants This research was supported by NSF Grant 0813829 to the University of Massachusetts Amherst, by an NSF Graduate Research Fellowship to Robert Staubs, and a SSHRCC doctoral fellowship to Karen Jesney.

  • J. Pater, R. Staubs, K. Jesney, B. Smith

UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 37 / 37