Learning probabilities over underlying representations Presented by - PowerPoint PPT Presentation

Learning probabilities over underlying representations Presented by Robert Staubs Joe Pater * Robert Staubs * Brian Smith * Karen Jesney † * University of Massachusetts Amherst † University of Southern California SIGMORPHON 2012 June 7, 2012 J. Pater, R. Staubs, K. Jesney, B. Smith UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 1 / 37

We pursue an approach to the learning of underlying representations (URs) in which URs are chosen by the grammar from a contextually conditioned distribution over forms observed as surface representations (SRs). Learning consists of adjusting probabilities over URs, rather than selecting unique URs. In particular, we show that such an approach allows an account of some cases otherwise dealt with by abstract underlying representations and can still be made to generalize properly. J. Pater, R. Staubs, K. Jesney, B. Smith UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 2 / 37

1 Single and multiple underlying representations 2 Formal model in a Maximum Entropy version of Optimality Theory 3 Basic learning results 4 Learning for “abstract” URs for alternation 5 Lexical variation J. Pater, R. Staubs, K. Jesney, B. Smith UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 3 / 37

Consider a morpheme that alternates in voicing: [bet] ‘cat’ and [beda] ‘cats’. Is this final devoicing? UR SR Meaning a. /bed/ [bet] cat b. /bed+a/ [beda] cats J. Pater, R. Staubs, K. Jesney, B. Smith UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 4 / 37

Consider a morpheme that alternates in voicing: [bet] ‘cat’ and [beda] ‘cats’. Or intervocalic voicing ? UR SR Meaning a. /bet/ [bet] cat b. /bet+a/ [beda] cats J. Pater, R. Staubs, K. Jesney, B. Smith UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 5 / 37

This kind of structural ambiguity is typical: the observed SRs allow multiple URs. More information can clarify the situation (here a typical analysis): UR SR Meaning a. /bed/ [bet] cat b. /bed+a/ [beda] cats c. /mot/ [mot] dog d. /mot+a/ [mota] dogs Intervocalic voiceless consonants in SRs ⇒ it’s final devoicing. J. Pater, R. Staubs, K. Jesney, B. Smith UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 6 / 37

This standard single UR view (e.g. Jakobson 1948) is not the only one available. Multiple URs can be used for the same kind of learning task. J. Pater, R. Staubs, K. Jesney, B. Smith UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 7 / 37

An internally-consistent analysis with multiple URs is possible: UR SR Meaning a. /bet/ [bet] cat b. /bed+a/ [beda] cats c. /mot/ [mot] dog d. /mot+a/ [mota] dogs Here when the SR varies, the UR varies as well. The UR is just an observed SR. J. Pater, R. Staubs, K. Jesney, B. Smith UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 8 / 37

Multiple URs are common in analysis of slightly more complex cases. These analyses are typically labelled as allomorphy. We extend the machinery for such accounts (Kager 2008) to all URs. J. Pater, R. Staubs, K. Jesney, B. Smith UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 9 / 37

English indefinite English ‘a’ alternates with ‘an’. This occurs in a phonologically sensible way: the form with a final consonant occurs when that consonant gives a syllable an onset. But English has no general process of [n] epenthesis. This kind of alternation is sometimes formalized as UR choice: pick the “phonologically best” UR. J. Pater, R. Staubs, K. Jesney, B. Smith UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 10 / 37

We represent UR choice with constraints: Definition (UR Constraints) X → /y/: Violated if the meaning is X and the chosen UR is not /y/. Example (English Indefinite) Indefinite → /@n/ Indefinite → /@/ Constraints give UR options, but here Output constraints choose the UR. J. Pater, R. Staubs, K. Jesney, B. Smith UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 11 / 37

Multiple URs, generalized Proposal Consider all observed SRs of a morpheme as possible URs for that morpheme. (Make UR constraints for these mappings.) Choose URs based on UR constraints, Output constraints, and Faithfulness constraints in competition. The use of UR constraints follows Apoussidou (2007) for UR learning with such constraints and Eisenstat (2009) for a similar model with log-linear grammar. J. Pater, R. Staubs, K. Jesney, B. Smith UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 12 / 37

Grammar Our grammatical model is Maximum Entropy Grammar (MaxEnt; Goldwater and Johnson, 2003). MaxEnt is a probabilistic version of Harmonic Grammar (Smolensky and Legendre, 2006) related to Optimality Theory (Prince and Smolensky, 2003). J. Pater, R. Staubs, K. Jesney, B. Smith UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 13 / 37

The grammar assigns probabilities to input-output mappings based on the mappings’ violations of weighted constraints. Here “inputs” are the morphemes or their meanings. “Outputs” are SRs (as usual). URs form a third, “hidden” level. J. Pater, R. Staubs, K. Jesney, B. Smith UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 14 / 37

The probability of an input/output pair ( x i , y ij ) is determined by its harmony H ij . This is the sum of violations f c ( x i , y ij ) for the pair weighted by the constraints w c . Definition (Harmony) � H ij = w c f c ( x i , y ij ) c J. Pater, R. Staubs, K. Jesney, B. Smith UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 15 / 37

The actual probabilities are proportional to the exponentials of the harmonies. They are normalized within each input. Definition (Input-Output Probabilities) p ( y ij | x i ) = 1 e H ij Z i � e H ij ′ Z i = j ′ J. Pater, R. Staubs, K. Jesney, B. Smith UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 16 / 37

The above MaxEnt definition does not account for multiple URs. For each input/output probability we sum over the probabilities for compatible hidden structures (URs) z ijk corresponding to inputs x i and outputs y ij . Definition (Probabilities with URs) � p ( y ij | x i ) = p ( y ij , z ijk | x i ) k J. Pater, R. Staubs, K. Jesney, B. Smith UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 17 / 37

Learning We minimize the KL-divergence between predicted probabilities and the observed probabilities to obtain weights. KL allows a uniform treatment of probabilistic and categorical data. We include an L2 regularization with σ 2 = 10 , 000 for all tests presented here. J. Pater, R. Staubs, K. Jesney, B. Smith UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 18 / 37

Generalization Our learned solutions should generalize. We want the grammar to extend phonological patterns to novel forms. We ensure generalization in a way similar to other work on learning in OT and HG (Smolensky, 1996; Jesney and Tessier, 2011): we bias Output constraints to be weighted higher than Faithfulness constraints. J. Pater, R. Staubs, K. Jesney, B. Smith UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 19 / 37

We include the difference between Output constraints O and Faithfulness constraints F in the objective: Definition (Generalization Term) �� λ w f − w o f ∈ F o ∈ O This term is reminiscient of Prince and Tesar’s (2004) R -measure. Many other priors might have a similar effect. We use this one to demonstrate feasibility. J. Pater, R. Staubs, K. Jesney, B. Smith UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 20 / 37

Consider again the final devoicing language: SR Meaning a. [bet] cat b. [beda] cats c. [mot] dog d. [mota] dogs Learning with multiple URs gives the following results: UR SR Meaning a. /bet/ (0.92) /bed/ (0.08) [bet] cat b. /bed+a/ [beda] cats c. /mot/ [mot] dog d. /mot+a/ [mota] dogs J. Pater, R. Staubs, K. Jesney, B. Smith UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 21 / 37

In the observed forms a faithful (surface-true) UR is chosen. The generalization is learned, however. The constraint favoring final devoicing ( No-Coda-Voice ) greatly outweighs the constraint motivating preserving voicing ( Ident-Voice ). Thus novel forms will have final devoicing with probability near 1. Constraint Weight No-Coda-Voice 401.41 Ident-Voice 6.05 cat → / bed/ 3.65 Inter-V-Voice 1.94 cat → / bet/ 0 J. Pater, R. Staubs, K. Jesney, B. Smith UMass Amherst/USC SIGMORPHON 2012 Learning probabilities over underlying representations 22 / 37

Learning probabilities over underlying representations Presented by - PowerPoint PPT Presentation

Learning probabilities over underlying representations Presented by Robert Staubs Joe Pater * Robert Staubs * Brian Smith * Karen Jesney * University of Massachusetts Amherst University of Southern California SIGMORPHON 2012 June 7,

Review: Probabilities DISCRETE PROBABILITIES Intro We have all been exposed to informal

Where do the probabilities come from? Probabilities come from: Experts Data D. Poole

61A Lecture 16 Announcements String Representations String Representations 4 String

Restrictive Learning with Distributions over Underlying Representations Karen Jesney, Joe Pater

Partially specified Probabilities: decisions and games May 2007 Ehud Lehrer The problem

N-Gram Model Formulas Estimating Probabilities N-gram conditional probabilities can be

Conditional Probabilities Anders Ringgaard Kristensen Department of Veterinary and Animal

Stochastic Simulation Idea: probabilities samples Get probabilities from samples: X count X

Should we think of quantum probabilities as Bayesian probabilities? Carlton M. Caves C. M.

Comonotone lower probabilities for bivariate Introduction and discrete structures Comonotonicity

Stochastic Simulation Idea: probabilities samples Get probabilities from samples: X count X

Probabilities and Expectations A. Rupam Mahmood September 10, 2015 Probabilities

Integrable gap probabilities for the Generalized Bessel process Manuela Girotti SISSA,

(Wu 1995) Standard probabilistic context-free grammars: probabilities over rewrite rules

Rich representations for Rich representations for learning visual recognition learning visual

Fourier transform for nilpotent Lie groups Index sets and representations Granada Index sets

Typological consequences of agent interaction Coral Hughto Robert Staubs Joe Pater University

Sign constraints on feature weights improve a joint model of word segmentation and phonology Mark

8 Golden Rules of Interface Design 9-3-2012 What is Digital Entertainment? It's video games

D. Treschev Linear symplectic maps and billiard dynamics September 8, 2020 I. Introduction 1.

Todays Menu I. Marx Review capitalist responses to decline Was Marx right? (cont.)

Unsupervised neural and Bayesian models for zero-resource speech processing MIT CSAIL, 15 Nov.

Why she had to go migration and sex ratios Ilya Kashnitsky BSPS19 Ravenstein says: Fe

Strategies, and Tips Catholic Charities, Financial Stability Network Don Hathway, Financial