Class 7: Learning and learnability Adam Albright (albright@mit.edu) - - PowerPoint PPT Presentation

class 7 learning and learnability
SMART_READER_LITE
LIVE PREVIEW

Class 7: Learning and learnability Adam Albright (albright@mit.edu) - - PowerPoint PPT Presentation

Class 7: Learning and learnability Adam Albright (albright@mit.edu) LSA 2017 Phonology University of Kentucky Announcements For those taking this class for credit Option 1: assignment 2 comments are posted, assignment 3 due next Monday


slide-1
SLIDE 1

Class 7: Learning and learnability

Adam Albright (albright@mit.edu)

LSA 2017 Phonology University of Kentucky

slide-2
SLIDE 2

Announcements

▶ For those taking this class for credit

▶ Option 1: assignment 2 comments are posted, assignment 3 due

next Monday 7/31

▶ Option 2: short paper/squib due next Monday 7/31

▶ The home stretch

▶ Questions? ▶ T

  • day: learning constraint rankings

▶ Next Monday: phonological typology

Learning in OT RCD The subset problem The GLA and HG References 1/55

slide-3
SLIDE 3

A question from last time: phonemes

▶ Phones ▶ Allophones ▶ Phonemes

Learning in OT RCD The subset problem The GLA and HG References 2/55

slide-4
SLIDE 4

Shizuoka Japanese (from assignment 2)

Adjective Emphatic Gloss a. hade hande ‘showy’ b.

  • zoi
  • nzoi

‘terrible’ c. jowai joɴwai ‘weak’ d. hajai haɴjai ‘fast’ e. kaɾai kaɴɾai ‘spicy’ f. nagai naŋgai ‘long’ g. kanaʃiː kanːaʃiː ‘sad’ h. amai amːai ‘sweet’ i. katai katːai ‘hard’ j.

  • soi
  • sːoi

‘slow’ k. takai takːai ‘high’ l. atsui atːsui ‘hot’ Adjective Emphatic Gloss m. kitanai kitːanai ‘dirty’ n. kusai kusːai ‘stinky’

  • .

ikai ikːai ‘big’ p. zonzai zoːnzai ‘impolite’ q. kandaɾui kaːndaɾui ‘languid’ r.

  • nzokutai
  • ːnzokutai

‘ugly’ s. supːai suːpːai ‘sour’ t.

  • kːanai
  • ːkːanai

‘scary’ u.

  • ͡ɪʃiː
  • ͡ːɪʃiː

‘delicious’ v. kiːɾoi kiːɴɾoi ‘yellow’ w. toːtoi toːtːoi respectable

Learning in OT RCD The subset problem The GLA and HG References 3/55

slide-5
SLIDE 5

Shizuoka Japanese (from assignment 2)

▶ The hierarchy of preferences that we observed

  • 1. Lengthen the V if there’s already NC or CC (i.e., if lengthening a C

would result in a non-intervocalic Cː)

  • 2. Else, insert an N if lengthening the C would result in a ‘bad

geminate’ (voiced stop, glide, etc.)

  • 3. Else, lengthen the consonant

▶ Preferences: Lengthen C > insert N > lengthen V ▶ General constraint ranking

Some constraint that penalizes lengthening V (*Vː, or Ident/V)

|

Some constraint that penalizes inserting N (*NC, or Dep(N))

|

Some constraint that penalizes lengthening C (*Cː, or Ident/C)

▶ Forcing violations of higher-ranked constraints

▶ *non-intervocalic Cː, *bad geminate outrank all of the above ▶ That is, don’t create bad sequences merely in order to employ a

preferred change

Learning in OT RCD The subset problem The GLA and HG References 4/55

slide-6
SLIDE 6

The architecture of Optimality Theory

▶ Con (universal, but must have particular form)

▶ M: Evaluate phonological form of output (incl. hidden structure) ▶ F: Or relation between input and output forms

▶ Eval (language particular ranking, procedure is universal) ▶ Gen (universal, and relatively generic)

▶ Augments underlying forms with additional structure

(syllabification, prosodic structure, etc.)

▶ Modifies structures of input in various ways to generate competing

candidates ▶ Lexicon (language particular, must be learned)

▶ Phonological strings, in same featural representation as output

representations

▶ Possibly additional structure, not present in output?

Learning in OT RCD The subset problem The GLA and HG References 5/55

slide-7
SLIDE 7

What is data used for in OT?

▶ Infer the grammar and lexicon that were used to generate forms

▶ This is an unachievable goal, in the general case!

▶ Less ambitious: infer a grammar that can generate the data ▶ Minimally: P(attested forms) > 0 ▶ Ideally…

▶ P(attested forms) > 0 ▶ P(unattested forms) = 0

▶ Modeling humans

▶ P(acceptable forms) > 0 ▶ P(unacceptable forms) = 0 ▶ Assuming that attested forms are all acceptable…

  

P(attested forms) > 0 P(accidentally unattested, but acceptable forms) > 0 P(unattested and unacceptable forms) = 0

  

Learning in OT RCD The subset problem The GLA and HG References 6/55

slide-8
SLIDE 8

Decomposing the problem

▶ Really hard part:

▶ Inferring underlying forms, with hidden structure, on basis of overt

evidence ▶ Somewhat hard part:

▶ Given those UR’s, construct set of informative competing

candidates ▶ The easier part:

▶ Given UR’s and candidates, find a consistent ranking

Learning in OT RCD The subset problem The GLA and HG References 7/55

slide-9
SLIDE 9

The easy part

Ensure P(attested data) > 0 with Recursive Constraint Demotion

▶ Crucial assumption: input data is generated with a grammar with

a total ranking

▶ This guarantees that the grammar generates a consistent output

for all possible inputs

▶ I.e., no inconsistencies that would lead to ranking paradoxes!

▶ Mark cancellation: reduce tableau to W’s and L’s (comparative

format)

▶ At each step, find set of constraints with L’s for any active mdp’s,

and demote them

▶ Remove mdp’s with W’s for undemoted constraints (“in current

stratum”)

▶ Recursion: rank remaining constraints in similar fashion, for

remaining mdp’s

Learning in OT RCD The subset problem The GLA and HG References 8/55

slide-10
SLIDE 10

An example

T esar and Smolensky (1996, p. 14)

/ulod/ Ons *Coda Dep(V) Max Dep(C) a. ulod * * b. lo ** c. lodə * *

d. ʔulo * *

▶ Assume we’re given inputs, attested (winning) output, and losing

candidates

Learning in OT RCD The subset problem The GLA and HG References 9/55

slide-11
SLIDE 11

The procedure: an overview

  • 1. Construct mark-data pairs

▶ For each winner/loser pair, compare violations for each constraint ▶ If both violate a constraint C an equal number of times, these

marks ‘cancel each other out’

▶ Identify C’s that assess uncancelled marks (either winner or loser

have more violations)

(i.e., make a comparative tableau!)

  • 2. Start with all constraints in a single stratum (no crucial rankings)
  • 3. Look for constraints C that assign uncancelled marks to winners

(that is, all constraints with L). Demote any such C, unless it is already dominated by another constraint C′ that has uncancelled loser marks (that is, a higher W)

  • 4. Continue, creating subsequent strata, until there are no

uncancelled winner marks without higher-ranked uncancelled loser marks

  • 5. Refine: given partial ranking, pick a total ranking

Learning in OT RCD The subset problem The GLA and HG References 10/55

slide-12
SLIDE 12

An example

Applying RCD

1. Data: /ulod/ Ons *Coda Dep(V) Max Dep(C) a. ulod * * b. lo ** c. lodə * *

d. ʔulo * * 2. mdp’s:

  • 3. Demote constraints with L’s: Max, Dep(C)

All L’s are now covered by higher-ranked W’s I.e., all mdp’s are eliminated

Refine

Learning in OT RCD The subset problem The GLA and HG References 11/55

slide-13
SLIDE 13

An example

Applying RCD

1. Data: /ulod/ Ons *Coda Dep(V) Max Dep(C) a. ulod * * b. lo ** c. lodə * *

d. ʔulo * * 2. mdp’s: /ulod/ → ʔulo Ons *Coda Dep(V) Max Dep(C) ʔulo, *ulod W W L L ʔulo, *lo W L ʔulo, *lodə W L

  • 3. Demote constraints with L’s: Max, Dep(C)

All L’s are now covered by higher-ranked W’s I.e., all mdp’s are eliminated

Refine

Learning in OT RCD The subset problem The GLA and HG References 11/55

slide-14
SLIDE 14

An example

Applying RCD

1. Data: /ulod/ Ons *Coda Dep(V) Max Dep(C) a. ulod * * b. lo ** c. lodə * *

d. ʔulo * * 2. mdp’s: /ulod/ → ʔulo Ons *Coda Dep(V) Max Dep(C) ʔulo, *ulod W W L L ʔulo, *lo W L ʔulo, *lodə W L

▶ 3. Demote constraints with L’s: Max, Dep(C)

▶ All L’s are now covered by higher-ranked W’s ▶ I.e., all mdp’s are eliminated

Refine

Learning in OT RCD The subset problem The GLA and HG References 11/55

slide-15
SLIDE 15

An example

Applying RCD

1. Data: /ulod/ Ons *Coda Dep(V) Max Dep(C) a. ulod * * b. lo ** c. lodə * *

d. ʔulo * * 2. mdp’s: /ulod/ → ʔulo Ons *Coda Dep(V) Max Dep(C) ʔulo, *ulod W W L L ʔulo, *lo W L ʔulo, *lodə W L

▶ 3. Demote constraints with L’s: Max, Dep(C)

▶ All L’s are now covered by higher-ranked W’s ▶ I.e., all mdp’s are eliminated

▶ Refine

Learning in OT RCD The subset problem The GLA and HG References 11/55

slide-16
SLIDE 16

Another example

For practice: Nupe Input Intended *si *s *ʃ Ident([±ant]) /sa/

sa * ʃa * * /ʃa/ ʃa *

sa * * /si/ si * *

ʃi * * /ʃi/

ʃi * si * * *

Learning in OT RCD The subset problem The GLA and HG References 12/55

slide-17
SLIDE 17

RCD with unfaithful mappings

Step 1: construct mark-data pairs Input Winner Loser *si *s *ʃ

F(±ant)

/sa/ sa ʃa L W W /ʃa/ sa ʃa L W L /si/ ʃi si W W L L /ʃi/ ʃi si W W L W

▶ For each constraint, calculate ∆(winner violations, loser

violations)

▶ Constraint preference:

  

W if ∆(w,l) < 1 L if ∆(w,l) > 1 tie

  • therwise

Learning in OT RCD The subset problem The GLA and HG References 13/55

slide-18
SLIDE 18

RCD with unfaithful mappings

Step 2: demote Input Winner Loser *si *s *ʃ

F(±ant)

/sa/ sa ʃa L W W /ʃa/ sa ʃa L W L /si/ ʃi si W W L L /ʃi/ ʃi si W W L W

▶ Constraints that prefer only winners are placed in the current

stratum

▶ Constraints that prefer losers are demoted

Learning in OT RCD The subset problem The GLA and HG References 14/55

slide-19
SLIDE 19

RCD with unfaithful mappings

Step 3: remove explained pairs Input Winner Loser *si *s *ʃ

F(±ant)

/sa/ sa ʃa L W W /ʃa/ sa ʃa L W L /si/ ʃi si W W L L /ʃi/ ʃi si W W L W

▶ Rows with ranked W are removed as explained ▶ Crucial: relies on strict domination (no lower violations can

“de-explain” these mdp’s)

Learning in OT RCD The subset problem The GLA and HG References 15/55

slide-20
SLIDE 20

RCD with unfaithful mappings

Step 3: remove explained pairs Input Winner Loser *si *s *ʃ

F(±ant)

/sa/ sa ʃa L W W /ʃa/ sa ʃa L W L /si/ ʃi si W W L L /ʃi/ ʃi si W W L W

▶ Rows with ranked W are removed as explained ▶ Crucial: relies on strict domination (no longer violations can

“de-explain” these mdp’s)

Learning in OT RCD The subset problem The GLA and HG References 16/55

slide-21
SLIDE 21

RCD with unfaithful mappings

Step 3: remove explained pairs Input Winner Loser *si *s *ʃ

F(±ant)

/sa/ sa ʃa L W W /ʃa/ sa ʃa L W L

▶ Rows with ranked W are removed as explained ▶ Crucial: relies on strict domination (no longer violations can

“de-explain” these mdp’s)

Learning in OT RCD The subset problem The GLA and HG References 17/55

slide-22
SLIDE 22

RCD with unfaithful mappings

Repeat: demote and remove Input Winner Loser *si *s *ʃ

F(±ant)

/sa/ sa ʃa L W W /ʃa/ sa ʃa L W L

Learning in OT RCD The subset problem The GLA and HG References 18/55

slide-23
SLIDE 23

RCD with unfaithful mappings

Repeat: demote and remove Input Winner Loser *si *ʃ *s

F(±ant)

/sa/ sa ʃa W L W /ʃa/ sa ʃa W L L

Learning in OT RCD The subset problem The GLA and HG References 19/55

slide-24
SLIDE 24

RCD with unfaithful mappings

Repeat: demote and remove Input Winner Loser *si *ʃ *s

F(±ant)

/sa/ sa ʃa W L W /ʃa/ sa ʃa W L L

Learning in OT RCD The subset problem The GLA and HG References 20/55

slide-25
SLIDE 25

RCD with unfaithful mappings

Repeat: demote and remove Input Winner Loser *si *ʃ *s

F(±ant)

/sa/ sa ʃa W L W /ʃa/ sa ʃa W L L

Learning in OT RCD The subset problem The GLA and HG References 21/55

slide-26
SLIDE 26

RCD with unfaithful mappings

Repeat: demote and remove Input Winner Loser *si *ʃ *s

F(±ant)

*si ≫ *ʃ ≫ *s, F(±ant)

Learning in OT RCD The subset problem The GLA and HG References 22/55

slide-27
SLIDE 27

Exercise

Find a ranking consistent with the following tableau: C1 C2 C3 C4 C5 Datum 1 W L W W L Datum 2 L W Datum 3 W L W Datum 4 L L W Datum 5 W L W

Learning in OT RCD The subset problem The GLA and HG References 23/55

slide-28
SLIDE 28

Virtues of RCD

▶ Converges correctly and efficiently: for N data pairs and K

constraints

▶ Maximally K strata → maximally K demotion steps ▶ At each step, we have maximally K constraints to consider

demoting (really, maximally K − current stratum)

▶ T

  • see if a constraint needs demoted, we must check maximally N

as-yet-unexplained mdp’s

▶ T

  • tal: K2·N

▶ “Efficient” (Achilles heel: how many losing candidates?)

▶ That is, if there’s a consistent ranking, it will find it

▶ If there are multiple consistent rankings, it will find one of them

(or more than one: partial hierarchy via strata)

Learning in OT RCD The subset problem The GLA and HG References 24/55

slide-29
SLIDE 29

Efficiency of error-driven learning

▶ Demotion only in response to error ▶ Error only when L not strictly dominated by a W ▶ L in stratum k can only generate errors/be demoted k−1 times

Learning in OT RCD The subset problem The GLA and HG References 25/55

slide-30
SLIDE 30

Consistent ranking, but not necessarily identical

Ways in which the RCD fails to arrive at grammar that generated input data

▶ T

  • tal hierarchy vs. partial hierarchy (strata)

▶ RCD learns stratified hierarchies, but cannot (in the general case)

learn languages generated by stratified hierarchies (why?)

▶ Simply assume total ranking imposed later, by independent means

(to guarantee that ‘parents’ will speak learnable languages) ▶ Ambiguities

▶ Constraints with same violation patterns (have to be placed

together by RCD)

▶ Constraints with just W’s and ties ▶ This point is taken up below

Learning in OT RCD The subset problem The GLA and HG References 26/55

slide-31
SLIDE 31

The RCD and strict domination

▶ RCD is efficient because of strict domination:

▶ Once a mdp has a ‘W-preferrer’ installed, it’s guaranteed to be

explained, so can be removed from consideration ▶ Interesting to note that under T

esar’s formulation, there are some types of consistent ranking that RCD will never find (why?)

/UR/ C1 C2 C3

cand1 * cand2 *! cand3 * *! /UR/ C1 C2 C3

cand1 * cand2 *! cand3 * ▶ Do human learners ever favor such rankings? (How would we

know?)

Learning in OT RCD The subset problem The GLA and HG References 27/55

slide-32
SLIDE 32

Where do input/output pairs come from?

▶ Assumption so far: both are given by an omniscient being ▶ Relaxing this: receive just the SR, infer UR and ranking ▶ This can be tricky! Wrong choices could lead to dead ends

Learning in OT RCD The subset problem The GLA and HG References 28/55

slide-33
SLIDE 33

Unfortuitous choice of UR’s

Example: given SR [sa]…

▶ Candidate UR’s: /sa/, /ʃa/, /sap/, etc. ▶ In principle, given just this datum, any of these is possible (given

the appropriate ranking)

▶ Danger: incorrect selection of UR’s creates ranking paradoxes

▶ E.g., [sa] ← /sap/, [mata] ← /mat/

▶ A possible, but inefficient approach:

▶ Hypothesize UR’s, in relatively unconstrained fashion (/UR/ → [SR]

must not be harmonically bound, but free to select among possible mappings)

▶ Construct mdp’s ▶ Attempt to learn a consistent ranking ▶ If learning terminates with no consistent ranking, randomly modify

hypothesized UR’s until something works

Learning in OT RCD The subset problem The GLA and HG References 29/55

slide-34
SLIDE 34

A more conservative approach

Breaking into the system

▶ On hearing [sa], a good first guess is that the grammar must be

able to map /sa/ to [sa]1

▶ Less certain: does the grammar also map /sap/, /ʃa/ → [sa]? ▶ Start modestly: assume /sa/ (IN = OUT)

▶ Robust interpretive parsing: augment with any hidden structure

needed for candidate generation/constraint evalutions

▶ E.g., parse prosodic structure, etc. ▶ We’ll leave this aside for the moment

1T

esar (2013, 2017) proves that this holds true for OT-consistent languages, as long as all interactions of processes are ‘transparent’ (there is no opacity).

Learning in OT RCD The subset problem The GLA and HG References 30/55

slide-35
SLIDE 35

What this buys us

How does the sa/ʃi learning scenario change when we are restricted to learning from (IN = OUT) pairs? Input Intended *si *s *ʃ Ident([±ant]) /sa/

sa * ʃa * * /ʃa/ ʃa *

sa * * /si/ si * *

ʃi * * /ʃi/

ʃi * si * * * Learning converges efficiently But arrives at a different answer! (why?)

Learning in OT RCD The subset problem The GLA and HG References 31/55

slide-36
SLIDE 36

What this buys us

How does the sa/ʃi learning scenario change when we are restricted to learning from (IN = OUT) pairs? Input Intended *si *s *ʃ Ident([±ant]) /sa/

sa * ʃa * * /ʃi/

ʃi * si * * * Learning converges efficiently But arrives at a different answer! (why?)

Learning in OT RCD The subset problem The GLA and HG References 31/55

slide-37
SLIDE 37

What this buys us

How does the sa/ʃi learning scenario change when we are restricted to learning from (IN = OUT) pairs? Input Intended *si *s *ʃ Ident([±ant]) /sa/

sa * ʃa * * /ʃi/

ʃi * si * * *

▶ Learning converges efficiently ▶ But arrives at a different answer! (why?)

Learning in OT RCD The subset problem The GLA and HG References 31/55

slide-38
SLIDE 38

The challenge of positive evidence

▶ Function of a grammar is to distinguish between what maps

faithfully (grammatical) and what maps unfaithfully (ungrammatical)

▶ Positive evidence tells us only what maps faithfully ▶ Extremely ambiguous! Faithful mappings obey all F

▶ Tied or winner-preferring for all mdp’s (why?)

▶ Applying RCD based on positive evidence will never have reason

to demote F

▶ Unintended consequence: grammar typically allows quite a bit

more than was seen in input data ▶ The subset problem (e.g., Angluin 1980)

▶ Data is ambiguous: consistent with grammars that produce many

different languages

▶ Claim: we want a grammar that produces the smallest possible

language (allow attested forms, as little else as possible)

Learning in OT RCD The subset problem The GLA and HG References 32/55

slide-39
SLIDE 39

The subset principle

▶ Assumption 1: human learners do indeed prefer maximally

restrictive analyses

▶ I.e., they solve the subset problem, somehow ▶ At a first pass, this is probably more right than it is wrong, but

worth evaluating

▶ We’ll come back to this, but for now we’ll stick to the way the

problem is framed in this literature ▶ Assumption 2: the way to solve the subset problem is by having

the learner prefer the most restrictive analysis at each stage of learning

▶ Premise: more permissive hypothesis can never be falsified based

  • n positive evidence

▶ So, once you adopt a more permissive analysis, you’re stuck there

(no counterevidence)

▶ Here too, one might question the premise, but we’ll grant it for

now in order to see the kind of approaches that have been proposed

Learning in OT RCD The subset problem The GLA and HG References 33/55

slide-40
SLIDE 40

Positive evidence as fixed points

▶ If a form can surface unfaithfully, it should be able to surface

faithfully as well /ba/ *ɓ

F(ɓ) F([±voi])

*[+voi]

ba * pa *! /ɓa/ *ɓ Ident(ɓ) Ident([±voi]) *[+voi]

ba * * pa * * ɓa *!

▶ Faithful mapping has subset of violations of unfaithful mapping ▶ Once again, a caveat: counterfeeding opacity

▶ Number of forms that can surface faithfully is a metric of

permissiveness of grammar

▶ However, it’s a tough one to calculate…

Learning in OT RCD The subset problem The GLA and HG References 34/55

slide-41
SLIDE 41

Something to keep in mind at the outset

▶ Even among languages that have same set of fixed points, there

are many possible grammars/mappings

▶ E.g., among lgs that allow CV, CVC, CVCV, CVCVC

▶ /prat/ → pat ▶ /prat/ → pərat

▶ Hayes/Prince&T

esar assume that knowledge of unfaithful mappings will be refined while learning alternations

▶ A plausible further criterion to keep in mind:

▶ Favor correct set of unfaithful mappings, even in absence of

explicit evidence from alternations (seen, for example, in loanword phonology)

Learning in OT RCD The subset problem The GLA and HG References 35/55

slide-42
SLIDE 42

The logic of markedness and faithfulness

▶ Markedness constraints wish to ban structures

▶ If input contains marked structures, prefer candidates that

eliminate them ▶ Faithfulness constraints license structures

▶ Allow forms to pass through the grammar unmodified

▶ Ranking of M and F determines restrictiveness of grammar

▶ M ≫ (M ≫ …) F: neutralization ▶ F ≫ M: contrast

Learning in OT RCD The subset problem The GLA and HG References 36/55

slide-43
SLIDE 43

The M ≫ F bias

▶ Each F ≫ M ranking allows a particular structure to surface

faithfully

▶ E.g., F([±voi]) ≫ [+voi]: /ba/ → [ba]

▶ The fewer the F ≫ M rankings we have, the fewer the number

  • f structures that the grammar will allow

▶ Bias: M ≫ F

Learning in OT RCD The subset problem The GLA and HG References 37/55

slide-44
SLIDE 44

Initial M ≫ F bias is not enough

Going back to the sa/ʃi allophonic language: try M≫F bias Input Winner ∼ Loser *si *s *ʃ

F([±ant])

/sa/ sa

*ʃa L W W /ʃi/ ʃi

*si W W L W Stratum 1 1 1 2

▶ Stage 1: *si, *s, *ʃ ≫ F([±ant])

▶ *si is the only unranked constraint that doesn’t prefer a loser ▶ F prefers all winners, but starts out ranked below stratum 1

Learning in OT RCD The subset problem The GLA and HG References 38/55

slide-45
SLIDE 45

Initial M ≫ F bias is not enough

Going back to the sa/ʃi allophonic language: try M≫F bias Input Winner , Loser *si *s *ʃ

F([±ant])

/sa/ sa , ʃa L W W /ʃi/ ʃi , si W W L W Stratum 1 2 2 2

▶ Stage 1: *si, *s, *ʃ ≫ F([±ant])

▶ *si is the only unranked constraint that doesn’t prefer a loser ▶ F prefers all winners, but starts out ranked below stratum 1

▶ Stage 2: *si ≫ *s, *ʃ, F([±ant])

▶ *ʃ, F([±ant]) both prefer winner ▶ *s demoted to next stratum

Learning in OT RCD The subset problem The GLA and HG References 39/55

slide-46
SLIDE 46

Initial M ≫ F bias is not enough

Going back to the sa/ʃi allophonic language: try M≫F bias Input Winner , Loser *si *ʃ

F([±ant])

*s /sa/ sa , ʃa W W L /ʃi/ ʃi , si W L W W Stratum 1 2 2 3

Learning in OT RCD The subset problem The GLA and HG References 40/55

slide-47
SLIDE 47

M ≫ F as a persistent bias

▶ In previous example, we need to maintain *ʃ ≫ F([±ant]) ▶ That is, given ambiguous datum /sa/ → [sa], prefer

markedness-based explanation

▶ Persistent bias: wherever possible, rank M ≫ F

▶ Give M ‘first crack’ at ambiguous data ▶ Deploy F only where truly needed

▶ Similar observations by many authors

▶ Ito and Mester (1999); Prince and T

esar (2004); Hayes (2004)

Learning in OT RCD The subset problem The GLA and HG References 41/55

slide-48
SLIDE 48

Favoring restrictive grammars

▶ Prince and T

esar (Prince and T esar): augment RCD to help maximize the number of M ≫ F rankings in the grammar

▶ Hayes (Hayes): different refinements to RCD with heuristics to

minimize the number of additional surface forms that the resulting grammar will allow

▶ Jarosz (2007): likelihood maximization approach, to directly

reward grammars that generate ‘smaller’ languages

▶ See also Heinz and Riggle (2011) for some discussion

Learning in OT RCD The subset problem The GLA and HG References 42/55

slide-49
SLIDE 49

Some features of RCD

▶ Learning is ‘instantaneous’

▶ Constraints that assign L’s are demoted as far as necessary to

ensure they are ‘covered’ by W’s ▶ Demotion by demerit, no credit for explaining mdp’s

▶ Does not distinguish ‘useful’ (W-assigning) from ‘harmless’ (tie)

constraints ▶ Non-robust

▶ Algorithm fails to converge if there are inconsistencies due to

errors, variation, etc., and resulting grammar is not guaranteed to ‘resemble’ training data

Learning in OT RCD The subset problem The GLA and HG References 43/55

slide-50
SLIDE 50

The Gradual Learning Algorithm

The Gradual Learning Algorithm (GLA; Boersma 1997, Boersma and Hayes 2001)

▶ Reranking does not proceed stratum by stratum, creating strict

rankings at each stage

▶ Rather, constraints are moved in small increments, getting closer

and closer together and finally switching places /pak/ *[CC *Coda Max(C) pak ∼ *pa L L W W

Learning in OT RCD The subset problem The GLA and HG References 44/55

slide-51
SLIDE 51

The Gradual Learning Algorithm

The Gradual Learning Algorithm (GLA; Boersma 1997, Boersma and Hayes 2001)

▶ Reranking does not proceed stratum by stratum, creating strict

rankings at each stage

▶ Rather, constraints are moved in small increments, getting closer

and closer together and finally switching places /pak/ *[CC *Coda Max(C) pak ∼ *pa L

→L

W← W

Learning in OT RCD The subset problem The GLA and HG References 44/55

slide-52
SLIDE 52

Gradualism and acquisition

▶ Incremental updates provide one way of modeling time course of

acquisition

▶ Initial state unlike target ⇒ systematic errors ▶ Gradual reranking to achieve necessary ranking conditions ⇒

milestones in mastering adult structures

▶ An important issue not addressed here:

comprehension/production asymmetry

Learning in OT RCD The subset problem The GLA and HG References 45/55

slide-53
SLIDE 53

Implementing small steps

Encoding gradient rankings

▶ Instead of constraint strata (partial hierarchy), we need

something that will let us express degrees of distance

▶ Boersma proposes a ranking scale: constraints are given

numerical values corresponding to their importance

▶ Boersma’s formulation turns out not to converge in some cases

(Pater, 2008), but this is easily fixed by being careful about the amount that you promote/demote by (Magri, 2012)

(1) Categorical ranking of constraints (C) along a continuous scale C1 strict (high-ranked) lax (low-ranked) C2 C3 (Boersma and Hayes 2001, p. 47)

Learning in OT RCD The subset problem The GLA and HG References 46/55

slide-54
SLIDE 54

Weighted constraint models for phonology

Formulation of grammar in terms of numerical ranking values mirrors immediate predecessor of OT: Harmonic Grammar

▶ Constraints assign numerical violations to output forms

▶ *Coda([pa]) = 0, *Coda([pak]) = 1, *Coda([pak.pak]) = 2

▶ May be conditioned on inputs

▶ Max([pa]|/pa/) = 0, Max([pa]|/pak/) = 1

▶ Constraints are weighted—e.g.,

No. Constraint Weight 1 *Coda 2.5 2 Max 3.0

▶ Constraint interaction: weighted sum of violations

c ∈ CON

wc × violationsc(output)

▶ Compare ‘higher-ranked takes all’ mark cancellation in OT

Learning in OT RCD The subset problem The GLA and HG References 47/55

slide-55
SLIDE 55

Additive interaction (Pater 2009)

Possible example: Japanese loanwords (Kawahara 2006)

▶ Lyman’s law: at most one vcd obstruent in native morphemes

▶ ori ‘fold’ + kami ‘paper’ → origami, vs. hitori ‘one person’ + tabi

‘travel’ → hitoritabi, *hitoridabi ‘alone’

▶ Violated in loanwords: bagɯ ‘bug’, giga ‘giga’

▶ Coda stops from English often borrowed as geminates

sɯtoppɯ ‘stop’ sɯnobbɯ ‘snob’ kitto ‘kit’ kiddo ‘kid’ autoretto ‘outlet’ reddo ‘red’

▶ But not so willingly when another voiced obstruent

beddo ∼ betto ‘bed’ doɡɡɯ ∼ dokkɯ ‘dog’

Learning in OT RCD The subset problem The GLA and HG References 48/55

slide-56
SLIDE 56

Isn’t this dangerous?

Pater, Bhatt and Potts (2007), Pater (2009):

▶ Consider potential dangers of ganging up

▶ E.g., don’t have both a coda and voiced stop in the same word ▶ Illustration of weighting paradox (p. 11)

▶ T

wo interesting aspect of OT constraints

▶ Trading off of markedness and faithfulness ▶ Locality of violations (e.g., McCarthy)

Learning in OT RCD The subset problem The GLA and HG References 50/55

slide-57
SLIDE 57

The learning task

▶ Given:

▶ A set of structural descriptions (constraints) ▶ A set of output forms ▶ A probability distribution over of a set of output forms

▶ Learn:

▶ Weights that will generate the observed distribution

▶ Procedure:

▶ Each time you encounter an input/output pair that you choose the

wrong output for, slightly increment the weight of ‘W’ constraints, and decrement the weight of ‘L’ constraints (cf. ‘perceptron learning’)

▶ See Magri (2012) and Boersma and Pater (2016) for details and

discussion

Learning in OT RCD The subset problem The GLA and HG References 52/55

slide-58
SLIDE 58

An advantage of numerical weights or ranking values

▶ Distances on a scale give us a way of thinking about ranking

probabilities

▶ Very close values: almost a tie, both rankings have some

probability

▶ Very far apart: C1 consistently outranks C1

▶ This allow us to model variability! ▶ T

wo main approaches

▶ “Noisy evaluation”: ranking values for each constraint vary a little

each time the grammar is invoked (GLA = ‘Noisy OT’, Noisy HG)

▶ Probability distribution determined by weighted violations:

Prob(yi|x) = exp(− ∑ weightc × c(yi))

y∈Y exp(− ∑

weightc × c(y)) (Maximum Entropy models)

Learning in OT RCD The subset problem The GLA and HG References 53/55

slide-59
SLIDE 59

References

Boersma, P . and J. Pater (2016). Convergence properties of a gradual learning algorithm for Harmonic Grammar. In J. McCarthy and J. Pater (Eds.), Harmonic Serialism and Harmonic Grammar, pp. 389–434. Sheffield: Equinox. Hayes, B. Phonological acquisition in Optimality Theory: The early stages. pp. 158–203. Heinz, J. and J. Riggle (2011). Learnability. In M. van Oostendorp, C. Ewen, B. Hume, and K. Rice (Eds.), Blackwell Companion to Phonology, pp. 54–78. Wiley-Blackwell. Jarosz, G. (2007). Restrictiveness in phonological grammar and lexicon learning. In

  • M. Elliott, J. Kirby, O. Sawada, E. Staraki, and S. Yoon (Eds.), Proceedings of the 43rd

Annual Meeting of the Chicago Lingusitics Society, pp. 125–139. Chicago Linguistic Society. Magri, G. (2012). Convergence of error-driven ranking algorithms. Phonology 29(2), 213–269. Pater, J. (2008). Gradual learning and convergence. Linguistic Inquiry 39, 334–345. Prince, A. and B. T

  • esar. Learning phonotactic distributions. pp. 245–291.

T esar, B. (2013). Output-Driven Phonology: Theory and Learning. Cambridge University Press.

Learning in OT RCD The subset problem The GLA and HG References 54/55

slide-60
SLIDE 60

References

T esar, B. (2017). Phonological learning with output-driven maps. Language Acquisition 24, 148–167. T esar, B. and P . Smolensky (1996). Learnability in optimality theory (short version). T echnical report, Johns Hopkins University. T echnical Report JHU-CogSci-96-2.

Learning in OT RCD The subset problem The GLA and HG References 55/55