in a Variable-free Model of Phonotactics Brandon Prickett - - PowerPoint PPT Presentation

in a variable free model of
SMART_READER_LITE
LIVE PREVIEW

in a Variable-free Model of Phonotactics Brandon Prickett - - PowerPoint PPT Presentation

Identity Bias & Generalization in a Variable-free Model of Phonotactics Brandon Prickett University of Massachusetts Amherst LSA Annual Meeting January 3 rd , 2020 Road Map 1. Introduction a) Variables in phonological rules b)


slide-1
SLIDE 1

Identity Bias & Generalization in a Variable-free Model of Phonotactics

Brandon Prickett University of Massachusetts Amherst LSA Annual Meeting January 3rd, 2020

slide-2
SLIDE 2

Road Map

1. Introduction

a) Variables in phonological rules b) Variables in phonological constraints c) Apparent evidence for variables d) Variable-free phonotactic models

2

slide-3
SLIDE 3

Road Map

1. Introduction

a) Variables in phonological rules b) Variables in phonological constraints c) Apparent evidence for variables d) Variable-free phonotactic models

2. Probabilistic Feature Attention

a) Feature attention b) Probabilistic feature attention c) Learning with PFA d) Ambiguous constraint violations e) An example with PFA

3

slide-4
SLIDE 4

Road Map

1. Introduction

a) Variables in phonological rules b) Variables in phonological constraints c) Apparent evidence for variables d) Variable-free phonotactic models

2. Probabilistic Feature Attention

a) Feature attention b) Probabilistic feature attention c) Learning with PFA d) Ambiguous constraint violations e) An example with PFA

3. Modeling Identity Bias

a) Simulation Set-Up b) Results c) Why does PFA create an Identity Bias?

4

slide-5
SLIDE 5

Road Map

1. Introduction

a) Variables in phonological rules b) Variables in phonological constraints c) Apparent evidence for variables d) Variable-free phonotactic models

2. Probabilistic Feature Attention

a) Feature attention b) Probabilistic feature attention c) Learning with PFA d) Ambiguous constraint violations e) An example with PFA

3. Modeling Identity Bias

a) Simulation Set-Up b) Results c) Why does PFA create an Identity Bias?

4. Modeling Identity Generalization

a) Simulation Set-Up b) Results c) Why does PFA cause Identity Generalization?

5

slide-6
SLIDE 6

Road Map

1. Introduction

a) Variables in phonological rules b) Variables in phonological constraints c) Apparent evidence for variables d) Variable-free phonotactic models

2. Probabilistic Feature Attention

a) Feature attention b) Probabilistic feature attention c) Learning with PFA d) Ambiguous constraint violations e) An example with PFA

3. Modeling Identity Bias

a) Simulation Set-Up b) Results c) Why does PFA create an Identity Bias?

4. Modeling Identity Generalization

a) Simulation Set-Up b) Results c) Why does PFA cause Identity Generalization?

5. Discussion

a) Future work b) Conclusions

6

slide-7
SLIDE 7

Introduction

7

slide-8
SLIDE 8

Variables in Phonological Rules

  • Halle (1962) first proposed that assimilation and dissimilation patterns could be described using

explicit, algebraic variables. This provided these kinds of patterns with simpler representations, which is useful since they’re typologically common. For example, if a language has voicing assimilation, this could be represented with the rule: [-Syllabic] → [αVoice] /_[αVoice] …where [α] stands for either [+] or [-] and has the same value in both feature bundles. n analysis of the same pattern that was variable free would require two rules: [-Syllabic] → [+Voice] /_[+Voice] [-Syllabic] → [-Voice] /_[-Voice]

8

slide-9
SLIDE 9

Variables in Phonological Rules

  • Halle (1962) first proposed that assimilation and dissimilation patterns could be described using

explicit, algebraic variables.

  • This provided these kinds of patterns with simpler representations, which is useful since they’re

typologically common. For example, if a language has voicing assimilation, this could be represented with the rule: [-Syllabic] → [αVoice] /_[αVoice] …where [α] stands for either [+] or [-] and has the same value in both feature bundles. n analysis of the same pattern that was variable free would require two rules: [-Syllabic] → [+Voice] /_[+Voice] [-Syllabic] → [-Voice] /_[-Voice]

9

slide-10
SLIDE 10

Variables in Phonological Rules

  • Halle (1962) first proposed that assimilation and dissimilation patterns could be described using

explicit, algebraic variables.

  • This provided these kinds of patterns with simpler representations, which is useful since they’re

typologically common.

  • For example, if a language has voicing assimilation, this could be represented with the rule:

[-Syllabic] → [αVoice] /_[αVoice] …where [α] stands for either [+] or [-] and has the same value in both feature bundles. n analysis of the same pattern that was variable free would require two rules: [-Syllabic] → [+Voice] /_[+Voice] [-Syllabic] → [-Voice] /_[-Voice]

10

slide-11
SLIDE 11

Variables in Phonological Rules

  • Halle (1962) first proposed that assimilation and dissimilation patterns could be described using

explicit, algebraic variables.

  • This provided these kinds of patterns with simpler representations, which is useful since they’re

typologically common.

  • For example, if a language has voicing assimilation, this could be represented with the rule:

[-Syllabic] → [αVoice] /_[αVoice] …where [α] stands for either [+] or [-] and has the same value in both feature bundles.

  • An analysis of the same pattern that was variable free would require two rules:

[-Syllabic] → [+Voice] /_[+Voice] [-Syllabic] → [-Voice] /_[-Voice]

11

slide-12
SLIDE 12

Variables in Phonological Constraints

  • Constraint-based theories can also use variables in their representations. A constraint like

*[αVoice][-αVoice] could enforce the same kind of assimilation pattern as Halle’s rules. However, recent proposals for phonotactic learning have lacked variables (e.g. Hayes and Wilson 2008; Pater and Moreton 2014). These models require two constraints to represent the same assimilatory process:

12

*[αVoice][-αVoice] [td] * [dt] *

slide-13
SLIDE 13

Variables in Phonological Constraints

  • Constraint-based theories can also use variables in their representations. A constraint like

*[αVoice][-αVoice] could enforce the same kind of assimilation pattern as Halle’s rules.

  • However, recent proposals for phonotactic learning have lacked variables (e.g. Hayes and

Wilson 2008; Pater and Moreton 2014). These models require two constraints to represent the same assimilatory process:

13

*[αVoice][-αVoice] [td] * [dt] * *[+Voice][-Voice] *[-Voice][+Voice] [td] * [dt] *

slide-14
SLIDE 14

Apparent Evidence for Variables

  • A considerable amount of work has argued against variable-free approaches to phonology

(e.g. Moreton 2012; Gallagher 2013; Berent 2013).

  • Whether variables are necessary for any domain of cognition is an active debate in the cognitive

science literature (Marcus 2001; Endress et al. 2007; Gervain and Werker 2013; Alhama and Zuidema 2018).

Gallagher (2013) presented two phenomena as evidence for variables:

Identity Bias: identity-based patterns are easier to learn than arbitrary ones Identity Generalization: people generalize identity-based patterns in a way that would be predicted by theories that make use of explicit variables

However, in this talk I’ll show results from a model with no explicit variables that manages to capture these two phenomena.

To do this I’ll use a novel mechanism, called Probabilistic Feature Attention, that assumes structured ambiguity occurs throughout the learning process.

14

slide-15
SLIDE 15

Apparent Evidence for Variables

  • A considerable amount of work has argued against variable-free approaches to phonology

(e.g. Moreton 2012; Gallagher 2013; Berent 2013).

  • Whether variables are necessary for any domain of cognition is an active debate in the cognitive

science literature (Marcus 2001; Endress et al. 2007; Gervain and Werker 2013; Alhama and Zuidema 2018).

  • Gallagher (2013) presented two phenomena as evidence for variables:

Identity Bias: identity-based patterns are easier to learn than arbitrary ones Identity Generalization: people generalize identity-based patterns in a way that would be predicted by theories that make use of explicit variables

However, in this talk I’ll show results from a model with no explicit variables that manages to capture these two phenomena.

To do this I’ll use a novel mechanism, called Probabilistic Feature Attention, that assumes structured ambiguity occurs throughout the learning process.

15

slide-16
SLIDE 16

Apparent Evidence for Variables

  • A considerable amount of work has argued against variable-free approaches to phonology

(e.g. Moreton 2012; Gallagher 2013; Berent 2013).

  • Whether variables are necessary for any domain of cognition is an active debate in the cognitive

science literature (Marcus 2001; Endress et al. 2007; Gervain and Werker 2013; Alhama and Zuidema 2018).

  • Gallagher (2013) presented two phenomena as evidence for variables:
  • Identity Bias: identity-based patterns are easier to learn than arbitrary ones

Identity Generalization: people generalize identity-based patterns in a way that would be predicted by theories that make use of explicit variables

However, in this talk I’ll show results from a model with no explicit variables that manages to capture these two phenomena.

To do this I’ll use a novel mechanism, called Probabilistic Feature Attention, that assumes structured ambiguity occurs throughout the learning process.

16

slide-17
SLIDE 17

Apparent Evidence for Variables

  • A considerable amount of work has argued against variable-free approaches to phonology

(e.g. Moreton 2012; Gallagher 2013; Berent 2013).

  • Whether variables are necessary for any domain of cognition is an active debate in the cognitive

science literature (Marcus 2001; Endress et al. 2007; Gervain and Werker 2013; Alhama and Zuidema 2018).

  • Gallagher (2013) presented two phenomena as evidence for variables:
  • Identity Bias: identity-based patterns are easier to learn than arbitrary ones
  • Identity Generalization: people generalize identity-based patterns in a way that would be predicted by

theories that make use of explicit variables (see Berent et al. 2012, Linzen & Gallagher 2017, and Tang & Baer-Henney 2019 for similar results)

However, in this talk I’ll show results from a model with no explicit variables that manages to capture these two phenomena.

To do this I’ll use a novel mechanism, called Probabilistic Feature Attention, that assumes structured ambiguity occurs throughout the learning process.

17

slide-18
SLIDE 18

Apparent Evidence for Variables

  • A considerable amount of work has argued against variable-free approaches to phonology

(e.g. Moreton 2012; Gallagher 2013; Berent 2013).

  • Whether variables are necessary for any domain of cognition is an active debate in the cognitive

science literature (Marcus 2001; Endress et al. 2007; Gervain and Werker 2013; Alhama and Zuidema 2018).

  • Gallagher (2013) presented two phenomena as evidence for variables:
  • Identity Bias: identity-based patterns are easier to learn than arbitrary ones
  • Identity Generalization: people generalize identity-based patterns in a way that would be predicted by

theories that make use of explicit variables (see Berent et al. 2012, Linzen & Gallagher 2017, and Tang & Baer-Henney 2019 for similar results)

  • While humans demonstrated both of these behaviors in artificial language learning

experiments, a variable-free phonotactic learner (Hayes and Wilson 2008) was unable to capture either phenomenon.

18

slide-19
SLIDE 19

Variable-Free Phonotactic Models

  • The Hayes and Wilson (2008) phonotactic model learns a probability distribution over all possible

words in a language after being trained on that language’s lexicon.

It represents this probability distribution using a set of weighted constraints like *[+voice] or *[-tense][+word_boundary]. The model’s probability estimate for a word is proportional to the weighted sum of that word’s constraint violations: 𝑞 𝑥𝑝𝑠𝑒𝑗 =

𝑓𝐼𝑗 σ 𝑓𝐼𝑘

where 𝐼𝑗 = σ𝑑 ∈𝐷 𝑥𝑑𝑤𝑑,𝑗

The results in this talk are from a different, but similar phonotactic model: GMECCS (“Gradual Maximum Entropy with a Conjunctive Constraint Schema”; Pater & Moreton, 2014; Moreton et al. 2017).

The main difference between the two models is that while Hayes and Wilson’s (2008) learner induces its constraint set, GMECCS begins learning with a constraint set that includes every possible conjunction of the relevant phonological features. This means that GMECCS’s learning process only consists of finding the optimal set of weights for those constraints.

19

slide-20
SLIDE 20

Variable-Free Phonotactic Models

  • The Hayes and Wilson (2008) phonotactic model learns a probability distribution over all possible

words in a language after being trained on that language’s lexicon.

  • It represents this probability distribution using a set of weighted constraints like

*[+voice] or *[-tense][+word_boundary]. The model’s probability estimate for a word is proportional to the weighted sum of that word’s constraint violations: 𝑞 𝑥𝑝𝑠𝑒𝑗 =

𝑓𝐼𝑗 σ 𝑓𝐼𝑘

where 𝐼𝑗 = σ𝑑 ∈𝐷 𝑥𝑑𝑤𝑑,𝑗

The results in this talk are from a different, but similar phonotactic model: GMECCS (“Gradual Maximum Entropy with a Conjunctive Constraint Schema”; Pater & Moreton, 2014; Moreton et al. 2017).

The main difference between the two models is that while Hayes and Wilson’s (2008) learner induces its constraint set, GMECCS begins learning with a constraint set that includes every possible conjunction of the relevant phonological features. This means that GMECCS’s learning process only consists of finding the optimal set of weights for those constraints.

20

slide-21
SLIDE 21

Variable-Free Phonotactic Models

  • The Hayes and Wilson (2008) phonotactic model learns a probability distribution over all possible

words in a language after being trained on that language’s lexicon.

  • It represents this probability distribution using a set of weighted constraints like

*[+voice] or *[-tense][+word_boundary].

  • The model’s probability estimate for a word is proportional to the weighted sum of that word’s constraint

violations: 𝑞 𝑥𝑝𝑠𝑒𝑗 =

𝑓𝐼𝑗 σ 𝑓𝐼𝑘

where 𝐼𝑗 = σ𝑑 ∈𝐷 𝑥𝑑𝑤𝑑,𝑗

The results in this talk are from a different, but similar phonotactic model: GMECCS (“Gradual Maximum Entropy with a Conjunctive Constraint Schema”; Pater & Moreton, 2014; Moreton et al. 2017).

The main difference between the two models is that while Hayes and Wilson’s (2008) learner induces its constraint set, GMECCS begins learning with a constraint set that includes every possible conjunction of the relevant phonological features. This means that GMECCS’s learning process only consists of finding the optimal set of weights for those constraints.

21

slide-22
SLIDE 22

Variable-Free Phonotactic Models

  • The Hayes and Wilson (2008) phonotactic model learns a probability distribution over all possible

words in a language after being trained on that language’s lexicon.

  • It represents this probability distribution using a set of weighted constraints like

*[+voice] or *[-tense][+word_boundary].

  • The model’s probability estimate for a word is proportional to the weighted sum of that word’s constraint

violations: 𝑞 𝑥𝑝𝑠𝑒𝑗 =

𝑓𝐼𝑗 σ 𝑓𝐼𝑘

where 𝐼𝑗 = σ𝑑 ∈𝐷 𝑥𝑑𝑤𝑑,𝑗

  • The results in this talk are from a different, but similar phonotactic model: GMECCS (“Gradual

Maximum Entropy with a Conjunctive Constraint Schema”; Pater & Moreton, 2014; Moreton et al. 2017).

The main difference between the two models is that while Hayes and Wilson’s (2008) learner induces its constraint set, GMECCS begins learning with a constraint set that includes every possible conjunction of the relevant phonological features. This means that GMECCS’s learning process only consists of finding the optimal set of weights for those constraints.

22

slide-23
SLIDE 23

Variable-Free Phonotactic Models

  • The Hayes and Wilson (2008) phonotactic model learns a probability distribution over all possible

words in a language after being trained on that language’s lexicon.

  • It represents this probability distribution using a set of weighted constraints like

*[+voice] or *[-tense][+word_boundary].

  • The model’s probability estimate for a word is proportional to the weighted sum of that word’s constraint

violations: 𝑞 𝑥𝑝𝑠𝑒𝑗 =

𝑓𝐼𝑗 σ 𝑓𝐼𝑘

where 𝐼𝑗 = σ𝑑 ∈𝐷 𝑥𝑑𝑤𝑑,𝑗

  • The results in this talk are from a different, but similar phonotactic model: GMECCS (“Gradual

Maximum Entropy with a Conjunctive Constraint Schema”; Pater & Moreton, 2014; Moreton et al. 2017).

  • The main difference between the two models is that while Hayes and Wilson’s (2008) learner induces its

constraint set, GMECCS begins learning with a constraint set that includes every possible conjunction of the relevant phonological features. This means that GMECCS’s learning process only consists of finding the optimal set of weights for those constraints.

23

slide-24
SLIDE 24

Variable-Free Phonotactic Models

  • The Hayes and Wilson (2008) phonotactic model learns a probability distribution over all possible

words in a language after being trained on that language’s lexicon.

  • It represents this probability distribution using a set of weighted constraints like

*[+voice] or *[-tense][+word_boundary].

  • The model’s probability estimate for a word is proportional to the weighted sum of that word’s constraint

violations: 𝑞 𝑥𝑝𝑠𝑒𝑗 =

𝑓𝐼𝑗 σ 𝑓𝐼𝑘

where 𝐼𝑗 = σ𝑑 ∈𝐷 𝑥𝑑𝑤𝑑,𝑗

  • The results in this talk are from a different, but similar phonotactic model: GMECCS (“Gradual

Maximum Entropy with a Conjunctive Constraint Schema”; Pater & Moreton, 2014; Moreton et al. 2017).

  • The main difference between the two models is that while Hayes and Wilson’s (2008) learner induces its

constraint set, GMECCS begins learning with a constraint set that includes every possible conjunction of the relevant phonological features.

  • This means that GMECCS’s learning process only consists of finding the optimal set of weights for those

constraints.

24

slide-25
SLIDE 25

Probabilistic Feature Attention

25

slide-26
SLIDE 26

Feature Attention

  • In cognitive science and machine learning the

word “attention” can mean a lot of different things. I’m going to be using it in the same sense as Nosofsky (1986), who used “attention” to describe how much a model uses each dimension of a feature space. For example, when learning a visual pattern using shapes of different colors and sizes…

One might attend to the features [shape], [size], and [color] equally (as in A)… …or with a disproportionate amount of attention given to [color] (as in B).

26

slide-27
SLIDE 27

Feature Attention

  • In cognitive science and machine learning the

word “attention” can mean a lot of different things.

  • I’m going to be using it in the same sense as

Nosofsky (1986), who used “attention” to describe how much a model uses each dimension of a feature space. For example, when learning a visual pattern using shapes of different colors and sizes…

One might attend to the features [shape], [size], and [color] equally (as in A)… …or with a disproportionate amount of attention given to [color] (as in B).

27

slide-28
SLIDE 28

Feature Attention

  • In cognitive science and machine learning the

word “attention” can mean a lot of different things.

  • I’m going to be using it in the same sense as

Nosofsky (1986), who used “attention” to describe how much a model uses each dimension of a feature space.

  • For example, when learning a visual pattern using

shapes of different colors and sizes…

One might attend to the features [shape], [size], and [color] equally (as in A)… …or with a disproportionate amount of attention given to [color] (as in B).

28

slide-29
SLIDE 29

Feature Attention

  • In cognitive science and machine learning the

word “attention” can mean a lot of different things.

  • I’m going to be using it in the same sense as

Nosofsky (1986), who used “attention” to describe how much a model uses each dimension of a feature space.

  • For example, when learning a visual pattern using

shapes of different colors and sizes…

  • One might attend to the features [shape], [size], and

[color] equally (as in A)… …or with a disproportionate amount of attention given to [color] (as in B).

29

slide-30
SLIDE 30

Feature Attention

  • In cognitive science and machine learning the

word “attention” can mean a lot of different things.

  • I’m going to be using it in the same sense as

Nosofsky (1986), who used “attention” to describe how much a model uses each dimension of a feature space.

  • For example, when learning a visual pattern using

shapes of different colors and sizes…

  • One might attend to the features [shape], [size], and

[color] equally (as in A)…

  • …or with a disproportionate amount of attention

given to [color] (as in B).

30

slide-31
SLIDE 31

Probabilistic Feature Attention

  • Probabilistic Feature Attention (PFA) is a novel mechanism (inspired by Dropout; Srivastava et al.

2014) that randomly distributes attention across different features over the course of learning. PFA has three main assumptions: The probability that any given feature is attended to at a given point in learning is a hyperparameter set by the analyst.

31

slide-32
SLIDE 32

Probabilistic Feature Attention

  • Probabilistic Feature Attention (PFA) is a novel mechanism (inspired by Dropout; Srivastava et al.

2014) that randomly distributes attention across different features over the course of learning.

  • PFA has three main assumptions:

32

1) When acquiring phonotactics, a learner doesn’t attend to every phonological feature in every word, at every weight update. 2) This lack of attention creates ambiguity in the learner’s input, since it isn’t attending to every feature that might distinguish different data. 3) In the face of ambiguity, the grammar errs on the side of assigning constraint violations.

slide-33
SLIDE 33

Probabilistic Feature Attention

  • Probabilistic Feature Attention (PFA) is a novel mechanism (inspired by Dropout; Srivastava et al.

2014) that randomly distributes attention across different features over the course of learning.

  • PFA has three main assumptions:

33

1) When acquiring phonotactics, a learner doesn’t attend to every phonological feature in every word, at every weight update. 2) This lack of attention creates ambiguity in the learner’s input, since it isn’t attending to every feature that might distinguish different data. 3) In the face of ambiguity, the grammar errs on the side of assigning constraint violations.

slide-34
SLIDE 34

Probabilistic Feature Attention

  • Probabilistic Feature Attention (PFA) is a novel mechanism (inspired by Dropout; Srivastava et al.

2014) that randomly distributes attention across different features over the course of learning.

  • PFA has three main assumptions:

34

1) When acquiring phonotactics, a learner doesn’t attend to every phonological feature in every word, at every weight update. 2) This lack of attention creates ambiguity in the learner’s input, since it isn’t attending to every feature that might distinguish different data. 3) In the face of ambiguity, the grammar errs on the side of assigning constraint violations.

slide-35
SLIDE 35

Probabilistic Feature Attention

  • Probabilistic Feature Attention (PFA) is a novel mechanism (inspired by Dropout; Srivastava et al.

2014) that randomly distributes attention across different features over the course of learning.

  • PFA has three main assumptions:

35

1) When acquiring phonotactics, a learner doesn’t attend to every phonological feature in every word, at every weight update. 2) This lack of attention creates ambiguity in the learner’s input, since it isn’t attending to every feature that might distinguish different data. 3) In the face of ambiguity, the grammar errs on the side of assigning constraint violations.

slide-36
SLIDE 36

Learning with PFA

  • While PFA could be combined with various learning algorithms and theoretical frameworks,

here I pair it with GMECCS (Pater and Moreton 2014; Moreton et al. 2017) and Stochastic Gradient Descent with batch sizes of 1 (all results also hold for batch gradient descent). The learning update for weights in gradient descent is proportional to the difference between the observed violations of a constraint in the training data and the number of violations for that constraints that the model expects to find in the training data, based on its current weights: PFA introduces ambiguity into the calculation of the model’s expected probabilities as well as 𝒘𝒋 at each weight update. This added ambiguity means that the weight updates will not always move the model toward a more optimal solution in learning.

36

slide-37
SLIDE 37

Learning with PFA

  • While PFA could be combined with various learning algorithms and theoretical frameworks,

here I pair it with GMECCS (Pater and Moreton 2014; Moreton et al. 2017) and Stochastic Gradient Descent with batch sizes of 1 (all results also hold for batch gradient descent).

  • The learning update for weights in gradient descent is proportional to the difference between

the observed violations of a constraint in the training data and the number of violations for that constraint that the model expects to find in the training data, based on its current weights: PFA introduces ambiguity into the calculation of the model’s expected probabilities as well as 𝒘𝒋 at each weight update. This added ambiguity means that the weight updates will not always move the model toward a more optimal solution in learning.

37

Δ𝑥𝑗 = 𝑚𝑠 ∙ (𝑃𝑐𝑡 𝑤𝑗 − 𝐹𝑦𝑞 𝑤𝑗 )

slide-38
SLIDE 38

Learning with PFA

  • While PFA could be combined with various learning algorithms and theoretical frameworks,

here I pair it with GMECCS (Pater and Moreton 2014; Moreton et al. 2017) and Stochastic Gradient Descent with batch sizes of 1 (all results also hold for batch gradient descent).

  • The learning update for weights in gradient descent is proportional to the difference between

the observed violations of a constraint in the training data and the number of violations for that constraint that the model expects to find in the training data, based on its current weights:

  • PFA introduces ambiguity into the calculation of the model’s expected probabilities as well as

𝒘𝒋 at each weight update. This added ambiguity means that the weight updates will not always move the model toward a more optimal solution in learning.

38

Δ𝑥𝑗 = 𝑚𝑠 ∙ (𝑃𝑐𝑡 𝑤𝑗 − 𝑭𝒚𝒒 𝑤𝑗 )

slide-39
SLIDE 39

Learning with PFA

  • While PFA could be combined with various learning algorithms and theoretical frameworks,

here I pair it with GMECCS (Pater and Moreton 2014; Moreton et al. 2017) and Stochastic Gradient Descent with batch sizes of 1 (all results also hold for batch gradient descent).

  • The learning update for weights in gradient descent is proportional to the difference between

the observed violations of a constraint in the training data and the number of violations for that constraint that the model expects to find in the training data, based on its current weights:

  • PFA introduces ambiguity into the calculation of the model’s expected probabilities as well as

𝒘𝒋 at each weight update. This added ambiguity means that the weight updates will not always move the model toward a more optimal solution in learning.

39

Δ𝑥𝑗 = 𝑚𝑠 ∙ (𝑃𝑐𝑡 𝒘𝒋 − 𝑭𝒚𝒒 𝒘𝒋 )

slide-40
SLIDE 40

Learning with PFA

  • While PFA could be combined with various learning algorithms and theoretical frameworks,

here I pair it with GMECCS (Pater and Moreton 2014; Moreton et al. 2017) and Stochastic Gradient Descent with batch sizes of 1 (all results also hold for batch gradient descent).

  • The learning update for weights in gradient descent is proportional to the difference between

the observed violations of a constraint in the training data and the number of violations for that constraint that the model expects to find in the training data, based on its current weights:

  • PFA introduces ambiguity into the calculation of the model’s expected probabilities as well as

𝒘𝒋 at each weight update.

  • This added ambiguity means that the weight updates will not always move the model toward a

more optimal solution in learning.

40

Δ𝑥𝑗 = 𝑚𝑠 ∙ (𝑃𝑐𝑡 𝒘𝒋 − 𝑭𝒚𝒒 𝒘𝒋 )

slide-41
SLIDE 41

Ambiguous Constraint Violations

  • For example, let’s consider a simplified scenario where words are only one segment long, and

there are only four possible segments and two phonological features: [Voice] and [Continuant]. When all features are attended to, we get unique violation profiles for every possible segment… …but if we only attend to [voice], [t]/[s] and [d]/[z] are ambiguous. And since the grammar errs on the side of assigning constraint violations when faced with ambiguity, each pair of ambiguous segments will have the same violation profile.

41

*[+v] *[-v] *[+c] *[-c] *[+v, +c] *[+v, -c] *[-v, +c] *[-v, -c] [d] [z] [t] [s]

slide-42
SLIDE 42

Ambiguous Constraint Violations

  • For example, let’s consider a simplified scenario where words are only one segment long, and

there are only four possible segments and two phonological features: [Voice] and [Continuant].

  • When all features are attended to, we get unique violation profiles for every possible segment…

…but if we only attend to [voice], [t]/[s] and [d]/[z] are ambiguous. And since the grammar errs on the side of assigning constraint violations when faced with ambiguity, each pair of ambiguous segments will have the same violation profile.

42

*[+v] *[-v] *[+c] *[-c] *[+v, +c] *[+v, -c] *[-v, +c] *[-v, -c] [d] 1 1 1 [z] 1 1 1 [t] 1 1 1 [s] 1 1 1

slide-43
SLIDE 43

Ambiguous Constraint Violations

  • For example, let’s consider a simplified scenario where words are only one segment long, and

there are only four possible segments and two phonological features: [Voice] and [Continuant].

  • When all features are attended to, we get unique violation profiles for every possible segment…
  • …but if we only attend to [voice], [t]/[s] and [d]/[z] are ambiguous.

And since the grammar errs on the side of assigning constraint violations when faced with ambiguity, each pair of ambiguous segments will have the same violation profile.

43

*[+v] *[-v] *[+c] *[-c] *[+v, +c] *[+v, -c] *[-v, +c] *[-v, -c] [d] or [z]? [d] or [z]? [t] or [s]? [t] or [s]?

slide-44
SLIDE 44

Ambiguous Constraint Violations

  • For example, let’s consider a simplified scenario where words are only one segment long, and

there are only four possible segments and two phonological features: [Voice] and [Continuant].

  • When all features are attended to, we get unique violation profiles for every possible segment…
  • …but if we only attend to [voice], [t]/[s] and [d]/[z] are ambiguous.
  • And since the grammar errs on the side of assigning constraint violations when faced with

ambiguity, each ambiguous datum has a violation profile that’s the union of the segments it’s ambiguous between.

44

*[+v] *[-v] *[+c] *[-c] *[+v, +c] *[+v, -c] *[-v, +c] *[-v, -c] [d] or [z]? 1 1 1 1 1 [d] or [z]? 1 1 1 1 1 [t] or [s]? 1 1 1 1 1 [t] or [s]? 1 1 1 1 1

slide-45
SLIDE 45

An Example with PFA

  • Let’s say our model is acquiring a language in which only the

words [s] and [z] are grammatical.

In a model without PFA, every learning update will add more weight to constraints like *[-Continuant] and less weight to constraints like *[+Continuant].

However, if PFA is being used, any iteration in which [Continuant] is not attended to will be uninformative for the model…

…since attested and unattested sounds will both either be [+Voice, ?continuant] or [-Voice, ?Continuant]. Iterations that do attend to [Continuant] will be the only ones in which the model is able to learn anything.

In more complex languages, this structured ambiguity can push the model in unexpected directions, causing it to behave differently than models without PFA.

45

Attested [s], [z] Unattested [t], [d]

slide-46
SLIDE 46

An Example with PFA

  • Let’s say our model is acquiring a language in which only the

words [s] and [z] are grammatical.

  • In a model without PFA, every learning update will add more

weight to constraints like *[-Continuant] and less weight to constraints like *[+Continuant].

However, if PFA is being used, any iteration in which [Continuant] is not attended to will be uninformative for the model…

…since attested and unattested sounds will both either be [+Voice, ?continuant] or [-Voice, ?Continuant]. Iterations that do attend to [Continuant] will be the only ones in which the model is able to learn anything.

In more complex languages, this structured ambiguity can push the model in unexpected directions, causing it to behave differently than models without PFA.

46

Attested [s], [z] Unattested [t], [d]

slide-47
SLIDE 47

An Example with PFA

  • Let’s say our model is acquiring a language in which only the

words [s] and [z] are grammatical.

  • In a model without PFA, every learning update will add more

weight to constraints like *[-Continuant] and less weight to constraints like *[+Continuant].

  • However, if PFA is being used, any iteration in which

[Continuant] is not attended to will be uninformative for the model…

…since attested and unattested sounds will both either be [+Voice, ?continuant] or [-Voice, ?Continuant]. Iterations that do attend to [Continuant] will be the only ones in which the model is able to learn anything.

In more complex languages, this structured ambiguity can push the model in unexpected directions, causing it to behave differently than models without PFA.

47

Attested [s], [z] Unattested [t], [d] PFA

slide-48
SLIDE 48

An Example with PFA

  • Let’s say our model is acquiring a language in which only the

words [s] and [z] are grammatical.

  • In a model without PFA, every learning update will add more

weight to constraints like *[-Continuant] and less weight to constraints like *[+Continuant].

  • However, if PFA is being used, any iteration in which

[Continuant] is not attended to will be uninformative for the model…

  • …since attested and unattested sounds will both either be

[+Voice, ?Continuant] or [-Voice, ?Continuant]. Iterations that do attend to [Continuant] will be the only ones in which the model is able to learn anything.

In more complex languages, this structured ambiguity can push the model in unexpected directions, causing it to behave differently than models without PFA.

48

Attested [s], [z] Unattested [t], [d] PFA Attested [+Voice, ?Cont.], [-Voice, ?Cont.] Unattested [+Voice, ?Cont.], [-Voice, ?Cont.]

slide-49
SLIDE 49

An Example with PFA

  • Let’s say our model is acquiring a language in which only the

words [s] and [z] are grammatical.

  • In a model without PFA, every learning update will add more

weight to constraints like *[-Continuant] and less weight to constraints like *[+Continuant].

  • However, if PFA is being used, any iteration in which

[Continuant] is not attended to will be uninformative for the model…

  • …since attested and unattested sounds will both either be

[+Voice, ?Continuant] or [-Voice, ?Continuant].

  • Iterations that do attend to [Continuant] will be the only ones in

which the model is able to learn anything.

In more complex languages, this structured ambiguity can push the model in unexpected directions, causing it to behave differently than models without PFA.

49

Attested [s], [z] Unattested [t], [d] PFA Attested [+Voice, ?Cont.], [-Voice, ?Cont.] Unattested [+Voice, ?Cont.], [-Voice, ?Cont.]

slide-50
SLIDE 50

An Example with PFA

  • Let’s say our model is acquiring a language in which only the

words [s] and [z] are grammatical.

  • In a model without PFA, every learning update will add more

weight to constraints like *[-Continuant] and less weight to constraints like *[+Continuant].

  • However, if PFA is being used, any iteration in which

[Continuant] is not attended to will be uninformative for the model…

  • …since attested and unattested sounds will both either be

[+Voice, ?Continuant] or [-Voice, ?Continuant].

  • Iterations that do attend to [Continuant] will be the only ones in

which the model is able to learn anything.

  • In more complex languages, this structured ambiguity can

push the model in unexpected directions, causing it to behave differently than models without PFA.

50

Attested [s], [z] Unattested [t], [d] PFA Attested [+Voice, ?Cont.], [-Voice, ?Cont.] Unattested [+Voice, ?Cont.], [-Voice, ?Cont.]

slide-51
SLIDE 51

Modeling Identity Bias

51

slide-52
SLIDE 52

Simulation Set-Up

  • To see if PFA allows a model to capture Identity Bias, I simulated Gallagher’s (2013) first experiment:

The experiment had two language conditions, both of which taught participants a restriction that didn’t allow most [+Voice][+Voice] sequences to surface. She found that when exceptions to this restriction were identity-based, the pattern was learned more quickly than when the exceptions were arbitrary.

Training data for the simulations was identical to the experiment, except all vowels were removed and no information about UR’s was given to the model (exceptions to the restriction are in maroon):

Ident Language: [bb], [dp], [gp], [dd], [dk], [bt], [gg], [gt], and [bk]. Arbitrary Language: [bd], [dg], [gb], [bp], [bk], [dp], [dt], [gk], and [gt].

GMECCS was given a constraint set that represented every possible conjunction of the features [±Voice], [±Labial], and [±Dorsal] for ngrams of length 1 and 2.

E.g. *[+Voice], *[+Voice, -Labial][-Voice, +Labial], *[+Voice, +Labial, -Dorsal], etc…

A model with Identity Bias should learn the Ident Language more quickly than the arbitrary one.

52

slide-53
SLIDE 53

Simulation Set-Up

  • To see if PFA allows a model to capture Identity Bias, I simulated Gallagher’s (2013) first experiment:
  • The experiment had two language conditions, both of which taught participants a restriction that didn’t allow

most [+Voice][+Voice] sequences to surface. She found that when exceptions to this restriction were identity-based, the pattern was learned more quickly than when the exceptions were arbitrary.

Training data for the simulations was identical to the experiment, except all vowels were removed and no information about UR’s was given to the model (exceptions to the restriction are in maroon):

Ident Language: [bb], [dp], [gp], [dd], [dk], [bt], [gg], [gt], and [bk]. Arbitrary Language: [bd], [dg], [gb], [bp], [bk], [dp], [dt], [gk], and [gt].

GMECCS was given a constraint set that represented every possible conjunction of the features [±Voice], [±Labial], and [±Dorsal] for ngrams of length 1 and 2.

E.g. *[+Voice], *[+Voice, -Labial][-Voice, +Labial], *[+Voice, +Labial, -Dorsal], etc…

A model with Identity Bias should learn the Ident Language more quickly than the arbitrary one.

53

slide-54
SLIDE 54

Simulation Set-Up

  • To see if PFA allows a model to capture Identity Bias, I simulated Gallagher’s (2013) first experiment:
  • The experiment had two language conditions, both of which taught participants a restriction that didn’t allow

most [+Voice][+Voice] sequences to surface.

  • She found that when exceptions to this restriction were identity-based, the pattern was learned more quickly

than when the exceptions were arbitrary.

Training data for the simulations was identical to the experiment, except all vowels were removed and no information about UR’s was given to the model (exceptions to the restriction are in maroon):

Ident Language: [bb], [dp], [gp], [dd], [dk], [bt], [gg], [gt], and [bk]. Arbitrary Language: [bd], [dg], [gb], [bp], [bk], [dp], [dt], [gk], and [gt].

GMECCS was given a constraint set that represented every possible conjunction of the features [±Voice], [±Labial], and [±Dorsal] for ngrams of length 1 and 2.

E.g. *[+Voice], *[+Voice, -Labial][-Voice, +Labial], *[+Voice, +Labial, -Dorsal], etc…

A model with Identity Bias should learn the Ident Language more quickly than the arbitrary one.

54

slide-55
SLIDE 55

Simulation Set-Up

  • To see if PFA allows a model to capture Identity Bias, I simulated Gallagher’s (2013) first experiment:
  • The experiment had two language conditions, both of which taught participants a restriction that didn’t allow

most [+Voice][+Voice] sequences to surface.

  • She found that when exceptions to this restriction were identity-based, the pattern was learned more quickly

than when the exceptions were arbitrary.

  • Training data for the simulations was identical to the experiment, except all vowels were removed

and no information about UR’s was given to the model (exceptions to the restriction are in maroon):

Ident Language: [bb], [dp], [gp], [dd], [dk], [bt], [gg], [gt], and [bk]. Arbitrary Language: [bd], [dg], [gb], [bp], [bk], [dp], [dt], [gk], and [gt].

GMECCS was given a constraint set that represented every possible conjunction of the features [±Voice], [±Labial], and [±Dorsal] for ngrams of length 1 and 2.

E.g. *[+Voice], *[+Voice, -Labial][-Voice, +Labial], *[+Voice, +Labial, -Dorsal], etc…

A model with Identity Bias should learn the Ident Language more quickly than the arbitrary one.

55

slide-56
SLIDE 56

Simulation Set-Up

  • To see if PFA allows a model to capture Identity Bias, I simulated Gallagher’s (2013) first experiment:
  • The experiment had two language conditions, both of which taught participants a restriction that didn’t allow

most [+Voice][+Voice] sequences to surface.

  • She found that when exceptions to this restriction were identity-based, the pattern was learned more quickly

than when the exceptions were arbitrary.

  • Training data for the simulations was identical to the experiment, except all vowels were removed

and no information about UR’s was given to the model (exceptions to the restriction are in maroon):

  • Ident Language: [bb], [dp], [gp], [dd], [dk], [bt], [gg], [gt], and [bk].

Arbitrary Language: [bd], [dg], [gb], [bp], [bk], [dp], [dt], [gk], and [gt].

GMECCS was given a constraint set that represented every possible conjunction of the features [±Voice], [±Labial], and [±Dorsal] for ngrams of length 1 and 2.

E.g. *[+Voice], *[+Voice, -Labial][-Voice, +Labial], *[+Voice, +Labial, -Dorsal], etc…

A model with Identity Bias should learn the Ident Language more quickly than the arbitrary one.

56

slide-57
SLIDE 57

Simulation Set-Up

  • To see if PFA allows a model to capture Identity Bias, I simulated Gallagher’s (2013) first experiment:
  • The experiment had two language conditions, both of which taught participants a restriction that didn’t allow

most [+Voice][+Voice] sequences to surface.

  • She found that when exceptions to this restriction were identity-based, the pattern was learned more quickly

than when the exceptions were arbitrary.

  • Training data for the simulations was identical to the experiment, except all vowels were removed

and no information about UR’s was given to the model (exceptions to the restriction are in maroon):

  • Ident Language: [bb], [dp], [gp], [dd], [dk], [bt], [gg], [gt], and [bk].
  • Arbitrary Language: [bd], [dg], [gb], [bp], [bk], [dp], [dt], [gk], and [gt].

GMECCS was given a constraint set that represented every possible conjunction of the features [±Voice], [±Labial], and [±Dorsal] for ngrams of length 1 and 2.

E.g. *[+Voice], *[+Voice, -Labial][-Voice, +Labial], *[+Voice, +Labial, -Dorsal], etc…

A model with Identity Bias should learn the Ident Language more quickly than the arbitrary one.

57

slide-58
SLIDE 58

Simulation Set-Up

  • To see if PFA allows a model to capture Identity Bias, I simulated Gallagher’s (2013) first experiment:
  • The experiment had two language conditions, both of which taught participants a restriction that didn’t allow

most [+Voice][+Voice] sequences to surface.

  • She found that when exceptions to this restriction were identity-based, the pattern was learned more quickly

than when the exceptions were arbitrary.

  • Training data for the simulations was identical to the experiment, except all vowels were removed

and no information about UR’s was given to the model (exceptions to the restriction are in maroon):

  • Ident Language: [bb], [dp], [gp], [dd], [dk], [bt], [gg], [gt], and [bk].
  • Arbitrary Language: [bd], [dg], [gb], [bp], [bk], [dp], [dt], [gk], and [gt].
  • GMECCS was given a constraint set that represented every possible conjunction of the features

[±Voice], [±Labial], and [±Dorsal] for ngrams of length 1 and 2.

  • E.g. *[+Voice], *[+Voice, -Labial][-Voice, +Labial], *[+Voice, +Labial, -Dorsal], etc…

A model with Identity Bias should learn the Ident Language more quickly than the arbitrary one.

58

slide-59
SLIDE 59

Simulation Set-Up

  • To see if PFA allows a model to capture Identity Bias, I simulated Gallagher’s (2013) first experiment:
  • The experiment had two language conditions, both of which taught participants a restriction that didn’t allow

most [+Voice][+Voice] sequences to surface.

  • She found that when exceptions to this restriction were identity-based, the pattern was learned more quickly

than when the exceptions were arbitrary.

  • Training data for the simulations was identical to the experiment, except all vowels were removed

and no information about UR’s was given to the model (exceptions to the restriction are in maroon):

  • Ident Language: [bb], [dp], [gp], [dd], [dk], [bt], [gg], [gt], and [bk].
  • Arbitrary Language: [bd], [dg], [gb], [bp], [bk], [dp], [dt], [gk], and [gt].
  • GMECCS was given a constraint set that represented every possible conjunction of the features

[±Voice], [±Labial], and [±Dorsal] for ngrams of length 1 and 2.

  • E.g. *[+Voice], *[+Voice, -Labial][-Voice, +Labial], *[+Voice, +Labial, -Dorsal], etc…
  • A model with Identity Bias should learn the Ident Language more quickly than the arbitrary one.

59

slide-60
SLIDE 60

Results (without PFA)

  • First, a version of GMECCS with no PFA was trained separately on these two languages for 200

epochs each, with a learning rate of 0.05.

60

slide-61
SLIDE 61

Results (with PFA)

  • Then GMECCS was run with PFA, with the same hyperparameters and training data as before,

but with only a .25 probability of attending to each feature in any given iteration.

61

slide-62
SLIDE 62

Why does PFA capture Identity Bias?

  • The Identity Language is easier for the learner because attested words in that language are

more likely to become ambiguous with one another in ways that aid the acquisition of the

  • verall pattern.

For example, [bb] and [dd] are ambiguous with one another at any point in learning where [Labial] is not being attended to.

This means that any time the model sees either of these data points and isn’t paying attention to their labiality, it will move twice as many constraint weights in the correct direction… Since it will be correctly updating the weights of constraints that [bb] violates and constraints that [dd] violates.

The Arbitrary Lang’s attested words do not have this kind of systematic similarity across data, so the random ambiguity just creates noise in the learning process, making it more difficult to acquire than its counterpart.

62

slide-63
SLIDE 63

Why does PFA capture Identity Bias?

  • The Identity Language is easier for the learner because attested words in that language are

more likely to become ambiguous with one another in ways that aid the acquisition of the

  • verall pattern.
  • For example, [bb] and [dd] are ambiguous with one another at any point in learning where

[Labial] is not being attended to.

This means that any time the model sees either of these data points and isn’t paying attention to their labiality, it will move twice as many constraint weights in the correct direction… Since it will be correctly updating the weights of constraints that [bb] violates and constraints that [dd] violates.

The Arbitrary Lang’s attested words do not have this kind of systematic similarity across data, so the random ambiguity just creates noise in the learning process, making it more difficult to acquire than its counterpart.

63

slide-64
SLIDE 64

Why does PFA capture Identity Bias?

  • The Identity Language is easier for the learner because attested words in that language are

more likely to become ambiguous with one another in ways that aid the acquisition of the

  • verall pattern.
  • For example, [bb] and [dd] are ambiguous with one another at any point in learning where

[Labial] is not being attended to.

  • This means that any time the model sees either of these data points and isn’t paying attention to their

labiality, it will move twice as many constraint weights in the correct direction…

  • …Since it will be correctly updating the weights of constraints that [bb] violates and constraints that [dd]

violates.

The Arbitrary Lang’s attested words do not have this kind of systematic similarity across data, so the random ambiguity just creates noise in the learning process, making it more difficult to acquire than its counterpart.

64

slide-65
SLIDE 65

Why does PFA capture Identity Bias?

  • The Identity Language is easier for the learner because attested words in that language are

more likely to become ambiguous with one another in ways that aid the acquisition of the

  • verall pattern.
  • For example, [bb] and [dd] are ambiguous with one another at any point in learning where

[Labial] is not being attended to.

  • This means that any time the model sees either of these data points and isn’t paying attention to their

labiality, it will move twice as many constraint weights in the correct direction…

  • …Since it will be correctly updating the weights of constraints that [bb] violates and constraints that [dd]

violates.

  • The Arbitrary Lang’s attested words do not have this kind of systematic similarity across data, so

the random ambiguity just creates noise in the learning process, making it more difficult to acquire than its counterpart.

65

slide-66
SLIDE 66

Modeling Identity Generalization

66

slide-67
SLIDE 67

Simulation Set-Up

  • To test whether humans demonstrated Identity Generalization, Gallagher’s (2013) second

experiment again trained participants on a voicing restriction with identity-based exceptions.

However, in this experiment a single pair of identical consonants was withheld from training (e.g. [gg]) In testing, participants were more likely to treat this withheld word as an exception than non-identical words that also violated the restriction (e.g. [dg])

The simulation for the Gallagher’s (2013) second experiment used the same training data as the Ident Language from before, except with the relevant items withheld: [bb], [dp], [gp], [dd], [bt], [gt], and [bk]. And at the end of training, the model was asked to estimate probabilities for the crucial test items: [gg] and [dg]. A model exhibiting Identity Generalization should assign more probability to [gg] than to [dg] at the end of training, even though both occur with a probability of 0 in the training data.

67

slide-68
SLIDE 68

Simulation Set-Up

  • To test whether humans demonstrated Identity Generalization, Gallagher’s (2013) second

experiment again trained participants on a voicing restriction with identity-based exceptions.

  • However, in this experiment a single pair of identical consonants was withheld from training (e.g. [gg])

In testing, participants were more likely to treat this withheld word as an exception than non-identical words that also violated the restriction (e.g. [dg])

The simulation for the Gallagher’s (2013) second experiment used the same training data as the Ident Language from before, except with the relevant items withheld: [bb], [dp], [gp], [dd], [bt], [gt], and [bk]. And at the end of training, the model was asked to estimate probabilities for the crucial test items: [gg] and [dg]. A model exhibiting Identity Generalization should assign more probability to [gg] than to [dg] at the end of training, even though both occur with a probability of 0 in the training data.

68

slide-69
SLIDE 69

Simulation Set-Up

  • To test whether humans demonstrated Identity Generalization, Gallagher’s (2013) second

experiment again trained participants on a voicing restriction with identity-based exceptions.

  • However, in this experiment a single pair of identical consonants was withheld from training (e.g. [gg])
  • In testing, participants were more likely to treat this withheld word as an exception than non-identical

words that also violated the restriction (e.g. [dg])

The simulation for the Gallagher’s (2013) second experiment used the same training data as the Ident Language from before, except with the relevant items withheld: [bb], [dp], [gp], [dd], [bt], [gt], and [bk]. And at the end of training, the model was asked to estimate probabilities for the crucial test items: [gg] and [dg]. A model exhibiting Identity Generalization should assign more probability to [gg] than to [dg] at the end of training, even though both occur with a probability of 0 in the training data.

69

slide-70
SLIDE 70

Simulation Set-Up

  • To test whether humans demonstrated Identity Generalization, Gallagher’s (2013) second

experiment again trained participants on a voicing restriction with identity-based exceptions.

  • However, in this experiment a single pair of identical consonants was withheld from training (e.g. [gg])
  • In testing, participants were more likely to treat this withheld word as an exception than non-identical

words that also violated the restriction (e.g. [dg])

  • The simulation for the Gallagher’s (2013) second experiment used the same training data as the

Ident Language from before, except with the relevant items withheld: [bb], [dp], [gp], [dd], [bt], [gt], and [bk]. And at the end of training, the model was asked to estimate probabilities for the crucial test items: [gg] and [dg]. A model exhibiting Identity Generalization should assign more probability to [gg] than to [dg] at the end of training, even though both occur with a probability of 0 in the training data.

70

slide-71
SLIDE 71

Simulation Set-Up

  • To test whether humans demonstrated Identity Generalization, Gallagher’s (2013) second

experiment again trained participants on a voicing restriction with identity-based exceptions.

  • However, in this experiment a single pair of identical consonants was withheld from training (e.g. [gg])
  • In testing, participants were more likely to treat this withheld word as an exception than non-identical

words that also violated the restriction (e.g. [dg])

  • The simulation for the Gallagher’s (2013) second experiment used the same training data as the

Ident Language from before, except with the relevant items withheld: [bb], [dp], [gp], [dd], [bt], [gt], and [bk].

  • And at the end of training, the model was asked to estimate probabilities for the crucial test

items: [gg] and [dg]. A model exhibiting Identity Generalization should assign more probability to [gg] than to [dg] at the end of training, even though both occur with a probability of 0 in the training data.

71

slide-72
SLIDE 72

Simulation Set-Up

  • To test whether humans demonstrated Identity Generalization, Gallagher’s (2013) second

experiment again trained participants on a voicing restriction with identity-based exceptions.

  • However, in this experiment a single pair of identical consonants was withheld from training (e.g. [gg])
  • In testing, participants were more likely to treat this withheld word as an exception than non-identical

words that also violated the restriction (e.g. [dg])

  • The simulation for the Gallagher’s (2013) second experiment used the same training data as the

Ident Language from before, except with the relevant items withheld: [bb], [dp], [gp], [dd], [bt], [gt], and [bk].

  • And at the end of training, the model was asked to estimate probabilities for the crucial test

items: [gg] and [dg].

  • A model exhibiting Identity Generalization should assign more probability to [gg] than to [dg]

at the end of training, even though both occur with a probability of 0 in the training data.

72

slide-73
SLIDE 73

Results (without PFA)

  • First, the standard version of GMECCS was trained on the Ident Language for 200 epochs, with a

learning rate of 0.05, and tested on the words [gg] and [dg].

73

slide-74
SLIDE 74

Results (with PFA)

  • Then GMECCS was run with PFA, with the same hyperparameters and training data as before,

but with only a .25 probability of attending to each feature in any given iteration.

74

slide-75
SLIDE 75

Why does PFA capture Identity Generalization?

  • This was a result of the fact that, over the course of learning, [dg] is more likely to become

ambiguous with other unattested words than [gg] is (because there are more [-Dorsal][+Dorsal] unattested words). All Unattested Words, by [Dorsal] Values Since there are more [-dorsal][+dorsal] words with a probability of 0 than their [+dorsal][+dorsal] counterparts, [dg] is more likely to become ambiguous with other zero-probability words in the training data.

75

slide-76
SLIDE 76

Why does PFA capture Identity Generalization?

  • This was a result of the fact that, over the course of learning, [dg] is more likely to become

ambiguous with other unattested words than [gg] is (because there are more [-Dorsal][+Dorsal] unattested words). All Unattested Words, by [Dorsal] Values Since there are more [-dorsal][+dorsal] words with a probability of 0 than their [+dorsal][+dorsal] counterparts, [dg] is more likely to become ambiguous with other zero-probability words in the training data.

76

[-Dorsal][-Dorsal] [+Dorsal][-Dorsal] [-Dorsal][+Dorsal] [+Dorsal][+Dorsal] tt pd kt tk kk td pp kd tg kg tp pb kp dk gk tb bd kb dg gg dt bp gb pk db gd pg pt bg

slide-77
SLIDE 77

Why does PFA capture Identity Generalization?

  • This was a result of the fact that, over the course of learning, [dg] is more likely to become

ambiguous with other unattested words than [gg] is (because there are more [-Dorsal][+Dorsal] unattested words). All Unattested Words, by [Dorsal] Values

  • Since there are more [-dorsal][+dorsal] words with a probability of 0 than their [+dorsal][+dorsal]

counterparts, [dg] is more likely to become ambiguous with other zero-probability words in the training data.

77

[-Dorsal][-Dorsal] [+Dorsal][-Dorsal] [-Dorsal][+Dorsal] [+Dorsal][+Dorsal] tt pd kt tk kk td pp kd tg kg tp pb kp dk gk tb bd kb dg gg dt bp gb pk db gd pg pt bg

slide-78
SLIDE 78

Discussion

78

slide-79
SLIDE 79

Future Work

  • One major way to continue the work on PFA is to start applying it to real language data.
  • Berent et al. (2012) showed that variables can capture Identity Generalization demonstrated by native

Hebrew speakers. Predictions about what kinds of mistakes infants make while acquiring language could also be used as a way to distinguish between the predictions made by PFA and those made by variables. For example, Gervain and Werker (2013) showed that newborns and older infants do demonstrate differences in their ability to perform Identity Generalization (also, see Marcus et al. 1999). Are there other acquisition-based studies meant to explore the predictions of variables?

Are there other phenomena that might be captured using PFA?

It’s hard to predict what ways PFA will affect a MaxEnt model’s learning. If you’re interested in applying PFA to any of the phenomena you’re interested in, you can find the software I used at https://github.com/blprickett/Feature-Attention

79

slide-80
SLIDE 80

Future Work

  • One major way to continue the work on PFA is to start applying it to real language data.
  • Berent et al. (2012) showed that variables can capture Identity Generalization demonstrated by native

Hebrew speakers.

  • To scale up PFA to this kind of data, a more efficient phonotactic learner than GMECCS would likely be

needed (e.g. Hayes and Wilson 2008 or Moreton 2019).

Predictions about what kinds of mistakes infants make while acquiring language could also be used as a way to distinguish between the predictions made by PFA and those made by variables.

For example, Gervain and Werker (2013) showed that newborns and older infants do demonstrate differences in their ability to perform Identity Generalization (also, see Marcus et al. 1999). Are there other acquisition-based studies meant to explore the predictions of variables?

Are there other phenomena that might be captured using PFA?

It’s hard to predict what ways PFA will affect a MaxEnt model’s learning. If you’re interested in applying PFA to any of the phenomena you’re interested in, you can find the software I used at https://github.com/blprickett/Feature-Attention

80

slide-81
SLIDE 81

Future Work

  • One major way to continue the work on PFA is to start applying it to real language data.
  • Berent et al. (2012) showed that variables can capture Identity Generalization demonstrated by native

Hebrew speakers.

  • To scale up PFA to this kind of data, a more efficient phonotactic learner than GMECCS would likely be

needed (e.g. Hayes and Wilson 2008 or Moreton 2019).

  • Are there other phenomena that might be captured using PFA?
  • It’s hard to predict what ways PFA will affect a MaxEnt model’s learning.

If you’re interested in applying PFA to any of the phenomena you’re interested in, you can find the software I used at https://github.com/blprickett/Feature-Attention

81

slide-82
SLIDE 82

Future Work

  • One major way to continue the work on PFA is to start applying it to real language data.
  • Berent et al. (2012) showed that variables can capture Identity Generalization demonstrated by native

Hebrew speakers.

  • To scale up PFA to this kind of data, a more efficient phonotactic learner than GMECCS would likely be

needed (e.g. Hayes and Wilson 2008 or Moreton 2019).

  • Are there other phenomena that might be captured using PFA?
  • It’s hard to predict what ways PFA will affect a MaxEnt model’s learning.
  • If you’re interested in applying PFA to any of the phenomena you’re interested in, you can find the

software I used at https://github.com/blprickett/Feature-Attention

82

slide-83
SLIDE 83

Conclusions

  • Here I’ve shown that two phenomena typically attributed to variables can be captured by a

variable-free MaxEnt model equipped with PFA. In other work that we don’t have time to go into, I’ve found that PFA can model two other phenomena in phonotactic learning:

Intradimensional Bias, which Moreton (2012) observed in phonotactic learning and attributed to variables. Similarity-based Generalization, which Cristia et al. (2013) demonstrated in a phonotactic learning experiment and which variables cannot capture.

Unlike variables, PFA provides a unified account for all of these different phenomena, by assuming structured ambiguity throughout the learning process.

83

slide-84
SLIDE 84

Conclusions

  • Here I’ve shown that two phenomena typically attributed to variables can be captured by a

variable-free MaxEnt model equipped with PFA.

  • In other work that we don’t have time to go into, I’ve found that PFA can model two other

phenomena in phonotactic learning:

  • Intradimensional Bias, which Moreton (2012) observed in phonotactic learning and attributed to

variables.

  • Similarity-based Generalization, which Cristia et al. (2013) demonstrated in a phonotactic learning

experiment and which variables cannot capture.

Unlike variables, PFA provides a unified account for all of these different phenomena, by assuming structured ambiguity throughout the learning process.

84

slide-85
SLIDE 85

Conclusions

  • Here I’ve shown that two phenomena typically attributed to variables can be captured by a

variable-free MaxEnt model equipped with PFA.

  • In other work that we don’t have time to go into, I’ve found that PFA can model two other

phenomena in phonotactic learning:

  • Intradimensional Bias, which Moreton (2012) observed in phonotactic learning and attributed to

variables.

  • Similarity-based Generalization, which Cristia et al. (2013) demonstrated in a phonotactic learning

experiment and which variables cannot capture.

  • Unlike variables, PFA provides a unified account for all of these different phenomena, by

assuming structured ambiguity throughout the learning process.

85

slide-86
SLIDE 86

Conclusions

  • Here I’ve shown that two phenomena typically attributed to variables can be captured by a

variable-free MaxEnt model equipped with PFA.

  • In other work that we don’t have time to go into, I’ve found that PFA can model two other

phenomena in phonotactic learning:

  • Intradimensional Bias, which Moreton (2012) observed in phonotactic learning and attributed to

variables.

  • Similarity-based Generalization, which Cristia et al. (2013) demonstrated in a phonotactic learning

experiment and which variables cannot capture.

  • Unlike variables, PFA provides a unified account for all of these different phenomena, by

assuming structured ambiguity throughout the learning process.

  • This suggests that PFA could be a useful alternative to variables in theories of phonotactic learning.

86

slide-87
SLIDE 87

Acknowledgments

Thanks to the members of the UMass’s Sound Workshop, Umass’s Phonology Reading Group, the attendees of the 2018 meeting of NECPHON, and the audience at UNC’s 2019 Spring

  • Colloquium. For helpful conversations about this work, I also thank Elliott Moreton, Joe Pater,

and Gaja Jarosz.

87

slide-88
SLIDE 88

References

Berent, I. (2013). The phonological mind. Trends in Cognitive Sciences, 17(7), 319–327. Berent, I., Wilson, C., Marcus, G., & Bemis, D. K. (2012). On the role of variables in phonology: Remarks on Hayes and Wilson 2008. Linguistic Inquiry, 43(1), 97–119. Cristia, A., Mielke, J., Daland, R., & Peperkamp, S. (2013). Similarity in the generalization of implicitly learned sound patterns. Laboratory Phonology, 4(2), 259–285. Endress, A. D., Dehaene-Lambertz, G., & Mehler, J. (2007). Perceptual constraints and the learnability of simple grammars. Cognition, 105(3), 577–614. Gallagher, G. (2013). Learning the identity effect as an artificial language: Bias and generalisation. Phonology, 30(2), 253–295. Gervain, J., & Werker, J. F. (2013). Learning non-adjacent regularities at age 0; 7. Journal of Child Language, 40(4), 860–872. Halle, M. (1962). A descriptive convention for treating assimilation and dissimilation. Quarterly Progress Report, 66, 295–296. Hayes, B., & Wilson, C. (2008). A maximum entropy model of phonotactics and phonotactic learning. Linguistic Inquiry, 39(3), 379–440. Linzen, T., & Gallagher, G. (2017). Rapid generalization in phonotactic learning. Laboratory Phonology: Journal of the Association for Laboratory Phonology, 8(1). Marcus, G. (2001). The algebraic mind. Cambridge, MA: MIT Press. Marcus, G., Vijayan, S., Rao, S. B., & Vishton, P. M. (1999). Rule learning by seven-month-old infants. Science, 283(5398), 77–80. Moreton, E. (2012). Inter-and intra-dimensional dependencies in implicit phonotactic learning. Journal of Memory and Language, 67(1), 165–183. Moreton, E. (2019). Constraint breeding during on-line incremental learning. Proceedings of the Society for Computation in Linguistics, 2(1), 69–80. Nosofsky, R. M. (1986). Attention, similarity, and the identification–categorization relationship. Journal of Experimental Psychology: General, 115(1), 39. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), 1929–1958. Tang, K., & Baer-Henney, D. (2019). Disentangling L1 and L2 effects in artificial language learning. Presented at the Manchester Phonology Meeting, Manchester, UK. Retrieved from http://www.lel.ed.ac.uk/mfm/27mfm-abbk.pdf 88

slide-89
SLIDE 89

Appendix

89

slide-90
SLIDE 90

Gallagher (2013): Experiment 1 Design

  • This experiment was aimed at discovering whether identity-based patterns were easier to learn

than more arbitrary ones (i.e. Identity Bias). Examples from Arbitrary Lang Examples from Identity Lang Participants were trained on these mappings and then tested on novel mappings that involved the same crucial consonantal pairs.

90

slide-91
SLIDE 91

Gallagher (2013): Experiment 1 Design

  • This experiment was aimed at discovering whether identity-based patterns were easier to learn

than more arbitrary ones (i.e. Identity Bias).

  • To test this, Gallagher trained participants on one of two devoicing patterns: a pattern with

unsystematic exceptions to devoicing (“Arbitrary Lang” below) and a pattern in which devoicing did not occur when a word’s consonants were identical (“Identity Lang” below). Examples from Arbitrary Lang Examples from Identity Lang Participants were trained on these mappings and then tested on novel mappings that involved the same crucial consonantal pairs.

91

Alternating Exception [babu] → [bapu] [badu] → [badu] [dabu] → [dapu] [dagu] → [daku] [gabu] → [gapu] [gabu] → [gabu]

slide-92
SLIDE 92

Gallagher (2013): Experiment 1 Design

  • This experiment was aimed at discovering whether identity-based patterns were easier to learn

than more arbitrary ones (i.e. Identity Bias).

  • To test this, Gallagher trained participants on one of two devoicing patterns: a pattern with

unsystematic exceptions to devoicing (“Arbitrary Lang” below) and a pattern in which devoicing did not occur when a word’s consonants were identical (“Identity Lang” below). Examples from Arbitrary Lang Examples from Identity Lang Participants were trained on these mappings and then tested on novel mappings that involved the same crucial consonantal pairs.

92

Alternating Exception [babu] → [bapu] [badu] → [badu] [dabu] → [dapu] [dagu] → [daku] [gabu] → [gapu] [gabu] → [gabu]

slide-93
SLIDE 93

Gallagher (2013): Experiment 1 Design

  • This experiment was aimed at discovering whether identity-based patterns were easier to learn

than more arbitrary ones (i.e. Identity Bias).

  • To test this, Gallagher trained participants on one of two devoicing patterns: a pattern with

unsystematic exceptions to devoicing (“Arbitrary Lang” below) and a pattern in which devoicing did not occur when a word’s consonants were identical (“Identity Lang” below). Examples from Arbitrary Lang Examples from Identity Lang Participants were trained on these mappings and then tested on novel mappings that involved the same crucial consonantal pairs.

93

Alternating Exception [babu] → [bapu] [badu] → [badu] [dabu] → [dapu] [dagu] → [daku] [gabu] → [gapu] [gabu] → [gabu] Alternating Exception [badu] → [batu] [babu] → [babu] [dagu] → [daku] [dadu] → [dadu] [gadu] → [gatu] [gagu] → [gagu]

slide-94
SLIDE 94

Gallagher (2013): Experiment 1 Design

  • This experiment was aimed at discovering whether identity-based patterns were easier to learn

than more arbitrary ones (i.e. Identity Bias).

  • To test this, Gallagher trained participants on one of two devoicing patterns: a pattern with

unsystematic exceptions to devoicing (“Arbitrary Lang” below) and a pattern in which devoicing did not occur when a word’s consonants were identical (“Identity Lang” below). Examples from Arbitrary Lang Examples from Identity Lang Participants were trained on these mappings and then tested on novel mappings that involved the same crucial consonantal pairs.

94

Alternating Exception [babu] → [bapu] [badu] → [badu] [dabu] → [dapu] [dagu] → [daku] [gabu] → [gapu] [gabu] → [gabu] Alternating Exception [badu] → [batu] [babu] → [babu] [dagu] → [daku] [dadu] → [dadu] [gadu] → [gatu] [gagu] → [gagu]

slide-95
SLIDE 95

Gallagher (2013): Experiment 1 Design

  • This experiment was aimed at discovering whether identity-based patterns were easier to learn

than more arbitrary ones (i.e. Identity Bias).

  • To test this, Gallagher trained participants on one of two devoicing patterns: a pattern with

unsystematic exceptions to devoicing (“Arbitrary Lang” below) and a pattern in which devoicing did not occur when a word’s consonants were identical (“Identity Lang” below). Examples from Arbitrary Lang Examples from Identity Lang

  • Participants were trained on these mappings and then tested on novel mappings that involved

the same crucial consonantal pairs.

95

Alternating Exception [babu] → [bapu] [badu] → [badu] [dabu] → [dapu] [dagu] → [daku] [gabu] → [gapu] [gabu] → [gabu] Alternating Exception [badu] → [batu] [babu] → [babu] [dagu] → [daku] [dadu] → [dadu] [gadu] → [gatu] [gagu] → [gagu]

slide-96
SLIDE 96

Gallagher (2013): Experiment 1 Results

  • The results for this experiment

demonstrated that participants learned the Identity Language better than the Arbitrary Language. This suggests that an Identity Bias was affecting their learning. Gallagher (2013) showed that the Hayes and Wilson (2008) learner could not model this bias unless variables were added.

This is because variables cause the identity-based pattern to be structurally simpler than the arbitrary language. Without variables, the two patterns require the same number of constraints to represent.

96

Identity Arbitrary Adapted from Gallagher (2013:Figure 3)

slide-97
SLIDE 97

Gallagher (2013): Experiment 1 Results

  • The results for this experiment

demonstrated that participants learned the Identity Language better than the Arbitrary Language.

  • This suggests that an Identity Bias was

affecting their learning. Gallagher (2013) showed that the Hayes and Wilson (2008) learner could not model this bias unless variables were added.

This is because variables cause the identity-based pattern to be structurally simpler than the arbitrary language. Without variables, the two patterns require the same number of constraints to represent.

97

Identity Arbitrary Adapted from Gallagher (2013:Figure 3)

slide-98
SLIDE 98

Gallagher (2013): Experiment 1 Results

  • The results for this experiment

demonstrated that participants learned the Identity Language better than the Arbitrary Language.

  • This suggests that an Identity Bias was

affecting their learning.

  • Gallagher (2013) showed that the Hayes

and Wilson (2008) learner could not model this bias unless variables were added.

This is because variables cause the identity-based pattern to be structurally simpler than the arbitrary language. Without variables, the two patterns require the same number of constraints to represent.

98

Identity Arbitrary Adapted from Gallagher (2013:Figure 3)

slide-99
SLIDE 99

Gallagher (2013): Experiment 1 Results

  • The results for this experiment

demonstrated that participants learned the Identity Language better than the Arbitrary Language.

  • This suggests that an Identity Bias was

affecting their learning.

  • Gallagher (2013) showed that the Hayes

and Wilson (2008) learner could not model this bias unless variables were added.

  • This is because variables cause the

identity-based pattern to be structurally simpler than the arbitrary language. Without variables, the two patterns require the same number of constraints to represent.

99

Identity Arbitrary Adapted from Gallagher (2013:Figure 3)

slide-100
SLIDE 100

Gallagher (2013): Experiment 1 Results

  • The results for this experiment

demonstrated that participants learned the Identity Language better than the Arbitrary Language.

  • This suggests that an Identity Bias was

affecting their learning.

  • Gallagher (2013) showed that the Hayes

and Wilson (2008) learner could not model this bias unless variables were added.

  • This is because variables cause the

identity-based pattern to be structurally simpler than the arbitrary language.

  • Without variables, the two patterns require

the same number of constraints to represent.

100

Identity Arbitrary Adapted from Gallagher (2013:Figure 3)

slide-101
SLIDE 101

Gallagher (2013): Experiment 2 Design

  • This experiment tested whether participants generalized identity-based patterns to novel

segments (i.e. Identity Generalization). To test this, the Identity Language from Experiment 1 was altered so that a single pair of identical consonants was withheld from training. Participants were then tested on this pair, as well as another pair that was not identical. If participants learned the identity pattern in a generalizable way, the devoicing process should be applied to the non-identical novel pair but not the identical one.

101

slide-102
SLIDE 102

Gallagher (2013): Experiment 2 Design

  • This experiment tested whether participants generalized identity-based patterns to novel

segments (i.e. Identity Generalization).

  • To test this, the Identity Language from Experiment 1 was altered so that a single pair of

identical consonants was withheld from training. Participants were then tested on this pair, as well as another pair that was not identical. If participants learned the identity pattern in a generalizable way, the devoicing process should be applied to the non-identical novel pair but not the identical one.

102

Alternating Exception Withheld [badu] → [batu] [babu] → [babu] [gagu] → ? [dagu] → [daku] [dadu] → [dadu] [gadu] → ?

slide-103
SLIDE 103

Gallagher (2013): Experiment 2 Design

  • This experiment tested whether participants generalized identity-based patterns to novel

segments (i.e. Identity Generalization).

  • To test this, the Identity Language from Experiment 1 was altered so that a single pair of

identical consonants was withheld from training.

  • Participants were then tested on this pair, as well as another pair that was not identical.

If participants learned the identity pattern in a generalizable way, the devoicing process should be applied to the non-identical novel pair but not the identical one.

103

Alternating Exception Withheld [badu] → [batu] [babu] → [babu] [gagu] → ? [dagu] → [daku] [dadu] → [dadu] [gadu] → ?

slide-104
SLIDE 104

Gallagher (2013): Experiment 2 Design

  • This experiment tested whether participants generalized identity-based patterns to novel

segments (i.e. Identity Generalization).

  • To test this, the Identity Language from Experiment 1 was altered so that a single pair of

identical consonants was withheld from training.

  • Participants were then tested on this pair, as well as another pair that was not identical.
  • If participants learned the identity pattern in a generalizable way, the devoicing process should

be applied to the non-identical novel pair but not the identical one.

104

Alternating Exception Withheld [badu] → [batu] [babu] → [babu] [gagu] → ? [dagu] → [daku] [dadu] → [dadu] [gadu] → ?

slide-105
SLIDE 105

Gallagher (2013): Experiment 2 Results

  • The results showed that participants were more

likely to devoice non-identical withheld consonant pairs than their counterparts. This suggests that they were properly generalizing the identity-based pattern of exceptionality in the language. The Hayes and Wilson (2008) model cannot capture this kind of generalization because it doesn’t have any way of representing similarity across segments.

i.e. there’s no way for the model to represent that [gagu], [babu], and [dadu] all have something in common with variable-free constraints, so no extra probability will be given to the withheld word.

105

Adapted from Gallagher (2013:Figure 4)

slide-106
SLIDE 106

Gallagher (2013): Experiment 2 Results

  • The results showed that participants were more

likely to devoice non-identical withheld consonant pairs than their counterparts.

  • This suggests that they were properly

generalizing the identity-based pattern of exceptionality in the language. The Hayes and Wilson (2008) model cannot capture this kind of generalization because it doesn’t have any way of representing similarity across segments.

i.e. there’s no way for the model to represent that [gagu], [babu], and [dadu] all have something in common with variable-free constraints, so no extra probability will be given to the withheld word.

106

Adapted from Gallagher (2013:Figure 4)

slide-107
SLIDE 107

Gallagher (2013): Experiment 2 Results

  • The results showed that participants were more

likely to devoice non-identical withheld consonant pairs than their counterparts.

  • This suggests that they were properly

generalizing the identity-based pattern of exceptionality in the language.

  • The Hayes and Wilson (2008) model cannot

capture this kind of generalization because it doesn’t have any way of representing similarity across segments.

i.e. there’s no way for the model to represent that [gagu], [babu], and [dadu] all have something in common with variable-free constraints, so no extra probability will be given to the withheld word.

107

Adapted from Gallagher (2013:Figure 4)

slide-108
SLIDE 108

Gallagher (2013): Experiment 2 Results

  • The results showed that participants were more

likely to devoice non-identical withheld consonant pairs than their counterparts.

  • This suggests that they were properly

generalizing the identity-based pattern of exceptionality in the language.

  • The Hayes and Wilson (2008) model cannot

capture this kind of generalization because it doesn’t have any way of representing similarity across segments.

  • i.e. there’s no way for the model to represent that

[gagu], [babu], and [dadu] all have something in common with variable-free constraints, so no extra probability will be given to the withheld word.

108

Adapted from Gallagher (2013:Figure 4)

slide-109
SLIDE 109

Results (with PFA, only exceptions)

  • This effect is exaggerated even more when GMECCS is trained to just differentiate the

exceptions to the devoicing pattern from all other words.

109

slide-110
SLIDE 110

Results (with PFA, only exceptions)

  • Again, the effect is even more apparent when GMECCS is trained to just pick out the exceptions

to the devoicing pattern.

110

slide-111
SLIDE 111

Similarity-Based Generalization (Cristia et al. 2013)

Humans Without PFA With PFA

111

slide-112
SLIDE 112

Intradimensional Bias (Moreton 2012)

Without PFA With PFA

112