On Learning the Past Tenses of Verbs Rumelhart, McClelland 1985 - - PowerPoint PPT Presentation

on learning the past tenses of verbs
SMART_READER_LITE
LIVE PREVIEW

On Learning the Past Tenses of Verbs Rumelhart, McClelland 1985 - - PowerPoint PPT Presentation

On Learning the Past Tenses of Verbs Rumelhart, McClelland 1985 Big Picture How do we (humans) use and acquire knowledge of language? Two competing ideas: 1. Explicit, inaccessible rule view : rules of language are stored in explicit form 2.


slide-1
SLIDE 1

Rumelhart, McClelland 1985

On Learning the Past Tenses of Verbs

slide-2
SLIDE 2

Big Picture

How do we (humans) use and acquire knowledge of language?

Two competing ideas:

  • 1. Explicit, inaccessible rule view: rules of language are stored in

explicit form

  • 2. Connectionist models: capture “rule-like” behavior with no explicit

form of rules

slide-3
SLIDE 3

History

  • First connectionist implementation by Rumelhart & McClelland in 1986
  • Number of criticisms:
  • Error rate on “unseen” verbs is high -> Do these models reach adult

competence?

  • Pinker and Prince (1988) and Lachter and Bever (1988): Extremely poor

empirical performance

  • Improved results by MacWhinney & Leinbach in 1991, replaced

Wickelfeature representation with UNIBET

  • Resurgence of neural networks today
  • Kirov and Cotterell (2018) show that the Encoder-Decoder network architectures preclude

many of P&P’s arguments

slide-4
SLIDE 4

Three claims from R&M connectionist model

  • 1. The model captures the U-learning three-stage pattern of

acquisition.

  • 2. The model captures most aspects of difgerences in

performance on difgerent types of regular and irregular verbs.

  • 3. The model is capable of responding to regular and

irregular verbs seen in training and low frequency “unseen” verbs.

slide-5
SLIDE 5

R&M argument

1. The model demonstrates that it can acquire past tense without rules. So, “[t]he child need not figure out what the rules are, nor even that there are

  • rules. The child need not decide whether a verb is regular or irregular.”

2. If no explicit rules, why should children generate forms that they have never heard of? “They do so because the past tenses of similar verbs they are learning show such a consistent pattern that the generalization from these similar verbs

  • utweighs the relatively small amount of learning that has occured on the

irregular verb question.”

slide-6
SLIDE 6

Discussion

slide-7
SLIDE 7

U-learning three-stage pattern of past tense acquisition

Age Proportion Correct Stage 1 Rote learning of regular verbs Initial error free performance Stage 3 Regular and irregular verbs Recovery from errors Stage 2 Rule extraction, few new irregular verbs Over-regularization errors

slide-8
SLIDE 8

Connectionist model

Pattern Associator Decoding Encoding Phonological representation of root form Wickelfeature representation of root form Wickelfeature representation of past tense Phonological representation of past form Figure adapted from paper

Train: 10 trials, 10 high-frequency verbs 190 more trials, 410 medium-frequency verbs Test: 86 low-frequency verbs

slide-9
SLIDE 9

Connectionist model

eat /eet/

*ee eet et*

*: [(000) (00) (000) (00) 1] e: [(001) (01) (100) (01) 0] e: [(001) (01) (100) (01) 0] root phonetic wickelphones of /eet/ wickefeature of *ee

slide-10
SLIDE 10

Connectionist model

Pattern associators allow:

  • 1. Exploitation of regularities that exist in mappings (e.g.

dependent set of inputs -> patterns)

  • 2. Regular patterns and exceptions to those patterns to

coexist

  • 3. For regularization, followed by the gradual tuning of

connections to include exceptions

slide-11
SLIDE 11

Discussion

slide-12
SLIDE 12

1: Model captures U-learning three stage pattern

Stage 1 Stage 2 Stage 3

Figures from paper

slide-13
SLIDE 13

Discussion

slide-14
SLIDE 14

2: Model captures difgerences in regular & irregular verbs

not t/d: drink, move, make

  • > used as no-change verbs

t/d: eat, build, pat

  • > predominantly regularized

Table from paper

slide-15
SLIDE 15

2: Model captures difgerences in regular & irregular verbs

no change vowel change

Table from paper

slide-16
SLIDE 16

2: Model captures difgerences in regular & irregular verbs

Tables from paper

slide-17
SLIDE 17

2: Model captures difgerences in regular & irregular verbs

Graphs from paper

Examples: spend/spent; bite/bit; sing/sang; come/came sleep/slept; catch/caught; see/saw

slide-18
SLIDE 18

Discussion

slide-19
SLIDE 19

3: Model responds to training and testing sets

  • The testing sample contains 86 “unseen" low frequency verbs (14 irregular

and 72 regular), all of which were not chosen at random.

  • Six verbs had no response alternatives: jump, pump, soak, warm, trail,

and glare

  • 93% error rate for irregular verbs; 33% error rate for regular verbs
  • 43% error rate overall
slide-20
SLIDE 20
slide-21
SLIDE 21

Discussion

slide-22
SLIDE 22

Furrer, Zee, Scales, Schärli Google Research

Compositional Generalization in Semantic Parsing: Pre-training vs. Specialized Architectures

slide-23
SLIDE 23

“How can we achieve compositional generalization in natural language?

1. How to properly measure compositional generalization? 2. Approaches tried 3. Which work? Which don’t? Future directions?

slide-24
SLIDE 24
  • 1. How to measure compositional generalization?

One way: The SCAN dataset

slide-25
SLIDE 25
  • 1. How to measure compositional generalization?
slide-26
SLIDE 26

Traditional SCAN splits

Split name Commands held out

Add jump any compound containing "jump" Add turn left any compound containing "turn left" Jump around right any compound containing "jump around right" Around right any compound containing "PRIMITIVE around right" e.g. walk around right Opposite right any compound containing "PRIMITIVE opposite right" Right any compound containing "PRIMITIVE right" Length any command whose target sequence length is greater than 22

slide-27
SLIDE 27

Distribution-Based Compositionality Assessment (DBCA) and Maximum Compound Divergence (MCD)

MCD: Split with maximum compound divergence , low atom divergence ( ≤ 0.02)

slide-28
SLIDE 28

Distribution-Based Compositionality Assessment (DBCA) and Maximum Compound Divergence (MCD)

Frequency of atoms (left) and compounds (right) in the train and test sets of the MCD split for CFQ data

slide-29
SLIDE 29

The CFQ Dataset

  • Given natural language question, generate SPARQL query which, when

executed, generates the correct answer

slide-30
SLIDE 30
  • 2. Architectures and Techniques
  • SCAN-inspired

→ Syn-att, CGPS, Equivariant, CNN, GECA

  • Meta-learning

→ Meta seq2seq, Synth

  • Symbolic

→ LANE

  • MLM + Pretraining

→ T5 transformer family

  • Other

→ NSEN

slide-31
SLIDE 31

Results

slide-32
SLIDE 32
  • Length split accuracy decreases as model size increases!

19.4, 10.9, 14.4, 5.2, 3.3, 2.0

  • SCAN MCD split accuracy with size shows no clear relation

0.9, 6.0, 15.4, 10.1, 11.6, 9.1

  • CFQ accuracy increases with size:

21.5, 28.0, 31.2, 34.8, 40.2, 40.9

  • Intermediate representation gives +1.2% accuracy boost
  • Hypothesized benefit of pretraining: “improve model’s ability to substitute

similar words by ensuring they are close to each other in representation space”

  • Achieves near-perfect performance on Add jump split, lesser gains on
  • thers.

Pretraining success?

slide-33
SLIDE 33

Discussion

slide-34
SLIDE 34

Symbolic approach: LANE

  • Two modules, Composer and Solver, plus memory. Trained with curriculum and

hierarchical RL.

  • 100% accuracy on SCAN MCD split.
slide-35
SLIDE 35

Meta-learning: Meta seq2seq

  • Trains over permutations of the SCAN grammar by remapping primitives to difgerent
  • utputs, e.g. jump -> WALK.
  • Highly augmented training data — fair comparison?
  • Builds invariance to primitive replacement in similar manner to Synth, Equivariant, and

GECA approaches

slide-36
SLIDE 36

Meta-learning: Synth

  • seq2seq model takes in i/o examples and generates single program (interpretation

grammar) which is symbolically evaluated to solve all examples.

  • Trained by sampling grammars from a meta-grammar, and learning to output the

correct program given examples generated with the sampled grammar.

slide-37
SLIDE 37

GECA

  • Simple, efgective approach: detects templates repeated during training, generates new

training examples by filling with difgerent fragments

  • Augmenting training set so helps build invariance to compositional shifts in distribution
slide-38
SLIDE 38

CGPS and Syn-att

  • Separates syntax (output action type) from semantics (output action order), each having

a separate representation.

  • CGPS chosen representative of SCAN-inspired approaches
  • Bad performance on SCAN MCD

“It appears rather that the CGPS mechanism, unlike pre-training, is not robust to shifts in compound distribution and even introduces negative efgects in such circumstances.”

slide-39
SLIDE 39

NSEN

  • Learns O(n log n) seq2seq algorithms with a shuffme-exchange architecture. Successor to

Neural GPU.

slide-40
SLIDE 40

Conclusions

1. Pretraining helps for compositional generalization, but does not solve it. 2. Specialized architectures often do not transfer to new compositional generalization benchmarks 3. Improvements in seq2seq architectures leads to corresponding incremental improvements in compositional settings 4. MCD likely measures compositional generalization more thoroughly than the traditional SCAN splits

slide-41
SLIDE 41

Discussion