A Joint Learning Model of Word Segmentation, Lexical Acquisition and - - PowerPoint PPT Presentation

a joint learning model of word segmentation lexical
SMART_READER_LITE
LIVE PREVIEW

A Joint Learning Model of Word Segmentation, Lexical Acquisition and - - PowerPoint PPT Presentation

A Joint Learning Model of Word Segmentation, Lexical Acquisition and Phonetic Variability Micha Elsner Sharon Goldwater Naomi Feldman Frank Wood The Ohio State University, University of Edinburgh, University of Maryland and Oxford University


slide-1
SLIDE 1

A Joint Learning Model of Word Segmentation, Lexical Acquisition and Phonetic Variability

Micha Elsner Sharon Goldwater Naomi Feldman Frank Wood

The Ohio State University, University of Edinburgh, University of Maryland and Oxford University

October 18, 2013

slide-2
SLIDE 2

Infant word learning

youwanttoseethebook lookthere’saboywithhishat andadoggie youwanttolookatthis lookatthis haveadrink takeitout youwantitin putthaton that yes okay openitup takethedoggieout ithinkitwillcomeout what daddy wherediditgo youwantthatone daddy i’llgogetyourblock what’sthatalice what’sthatablock that that’satelephone that’sthephone say hello youwanttospeaktoalice sayhello what’s youhavetotellme block youwanttheblocks

◮ The infant learner hears a stream of

utterances...

2

slide-3
SLIDE 3

Infant word learning

youwanttoseethebook lookthere’saboywithhishat andadoggie youwanttolookatthis lookatthis haveadrink takeitout youwantitin putthaton that yes okay openitup takethedoggieout ithinkitwillcomeout what daddy wherediditgo youwantthatone daddy i’llgogetyourblock what’sthatalice what’sthatablock that that’satelephone that’sthephone say hello youwanttospeaktoalice sayhello what’s youhavetotellme block youwanttheblocks

◮ The infant learner hears a stream of

utterances...

◮ And is sensitive to repeated sequences

2

slide-4
SLIDE 4

Models have been very successful... Lexical models Goal: learn lexicon and LM

◮ We follow: (Goldwater, Griffiths, Johnson ‘09) (GGJ) ◮ Basic idea since (Brent ‘99) ◮ Many extensions since

Non-lexical models

◮ Word boundaries from phonotactics: (Fleck ‘08,

Rytting ‘07, Daland+al ‘10)

◮ Word-like units from acoustics: (Park+al ‘08,

Aimetti ‘09, Jansen+al ‘10)

3

slide-5
SLIDE 5

But lexical models handle phonetics poorly

◮ “Intended form” /want/ ends up as [wan] or

[w˜ aP]

◮ Lowers overall performance of GGJ... ◮ And changes qualitative results

◮ Learn syllables or morphemes instead of words

(Fleck ‘08)

Real infants learn collocations Sequences learned as words (Peters, Tomasello)

◮ “youlike”, “wantto” ◮ Production evidence: Early words show up

in fixed multi-word contexts

◮ Infants don’t produce subwords

4

slide-6
SLIDE 6

Our work This paper

Model jointly:

◮ Segments words ◮ Clusters word tokens into lexical entries ◮ Infers a model of phonetic variation ◮ ...on a broad-coverage corpus

5

slide-7
SLIDE 7

Research context Previous models integrate lexical/phonetic learning...

◮ (Feldman+al ‘09, ‘13): vowel learning (fixed lexicon) ◮ (Driesen+al ‘09, Rasanen ‘11): words and sounds

(tiny datasets)

◮ (Börschinger+al ‘13): segmentation and phonetics

(only t-deletion)

◮ (Neubig+al ‘10): LM from phone lattices (eval

phone recognition only)

◮ (Elsner+al ‘12): two-stage pipeline

6

slide-8
SLIDE 8

Last year... Elsner+al ‘12

Messy data j@w˜ aPw2n wan@kUki GGJ segmentation Segmented j@•w˜ aP•w2n, wan@k•Uki Cluster word types Clustering { /wan/: w˜ aP, wan@k, wan} Normalized ju•wan•w2n, wan•Uki

7

slide-9
SLIDE 9

Last year... Elsner+al ‘12

Messy data j@w˜ aPw2n wan@kUki GGJ segmentation Segmented j@•w˜ aP•w2n, wan@k•Uki Cluster word types Clustering { /wan/: w˜ aP, wan@k, wan} Normalized ju•wan•w2n, wan•Uki

◮ Standard problem with pipelines:

errors propagate

◮ Not a good cognitive model: doesn’t capture

interactions between levels

◮ Type-level inference doesn’t scale to

acoustics

7

slide-10
SLIDE 10

In this paper...

Technical details GGJ: Bayesian word segmentation Our noisy-channel model Joint inference without types: beam sampling Cognitive modeling results Words, collocations and morphemes

Infants form collocations ...and have trouble with vowel-initial words

Phonetic learning

Infants learn consonants better ...and underestimate variation

Missegmentations and misrecognitions

Short, frequent words are hard

8

slide-11
SLIDE 11

GGJ: a non-parametric bigram language model

α

Geom

a, b, ..., ju, ... want, ... juwant, ... Generator for possible words Probabilities for each word (sparse) p(ði) = .1, p(a) = .05, p(want) = .01...

∞ contexts

Conditional probabilities for each word after each word p(ði | want) = .3, p(a | want) = .1, p(want | want) = .0001...

G G

x

α1

x1 x2 ...

Intended forms ju want ə kuki ju want ɪt ...

n utterances

9

slide-12
SLIDE 12

Noisy channel component

α

Geom

a, b, ..., ju, ... want, ... juwant, ... Generator for possible words Probabilities for each word (sparse) p(ði) = .1, p(a) = .05, p(want) = .01...

∞ contexts

Conditional probabilities for each word after each word p(ði | want) = .3, p(a | want) = .1, p(want | want) = .0001...

G G

x

α1

x1 x2 ...

Intended forms ju want ə kuki ju want ɪt ...

n utterances

s1 s2 ...

Surface forms jə wan ə kuki ju wand ɪt ...

T

10

slide-13
SLIDE 13

The transducer

◮ Independently rewrites each character

(a → u)

◮ Log-linear features based on articulation

(Hayes+Wilson, Dreyer+Eisner)

Constrained by efficiency issues:

◮ Can insert (→ h) but not delete (h →) ◮ Similar to (Neubig, Elsner, Börschinger) but simpler

Learning phonetics

◮ Initialize with simple model (a → a) ◮ Learn via EM

11

slide-14
SLIDE 14

Inference Intended forms vary from surface forms: large search space!

◮ Character-by-character Gibbs likely to get

stuck Forward-backward style sampling method:

◮ Following previous work ◮ Semi-Markov formulation of GGJ (Mochihashi+al ‘09) ◮ Composition with transducer yields large

FSM (Neubig ‘10)

12

slide-15
SLIDE 15

Finite-state encoding

ə

word jə

p(jə|[s])

d

...

j u

word ju

[s]

word j

u

word u

p(j|[s]) p(u|j) p(ju|[s]) j/j d/j ə/u u/u u/u

13

slide-16
SLIDE 16

Sampling from huge transducers (beam sampling)

ə

word jə

p(jə|[s])

d

...

j u [s]

j/j j/d ə/u u/u

k

j/k

(van Gael+al ‘08), (Huggins+Wood ‘13)

14

slide-17
SLIDE 17

Sampling from huge transducers (beam sampling)

ə

word jə

p(jə|[s])

d

...

j u [s]

j/j j/d ə/u u/u

k

j/k

(van Gael+al ‘08), (Huggins+Wood ‘13)

14

slide-18
SLIDE 18

Sampling from huge transducers (beam sampling)

~ [0, p(u/u)] ~ [0, p(j/j)]

ə

word jə

p(jə|[s])

d

...

j u [s]

j/j j/d ə/u u/u

k

j/k

(van Gael+al ‘08), (Huggins+Wood ‘13)

14

slide-19
SLIDE 19

Overview

Technical details GGJ: Bayesian word segmentation Our noisy-channel model Joint inference without types: beam sampling Cognitive modeling results Words, collocations and morphemes

Infants form collocations ...and have trouble with vowel-initial words

Phonetic learning

Infants learn consonants better ...and underestimate variation

Missegmentations and misrecognitions

Short, frequent words are hard

15

slide-20
SLIDE 20

Synthetic dataset from (Elsner+al ‘12) Simulate child-directed speech in close phonetic transcription

◮ Use: Bernstein-Ratner (child-directed) (Bernstein-Ratner ‘87) ◮ Buckeye (closely transcribed) (Pitt+al ‘07) ◮ Sample pronunciation for each BR word from Buckeye

◮ No coarticulation between words

“about”

ahbawt:15, bawt:9, ihbawt:4, ahbawd:4, ihbawd:4, ahbaat:2, baw:1, ahbaht:1, erbawd:1, bawd:1, ahbaad:1, ahpaat:1, bah:1, baht:1

16

slide-21
SLIDE 21

Word segmentation results Prec Rec F-score GGJ, clean data 90.1 80.3 84.9 GGJ segmentation 70.4 93.5 80.3

◮ GGJ on clean data has high precision, low

recall...

◮ On variable data, tradeoff flips (as in (Fleck ‘08))

17

slide-22
SLIDE 22

Word segmentation results Prec Rec F-score GGJ, clean data 90.1 80.3 84.9 GGJ segmentation 70.4 93.5 80.3 GGJ, our beam inference 73.9 91.0 81.6

◮ Our inference scheme works

◮ Confidence intervals overlap 17

slide-23
SLIDE 23

Word segmentation results Prec Rec F-score GGJ, clean data 90.1 80.3 84.9 GGJ segmentation 70.4 93.5 80.3 GGJ, our beam inference 73.9 91.0 81.6 EM transducer 80.1 83.0 81.5

◮ Segmentation with transducer trades recall

for precision

◮ Moving closer to original qualitative results

17

slide-24
SLIDE 24

A closer look Where do gold-standard word tokens end up?

◮ Correct boundaries and lexical item ◮ Correct boundaries, wrong lexical item: ju analyzed as jEs ◮ Collocation: boundaries are real but too wide: real ju•want

as juwant

◮ Split: dOgiz as dO•giz ◮ One boundary: ju•wa... ◮ Just plain wrong

18

slide-25
SLIDE 25

Analysis EM-learned GGJ Correct 49.88 47.61 Wrong form 17.96 23.73 Collocation 15.60 7.59 Split 8.69 15.84 One bound 7.11 15.18 Wrong 0.75 0.22

◮ “Wrong form” errors could be repaired in

pipeline

◮ ...but collocation vs split cannot be

19

slide-26
SLIDE 26

Vowel-initial words

◮ Infants are slow to segment vowel-initial

words (Mattys+Jusczyk, Nazzi+al, Seidl+Johnson)

◮ Initial vowels often variable, resyllabified

(Seidl+Johnson)

EM transducer

  • Vow. init
  • Cons. init

Correct 41.5 52.1 Wrong form 20.4 17.3 Collocation 19.2 12.5

◮ Transducer system has trouble with vowels... ◮ More likely to find collocation, less likely to

get left boundary correct

20

slide-27
SLIDE 27

Phonetic learning

◮ Infants learn consonant categories slower

than vowels

◮ Non-native vowel contrasts lost by 8 ms (Kuhl,

Bosch+Sebastian-Galles)

◮ Consonant contrasts by 10-12 ms (Werker+Tees)

◮ Generalize across talkers/dialects slowly

◮ (Houston+Jusczyk, Singh)

What about the model?

21

slide-28
SLIDE 28

Learned phonetic variability

System x top 4 outputs s Actual (oracle) variability u u .68 @ .05 a .04 U .04 i i .85 I .03 @ .03 E .02 D D .69 s .07 [φ] .07 z .04 k k .93 d .02 g .02 EM (full) u u .75 @ .08 I .04 U .03 i i .90 I .04 E .02 D D .91 s .03 z 0.1 k k .98

◮ u and D are around equally variable ◮ But model learns variants of u better ◮ In general model underestimates true

variability

22

slide-29
SLIDE 29

Misrecognitions

◮ Little known about infant misrecognitions ◮ Adults misrecognize things... (Butterfield+Cutler) ◮ Incorrect hypothesis contains frequent words

(Connine+al)

◮ Indefinite article is hard (Kim+al, Dilley+Pitt)

23

slide-30
SLIDE 30

Model misrecognitions Correctly segmented

◮ two/to ◮ can/can’t ◮ and/an ◮ his/is

Incorrectly segmented

(Errors like this only possible in a joint model!)

◮ it/it’s/is ◮ a/is ◮ who/who’s/whose ◮ that’s/what’s ◮ there/there’s

24

slide-31
SLIDE 31

Conclusion Cognitive model allows joint inference of lexicon, phonetics

◮ Replicates several experimental results ◮ Broad-coverage, naturalistic corpus

Future research:

◮ Token-based sampling can extend to

acoustics

◮ Cross-linguistic?

Software available!

◮ ACL archive ◮ bitbucket.org/melsner/beamseg

25