Testing the robustness of online word segmentation: Effects of - - PowerPoint PPT Presentation

testing the robustness of online word segmentation
SMART_READER_LITE
LIVE PREVIEW

Testing the robustness of online word segmentation: Effects of - - PowerPoint PPT Presentation

Testing the robustness of online word segmentation: Effects of linguistic diversity and phonetic variation e 1 & Emmanuel Dupoux 2 Luc Boruta 1,2 , Sharon Peperkamp 2 , Beno t Crabb luc.boruta@inria.fr 1 ALPAGE, Univ. Paris 7 &


slide-1
SLIDE 1

Testing the robustness of online word segmentation: Effects of linguistic diversity and phonetic variation

Luc Boruta1,2, Sharon Peperkamp2, Benoˆ ıt Crabb´ e1 & Emmanuel Dupoux2 luc.boruta@inria.fr

1ALPAGE, Univ. Paris 7 & INRIA 2LSCP–DEC, EHESS, ENS & CNRS

CMCL — June 23, 2011

slide-2
SLIDE 2

Yet another study on word segmentation...

What this work is not about

  • New models of word segmentation.

What this work is about

  • The acquisition of word segmentation;
  • The acquisition of phonological knowledge;
  • Interactions between the two.

Boruta et al. | Testing the robustness of online word segmentation 1 / 17

slide-3
SLIDE 3

Yet another study on word segmentation...

What this work is not about

  • New models of word segmentation.

What this work is about

  • The acquisition of word segmentation;
  • The acquisition of phonological knowledge;
  • Interactions between the two.

Boruta et al. | Testing the robustness of online word segmentation 1 / 17

slide-4
SLIDE 4

Yet another study on word segmentation...

What this work is not about

  • New models of word segmentation.

What this work is about

  • The acquisition of word segmentation;
  • The acquisition of phonological knowledge;
  • Interactions between the two.

Boruta et al. | Testing the robustness of online word segmentation 1 / 17

slide-5
SLIDE 5

Word segmentation vs. allophonic rules

French devoicing allophonic rule /r/ →

  • [X]

before a voiceless consonant [K]

  • therwise

Consequence /kanar/ →

  • [kanaX flot˜

A], canard flottant [kanaK Zon], canard jaune

Boruta et al. | Testing the robustness of online word segmentation 2 / 17

slide-6
SLIDE 6

Word segmentation vs. allophonic rules

French devoicing allophonic rule /r/ →

  • [X]

before a voiceless consonant [K]

  • therwise

Consequence /kanar/ →

  • [kanaX flot˜

A], canard flottant [kanaK Zon], canard jaune

Boruta et al. | Testing the robustness of online word segmentation 2 / 17

slide-7
SLIDE 7

Word segmentation

The task

  • Input: /@wUdÙ2kwUdÙ2kwUd/
  • Output: /@ wUdÙ2k wUd Ù2k wUd/

Phonemic transcripts = idealized input

  • Models are typically evaluated using phonemic transcripts;
  • Assumption: kids know how to undo allophony/coarticulation.

Boruta et al. | Testing the robustness of online word segmentation 3 / 17

slide-8
SLIDE 8

Word segmentation

The task

  • Input: /@wUdÙ2kwUdÙ2kwUd/
  • Output: /@ wUdÙ2k wUd Ù2k wUd/

Phonemic transcripts = idealized input

  • Models are typically evaluated using phonemic transcripts;
  • Assumption: kids know how to undo allophony/coarticulation.

Boruta et al. | Testing the robustness of online word segmentation 3 / 17

slide-9
SLIDE 9

Related work

Rytting, Brew & Fosler-Lussier (2010)

  • Input unit: probability vector over a finite set of symbols;
  • Symbols: limited to the phonemic inventory.

Daland & Pierrehumbert (2010)

  • Input: phonemic transcripts, conversational reduction processes;
  • Reduction processes: implemented by hand;
  • Transcripts: adult-directed speech.

Boruta et al. | Testing the robustness of online word segmentation 4 / 17

slide-10
SLIDE 10

Related work

Rytting, Brew & Fosler-Lussier (2010)

  • Input unit: probability vector over a finite set of symbols;
  • Symbols: limited to the phonemic inventory.

Daland & Pierrehumbert (2010)

  • Input: phonemic transcripts, conversational reduction processes;
  • Reduction processes: implemented by hand;
  • Transcripts: adult-directed speech.

Boruta et al. | Testing the robustness of online word segmentation 4 / 17

slide-11
SLIDE 11

Which segmentation models?

Desirable properties

[Brent, 1999; Gambell & Yang, 2004]

  • Start without any knowledge specific to a particular language;
  • Learn in an unsupervised manner and operate incrementally.

Which segmentation models?

  • MDBP-1: Brent, 1999;
  • NGS-u: Venkataraman, 2001;
  • Two random baselines.

Boruta et al. | Testing the robustness of online word segmentation 5 / 17

slide-12
SLIDE 12

Which segmentation models?

Desirable properties

[Brent, 1999; Gambell & Yang, 2004]

  • Start without any knowledge specific to a particular language;
  • Learn in an unsupervised manner and operate incrementally.

Which segmentation models?

  • MDBP-1: Brent, 1999;
  • NGS-u: Venkataraman, 2001;
  • Two random baselines.

Boruta et al. | Testing the robustness of online word segmentation 5 / 17

slide-13
SLIDE 13

Evaluation

Now-standard evaluation protocol

[Brent, 1999; Goldwater et al., 2009]

  • Gold standard: orthographic segmentation;
  • Precision, recall and F-score on the word segmentation;
  • Precision, recall and F-score on the induced lexicon.

Lexicon Segmentation @ wUdÙ2k wUd Ù2k wUd ✓ ✓ @ wUd Ù2k wUdÙ2k wUd ✓ ✗

Boruta et al. | Testing the robustness of online word segmentation 6 / 17

slide-14
SLIDE 14

Experimental setup

CHILDES corpora of child-directed speech

[MacWhinney, 2000]

  • Derived from transcribed adult-child verbal interactions;
  • Phonemic transcriptions, orthographic segmentation.

English French Japanese Utterance tokens 10k 10k 10k Word tokens 33k 51k 27k Phoneme tokens 96k 121k 103k Phoneme types 50 35 49

Boruta et al. | Testing the robustness of online word segmentation 7 / 17

slide-15
SLIDE 15

Cross-linguistic evaluation on phonemic corpora

JP FR EN Segmentation F−score

10 20 30 40 50 60 70 80 90

JP FR EN Lexicon F−score

10 20 30 40 50 60 70 80 90

MBDP−1 NGS−u Random+ Random

Boruta et al. | Testing the robustness of online word segmentation 8 / 17

slide-16
SLIDE 16

Cross-linguistic evaluation on phonemic corpora

JP FR EN Segmentation F−score

10 20 30 40 50 60 70 80 90

JP FR EN Lexicon F−score

10 20 30 40 50 60 70 80 90

MBDP−1 NGS−u Random+ Random

  • Blame it on the data?
  • Rich morphology (e.g. French clitics)? Hapax rate?
  • Relative importance of different cues?

Boruta et al. | Testing the robustness of online word segmentation 9 / 17

slide-17
SLIDE 17

Effects of phonetic variation

Phonemic transcripts = idealized input

  • Models are typically evaluated using phonemic transcripts;
  • Assumption: kids know how to undo allophony/coarticulation.

Corpora and allophonic rules

  • No phonetic transcripts of child-directed speech are available;
  • How many allophones do infants have to learn?
  • Where is the limit between allophony and mere coarticulation?

Boruta et al. | Testing the robustness of online word segmentation 10 / 17

slide-18
SLIDE 18

Effects of phonetic variation

Phonemic transcripts = idealized input

  • Models are typically evaluated using phonemic transcripts;
  • Assumption: kids know how to undo allophony/coarticulation.

Corpora and allophonic rules

  • No phonetic transcripts of child-directed speech are available;
  • How many allophones do infants have to learn?
  • Where is the limit between allophony and mere coarticulation?

Boruta et al. | Testing the robustness of online word segmentation 10 / 17

slide-19
SLIDE 19

Experimental setup

Emulating rich phonetic transcriptions

[Boruta, 2011a]

  • Apply artificial allophonic rules to phonemic corpora;
  • Benchmark models using different allophonic complexities;
  • Control the size of the allophonic grammar.

Simplifying assumptions

[LeCalvez, 2007; Boruta, 2011a]

  • We only model monolateral rules: p → a /

c

  • No two rules introduce the same phone: [R]/t/ and [R]/d/

Boruta et al. | Testing the robustness of online word segmentation 11 / 17

slide-20
SLIDE 20

Experimental setup

Emulating rich phonetic transcriptions

[Boruta, 2011a]

  • Apply artificial allophonic rules to phonemic corpora;
  • Benchmark models using different allophonic complexities;
  • Control the size of the allophonic grammar.

Simplifying assumptions

[LeCalvez, 2007; Boruta, 2011a]

  • We only model monolateral rules: p → a /

c

  • No two rules introduce the same phone: [R]/t/ and [R]/d/

Boruta et al. | Testing the robustness of online word segmentation 11 / 17

slide-21
SLIDE 21

Lexical complexity ∝ allophonic complexity

English French Japanese Allophonic complexity Lexical complexity 1 5 10 15 20 1 2 3

Boruta et al. | Testing the robustness of online word segmentation 12 / 17

slide-22
SLIDE 22

Results: English

5 10 15 20 25 10 20 30 40 50 60 70 Segmentation F−score 5 10 15 20 25 10 20 30 40 50 60 70 Lexicon F−score MBDP−1 NGS−u Random Random+

Boruta et al. | Testing the robustness of online word segmentation 13 / 17

slide-23
SLIDE 23

Results: French

5 10 15 20 25 10 20 30 40 50 60 70 Segmentation F−score 5 10 15 20 25 10 20 30 40 50 60 70 Lexicon F−score MBDP−1 NGS−u Random Random+

Boruta et al. | Testing the robustness of online word segmentation 14 / 17

slide-24
SLIDE 24

Results: Japanese

2 4 6 8 10 12 10 20 30 40 50 60 70 Segmentation F−score 2 4 6 8 10 12 10 20 30 40 50 60 70 Lexicon F−score MBDP−1 NGS−u Random Random+

Boruta et al. | Testing the robustness of online word segmentation 15 / 17

slide-25
SLIDE 25

Effects of phonetic variation

5 10 15 20 25 10 20 30 40 50 60 70 Lexicon F−score MBDP−1 NGS−u Random Random+ 5 10 15 20 25 10 20 30 40 50 60 70 Lexicon F−score MBDP−1 NGS−u Random Random+ 2 4 6 8 10 12 10 20 30 40 50 60 70 Lexicon F−score MBDP−1 NGS−u Random Random+

Unsurprising results

  • No mechanism for ‘explaining away’ allophonic variation;
  • Any word form found by the models will be added to the lexicon.

Boruta et al. | Testing the robustness of online word segmentation 16 / 17

slide-26
SLIDE 26

Conclusion

Take-home message

  • Cross-linguistic evaluation is not dispensable;
  • Phonetic inputs impact word seg. models’ performance.
  • Phonological knowledge < word segmentation?

Where to go from here?

  • Incorporate some mechanism to handle noise and/or variation;
  • Use the imperfect lexical knowledge to help learning a phonology.

Combining indicators of allophony, ACL’11 Student Session.

Boruta et al. | Testing the robustness of online word segmentation 17 / 17

slide-27
SLIDE 27

Conclusion

Take-home message

  • Cross-linguistic evaluation is not dispensable;
  • Phonetic inputs impact word seg. models’ performance.
  • Phonological knowledge < word segmentation?

Where to go from here?

  • Incorporate some mechanism to handle noise and/or variation;
  • Use the imperfect lexical knowledge to help learning a phonology.

Combining indicators of allophony, ACL’11 Student Session.

Boruta et al. | Testing the robustness of online word segmentation 17 / 17

slide-28
SLIDE 28

/TæNk ju:/