testing the robustness of online word segmentation
play

Testing the robustness of online word segmentation: Effects of - PowerPoint PPT Presentation

Testing the robustness of online word segmentation: Effects of linguistic diversity and phonetic variation e 1 & Emmanuel Dupoux 2 Luc Boruta 1,2 , Sharon Peperkamp 2 , Beno t Crabb luc.boruta@inria.fr 1 ALPAGE, Univ. Paris 7 &


  1. Testing the robustness of online word segmentation: Effects of linguistic diversity and phonetic variation e 1 & Emmanuel Dupoux 2 Luc Boruta 1,2 , Sharon Peperkamp 2 , Benoˆ ıt Crabb´ luc.boruta@inria.fr 1 ALPAGE, Univ. Paris 7 & INRIA 2 LSCP–DEC, EHESS, ENS & CNRS CMCL — June 23, 2011

  2. Yet another study on word segmentation... What this work is not about • New models of word segmentation. What this work is about • The acquisition of word segmentation ; • The acquisition of phonological knowledge ; • Interactions between the two. Boruta et al. | Testing the robustness of online word segmentation 1 / 17

  3. Yet another study on word segmentation... What this work is not about • New models of word segmentation. What this work is about • The acquisition of word segmentation ; • The acquisition of phonological knowledge ; • Interactions between the two. Boruta et al. | Testing the robustness of online word segmentation 1 / 17

  4. Yet another study on word segmentation... What this work is not about • New models of word segmentation. What this work is about • The acquisition of word segmentation ; • The acquisition of phonological knowledge ; • Interactions between the two. Boruta et al. | Testing the robustness of online word segmentation 1 / 17

  5. Word segmentation vs. allophonic rules French devoicing allophonic rule � [ X ] before a voiceless consonant / r / → [ K ] otherwise Consequence � [kana X flot˜ A] , canard flottant /kana r / → [kana K Zon] , canard jaune Boruta et al. | Testing the robustness of online word segmentation 2 / 17

  6. Word segmentation vs. allophonic rules French devoicing allophonic rule � [ X ] before a voiceless consonant / r / → [ K ] otherwise Consequence � [kana X flot˜ A] , canard flottant /kana r / → [kana K Zon] , canard jaune Boruta et al. | Testing the robustness of online word segmentation 2 / 17

  7. Word segmentation The task • Input: /@wUdÙ2kwUdÙ2kwUd/ • Output: /@ wUdÙ2k wUd Ù2k wUd/ Phonemic transcripts = idealized input • Models are typically evaluated using phonemic transcripts; • Assumption: kids know how to undo allophony /coarticulation. Boruta et al. | Testing the robustness of online word segmentation 3 / 17

  8. Word segmentation The task • Input: /@wUdÙ2kwUdÙ2kwUd/ • Output: /@ wUdÙ2k wUd Ù2k wUd/ Phonemic transcripts = idealized input • Models are typically evaluated using phonemic transcripts; • Assumption: kids know how to undo allophony /coarticulation. Boruta et al. | Testing the robustness of online word segmentation 3 / 17

  9. Related work Rytting, Brew & Fosler-Lussier (2010) • Input unit: probability vector over a finite set of symbols; • Symbols: limited to the phonemic inventory. Daland & Pierrehumbert (2010) • Input: phonemic transcripts, conversational reduction processes; • Reduction processes: implemented by hand; • Transcripts: adult-directed speech. Boruta et al. | Testing the robustness of online word segmentation 4 / 17

  10. Related work Rytting, Brew & Fosler-Lussier (2010) • Input unit: probability vector over a finite set of symbols; • Symbols: limited to the phonemic inventory. Daland & Pierrehumbert (2010) • Input: phonemic transcripts, conversational reduction processes; • Reduction processes: implemented by hand; • Transcripts: adult-directed speech. Boruta et al. | Testing the robustness of online word segmentation 4 / 17

  11. Which segmentation models? Desirable properties [Brent, 1999; Gambell & Yang, 2004] • Start without any knowledge specific to a particular language; • Learn in an unsupervised manner and operate incrementally. Which segmentation models? • MDBP-1: Brent, 1999; • NGS-u: Venkataraman, 2001; • Two random baselines. Boruta et al. | Testing the robustness of online word segmentation 5 / 17

  12. Which segmentation models? Desirable properties [Brent, 1999; Gambell & Yang, 2004] • Start without any knowledge specific to a particular language; • Learn in an unsupervised manner and operate incrementally. Which segmentation models? • MDBP-1: Brent, 1999; • NGS-u: Venkataraman, 2001; • Two random baselines. Boruta et al. | Testing the robustness of online word segmentation 5 / 17

  13. Evaluation Now-standard evaluation protocol [Brent, 1999; Goldwater et al., 2009] • Gold standard: orthographic segmentation; • Precision, recall and F-score on the word segmentation; • Precision, recall and F-score on the induced lexicon. Lexicon Segmentation @ wUdÙ2k wUd Ù2k wUd ✓ ✓ @ wUd Ù2k wUdÙ2k wUd ✓ ✗ Boruta et al. | Testing the robustness of online word segmentation 6 / 17

  14. Experimental setup CHILDES corpora of child-directed speech [MacWhinney, 2000] • Derived from transcribed adult-child verbal interactions; • Phonemic transcriptions, orthographic segmentation. English French Japanese Utterance tokens 10k 10k 10k Word tokens 33k 51k 27k Phoneme tokens 96k 121k 103k Phoneme types 50 35 49 Boruta et al. | Testing the robustness of online word segmentation 7 / 17

  15. Cross-linguistic evaluation on phonemic corpora Segmentation F−score FR EN JP 0 10 20 30 40 50 60 70 80 90 Lexicon F−score FR EN MBDP−1 NGS−u Random + JP Random 0 10 20 30 40 50 60 70 80 90 Boruta et al. | Testing the robustness of online word segmentation 8 / 17

  16. Cross-linguistic evaluation on phonemic corpora Segmentation F−score Lexicon F−score FR EN FR EN MBDP−1 NGS−u Random + JP JP Random 0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90 • Blame it on the data? • Rich morphology (e.g. French clitics)? Hapax rate? • Relative importance of different cues? Boruta et al. | Testing the robustness of online word segmentation 9 / 17

  17. Effects of phonetic variation Phonemic transcripts = idealized input • Models are typically evaluated using phonemic transcripts; • Assumption: kids know how to undo allophony /coarticulation. Corpora and allophonic rules • No phonetic transcripts of child-directed speech are available; • How many allophones do infants have to learn? • Where is the limit between allophony and mere coarticulation? Boruta et al. | Testing the robustness of online word segmentation 10 / 17

  18. Effects of phonetic variation Phonemic transcripts = idealized input • Models are typically evaluated using phonemic transcripts; • Assumption: kids know how to undo allophony /coarticulation. Corpora and allophonic rules • No phonetic transcripts of child-directed speech are available; • How many allophones do infants have to learn? • Where is the limit between allophony and mere coarticulation? Boruta et al. | Testing the robustness of online word segmentation 10 / 17

  19. Experimental setup Emulating rich phonetic transcriptions [Boruta, 2011a] • Apply artificial allophonic rules to phonemic corpora; • Benchmark models using different allophonic complexities ; • Control the size of the allophonic grammar. Simplifying assumptions [LeCalvez, 2007; Boruta, 2011a] • We only model monolateral rules: p → a / c • No two rules introduce the same phone: [R] /t/ and [R] /d/ Boruta et al. | Testing the robustness of online word segmentation 11 / 17

  20. Experimental setup Emulating rich phonetic transcriptions [Boruta, 2011a] • Apply artificial allophonic rules to phonemic corpora; • Benchmark models using different allophonic complexities ; • Control the size of the allophonic grammar. Simplifying assumptions [LeCalvez, 2007; Boruta, 2011a] • We only model monolateral rules: p → a / c • No two rules introduce the same phone: [R] /t/ and [R] /d/ Boruta et al. | Testing the robustness of online word segmentation 11 / 17

  21. Lexical complexity ∝ allophonic complexity English French Lexical complexity 3 2 Japanese 1 1 5 10 15 20 Allophonic complexity Boruta et al. | Testing the robustness of online word segmentation 12 / 17

  22. Results: English Segmentation F−score Lexicon F−score 70 70 MBDP−1 NGS−u Random 60 60 Random + 50 50 40 40 30 30 20 20 10 10 0 5 10 15 20 25 0 5 10 15 20 25 Boruta et al. | Testing the robustness of online word segmentation 13 / 17

  23. Results: French Segmentation F−score Lexicon F−score 70 70 MBDP−1 NGS−u Random 60 60 Random + 50 50 40 40 30 30 20 20 10 10 0 5 10 15 20 25 0 5 10 15 20 25 Boruta et al. | Testing the robustness of online word segmentation 14 / 17

  24. Results: Japanese Segmentation F−score Lexicon F−score 70 70 MBDP−1 NGS−u Random 60 60 Random + 50 50 40 40 30 30 20 20 10 10 0 2 4 6 8 10 12 0 2 4 6 8 10 12 Boruta et al. | Testing the robustness of online word segmentation 15 / 17

  25. Effects of phonetic variation Lexicon F−score Lexicon F−score Lexicon F−score 70 70 70 MBDP−1 MBDP−1 MBDP−1 NGS−u NGS−u NGS−u Random Random Random 60 60 60 Random + Random + Random + 50 50 50 40 40 40 30 30 30 20 20 20 10 10 10 0 5 10 15 20 25 0 5 10 15 20 25 0 2 4 6 8 10 12 Unsurprising results • No mechanism for ‘explaining away’ allophonic variation; • Any word form found by the models will be added to the lexicon. Boruta et al. | Testing the robustness of online word segmentation 16 / 17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend