A Joint Learning Model of Word Segmentation, Lexical Acquisition and - PowerPoint PPT Presentation

A Joint Learning Model of Word Segmentation, Lexical Acquisition and Phonetic Variability Micha Elsner Sharon Goldwater Naomi Feldman Frank Wood The Ohio State University, University of Edinburgh, University of Maryland and Oxford University October 18, 2013

Infant word learning youwanttoseethebook lookthere’saboywithhishat andadoggie youwanttolookatthis lookatthis haveadrink takeitout youwantitin putthaton that yes okay openitup takethedoggieout ithinkitwillcomeout what daddy wherediditgo youwantthatone daddy i’llgogetyourblock what’sthatalice what’sthatablock that that’satelephone that’sthephone say hello youwanttospeaktoalice sayhello what’s youhavetotellme block youwanttheblocks ◮ The infant learner hears a stream of utterances... 2

Infant word learning you want toseethebook lookthere’saboywithhishat andadoggie you want tolookatthis lookatthis haveadrink takeitout you want itin putthaton that yes okay openitup takethedoggieout ithinkitwillcomeout what daddy wherediditgo you want thatone daddy i’llgogetyourblock what’sthatalice what’sthatablock that that’satelephone that’sthephone say hello you want tospeaktoalice sayhello what’s youhavetotellme block you want theblocks ◮ The infant learner hears a stream of utterances... ◮ And is sensitive to repeated sequences 2

Models have been very successful... Lexical models Goal: learn lexicon and LM ◮ We follow: (Goldwater, Griffiths, Johnson ‘09) (GGJ) ◮ Basic idea since (Brent ‘99) ◮ Many extensions since Non-lexical models ◮ Word boundaries from phonotactics: (Fleck ‘08, Rytting ‘07, Daland+al ‘10) ◮ Word-like units from acoustics: (Park+al ‘08, Aimetti ‘09, Jansen+al ‘10) 3

But lexical models handle phonetics poorly ◮ “Intended form” /want/ ends up as [wan] or [w˜ aP] ◮ Lowers overall performance of GGJ... ◮ And changes qualitative results ◮ Learn syllables or morphemes instead of words (Fleck ‘08) Real infants learn collocations Sequences learned as words (Peters, Tomasello) ◮ “youlike”, “wantto” ◮ Production evidence: Early words show up in fixed multi-word contexts ◮ Infants don’t produce subwords 4

Our work This paper Model jointly: ◮ Segments words ◮ Clusters word tokens into lexical entries ◮ Infers a model of phonetic variation ◮ ...on a broad-coverage corpus 5

Research context Previous models integrate lexical/phonetic learning... ◮ (Feldman+al ‘09, ‘13) : vowel learning (fixed lexicon) ◮ (Driesen+al ‘09, Rasanen ‘11) : words and sounds (tiny datasets) ◮ (Börschinger+al ‘13) : segmentation and phonetics (only t-deletion) ◮ (Neubig+al ‘10) : LM from phone lattices (eval phone recognition only) ◮ (Elsner+al ‘12) : two-stage pipeline 6

Last year... Elsner+al ‘12 j@w˜ Messy data aPw2n wan@kUki GGJ segmentation j@ • w˜ aP • w2n , wan@k • Uki Segmented Cluster word types { /wan/ : w˜ aP, wan@k, wan } Clustering Normalized ju • wan • w2n , wan • Uki 7

Last year... Elsner+al ‘12 j@w˜ Messy data aPw2n wan@kUki GGJ segmentation j@ • w˜ aP • w2n , wan@k • Uki Segmented Cluster word types { /wan/ : w˜ aP, wan@k, wan } Clustering Normalized ju • wan • w2n , wan • Uki ◮ Standard problem with pipelines: errors propagate ◮ Not a good cognitive model: doesn’t capture interactions between levels ◮ Type-level inference doesn’t scale to acoustics 7

In this paper... Technical details GGJ: Bayesian word segmentation Our noisy-channel model Joint inference without types: beam sampling Cognitive modeling results Words, collocations and morphemes Infants form collocations ...and have trouble with vowel-initial words Phonetic learning Infants learn consonants better ...and underestimate variation Missegmentations and misrecognitions Short, frequent words are hard 8

GGJ: a non-parametric bigram language model Generator for possible words Geom a, b, ..., ju, ... want, ... juwant, ... Probabilities for each word (sparse) α G p(ði) = .1, p(a) = .05, p(want) = .01... 0 0 Conditional probabilities α 1 for each word after each word G p(ði | want) = .3, p(a | want) = .1, x p(want | want) = .0001... ∞ contexts Intended forms ju want ə kuki x 1 x 2 ... ju want ɪ t ... n utterances 9

Noisy channel component Generator for possible words Geom a, b, ..., ju, ... want, ... juwant, ... Probabilities for each word (sparse) α G p(ði) = .1, p(a) = .05, p(want) = .01... 0 0 Conditional probabilities α 1 for each word after each word G p(ði | want) = .3, p(a | want) = .1, x p(want | want) = .0001... ∞ contexts Intended forms ju want ə kuki x 1 x 2 ... ju want ɪ t ... Surface forms T s 1 s 2 ... j ə wan ə kuki ju wand ɪ t n utterances ... 10

The transducer ◮ Independently rewrites each character ( a → u ) ◮ Log-linear features based on articulation (Hayes+Wilson, Dreyer+Eisner) Constrained by efficiency issues: ◮ Can insert ( → h ) but not delete ( h → ) ◮ Similar to (Neubig, Elsner, Börschinger) but simpler Learning phonetics ◮ Initialize with simple model ( a → a ) ◮ Learn via EM 11

Inference Intended forms vary from surface forms: large search space! ◮ Character-by-character Gibbs likely to get stuck Forward-backward style sampling method: ◮ Following previous work ◮ Semi-Markov formulation of GGJ (Mochihashi+al ‘09) ◮ Composition with transducer yields large FSM (Neubig ‘10) 12

Finite-state encoding word j word u u u/u p(u|j) p(j|[s]) word ju u/u [s] j u j/j p(ju|[s]) ə/u word jə d/j ə d p(jə|[s]) ... 13

Sampling from huge transducers (beam sampling) j/j u/u [s] j u ə/u j/d word jə ə d p(jə|[s]) ... j/k k (van Gael+al ‘08) , (Huggins+Wood ‘13) 14

Sampling from huge transducers (beam sampling) j/j u/u [s] j u ~ [0, p(u/u)] ə/u j/d word jə ə d p(jə|[s]) ... ~ [0, p(j/j)] j/k k (van Gael+al ‘08) , (Huggins+Wood ‘13) 14

Overview Technical details GGJ: Bayesian word segmentation Our noisy-channel model Joint inference without types: beam sampling Cognitive modeling results Words, collocations and morphemes Infants form collocations ...and have trouble with vowel-initial words Phonetic learning Infants learn consonants better ...and underestimate variation Missegmentations and misrecognitions Short, frequent words are hard 15

Synthetic dataset from (Elsner+al ‘12) Simulate child-directed speech in close phonetic transcription ◮ Use: Bernstein-Ratner (child-directed) (Bernstein-Ratner ‘87) ◮ Buckeye (closely transcribed) (Pitt+al ‘07) ◮ Sample pronunciation for each BR word from Buckeye ◮ No coarticulation between words “about” ahbawt:15, bawt:9, ihbawt:4, ahbawd:4, ihbawd:4, ahbaat:2, baw:1, ahbaht:1, erbawd:1, bawd:1, ahbaad:1, ahpaat:1, bah:1, baht:1 16

Word segmentation results Prec Rec F-score GGJ, clean data 90.1 80.3 84.9 GGJ segmentation 70.4 93.5 80.3 ◮ GGJ on clean data has high precision, low recall... ◮ On variable data, tradeoff flips (as in (Fleck ‘08) ) 17

Word segmentation results Prec Rec F-score GGJ, clean data 90.1 80.3 84.9 GGJ segmentation 70.4 93.5 80.3 GGJ, our beam inference 73.9 91.0 81.6 ◮ Our inference scheme works ◮ Confidence intervals overlap 17

Word segmentation results Prec Rec F-score GGJ, clean data 90.1 80.3 84.9 GGJ segmentation 70.4 93.5 80.3 GGJ, our beam inference 73.9 91.0 81.6 EM transducer 80.1 83.0 81.5 ◮ Segmentation with transducer trades recall for precision ◮ Moving closer to original qualitative results 17

A closer look Where do gold-standard word tokens end up? ◮ Correct boundaries and lexical item ◮ Correct boundaries, wrong lexical item: ju analyzed as jEs ◮ Collocation: boundaries are real but too wide: real ju • want as juwant ◮ Split: dOgiz as dO • giz ◮ One boundary: ju • wa... ◮ Just plain wrong 18

Analysis EM-learned GGJ Correct 49.88 47.61 Wrong form 17.96 23.73 Collocation 15.60 7.59 Split 8.69 15.84 One bound 7.11 15.18 Wrong 0.75 0.22 ◮ “Wrong form” errors could be repaired in pipeline ◮ ...but collocation vs split cannot be 19

Vowel-initial words ◮ Infants are slow to segment vowel-initial words (Mattys+Jusczyk, Nazzi+al, Seidl+Johnson) ◮ Initial vowels often variable, resyllabified (Seidl+Johnson) EM transducer Vow. init Cons. init Correct 41.5 52.1 Wrong form 20.4 17.3 Collocation 19.2 12.5 ◮ Transducer system has trouble with vowels... ◮ More likely to find collocation, less likely to get left boundary correct 20

Phonetic learning ◮ Infants learn consonant categories slower than vowels ◮ Non-native vowel contrasts lost by 8 ms (Kuhl, Bosch+Sebastian-Galles) ◮ Consonant contrasts by 10-12 ms (Werker+Tees) ◮ Generalize across talkers/dialects slowly ◮ (Houston+Jusczyk, Singh) What about the model? 21

A Joint Learning Model of Word Segmentation, Lexical Acquisition and - PowerPoint PPT Presentation

A Joint Learning Model of Word Segmentation, Lexical Acquisition and Phonetic Variability Micha Elsner Sharon Goldwater Naomi Feldman Frank Wood The Ohio State University, University of Edinburgh, University of Maryland and Oxford University

Heterogeneous Lexical Resources MultiJEDI ERC 259234 Lexical Resource Lexical Resource Lexical

Segmentation Bottom-up Segmentation Semantic / instance segmentation Many Slides from L.

VIDEO SIGNALS Segmentation WHAT IS SEGMENTATION WHAT IS SEGMENTATION Segmentation is a

Semantic Segmentation / Instance Segmentation Based on Deep learning Yiding Liu 2018.12.08

LEXICAL TYPOLOGY Peter Koch (Part I) Koch, Lexical typology, 2010-8-24 A. General introduction

Compilers Lexical Analysis Alex Aiken Lexical Analysis 1. Lexical Analysis 2. Parsing 3.

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Segmentation Segmentation Segmentation Define the accurate boundaries of all objects in an image

Segmentation using Segmentation using Bayesian Decision Theory Bayesian Decision Theory

LEXICAL TYPOLOGY LEXICAL TYPOLOGY Peter Koch (Part II) Department of Romance Studies, Tbingen

LEXICAL SEMANTICS LEXICAL SEMANTICS CS 224N 2011 Gerald Penn Slides largely adapted from

Lesson 2 Lexical Analysis CS 226/326 Spring 2003 Lexical Analysis Transform source program

Lexical analysis Lexical analysis Lexical analysis checks the correctness of program words and

Introduction to Lexical Analysis Outline Informal sketch of lexical analysis

Word Segmentation and their Integration in Machine Translation Advanced MT Seminar ThuyLinh

NLP Programming Tutorial 4 - Word Segmentation Graham Neubig Nara Institute of Science and

IMPLEMENTING COLLOCATION GROUPS #1 About Draper Lab An independent, not-for-profit

Weighted reduced basis for the approximation of viscous flows with random coefficients Peng Chen 1

New and not so new methods for estimating a regularization parameter Giuseppe Rodriguez in

Multigrid discontinuous Galerkin method for multigroup particle transport Pablo Lucero

2. Computers with everything 2. Computers with everything History History How far?

Scaling Graphite at Criteo FOSDEM 2018 - Not yet another talk about Prometheus Me Corentin

Larry Holder School of EECS Washington State University Artificial Intelligence 1 } Weak AI

Scio A Scala API for Google Cloud Dataflow & Apache Beam Robert Gruener @MrRobbie_G About

A Joint Learning Model of Word Segmentation, Lexical Acquisition and - PowerPoint PPT Presentation

A Joint Learning Model of Word Segmentation, Lexical Acquisition and Phonetic Variability Micha Elsner Sharon Goldwater Naomi Feldman Frank Wood The Ohio State University, University of Edinburgh, University of Maryland and Oxford University

Heterogeneous Lexical Resources MultiJEDI ERC 259234 Lexical Resource Lexical Resource Lexical

Segmentation Bottom-up Segmentation Semantic / instance segmentation Many Slides from L.

VIDEO SIGNALS Segmentation WHAT IS SEGMENTATION WHAT IS SEGMENTATION Segmentation is a

Semantic Segmentation / Instance Segmentation Based on Deep learning Yiding Liu 2018.12.08

LEXICAL TYPOLOGY Peter Koch (Part I) Koch, Lexical typology, 2010-8-24 A. General introduction

Compilers Lexical Analysis Alex Aiken Lexical Analysis 1. Lexical Analysis 2. Parsing 3.

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Segmentation Segmentation Segmentation Define the accurate boundaries of all objects in an image

Segmentation using Segmentation using Bayesian Decision Theory Bayesian Decision Theory

LEXICAL TYPOLOGY LEXICAL TYPOLOGY Peter Koch (Part II) Department of Romance Studies, Tbingen

LEXICAL SEMANTICS LEXICAL SEMANTICS CS 224N 2011 Gerald Penn Slides largely adapted from

Lesson 2 Lexical Analysis CS 226/326 Spring 2003 Lexical Analysis Transform source program

Lexical analysis Lexical analysis Lexical analysis checks the correctness of program words and

Introduction to Lexical Analysis Outline Informal sketch of lexical analysis

Word Segmentation and their Integration in Machine Translation Advanced MT Seminar ThuyLinh

NLP Programming Tutorial 4 - Word Segmentation Graham Neubig Nara Institute of Science and

IMPLEMENTING COLLOCATION GROUPS #1 About Draper Lab An independent, not-for-profit

Weighted reduced basis for the approximation of viscous flows with random coefficients Peng Chen 1

New and not so new methods for estimating a regularization parameter Giuseppe Rodriguez in

Multigrid discontinuous Galerkin method for multigroup particle transport Pablo Lucero

2. Computers with everything 2. Computers with everything History History How far?

Scaling Graphite at Criteo FOSDEM 2018 - Not yet another talk about Prometheus Me Corentin

Larry Holder School of EECS Washington State University Artificial Intelligence 1 } Weak AI

Scio A Scala API for Google Cloud Dataflow &amp; Apache Beam Robert Gruener @MrRobbie_G About

Scio A Scala API for Google Cloud Dataflow & Apache Beam Robert Gruener @MrRobbie_G About