discovering morphological paradigms from plain text
play

Discovering Morphological Paradigms from Plain Text Using a - PowerPoint PPT Presentation

Discovering Morphological Paradigms from Plain Text Using a Dirichlet Process Mixture Model Markus Dreyer Jason Eisner SDL Language Weaver Johns Hopkins University This work was done at: Center for Language and Speech Processing (CLSP)


  1. Discovering Morphological Paradigms from Plain Text Using a Dirichlet Process Mixture Model Markus Dreyer Jason Eisner SDL Language Weaver Johns Hopkins University This work was done at: Center for Language and Speech Processing (CLSP) Human Lang. Tech. Center of Excellence (HLTCOE) Johns Hopkins University (JHU)

  2. Motivation Rich morphology English text German text break brichst break brecht springst jump break brechen springe jump break breche break brichst break breche

  3. Motivation brichst • Analyzing text: • Lack of generalization brecht springst • Data sparseness brechen • Generating text: springe • Generate correct forms breche brichst • Produce correctly inflected text breche There is a need for a general morphology model that knows how to inflect words .

  4. Motivation So how do you inflect a word? You look it up in such a table, for example: Inflectional Paradigm But creating such treffen supervised data is expensive . treffe treffen traf trafen Let’s use triffst trefft trafst traft unannotated text trifft treffen traf trafen to learn these paradigms!

  5. Motivation • This talk is about a comprehensive model for inflectional morphology . • Main goal: • Given some unannotated text , can we learn how to inflect the verbs of a language (incl. irregularities and exceptions)? • Discover the inflectional paradigms (tables) of a language, using minimal supervision

  6. Motivation 1. Identify the different lexemes in text Paradigm German text brichst brecht springst brechen springe breche brichst breche Tokens Types

  7. Motivation 1. Identify the different lexemes in text Paradigm German text brichst brecht springst brechen springe breche brichst breche Tokens Types

  8. Motivation 1. Identify the different lexemes in text Paradigm German text brichst brichst brecht brecht springst brechen brechen springe breche breche brichst brichst breche breche Tokens Types

  9. Motivation 2. Place each form of a lexeme into its paradigm Paradigm German text brechen brichst breche breche brecht springst brichst brecht brichst brechen springe breche brichst breche Tokens Types

  10. Motivation 2. Place each form of a lexeme into its paradigm Paradigm German text brechen brichst brichte? brichen? brichten? breche breche brecht brach? brachen? brechen? springst brichtest? brichtet? brichst brecht brichst brachst? brechen bracht? bricht? brichen? brichte? brichten? brach? brechen? brecht? brachen? springe breche brichst breche Tokens Types

  11. Motivation 2. Place each form of a lexeme into its paradigm Paradigm German text brechen brichst brichte? brichen? brichten? breche breche brecht brach? brachen? brechen? springst springst brichtest? brichtet? brichst brichst brecht brachst? brechen bracht? bricht? brichen? brichte? brichten? brach? brechen? brecht? brachen? springe springe breche brichst breche Tokens Types

  12. Motivation 2. Place each form of a lexeme into its paradigm Paradigm German text brechen brichst brichte? brichen? brichten? breche breche brecht springe brach? brachen? brechen? springst brichtest? brichtet? brichst brichst brecht springst brachst? brechen bracht? bricht? brichen? brichte? brichten? brach? brechen? brecht? brachen? springe breche brichst breche Tokens Types

  13. Motivation 2. Place each form of a lexeme into its paradigm Paradigm German text brechen springen? brichst sprengen? brichte? brichen? brichten? breche breche springen? springte? springte? brecht springe brach? brachen? brechen? sprengen? sprang? sprang? springst brichtest? brichtet? brichst brichst brecht springtet? springt? springtest? springst brachst? brechen bracht? sprangst? sprangt? sprengt? bricht? brichen? brichte? brichten? springen? springten? springt? springte? brach? brechen? brecht? brachen? springe sprengen? sprang? sprangen? sprengt? breche brichst breche Tokens Types

  14. Motivation Paradigm German text brechen springen? brichst sprengen? brichte? brichen? brichten? breche breche springen? springte? springte? brecht springe brach? brachen? brechen? sprengen? sprang? sprang? springst brichtest? brichtet? brichst brichst brecht springtet? springt? springtest? springst brachst? brechen bracht? sprangst? sprangt? sprengt? bricht? brichen? brichte? brichten? springen? springten? springt? springte? brach? brechen? brecht? brachen? springe sprengen? sprang? sprangen? sprengt? breche brichst breche Tokens Types

  15. Motivation Paradigm German text brechen springen? brichst sprengen? brichte? brichen? brichten? breche breche springen? springte? springte? brecht springe brach? brachen? brechen? sprengen? sprang? sprang? springst brichtest? brichtet? brichst brichst brecht springtet? springt? springtest? springst brachst? brechen bracht? sprangst? sprangt? sprengt? bricht? brichen? brichte? brichten? springen? springten? springt? springte? brach? brechen? brecht? brachen? springe sprengen? sprang? sprangen? sprengt? breche saufen brichst säufen? saufe breche saufen? säufst sauft Tokens Types säufen? säuft? saufen? sauft?

  16. Motivation In order to perform this morphological knowledge discovery, we define a probability distribution over a text corpus and its (hidden) inflectional paradigms : p ( )

  17. Overview 1 p ( ) p ( ) 2

  18. Overview 1 p ( ) p ( ) 2

  19. Paradigms 1 Why build probability model over paradigms? brichen? • Jointly predict brechen? missing string values brichen? breche breche brechen? • Compute marginals brichst brichst brecht • Know what spellings bricht? brichen? brechen? brecht? are likely in the different cells

  20. Paradigms 1 How to build probability model over paradigms? brichen? brechen? brichen? breche breche brechen? brichst brecht brichst bricht? brichen? brechen? brecht? Dreyer & Eisner (2009)

  21. Paradigms 1 How to build probability model over paradigms? X Lem Each cell is modeled by a string-valued random variable X 1sg X 1pl X 2sg X 2pl X 3sg X 3pl Dreyer & Eisner (2009)

  22. Paradigms 1 How to build probability model over paradigms? X Lem Each cell is modeled by a string-valued random variable X 1sg X 1pl X 2sg X 2pl X 3sg X 3pl Dreyer & Eisner (2009)

  23. Paradigms 1 How to build probability model over paradigms? X Lem X 1sg X 1pl X 2sg X 2pl X 3sg X 3pl Dreyer & Eisner (2009)

  24. Paradigms 1 How to build probability model over paradigms? X 1sg X 1pl X 2sg X Lem X 2pl X 3sg X 3pl Dreyer & Eisner (2009)

  25. Paradigms 1 How to build probability model over paradigms? X 1sg X 1pl X 2sg X Lem X 2pl X 3sg X 3pl Dreyer & Eisner (2009)

  26. Paradigms 1 How to build probability model over paradigms? p(X Lem , X 1sg , X 2sg , X 3sg , X 1pl , X 2pl , X 3pl ) = 1 Z X 1sg X 1pl F 1 (X Lem ,X 1sg ) × F F 1 4 F 2 (X Lem ,X 2sg ) × F F 5 2 X 2sg X Lem X 2pl F 3 (X Lem ,X 3sg ) × F F 3 6 F 4 (X Lem ,X 1pl ) × X 3sg X 3pl F 5 (X Lem ,X 2pl ) × F 6 (X Lem ,X 3pl ) × Markov Random Field over string-valued variables

  27. Paradigms 1 Belief Propagation: • Standard inference algorithm • Computes Marginals through message passing X 1sg X 1pl F F 1 4 F F 5 • We use finite-state 2 X 2sg X Lem X 2pl F variant of this algorithm, F 3 6 Dreyer & Eisner (2009) X 3sg X 3pl Markov Random Field over string-valued variables

  28. Paradigms 1 Belief Propagation: • Standard inference algorithm brechen • Computes Marginals brichen ... through message passing X 1sg X 1pl F F 1 4 F F 5 • We use finite-state 2 X 2sg X Lem X 2pl F variant of this algorithm, F 3 6 Dreyer & Eisner (2009) n b e X 3sg X 3pl r h e c n i e c r b h b h r c . i e . c . . e r . h b . bricht n e n brechen

  29. Paradigms 1 Summary • Paradigms modeled as Markov Random Fields ( MRF ) • Weighted finite-state transducers ( FST ) relate the various spellings to one another • They encode morphological knowledge (“ grammar ”) • Use finite-state-based belief propagation ( BP ) to compute string marginals

  30. Paradigms 1 Summary • Dreyer & Eisner (2009): • Learn purely from example paradigms (training data) • Then use model to predict unseen forms • Disadvantages: • Training data is expensive • Predicts forms that would never occur in real text (where an alternate form may be preferred) We will now address n e h c n i e these problems. r h b c . e . bricht . r b

  31. Overview 1 p ( ) p ( ) 2

  32. Lexicon & Corpus 2 • We use the paradigms to construct a probabilistic lexicon that specifies which inflections of which lexemes are common and how they are spelled. • We define a generative probability model of the lexicon and a text corpus. • This allows for clean inference procedure to learn morphology from text and discover its inflectional paradigms

  33. Lexicon & Corpus 2 Generative story Inference (Sampling) Model Model generate learn Data Data

  34. Lexicon & Corpus Generative story Model To generate from our model: • First, generate generate the lexicon (types). • Then, use it to generate the corpus (tokens). Data

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend