 
              Discovering Morphological Paradigms from Plain Text Using a Dirichlet Process Mixture Model Markus Dreyer Jason Eisner SDL Language Weaver Johns Hopkins University This work was done at: Center for Language and Speech Processing (CLSP) Human Lang. Tech. Center of Excellence (HLTCOE) Johns Hopkins University (JHU)
Motivation Rich morphology English text German text break brichst break brecht springst jump break brechen springe jump break breche break brichst break breche
Motivation brichst • Analyzing text: • Lack of generalization brecht springst • Data sparseness brechen • Generating text: springe • Generate correct forms breche brichst • Produce correctly inflected text breche There is a need for a general morphology model that knows how to inflect words .
Motivation So how do you inflect a word? You look it up in such a table, for example: Inflectional Paradigm But creating such treffen supervised data is expensive . treffe treffen traf trafen Let’s use triffst trefft trafst traft unannotated text trifft treffen traf trafen to learn these paradigms!
Motivation • This talk is about a comprehensive model for inflectional morphology . • Main goal: • Given some unannotated text , can we learn how to inflect the verbs of a language (incl. irregularities and exceptions)? • Discover the inflectional paradigms (tables) of a language, using minimal supervision
Motivation 1. Identify the different lexemes in text Paradigm German text brichst brecht springst brechen springe breche brichst breche Tokens Types
Motivation 1. Identify the different lexemes in text Paradigm German text brichst brecht springst brechen springe breche brichst breche Tokens Types
Motivation 1. Identify the different lexemes in text Paradigm German text brichst brichst brecht brecht springst brechen brechen springe breche breche brichst brichst breche breche Tokens Types
Motivation 2. Place each form of a lexeme into its paradigm Paradigm German text brechen brichst breche breche brecht springst brichst brecht brichst brechen springe breche brichst breche Tokens Types
Motivation 2. Place each form of a lexeme into its paradigm Paradigm German text brechen brichst brichte? brichen? brichten? breche breche brecht brach? brachen? brechen? springst brichtest? brichtet? brichst brecht brichst brachst? brechen bracht? bricht? brichen? brichte? brichten? brach? brechen? brecht? brachen? springe breche brichst breche Tokens Types
Motivation 2. Place each form of a lexeme into its paradigm Paradigm German text brechen brichst brichte? brichen? brichten? breche breche brecht brach? brachen? brechen? springst springst brichtest? brichtet? brichst brichst brecht brachst? brechen bracht? bricht? brichen? brichte? brichten? brach? brechen? brecht? brachen? springe springe breche brichst breche Tokens Types
Motivation 2. Place each form of a lexeme into its paradigm Paradigm German text brechen brichst brichte? brichen? brichten? breche breche brecht springe brach? brachen? brechen? springst brichtest? brichtet? brichst brichst brecht springst brachst? brechen bracht? bricht? brichen? brichte? brichten? brach? brechen? brecht? brachen? springe breche brichst breche Tokens Types
Motivation 2. Place each form of a lexeme into its paradigm Paradigm German text brechen springen? brichst sprengen? brichte? brichen? brichten? breche breche springen? springte? springte? brecht springe brach? brachen? brechen? sprengen? sprang? sprang? springst brichtest? brichtet? brichst brichst brecht springtet? springt? springtest? springst brachst? brechen bracht? sprangst? sprangt? sprengt? bricht? brichen? brichte? brichten? springen? springten? springt? springte? brach? brechen? brecht? brachen? springe sprengen? sprang? sprangen? sprengt? breche brichst breche Tokens Types
Motivation Paradigm German text brechen springen? brichst sprengen? brichte? brichen? brichten? breche breche springen? springte? springte? brecht springe brach? brachen? brechen? sprengen? sprang? sprang? springst brichtest? brichtet? brichst brichst brecht springtet? springt? springtest? springst brachst? brechen bracht? sprangst? sprangt? sprengt? bricht? brichen? brichte? brichten? springen? springten? springt? springte? brach? brechen? brecht? brachen? springe sprengen? sprang? sprangen? sprengt? breche brichst breche Tokens Types
Motivation Paradigm German text brechen springen? brichst sprengen? brichte? brichen? brichten? breche breche springen? springte? springte? brecht springe brach? brachen? brechen? sprengen? sprang? sprang? springst brichtest? brichtet? brichst brichst brecht springtet? springt? springtest? springst brachst? brechen bracht? sprangst? sprangt? sprengt? bricht? brichen? brichte? brichten? springen? springten? springt? springte? brach? brechen? brecht? brachen? springe sprengen? sprang? sprangen? sprengt? breche saufen brichst säufen? saufe breche saufen? säufst sauft Tokens Types säufen? säuft? saufen? sauft?
Motivation In order to perform this morphological knowledge discovery, we define a probability distribution over a text corpus and its (hidden) inflectional paradigms : p ( )
Overview 1 p ( ) p ( ) 2
Overview 1 p ( ) p ( ) 2
Paradigms 1 Why build probability model over paradigms? brichen? • Jointly predict brechen? missing string values brichen? breche breche brechen? • Compute marginals brichst brichst brecht • Know what spellings bricht? brichen? brechen? brecht? are likely in the different cells
Paradigms 1 How to build probability model over paradigms? brichen? brechen? brichen? breche breche brechen? brichst brecht brichst bricht? brichen? brechen? brecht? Dreyer & Eisner (2009)
Paradigms 1 How to build probability model over paradigms? X Lem Each cell is modeled by a string-valued random variable X 1sg X 1pl X 2sg X 2pl X 3sg X 3pl Dreyer & Eisner (2009)
Paradigms 1 How to build probability model over paradigms? X Lem Each cell is modeled by a string-valued random variable X 1sg X 1pl X 2sg X 2pl X 3sg X 3pl Dreyer & Eisner (2009)
Paradigms 1 How to build probability model over paradigms? X Lem X 1sg X 1pl X 2sg X 2pl X 3sg X 3pl Dreyer & Eisner (2009)
Paradigms 1 How to build probability model over paradigms? X 1sg X 1pl X 2sg X Lem X 2pl X 3sg X 3pl Dreyer & Eisner (2009)
Paradigms 1 How to build probability model over paradigms? X 1sg X 1pl X 2sg X Lem X 2pl X 3sg X 3pl Dreyer & Eisner (2009)
Paradigms 1 How to build probability model over paradigms? p(X Lem , X 1sg , X 2sg , X 3sg , X 1pl , X 2pl , X 3pl ) = 1 Z X 1sg X 1pl F 1 (X Lem ,X 1sg ) × F F 1 4 F 2 (X Lem ,X 2sg ) × F F 5 2 X 2sg X Lem X 2pl F 3 (X Lem ,X 3sg ) × F F 3 6 F 4 (X Lem ,X 1pl ) × X 3sg X 3pl F 5 (X Lem ,X 2pl ) × F 6 (X Lem ,X 3pl ) × Markov Random Field over string-valued variables
Paradigms 1 Belief Propagation: • Standard inference algorithm • Computes Marginals through message passing X 1sg X 1pl F F 1 4 F F 5 • We use finite-state 2 X 2sg X Lem X 2pl F variant of this algorithm, F 3 6 Dreyer & Eisner (2009) X 3sg X 3pl Markov Random Field over string-valued variables
Paradigms 1 Belief Propagation: • Standard inference algorithm brechen • Computes Marginals brichen ... through message passing X 1sg X 1pl F F 1 4 F F 5 • We use finite-state 2 X 2sg X Lem X 2pl F variant of this algorithm, F 3 6 Dreyer & Eisner (2009) n b e X 3sg X 3pl r h e c n i e c r b h b h r c . i e . c . . e r . h b . bricht n e n brechen
Paradigms 1 Summary • Paradigms modeled as Markov Random Fields ( MRF ) • Weighted finite-state transducers ( FST ) relate the various spellings to one another • They encode morphological knowledge (“ grammar ”) • Use finite-state-based belief propagation ( BP ) to compute string marginals
Paradigms 1 Summary • Dreyer & Eisner (2009): • Learn purely from example paradigms (training data) • Then use model to predict unseen forms • Disadvantages: • Training data is expensive • Predicts forms that would never occur in real text (where an alternate form may be preferred) We will now address n e h c n i e these problems. r h b c . e . bricht . r b
Overview 1 p ( ) p ( ) 2
Lexicon & Corpus 2 • We use the paradigms to construct a probabilistic lexicon that specifies which inflections of which lexemes are common and how they are spelled. • We define a generative probability model of the lexicon and a text corpus. • This allows for clean inference procedure to learn morphology from text and discover its inflectional paradigms
Lexicon & Corpus 2 Generative story Inference (Sampling) Model Model generate learn Data Data
Lexicon & Corpus Generative story Model To generate from our model: • First, generate generate the lexicon (types). • Then, use it to generate the corpus (tokens). Data
Recommend
More recommend