Automatic Learning of a Morphological Model Theory and - PowerPoint PPT Presentation

Automatic Learning of a Morphological Model Theory and Unsupervised Approaches

Unsupervised Learning - ToC  Foundations  Problem description  General architecture  Papers: – Goldsmith 2001 and 2006 – Goldsmith and Hu 2004

Unsupervised Models - Foundations  Saffran et al. 1996: Adults are capable of discovering word units rapidly in a stream of a nonsense language without any connection to meaning.  Creutz and Lagus 2007: This suggests that humans do use distributional cues, such as transition probabilities between sounds, in language learning. And these kinds of statistical patterns in language data can be successfully exploited by appropriately designed algorithms.

Unsupervised Models - Problem desc.  Input: – untagged text in an orthographic form, words are separated. – No syntactic, semantic or phonological info is given.  Output: – A lexicon of morphemes (stems and affixes). Frequent un- segmented words might also be included (mixed lexicon, see Creutz and Lagus). – An analysis of each word in the corpus by segmentation into morphemes.  Results evaluation – The analysis should be as close as possible to the gold standard, obtained by manual segmentation. – The data should be represented efficiently.

Unsupervised Models – Components  Bootstrapping heuristic - creates the initial morphological model.  Incremental heuristics – create an improved morphological model based on the existing one.  Evaluation model - compares morphological models and tells whether a significant improvement was achieved.

Unsupervised Models – Flow diag. Bootstrapping Model of Corpus Heuristic Morphology Evaluate Evaluation Model Replace old Incremental New is (much) better (MDL, MAP) model Heuristics E v a l u Not much a t e improvement New Model of Morphology Stop

Unsupervised Models – Preface  Goldsmith 2001/ 6 – Recursive segmentation into stem+suffix/prefix+stem. – Evaluation in terms of Minimum Description Length.  Goldsmith and Hu 2004 – NFSA holds layered morpheme compositions. – Evaluation in terms of Minimum Description Length.  Creutz and Lagus 2005/7 – HMM of categories emitting morphemes. – Evaluation in terms of Maximum A Posteriori.

Unsupervised Models - Methodologies - 1  Tradeoff between restrictiveness and flexibility – A too restricting model may exclude all optimal and near optimal models, making learning a good model impossible, regardless of how much data and computation time is spent. – A too flexible model is very hard to learn as it requires impractical amounts of data and computation.

Unsupervised Models - Methodologies - 2  Minimum Description Length (Rissanen 1989) – Main ideas:  Every regularity in data may be used to compress that data.  Learning can be equated with finding regularities in data. – Formalization of Occam's Razor: The best hypothesis for a given set of data is the one that leads to the largest compression of the data. – Choose the best model by simultaneously considering model accuracy and model complexity. – Simpler models are favored over complex ones. This generally improves generalization capacity by inhibiting overlearning.

Unsupervised Models - Methodologies - 3  MDL formulation is used in Goldsmith’s papers: ( , ) argmin DescriptionLength CorpusC ModelM ModelM 1   argmin ( ) log length M 2 ( | ) P C M ModelM  Alternative formulation, used in the papers by Creutz and Lagus: Maximum A Posteriori (MAP): ( | ) argmax P Lexicon Corpus Lexicon  Chen 1996: The two approaches are equivalent with respect to the task discussed.

Unsupervised Models – Goldsmith 2001/6  The signature concept – List of affixes appearing with a stem  E.g. Jump: NULL.ed.ing.s  Each stem has a unique signature.  The structure of the morphological model: – List of stems, e.g. { cat, jump, laugh, hat, walk, sav } – List of affixes, e.g. { NULL, ed, ing, s, e, es } ptr( NULL ) – List of signatures and their ptr( jump ) ptr( ed ) associated stems, e.g. ptr( walk ) ptr( ing ) ptr( laugh ) ptr( s )

Goldsmith 2001/6 - Samples  Some signatures from The Adventures of Tom Sawyer : Signature Sample of tokens # Stems # Tokens NULL.ed.ing betray betrayed 69 816 betraying NULL.ed.ing.s remain remained 14 516 remaining remains NULL.s cow cows 253 3414 e.ed.es.ing notice noticed 4 62 notices noticing

Goldsmith 2001/6 - Bootstrapping heuristic - 1  Creates the initial (rough) morphological model.  Cuts words into morphemes (stem + affix) and builds the lists.  The most effective way is based on an early proposal of Harris (1955, 1967), named: Successor frequency .  The idea: words should be cut where it is least likely to predict the succeeding character.

Goldsmith 2001/6 - Bootstrapping heuristic - 2  An example of successor frequency – Consider the word: government – Assume that empirically:  { n } follows gover  { e , i , m , o , s , #} follows govern  { e } follows governm – We get the frequencies : g o v e r 1 n 6 m 1 e n t Peak  successor frequency tells us to cut the word into govern + ment

Goldsmith 2001/6 - Bootstrapping heuristic - 3  Difficulties with successor frequency : – Consider: c 9 o 18 n 11 s 6 e 4 r 1 v 2 a 1 t 1 i 2 v 1 e 1 s Peak Peak Peak  May lead to under-cut or over-cut. How could one decide?  Set constraints (higher precision, lower recall) – Stems length must be at least 5. – Number of stems in signatures must be at least 5. – Absolute peaks: frequencies form must be 1 N 1

Goldsmith 2001/6 - Incremental Heuristics - 1  General idea: – Try to reorganize the lists in the model – Evaluate the model length with/without the change and proceed accordingly.  Create signatures by a Loose fit strategy: For every known suffix F, For every word that can be split into S+F:  Collect all the suffixes of S and suggest a new signature.  The Check Signatures function: – Move letters from stems to suffixes (slide left the boundary)  Examine each signature and suggest moving letters.  Example 1: consider the i in: - i on or – i ve.  Example 2: consider the words ending with – a ble and – i ble.

Goldsmith 2001/6 - Incremental Heuristics - 2  Extending to unanalyzed words – Recall the conservatism of the bootstrapping successor frequency peak: 1 N 1. – No account was given for words like derivation and derivative (could not be analyzed as deriv-ation and deriv-ative ) – Heuristic: for every unanalyzed word, suggest a cut into a known stem+suffix (prefer the most common stem)  Slide right the stem-suffix boundary – Consider signatures with suffixes all sharing the same prefix. E.g: te.ting.ts – Suggest sliding right the boundary, thus creating: e.ing.s .

Goldsmith 2001/6 – Description Length Evaluation 1  Evaluating the description length in MDL: 1   – Recall the formula: ( ) log DL length ModelM 2 ( | ) P CorupsC M  Evaluating the model length:    ( ) ( ) ( ) ( ) length ModelM length stems length affixes length sig

Goldsmith 2001/6 – Description Length Evaluation 2  Stems list structure: Stem 1 Stem 2 Stem 3 … . Stem N Number of Stems - N Ptr Ptr Ptr Ptr Stem 1 Data Stem 2 Data Stem 3 Data ,,, Stem N Data W     The stems list (T) length: log ( ) (log 26* ( ) log( )) T length t 2 2 t inW  t T bits for bits for bits for items # data pointer W    log ( ) (log 26* ( ) log( ))  The affixes list (F) length: F length f 2 2 f inW  f F

Goldsmith 2001/6 – Description Length Evaluation 3  Evaluating the model length (cont’): – The signatures list (S) length: bits for signatures count: log ([ ]) S 2 [ ] W  (log ) + bits for signatures pointers: [ ] s  s Sig  ( + for each sig:  s Sig bits for stems count + bits for stems pointers:   [ ] W log [ ( )] (log ) Stems s 2 2 [ ] t  t Stems s ( ) + bits for affixes count + bits for affixes pointers: [ ] s   log *[ ( )] (log )) Affixes s 2 2 [ ] f in s  ( ) f Affixes s

Goldsmith 2001/6 – Description Length Evaluation 4  Evaluating the probability of the corpus decomposition:    1 1  log log log ( | ) P w M  2 2 2 ( | ) ( | ) P CorupsC M P w M  w Corpus  w Corpus  Evaluating the probability of the decomposition of each word, w:     ( | ) ( | ) ( | )* ( | , )* ( | , ) P w M P w t f M P sig M P t sig M P f sig M     log ( | ) log ( ( | )) log ( ( | , )) log ( ( | , )) P w M P sig M P t sig M P f sig M 2 2 2 2 ( ) ( ) W sig w sig w    log ( ) log ( ) log ( ) 2 2 2 ( ) ( ) ( ) ( ) ( ) sig w stem w in sig w affix w in sig w

Goldsmith 2001/6 – An Example of Loose-fit - 1  Assume that the corpus contains: { act , acted , action , acts , acting }.  The bootstrapping heuristic adds all words to the stems list and also { NULL , ed , ion , s, ing} to the affixes list.  Loose fit: consider adding the signature: NULL.ed.ion.s.ing instead of 4 instances. – Evaluate the cost of each alternative. – Increment only if a cheaper alternative is found.

Automatic Learning of a Morphological Model Theory and - PowerPoint PPT Presentation

Automatic Learning of a Morphological Model Theory and Unsupervised Approaches Unsupervised Learning - ToC Foundations Problem description General architecture Papers: Goldsmith 2001 and 2006 Goldsmith and Hu 2004

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

Automatic Enrollment and Automatic IRAs David C. John The Heritage Foundation The Retirement

Automatic Registration and Calibration Automatic Registration and Calibration Automatic

Dependency Dependency- -Based Automatic Evaluation Based Automatic Evaluation Dependency

Seminar 18122 Automatic Quality Assurance and Release Seminar 18122 Automatic Quality

Advice Automatic Structures and Uniformly Automatic Classes Faried Abu Zaid 1 , Erich Grdel 2 ,

Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master

Automatic Learning with Feedback Queries Automatic Refers to Accepted by Finite Automata John Case

A Hands-On Introduction to Automatic Machine Learning Lars Kotthofg University of Wyoming

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 7: Hidden

Automatic Blood Glucose Control for Diabetics Anders Lyngvi Fougner Department of Engineering

Automatic Storm Shutters Automatic Storm Shutters Team: Make It Rain Kyle Weber Zachary

Automatic Query Type Identification Automatic Query Type Identification Based on Click Through

Automatic Extraction From Automatic Extraction From and Reasoning About and Reasoning About

A Framework for Automatic Generation A Framework for Automatic Generation of Configuration Files

ITS TIME TO SAVE Automatic voltage optimisers I IREM 49 POWER SUPPLY AND PROFESSIONAL USERS

Foundations of Machine Learning Learning with Finite Hypothesis Sets Motivation Some

Prediction and Solomonoff Pter Gcs Boston University Quantum Foundations worshop, August

Welcome to Class 2: Did people in Columbuss >me

Quantum Mechanics A Gentle Introduction Sebastian Riese 27.12.2018 Quantum Mechanics 1/40

Poll Everywhere Quick Guide Google Slides Part I: Creating Polls at the Poll Everywhere web

Depth Sensing Shao-Yi Chien Department of Electrical Engineering National Taiwan

VALSE VA ON ONLINE Tr Tracking Mu Multiple Ob Objects in in Im Image Se Sequences

Occluded Bilateral EPI Regularization 2nd Workshop on Light Fields for Computer Vision July 26,

Automatic Learning of a Morphological Model Theory and - PowerPoint PPT Presentation

Automatic Learning of a Morphological Model Theory and Unsupervised Approaches Unsupervised Learning - ToC Foundations Problem description General architecture Papers: Goldsmith 2001 and 2006 Goldsmith and Hu 2004

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

Automatic Enrollment and Automatic IRAs David C. John The Heritage Foundation The Retirement

Automatic Registration and Calibration Automatic Registration and Calibration Automatic

Dependency Dependency- -Based Automatic Evaluation Based Automatic Evaluation Dependency

Seminar 18122 Automatic Quality Assurance and Release Seminar 18122 Automatic Quality

Advice Automatic Structures and Uniformly Automatic Classes Faried Abu Zaid 1 , Erich Grdel 2 ,

Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master

Automatic Learning with Feedback Queries Automatic Refers to Accepted by Finite Automata John Case

A Hands-On Introduction to Automatic Machine Learning Lars Kotthofg University of Wyoming

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 7: Hidden

Automatic Blood Glucose Control for Diabetics Anders Lyngvi Fougner Department of Engineering

Automatic Storm Shutters Automatic Storm Shutters Team: Make It Rain Kyle Weber Zachary

Automatic Query Type Identification Automatic Query Type Identification Based on Click Through

Automatic Extraction From Automatic Extraction From and Reasoning About and Reasoning About

A Framework for Automatic Generation A Framework for Automatic Generation of Configuration Files

ITS TIME TO SAVE Automatic voltage optimisers I IREM 49 POWER SUPPLY AND PROFESSIONAL USERS

Foundations of Machine Learning Learning with Finite Hypothesis Sets Motivation Some

Prediction and Solomonoff Pter Gcs Boston University Quantum Foundations worshop, August

Welcome to Class 2: Did people in Columbuss &gt;me

Quantum Mechanics A Gentle Introduction Sebastian Riese 27.12.2018 Quantum Mechanics 1/40

Poll Everywhere Quick Guide Google Slides Part I: Creating Polls at the Poll Everywhere web

Depth Sensing Shao-Yi Chien Department of Electrical Engineering National Taiwan

VALSE VA ON ONLINE Tr Tracking Mu Multiple Ob Objects in in Im Image Se Sequences

Occluded Bilateral EPI Regularization 2nd Workshop on Light Fields for Computer Vision July 26,

Welcome to Class 2: Did people in Columbuss >me