simple morpheme labelling in unsupervised morpheme
play

Simple Morpheme Labelling in Unsupervised Morpheme Analysis - PowerPoint PPT Presentation

Simple Morpheme Labelling in Unsupervised Morpheme Analysis Delphine Bernhard Ubiquitous Knowledge Processing Lab, Darmstadt, Germany Morpho Challenge 2007 September 19, 2007 1 / 23 Main features of the method Algorithm already


  1. Simple Morpheme Labelling in Unsupervised Morpheme Analysis Delphine Bernhard Ubiquitous Knowledge Processing Lab, Darmstadt, Germany Morpho Challenge 2007 – September 19, 2007 1 / 23

  2. Main features of the method ◮ Algorithm already presented at Morpho Challenge 2005 ◮ Only input: plain list of words ⇒ no use of corpora or token frequency information ◮ Output: list of labelled morphemic segments for each word: ◮ prefix: dis arm ed ◮ suffix: sulk ing ◮ stem: grow ◮ linking element: oil – painting s 2 / 23

  3. Overview of the method Step 1: List of Extraction of prefixes word forms and suffixes List of prefixes Step 2: and suffixes Acquisition of stems List of stems Step 3: Segmentation of words Potential segmentations for each word Step 4: Morphemic segments Selection of the best segmentation Step 5 (optional): Additional Application of the segments word forms to a new data set Segmented words 3 / 23

  4. Step 1: Extraction of prefixes and suffixes Step 1: List of Extraction of prefixes word forms and suffixes List of prefixes Step 2: and suffixes Acquisition of stems List of stems Step 3: Segmentation of words Potential segmentations for each word Step 4: Morphemic segments Selection of the best segmentation Step 5 (optional): Additional Application of the segments word forms to a new data set Segmented words 4 / 23

  5. h y p e r v e n t i l a t i n g 1 . 0 0 . 8 0 . 6 0 . 4 0 . 2 0 . 0 H Y P E R V E N T I L A T I N G L e t t r e s Step 1: Extraction of prefixes and suffixes Input Longest words 5 / 23

  6. Step 1: Extraction of prefixes and suffixes Locate positions with low segment predictability h y p e r v e n t i l a t i n g 1 . 0 0 . 8 Input 0 . 6 Longest words 0 . 4 0 . 2 0 . 0 H Y P E R V E N T I L A T I N G L e t t r e s Variations of the average maximum transition probabilities 5 / 23

  7. Step 1: Extraction of prefixes and suffixes Locate positions with low segment predictability h y p e r v e n t i l a t i n g 1 . 0 0 . 8 Input Output 0 . 6 Longest Segments words 0 . 4 0 . 2 0 . 0 H Y P E R V E N T I L A T I N G L e t t r e s Variations of the average maximum transition probabilities 5 / 23

  8. Step 1: Extraction of prefixes and suffixes Identification of a stem among the segments hyper ventilat ing frequency 123 > 16 < 13 768 length 5 < 8 > 3 Prefixes and suffixes hyper ing ion or ventilat ors hyper ion un ed badly- ed 6 / 23

  9. Step 2: Acquisition of stems Step 1: List of Extraction of prefixes word forms and suffixes List of prefixes Step 2: and suffixes Acquisition of stems List of stems Step 3: Segmentation of words Potential segmentations for each word Step 4: Morphemic segments Selection of the best segmentation Step 5 (optional): Additional Application of the segments word forms to a new data set Segmented words 7 / 23

  10. Step 2: Acquisition of stems Subtract prefixes and suffixes from all words 8 / 23

  11. Step 3: Segmentation of words Step 1: List of Extraction of prefixes word forms and suffixes List of prefixes Step 2: and suffixes Acquisition of stems List of stems Step 3: Segmentation of words Potential segmentations for each word Step 4: Morphemic segments Selection of the best segmentation Step 5 (optional): Additional Application of the segments word forms to a new data set Segmented words 9 / 23

  12. Step 3: Segmentation of words ing d dis e integrat # re # ion - ist well s or fully Alignment of words containing the same stem in order to discover similar and dissimilar parts 10 / 23

  13. Step 3: Segmentation of words Validation of new prefixes and suffixes Words Known prefixes Potential stems New prefixes A 1 A 2 A 3 fully-integrated fully- well-integrated well- reintegrated re disintegrated dis integrated ǫ | A 1 | + | A 2 | | A 1 | | A 1 | + | A 2 | + | A 3 | ≥ a and | A 1 | + | A 2 | ≥ b 11 / 23

  14. Step 4: Selection of the best segmentation Step 1: List of Extraction of prefixes word forms and suffixes List of prefixes Step 2: and suffixes Acquisition of stems List of stems Step 3: Segmentation of words Potential segmentations for each word Step 4: Morphemic segments Selection of the best segmentation Step 5 (optional): Additional Application of the segments word forms to a new data set Segmented words 12 / 23

  15. Step 4: Selection of the best segmentation ation transplant(40) (737) auto - transplantation (41) (12,194) (12) transplanta tion (16) (103) ◮ The most frequent segment is chosen when given a choice ◮ Some frequency and morphotactic constraints are verified 13 / 23

  16. Step 5 (optional): Application of the morphemic segments to a new data set Step 1: List of Extraction of prefixes word forms and suffixes List of prefixes Step 2: and suffixes Acquisition of stems List of stems Step 3: Segmentation of words Potential segmentations for each word Step 4: Morphemic segments Selection of the best segmentation Step 5 (optional): Additional Application of the segments word forms to a new data set Segmented words 14 / 23

  17. Step 5 (optional): Application of the morphemic segments to a new data set ◮ For each word, select segments so that the total cost is minimal ◮ Cost functions used: ◮ Method 1: f ( s i ) cost 1 ( s i ) = − log � i f ( s i ) ◮ Method 2: f ( s i ) cost 2 ( s i ) = − log max i [ f ( s i )] where: ◮ s i = morphemic segment ◮ f ( s i ) = frequency of segment s i 15 / 23

  18. Results for competition 1: Precision 80 78.2 76.0 73.7 70 72.0 63.2 60 61.6 59.6 Precision % 50 49.1 40 30 20 10 Method 1 Method 2 0 English Finnish German Turkish ◮ Method 1 > Method 2 16 / 23

  19. Results for competition 1: Recall 80 70 60 60.0 57.4 Recall % 50 52.5 40 40.4 37.7 30 25.0 20 14.8 10 10.9 Method 1 Method 2 0 English Finnish German Turkish ◮ Method 2 > Method 1 ◮ Low recall in Turkish 17 / 23

  20. Results for competition 1: F-measure 80 70 60 60.8 60.7 F-Measure % 52.9 50 48.2 47.2 40 37.6 30 24.6 20 19.2 10 Method 1 Method 2 0 English Finnish German Turkish ◮ Method 2 > Method 1 ◮ Low F-measure in Turkish 18 / 23

  21. Results for competition 2: Tfidf weighting 50 40 40.2 39.8 39.0 38.1 37.8 37.3 37.2 37.0 Tfidf - AP x 100 35.6 35.0 30 27.8 27.8 27.8 26.7 26.8 20 Dummy with new Method 1 without new 10 Method 1 with new Method 2 without new Method 2 with new 0 English Finnish German 19 / 23

  22. Results for competition 2: Okapi BM 25 weighting 50 49.1 47.3 46.8 46.8 46.1 46.2 44.2 41.8 40 39.4 39.0 39.2 38.8 Okapi - AP x 100 32.7 32.3 30 31.2 20 Dummy with new Method 1 without new 10 Method 1 with new Method 2 without new Method 2 with new 0 English Finnish German 20 / 23

  23. Challenges in unsupervised morpheme analysis ◮ Objectives of Morpho Challenge 2007: unsupervised morpheme analysis ⇒ more complex than segmentation of words into sub-units 21 / 23

  24. Challenges in unsupervised morpheme analysis ◮ Objectives of Morpho Challenge 2007: unsupervised morpheme analysis ⇒ more complex than segmentation of words into sub-units ◮ Problems to be solved: ◮ allomorphy: different forms for the same morpheme oxen = ox +PL and flies = fly_N +PL ◮ homography: same form for different morphemes fly (noun = insect ) vs. fly (verb) 21 / 23

  25. Challenges in unsupervised morpheme analysis ◮ Objectives of Morpho Challenge 2007: unsupervised morpheme analysis ⇒ more complex than segmentation of words into sub-units ◮ Problems to be solved: ◮ allomorphy: different forms for the same morpheme oxen = ox +PL and flies = fly_N +PL ◮ homography: same form for different morphemes fly (noun = insect ) vs. fly (verb) ◮ What can be solved by the system in its current state? 21 / 23

  26. Challenges in unsupervised morpheme analysis ◮ Objectives of Morpho Challenge 2007: unsupervised morpheme analysis ⇒ more complex than segmentation of words into sub-units ◮ Problems to be solved: ◮ allomorphy: different forms for the same morpheme oxen = ox +PL and flies = fly_N +PL ◮ homography: same form for different morphemes fly (noun = insect ) vs. fly (verb) ◮ What can be solved by the system in its current state? 21 / 23

  27. Challenges in unsupervised morpheme analysis ◮ Objectives of Morpho Challenge 2007: unsupervised morpheme analysis ⇒ more complex than segmentation of words into sub-units ◮ Problems to be solved: ◮ allomorphy: different forms for the same morpheme oxen = ox +PL and flies = fly_N +PL ◮ homography: same form for different morphemes fly (noun = insect ) vs. fly (verb) ◮ What can be solved by the system in its current state? 21 / 23

  28. How well does the system disambiguate cross-category homography? Examples in English ship as a suffix vs. ship as a stem ◮ censor ship ◮ ship wreck ◮ !!!! space ship s !!!! Analysis of the results + Morphotactic constraints prevent a suffix from occurring at the beginning of a word – The most frequent segments are privileged when several morpheme categories are morphotactically plausible 22 / 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend