Simple Morpheme Labelling in Unsupervised Morpheme Analysis - - PowerPoint PPT Presentation

simple morpheme labelling in unsupervised morpheme
SMART_READER_LITE
LIVE PREVIEW

Simple Morpheme Labelling in Unsupervised Morpheme Analysis - - PowerPoint PPT Presentation

Simple Morpheme Labelling in Unsupervised Morpheme Analysis Delphine Bernhard Ubiquitous Knowledge Processing Lab, Darmstadt, Germany Morpho Challenge 2007 September 19, 2007 1 / 23 Main features of the method Algorithm already


slide-1
SLIDE 1

Simple Morpheme Labelling in Unsupervised Morpheme Analysis

Delphine Bernhard

Ubiquitous Knowledge Processing Lab, Darmstadt, Germany

Morpho Challenge 2007 – September 19, 2007

1 / 23

slide-2
SLIDE 2

Main features of the method

◮ Algorithm already presented at Morpho Challenge 2005 ◮ Only input: plain list of words

⇒ no use of corpora or token frequency information

◮ Output: list of labelled morphemic segments for each word:

◮ prefix: dis arm ed ◮ suffix: sulk ing ◮ stem: grow ◮ linking element: oil – painting s 2 / 23

slide-3
SLIDE 3

Overview of the method

List of prefixes and suffixes List of stems Step 4: Selection of the best segmentation Potential segmentations for each word List of word forms Step 1: Extraction of prefixes and suffixes Step 2: Acquisition of stems Step 3: Segmentation of words Segmented words Morphemic segments Step 5 (optional): Application of the segments to a new data set Additional word forms

3 / 23

slide-4
SLIDE 4

Step 1: Extraction of prefixes and suffixes

List of prefixes and suffixes List of stems Step 4: Selection of the best segmentation Potential segmentations for each word List of word forms Step 1: Extraction of prefixes and suffixes Step 2: Acquisition of stems Step 3: Segmentation of words Segmented words Morphemic segments Step 5 (optional): Application of the segments to a new data set Additional word forms

4 / 23

slide-5
SLIDE 5

Step 1: Extraction of prefixes and suffixes

Input

Longest words

h y p e r v e n t i l a t i n g H Y P E R V E N T I L A T I N G L e t t r e s . . 2 . 4 . 6 . 8 1 .

5 / 23

slide-6
SLIDE 6

Step 1: Extraction of prefixes and suffixes

Input

Longest words

Locate positions with low segment predictability

h y p e r v e n t i l a t i n g H Y P E R V E N T I L A T I N G L e t t r e s . . 2 . 4 . 6 . 8 1 .

Variations of the average maximum transition probabilities

5 / 23

slide-7
SLIDE 7

Step 1: Extraction of prefixes and suffixes

Input

Longest words

Locate positions with low segment predictability

h y p e r v e n t i l a t i n g H Y P E R V E N T I L A T I N G L e t t r e s . . 2 . 4 . 6 . 8 1 .

Variations of the average maximum transition probabilities

Output

Segments

5 / 23

slide-8
SLIDE 8

Step 1: Extraction of prefixes and suffixes

Identification of a stem among the segments

hyper ventilat ing frequency 123 > 16 < 13 768 length 5 < 8 > 3

Prefixes and suffixes

hyper ing ion

  • r

ventilat

  • rs

hyper ion un ed badly- ed

6 / 23

slide-9
SLIDE 9

Step 2: Acquisition of stems

List of prefixes and suffixes List of stems Step 4: Selection of the best segmentation Potential segmentations for each word List of word forms Step 1: Extraction of prefixes and suffixes Step 2: Acquisition of stems Step 3: Segmentation of words Segmented words Morphemic segments Step 5 (optional): Application of the segments to a new data set Additional word forms

7 / 23

slide-10
SLIDE 10

Step 2: Acquisition of stems Subtract prefixes and suffixes from all words

8 / 23

slide-11
SLIDE 11

Step 3: Segmentation of words

List of prefixes and suffixes List of stems Step 4: Selection of the best segmentation Potential segmentations for each word List of word forms Step 1: Extraction of prefixes and suffixes Step 2: Acquisition of stems Step 3: Segmentation of words Segmented words Morphemic segments Step 5 (optional): Application of the segments to a new data set Additional word forms

9 / 23

slide-12
SLIDE 12

Step 3: Segmentation of words

# fully well re dis integrat

  • e

ing ion

  • r

d s # ist

Alignment of words containing the same stem in order to discover similar and dissimilar parts

10 / 23

slide-13
SLIDE 13

Step 3: Segmentation of words

Validation of new prefixes and suffixes

Words Known prefixes Potential stems New prefixes A1 A2 A3 fully-integrated fully- well-integrated well- reintegrated re disintegrated dis integrated ǫ

|A1| + |A2| |A1| + |A2| + |A3| ≥ a and |A1| |A1| + |A2| ≥ b

11 / 23

slide-14
SLIDE 14

Step 4: Selection of the best segmentation

List of prefixes and suffixes List of stems Step 4: Selection of the best segmentation Potential segmentations for each word List of word forms Step 1: Extraction of prefixes and suffixes Step 2: Acquisition of stems Step 3: Segmentation of words Segmented words Morphemic segments Step 5 (optional): Application of the segments to a new data set Additional word forms

12 / 23

slide-15
SLIDE 15

Step 4: Selection of the best segmentation

auto (41)

  • (12,194)

transplant(40) transplantation (12) transplanta (16) ation (737) tion (103)

◮ The most frequent segment is chosen when given a choice ◮ Some frequency and morphotactic constraints are verified

13 / 23

slide-16
SLIDE 16

Step 5 (optional): Application of the morphemic segments to a new data set

List of prefixes and suffixes List of stems Step 4: Selection of the best segmentation Potential segmentations for each word List of word forms Step 1: Extraction of prefixes and suffixes Step 2: Acquisition of stems Step 3: Segmentation of words Segmented words Morphemic segments Step 5 (optional): Application of the segments to a new data set Additional word forms

14 / 23

slide-17
SLIDE 17

Step 5 (optional): Application of the morphemic segments to a new data set

◮ For each word, select segments so that the total cost is

minimal

◮ Cost functions used:

◮ Method 1:

cost1(si) = −log f(si)

  • i f(si)

◮ Method 2:

cost2(si) = −log f(si) maxi[f(si)] where:

◮ si = morphemic segment ◮ f(si) = frequency of segment si 15 / 23

slide-18
SLIDE 18

Results for competition 1: Precision

English Finnish German Turkish 10 20 30 40 50 60 70 80 Precision %

72.0 76.0 63.2 78.2 61.6 59.6 49.1 73.7

Method 1 Method 2

◮ Method 1 > Method 2

16 / 23

slide-19
SLIDE 19

Results for competition 1: Recall

English Finnish German Turkish 10 20 30 40 50 60 70 80 Recall %

52.5 25.0 37.7 10.9 60.0 40.4 57.4 14.8

Method 1 Method 2

◮ Method 2 > Method 1 ◮ Low recall in Turkish

17 / 23

slide-20
SLIDE 20

Results for competition 1: F-measure

English Finnish German Turkish 10 20 30 40 50 60 70 80 F-Measure %

60.7 37.6 47.2 19.2 60.8 48.2 52.9 24.6

Method 1 Method 2

◮ Method 2 > Method 1 ◮ Low F-measure in Turkish

18 / 23

slide-21
SLIDE 21

Results for competition 2: Tfidf weighting

English Finnish German 10 20 30 40 50 Tfidf - AP x 100

27.8 35.6 35.0 27.8 40.2 37.8 27.8 39.0 37.2 26.7 39.8 37.3 26.8 38.1 37.0

Dummy with new Method 1 without new Method 1 with new Method 2 without new Method 2 with new

19 / 23

slide-22
SLIDE 22

Results for competition 2: Okapi BM 25 weighting

English Finnish German 10 20 30 40 50 Okapi - AP x 100

31.2 32.7 32.3 38.8 41.8 46.1 39.0 46.8 47.3 39.2 44.2 46.8 39.4 49.1 46.2

Dummy with new Method 1 without new Method 1 with new Method 2 without new Method 2 with new

20 / 23

slide-23
SLIDE 23

Challenges in unsupervised morpheme analysis

◮ Objectives of Morpho Challenge 2007: unsupervised

morpheme analysis ⇒ more complex than segmentation of words into sub-units

21 / 23

slide-24
SLIDE 24

Challenges in unsupervised morpheme analysis

◮ Objectives of Morpho Challenge 2007: unsupervised

morpheme analysis ⇒ more complex than segmentation of words into sub-units

◮ Problems to be solved:

◮ allomorphy: different forms for the same morpheme

  • xen = ox +PL and flies = fly_N +PL

◮ homography: same form for different morphemes

fly (noun = insect ) vs. fly (verb)

21 / 23

slide-25
SLIDE 25

Challenges in unsupervised morpheme analysis

◮ Objectives of Morpho Challenge 2007: unsupervised

morpheme analysis ⇒ more complex than segmentation of words into sub-units

◮ Problems to be solved:

◮ allomorphy: different forms for the same morpheme

  • xen = ox +PL and flies = fly_N +PL

◮ homography: same form for different morphemes

fly (noun = insect ) vs. fly (verb)

◮ What can be solved by the system in its current state?

21 / 23

slide-26
SLIDE 26

Challenges in unsupervised morpheme analysis

◮ Objectives of Morpho Challenge 2007: unsupervised

morpheme analysis ⇒ more complex than segmentation of words into sub-units

◮ Problems to be solved:

◮ allomorphy: different forms for the same morpheme

  • xen = ox +PL and flies = fly_N +PL

◮ homography: same form for different morphemes

fly (noun = insect ) vs. fly (verb)

◮ What can be solved by the system in its current state?

21 / 23

slide-27
SLIDE 27

Challenges in unsupervised morpheme analysis

◮ Objectives of Morpho Challenge 2007: unsupervised

morpheme analysis ⇒ more complex than segmentation of words into sub-units

◮ Problems to be solved:

◮ allomorphy: different forms for the same morpheme

  • xen = ox +PL and flies = fly_N +PL

◮ homography: same form for different morphemes

fly (noun = insect ) vs. fly (verb)

◮ What can be solved by the system in its current state?

21 / 23

slide-28
SLIDE 28

How well does the system disambiguate cross-category homography?

Examples in English

ship as a suffix vs. ship as a stem

◮ censor ship ◮ ship wreck ◮ !!!! space ship s !!!!

Analysis of the results

+ Morphotactic constraints prevent a suffix from occurring at the beginning of a word – The most frequent segments are privileged when several morpheme categories are morphotactically plausible

22 / 23

slide-29
SLIDE 29

Future work

◮ Variable morphotactic constraints ◮ Take paradigmatic relationships between affixes into

account

◮ Need of corpus-derived information to:

  • 1. Improve the results obtained at several stages of the

algorithm

  • 2. Be able to relax some constraints
  • 3. Achieve finer-grained morpheme labelling

23 / 23

slide-30
SLIDE 30

Thank you!