An Algorithm for Suffix Stripping Evaluation Algorithm Porter - - PowerPoint PPT Presentation

an algorithm for suffix stripping
SMART_READER_LITE
LIVE PREVIEW

An Algorithm for Suffix Stripping Evaluation Algorithm Porter - - PowerPoint PPT Presentation

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction Information Retrieval Suffix Stripping An Algorithm for Suffix Stripping Evaluation Algorithm Porter (1980) Notations Rules Further Issues References Sabrina Galasso


slide-1
SLIDE 1

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

An Algorithm for Suffix Stripping

Porter (1980) Sabrina Galasso & Eyal Schejter

T¨ ubingen University

April 30, 2015

1 / 45

slide-2
SLIDE 2

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

An Algorithm for Suffix Stripping

Introduction Information Retrieval Suffix Stripping Evaluation Algorithm Notations Rules Further Issues

2 / 45

slide-3
SLIDE 3

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

What is Information Retrieval?

An Information Retrieval (IR) system matches user queries to documents stored in a database.

— (Frakes, 1992)

◮ user queries: formal statements of information needs ◮ documents: data objects, usually textual (may also

contain other types of data such as photographs or graphs)

◮ database: documents are not necessarily stored directly

in the IR database, but can be represented by surrogates.

◮ surrogate: the representation of a document (e.g.,

including only title, author and abstract of an article)

3 / 45

slide-4
SLIDE 4

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

Why is it useful to remove suffixes from words?

◮ goal: improvement of information retrieval (IR)

performance

◮ idea: terms with a common stem will usually have

similar meanings → conflation of term groups

◮ result: reduction of the total number of terms →

reduction of the size and complexity of the data

4 / 45

slide-5
SLIDE 5

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

Why is it useful to remove suffixes from words?

5 / 45

slide-6
SLIDE 6

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

Suffix stripping strategies

◮ Suffix stripping for IR purposes aims at removing

suffixes from a word’s stem

stem: part of a word to which derivational or inflectional affixes are attached (cf., class slides)

◮ morphological analysers can be implemented 1

◮ using a mapping table (works well for languages such as

English or Chinese)

◮ algorithmic (more suitable for complex morphologies) 1different implementation require different input, such as stem

dictionary and/or suffix list

6 / 45

slide-7
SLIDE 7

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

Suffix stripping strategies

Algorithms can be iterative and/or use the principle of longest-match assignment (Lennon et al., 1981):

◮ iteration: suffixes are often joined to a stem one after

another → algorithms can remove suffixes in a similar manner

◮ longest-match assignment: Only one iteration is

  • involved. If more than one suffix matches the end of a

word, the longest one is selected. (Relativistic→Relativ, if the suffixes -istic and -ic are given)

◮ easier to program ◮ require a larger suffix dictionary 7 / 45

slide-8
SLIDE 8

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

Suffix stripping strategies

Algorithms can be context-free or context-sensitive (Lennon et al., 1981):

◮ context-sensitive: restrictions are placed on the

removal/replacement of each of the suffixes →involves an expensive development of a comprehensive set of context-sensitive rules

◮ context-free: any word ending that matches a suffix is

stripped →simpler to develop and may also be more efficient at run time

8 / 45

slide-9
SLIDE 9

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

Suffix stripping strategies

Algorithms can include recoding rules, which are applied after the actual stemming steps (Lennon et al., 1981): Example: Stemming: forgetting → forgett Recoding rule: ”remove one of any such doublings at the end of a stem” forgett → forget An alternative to recoding rules is a partial matching procedure at search time.

9 / 45

slide-10
SLIDE 10

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

The presented approach

◮ Input: an explicit list of suffixes and for each suffix a

criterion under which it can be removed e.g., the past tense marking suffix -ed should only be removed if the stem contains at least one vowel → context-sensitive

◮ The algorithm follows the principle of longest-match

assignment and contains recoding rules.

◮ It was developed to improve the performance of IR

systems.

10 / 45

slide-11
SLIDE 11

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

The presented approach

Some technical issues:

◮ 400 lines of BCPL code (Basic Combined Programming

Language) → relatively small → simple to understand and to rewrite in another programming language

◮ processes 10,000 terms in about 8.1 seconds on the

IBM 370/165

◮ processes 10,000 terms in about 0.13 seconds on the

i7-4500 CPU 1.80GHz (just using one core)

11 / 45

slide-12
SLIDE 12

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

Points to Keep in Mind - I

Remember the goal!

◮ The goal is to improve IR performance. ◮ Not all justified linguistic decisions improve IR ◮ The criterion becomes a semantic one:

◮ What is the document about? ◮ e.g.:

connection & connections relate & relativity

12 / 45

slide-13
SLIDE 13

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

Points to Keep in Mind - II

Less than 100% success rate

◮ The simplifications made result in errors:

◮ wand & wander are conflated (like sand & sander) ◮ probe & probate

◮ Correction of the errors results in complications:

◮ additional rules may effect other correct forms ◮ additional rules can effect efficiency ◮ frequency in real vocabularies ◮ e.g.:

change the stem: deceive & deceptions resume & resumption index & indices

13 / 45

slide-14
SLIDE 14

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

Evaluation

◮ Cut-Off Recall method:

◮ strip suffixes for each document in the collection

(Cranfield 200 (Cleverdon, 1967))

◮ strip suffixes for the test queries ◮ cut-off recall for the documents retrieved

◮ Vocabulary reduction.

14 / 45

slide-15
SLIDE 15

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

Evaluation - Cut-Off Recall

Recall = number of relevant items retrieved number of relevant items in collection (1) Percision = number of relevant items retrieved total number of items retrieved (2)

(Voorhees and Harman, 2006)

15 / 45

slide-16
SLIDE 16

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

Evaluation - Cut-Off Recall

Example Collection: { D1, D2, D3, D4, D5, D6, D7, D8, D9, D10} Relevant documents for query Q1: {D1, D2, D3} The system retrieved documents for query Q1 (in order): D1, D4, D2, D5, D6, D7, D3, D8, D9, D10

16 / 45

slide-17
SLIDE 17

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

Evaluation - Cut-Off Recall

10 Docs 7 Docs 5 Docs 3 Docs 1 Doc D1 D1 D1 D1 D1 D4 D4 D4 D4 D2 D2 D2 D2 D5 D5 D5 D6 D6 D6 D7 D7 D3 D3 D8 D9 D10

17 / 45

slide-18
SLIDE 18

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

Evaluation - Cut-Off Recall

10 Docs 7 Docs 5 Docs 3 Docs 1 Doc D1 D1 D1 D1 D1 D4 D4 D4 D4 D2 D2 D2 D2 D5 D5 D5 D6 D6 D6 D7 D7 D3 D3 D8 D9 D10 p = 0.3 p = 0.43 p = 0.4 p = 0.67 p = 1 r = 1 r = 1 r = 0.67 r = 0.67 r = 0.33

17 / 45

slide-19
SLIDE 19

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

Evaluation - Cut-Off Recall

Figure : System comparison by Prof. C.J. van Rijsbergen (Porter, 1980)

”Clearly the performance is not very different.”

18 / 45

slide-20
SLIDE 20

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

Evaluation - Vocabulary reduction

◮ A vocabulary of 10,000 unique words is reduced to 6370

distinct forms.

◮ On our own test (Leaves of Grass By Walt Whitman)

10,000 unique words are reduced to 7170 distinct forms.

19 / 45

slide-21
SLIDE 21

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

Algorithm Notations & Definitions

Consonants all letters other than a, e, i, o, u and other than y preceded by a consonant Vowels all letters that are not consonants Examples

◮ In the word toy, the consonants are t and y. ◮ In the word happy, the y is a vowel.

20 / 45

slide-22
SLIDE 22

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

Algorithm Notations & Definitions

Notation c one consonant C one or more consonants v one vowel V one or more vowels Possible Forms 2

◮ CVCV. . . C (Retrieval) ◮ CVCV. . . V (Library) ◮ VCVC. . . C (Information) ◮ VCVC. . . V (Infinity)

2informal notation 21 / 45

slide-23
SLIDE 23

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

Algorithm Notations & Definitions

Abstraction

◮ [C]VCVC . . . [V ]

Regular expression

◮ [C](VC)m[V ]

m ∈ N0 m is used as a measure for a word’s length:

◮ m = 0, e.g.: ǫ, by, tree ◮ m = 1, e.g.: trouble, trees ◮ m = 2, e.g.: troubles, private

22 / 45

slide-24
SLIDE 24

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

Algorithm Notations & Definitions

How long is each of the following words?

◮ finite ◮ state ◮ morphology ◮ supercalifragilisticexpialidocious 3

3Adjective - used as a nonsense word by children to express approval

  • r to represent the longest word in English (Dictionary.com, nd)

23 / 45

slide-25
SLIDE 25

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

Algorithm Notations & Definitions

How long is each of the following words?

◮ finite

m = 2

◮ state

m = 1

◮ morphology

m = 3

◮ supercalifragilisticexpialidocious 3

m = 13

3Adjective - used as a nonsense word by children to express approval

  • r to represent the longest word in English (Dictionary.com, nd)

23 / 45

slide-26
SLIDE 26

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

Algorithm Notations & Definitions

The algorithm contains rules for removing or replacing suffixes: (condition) S1 → S2 If the stem4 satisfies the condition, the rule is applied. Example: (m > 1) EMENT → ∅ replacement → replac basement → basement

4NB: the term stem is here used to denote the substring word-S1,

which does not necessarily correspond to our class terminology

24 / 45

slide-27
SLIDE 27

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

Algorithm Notations & Definitions

Conditions can furthermore contain boolean expressions (and, or, not) and the following notations: *S the stem ends with the letter s *v* the stem contains a vowel *d the stem ends with a double consonant (e.g., -tt, -ss) *o the stem ends with cvc, where the second c is not w, x

  • r z

Examples: a) (m > 1 and (*S or *T)) b) (*d and not (*L or *S or *Z))

25 / 45

slide-28
SLIDE 28

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

Algorithm Notations & Definitions

  • 1. Which of the following strings satisfy condition a?

(m > 1 and (*S or *T))

◮ manifest, is, adopt, carrot

  • 2. Which of the following strings satisfy condition b?

(*d and not (*L or *S or *Z))

◮ stroll, scruff, hajj, pass

  • 3. Write a condition that holds for the words in list A and

excludes all the words in list B. A ixktt, fehss, sqehwtt, off, izz B sff, abke, werel, kll, rtt

26 / 45

slide-29
SLIDE 29

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

Algorithm Notations & Definitions

  • 1. Which of the following strings satisfy condition a?

(m > 1 and (*S or *T))

◮ manifest, is, adopt, carrot

  • 2. Which of the following strings satisfy condition b?

(*d and not (*L or *S or *Z))

◮ stroll, scruff, hajj, pass

  • 3. Write a condition that holds for the words in list A and

excludes all the words in list B. A ixktt, fehss, sqehwtt, off, izz B sff, abke, werel, kll, rtt

◮ (*d and *v*)

27 / 45

slide-30
SLIDE 30

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

Algorithm – Rules

◮ The algorithm consists of 5 steps:

◮ The first step handles inflection ◮ Steps 2-4 handle derivation ◮ The last step adds corrections

Remember: stem → derivation → inflection

◮ In a set of rules within the same step, only the rule with

the longest match is applied

28 / 45

slide-31
SLIDE 31

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

Algorithm – Rules

Step 1a SSES → SS caresses → caress IES → I ponies → poni ties → ti SS → SS caress → caress S → ∅ cats → cat Step 1b (m > 0) EED → EE feed → feed agreed → agree (*v*) ED → ∅ plastered → plaster bled → bled (*v*) ING → ∅ motoring → motor sing → sing

29 / 45

slide-32
SLIDE 32

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

Algorithm – Rules

If the second and third rule of Step 1b (removal of verbal

  • ed and -ing suffixes) was successful, the following is done:

AT → ATE conflat(ed) → conflate BL → BLE troubl(ing) → trouble IZ → IZE siz(ed) → size (*d and not (*L or *S or *Z)) hopp(ing) → hop double letter → single letter tann(ed) → tan fall(ing) → fall hiss(ing) → hiss fizz(ed) → fizz (m = 1 and *o) ∅ → E fail(ing) → fail fil(ing) → file

Step 5a 30 / 45

slide-33
SLIDE 33

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

Algorithm – Rules

◮ Step 1a handles pluralization ◮ Step 1b handles verbal inflection

31 / 45

slide-34
SLIDE 34

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

Algorithm – Rules

Step 1c (*v*) Y → I happy → happi sky → sky

32 / 45

slide-35
SLIDE 35

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

Algorithm – Rules

Step 2 (m > 0) ATIONAL → ATE relational → relate (m > 0) TIONAL → TION conditional → condition rational → rational (m > 0) ENCI → ENCE valenci → valence (m > 0) ANCI → ANCE hesitanci → hesitance (m > 0) IZER → IZE digitizer → digitize (m > 0) ABLI → ABLE conformabli → conformable (m > 0) ALLI → AL radicalli → radical (m > 0) ENTLI → ENT differentli → different (m > 0) ELI → E vileli → vile (m > 0) OUSLI → OUS analogousli → analogous (m > 0) IZATION → IZE vietnamization→ vietnamize

33 / 45

slide-36
SLIDE 36

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

Algorithm – Rules

(m > 0) ATION → ATE predication → predicate (m > 0) ATOR → ATE

  • perator

→ operate (m > 0) ALISM → AL feudalism → feudal (m > 0) IVENESS → IVE decisiveness → decisive (m > 0) FULNESS→ FUL hopefulness → hopeful (m > 0) OUSNESS→ OUS callousness → callous (m > 0) ALITI → AL formaliti → formal (m > 0) IVITI → IVE sensitiviti → sensitive (m > 0) BILITI → BLE sensibiliti → sensible

34 / 45

slide-37
SLIDE 37

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

Algorithm – Rules

Step 3 (m > 0) ICATE → IC triplicate → triplic (m > 0) ATIVE → ∅ formative → form (m > 0) ALIZE → AL formalize → formal (m > 0) ICITI → IC electriciti → electric (m > 0) ICAL → IC electrical → electric (m > 0) FUL → ∅ hopeful → hope (m > 0) NESS → ∅ goodness → good

35 / 45

slide-38
SLIDE 38

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

Algorithm – Rules

◮ Step 2 (m > 0) FULNESS → FUL ◮ Step 3 (m > 0) FUL → ∅

Why don’t we have the following rule?

◮ Step 2 (m > 0) NESSFUL → NESS

36 / 45

slide-39
SLIDE 39

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

Algorithm – Rules

Step 4 (m > 1) AL → ∅ revival → reviv (m > 1) ANCE → ∅ allowance → allow (m > 1) ENCE → ∅ inference → infer (m > 1) ER → ∅ airliner → airlin (m > 1) IC → ∅ gyroscopic → gyroscop (m > 1) ABLE → ∅ adjustable → adjust (m > 1) IBLE → ∅ defensible → defens (m > 1) ANT → ∅ irritant → irrit (m > 1) EMENT → ∅ replacement→ replac (m > 1) MENT → ∅ adjustment → adjust (m > 1) ENT → ∅ dependent → depend (m > 1 and (*S or *T)) adoption → adopt ION → ∅

37 / 45

slide-40
SLIDE 40

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

Algorithm – Rules

(m > 1) OU → ∅ homologou → homolog (m > 1) ISM → ∅ communism → commun (m > 1) ATE → ∅ activate → activ (m > 1) ITI → ∅ angulariti → angular (m > 1) OUS → ∅ homologous → homolog (m > 1) IVE → ∅ effective → effect (m > 1) IZE → ∅ bowdlerize → bowdler

38 / 45

slide-41
SLIDE 41

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

Algorithm – Rules

What is the difference between m > 0 and m > 1? valency hesitancy Step1 valenci hesitanci Step2 valence hesitance Step3 ” ” Step4 valence hesit

39 / 45

slide-42
SLIDE 42

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

Algorithm – Rules

Step 5a (m > 1) E → ∅ probate → probat rate → rate (m = 1 and not *o) E → ∅ cease → ceas Step 5b (m > 1 and *d and *L) controll → control double letter → single letter roll → roll

Step 1b 40 / 45

slide-43
SLIDE 43

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

◮ Can you find an example of what will go wrong if we

don’t include step 1c?

◮ Follow the algorithm for the following word and write

down the resulting string: visualizations

◮ Are there additional difficulties when considering such

an algorithm for the German morphology?

◮ Could you think of a parallel system for German? Name

some rules it could include.

41 / 45

slide-44
SLIDE 44

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

Further Issues

”Under-stemming is a fault, but by itself it will not degrade the performance of an IR system. Because of under-stemming words may fail to conflate that ought to have conflated, but you are, in a sense, no worse off than you were before. Mis-stemming is more serious, but again mis-stemming does not really matter unless it leads to false conflations, and that frequently does not happen.”

— Porter

42 / 45

slide-45
SLIDE 45

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

Further Issues

◮ Prefixes ◮ Named Entities like ”Cats”, ”Dallas”, and ”Mary

Poppins”.

◮ Irregular inflections (e.g., copula, functional words) ◮ exceptions in suffix ordering (derivation → inflection).

e.g., lovingly, devotedness

◮ Attached suffixes (e.g., particle words in Italian):

mandargli = mandare + gli to send + to him mandarglielo = mandare + gli + lo to send + it + to him (derivation → inflection → attached suffixes)

43 / 45

slide-46
SLIDE 46

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

Porter Algorithm implementations

The suffix stripping algorithm was extended to various languages (Willett, 2006):

◮ Germanic (English, Dutch, German) ◮ Romance (Italian, French, Portuguese, Spanish) ◮ Scandinavian (Danish, Norwegian, Swedish) ◮ Finish ◮ Russian

These different stemmers are described in a high-level programming language for stemming called Snowball (Porter and Boulton, 2006).

44 / 45

slide-47
SLIDE 47

Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction

Information Retrieval Suffix Stripping Evaluation

Algorithm

Notations Rules

Further Issues References

References

Andrews, K. (1971). The development of a fast conflation algorithm for english’. Computer Laboratory, University of Cambridge. Cleverdon, C. (1967). The cranfield tests on index language devices. In Aslib proceedings, volume 19, pages 173–194. MCB UP Ltd. Dictionary.com (n.d.). supercalifragilisticexpialidocious. retrieved april 27, 2015, from dictionary.com. Frakes, W. (1992). Introduction to information storage and retrieval systems. Space, 14:10. G¨

  • rz, G. and Paulus, D. (1988). A finite state approach to german verb morphology. In Proceedings of

the 12th conference on Computational linguistics-Volume 1, pages 212–215. Association for Computational Linguistics. Lennon, M., Peirce, D. S., Tarry, B. D., and Willett, P. (1981). An evaluation of some conflation algorithms for information retrieval. Journal of information Science, 3(4):177–183. Porter, M. and Boulton, R. (2006). Snowball stemming algorithm. Implementations available at http://snowball. tartarus. org. Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3):130–137. Voorhees, E. M. and Harman, D. (2006). Common evaluation measures. In The Twelfth Text Retrieval Conference (TREC 2003), pages 500–255. Willett, P. (2006). The porter stemming algorithm: then and now. Program, 40(3):219–223. 45 / 45