 
              Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction Information Retrieval Suffix Stripping An Algorithm for Suffix Stripping Evaluation Algorithm Porter (1980) Notations Rules Further Issues References Sabrina Galasso & Eyal Schejter T¨ ubingen University April 30, 2015 1 / 45
Suffix Stripping An Algorithm for Suffix Stripping Sabrina Galasso & Eyal Schejter Introduction Information Retrieval Introduction Suffix Stripping Evaluation Information Retrieval Algorithm Suffix Stripping Notations Rules Evaluation Further Issues References Algorithm Notations Rules Further Issues 2 / 45
Suffix Stripping What is Information Retrieval? Sabrina Galasso & Eyal Schejter Introduction An Information Retrieval (IR) system matches user queries Information Retrieval Suffix Stripping to documents stored in a database . Evaluation Algorithm — (Frakes, 1992) Notations Rules Further Issues ◮ user queries : formal statements of information needs References ◮ documents : data objects, usually textual (may also contain other types of data such as photographs or graphs) ◮ database : documents are not necessarily stored directly in the IR database, but can be represented by surrogates. ◮ surrogate : the representation of a document (e.g., including only title, author and abstract of an article) 3 / 45
Suffix Stripping Why is it useful to remove suffixes from words? Sabrina Galasso & Eyal Schejter Introduction Information Retrieval Suffix Stripping Evaluation ◮ goal: improvement of information retrieval (IR) Algorithm Notations performance Rules Further Issues ◮ idea: terms with a common stem will usually have References similar meanings → conflation of term groups ◮ result: reduction of the total number of terms → reduction of the size and complexity of the data 4 / 45
Suffix Stripping Why is it useful to remove suffixes from words? Sabrina Galasso & Eyal Schejter Introduction Information Retrieval Suffix Stripping Evaluation Algorithm Notations Rules Further Issues References 5 / 45
Suffix Stripping Suffix stripping strategies Sabrina Galasso & Eyal Schejter Introduction Information Retrieval Suffix Stripping ◮ Suffix stripping for IR purposes aims at removing Evaluation suffixes from a word’s stem Algorithm Notations stem: part of a word to which derivational or Rules inflectional affixes are attached (cf., class slides) Further Issues ◮ morphological analysers can be implemented 1 References ◮ using a mapping table (works well for languages such as English or Chinese) ◮ algorithmic (more suitable for complex morphologies) 1 different implementation require different input, such as stem dictionary and/or suffix list 6 / 45
Suffix Stripping Suffix stripping strategies Sabrina Galasso & Eyal Schejter Introduction Information Retrieval Algorithms can be iterative and/or use the principle of Suffix Stripping Evaluation longest-match assignment (Lennon et al., 1981): Algorithm ◮ iteration: suffixes are often joined to a stem one after Notations Rules another → algorithms can remove suffixes in a similar Further Issues manner References ◮ longest-match assignment: Only one iteration is involved. If more than one suffix matches the end of a word, the longest one is selected. ( Relativistic → Relativ , if the suffixes -istic and -ic are given) ◮ easier to program ◮ require a larger suffix dictionary 7 / 45
Suffix Stripping Suffix stripping strategies Sabrina Galasso & Eyal Schejter Introduction Information Retrieval Algorithms can be context-free or context-sensitive Suffix Stripping Evaluation (Lennon et al., 1981): Algorithm Notations ◮ context-sensitive: restrictions are placed on the Rules removal/replacement of each of the suffixes Further Issues References → involves an expensive development of a comprehensive set of context-sensitive rules ◮ context-free: any word ending that matches a suffix is stripped → simpler to develop and may also be more efficient at run time 8 / 45
Suffix Stripping Suffix stripping strategies Sabrina Galasso & Eyal Schejter Introduction Information Retrieval Algorithms can include recoding rules , which are applied Suffix Stripping Evaluation after the actual stemming steps (Lennon et al., 1981): Algorithm Notations Rules Example: Further Issues References Stemming: forgetting → forgett Recoding rule: ”remove one of any such doublings at the end of a stem” forgett → forget An alternative to recoding rules is a partial matching procedure at search time. 9 / 45
Suffix Stripping The presented approach Sabrina Galasso & Eyal Schejter Introduction Information Retrieval Suffix Stripping ◮ Input : an explicit list of suffixes and for each suffix a Evaluation Algorithm criterion under which it can be removed Notations Rules e.g., the past tense marking suffix -ed should only be Further Issues removed if the stem contains at least one vowel References → context-sensitive ◮ The algorithm follows the principle of longest-match assignment and contains recoding rules. ◮ It was developed to improve the performance of IR systems. 10 / 45
Suffix Stripping The presented approach Sabrina Galasso & Eyal Schejter Introduction Information Retrieval Some technical issues: Suffix Stripping Evaluation ◮ 400 lines of BCPL code (Basic Combined Programming Algorithm Notations Language) Rules → relatively small Further Issues → simple to understand and to rewrite in another References programming language ◮ processes 10,000 terms in about 8.1 seconds on the IBM 370/165 ◮ processes 10,000 terms in about 0.13 seconds on the i7-4500 CPU 1.80GHz (just using one core) 11 / 45
Suffix Stripping Points to Keep in Mind - I Sabrina Galasso & Eyal Schejter Introduction Information Retrieval Suffix Stripping Evaluation Remember the goal! Algorithm Notations ◮ The goal is to improve IR performance. Rules Further Issues ◮ Not all justified linguistic decisions improve IR References ◮ The criterion becomes a semantic one: ◮ What is the document about? ◮ e.g.: connection & connections relate & relativity 12 / 45
Suffix Stripping Points to Keep in Mind - II Sabrina Galasso & Eyal Schejter Introduction Information Retrieval Less than 100% success rate Suffix Stripping Evaluation ◮ The simplifications made result in errors: Algorithm ◮ wand & wander are conflated (like sand & sander ) Notations Rules ◮ probe & probate Further Issues ◮ Correction of the errors results in complications: References ◮ additional rules may effect other correct forms ◮ additional rules can effect efficiency ◮ frequency in real vocabularies ◮ e.g.: change the stem: deceive & deceptions resume & resumption index & indices 13 / 45
Suffix Stripping Evaluation Sabrina Galasso & Eyal Schejter Introduction Information Retrieval Suffix Stripping Evaluation Algorithm ◮ Cut-Off Recall method: Notations Rules ◮ strip suffixes for each document in the collection Further Issues (Cranfield 200 (Cleverdon, 1967)) References ◮ strip suffixes for the test queries ◮ cut-off recall for the documents retrieved ◮ Vocabulary reduction. 14 / 45
Suffix Stripping Evaluation - Cut-Off Recall Sabrina Galasso & Eyal Schejter Introduction Information Retrieval Suffix Stripping Evaluation Algorithm Notations Rules number of relevant items retrieved Recall = (1) Further Issues number of relevant items in collection References Percision = number of relevant items retrieved (2) total number of items retrieved (Voorhees and Harman, 2006) 15 / 45
Suffix Stripping Evaluation - Cut-Off Recall Sabrina Galasso & Eyal Schejter Introduction Information Retrieval Suffix Stripping Evaluation Algorithm Example Notations Rules Collection: { D1, D2, D3, D4, D5, D6, D7, D8, D9, D10 } Further Issues Relevant documents for query Q1: { D1, D2, D3 } References The system retrieved documents for query Q1 (in order): D1, D4, D2, D5, D6, D7, D3, D8, D9, D10 16 / 45
Suffix Stripping Evaluation - Cut-Off Recall Sabrina Galasso & Eyal Schejter 10 Docs 7 Docs 5 Docs 3 Docs 1 Doc Introduction Information Retrieval D1 D1 D1 D1 D1 Suffix Stripping Evaluation D4 D4 D4 D4 Algorithm Notations D2 D2 D2 D2 Rules D5 D5 D5 Further Issues D6 D6 D6 References D7 D7 D3 D3 D8 D9 D10 17 / 45
Suffix Stripping Evaluation - Cut-Off Recall Sabrina Galasso & Eyal Schejter 10 Docs 7 Docs 5 Docs 3 Docs 1 Doc Introduction Information Retrieval D1 D1 D1 D1 D1 Suffix Stripping Evaluation D4 D4 D4 D4 Algorithm Notations D2 D2 D2 D2 Rules D5 D5 D5 Further Issues D6 D6 D6 References D7 D7 D3 D3 D8 D9 D10 p = 0 . 3 p = 0 . 43 p = 0 . 4 p = 0 . 67 p = 1 r = 1 r = 1 r = 0 . 67 r = 0 . 67 r = 0 . 33 17 / 45
Suffix Stripping Evaluation - Cut-Off Recall Sabrina Galasso & Eyal Schejter Introduction Information Retrieval Suffix Stripping Evaluation Algorithm Notations Rules Further Issues References Figure : System comparison by Prof. C.J. van Rijsbergen (Porter, 1980) ”Clearly the performance is not very different.” 18 / 45
Recommend
More recommend