Putting Suffix-Tree-Stemming to Work Benno Stein Martin Potthast - PowerPoint PPT Presentation

Putting Suffix-Tree-Stemming to Work Benno Stein Martin Potthast Bauhaus University Weimar Paderborn University Introduction Stemming Approaches Evaluation Σ GFKL ’06 Mar. 8th, 2006 Stein/Potthast

Index terms Text with markups [Reuters] : <TEXT> <TITLE>CHRYSLER> DEAL LEAVES UNCERTAINTY FOR AMC WORKERS</TITLE> <AUTHOR> By Richard Walker, Reuters</AUTHOR> <DATELINE> DETROIT, March 11 - </DATELINE><BODY>Chrysler Corp’s 1.5 billion dlr bid to takeover American Motors Corp; AMO> should help bolster the small automaker’s sales, but it leaves the future of its 19,000 employees in doubt, industry analysts say. It was "business as usual" yesterday at the American ... Introduction Stemming Approaches Evaluation Σ GFKL ’06 Mar. 8th, 2006 Stein/Potthast

Index terms Raw text: chrysler deal leaves uncertainty for amc workers by richard walker reuters detroit march 11 chrysler corp s 1 5 billion dlr bid to takeover american motors corp should help bolster the small automaker s sales but it leaves the future of its 19 000 employees in doubt industry analysts say it was business as usual yesterday at the american Introduction Stemming Approaches Evaluation Σ GFKL ’06 Mar. 8th, 2006 Stein/Potthast

Index terms Stop words emphasized: chrysler deal leaves uncertainty for amc workers by richard walker reuters detroit march 11 chrysler corp s 1 5 billion dlr bid to takeover american motors corp should help bolster the small automaker s sales but it leaves the future of its 19 000 employees in doubt industry analysts say it was business as usual yesterday at the american Introduction Stemming Approaches Evaluation Σ GFKL ’06 Mar. 8th, 2006 Stein/Potthast

Index terms After stemming: chrysler deal leav uncertain amc work richard walk reut detroit takeover american motor help bols automak sal leav futur employ doubt industr analy business usual yesterday Introduction Stemming Approaches Evaluation Σ GFKL ’06 Mar. 8th, 2006 Stein/Potthast

Index terms After stemming: chrysler deal leav uncertain amc work richard walk reut detroit takeover american motor help bols automak sal leav futur employ doubt industr analy business usual yesterday Stemming algorithms remove inflectional and morphological affixes. connect connects connected connecting Introduction connection Stemming Approaches Evaluation Σ GFKL ’06 Mar. 8th, 2006 Stein/Potthast

Index terms After stemming: chrysler deal leav uncertain amc work richard walk reut detroit takeover american motor help bols automak sal leav futur employ doubt industr analy business usual yesterday Stemming algorithms remove inflectional and morphological affixes. connect connects connected connecting Introduction connection Stemming + make text operations less dependent on special word forms Approaches + reduce the dictionary size Evaluation Σ – may merge words that have very different meanings – discard possibly useful information about language use GFKL ’06 Mar. 8th, 2006 Stein/Potthast

Index terms Boolean model� Fuzzy set model� direct usage of� document terms vector space model� probabilistic model� (BIR, NBIR, Poisson, etc.) algebraic model� document-� hidden variables and� inference network model� model concepts generative language model� (statistical language model) suffix model� information on structure text structure model Introduction special linguistic features word class model Stemming Approaches Evaluation linguistic theory [Stein 05]� Σ Retrieval model ∼ document model GFKL ’06 Mar. 8th, 2006 Stein/Potthast

Stemming Approaches 1. Table lookup. To each stem all flections are stored in a hash table. Problem: memory size (consider client-side applications) 2. Successor variety analysis. Morpheme boundaries are found by statistical analyses. Problem: parameter settings, runtime 3. Affix elimination. Rule-based replacement of prefixes and suffixes; the most commonly used approach. Principle: iterative longest match stemming Introduction (a) Removal of the match resulting from the longest precondition. Stemming (b) Exhaustive application of the first step. Approaches (c) Repair of irregularities. Evaluation Σ GFKL ’06 Mar. 8th, 2006 Stein/Potthast

Stemming Approaches Affix Elimination under Porter Rule type Condition Suffix Replacement Example caresses → caress 1a Null sses ss ponies → poni 1a Null ies i feed → feed 1b (m>0) eed ee agreed → agree ε plastered → plaster 1b (*v*) ed bled → bled motoring → motor ε 1b (*v*) ing sing → sing happy → happi 1c (*v*) y i sky → sky sensibiliti → sensible 2 (m>0) biliti ble Introduction Stemming Approaches number of vocal-consonant-sequences exceeds x (m>x) stem ends with letter S (*S) Evaluation stem contains vocal (*v*) Σ stem ends with cvc where second consonant c �∈ {W, X, Y} (*o) stem ends with two identical consonants (*d) GFKL ’06 Mar. 8th, 2006 Stein/Potthast

Stemming Approaches Affix Elimination under Porter: Weaknesses ❑ difficult to modify: effects of new rules are barely to anticipate ❑ subject to over-generalization: policy/police university/universe organization/organ ❑ several definite generalizations are not covered: European/Europe matrices/matrix machine/machinery ❑ generates stem that are hard to be interpreted: Introduction iteration/iter general/gener Stemming Approaches Evaluation Σ GFKL ’06 Mar. 8th, 2006 Stein/Potthast

Stemming Approaches Successor Variety Analysis: Interesting Aspects ❑ The idea of corpus-specific stemming . Corpus dependency is an advantage, if the corpus has a strong topic or application bias. ❑ The idea of language independence . Language independence is essential for multilingual documents or if the language cannot be determined. Stemming Corpus Language approach dependency independence Affix elimination no yes Introduction Variety analysis yes little Stemming Approaches Evaluation Σ GFKL ’06 Mar. 8th, 2006 Stein/Potthast

Stemming Approaches Successor Variety Analysis: Realization Suffix tree at letter level: Suffix tree at word level: 1 con 2 nect� tact� 3 1 � d ing� e s � $ 1 1 $ 1 $ $ $ Introduction Stemming Approaches Evaluation Σ GFKL ’06 Mar. 8th, 2006 Stein/Potthast

Stemming Approaches Successor Variety Analysis: Realization Suffix tree at letter level: Suffix tree at word level: 1 0 boy plays chess too� father plays chess� con plays chess� chess� 2 nect� tact� 3 1 1 1 2 2 too o � o d t ing� e s � $ 1 1 $ 1 $ $ $ 1 1 $ $ $ $ $ $ Introduction Stemming Approaches Evaluation Σ GFKL ’06 Mar. 8th, 2006 Stein/Potthast

Stemming Approaches Successor Variety Analysis: Realization Suffix tree at letter level: Suffix tree at word level: 1 0 boy plays chess too� father plays chess� con plays chess� chess� 2 nect� tact� 3 1 1 1 2 2 too o � o d t ing� e s � $ 1 1 $ 1 $ $ $ 1 1 $ $ $ $ $ $ How to find good candidates for a stem? Introduction ❑ analysis of degree differences (depending on tree depth) Stemming Approaches ❑ cut-off method, complete word method, entropy method Evaluation Σ GFKL ’06 Mar. 8th, 2006 Stein/Potthast

Evaluation Caution is advised ; ) ❑ existing reports on the impact of stemming are contradictory ❑ employed analysis tool (among others): clustering But what can be found? 1. improved document model 2. peculiarity of a clustering algorithm 3. . . . Introduction Stemming Approaches Evaluation Σ GFKL ’06 Mar. 8th, 2006 Stein/Potthast

Evaluation Caution is advised ; ) ❑ existing reports on the impact of stemming are contradictory ❑ employed analysis tool (among others): clustering But what can be found? 1. improved document model 2. peculiarity of a clustering algorithm 3. . . . Introduction A cluster algorithm’s performance depends on various parameters. Stemming Approaches Different cluster algorithms behave differently sensitive to Evaluation document model “improvements”. Σ Baseline? Interpretation? Objectivity? Generalizability? GFKL ’06 Mar. 8th, 2006 Stein/Potthast

Evaluation Caution is advised ; ) An objective way to rank document models is to compare their ability to capture the intrinsic similarity relations of a collection D . Basic idea: 1. construct a similarity graph, G = � V, E, w � 2. measure its conformance to a reference classification 3. analyze improvement/decline under new document model Introduction Stemming Approaches Evaluation Σ GFKL ’06 Mar. 8th, 2006 Stein/Potthast

Expected Density ¯ ρ Definition Graph G = � V, E, w � | E | = O ( | V | ) [ O ( | V | 2 ) ] ❑ G is called sparse [dense] if ❑ the density θ computes from the equation | E | = | V | θ Introduction Stemming Approaches Evaluation Σ GFKL ’06 Mar. 8th, 2006 Stein/Potthast

Putting Suffix-Tree-Stemming to Work Benno Stein Martin Potthast - PowerPoint PPT Presentation

Putting Suffix-Tree-Stemming to Work Benno Stein Martin Potthast Bauhaus University Weimar Paderborn University Introduction Stemming Approaches Evaluation GFKL 06 Mar. 8th, 2006 Stein/Potthast Index terms Text with markups

Algorithms in Bioinformatics: A Practical Introduction Suffix tree Overview What is suffix

Suffix Trees Construction and Applications Joo Carreira 2008 Outline Why Suffix Trees?

Stemming VSM, session 10 CS6200: Information Retrieval Slides by: Jesse Anderton Stemming A

capitalise Suffix terrorise fertilise ise suffix words are usually just created by simply

Approximate Pattern Matching Using Suffix Tries Hendrik Nigul nigulh@math.ut.ee University of

Sales Tax Affiliate Nexus Stemming From Sales Tax Affiliate Nexus Stemming From Online Business

Suffix tree and Suffix array Karatsuba CS214: Algorithms and Complexity Shanghai Jiao Tong

Lecture 15: Suffix trees, suffix arrays, and their applica8ons

This week, we are going to look at adding words ending in the suffix al. Can you remember what

An Algorithm for Suffix Stripping Evaluation Algorithm Porter (1980) Notations Rules Further

Are Hybrid Physical Designs Important? 1 B+ tree 2 C O L B+ tree 3 ? C O L C O L B+ tree

Suffix tree Build a tree from the text Used if the text is expected to be the same during

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

Putting a socially responsible price on carbon Putting a socially responsible price on carbon

Algorithms Theory 15 Text Search (2) Construction of suffix trees Prof. Dr. S. Albers

Tree-sitter @maxbrunsfeld What is Tree-sitter? Why I wrote Tree-sitter What were

Text mining with ngram variables Matthias Schonlau, Ph.D. The most common approach to dealing

Stop w ords SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a Data Scientist What are stop w

WITH C++ Prof. Amr Goneid AUC Part 13. Abstract Data Types (ADTs) Prof. amr Goneid, AUC 1

Pattern Recognition Part 6: Bandwidth Extension Gerhard Schmidt Christian-Albrechts-Universitt

DCU meets MET: Bengali and Hindi Morpheme Extraction Debasis Ganguly, Johannes Leveling, Gareth

Text Processing CS440 Text processing NLP tasks typically require multiple steps of text

of Geometric Concepts Uri Stemmer Ben-Gurion University joint work with Haim Kaplan, Yishay

Efficient Generation and Processing of Word Co-occurrence Networks Using corpus2graph Zheng