Putting Suffix-Tree-Stemming to Work Benno Stein Martin Potthast - - PowerPoint PPT Presentation

putting suffix tree stemming to work
SMART_READER_LITE
LIVE PREVIEW

Putting Suffix-Tree-Stemming to Work Benno Stein Martin Potthast - - PowerPoint PPT Presentation

Putting Suffix-Tree-Stemming to Work Benno Stein Martin Potthast Bauhaus University Weimar Paderborn University Introduction Stemming Approaches Evaluation GFKL 06 Mar. 8th, 2006 Stein/Potthast Index terms Text with markups


slide-1
SLIDE 1

Introduction Stemming Approaches Evaluation Σ

Putting Suffix-Tree-Stemming to Work

Benno Stein Bauhaus University Weimar Martin Potthast Paderborn University

GFKL ’06 Mar. 8th, 2006 Stein/Potthast

slide-2
SLIDE 2

Introduction Stemming Approaches Evaluation Σ

Index terms

Text with markups

[Reuters]:

<TEXT> <TITLE>CHRYSLER> DEAL LEAVES UNCERTAINTY FOR AMC WORKERS</TITLE> <AUTHOR> By Richard Walker, Reuters</AUTHOR> <DATELINE> DETROIT, March 11 - </DATELINE><BODY>Chrysler Corp’s 1.5 billion dlr bid to takeover American Motors Corp; AMO> should help bolster the small automaker’s sales, but it leaves the future of its 19,000 employees in doubt, industry analysts say. It was "business as usual" yesterday at the American ...

GFKL ’06 Mar. 8th, 2006 Stein/Potthast

slide-3
SLIDE 3

Introduction Stemming Approaches Evaluation Σ

Index terms

Raw text: chrysler deal leaves uncertainty for amc workers by richard walker reuters detroit march 11 chrysler corp s 1 5 billion dlr bid to takeover american motors corp should help bolster the small automaker s sales but it leaves the future

  • f its 19 000 employees in doubt industry

analysts say it was business as usual yesterday at the american

GFKL ’06 Mar. 8th, 2006 Stein/Potthast

slide-4
SLIDE 4

Introduction Stemming Approaches Evaluation Σ

Index terms

Stop words emphasized: chrysler deal leaves uncertainty for amc workers by richard walker reuters detroit march 11 chrysler corp s 1 5 billion dlr bid to takeover american motors corp should help bolster the small automaker s sales but it leaves the future

  • f its 19 000 employees in doubt industry

analysts say it was business as usual yesterday at the american

GFKL ’06 Mar. 8th, 2006 Stein/Potthast

slide-5
SLIDE 5

Introduction Stemming Approaches Evaluation Σ

Index terms

After stemming: chrysler deal leav uncertain amc work richard walk reut detroit takeover american motor help bols automak sal leav futur employ doubt industr analy business usual yesterday

GFKL ’06 Mar. 8th, 2006 Stein/Potthast

slide-6
SLIDE 6

Introduction Stemming Approaches Evaluation Σ

Index terms

After stemming: chrysler deal leav uncertain amc work richard walk reut detroit takeover american motor help bols automak sal leav futur employ doubt industr analy business usual yesterday Stemming algorithms remove inflectional and morphological affixes. connect connects connected connecting connection

GFKL ’06 Mar. 8th, 2006 Stein/Potthast

slide-7
SLIDE 7

Introduction Stemming Approaches Evaluation Σ

Index terms

After stemming: chrysler deal leav uncertain amc work richard walk reut detroit takeover american motor help bols automak sal leav futur employ doubt industr analy business usual yesterday Stemming algorithms remove inflectional and morphological affixes. connect connects connected connecting connection + make text operations less dependent on special word forms + reduce the dictionary size – may merge words that have very different meanings – discard possibly useful information about language use

GFKL ’06 Mar. 8th, 2006 Stein/Potthast

slide-8
SLIDE 8

Introduction Stemming Approaches Evaluation Σ

Index terms

Boolean model Fuzzy set model vector space model probabilistic model

(BIR, NBIR, Poisson, etc.)

algebraic model inference network model generative language model

(statistical language model)

suffix model text structure model hidden variables and concepts direct usage of document terms information on structure special linguistic features word class model document- model

linguistic theory

[Stein 05]

Retrieval model ∼ document model

GFKL ’06 Mar. 8th, 2006 Stein/Potthast

slide-9
SLIDE 9

Introduction Stemming Approaches Evaluation Σ

Stemming Approaches

  • 1. Table lookup.

To each stem all flections are stored in a hash table. Problem: memory size (consider client-side applications)

  • 2. Successor variety analysis.

Morpheme boundaries are found by statistical analyses. Problem: parameter settings, runtime

  • 3. Affix elimination.

Rule-based replacement of prefixes and suffixes; the most commonly used approach.

Principle: iterative longest match stemming

(a) Removal of the match resulting from the longest precondition. (b) Exhaustive application of the first step. (c) Repair of irregularities.

GFKL ’06 Mar. 8th, 2006 Stein/Potthast

slide-10
SLIDE 10

Introduction Stemming Approaches Evaluation Σ

Stemming Approaches

Affix Elimination under Porter

Rule type Condition Suffix Replacement Example 1a Null sses ss caresses → caress 1a Null ies i ponies → poni 1b (m>0) eed ee feed → feed agreed → agree 1b (*v*) ed ε plastered → plaster bled → bled 1b (*v*) ing ε motoring → motor sing → sing 1c (*v*) y i happy → happi sky → sky 2 (m>0) biliti ble sensibiliti → sensible (m>x) number of vocal-consonant-sequences exceeds x (*S) stem ends with letter S (*v*) stem contains vocal (*o) stem ends with cvc where second consonant c ∈ {W, X, Y} (*d) stem ends with two identical consonants

GFKL ’06 Mar. 8th, 2006 Stein/Potthast

slide-11
SLIDE 11

Introduction Stemming Approaches Evaluation Σ

Stemming Approaches

Affix Elimination under Porter: Weaknesses

❑ difficult to modify:

effects of new rules are barely to anticipate

❑ subject to over-generalization:

policy/police university/universe

  • rganization/organ

❑ several definite generalizations are not covered:

European/Europe matrices/matrix machine/machinery

❑ generates stem that are hard to be interpreted:

iteration/iter general/gener

GFKL ’06 Mar. 8th, 2006 Stein/Potthast

slide-12
SLIDE 12

Introduction Stemming Approaches Evaluation Σ

Stemming Approaches

Successor Variety Analysis: Interesting Aspects

❑ The idea of corpus-specific stemming.

Corpus dependency is an advantage, if the corpus has a strong topic or application bias.

❑ The idea of language independence.

Language independence is essential for multilingual documents or if the language cannot be determined. Stemming Corpus Language approach dependency independence Affix elimination no yes Variety analysis yes little

GFKL ’06 Mar. 8th, 2006 Stein/Potthast

slide-13
SLIDE 13

Introduction Stemming Approaches Evaluation Σ

Stemming Approaches

Successor Variety Analysis: Realization

Suffix tree at letter level: Suffix tree at word level:

nect tact 1 1 3 $ $ $ $ 1 1 ing e d

  • 2

con $ 1 s

  • GFKL

’06 Mar. 8th, 2006 Stein/Potthast

slide-14
SLIDE 14

Introduction Stemming Approaches Evaluation Σ

Stemming Approaches

Successor Variety Analysis: Realization

Suffix tree at letter level: Suffix tree at word level:

nect tact 1 1 3 $ $ $ $ 1 1 ing e d

  • 2

con $ 1 s

  • father plays chess

plays chess chess boy plays chess too

1 1 2 2 $ $ $ $ $ $ 1 1 too t

  • GFKL

’06 Mar. 8th, 2006 Stein/Potthast

slide-15
SLIDE 15

Introduction Stemming Approaches Evaluation Σ

Stemming Approaches

Successor Variety Analysis: Realization

Suffix tree at letter level: Suffix tree at word level:

nect tact 1 1 3 $ $ $ $ 1 1 ing e d

  • 2

con $ 1 s

  • father plays chess

plays chess chess boy plays chess too

1 1 2 2 $ $ $ $ $ $ 1 1 too t

  • How to find good candidates for a stem?

❑ analysis of degree differences (depending on tree depth) ❑ cut-off method, complete word method, entropy method

GFKL ’06 Mar. 8th, 2006 Stein/Potthast

slide-16
SLIDE 16

Introduction Stemming Approaches Evaluation Σ

Evaluation

Caution is advised ; )

❑ existing reports on the impact of stemming are contradictory ❑ employed analysis tool (among others): clustering

But what can be found?

  • 1. improved document model
  • 2. peculiarity of a clustering algorithm
  • 3. . . .

GFKL ’06 Mar. 8th, 2006 Stein/Potthast

slide-17
SLIDE 17

Introduction Stemming Approaches Evaluation Σ

Evaluation

Caution is advised ; )

❑ existing reports on the impact of stemming are contradictory ❑ employed analysis tool (among others): clustering

But what can be found?

  • 1. improved document model
  • 2. peculiarity of a clustering algorithm
  • 3. . . .

A cluster algorithm’s performance depends on various parameters. Different cluster algorithms behave differently sensitive to document model “improvements”. Baseline? Interpretation? Objectivity? Generalizability?

GFKL ’06 Mar. 8th, 2006 Stein/Potthast

slide-18
SLIDE 18

Introduction Stemming Approaches Evaluation Σ

Evaluation

Caution is advised ; ) An objective way to rank document models is to compare their ability to capture the intrinsic similarity relations of a collection D. Basic idea:

  • 1. construct a similarity graph, G = V, E, w
  • 2. measure its conformance to a reference classification
  • 3. analyze improvement/decline under new document model

GFKL ’06 Mar. 8th, 2006 Stein/Potthast

slide-19
SLIDE 19

Introduction Stemming Approaches Evaluation Σ

Expected Density ¯ ρ

Definition Graph G = V, E, w

❑ G is called sparse [dense]

if |E| = O(|V |) [O(|V |2)]

❑ the density θ computes from the equation |E| = |V |θ

GFKL ’06 Mar. 8th, 2006 Stein/Potthast

slide-20
SLIDE 20

Introduction Stemming Approaches Evaluation Σ

Expected Density ¯ ρ

Definition Graph G = V, E, w

❑ G is called sparse [dense]

if |E| = O(|V |) [O(|V |2)]

❑ the density θ computes from the equation |E| = |V |θ ❑ with w(G) :=

  • e∈E

w(e), this extends to weighted graphs: w(G) = |V |θ ⇔ θ = ln (w(G)) ln (|V |) Using θ we assess the density of an induced subgraph Gi of G.

GFKL ’06 Mar. 8th, 2006 Stein/Potthast

slide-21
SLIDE 21

Introduction Stemming Approaches Evaluation Σ

Expected Density ¯ ρ

Definition Graph G = V, E, w

❑ G is called sparse [dense]

if |E| = O(|V |) [O(|V |2)]

❑ the density θ computes from the equation |E| = |V |θ ❑ with w(G) :=

  • e∈E

w(e), this extends to weighted graphs: w(G) = |V |θ ⇔ θ = ln (w(G)) ln (|V |) Using θ we assess the density of an induced subgraph Gi of G.

❑ a categorization C = {C1, . . . , Ck} induces k subgraphs Gi

➜ expected density ρ(C) =

k

  • i=1

|Vi| |V | · w(Gi) |Vi|θ

GFKL ’06 Mar. 8th, 2006 Stein/Potthast

slide-22
SLIDE 22

Introduction Stemming Approaches Evaluation Σ

Expected Density ¯ ρ

Understanding Expected Density Embedding of a collection under a particular document model.

GFKL ’06 Mar. 8th, 2006 Stein/Potthast

slide-23
SLIDE 23

Introduction Stemming Approaches Evaluation Σ

Expected Density ¯ ρ

Understanding Expected Density Embedding of a collection under a particular document model. ρ > 1 [ρ < 1] if the cluster density is larger [smaller] than average.

GFKL ’06 Mar. 8th, 2006 Stein/Potthast

slide-24
SLIDE 24

Introduction Stemming Approaches Evaluation Σ

Expected Density ¯ ρ

Understanding Expected Density Consider inter-cluster and intra-cluster similarities.

GFKL ’06 Mar. 8th, 2006 Stein/Potthast

slide-25
SLIDE 25

Introduction Stemming Approaches Evaluation Σ

Expected Density ¯ ρ

Understanding Expected Density Consider inter-cluster and intra-cluster similarities. Effect of a document model that reinforces the structural characteristic within a document collection.

GFKL ’06 Mar. 8th, 2006 Stein/Potthast

slide-26
SLIDE 26

Introduction Stemming Approaches Evaluation Σ

Expected Density ¯ ρ

Understanding Expected Density

1 1.1 1.2 1.3 1.4 1.5 200 300 400 500 600 700 800 900 1000

Sample size Expected density ρ

The expected density ρ is a monotonically increasing function of the sample size.

GFKL ’06 Mar. 8th, 2006 Stein/Potthast

slide-27
SLIDE 27

Introduction Stemming Approaches Evaluation Σ

Expected Density ¯ ρ

Understanding Expected Density The expected density ρ is a monotonically increasing function of the sample size.

GFKL ’06 Mar. 8th, 2006 Stein/Potthast

slide-28
SLIDE 28

Introduction Stemming Approaches Evaluation Σ

Expected Density ¯ ρ

Understanding Expected Density The expected density ρ is a monotonically increasing function of the sample size.

GFKL ’06 Mar. 8th, 2006 Stein/Potthast

slide-29
SLIDE 29

Introduction Stemming Approaches Evaluation Σ

Expected Density ¯ ρ

Understanding Expected Density The expected density ρ is a monotonically increasing function of the sample size.

GFKL ’06 Mar. 8th, 2006 Stein/Potthast

slide-30
SLIDE 30

Introduction Stemming Approaches Evaluation Σ

Expected Density ¯ ρ

Experiments: English Collection

1 1.1 1.2 1.3 1.4 1.5 200 300 400 500 600 700 800 900 1000

Sample size Expected density ρ

Stemming: without 5 categories Porter

Collection: RCV1. Two documents d1, d2 are assigned to the same category if they share the top level category and the most specific category.

GFKL ’06 Mar. 8th, 2006 Stein/Potthast

slide-31
SLIDE 31

Introduction Stemming Approaches Evaluation Σ

Expected Density ¯ ρ

Experiments: English Collection

1 1.1 1.2 1.3 1.4 1.5 200 300 400 500 600 700 800 900 1000

Sample size Expected density ρ

Stemming: without 5 categories Porter Suffix tree

A note on reproducibility: meta information files that describe the compiled test collections are made available upon request.

GFKL ’06 Mar. 8th, 2006 Stein/Potthast

slide-32
SLIDE 32

Introduction Stemming Approaches Evaluation Σ

Expected Density ¯ ρ

Experiments: German Collection

1 1.1 1.2 1.3 1.4 1.5 200 300 400 500 600 700 800 900 1000

Sample size Expected density ρ

Stemming: without 5 categories Snowball

Collection: Compilation of 26,000 documents from 20 German news groups.

GFKL ’06 Mar. 8th, 2006 Stein/Potthast

slide-33
SLIDE 33

Introduction Stemming Approaches Evaluation Σ

Expected Density ¯ ρ

Experiments: German Collection

1 1.1 1.2 1.3 1.4 1.5 200 300 400 500 600 700 800 900 1000

Sample size Expected density ρ

Stemming: without 5 categories Snowball Suffix tree

GFKL ’06 Mar. 8th, 2006 Stein/Potthast

slide-34
SLIDE 34

Introduction Stemming Approaches Evaluation Σ

Expected Density ¯ ρ

Experiments: German Collection

1 1.1 1.2 1.3 1.4 1.5 200 300 400 500 600 700 800 900 1000

Sample size Expected density ρ

Stemming: without 5 categories Snowball Suffix tree

Stemming can reduce noise.

GFKL ’06 Mar. 8th, 2006 Stein/Potthast

slide-35
SLIDE 35

Introduction Stemming Approaches Evaluation Σ

Expected Density ¯ ρ

Experiments: German Collection

Where successor variety works: mechanis

  • mus, tisch, che, ch, tischen, men,
  • tisches, ierung, chen

zusammen

  • leben, gang, h

zusammenbr

  • icht, uch, aut, echen

zusammenfass

  • en, ung, t, end

zusammenge

  • faßt, baut, zählt, fasst

zusammengesetzt - en, $ zusammenh

  • ängen, ängt, änge

zusammenha

  • lten, lt

zusammenhang

  • los, es, s, $

GFKL ’06 Mar. 8th, 2006 Stein/Potthast

slide-36
SLIDE 36

Introduction Stemming Approaches Evaluation Σ

Expected Density ¯ ρ

Experiments: German Collection

Where successor variety works: mechanis

  • mus, tisch, che, ch, tischen, men,
  • tisches, ierung, chen

zusammen

  • leben, gang, h

zusammenbr

  • icht, uch, aut, echen

zusammenfass

  • en, ung, t, end

zusammenge

  • faßt, baut, zählt, fasst

zusammengesetzt - en, $ zusammenh

  • ängen, ängt, änge

zusammenha

  • lten, lt

zusammenhang

  • los, es, s, $

and where it fails: schwarz - arbeit, denker, schild, fahrer, em, en,

  • e, markt, maler, bader, hörer, radler, e, s

GFKL ’06 Mar. 8th, 2006 Stein/Potthast

slide-37
SLIDE 37

Introduction Stemming Approaches Evaluation Σ

Evaluation

A Note on F-Measure Values

Stemming F-min F-max F-av. approach

(sample size 1000, 10 categories)

without —baseline— Porter

  • 12%

11% 2% suffix tree

  • 10%

10% 2%

GFKL ’06 Mar. 8th, 2006 Stein/Potthast

slide-38
SLIDE 38

Introduction Stemming Approaches Evaluation Σ

Evaluation

A Note on F-Measure Values

Stemming F-min F-max F-av. approach

(sample size 1000, 10 categories)

without —baseline— Porter

  • 12%

11% 2% suffix tree

  • 10%

10% 2%

A Note on Runtime

❑ successor variety analysis with suffix trees

in O(n)

[Ukkonen 1995], and

in O(n2) and Θ(n log(n)) respectively

[Giegerich et. al.]

❑ successor variety analysis with Pat trees

in O(n2); Θ(n log(n)) may be assumed for short affixes

GFKL ’06 Mar. 8th, 2006 Stein/Potthast

slide-39
SLIDE 39

Introduction Stemming Approaches Evaluation Σ

Summary

❑ Basis: document models with “visible” index terms ❑ Issue: selection, modification, enrichment of index terms ❑ Question: stemming without semantic background

Contribution

❑ efficient implementation of variational stemming with Patricia ❑ parameter optimization ⇒ significantly better than

[Frakes 1992]

❑ comparison to Porter stemmer and Snowball stemmer ❑ algorithm-neutral evaluation method based on ¯

ρ Message

❑ the impact of stemming may be over-estimated ❑ generally accepted analysis methods are required

GFKL ’06 Mar. 8th, 2006 Stein/Potthast

slide-40
SLIDE 40

Introduction Stemming Approaches Evaluation Σ

Summary

Related Work

❑ A similar approach can be applied to index construction.

variational n-grams: use words (not letters) as tokens

❑ Issue: collection-specific document model ❑ Motto: “co-occurrence analysis versus Wordnet”

GFKL ’06 Mar. 8th, 2006 Stein/Potthast

slide-41
SLIDE 41

Introduction Stemming Approaches Evaluation Σ

Summary

Related Work

❑ A similar approach can be applied to index construction.

variational n-grams: use words (not letters) as tokens

❑ Issue: collection-specific document model ❑ Motto: “co-occurrence analysis versus Wordnet”

1 1.1 1.2 1.3 1.4 1.5 200 300 400 500 600 700 800 900 1000

Sample size

5 categories

Additional concepts: without Wordnet n-gram Expected density ρ

GFKL ’06 Mar. 8th, 2006 Stein/Potthast

slide-42
SLIDE 42

Introduction Stemming Approaches Evaluation Σ

References

❑ Bacchin, M. Ferro, N. and Melucci, M. (2002): Experiments to evaluate a

statistical stemming algorithm. CLEF: Cross-Language Evaluation Forum Workshop, Rome, 161–168.

❑ Frakes, W. B. (1984): Term conflation for information retrieval. In:

Proceedings of SIGIR ’84, Swinton, UK, 383–389.

❑ Fürnkranz, J. (1998): A Study Using n-gram Features for Text

  • Categorization. Austrian Institute for Artificial Intelligence.

❑ Mayfield, J. and McNamee, P

. (2003): Single n-gram stemming. In: Proceedings of the SIGIR ’03, Toronto, 415–416.

❑ Porter, M. F

. (1980): An Algorithm for Suffix Stripping. Program, 14(3):130–137.

GFKL ’06 Mar. 8th, 2006 Stein/Potthast

slide-43
SLIDE 43

Introduction Stemming Approaches Evaluation Σ

GFKL ’06 Mar. 8th, 2006 Stein/Potthast