SLIDE 1
DCU meets MET: Bengali and Hindi Morpheme Extraction Debasis - - PowerPoint PPT Presentation
DCU meets MET: Bengali and Hindi Morpheme Extraction Debasis - - PowerPoint PPT Presentation
DCU meets MET: Bengali and Hindi Morpheme Extraction Debasis Ganguly, Johannes Leveling, Gareth J.F. Jones CNGL, School of Computing, Dublin City University, Ireland Outline Motivation Task Description Bengali Stemming Approach Hindi Stemming
SLIDE 2
SLIDE 3
Motivation
Some languages have complex inflectional and derivational morphology, i.e. the same base form can correspond to multiple surface word forms Example:
company, companies → company; hopeful → hope
For information retrieval, indexing surface forms would lead to many mismatches between query terms and index terms extracted from documents Index base forms/stems: Reduce different surface forms to the same index form (stem, lemma) to increase the chance
- f matching query term with document terms
SLIDE 4
Task Description
Morpheme Extraction Task:
Investigate effect of morphologic analysis/ lemmatization/ stemming
- n information retrieval (IR) performance (for Indian languages)
Subtasks:
Subtask 1: manual evaluation of morpheme extraction Subtask 2: IR evaluation using the proposed morpheme representation as index terms. Evaluation metric is mean average precision (MAP)
SLIDE 5
Stemming Approaches
Light vs aggressive stemming Rule-based vs. corpus-based stemming manually created vs. cluster of related words iteratively remove word suffixes problem:
- verstemming, i.e. removed suffix is too long
e.g. international/intern; news/new
understemming, i.e. removed suffix is too short
e.g. forgetfulness/forgetful
irregular forms
e.g. feet/foot; women/woman
SLIDE 6
Our Bengali Stemming Approach
Rule-based stemmer created by native speaker Focus on nouns (most important for IR) Four categories [Bhattacharya et al. 2005]:
Title markers added as suffixes to proper nouns
e.g. “দে” (Mrs.), “া” (sir)
Classifier for plurality and specificity/gender of a noun
e.g. ছবুল া (Pictures), ছবটা (the Picture), ছার (female student)
Case marker for possessive or accusative relations
e.g. পবিালিি (family’s)
Emphasizer to emphasize the current word
e.g. ছবই (only a picture), ছবটাই (only this picture)
SLIDE 7
Bengali Stemmer
Drop emphasizers (iteratively)
e.g. আবি্যই আবি্য
Drop classifiers and case markers
e.g. রিাও র, ািলেি ািে
Drop title markers
e.g. োলে ো
Drop plural suffixes
e.g. ািেয়লেি ািেয়
Drop derivational suffixes
e.g. বিেশ বিে
SLIDE 8
Our Hindi Stemming Approach
Hindi has less complex inflectional morphology
fewer stemming rules
Rule-based stemmer Stemming rules manually created by native Hindi speaker
SLIDE 9
Hindi Stemmer
Iteratively remove
Hindi vowels, Matras, Anusvara, and “य” (character ya)
from the right of a string until first consonant is encountered Drop derivational suffixes, e.g.
लड़कं (to boys) लड़का (boy) लड़ककयं (to girls) लड़की (girl)
SLIDE 10
MET Experiments
Experiments for Bengali and Hindi Stemmers implemented in C Submission as source code Stemmed forms are used for retrieval with Terrier
SLIDE 11
Results
Team Language MAP Baseline Bengali 0.2740 JU Bengali 0.3307 (+20.69%) DCU Bengali 0.3300 (+20.44%) IIT-KGP Bengali 0.3225 (+17.70%) CVPR-Team Bengali 0.3159 (+15.29%) ISM Bengali 0.3103 (+13.25%) Baseline Hindi 0.2821 DCU Hindi 0.2963 (+5.03%) ISM Hindi 0.2793 (-0.99%)
SLIDE 12
Conclusions
Bengali stemmer:
2nd best performance
Hindi stemmer:
Best performance
Both have also been used successfully in previous ad-hoc IR experiments for FIRE
SLIDE 13
Future work
Explore use of exclusion lists for irregular cases Extend rule set (i.e. handle verbs) Compare to other stemmers for Bengali/Hindi
e.g. Indian language in version 4 of Lucene; stemmers from Jacques Savoy’s web page on cross-language IR
Investigate morphology of named entities
SLIDE 14