DCU meets MET: Bengali and Hindi Morpheme Extraction Debasis - - PowerPoint PPT Presentation

▶

Oct 02, 2023 747 likes •908 views

DCU meets MET: Bengali and Hindi Morpheme Extraction Debasis Ganguly, Johannes Leveling, Gareth J.F. Jones CNGL, School of Computing, Dublin City University, Ireland Outline Motivation Task Description Bengali Stemming Approach Hindi Stemming

SLIDE 1

DCU meets MET: Bengali and Hindi Morpheme Extraction

Debasis Ganguly, Johannes Leveling, Gareth J.F. Jones CNGL, School of Computing, Dublin City University, Ireland

SLIDE 2

Outline

Motivation Task Description Bengali Stemming Approach Hindi Stemming Approach Results Conclusions and Future Work

SLIDE 3

Motivation

Some languages have complex inflectional and derivational morphology, i.e. the same base form can correspond to multiple surface word forms Example:

company, companies → company; hopeful → hope

For information retrieval, indexing surface forms would lead to many mismatches between query terms and index terms extracted from documents Index base forms/stems: Reduce different surface forms to the same index form (stem, lemma) to increase the chance

f matching query term with document terms

SLIDE 4

Task Description

Morpheme Extraction Task:

Investigate effect of morphologic analysis/ lemmatization/ stemming

n information retrieval (IR) performance (for Indian languages)

Subtasks:

Subtask 1: manual evaluation of morpheme extraction Subtask 2: IR evaluation using the proposed morpheme representation as index terms. Evaluation metric is mean average precision (MAP)

SLIDE 5

Stemming Approaches

Light vs aggressive stemming Rule-based vs. corpus-based stemming manually created vs. cluster of related words iteratively remove word suffixes problem:

verstemming, i.e. removed suffix is too long

e.g. international/intern; news/new

understemming, i.e. removed suffix is too short

e.g. forgetfulness/forgetful

irregular forms

e.g. feet/foot; women/woman

SLIDE 6

Our Bengali Stemming Approach

Rule-based stemmer created by native speaker Focus on nouns (most important for IR) Four categories [Bhattacharya et al. 2005]:

Title markers added as suffixes to proper nouns

e.g. “দে” (Mrs.), “া” (sir)

Classifier for plurality and specificity/gender of a noun

e.g. ছবুল া (Pictures), ছবটা (the Picture), ছার (female student)

Case marker for possessive or accusative relations

e.g. পবিালিি (family’s)

Emphasizer to emphasize the current word

e.g. ছবই (only a picture), ছবটাই (only this picture)

SLIDE 7

Bengali Stemmer

Drop emphasizers (iteratively)

e.g. আবি্যই আবি্য

Drop classifiers and case markers

e.g. রিাও র, ািলেি ািে

Drop title markers

e.g. োলে ো

Drop plural suffixes

e.g. ািেয়লেি ািেয়

Drop derivational suffixes

e.g. বিেশ বিে

SLIDE 8

Our Hindi Stemming Approach

Hindi has less complex inflectional morphology

fewer stemming rules

Rule-based stemmer Stemming rules manually created by native Hindi speaker

SLIDE 9

Hindi Stemmer

Iteratively remove

Hindi vowels, Matras, Anusvara, and “य” (character ya)

from the right of a string until first consonant is encountered Drop derivational suffixes, e.g.

लड़कं (to boys)  लड़का (boy) लड़ककयं (to girls)  लड़की (girl)

SLIDE 10

MET Experiments

Experiments for Bengali and Hindi Stemmers implemented in C Submission as source code Stemmed forms are used for retrieval with Terrier

SLIDE 11

Results

Team Language MAP Baseline Bengali 0.2740 JU Bengali 0.3307 (+20.69%) DCU Bengali 0.3300 (+20.44%) IIT-KGP Bengali 0.3225 (+17.70%) CVPR-Team Bengali 0.3159 (+15.29%) ISM Bengali 0.3103 (+13.25%) Baseline Hindi 0.2821 DCU Hindi 0.2963 (+5.03%) ISM Hindi 0.2793 (-0.99%)

SLIDE 12

Conclusions

Bengali stemmer:

2nd best performance

Hindi stemmer:

Best performance

Both have also been used successfully in previous ad-hoc IR experiments for FIRE

SLIDE 13

Future work

Explore use of exclusion lists for irregular cases Extend rule set (i.e. handle verbs) Compare to other stemmers for Bengali/Hindi

e.g. Indian language in version 4 of Lucene; stemmers from Jacques Savoy’s web page on cross-language IR

Investigate morphology of named entities

SLIDE 14