StemmingandSearch StrategiesforEast EuropeanLanguage - - PowerPoint PPT Presentation

stemming and search strategies for east european language
SMART_READER_LITE
LIVE PREVIEW

StemmingandSearch StrategiesforEast EuropeanLanguage - - PowerPoint PPT Presentation

StemmingandSearch StrategiesforEast EuropeanLanguage LjiljanaDolamic,JacquesSavoy ComputerScienceDepartment UniversityofNeuchatel,Switzerland www.unine.ch/info/clef/ East EuropeanLanguages


slide-1
SLIDE 1

StemmingandSearch StrategiesforEast EuropeanLanguage

LjiljanaDolamic,JacquesSavoy

ComputerScienceDepartment UniversityofNeuchatel,Switzerland www.unine.ch/info/clef/

slide-2
SLIDE 2

East EuropeanLanguages

Hungarian SlavicLanguages Bulgarian Czech Russian

slide-3
SLIDE 3

Hungarian

Ob)Ugriclanguage Largenumberofcases

slide-4
SLIDE 4

Hungarian

Stem– plural– possesion– case

  • gyereke)i)nke)t

child–Pl– PlPoss– Acc

Derivatinals

  • jelent– és(meaning)

tomean– der

slide-5
SLIDE 5

Hungarian

Compoundconstructions

hétvégé =hét+vég

weekend=week/seven+end

Savoy,J.ReportonCLEF2003monolingualtracks:Fusionofprobabilisticmodelsforeffectivemonolingualretrieval

slide-6
SLIDE 6

Bulgarian

SouthernSlavicLanguage Cyrillic Nocases Definitearticle

http://www.unine.ch/info/clef

slide-7
SLIDE 7

Bulgarian

stem– plural– artical

  • вечер – и – те

evening– PL– the

  • геро(й) – Ø – ят/я

hero– Ø – the

  • слаб – а – та

weak– f,sg– the

slide-8
SLIDE 8

Bulgarian

Derivationals

  • Българ – СК – и – те

stem ) der ) PL– the (theBulgarian)

slide-9
SLIDE 9

ProblemswithBulgarian

  • Mutationof–Я–
  • бял – белота

(white) whiteness)

  • грях – грехове

(sin) sins)

  • Elisionofvowel

–Е– or–Ъ–

  • орел – орли

(eagle) eagles)

  • топъл –топла

(warm,m) f)

  • Palatalisation
  • 1. К,Г,Х Ч,Ж,Ш
  • око – очи

(eye– eyes)

  • Бог – Боже

(God,Nom) Voc)

  • 2. К,Г,Х Ц,З,С
  • вълк – вълци

(wolf) wolves)

slide-10
SLIDE 10

Czech

WesternSlavicLanguage Sevencasesystem stem– case

  • pán– ovi

sir(N,L,sg)

  • mlad– ou

young(A,sg,f)

slide-11
SLIDE 11

Czech

Stem) case

mlad ) ým mlad ) ému mlad – é (young) neutre žen) ám žen) ě žen– a (woman) feminine pán) ům pán) ovi pán (sir) masculine

dativeplural dative singulier nominative case gendre

slide-12
SLIDE 12

Czech

Derivationals

  • klavír– ist – a(pianist)

piano– der– case

  • Žid– ovk – a(Jewishwoman)

Jew– der– case

slide-13
SLIDE 13

ProblemswithCzech

Fleeting– E–

  • zámek– zámkem

(castel,Nom– Ins)

  • tec– otcův

(father– father’s)

ů o

  • stůl – stoly

(table– tables)

Consonantsoftening

  • matka– matčin

(mother– mother’s)

  • drahý– drazí

(dear,Nom,sg– pl)

  • mokrý– mokří

(wet,Nom,sg– pl)

  • český– čeští

(Czech,adj, Nom,sg– pl)

slide-14
SLIDE 14

Russian

EasternSlavicLanguage Cyrillic Sixcases stem– case

  • книг – а

book(N,sg)

  • хорош – ая

good(N,sg,f)

slide-15
SLIDE 15

Evaluation

IRmodels

  • Okapi
  • DFR
  • LM
slide-16
SLIDE 16

EvaluationHungarian

0.2345 0.2532 0.2344

  • 0.3527

0.3897 0.3525

DFRIneC2

0.3153 0.3482 0.3118

LM(λ=0.35)

0.3445 0.3629 0.3231

Okapi

4)grams dec word Model Q=TD

slide-17
SLIDE 17

EvaluationHungarian

0.2224 0.2345 0.2532 0.2344

  • 0.3480

0.3527 0.3897 0.3525

DFRIneC2

0.3155 0.3153 0.3482 0.3118

LM(λ=0.35)

0.3509 0.3445 0.3629 0.3231

Okapi

jmorph* 4)grams dec word Model Q=TD

*jmorh– Javaportforhunmorphmorphologicalavalyzer(http://mokk.bme.hu/resouces/ir)

slide-18
SLIDE 18

EvaluationHungarian

0.2224* 0.2345 0.2532* 0.2344*

  • 0.3480

0.3527 0.3897 0.3525

DFRIneC2

0.3155* 0.3153 0.3482* 0.3118*

LM(λ=0.35)

0.3509 0.3445 0.3629* 0.3231*

Okapi

jmorph 4)grams dec word Model Q=TD

slide-19
SLIDE 19

EvaluationBulgarian

0.2105 0.2143 0.2103

  • 0.3156

0.3606 0.3423

DFRIneC2

0.2868 0.3368 0.3175

LM(λ=0.35)

0.3022 0.3425 0.3155

Okapi

4)grams deriv. light Model Q=TD

slide-20
SLIDE 20

EvaluationBulgarian

)5.9% +5.8%

baseline

)32.8%

  • 0.2105

0.2143 0.2103 0.1636

  • 0.3156

0.3606 0.3423 0.2215

DFRIneC2

0.2868 0.3368 0.3175 0.2083

LM(λ=0.35)

0.3022 0.3425 0.3155 0.2035

Okapi

4)grams deriv. light word Model Q=TD

slide-21
SLIDE 21

EvaluationBulgarian

0.2105* 0.2143* 0.2103* 0.1636*

  • 0.3156

0.3606 0.3423 0.2215

DFRIneC2

0.2868* 0.3368* 0.3175* 0.2083*

LM(λ=0.35)

0.3022 0.3425* 0.3155* 0.2035*

Okapi

4)grams deriv. light word Model Q=TD

slide-22
SLIDE 22

EvaluationCzech

0.3365 0.3342 0.3437

DFRGL2

0.2126 0.1984 0.2050

  • 0.3517

0.3473 0.3539

DFRIneC2

0.3204 0.3109 0.3263

LM(λ=0.35)

0.3401 0.3255 0.3355

Okapi

4)grams deriv. light Model Q=TD

slide-23
SLIDE 23

EvaluationCzech

0.3365 0.3342 0.3359 0.3437

DFRGL2

0.2126 0.1984 0.2078 0.2050

  • 0.3517

0.3473 0.3473 0.3539

DFRIneC2

0.3204 0.3109 0.3174 0.3263

LM(λ=0.35)

0.3401 0.3255 0.3306 0.3355

Okapi

4)grams deriv.

light

noAccent

light Model Q=TD

slide-24
SLIDE 24

EvaluationCzech

0.3365 0.3342 0.3359 0.3437

DFRGL2

0.2126* 0.1984* 0.2078* 0.2050*

  • DFRIneC2

0.3204* 0.3109* 0.3174* 0.3263*

LM(λ=0.35)

0.3401* 0.3255* 0.3306* 0.3355

Okapi

4)grams deriv.

light

noAccent

light Model Q=TD

slide-25
SLIDE 25

EvaluationRussian

0.1264 0.1639 DFRGL2 0.1246 0.1511 LM(λ=0.35) 0.0918 0.1188

  • 0.1052

0.1775 DFRInB2 0.0917 0.1630 Okapi

4)grams light Model Q=TD

slide-26
SLIDE 26

EvaluationRussian

0.1689 0.1264 0.1639 DFRGL2 0.1524 0.1246 0.1511 LM(λ=0.35) 0.1194 0.0918 0.1188

  • 0.1749

0.1052 0.1775 DFRInB2 0.1617 0.0917 0.1630 Okapi

snowball* 4)grams light Model Q=TD

*http://snowball.tartarus.org/

slide-27
SLIDE 27

EvaluationRussian

0.1689 0.1264 0.1639 DFRGL2 0.1524 0.1246 0.1511* LM(λ=0.35) 0.1194* 0.0918* 0.1188*

  • 0.1749

0.1052 0.1775 DFRInB2 0.1617 0.0917* 0.1630 Okapi

snowball 4)grams light Model Q=TD

slide-28
SLIDE 28

Query)by)Query

Hardtopics

map<0.1

slide-29
SLIDE 29

Query)by)Query) Hungarian

  • #411,#426,#436,#439,#446
  • #436,‘VIPdivorces’
  • 0.0003(DFRGL2,dec)

<title>VIPválások</title> <desc>Keressünkcikkekethíresemberekválásáról.</desc>

  • VIP– df=0
slide-30
SLIDE 30

Query)by)Query) Bulgarian

light,4grams

  • #407
  • #412
  • #417
  • #422
  • #428
  • #429
  • #435

agressive

  • #412
  • #417
  • #422
  • #428
  • #435
slide-31
SLIDE 31

Query)by)Query) Bulgarian

#429,‘WaterHealthRisks‘

<title>Рискове за здравето,причинени от вода</title> <desc>Намерете документи,които съдържат информация за рисковете за здравето от замърсена или заразена вода.</desc>

заразата заразена здравното здравна здравен здравето D зараг D здрав здрав Q заразн зараг Q D здравн здрав D light deriv.

slide-32
SLIDE 32

Query)by)Query) Czech

  • #411,#422,#428,#430,#435,#439,#446
  • #430,‘Cosmeticprocedures’
  • 0.0025(tf.idf,Q=TDN,4grams)
  • 0.1553(DFRGL2,Q=D,light)
  • #411,‘BestpictureOscar ’
  • 0.0053(DFRGL2,Q=TDN,light)

<title>Oskarzanejlepší film</title> <desc>Jakýtitulzískalvbřeznu2002Oskarazanejlepší film?</desc>

slide-33
SLIDE 33

Query)by)Query) Russian

4grams

  • #176
  • #180
  • #185
  • #186
  • #189
  • #192
  • #194
  • #196
  • #198

light

  • #176
  • #185
  • #186
  • #189
  • #192
  • #195
  • #196
slide-34
SLIDE 34

Query)by)Query) Russian

#192,‘Systemchangeandfamily planninginEastGermany’

  • 0.0034(DFRInB2,light,Q=TDN)

<title>Трансформация и семейное планирование в Восточной Германии</title> <desc>Найти документы,в которых описываются тенденции в области деторождения и семейное планирование в Восточной Германии после объединения.</desc>

  • 1relevantitem
slide-35
SLIDE 35

Query)by)Query) Russian

#171,‘Siblingrelations’

  • 0.0089(DFRInB2,light,Q=TDN)

<title>Отношения между родными братьями и сестрами</title> <desc>Найдите документы,которые подробно описывают развитие отношений между родными сестрами и братьями.</desc>

  • 2relevantitems
  • семейиые – family
slide-36
SLIDE 36

Conclusion

Isstemmingeffective? Bestperformingretrievalmodel Hardtopics

slide-37
SLIDE 37

StemmingandSearch StrategiesforEast EuropeanLanguages

LjiljanaDolamic,JacquesSavoy

ComputerScienceDepartment UniversityofNeuchatel,Switzerland www.unine.ch/info/clef/