Temporal classification for historical Romanian texts Alina Maria - - PowerPoint PPT Presentation

temporal classification for historical romanian texts
SMART_READER_LITE
LIVE PREVIEW

Temporal classification for historical Romanian texts Alina Maria - - PowerPoint PPT Presentation

http://nlp.unibuc.ro Temporal classification for historical Romanian texts Alina Maria Ciobanu Anca Dinu Liviu P. Dinu Vlad Niculae Octavia-Maria ulea Center for Computational Linguistics University of Bucharest August 2013 .... .. ..


slide-1
SLIDE 1

Temporal classification for historical Romanian texts

Alina Maria Ciobanu Anca Dinu Liviu P. Dinu Vlad Niculae Octavia-Maria Şulea

Center for Computational Linguistics University of Bucharest http://nlp.unibuc.ro

August 2013 . . .... .. .. .... .... .... .... .... .... .... .... .... .... .... .... .... .. .. .. .. .. .. .... .. .

slide-2
SLIDE 2

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Temporal text classification

▶ Classifying texts after the time frame they were written in ▶ Coarseness level: century ▶ Supervised classification approach ▶ Romanian texts from XVI -- XX centuries

slide-3
SLIDE 3

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Historical texts (16th century)

Beginning of written Romanian, first printed books. Religious texts and translations.

▶ Codicele Todorescu ▶ Codicele Martian ▶ Coresi, Evanghelia cu învățătură ▶ Coresi, Lucrul apostolesc ▶ Coresi, Psaltirea slavo-română ▶ Coresi, Targul evangheliilor ▶ Coresi, Tetraevanghelul ▶ Manuscrisul de la Ieud ▶ Palia de la Orăștie ▶ Psaltirea Hurmuzaki

slide-4
SLIDE 4

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

slide-5
SLIDE 5

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Historical texts (17th century)

Social, economical, cultural and political chronicles of Moldavia

▶ The Bible ▶ Miron Costin, Letopisețul Țarii Moldovei ▶ Miron Costin, De neamul moldovenilor ▶ Grigore Ureche, Letopisețul Țarii Moldovei ▶ Dosoftei, Viața si petreacerea sfinților ▶ Varlaam Motoc, Cazania ▶ Varlaam Motoc, Raspunsul împotriva Catehismului

calvinesc

slide-6
SLIDE 6

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Historical texts (18th century)

More chronicles, beginning of literature

▶ Antim Ivireanul, Opere ▶ Axinte Uricariul, Letopisețul Țării Românești și al Țării

Moldovei

▶ Ioan Canta, Letopisețul Țării Moldovei ▶ Dimitrie Cantemir, Istoria ieroglifică ▶ Dimitrie E. Brașoveanul, Gramatica românească ▶ Ion Neculce, O samă de cuvinte

slide-7
SLIDE 7

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Historical texts (19th and 20th century)

19th century:

▶ Mihai Eminescu, Opere (journalism works), vol. IX--XIII

20th century: literature

▶ Eugen Barbu, Groapa ▶ Mircea Cartarescu, Orbitor ▶ Marin Preda, Cel mai iubit dintre pământeni

slide-8
SLIDE 8

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Preprocessing

▶ All text had already been digitized and transcribed to latin. ▶ Removed: numbers, references and annotations. ▶ Tokenized: whitespace, punctuation ▶ Split: 500 sentence chunks ▶ Train-test split with ratio 1/4 ▶ 3-fold cross validation for model selection

slide-9
SLIDE 9

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Features

▶ lengths (avg. characters per word, avg. words per

sentence)

▶ stopwords (50 most common words) ▶ endings (suffixes of length 1--3) ▶ dictionary (unambiguous matches in DexOnline)

▶ obsolete marker (all dictionaries) ▶ dictionaries of archaisms (2 dictionaries) ▶ published before 1975 (7 dictionaries) ▶ published after 1975 (31 dictionaries)

slide-10
SLIDE 10

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Results

lengths stopwords endings dictionary RF SVM 25.38 25.38 ✓ 86.58 79.87 ✓ 98.51 95.16 ✓ ✓ 97.76 97.02 ✓ 98.51 96.27 ✓ ✓ 98.51 94.78 ✓ ✓ 98.88 *98.14 ✓ ✓ ✓ 98.51 97.77 ✓ 68.27 22.01 ✓ ✓ 92.92 23.13 ✓ ✓ 98.14 23.89 ✓ ✓ ✓ 98.50 23.14 ✓ ✓ 98.14 23.53 ✓ ✓ ✓ 98.51 25.00 ✓ ✓ ✓ 98.88 23.14 ✓ ✓ ✓ ✓ *99.25 22.75

slide-11
SLIDE 11

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Test results

▶ Linear SVM, C = 104: 98.8% accuracy, confusion: 17th and

20th century

▶ Random forest, 50 trees: 97.7% accuracy, confusion: 16th

and 17th century

slide-12
SLIDE 12

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

χ2 feature selection

χ2(f) = ∑

f,y

(Nf,y − Ef,y)2 Ef,y

16 17 18 19 20 0.0 0.2 0.4 0.6 0.8 1.0

amu

16 17 18 19 20

au

16 17 18 19 20

care

16 17 18 19 20

cari

0.0 0.2 0.4 0.6 0.8 1.0

de derept lu pe

0.0 0.2 0.4 0.6 0.8 1.0

pre se

slide-13
SLIDE 13

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Results and sanity check

▶ RF from last slides: 98.8% ▶ NB predicting century: 90.1% accuracy ▶ RF predicting century (20 trees): 100% ▶ RF predicting source document: 72.1% ▶ RF predicting document, evaluated for century: 98.1% ▶ > 95% confidence on 20th century novels set in the past