Corpus Linguistics Tools for Sahidic Coptic
Amir Zeldes, Humboldt-Universität zu Berlin amir.zeldes@rz.hu-berlin.de Caroline T. Schroeder, University of the Pacific cschroeder@pacific.edu
Leipzig eHumanities Seminar, 18.12.2013
Corpus Linguistics Tools for Sahidic Coptic Amir Zeldes, - - PowerPoint PPT Presentation
Corpus Linguistics Tools for Sahidic Coptic Amir Zeldes, Humboldt-Universitt zu Berlin amir.zeldes@rz.hu-berlin.de Caroline T. Schroeder, University of the Pacific cschroeder@pacific.edu Leipzig eHumanities Seminar , 18.12.2013 Plan
Corpus Linguistics Tools for Sahidic Coptic
Amir Zeldes, Humboldt-Universität zu Berlin amir.zeldes@rz.hu-berlin.de Caroline T. Schroeder, University of the Pacific cschroeder@pacific.edu
Leipzig eHumanities Seminar, 18.12.2013
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 1/46
Plan
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 2/46
Who are these people?
Korpuslinguistik / SFB 632 Information Structure Humboldt-Universität zu Berlin
Religious and Classical Studies / Humanities Center Director University of the Pacific
NEH summer institute on "Text in a Digital Age" (Tufts): http://coptic.pacific.edu/
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 3/46
What is Coptic?
(Longest continuous documentation of any language)
(Egyptian climate + papyrus = happy philologists )
libraries, lost ...
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 4/46
Why study Coptic?
Afroasiatic Cushitic Chadic Omotic Berber Egyptian Coptic Semitic
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 5/46
Why study Coptic?
hagiographies
(e.g. Thomas, Mary, and most recently "Jesus's Wife")
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 6/46
Sahidic Coptic
almost 2000 years
periods
Sahidic (2nd-14th C.)
this project
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 7/46
What we would like to see
vocabulary, identifying reuse; NB much homography!)
paleographic and text-linguistic interests, e.g. line breaks in words, but whole words... ( talk in Berlin next month)
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 8/46
A word about the texts in this talk
Abraham our Father:
upstanding in all our works, and so that the prophets, apostles and all the saints might dwell among us, ..."
sixty years living at the top of the river and she never set foot outside to see the river."
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 9/46
Corpus linguistics
(some examples in the next slides)
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 10/46
Some stuff we've been working on
tokenized, segmented and tagged data (this talk)
corpus architecture, metadata (talk at Berlin Digital Classicist Seminar next month)
entities (manual)
ANNIS search interface: https://korpling.german.hu-berlin.de/annis3/scriptorium
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 11/46
Some stuff we've been working on
normalized, tokenized, consistently tagged data
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 12/46
Normalization
f sh h ti ch kj
(but are often omitted)
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 13/46
Normalization
potentially marking 'word' borders, potentially 'meaningless'
substantially, even for foreign words and even in the same manuscript
Can you guess the word? Solution: Collegium
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 14/46
Normalization
and add normalization
for diacritics
abbreviations, growing
views in interface (ANNIS, Zeldes et al. 2009)
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 15/46
Tokenization
'Since I became a monk' since-that-PAST-1sg-do-monk
'he who made us keep the ceremony' REL-PAST-3sgM-CAUS-1pl-do-the-observance
(Layton 2004), some hints from "meaningless diacritics"
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 16/46
Tokenization – Step 1/2
........... ⲛ̄ ⲟⲩϣⲏⲣⲉ`ⲛ̄ⲁ ⲛ̄ⲟⲩϣⲏⲣⲉ` ⲛ̄ⲁⲃⲣⲁϩⲁⲙ` ⲃⲣⲁϩⲁⲙ`... 'of-a-son of-Abraham' most texts 'come like this' from researchers – phew! (e.g. in EpiDoc XML, text files, MS Word etc.)
idea of word forms but this is only sometimes so
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 17/46
Tokenization – Step 2/2
ⲛ̄ⲟⲩϣⲏⲣⲉ` ⲛ̄ⲁⲃⲣⲁϩⲁⲙ` ⲛ ⲟⲩ ϣⲏⲣⲉ ⲛ ⲁⲃⲣⲁϩⲁⲙ
CMCL, courtesy of Prof. Tito Orlandi)
(possible for smaller MSS, less so for the whole Bible)
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 18/46
Examples and challenges
e.g.: Indefinite durative present/future:
errors
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 19/46
Examples and challenges
ⲧ /p/ + ϩ /h/ > ⲑ /th/ (aspirated pronunciation of ⲑ, ⲫ, ⲭ)
the word form
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 20/46
Part-of-speech tagging
methods become more easily applicable
(based on Layton 2004)
proper and common nouns, imperative and stative verbs, different types of pronouns)
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 21/46
Tagset overview
A[*] Auxiliary base ⲁ[ϥ], ⲙⲉ[ϥ], ⲧⲣⲉ[ϥ] ADV Adverb ⲉⲃⲟⲗ, ⲟⲛ, ⲡⲱⲥ ART Article ⲡ(ⲉ), ⲧ(ⲉ), ⲛ(ⲉ), ϩⲉⲛ C[*] Converter ⲉ, ⲉⲧⲉ, ⲛⲉ, ... CONJ Conjunction ⲁⲩⲱ, ⲏ, ⲙⲏ, ⲕⲁⲓ, ⲉⲓⲧⲉ COP Copula ⲡⲉ/ⲧⲉ/ⲛⲉ EXIST Existential ⲟⲩⲛ/ⲙⲛ FUT Future ⲛⲁ IMOD Inflected modifier ⲧⲏⲣ[ϥ], ϩⲱⲱ[ⲧ], … N[*] Noun ⲁⲑⲏⲧ, ⲣⲱⲙⲉ, ⲁⲣⲭⲏ, ... NEG Negation ⲛ, ⲁⲛ, ⲧⲙ[ⲥⲱⲧⲙ] NUM Numeral ⲟⲩⲁ, ⲥⲛⲁⲩ, … PDEM Pronoun, demonst. ⲡⲉⲓ/ⲡⲁⲓ, ⲧⲉⲓ/ⲧⲁⲓ PINT Pronoun, interrog. ⲟⲩ, ⲛⲓⲙ PPER[*] Pronoun, personal ϥ,ⲥ,ϯ,ⲛ,ⲁⲛⲟⲕ,ⲁⲛⲅ̄ PPOS[*] Pronoun, possess. ⲡⲉϥ,ⲧⲉⲧⲛ̄,ⲡⲟⲩ,ⲡⲁ PREP Preposition ⲉⲧⲃⲉ, ϩⲛ̄, ⲛ, ⲙ̄ⲙⲟ[ϥ] PTC Particle ⲇⲉ, ⲛ̄ϭⲓ, ϫⲉ, … PUNCT Punctuation . , · … UNKNOWN Unknown, lacuna ⲃ_ _ _, _ _ⲟⲥ, _ _ _ V[*] Verb ⲥⲱⲧⲡ, ⲥⲟⲧⲡ, ⲟ, ⲁⲣⲓ VBD Verboid ⲛⲁⲛⲟⲩ[ϥ], ⲡⲉϫⲁ[ϥ]
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 22/46
Interannotator agreement
ability to tag text correctly
SCRIPTORIUM guidelines (Zeldes & Schroeder 2013)
(considers chance agreement, cf. Artstein & Poesio 2008)
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 23/46
Where are the problems?
854/906=94.26%
542/576=94.09%
continuing training, etc.
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 24/46
Where are the problems?
1 2 3 4 5 6 7 8 9
is this verb nominalized?
pronoun? problems with converters
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 25/46
Example: Converters
some environments: ⲉ
ⲉϥϯⲙ̄ⲧⲟⲛ ⲇⲉ ⲟⲛ` ⲛ̄ⲧⲙⲁⲁⲩ` ⲛ̄ⲧⲁⲥϫⲡⲟϥ (...? And thus he gives rest to the mother who bore him...)
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 26/46
Training a tagger for Coptic
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 27/46
Different genres in manual training set
Corpus manual morphs +auto morphs total tokens Abraham our Father (Shenoute of Atripe) 1908 7111 7688 Apophthegmata Patrum 1395 1395 1501 Sahidica NT 1229 209,633 209,633 4532 218,139 218,822
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 28/46
Crafting the tag set
something is a noun or a verb
verbs?
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 29/46
Crafting the tag set
imperatives
Coptic = mnt-rm-n-kēme < Egypt + man + ness tempting to break down but obscures this being a noun
separate annotation layer!
with article etc. or pronoun), predicate, objects, prepositional phrases...
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 30/46
Closed vs. open classes
nouns (hard, but important!)
problem when unambiguous...
ⲉ
CFOC CREL CCIRC AOPT ⲉϥⲉ ACOND ⲉϥϣⲁⲛ PREP
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 31/46
Evaluation
almost full lexicon coverage, using TreeTagger (Schmid 1994)
88 90 92 94 96 98 1000 2000 3000 4000 5000
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 32/46
Evaluation
(even with 10% missing, Shenoute is Shenoute...)
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 33/46
Evaluation
papyri.info
(thanks to Ines Rehbein for this idea)
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 34/46
Getting more of the "best" data
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 35/46
Extrapolation – making up more good data!
class words like nouns and verbs?
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 36/46
Not so easy
IN TRAINING DATA EXTRAPOLATION VIA LEXICON
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 37/46
Automatic generation: some examples
Disambiguating ⲉ-
auō esnau ansōtm epnoute And while she saw, we listened to God.
šakei epšēre šantuhareh epjoeis You always go to the son, until they observe ø the Lord Stay tuned for how this turns out!
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 38/46
Why do all this corpus linguistics stuff?
considering the effort?
<hi rend="oversized letter in left margin">ⲁ</hi>
ⲩϫⲟⲟ̇ⲥ ⲉ̇ⲧⲃⲉⲧ̇ⲙⲁⲕⲁ<lb/> ⲣⲓ̇ⲁ̇ ⲥⲁⲣⲁ ⲧ̇ⲡⲁⲣⲑⲉⲛⲟⲥ<lb/> ϫⲉⲁ̇ⲥⲉⲣ ⲥⲉ ⲛ̇ⲣⲟⲙⲡⲉ<lb/> ⲉ̇ⲥⲟⲩⲏ̇ϩ ⲙ̇ⲡⲉⲧⲡⲉ <lb/> ⲙ̇ⲡⲓ̇ⲉ̇ⲣⲟ · ⲙ̇ⲡⲉⲥ<lb/> ⲕⲉ ⲣⲁⲧⲥ̇ ⲉ̇ⲃⲟⲗ ⲉ̇ⲛⲉϩ ⲉ̇<lb/> ⲛⲁⲩ ⲉ̇ⲡⲓ̇ⲉ̇ⲣⲟ.ⲻ<lb/>
<pb/>
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 39/46
Why do all this corpus linguistics stuff?
informed statistics and gain new insights:
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 40/46
What is a text about?
11 APOPHTHEGMATA PATRUM GOSPEL OF MARK 1
ⲓⲏⲥⲟⲩⲥ ⲛⲙ ⲛⲧⲉⲣⲉ
ⲃⲁⲡⲧⲓⲥⲙⲁ ⲅⲁⲗⲓⲗⲁⲓⲁ ⲓⲱϩⲁⲛⲛⲏⲥ ⲛⲥⲱ ⲡⲛⲉⲩⲙⲁ ⲥⲓⲙⲱⲛ
ⲕⲏⲣⲩⲥⲥⲉ ⲥⲩⲛⲁⲅⲱⲅⲏ ⲧⲃⲃⲟ
ϯⲥⲃⲱ ⲁⲕⲁⲑⲁⲣⲧⲟⲛ ⲇⲁⲓⲙⲱⲛⲓⲟⲛ ⲉⲣⲏⲙⲟⲥ ⲉⲩⲁⲅⲅⲉⲗⲓⲟⲛ ⲕⲁ ⲛⲉⲩ ⲛⲙⲙⲁ ⲥⲟⲩⲧⲛ
ⲡⲉⲓ ⲧⲁ
. ⲫⲟⲣⲉⲓ
ϣⲁ ϫⲟⲟ ⲗⲁⲁⲩ ⲣⲓ ⲣⲟⲙⲡⲉ
ϣⲟⲙⲛⲧ ϣⲧⲏⲛ ⲉⲓⲣⲉ ⲏⲣⲡ ⲡⲉϫⲉ ⲥⲱ ⲧⲉⲧⲛ
ϩⲟⲟⲩ ϭⲱⲗⲡ ⲁϣ ⲉⲓⲃⲉ ⲕⲱ ⲙⲉⲉⲩⲉ ⲙⲟⲛⲁⲭⲟⲥ ⲙⲟⲟⲩ ⲟⲩⲛ ⲟⲩⲱⲙ ⲣⲁⲧ
Run of the mill word clouds...
Egyptian vocabulary said you.SG.M Abba eat wine I/me Greek vocabulary synagogue impure baptism John Jesus Holy Ghost Gospel
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 41/46
What is a text about?
ϫⲓⲛⲧⲁⲓⲣ̅ⲙⲟⲛⲁⲭⲟⲥ 'since I became a monk'
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 42/46
Grammatical characteristics
versus AOF:
Excel Plug-in: http://korpling.german.hu-berlin.de/~amir/uoaddin.htm
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 43/46
Grammatical characteristics
enclitics (especially ⲇⲉ, but others too)
which is legitimate, brethren which are superior to them, thoughts of alienation which exist in our hearts...)
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 44/46
Conclusion
standards while not re-inventing the wheel
better handling of out of vocabulary items)
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 45/46
Outlook
Leipzig, 18.12.2013 Zeldes & Schroeder / Corpus Linguistics Tools for Sahidic Coptic 46/46
Outlook
eHumanities at HU Berlin
KOrpuslinguistische Methoden für ePhilologie mit TEI
linguistics methods and formats
world textual resources
Ⲙⲓⲱⲧⲛ ⲧⲱⲛⲟⲩ!
well-being+your.PL greatly => Thanks!
References
Computational Linguistics. Computational Linguistics 34(4), 556–596.
stuttgart.de/ftp/pub/corpora/tree-tagger1.pdf.
Search Tool for Multi-Layer Annotated Corpora. In: Proceedings of Corpus Linguistics 2009. Liverpool, UK.
Tagsets for Sahidic Coptic. Version: 1.0.1_2013.7.6a. Available at: http://coptic.pacific.edu/download/tools/scriptorium_tagset_documentatio n.pdf.
Links
http://coptic.pacific.edu/
http://www.sfb632.uni-potsdam.de/annis/
https://korpling.german.hu-berlin.de/annis3/scriptorium