Towards Digital Coptic Searching and Visualizing Coptic Manuscript - - PowerPoint PPT Presentation

towards digital coptic
SMART_READER_LITE
LIVE PREVIEW

Towards Digital Coptic Searching and Visualizing Coptic Manuscript - - PowerPoint PPT Presentation

Towards Digital Coptic Searching and Visualizing Coptic Manuscript Data Caroline T. Schroeder, University of the Pacific cschroeder@pacific.edu Amir Zeldes, Humboldt-Universitt zu Berlin amir.zeldes@rz.hu-berlin.de Berlin Digital Classicist


slide-1
SLIDE 1

Towards Digital Coptic

Caroline T. Schroeder, University of the Pacific cschroeder@pacific.edu Amir Zeldes, Humboldt-Universität zu Berlin amir.zeldes@rz.hu-berlin.de

Berlin Digital Classicist Seminar, 14.1.2014

Searching and Visualizing Coptic Manuscript Data

slide-2
SLIDE 2

Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 1/37

Plan

  • Introduction
  • Coptic data
  • Annotations so far: normalizing, tokenizing and tagging
  • Search architecture
  • Searching through multiple segmentations: ANNIS
  • Dealing with corpus formats: TEI, SaltNPepper
  • Visualization
  • Dedicated visualizations
  • A reusable generic approach
  • Conclusion and outlook
slide-3
SLIDE 3

Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 2/37

Who are these people?

  • Prof. Caroline T. Schroeder –

Religious and Classical Studies / Humanities Center Director University of the Pacific

  • Dr. Amir Zeldes –

Korpuslinguistik / SFB 632 Information Structure (from March: eHumanities group KOMeT) Humboldt-Universität zu Berlin

  • Cooperation Coptic SCRIPTORIUM established at 2012

NEH summer institute on "Text in a Digital Age" (Tufts): http://coptic.pacific.edu/

slide-4
SLIDE 4

Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 3/37

Why Coptic?

  • Last stage of Ancient Egyptian Language (starting 2nd Century)
  • Mediterranean in 1st millenium
  • Hellenistic period
  • Unique language
  • Longest continuous documentation
  • Contact language (with Greek)
  • Religious significance
  • Early Christianity
  • Rise of monasticism
  • Gnosticism
  • ...
BMBF eHumanties - KOMeT / Zeldes

Coptische Dialects

slide-5
SLIDE 5

Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 4/37

The data

  • Lots of material (thanks to the Egyptian desert )
  • Relatively little online, nothing like Greek and Latin

(Perseus)

  • Lots of things you may want are not available:
  • New Testament (online, not normalized/lemmatized/annotated)
  • Old Testament
  • The Rule of St. Pachomius
  • Works of Shenoute of Atripe
  • Apophthegmata patrum
  • ...
  • But some have been digitized at some point!
slide-6
SLIDE 6

Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 5/37

A word about the texts in this talk

  • So far we've concentrated on Shenoute's sermon Abraham our

Father

  • "As for us, brethren, let us live by the truth so that we are upstanding in

all our works, and so that the prophets, apostles and all the saints might dwell among us, ..."

  • Apophthegmata Patrum (sayings of the desert fathers)
  • "They said about the blessed Sarah the virgin that she spent sixty years

living at the top of the river and she never set foot outside to see the river."

  • New Testament, esp. Gospel of Mark

see http://coptic.pacific.edu/ for corpora and tools

slide-7
SLIDE 7

Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 6/37

Getting from raw text to annotated corpora

  • Making the data searchable starts

with:

  • Encoding manuscripts (Epidoc TEI)
  • Segmentation of "word forms"
  • Normalization
  • Segmentation of morphemes
  • Part-of-speech tagging
  • More annotations...
  • Brief recap: Detailed talk in Leipzig

last month (slides on my page)

slide-8
SLIDE 8

Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 7/37

Normalization

  • Automatic normalization, manual correction
  • handling of known diacritics, abbreviations
  • closed, growing list of known variants
slide-9
SLIDE 9

Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 8/37

Tokenization

  • Identifying morphemes non-trivial (agglutinative language,

different conventions; we follow Layton 2004)

  • ϫⲓⲛⲧⲁⲓⲣ̅ⲙⲟⲛⲁⲭⲟⲥ

'Since I became a monk' since-that-PAST-1sg-do-monk

  • ⲉⲛⲧⲁϥⲧⲣⲉⲛⲣⲡϣⲁ

'he who made us keep the ceremony' REL-PAST-3sgM-CAUS-1pl-do-the-observance

  • Word level segmentation: manual (no scriptio continua)
  • Morph segmentation: automatic (accuracy: 84% - 94%)

ⲛ̄ⲟⲩϣⲏⲣⲉ` ⲛ̄ⲁⲃⲣⲁϩⲁⲙ`  ⲛ ⲟⲩ ϣⲏⲣⲉ ⲛ ⲁⲃⲣⲁϩⲁⲙ

  • f-a-son of-Abraham of a son of Abraham
slide-10
SLIDE 10

Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 9/37

Part-of-speech tagging

  • POS tagging using TreeTagger (Schmid 1994) and a lexicon from the

CMCL project (courtesy of Prof. Tito Orlandi)

  • Two tag sets:
  • fine grained (45 tags) and coarse (22 tags)

(see http://coptic.pacific.edu/ for documentation)

  • Interannotator agreement: 94.19% agreement, kappa = 93.67

(considers chance agreement, cf. Artstein & Poesio 2008)

  • Accuracy:
  • In domain, 10-fold cross-validation: 94.04% (fine)
  • Out of domain (test with papyri.info): 79.6% (fine) / 87.7% (coarse)
  • Main difficulties: open classes (N/V),

disambiguating homonyms (ⲉ can have 6 different tags!)

slide-11
SLIDE 11

Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 10/37

Further annotations

  • Many other layers are done manually:
  • Translation
  • Language of origin
  • Coreference
  • Entity tagging (people, places...)
  • Parallel alignment (with Greek)
  • Syntax trees (very preliminary tests)
slide-12
SLIDE 12

Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 11/37

Representing data – how to look at all this stuff?

  • We now have a lot of data to represent:
  • Diplomatic transcriptions (including character rendering!)
  • Normalization
  • Segmentation into words, morphemes, sometimes letters
  • Annotations
  • How do we encode this data for search and

visualization?

slide-13
SLIDE 13

Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 12/37

The first challenge: minimal units

  • Minimal units, or tokens, are critical for searching:
  • Find all words preceding the word "God"
  • Give me any mentions of Saint Paphnutius, ±10 words
  • Search for the glosses father and son within 20 words
  • Two problems:
  • The concept of words is complex in Coptic
  • Annotations overlap parts of words:

individual letters, line breaks...  tokens are smaller than words!

ⲡⲉϪⲁϥ ϫⲉ ⲉⲓ̇ⲥ ϣ ⲙⲟⲩⲛ ⲛ̇ⲣⲟⲙⲡⲉ ⲻ Ⲡⲉϫⲉ ⲡ̇ϩⲗ̇ⲗⲟ ⲛⲁϥ

he sAid "it's been e ight years" – The old man told him

slide-14
SLIDE 14

Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 13/37

Solution: segmentation layers in ANNIS

  • We use the open source ANNIS platform as a search

interface (Zeldes et al. 2009)

  • Any annotation layer can be defined as a segmentation

defining alternative views on:

  • Adjacency

(in words, morphemes, etc.)

  • Proximity

(in words, morphemes, etc.)

  • Context size

(in words, morphemes, etc.)

  • But which segmentation layer do you want to see?
  • Remember, diplomatic and normalized layers don't match
  • Any segmentation layer is usable as "base text"
slide-15
SLIDE 15

Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 14/37

Switching segmentations in ANNIS

slide-16
SLIDE 16

Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 15/37

Different contexts

  • Example search: entity="person"
  • Hit: Abba Antonius
  • Some options:
  • ±5 words, diplomatic: (less than -5 found, since start of text)

Ⲁⲩϭⲱⲗⲡ̇ ⲉ̇ⲃⲟⲗ ⲛⲁⲡⲁ ⲁ̇ⲛⲧⲱⲛⲓ̇ⲟⲥ ϩⲓ̇ⲡ̇ϫⲁⲓ̇ⲉ̇ · ϫⲉⲟⲩⲛⲟⲩⲁ̇ ⲉ̇ϥⲉⲓⲛⲉ̇ ⲙ̇ⲙⲟⲕ

  • ±10 morphs, normalized:

ⲁ ⲩ ϭⲱⲗⲡ ⲉⲃⲟⲗ ⲛ ⲁⲡⲁ ⲁⲛⲧⲱⲛⲓⲟⲥ ϩⲓ ⲡ ϫⲁⲓⲉ · ϫⲉ ⲟⲩⲛ ⲟⲩⲁ ⲉ ϥ ⲉⲓⲛⲉ ⲙⲙⲟ ⲕ

  • ±5 tokens:

Ⲁ ⲩ ϭⲱⲗⲡ̇ ⲉ̇ⲃⲟⲗ ⲛ ⲁⲡⲁ ⲁ̇ⲛ ⲧⲱⲛⲓ̇ⲟⲥ ϩⲓ̇ ⲡ̇ ϫⲁⲓ̇ⲉ̇ · ϫⲉ

Ⲁⲩϭⲱⲗⲡ̇

5 ⲉ̇ⲃⲟⲗ ⲛⲁⲡⲁ ⲁ̇ⲛ ⲧⲱⲛⲓ̇ⲟⲥ ϩⲓ̇ ⲡ̇ϫⲁⲓ̇ⲉ̇ · ϫⲉ ⲟⲩⲛ ⲟⲩⲁ̇ ⲉ̇ϥⲉⲓⲛⲉ̇

slide-17
SLIDE 17

Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 16/37

Searching with AQL

(see http://www.sfb632.uni-potsdam.de/annis/ )

  • Basic principle of ANNIS Query Language (AQL):
  • search for some annotations (#1, #2, #3...)
  • stipulate relationships between them (operators)
  • Example: verbs of Greek origin

pos="V" & source_lang="Greek" & #1 _=_ #2

The head bandit repented I have faith in God identical coverage operator

slide-18
SLIDE 18

Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 17/37

Referencing segmentations

  • There are many operators
  • . (adjacent), _i_ (inclusion), _o_ (overlap), _l_ (left aligned)...
  • > (dominance), -> (pointing relation), >@l (left child)...
  • ...
  • Possible to use segmentations in queries:
  • #1 . #2
  • one followed by two
  • #1 .word #2
  • two is the next word after one
  • #1 .norm,1,10 #2
  • within 1 to 10 norm units
  • ...
slide-19
SLIDE 19

Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 18/37

Adding metadata

  • Metadata is like any other constraint, with meta::

prefix

  • Can use regular expressions and negation

pos!="V" & source_lang="Greek" & #1 _=_ #2 & meta::msName=/MONB.*/

  • For metadata names and values we use TEI/EpiDoc as

a guideline

  • More information on AQL:

http://www.sfb632.uni-potsdam.de/annis/

slide-20
SLIDE 20

Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 19/37

Architecture and formats

  • Different formats are suitable for different parts of the

data

  • TEI ideal for manuscript structure, metadata
  • Linguistic formats for computational corpus linguistics:

tagging, parsing, coreference

  • Convert and merge data using SaltNPepper

(Zipser & Romary 2010)

slide-21
SLIDE 21

Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 20/37

SaltNPepper (Zipser & Romary 2010)

  • Metamodel Salt for

multiformat conversion

  • Work on extending

TEI support: 2014-15

  • Salt as internal representation

in ANNIS

slide-22
SLIDE 22

Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 21/37

How can we view the data?

  • Even if we can query everything at once:
  • people who are indirect objects of the verb "show" aligned

with Greek neuters...

  • Can we also look at everything at once?
  • Excerpt from a Salt graph view of two words:
slide-23
SLIDE 23

Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 22/37

Breaking it down

  • Different annotations require different visualizations
  • Two conflicting requirements:
  • Ideal representation for each layer (syntax -> trees)
  • Stay generic and minimize amount of visualizations
  • How can we avoid programming new visualizations

with each new annotation layer?

slide-24
SLIDE 24

Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 23/37

Generic versus dedicated

  • For some purposes, dedicated visualizations cannot be

avoided

  • Special interactive functionality
  • Special layouting algorithms
  • For other purposes, we can reuse visualizations by

making flexible and configurable

  • Need to take segmentations into account
slide-25
SLIDE 25

Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 24/37

Some dedicated examples

  • Syntax trees
  • Coreference view (interactive)
slide-26
SLIDE 26

Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 25/37

Taking segmentations into account

  • Visualizations must be configurable to be aware of different

base texts

  • Syntax tree is based on normalized "word"-internal morphs
  • Sometimes one syntactic unit has multiple tokens

band

  • f ban dits

came upon a band of bandits band ofban 15 dits and foundthem drinking . [...]

slide-27
SLIDE 27

Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 26/37

Reusing dedicated visualizers?

  • In some cases, some creative uses can be found for

existing visualizations

  • Using the coreference visualizer for parallel alignment:

apophthegmata patrum

slide-28
SLIDE 28

Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 27/37

Generic visualizations

Two main generic visualizers:

  • Annotation grid:
  • just mark borders of annotations
  • good for flat information
  • HTML visualizer:
  • generates HTML elements based
  • n annotations
  • defined using two simple stylesheets
  • can look like (almost) anything
slide-29
SLIDE 29

Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 28/37

Multiple grids

  • All annotations in one grid can lead to visual overload
  • Often better to separate groups of annotations:
slide-30
SLIDE 30

Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 29/37

The HTML visualizer

norm.config norm.css p p word span; style="word" norm span; style="norm" value trans t:title; style="trans" value div.htmlvis { font-family: Antinoou, sans-serif; width: 500px; white-space: normal !important; } .trans:hover{color: red} .word:after{content: " ";}

  • Any specific visualization is configured by two style sheets:

a config file and a CSS file

slide-31
SLIDE 31

Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 30/37

Result

<p> <t class="translation" title="Abraham our father wished to have children with Sarah."> <span class="word"> <span class="norm"> ⲁⲃⲣⲁϩⲁⲙ </span> </span> <span class="word"> <span class="norm"> ⲡⲉⲛ </span> <span class="norm"> ⲉⲓⲱⲧ </span> </span> </t> ... </p>

Abraham our Father

slide-32
SLIDE 32

Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 31/37

Reusing the HTML visualizer

dipl.config

tok span value lb div; style="line" pb table:title; style="pb" value pb tr cb td; style="cb" hi_rend hi_rend:rend value

slide-33
SLIDE 33

Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 32/37

Visualizing TEI @rend attributes

dipl.css

div.line{display: block; height: 22px counter-increment: linecount;} div.line:nth-of-type(5n):before{ content: counter(linecount)" "}

...

.pb{border-style:solid;} .cb{counter-reset: linecount 0; width: 160px; min-width: 160px} ... hi_rend[rend*=superscript] {vertical-align: super; font-size: 80%} hi_rend[rend*=red] {color: red} hi_rend[rend*=tall] {font-size: 120%} hi_rend[rend*=extralarge] {font-size: 160%}

slide-34
SLIDE 34

Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 33/37

Aggregate visualizations

  • Latest version of ANNIS offers basic frequency analysis
  • Open question: How much more should we build?
slide-35
SLIDE 35

Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 34/37

Aggregate visualizations

  • Other visualizations are currently done e.g. in R:

11 apophthegmata patrum Gospel of Mark 1

ⲉⲓ ⲩⲛⲟⲩ

ⲓⲏⲥⲟⲩⲥ ⲛⲙ ⲛⲧⲉⲣⲉ

ⲃⲁⲡⲧⲓⲥⲙⲁ ⲅⲁⲗⲓⲗⲁⲓⲁ ⲓⲱϩⲁⲛⲛⲏⲥ ⲛⲥⲱ ⲡⲛⲉⲩⲙⲁ ⲥⲓⲙⲱⲛ

ⲕⲏⲣⲩⲥⲥⲉ ⲥⲩⲛⲁⲅⲱⲅⲏ ⲧⲃⲃⲟ

ϯⲥⲃⲱ ⲁⲕⲁⲑⲁⲣⲧⲟⲛ ⲇⲁⲓⲙⲱⲛⲓⲟⲛ ⲉⲣⲏⲙⲟⲥ ⲉⲩⲁⲅⲅⲉⲗⲓⲟⲛ ⲕⲁ ⲛⲉⲩ ⲛⲙⲙⲁ ⲥⲟⲩⲧⲛ

ϫⲉ

ⲡⲉϫⲁ

ϩⲗⲗⲟ ⲕ

ⲁⲡⲁ

ⲡⲉⲓ ⲧⲁ

. ⲫⲟⲣⲉⲓ

ϣⲁ ϫⲟⲟ ⲗⲁⲁⲩ ⲣⲓ ⲣⲟⲙⲡⲉ

ϣⲟⲙⲛⲧ ϣⲧⲏⲛ ⲉⲓⲣⲉ ⲏⲣⲡ ⲡⲉϫⲉ ⲥⲱ ⲧⲉⲧⲛ

ϩⲟⲟⲩ ϭⲱⲗⲡ ⲁϣ ⲉⲓⲃⲉ ⲕⲱ ⲙⲉⲉⲩⲉ ⲙⲟⲛⲁⲭⲟⲥ ⲙⲟⲟⲩ ⲟⲩⲛ ⲟⲩⲱⲙ ⲣⲁⲧ

  • ld man

Egyptian vocabulary said you.SG.M Abba eat wine I/me Greek vocabulary synagogue impure baptism John Jesus Holy Ghost Gospel

slide-36
SLIDE 36

Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 35/37

Conclusion

  • Annotation projects should not be limited by corpus

architectures:

  • annotate whatever you want, however often you want
  • link anything to anything
  • Why annotate all of these things in the corpus?

(and not just in a separate spreadsheet)

  • Plots of just the verbs? Proper names?  POS tagging
  • Highlight, search and link place-names?  Entity tagging
  • Collapse inflected variants?  Lemmatization
  • Collapse prominent referents?  Coreference annotation
  • Dispersion of any of the above, alignment ... and much more
slide-37
SLIDE 37

Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 36/37

Conclusion

  • Anything can be made queryable with more layers:
  • typical constructions and objects of verbs?
  • Greek vs. native verbs -> add language of origin layer
  • Translation behavior -> add alignment layer
  • ...
  • Fitting visualization facilities
  • should be easy to re-use
  • optimized to the task, display relevant portions of information
  • for many purposes, they must be sensitive to segmentations
slide-38
SLIDE 38

Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 37/37

Outlook

  • This March: BMBF funded young researcher group on

eHumanities at HU Berlin

  • KOMeT:

KOrpuslinguistische Methoden für ePhilologie mit TEI

  • Focus on marrying TEI resources with computational linguistics methods

and formats

  • Developing NLP tools, search and visualization for ancient world textual

resources

  • Pilot phase (2014, approved): Coptic
  • Main phase (2015-2019, pending): Other languages as well
  • Currently looking for a student assistant (60h/month)
  • Stay tuned for more!
slide-39
SLIDE 39

Ⲙⲓⲱⲧⲛ ⲧⲱⲛⲟⲩ!

well-being+your.PL greatly => Thanks!

slide-40
SLIDE 40

References

  • Artstein, Ron & Massimo Poesio (2008), Inter-Coder Agreement for

Computational Linguistics. Computational Linguistics 34(4), 556–596.

  • Layton, Bentley (2004), A Coptic Grammar. Second Edition, Revised and
  • Expanded. (Porta linguarum orientalium 20.) Wiesbaden: Harrassowitz.
  • Schmid, Helmut (1994), Probabilistic Part-of-Speech Tagging Using Decision
  • Trees. In: Proceedings of the Conference on New Methods in Language
  • Processing. Manchester, UK, 44–49. Available at: http://www.ims.uni-

stuttgart.de/ftp/pub/corpora/tree-tagger1.pdf.

  • Zeldes, Amir, Julia Ritz, Anke Lüdeling & Christian Chiarcos (2009), ANNIS: A

Search Tool for Multi-Layer Annotated Corpora. In: Proceedings of Corpus Linguistics 2009. Liverpool, UK.

  • Zipser, Florian & Laurent Romary (2010), A Model Oriented Approach to the

Mapping of Annotation Formats using Standards. In: Proceedings of the Workshop on Language Resource and Language Technology Standards, LREC-2010. Valletta, Malta, 7–18.

slide-41
SLIDE 41

Links

  • Coptic SCRIPTORIUM:

http://coptic.pacific.edu/

  • ANNIS:

http://www.sfb632.uni-potsdam.de/annis/

  • Search engine for our corpora:

https://korpling.german.hu-berlin.de/annis3/scriptorium

  • Papyri.info: http://papyri.info/
  • CMCL: http://cmcl.let.uniroma1.it/