[PPT] - The Glossarium Graeco-Arabicum Linguistic Research and Database PowerPoint Presentation

SLIDE 1

The Glossarium Graeco-Arabicum Linguistic Research and Database Design in Polyalphabetic Environments

Torsten Roeder (BBAW), Yury Arzhanov (Ruhr Universität Bochum)

SLIDE 2

Ms. Paris BnF 5847, f. 5:

Muslim scholars in discussion.

SLIDE 3

Arabic translation of Dioscurides’ Materia medica

(Ibn al- al-- al-adwiya wa-l-aghdhiya, 1–4. 1291 H.)

SLIDE 4

Filecards for the Greek and Arabic Lexicon (GALex)

SLIDE 5

GALex

SLIDE 6

The Database Glossarium Græco-Arabicum

SLIDE 7

The Glossarium Graeco-Arabicum makes available information in the following fields of research:

the vocabulary and syntax of Classical and

Middle Arabic;

the development of a scientific and technical

vocabulary in Arabic;

the vocabulary of Classical and Middle

Greek;

the chronology and nature of the translation

movement into Arabic;

the establishment of the texts of Greek

works and their Arabic translations.

SLIDE 8

The Glossarium Graeco-Arabicum online:

SLIDE 9 November 2013 Glossarium Graeco-Arabicum Telota

BERLIN-BRANDENBURGISCHE AKADEMIE DER WISSENSCHAFTEN

SLIDE 10 November 2013 Glossarium Graeco-Arabicum Telota

I Technical Challenges → polyalphabetic environment II Scholarly Requirements → linguistic database III Technical vs. Scholarly → concluding discussion

OUTLINE

SLIDE 11 November 2013 Glossarium Graeco-Arabicum Telota

1 Languages Used in the GlossGA Interface 2 Unicode Character Corpus 3 Areas of Technical Challenges 4 Examples

I. TECHNICAL CHALLENGES

SLIDE 12 November 2013 Glossarium Graeco-Arabicum Telota

Languages used within the project: Ancient Greek Medieval Arabic Modern English Greek alphabet Arabic alphabet Latin alphabet 3 layers of diacritics

ptional vowel signs

1 layer of diacritics LTR (left to right) RTL LTR

I.1. LANGUAGES

SLIDE 13 November 2013 Glossarium Graeco-Arabicum Telota

Unicode Chart Range Description C0 Controls and Basic Latin 0000-007F Latin Alphabet Latin Extended-A 0100-017F transliteration symbols Latin Extended-Additional 1E00-1EFF transliteration symbols Greek and Coptic 0370-03FF Greek Alphabet Greek Extended 1F00-1FFF Greek Diacritics Arabic 0600-06FF Arabic Alphabet Arabic Supplement 0750-077F Arabic Alphabet Spacing Modifier Letters 02B0-02FF special Arabic characters → in total: about 450 different characters from eight different charts

I.2. UNICODE

SLIDE 14 November 2013 Glossarium Graeco-Arabicum Telota

Requirements:

1. Data input in all three alphabets with all vowels and diacritics

→ How to implement a comfortable interface?

2. Simultaneous display of texts in three alphabets and two directions

→ How to implement concurrent writing directions?

3. Search for terms, insensitive for diacritics or vowels

→ How to implement queries with different collation sets?

I.3. REQUIREMENTS

SLIDE 15 November 2013 Glossarium Graeco-Arabicum Telota

a Data Input b Writing Directions c Search d Search Terms

I.4. EXAMPLES

SLIDE 16 November 2013 Glossarium Graeco-Arabicum Telota

ʾ ˒ ʿ ˓

I.4.a. DATA INPUT

SLIDE 17 November 2013 Glossarium Graeco-Arabicum Telota

U+02BE MODIFIER LETTER RIGHT HALF RING transliteration of Arabic hamza U+02D2 MODIFIER LETTER CENTRED RIGHT HALF RING more rounded articulation U+02BF MODIFIER LETTER LEFT HALF RING transliteration of Arabic ain U+02D3 MODIFIER LETTER CENTRED LEFT HALF RING less rounded articulation

I.4.a. DATA INPUT

[ʾ] [˒] [ʿ] [˓]

SLIDE 18 November 2013 Glossarium Graeco-Arabicum Telota

Problem: Appearance vs. Encoding Users will normally choose charaters … → not because of their unicode description → but because of their appearance How to bring Unicode to the user?

I.4.a. DATA INPUT

SLIDE 19 November 2013 Glossarium Graeco-Arabicum Telota

Solutions: – restrict the characters accepted by the database → safe, but required validation methods – provide a virtual keyboard (onscreen) → user-friendly Alternative methods: – beta code → less recommendable from unicode point of view → but widely used

I.4.a. DATA INPUT

SLIDE 20 November 2013 Glossarium Graeco-Arabicum Telota

Phenomenon:

Home (THEN) Arabic Glossary (THEN) ص (THEN) ةحص

becomes

Home > Arabic Glossary > ةحص <ص I.4.b. WRITING DIRECTIONS

SLIDE 21 November 2013 Glossarium Graeco-Arabicum Telota

Problem: Strong vs. Weak Characters In Unicode, alphabetic characters are usually STRONG CHARACTERS which determine the writing direction, while punctuation characters are usually WEAK CHARACTERS which do not change the writing direction. → relevant in: comma separated lists, bibliographic references, breadcrumb lines, table alignments …

I.4.b. WRITING DIRECTIONS

SLIDE 22 November 2013 Glossarium Graeco-Arabicum Telota

Solutions: – insert a ”strong whitespace”: Unicodes U+200E (left to right) or U+200F (right to left) –

r, if in HTML, set the writing direction directly:

I.4.b. WRITING DIRECTIONS

SLIDE 23 November 2013 Glossarium Graeco-Arabicum Telota

GREEK ARABIC ENGLISH diacritics vowel signs diacritics not distinct not distinct distinct requirement: requirement: requirement: η finds also ἠ ἦ ἥببس finds also 7ب8ب8س d does not find ḏ Problem: Distinction vs. Collation

I.4.c. SEARCH

SLIDE 24 November 2013 Glossarium Graeco-Arabicum Telota

Solution: Greek Arabic English Greek collation Arabic collation Latin collation Collation Charts: <http://unicode.org/charts/uca/> Restrictions: – does not work for mixed texts → data needs to be separated – some environments do not support Arabic vowel collation → e.g. MySQL <6.0

I.4.c. SEARCH

SLIDE 25 November 2013 Glossarium Graeco-Arabicum Telota

Phenomenon: – user searches for Arabic words starting with لم – truncation sysmbol (asterisk) appears at the wrong side

لم*

Problem: Neutral Writing Direction – the standard asterisk is a NEUTRAL CHARACTER – it adapts the main writing direction

I.4.d. SEARCH TERMS

SLIDE 26 November 2013 Glossarium Graeco-Arabicum Telota

Solution: Unicode Arabic Asterisk (U+066D), right-to-left

٭لم

I.4.d. SEARCH TERMS

SLIDE 27 November 2013 Glossarium Graeco-Arabicum Telota

Challenges for the Developer: – Unicode does not provide general truncation or joker symbols – different asterisk and joker signs must be processed – no standard solution available

I.4.d. SEARCH TERMS

SLIDE 28 November 2013 Glossarium Graeco-Arabicum Telota

Technical Recommendations for Polyalphabetic Environments – use software components that supports unicode thoughout – compose a project corpus of unicode characters – provide input methods to make the characters easily available – consider unicode writing directions and collations – make sure that all characters do not only appear correctly, but that they are also encoded correctly

SUMMARY OF I.

SLIDE 29 November 2013 Glossarium Graeco-Arabicum Telota

1 Corpus → How to deal with a database of 70,000+ words? 2 Translation movements → How to visualize transformations of language structures? 3 Single Lexemes → How to transform the database into a dictionary?

II. SCHOLARLY REQUIREMENTS

SLIDE 30 November 2013 Glossarium Graeco-Arabicum Telota

How to deal with a database of 70,000+ words? – search form → user needs to know exactly what he/she is looking for – browsing (e.g. by sources and words in alphabetical order) → user needs to know roughly what he/she is looking for – visualization → statistical and/or graphical approach → user can explore the corpus

II.1. CORPUS

SLIDE 31 November 2013 Glossarium Graeco-Arabicum Telota

II.1.a. CORPUS TREEMAP

Distributon of sources in the GlossGA corpus Area size corresponds to number of words → Which sources constitute the major/minor parts

f the corpus?

SLIDE 32 November 2013 Glossarium Graeco-Arabicum Telota

II.1.b. SOURCE TREEMAP

Distribution of words in one source Area size corresponds to number of words → What kind of vocabulary does constitute the source?

SLIDE 33 November 2013 Glossarium Graeco-Arabicum Telota

How to visualize transformation of language structures? → compare parts of speech in diagrams (experimental)

II.2. TRANSLATION MOVEMENTS

SLIDE 34 November 2013 Glossarium Graeco-Arabicum Telota

Compared Parts of Speech Blue: Greek Parts of Speech Red: Arabic Parts of Speech Bar Length: number of words

f respective part of speech

II.2.a. TRANSLATION MOVEMENTS

SLIDE 35 November 2013 Glossarium Graeco-Arabicum Telota

Compared Parts of Speech X-Axis: Greek Parts of Speech Y-Axis: Arabic Parts of Speech Intersections: Dot size represents number

f words transferred from

Greek PoS into Arabic PoS

II.2.b. TRANSLATION MOVEMENTS

SLIDE 36 November 2013 Glossarium Graeco-Arabicum Telota

How to transform the database into a dictionary? Experimental preview: → collation of all entries of a Greek lexeme → ordered by Arabic lexeme → output with source and context

II.3.a. SINGLE LEXEMES

SLIDE 37 November 2013 Glossarium Graeco-Arabicum Telota

Export function via email:

II.3.b. SINGLE LEXEMES

SLIDE 38 November 2013 Glossarium Graeco-Arabicum Telota

Recommendations 1 provide multiple access methods → support various user scenarios 2 invent statistical and visual evaluation methods → profit from electronic data processing 3 provide conventional scholarly formats → correspond to the community’s needs

SUMMARY OF II.

SLIDE 39 November 2013 Glossarium Graeco-Arabicum Telota

Situation: Technical vs. Scholarly Requirements – which one goes first? → technical requirements as necessary basis → scholarly requirements as superior objective – both need attention from scholars – both need attention from techies → vice versa understanding → team competence

LAST BUT ONE SLIDE

SLIDE 40 November 2013 Glossarium Graeco-Arabicum Telota

Thanks to you for your attention! Project Website http://telota.bbaw.de/glossga Contact Yury Arzhanov | yury.arzhanov@rub.de Torsten Roeder | roeder@bbaw.de