[PPT] - Dictionaries Christian Chiarcos Applied Computational Linguistics PowerPoint Presentation

SLIDE 1

1

Linking Machine-Readable Dictionaries

Christian Chiarcos Applied Computational Linguistics Lab chiarcos@informatik.uni-frankfurt.de

Digital Humanities Workshop, Sep 9 – 11, 2014, Batumi, Georgia

SLIDE 2

Linking Machine-Readable Dictionaries

Motivation: Aggregating information

– from different dictionaries – from dictionaries and automatically analyzed text

State of the art on machine-readable dictionaries

– XML (TEI, LMF) – RDF (lemon)

Example

– Converting, linking and querying multilingual Wiktionaries

SLIDE 3

The future of the dictionary …

„The three things no young person owns or uses and often don‘t realise exist: an alarm clock, an address book and a dictionary … At university I didn‘t meet a single person who owned any of them“

http://guardian.co.uk/books/booksblog/2012/sep/13/dictio naries-democratic-crowdsourcing/

SLIDE 4

The future of the dictionary …

„The three things no young person owns or uses and often don‘t realise exist: an alarm clock, an address book and a dictionary … At university I didn‘t meet a single person who owned any of them“

http://guardian.co.uk/books/booksblog/2012/sep/13/dictio naries-democratic-crowdsourcing/

„[D]ictionaries are not dead, they just smell funny“

Ilan Kernerman, CEO KDictionaries, Kernerman Dictionary News 21 (July 2013): 1, paraphrasing Frank Zappa‘s quote on Jazz (1974)

SLIDE 5

The future of the dictionary …

„[D]ictionaries … lose their autonomous identity and disappear in language technology. Machine translation, word processors, … and the like incorporate dictionary content and apply it in new forms“

Ilan Kernerman, CEO KDictionaries, Kernerman Dictionary News 21 (July 2013): 1

„[T]he message is clear and unambiguous: the future

f the dictionary is digital.“

Stephen Bullon, Macmillan Education, upon announcing that Macmillan will no longer publish print dictionaries, Nov 2012

SLIDE 6

The future of the dictionary …

… is digital

– no space limitations

adding context information, e.g., from corpora

– dynamic ordering & search

no index optimization for manual lookup

– information aggregation

integrating information from different sources

SLIDE 7

The future of the dictionary …

… is digital

– no space limitations

adding context information, e.g., from corpora

– dynamic ordering & search

no index optimization for manual lookup

– information aggregation

integrating information from different sources

two use cases:

cross-lingual dictionary lookup
text mining for archaeologists

SLIDE 8

Information Aggregation I Cross-lingual search

Assume you‘re a speaker of language X, say,

German, and are interested in working with text in language Y, say, Georgian – Statistical machine translation may give you an idea, but you certainly want to counter- check with a dictionary ...

SLIDE 9

Information Aggregation I Cross-lingual search

Assume you‘re a speaker of language X, say,

German, and are interested in working with text in language Y, say, Georgian – Statistical machine translation may give you an idea, but you certainly want to counter- check with a dictionary ... ... unfortunately, you don‘t have one

SLIDE 10

Information Aggregation I Cross-lingual search

Assume you‘re a speaker of language X, say,

German, and are interested in working with text in language Y, say, Georgian

We do have a Georgian-English dictionary,

though, and (luckily) a English-German one

Given a proper representation, storage and

query formalisms, it is possible to perform a transitive query using English as a pivot language

SLIDE 11

Information Aggregation I Cross-lingual search

ფეხი foot leg

http://www.georgianweb. com/pdf/lexicon.pdf

Basis Fußbreit Fußende Fußpunkt Sockel Sohle Standfuß Standvorrichtung Tritt Mastfuß Segelunterliek Fußlinie Fußmauer Bein Abschnitt Etappe Programmzweig Schaft Schlägel Stollen Strecke Ader Hachse Kathete Schenkel Strang Fuß

dict.leo.org dict.leo.org

SLIDE 12

Information Aggregation I Cross-lingual search

Unfortunately, using English introduces a lot of

noise

– 2 English translations, 27 (!) German translations

But we can combine multiple paths, e.g., one

using English as a pivot, one using Russian

– elements in the intersection should be more reliable

SLIDE 13

Information Aggregation I Cross-lingual search

ფეხი foot leg

http://www.georgianweb. com/pdf/lexicon.pdf

Basis Fußbreit Fußende Fußpunkt Sockel Sohle Standfuß Standvorrichtung Tritt Mastfuß Segelunterliek Fußlinie Fußmauer Bein Abschnitt Etappe Programmzweig Schaft Schlägel Stollen Strecke Ader Hachse Kathete Schenkel Strang Fuß

dict.leo.org dict.leo.org

нога

http://meskhi.net/lexicon

Spielbein

dict.leo.org

SLIDE 14

Information Aggregation I Cross-lingual search

Unfortunately, using English introduces a lot of

noise

– 2 English translations, 27 (!) German translations

But we can combine multiple paths, e.g., one

using English as a pivot, one using Russian

– elements in the intersection should be more reliable 27 English-based translations + 3 Russian-based translations = 2 shared translations

SLIDE 15

Information Aggregation I Cross-lingual search

In a similar way, words missing from the

Russian (or the English) path may be taken from the other one

– more noise, but better coverage

27 English-based translations + 3 Russian-based translations = 28 possible translations

– e.g., German Spielbein „free leg“

SLIDE 16

Information Aggregation I Jargon: A Prototype

student project @ GU Frankfurt
enter a word (in any language) and a target language
consult different machine-readable dictionaries to find a

path into the target language

visualize results together with their „path“

SLIDE 17

Information Aggregation I Jargon: A Prototype

SLIDE 18

Information Aggregation I Jargon: A Prototype

Jargon uses lexical resources provided by

different groups

– using a shared vocabulary

lemon, more in 10 minutes

=> joint queries

still under development

– prototype on restricted data set

SLIDE 19

Information Aggregation II Multilingual Semantic Web

a system for text mining (open information

extraction) from archeological reports

extract machine-readable information from

plain text

– currently, English only

in the longer perspective, German and Dutch

– http://corpora.acoli.informatik.uni- frankfurt.de/text-mining-webservice

SLIDE 20

Information Aggregation II Multilingual Semantic Web

Given a PDF document

SLIDE 21

Information Aggregation II Multilingual Semantic Web

Upload to server

SLIDE 22

Information Aggregation II Multilingual Semantic Web

Perform NLP analysis

SLIDE 23

Information Aggregation II Multilingual Semantic Web

Visualize data

SLIDE 24

Information Aggregation II Multilingual Semantic Web

e.g. arch. periods

SLIDE 25

Information Aggregation II Multilingual Semantic Web

r query in the results

SLIDE 26

Information Aggregation II Multilingual Semantic Web

r query in the results

Dr Irakli Iashvili spent a month at the Heberden Coin Room at the Ashmolean Museum , also with the support of the British Academy , working on the coinage of the Black Sea in general , and the coins found at Pichvnari in particular .

Result TRIPLES TEXT QUERY

SLIDE 27

Information Aggregation II Multilingual Semantic Web

r query in the results

In this query, the only information-bearing element is „:work“ If we define that „:work“ entails „:bearbeitet“ (the German translation), we can formulate the same query in German i.e. ?a :bearbeitet ?c

SLIDE 28

Linking Machine-Readable Dictionaries

Motivation: Aggregating information

– from different dictionaries – from dictionaries and automatically analyzed text

State of the art on machine-readable dictionaries

– XML – RDF

Example

– Converting, linking and querying multilingual Wiktionaries

SLIDE 29

Machine Readable Dictionaries XML

Text Encoding Initiative (TEI)

– specifications for markup of digital-born documents – originally closely oriented towards digital editions of printed books – rich metadata (TEI header) – semantic markup (div, seg, verse, …) – limited interoperability

many different ways to represent the same

information => information aggregation ???

SLIDE 30

Machine Readable Dictionaries XML

Lexical Markup Framework (LMF)

– ISO standard for representing machine- readable dictionaries – an abstract model with XML specifications (DTD) – concrete application requires an instantiation

 extending the DTD  violating the original DTD  in order to use this standard, you need to break it

SLIDE 31

Machine Readable Dictionaries XML

Lexical Markup Framework (LMF)

– ISO standard for representing machine- readable dictionaries – an abstract model with XML specifications (DTD) – concrete application requires an instantiation

 extending the DTD  violating the original DTD  in order to use this standard, you need to break it  suggestions for alternative representations of LMF, e.g., RDF (Francopoulo 2006)

SLIDE 32

Resource Description Framework (RDF)

W3C standard (1999)

– generic data model: directed labeled graph

nodes, edges, labels

– originally developed to provide metadata about resources

e.g., journals in a bookstore and eBooks in an
nline shop

– resources are unambiguously identified in the web of data by Uniform Resource Identifiers URIs)

SLIDE 33

Resource Description Framework (RDF)

– resources are unambiguously identified in the web of data by Uniform Resource Identifiers URIs)

URLs are prototypical URIs

http://www.w3.org/2000/01/rdf-schema#label

protocol namespace (e.g., a document in the web) identifier „local name“

SLIDE 34

Resource Description Framework (RDF)

compact notation: define a prefix and use it instead of the namespace

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>. … rdfs:label …

URLs are prototypical URIs

http://www.w3.org/2000/01/rdf-schema#label

protocol namespace (e.g., a document in the web) identifier „local name“

if you encounter an unknown prefix, you can look up its namespace under http://prefix.cc/

SLIDE 35

Resource Description Framework (RDF)

basic data structure is a triple, consisting of

– source node

„subject“

– relation / property

„predicate “

– target node

„object“

typically, all of these are RDF resources

– identified by a URI

SLIDE 36

Resource Description Framework (RDF)

basic data structure is a triple, consisting of

– source node

„subject“

– relation / property

„predicate “

– target node

„object“

typically, all of these are RDF resources

– identified by a URI

alternatively, the target may also be a literal,

e.g., a string

SLIDE 37

Resource Description Framework (RDF)

there are different notations, we use Turtle*

– triples written in a sequence – separated by . – usually one triple per line

* this description is deliberately simplified, see http://www.w3.org/TR/turtle/

SLIDE 38

Resource Description Framework (RDF)

there are special vocabularies for classes,

properties and instances

– e.g., RDF Schema (RDFS)

short notation for rdf:type, i.e., „is instance of“ assigns a RDF resource a human-readable form

SLIDE 39

Machine-Readable Dictionaries RDF: lemon

an RDF Schema (or an ontology) defines

a domain vocabulary

lexical resources are interoperable if

shared vocabularies are used, e.g., the Lexicon Model for Ontologies (lemon)

– based on LMF – developed by a W3C Community Group

http://www.w3.org/community/ontolex/

– not a standard (yet), but already widely used

SLIDE 40

Machine-Readable Dictionaries RDF: lemon

http://www.lemon-model.net/lemon

core

SLIDE 41

Machine-Readable Dictionaries RDF: lemon

http://www.lemon-model.net/lemon

core we ignore this part, this is more relevant for the Semantic Web than for the Humanities

SLIDE 42

Machine-Readable Dictionaries RDF: lemon

http://www.lemon-model.net/lemon

core today, we ignore this part, this is relevant for phrasal expressions (e.g., idioms)

SLIDE 43

Wiktionary as a Lexicon

SLIDE 44

Wiktionary as a Lexicon

http://www.lemon-model.net/lemon

SLIDE 45

Wiktionary as a Lexicon

http://www.lemon-model.net/lemon

(plus subcategorization information)

SLIDE 46

Wiktionary as a Lexicon

http://www.lemon-model.net/lemon

(plus morphosyntactic information)

SLIDE 47

Wiktionary as a Lexicon

beyond its core,

lemon provides additional vocabulary elements

e.g., cross-

references to other languages

http://www.lemon-model.net/lemon

ther languages

SLIDE 48

Wiktionary as a Lexicon

http://www.lemon-model.net/lemon translation

beyond its core,

lemon provides additional vocabulary elements

e.g., cross-

references to other languages

SLIDE 49

From Wiktionary to lemon

http://www.lemon-model.net/lemon

<http://wiktionary.org> a lemon:Lexicon . <http://wiktionary.org> lemon:language „en“.

<http://wiktionary.org>

SLIDE 50

From Wiktionary to lemon

http://www.lemon-model.net/lemon

<http://wiktionary.org> lemon:entry :know . :know a lemon:LexicalEntry .

<http://wiktionary.org> :know

SLIDE 51

From Wiktionary to lemon

http://www.lemon-model.net/lemon

:know1 a lemon:LexicalSense . :know1 rdfs:comment „(transitive) To be certain …“ .

SLIDE 52

From Wiktionary to lemon

http://www.lemon-model.net/lemon

:know lemon:sense :know1. :know lemon:sense :know2. …

:know

SLIDE 53

From Wiktionary to lemon

http://www.lemon-model.net/lemon

:know_form1

:know_form0 a lemon:LexicalForm. :know_form0 lemon:writtenRep „to know“ . :know_form0 rdfs:comment „Verb“. :know_form1 a lemon:LexicalForm . :know_form1 lemon:writtenRep „knows“ . :know_form1 rdfs:comment „Verb; third- person singular simple present“. ...

:know_form2 … :know_form0

SLIDE 54

From Wiktionary to lemon

http://www.lemon-model.net/lemon

:know_form1 :know_form2 …

:know lemon:canonicalForm :know_form0 . :know lemon:form :know_form1 . :know lemon:form :know_form 2. …

:know_form0

SLIDE 55

From Wiktionary to lemon

http://www.lemon-model.net/lemon translation

:know

:know lemon:isTranslationOf <http://de.wiktionary.org/wiki/wissen> . …

for lemon:isTranslationOf and more documentation, see http://lemon-model.net/

SLIDE 56

Querying RDF

SPARQL*

– SPARQL protocol and RDF Query Language – http://www.w3.org/TR/rdf-sparql-query/

SELECT

– define variables (beginning with ?)

WHERE

– write triples (similar to Turtle, but now with variables)

* simplified, again

SLIDE 57

PREFIX : <http://wiktionary.org> # namespace prefix declarations PREFIX lemon: <http://lemon-model.net/lemon#> SELECT ?deForm ?enForm WHERE { ?deLexicon a lemon:Lexicon. # German lexicon ?deLexicon ?deLexicon lemon:language „de“ . ?deLexicon lemon:entry ?de. # lexical entry ?de ?de lemon:canonicalForm ?deForm. # German lexicon form (lemma) ?de lemon:isTranslationOf ?en. # German is translation of ?en ?enLexicon lemon:entry ?en. ?enLexicon lemon:language „en“. # ?en in English lexicon ?en lemon:canonicalForm ?enForm. }

Querying RDF

isTranslationOf

SPARQL 1.0

SLIDE 58

So far

RDF motivation and introduction
How to interpret (convert) Wiktionary data

to RDF

– lemon vocabulary

How to query lemon data (example)

– for English translations of German words – queries over series of dictionaries can simply be concatenated

Slightly simplified

SLIDE 59

Digging Deeper

remember that URIs were unique

identifiers in the web (of data)

– this means we can look up external URIs (links) – if they resolve over HTTP, and they provide RDF data, we can query these data sources – if they use the same vocabulary, we can iterate the query, e.g., over dictionaries in a third language – „federation“

SLIDE 60

Digging Deeper

remember that URIs were unique

identifiers in the web (of data)

– this means we can look up external URIs (links) – if they resolve over HTTP, and they provide RDF data, we can query these data sources – if they use the same vocabulary, we can iterate the query, e.g., over dictionaries in a third language – „federation“ Resources fulfilling these conditions constitute „Linked (Open) Data“

http://www.w3.org/DesignIssues/LinkedData.html

SLIDE 61

Linked Open Data (LOD) cloud

Source http://lod-cloud.net

SLIDE 62

Source http://lod-cloud.net DBpedia (Wikipedia)

cf. Markert & Nissim (2003) on anaphor resolution

WordNet(s) language identifiers WordNet-derived datasets Named Entity Repositories Other Semantic Knowledge Bases

Linguistically relevant LOD resources

SLIDE 63

Linguistic Linked Open Data cloud

a collection of linguistic resources

– published under open licenses – as linked data – decentralized developed and maintained – meta data at http://datahub.io

=> cloud diagram

– developed as a community effort in the context of the Open Linguistics Working Group of the Open Knowledge Foundation

SLIDE 64

Open Knowledge Foundation (OKFN, http://okfn.org)

non-profit organization
founded in 2004
promote open knowledge in all its forms

– e.g., publication of government data (UK, US)

provide infrastructural support for several

working groups

SLIDE 65

OKFN Open Linguistics Working Group (OWLG)

founded in Oct 2010 in Berlin, Germany
open network of individuals interested in

– linguistic resources and/or – their publication under open licenses

multi-disciplinary

– NLP/CL, DH, typology/language documentation, IT, …

infrastructure

– mailing list, web site/blog, wiki – http://linguistics.okfn.org

SLIDE 66

Important OWLG goals (http://linguistics.okfn.org)

1. Promote open data in relation to language

data

2. Facilitate communication between

researchers who use / distribute / maintain

pen linguistic data
3. Mediate between providers and users of

technical infrastructures

4. Build and maintain an index of open linguistic

data sources

SLIDE 67

Workshop series

Linked Data in Linguistics (LDL) Multilingual Linked Open Data for Enterprises (MLODE) Linked Data in Linguistic Typology (LDLT)

SLIDE 68

Linguistic Linked Open Data

Very different data

sets

different data providers, different incentives

All using the same set
f technologies, and –

increasingly – shared vocabularies

e.g. lemon

SLIDE 69

Linguistic Linked Open Data

Very different data

sets

different data providers, different incentives

All using the same set
f technologies, and –

increasingly – shared vocabularies

e.g. lemon (L!) L! L! L! L! L! L! L! L!

SLIDE 70

Linguistic Linked Open Data

Very different data

sets

different data providers, different incentives

All using the same set
f technologies, and –

increasingly – shared vocabularies

e.g. lemon (L!) L! L! L! L! L! L! L! L!

Since 2012, our focus has been on converting resources With what we have accomplished since then, we are now in the position to begin developing applications

SLIDE 71

Final words (for the moment)

Thank you for your attention …
Interested in contributing ? Talk to me !

– There are quite a few resources that would be interesting candidates for an LOD conversion

Georgian-Russian dictionary
Georgian-English dictionary
Georgian Wiktionary

– Developing, extending, using* software with such data

* in the longer perspective