SLIDE 1 1
Linking Machine-Readable Dictionaries
Christian Chiarcos Applied Computational Linguistics Lab chiarcos@informatik.uni-frankfurt.de
Digital Humanities Workshop, Sep 9 – 11, 2014, Batumi, Georgia
SLIDE 2 Linking Machine-Readable Dictionaries
- Motivation: Aggregating information
– from different dictionaries – from dictionaries and automatically analyzed text
- State of the art on machine-readable dictionaries
– XML (TEI, LMF) – RDF (lemon)
– Converting, linking and querying multilingual Wiktionaries
SLIDE 3
The future of the dictionary …
„The three things no young person owns or uses and often don‘t realise exist: an alarm clock, an address book and a dictionary … At university I didn‘t meet a single person who owned any of them“
http://guardian.co.uk/books/booksblog/2012/sep/13/dictio naries-democratic-crowdsourcing/
SLIDE 4
The future of the dictionary …
„The three things no young person owns or uses and often don‘t realise exist: an alarm clock, an address book and a dictionary … At university I didn‘t meet a single person who owned any of them“
http://guardian.co.uk/books/booksblog/2012/sep/13/dictio naries-democratic-crowdsourcing/
„[D]ictionaries are not dead, they just smell funny“
Ilan Kernerman, CEO KDictionaries, Kernerman Dictionary News 21 (July 2013): 1, paraphrasing Frank Zappa‘s quote on Jazz (1974)
SLIDE 5 The future of the dictionary …
„[D]ictionaries … lose their autonomous identity and disappear in language technology. Machine translation, word processors, … and the like incorporate dictionary content and apply it in new forms“
Ilan Kernerman, CEO KDictionaries, Kernerman Dictionary News 21 (July 2013): 1
„[T]he message is clear and unambiguous: the future
- f the dictionary is digital.“
Stephen Bullon, Macmillan Education, upon announcing that Macmillan will no longer publish print dictionaries, Nov 2012
SLIDE 6 The future of the dictionary …
… is digital
– no space limitations
- adding context information, e.g., from corpora
– dynamic ordering & search
- no index optimization for manual lookup
– information aggregation
- integrating information from different sources
SLIDE 7 The future of the dictionary …
… is digital
– no space limitations
- adding context information, e.g., from corpora
– dynamic ordering & search
- no index optimization for manual lookup
– information aggregation
- integrating information from different sources
two use cases:
- cross-lingual dictionary lookup
- text mining for archaeologists
SLIDE 8 Information Aggregation I Cross-lingual search
- Assume you‘re a speaker of language X, say,
German, and are interested in working with text in language Y, say, Georgian – Statistical machine translation may give you an idea, but you certainly want to counter- check with a dictionary ...
SLIDE 9 Information Aggregation I Cross-lingual search
- Assume you‘re a speaker of language X, say,
German, and are interested in working with text in language Y, say, Georgian – Statistical machine translation may give you an idea, but you certainly want to counter- check with a dictionary ... ... unfortunately, you don‘t have one
SLIDE 10 Information Aggregation I Cross-lingual search
- Assume you‘re a speaker of language X, say,
German, and are interested in working with text in language Y, say, Georgian
- We do have a Georgian-English dictionary,
though, and (luckily) a English-German one
- Given a proper representation, storage and
query formalisms, it is possible to perform a transitive query using English as a pivot language
SLIDE 11 Information Aggregation I Cross-lingual search
ფეხი foot leg
http://www.georgianweb. com/pdf/lexicon.pdf
Basis Fußbreit Fußende Fußpunkt Sockel Sohle Standfuß Standvorrichtung Tritt Mastfuß Segelunterliek Fußlinie Fußmauer Bein Abschnitt Etappe Programmzweig Schaft Schlägel Stollen Strecke Ader Hachse Kathete Schenkel Strang Fuß
dict.leo.org dict.leo.org
SLIDE 12 Information Aggregation I Cross-lingual search
- Unfortunately, using English introduces a lot of
noise
– 2 English translations, 27 (!) German translations
- But we can combine multiple paths, e.g., one
using English as a pivot, one using Russian
– elements in the intersection should be more reliable
SLIDE 13 Information Aggregation I Cross-lingual search
ფეხი foot leg
http://www.georgianweb. com/pdf/lexicon.pdf
Basis Fußbreit Fußende Fußpunkt Sockel Sohle Standfuß Standvorrichtung Tritt Mastfuß Segelunterliek Fußlinie Fußmauer Bein Abschnitt Etappe Programmzweig Schaft Schlägel Stollen Strecke Ader Hachse Kathete Schenkel Strang Fuß
dict.leo.org dict.leo.org
нога
http://meskhi.net/lexicon
Spielbein
dict.leo.org
SLIDE 14 Information Aggregation I Cross-lingual search
- Unfortunately, using English introduces a lot of
noise
– 2 English translations, 27 (!) German translations
- But we can combine multiple paths, e.g., one
using English as a pivot, one using Russian
– elements in the intersection should be more reliable 27 English-based translations + 3 Russian-based translations = 2 shared translations
SLIDE 15 Information Aggregation I Cross-lingual search
- In a similar way, words missing from the
Russian (or the English) path may be taken from the other one
– more noise, but better coverage
27 English-based translations + 3 Russian-based translations = 28 possible translations
– e.g., German Spielbein „free leg“
SLIDE 16 Information Aggregation I Jargon: A Prototype
- student project @ GU Frankfurt
- enter a word (in any language) and a target language
- consult different machine-readable dictionaries to find a
path into the target language
- visualize results together with their „path“
SLIDE 17
Information Aggregation I Jargon: A Prototype
SLIDE 18 Information Aggregation I Jargon: A Prototype
- Jargon uses lexical resources provided by
different groups
– using a shared vocabulary
- lemon, more in 10 minutes
=> joint queries
– prototype on restricted data set
SLIDE 19 Information Aggregation II Multilingual Semantic Web
- a system for text mining (open information
extraction) from archeological reports
- extract machine-readable information from
plain text
– currently, English only
- in the longer perspective, German and Dutch
– http://corpora.acoli.informatik.uni- frankfurt.de/text-mining-webservice
SLIDE 20
Information Aggregation II Multilingual Semantic Web
Given a PDF document
SLIDE 21
Information Aggregation II Multilingual Semantic Web
Upload to server
SLIDE 22
Information Aggregation II Multilingual Semantic Web
Perform NLP analysis
SLIDE 23
Information Aggregation II Multilingual Semantic Web
Visualize data
SLIDE 24
Information Aggregation II Multilingual Semantic Web
e.g. arch. periods
SLIDE 25 Information Aggregation II Multilingual Semantic Web
SLIDE 26 Information Aggregation II Multilingual Semantic Web
Dr Irakli Iashvili spent a month at the Heberden Coin Room at the Ashmolean Museum , also with the support of the British Academy , working on the coinage of the Black Sea in general , and the coins found at Pichvnari in particular .
Result TRIPLES TEXT QUERY
SLIDE 27 Information Aggregation II Multilingual Semantic Web
In this query, the only information-bearing element is „:work“ If we define that „:work“ entails „:bearbeitet“ (the German translation), we can formulate the same query in German i.e. ?a :bearbeitet ?c
SLIDE 28 Linking Machine-Readable Dictionaries
- Motivation: Aggregating information
– from different dictionaries – from dictionaries and automatically analyzed text
- State of the art on machine-readable dictionaries
– XML – RDF
– Converting, linking and querying multilingual Wiktionaries
SLIDE 29 Machine Readable Dictionaries XML
- Text Encoding Initiative (TEI)
– specifications for markup of digital-born documents – originally closely oriented towards digital editions of printed books – rich metadata (TEI header) – semantic markup (div, seg, verse, …) – limited interoperability
- many different ways to represent the same
information => information aggregation ???
SLIDE 30 Machine Readable Dictionaries XML
- Lexical Markup Framework (LMF)
– ISO standard for representing machine- readable dictionaries – an abstract model with XML specifications (DTD) – concrete application requires an instantiation
extending the DTD violating the original DTD in order to use this standard, you need to break it
SLIDE 31 Machine Readable Dictionaries XML
- Lexical Markup Framework (LMF)
– ISO standard for representing machine- readable dictionaries – an abstract model with XML specifications (DTD) – concrete application requires an instantiation
extending the DTD violating the original DTD in order to use this standard, you need to break it suggestions for alternative representations of LMF, e.g., RDF (Francopoulo 2006)
SLIDE 32 Resource Description Framework (RDF)
– generic data model: directed labeled graph
– originally developed to provide metadata about resources
- e.g., journals in a bookstore and eBooks in an
- nline shop
– resources are unambiguously identified in the web of data by Uniform Resource Identifiers URIs)
SLIDE 33 Resource Description Framework (RDF)
– resources are unambiguously identified in the web of data by Uniform Resource Identifiers URIs)
URLs are prototypical URIs
- http://www.w3.org/2000/01/rdf-schema#label
protocol namespace (e.g., a document in the web) identifier „local name“
SLIDE 34 Resource Description Framework (RDF)
compact notation: define a prefix and use it instead of the namespace
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>. … rdfs:label …
URLs are prototypical URIs
- http://www.w3.org/2000/01/rdf-schema#label
protocol namespace (e.g., a document in the web) identifier „local name“
if you encounter an unknown prefix, you can look up its namespace under http://prefix.cc/
SLIDE 35 Resource Description Framework (RDF)
- basic data structure is a triple, consisting of
– source node
„subject“
– relation / property
„predicate “
– target node
„object“
- typically, all of these are RDF resources
– identified by a URI
SLIDE 36 Resource Description Framework (RDF)
- basic data structure is a triple, consisting of
– source node
„subject“
– relation / property
„predicate “
– target node
„object“
- typically, all of these are RDF resources
– identified by a URI
- alternatively, the target may also be a literal,
e.g., a string
SLIDE 37 Resource Description Framework (RDF)
- there are different notations, we use Turtle*
– triples written in a sequence – separated by . – usually one triple per line
* this description is deliberately simplified, see http://www.w3.org/TR/turtle/
SLIDE 38 Resource Description Framework (RDF)
- there are special vocabularies for classes,
properties and instances
– e.g., RDF Schema (RDFS)
short notation for rdf:type, i.e., „is instance of“ assigns a RDF resource a human-readable form
SLIDE 39 Machine-Readable Dictionaries RDF: lemon
- an RDF Schema (or an ontology) defines
a domain vocabulary
- lexical resources are interoperable if
shared vocabularies are used, e.g., the Lexicon Model for Ontologies (lemon)
– based on LMF – developed by a W3C Community Group
- http://www.w3.org/community/ontolex/
– not a standard (yet), but already widely used
SLIDE 40 Machine-Readable Dictionaries RDF: lemon
http://www.lemon-model.net/lemon
core
SLIDE 41 Machine-Readable Dictionaries RDF: lemon
http://www.lemon-model.net/lemon
core we ignore this part, this is more relevant for the Semantic Web than for the Humanities
SLIDE 42 Machine-Readable Dictionaries RDF: lemon
http://www.lemon-model.net/lemon
core today, we ignore this part, this is relevant for phrasal expressions (e.g., idioms)
SLIDE 43
Wiktionary as a Lexicon
SLIDE 44 Wiktionary as a Lexicon
http://www.lemon-model.net/lemon
SLIDE 45 Wiktionary as a Lexicon
http://www.lemon-model.net/lemon
(plus subcategorization information)
SLIDE 46 Wiktionary as a Lexicon
http://www.lemon-model.net/lemon
(plus morphosyntactic information)
SLIDE 47 Wiktionary as a Lexicon
lemon provides additional vocabulary elements
references to other languages
http://www.lemon-model.net/lemon
SLIDE 48 Wiktionary as a Lexicon
http://www.lemon-model.net/lemon translation
lemon provides additional vocabulary elements
references to other languages
SLIDE 49 From Wiktionary to lemon
http://www.lemon-model.net/lemon
<http://wiktionary.org> a lemon:Lexicon . <http://wiktionary.org> lemon:language „en“.
<http://wiktionary.org>
SLIDE 50 From Wiktionary to lemon
http://www.lemon-model.net/lemon
<http://wiktionary.org> lemon:entry :know . :know a lemon:LexicalEntry .
<http://wiktionary.org> :know
SLIDE 51 From Wiktionary to lemon
http://www.lemon-model.net/lemon
:know1 a lemon:LexicalSense . :know1 rdfs:comment „(transitive) To be certain …“ .
SLIDE 52 From Wiktionary to lemon
http://www.lemon-model.net/lemon
:know lemon:sense :know1. :know lemon:sense :know2. …
:know
SLIDE 53 From Wiktionary to lemon
http://www.lemon-model.net/lemon
:know_form1
:know_form0 a lemon:LexicalForm. :know_form0 lemon:writtenRep „to know“ . :know_form0 rdfs:comment „Verb“. :know_form1 a lemon:LexicalForm . :know_form1 lemon:writtenRep „knows“ . :know_form1 rdfs:comment „Verb; third- person singular simple present“. ...
:know_form2 … :know_form0
SLIDE 54 From Wiktionary to lemon
http://www.lemon-model.net/lemon
:know_form1 :know_form2 …
:know lemon:canonicalForm :know_form0 . :know lemon:form :know_form1 . :know lemon:form :know_form 2. …
:know_form0
SLIDE 55 From Wiktionary to lemon
http://www.lemon-model.net/lemon translation
:know
:know lemon:isTranslationOf <http://de.wiktionary.org/wiki/wissen> . …
for lemon:isTranslationOf and more documentation, see http://lemon-model.net/
SLIDE 56 Querying RDF
– SPARQL protocol and RDF Query Language – http://www.w3.org/TR/rdf-sparql-query/
– define variables (beginning with ?)
– write triples (similar to Turtle, but now with variables)
* simplified, again
SLIDE 57 PREFIX : <http://wiktionary.org> # namespace prefix declarations PREFIX lemon: <http://lemon-model.net/lemon#> SELECT ?deForm ?enForm WHERE { ?deLexicon a lemon:Lexicon. # German lexicon ?deLexicon ?deLexicon lemon:language „de“ . ?deLexicon lemon:entry ?de. # lexical entry ?de ?de lemon:canonicalForm ?deForm. # German lexicon form (lemma) ?de lemon:isTranslationOf ?en. # German is translation of ?en ?enLexicon lemon:entry ?en. ?enLexicon lemon:language „en“. # ?en in English lexicon ?en lemon:canonicalForm ?enForm. }
Querying RDF
isTranslationOf
SPARQL 1.0
SLIDE 58 So far
- RDF motivation and introduction
- How to interpret (convert) Wiktionary data
to RDF
– lemon vocabulary
- How to query lemon data (example)
– for English translations of German words – queries over series of dictionaries can simply be concatenated
SLIDE 59 Digging Deeper
- remember that URIs were unique
identifiers in the web (of data)
– this means we can look up external URIs (links) – if they resolve over HTTP, and they provide RDF data, we can query these data sources – if they use the same vocabulary, we can iterate the query, e.g., over dictionaries in a third language – „federation“
SLIDE 60 Digging Deeper
- remember that URIs were unique
identifiers in the web (of data)
– this means we can look up external URIs (links) – if they resolve over HTTP, and they provide RDF data, we can query these data sources – if they use the same vocabulary, we can iterate the query, e.g., over dictionaries in a third language – „federation“ Resources fulfilling these conditions constitute „Linked (Open) Data“
http://www.w3.org/DesignIssues/LinkedData.html
SLIDE 61 Linked Open Data (LOD) cloud
Source http://lod-cloud.net
SLIDE 62 Source http://lod-cloud.net DBpedia (Wikipedia)
- cf. Markert & Nissim (2003) on anaphor resolution
WordNet(s) language identifiers WordNet-derived datasets Named Entity Repositories Other Semantic Knowledge Bases
Linguistically relevant LOD resources
SLIDE 63 Linguistic Linked Open Data cloud
- a collection of linguistic resources
– published under open licenses – as linked data – decentralized developed and maintained – meta data at http://datahub.io
=> cloud diagram
– developed as a community effort in the context of the Open Linguistics Working Group of the Open Knowledge Foundation
SLIDE 64 Open Knowledge Foundation (OKFN, http://okfn.org)
- non-profit organization
- founded in 2004
- promote open knowledge in all its forms
– e.g., publication of government data (UK, US)
- provide infrastructural support for several
working groups
SLIDE 65 OKFN Open Linguistics Working Group (OWLG)
- founded in Oct 2010 in Berlin, Germany
- open network of individuals interested in
– linguistic resources and/or – their publication under open licenses
– NLP/CL, DH, typology/language documentation, IT, …
– mailing list, web site/blog, wiki – http://linguistics.okfn.org
SLIDE 66 Important OWLG goals (http://linguistics.okfn.org)
- 1. Promote open data in relation to language
data
- 2. Facilitate communication between
researchers who use / distribute / maintain
- pen linguistic data
- 3. Mediate between providers and users of
technical infrastructures
- 4. Build and maintain an index of open linguistic
data sources
SLIDE 67 Workshop series
Linked Data in Linguistics (LDL) Multilingual Linked Open Data for Enterprises (MLODE) Linked Data in Linguistic Typology (LDLT)
SLIDE 68 Linguistic Linked Open Data
sets
different data providers, different incentives
- All using the same set
- f technologies, and –
increasingly – shared vocabularies
e.g. lemon
SLIDE 69 Linguistic Linked Open Data
sets
different data providers, different incentives
- All using the same set
- f technologies, and –
increasingly – shared vocabularies
e.g. lemon (L!) L! L! L! L! L! L! L! L!
SLIDE 70 Linguistic Linked Open Data
sets
different data providers, different incentives
- All using the same set
- f technologies, and –
increasingly – shared vocabularies
e.g. lemon (L!) L! L! L! L! L! L! L! L!
Since 2012, our focus has been on converting resources With what we have accomplished since then, we are now in the position to begin developing applications
SLIDE 71 Final words (for the moment)
- Thank you for your attention …
- Interested in contributing ? Talk to me !
– There are quite a few resources that would be interesting candidates for an LOD conversion
- Georgian-Russian dictionary
- Georgian-English dictionary
- Georgian Wiktionary
– Developing, extending, using* software with such data
* in the longer perspective