Dictionaries Christian Chiarcos Applied Computational Linguistics - - PowerPoint PPT Presentation

dictionaries
SMART_READER_LITE
LIVE PREVIEW

Dictionaries Christian Chiarcos Applied Computational Linguistics - - PowerPoint PPT Presentation

Digital Humanities Workshop, Sep 9 11, 2014, Batumi, Georgia Linking Machine-Readable Dictionaries Christian Chiarcos Applied Computational Linguistics Lab chiarcos@informatik.uni-frankfurt.de 1 Linking Machine-Readable Dictionaries


slide-1
SLIDE 1

1

Linking Machine-Readable Dictionaries

Christian Chiarcos Applied Computational Linguistics Lab chiarcos@informatik.uni-frankfurt.de

Digital Humanities Workshop, Sep 9 – 11, 2014, Batumi, Georgia

slide-2
SLIDE 2

Linking Machine-Readable Dictionaries

  • Motivation: Aggregating information

– from different dictionaries – from dictionaries and automatically analyzed text

  • State of the art on machine-readable dictionaries

– XML (TEI, LMF) – RDF (lemon)

  • Example

– Converting, linking and querying multilingual Wiktionaries

slide-3
SLIDE 3

The future of the dictionary …

„The three things no young person owns or uses and often don‘t realise exist: an alarm clock, an address book and a dictionary … At university I didn‘t meet a single person who owned any of them“

http://guardian.co.uk/books/booksblog/2012/sep/13/dictio naries-democratic-crowdsourcing/

slide-4
SLIDE 4

The future of the dictionary …

„The three things no young person owns or uses and often don‘t realise exist: an alarm clock, an address book and a dictionary … At university I didn‘t meet a single person who owned any of them“

http://guardian.co.uk/books/booksblog/2012/sep/13/dictio naries-democratic-crowdsourcing/

„[D]ictionaries are not dead, they just smell funny“

Ilan Kernerman, CEO KDictionaries, Kernerman Dictionary News 21 (July 2013): 1, paraphrasing Frank Zappa‘s quote on Jazz (1974)

slide-5
SLIDE 5

The future of the dictionary …

„[D]ictionaries … lose their autonomous identity and disappear in language technology. Machine translation, word processors, … and the like incorporate dictionary content and apply it in new forms“

Ilan Kernerman, CEO KDictionaries, Kernerman Dictionary News 21 (July 2013): 1

„[T]he message is clear and unambiguous: the future

  • f the dictionary is digital.“

Stephen Bullon, Macmillan Education, upon announcing that Macmillan will no longer publish print dictionaries, Nov 2012

slide-6
SLIDE 6

The future of the dictionary …

… is digital

– no space limitations

  • adding context information, e.g., from corpora

– dynamic ordering & search

  • no index optimization for manual lookup

– information aggregation

  • integrating information from different sources
slide-7
SLIDE 7

The future of the dictionary …

… is digital

– no space limitations

  • adding context information, e.g., from corpora

– dynamic ordering & search

  • no index optimization for manual lookup

– information aggregation

  • integrating information from different sources

two use cases:

  • cross-lingual dictionary lookup
  • text mining for archaeologists
slide-8
SLIDE 8

Information Aggregation I Cross-lingual search

  • Assume you‘re a speaker of language X, say,

German, and are interested in working with text in language Y, say, Georgian – Statistical machine translation may give you an idea, but you certainly want to counter- check with a dictionary ...

slide-9
SLIDE 9

Information Aggregation I Cross-lingual search

  • Assume you‘re a speaker of language X, say,

German, and are interested in working with text in language Y, say, Georgian – Statistical machine translation may give you an idea, but you certainly want to counter- check with a dictionary ... ... unfortunately, you don‘t have one

slide-10
SLIDE 10

Information Aggregation I Cross-lingual search

  • Assume you‘re a speaker of language X, say,

German, and are interested in working with text in language Y, say, Georgian

  • We do have a Georgian-English dictionary,

though, and (luckily) a English-German one

  • Given a proper representation, storage and

query formalisms, it is possible to perform a transitive query using English as a pivot language

slide-11
SLIDE 11

Information Aggregation I Cross-lingual search

ფეხი foot leg

http://www.georgianweb. com/pdf/lexicon.pdf

Basis Fußbreit Fußende Fußpunkt Sockel Sohle Standfuß Standvorrichtung Tritt Mastfuß Segelunterliek Fußlinie Fußmauer Bein Abschnitt Etappe Programmzweig Schaft Schlägel Stollen Strecke Ader Hachse Kathete Schenkel Strang Fuß

dict.leo.org dict.leo.org

slide-12
SLIDE 12

Information Aggregation I Cross-lingual search

  • Unfortunately, using English introduces a lot of

noise

– 2 English translations, 27 (!) German translations

  • But we can combine multiple paths, e.g., one

using English as a pivot, one using Russian

– elements in the intersection should be more reliable

slide-13
SLIDE 13

Information Aggregation I Cross-lingual search

ფეხი foot leg

http://www.georgianweb. com/pdf/lexicon.pdf

Basis Fußbreit Fußende Fußpunkt Sockel Sohle Standfuß Standvorrichtung Tritt Mastfuß Segelunterliek Fußlinie Fußmauer Bein Abschnitt Etappe Programmzweig Schaft Schlägel Stollen Strecke Ader Hachse Kathete Schenkel Strang Fuß

dict.leo.org dict.leo.org

нога

http://meskhi.net/lexicon

Spielbein

dict.leo.org

slide-14
SLIDE 14

Information Aggregation I Cross-lingual search

  • Unfortunately, using English introduces a lot of

noise

– 2 English translations, 27 (!) German translations

  • But we can combine multiple paths, e.g., one

using English as a pivot, one using Russian

– elements in the intersection should be more reliable 27 English-based translations + 3 Russian-based translations = 2 shared translations

slide-15
SLIDE 15

Information Aggregation I Cross-lingual search

  • In a similar way, words missing from the

Russian (or the English) path may be taken from the other one

– more noise, but better coverage

27 English-based translations + 3 Russian-based translations = 28 possible translations

– e.g., German Spielbein „free leg“

slide-16
SLIDE 16

Information Aggregation I Jargon: A Prototype

  • student project @ GU Frankfurt
  • enter a word (in any language) and a target language
  • consult different machine-readable dictionaries to find a

path into the target language

  • visualize results together with their „path“
slide-17
SLIDE 17

Information Aggregation I Jargon: A Prototype

slide-18
SLIDE 18

Information Aggregation I Jargon: A Prototype

  • Jargon uses lexical resources provided by

different groups

– using a shared vocabulary

  • lemon, more in 10 minutes

=> joint queries

  • still under development

– prototype on restricted data set

slide-19
SLIDE 19

Information Aggregation II Multilingual Semantic Web

  • a system for text mining (open information

extraction) from archeological reports

  • extract machine-readable information from

plain text

– currently, English only

  • in the longer perspective, German and Dutch

– http://corpora.acoli.informatik.uni- frankfurt.de/text-mining-webservice

slide-20
SLIDE 20

Information Aggregation II Multilingual Semantic Web

Given a PDF document

slide-21
SLIDE 21

Information Aggregation II Multilingual Semantic Web

Upload to server

slide-22
SLIDE 22

Information Aggregation II Multilingual Semantic Web

Perform NLP analysis

slide-23
SLIDE 23

Information Aggregation II Multilingual Semantic Web

Visualize data

slide-24
SLIDE 24

Information Aggregation II Multilingual Semantic Web

e.g. arch. periods

slide-25
SLIDE 25

Information Aggregation II Multilingual Semantic Web

  • r query in the results
slide-26
SLIDE 26

Information Aggregation II Multilingual Semantic Web

  • r query in the results

Dr Irakli Iashvili spent a month at the Heberden Coin Room at the Ashmolean Museum , also with the support of the British Academy , working on the coinage of the Black Sea in general , and the coins found at Pichvnari in particular .

Result TRIPLES TEXT QUERY

slide-27
SLIDE 27

Information Aggregation II Multilingual Semantic Web

  • r query in the results

In this query, the only information-bearing element is „:work“ If we define that „:work“ entails „:bearbeitet“ (the German translation), we can formulate the same query in German i.e. ?a :bearbeitet ?c

slide-28
SLIDE 28

Linking Machine-Readable Dictionaries

  • Motivation: Aggregating information

– from different dictionaries – from dictionaries and automatically analyzed text

  • State of the art on machine-readable dictionaries

– XML – RDF

  • Example

– Converting, linking and querying multilingual Wiktionaries

slide-29
SLIDE 29

Machine Readable Dictionaries XML

  • Text Encoding Initiative (TEI)

– specifications for markup of digital-born documents – originally closely oriented towards digital editions of printed books – rich metadata (TEI header) – semantic markup (div, seg, verse, …) – limited interoperability

  • many different ways to represent the same

information => information aggregation ???

slide-30
SLIDE 30

Machine Readable Dictionaries XML

  • Lexical Markup Framework (LMF)

– ISO standard for representing machine- readable dictionaries – an abstract model with XML specifications (DTD) – concrete application requires an instantiation

 extending the DTD  violating the original DTD  in order to use this standard, you need to break it

slide-31
SLIDE 31

Machine Readable Dictionaries XML

  • Lexical Markup Framework (LMF)

– ISO standard for representing machine- readable dictionaries – an abstract model with XML specifications (DTD) – concrete application requires an instantiation

 extending the DTD  violating the original DTD  in order to use this standard, you need to break it  suggestions for alternative representations of LMF, e.g., RDF (Francopoulo 2006)

slide-32
SLIDE 32

Resource Description Framework (RDF)

  • W3C standard (1999)

– generic data model: directed labeled graph

  • nodes, edges, labels

– originally developed to provide metadata about resources

  • e.g., journals in a bookstore and eBooks in an
  • nline shop

– resources are unambiguously identified in the web of data by Uniform Resource Identifiers URIs)

slide-33
SLIDE 33

Resource Description Framework (RDF)

– resources are unambiguously identified in the web of data by Uniform Resource Identifiers URIs)

URLs are prototypical URIs

  • http://www.w3.org/2000/01/rdf-schema#label

protocol namespace (e.g., a document in the web) identifier „local name“

slide-34
SLIDE 34

Resource Description Framework (RDF)

compact notation: define a prefix and use it instead of the namespace

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>. … rdfs:label …

URLs are prototypical URIs

  • http://www.w3.org/2000/01/rdf-schema#label

protocol namespace (e.g., a document in the web) identifier „local name“

if you encounter an unknown prefix, you can look up its namespace under http://prefix.cc/

slide-35
SLIDE 35

Resource Description Framework (RDF)

  • basic data structure is a triple, consisting of

– source node

„subject“

– relation / property

„predicate “

– target node

„object“

  • typically, all of these are RDF resources

– identified by a URI

slide-36
SLIDE 36

Resource Description Framework (RDF)

  • basic data structure is a triple, consisting of

– source node

„subject“

– relation / property

„predicate “

– target node

„object“

  • typically, all of these are RDF resources

– identified by a URI

  • alternatively, the target may also be a literal,

e.g., a string

slide-37
SLIDE 37

Resource Description Framework (RDF)

  • there are different notations, we use Turtle*

– triples written in a sequence – separated by . – usually one triple per line

* this description is deliberately simplified, see http://www.w3.org/TR/turtle/

slide-38
SLIDE 38

Resource Description Framework (RDF)

  • there are special vocabularies for classes,

properties and instances

– e.g., RDF Schema (RDFS)

short notation for rdf:type, i.e., „is instance of“ assigns a RDF resource a human-readable form

slide-39
SLIDE 39

Machine-Readable Dictionaries RDF: lemon

  • an RDF Schema (or an ontology) defines

a domain vocabulary

  • lexical resources are interoperable if

shared vocabularies are used, e.g., the Lexicon Model for Ontologies (lemon)

– based on LMF – developed by a W3C Community Group

  • http://www.w3.org/community/ontolex/

– not a standard (yet), but already widely used

slide-40
SLIDE 40

Machine-Readable Dictionaries RDF: lemon

http://www.lemon-model.net/lemon

core

slide-41
SLIDE 41

Machine-Readable Dictionaries RDF: lemon

http://www.lemon-model.net/lemon

core we ignore this part, this is more relevant for the Semantic Web than for the Humanities

slide-42
SLIDE 42

Machine-Readable Dictionaries RDF: lemon

http://www.lemon-model.net/lemon

core today, we ignore this part, this is relevant for phrasal expressions (e.g., idioms)

slide-43
SLIDE 43

Wiktionary as a Lexicon

slide-44
SLIDE 44

Wiktionary as a Lexicon

http://www.lemon-model.net/lemon

slide-45
SLIDE 45

Wiktionary as a Lexicon

http://www.lemon-model.net/lemon

(plus subcategorization information)

slide-46
SLIDE 46

Wiktionary as a Lexicon

http://www.lemon-model.net/lemon

(plus morphosyntactic information)

slide-47
SLIDE 47

Wiktionary as a Lexicon

  • beyond its core,

lemon provides additional vocabulary elements

  • e.g., cross-

references to other languages

http://www.lemon-model.net/lemon

  • ther languages
slide-48
SLIDE 48

Wiktionary as a Lexicon

http://www.lemon-model.net/lemon translation

  • beyond its core,

lemon provides additional vocabulary elements

  • e.g., cross-

references to other languages

slide-49
SLIDE 49

From Wiktionary to lemon

http://www.lemon-model.net/lemon

<http://wiktionary.org> a lemon:Lexicon . <http://wiktionary.org> lemon:language „en“.

<http://wiktionary.org>

slide-50
SLIDE 50

From Wiktionary to lemon

http://www.lemon-model.net/lemon

<http://wiktionary.org> lemon:entry :know . :know a lemon:LexicalEntry .

<http://wiktionary.org> :know

slide-51
SLIDE 51

From Wiktionary to lemon

http://www.lemon-model.net/lemon

:know1 a lemon:LexicalSense . :know1 rdfs:comment „(transitive) To be certain …“ .

slide-52
SLIDE 52

From Wiktionary to lemon

http://www.lemon-model.net/lemon

:know lemon:sense :know1. :know lemon:sense :know2. …

:know

slide-53
SLIDE 53

From Wiktionary to lemon

http://www.lemon-model.net/lemon

:know_form1

:know_form0 a lemon:LexicalForm. :know_form0 lemon:writtenRep „to know“ . :know_form0 rdfs:comment „Verb“. :know_form1 a lemon:LexicalForm . :know_form1 lemon:writtenRep „knows“ . :know_form1 rdfs:comment „Verb; third- person singular simple present“. ...

:know_form2 … :know_form0

slide-54
SLIDE 54

From Wiktionary to lemon

http://www.lemon-model.net/lemon

:know_form1 :know_form2 …

:know lemon:canonicalForm :know_form0 . :know lemon:form :know_form1 . :know lemon:form :know_form 2. …

:know_form0

slide-55
SLIDE 55

From Wiktionary to lemon

http://www.lemon-model.net/lemon translation

:know

:know lemon:isTranslationOf <http://de.wiktionary.org/wiki/wissen> . …

for lemon:isTranslationOf and more documentation, see http://lemon-model.net/

slide-56
SLIDE 56

Querying RDF

  • SPARQL*

– SPARQL protocol and RDF Query Language – http://www.w3.org/TR/rdf-sparql-query/

  • SELECT

– define variables (beginning with ?)

  • WHERE

– write triples (similar to Turtle, but now with variables)

* simplified, again

slide-57
SLIDE 57

PREFIX : <http://wiktionary.org> # namespace prefix declarations PREFIX lemon: <http://lemon-model.net/lemon#> SELECT ?deForm ?enForm WHERE { ?deLexicon a lemon:Lexicon. # German lexicon ?deLexicon ?deLexicon lemon:language „de“ . ?deLexicon lemon:entry ?de. # lexical entry ?de ?de lemon:canonicalForm ?deForm. # German lexicon form (lemma) ?de lemon:isTranslationOf ?en. # German is translation of ?en ?enLexicon lemon:entry ?en. ?enLexicon lemon:language „en“. # ?en in English lexicon ?en lemon:canonicalForm ?enForm. }

Querying RDF

isTranslationOf

SPARQL 1.0

slide-58
SLIDE 58

So far

  • RDF motivation and introduction
  • How to interpret (convert) Wiktionary data

to RDF

– lemon vocabulary

  • How to query lemon data (example)

– for English translations of German words – queries over series of dictionaries can simply be concatenated

  • Slightly simplified
slide-59
SLIDE 59

Digging Deeper

  • remember that URIs were unique

identifiers in the web (of data)

– this means we can look up external URIs (links) – if they resolve over HTTP, and they provide RDF data, we can query these data sources – if they use the same vocabulary, we can iterate the query, e.g., over dictionaries in a third language – „federation“

slide-60
SLIDE 60

Digging Deeper

  • remember that URIs were unique

identifiers in the web (of data)

– this means we can look up external URIs (links) – if they resolve over HTTP, and they provide RDF data, we can query these data sources – if they use the same vocabulary, we can iterate the query, e.g., over dictionaries in a third language – „federation“ Resources fulfilling these conditions constitute „Linked (Open) Data“

http://www.w3.org/DesignIssues/LinkedData.html

slide-61
SLIDE 61

Linked Open Data (LOD) cloud

Source http://lod-cloud.net

slide-62
SLIDE 62

Source http://lod-cloud.net DBpedia (Wikipedia)

  • cf. Markert & Nissim (2003) on anaphor resolution

WordNet(s) language identifiers WordNet-derived datasets Named Entity Repositories Other Semantic Knowledge Bases

Linguistically relevant LOD resources

slide-63
SLIDE 63

Linguistic Linked Open Data cloud

  • a collection of linguistic resources

– published under open licenses – as linked data – decentralized developed and maintained – meta data at http://datahub.io

=> cloud diagram

– developed as a community effort in the context of the Open Linguistics Working Group of the Open Knowledge Foundation

slide-64
SLIDE 64

Open Knowledge Foundation (OKFN, http://okfn.org)

  • non-profit organization
  • founded in 2004
  • promote open knowledge in all its forms

– e.g., publication of government data (UK, US)

  • provide infrastructural support for several

working groups

slide-65
SLIDE 65

OKFN Open Linguistics Working Group (OWLG)

  • founded in Oct 2010 in Berlin, Germany
  • open network of individuals interested in

– linguistic resources and/or – their publication under open licenses

  • multi-disciplinary

– NLP/CL, DH, typology/language documentation, IT, …

  • infrastructure

– mailing list, web site/blog, wiki – http://linguistics.okfn.org

slide-66
SLIDE 66

Important OWLG goals (http://linguistics.okfn.org)

  • 1. Promote open data in relation to language

data

  • 2. Facilitate communication between

researchers who use / distribute / maintain

  • pen linguistic data
  • 3. Mediate between providers and users of

technical infrastructures

  • 4. Build and maintain an index of open linguistic

data sources

slide-67
SLIDE 67

Workshop series

Linked Data in Linguistics (LDL) Multilingual Linked Open Data for Enterprises (MLODE) Linked Data in Linguistic Typology (LDLT)

slide-68
SLIDE 68

Linguistic Linked Open Data

  • Very different data

sets

different data providers, different incentives

  • All using the same set
  • f technologies, and –

increasingly – shared vocabularies

e.g. lemon

slide-69
SLIDE 69

Linguistic Linked Open Data

  • Very different data

sets

different data providers, different incentives

  • All using the same set
  • f technologies, and –

increasingly – shared vocabularies

e.g. lemon (L!) L! L! L! L! L! L! L! L!

slide-70
SLIDE 70

Linguistic Linked Open Data

  • Very different data

sets

different data providers, different incentives

  • All using the same set
  • f technologies, and –

increasingly – shared vocabularies

e.g. lemon (L!) L! L! L! L! L! L! L! L!

Since 2012, our focus has been on converting resources With what we have accomplished since then, we are now in the position to begin developing applications

slide-71
SLIDE 71

Final words (for the moment)

  • Thank you for your attention …
  • Interested in contributing ? Talk to me !

– There are quite a few resources that would be interesting candidates for an LOD conversion

  • Georgian-Russian dictionary
  • Georgian-English dictionary
  • Georgian Wiktionary

– Developing, extending, using* software with such data

* in the longer perspective