Curation Technologies for Multilingual Europe Georg Rehm DFKI, - - PowerPoint PPT Presentation

curation technologies for multilingual europe
SMART_READER_LITE
LIVE PREVIEW

Curation Technologies for Multilingual Europe Georg Rehm DFKI, - - PowerPoint PPT Presentation

Curation Technologies for Multilingual Europe Georg Rehm DFKI, Germany META-FORUM 2016 Lisbon, Portugal 04/05 July 2016 Author Scholar TV editor Researcher Information Knowledge worker ? Investigative


slide-1
SLIDE 1

Curation Technologies 
 for Multilingual Europe

Georg Rehm

DFKI, Germany META-FORUM 2016 – Lisbon, Portugal – 04/05 July 2016

slide-2
SLIDE 2

Information Information Information Information Information Information Information Information Information

?

??

?

Information

Output Input Software Processes

Curation Technologies for Multilingual Europe

  • Author
  • Scholar
  • TV editor
  • Researcher
  • Knowledge worker
  • Investigative journalist
  • Designer of an exhibition
  • Curator of digital information
slide-3
SLIDE 3

Sectors

Input Processes Software Output

tweet analyse text processor newspaper article newspaper article select presentation multimedia website wire copy focus spreadsheet tv report facebook status update revise email exhibition catalogue search result read up on browser mobile application email write groupware mashup (e.g., map) text message create sector-specific application text piece concept research CMS concept text file assess ECMS timeline video evaluate CRM study map arrange enterprise software presentation stockphoto sort graphics/layouting software fact collection in-house database structure IP telephony description of an exhibit calendar entry summarise etc. analysis spreadsheet shorten etc. archive translate etc. catch up on combine abstract integrate visualise generate annotate reference etc.

Information Information Information Information Information Information Information Information Information

?

??

?

Information

Output Input Software Processes

slide-4
SLIDE 4

Sectors

Input Processes Software Output

tweet analyse text processor newspaper article newspaper article select presentation multimedia website wire copy focus spreadsheet tv report facebook status update revise email exhibition catalogue search result read up on browser mobile application email write groupware mashup (e.g., map) text message create sector-specific application text piece concept research CMS concept text file assess ECMS timeline video evaluate CRM study map arrange enterprise software presentation stockphoto sort graphics/layouting software fact collection in-house database structure IP telephony description of an exhibit calendar entry summarise etc. analysis spreadsheet shorten etc. archive translate etc. catch up on combine abstract integrate visualise generate annotate reference etc.

Information Information Information Information Information Information Information Information Information

?

??

?

Information

Output Input Software Processes

slide-5
SLIDE 5

Sectors

Input Processes Software Output

tweet analyse text processor newspaper article newspaper article select presentation multimedia website wire copy focus spreadsheet tv report facebook status update revise email exhibition catalogue search result read up on browser mobile application email write groupware mashup (e.g., map) text message create sector-specific application text piece concept research CMS concept text file assess ECMS timeline video evaluate CRM study map arrange enterprise software presentation stockphoto sort graphics/layouting software fact collection in-house database structure IP telephony description of an exhibit calendar entry summarise etc. analysis spreadsheet shorten etc. archive translate etc. catch up on combine abstract integrate visualise generate annotate reference etc.

Information Information Information Information Information Information Information Information Information

?

??

?

Information

Output Input Software Processes

slide-6
SLIDE 6

language and knowledge technologies curation technologies sector-specific technologies platform technologies

sector-specific solutions

!

Digital Curation Technologies

  • Make curation processes in four SMEs (and sectors) more

efficient through language and knowledge technologies.

  • Technology transfer project to arrive at proofs of concept.
  • Curation services for real companies and real use cases.
  • The human expert/curator is always in the centre and loop.
  • Platform for digital curation technologies: innovation boost.

Curation Technologies for Multilingual Europe

slide-7
SLIDE 7

Curation Technologies for Multilingual Europe

Curation Dashboard Structure visualisation Multilingual multimedia sources Crossmedia recommendations Multilingual summarisation Event timelining Semantification of content Multilingual sentiment analysis Semantic storytelling Ontology-based knowledge structures Automatic hyperlinking of document collections

Curation Processes

Processing, exploration and 
 re-aggregation of domain- and task- specific document collections.

slide-8
SLIDE 8

Key Characteristics

  • Technology transfer and integration project
  • Broad set of tools and technologies
  • Focus on building proofs of concept
  • Our technologies don’t have to be perfect
  • Human expert, i.e., the curator, always in the loop
  • Important for all SME partners: domain-adaptability.
  • WPs: Semantic Analysis, Semantic Generation,

Multilingual Technologies, Integration into Curation Tech

Curation Technologies for Multilingual Europe

slide-9
SLIDE 9

platform for digital curation technologies

broker REST API curation service 1

language or knowledge technology

curation service 2

language or knowledge technology

client using 
 the API external service 1 external service 2 client using 
 the API client using 
 the API client using 
 the API pipelined curation workflow

Curation Technologies for Multilingual Europe

  • Curation process: e-service available through REST API.
  • Services can be combined to form pipelines or workflows.
  • Domain-adaptability: every curation process has a training API to create

and use domain-specific models.

slide-10
SLIDE 10

Current Results

  • Implemented the following baseline services:

– NER – e-entityrecognition e-service – Geolocation – e-entityrecognition and visualisation – Temporal Analyser – e-entityrecognition and visualisation – Classification – e-classification e-service – Clustering – e-clustering e-service – Machine Translation – e-translation e-service

  • Curation Dashboard (first prototype)
  • Semantic Storytelling (work in progress)

Curation Technologies for Multilingual Europe

slide-11
SLIDE 11

NER, Entity Linking, Geolocation

Curation Technologies for Multilingual Europe ... In the Viking colony of Iceland, an extraordinary vernacular literature blossomed in the 12th through 14th centuries ... ...
 The ships were scuttled there in the 11th century, to block a
 navigation channel and thus 
 protect Roskilde, then 
 Copenhagen from seaborne assault
 ... ...
 Viking Age inscriptions have 
 also been discovered on the 
 Manx runestones on the 
 Isle of Man.
 …

Plain Text NIF enrichment visualisation

http://api.digitale-kuratierung.de/api/e-nlp/namedEntityRecognition?analysis=ner http://http://dev.digitale-kuratierung.de/admini/pages/geolocalization.php

  • Currently based on OpenNLP (with NIF integration)
  • Mode 1: model-based (for domains where annotated

data is available)

  • Mode 2: dictionary-based (for domains where only a

list of names is available)

  • Entity Linking through SPARQL queries to DBPedia
  • For locations, GPS-coordinates are retrieved,

document level average and standard deviation (over all locations) are calculated to visualise positioning of documents on a map.

slide-12
SLIDE 12

Curation Technologies for Multilingual Europe

NER Training

http://api.digitale-kuratierung.de/api/e-nlp/trainModel?analysis=dict 
 (in the suboptimal case that only a list of terms and their URIs in an

  • ntology is available)


http://api.digitale-kuratierung.de/api/e-nlp/trainModel?analysis=ner
 (if annotated training data is available)


directly usable on new input NER model

slide-13
SLIDE 13

Curation Technologies for Multilingual Europe

Temporal Analysis

...
 The ships were scuttled there in the 11th century, to block a
 navigation channel and thus 
 protect Roskilde, then 
 Copenhagen from seaborne assault
 ... ...
 Viking Age inscriptions have 
 also been discovered on the 
 Manx runestones on the 
 Isle of Man.
 ... ... In the Viking colony of Iceland, an extraordinary vernacular literature blossomed in the 12th through 14th centuries …

900 1600 http://api.digitale-kuratierung.de/api/e-nlp/namedEntityRecognition?analysis=temp http://dev.digitale-kuratierung.de/admini/pages/timelining.php

Plain Text NIF enrichment visualisation

  • Sort and rank documents from a

collection on chronological scale.

  • Developed rule-based system due

to our focus in terms of languages (EN, DE), domain adaptability, normalisation requirements.

  • Analysis of temporal expressions

in a document (or, later, paragraphs or even sentences).

  • Compute mean value for date and

time, allowing positioning on a timeline.

  • Future plans: adaptability through

user-specific rules.

  • Related work: SUTime,

HeidelTime, Tango, Tarsgi; many papers at LREC 2016

slide-14
SLIDE 14

Classification

  • Mallet – Maximum Entropy Algorithm
  • Algorithm for text classification, easy integration.
  • Goal: text classification, i.e., assign a topic (class) to a

document (or parts of a document) to apply domain- or topic- specific NLP processing techniques.

  • Future plans: improvement of classification schema by means
  • f new training data and additional algorithms.

Curation Technologies for Multilingual Europe @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . @prefix nif: <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . 
 <http://dkt.dfki.de/documents/#char=0,1257> a nif:RFC5147String , nif:String , nif:Context ; nif:beginIndex "0"^^xsd:nonNegativeInteger ; nif:endIndex "1257"^^xsd:nonNegativeInteger ; nif:documentClassificationLabel "Frühjahrsoffensive_1918"^^xsd:string ; nif:isString "Ceylon-Teestube B. Walther München Maximilian-Strasse 44 Gegenüber dem Königl. Hoftheater Telephon 428 München, den 26.XI.13. Von hier nach Dresden ab München 8.25 9.00 10.20 an Dresden 7.28 10.47 9.48 Sie müssen unbedingt Donnerstag hier bleiben. So können Sie doch nicht vorbeifahren. Donnerstag Abend eine interessante Uraufführung in den Kammerspielen "unseligen Gedenkens " Ich werde Billets dafür besorgen. […]"^^xsd:string .

slide-15
SLIDE 15

Clustering

  • WEKA (Expectation Maximisation algorithm)
  • Easy integration, availability, additional algorithms.
  • Goal: identification of distinct features of document collections.
  • Example use case: a user has to prepare a museum exhibit on

“Birds”. Knowing which documents can be grouped can be useful to split the documents into exhibition rooms.

  • Future plans: allow users to easily recognize groups of documents in

new domains and collections; faceted search.

Curation Technologies for Multilingual Europe

ARFF Input JSON Output @RELATION iris @ATTRIBUTE sepallength NUMERIC @ATTRIBUTE sepalwidth NUMERIC @ATTRIBUTE petallength NUMERIC @ATTRIBUTE petalwidth NUMERIC 
 @DATA 5.1,3.5,1.4,0.2 4.9,3.0,1.4,0.2 4.7,3.2,1.3,0.2 4.6,3.1,1.5,0.2 5.0,3.6,1.4,0.2 5.4,3.9,1.7,0.4 4.6,3.4,1.4,0.3 5.0,3.4,1.5,0.2 4.4,2.9,1.4,0.2 4.9,3.1,1.5,0.1 { "results": { "numberClusters": -1, "clusters": {"cluster1": { "clusterId": 1, "entitites": { "entity1": { "meanValue": 3.3099999999999996, "label": "sepalwidth" }, "entity2": { "meanValue": 1.45, "label": "petallength" }, "entity3": { "meanValue": 0.22000000000000003, "label": "petalwidth" } } }}}}

slide-16
SLIDE 16

Machine Translation

Curation Technologies for Multilingual Europe

Workflow

Language & Translation Models trained

  • n DGT, News,

Europarl, TED

Herr Modi befindet sich auf einer fünftägigen Reise nach Japan, um die wirtschaftlichen Beziehungen mit der drittgrößten Wirtschaftsnation der Welt zu festigen. Mr Modi is located on a five-day trip to Japan to strengthen the economic ties with the third largest economy in the world.

Named Entity Recognition Entity Linking Temporal Expressions Metadata Processing Post-Edit Retraining Example

  • Robust, adaptable and customised models of MT as e-services (Moses-based SMT)
  • Scenarios: museums, showrooms; news, media; publishers; cultural institutions, archives
  • Integration in curation workflows with other DKT services (NER, Temporal Analyser)
  • Plug-in multiple knowledge sources (Linked Data)
slide-17
SLIDE 17

Semantic Storytelling

  • Important objective for all partner use cases: Automatic

hyper-linking of task-specific, self-contained collections.

  • Input: coherent, self-contained document collection
  • Output: processed collection with added analysis information,

easily accessible as a hypertext, for efficient browsing

  • Semantic Storytelling – operates on the hypertext graph that

we construct on top of the original collection

  • Enables multiple different paths through the collection
  • Semantic storytelling is the identification, ranking and

recommendation of meaningful hypertext paths.

Curation Technologies for Multilingual Europe

slide-18
SLIDE 18

Curation Technologies for Multilingual Europe

<http://d-nb.info/gnd/11858071X, met, http://d-nb.info/gnd/129094722> http://dev.digitale-kuratierung.de/2ds3/index.php <http://d-nb.info/gnd/118589768, wrote, http://d-nb.info/gnd/118623230> <http://d-nb.info/gnd/123242231, visited, http://d-nb.info/gnd/188402519> <http://d-nb.info/gnd/118569015, said, http://d-nb.info/gnd/11947509X> <http://d-nb.info/gnd/119173425, was, http://d-nb.info/gnd/118629867> <http://d-nb.info/gnd/119178893, designed, http://d-<nb.info/gnd/118629867> <http://d-nb.info/gnd/118876759, love, http://d-nb.info/gnd/118629867> <http://d-nb.info/gnd/118545892, depart, http://d-nb.info/gnd/107363569> <http://d-nb.info/gnd/128830751, write, http://d-nb.info/gnd/118606026> <http://d-nb.info/gnd/11858071X, protect, http://d-nb.info/gnd/39650438> <http://d-nb.info/gnd/116713704, married, http://d-nb.info/gnd/52754181> …

1 2 3 4 5

slide-19
SLIDE 19

Curation Technologies for Multilingual Europe

Curation Dashboard

slide-20
SLIDE 20

Conclusions

  • Curation technologies are smart technologies to support

knowledge workers handling content and knowledge.

  • The multilingual Digital Single Market will create a

massive need for multilingual Curation Technologies due to an ever-increasing need for multilingual content.

  • DKT is mostly centred around German and English.
  • We cater for a small set of curation processes.
  • To be extended in a larger follow-up project.
  • Extended set of curation processes, more complex

approaches, many more languages.

Curation Technologies for Multilingual Europe

slide-21
SLIDE 21

Thank you!

supported by