Curation Technologies for Multilingual Europe
Georg Rehm
DFKI, Germany META-FORUM 2016 – Lisbon, Portugal – 04/05 July 2016
Curation Technologies for Multilingual Europe Georg Rehm DFKI, - - PowerPoint PPT Presentation
Curation Technologies for Multilingual Europe Georg Rehm DFKI, Germany META-FORUM 2016 Lisbon, Portugal 04/05 July 2016 Author Scholar TV editor Researcher Information Knowledge worker ? Investigative
Georg Rehm
DFKI, Germany META-FORUM 2016 – Lisbon, Portugal – 04/05 July 2016
Information Information Information Information Information Information Information Information Information
Information
Curation Technologies for Multilingual Europe
Sectors
Input Processes Software Output
tweet analyse text processor newspaper article newspaper article select presentation multimedia website wire copy focus spreadsheet tv report facebook status update revise email exhibition catalogue search result read up on browser mobile application email write groupware mashup (e.g., map) text message create sector-specific application text piece concept research CMS concept text file assess ECMS timeline video evaluate CRM study map arrange enterprise software presentation stockphoto sort graphics/layouting software fact collection in-house database structure IP telephony description of an exhibit calendar entry summarise etc. analysis spreadsheet shorten etc. archive translate etc. catch up on combine abstract integrate visualise generate annotate reference etc.
Information Information Information Information Information Information Information Information Information
??
Information
Output Input Software Processes
Sectors
Input Processes Software Output
tweet analyse text processor newspaper article newspaper article select presentation multimedia website wire copy focus spreadsheet tv report facebook status update revise email exhibition catalogue search result read up on browser mobile application email write groupware mashup (e.g., map) text message create sector-specific application text piece concept research CMS concept text file assess ECMS timeline video evaluate CRM study map arrange enterprise software presentation stockphoto sort graphics/layouting software fact collection in-house database structure IP telephony description of an exhibit calendar entry summarise etc. analysis spreadsheet shorten etc. archive translate etc. catch up on combine abstract integrate visualise generate annotate reference etc.
Information Information Information Information Information Information Information Information Information
??
Information
Output Input Software Processes
Sectors
Input Processes Software Output
tweet analyse text processor newspaper article newspaper article select presentation multimedia website wire copy focus spreadsheet tv report facebook status update revise email exhibition catalogue search result read up on browser mobile application email write groupware mashup (e.g., map) text message create sector-specific application text piece concept research CMS concept text file assess ECMS timeline video evaluate CRM study map arrange enterprise software presentation stockphoto sort graphics/layouting software fact collection in-house database structure IP telephony description of an exhibit calendar entry summarise etc. analysis spreadsheet shorten etc. archive translate etc. catch up on combine abstract integrate visualise generate annotate reference etc.
Information Information Information Information Information Information Information Information Information
??
Information
Output Input Software Processes
language and knowledge technologies curation technologies sector-specific technologies platform technologies
sector-specific solutions
!
efficient through language and knowledge technologies.
Curation Technologies for Multilingual Europe
Curation Technologies for Multilingual Europe
Curation Dashboard Structure visualisation Multilingual multimedia sources Crossmedia recommendations Multilingual summarisation Event timelining Semantification of content Multilingual sentiment analysis Semantic storytelling Ontology-based knowledge structures Automatic hyperlinking of document collections
Processing, exploration and re-aggregation of domain- and task- specific document collections.
Multilingual Technologies, Integration into Curation Tech
Curation Technologies for Multilingual Europe
platform for digital curation technologies
broker REST API curation service 1
language or knowledge technology
curation service 2
language or knowledge technology
client using the API external service 1 external service 2 client using the API client using the API client using the API pipelined curation workflow
Curation Technologies for Multilingual Europe
and use domain-specific models.
– NER – e-entityrecognition e-service – Geolocation – e-entityrecognition and visualisation – Temporal Analyser – e-entityrecognition and visualisation – Classification – e-classification e-service – Clustering – e-clustering e-service – Machine Translation – e-translation e-service
Curation Technologies for Multilingual Europe
Curation Technologies for Multilingual Europe ... In the Viking colony of Iceland, an extraordinary vernacular literature blossomed in the 12th through 14th centuries ... ... The ships were scuttled there in the 11th century, to block a navigation channel and thus protect Roskilde, then Copenhagen from seaborne assault ... ... Viking Age inscriptions have also been discovered on the Manx runestones on the Isle of Man. …
Plain Text NIF enrichment visualisation
http://api.digitale-kuratierung.de/api/e-nlp/namedEntityRecognition?analysis=ner http://http://dev.digitale-kuratierung.de/admini/pages/geolocalization.php
data is available)
list of names is available)
document level average and standard deviation (over all locations) are calculated to visualise positioning of documents on a map.
Curation Technologies for Multilingual Europe
http://api.digitale-kuratierung.de/api/e-nlp/trainModel?analysis=dict (in the suboptimal case that only a list of terms and their URIs in an
http://api.digitale-kuratierung.de/api/e-nlp/trainModel?analysis=ner (if annotated training data is available)
directly usable on new input NER model
Curation Technologies for Multilingual Europe
... The ships were scuttled there in the 11th century, to block a navigation channel and thus protect Roskilde, then Copenhagen from seaborne assault ... ... Viking Age inscriptions have also been discovered on the Manx runestones on the Isle of Man. ... ... In the Viking colony of Iceland, an extraordinary vernacular literature blossomed in the 12th through 14th centuries …
900 1600 http://api.digitale-kuratierung.de/api/e-nlp/namedEntityRecognition?analysis=temp http://dev.digitale-kuratierung.de/admini/pages/timelining.php
Plain Text NIF enrichment visualisation
collection on chronological scale.
to our focus in terms of languages (EN, DE), domain adaptability, normalisation requirements.
in a document (or, later, paragraphs or even sentences).
time, allowing positioning on a timeline.
user-specific rules.
HeidelTime, Tango, Tarsgi; many papers at LREC 2016
document (or parts of a document) to apply domain- or topic- specific NLP processing techniques.
Curation Technologies for Multilingual Europe @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . @prefix nif: <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . <http://dkt.dfki.de/documents/#char=0,1257> a nif:RFC5147String , nif:String , nif:Context ; nif:beginIndex "0"^^xsd:nonNegativeInteger ; nif:endIndex "1257"^^xsd:nonNegativeInteger ; nif:documentClassificationLabel "Frühjahrsoffensive_1918"^^xsd:string ; nif:isString "Ceylon-Teestube B. Walther München Maximilian-Strasse 44 Gegenüber dem Königl. Hoftheater Telephon 428 München, den 26.XI.13. Von hier nach Dresden ab München 8.25 9.00 10.20 an Dresden 7.28 10.47 9.48 Sie müssen unbedingt Donnerstag hier bleiben. So können Sie doch nicht vorbeifahren. Donnerstag Abend eine interessante Uraufführung in den Kammerspielen "unseligen Gedenkens " Ich werde Billets dafür besorgen. […]"^^xsd:string .
“Birds”. Knowing which documents can be grouped can be useful to split the documents into exhibition rooms.
new domains and collections; faceted search.
Curation Technologies for Multilingual Europe
ARFF Input JSON Output @RELATION iris @ATTRIBUTE sepallength NUMERIC @ATTRIBUTE sepalwidth NUMERIC @ATTRIBUTE petallength NUMERIC @ATTRIBUTE petalwidth NUMERIC @DATA 5.1,3.5,1.4,0.2 4.9,3.0,1.4,0.2 4.7,3.2,1.3,0.2 4.6,3.1,1.5,0.2 5.0,3.6,1.4,0.2 5.4,3.9,1.7,0.4 4.6,3.4,1.4,0.3 5.0,3.4,1.5,0.2 4.4,2.9,1.4,0.2 4.9,3.1,1.5,0.1 { "results": { "numberClusters": -1, "clusters": {"cluster1": { "clusterId": 1, "entitites": { "entity1": { "meanValue": 3.3099999999999996, "label": "sepalwidth" }, "entity2": { "meanValue": 1.45, "label": "petallength" }, "entity3": { "meanValue": 0.22000000000000003, "label": "petalwidth" } } }}}}
Curation Technologies for Multilingual Europe
Workflow
Language & Translation Models trained
Europarl, TED
Herr Modi befindet sich auf einer fünftägigen Reise nach Japan, um die wirtschaftlichen Beziehungen mit der drittgrößten Wirtschaftsnation der Welt zu festigen. Mr Modi is located on a five-day trip to Japan to strengthen the economic ties with the third largest economy in the world.
Named Entity Recognition Entity Linking Temporal Expressions Metadata Processing Post-Edit Retraining Example
hyper-linking of task-specific, self-contained collections.
easily accessible as a hypertext, for efficient browsing
we construct on top of the original collection
recommendation of meaningful hypertext paths.
Curation Technologies for Multilingual Europe
Curation Technologies for Multilingual Europe
<http://d-nb.info/gnd/11858071X, met, http://d-nb.info/gnd/129094722> http://dev.digitale-kuratierung.de/2ds3/index.php <http://d-nb.info/gnd/118589768, wrote, http://d-nb.info/gnd/118623230> <http://d-nb.info/gnd/123242231, visited, http://d-nb.info/gnd/188402519> <http://d-nb.info/gnd/118569015, said, http://d-nb.info/gnd/11947509X> <http://d-nb.info/gnd/119173425, was, http://d-nb.info/gnd/118629867> <http://d-nb.info/gnd/119178893, designed, http://d-<nb.info/gnd/118629867> <http://d-nb.info/gnd/118876759, love, http://d-nb.info/gnd/118629867> <http://d-nb.info/gnd/118545892, depart, http://d-nb.info/gnd/107363569> <http://d-nb.info/gnd/128830751, write, http://d-nb.info/gnd/118606026> <http://d-nb.info/gnd/11858071X, protect, http://d-nb.info/gnd/39650438> <http://d-nb.info/gnd/116713704, married, http://d-nb.info/gnd/52754181> …
1 2 3 4 5
Curation Technologies for Multilingual Europe
Curation Dashboard
knowledge workers handling content and knowledge.
massive need for multilingual Curation Technologies due to an ever-increasing need for multilingual content.
approaches, many more languages.
Curation Technologies for Multilingual Europe
supported by