Bringing Europeana and CLARIN together: Dissemination and - - PowerPoint PPT Presentation

bringing europeana and clarin together dissemination and
SMART_READER_LITE
LIVE PREVIEW

Bringing Europeana and CLARIN together: Dissemination and - - PowerPoint PPT Presentation

Bringing Europeana and CLARIN together: Dissemination and exploitation of cultural heritage data in a research infrastructure Twan Goosen 1 (CLARIN ERIC), Nuno Freire 2 , Clemens Neudecker 3 , Maria Eskevich 1 1 CLARIN ERIC; 2 Europeana /


slide-1
SLIDE 1

Bringing Europeana and CLARIN together: Dissemination and exploitation of cultural heritage data in a research infrastructure

Twan Goosen1 (CLARIN ERIC), Nuno Freire2, Clemens Neudecker3, Maria Eskevich1

1 CLARIN ERIC; 2 Europeana / INESC-ID; 3 Berlin State Library/Europeana Newspapers

Digital Infrastructures for research (DI4R) 2017 Brussels, BE 30 November 2017

slide-2
SLIDE 2

Europeana in six bullets

  • Europeana is the European digital platform for cultural

heritage that

  • seeks to enable users to search and access knowledge in all the

languages of Europe, either directly via its web portals, or indirectly via third-party applications leveraging its data service

  • Europeana enables people to explore the digital resources of

Europe's galleries, museums, libraries, archives and audiovisual collections

  • working with partners and allies to develop frameworks,

standards, strategy and policy relevant to digital cultural heritage, and to raise funds

  • providing digital expertise and platforms for bringing cultural

heritage to wider audiences

  • championing the use of digitised cultural heritage in education,

research and the creative industries through partnerships and international engagement campaigns

2

slide-3
SLIDE 3

CLARIN in seven bullets

  • CLARIN is the Common Language Resources and

Technology Infrastructure

  • ESFRI ERIC status since 2012, Landmark since 2016
  • that provides easy and sustainable access for scholars

in the humanities and social sciences and beyond

  • to digital language data (in written, spoken, video or

multimodal form)

  • and advanced tools to discover, explore, exploit,

annotate, analyse or combine them, wherever they are located

  • through a single sign-on online environment
  • and that serves as an ecosystem for knowledge sharing

3

slide-4
SLIDE 4

4

CLARIN ERIC in members and centres

A consortium of:

  • 19 members: AT,

BG, CZ, DE, DK, DLU, EE, FI, GR, HU, IT, LT, LV, NL, NO, PL, PT, SE, SI

  • 2 observers:

FR, UK;

  • >40 centres
slide-5
SLIDE 5

CLARIN & Europeana partnership in context of DSI

Digital Service Infrastructure (DSI) : Creation of a complete, cohesive and integrated Digital Service Infrastructure

  • DSI (01.2015 – 06.2016):

– European Research Distribution Plan – Assessment of relevant data sets available from The European Library (TEL)

  • DSI-2 (07.2016 – 08.2017):

– Improvement of data quality and implementation of quality frameworks to improve metadata quality – Integration of Europeana data into CLARIN infrastructure

  • DSI-3 (09.2017 – 08.2018):

– Fostering content supply by optimising Europeana data and aggregation infrastructure – Improving (meta-)data and content quality – Fostering reuse of digital cultural heritage resources by improving content distribution mechanisms – Maintain an international interoperable licensing framework

5

slide-6
SLIDE 6

Steps towards CLARIN & Europeana interoperability

1) Incorporate Europeana metadata in the VLO 2) Opening up the full-text Europeana Newspapers resources such as those from Europeana Newspapers through CLARIN’s federated content search mechanism (FCS) 3) Exploiting CLARIN’s communication channels to increase the awareness of Europeana within the community 4) Measure impact of the dissemination of Europeana data

6

slide-7
SLIDE 7

Metadata: access to cultural heritage

  • Aggregation and

exploitation of (meta)data about digitised objects from very different contexts.

  • Europeana Data Model

(EDM) as its model for interoperability of metadata, in line with the vision of linked open vocabularies

7

  • Aggregation of metadata from

resource providers (CLARIN centres and selected “external” parties)

  • Virtual Language Observatory

(VLO) provides a uniform experience and consistent workflow.

  • Language Resource Switchboard

(LRS) allows researchers to invoke tools with the selected resources directly from its user interface.

Challenge: CLARIN and Europeana do not share a common metadata model

slide-8
SLIDE 8

The CLARIN data architecture: repositories

8

Repository at a CLARIN centre Language Data Metadata Language Tools

describes

single text or recording

!

corpus

!

lexicon

!

wordnet

!

grammar

!

… web application

!

web service

!

web service pipeline

!

stand-alone application

!

slide-9
SLIDE 9

The CLARIN data architecture: harvesting

9

Language Data Metadata Language Tools Language Data Metadata Language Tools

Harvested Metadata

Language Data Metadata Language Tools Language Data Metadata Language Tools

copy

slide-10
SLIDE 10

The CLARIN data architecture: processing

10

slide-11
SLIDE 11

The CLARIN data architecture: content search

11 Language Data Metadata Language Tools Language Data Metadata Language Tools

(Federated) Content Search!

!

(1) enter query

!

(4) show aggregated results

Language Data Metadata Language Tools Language Data Metadata Language Tools

(2) perform local search (3) retrieve results

slide-12
SLIDE 12

The CLARIN data architecture: workflows

12

Language Data Metadata Language Tools Language Data Metadata Language Tools

Web Service Pipelines!

!

(1) select input data (2) construct pipeline (3) execute (4) use/analyse output data

Language Data Metadata Language Tools Language Data Metadata Language Tools

slide-13
SLIDE 13

Interoperability is key

  • to the exhange of metadata
  • to the exchange formats for the output of

analytic tools

  • to the options for supporting comparative

research

13

slide-14
SLIDE 14

CLARIN & Europeana Interoperaility highligths

  • CLARIN’s ingestion pipeline (Open Archives Initiative

Protocol for Metadata Harvesting (OAI-PMH protocol)) was extended to retrieve a set of selected collections from Europeana and apply the conversion in the process.

  • Several infrastructure components had to be adapted to

accommodate the significant increase in the amount of data to be handled and stored.

– Current status:

  • 775 Europeana data sets (e.g. Newspapers) now found in the VLO
  • 10 K are technically suitable for processing with the LRS

– Goal:

  • More records in the foreseeable future

14

slide-15
SLIDE 15

Metadata retrieval and conversion: OAI-PMH protocol

  • Europeana:

– EDM-structured Europeana as RDF/XML documents

  • CLARIN:

– Harvester performs conversions by means of XSLT stylesheets by applying a stylesheet that converts the RDF/XML documents metadata to Component Metadata (CMD) – Creation of a CMD profile for EDM in the CMDI Component Registry – implementation of an XSLT stylesheet that produces instances of the corresponding schema on basis of the EDM records. – Properties are defined as CMD elements in the order that they appear in the EDM specification while object order is based on relevance. – Concept links are assigned to most components and elements. – Implemented conversion stylesheet: the header information and resource proxies (entities representing external documents) in the resulting record are produced on the basis of a list of static XPaths in the original document. – The record’s payload is produced mostly by means of a straightforward crosswalk where the properties in the document are mapped to CMD components or elements of an equivalent name.

  • Test harvest of 11 selected metadata sets :

– Total of 3.2 million successfully retrieved and converted, schema valid records – Full harvest and import of the size of this sample takes roughly 48 hours

15

slide-16
SLIDE 16

Processing pipeline issues

  • General lack of technical information available in

the provided EDM (e.g. the media type for linked resources)

  • Direct links to machine processable resources are

commonly missing

  • Limited functionality provided by the tools that are

connected to the LRS (e.g. languages variability, resource types, accessibility)

16

slide-17
SLIDE 17

Get in touch

www.clarin.eu clarin@clarin.eu https://www.europeana.eu https://pro.europeana.eu https://pro.europeana.eu/project/eu ropeana-dsi-3

17