[PPT] - Migrating The Language Archive to a new repository solution PAUL PowerPoint Presentation

SLIDE 1

Migrating The Language Archive to a new repository solution

PAUL TRILSBEEK MAX PLANCK INSTITUTE FOR PSYCHOLINGUISTICS

Photo: Gunter Senft

SLIDE 2

The Language Archive

Digital archive of language materials based at the Max Planck Institute

for Psycholinguistics, Nijmegen, The Netherlands (One of 84 research institutes of the German Max Planck Society)

Archive exists since the late 90’s, initially archiving language materials

from our own field researchers and language acquisition researchers

Became the central archive for the DOBES endangered languages

documentation programme, funded by the Volkswagen Foundation in 2000

Archive was named “The Language Archive” (TLA) in 2011, as a

collaboration between 3 research funding organisations

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION

SLIDE 3

Collections in The Language Archive

Holds more than 350 collections covering more than 250 different

languages:

Languages from around the world studied by Max Planck Institute

field linguists

First and second language acquisition corpora
Endangered languages documented for the VolkswagenStiftung

DOBES programme

Spoken Dutch corpus
Sign language corpora
More than 15.000 hours of audio and video recordings
More than 1 million files
About 110 TB of data

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION

SLIDE 4

UNESCO Memory of the World

October 2015: Selected collections of TLA added to the UNESCO

Memory of the World register.

64 collections, containing materials from 102 different languages
3000 hours of video, 5000 hours of audio, 43,000 images, 17,000

written documents

Great recognition of the value of these collections for the world, as

well as of the work that TLA has done in preserving them

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION

SLIDE 5

TLA Repository

Starting in the late ‘90s, a repository solution was developed in-house,

since no existing solution was around that suited our needs

Over the years this grew into a rather complex system using a variety
f frameworks and paradigms, developed by many different

developers → difficult and costly to maintain, not optimal in terms of user experience, partly using outdated web technology

Meanwhile, various open source repository systems had been

developed that became widely used

2014: decision was made to build up a new repository using an

existing open source platform as a basis, to reduce maintenance costs and to enhance the user experience

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION

SLIDE 6

CLARIN Centre

TLA is a centre of the CLARIN European Research Infrastructure for

Language Resources and Technology

Being a CLARIN “B Type” centre comes with certain technical

requirements for the repository such that it is interoperable with the

verall infrastructure
Meertens Institute in Amsterdam is also a CLARIN centre and was a

partner in TLA. Had similar needs for a repository, therefore development of the new repository solution was jointly undertaken by Max Planck Institute and Meertens Institute

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION

SLIDE 7

CLARIN “B Type” Centre requirements

Support CLARIN CMDI metadata
Offer metadata via the OAI-PMH protocol
Support for Shibboleth/SAML2 authentication in order to be part of

CLARIN Service Provider Federation

Support for persistent identifiers (e.g. using Handle system)
Repository must be able to meet CoreTrustSeal requirements

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION

SLIDE 8

Further requirements at TLA

Support for faceted search
Versioning support
Support for data types present in TLA
Checksum support
Support for Persistent Identifiers using the Handle system
File format verification
Elaborate access control
LDAP support for authentication
Preferably programming languages for which we had expertise

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION

SLIDE 9

Existing repository system comparison

Basic criteria:
Open Source
“Mature”
Widely used
Actively maintained
A number of solutions were evaluated to see whether they met our

further technical requirements:

DSpace
Eprints
Fedora Commons 3.8.1/Islandora
Fedora Commons 3.8.1/Hydra (now Samvera)
Greenstone

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION

SLIDE 10

Feature comparison

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION

Fedora Commons DSpace EPrints Greenstone Main progr. language Java Java Perl Java Nested collections Yes Somehow No No Accommodate CMDI Yes Yes Yes Yes Support Data Types Yes Yes Yes Yes File format verification Islandora/Hydra Plug-in No No Checksums Yes Yes Yes Yes Versioning Yes Yes Yes No Handle PID Plug-in Yes No No OAI-PMH Yes Yes Yes Yes Access Control Yes Yes No Yes LDAP Yes Yes Yes No Shibboleth Plug-in Yes Yes No Facet Search Islandora/Hydra Yes Plug-in Yes

SLIDE 11

Our choice: Fedora/Islandora

Both DSpace and the two Fedora-based systems met most of the

technical requirements

Two main reasons for choosing a Fedora-based system over DSpace:
Deeply nested collection hierarchies could not be easily reflected in

DSpace content model (at least in 2014)

Turnkey-style solution with integrated front-ends meant that

modifications likely had to be done in DSpace core

Reasons for choosing Islandora over Hydra
Programming language expertise present (PHP vs. Ruby)
Highly modular approach of Islandora (Drupal modules)
(even though Hydra/Sufia deposit UI was better suited)

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION

SLIDE 12

Islandora performance testing

Tools were developed to transform existing collections with CMDI

metadata into FOXML that could be ingested into Fedora

All 1 million objects were ingested into a Fedora/Islandora instance

using Fedora batch ingest scripts

Most performance bottlenecks could be solved by making use of the

(optional) Solr index, rather than the Mulgara triple store

All data was ingested as “Externally managed” datastreams in Fedora.

This made ingest faster and gave us more control over file system locations

Conclusion was that the Fedora/Islandora combination was fast

enough for the size of our repository

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION

SLIDE 13

Deposit front- and back-end

A custom deposit front- and back-end was needed, since we needed a

more controlled ingest workflow as well as a user interface that was easy to use for self-deposit by our researchers

“Doorkeeper” ingest workflow engine was developed to perform a

customizable set of actions before data is eventually ingested into Fedora, e.g.:

Check SIP completeness
Check whether ingested files are of accepted types and conform to

defined criteria (XPath rules on FITS output)

Check CMDI metadata validity and transform to DC + OLAC
Issue Handle PIDs and add them to metadata
Update parent of ingested object (CMDI has top-down hierarchy)
Move files to persistent storage

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION

SLIDE 14

Deposit front-end

Connects to local network shares where internal users store their

research data

Connects to a Nextcloud self-hosted cloud instance running on the

same server for external depositors to upload data

Metadata can be uploaded as XML files or entered using web forms
“Validation” step checks whether SIP is complete and conforms to

defined archival standards

Existing objects in the repository can be modified or amended

(versions will be created)

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION

SLIDE 15

Deposit front-end

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION

SLIDE 16

Deposit front-end

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION

SLIDE 17

Migration: January/February 2018

Migration of over 1 million objects comprising over 100 TB of data took

a bit more than a week

Production setup using Blazegraph triple store rather than Mulgara
All Object datastreams “Externally managed”
Performance on 6-core VM with 48 GB of RAM overall very good
Spring 2018: soft launch of deposit UI with selected researchers
October 2018: deposit UI made available to all depositors
CoreTrustSeal certified in January 2019

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION

SLIDE 18

Repository Browse/Search/Explore

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION

SLIDE 19

Repository Browse/Search/Explore

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION

SLIDE 20

Repository Browse/Search/Explore

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION

SLIDE 21

Remaining performance issues

Changing access permissions on large collections takes a very long

time

Occasional performance issues when requests come it at fast pace,

not sure yet what causes this

Slow loading of “compound objects” with many children (> 100), since

SPARQL still used there instead of Solr

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION

SLIDE 22

Conclusions/Outlook

Overall very pleased with our Islandora-based setup, performance-

wise and in terms of flexibility of the modular setup

Depositors find the new deposit interface much easier to use than the

previous one

Further improvements planned:
Solve the remaining performance bottlenecks where possible
Develop solution pack for transcribed/annotated media
Tool for batch updates of metadata
Migration to Islandora 8/9, Fedora 5/6? Will require some additional

development to adapt our ingest tools and custom modules, but we believe that this is a manageable task. Will probably wait at least until Fedora 6 is out, with the Oxford Common File Layout storage layer.

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION

SLIDE 23

Acknowledgements

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION

Menzo Windhouwer Daniel von Rhein Ibrahim Abdullah Pavithra Srinivasa Developers:

SLIDE 24

Migrating The Language Archive to a new repository solution

PAUL TRILSBEEK MAX PLANCK INSTITUTE FOR PSYCHOLINGUISTICS

The Language Archive

for Psycholinguistics, Nijmegen, The Netherlands (One of 84 research institutes of the German Max Planck Society)

from our own field researchers and language acquisition researchers

documentation programme, funded by the Volkswagen Foundation in 2000

collaboration between 3 research funding organisations

Collections in The Language Archive

languages:

field linguists

DOBES programme

UNESCO Memory of the World

Memory of the World register.

written documents

well as of the work that TLA has done in preserving them

TLA Repository

since no existing solution was around that suited our needs

developers → difficult and costly to maintain, not optimal in terms of user experience, partly using outdated web technology

developed that became widely used

existing open source platform as a basis, to reduce maintenance costs and to enhance the user experience

CLARIN Centre

Language Resources and Technology

requirements for the repository such that it is interoperable with the

partner in TLA. Had similar needs for a repository, therefore development of the new repository solution was jointly undertaken by Max Planck Institute and Meertens Institute

CLARIN “B Type” Centre requirements

CLARIN Service Provider Federation

Further requirements at TLA

Existing repository system comparison

further technical requirements:

Feature comparison

Our choice: Fedora/Islandora

technical requirements

DSpace content model (at least in 2014)

modifications likely had to be done in DSpace core

Islandora performance testing

metadata into FOXML that could be ingested into Fedora

using Fedora batch ingest scripts

(optional) Solr index, rather than the Mulgara triple store

This made ingest faster and gave us more control over file system locations

enough for the size of our repository

Deposit front- and back-end

more controlled ingest workflow as well as a user interface that was easy to use for self-deposit by our researchers

customizable set of actions before data is eventually ingested into Fedora, e.g.:

defined criteria (XPath rules on FITS output)

Deposit front-end

research data

same server for external depositors to upload data

defined archival standards

(versions will be created)

Deposit front-end

Deposit front-end

Migration: January/February 2018

a bit more than a week

Repository Browse/Search/Explore

Repository Browse/Search/Explore

Repository Browse/Search/Explore

Remaining performance issues

time

not sure yet what causes this

SPARQL still used there instead of Solr

Conclusions/Outlook

wise and in terms of flexibility of the modular setup

previous one

development to adapt our ingest tools and custom modules, but we believe that this is a manageable task. Will probably wait at least until Fedora 6 is out, with the Oxford Common File Layout storage layer.

Acknowledgements

Menzo Windhouwer Daniel von Rhein Ibrahim Abdullah Pavithra Srinivasa Developers:

archive.mpi.nl github.com/TheLanguageArchive github.com/TLA-FLAT Paul.Trilsbeek@mpi.nl

Questions?