Migrating The Language Archive to a new repository solution PAUL - - PowerPoint PPT Presentation

migrating the language archive to a new repository
SMART_READER_LITE
LIVE PREVIEW

Migrating The Language Archive to a new repository solution PAUL - - PowerPoint PPT Presentation

Migrating The Language Archive to a new repository solution PAUL TRILSBEEK MAX PLANCK INSTITUTE FOR PSYCHOLINGUISTICS Photo: Gunter Senft MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION The Language Archive Digital archive of


slide-1
SLIDE 1

Migrating The Language Archive to a new repository solution

PAUL TRILSBEEK MAX PLANCK INSTITUTE FOR PSYCHOLINGUISTICS

Photo: Gunter Senft

slide-2
SLIDE 2

The Language Archive

  • Digital archive of language materials based at the Max Planck Institute

for Psycholinguistics, Nijmegen, The Netherlands (One of 84 research institutes of the German Max Planck Society)

  • Archive exists since the late 90’s, initially archiving language materials

from our own field researchers and language acquisition researchers

  • Became the central archive for the DOBES endangered languages

documentation programme, funded by the Volkswagen Foundation in 2000

  • Archive was named “The Language Archive” (TLA) in 2011, as a

collaboration between 3 research funding organisations

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION

slide-3
SLIDE 3

Collections in The Language Archive

  • Holds more than 350 collections covering more than 250 different

languages:

  • Languages from around the world studied by Max Planck Institute

field linguists

  • First and second language acquisition corpora
  • Endangered languages documented for the VolkswagenStiftung

DOBES programme

  • Spoken Dutch corpus
  • Sign language corpora
  • More than 15.000 hours of audio and video recordings
  • More than 1 million files
  • About 110 TB of data

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION

slide-4
SLIDE 4

UNESCO Memory of the World

  • October 2015: Selected collections of TLA added to the UNESCO

Memory of the World register.

  • 64 collections, containing materials from 102 different languages
  • 3000 hours of video, 5000 hours of audio, 43,000 images, 17,000

written documents

  • Great recognition of the value of these collections for the world, as

well as of the work that TLA has done in preserving them

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION

slide-5
SLIDE 5

TLA Repository

  • Starting in the late ‘90s, a repository solution was developed in-house,

since no existing solution was around that suited our needs

  • Over the years this grew into a rather complex system using a variety
  • f frameworks and paradigms, developed by many different

developers → difficult and costly to maintain, not optimal in terms of user experience, partly using outdated web technology

  • Meanwhile, various open source repository systems had been

developed that became widely used

  • 2014: decision was made to build up a new repository using an

existing open source platform as a basis, to reduce maintenance costs and to enhance the user experience

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION

slide-6
SLIDE 6

CLARIN Centre

  • TLA is a centre of the CLARIN European Research Infrastructure for

Language Resources and Technology

  • Being a CLARIN “B Type” centre comes with certain technical

requirements for the repository such that it is interoperable with the

  • verall infrastructure
  • Meertens Institute in Amsterdam is also a CLARIN centre and was a

partner in TLA. Had similar needs for a repository, therefore development of the new repository solution was jointly undertaken by Max Planck Institute and Meertens Institute

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION

slide-7
SLIDE 7

CLARIN “B Type” Centre requirements

  • Support CLARIN CMDI metadata
  • Offer metadata via the OAI-PMH protocol
  • Support for Shibboleth/SAML2 authentication in order to be part of

CLARIN Service Provider Federation

  • Support for persistent identifiers (e.g. using Handle system)
  • Repository must be able to meet CoreTrustSeal requirements

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION

slide-8
SLIDE 8

Further requirements at TLA

  • Support for faceted search
  • Versioning support
  • Support for data types present in TLA
  • Checksum support
  • Support for Persistent Identifiers using the Handle system
  • File format verification
  • Elaborate access control
  • LDAP support for authentication
  • Preferably programming languages for which we had expertise

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION

slide-9
SLIDE 9

Existing repository system comparison

  • Basic criteria:
  • Open Source
  • “Mature”
  • Widely used
  • Actively maintained
  • A number of solutions were evaluated to see whether they met our

further technical requirements:

  • DSpace
  • Eprints
  • Fedora Commons 3.8.1/Islandora
  • Fedora Commons 3.8.1/Hydra (now Samvera)
  • Greenstone

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION

slide-10
SLIDE 10

Feature comparison

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION

Fedora Commons DSpace EPrints Greenstone Main progr. language Java Java Perl Java Nested collections Yes Somehow No No Accommodate CMDI Yes Yes Yes Yes Support Data Types Yes Yes Yes Yes File format verification Islandora/Hydra Plug-in No No Checksums Yes Yes Yes Yes Versioning Yes Yes Yes No Handle PID Plug-in Yes No No OAI-PMH Yes Yes Yes Yes Access Control Yes Yes No Yes LDAP Yes Yes Yes No Shibboleth Plug-in Yes Yes No Facet Search Islandora/Hydra Yes Plug-in Yes

slide-11
SLIDE 11

Our choice: Fedora/Islandora

  • Both DSpace and the two Fedora-based systems met most of the

technical requirements

  • Two main reasons for choosing a Fedora-based system over DSpace:
  • Deeply nested collection hierarchies could not be easily reflected in

DSpace content model (at least in 2014)

  • Turnkey-style solution with integrated front-ends meant that

modifications likely had to be done in DSpace core

  • Reasons for choosing Islandora over Hydra
  • Programming language expertise present (PHP vs. Ruby)
  • Highly modular approach of Islandora (Drupal modules)
  • (even though Hydra/Sufia deposit UI was better suited)

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION

slide-12
SLIDE 12

Islandora performance testing

  • Tools were developed to transform existing collections with CMDI

metadata into FOXML that could be ingested into Fedora

  • All 1 million objects were ingested into a Fedora/Islandora instance

using Fedora batch ingest scripts

  • Most performance bottlenecks could be solved by making use of the

(optional) Solr index, rather than the Mulgara triple store

  • All data was ingested as “Externally managed” datastreams in Fedora.

This made ingest faster and gave us more control over file system locations

  • Conclusion was that the Fedora/Islandora combination was fast

enough for the size of our repository

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION

slide-13
SLIDE 13

Deposit front- and back-end

  • A custom deposit front- and back-end was needed, since we needed a

more controlled ingest workflow as well as a user interface that was easy to use for self-deposit by our researchers

  • “Doorkeeper” ingest workflow engine was developed to perform a

customizable set of actions before data is eventually ingested into Fedora, e.g.:

  • Check SIP completeness
  • Check whether ingested files are of accepted types and conform to

defined criteria (XPath rules on FITS output)

  • Check CMDI metadata validity and transform to DC + OLAC
  • Issue Handle PIDs and add them to metadata
  • Update parent of ingested object (CMDI has top-down hierarchy)
  • Move files to persistent storage

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION

slide-14
SLIDE 14

Deposit front-end

  • Connects to local network shares where internal users store their

research data

  • Connects to a Nextcloud self-hosted cloud instance running on the

same server for external depositors to upload data

  • Metadata can be uploaded as XML files or entered using web forms
  • “Validation” step checks whether SIP is complete and conforms to

defined archival standards

  • Existing objects in the repository can be modified or amended

(versions will be created)

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION

slide-15
SLIDE 15

Deposit front-end

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION

slide-16
SLIDE 16

Deposit front-end

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION

slide-17
SLIDE 17

Migration: January/February 2018

  • Migration of over 1 million objects comprising over 100 TB of data took

a bit more than a week

  • Production setup using Blazegraph triple store rather than Mulgara
  • All Object datastreams “Externally managed”
  • Performance on 6-core VM with 48 GB of RAM overall very good
  • Spring 2018: soft launch of deposit UI with selected researchers
  • October 2018: deposit UI made available to all depositors
  • CoreTrustSeal certified in January 2019

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION

slide-18
SLIDE 18

Repository Browse/Search/Explore

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION

slide-19
SLIDE 19

Repository Browse/Search/Explore

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION

slide-20
SLIDE 20

Repository Browse/Search/Explore

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION

slide-21
SLIDE 21

Remaining performance issues

  • Changing access permissions on large collections takes a very long

time

  • Occasional performance issues when requests come it at fast pace,

not sure yet what causes this

  • Slow loading of “compound objects” with many children (> 100), since

SPARQL still used there instead of Solr

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION

slide-22
SLIDE 22

Conclusions/Outlook

  • Overall very pleased with our Islandora-based setup, performance-

wise and in terms of flexibility of the modular setup

  • Depositors find the new deposit interface much easier to use than the

previous one

  • Further improvements planned:
  • Solve the remaining performance bottlenecks where possible
  • Develop solution pack for transcribed/annotated media
  • Tool for batch updates of metadata
  • Migration to Islandora 8/9, Fedora 5/6? Will require some additional

development to adapt our ingest tools and custom modules, but we believe that this is a manageable task. Will probably wait at least until Fedora 6 is out, with the Oxford Common File Layout storage layer.

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION

slide-23
SLIDE 23

Acknowledgements

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION

Menzo Windhouwer Daniel von Rhein Ibrahim Abdullah Pavithra Srinivasa Developers:

slide-24
SLIDE 24

archive.mpi.nl github.com/TheLanguageArchive github.com/TLA-FLAT Paul.Trilsbeek@mpi.nl

Questions?