The role of RDF in records management and archiving Graham Moore, - - PowerPoint PPT Presentation

the role of rdf in records management and archiving
SMART_READER_LITE
LIVE PREVIEW

The role of RDF in records management and archiving Graham Moore, - - PowerPoint PPT Presentation

The role of RDF in records management and archiving Graham Moore, Head of Product Development, SESAM @gra_moore, graham.moore@sesam.io Why are we here? I had been working on records management and archive related projects and separately


slide-1
SLIDE 1

The role of RDF in records management and archiving

Graham Moore, Head of Product Development, SESAM @gra_moore, graham.moore@sesam.io

slide-2
SLIDE 2

2

I had been working on records management and archive related projects and separately Semantic Web technologies Started looking at Moreq2010 and RDF. I told Gunnar about some of this work. Gunnar was interested in this so initiated a small project so that I could focus more deeply on the issues, and widened the scope to include Noark5 and continuous archiving. This is a summary of the report produced by this project.

Why are we here…?

slide-3
SLIDE 3

3

Research Goals

Look at the role of RDF and related standards as the basis for:

– Standardised descriptions of RM and Archive systems data structures – The definition of semantics – As a tool for interchange – The provision of continuous archiving – Unifying Noark with MoReq2010 – Demonstrating data driven standards definitions

slide-4
SLIDE 4

Agenda

4

Introduction to me What is an I.T. standard? Research Goals Introduction to RDF, RDFCL, SOF Modelling Noark5 and Moreq2010 in RDF Noark5 as Moreq2010 Using RDF and SDShare for continuous archiving Conclusion and Future work

slide-5
SLIDE 5

Introduction to me

5

Time

Southampton University Using SGML for interchange in an OO system.

1996

Met authors of SGML and HyTime and got brainwashed to work on Topic Maps

1999 > 10 years of ISO meetings producing XTM 1.0, 1.1, TMCL, TMDM, TMSyntax 2010

Created SDShare with Makx Dekkers, Marc Kuster for CEN

2007

With LMG Updated SDShare for better RDF, Created W3C group.

2011

Work on OData to SPARQL

2012

TMAP I OData working group in Oasis RDF Net API Implementing software based on standards

slide-6
SLIDE 6

Work Themes

6

Standards around generalised data

– Structures, Semantics, Interchange, APIs

The mentality of standardisation

– What makes a good standard? – What should be standardised? – What is in and what is out? – When to standardise?

slide-7
SLIDE 7

What do I think I have learnt?

7

Less is more Ambiguity is a very bad Extension through data is better than making the spec bigger Conformance is tricky Building on top of others standards really helps

slide-8
SLIDE 8

9

How do I see Noark and Moreq2010

I.T. standards upon which there is a great weight of expectation, and responsibility. We want these standards to capture structures and operations in ways that support the data and processes of records and archiving. These structures and operations should be constrained enough for conformity and reliability but not place unnecessary restrictions on implementations or domains of application.

slide-9
SLIDE 9

10

What makes an I.T. Standard?

slide-10
SLIDE 10

11

Conformance Touch Points

Data Models

– Central, critical piece of any standard. – But you cannot test conformance to a data model – Provide guidance to implementers, conveys intent and if its wrong the next two things are wrong.

Interchange Syntax

– The way in which different systems can communicate. Standardising this is about unambiguously defining the way syntactic elements map to data model constructs. – Conformance is about detecting invalid syntactic and semantic structures.

API

– If it walks like a duck, quacks like a duck, and looks like a duck. It‟s a duck. – If you declare an API and make clear the semantics of each operation, any system that adheres to those expectations is compliant. – Used either for just conformance testing of a solution or to allow interchange of implementations.

UI, Search, Reporting

– These should never be in a standard.

slide-11
SLIDE 11

12

Evolution of software

Windows Desktop Application Windows Desktop Application Windows Desktop Application

database API database Phone Tablet PC

slide-12
SLIDE 12

13

Records Management and Archive Systems Meta- Architecture

Archive Aggregator / RMS Records Management System Phone Tablet PC API X-Format Protocol of submission Conformance Suite Model Validation Model Validation

slide-13
SLIDE 13

14

Research Goals

Look at the role of RDF and related standards as the basis for:

– The descriptions of RM and Archive systems data structures The definition of semantics – As a tool for interchange – The provision of continuous archiving – Unifying Noark andMoreq2010

  • Seriously, why have two standards?

– Demonstrating data driven standards definitions

slide-14
SLIDE 14

15

Scope

Investigative work Explore a wide and extreme scope Help define what work items are interesting in the future

slide-15
SLIDE 15

16

BIG SCOPE 80 hours for

Research, Report & Presentation

slide-16
SLIDE 16

17

Introduction to RDF

slide-17
SLIDE 17

Semantic technologies

A family of standards from the W3C

– same organization that does HTML, CSS, ...

Goal: to enable semantic web

– interchange of structured data for machines – not just documents for humans

Grounded on open world assumptions

– Anyone can „express‟ anything and anyone should be able to process and understand it.

18

slide-18
SLIDE 18

RDF

The core standard

– defines the data model – all the other parts build on this

Implemented in database products

– known as triple stores – these often replace RDBMS databases when working with semantic technologies

An unusual data model

– schemaless – graph database – everything based on triples – all objects identified by URIs

19

slide-19
SLIDE 19

RDF

Very powerful data model with extensive use of URIs <subject> <predicate> <object> Or <thing> <property-type> <value>

20

slide-20
SLIDE 20

URIs

http://sesam.io/standards/moreq2010/aggregation Persistent, globally unique identifiers for Things, Concepts Particularly useful for types, and property types and controlled vocabularies. Helps ensure that everyone uses the same identifier for the same thing. Minted by authorities Address as well as identity

21

slide-21
SLIDE 21

22

slide-22
SLIDE 22

How RDF works

23

I D NAME EMAIL 1 Graham Moore graham.moore@ 2 Lars Marius Garshol larsga@bouvet.no 3 Axel Borge axel.borge@bouve t SUBJECT PROPERTY OBJECT http://example.com/person/1 rdf:type ex:Person http://example.com/person/1 ex:name Graham Moore http://example.com/person/1 ex:email graham.moore@ http://example.com/person/2 rdf:type ex:Person http://example.com/person/2 ex:name Lars Marius Garshol ... ... ... „PERSON‟ table RDF-ized data

slide-23
SLIDE 23

RDF is a graph

24

../1

Graham Moore

foaf:name

gra

foaf:nick

Bouvet

ex:works-for

foaf: Person

rdf:type

foaf: Agent

rdfs:subClassOf

slide-24
SLIDE 24

Two more things

25

Datatypes

– values can be typed with XML Schema datatypes – means values can be stored more efficiently – numbers sort as numbers – user-defined data types also possible

Graphs

– the database is divided into graphs – each graph is a set of triples identified by a URI – very, very useful for subdividing the database – we use one graph per data source

slide-25
SLIDE 25

RDF Merging

26

1 5 3 1 6 1 5 3 1 6

slide-26
SLIDE 26

RDF Tracking Data Origins

27

1 5 3 1 6 1 5 3 1 6

RDF Store Graph 1 Graph 2 Show me thing 1 =>

slide-27
SLIDE 27

Many Serialisation formats

N-Triples

– Simple line based format

RDF/XML

– RDF in XML – the original format – Truly horrific

  • Slowed RDF adoption

Turtle

– nice, human-readable text format – a bit more work to parse

JSON-LD

– RDF in JSON

28

slide-28
SLIDE 28

Turtle example

29

@prefix foaf: <http://xmlns.com/foaf/0.1/> . @prefix ex: <http://example.org/> @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . _:lmg a foaf:Person; foaf:name “Graham Moore”; foaf:nick “gra”; ex:works-for _:bouvet. foaf:Person a rdfs:Class; rdfs:label “Person”; rdfs:subClassOf foaf:Agent.

slide-29
SLIDE 29

Benefit of RDF Serialisation

30

In XML serialisation systems, the XML is nearly never the internal operation model So a lot of effort is spent describing how the model maps to the syntactic representation With RDF the serialisation is a direct serialisation of the data being used

  • perationally.
slide-30
SLIDE 30

SPARQL

31

The RDF query language

– lots of implementations

Standardized protocol

– HTTP-based and very simple – XML, JSON, RDF format for results

Very like SQL in some ways

– totally unlike in others

Also has an update language

– update, insert, delete, clear, ...

slide-31
SLIDE 31

All persons sorted by name

32

prefix foaf: <http://xmlns.com/foaf/0.1/> select ?person ?name where { ?person a foaf:Person . ?person foaf:name ?name . }

  • rder by ?name
slide-32
SLIDE 32

Why RDF for RMS / Archive standards and systems?

33

Data model is power and flexible Need to be able to combine core models with data from any domain URIs provide strong basis for common agreed identifiers in core and domain areas, i.e. what is the identifier for a vehicle registration date? All of this stuff is in data not in prose Interchange is for free Merging is for free Types and Property Types in DATA – extensibility is built in.

slide-33
SLIDE 33

34

Research Activities Enabler

Noark5 Model and Constraints as RDF Moreq2010 Model and Constraints as RDF RDF Constraint Language Demonstrate how operations can be defined in terms of the RDF models Semantic Operations Framework Noark5 as Moreq2010 RDF Constraint Language Noark5 Domain Extension as RDF RDF Constraint Language

slide-34
SLIDE 34

35

RDF Constraint Language

RDF family of standards includes RDFS and OWL These are designed to support an open world model that is primarily about inference.

slide-35
SLIDE 35

Open World Inference Assumptions

Rule: Value of dc:creator must be a person, it follows that any dc:creator is a person.

rdf:type dc:creator sau rdf:type rdf:type

  • wl:disjointWith

dc:creator person rdfs:Range

slide-36
SLIDE 36

37

RDF Constraint Language

We need something for validation and contraints SPIN, W3C Validation Workshops but nothing just perfect for this right now. Enter RDFCL

slide-37
SLIDE 37

RDFCL – General Requirements

38

Generally, more type centric constraints.

– Things of type person must have exactly one birthdate property.

And then „ranging‟ constraints.

– Constraints that go across many entities. – A record cannot be the child of an aggregation of aggregation already has children of type aggregation.

slide-38
SLIDE 38

RDFCL and RMIL

39

RDFCL

– Based on TMCL but for RDF – Still early draft – Very small specification as it is based on RDF data model and SPARQL – Aims for 80 / 20 rule.

RMIL is a simple syntax optimised for describing RDFCL constraints.

slide-39
SLIDE 39

RDFCL Data Model Example

40

PropertyType Constraint #1 Person PropertyType Constraint Type

applies-to-type is-a

Person

applies-to-prop-type min-card min-card

“1” “1”

validation-query

“complicated but unambiguous SPARQL expression”

slide-40
SLIDE 40

RDF for this is verbose

41

<http://arkivverket.no/noark5/Record> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://data.bouvet.no/rmil/Class> . <http://arkivverket.no/noark5/Record> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2000/01/rdf-schema#Class> . <http://data.bouvet.no/rmil/1b6ecaa997d84181ae39d3d399ab3825> <http://www.w3.org/1999/02/22- rdf-syntax-ns#type> <http://data.bouvet.no/rmil/ClassPropertyConstraint> . <http://data.bouvet.no/rmil/1b6ecaa997d84181ae39d3d399ab3825> <http://data.bouvet.no/rmil/applies-to-type> <http://arkivverket.no/noark5/Record> . <http://data.bouvet.no/rmil/1b6ecaa997d84181ae39d3d399ab3825> <http://data.bouvet.no/rmil/applies-to-propertytype> <http://arkivverket.no/noark5/recordType> . <http://data.bouvet.no/rmil/1b6ecaa997d84181ae39d3d399ab3825> <http://data.bouvet.no/rmil/valuetype> <http://www.w3.org/2001/XMLSchema#string> . <http://data.bouvet.no/rmil/1b6ecaa997d84181ae39d3d399ab3825> <http://data.bouvet.no/rmil/mincard> "0" . <http://data.bouvet.no/rmil/1b6ecaa997d84181ae39d3d399ab3825> <http://data.bouvet.no/rmil/maxcard> "1" . <http://arkivverket.no/noark5/recordType> <http://www.w3.org/2000/01/rdf-schema#domain> <http://arkivverket.no/noark5/Record> . <http://arkivverket.no/noark5/recordType> <http://www.w3.org/2000/01/rdf-schema#range> <http://www.w3.org/2001/XMLSchema#string> .
slide-41
SLIDE 41

RDF Model and Instance Language (RMIL)

42

Optimised for schema definition and controlled vocabularies

# the prefix without a name defines the default prefix prefix http://sesam.io/schema/ # the named prefix definition prefix foaf http://xmlns.com/foaf/0.1/ # define a class as a subclass of another Class class Person foaf:Person # must have exactly one birth date xsd:dateTime birthdate 1 1 # must have exactly one name xsd:string foaf:name 1 1 # can have many emails or none xsd:string email 0 * end

slide-42
SLIDE 42

RDFCL Evaluation Semantics

43

Schema Constraint #1 Constraint #1 Constraint #1 Constraint #1 Constraint #1

RDF M1

If all the constraints in the schema validate with respect to the data provided the data is considered valid.

slide-43
SLIDE 43

Noark5 as RDF

44

Based on the English version of standard and metadata catalogue Define RDFCL for the model Define identifiers that also connect to the specification

slide-44
SLIDE 44

Example – File (Mappe)

45

# Definition for Class File class File xsd:string fileID 0 1 xsd:string filetype 0 1 xsd:string title 0 1 xsd:string officialTitle 0 1 xsd:string description 0 1 xsd:string documentMedium 0 1 xsd:string storageLocation 0 * xsd:dateTime createdDate 1 1 FondCreator createdBy 0 1 xsd:dateTime finalisedDate 1 1 OrganisationalUnit finalisedBy 0 1 File parentFile 0 1 end

slide-45
SLIDE 45

Example – Record (Registrering)

46

# Definition for Class Record # A single record of evidence class Record xsd:string recordType 0 1 xsd:dateTime createdDate 1 1 FondCreator createdBy 0 1 xsd:dateTime archivedDate 1 1 OrganisationUnit archivedBy 0 1 File parentFile 0 1 Series parentSeries 0 1 end

slide-46
SLIDE 46

Document Object

47

class DocumentObject xsd:integer versionNumber 0 1 xsd:string variantFormat 0 1 xsd:string format 0 1 xsd:string formatDetails 0 1 xsd:dateTime createdDate 1 1 FondCreator createdBy 0 1 DocumentDescription 0 1 Record record 0 1 xsd:string checksum 1 1 xsd:string checksumAlgorithm 1 1 xsd:integer filesize 1 1 end

slide-47
SLIDE 47

What does this provide?

48

Set of URI identifiers for types and property types of the Noark5 Model Set of constraints that can be used to check the structural integrity of any RDF model claiming to be a Noark5 Model.

slide-48
SLIDE 48

Where can we use it?

49

In a RDF version / section of the standard. Implementations of an RMS Enabling RM systems that want to expose data as RDF As a way to validate submissions to the archive

slide-49
SLIDE 49

Noark5 Semantics

50

Lots of soft requirements Some harder semantics and some implied through the soft semantics If NoarkX is going to be a hard standard then it needs to define not just data structures, constraint and interchange format but also operations. In essence, an API

slide-50
SLIDE 50

How can we define operations on the RDF model we have created?

51

RDF already provides us with SPARQL We have seen how we can use that and RDF data model to create a constraint language Use the same fundamental building blocks to create a Semantic Operations Framework

slide-51
SLIDE 51

Semantic Operations Framework

52

slide-52
SLIDE 52

Example

53

# The core class definition for all semantic operations class SemanticOperation xsd:string validation_query 1 1 xsd:string update_query 1 1 end # The class for a given class of operation class set_name rdf:resource person 1 1 xsd:string name 1 1 end # Also indicate that the class is an instance of # SemanticOperation and class level properties # for validation and update. instance set_name : SemanticOperation validation_query “select ?x where ...” update_query “delete ..., update ...” end # an example of a operation instance a_set_name : set_name name “bob” person http://example.org/gra end

slide-53
SLIDE 53

Noark5 Example

54

class AddSeriesRecord Series parentSeries 1 1 Record record-to-add 1 1 end instance AddSeriesRecord SemanticOperation validation-expression "select ?y where { ?x noark5:finalisedDate ?y FILTER(?x = [[parentSeries]]) }" update-expression "insert { [[record-to-add]] ?p ?v . }" end

slide-54
SLIDE 54

Moreq2010 as RDF

55

With Noark we just modelled things in RDF In Moreq2010 we created the models we felt were necessary (Record, Aggregation etc) and then we suggested parts of the standard that could be implemented in RDF. Primarily, Classification Service and the Model Metadata Service

slide-55
SLIDE 55

Core Model in Moreq2010

56 # default prefix for moreq2010 prefix http://data.sesam.io/moreq2010/ prefix owl http://w3c.org/owl # Aggregation Data Model class Aggregation xsd:dateTimw created 1 1 xsd:dateTime originated 1 1 xsd:dateTime first-used 0 1 xsd:dateTime last-addition 0 1 Class class 1 1 xsd:string title 1 1 xsd:string description 0 1 xsd:string scope-notes 0 1 xsd:dateTime closed 0 1 xsd:dateTime destroyed 0 1 xsd:int max-levels-of-aggregation 0 1 Aggregation parent 0 1 xsd:dateTime aggregated 0 1 end # Record Class Definition class Record xsd:dateTime created xsd:dateTime originated xsd:string title xsd:string description Record duplicate 0 1 Aggregation parent 1 1 xsd:dateTime aggregated Class class 0 1 DisposalSchedule disposalSchedule 0 1 xsd:dateTime retentionStart 0 1 DisposalAction disposalAction 1 1 xsd:dateTime disposalActionDue xsd:dateTime disposalConfirmationDue xsd:dateTime disposalOverdueAlert xsd:string lastReviewComment xsd:dateTime lastReviewed xsd:dateTime transferred xsd:dateTime destroyed end

slide-56
SLIDE 56

MoReq2010 Component

57 class DisposalAction xsd:int codeValue 1 1 end instance retain_on_hold : DisposalAction codeValue 0 end instance retain_permanently : DisposalAction codeValue 1 end instance review : DisposalAction codeValue 2 end instance transfer : DisposalAction codeValue 3 end instance destroy : DisposalAction codeValue 4 end # Component Class Definition class Component xsd:dateTime created 1 1 xsd:dateTime originated 1 1 Record parent 1 1 xsd:string title 1 1 xsd:string description 1 1 Component duplicate xsd:boolean automaticDeletion xsd:dateTime Destroyed

end

slide-57
SLIDE 57

MoReq2010 Model Metadata Service

58

Complicated and under powered, but spot

  • n with its aims and objectives.

Class

slide-58
SLIDE 58

Unable to constrain the core model

59

Use case: all aggregations of class job application can only contain records of the following type: application, meeting invitation, contract, signed contract. And must have metadata, applicant ID First is not (obviously possible) with MMS, the second is doable but painful. Let‟s fix this with RDFCL.

slide-59
SLIDE 59

Model Extensions and Constraints in Moreq2010

60 # Job Application Use Case class JobApplication Aggregation end class ApplicationLetter Record end class InterviewLetter Record end class OfferLetter Record end class Contract Record end class SignedContract Record end class JobApplication Aggregation Person applicant 1 1 xsd:string applicantName 1 1 xsd:string applicantAddress 0 1 xsd:string applicationEmail 0 1 end

# Add structure constraints class JobApplication ApplicationLetter applicationLetter 1 1 InterviewLetter interviewLetter 0 * OfferLetter offerLetter 0 1 Contract contract 0 1 SignedContract signedContract 0 1 end # additional constraint instance jobApplicationStructureConstraint1 rdfcl:Constraint query "select ?x where { ?x is a child of a_jobapplication but is not of one of the allowed types" end

slide-60
SLIDE 60

What does this give us?

61

Use the same approach to extension as to that of the main model. As RDF we get constraints, model refinements, interchange of constraints and interchange of the instance data – most of it for free. Open up the market for RMS as they would all be based on RDF systems and able to interchange data and data models.

slide-61
SLIDE 61

We‟re not finished yet! Let‟s rethink the Noark standard and domain models.

62

Moreq2010 as RDF, RDFCL, SOF, some text Noark as RDF, RDFCL, SOF, in terms of

  • MoReq2010. No Text,
  • nly data.

Domain models as RDF, RDFCL, SOF, in terms of

  • Noark. No Text, only

data.

Core MoReq2010 is informative text and formal models and semantics based on RDF and family of standards. Noark Domains are defined as extensions and refinements to Noark. Noark is defined as extensions and refinements to MoReq2010.

slide-62
SLIDE 62

In practice, Noark as MoReq2010

63

## -------------------------------- # NOAK5 # Subclass the appropriate structures in the moreq2010 model ## -------------------------------- prefix http://data.sesam.io/noark5/ prefix moreq http://data.sesam.io/moreq2010 class Fond moreq:Aggregation Fond parent 0 1 end class Series moreq:Aggregation Fond parent 1 1 end class File moreq:Aggregation Series parent 1 1 end class Record moreq:Record File fileParent 0 1 Series seriesParent 0 1 end class DocumentDescription moreq:Component class DocumentObject moreq:Component

slide-63
SLIDE 63

Domain model

64

# Vehicle Registration Model class VehicleRegistration Aggregation Citizen registator 1 1 OrganisationUnit caseHandler 1 1 end class RegistrationForm Record VehicleRegistration registration 1 1 end class Citizen end

slide-64
SLIDE 64

65

Records Management and Archive Systems Meta- Architecture

Archive Aggregator / RMS Records Management System Phone Tablet PC API X-Format Protocol of submission Conformance Suite Model Validation Model Validation

SOF RDF standards for serialisation RDFCL RDFCL ??

slide-65
SLIDE 65

66

Continuous Archive Submission

Use another RDF centric protocol called

  • SDShare. (sdshare.org, w3c community

group) Syndication and synchronisation of RDF repositories Used to expose data from all kinds of systems as RDF Pull based approach provides high quality system level properties. Pushing to web service endpoints is very broken.

slide-66
SLIDE 66

67

slide-67
SLIDE 67

68

Make use of models and constraints

Aggregator can make use of the constraints to validate arriving data Publishers „just‟ publish the data as RDF No special interchange format needed, no model syntactic mappings required.

slide-68
SLIDE 68

69

Benefits

Multiple subscribers Continuous Robust in the face of downtime due to pull based model RDF deals with partial submission (e.g. the record is closed but the aggregation isn‟t, send the record and reference the aggregation) send the aggregation later. Thanks to Merging, and URIs for identifiers.

slide-69
SLIDE 69

70

Records Management and Archive Systems Meta- Architecture

Archive Aggregator / RMS Records Management System Phone Tablet PC API X-Format Protocol of submission Conformance Suite Model Validation Model Validation

SOF RDF standards for serialisation RDFCL RDFCL SDShare

slide-70
SLIDE 70

71

Resources

Produced several artefacts:

– Research report – These slides – RDFCL draft – RMIL draft – Semantic Operations Framework draft – RMIL schemas for Noark5 and Moreq2010 core data models – RDF n-triples files for RDFCL constraints and RDFS declarations for Noark5 and Moreq2010.

slide-71
SLIDE 71

72

Summary

RDF looks like a very promising technology for this space It looks possible to define records management, and archive submission processes using RDF and related family

  • f standards

In this process everything becomes data, queries and is completely formal.

slide-72
SLIDE 72

73

Some next steps?

Create and publish RDF models for MoReq2010 and Noark5,

– Allow these as valid interchange formats for Noark and MoReq2010

Define an RDF centric standard for records management and archive submissions based on MoReq2010 and Noark? Open source implementations of RDF RMS based on these concepts Evaluate SDShare as submission process W3C to adopt RDFCL and SOF

slide-73
SLIDE 73

74

Questions?