The role of RDF in records management and archiving
Graham Moore, Head of Product Development, SESAM @gra_moore, graham.moore@sesam.io
The role of RDF in records management and archiving Graham Moore, - - PowerPoint PPT Presentation
The role of RDF in records management and archiving Graham Moore, Head of Product Development, SESAM @gra_moore, graham.moore@sesam.io Why are we here? I had been working on records management and archive related projects and separately
The role of RDF in records management and archiving
Graham Moore, Head of Product Development, SESAM @gra_moore, graham.moore@sesam.io
2
I had been working on records management and archive related projects and separately Semantic Web technologies Started looking at Moreq2010 and RDF. I told Gunnar about some of this work. Gunnar was interested in this so initiated a small project so that I could focus more deeply on the issues, and widened the scope to include Noark5 and continuous archiving. This is a summary of the report produced by this project.
Why are we here…?
3
Research Goals
Look at the role of RDF and related standards as the basis for:
– Standardised descriptions of RM and Archive systems data structures – The definition of semantics – As a tool for interchange – The provision of continuous archiving – Unifying Noark with MoReq2010 – Demonstrating data driven standards definitions
Agenda
4
Introduction to me What is an I.T. standard? Research Goals Introduction to RDF, RDFCL, SOF Modelling Noark5 and Moreq2010 in RDF Noark5 as Moreq2010 Using RDF and SDShare for continuous archiving Conclusion and Future work
Introduction to me
5
Time
Southampton University Using SGML for interchange in an OO system.
1996
Met authors of SGML and HyTime and got brainwashed to work on Topic Maps
1999 > 10 years of ISO meetings producing XTM 1.0, 1.1, TMCL, TMDM, TMSyntax 2010
Created SDShare with Makx Dekkers, Marc Kuster for CEN
2007
With LMG Updated SDShare for better RDF, Created W3C group.
2011
Work on OData to SPARQL
2012
TMAP I OData working group in Oasis RDF Net API Implementing software based on standards
Work Themes
6
Standards around generalised data
– Structures, Semantics, Interchange, APIs
The mentality of standardisation
– What makes a good standard? – What should be standardised? – What is in and what is out? – When to standardise?
What do I think I have learnt?
7
Less is more Ambiguity is a very bad Extension through data is better than making the spec bigger Conformance is tricky Building on top of others standards really helps
9
How do I see Noark and Moreq2010
I.T. standards upon which there is a great weight of expectation, and responsibility. We want these standards to capture structures and operations in ways that support the data and processes of records and archiving. These structures and operations should be constrained enough for conformity and reliability but not place unnecessary restrictions on implementations or domains of application.
10
What makes an I.T. Standard?
11
Conformance Touch Points
Data Models
– Central, critical piece of any standard. – But you cannot test conformance to a data model – Provide guidance to implementers, conveys intent and if its wrong the next two things are wrong.
Interchange Syntax
– The way in which different systems can communicate. Standardising this is about unambiguously defining the way syntactic elements map to data model constructs. – Conformance is about detecting invalid syntactic and semantic structures.
API
– If it walks like a duck, quacks like a duck, and looks like a duck. It‟s a duck. – If you declare an API and make clear the semantics of each operation, any system that adheres to those expectations is compliant. – Used either for just conformance testing of a solution or to allow interchange of implementations.
UI, Search, Reporting
– These should never be in a standard.
12
Evolution of software
Windows Desktop Application Windows Desktop Application Windows Desktop Application
database API database Phone Tablet PC
13
Records Management and Archive Systems Meta- Architecture
Archive Aggregator / RMS Records Management System Phone Tablet PC API X-Format Protocol of submission Conformance Suite Model Validation Model Validation
14
Research Goals
Look at the role of RDF and related standards as the basis for:
– The descriptions of RM and Archive systems data structures The definition of semantics – As a tool for interchange – The provision of continuous archiving – Unifying Noark andMoreq2010
– Demonstrating data driven standards definitions
15
Scope
Investigative work Explore a wide and extreme scope Help define what work items are interesting in the future
16
Research, Report & Presentation
17
Semantic technologies
A family of standards from the W3C
– same organization that does HTML, CSS, ...
Goal: to enable semantic web
– interchange of structured data for machines – not just documents for humans
Grounded on open world assumptions
– Anyone can „express‟ anything and anyone should be able to process and understand it.
18
RDF
The core standard
– defines the data model – all the other parts build on this
Implemented in database products
– known as triple stores – these often replace RDBMS databases when working with semantic technologies
An unusual data model
– schemaless – graph database – everything based on triples – all objects identified by URIs
19
RDF
Very powerful data model with extensive use of URIs <subject> <predicate> <object> Or <thing> <property-type> <value>
20
URIs
http://sesam.io/standards/moreq2010/aggregation Persistent, globally unique identifiers for Things, Concepts Particularly useful for types, and property types and controlled vocabularies. Helps ensure that everyone uses the same identifier for the same thing. Minted by authorities Address as well as identity
21
22
How RDF works
23
I D NAME EMAIL 1 Graham Moore graham.moore@ 2 Lars Marius Garshol larsga@bouvet.no 3 Axel Borge axel.borge@bouve t SUBJECT PROPERTY OBJECT http://example.com/person/1 rdf:type ex:Person http://example.com/person/1 ex:name Graham Moore http://example.com/person/1 ex:email graham.moore@ http://example.com/person/2 rdf:type ex:Person http://example.com/person/2 ex:name Lars Marius Garshol ... ... ... „PERSON‟ table RDF-ized data
RDF is a graph
24
../1
Graham Moore
foaf:name
gra
foaf:nick
Bouvet
ex:works-for
foaf: Person
rdf:type
foaf: Agent
rdfs:subClassOf
Two more things
25
Datatypes
– values can be typed with XML Schema datatypes – means values can be stored more efficiently – numbers sort as numbers – user-defined data types also possible
Graphs
– the database is divided into graphs – each graph is a set of triples identified by a URI – very, very useful for subdividing the database – we use one graph per data source
RDF Merging
26
1 5 3 1 6 1 5 3 1 6
RDF Tracking Data Origins
27
1 5 3 1 6 1 5 3 1 6
RDF Store Graph 1 Graph 2 Show me thing 1 =>
Many Serialisation formats
N-Triples
– Simple line based format
RDF/XML
– RDF in XML – the original format – Truly horrific
Turtle
– nice, human-readable text format – a bit more work to parse
JSON-LD
– RDF in JSON
28
Turtle example
29
@prefix foaf: <http://xmlns.com/foaf/0.1/> . @prefix ex: <http://example.org/> @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . _:lmg a foaf:Person; foaf:name “Graham Moore”; foaf:nick “gra”; ex:works-for _:bouvet. foaf:Person a rdfs:Class; rdfs:label “Person”; rdfs:subClassOf foaf:Agent.
Benefit of RDF Serialisation
30
In XML serialisation systems, the XML is nearly never the internal operation model So a lot of effort is spent describing how the model maps to the syntactic representation With RDF the serialisation is a direct serialisation of the data being used
SPARQL
31
The RDF query language
– lots of implementations
Standardized protocol
– HTTP-based and very simple – XML, JSON, RDF format for results
Very like SQL in some ways
– totally unlike in others
Also has an update language
– update, insert, delete, clear, ...
All persons sorted by name
32
prefix foaf: <http://xmlns.com/foaf/0.1/> select ?person ?name where { ?person a foaf:Person . ?person foaf:name ?name . }
Why RDF for RMS / Archive standards and systems?
33
Data model is power and flexible Need to be able to combine core models with data from any domain URIs provide strong basis for common agreed identifiers in core and domain areas, i.e. what is the identifier for a vehicle registration date? All of this stuff is in data not in prose Interchange is for free Merging is for free Types and Property Types in DATA – extensibility is built in.
34
Research Activities Enabler
Noark5 Model and Constraints as RDF Moreq2010 Model and Constraints as RDF RDF Constraint Language Demonstrate how operations can be defined in terms of the RDF models Semantic Operations Framework Noark5 as Moreq2010 RDF Constraint Language Noark5 Domain Extension as RDF RDF Constraint Language
35
RDF Constraint Language
RDF family of standards includes RDFS and OWL These are designed to support an open world model that is primarily about inference.
Open World Inference Assumptions
Rule: Value of dc:creator must be a person, it follows that any dc:creator is a person.
rdf:type dc:creator sau rdf:type rdf:type
dc:creator person rdfs:Range
37
RDF Constraint Language
We need something for validation and contraints SPIN, W3C Validation Workshops but nothing just perfect for this right now. Enter RDFCL
RDFCL – General Requirements
38
Generally, more type centric constraints.
– Things of type person must have exactly one birthdate property.
And then „ranging‟ constraints.
– Constraints that go across many entities. – A record cannot be the child of an aggregation of aggregation already has children of type aggregation.
RDFCL and RMIL
39
RDFCL
– Based on TMCL but for RDF – Still early draft – Very small specification as it is based on RDF data model and SPARQL – Aims for 80 / 20 rule.
RMIL is a simple syntax optimised for describing RDFCL constraints.
RDFCL Data Model Example
40
PropertyType Constraint #1 Person PropertyType Constraint Type
applies-to-type is-a
Person
applies-to-prop-type min-card min-card
“1” “1”
validation-query
“complicated but unambiguous SPARQL expression”
RDF for this is verbose
41
<http://arkivverket.no/noark5/Record> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://data.bouvet.no/rmil/Class> . <http://arkivverket.no/noark5/Record> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2000/01/rdf-schema#Class> . <http://data.bouvet.no/rmil/1b6ecaa997d84181ae39d3d399ab3825> <http://www.w3.org/1999/02/22- rdf-syntax-ns#type> <http://data.bouvet.no/rmil/ClassPropertyConstraint> . <http://data.bouvet.no/rmil/1b6ecaa997d84181ae39d3d399ab3825> <http://data.bouvet.no/rmil/applies-to-type> <http://arkivverket.no/noark5/Record> . <http://data.bouvet.no/rmil/1b6ecaa997d84181ae39d3d399ab3825> <http://data.bouvet.no/rmil/applies-to-propertytype> <http://arkivverket.no/noark5/recordType> . <http://data.bouvet.no/rmil/1b6ecaa997d84181ae39d3d399ab3825> <http://data.bouvet.no/rmil/valuetype> <http://www.w3.org/2001/XMLSchema#string> . <http://data.bouvet.no/rmil/1b6ecaa997d84181ae39d3d399ab3825> <http://data.bouvet.no/rmil/mincard> "0" . <http://data.bouvet.no/rmil/1b6ecaa997d84181ae39d3d399ab3825> <http://data.bouvet.no/rmil/maxcard> "1" . <http://arkivverket.no/noark5/recordType> <http://www.w3.org/2000/01/rdf-schema#domain> <http://arkivverket.no/noark5/Record> . <http://arkivverket.no/noark5/recordType> <http://www.w3.org/2000/01/rdf-schema#range> <http://www.w3.org/2001/XMLSchema#string> .RDF Model and Instance Language (RMIL)
42
Optimised for schema definition and controlled vocabularies
# the prefix without a name defines the default prefix prefix http://sesam.io/schema/ # the named prefix definition prefix foaf http://xmlns.com/foaf/0.1/ # define a class as a subclass of another Class class Person foaf:Person # must have exactly one birth date xsd:dateTime birthdate 1 1 # must have exactly one name xsd:string foaf:name 1 1 # can have many emails or none xsd:string email 0 * end
RDFCL Evaluation Semantics
43
Schema Constraint #1 Constraint #1 Constraint #1 Constraint #1 Constraint #1
RDF M1
If all the constraints in the schema validate with respect to the data provided the data is considered valid.
Noark5 as RDF
44
Based on the English version of standard and metadata catalogue Define RDFCL for the model Define identifiers that also connect to the specification
Example – File (Mappe)
45
# Definition for Class File class File xsd:string fileID 0 1 xsd:string filetype 0 1 xsd:string title 0 1 xsd:string officialTitle 0 1 xsd:string description 0 1 xsd:string documentMedium 0 1 xsd:string storageLocation 0 * xsd:dateTime createdDate 1 1 FondCreator createdBy 0 1 xsd:dateTime finalisedDate 1 1 OrganisationalUnit finalisedBy 0 1 File parentFile 0 1 end
Example – Record (Registrering)
46
# Definition for Class Record # A single record of evidence class Record xsd:string recordType 0 1 xsd:dateTime createdDate 1 1 FondCreator createdBy 0 1 xsd:dateTime archivedDate 1 1 OrganisationUnit archivedBy 0 1 File parentFile 0 1 Series parentSeries 0 1 end
Document Object
47
class DocumentObject xsd:integer versionNumber 0 1 xsd:string variantFormat 0 1 xsd:string format 0 1 xsd:string formatDetails 0 1 xsd:dateTime createdDate 1 1 FondCreator createdBy 0 1 DocumentDescription 0 1 Record record 0 1 xsd:string checksum 1 1 xsd:string checksumAlgorithm 1 1 xsd:integer filesize 1 1 end
What does this provide?
48
Set of URI identifiers for types and property types of the Noark5 Model Set of constraints that can be used to check the structural integrity of any RDF model claiming to be a Noark5 Model.
Where can we use it?
49
In a RDF version / section of the standard. Implementations of an RMS Enabling RM systems that want to expose data as RDF As a way to validate submissions to the archive
Noark5 Semantics
50
Lots of soft requirements Some harder semantics and some implied through the soft semantics If NoarkX is going to be a hard standard then it needs to define not just data structures, constraint and interchange format but also operations. In essence, an API
How can we define operations on the RDF model we have created?
51
RDF already provides us with SPARQL We have seen how we can use that and RDF data model to create a constraint language Use the same fundamental building blocks to create a Semantic Operations Framework
Semantic Operations Framework
52
Example
53
# The core class definition for all semantic operations class SemanticOperation xsd:string validation_query 1 1 xsd:string update_query 1 1 end # The class for a given class of operation class set_name rdf:resource person 1 1 xsd:string name 1 1 end # Also indicate that the class is an instance of # SemanticOperation and class level properties # for validation and update. instance set_name : SemanticOperation validation_query “select ?x where ...” update_query “delete ..., update ...” end # an example of a operation instance a_set_name : set_name name “bob” person http://example.org/gra end
Noark5 Example
54
class AddSeriesRecord Series parentSeries 1 1 Record record-to-add 1 1 end instance AddSeriesRecord SemanticOperation validation-expression "select ?y where { ?x noark5:finalisedDate ?y FILTER(?x = [[parentSeries]]) }" update-expression "insert { [[record-to-add]] ?p ?v . }" end
Moreq2010 as RDF
55
With Noark we just modelled things in RDF In Moreq2010 we created the models we felt were necessary (Record, Aggregation etc) and then we suggested parts of the standard that could be implemented in RDF. Primarily, Classification Service and the Model Metadata Service
Core Model in Moreq2010
56 # default prefix for moreq2010 prefix http://data.sesam.io/moreq2010/ prefix owl http://w3c.org/owl # Aggregation Data Model class Aggregation xsd:dateTimw created 1 1 xsd:dateTime originated 1 1 xsd:dateTime first-used 0 1 xsd:dateTime last-addition 0 1 Class class 1 1 xsd:string title 1 1 xsd:string description 0 1 xsd:string scope-notes 0 1 xsd:dateTime closed 0 1 xsd:dateTime destroyed 0 1 xsd:int max-levels-of-aggregation 0 1 Aggregation parent 0 1 xsd:dateTime aggregated 0 1 end # Record Class Definition class Record xsd:dateTime created xsd:dateTime originated xsd:string title xsd:string description Record duplicate 0 1 Aggregation parent 1 1 xsd:dateTime aggregated Class class 0 1 DisposalSchedule disposalSchedule 0 1 xsd:dateTime retentionStart 0 1 DisposalAction disposalAction 1 1 xsd:dateTime disposalActionDue xsd:dateTime disposalConfirmationDue xsd:dateTime disposalOverdueAlert xsd:string lastReviewComment xsd:dateTime lastReviewed xsd:dateTime transferred xsd:dateTime destroyed end
MoReq2010 Component
57 class DisposalAction xsd:int codeValue 1 1 end instance retain_on_hold : DisposalAction codeValue 0 end instance retain_permanently : DisposalAction codeValue 1 end instance review : DisposalAction codeValue 2 end instance transfer : DisposalAction codeValue 3 end instance destroy : DisposalAction codeValue 4 end # Component Class Definition class Component xsd:dateTime created 1 1 xsd:dateTime originated 1 1 Record parent 1 1 xsd:string title 1 1 xsd:string description 1 1 Component duplicate xsd:boolean automaticDeletion xsd:dateTime Destroyed
end
MoReq2010 Model Metadata Service
58
Complicated and under powered, but spot
Class
Unable to constrain the core model
59
Use case: all aggregations of class job application can only contain records of the following type: application, meeting invitation, contract, signed contract. And must have metadata, applicant ID First is not (obviously possible) with MMS, the second is doable but painful. Let‟s fix this with RDFCL.
Model Extensions and Constraints in Moreq2010
60 # Job Application Use Case class JobApplication Aggregation end class ApplicationLetter Record end class InterviewLetter Record end class OfferLetter Record end class Contract Record end class SignedContract Record end class JobApplication Aggregation Person applicant 1 1 xsd:string applicantName 1 1 xsd:string applicantAddress 0 1 xsd:string applicationEmail 0 1 end
# Add structure constraints class JobApplication ApplicationLetter applicationLetter 1 1 InterviewLetter interviewLetter 0 * OfferLetter offerLetter 0 1 Contract contract 0 1 SignedContract signedContract 0 1 end # additional constraint instance jobApplicationStructureConstraint1 rdfcl:Constraint query "select ?x where { ?x is a child of a_jobapplication but is not of one of the allowed types" end
What does this give us?
61
Use the same approach to extension as to that of the main model. As RDF we get constraints, model refinements, interchange of constraints and interchange of the instance data – most of it for free. Open up the market for RMS as they would all be based on RDF systems and able to interchange data and data models.
We‟re not finished yet! Let‟s rethink the Noark standard and domain models.
62
Moreq2010 as RDF, RDFCL, SOF, some text Noark as RDF, RDFCL, SOF, in terms of
Domain models as RDF, RDFCL, SOF, in terms of
data.
Core MoReq2010 is informative text and formal models and semantics based on RDF and family of standards. Noark Domains are defined as extensions and refinements to Noark. Noark is defined as extensions and refinements to MoReq2010.
In practice, Noark as MoReq2010
63
## -------------------------------- # NOAK5 # Subclass the appropriate structures in the moreq2010 model ## -------------------------------- prefix http://data.sesam.io/noark5/ prefix moreq http://data.sesam.io/moreq2010 class Fond moreq:Aggregation Fond parent 0 1 end class Series moreq:Aggregation Fond parent 1 1 end class File moreq:Aggregation Series parent 1 1 end class Record moreq:Record File fileParent 0 1 Series seriesParent 0 1 end class DocumentDescription moreq:Component class DocumentObject moreq:Component
Domain model
64
# Vehicle Registration Model class VehicleRegistration Aggregation Citizen registator 1 1 OrganisationUnit caseHandler 1 1 end class RegistrationForm Record VehicleRegistration registration 1 1 end class Citizen end
65
Records Management and Archive Systems Meta- Architecture
Archive Aggregator / RMS Records Management System Phone Tablet PC API X-Format Protocol of submission Conformance Suite Model Validation Model Validation
SOF RDF standards for serialisation RDFCL RDFCL ??
66
Continuous Archive Submission
Use another RDF centric protocol called
group) Syndication and synchronisation of RDF repositories Used to expose data from all kinds of systems as RDF Pull based approach provides high quality system level properties. Pushing to web service endpoints is very broken.
67
68
Make use of models and constraints
Aggregator can make use of the constraints to validate arriving data Publishers „just‟ publish the data as RDF No special interchange format needed, no model syntactic mappings required.
69
Benefits
Multiple subscribers Continuous Robust in the face of downtime due to pull based model RDF deals with partial submission (e.g. the record is closed but the aggregation isn‟t, send the record and reference the aggregation) send the aggregation later. Thanks to Merging, and URIs for identifiers.
70
Records Management and Archive Systems Meta- Architecture
Archive Aggregator / RMS Records Management System Phone Tablet PC API X-Format Protocol of submission Conformance Suite Model Validation Model Validation
SOF RDF standards for serialisation RDFCL RDFCL SDShare
71
Resources
Produced several artefacts:
– Research report – These slides – RDFCL draft – RMIL draft – Semantic Operations Framework draft – RMIL schemas for Noark5 and Moreq2010 core data models – RDF n-triples files for RDFCL constraints and RDFS declarations for Noark5 and Moreq2010.
72
Summary
RDF looks like a very promising technology for this space It looks possible to define records management, and archive submission processes using RDF and related family
In this process everything becomes data, queries and is completely formal.
73
Some next steps?
Create and publish RDF models for MoReq2010 and Noark5,
– Allow these as valid interchange formats for Noark and MoReq2010
Define an RDF centric standard for records management and archive submissions based on MoReq2010 and Noark? Open source implementations of RDF RMS based on these concepts Evaluate SDShare as submission process W3C to adopt RDFCL and SOF
74