LINKED DATA EXPERIENCE AT MACMILLAN Building discovery services for - - PowerPoint PPT Presentation

▶

Nov 22, 2022 366 likes •538 views

LINKED DATA EXPERIENCE AT MACMILLAN Building discovery services for scientific and Tony Hammond scholarly content on top of a semantic data model Michele Pasin 22 October 2014 Background About Macmillan and what we are doing 1 Linked Data

SLIDE 1

LINKED DATA EXPERIENCE AT MACMILLAN

Building discovery services for scientific and scholarly content on top of a semantic data model

22 October 2014

Tony Hammond Michele Pasin

SLIDE 2

Linked Data at Macmillan | 22 October 2014

1 Background

About Macmillan and what we are doing

SLIDE 3

Macmillan Science and Education

Linked Data at Macmillan | 22 October 2014

Group brands and businesses

SLIDE 4

MS&E Current trends

Change Drivers

Digital first workflow

– print becomes secondary – support for multiple workflows

User-centric design

– things, not data – focus on user experience

Deeply integrated datasets

– standard naming convention – common metadata model – flexible schema management – rich dataset descriptions

Linked Data at Macmillan | 22 October 2014

Developing a richer graph of objects

SLIDE 5

NPG Linked Data Platform (2012)

Deliverables (2012–2014)

Prototype for external use
Two RDF dataset releases in 2012

– April 2012 (22m triples) – July 2012 (270m triples)

Live updates to query endpoint
SPARQL query service (decommissioned)

Current Work (2014–)

Focus on internal use-cases
Publish ontology pages
Periodic data snapshots

Linked Data at Macmillan | 22 October 2014

data.nature.com

SLIDE 6

NPG Core Ontology (2014)

Features

Classes: ~65
Properties: ~200
Named graphs (per class)

Namespaces

npg: => http://ns.nature.com/terms/
npgg: => http://ns.nature.com/graphs/

Approach

Incremental formalization (RDF, RDFS, OWL-DL)
Shared metamodel vs. automatic inference
Minimal commitment to external vocabs

Linked Data at Macmillan | 22 October 2014

Things: assets, documents, events, types

SLIDE 7

NPG Subject Pages (2014)

Features

Based on SKOS taxonomy

– >2500 scientific terms – content inherited via SKOS tree

Dynamically generated

– one webpage per subject term – secondary pages for article types

Various formats, e.g. e-alerts, feeds

– allows people to ‘follow’ a subject

Customized related content

– ads, jobs, events, etc.

Linked Data at Macmillan | 22 October 2014

Topical access to content

SLIDE 8

Linked Data at Macmillan | 22 October 2014

2 Data Storage and Query

Achieving speed by means of a hybrid architecture

SLIDE 9

Content Hub

Capabilities

Discovery – Graph
Storage – Content Repos

Features

Hybrid RDF + XML architecture

– MarkLogic for XML, RDF/XML – Triplestore (TDB) for RDF validation

Repo’s for binary assets

Datasets

Documents (large; >1m)
Ontologies (small; <10k)

Linked Data at Macmillan | 22 October 2014

Managed content warehouse for data discovery

SLIDE 10

System Architecture

Linked Data at Macmillan | 22 October 2014

Hub content

SLIDE 11

Content Discovery – Principles

Generations

1st – Generic linked data API (RDF/*)
2nd – Specific page model API (JSON)

Concerns

Speed (20ms single object; 200ms filtered object)
Simplicity (data construction)
Stability (backup, clustering, security, transactions)

Principles

Chunky not chatty, all data in a single response
Data as consumed, rather than as stored
Support common use cases in simple, obvious ways
Ensure a guaranteed, consistent speed of response for more complex queries
Build on foundation of standard, pragmatic REST (collections, items)

Linked Data at Macmillan | 22 October 2014

Readying the API for applications

SLIDE 12

Content Discovery – Optimization

Approaches

TDB + Fuseki – SPARQL
MarkLogic Semantics – SPARQL
MarkLogic – XQuery
MarkLogic (Optimized) – XQuery

Techniques

Partitioning – RDF/XML objects
Streaming – serialization
Hashing – dictionary lookup
Cacheing – Varnish

Linked Data at Macmillan | 22 October 2014

Tuning the API for performance

SLIDE 13

Content Storage – Layout and Indexing

Challenges

Sort orders
RDF Lists
Facetting, counting

Layout

Semantic RDF/XML includes in XML
RDF objects serialized in list order
Application XML for subject hierarchy

Indexes

Indexes over all elements
Range indexes for datatypes (e.g. datetimes)

Linked Data at Macmillan | 22 October 2014

Readying the data for page delivery

SLIDE 14

In Conclusion

Summary

An RDF metamodel allows for scalable enterprise-level data organization
It is crucial to adequately distinguish between external and internal use cases
A hybrid architecture proved to be an efficient internal solution for content delivery

Future Work

Grow the ontology so that it matches product requirements more closely
Support automated reasoning and richer query options – both RDF and XML based
Maintain and expand the vision of a shared semantic model as a core enterprise asset

Linked Data at Macmillan | 22 October 2014

A few lessons learned

SLIDE 15

For more information please contact TONY HAMMOND Data Architect, Content Data Servicestony.hammond@macmillan.com MICHELE PASIN Information Architect, Product Office michele.pasin@macmillan.com

LINKED DATA EXPERIENCE AT MACMILLAN

Building discovery services for scientific and scholarly content on top of a semantic data model

22 October 2014

Tony Hammond Michele Pasin

1

Background

About Macmillan and what we are doing

Macmillan Science and Education

Group brands and businesses

MS&E Current trends

Change Drivers

– print becomes secondary – support for multiple workflows

– things, not data – focus on user experience

– standard naming convention – common metadata model – flexible schema management – rich dataset descriptions

Developing a richer graph of objects

NPG Linked Data Platform (2012)

Deliverables (2012–2014)

– April 2012 (22m triples) – July 2012 (270m triples)

Current Work (2014–)

data.nature.com

NPG Core Ontology (2014)

Features

Namespaces

Approach

Things: assets, documents, events, types

NPG Subject Pages (2014)

Features

– >2500 scientific terms – content inherited via SKOS tree

– one webpage per subject term – secondary pages for article types

– allows people to ‘follow’ a subject

– ads, jobs, events, etc.

Topical access to content

2

Data Storage and Query

Achieving speed by means of a hybrid architecture

Content Hub

Capabilities

Features

– MarkLogic for XML, RDF/XML – Triplestore (TDB) for RDF validation

Datasets

Managed content warehouse for data discovery

System Architecture

Hub content

Content Discovery – Principles

Generations

Concerns

Principles

Readying the API for applications

Content Discovery – Optimization

Approaches

Techniques

Tuning the API for performance

Content Storage – Layout and Indexing

Challenges

Layout

Indexes

Readying the data for page delivery

In Conclusion

Summary

Future Work

A few lessons learned

Thank you