[PPT] - From RDA Data Cita-on Recommenda-ons to new paradigms for ci-ng data PowerPoint Presentation

SLIDE 1

From RDA Data Cita-on Recommenda-ons to new paradigms for ci-ng data from VAMDC C.M. Zwölf and VAMDC consor-um Heidelberg June 2016

SLIDE 2

Importance of cita-on in building new knowledge

Cita%on is a key element in the produc-on of new knowledge Gives credits to the author of the intellectual product cited

SLIDE 3

Importance of cita-on in building new knowledge

Cita%on is a key element in the produc-on of new knowledge Gives credits to the author of the intellectual product cited Makes the processes described into the ci-ng ar-cle reproducible

SLIDE 4

Importance of cita-on in building new knowledge

Cita%on is a key element in the produc-on of new knowledge Gives credits to the author of the intellectual product cited Makes the processes described into the ci-ng ar-cle reproducible Enhance trust: the new results are based on proven/ solid bases. Each author does not need to prove again an used result

SLIDE 5

Importance of cita-on in building new knowledge

Cita%on is a key element in the produc-on of new knowledge Gives credits to the author of the intellectual product cited Makes the processes described into the ci-ng ar-cle reproducible Enhance trust: the new results are based on proven/ solid bases. Each author does not need to prove again an used result The nowadays adopted cita-on model works well for papers. It cannot be easily transposed to cita-on of digital data…

SLIDE 6

Issues in data cita-on: case of the Atomic and Molecular data

The evolu-on of digital data:

Is very rapid
Is not systema-cally

reported

SLIDE 7

Issues in data cita-on: case of the Atomic and Molecular data

A database may evolve over -me. VALD Basecol

Piskunov et Al. (1995) Ryabchikova et Al. (2015)

me

Dubernet et Al. (2006) Dubernet et Al. (2013)

The evolu-on of digital data:

Is very rapid
Is not systema-cally

reported

SLIDE 8

Issues in data cita-on: case of the Atomic and Molecular data

A database may evolve over -me. VALD Basecol

Piskunov et Al. (1995) Ryabchikova et Al. (2015)

me

Dubernet et Al. (2006) Dubernet et Al. (2013)

It may happen that some evolu-ons (usually minor) of these databases are not systema-cally reported through new publica-ons

The evolu-on of digital data:

Is very rapid
Is not systema-cally

reported

SLIDE 9

Issues in data cita-on: case of the Atomic and Molecular data

A database may evolve over -me. VALD Basecol

Piskunov et Al. (1995) Ryabchikova et Al. (2015)

me

Dubernet et Al. (2006) Dubernet et Al. (2013)

It may happen that some evolu-ons (usually minor) of these databases are not systema-cally reported through new publica-ons

The evolu-on of digital data:

Is very rapid
Is not systema-cally

reported A huge number of digital data are used nowadays in papers.

SLIDE 10

Issues in data cita-on: case of the Atomic and Molecular data

A database may evolve over -me. VALD Basecol

Piskunov et Al. (1995) Ryabchikova et Al. (2015)

me

Dubernet et Al. (2006) Dubernet et Al. (2013)

It may happen that some evolu-ons (usually minor) of these databases are not systema-cally reported through new publica-ons

The evolu-on of digital data:

Is very rapid
Is not systema-cally

reported

The volume of digital data is wide and constantly growing.

A given surveys may use thousands of spectroscopic data coming from many experimental/theore-cal authors.

A huge number of digital data are used nowadays in papers.

SLIDE 11

Issues in data cita-on: case of the Atomic and Molecular data

A database may evolve over -me. VALD Basecol

Piskunov et Al. (1995) Ryabchikova et Al. (2015)

me

Dubernet et Al. (2006) Dubernet et Al. (2013)

It may happen that some evolu-ons (usually minor) of these databases are not systema-cally reported through new publica-ons

The evolu-on of digital data:

Is very rapid
Is not systema-cally

reported

The volume of digital data is wide and constantly growing.

A given surveys may use thousands of spectroscopic data coming from many experimental/theore-cal authors.

It is impossible to effec-vely cite the origin of thousand of data with the required fine grained granularity. A huge number of digital data are used nowadays in papers.

SLIDE 12

Issues in data cita-on: case of the Atomic and Molecular data

The evolu-on of digital data:

Is very rapid
Is not systema-cally

reported A huge number of digital data are used nowadays in papers. Cita-on of data is incompa-ble with the hand- made classic cita-on mechanisms.

SLIDE 13

Issues in data cita-on: case of the Atomic and Molecular data

The evolu-on of digital data:

Is very rapid
Is not systema-cally

reported A huge number of digital data are used nowadays in papers. Cita-on of data is incompa-ble with the hand- made classic cita-on mechanisms.

The survey by [Ginard et al. (2012)] covers frequencies from 83302Mhz to 262404Mhz detec-ng emission from about 36 species:

They used catalogues from two public

databases [Picket et al. (1998)] and [Müller et al (2005)] and a private communica-on from

J. Cernicharo.
There is no knowledge of the exact dataset

used à Their analysis is not reproducible.

There is no cita-on of the authors who

produced the spectroscopic data used in their analysis.

The collisional data are properly cited.
Dozen of papers for collisional data vs.

hundreds of papers for spectroscopic data.

SLIDE 14

Issues in data cita-on: case of the Atomic and Molecular data

The evolu-on of digital data:

Is very rapid
Is not systema-cally

reported A huge number of digital data are used nowadays in papers. Cita-on of data is incompa-ble with the hand- made classic cita-on mechanisms.

Track the versioning of data Having a mechanisms to speed up the cita-on process

SLIDE 15

Issues in data cita-on: case of the Atomic and Molecular data

The evolu-on of digital data:

Is very rapid
Is not systema-cally

reported A huge number of digital data are used nowadays in papers. Cita-on of data is incompa-ble with the hand- made classic cita-on mechanisms.

Track the versioning of data Having a mechanisms to speed up the cita-on process

Address these

issues at the VAMDC federated level (not database by database)

Discuss these

issues at the data-community level: we joined (spring 2014) the RDA Data Cita7on Working Group. VAMDC has become

ne of the RDA use-

cases.

SLIDE 16

The Research Data Alliance and the Data Cita-on WG

SLIDE 17

The RDA recommenda-ons comes from standalone databases or warehouse.
VAMDC is a distributed infrastructure, with no central management system.

The Research Data Alliance and the Data Cita-on WG

SLIDE 18

Let us implement the recommenda-on!!

Tagging and versioning data The problem is more anthropological than technical… What does it really mean data cita%on?

SLIDE 19

Let us implement the recommenda-on!!

Tagging and versioning data The problem is more anthropological than technical…

We see technically how to do that But each data provider differently define what a dataset is. Naturally it is the dataset (A+M data have no meaning

utside this given context)

Ok, but What is the data granularity for tagging?

What does it really mean data cita%on?

SLIDE 20

Let us implement the recommenda-on!!

Tagging and versioning data The problem is more anthropological than technical…

We see technically how to do that But each data provider differently define what a dataset is. Naturally it is the dataset (A+M data have no meaning

utside this given context)

Ok, but What is the data granularity for tagging?

What does it really mean data cita%on?

Everyone knows what it is! Yes, but everyone has its own defini-on RDA àcite databases record or output files. (an extracted data file may have an H-factor) VAMDC àcite all the papers used for compiling the content of a given output file.

SLIDE 21

Let us focus on data tagging/versioning issue:

Output XSAMS file

Radia-ve process 1 Radia-ve process 2

We adopted a change of paradigms (weak structura-on):

Radia-ve process n

…

Collisional process 1 Collisional process 2 Collisional process m

…

Energy State 1 Energy State 2 Energy State p

… …

Element 1 Element 2 Element q

SLIDE 22

Let us focus on data tagging/versioning issue:

Output XSAMS file

Radia-ve process 1 Radia-ve process 2

We adopted a change of paradigms:

Radia-ve process n

…

Collisional process 1 Collisional process 2 Collisional process m

…

Energy State 1 Energy State 2 Energy State p

… …

Element 1 Element 2 Element q

Version 1

(tagged according to infrastructure state & updates)

Version 2

(tagged according to infrastructure state & updates)

SLIDE 23

Let us focus on data tagging/versioning issue:

Output XSAMS file

Radia-ve process 1 Radia-ve process 2

We adopted a change of paradigms:

Radia-ve process n

…

Collisional process 1 Collisional process 2 Collisional process m

…

Energy State 1 Energy State 2 Energy State p

… …

Element 1 Element 2 Element q

Version 1

(tagged according to infrastructure state & updates)

Version 2

(tagged according to infrastructure state & updates)

SLIDE 24

Let us focus on data tagging/versioning issue:

We adopted a change of paradigms:

This approach has several advantages:

It solves the data tagging granularity problem
It is independent from what is considered a dataset
The new files are compliant with old libraries & processing programs
We add a new feature, an overlay to the exis-ng structure
We induce a structura-on, without changing the structure (weak

structura-on)

SLIDE 25

Let us focus on data tagging/versioning issue:

We adopted a change of paradigms:

This approach has several advantages:

It solves the data tagging granularity problem
It is independent from what is considered a dataset
The new files are compliant with old libraries & processing programs
We add a new feature, an overlay to the exis-ng structure
We induce a structura-on, without changing the structure (weak

structura-on) Technical details described in New model for datasets cita0on and extrac0on reproducibility in VAMDC,

C.M. Zwölf, N. Moreau, M.-L. Dubernet,

In press J. Mol. Spectrosc. (2016), hlp://dx.doi.org/10.1016/j.jms.2016.04.009 Arxiv version: hlps://arxiv.org/abs/1606.00405

SLIDE 26

Let us focus on the query store:

The difficulty we have to cope with:

Handle a query store in a distributed environment (RDA did not design it

for these configura-ons).

Integrate the query store with the exis-ng VAMDC infrastructure.

SLIDE 27

Let us focus on the query store:

The difficulty we have to cope with:

Handle a query store in a distributed environment (RDA did not design it

for these configura-ons).

Integrate the query store with the exis-ng VAMDC infrastructure.

The implementa-on of the query store is the goal of a jointly collabora-on between VAMDC and RDA-Europe.

Development will start during spring 2016.
Final product released during 2017.

SLIDE 28

Let us focus on the query store:

The difficulty we have to cope with:

Handle a query store in a distributed environment (RDA did not design it

for these configura-ons).

Integrate the query store with the exis-ng VAMDC infrastructure.

The implementa-on of the query store is the goal of a jointly collabora-on between VAMDC and RDA-Europe.

Development will start during spring 2016.
Final product released during 2017.

Collabora-on with Elsevier for embedding the VAMDC query store into the pages displaying the digital version of papers. Designing technical solu-on for

Paper / data linking at the paper submission (for authors)
Paper / data linking at the paper display (for readers)

SLIDE 29

Data extrac-on procedure

Let us focus on the query store:

Sketching the func-oning:

VAMDC portal (query interface) VAMDC infrastructure

Query

VAMD portal (result part)

Computed response

Access to the

utput data file

Digital Unique Iden-fier associated to the current extrac-on

SLIDE 30

Data extrac-on procedure

Let us focus on the query store:

Sketching the func-oning:

VAMDC portal (query interface) VAMDC infrastructure

Query

VAMD portal (result part)

Computed response

Access to the

utput data file

Digital Unique Iden-fier associated to the current extrac-on

Resolves

Landing Page

The original query Date & -me where query was processed Version of the infrastructure when the query was processed List of publica-ons needed for answering the query When supported (by the VAMDC federated DB): retrieve the output data-file as it was computed (query re-execu-on)

Query Metadata

Query Store

SLIDE 31

Data extrac-on procedure

Let us focus on the query store:

Sketching the func-oning:

VAMDC portal (query interface) VAMDC infrastructure

Query

VAMD portal (result part)

Computed response

Access to the

utput data file

Digital Unique Iden-fier associated to the current extrac-on

Resolves

Landing Page

The original query Date & -me where query was processed Version of the infrastructure when the query was processed List of publica-ons needed for answering the query When supported (by the VAMDC federated DB): retrieve the output data-file as it was computed (query re-execu-on)

Query Metadata

Query Store

Manage queries (with authorisa-on/ authen-ca-on)

Group arbitrary set of queries (with related DUI) and assign them a DOI to use in publica-ons Use DOI in papers

SLIDE 32

Data extrac-on procedure

Let us focus on the query store:

Sketching the func-oning:

VAMDC portal (query interface) VAMDC infrastructure

Query

VAMD portal (result part)

Computed response

Access to the

utput data file

Digital Unique Iden-fier associated to the current extrac-on

Resolves

Landing Page

The original query Date & -me where query was processed Version of the infrastructure when the query was processed List of publica-ons needed for answering the query When supported (by the VAMDC federated DB): retrieve the output data-file as it was computed (query re-execu-on)

Query Metadata

Query Store

Manage queries (with authorisa-on/ authen-ca-on)

Group arbitrary set of queries (with related DUI) and assign them a DOI to use in publica-ons Use DOI in papers

Editors may follow the cita-on pipeline : credit delega-on applies

SLIDE 33

Final remarks:

Our aims:
Provide the VAMDC infrastructure with an opera-onal query store
Share our experience with other data-providers
Provide data-providers with a set of libraries/tools/methods for an easy

implementa-on of a query store.

We will try to build a generic query store (i.e. using generic soqware

blocks)

SLIDE 34