From RDA Data Cita-on Recommenda-ons to new paradigms for ci-ng data - - PowerPoint PPT Presentation

from rda data cita on recommenda ons to new paradigms for
SMART_READER_LITE
LIVE PREVIEW

From RDA Data Cita-on Recommenda-ons to new paradigms for ci-ng data - - PowerPoint PPT Presentation

From RDA Data Cita-on Recommenda-ons to new paradigms for ci-ng data from VAMDC C.M. Zwlf and VAMDC consor-um Heidelberg June 2016 Importance of cita-on in building new knowledge Gives credits to the author of the intellectual product cited


slide-1
SLIDE 1

From RDA Data Cita-on Recommenda-ons to new paradigms for ci-ng data from VAMDC C.M. Zwölf and VAMDC consor-um Heidelberg June 2016

slide-2
SLIDE 2

Importance of cita-on in building new knowledge

Cita%on is a key element in the produc-on of new knowledge Gives credits to the author of the intellectual product cited

slide-3
SLIDE 3

Importance of cita-on in building new knowledge

Cita%on is a key element in the produc-on of new knowledge Gives credits to the author of the intellectual product cited Makes the processes described into the ci-ng ar-cle reproducible

slide-4
SLIDE 4

Importance of cita-on in building new knowledge

Cita%on is a key element in the produc-on of new knowledge Gives credits to the author of the intellectual product cited Makes the processes described into the ci-ng ar-cle reproducible Enhance trust: the new results are based on proven/ solid bases. Each author does not need to prove again an used result

slide-5
SLIDE 5

Importance of cita-on in building new knowledge

Cita%on is a key element in the produc-on of new knowledge Gives credits to the author of the intellectual product cited Makes the processes described into the ci-ng ar-cle reproducible Enhance trust: the new results are based on proven/ solid bases. Each author does not need to prove again an used result The nowadays adopted cita-on model works well for papers. It cannot be easily transposed to cita-on of digital data…

slide-6
SLIDE 6

Issues in data cita-on: case of the Atomic and Molecular data

The evolu-on of digital data:

  • Is very rapid
  • Is not systema-cally

reported

slide-7
SLIDE 7

Issues in data cita-on: case of the Atomic and Molecular data

A database may evolve over -me. VALD Basecol

Piskunov et Al. (1995) Ryabchikova et Al. (2015)

  • me

Dubernet et Al. (2006) Dubernet et Al. (2013)

The evolu-on of digital data:

  • Is very rapid
  • Is not systema-cally

reported

slide-8
SLIDE 8

Issues in data cita-on: case of the Atomic and Molecular data

A database may evolve over -me. VALD Basecol

Piskunov et Al. (1995) Ryabchikova et Al. (2015)

  • me

Dubernet et Al. (2006) Dubernet et Al. (2013)

It may happen that some evolu-ons (usually minor) of these databases are not systema-cally reported through new publica-ons

The evolu-on of digital data:

  • Is very rapid
  • Is not systema-cally

reported

slide-9
SLIDE 9

Issues in data cita-on: case of the Atomic and Molecular data

A database may evolve over -me. VALD Basecol

Piskunov et Al. (1995) Ryabchikova et Al. (2015)

  • me

Dubernet et Al. (2006) Dubernet et Al. (2013)

It may happen that some evolu-ons (usually minor) of these databases are not systema-cally reported through new publica-ons

The evolu-on of digital data:

  • Is very rapid
  • Is not systema-cally

reported A huge number of digital data are used nowadays in papers.

slide-10
SLIDE 10

Issues in data cita-on: case of the Atomic and Molecular data

A database may evolve over -me. VALD Basecol

Piskunov et Al. (1995) Ryabchikova et Al. (2015)

  • me

Dubernet et Al. (2006) Dubernet et Al. (2013)

It may happen that some evolu-ons (usually minor) of these databases are not systema-cally reported through new publica-ons

The evolu-on of digital data:

  • Is very rapid
  • Is not systema-cally

reported

The volume of digital data is wide and constantly growing.

A given surveys may use thousands of spectroscopic data coming from many experimental/theore-cal authors.

A huge number of digital data are used nowadays in papers.

slide-11
SLIDE 11

Issues in data cita-on: case of the Atomic and Molecular data

A database may evolve over -me. VALD Basecol

Piskunov et Al. (1995) Ryabchikova et Al. (2015)

  • me

Dubernet et Al. (2006) Dubernet et Al. (2013)

It may happen that some evolu-ons (usually minor) of these databases are not systema-cally reported through new publica-ons

The evolu-on of digital data:

  • Is very rapid
  • Is not systema-cally

reported

The volume of digital data is wide and constantly growing.

A given surveys may use thousands of spectroscopic data coming from many experimental/theore-cal authors.

It is impossible to effec-vely cite the origin of thousand of data with the required fine grained granularity. A huge number of digital data are used nowadays in papers.

slide-12
SLIDE 12

Issues in data cita-on: case of the Atomic and Molecular data

The evolu-on of digital data:

  • Is very rapid
  • Is not systema-cally

reported A huge number of digital data are used nowadays in papers. Cita-on of data is incompa-ble with the hand- made classic cita-on mechanisms.

slide-13
SLIDE 13

Issues in data cita-on: case of the Atomic and Molecular data

The evolu-on of digital data:

  • Is very rapid
  • Is not systema-cally

reported A huge number of digital data are used nowadays in papers. Cita-on of data is incompa-ble with the hand- made classic cita-on mechanisms.

The survey by [Ginard et al. (2012)] covers frequencies from 83302Mhz to 262404Mhz detec-ng emission from about 36 species:

  • They used catalogues from two public

databases [Picket et al. (1998)] and [Müller et al (2005)] and a private communica-on from

  • J. Cernicharo.
  • There is no knowledge of the exact dataset

used à Their analysis is not reproducible.

  • There is no cita-on of the authors who

produced the spectroscopic data used in their analysis.

  • The collisional data are properly cited.
  • Dozen of papers for collisional data vs.

hundreds of papers for spectroscopic data.

slide-14
SLIDE 14

Issues in data cita-on: case of the Atomic and Molecular data

The evolu-on of digital data:

  • Is very rapid
  • Is not systema-cally

reported A huge number of digital data are used nowadays in papers. Cita-on of data is incompa-ble with the hand- made classic cita-on mechanisms.

Track the versioning of data Having a mechanisms to speed up the cita-on process

slide-15
SLIDE 15

Issues in data cita-on: case of the Atomic and Molecular data

The evolu-on of digital data:

  • Is very rapid
  • Is not systema-cally

reported A huge number of digital data are used nowadays in papers. Cita-on of data is incompa-ble with the hand- made classic cita-on mechanisms.

Track the versioning of data Having a mechanisms to speed up the cita-on process

  • Address these

issues at the VAMDC federated level (not database by database)

  • Discuss these

issues at the data-community level: we joined (spring 2014) the RDA Data Cita7on Working Group. VAMDC has become

  • ne of the RDA use-

cases.

slide-16
SLIDE 16
  • The Research Data Alliance and the Data Cita-on WG
slide-17
SLIDE 17
  • The RDA recommenda-ons comes from standalone databases or warehouse.
  • VAMDC is a distributed infrastructure, with no central management system.

The Research Data Alliance and the Data Cita-on WG

slide-18
SLIDE 18

Let us implement the recommenda-on!!

Tagging and versioning data The problem is more anthropological than technical… What does it really mean data cita%on?

slide-19
SLIDE 19

Let us implement the recommenda-on!!

Tagging and versioning data The problem is more anthropological than technical…

We see technically how to do that But each data provider differently define what a dataset is. Naturally it is the dataset (A+M data have no meaning

  • utside this given context)

Ok, but What is the data granularity for tagging?

What does it really mean data cita%on?

slide-20
SLIDE 20

Let us implement the recommenda-on!!

Tagging and versioning data The problem is more anthropological than technical…

We see technically how to do that But each data provider differently define what a dataset is. Naturally it is the dataset (A+M data have no meaning

  • utside this given context)

Ok, but What is the data granularity for tagging?

What does it really mean data cita%on?

Everyone knows what it is! Yes, but everyone has its own defini-on RDA àcite databases record or output files. (an extracted data file may have an H-factor) VAMDC àcite all the papers used for compiling the content of a given output file.

slide-21
SLIDE 21

Let us focus on data tagging/versioning issue:

Output XSAMS file

Radia-ve process 1 Radia-ve process 2

We adopted a change of paradigms (weak structura-on):

Radia-ve process n

Collisional process 1 Collisional process 2 Collisional process m

Energy State 1 Energy State 2 Energy State p

… …

Element 1 Element 2 Element q

slide-22
SLIDE 22

Let us focus on data tagging/versioning issue:

Output XSAMS file

Radia-ve process 1 Radia-ve process 2

We adopted a change of paradigms:

Radia-ve process n

Collisional process 1 Collisional process 2 Collisional process m

Energy State 1 Energy State 2 Energy State p

… …

Element 1 Element 2 Element q

Version 1

(tagged according to infrastructure state & updates)

Version 2

(tagged according to infrastructure state & updates)

slide-23
SLIDE 23

Let us focus on data tagging/versioning issue:

Output XSAMS file

Radia-ve process 1 Radia-ve process 2

We adopted a change of paradigms:

Radia-ve process n

Collisional process 1 Collisional process 2 Collisional process m

Energy State 1 Energy State 2 Energy State p

… …

Element 1 Element 2 Element q

Version 1

(tagged according to infrastructure state & updates)

Version 2

(tagged according to infrastructure state & updates)

slide-24
SLIDE 24

Let us focus on data tagging/versioning issue:

We adopted a change of paradigms:

This approach has several advantages:

  • It solves the data tagging granularity problem
  • It is independent from what is considered a dataset
  • The new files are compliant with old libraries & processing programs
  • We add a new feature, an overlay to the exis-ng structure
  • We induce a structura-on, without changing the structure (weak

structura-on)

slide-25
SLIDE 25

Let us focus on data tagging/versioning issue:

We adopted a change of paradigms:

This approach has several advantages:

  • It solves the data tagging granularity problem
  • It is independent from what is considered a dataset
  • The new files are compliant with old libraries & processing programs
  • We add a new feature, an overlay to the exis-ng structure
  • We induce a structura-on, without changing the structure (weak

structura-on) Technical details described in New model for datasets cita0on and extrac0on reproducibility in VAMDC,

C.M. Zwölf, N. Moreau, M.-L. Dubernet,

In press J. Mol. Spectrosc. (2016), hlp://dx.doi.org/10.1016/j.jms.2016.04.009 Arxiv version: hlps://arxiv.org/abs/1606.00405

slide-26
SLIDE 26

Let us focus on the query store:

The difficulty we have to cope with:

  • Handle a query store in a distributed environment (RDA did not design it

for these configura-ons).

  • Integrate the query store with the exis-ng VAMDC infrastructure.
slide-27
SLIDE 27

Let us focus on the query store:

The difficulty we have to cope with:

  • Handle a query store in a distributed environment (RDA did not design it

for these configura-ons).

  • Integrate the query store with the exis-ng VAMDC infrastructure.

The implementa-on of the query store is the goal of a jointly collabora-on between VAMDC and RDA-Europe.

  • Development will start during spring 2016.
  • Final product released during 2017.
slide-28
SLIDE 28

Let us focus on the query store:

The difficulty we have to cope with:

  • Handle a query store in a distributed environment (RDA did not design it

for these configura-ons).

  • Integrate the query store with the exis-ng VAMDC infrastructure.

The implementa-on of the query store is the goal of a jointly collabora-on between VAMDC and RDA-Europe.

  • Development will start during spring 2016.
  • Final product released during 2017.

Collabora-on with Elsevier for embedding the VAMDC query store into the pages displaying the digital version of papers. Designing technical solu-on for

  • Paper / data linking at the paper submission (for authors)
  • Paper / data linking at the paper display (for readers)
slide-29
SLIDE 29

Data extrac-on procedure

Let us focus on the query store:

Sketching the func-oning:

VAMDC portal (query interface) VAMDC infrastructure

Query

VAMD portal (result part)

Computed response

Access to the

  • utput data file

Digital Unique Iden-fier associated to the current extrac-on

slide-30
SLIDE 30

Data extrac-on procedure

Let us focus on the query store:

Sketching the func-oning:

VAMDC portal (query interface) VAMDC infrastructure

Query

VAMD portal (result part)

Computed response

Access to the

  • utput data file

Digital Unique Iden-fier associated to the current extrac-on

Resolves

Landing Page

The original query Date & -me where query was processed Version of the infrastructure when the query was processed List of publica-ons needed for answering the query When supported (by the VAMDC federated DB): retrieve the output data-file as it was computed (query re-execu-on)

Query Metadata

Query Store

slide-31
SLIDE 31

Data extrac-on procedure

Let us focus on the query store:

Sketching the func-oning:

VAMDC portal (query interface) VAMDC infrastructure

Query

VAMD portal (result part)

Computed response

Access to the

  • utput data file

Digital Unique Iden-fier associated to the current extrac-on

Resolves

Landing Page

The original query Date & -me where query was processed Version of the infrastructure when the query was processed List of publica-ons needed for answering the query When supported (by the VAMDC federated DB): retrieve the output data-file as it was computed (query re-execu-on)

Query Metadata

Query Store

Manage queries (with authorisa-on/ authen-ca-on)

Group arbitrary set of queries (with related DUI) and assign them a DOI to use in publica-ons Use DOI in papers

slide-32
SLIDE 32

Data extrac-on procedure

Let us focus on the query store:

Sketching the func-oning:

VAMDC portal (query interface) VAMDC infrastructure

Query

VAMD portal (result part)

Computed response

Access to the

  • utput data file

Digital Unique Iden-fier associated to the current extrac-on

Resolves

Landing Page

The original query Date & -me where query was processed Version of the infrastructure when the query was processed List of publica-ons needed for answering the query When supported (by the VAMDC federated DB): retrieve the output data-file as it was computed (query re-execu-on)

Query Metadata

Query Store

Manage queries (with authorisa-on/ authen-ca-on)

Group arbitrary set of queries (with related DUI) and assign them a DOI to use in publica-ons Use DOI in papers

Editors may follow the cita-on pipeline : credit delega-on applies

slide-33
SLIDE 33

Final remarks:

  • Our aims:
  • Provide the VAMDC infrastructure with an opera-onal query store
  • Share our experience with other data-providers
  • Provide data-providers with a set of libraries/tools/methods for an easy

implementa-on of a query store.

  • We will try to build a generic query store (i.e. using generic soqware

blocks)

slide-34
SLIDE 34

Bibliography (in order of cita-on)