From RDA Data Cita-on Recommenda-ons to new paradigms for ci-ng data - - PowerPoint PPT Presentation
From RDA Data Cita-on Recommenda-ons to new paradigms for ci-ng data - - PowerPoint PPT Presentation
From RDA Data Cita-on Recommenda-ons to new paradigms for ci-ng data from VAMDC C.M. Zwlf and VAMDC consor-um Heidelberg June 2016 Importance of cita-on in building new knowledge Gives credits to the author of the intellectual product cited
Importance of cita-on in building new knowledge
Cita%on is a key element in the produc-on of new knowledge Gives credits to the author of the intellectual product cited
Importance of cita-on in building new knowledge
Cita%on is a key element in the produc-on of new knowledge Gives credits to the author of the intellectual product cited Makes the processes described into the ci-ng ar-cle reproducible
Importance of cita-on in building new knowledge
Cita%on is a key element in the produc-on of new knowledge Gives credits to the author of the intellectual product cited Makes the processes described into the ci-ng ar-cle reproducible Enhance trust: the new results are based on proven/ solid bases. Each author does not need to prove again an used result
Importance of cita-on in building new knowledge
Cita%on is a key element in the produc-on of new knowledge Gives credits to the author of the intellectual product cited Makes the processes described into the ci-ng ar-cle reproducible Enhance trust: the new results are based on proven/ solid bases. Each author does not need to prove again an used result The nowadays adopted cita-on model works well for papers. It cannot be easily transposed to cita-on of digital data…
Issues in data cita-on: case of the Atomic and Molecular data
The evolu-on of digital data:
- Is very rapid
- Is not systema-cally
reported
Issues in data cita-on: case of the Atomic and Molecular data
A database may evolve over -me. VALD Basecol
Piskunov et Al. (1995) Ryabchikova et Al. (2015)
- me
Dubernet et Al. (2006) Dubernet et Al. (2013)
The evolu-on of digital data:
- Is very rapid
- Is not systema-cally
reported
Issues in data cita-on: case of the Atomic and Molecular data
A database may evolve over -me. VALD Basecol
Piskunov et Al. (1995) Ryabchikova et Al. (2015)
- me
Dubernet et Al. (2006) Dubernet et Al. (2013)
It may happen that some evolu-ons (usually minor) of these databases are not systema-cally reported through new publica-ons
The evolu-on of digital data:
- Is very rapid
- Is not systema-cally
reported
Issues in data cita-on: case of the Atomic and Molecular data
A database may evolve over -me. VALD Basecol
Piskunov et Al. (1995) Ryabchikova et Al. (2015)
- me
Dubernet et Al. (2006) Dubernet et Al. (2013)
It may happen that some evolu-ons (usually minor) of these databases are not systema-cally reported through new publica-ons
The evolu-on of digital data:
- Is very rapid
- Is not systema-cally
reported A huge number of digital data are used nowadays in papers.
Issues in data cita-on: case of the Atomic and Molecular data
A database may evolve over -me. VALD Basecol
Piskunov et Al. (1995) Ryabchikova et Al. (2015)
- me
Dubernet et Al. (2006) Dubernet et Al. (2013)
It may happen that some evolu-ons (usually minor) of these databases are not systema-cally reported through new publica-ons
The evolu-on of digital data:
- Is very rapid
- Is not systema-cally
reported
The volume of digital data is wide and constantly growing.
A given surveys may use thousands of spectroscopic data coming from many experimental/theore-cal authors.
A huge number of digital data are used nowadays in papers.
Issues in data cita-on: case of the Atomic and Molecular data
A database may evolve over -me. VALD Basecol
Piskunov et Al. (1995) Ryabchikova et Al. (2015)
- me
Dubernet et Al. (2006) Dubernet et Al. (2013)
It may happen that some evolu-ons (usually minor) of these databases are not systema-cally reported through new publica-ons
The evolu-on of digital data:
- Is very rapid
- Is not systema-cally
reported
The volume of digital data is wide and constantly growing.
A given surveys may use thousands of spectroscopic data coming from many experimental/theore-cal authors.
It is impossible to effec-vely cite the origin of thousand of data with the required fine grained granularity. A huge number of digital data are used nowadays in papers.
Issues in data cita-on: case of the Atomic and Molecular data
The evolu-on of digital data:
- Is very rapid
- Is not systema-cally
reported A huge number of digital data are used nowadays in papers. Cita-on of data is incompa-ble with the hand- made classic cita-on mechanisms.
Issues in data cita-on: case of the Atomic and Molecular data
The evolu-on of digital data:
- Is very rapid
- Is not systema-cally
reported A huge number of digital data are used nowadays in papers. Cita-on of data is incompa-ble with the hand- made classic cita-on mechanisms.
The survey by [Ginard et al. (2012)] covers frequencies from 83302Mhz to 262404Mhz detec-ng emission from about 36 species:
- They used catalogues from two public
databases [Picket et al. (1998)] and [Müller et al (2005)] and a private communica-on from
- J. Cernicharo.
- There is no knowledge of the exact dataset
used à Their analysis is not reproducible.
- There is no cita-on of the authors who
produced the spectroscopic data used in their analysis.
- The collisional data are properly cited.
- Dozen of papers for collisional data vs.
hundreds of papers for spectroscopic data.
Issues in data cita-on: case of the Atomic and Molecular data
The evolu-on of digital data:
- Is very rapid
- Is not systema-cally
reported A huge number of digital data are used nowadays in papers. Cita-on of data is incompa-ble with the hand- made classic cita-on mechanisms.
Track the versioning of data Having a mechanisms to speed up the cita-on process
Issues in data cita-on: case of the Atomic and Molecular data
The evolu-on of digital data:
- Is very rapid
- Is not systema-cally
reported A huge number of digital data are used nowadays in papers. Cita-on of data is incompa-ble with the hand- made classic cita-on mechanisms.
Track the versioning of data Having a mechanisms to speed up the cita-on process
- Address these
issues at the VAMDC federated level (not database by database)
- Discuss these
issues at the data-community level: we joined (spring 2014) the RDA Data Cita7on Working Group. VAMDC has become
- ne of the RDA use-
cases.
- The Research Data Alliance and the Data Cita-on WG
- The RDA recommenda-ons comes from standalone databases or warehouse.
- VAMDC is a distributed infrastructure, with no central management system.
The Research Data Alliance and the Data Cita-on WG
Let us implement the recommenda-on!!
Tagging and versioning data The problem is more anthropological than technical… What does it really mean data cita%on?
Let us implement the recommenda-on!!
Tagging and versioning data The problem is more anthropological than technical…
We see technically how to do that But each data provider differently define what a dataset is. Naturally it is the dataset (A+M data have no meaning
- utside this given context)
Ok, but What is the data granularity for tagging?
What does it really mean data cita%on?
Let us implement the recommenda-on!!
Tagging and versioning data The problem is more anthropological than technical…
We see technically how to do that But each data provider differently define what a dataset is. Naturally it is the dataset (A+M data have no meaning
- utside this given context)
Ok, but What is the data granularity for tagging?
What does it really mean data cita%on?
Everyone knows what it is! Yes, but everyone has its own defini-on RDA àcite databases record or output files. (an extracted data file may have an H-factor) VAMDC àcite all the papers used for compiling the content of a given output file.
Let us focus on data tagging/versioning issue:
Output XSAMS file
Radia-ve process 1 Radia-ve process 2
We adopted a change of paradigms (weak structura-on):
Radia-ve process n
…
Collisional process 1 Collisional process 2 Collisional process m
…
Energy State 1 Energy State 2 Energy State p
… …
Element 1 Element 2 Element q
Let us focus on data tagging/versioning issue:
Output XSAMS file
Radia-ve process 1 Radia-ve process 2
We adopted a change of paradigms:
Radia-ve process n
…
Collisional process 1 Collisional process 2 Collisional process m
…
Energy State 1 Energy State 2 Energy State p
… …
Element 1 Element 2 Element q
Version 1
(tagged according to infrastructure state & updates)
Version 2
(tagged according to infrastructure state & updates)
Let us focus on data tagging/versioning issue:
Output XSAMS file
Radia-ve process 1 Radia-ve process 2
We adopted a change of paradigms:
Radia-ve process n
…
Collisional process 1 Collisional process 2 Collisional process m
…
Energy State 1 Energy State 2 Energy State p
… …
Element 1 Element 2 Element q
Version 1
(tagged according to infrastructure state & updates)
Version 2
(tagged according to infrastructure state & updates)
Let us focus on data tagging/versioning issue:
We adopted a change of paradigms:
This approach has several advantages:
- It solves the data tagging granularity problem
- It is independent from what is considered a dataset
- The new files are compliant with old libraries & processing programs
- We add a new feature, an overlay to the exis-ng structure
- We induce a structura-on, without changing the structure (weak
structura-on)
Let us focus on data tagging/versioning issue:
We adopted a change of paradigms:
This approach has several advantages:
- It solves the data tagging granularity problem
- It is independent from what is considered a dataset
- The new files are compliant with old libraries & processing programs
- We add a new feature, an overlay to the exis-ng structure
- We induce a structura-on, without changing the structure (weak
structura-on) Technical details described in New model for datasets cita0on and extrac0on reproducibility in VAMDC,
C.M. Zwölf, N. Moreau, M.-L. Dubernet,
In press J. Mol. Spectrosc. (2016), hlp://dx.doi.org/10.1016/j.jms.2016.04.009 Arxiv version: hlps://arxiv.org/abs/1606.00405
Let us focus on the query store:
The difficulty we have to cope with:
- Handle a query store in a distributed environment (RDA did not design it
for these configura-ons).
- Integrate the query store with the exis-ng VAMDC infrastructure.
Let us focus on the query store:
The difficulty we have to cope with:
- Handle a query store in a distributed environment (RDA did not design it
for these configura-ons).
- Integrate the query store with the exis-ng VAMDC infrastructure.
The implementa-on of the query store is the goal of a jointly collabora-on between VAMDC and RDA-Europe.
- Development will start during spring 2016.
- Final product released during 2017.
Let us focus on the query store:
The difficulty we have to cope with:
- Handle a query store in a distributed environment (RDA did not design it
for these configura-ons).
- Integrate the query store with the exis-ng VAMDC infrastructure.
The implementa-on of the query store is the goal of a jointly collabora-on between VAMDC and RDA-Europe.
- Development will start during spring 2016.
- Final product released during 2017.
Collabora-on with Elsevier for embedding the VAMDC query store into the pages displaying the digital version of papers. Designing technical solu-on for
- Paper / data linking at the paper submission (for authors)
- Paper / data linking at the paper display (for readers)
Data extrac-on procedure
Let us focus on the query store:
Sketching the func-oning:
VAMDC portal (query interface) VAMDC infrastructure
Query
VAMD portal (result part)
Computed response
Access to the
- utput data file
Digital Unique Iden-fier associated to the current extrac-on
Data extrac-on procedure
Let us focus on the query store:
Sketching the func-oning:
VAMDC portal (query interface) VAMDC infrastructure
Query
VAMD portal (result part)
Computed response
Access to the
- utput data file
Digital Unique Iden-fier associated to the current extrac-on
Resolves
Landing Page
The original query Date & -me where query was processed Version of the infrastructure when the query was processed List of publica-ons needed for answering the query When supported (by the VAMDC federated DB): retrieve the output data-file as it was computed (query re-execu-on)
Query Metadata
Query Store
Data extrac-on procedure
Let us focus on the query store:
Sketching the func-oning:
VAMDC portal (query interface) VAMDC infrastructure
Query
VAMD portal (result part)
Computed response
Access to the
- utput data file
Digital Unique Iden-fier associated to the current extrac-on
Resolves
Landing Page
The original query Date & -me where query was processed Version of the infrastructure when the query was processed List of publica-ons needed for answering the query When supported (by the VAMDC federated DB): retrieve the output data-file as it was computed (query re-execu-on)
Query Metadata
Query Store
Manage queries (with authorisa-on/ authen-ca-on)
Group arbitrary set of queries (with related DUI) and assign them a DOI to use in publica-ons Use DOI in papers
Data extrac-on procedure
Let us focus on the query store:
Sketching the func-oning:
VAMDC portal (query interface) VAMDC infrastructure
Query
VAMD portal (result part)
Computed response
Access to the
- utput data file
Digital Unique Iden-fier associated to the current extrac-on
Resolves
Landing Page
The original query Date & -me where query was processed Version of the infrastructure when the query was processed List of publica-ons needed for answering the query When supported (by the VAMDC federated DB): retrieve the output data-file as it was computed (query re-execu-on)
Query Metadata
Query Store
Manage queries (with authorisa-on/ authen-ca-on)
Group arbitrary set of queries (with related DUI) and assign them a DOI to use in publica-ons Use DOI in papers
Editors may follow the cita-on pipeline : credit delega-on applies
Final remarks:
- Our aims:
- Provide the VAMDC infrastructure with an opera-onal query store
- Share our experience with other data-providers
- Provide data-providers with a set of libraries/tools/methods for an easy
implementa-on of a query store.
- We will try to build a generic query store (i.e. using generic soqware