Data Management: report & news. PaNDaaS WG 2 nd meeting @ESRF - - PowerPoint PPT Presentation

data management report news
SMART_READER_LITE
LIVE PREVIEW

Data Management: report & news. PaNDaaS WG 2 nd meeting @ESRF - - PowerPoint PPT Presentation

Data Management: report & news. PaNDaaS WG 2 nd meeting @ESRF Jean-Franois Perrin (ILL) 12th of Dec 2016 I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I


slide-1
SLIDE 1

I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N

1

I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N

1

Data Management: report & news.

PaNDaaS WG 2nd meeting @ESRF

12th of Dec 2016 Jean-François Perrin (ILL)

slide-2
SLIDE 2

I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N

2

I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N

2

Experimental data management

Some Results: Dec 2012 – Dec 2016

Co-funded by the European Union : PaNData-Europe Grant Agreement No 261537 PaNData-ODI Grant Agreement No RI-283556

slide-3
SLIDE 3

I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N

3

  • 2008 1st discussion on Data Policy (PaNData)
  • 2011 “Open” DP published - 3 (max 5) years embargo
  • 2012 1st experiment under DP
  • 2013 complete set of Data Management Services

available for users: search, access, annotate, archive, identify, publish, …

  • since then, communication with our users …

What has been done so far?

slide-4
SLIDE 4

I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N

4

Data Policy revisited

Based on the PaNData framework

Open data & how to protect and credit our users?

  • The facility shall act as a custodian for the data.
  • All raw data will be curated in a well-defined format with a unique ID (DOI).
  • Metadata is captured automatically and resides either within the raw data files, and/or in an

associated on-line catalogue.

  • Users can release or give access to their data at any time, by default access to raw data, the

associated metadata and the analysis data is restricted to the experimental team for a period of 3 years. During the 2 next years data are available on request. Thereafter, they become publicly accessible.

  • The embargo period can be extended on requests to the direction.
  • Publication based on data must acknowledge the source of the data and cite its unique

identifier (CC-BY licence).

  • Also apply for CRG beam time when they use the ILL data infrastructure.

https://www.ill.eu/DataPolicy

slide-5
SLIDE 5

I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N

5

  • Tailored to ILL needs

– User management of data access authorization. – Users could decide to publish (open access) their

data, before the end of the embargo period.

– Linked to DOIs. – Linked to experimental logs. – Linked to user annotation tool. – Linked with proposal system. – Download of data. – Full text search

Data portal

  • Provide access to data, meta-data, logs, DOIs

landing page, …

  • Scientists can contact the experimental team
  • Tools for managing data authZ
  • Grant individual access
  • Release data at any time (non-reversible)

Index all available information: Proposal, experimental report, data file annotation, publications, …

slide-6
SLIDE 6

I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N

6

  • 3 data sets publicly released before end of the

embargo

  • 26 access granted to external scientists (peer-review)
  • 0 requests to get access to datasets under embargo

(at least through the portal)

  • 760651 data files downloaded (90% external users)

concerning 376 unique datasets

Data Portal results

slide-7
SLIDE 7

Collaboration with

DataCite/INIST

DOIs

Linking data and people through ORCID/ResearcherID Linking data with publications

slide-8
SLIDE 8

I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N

8

We ask our users to cite data sets using the reference section of their articles.

DOIs communications

slide-9
SLIDE 9

I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N

9

  • This is still new for most of the scientists

“What are DOIs? What are you talking about?”

  • We (ILL/ISIS) currently feel a bit alone – need to reach critical
  • mass. (ESRF, PSI, ESS … are joining)
  • We need more communication – mentoring – cultural change -

education. Need to fill the gap between what we hear in RDA-like meetings and the daily reality of the scientists. Still need to convince the scientist that a change is happening regarding experimental data.

Issue #1: Awareness of the scientists

slide-10
SLIDE 10

I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N

10

  • Technical reason : DOIs in figures instead of references, partial

citations …

  • No tools yet available to easily collect references

– CrossRef cited by linking - currently only for article (vs data) publishers ? -, OpenAire. – This is a business for the publishers.

  • Difficulties to get metrics: how successful are we?

– We have currently (Dec 2016) collected less than 50 peer reviewed article referencing the

data DOI.

– How many are we missing?

Need to access freely information for building metrics.

Issue #2: Difficulty to collect the articles exploiting the experimental data

slide-11
SLIDE 11

I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N

11

Text not in the reference section.

Not easily findable through most of search services (WoS, scopus, …) Only findable through google scholar.

slide-12
SLIDE 12

I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N

12

Cited in an image instead of …

Not findable at all

slide-13
SLIDE 13

I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N

13

Data DOI vs article DOI

Should be the DOI of the article, instead of the one

  • f the data.
slide-14
SLIDE 14

I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N

14

  • Time for understanding data & analyses
  • Time for writing articles
  • Time for publishing
  • On our side Time for explaining & convincing

This is by nature a long process, but seeing the level of investment needed, we need to convince, we need evidence of success urgently.

Issue #3: time

slide-15
SLIDE 15

I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N

15

  • The reference to Data sets

in scientific articles, through DOIs, is recently improving.

  • Real interest of the

publishers http://www.elsevier.com/?a=57755

  • More user feedback: “Why I don’t

get a DOI for experiment XYZ?”

Results as of Dec 2016

1 2 3 4 5 6 7 2012 2013 2014 2015 2016

Year

% of ILL users' publication citing the data sets through DOIs

% Scientists name disambiguation:

  • 378 Scientist “publication name”
  • 184 Orcid
  • 141 Researcherid
slide-16
SLIDE 16

I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N

16

One more issue:

  • ther repositories

in the middle.

Cite as

  • M. J. Roy. (2016). Contour method and neutron

diffraction dataset to determine the weld fusion zone shape on residual stress in submerged arc welding [Data set]. Zenodo. http://doi.org/10.5281/zenodo.165765

Instead of

WITHERS Philip J.; ISHIGAMI Atsushi; PIRLING Thilo; ROY Matthew and WALSH Joanna. (2014). The effect

  • f weld bead shape on residual stress in novel low

heat input welding of steel. Institut Laue-Langevin (ILL) doi:10.5291/ILL-DATA.1-02-145

Licence ?

slide-17
SLIDE 17

I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N

17

I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N

17

Data Analysis As a Service.

slide-18
SLIDE 18

I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N

18

Data volume evolution

10 20 30 40 50 60 2000 2001 2001 2002 2003 2004 2005 2006 2007 2008 2008 2009 2010 2011 2012 2013 2014 2015 2015 2016 2017

TB 2016-2017

Volume of experimental data / cycle

Raw (TB) Processed (TB) Forecast (TB)

Evaluation of new detectors leading to permanent instruments starting from Dec 2016. Moving to list mode (vs Histo)

slide-19
SLIDE 19

I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N

19

  • Storage (2 experiments = 70TB)

– ILL archive capacity & performance – Users’ storage becoming almost impossible

  • Moving data

– Today how to carry 40TB to 10 different labs? – Why carrying them?

  • Analysis

– Almost impossible in most users’ labs with such data

sets.

  • But

– 32 direct (h-index 4) peer reviewed articles published – 2 Phd-thesis – 10+ international conferences – …

Impacts of the data volume evolution

Example of the EXILL campaign

slide-20
SLIDE 20

I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N

20

  • The aim is to proposed to users to access analysis services

(data, software, IT capacity and expertise) remotely using standard tools (ideally only web browser).

  • Typical workflow:

1) The user connects remotely using his web browser and its credentials (Federated IM) 2) Then select one of the experiment he has performed in the list. 3) he is then connected to a service where the necessary analysis applications have been installed and configured for accessing directly the experimental data. 4) If necessary he could receive help and support from facility expert, during the analysis. 5) Analysis data are published.

Data analysis as a Service

As of Dec 2016

  • Openstack testbeds
  • Evaluation of the management APIs
  • More resources to come … soon
slide-21
SLIDE 21

I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N

21

  • top 3 data analysis applications …

– LAMP

, Mantid, Matlab through a private cloud + remote desktop

  • what services could the e-infrastructures provide ?

– OpenAire/Datacite: help us to communicate, collect metrics of data usage – GEANT: Global AAI? Hybrid-Cloud? – EGI/EUDAT: ???

  • If we submit a new PANDAAS proposal … what to

solve.

– DaaS (volume and ease), analysis preservation, metrics

  • NX as an immediate, temporary (scalability?) solution

Homework by Andy

slide-22
SLIDE 22

I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N

22

I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N

22

Contact: data@ill.eu Portal: https://data.ill.eu Policy: https://www.ill.eu/DataPolicy PaNData Collaboration: http://pan-data.eu