Reproducibility and Open Science Follow along at: - - PowerPoint PPT Presentation

reproducibility and open science
SMART_READER_LITE
LIVE PREVIEW

Reproducibility and Open Science Follow along at: - - PowerPoint PPT Presentation

Reproducibility and Open Science Follow along at: https://gordonwatts.github.io/ros-roadshow 1 / 39 $ 37.8M for 5 years: "Moore-Sloan Data Science Environments" Additional funding from Washington Research Foundation National


slide-1
SLIDE 1

Reproducibility and Open Science

Follow along at: https://gordonwatts.github.io/ros-roadshow

1 / 39

slide-2
SLIDE 2

$ 37.8M for 5 years: "Moore-Sloan Data Science Environments" Additional funding from

  • Washington Research Foundation
  • National Science Foundation

Reproducibility and Open Science Working Group:

  • https://reproduciblescience.org/
  • Mailing list: reproducible@uw.edu

2 / 39

slide-3
SLIDE 3
  • Goal: Stimulate discussion and share ideas
  • Types of reproducibility
  • Tools for reproducibility
  • Data: archiving, curation, sharing
  • Code: scripting, versioning, collaborating, sharing, publishing
  • Publication: open access

3 / 39

slide-4
SLIDE 4

Private reproducibility...

Use scripts, not GUIs, for data analysis and visualization. Use version control / provenance tracking tools. Archive code and data used for published results. Why?

  • Ability to check results in prior publication,
  • Ability to build on your own past research of your own (or students /

collaborators).

  • Easily modify tables/figures to satisfy referees, etc.

4 / 39

slide-5
SLIDE 5

Private reproducibility...

Use scripts, not GUIs, for data analysis and visualization. Use version control / provenance tracking tools. Archive code and data used for published results. Why?

  • Ability to check results in prior publication,
  • Ability to build on your own past research of your own (or students /

collaborators).

  • Easily modify tables/figures to satisfy referees, etc.

Auditable Research: Even if code and data are not shared, there should be a permanent record that can be checked. Analogous to lab notebooks. 5 / 39

slide-6
SLIDE 6

Public Reproducibility...

Allowing others to reproduce your results. (Readers, referees, researchers down the hall...) Why?

  • Verifying scientific integrity of results.
  • Aids in understanding ideas, implementing methods
  • Increases impact of work.

6 / 39

slide-7
SLIDE 7

Public Reproducibility...

Allowing others to reproduce your results. (Readers, referees, researchers down the hall...) Why?

  • Verifying scientific integrity of results.
  • Aids in understanding ideas, implementing methods
  • Increases impact of work.

"An article about computational result is advertising, not scholarship. The actual scholarship is the full software environment, code and data, that produced the result." Buckheit and Donoho (1995) 7 / 39

slide-8
SLIDE 8

Compare to Mathematics

Traditional research in Mathematics is reproducible...

  • A paper containing a new theorem cannot be published without the

proof. 8 / 39

slide-9
SLIDE 9

Compare to Mathematics

Traditional research in Mathematics is reproducible...

  • A paper containing a new theorem cannot be published without the

proof.

It wasn't always so...

There is no . . . mathematician so expert in his science, as to place entire confidence in any truth immediately upon his discovery of it. . . . Every time he runs over his proofs, his confidence encreases; but still more by the approbation of his friends; and is raised to its utmost perfection by the universal assent and applauses of the learned world.

  • David Hume, 1739

9 / 39

slide-10
SLIDE 10

Compare to Mathematics

Many arguments against publishing code might be applied to proofs in an alternate universe... "Top Ten Reasons To Not Share Your Code (and why you should anyway)", SIAM News, April, 2013

  • The proof is too ugly to show anyone else.
  • I didn't work out all the details.
  • I didn't actually prove the theorem - my student did.
  • Giving the proof to my competitors would be unfair to me.
  • The proof is valuable intellectual property.
  • Etc.

10 / 39

slide-11
SLIDE 11

Gorgolewski and Poldrack (2016) 11 / 39

slide-12
SLIDE 12

The broader open source software community has worked out a lot of the issues around making code available and broadly useful. 12 / 39

slide-13
SLIDE 13

The broader open source software community has worked out a lot of the issues around making code available and broadly useful.

  • Version control

12 / 39

slide-14
SLIDE 14

http://www.phdcomics.com/comics/archive.php?comicid=1531 13 / 39

slide-15
SLIDE 15

14 / 39

slide-16
SLIDE 16

The broader open source software community has worked out a lot of the issues around making code available and broadly useful.

  • Version control

15 / 39

slide-17
SLIDE 17

The broader open source software community has worked out a lot of the issues around making code available and broadly useful.

  • Version control
  • Automated software testing

15 / 39

slide-18
SLIDE 18

16 / 39

slide-19
SLIDE 19

Write code that checks that our code does what we expect it to do 16 / 39

slide-20
SLIDE 20

Write code that checks that our code does what we expect it to do We all do this anyway... 16 / 39

slide-21
SLIDE 21

Write code that checks that our code does what we expect it to do We all do this anyway... Formalize this and keep running the tests every time you make changes to the software 16 / 39

slide-22
SLIDE 22

Write code that checks that our code does what we expect it to do We all do this anyway... Formalize this and keep running the tests every time you make changes to the software Continuous integration 16 / 39

slide-23
SLIDE 23

Write code that checks that our code does what we expect it to do We all do this anyway... Formalize this and keep running the tests every time you make changes to the software Continuous integration Why not design your analysis to run in this envrionment as well?

  • No hand art
  • Parameters and configurations tracked
  • Results tracked as artifacts and log files
  • Results computer accessible

16 / 39

slide-24
SLIDE 24

The broader open source software community has worked out a lot of the issues around making code available and broadly useful.

  • Version control
  • Automated software testing

17 / 39

slide-25
SLIDE 25

The broader open source software community has worked out a lot of the issues around making code available and broadly useful.

  • Version control
  • Automated software testing
  • Software licensing

17 / 39

slide-26
SLIDE 26

http://www.astrobetter.com/blog/2014/03/10/the-whys-and-hows-of-licensing- scientific-code/ 18 / 39

slide-27
SLIDE 27

http://www.astrobetter.com/blog/2014/03/10/the-whys-and-hows-of-licensing- scientific-code/

  • Code without a license is closed code

18 / 39

slide-28
SLIDE 28

http://www.astrobetter.com/blog/2014/03/10/the-whys-and-hows-of-licensing- scientific-code/

  • Code without a license is closed code
  • Use a license that is broadly compatible (do not make up your own license!)

18 / 39

slide-29
SLIDE 29

http://www.astrobetter.com/blog/2014/03/10/the-whys-and-hows-of-licensing- scientific-code/

  • Code without a license is closed code
  • Use a license that is broadly compatible (do not make up your own license!)
  • Consider using a permissive (e.g, BSD) license, rather than a "copyleft"

license 18 / 39

slide-30
SLIDE 30

http://www.astrobetter.com/blog/2014/03/10/the-whys-and-hows-of-licensing- scientific-code/

  • Code without a license is closed code
  • Use a license that is broadly compatible (do not make up your own license!)
  • Consider using a permissive (e.g, BSD) license, rather than a "copyleft"

license

Licensing makes your software useful to others, while maintaining your rights as the creator of the software.

18 / 39

slide-31
SLIDE 31

To proceed in the academic career ladder, we need signals that our work is meaningful and useful Especially pertinent if some aspects of your software work are not captured by traditional peer-reviewed publications Software papers give you a line in your CV, and allow others to cite their dependence on your software (independently from their inspiration by your findings). 19 / 39

slide-32
SLIDE 32

Software journals

https://www.software.ac.uk/which-journals-should-i-publish-my-software 20 / 39

slide-33
SLIDE 33

Software journals

https://www.software.ac.uk/which-journals-should-i-publish-my-software

Journal of Open Source Software

20 / 39

slide-34
SLIDE 34

Software journals

https://www.software.ac.uk/which-journals-should-i-publish-my-software

Journal of Open Source Software How to cite software

https://github.com/uwescience/citing_software We did something like this at the recent Advanced Computing and Analysis Techniques in Physics Research conference. Daniel Katz's talk contians further examples. All submissions for the ACAT proceedings will be asked to cite the software directly using these guidelines. 20 / 39

slide-35
SLIDE 35

Making your data available

Data Curation Ten Simple Rules for the Care and Feeding of Scientific Data

by Alyssa Goodman, Alberto Pepe , Alexander W. Blocker, Christine L. Borgman, Kyle Cranmer, Merce Crosas, Rosanne Di Stefano, Yolanda Gil, Paul Groth, Margaret Hedstrom, David W. Hogg, Vinay Kashyap, Ashish Mahabal, Aneta Siemiginowska, Aleksandra Slavkovic, PLOS Computational Biology 10(2014), e1003542. http://dx.doi.org/10.1371/journal.pcbi.1003542 21 / 39

slide-36
SLIDE 36

Ten Simple Rules for the Care and Feeding of Scientific Data

  • Rule 2. Share Your Data Online, with a Permanent Identifier (e.g. DOI)
  • Rule 4. Publish Workflow as Context
  • Rule 5. Link Your Data to Your Publications as Often as Possible
  • Rule 6. Publish Your Code (Even the Small Bits)
  • Rule 7. State How You Want to Get Credit
  • Rule 8. Foster and Use Data Repositories

22 / 39

slide-37
SLIDE 37

Data Repositories

  • Open Science Framework: https://osf.io/

Slides by Kara Woo in eScience Reproducibility and Open Science Seminar

  • UW ResearchWorks: https://researchworks.lib.washington.edu/
  • Ex: Human neuroimaging data,

https://digital.lib.washington.edu/researchworks/handle/1773/33311

  • figshare: https://figshare.com/
  • Zenodo: https://zenodo.org/
  • Ex: Clawpack Version 5.3.1 at http://dx.doi.org/10.5281/zenodo.50982
  • Ex: Code, data, and Jupyter notebooks for a paper:

http://faculty.washington.edu/rjl/pubs/KLslip/index.html 23 / 39

slide-38
SLIDE 38

Domain-specific repositories

Geosciences:

  • DesignSafe: https://www.designsafe-ci.org/
  • Community Surface Dynamics Modeling System

(CSDMS): http://csdms.colorado.edu Data and model repositories, Web interface to some models

Neuroscience:

  • Collaboration in Computational Neuroscience: https://crcns.org/
  • Open fMRI: https://openfmri.org/

24 / 39

slide-39
SLIDE 39

Data availabilty confers a citation advantage

Sharing Detailed Research Data Is Associated with Increased Citation Rate

Piwowar HA, Day RS, Fridsma DB (2007) PLoS ONE 2(3): e308. http://dx.doi.org/10.1371/journal.pone.0000308 25 / 39

slide-40
SLIDE 40

Data availabilty confers a citation advantage

Sharing Detailed Research Data Is Associated with Increased Citation Rate

Piwowar HA, Day RS, Fridsma DB (2007) PLoS ONE 2(3): e308. http://dx.doi.org/10.1371/journal.pone.0000308 A collection of links on the topic: http://opcit.eprints.org/oacitation-biblio.html 25 / 39

slide-41
SLIDE 41

26 / 39

slide-42
SLIDE 42
  • Organize your data in a manner that will make sharing easy.

26 / 39

slide-43
SLIDE 43
  • Organize your data in a manner that will make sharing easy.
  • Develop your software using git/Github. Use private repos during

development, if you must (https://education.github.com/) 26 / 39

slide-44
SLIDE 44
  • Organize your data in a manner that will make sharing easy.
  • Develop your software using git/Github. Use private repos during

development, if you must (https://education.github.com/)

  • Use tools that facilitate open communication around code, data and

results. 26 / 39

slide-45
SLIDE 45

Literate programming

Jupyter

A notebook format that supports reproducibility by interweaving code, data and figures. 40 different languages are supported, including Julia, Python and R, and many

  • thers (Matlab too!).

27 / 39

slide-46
SLIDE 46

Example

Evaluating the Accuracy of Diffusion MRI Models in White Matter http://dx.doi.org/10.1371/journal.pone.0123272 28 / 39

slide-47
SLIDE 47

Example

Evaluating the Accuracy of Diffusion MRI Models in White Matter http://dx.doi.org/10.1371/journal.pone.0123272

Code: https://github.com/vistalab/osmosis Notebooks: https://github.com/vistalab/osmosis/tree/master/doc/paper_figures Data: https://purl.stanford.edu/ng782rw8378

28 / 39

slide-48
SLIDE 48

Dependency hell

29 / 39

slide-49
SLIDE 49

Dependency hell

To run these notebooks, you have to install all my dependencies. 29 / 39

slide-50
SLIDE 50

Dependency hell

To run these notebooks, you have to install all my dependencies. To reproduce my results, you have to download my code, and my data, to your machine. 29 / 39

slide-51
SLIDE 51

Dependency hell

To run these notebooks, you have to install all my dependencies. To reproduce my results, you have to download my code, and my data, to your machine. If my code has compiled components, you'll need to compile it. 29 / 39

slide-52
SLIDE 52

Dependency hell

To run these notebooks, you have to install all my dependencies. To reproduce my results, you have to download my code, and my data, to your machine. If my code has compiled components, you'll need to compile it. If you happen to have a different operating system, different compiler, different libraries, etc... we might be out of luck! 29 / 39

slide-53
SLIDE 53

Tools to mitigate dependency hell

30 / 39

slide-54
SLIDE 54

Tools to mitigate dependency hell

Virtualization

  • Package code along with complete environment
  • E.g., VirtualBox, VMware, etc.
  • Docker

30 / 39

slide-55
SLIDE 55

Tools to mitigate dependency hell

Virtualization

  • Package code along with complete environment
  • E.g., VirtualBox, VMware, etc.
  • Docker

Cloud computing

  • E.g., Amazon EC2, Windows Azure, etc. + VM

30 / 39

slide-56
SLIDE 56

Tools to mitigate dependency hell

Virtualization

  • Package code along with complete environment
  • E.g., VirtualBox, VMware, etc.
  • Docker

Cloud computing

  • E.g., Amazon EC2, Windows Azure, etc. + VM

Web platforms for running code

  • E.g., RunMyCode.org, wakari.io
  • SageMathCloud: https://cloud.sagemath.com

30 / 39

slide-57
SLIDE 57

Binder

http://mybinder.org Developed by the Jeremy Freeman's Lab at Janelia Farms Provisions a GitHub repository as a cloud-computing environment 31 / 39

slide-58
SLIDE 58

Binder

http://mybinder.org Developed by the Jeremy Freeman's Lab at Janelia Farms Provisions a GitHub repository as a cloud-computing environment For example, here is a binder that will run the LIGO analysis that confirmed the existence of gravitational waves (The Github repository is here). 31 / 39

slide-59
SLIDE 59

What if you don't like notebooks?

I'll address more complex workflows later 32 / 39

slide-60
SLIDE 60

Making your publications available

Publish in open access journals

33 / 39

slide-61
SLIDE 61

Making your publications available

Publish in open access journals Use preprint servers:

Make your work available before it is published https://arxiv.org/ http://biorxiv.org/ 33 / 39

slide-62
SLIDE 62

Making your publications available

Publish in open access journals Use preprint servers:

Make your work available before it is published https://arxiv.org/ http://biorxiv.org/ Provides access to your work 33 / 39

slide-63
SLIDE 63

Making your publications available

Publish in open access journals Use preprint servers:

Make your work available before it is published https://arxiv.org/ http://biorxiv.org/ Provides access to your work Establishes precedence 33 / 39

slide-64
SLIDE 64

Summary and conclusions

  • Reproducibility is a cornerstone of science.

34 / 39

slide-65
SLIDE 65

Summary and conclusions

  • Reproducibility is a cornerstone of science.
  • Think about reproducibility when you start your project and bake it in.

34 / 39

slide-66
SLIDE 66

Summary and conclusions

  • Reproducibility is a cornerstone of science.
  • Think about reproducibility when you start your project and bake it in.
  • Make your data, code and papers open and available, so that others can

build on your work. 34 / 39

slide-67
SLIDE 67

Summary and conclusions

  • Reproducibility is a cornerstone of science.
  • Think about reproducibility when you start your project and bake it in.
  • Make your data, code and papers open and available, so that others can

build on your work.

  • Come and talk to us!

34 / 39

slide-68
SLIDE 68

Reproducibility and Open Science Working Group:

  • https://reproduciblescience.org/
  • Mailing list: reproducible@uw.edu,

https://mailman11.u.washington.edu/mailman/listinfo/reproducible

Come to our office hours!

http://escience.washington.edu/office-hours/ 35 / 39

slide-69
SLIDE 69

We're eager to hear! And you can post issues/questions here: https://github.com/rjleveque/2016-ros-amath/issues 36 / 39

slide-70
SLIDE 70

More materials

37 / 39

slide-71
SLIDE 71

https://medium.com/@lorenaabarba/barba-group-reproducibility-syllabus- e3757ee635cf#.x1w245xvg 38 / 39

slide-72
SLIDE 72
  • List of 10 recommended tutorials
  • https://help.github.com/categories/bootcamp/
  • http://git-scm.com/book/en/Getting-Started-Git-Basics
  • Github online tutorial

More general resources, including Git:

  • Software Carpentry
  • Code Academy

39 / 39