Reproducibility in Data Science Juliana Freire Visualization, - PowerPoint PPT Presentation

Reproducibility in Data Science Juliana Freire Visualization, Imaging and Data Analysis Center (VIDA) Computer Science & Engineering Center for Data Science (CDS) VISUALIZATION IMAGING AND DATA ANALYSIS CENTER

Data-Driven Exploration • Every scientific domain is moving toward data-driven exploration, this has led to great advances and discoveries • Companies are capitalizing on data • Government agencies uses data to operate efficiently, make policies, and informed decisions Computing is free Storage is free Data are abundant The bottlenecks lie with people VISUALIZATION IMAGING AND DATA ANALYSIS CENTER

Data-Driven Exploration: Challenges • Data are vast and produced at unprecedented rates • Sources are broad, varied, and unreliable • Computational processes are required to extract insight • But they hard to assemble machine learning algorithms statistics math data curation data discovery data management data integration provenance visualization VISUALIZATION IMAGING AND DATA ANALYSIS CENTER

Data-Driven Exploration: Challenges • Exploratory tasks are inherently iterative as one tests and formulates hypotheses Data Perception & Computation Data Knowledge Products Cognition Specification Exploration [Modified from Van Wijk, Vis 2005] VISUALIZATION IMAGING AND DATA ANALYSIS CENTER

Many Trials and Errors… Data Clean data Re-run model F-measure=.75 Cluster data Refine model Update F-measure=.92 scikitlearn Visualize Visualize Clean data Clean data Create model F-measure=.61 VISUALIZATION IMAGING AND DATA ANALYSIS CENTER

Data-Driven Exploration: Challenges • After many steps… "An analysis has 30 different steps. It is tempting to just do this then that and then this. You have no idea in which ways you are wrong and what data is wrong” [Kandel et al., VAST 2012] ! ! e e c c n n a a n n • It is easy to get lost and not remember how a result was derived e e v v o o r r p p • Processes can break or misbehave in unforeseen ways d d e e e e N N • Results can be hard to understand, interpret and trust decisions knowledge data Incorrect conclusions can lead to bad decisions! VISUALIZATION IMAGING AND DATA ANALYSIS CENTER

Computational Provenance “Provenance is the source or origin of an object; its history and pedigree; a record of the ultimate derivation and passage of an item through its various owners.” The Oxford English Dictionary ce is a key ingredient for transparency and • Pr Proven enan ance reproducibility • Computational provenance is a causality graph that models process and data dependencies VISUALIZATION IMAGING AND DATA ANALYSIS CENTER

Computational Provenance = Graph Data Data i dependencies Clean Process data dependencies Data 1 F-measure=.75 Cluster Clean Run data data Model Clean Data c Data 2 data F-measure=.92 Create Create Visualize Data 3 Model 2 Model Model Visualize Model 1 … VISUALIZATION F-measure=.61 IMAGING AND DATA ANALYSIS CENTER

Computational Provenance: Benefits • Interpret and reproduce results • Understand the experiment and chain of reasoning that was used in the production of a result • Verify that an experiment was performed according to acceptable procedures • Identify the inputs to an experiment were and where they came from • Re-run steps, possibly with different settings • Debug • Share, re-use and extend results VISUALIZATION IMAGING AND DATA ANALYSIS CENTER

Different Flavors of Provenance • Computations are carried out in a controlled environment • It is possible to systematically capture detailed provenance • What to capture? Depends on what you will use provenance for: • Document computational process • Re-execute • Enable others to re-execute • Extend/modify process VISUALIZATION IMAGING AND DATA ANALYSIS CENTER

Capture the Code What do you get? Is this enough? VISUALIZATION http: http:// //ti tinyur nyurl.com/y /y3eohbo hbo4 IMAGING AND DATA ANALYSIS CENTER

Notebooks and Reproducibility • Recent study of 1,435,373 notebooks collected from 265,143 GitHub repositories • 1,029,279 attempted executions of valid notebooks (i.e., notebooks with defined Python version and execution order) • Only 25.28% executed without errors, and • 4.57% produced the same results • Problems: • No specification of library versions • Hard-coded paths • Out-of-order cells • Hidden states VISUALIZATION IMAGING AND [Pimentel et al., MSR2019] DATA ANALYSIS CENTER

Notebooks: Best Practices • Use relative paths (or external data repositories) • Re-run notebook top to bottom before committing • Declare dependencies and library versions • Use clean environment to test dependencies Or use https://www.reprozip.org/ VISUALIZATION IMAGING AND [Pimentel et al., MSR2019] DATA ANALYSIS CENTER

ReproZip: Reproducibility in 2 Steps Packing Unpacking Windows ReproZip Package Linux Linux data files, libraries, Mac OS X environment variables, etc. required to reproduce the research open, unpack, and reproduce anywhere, VISUALIZATION anytime! IMAGING AND DATA ANALYSIS CENTER

ReproZip: Advantages • Automatically tracks dependencies in an environment and set them up in a different environment – portability • Deals with variability in computational environments • Reproducibility in hindsight • Very easy (I will show!) VISUALIZATION IMAGING AND DATA ANALYSIS CENTER

ReproZip: How does it work? https://www.youtube.com/watch?v=-zLPuwCHXo0 VISUALIZATION [Chirigati et al., ACM SIGMOD 2013] IMAGING AND DATA ANALYSIS CENTER

Packing a Notebook Packing VISUALIZATION IMAGING AND DATA ANALYSIS CENTER

Reproducing the Notebook Unpacking VISUALIZATION IMAGING AND DATA ANALYSIS CENTER

ReproZip Jupyter Extension https://docs.reprozip.org/en/1.0.x/jupyter.html VISUALIZATION IMAGING AND DATA ANALYSIS CENTER

ReproZip can pack… Data analysis scripts / software (any language, you name it!) Graphical tools Interactive tools Client-server applications (including databases) Jupyter notebooks (very soon!) MPI experiments (setting up the experiment is involved though…) ... and many more! https://examples.reprozip.org VISUALIZATION IMAGING AND DATA ANALYSIS CENTER

ReproServer: Unpacking in a Browser Packing Unpacking ReproZip ReproServer Package Linux Upload it to data files, libraries, ReproServer or give it environment variables, etc. a link, and reproduce! required to reproduce the research VISUALIZATION IMAGING AND DATA ANALYSIS CENTER

ReproServer Runs ReproZip packages in in ● th the browser , no local software needed Allows ch anging input data, chan ● configuration, command-lines Gives you a a UR URL to incl clude e in ● ports to reproduce pa pape pers/r /repo your experiment in : build on your No No lock-in ● laptop, pack automatically, reproduce anywhere https://www.youtube.com/watch?v=Ffb-PaVPC58 [Rampin et al., 2018, https://arxiv.org/abs/1808.01406] VISUALIZATION IMAGING AND DATA ANALYSIS CENTER

Capture the Exploratory Process Data i Clean data Data 1 F-measure=.75 Cluster Clean Run data data Model Clean Data c Data 2 data F-measure=.92 Create Create Visualize Data 3 Model 2 Model Model Visualize Model 1 … VISUALIZATION F-measure=.61 IMAGING AND DATA ANALYSIS CENTER

Capture the Exploratory Process Automatically VISUALIZATION http: http:// //www www.vistr trails.org IMAGING AND DATA ANALYSIS CENTER

Provenance Beyond Reproducibility • Support for reflective reasoning • Ability to compare data products vt 1 = x i ◦ x i-1 ◦ … ◦ x 1 ◦ Ø vt 2 = x j ◦ x j-1 ◦ … ◦ x 1 ◦ Ø vt 1 -vt 2 = { x i , x i-1 , …, x 1 , Ø } – {x j , x j-1 , …,x 1 , Ø } VISUALIZATION [Freire et al., IPAW 2006] IMAGING AND DATA ANALYSIS CENTER

Provenance Beyond Reproducibility • Support for reflective reasoning • Ability to compare data products • Explore parameter spaces and compare results • Also explore alternative computations (setParameter(id n ,value n ) ◦ … ◦ ( setParameter(id 1 ,value 1 ) ◦ v t ) (addModule(id i ,…) ◦ ( deleteModule(id i ) ◦ v 1 ) … (addModule(id i ,…) ◦ ( deleteModule(id i ) ◦ v n ) VISUALIZATION IMAGING AND DATA ANALYSIS CENTER

Provenance Beyond Reproducibility • Support for reflective reasoning • Ability to compare data products • Explore parameter spaces and compare results • Support for collaboration [Ellkvist et al., IPAW 2008] VISUALIZATION IMAGING AND DATA ANALYSIS CENTER

Change-Based Provenance: Extensibility Autodesk Maya ParaView VisIt ImageVis3d [Callahan et al., IPAW 2008] VISUALIZATION IMAGING AND DATA ANALYSIS CENTER

Provenance Plugin for Autodesk Maya VISUALIZATION IMAGING AND DATA ANALYSIS CENTER

Vizier: Provenance + Notebooks https://vizierdb.info/ VISUALIZATION [Glavic et al., ACM SIGMOD 2019] IMAGING AND DATA ANALYSIS CENTER

Reproducibility in Data Science Juliana Freire Visualization, - PowerPoint PPT Presentation

Reproducibility in Data Science Juliana Freire Visualization, Imaging and Data Analysis Center (VIDA) Computer Science & Engineering Center for Data Science (CDS) VISUALIZATION IMAGING AND DATA ANALYSIS CENTER Data-Driven Exploration

Reproducibility & Generalizability @ Twitter Strengthening Reproducibility in Network Science

Everware - lowering reproducibility barriers Andrey Ustyuzhanin Yandex School of Data Analysis

Computational Reproducibility in Production Physics Applications Numerical Reproducibility at

Computational Reproducibility Daniel S. Katz Jennifer Freeman Smith Computational

Rigor, Reproducibility, and Transparency David T. Redden, PhD Co-Director, CCTS BERD Chair,

Worksheets Percy Liang UCI Reproducibility Symposium September 22, 2020 The current research

Numerical reproducibility of high-performance computations using floating-point or interval

B: Data Reproducibility What are we doing in Singapore, Tim White and what should journals be

Science is in trouble Information overload Built-in bias Reproducibility issues Access issues

Adventures in Elm GOTO Chicago, 24 May 2016 Adventures in Elm Events, Reproducibility, and

Repeatability Reproducibility & Rigor Jan Vitek Kalibera, Vitek. Repeatability,

Reproducibility: failures & futures David A. C. Beck Chemical Engineering & eScience

Discussion: Reproducibility and Cross-study Replicability of Prognostic Signatures from High

THE 3-R'S OF DATA- THE 3-R'S OF DATA- SCIENCE: SCIENCE: REPEATABILITY REPEATABILITY, ,

Experiment Reproducibility in Planetlab RP 1.1 Project Presentation Sudesh Jethoe Experiment

New NIH requirements regarding Rigor and Reproducibility

Subspace Trail Cryptanalysis and its Applications to AES Lorenzo Grassi, Christian Rechberger and

CPSC 875 CPSC 875 John D McGregor John D. McGregor Risk, Uncertainty, and Options Dynamic

Data Synchronization in Privacy-Preserving RFID Authentication Schemes S ebastien CANARD and

INRES Service & Protocol Stanislaw BUDKOWSKI NATIONAL INSTITUTE OF TELECOMMUNICATIONS (INT)

Compatible rewriting of noncommutative polynomials for proving operator identities C.Chenavier

D Dynamics of i f O li Online Scam S Hosting Infr Hosting Infr rastructure rastructure

Introduction to Pairings ECC Summer School Diego F. Aranha November 12, 2017 Institute of

ICS 6B Boolean Algebra & Logic Boolean Algebra & Logic Lecture Notes for Summer Quarter,

Reproducibility in Data Science Juliana Freire Visualization, - PowerPoint PPT Presentation

Reproducibility in Data Science Juliana Freire Visualization, Imaging and Data Analysis Center (VIDA) Computer Science & Engineering Center for Data Science (CDS) VISUALIZATION IMAGING AND DATA ANALYSIS CENTER Data-Driven Exploration

Reproducibility &amp; Generalizability @ Twitter Strengthening Reproducibility in Network Science

Everware - lowering reproducibility barriers Andrey Ustyuzhanin Yandex School of Data Analysis

Computational Reproducibility in Production Physics Applications Numerical Reproducibility at

Computational Reproducibility Daniel S. Katz Jennifer Freeman Smith Computational

Rigor, Reproducibility, and Transparency David T. Redden, PhD Co-Director, CCTS BERD Chair,

Worksheets Percy Liang UCI Reproducibility Symposium September 22, 2020 The current research

Numerical reproducibility of high-performance computations using floating-point or interval

B: Data Reproducibility What are we doing in Singapore, Tim White and what should journals be

Science is in trouble Information overload Built-in bias Reproducibility issues Access issues

Adventures in Elm GOTO Chicago, 24 May 2016 Adventures in Elm Events, Reproducibility, and

Repeatability Reproducibility &amp; Rigor Jan Vitek Kalibera, Vitek. Repeatability,

Reproducibility: failures &amp; futures David A. C. Beck Chemical Engineering &amp; eScience

Discussion: Reproducibility and Cross-study Replicability of Prognostic Signatures from High

THE 3-R'S OF DATA- THE 3-R'S OF DATA- SCIENCE: SCIENCE: REPEATABILITY REPEATABILITY, ,

Experiment Reproducibility in Planetlab RP 1.1 Project Presentation Sudesh Jethoe Experiment

New NIH requirements regarding Rigor and Reproducibility

Subspace Trail Cryptanalysis and its Applications to AES Lorenzo Grassi, Christian Rechberger and

CPSC 875 CPSC 875 John D McGregor John D. McGregor Risk, Uncertainty, and Options Dynamic

Data Synchronization in Privacy-Preserving RFID Authentication Schemes S ebastien CANARD and

INRES Service &amp; Protocol Stanislaw BUDKOWSKI NATIONAL INSTITUTE OF TELECOMMUNICATIONS (INT)

Compatible rewriting of noncommutative polynomials for proving operator identities C.Chenavier

D Dynamics of i f O li Online Scam S Hosting Infr Hosting Infr rastructure rastructure

Introduction to Pairings ECC Summer School Diego F. Aranha November 12, 2017 Institute of

ICS 6B Boolean Algebra &amp; Logic Boolean Algebra &amp; Logic Lecture Notes for Summer Quarter,

Reproducibility & Generalizability @ Twitter Strengthening Reproducibility in Network Science

Repeatability Reproducibility & Rigor Jan Vitek Kalibera, Vitek. Repeatability,

Reproducibility: failures & futures David A. C. Beck Chemical Engineering & eScience

INRES Service & Protocol Stanislaw BUDKOWSKI NATIONAL INSTITUTE OF TELECOMMUNICATIONS (INT)

ICS 6B Boolean Algebra & Logic Boolean Algebra & Logic Lecture Notes for Summer Quarter,