THE 3-R'S OF DATA- THE 3-R'S OF DATA- SCIENCE: SCIENCE: - - PowerPoint PPT Presentation

the 3 r s of data the 3 r s of data science science
SMART_READER_LITE
LIVE PREVIEW

THE 3-R'S OF DATA- THE 3-R'S OF DATA- SCIENCE: SCIENCE: - - PowerPoint PPT Presentation

06/05/2019 reveal.js The HTML Presentation Framework THE 3-R'S OF DATA- THE 3-R'S OF DATA- SCIENCE: SCIENCE: REPEATABILITY REPEATABILITY, , REPRODUCIBILITY REPRODUCIBILITY & & REPLICABILITY REPLICABILITY By Suneeta Mall


slide-1
SLIDE 1

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 1/93

THE 3-R'S OF DATA- THE 3-R'S OF DATA- SCIENCE: SCIENCE:

REPEATABILITY REPEATABILITY, , REPRODUCIBILITY REPRODUCIBILITY & & REPLICABILITY REPLICABILITY

By Suneeta Mall

slide-2
SLIDE 2

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 2/93

OVERVIEW OVERVIEW

Industry adaptation Importance of 3-Rs Peek into Reproducibility crisis Define 3-Rs: repeatability, reproducibility & replicability Down the memory lane of confused terminology Techniques to ensure Repeatability & Reproducibility In depth review of few of the promising tools Techniques to ensure Replicability Few examples One last point!

slide-3
SLIDE 3

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 3/93

INDUSTRY ADAPTATION INDUSTRY ADAPTATION

“58% of respondents indicated that they were seriously building data science based solutions, with only 14% indicating no involvement just yet.

Evolving Data Infrastructure - Ben Lorica and Paco Nathan (O’Reilly, Oct 2018)

https://www.kdnuggets.com/2017/05/machine- learning-overtaking-big-data.html

slide-4
SLIDE 4

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 4/93

INDUSTRY ADAPTATION INDUSTRY ADAPTATION

“As per a survey in UK, 84%

  • f startups focus on Data-
  • science. With 52% of

companies preferred to build/use their own models.

David Kelnar, MMC Ventures, 2016 ” https://www.kdnuggets.com/2017/05/machine- learning-overtaking-big-data.html

slide-5
SLIDE 5

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 5/93

DATA-SCIENCE DATA-SCIENCE

https://xkcd.com/1838/

slide-6
SLIDE 6

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 6/93

IMPORTANCE OF 3-RS IN DATA- IMPORTANCE OF 3-RS IN DATA- SCIENCE SCIENCE

slide-7
SLIDE 7

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 7/93

IMPORTANCE OF 3-RS IN DATA- IMPORTANCE OF 3-RS IN DATA- SCIENCE SCIENCE

Know what... why ... & how ...

slide-8
SLIDE 8

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 8/93

IMPORTANCE OF 3-RS IN DATA- IMPORTANCE OF 3-RS IN DATA- SCIENCE SCIENCE

Know what... why ... & how ... recreate it ... & solve it.

slide-9
SLIDE 9

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 9/93

IMPORTANCE OF 3RS .. CTD.. IMPORTANCE OF 3RS .. CTD..

We are continually faced by great opportunities brilliantly disguised as insoluble problems. Lee Iacocca The opportunities here are building:

slide-10
SLIDE 10

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 10/93

IMPORTANCE OF 3RS .. CTD.. IMPORTANCE OF 3RS .. CTD..

We are continually faced by great opportunities brilliantly disguised as insoluble problems. Lee Iacocca The opportunities here are building: Reliable, & robust predictive solution

slide-11
SLIDE 11

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 11/93

IMPORTANCE OF 3RS .. CTD.. IMPORTANCE OF 3RS .. CTD..

We are continually faced by great opportunities brilliantly disguised as insoluble problems. Lee Iacocca The opportunities here are building: Reliable, & robust predictive solution

  • That can be trusted
slide-12
SLIDE 12

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 12/93

IMPORTANCE OF 3RS: IN IMPORTANCE OF 3RS: IN RESEARCH RESEARCH

“ Non-reproducible single occurrences are of no significance to science.

Popper (The logic of Scientific Discovery) ”

slide-13
SLIDE 13

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 13/93

IMPORTANCE OF 3RS: IN IMPORTANCE OF 3RS: IN RESEARCH RESEARCH

“ Non-reproducible single occurrences are of no significance to science.

Popper (The logic of Scientific Discovery) ”

Yet 70% of researchers have failed to reproduce another scientist's experiments

slide-14
SLIDE 14

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 14/93

IMPORTANCE OF 3RS: IN IMPORTANCE OF 3RS: IN RESEARCH RESEARCH

“ Non-reproducible single occurrences are of no significance to science.

Popper (The logic of Scientific Discovery) ”

Yet 70% of researchers have failed to reproduce another scientist's experiments , and > 50% have failed to reproduce their own experiments

slide-15
SLIDE 15

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 15/93

IMPORTANCE OF 3RS: IN IMPORTANCE OF 3RS: IN RESEARCH RESEARCH

“ Non-reproducible single occurrences are of no significance to science.

Popper (The logic of Scientific Discovery) ”

Yet 70% of researchers have failed to reproduce another scientist's experiments , and > 50% have failed to reproduce their own experiments

  • Nature's Survey (2016)
slide-16
SLIDE 16

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 16/93

slide-17
SLIDE 17

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 17/93

slide-18
SLIDE 18

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 18/93

A reproducibility crisis

slide-19
SLIDE 19

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 19/93

International Conference on Learning Representations Annual reproducibility challenge (since 2018)

lead by Dr. Joelle Pineau, an Associate Professor at McGill University and lead for Facebook’s Artificial Intelligence Research lab (FAIR)

slide-20
SLIDE 20

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 20/93

Thats research and reproducibility crisis is very real! But why are we talking about it?

slide-21
SLIDE 21

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 21/93

Thats research and reproducibility crisis is very real! But why are we talking about it? Industry Adaptation is making data-science accessible to people

slide-22
SLIDE 22

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 22/93

Thats research and reproducibility crisis is very real! But why are we talking about it? Industry Adaptation is making data-science accessible to people Thus changing our community, and society

slide-23
SLIDE 23

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 23/93

Thats research and reproducibility crisis is very real! But why are we talking about it? Industry Adaptation is making data-science accessible to people Thus changing our community, and society We have moral and social obligation to provide confident and reliable answers!

slide-24
SLIDE 24

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 24/93

1REPEATABILITY

REPEATABILITY, REPRODUCIBILITY REPRODUCIBILITY & & REPLICABILITY REPLICABILITY

slide-25
SLIDE 25

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 25/93

1.1 1.1REPEATABILITY

REPEATABILITY

is the closeness of the agreement between the results of successive attempt of the same experiment/process carried out under the same conditions.

slide-26
SLIDE 26

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 26/93

1.1 1.1REPEATABILITY

REPEATABILITY

is the closeness of the agreement between the results of successive attempt of the same experiment/process carried out under the same conditions. e.g. replay, repeat

slide-27
SLIDE 27

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 27/93

1.1 1.1REPEATABILITY

REPEATABILITY

A simple linear regression example on Scikit diabetes dataset

diabetes = datasets.load_diabetes() test_size=0.33, random_state=None) regr = linear_model.LinearRegression() import matplotlib.pyplot as plt 1 import numpy as np 2 from sklearn import datasets, linear_model 3 from sklearn.metrics import mean_squared_error, r2_score 4 from sklearn.model_selection import train_test_split 5 6 7 diabetes_X = diabetes.data[:, np.newaxis, 9] 8 xtrain, xtest, ytrain, ytest = train_test_split( 9 diabetes_X, diabetes.target, 10 11 12 13 regr.fit(xtrain, ytrain) 14 diabetes_y_pred = regr.predict(xtest) 15

slide-28
SLIDE 28

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 28/93

Run 1 Run 2

slide-29
SLIDE 29

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 29/93

1.1 1.1REPEATABILITY

REPEATABILITY

Linear regression example on Scikit diabetes dataset with fixed seed

diabetes = datasets.load_diabetes() test_size=0.33, random_state=32) regr = linear_model.LinearRegression() import matplotlib.pyplot as plt 1 import numpy as np 2 from sklearn import datasets, linear_model 3 from sklearn.metrics import mean_squared_error, r2_score 4 from sklearn.model_selection import train_test_split 5 6 7 diabetes_X = diabetes.data[:, np.newaxis, 9] 8 xtrain, xtest, ytrain, ytest = train_test_split( 9 diabetes_X, diabetes.target, 10 11 12 13 regr.fit(xtrain, ytrain) 14 diabetes_y_pred = regr.predict(xtest) 15

slide-30
SLIDE 30

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 30/93

Repeat 1 Repeat 2

slide-31
SLIDE 31

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 31/93

1.2 1.2REPRODUCIBILITY

REPRODUCIBILITY

is the closeness of the agreement between the results of successive attempt of the same experiment/process carried out under different conditions.

slide-32
SLIDE 32

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 32/93

1.2 1.2REPRODUCIBILITY

REPRODUCIBILITY

is the closeness of the agreement between the results of successive attempt of the same experiment/process carried out under different conditions. e.g. repeat*

slide-33
SLIDE 33

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 33/93

1.2 1.2REPRODUCIBILITY

REPRODUCIBILITY

Linear regression example on Scikit diabetes dataset with different configuration

diabetes = datasets.load_diabetes() test_size=0.33, random_state=32) regr = linear_model.LinearRegression(normalize=False) import matplotlib.pyplot as plt 1 import numpy as np 2 from sklearn import datasets, linear_model 3 from sklearn.metrics import mean_squared_error, r2_score 4 from sklearn.model_selection import train_test_split 5 6 7 diabetes_X = diabetes.data[:, np.newaxis, 9] 8 xtrain, xtest, ytrain, ytest = train_test_split( 9 diabetes_X, diabetes.target, 10 11 12 13 regr.fit(xtrain, ytrain) 14 diabetes_y_pred = regr.predict(xtest) 15

slide-34
SLIDE 34

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 34/93

Model fixed seed Model fixed seed with different configuration

slide-35
SLIDE 35

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 35/93

1.3 1.3REPLICABILITY

REPLICABILITY

is the closeness of the agreement between the results of original experiment/process to that of independent experiment/process conducted to simuate/replicate original experiement with at least similar data-set, algorithms, and conditions.

slide-36
SLIDE 36

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 36/93

1.3 1.3REPLICABILITY

REPLICABILITY

is the closeness of the agreement between the results of original experiment/process to that of independent experiment/process conducted to simuate/replicate original experiement with at least similar data-set, algorithms, and conditions. e.g. duplicate, copy

slide-37
SLIDE 37

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 37/93

1.3 1.3REPLICABILITY

REPLICABILITY

slide-38
SLIDE 38

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 38/93

1.3 1.3REPLICABILITY

REPLICABILITY

Stephen hawking

slide-39
SLIDE 39

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 39/93

1.3 1.3REPLICABILITY

REPLICABILITY

Stephen hawking Stephen hawking in Space

slide-40
SLIDE 40

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 40/93

HISTORY OF CONFUSED HISTORY OF CONFUSED TERMINOLOGY TERMINOLOGY

slide-41
SLIDE 41

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 41/93

HISTORY OF CONFUSED HISTORY OF CONFUSED TERMINOLOGY TERMINOLOGY

Plesser HE. Reproducibility vs. Replicability: A Brief History of a Confused Terminology. Front Neuroinform. 2018;11:76. Published 2018 Jan 18. doi:10.3389/fninf.2017.00076

slide-42
SLIDE 42

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 42/93

HISTORY OF CONFUSED HISTORY OF CONFUSED TERMINOLOGY TERMINOLOGY

Claerbout, J. F., and Karrenbach, M. (1992). Electronic documents give reproducible research a new meaning. SEG Expanded Abstracts 11, 601–604. doi: 10.1190/1.1822162

slide-43
SLIDE 43

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 43/93

HISTORY OF CONFUSED HISTORY OF CONFUSED TERMINOLOGY TERMINOLOGY

Drummond, C. (2009). “Replicability is not reproducibility: nor is it good science,” in Proceedings of the Evaluation Methods for Machine Learning Workshop at the 26th ICML (Montreal, QC). Available online at: http://www.site.uottawa.ca/~ cdrummon/pubs/ICMLws09.pdf

slide-44
SLIDE 44

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 44/93

Association for Computing Machinery, 2016, defined:

slide-45
SLIDE 45

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 45/93

Association for Computing Machinery, 2016, defined:

Repeatability:same results under same conditions Replicability:same results under slightly different conditions Reproducibility:same results under very different conditions (independent simulation)

slide-46
SLIDE 46

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 46/93

Association for Computing Machinery, 2016, defined:

Patil, P., Peng, R. D., and Leek, J. T. (2016). A statistical definition for reproducibility and

  • replicability. bioRxiv. doi: 10.1101/066803.
slide-47
SLIDE 47

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 47/93

Association for Computing Machinery, 2016, defined:

Goodman, S. N., Fanelli, D., and Ioannidis, J. P. A. (2016). What does research reproducibility mean? Sci. Transl. Med. 8:341ps12. doi: 10.1126/scitranslmed.aaf5027

slide-48
SLIDE 48

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 48/93

UNIVERSAL REPRODUCIBILITY UNIVERSAL REPRODUCIBILITY

Methods reproducibility: provide enough detail so the procedures could be exactly repeated Results reproducibility: obtain similar results from an independent study with same procedures as original experiment Inferential reproducibility: draw the same conclusions from either an independent replication or a reanalysis of the original experiment

slide-49
SLIDE 49

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 49/93

https://xkcd.com/1673

slide-50
SLIDE 50

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 50/93

Repeat Reproduce Replicate Robust Reliable

slide-51
SLIDE 51

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 51/93

Repeat Reproduce Replicate Robust Reliable

slide-52
SLIDE 52

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 52/93

1.1 1.1 ENSURING

ENSURING REPEATABILITY REPEATABILITY

  • 1. To Err is human: Automate everything
  • 2. Avoid adhoc changes
  • 3. Reproducible randomness
slide-53
SLIDE 53

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 53/93

1.1 1.1 ENSURING

ENSURING REPEATABILITY REPEATABILITY

  • 1. To Err is human: Automate everything
  • 2. Avoid adhoc changes
  • 3. Reproducible randomness

RFC 1149.5: https://xkcd.com/221

slide-54
SLIDE 54

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 54/93

1.2 1.2 ENSURING

ENSURING REPRODUCIBILITY REPRODUCIBILITY

  • 1. Ensuring repeatability
  • 2. Maintaining full provenance
  • 3. Model managment, tracing
slide-55
SLIDE 55

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 55/93

P2 OF R OF R2

PRINCIPLE PRINCIPLE Avoid adhoc changes Reproducible randomness PROCESS PROCESS

Auditing Automate

everything Mode Management Serving Mod

Tracing Provenance

slide-56
SLIDE 56

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 56/93

P1 P1AUTOMATE EVERYTHING

AUTOMATE EVERYTHING

Introduce DAG pipelines that are

https://xkcd.com/242

slide-57
SLIDE 57

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 57/93

P1 P1AUTOMATE EVERYTHING

AUTOMATE EVERYTHING

Introduce DAG pipelines that are

https://xkcd.com/242

Simple, Modular,

slide-58
SLIDE 58

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 58/93

P1 P1AUTOMATE EVERYTHING

AUTOMATE EVERYTHING

Introduce DAG pipelines that are

https://xkcd.com/242

Simple, Modular, Pluggable, Scalable, & Reliable

slide-59
SLIDE 59

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 59/93

Standard ML DAG flow

slide-60
SLIDE 60

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 60/93

DAG TOOLS FOR ML RELATED DAG TOOLS FOR ML RELATED WORKLOAD WORKLOAD

Mlflow Airflow Kubeflow Azkaban Luigi Spark Co

Data Science Pipeline AI Nextflow Snakemake DVC Delta

Lake Pachyderm Argo Digdag Bpipe

slide-61
SLIDE 61

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 61/93

P2 P2FULL PROVENANCE

FULL PROVENANCE

Maintain version control over everything - aka

https://xkcd.com/1597/

slide-62
SLIDE 62

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 62/93

P2 P2FULL PROVENANCE

FULL PROVENANCE

Maintain version control over everything - aka

https://xkcd.com/1597/

Lineage,

slide-63
SLIDE 63

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 63/93

P2 P2FULL PROVENANCE

FULL PROVENANCE

Maintain version control over everything - aka

https://xkcd.com/1597/

Lineage, Time Travel, Audit

slide-64
SLIDE 64

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 64/93

TOOLS USEFUL FOR FULL TOOLS USEFUL FOR FULL PROVENANCE PROVENANCE

Pachyderm Delta Lake Data Version

Control Airflow*

slide-65
SLIDE 65

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 65/93

P3 P3MODEL MANAGEMENT - AUDITING,

MODEL MANAGEMENT - AUDITING, TRACING, SERVING TRACING, SERVING

Manage gamut of models with: auditing, tracing & serving

slide-66
SLIDE 66

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 66/93

P3 P3MODEL MANAGEMENT - AUDITING,

MODEL MANAGEMENT - AUDITING, TRACING, SERVING TRACING, SERVING

Manage gamut of models with: auditing, tracing & serving

Delta Lake* Seldon Polyaxon ModelDB Da

Version Control Pachyderm* Mlflow Kubeflow C Pipeline AI

slide-67
SLIDE 67

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 67/93

DATA VERSION CONTROL DATA VERSION CONTROL

Code in Git, data and meta in object store

Version control: Model, data and config Git alike: pull, push etc. Maintains metadata in .dvc Analogous to GitOPS principles

slide-68
SLIDE 68

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 68/93

DATA VERSION CONTROL DATA VERSION CONTROL

DVC

Examples

dvc run -d data/feature.p \

  • d code/train_model.p

python code/train_model.p # Step 1 1 dvc run -d code/features.py \ 2

  • d data/src.tsv \

3

  • o data/feature.p \

4 python code/features.py 5 6 # Step 2 7 8 9

  • o data/model.h5 \

10 11

slide-69
SLIDE 69

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 69/93

DVC DVC

PROS PROS Not very dictative Independent tools, integrates easily Single source of truth with GitOPS CONS CONS DAG declaration not intutive Too much onus on user to integrate with system Relatively new - still evolving

slide-70
SLIDE 70

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 70/93

PACHYDERM PACHYDERM

Git repository for data

Data is first class citizen Like DVC automatically manages versions Containerized and platform agnostic

slide-71
SLIDE 71

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 71/93

PACHYDERM PACHYDERM

Data repos DAG Example of workflow

Pipeline Specification

"repo": "training_dat "repo": "parameters", "name": "model_train" { "input": { "union": [ { 1 "pfs": { 2 "glob": "/", 3 4 } }, { 5 "pfs": { 6 "glob": "/*.json", 7 8 } } ] }, 9 "pipeline": { 10 11 }, 12 "transform": { 13 "image": "ubuntu", 14 "cmd": ["/bin/bash"], 15

slide-72
SLIDE 72

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 72/93

slide-73
SLIDE 73

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 73/93

PACHYDERM PACHYDERM

PROS PROS Treats data as first class entity Manages versioning automatically Simple pipeline constructs Scalable & Distributed Capability of long running notebooks CONS CONS Is not capable of model serving Does not simplify model tracing Still rapidly evolving

slide-74
SLIDE 74

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 74/93

KUBEFLOW KUBEFLOW

Provides infrastructure agnostic ML toolkit Integrates with relevant sowares for seamless, scalable processing Sources off data from

  • bject stores

Kubernetes specific

slide-75
SLIDE 75

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 75/93

ML Toolkit for Kubernetes

slide-76
SLIDE 76

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 76/93

KUBEFLOW KUBEFLOW

Kubeflow pipeline eg

Pipeline Specification

def PreprocessOp(...): def TrainOp(...): @dsl.pipeline( def resnet_pipeline(....): import kfp.dsl as dsl 1 2 3 return dsl.ContainerOp(..) 4 5 6 return dsl.ContainerOp(...) 7 8 9 name='resnet_cifar10_pipeline 10 ) 11 12 persistent_volume_name = '/' 13 persistent_volume_path = '/pa 14

  • p_dict = {

15

slide-77
SLIDE 77

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 77/93

KUBEFLOW KUBEFLOW

PROS PROS Leverages kubernetes, scalable, distributed for ML compute Nicely integrates with likes of Pachyderm, Katib, ModelDB, Seldon GitOPS can be realize with ARGO CONS CONS Does not provide unified provenance Requires DSL based pipeline spec Installation may feel cumbersome Still evolving and integrating with tools

slide-78
SLIDE 78

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 78/93

DELTA-LAKE DELTA-LAKE

Promising offering for Hadoop based workloads Leverages spark for data-pipelining Manages provenance under time-travel feature Model manament/tracing via mlflow Model persistence via mleap

slide-79
SLIDE 79

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 79/93

DELTA LAKE DELTA LAKE

MLFlow traces

Spark Pipeline Spec

training = spark.createDataFrame(. hashingTF = HashingTF(...) pipeline = Pipeline(stages= prediction = model.transform(test) from pyspark.ml import Pipeline 1 2 3 tokenizer = Tokenizer(inputCol="tex 4

  • utputCol="words")

5 6 lr = LogisticRegression(maxIter=10 7 regParam=0.001) 8 9 [tokenizer, hashingTF, lr] 10 model = pipeline.fit(training) 11 12

slide-80
SLIDE 80

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 80/93

DELTA LAKE DELTA LAKE

Only recently launched under Apache 2: 0.1.0 24th April 2019! Proprietary version is released a year ago and in use by many organization. Very little documentation - so not enough is known Claims to have solved the provenance part right! Limited to Hadoop/Spark stack

slide-81
SLIDE 81

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 81/93

POLYAXON POLYAXON

Platform agnostic solution to manage end to end lifecycle of deep leaning workload Partly open source with enterprise option Provides reproducibility and model tracing Integrates with Seldon Data provenance is not managed

slide-82
SLIDE 82

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 82/93

POLYAXON POLYAXON

Polyaxon experiment

Polyaxon Spec

image: tensorflow/tensorflow:1.4.1 cmd: python model.py version: 1 1 2 kind: experiment 3 4 build: 5 6 build_steps: 7

  • pip3 install polyaxon-client

8 9 run: 10 11 polyaxon run -f polyaxonfile.yaml 1

slide-83
SLIDE 83

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 83/93

1.3 1.3ENSURING

ENSURING REPLICABILITY REPLICABILITY

Throwing in 2 more Rs: Reliability and Robustness

Dataset represents right demographics Robustness through data augumentation Knowing decision boundaries of model Model testing: against adverserial attacks http://www.cleverhans.io - Blog by Goodfellow & Papernot

slide-84
SLIDE 84

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 84/93

1.3 1.3ENSURING

ENSURING REPLICABILITY REPLICABILITY

“Not make confident mistakes

(Unrestricted Adversarial Examples - Goodfellow et al.) ”

“Models generalization across architectures and training sets (Explaining and harnessing adversarial examples - Szegedy et al.) ”

slide-85
SLIDE 85

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 85/93

EXAMPLES 1 EXAMPLES 1

slide-86
SLIDE 86

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 86/93

EXAMPLE 2 EXAMPLE 2

slide-87
SLIDE 87

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 87/93

Examples generated by .

fmodel = KerasModel(kmodel, bounds=(0, 255), preprocessing=(np.array([104, 116, 123] attack = FGSM(fmodel, criterion=Misclassification()) adversarial = attack(img, label) import foolbox 1 import numpy as np 2 from foolbox.criteria import Misclassification 3 from foolbox.models import KerasModel 4 from foolbox.attacks import FGSM 5 6 # model = <your model=""> 7 8 9 10 11 # img, label = <your image="">, <true label=""> 12 13 14 </true></your></you 15

foolbox

slide-88
SLIDE 88

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 88/93

EXAMPLE 3 EXAMPLE 3

Persian Cat

slide-89
SLIDE 89

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 89/93

LASTLY LASTLY

“The price of greatness is responsibility.Winston Churchill”

slide-90
SLIDE 90

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 90/93

LASTLY LASTLY

“The price of greatness is responsibility.Winston Churchill” Do data-science responsibly!

slide-91
SLIDE 91

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 91/93

REFERENCES & LINKS REFERENCES & LINKS

Soware Link Capability DVC https://iterative.ai/ DAG, Provenance Pachyderm http://pachyderm.io DAG, Provenance Kubeflow https://github.com/kubeflow/kubeflow DAG, Model Management Delta Lake https://delta.io/ DAG, Provenance, Model Management Pipeline AI https://github.com/PipelineAI/ DAG, Model Management Polyaxon https://github.com/polyaxon/polyaxon DAG, Model Management Airflow http://airflow.apache.org/ DAG, Limited provenance via Atlas DB

slide-92
SLIDE 92

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 92/93

REFERENCES & LINKS: PART 2 REFERENCES & LINKS: PART 2

Soware Link Capability Luigi https://github.com/spotify/luigi DAG Nextflow https://www.nextflow.io/ DAG Snakemake https://snakemake.readthedocs.io DAG Bpipe http://docs.bpipe.org/ DAG digdag https://github.com/treasure-data/digdag/ DAG Mleap https://github.com/combust/mleap Model Persistence Mlflow https://github.com/mlflow/mlflow/ Model Management Seldon https://www.seldon.io/ Model Management Modeldb https://github.com/mitdbg/modeldb Model Management Katib https://github.com/kubeflow/katib Hyper parameter tunning

slide-93
SLIDE 93

06/05/2019 reveal.js – The HTML Presentation Framework localhost:8000/3Rs.html?print-pdf#/intro 93/93

THANK YOU! THANK YOU!

QUESTIONS?? QUESTIONS??