Reducing Technical Debt with Reproducible Containers Tanu Malik - - PowerPoint PPT Presentation

reducing technical debt with reproducible containers
SMART_READER_LITE
LIVE PREVIEW

Reducing Technical Debt with Reproducible Containers Tanu Malik - - PowerPoint PPT Presentation

Reducing Technical Debt with Reproducible Containers Tanu Malik 2019 BSSw Fellow Assistant Professor School of Computing DePaul University Chicago, IL IDEAS-ECP Webinar, November 4 th , 2020 IDEAS-ECP Webinar, November 2020 2 WhoamI My


slide-1
SLIDE 1

Reducing Technical Debt with Reproducible Containers

Tanu Malik

2019 BSSw Fellow Assistant Professor School of Computing DePaul University Chicago, IL IDEAS-ECP Webinar, November 4th, 2020

IDEAS-ECP Webinar, November 2020 2

slide-2
SLIDE 2

WhoamI

My expertise is: Databases and distributed computing Data provenance: history and lineage of data and software Computational reproducibility: Repeating and recreating some one else’s work Systems built: http://sciunit.run I want to know more about: Reproducibility case studies in HPC and how containers are used. Problems I’m currently working on: Provenance alignment: Using provenance to highlight sources of irreproducibility State maintenance in lineage graphs: Making Jupyter Notebooks reproducible

Tanu Malik Assistant Professor, School of Computing Director, Data Systems and Opt. Lab DePaul University Chicago, IL https://facsrv.cs.depaul.edu/~tmalik1 Tanu.Malik@depaul.edu

IDEAS-ECP Webinar, November 2020 1

slide-3
SLIDE 3

Outline

PART 1: How technical debt affects reproducibility? PART 2: If reproducible containers provide a start? PART 3: Guidance and summary

IDEAS-ECP Webinar, November 2020 3

slide-4
SLIDE 4

PART 1: How technical debt affects reproducibility?

IDEAS-ECP Webinar, November 2020 4

slide-5
SLIDE 5

IDEAS-ECP Webinar, November 2020

Monetary debt

5

slide-6
SLIDE 6

Monetary debt meets the objective “sooner”

IDEAS-ECP Webinar, November 2020 6

slide-7
SLIDE 7

Technical debt1 is no different

IDEAS-ECP Webinar, November 2020 7 1A metaphor introduced by Ward Cunningham in 1992.

slide-8
SLIDE 8

Technical debt1 is no different

IDEAS-ECP Webinar, November 2020 8

</> </> </>

1A metaphor introduced by Ward Cunningham in 1992.

slide-9
SLIDE 9

Technical debt is no different.

IDEAS-ECP Webinar, November 2020

Technical debt Time Productivity Journal deadline Good scientific software Poor scientific software

9

slide-10
SLIDE 10

Dimensions of Technical Debt

  • Poor quality code
  • Poor design
  • Environment debt
  • Documentation debt
  • Testing debt

IDEAS-ECP Webinar, November 2020 10

slide-11
SLIDE 11

Consequence of Mismanaged Debt

IDEAS-ECP Webinar, November 2020

REPOSSESSED

11

slide-12
SLIDE 12

Consequence of Mismanaged Debt

</> </> </>

IDEAS-ECP Webinar, November 2020

REPOSSESSED IRREPRODUCIBLE

12

slide-13
SLIDE 13

Dimensions of Scientific Technical Debt

  • Poor quality code
  • Poor design
  • Environment debt
  • Documentation debt
  • Testing debt

IDEAS-ECP Webinar, November 2020 13

  • 1E. Tom, A. Aurum, R. Vidgen, An exploration of technical debt, Journal of Systems and Software, Volume 86,

Issue 6, 2013, Pages 1498-1516, ISSN 0164-1212, https://doi.org/10.1016/j.jss.2012.12.052.

slide-14
SLIDE 14

Dim Dimensio ions o

  • f S

Scie cientif ific T ic Tech chnic ical De al Debt

  • Poor quality code
  • Poor design

üEnvironment debt üDocumentation debt

  • Testing debt

IDEAS-ECP Webinar, November 2020 14

slide-15
SLIDE 15

IDEAS-ECP Webinar, November 2020 15

https://www.newscientist.com/gallery/software-bugs

slide-16
SLIDE 16

IDEAS-ECP Webinar, November 2020 16

slide-17
SLIDE 17

IDEAS-ECP Webinar, November 2020 17

https://www.nature.com/articles/d41586-020-01685-y

slide-18
SLIDE 18

Cos Cost of

  • f Sc

Scientific Technical Debt

IDEAS-ECP Webinar, November 2020 18

slide-19
SLIDE 19

Su Supercomp

  • mputing A

Art rtifact ct D Descri cription

  • n a

and Ev Evaluation Initiative

IDEAS-ECP Webinar, November 2020

https://sc20.supercomputing.org/planning-committee/

19

slide-20
SLIDE 20

La Lack ck of

  • f a

art rtifact cts w will r reject ct a a p paper

380 80 5 43 24 1

50 100 150 200 250 300 350 400 Submissions (Phase 1) Per reviewer with VG/E AD/AE (Phase 1) Submissions (Phase 2) with VG/E AD/AE (Phase 2) Unacceptable AD/AE

Total Number

Number IDEAS-ECP Webinar, November 2020 20

slide-21
SLIDE 21

Te Technical debt incurs burden

  • Reproducibility is an after

thought.

  • Identifying files for an

application is a challenge

  • Missing workflows
  • Really, that data/algorithm

should be part of the bundle?

IDEAS-ECP Webinar, November 2020 21

  • “Sticks” from reviewers work
  • Authors who have not taken

AD/AE process seriously do submit additional work

  • Time consuming task
  • No tools to check if everything

relevant for the publication is submitted

  • No mapping of experiments to

content in the paper.

  • No infrastructure for efficiently

verifying claimed results

slide-22
SLIDE 22

PART 2: Do reproducible containers provide a start?

IDEAS-ECP Webinar, November 2020 22

slide-23
SLIDE 23

Re Reproducibility ecosystem

IDEAS-ECP Webinar, November 2020 23

Github Figshare OpenData.gov Docker.com Sharing images via the cloud Package managers

An introduction to Docker for reproducible research C Boettiger - ACM SIGOPS Operating Systems Review, 2015 - dl.acm.org

Zenodo.org

slide-24
SLIDE 24

Do Dock cker: U : Usin ing c contain ainers f from b build ild t to r run

IDEAS-ECP Webinar, November 2020 24

https://www.exascaleproject.org/event/conthpc

slide-25
SLIDE 25

Con Containers provide con

  • nstrained resou
  • urce

is isola latio tion

IDEAS-ECP Webinar, November 2020

CPU Memory Filesystem Network

25

slide-26
SLIDE 26

Authors must program a Dockerfile

IDEAS-ECP Webinar, November 2020 26

slide-27
SLIDE 27

Con Containers do

  • not
  • t reduce technical debt
  • Declarative encapsulation of dependencies for isolated execution
  • E.g. various shell utilities and library versions unknown to user

IDEAS-ECP Webinar, November 2020 27

slide-28
SLIDE 28

Au Automatic Encapsulation of Dependencies: Th The Sci Sciunit

IDEAS-ECP Webinar, November 2020 28

slide-29
SLIDE 29

Ke Key Idea: Iden Identif tify dependenc dependencies ies dur during ing pr progr gram exec ecut ution

  • Captures application dependencies during executions
  • Repeats executions (with guarantees) within isolated environments

IDEAS-ECP Webinar, November 2020 29

slide-30
SLIDE 30

Sci Sciunit: A : Audit

IDEAS-ECP Webinar, November 2020 30

Sciunit Sciunit

  • Audit uses ptrace to observe

dependencies and environment variables

  • Identifies binaries, libraries, scripts,

and environment variables that application is dependent on.

  • Dependencies are copied into a

directory in the filesystem

  • Inclusion of data files is optional
  • user may or may not want to package

based on the size of the dataset.

D.H. Ton That, G. Fils, Z. Yuan, T. Malik. Sciunits: Reusable Research Objects. In IEEE eScience Conference (eScience), 374-383, 2017

slide-31
SLIDE 31

Au Audits provenance during execution time

IDEAS-ECP Webinar, November 2020 31 Utilizing Provenance in Reusable Research Objects, In Special Issue on Using Computational Provenance, MDPI Informatics, Vol 5(1), 2018. Light-weight Database Virtualization. In IEEE International Conference on Data Engineering, ICDE, 2015. Auditing and Maintaining Provenance in Software Packages. In International Provenance and Annotation Workshop (IPAW), 97-109, 2014

Sciunit

slide-32
SLIDE 32

Sci Sciunit: Sh : Share a as a a Z Zip f file o

  • r Do

r Dock cker c r container r

IDEAS-ECP Webinar, November 2020 32

Computational Artifacts (from websites) Sciunit Log Provenance Graph Documentation Docker File

Identification of Inputs, Outputs, Processes, Dependencies Sciunit Containment

Documenting Computing Environments for Reproducible Experiments, In Parallel Computing: Technology Trends, 756-765, 2020

slide-33
SLIDE 33

Sci Sciunit: R : Repeat

IDEAS-ECP Webinar, November 2020 33

Sciunit Sciunit

  • Sciunit uses namespace isolation

during repeat

  • Redirection of each call into the

package

Efficient Provenance Alignment in Reproduced Executions, In Theory and Practice of Provenance, 2020. ScIInc: A Container Runtime for Incremental Recomputation”, InIEEE 15th International Conference on eScience (eScience), 291-300, 2019, doi: 10.1109/eScience. 2019.00040.

Sciunit

slide-34
SLIDE 34

Sci Sciunit st steps and external re require rements

IDEAS-ECP Webinar, November 2020

  • 1. Create
  • 2. Share
  • 3. Repeat

34

slide-35
SLIDE 35

Ne Network rk-enabled enabled Sci Sciunit: A : Audit

35

2&3. Run task 1 2&3. Run task 2

Network-enabled Sciunit

  • 1. Network-enabled

Sciunit

  • 1. Network-enabled

Sciunit

Possible with Network- enabled Sciunit

Note:

  • 1. Identify remote host & copy Sciunit to it

2&3. Configure & run task with Sciunit

  • 4. Retrieve & manually merge

Spawn task 1 Spawn task 2

  • 4. Merge
  • 4. Merge

IDEAS-ECP Webinar, November 2020

slide-36
SLIDE 36

Ne Netw twork-en enabled ed Sci ciunit: : Rep epea eat t on singl gle e node

36

Note:

  • 1. Repeat all computations at root node.
  • 2. Network system calls are supplied

through the content data captured during the original audit.

No connection

Run application

Network-enabled Sciunit

IDEAS-ECP Webinar, November 2020

slide-37
SLIDE 37

Ne Netw twork-en enabled ed Sci ciunit: : Rep epea eat t on mu multi tiple e nodes es

37

Run task 1 Run task 2

Network-enabled Sciunit Network-enabled Sciunit & sub- container

Requirements:

  • 1. Identical number of nodes
  • 2. Descriptions of new hostnames or IP

addresses

Network-enabled Sciunit & sub- container

Run application

IDEAS-ECP Webinar, November 2020

slide-38
SLIDE 38

Usecases

[16[ City of Chicago, “Food Inspection Evaluation,” https://chicago.github. io/food-inspections-evaluation/, 2017, [Online; accessed 7-May-2017]. [17] M. M. Billah, J. L. Goodall et al., “Using a data grid to automate data preparation pipelines required for regional-scale hydrologic modeling,” Environmental Modelling & Software,

  • vol. 78, 2016.

[18] D. DBGroup, “Incremental Query Execution,” 2019, [Online; accessed 3-April-2019]. [Online]. Available: https://TonHai@bitbucket. org/TonHai/iqe.git

IDEAS-ECP Webinar, November 2020 38

TABLE I: Usecases descriptions.

FIE [16] VIC [17] IQE [18] Source code languages R, Bash C, C++, Python, C shell script, Fortran Python Source code files 29 97 5 Data files 14 11,481 5 Dependency files 659 357 112 Size of all files 306.6 MB 1.2 GB 22 MB Normal run time 286.756 s 40.259 s 5.226 s

slide-39
SLIDE 39

Sciunit (native) versus Docker sizes

IDEAS-ECP Webinar, November 2020 39

slide-40
SLIDE 40

Sciunit audit and repeat times

IDEAS-ECP Webinar, November 2020 40

slide-41
SLIDE 41

Experiments

41

  • NASA Parallel Benchmark:
  • Data transferred (~524 KB (class A) & ~268 KB (class B))

— VIC

IDEAS-ECP Webinar, November 2020

slide-42
SLIDE 42

Sa Samp mple I Interact ction

  • n of
  • f Sci

Sciunit

Alice’s Computer Bob’s Computer

42

  • 1. > sciunit open mSLLTj#

Opened Sciunit FIE

  • 2. > sciunit list

e1 Dec 4 12:44 ./FIE.sh ./DATA/weather_201710.Rds

  • 3. > sciunit repeat e1

  • 0. Download…
  • 1. Calculate violation matrix…
  • 2. Calculate heat map…
  • 3. Generate model data with ./DATA/weather_201710.Rds
  • 4. Apply random forest model…
  • 5. Evaluation…
  • 4. > sciunit given ‘/tmp/weather_201801.Rds’ e1 %

  • 1. Generate model data with ‘/tmp/weather_201801.Rds
  • 2. Apply random forest model…
  • 3. Evaluation…
  • 5. > sciunit list

e1 Dec 4 12:44 ./FIE.sh ./DATA/weather_201710.Rds e2 Dec 14 2:44 ./FIE.sh ./tmp/weather_201801.Rds

slide-43
SLIDE 43

Con Container r Li Limi mitation

  • ns
  • Container either include the data or exclude the data
  • The decision is binary but does not consider necessary and sufficient data

IDEAS-ECP Webinar, November 2020 43

slide-44
SLIDE 44

Con Container r Debloa

  • ating: Mi

MiDAS

IDEAS-ECP Webinar, November 2020 44

slide-45
SLIDE 45

Example

IDEAS-ECP Webinar, November 2020 45

  • void file_read(int bytes) {

int fd, sz; char *c = (char *) calloc(bytes, sizeof(char)); fd = open("test.txt", O_RDWR); lseek(fd,100,SEEK_SET); sz = read(fd, c, bytes); } void file_read(int bytes) {

  • Execution trace
slide-46
SLIDE 46

Example

IDEAS-ECP Webinar, November 2020 46

  • void file_read(int bytes) {

int fd, sz; char *c = (char *) calloc(bytes, sizeof(char)); fd = open("test.txt", O_RDWR); lseek(fd,100,SEEK_SET); sz = read(fd, c, bytes); } void file_read(int bytes) {

  • Execution trace

void file_read(int bytes) { int fd, sz; char *c = (char *) calloc(bytes, sizeof(char)); fd = open("test.txt", O_RDWR); lseek(fd,100,SEEK_SET); sz = read(fd, c, bytes); } fd = open("test.txt", O_RDWR); lseek(fd,100,SEEK_SET); sz = read(fd, c, bytes);

slide-47
SLIDE 47

Mi MiDAS: Mi : Minimi mizi zing DA DAtasetS

IDEAS-ECP Webinar, November 2020

Automatically identify & include ONLY relevant data chunks with application Map high level user inputs to file offsets

WHAT HOW

47

slide-48
SLIDE 48

Partial Evaluation & LLVM VM

  • Partial Evaluation→optimization technique to prune codebase
  • Uses static inputs to generate a specialized program to accept remaining

dynamic inputs

IDEAS-ECP Webinar, November 2020 48

1

#include <math.h>

2

float compute_building_height(float building_distance){

3

float viewing_angle = pi/4;

4

float building_height =

5

compute_opposite(building_distance,

6

viewing_angle);

7

return building_height;

8

}

9

float compute_opposite(float adjacent, float angle){

10

float opposite = adjacent * tan(angle);

11

return opposite;

12

} (a) Original code

1 2 3 4 5 6 7 1 2 3 4

1

float compute_building_height(float building_distance){

2

float building_height =

3

compute_opposite_specialized(building_distance);

4

return building_height;

5

}

6

float compute_opposite_specialized(float adjacent){

7

float opposite = adjacent * 1;

8

return opposite;

9

} (b) Specialized code ( tion 130, (

slide-49
SLIDE 49

Mi MiDAS

IDEAS-ECP Webinar, November 2020

  • riginal

bitcode specialization inputs Instrumentation

  • f Code

1 specialized bitcode with data chunks Code Execution 2 Data Chunk Extraction 3 Specialization 4 instrumented bitcode execution traces extracted data chunks

49

slide-50
SLIDE 50

I/ I/O Spec pecializ ializatio tion

  • Replace I/O call & preserve functionality
  • Extracted file data in global variable→ fileData
  • Copy from global variable to read buffer → memcpy
  • Update all I/O call variables→return value of read
  • I/O call instruction removed → read

IDEAS-ECP Webinar, November 2020

%94 = load i32, i32* %9, align 4 %95 = sext i32 %94 to i64 %96 = call i64 @read(i32 %92, i8* %93, i64 %95) %97 = trunc i64 %96 to i32 store i32 %97, i32* %13, align 4 %98 = load i32, i32* %12, align 4 %96 = call i64 @read(i32 %92, i8* %93, i64 %95)

%94 = load i32, i32* %9, align 4 %95 = sext i32 %94 to i64 %96 = bitcast [17 x i8]* @fileData to i8* call void @llvm.memcpy.p0i8.p0i8.i64(i8* %93, i8* %965i64 %95, i32 1, i1 false) %97 = alloca i64 store i64 %95, i64* %97 %loadRetVal = load i64, i64* %97 %98 = trunc i64 %loadRetVal to i32 store i32 %98, i32* %13, align 4 %99 = load i32, i32* %12, align 4 %96 = bitcast [17 x i8]* @fileData to i8* call void @llvm.memcpy.p0i8.p0i8.i64(i8* %93, i8* %96, i64 %95, i32 1, i1 false) %97 = alloca i64 store i64 %95, i64* %97 %loadRetVal = load i64, i64* %97

50

slide-51
SLIDE 51

Sp Speci cializing I I/O /O Ca Calls i in Sci Scientific Li c Librari ries

IDEAS-ECP Webinar, November 2020

Python apps NetCDF4-python module NetCDF C lib HDF5 C lib

data access interface fast I/O processing & storage build from source, instrument, specialize I/O calls

51

slide-52
SLIDE 52

Re Results | Percentage of File Accessed

  • Larger files generated from 30 MB NetCDF data file
  • Rewriting data for multiple timesteps
  • Data accessed corresponding to temperature attribute

IDEAS-ECP Webinar, November 2020

Total Size Accessed Size 30 MB 6.6 MB 700 MB 154 MB 1.4 GB 0.3 GB 9 GB 1.98 GB 12.8 GB 2.82 GB

APPLICATIONS OFTEN ACCESS ONLY A SUBSET OF A LARGE DATASET

52

slide-53
SLIDE 53

PART 3: Summary and Guidance

IDEAS-ECP Webinar, November 2020 53

slide-54
SLIDE 54

Su Summa mmary

  • Technical debt affects reproducibility of scientific claims.
  • Process for evaluation of scientific claims is being rethought.
  • Artifact description and evaluation are becoming part of conferences
  • Better reliability is needed.
  • Containers will be a prominent choice but their reliability is poor
  • Dependencies must be specified
  • Inefficient to use
  • No guarantees for execution verification
  • Not meant for interactive programs
  • New light-weight methods: Sciunit, MiDAS

IDEAS-ECP Webinar, November 2020 54

slide-55
SLIDE 55

Us Use e Sci Sciunit fo for your next paper submission!

  • 1. Tools downloaded ~850 times (tracked using pip)
  • 2. 8 active contributors to the project
  • 3. Actively used in geoscience disciplines that develop computational

models and data-analytic pipelines Website: http://sciunit.run Issues and contribution: pr@sciunit.run

IDEAS-ECP Webinar, November 2020 55

slide-56
SLIDE 56

Gu Guid idan ance ce for Improvin ing Reproducib cibility ility

IDEAS-ECP Webinar, November 2020 56

slide-57
SLIDE 57

Gu Guid idan ance ce for Improvin ing Reproducib cibility ility

IDEAS-ECP Webinar, November 2020 57

  • 1. J. Freire, N. Fuhr, and A. Rauber. Reproducibility of data-oriented experiments in e-science (Dagstuhl seminar 16041)

Dagstuhl reports. 6(1):108–159, 2016. [Online; accessed 10-Sep-2017]. Portability Repeatability Runnability Replicability Publishability https://bssw.io/items?topic=reproducibility

slide-58
SLIDE 58

Gu Guid idan ance ce for Improvin ing Reproducib cibility ility

IDEAS-ECP Webinar, November 2020 58

  • 1. J. Freire, N. Fuhr, and A. Rauber. Reproducibility of data-oriented experiments in e-science (Dagstuhl seminar 16041)

Dagstuhl reports. 6(1):108–159, 2016. [Online; accessed 10-Sep-2017]. Portability Repeatability Runnability Replicability Publishability Artifact Review and Badging

slide-59
SLIDE 59

Gu Guid idan ance ce for Improvin ing Reproducib cibility ility

IDEAS-ECP Webinar, November 2020 59

Hardware Concurrency Algorithmic randomness Application complexity Execution State Bugs outside the application

Identify sources

  • f

irreproducibility

slide-60
SLIDE 60

Gu Guid idan ance ce for Improvin ing Reproducib cibility ility

IDEAS-ECP Webinar, November 2020 60

Hardware Concurrency Algorithmic randomness Application complexity Execution State Bugs outside the application https://bssw.io/items?topic=reproducibility

Metadata: Provenance, Annotations, Snapshots

slide-61
SLIDE 61

Gu Guid idan ance ce for Improvin ing Reproducib cibility ility

IDEAS-ECP Webinar, November 2020 61

Hardware Concurrency Algorithmic randomness Application complexity Execution State Bugs outside the application https://bssw.io/items?topic=reproducibility

Methods for analyzing the metadata

slide-62
SLIDE 62

Ac Acknowledgements

IDEAS-ECP Webinar, November 2020

Yuta Nakamura Ph.D. student Jason Chuah M.S. student Raza Ahmad Research Engineer Nithin Manne M.S Student Zhihao Yuan Research Engineer Ton That Dai Hai Postdoctoral Associate

62

Jason Chuah Ph.D student @UVA

slide-63
SLIDE 63

Ac Acknowledgements

IDEAS-ECP Webinar, November 2020

Ian Foster UChicago & ANL Dave Tarboton Utah State Jon Goodall

  • Univ. of Virginia

Scott Peckham Univ of Colorado Boulder Eunseo Choi Univ of Memphis Ashish Gehani SRI International

63

slide-64
SLIDE 64

Ac Acknowledgements | Funding

NSF CNS-1846418, ICER-1639759, ICER-1661918 BSSw Fellowship Bloomberg Foundation DePaul Seed Grants

IDEAS-ECP Webinar, November 2020 64

slide-65
SLIDE 65

Qu Question

  • ns
  • tanu.malik@depaul.edu

IDEAS-ECP Webinar, November 2020 65

slide-66
SLIDE 66

Ex Exampl ple

IDEAS-ECP Webinar, November 2020 66

slide-67
SLIDE 67

Re Result

IDEAS-ECP Webinar, November 2020 67

slide-68
SLIDE 68

Cu Curr rrent and Future Work

  • rk
  • Developing Sciunit audit and repeat with checkpoint-restart
  • Compute- and data-analytic models that vary several parameters and are

reexecuted multiple times to test their reproducibility.

  • Useful for Jupyter Notebooks
  • Sciunit for reproducibility will provide provenance-based guarantees
  • Several cyberinfrastructure for Artifact Evaluation (OCCAM, CKFoundation)
  • Provenance-based guarantees are missing
  • Developing MiDAS for different inputs

IDEAS-ECP Webinar, November 2020 68