Tackling the Reproducibility Problem in Systems Research with - - PowerPoint PPT Presentation

tackling the reproducibility problem in systems research
SMART_READER_LITE
LIVE PREVIEW

Tackling the Reproducibility Problem in Systems Research with - - PowerPoint PPT Presentation

Tackling the Reproducibility Problem in Systems Research with Declarative Experiment Specifications Ivo Jimenez , Carlos Maltzahn ( UCSC ) Adam Moody, Kathryn Mohror ( LLNL ) Jay Lofstead ( Sandia ) Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau (


slide-1
SLIDE 1

Tackling the Reproducibility Problem in Systems Research with Declarative Experiment Specifications

Ivo Jimenez, Carlos Maltzahn (UCSC) Adam Moody, Kathryn Mohror (LLNL) Jay Lofstead (Sandia) Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau (UWM)

slide-2
SLIDE 2

The Reproducibility Problem

Original 20 40 60 80 100 120

Cluster size

Throughput (MB/s)

140

1 2 3 4 5 6 7 8 9 10 11 12 13

Reproduced?

2

  • Network
  • Disks
  • BIOS
  • OS conf.
  • Magic numbers
  • Workload
  • Jitter
  • etc...

Goal: define methodology so that we don’t end up in this situation

slide-3
SLIDE 3

Outline

  • Re-execution vs. validation
  • Declarative Experiment Specification (ESF)
  • Case Study
  • Benefits & Challenges

3

slide-4
SLIDE 4

Outline

  • Re-execution vs. validation
  • Declarative Experiment Specification (ESF)
  • Case Study
  • Benefits & Challenges

4

slide-5
SLIDE 5

Reproducibility Workflow

  • 1. Re-execute experiment

– Recreate original setup, re-execute experiments – Technical task

  • 2. Validate results

– Compare against original – A subjective task

  • How do we express objective validation criteria?
  • What contextual information to include with results?

5

slide-6
SLIDE 6

libs

OS

code, workload data hardware

Experiment Goal: Show that my algorithm/system/etc. is better than the state-of-the-art.

Means of Experiment Figure Observations Raw data

6

slide-7
SLIDE 7

Outline

  • Re-execution vs. validation
  • Declarative Experiment Specification (ESF)
  • Case Study
  • Benefits & Challenges

7

slide-8
SLIDE 8

8 Experiment Goal: Show that my algorithm/system/etc. is better than the state-of-the-art.

Means of Experiment

slide-9
SLIDE 9

Validation Language Syntax

validation validation : 'for' condition ('and' condition)* 'expect' result ('and' result)* : 'for' condition ('and' condition)* 'expect' result ('and' result)* ; ; condition condition : vars ('in' range | ('=' | '<' | '>' | '!=') value) : vars ('in' range | ('=' | '<' | '>' | '!=') value) ; ; result result : condition : condition ; ; vars vars : var (',' var)* : var (',' var)* ; ; range range : '[' range_num (',' range_num)* ']' : '[' range_num (',' range_num)* ']' ; ; range_num range_num : NUMBER '-' NUMBER | '*' : NUMBER '-' NUMBER | '*' ; ; value value : '*' | 'NUMBER (',' NUMBER)* : '*' | 'NUMBER (',' NUMBER)* ; ; 9

slide-10
SLIDE 10

Outline

  • Re-execution vs. validation
  • Declarative Experiment Specification (ESF)
  • Case Study
  • Benefits & Challenges

10

slide-11
SLIDE 11

Ceph OSDI ‘06

  • Select scalability experiment.

– Distributed; makes use of all resources – Main bottlenecks: I/O and network

  • Why this experiment?

– Top conference – 10 year old experiment – Ideal reproducibility conditions

  • Access to authors, topic familiarity, same hardware,

– Even in an ideal scenario, we s:ll struggle

  • Demonstrates which missing info is captured by an ESF!

11

slide-12
SLIDE 12

for for cluster_size <= 24 expect expect ceph >= 55 mb/s

Ceph OSDI ’06 Scalability Experiment Schema of Experiment Output Data Validation Statement

for for cluster_size <= 24 expect expect ceph >= 55 mb/s for for cluster_size <= 24 expect expect ceph >= (raw * .90) for for cluster_size <= 24 expect expect ceph >= (raw * .90)

"independent_variables": [{ "independent_variables": [{ "type": “cluster_size”, "type": “cluster_size”, "values": “2-28” "values": “2-28” }], }], "dependent_variable": { "dependent_variable": { "type": "throughput", "type": "throughput", "scale": "mb/s" "scale": "mb/s" }, }, "independent_variables": [{ "independent_variables": [{ "type": “cluster_size”, "type": “cluster_size”, "values": “2-28” "values": “2-28” }, },{ { "type": "method", "type": "method", "values": ["raw", "ceph"] "values": ["raw", "ceph"] }], ], "dependent_variable": { "dependent_variable": { "type": "throughput", "type": "throughput", "scale": "mb/s" "scale": "mb/s" }, }, "independent_variables": [{ "independent_variables": [{ "type": “cluster_size”, "type": “cluster_size”, "values": “2-28” "values": “2-28” }, },{ { "type": "method", "type": "method", "values": ["raw", "ceph"] "values": ["raw", "ceph"] }], ], "dependent_variable": { "dependent_variable": { "type": "throughput", "type": "throughput", "scale": "mb/s" "scale": "mb/s" }, }, "independent_variables": [{ "independent_variables": [{ "type": “cluster_size”, "type": “cluster_size”, "values": “2-28” "values": “2-28” }, }, { { "type": "method", "type": "method", "values": ["raw", "ceph"] "values": ["raw", "ceph"] }, },{ { "type": ”net_saturated", "type": ”net_saturated", "values": [”true", ”false"] "values": [”true", ”false"] }], ], "dependent_variable": { "dependent_variable": { "type": "throughput", "type": "throughput", "scale": "mb/s" "scale": "mb/s" }, },

12

for for cluster_size <= 24 expect expect ceph >= (raw * .90) for for cluster_size = * and and not not net_saturated expect expect ceph >= (raw * .90)

30 40 50

Cluster size

Per-OSD Average Throughput (MB/s)

60 2 6 10 14 18 22 26

slide-13
SLIDE 13

13

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1 2 3 4 5 6 7 8 9 10 11 12 Normalized Per-OSD Throughput OSD Cluster Size reproduced

  • riginal

. . . 24 26 . . .

slide-14
SLIDE 14

Benefits & Challenges

14

slide-15
SLIDE 15

Why care about Reproducibility?

  • Good enough is not an excuse

– We can always improve the state of our practice – How do we compare hardware/software in a scientific way?

  • Experimental Cloud Infrastructure

– PRObE / CloudLab / Chameleon – Having reproducible / validated experiments would represent a significant step toward embodying the scientific method as a core component of these infrastructures

15

slide-16
SLIDE 16

Benefits of ESF-based methodology

  • Brings falsibiability to our field

– Statements can be proven false

  • Automate validation

– Validation becomes an objective task

16

slide-17
SLIDE 17

Validation Workflow

Obtain/ recreate means of experiment. Original work findings are corroborated Update means of experiment Cannot validate

  • riginal

claims no yes no yes Any significant differences between original and recreated means?

Re-run and check validation clauses against

  • utput. Any validation

failed?

17

slide-18
SLIDE 18

Benefits of ESF-based methodology

  • Brings falsibiability to our field

– Statements can be proven false

  • Automate validation

– Validation becomes an objective task

  • Usability

– We all do this anyway, albeit in an ad-hoc way

  • Integrate into existing infrastructure

18

slide-19
SLIDE 19

Integration with Existing Infrastructure

19

push code Test:

  • Unit
  • Integration

pull Test:

  • Unit
  • Integration
  • Validations

push code and ESF

slide-20
SLIDE 20

Challenges

  • Reproduce every time

– Include sanity checks as part of experiment – Alternative: corroborate that network/disk

  • bserves expected behavior at runtime
  • Reproduce everywhere

– Example: GCC’s flags, 10806 combinations – Alternative: provide image of complete software stack (e.g. linux containers)

20

slide-21
SLIDE 21

Conclusion

ESFs:

  • Embody all components of an experiment
  • Enable automation of result validation
  • Brings us closer to the scientific method
  • Our ideal future:

– Researchers use ESFs to express an hypothesis – Toolkits for ESFs produce metadata-rich figures – Machine-readable evaluation section

21

https://github.com/systemslab/esf

slide-22
SLIDE 22

Thanks!

22

slide-23
SLIDE 23

SILT SOSP ‘11 Experiment Goal Schema Validations

The high random read speed of flash drives means that the CPU budget available for each index

  • peration is relatively limited.

This microbenchmark demonstrates that SILT’s indexes meet their design goal of computation-efficient indexing.

23

{ { "type": ”method”, "type": ”method”, "values": [“raw”, "cuckoo", "trie"] "values": [“raw”, "cuckoo", "trie"] }, }, { { "type": "workload", "type": "workload", "values": [ "values": [ "individual", "bulk", "lookup” "individual", "bulk", "lookup” ] ] }, }, "dependent_variable": { "dependent_variable": { "type": "throughput", "type": "throughput", "scale": ”bytes/second" "scale": ”bytes/second" } } for for workload=* expect expect cuckoo > raw and and trie > raw for for lookup expect expect cuckoo > trie and and for for individual and and bulk expect expect cuckoo > trie

slide-24
SLIDE 24

Geneiatakis et. al. CCS ‘12

24

slide-25
SLIDE 25

In this section, our goal is to evaluate the performance benefits that can be reaped, by utilizing virtual partitioning to apply otherwise expensive protection mechanisms on the most exposed part of applications. This allows us to strike a balance between the overhead imposed

  • n the application and its exposure to attacks.

Experiment Goal

25

slide-26
SLIDE 26

In this section, our goal is to evaluate the performance benefits that can be reaped, by utilizing virtual partitioning to apply otherwise expensive protection mechanisms on the most exposed part of applications. This allows us to strike a balance between the overhead imposed

  • n the application and its exposure to attacks.

Experiment Goal

26

slide-27
SLIDE 27

Schema

27

slide-28
SLIDE 28

Schema

"independent_variables": [ "independent_variables": [ { { "type": ”method", "type": ”method", "alias": [”technique”], "alias": [”technique”], "values": [ "values": [ “native", “pin", “isr”, “dta”, “native", “pin", “isr”, “dta”, “dta_pin”, “dta_isr” “dta_pin”, “dta_isr” ] ] } } ], ], "dependent_variable": { "dependent_variable": { "type": ”runtime", "type": ”runtime", "scale": “s" "scale": “s" }, },

28

slide-29
SLIDE 29

Validations

"independent_variables": [ "independent_variables": [ { { "type": ”method", "type": ”method", "alias": [”technique”], "alias": [”technique”], "values": [ "values": [ “native", “pin", “isr”, “dta”, “native", “pin", “isr”, “dta”, “dta_pin”, “dta_isr” “dta_pin”, “dta_isr” ] ] } } ], ], "dependent_variable": { "dependent_variable": { "type": ”runtime", "type": ”runtime", "scale": “s" "scale": “s" }, },

29

slide-30
SLIDE 30

Validations

"independent_variables": [ "independent_variables": [ { { "type": ”method", "type": ”method", "alias": [”technique”], "alias": [”technique”], "values": [ "values": [ “native", “pin", “isr”, “dta”, “native", “pin", “isr”, “dta”, “dta_pin”, “dta_isr” “dta_pin”, “dta_isr” ] ] } } ], ], "dependent_variable": { "dependent_variable": { "type": ”runtime", "type": ”runtime", "scale": “s" "scale": “s" }, }, expect expect native < any any

30

slide-31
SLIDE 31

Validations

"independent_variables": [ "independent_variables": [ { { "type": ”method", "type": ”method", "alias": [”technique”], "alias": [”technique”], "values": [ "values": [ “native", “pin", “isr”, “dta”, “native", “pin", “isr”, “dta”, “dta_pin”, “dta_isr” “dta_pin”, “dta_isr” ] ] } } ], ], "dependent_variable": { "dependent_variable": { "type": ”runtime", "type": ”runtime", "scale": “s" "scale": “s" }, }, expect expect native < any and any and

31

slide-32
SLIDE 32

Validations

"independent_variables": [ "independent_variables": [ { { "type": ”method", "type": ”method", "alias": [”technique”], "alias": [”technique”], "values": [ "values": [ “native", “pin", “isr”, “dta”, “native", “pin", “isr”, “dta”, “dta_pin”, “dta_isr” “dta_pin”, “dta_isr” ] ] } } ], ], "dependent_variable": { "dependent_variable": { "type": ”runtime", "type": ”runtime", "scale": “s" "scale": “s" }, }, expect expect native < any and any and dta_pin between between pin and and isr

32

slide-33
SLIDE 33

Validations

"independent_variables": [ "independent_variables": [ { { "type": ”method", "type": ”method", "alias": [”technique”], "alias": [”technique”], "values": [ "values": [ “native", “pin", “isr”, “dta”, “native", “pin", “isr”, “dta”, “dta_pin”, “dta_isr” “dta_pin”, “dta_isr” ] ] } } ], ], "dependent_variable": { "dependent_variable": { "type": ”runtime", "type": ”runtime", "scale": “s" "scale": “s" }, }, expect expect native < any and any and dta_pin between between pin and and isr and and

33

slide-34
SLIDE 34

Validations

"independent_variables": [ "independent_variables": [ { { "type": ”method", "type": ”method", "alias": [”technique”], "alias": [”technique”], "values": [ "values": [ “native", “pin", “isr”, “dta”, “native", “pin", “isr”, “dta”, “dta_pin”, “dta_isr” “dta_pin”, “dta_isr” ] ] } } ], ], "dependent_variable": { "dependent_variable": { "type": ”runtime", "type": ”runtime", "scale": “s" "scale": “s" }, }, expect expect native < any and any and dta_pin between between pin and and isr and and dta_isr between between isr and and dta

34

slide-35
SLIDE 35

Example 2

35

slide-36
SLIDE 36

Example 2

36

slide-37
SLIDE 37

Schema

"independent_variables": [ "independent_variables": [ { { "type": ”method", "type": ”method", "values": [ "values": [ “native", “pin", “isr”, “dta”, “native", “pin", “isr”, “dta”, “dta_pin”, “dta_isr” “dta_pin”, “dta_isr” ] ] } } ], ], "dependent_variable": { "dependent_variable": { "type": ”runtime", "type": ”runtime", "scale": “s" "scale": “s" }, },

37

slide-38
SLIDE 38

Schema

"independent_variables": [ "independent_variables": [ { { "type": ”method”, "type": ”method”, "values": [ "values": [ “native", “pin", “isr”, “dta”, “native", “pin", “isr”, “dta”, “dta_pin”, “dta_isr” “dta_pin”, “dta_isr” ] ] }, }, { { "type": ”workload”, "type": ”workload”, "values": [“ftp", “samba", “ssh”] "values": [“ftp", “samba", “ssh”] } } ], ], "dependent_variable": { "dependent_variable": { "type": ”runtime", "type": ”runtime", "scale": “s" "scale": “s" }, },

38

slide-39
SLIDE 39

Validations

"independent_variables": [ "independent_variables": [ { { "type": ”method”, "type": ”method”, "values": [ "values": [ “native", “pin", “isr”, “dta”, “native", “pin", “isr”, “dta”, “dta_pin”, “dta_isr” “dta_pin”, “dta_isr” ] ] }, }, { { "type": ”workload”, "type": ”workload”, "values": [“ftp", “samba", “ssh”] "values": [“ftp", “samba", “ssh”] } } ], ], "dependent_variable": { "dependent_variable": { "type": ”runtime", "type": ”runtime", "scale": “s" "scale": “s" }, }, for for workload=* expect expect native < any and any and dta_pin between between pin and and isr and and dta_isr between between isr and and dta

39

slide-40
SLIDE 40

Falsifiability in Science

Falsibiability of a statement, hypothesis, or theory is an inherent possibility to prove it to be false.

  • In other words, the ability to specify the

conditions under which a statement is false

  • Synonymous to Testability
  • Example:

– Statement: All swans are white – Falsifiable: Observe one black swan

source: en.wikipedia.org/wiki/Falsifiability

40

slide-41
SLIDE 41

41

slide-42
SLIDE 42

42

libs

OS

code, workload data hardware

Experiment Goal: Show that my algorithm/system/etc. is better than the state-of-the-art.

Means of Experiment Figure Observations Raw data

Falsifiability in Systems

slide-43
SLIDE 43

Falsifiability in Systems

  • To falsify a claim:

– Describe the means of the experiments – Provide validation statements over the output data

  • Conditional statement:

– if means are properly recreated – then validation statements should hold

  • Go from inert observations to falsifiable statements

From: We observe that our system outperforms the alternatives To: Expect 25-30% performance improvement on hardware platform X, on workload Y, when configured like Z

43

slide-44
SLIDE 44

Early Feedback

Creating an ESF helps authors to:

  • Find meaningful/reproducible baselines
  • Create a feedback loop in author’s mind
  • Specify exactly what author means
  • Make temporal context explicit

44