Tackling the Reproducibility Problem in Systems Research with - - PowerPoint PPT Presentation
Tackling the Reproducibility Problem in Systems Research with - - PowerPoint PPT Presentation
Tackling the Reproducibility Problem in Systems Research with Declarative Experiment Specifications Ivo Jimenez , Carlos Maltzahn ( UCSC ) Adam Moody, Kathryn Mohror ( LLNL ) Jay Lofstead ( Sandia ) Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau (
The Reproducibility Problem
Original 20 40 60 80 100 120
Cluster size
Throughput (MB/s)
140
1 2 3 4 5 6 7 8 9 10 11 12 13
Reproduced?
2
- Network
- Disks
- BIOS
- OS conf.
- Magic numbers
- Workload
- Jitter
- etc...
Goal: define methodology so that we don’t end up in this situation
Outline
- Re-execution vs. validation
- Declarative Experiment Specification (ESF)
- Case Study
- Benefits & Challenges
3
Outline
- Re-execution vs. validation
- Declarative Experiment Specification (ESF)
- Case Study
- Benefits & Challenges
4
Reproducibility Workflow
- 1. Re-execute experiment
– Recreate original setup, re-execute experiments – Technical task
- 2. Validate results
– Compare against original – A subjective task
- How do we express objective validation criteria?
- What contextual information to include with results?
5
libs
OS
code, workload data hardware
Experiment Goal: Show that my algorithm/system/etc. is better than the state-of-the-art.
Means of Experiment Figure Observations Raw data
6
Outline
- Re-execution vs. validation
- Declarative Experiment Specification (ESF)
- Case Study
- Benefits & Challenges
7
8 Experiment Goal: Show that my algorithm/system/etc. is better than the state-of-the-art.
Means of Experiment
Validation Language Syntax
validation validation : 'for' condition ('and' condition)* 'expect' result ('and' result)* : 'for' condition ('and' condition)* 'expect' result ('and' result)* ; ; condition condition : vars ('in' range | ('=' | '<' | '>' | '!=') value) : vars ('in' range | ('=' | '<' | '>' | '!=') value) ; ; result result : condition : condition ; ; vars vars : var (',' var)* : var (',' var)* ; ; range range : '[' range_num (',' range_num)* ']' : '[' range_num (',' range_num)* ']' ; ; range_num range_num : NUMBER '-' NUMBER | '*' : NUMBER '-' NUMBER | '*' ; ; value value : '*' | 'NUMBER (',' NUMBER)* : '*' | 'NUMBER (',' NUMBER)* ; ; 9
Outline
- Re-execution vs. validation
- Declarative Experiment Specification (ESF)
- Case Study
- Benefits & Challenges
10
Ceph OSDI ‘06
- Select scalability experiment.
– Distributed; makes use of all resources – Main bottlenecks: I/O and network
- Why this experiment?
– Top conference – 10 year old experiment – Ideal reproducibility conditions
- Access to authors, topic familiarity, same hardware,
– Even in an ideal scenario, we s:ll struggle
- Demonstrates which missing info is captured by an ESF!
11
for for cluster_size <= 24 expect expect ceph >= 55 mb/s
Ceph OSDI ’06 Scalability Experiment Schema of Experiment Output Data Validation Statement
for for cluster_size <= 24 expect expect ceph >= 55 mb/s for for cluster_size <= 24 expect expect ceph >= (raw * .90) for for cluster_size <= 24 expect expect ceph >= (raw * .90)
"independent_variables": [{ "independent_variables": [{ "type": “cluster_size”, "type": “cluster_size”, "values": “2-28” "values": “2-28” }], }], "dependent_variable": { "dependent_variable": { "type": "throughput", "type": "throughput", "scale": "mb/s" "scale": "mb/s" }, }, "independent_variables": [{ "independent_variables": [{ "type": “cluster_size”, "type": “cluster_size”, "values": “2-28” "values": “2-28” }, },{ { "type": "method", "type": "method", "values": ["raw", "ceph"] "values": ["raw", "ceph"] }], ], "dependent_variable": { "dependent_variable": { "type": "throughput", "type": "throughput", "scale": "mb/s" "scale": "mb/s" }, }, "independent_variables": [{ "independent_variables": [{ "type": “cluster_size”, "type": “cluster_size”, "values": “2-28” "values": “2-28” }, },{ { "type": "method", "type": "method", "values": ["raw", "ceph"] "values": ["raw", "ceph"] }], ], "dependent_variable": { "dependent_variable": { "type": "throughput", "type": "throughput", "scale": "mb/s" "scale": "mb/s" }, }, "independent_variables": [{ "independent_variables": [{ "type": “cluster_size”, "type": “cluster_size”, "values": “2-28” "values": “2-28” }, }, { { "type": "method", "type": "method", "values": ["raw", "ceph"] "values": ["raw", "ceph"] }, },{ { "type": ”net_saturated", "type": ”net_saturated", "values": [”true", ”false"] "values": [”true", ”false"] }], ], "dependent_variable": { "dependent_variable": { "type": "throughput", "type": "throughput", "scale": "mb/s" "scale": "mb/s" }, },
12
for for cluster_size <= 24 expect expect ceph >= (raw * .90) for for cluster_size = * and and not not net_saturated expect expect ceph >= (raw * .90)
30 40 50
Cluster size
Per-OSD Average Throughput (MB/s)
60 2 6 10 14 18 22 26
13
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1 2 3 4 5 6 7 8 9 10 11 12 Normalized Per-OSD Throughput OSD Cluster Size reproduced
- riginal
. . . 24 26 . . .
Benefits & Challenges
14
Why care about Reproducibility?
- Good enough is not an excuse
– We can always improve the state of our practice – How do we compare hardware/software in a scientific way?
- Experimental Cloud Infrastructure
– PRObE / CloudLab / Chameleon – Having reproducible / validated experiments would represent a significant step toward embodying the scientific method as a core component of these infrastructures
15
Benefits of ESF-based methodology
- Brings falsibiability to our field
– Statements can be proven false
- Automate validation
– Validation becomes an objective task
16
Validation Workflow
Obtain/ recreate means of experiment. Original work findings are corroborated Update means of experiment Cannot validate
- riginal
claims no yes no yes Any significant differences between original and recreated means?
Re-run and check validation clauses against
- utput. Any validation
failed?
17
Benefits of ESF-based methodology
- Brings falsibiability to our field
– Statements can be proven false
- Automate validation
– Validation becomes an objective task
- Usability
– We all do this anyway, albeit in an ad-hoc way
- Integrate into existing infrastructure
18
Integration with Existing Infrastructure
19
push code Test:
- Unit
- Integration
pull Test:
- Unit
- Integration
- Validations
push code and ESF
Challenges
- Reproduce every time
– Include sanity checks as part of experiment – Alternative: corroborate that network/disk
- bserves expected behavior at runtime
- Reproduce everywhere
– Example: GCC’s flags, 10806 combinations – Alternative: provide image of complete software stack (e.g. linux containers)
20
Conclusion
ESFs:
- Embody all components of an experiment
- Enable automation of result validation
- Brings us closer to the scientific method
- Our ideal future:
– Researchers use ESFs to express an hypothesis – Toolkits for ESFs produce metadata-rich figures – Machine-readable evaluation section
21
https://github.com/systemslab/esf
Thanks!
22
SILT SOSP ‘11 Experiment Goal Schema Validations
The high random read speed of flash drives means that the CPU budget available for each index
- peration is relatively limited.
This microbenchmark demonstrates that SILT’s indexes meet their design goal of computation-efficient indexing.
23
{ { "type": ”method”, "type": ”method”, "values": [“raw”, "cuckoo", "trie"] "values": [“raw”, "cuckoo", "trie"] }, }, { { "type": "workload", "type": "workload", "values": [ "values": [ "individual", "bulk", "lookup” "individual", "bulk", "lookup” ] ] }, }, "dependent_variable": { "dependent_variable": { "type": "throughput", "type": "throughput", "scale": ”bytes/second" "scale": ”bytes/second" } } for for workload=* expect expect cuckoo > raw and and trie > raw for for lookup expect expect cuckoo > trie and and for for individual and and bulk expect expect cuckoo > trie
Geneiatakis et. al. CCS ‘12
24
In this section, our goal is to evaluate the performance benefits that can be reaped, by utilizing virtual partitioning to apply otherwise expensive protection mechanisms on the most exposed part of applications. This allows us to strike a balance between the overhead imposed
- n the application and its exposure to attacks.
Experiment Goal
25
In this section, our goal is to evaluate the performance benefits that can be reaped, by utilizing virtual partitioning to apply otherwise expensive protection mechanisms on the most exposed part of applications. This allows us to strike a balance between the overhead imposed
- n the application and its exposure to attacks.
Experiment Goal
26
Schema
27
Schema
"independent_variables": [ "independent_variables": [ { { "type": ”method", "type": ”method", "alias": [”technique”], "alias": [”technique”], "values": [ "values": [ “native", “pin", “isr”, “dta”, “native", “pin", “isr”, “dta”, “dta_pin”, “dta_isr” “dta_pin”, “dta_isr” ] ] } } ], ], "dependent_variable": { "dependent_variable": { "type": ”runtime", "type": ”runtime", "scale": “s" "scale": “s" }, },
28
Validations
"independent_variables": [ "independent_variables": [ { { "type": ”method", "type": ”method", "alias": [”technique”], "alias": [”technique”], "values": [ "values": [ “native", “pin", “isr”, “dta”, “native", “pin", “isr”, “dta”, “dta_pin”, “dta_isr” “dta_pin”, “dta_isr” ] ] } } ], ], "dependent_variable": { "dependent_variable": { "type": ”runtime", "type": ”runtime", "scale": “s" "scale": “s" }, },
29
Validations
"independent_variables": [ "independent_variables": [ { { "type": ”method", "type": ”method", "alias": [”technique”], "alias": [”technique”], "values": [ "values": [ “native", “pin", “isr”, “dta”, “native", “pin", “isr”, “dta”, “dta_pin”, “dta_isr” “dta_pin”, “dta_isr” ] ] } } ], ], "dependent_variable": { "dependent_variable": { "type": ”runtime", "type": ”runtime", "scale": “s" "scale": “s" }, }, expect expect native < any any
30
Validations
"independent_variables": [ "independent_variables": [ { { "type": ”method", "type": ”method", "alias": [”technique”], "alias": [”technique”], "values": [ "values": [ “native", “pin", “isr”, “dta”, “native", “pin", “isr”, “dta”, “dta_pin”, “dta_isr” “dta_pin”, “dta_isr” ] ] } } ], ], "dependent_variable": { "dependent_variable": { "type": ”runtime", "type": ”runtime", "scale": “s" "scale": “s" }, }, expect expect native < any and any and
31
Validations
"independent_variables": [ "independent_variables": [ { { "type": ”method", "type": ”method", "alias": [”technique”], "alias": [”technique”], "values": [ "values": [ “native", “pin", “isr”, “dta”, “native", “pin", “isr”, “dta”, “dta_pin”, “dta_isr” “dta_pin”, “dta_isr” ] ] } } ], ], "dependent_variable": { "dependent_variable": { "type": ”runtime", "type": ”runtime", "scale": “s" "scale": “s" }, }, expect expect native < any and any and dta_pin between between pin and and isr
32
Validations
"independent_variables": [ "independent_variables": [ { { "type": ”method", "type": ”method", "alias": [”technique”], "alias": [”technique”], "values": [ "values": [ “native", “pin", “isr”, “dta”, “native", “pin", “isr”, “dta”, “dta_pin”, “dta_isr” “dta_pin”, “dta_isr” ] ] } } ], ], "dependent_variable": { "dependent_variable": { "type": ”runtime", "type": ”runtime", "scale": “s" "scale": “s" }, }, expect expect native < any and any and dta_pin between between pin and and isr and and
33
Validations
"independent_variables": [ "independent_variables": [ { { "type": ”method", "type": ”method", "alias": [”technique”], "alias": [”technique”], "values": [ "values": [ “native", “pin", “isr”, “dta”, “native", “pin", “isr”, “dta”, “dta_pin”, “dta_isr” “dta_pin”, “dta_isr” ] ] } } ], ], "dependent_variable": { "dependent_variable": { "type": ”runtime", "type": ”runtime", "scale": “s" "scale": “s" }, }, expect expect native < any and any and dta_pin between between pin and and isr and and dta_isr between between isr and and dta
34
Example 2
35
Example 2
36
Schema
"independent_variables": [ "independent_variables": [ { { "type": ”method", "type": ”method", "values": [ "values": [ “native", “pin", “isr”, “dta”, “native", “pin", “isr”, “dta”, “dta_pin”, “dta_isr” “dta_pin”, “dta_isr” ] ] } } ], ], "dependent_variable": { "dependent_variable": { "type": ”runtime", "type": ”runtime", "scale": “s" "scale": “s" }, },
37
Schema
"independent_variables": [ "independent_variables": [ { { "type": ”method”, "type": ”method”, "values": [ "values": [ “native", “pin", “isr”, “dta”, “native", “pin", “isr”, “dta”, “dta_pin”, “dta_isr” “dta_pin”, “dta_isr” ] ] }, }, { { "type": ”workload”, "type": ”workload”, "values": [“ftp", “samba", “ssh”] "values": [“ftp", “samba", “ssh”] } } ], ], "dependent_variable": { "dependent_variable": { "type": ”runtime", "type": ”runtime", "scale": “s" "scale": “s" }, },
38
Validations
"independent_variables": [ "independent_variables": [ { { "type": ”method”, "type": ”method”, "values": [ "values": [ “native", “pin", “isr”, “dta”, “native", “pin", “isr”, “dta”, “dta_pin”, “dta_isr” “dta_pin”, “dta_isr” ] ] }, }, { { "type": ”workload”, "type": ”workload”, "values": [“ftp", “samba", “ssh”] "values": [“ftp", “samba", “ssh”] } } ], ], "dependent_variable": { "dependent_variable": { "type": ”runtime", "type": ”runtime", "scale": “s" "scale": “s" }, }, for for workload=* expect expect native < any and any and dta_pin between between pin and and isr and and dta_isr between between isr and and dta
39
Falsifiability in Science
Falsibiability of a statement, hypothesis, or theory is an inherent possibility to prove it to be false.
- In other words, the ability to specify the
conditions under which a statement is false
- Synonymous to Testability
- Example:
– Statement: All swans are white – Falsifiable: Observe one black swan
source: en.wikipedia.org/wiki/Falsifiability
40
41
42
libs
OS
code, workload data hardware
Experiment Goal: Show that my algorithm/system/etc. is better than the state-of-the-art.
Means of Experiment Figure Observations Raw data
Falsifiability in Systems
Falsifiability in Systems
- To falsify a claim:
– Describe the means of the experiments – Provide validation statements over the output data
- Conditional statement:
– if means are properly recreated – then validation statements should hold
- Go from inert observations to falsifiable statements
From: We observe that our system outperforms the alternatives To: Expect 25-30% performance improvement on hardware platform X, on workload Y, when configured like Z
43
Early Feedback
Creating an ESF helps authors to:
- Find meaningful/reproducible baselines
- Create a feedback loop in author’s mind
- Specify exactly what author means
- Make temporal context explicit
44