Supporting Incremental Re-Computation with Whole System Provenance: - - PowerPoint PPT Presentation
Supporting Incremental Re-Computation with Whole System Provenance: - - PowerPoint PPT Presentation
Supporting Incremental Re-Computation with Whole System Provenance: Issues and Approaches Ashish Gehani, SRI Incremental Re-computation File 2 Read Supported via memoization open() close() Database query engines File 1 Read close()
Incremental Re-computation
- Supported via memoization
– Database query engines – Workflow planners – Software build systems
- Optimized with provenance
– Identify affected subgraph
- Forward slice from new inputs
– Identify dependencies
- Backward slice from affectees
- Whole system provenance
– Applies to range of applications – Introduces variety of challenges
- pen()
File 1 Read File 2 Read close()
- pen()
- pen()
close() File 3 Write Process execution Time close()
Output Operation Input 1 Input n
Ashish Gehani, Ulf Lindqvist, Bonsai: Balanced Lineage Authentication, ACSAC, 2007
Whole System Provenance
- Multiple possible approaches
– Dynamic binary instrumentation (e.g. Pin) – Compiler-based transformation (e.g. LLVM) – Library call interposition (e.g. LD_PRELOAD) – Kernel hooks (e.g. LSM)
- Global view of monitored system
- Provenance inferred is sound
- … but often incomplete
– may suffice for
- staging
- diagnostics
- profiling
- authorization
– challenge for reproducibility
- partial coverage
- semantic gap
!
Manual Curation
Ortholog Editor Taxonomy Elements PubMed Articles Enzyme Annotations Compound Structure Database
Application
!
function initialize() function processData() function errorHandler() function writeBack() function terminate() int recordSize var inputDatabase var errorMsg var outputDatabase
Workflow Manager
!
method beginWorkflow() method feedback() method processState() method verifyState() method updateLog() file errorLog file workflowStatus file inputData
Operating System
!
System Startup Network Adapter Event Dispatcher File System System Registry System Log send() recv() read() write() getKey() writeKey()
Distributed System
!
Authenticate Grid Registry Event Handler Cloud Processor Network File System Resource Manager
Ashish Gehani, Dawood Tariq, SPADE: Support for Provenance Auditing in Distributed Environments, ACM Middleware, 2012
Issue: Ephemeral Intermediates
- Example:
– Software builds link objects into final binary – Objects files are then removed – Memoization benefit is lost – Provenance still useful, but intermediates must be regenerated
Pidname:)cc1 Pid:)2161 Ppid:)2160 Pidname:)gcc Pid:)2160 Ppid:)2159 Pidname:)as Pid:)2162 Ppid:)2160 Filename:)protocol.o Filename:)network.o Pidname:)collect2 Pid:)2170 Ppid:)2169 Pidname:)gcc Pid:)2163 Ppid:)2159 Pidname:)cc1 Pid:)2164 Ppid:)2163 Pidname:)as Pid:)2165 Ppid:)2163 Pidname:)gcc Pid:)2169 Ppid:)2159 Pidname:)ld Pid:)2171 Ppid:)2170 Filename:)network.c Filename:)protocol.c Filename:)ccC1p2KN.s Filename:)ccfELrl1.s
Subgraph)that)needs)to)be)recomputed Modified) Artifact
Approach: Maintain Data History
TABLE I PERFORMANCE ANALYSIS Apache Operations Improvement Complete re-execution 63564 Provenance-based re-execution 15701 75.3% Snapshotting filesystem + 13595 78.61% Provenance-based re-execution BLAST Complete re-execution 48602 Provenance-based re-execution 9811 79.8% Snapshotting filesystem + 8391 82.73% Provenance-based re-execution PostMark Complete re-execution 57344 Provenance-based re-execution 14305 75.05% Snapshotting filesystem + 10031 82.5% Provenance-based re-execution TABLE II STORAGE OVERHEAD FOR PROVENANCE METADATA Apache BLAST PostMark Provenance-based re-execution 13 MB 8.7 MB 8.9 MB
Hasnain Lakhani, Rashid Tahir, Azeem Aqil, Fareed Zaffar, Dawood Tariq, Ashish Gehani, Optimized Rollback and Re-computation, IEEE HCSS, 2013
Issue: Dependency Conflation
- Often arises when:
– Instrumentation is at coarser level of abstraction – Causality manifests at finer granularity
- Examples, using system calls:
– Web server sends a different file to each client – Individual element
- f data archive utilized
- Implicated dependency
subgraph explodes
- Much re-computation
is unnecessarily performed
process:bash pid:5226 ppid:2045 process:bash pid:2045 ppid:2043 filename:httpd path:/var/httpd size:14350 filename:file1.html path:/var/htdocs/file1.html size:1205 filename:file2.html path:/var/htdocs/file2.html size:8136 filename:file3.html path:/var/htdocs/file3.html size:7160 process:terminal pid:2043 ppid:1 local_ip:192.168.1.3 remote_ip:192.168.1.18 local_ip:192.168.1.3 remote_ip:192.168.1.25 local_ip:192.168.1.3 remote_ip:192.168.1.7
Approach: Execution Partitioning
- Utilize finer-grained instrumentation
- Example, using function calls:
– Tracks web server’s input file ← output network flow dependency
FunctionID:main.0.2000 FunctionName:main ThreadID:2000 FunctionID:accept_request.2.2000 FunctionName:accept_request ThreadID:2000 ID:accept_request.2-0 ArgType:i32 ArgName:client ArgVal:5 FunctionID:serve_file.4.2000 FunctionName:serve_file ThreadID:2000 ID:serve_file.4-0 ArgType:i32 ArgName:client ArgVal:5 ID:serve_file.4-1 ArgType:i8* ArgName:filename ArgVal:0xbfa72afa FunctionID:cat.13.2000 FunctionName:cat ThreadID:2000 ID:cat.13-0 ArgType:i32 ArgName:client ArgVal:5 ID:cat.13-1 ArgType:%struct._IO_FILE* ArgName:resource ArgVal:0x9789578 FunctionID:accept_request.12.2000 FunctionName:accept_request ThreadID:2000 ID:accept_request.12-0 ArgType:i32 ArgName:client ArgVal:5 FunctionID:serve_file.8.2000 FunctionName:serve_file ThreadID:2000 ID:serve_file.8-0 ArgType:i32 ArgName:client ArgVal:5 ID:serve_file.8-1 ArgType:i8* ArgName:filename ArgVal:0x243cda99 FunctionID:cat.18.2000 FunctionName:cat ThreadID:2000 ID:cat.18-0 ArgType:i32 ArgName:client ArgVal:5 ID:cat.18-1 ArgType:%struct._IO_FILE* ArgName:resource ArgVal:0x6754193 FunctionID:accept_request.9.2000 FunctionName:accept_request ThreadID:2000 ID:accept_request.9-0 ArgType:i32 ArgName:client ArgVal:5 FunctionID:serve_file.7.2000 FunctionName:serve_file ThreadID:2000 ID:serve_file.7-0 ArgType:i32 ArgName:client ArgVal:5 ID:serve_file.7-1 ArgType:i8* ArgName:filename ArgVal:0x9287c18d FunctionID:cat.16.2000 FunctionName:cat ThreadID:2000 ID:cat.16-0 ArgType:i32 ArgName:client ArgVal:5 ID:cat.16-1 ArgType:%struct._IO_FILE* ArgName:resource ArgVal:0xaa12997d
Dawood Tariq, Maisem Ali, Ashish Gehani, Towards Automated Collection of Application-Level Data Provenance, TaPP, 2012
Issue: Changing Runtime Environment
- Application context is complex
- Code dependencies
– Linked libraries – System services – Utility programs
- Environmental dependencies
– Shell variables – Shared memory contents
- Changes in any can affect output
Approach: Code and Context Closures
- Partition code into composable units (e.g. Docker layers)
– Dependencies minimized, reducing need for re-computation
- Virtualize application execution (e.g. Linux containers)
– Replicated runtime environment
Loic Gelle, Hassen Saidi, Ashish Gehani, Wholly!: A Build System For The Modern Software Stack, Lecture Notes in Computer Science, Vol. , Springer, 2018.
sqlite-3.18 Wholly! recipe sqlite-3.18 Dockerfjle
Wholly! generates a Dockerfjle Docker builds the package into a container according to the Dockerfjle 1
Copy base building environment (including Wholly!-built Clang compiler) Base build tools Build dependencies
2
Copy Wholly! subpackages that are required as dependencies Source code
3
Download package source code Generated build products
4
Execute build invocations
./configure && make && make install
Issue: External Dependencies
- Assume code / context closures
- May still face challenges:
– User input – Randomized choices – Distributed computation – Asynchronous interrupts
- Baseline strategy
constructs models
– Significant engineering – Error-prone – May still be incomplete
System Call System Call System Call System Call System Call System Call System Call System Call System Call System Call System Call System Call Exception Handling Routine System Call System Call System Call System Call System Call System Call
Approach: Archive Non-deterministic Input
- Minimal set needed for
re-computation
- Can regenerate all
- ther artifacts
- Memoization policy
can guide tradeoff between:
– Storage – Incremental re- computation
- Example: In virtual
machine device drivers, log only external inputs
Nondet questionnaire libc File File File Process /home/user/filled_answers File /dev/tcp /home/user/medical_record /bin/questionnaire /usr/lib/libc
Ashish Gehani, Gabriela Ciocarlie, Natarajan Shankar, Accountable Clouds, IEEE HST, 2013
Conclusion
- Questions?
- Internships available!
- Provenance
- Reproducibility
- Virtualization
- Security
– Location: Silicon Valley, California – Email: first.last@sri.com
- Acknowledgement