Supporting Incremental Re-Computation with Whole System Provenance: - - PowerPoint PPT Presentation

supporting incremental re computation with whole system
SMART_READER_LITE
LIVE PREVIEW

Supporting Incremental Re-Computation with Whole System Provenance: - - PowerPoint PPT Presentation

Supporting Incremental Re-Computation with Whole System Provenance: Issues and Approaches Ashish Gehani, SRI Incremental Re-computation File 2 Read Supported via memoization open() close() Database query engines File 1 Read close()


slide-1
SLIDE 1

Supporting Incremental Re-Computation with Whole System Provenance: Issues and Approaches

Ashish Gehani, SRI

slide-2
SLIDE 2

Incremental Re-computation

  • Supported via memoization

– Database query engines – Workflow planners – Software build systems

  • Optimized with provenance

– Identify affected subgraph

  • Forward slice from new inputs

– Identify dependencies

  • Backward slice from affectees
  • Whole system provenance

– Applies to range of applications – Introduces variety of challenges

  • pen()

File 1 Read File 2 Read close()

  • pen()
  • pen()

close() File 3 Write Process execution Time close()

Output Operation Input 1 Input n

Ashish Gehani, Ulf Lindqvist, Bonsai: Balanced Lineage Authentication, ACSAC, 2007

slide-3
SLIDE 3

Whole System Provenance

  • Multiple possible approaches

– Dynamic binary instrumentation (e.g. Pin) – Compiler-based transformation (e.g. LLVM) – Library call interposition (e.g. LD_PRELOAD) – Kernel hooks (e.g. LSM)

  • Global view of monitored system
  • Provenance inferred is sound
  • … but often incomplete

– may suffice for

  • staging
  • diagnostics
  • profiling
  • authorization

– challenge for reproducibility

  • partial coverage
  • semantic gap

!

Manual Curation

Ortholog Editor Taxonomy Elements PubMed Articles Enzyme Annotations Compound Structure Database

Application

!

function initialize() function processData() function errorHandler() function writeBack() function terminate() int recordSize var inputDatabase var errorMsg var outputDatabase

Workflow Manager

!

method beginWorkflow() method feedback() method processState() method verifyState() method updateLog() file errorLog file workflowStatus file inputData

Operating System

!

System Startup Network Adapter Event Dispatcher File System System Registry System Log send() recv() read() write() getKey() writeKey()

Distributed System

!

Authenticate Grid Registry Event Handler Cloud Processor Network File System Resource Manager

Ashish Gehani, Dawood Tariq, SPADE: Support for Provenance Auditing in Distributed Environments, ACM Middleware, 2012

slide-4
SLIDE 4

Issue: Ephemeral Intermediates

  • Example:

– Software builds link objects into final binary – Objects files are then removed – Memoization benefit is lost – Provenance still useful, but intermediates must be regenerated

Pidname:)cc1 Pid:)2161 Ppid:)2160 Pidname:)gcc Pid:)2160 Ppid:)2159 Pidname:)as Pid:)2162 Ppid:)2160 Filename:)protocol.o Filename:)network.o Pidname:)collect2 Pid:)2170 Ppid:)2169 Pidname:)gcc Pid:)2163 Ppid:)2159 Pidname:)cc1 Pid:)2164 Ppid:)2163 Pidname:)as Pid:)2165 Ppid:)2163 Pidname:)gcc Pid:)2169 Ppid:)2159 Pidname:)ld Pid:)2171 Ppid:)2170 Filename:)network.c Filename:)protocol.c Filename:)ccC1p2KN.s Filename:)ccfELrl1.s

Subgraph)that)needs)to)be)recomputed Modified) Artifact

slide-5
SLIDE 5

Approach: Maintain Data History

TABLE I PERFORMANCE ANALYSIS Apache Operations Improvement Complete re-execution 63564 Provenance-based re-execution 15701 75.3% Snapshotting filesystem + 13595 78.61% Provenance-based re-execution BLAST Complete re-execution 48602 Provenance-based re-execution 9811 79.8% Snapshotting filesystem + 8391 82.73% Provenance-based re-execution PostMark Complete re-execution 57344 Provenance-based re-execution 14305 75.05% Snapshotting filesystem + 10031 82.5% Provenance-based re-execution TABLE II STORAGE OVERHEAD FOR PROVENANCE METADATA Apache BLAST PostMark Provenance-based re-execution 13 MB 8.7 MB 8.9 MB

Hasnain Lakhani, Rashid Tahir, Azeem Aqil, Fareed Zaffar, Dawood Tariq, Ashish Gehani, Optimized Rollback and Re-computation, IEEE HCSS, 2013

slide-6
SLIDE 6

Issue: Dependency Conflation

  • Often arises when:

– Instrumentation is at coarser level of abstraction – Causality manifests at finer granularity

  • Examples, using system calls:

– Web server sends a different file to each client – Individual element

  • f data archive utilized
  • Implicated dependency

subgraph explodes

  • Much re-computation

is unnecessarily performed

process:bash pid:5226 ppid:2045 process:bash pid:2045 ppid:2043 filename:httpd path:/var/httpd size:14350 filename:file1.html path:/var/htdocs/file1.html size:1205 filename:file2.html path:/var/htdocs/file2.html size:8136 filename:file3.html path:/var/htdocs/file3.html size:7160 process:terminal pid:2043 ppid:1 local_ip:192.168.1.3 remote_ip:192.168.1.18 local_ip:192.168.1.3 remote_ip:192.168.1.25 local_ip:192.168.1.3 remote_ip:192.168.1.7

slide-7
SLIDE 7

Approach: Execution Partitioning

  • Utilize finer-grained instrumentation
  • Example, using function calls:

– Tracks web server’s input file ← output network flow dependency

FunctionID:main.0.2000 FunctionName:main ThreadID:2000 FunctionID:accept_request.2.2000 FunctionName:accept_request ThreadID:2000 ID:accept_request.2-0 ArgType:i32 ArgName:client ArgVal:5 FunctionID:serve_file.4.2000 FunctionName:serve_file ThreadID:2000 ID:serve_file.4-0 ArgType:i32 ArgName:client ArgVal:5 ID:serve_file.4-1 ArgType:i8* ArgName:filename ArgVal:0xbfa72afa FunctionID:cat.13.2000 FunctionName:cat ThreadID:2000 ID:cat.13-0 ArgType:i32 ArgName:client ArgVal:5 ID:cat.13-1 ArgType:%struct._IO_FILE* ArgName:resource ArgVal:0x9789578 FunctionID:accept_request.12.2000 FunctionName:accept_request ThreadID:2000 ID:accept_request.12-0 ArgType:i32 ArgName:client ArgVal:5 FunctionID:serve_file.8.2000 FunctionName:serve_file ThreadID:2000 ID:serve_file.8-0 ArgType:i32 ArgName:client ArgVal:5 ID:serve_file.8-1 ArgType:i8* ArgName:filename ArgVal:0x243cda99 FunctionID:cat.18.2000 FunctionName:cat ThreadID:2000 ID:cat.18-0 ArgType:i32 ArgName:client ArgVal:5 ID:cat.18-1 ArgType:%struct._IO_FILE* ArgName:resource ArgVal:0x6754193 FunctionID:accept_request.9.2000 FunctionName:accept_request ThreadID:2000 ID:accept_request.9-0 ArgType:i32 ArgName:client ArgVal:5 FunctionID:serve_file.7.2000 FunctionName:serve_file ThreadID:2000 ID:serve_file.7-0 ArgType:i32 ArgName:client ArgVal:5 ID:serve_file.7-1 ArgType:i8* ArgName:filename ArgVal:0x9287c18d FunctionID:cat.16.2000 FunctionName:cat ThreadID:2000 ID:cat.16-0 ArgType:i32 ArgName:client ArgVal:5 ID:cat.16-1 ArgType:%struct._IO_FILE* ArgName:resource ArgVal:0xaa12997d

Dawood Tariq, Maisem Ali, Ashish Gehani, Towards Automated Collection of Application-Level Data Provenance, TaPP, 2012

slide-8
SLIDE 8

Issue: Changing Runtime Environment

  • Application context is complex
  • Code dependencies

– Linked libraries – System services – Utility programs

  • Environmental dependencies

– Shell variables – Shared memory contents

  • Changes in any can affect output
slide-9
SLIDE 9

Approach: Code and Context Closures

  • Partition code into composable units (e.g. Docker layers)

– Dependencies minimized, reducing need for re-computation

  • Virtualize application execution (e.g. Linux containers)

– Replicated runtime environment

Loic Gelle, Hassen Saidi, Ashish Gehani, Wholly!: A Build System For The Modern Software Stack, Lecture Notes in Computer Science, Vol. , Springer, 2018.

sqlite-3.18 Wholly! recipe sqlite-3.18 Dockerfjle

Wholly! generates a Dockerfjle Docker builds the package into a container according to the Dockerfjle 1

Copy base building environment (including Wholly!-built Clang compiler) Base build tools Build dependencies

2

Copy Wholly! subpackages that are required as dependencies Source code

3

Download package source code Generated build products

4

Execute build invocations

./configure && make && make install

slide-10
SLIDE 10

Issue: External Dependencies

  • Assume code / context closures
  • May still face challenges:

– User input – Randomized choices – Distributed computation – Asynchronous interrupts

  • Baseline strategy

constructs models

– Significant engineering – Error-prone – May still be incomplete

System Call System Call System Call System Call System Call System Call System Call System Call System Call System Call System Call System Call Exception Handling Routine System Call System Call System Call System Call System Call System Call

slide-11
SLIDE 11

Approach: Archive Non-deterministic Input

  • Minimal set needed for

re-computation

  • Can regenerate all
  • ther artifacts
  • Memoization policy

can guide tradeoff between:

– Storage – Incremental re- computation

  • Example: In virtual

machine device drivers, log only external inputs

Nondet questionnaire libc File File File Process /home/user/filled_answers File /dev/tcp /home/user/medical_record /bin/questionnaire /usr/lib/libc

Ashish Gehani, Gabriela Ciocarlie, Natarajan Shankar, Accountable Clouds, IEEE HST, 2013

slide-12
SLIDE 12

Conclusion

  • Questions?
  • Internships available!
  • Provenance
  • Reproducibility
  • Virtualization
  • Security

– Location: Silicon Valley, California – Email: first.last@sri.com

  • Acknowledgement

– National Science Foundation Grant ACI-1440800