Knockoff: Cheap versions in the cloud Xianzheng Dou , Peter M. Chen, - - PowerPoint PPT Presentation
Knockoff: Cheap versions in the cloud Xianzheng Dou , Peter M. Chen, - - PowerPoint PPT Presentation
Knockoff: Cheap versions in the cloud Xianzheng Dou , Peter M. Chen, Jason Flinn Cloud-based storage Google Drive Dropbox Microsoft OneDrive Pros: Ease-of-management Reliability Xianzheng Dou 1 Cloud-based storage Google Drive Dropbox
Cloud-based storage
Xianzheng Dou 1
Google Drive Microsoft OneDrive Dropbox
Pros: Ease-of-management Reliability
Cloud-based storage
Xianzheng Dou 2
Challenges: Storage costs Communication costs
Google Drive Microsoft OneDrive Dropbox
Versioning increases costs
Xianzheng Dou 3
Versioning
Pros: Recovery of lost data Auditing Troubleshooting
Google Drive Microsoft OneDrive Dropbox
Reducing costs: a new direction
- Established methods exploit similarities in data
– Chunk-based deduplication – Delta compression – Greater work for incremental gains
- Our goal: explore an orthogonal new dimension
– Deterministically recompute data in lieu of communication, storage
Xianzheng Dou 4
File: data or computation?
Xianzheng Dou 5
Computation File data
File: data or computation?
Xianzheng Dou 5
Computation File data
File: data or computation?
Xianzheng Dou 6
Computation File data
File: data or computation?
Xianzheng Dou 6
Computation File data
File: data or computation?
Xianzheng Dou 6
Computation File data
File: data or computation?
Xianzheng Dou 6
Computation File data
File: data or computation?
Xianzheng Dou 6
How can we address non-determinism?
Different output data Computation File data
File: data or computation?
Xianzheng Dou 7
- Deterministic record and replay
Logs of nondeterminism
… … … …
RECORD
Record
File: data or computation?
Xianzheng Dou 7
- Deterministic record and replay
Logs of nondeterminism
… … … …
RECORD
Record
File: data or computation?
Xianzheng Dou 7
- Deterministic record and replay
Logs of nondeterminism
… … … …
RECORD
Record
File: data or computation?
Xianzheng Dou 7
- Deterministic record and replay
Logs of nondeterminism
… … … …
RECORD
Record
… … … …
PLAY
Replay
Knockoff
- Selectively substitutes computation for data
- Benefits
– Reduction compared to chunk-based deduplication
- Communication costs: 21%
- Storage costs: 19%
– Benefits increases as we retain versions more frequently – A new fined-grained versioning policy
Xianzheng Dou 8
Outline
- Introduction
- Writing files
- Storing files
- Evaluation
Xianzheng Dou 9
Knockoff
Xianzheng Dou 10
- Knockoff selectively represents a file as:
Normal file data (by value) Logs of the nondeterminism needed to recompute the file (by operation)
File
Knockoff
Xianzheng Dou 10
- Knockoff selectively represents a file as:
Normal file data (by value) Logs of the nondeterminism needed to recompute the file (by operation)
File
Knockoff
Xianzheng Dou 10
- Knockoff selectively represents a file as:
Normal file data (by value) Logs of the nondeterminism needed to recompute the file (by operation)
OR File
Knockoff
Xianzheng Dou 10
- Knockoff selectively represents a file as:
Normal file data (by value) Logs of the nondeterminism needed to recompute the file (by operation)
OR OR File
An example log for compilation
Xianzheng Dou 11
Log entry Values 1
- pen
rc=3 2 mmap rc=<addr>,file=<id,version> 3 gettimeofday rc=0,time=<time> … … 4 pthread_lock rc=0 5 SIGCHILD
An example log for compilation
Xianzheng Dou 11
Log entry Values 1
- pen
rc=3 2 mmap rc=<addr>,file=<id,version> 3 gettimeofday rc=0,time=<time> … … 4 pthread_lock rc=0
Return values from syscalls Ordering of thread synchronization Signals
5 SIGCHILD
An example log for compilation
Xianzheng Dou 11
Log entry Values 1
- pen
rc=3 2 mmap rc=<addr>,file=<id,version> 3 gettimeofday rc=0,time=<time> … … 4 pthread_lock rc=0 5 SIGCHILD
An example log for compilation
Xianzheng Dou 11
Log entry Values 1
- pen
rc=3 2 mmap rc=<addr>,file=<id,version> 3 gettimeofday rc=0,time=<time> … … 4 pthread_lock rc=0 5 SIGCHILD
Writing files
Xianzheng Dou 13
By value By operation
Writing files
Xianzheng Dou 13
By value By operation
Writing files
Xianzheng Dou 13
By value By operation
Writing files
Xianzheng Dou 14
By value By operation photo editing
Writing files
Xianzheng Dou 15
By value By operation cryptographic key generation
Outline
- Introduction
- Writing files
- Storing files
- Evaluation
Xianzheng Dou 17
Storing files
- Store files by value or by operation?
- A tradeoff between latency and costs
– Current versions: by value – Past versions: by value or by operation
Xianzheng Dou 18
?
Storing past versions
Xianzheng Dou 19
- Maximum materialization delay
– Time bound for reconstructing any version
Materialization delay = 60s Regeneration time = 20s <
Storing past versions
Xianzheng Dou 19
- Maximum materialization delay
– Time bound for reconstructing any version
Materialization delay = 60s Regeneration time = 20s <
Storing past versions
Xianzheng Dou 20
- Maximum materialization delay
– Time bound for reconstructing any version
Regeneration time = 100s > Materialization delay = 60s
Storing past versions
Xianzheng Dou 20
- Maximum materialization delay
– Time bound for reconstructing any version
Regeneration time = 100s > Materialization delay = 60s
Storing past versions
Xianzheng Dou 21
- Maximum materialization delay
– Time bound for reconstructing any version
- Longest path > materialization delay
Materialization delay = 60s 20s Total regeneration time = 20s
<
Storing past versions
Xianzheng Dou 22
- Maximum materialization delay
– Time bound for reconstructing any version
- Longest path > materialization delay
Materialization delay = 60s 20s 30s Total regeneration time = 50s
<
Storing past versions
Xianzheng Dou 23
- Maximum materialization delay
– Time bound for reconstructing any version
- Longest path > materialization delay
Materialization delay = 60s 20s 30s 30s Total regeneration time = 80s
>
Storing past versions
Xianzheng Dou 24
- Maximum materialization delay
– Time bound for reconstructing any version
- Longest path > materialization delay
20s 30s 30s Total regeneration time = 80s Materialization delay = 60s
>
Storing past versions
Xianzheng Dou 24
- Maximum materialization delay
– Time bound for reconstructing any version
- Longest path > materialization delay
20s 30s 30s Total regeneration time = 80s Materialization delay = 60s
>
Storing past versions
Xianzheng Dou 25
- Maximum materialization delay
– Time bound for reconstructing any version
- Longest path > materialization delay
– A greedy algorithm
Materialization delay = 60s
Xianzheng Dou 26
- Frequency of versioning
Storing past versions: versioning policies
Xianzheng Dou 26
- Frequency of versioning
Storing past versions: versioning policies
No versioning Version on close Version on write Eidetic versioning
Xianzheng Dou 26
- Frequency of versioning
Storing past versions: versioning policies
Any past transient state in memory? No versioning Version on close Version on write Memory-mapped files Eidetic versioning
Optimization: log compression
- Chunk-based deduplication is effective for file data
– Executions of the same application have similar patterns – Can it also be applied to computation (logs of nondeterminism)?
- Delta compression
Xianzheng Dou 28
- Problem: a smattering of values differ in each log
Optimization: log compression
Xianzheng Dou 29
- Problem: a smattering of values differ in each log
Optimization: log compression
Xianzheng Dou 29
Delta compression: 42% reduction
Outline
- Introduction
- Writing files
- Storing files
- Evaluation
Xianzheng Dou 30
Evaluation
- How much does Knockoff reduce bandwidth usage?
- How much does Knockoff reduce storage costs?
- What is Knockoff’s performance overhead?
- For more experimental results, please refer to our paper
Xianzheng Dou 31
Experimental setup
- User study
– 8 participants performed several simple tasks in one hour
- 20-day study
– A single-user longitudinal study
- A variety of programs used
– Various Linux utilities, text editors and programming languages
Xianzheng Dou 32
Bandwidth usage: user study
Xianzheng Dou 33
100 200 300 400 500
No versioning Version on close Version on write Eidetic Data sent to the server (MB) Chunk-based deduplication Knockoff
Bandwidth usage: user study
Xianzheng Dou 33
Already achieve 80%-85% reduction
100 200 300 400 500
No versioning Version on close Version on write Eidetic Data sent to the server (MB) Chunk-based deduplication Knockoff
Bandwidth usage: user study
Xianzheng Dou 33
Already achieve 80%-85% reduction
100 200 300 400 500
No versioning Version on close Version on write Eidetic Data sent to the server (MB) Chunk-based deduplication Knockoff
Bandwidth usage: user study
Xianzheng Dou 33
24%
100 200 300 400 500
No versioning Version on close Version on write Eidetic Data sent to the server (MB) Chunk-based deduplication Knockoff
Bandwidth usage: user study
Xianzheng Dou 33
24% 25% 47%
100 200 300 400 500
No versioning Version on close Version on write Eidetic Data sent to the server (MB) Chunk-based deduplication Knockoff
Bandwidth usage: user study
Xianzheng Dou 33
24% 25% 47%
Variances across applications
Xianzheng Dou 35
Version on close
Bandwidth savings(%)
10 20 30 40 50 60
Linux utility Graph editing Libreoffice Programming Text editing Web browsing Total
Variances across applications
Xianzheng Dou 35
Version on close
Bandwidth savings(%)
10 20 30 40 50 60
Linux utility Graph editing Libreoffice Programming Text editing Web browsing Total
Variances across users
Xianzheng Dou 36
20 40 60 80 100
A B C D E F G Knockoff Version on close
Bandwidth savings(%)
Variances across users
Xianzheng Dou 36
20 40 60 80 100
A B C D E F G Knockoff Version on close
Bandwidth savings(%)
Relative storage costs for past versions
Xianzheng Dou 37
0.5 1 1.5 2 2.5
Version on close Version on write Eidetic Chunk-based deduplication Knockoff Relative storage cost
Relative storage costs for past versions
Xianzheng Dou 37
0.5 1 1.5 2 2.5
Version on close Version on write Eidetic Chunk-based deduplication Knockoff Relative storage cost
19% 23%
Relative storage costs for past versions
Xianzheng Dou 37
0.5 1 1.5 2 2.5
Version on close Version on write Eidetic Chunk-based deduplication Knockoff Relative storage cost
19% 23%
Performance overheads
38
0.2 0.4 0.6 0.8 1 1.2
No Versioning Version on close Version on write Eidetic
- 7-8% performance overheads
Baseline
Conclusion
- A new dimension for reducing costs
- Selectively substitute computation for data
- A general-purpose system for deterministic recomputation
– Reduces storage and communication costs for existing versioning policies – Enables eidetic versioning
Xianzheng Dou 39
Thank you!
Xianzheng Dou 40
Varying the materialization delay
Xianzheng Dou 41
Monetary costs
Xianzheng Dou 42
Workload characteristics
Xianzheng Dou 43