Knockoff: Cheap versions in the cloud Xianzheng Dou , Peter M. Chen, - - PowerPoint PPT Presentation

knockoff cheap versions in the cloud
SMART_READER_LITE
LIVE PREVIEW

Knockoff: Cheap versions in the cloud Xianzheng Dou , Peter M. Chen, - - PowerPoint PPT Presentation

Knockoff: Cheap versions in the cloud Xianzheng Dou , Peter M. Chen, Jason Flinn Cloud-based storage Google Drive Dropbox Microsoft OneDrive Pros: Ease-of-management Reliability Xianzheng Dou 1 Cloud-based storage Google Drive Dropbox


slide-1
SLIDE 1

Knockoff: Cheap versions in the cloud

Xianzheng Dou, Peter M. Chen, Jason Flinn

slide-2
SLIDE 2

Cloud-based storage

Xianzheng Dou 1

Google Drive Microsoft OneDrive Dropbox

Pros: Ease-of-management Reliability

slide-3
SLIDE 3

Cloud-based storage

Xianzheng Dou 2

Challenges: Storage costs Communication costs

Google Drive Microsoft OneDrive Dropbox

slide-4
SLIDE 4

Versioning increases costs

Xianzheng Dou 3

Versioning

Pros: Recovery of lost data Auditing Troubleshooting

Google Drive Microsoft OneDrive Dropbox

slide-5
SLIDE 5

Reducing costs: a new direction

  • Established methods exploit similarities in data

– Chunk-based deduplication – Delta compression – Greater work for incremental gains

  • Our goal: explore an orthogonal new dimension

– Deterministically recompute data in lieu of communication, storage

Xianzheng Dou 4

slide-6
SLIDE 6

File: data or computation?

Xianzheng Dou 5

Computation File data

slide-7
SLIDE 7

File: data or computation?

Xianzheng Dou 5

Computation File data

slide-8
SLIDE 8

File: data or computation?

Xianzheng Dou 6

Computation File data

slide-9
SLIDE 9

File: data or computation?

Xianzheng Dou 6

Computation File data

slide-10
SLIDE 10

File: data or computation?

Xianzheng Dou 6

Computation File data

slide-11
SLIDE 11

File: data or computation?

Xianzheng Dou 6

Computation File data

slide-12
SLIDE 12

File: data or computation?

Xianzheng Dou 6

How can we address non-determinism?

Different output data Computation File data

slide-13
SLIDE 13

File: data or computation?

Xianzheng Dou 7

  • Deterministic record and replay

Logs of nondeterminism

… … … …

RECORD

Record

slide-14
SLIDE 14

File: data or computation?

Xianzheng Dou 7

  • Deterministic record and replay

Logs of nondeterminism

… … … …

RECORD

Record

slide-15
SLIDE 15

File: data or computation?

Xianzheng Dou 7

  • Deterministic record and replay

Logs of nondeterminism

… … … …

RECORD

Record

slide-16
SLIDE 16

File: data or computation?

Xianzheng Dou 7

  • Deterministic record and replay

Logs of nondeterminism

… … … …

RECORD

Record

… … … …

PLAY

Replay

slide-17
SLIDE 17

Knockoff

  • Selectively substitutes computation for data
  • Benefits

– Reduction compared to chunk-based deduplication

  • Communication costs: 21%
  • Storage costs: 19%

– Benefits increases as we retain versions more frequently – A new fined-grained versioning policy

Xianzheng Dou 8

slide-18
SLIDE 18

Outline

  • Introduction
  • Writing files
  • Storing files
  • Evaluation

Xianzheng Dou 9

slide-19
SLIDE 19

Knockoff

Xianzheng Dou 10

  • Knockoff selectively represents a file as:

Normal file data (by value) Logs of the nondeterminism needed to recompute the file (by operation)

File

slide-20
SLIDE 20

Knockoff

Xianzheng Dou 10

  • Knockoff selectively represents a file as:

Normal file data (by value) Logs of the nondeterminism needed to recompute the file (by operation)

File

slide-21
SLIDE 21

Knockoff

Xianzheng Dou 10

  • Knockoff selectively represents a file as:

Normal file data (by value) Logs of the nondeterminism needed to recompute the file (by operation)

OR File

slide-22
SLIDE 22

Knockoff

Xianzheng Dou 10

  • Knockoff selectively represents a file as:

Normal file data (by value) Logs of the nondeterminism needed to recompute the file (by operation)

OR OR File

slide-23
SLIDE 23

An example log for compilation

Xianzheng Dou 11

Log entry Values 1

  • pen

rc=3 2 mmap rc=<addr>,file=<id,version> 3 gettimeofday rc=0,time=<time> … … 4 pthread_lock rc=0 5 SIGCHILD

slide-24
SLIDE 24

An example log for compilation

Xianzheng Dou 11

Log entry Values 1

  • pen

rc=3 2 mmap rc=<addr>,file=<id,version> 3 gettimeofday rc=0,time=<time> … … 4 pthread_lock rc=0

Return values from syscalls Ordering of thread synchronization Signals

5 SIGCHILD

slide-25
SLIDE 25

An example log for compilation

Xianzheng Dou 11

Log entry Values 1

  • pen

rc=3 2 mmap rc=<addr>,file=<id,version> 3 gettimeofday rc=0,time=<time> … … 4 pthread_lock rc=0 5 SIGCHILD

slide-26
SLIDE 26

An example log for compilation

Xianzheng Dou 11

Log entry Values 1

  • pen

rc=3 2 mmap rc=<addr>,file=<id,version> 3 gettimeofday rc=0,time=<time> … … 4 pthread_lock rc=0 5 SIGCHILD

slide-27
SLIDE 27

Writing files

Xianzheng Dou 13

By value By operation

slide-28
SLIDE 28

Writing files

Xianzheng Dou 13

By value By operation

slide-29
SLIDE 29

Writing files

Xianzheng Dou 13

By value By operation

slide-30
SLIDE 30

Writing files

Xianzheng Dou 14

By value By operation photo editing

slide-31
SLIDE 31

Writing files

Xianzheng Dou 15

By value By operation cryptographic key generation

slide-32
SLIDE 32

Outline

  • Introduction
  • Writing files
  • Storing files
  • Evaluation

Xianzheng Dou 17

slide-33
SLIDE 33

Storing files

  • Store files by value or by operation?
  • A tradeoff between latency and costs

– Current versions: by value – Past versions: by value or by operation

Xianzheng Dou 18

?

slide-34
SLIDE 34

Storing past versions

Xianzheng Dou 19

  • Maximum materialization delay

– Time bound for reconstructing any version

Materialization delay = 60s Regeneration time = 20s <

slide-35
SLIDE 35

Storing past versions

Xianzheng Dou 19

  • Maximum materialization delay

– Time bound for reconstructing any version

Materialization delay = 60s Regeneration time = 20s <

slide-36
SLIDE 36

Storing past versions

Xianzheng Dou 20

  • Maximum materialization delay

– Time bound for reconstructing any version

Regeneration time = 100s > Materialization delay = 60s

slide-37
SLIDE 37

Storing past versions

Xianzheng Dou 20

  • Maximum materialization delay

– Time bound for reconstructing any version

Regeneration time = 100s > Materialization delay = 60s

slide-38
SLIDE 38

Storing past versions

Xianzheng Dou 21

  • Maximum materialization delay

– Time bound for reconstructing any version

  • Longest path > materialization delay

Materialization delay = 60s 20s Total regeneration time = 20s

<

slide-39
SLIDE 39

Storing past versions

Xianzheng Dou 22

  • Maximum materialization delay

– Time bound for reconstructing any version

  • Longest path > materialization delay

Materialization delay = 60s 20s 30s Total regeneration time = 50s

<

slide-40
SLIDE 40

Storing past versions

Xianzheng Dou 23

  • Maximum materialization delay

– Time bound for reconstructing any version

  • Longest path > materialization delay

Materialization delay = 60s 20s 30s 30s Total regeneration time = 80s

>

slide-41
SLIDE 41

Storing past versions

Xianzheng Dou 24

  • Maximum materialization delay

– Time bound for reconstructing any version

  • Longest path > materialization delay

20s 30s 30s Total regeneration time = 80s Materialization delay = 60s

>

slide-42
SLIDE 42

Storing past versions

Xianzheng Dou 24

  • Maximum materialization delay

– Time bound for reconstructing any version

  • Longest path > materialization delay

20s 30s 30s Total regeneration time = 80s Materialization delay = 60s

>

slide-43
SLIDE 43

Storing past versions

Xianzheng Dou 25

  • Maximum materialization delay

– Time bound for reconstructing any version

  • Longest path > materialization delay

– A greedy algorithm

Materialization delay = 60s

slide-44
SLIDE 44

Xianzheng Dou 26

  • Frequency of versioning

Storing past versions: versioning policies

slide-45
SLIDE 45

Xianzheng Dou 26

  • Frequency of versioning

Storing past versions: versioning policies

No versioning Version on close Version on write Eidetic versioning

slide-46
SLIDE 46

Xianzheng Dou 26

  • Frequency of versioning

Storing past versions: versioning policies

Any past transient state in memory? No versioning Version on close Version on write Memory-mapped files Eidetic versioning

slide-47
SLIDE 47

Optimization: log compression

  • Chunk-based deduplication is effective for file data

– Executions of the same application have similar patterns – Can it also be applied to computation (logs of nondeterminism)?

  • Delta compression

Xianzheng Dou 28

slide-48
SLIDE 48
  • Problem: a smattering of values differ in each log

Optimization: log compression

Xianzheng Dou 29

slide-49
SLIDE 49
  • Problem: a smattering of values differ in each log

Optimization: log compression

Xianzheng Dou 29

Delta compression: 42% reduction

slide-50
SLIDE 50

Outline

  • Introduction
  • Writing files
  • Storing files
  • Evaluation

Xianzheng Dou 30

slide-51
SLIDE 51

Evaluation

  • How much does Knockoff reduce bandwidth usage?
  • How much does Knockoff reduce storage costs?
  • What is Knockoff’s performance overhead?
  • For more experimental results, please refer to our paper

Xianzheng Dou 31

slide-52
SLIDE 52

Experimental setup

  • User study

– 8 participants performed several simple tasks in one hour

  • 20-day study

– A single-user longitudinal study

  • A variety of programs used

– Various Linux utilities, text editors and programming languages

Xianzheng Dou 32

slide-53
SLIDE 53

Bandwidth usage: user study

Xianzheng Dou 33

slide-54
SLIDE 54

100 200 300 400 500

No versioning Version on close Version on write Eidetic Data sent to the server (MB) Chunk-based deduplication Knockoff

Bandwidth usage: user study

Xianzheng Dou 33

Already achieve 80%-85% reduction

slide-55
SLIDE 55

100 200 300 400 500

No versioning Version on close Version on write Eidetic Data sent to the server (MB) Chunk-based deduplication Knockoff

Bandwidth usage: user study

Xianzheng Dou 33

Already achieve 80%-85% reduction

slide-56
SLIDE 56

100 200 300 400 500

No versioning Version on close Version on write Eidetic Data sent to the server (MB) Chunk-based deduplication Knockoff

Bandwidth usage: user study

Xianzheng Dou 33

24%

slide-57
SLIDE 57

100 200 300 400 500

No versioning Version on close Version on write Eidetic Data sent to the server (MB) Chunk-based deduplication Knockoff

Bandwidth usage: user study

Xianzheng Dou 33

24% 25% 47%

slide-58
SLIDE 58

100 200 300 400 500

No versioning Version on close Version on write Eidetic Data sent to the server (MB) Chunk-based deduplication Knockoff

Bandwidth usage: user study

Xianzheng Dou 33

24% 25% 47%

slide-59
SLIDE 59

Variances across applications

Xianzheng Dou 35

Version on close

Bandwidth savings(%)

10 20 30 40 50 60

Linux utility Graph editing Libreoffice Programming Text editing Web browsing Total

slide-60
SLIDE 60

Variances across applications

Xianzheng Dou 35

Version on close

Bandwidth savings(%)

10 20 30 40 50 60

Linux utility Graph editing Libreoffice Programming Text editing Web browsing Total

slide-61
SLIDE 61

Variances across users

Xianzheng Dou 36

20 40 60 80 100

A B C D E F G Knockoff Version on close

Bandwidth savings(%)

slide-62
SLIDE 62

Variances across users

Xianzheng Dou 36

20 40 60 80 100

A B C D E F G Knockoff Version on close

Bandwidth savings(%)

slide-63
SLIDE 63

Relative storage costs for past versions

Xianzheng Dou 37

0.5 1 1.5 2 2.5

Version on close Version on write Eidetic Chunk-based deduplication Knockoff Relative storage cost

slide-64
SLIDE 64

Relative storage costs for past versions

Xianzheng Dou 37

0.5 1 1.5 2 2.5

Version on close Version on write Eidetic Chunk-based deduplication Knockoff Relative storage cost

19% 23%

slide-65
SLIDE 65

Relative storage costs for past versions

Xianzheng Dou 37

0.5 1 1.5 2 2.5

Version on close Version on write Eidetic Chunk-based deduplication Knockoff Relative storage cost

19% 23%

slide-66
SLIDE 66

Performance overheads

38

0.2 0.4 0.6 0.8 1 1.2

No Versioning Version on close Version on write Eidetic

  • 7-8% performance overheads

Baseline

slide-67
SLIDE 67

Conclusion

  • A new dimension for reducing costs
  • Selectively substitute computation for data
  • A general-purpose system for deterministic recomputation

– Reduces storage and communication costs for existing versioning policies – Enables eidetic versioning

Xianzheng Dou 39

slide-68
SLIDE 68

Thank you!

Xianzheng Dou 40

slide-69
SLIDE 69

Varying the materialization delay

Xianzheng Dou 41

slide-70
SLIDE 70

Monetary costs

Xianzheng Dou 42

slide-71
SLIDE 71

Workload characteristics

Xianzheng Dou 43