Storage Deduplication in Cloud Computing Joo Paulo and Jos Pereira - - PowerPoint PPT Presentation

storage deduplication in cloud computing
SMART_READER_LITE
LIVE PREVIEW

Storage Deduplication in Cloud Computing Joo Paulo and Jos Pereira - - PowerPoint PPT Presentation

Storage Deduplication in Cloud Computing Joo Paulo and Jos Pereira University of Minho July 2010 Joo Paulo and Jos Pereira Storage Deduplication in Cloud Computing Cloud Computing Overview Cloud Computing Cloud services allow clients


slide-1
SLIDE 1

Storage Deduplication in Cloud Computing

João Paulo and José Pereira

University of Minho

July 2010

João Paulo and José Pereira Storage Deduplication in Cloud Computing

slide-2
SLIDE 2

Cloud Computing Overview

Cloud Computing Cloud services allow clients to shift their data and applications into the “cloud“. These services run in a scalable and dependable infrastructure, which has a large server pool in several data centres. Virtualization Virtualization is a key aspect to achieve the Elasticity provided by cloud computing. Virtual Machines (VMs) can be deployed/migrated in few minutes. VMs Isolation allows a better management of resources and failures.

João Paulo and José Pereira Storage Deduplication in Cloud Computing

slide-3
SLIDE 3

Cloud Computing Overview

Cloud Computing Cloud services allow clients to shift their data and applications into the “cloud“. These services run in a scalable and dependable infrastructure, which has a large server pool in several data centres. Virtualization Virtualization is a key aspect to achieve the Elasticity provided by cloud computing. Virtual Machines (VMs) can be deployed/migrated in few minutes. VMs Isolation allows a better management of resources and failures.

João Paulo and José Pereira Storage Deduplication in Cloud Computing

slide-4
SLIDE 4

Deduplication

Cloud services store client’s data, applications and VMs images. Deduplication allows to:

Decrease storage’s size. Optimize the management of storage’s data.

Deduplication introduces overhead to the service.

João Paulo and José Pereira Storage Deduplication in Cloud Computing

slide-5
SLIDE 5

Outline

1

Shared Storage Deduplication

2

Experimental Evaluation - Preliminary Results

3

Conclusions

4

Future Work and Challenges

João Paulo and José Pereira Storage Deduplication in Cloud Computing

slide-6
SLIDE 6

Shared Storage Deduplication

Scenario

VM VM VM VM VM VM VM VM VM VM VM VM

Groups of VMs run in different physical machines. Each VM has its own virtual disk. Virtual disks are kept in a shared storage.

João Paulo and José Pereira Storage Deduplication in Cloud Computing

slide-7
SLIDE 7

Shared Storage Deduplication

XEN Blktap mechanism

Blktap Implemented within Xen. Allows to implement virtual block devices for Virtual Machines. User-level disk I/O interface (Tapdisk). Allows to have independent per-disk handler processes. Easy to implement Copy-on-Write.

João Paulo and José Pereira Storage Deduplication in Cloud Computing

slide-8
SLIDE 8

Shared Storage Deduplication

XEN Blktap mechanism

VM1 VM2 VM3 Physical Machine 1 Physical Machine 2

VM1 Disk VM2 Disk VM3 Disk

Tap aio Tap aio Tap aio

João Paulo and José Pereira Storage Deduplication in Cloud Computing

slide-9
SLIDE 9

Shared Storage Deduplication

XEN Blktap mechanism

VM1 VM2 VM3 Physical Machine 1 Physical Machine 2

VM1 Disk VM2 Disk VM3 Disk

Tap aio Tap aio Tap aio

Read/ write

João Paulo and José Pereira Storage Deduplication in Cloud Computing

slide-10
SLIDE 10

Shared Storage Deduplication

XEN Blktap mechanism

VM1 VM2 VM3 Physical Machine 1 Physical Machine 2

VM1 Disk VM2 Disk VM3 Disk

Tap aio Tap aio Tap aio

Read/ write

João Paulo and José Pereira Storage Deduplication in Cloud Computing

slide-11
SLIDE 11

Shared Storage Deduplication

XEN Blktap mechanism

VM1 VM2 VM3 Physical Machine 1 Physical Machine 2

VM1 Disk VM2 Disk VM3 Disk

Tap aio Tap aio Tap aio

Read/ write

João Paulo and José Pereira Storage Deduplication in Cloud Computing

slide-12
SLIDE 12

Shared Storage Deduplication

Deduplication Challenges

Deduplication is usually used for backup scenarios where data is practically immutable. In a virtualized scenario where stored data changes constantly, we must have in account:

The overhead introduced by the deduplication algorithm. The best approach to find duplicated data, which must be transparent to the VMs. The metadata needed to share identical data.

João Paulo and José Pereira Storage Deduplication in Cloud Computing

slide-13
SLIDE 13

Shared Storage Deduplication

Deduplication Algorithm

Shared Storage VM1 VM2 VM3 Physical Machine 1 Physical Machine 2 DHT Extend Server Tap disk Tap disk

João Paulo and José Pereira Storage Deduplication in Cloud Computing

slide-14
SLIDE 14

Shared Storage Deduplication

Deduplication Algorithm

Shared Storage VM1 VM2 VM3 Physical Machine 1 Physical Machine 2 DHT Extend Server Tap disk Tap disk

Read/ write João Paulo and José Pereira Storage Deduplication in Cloud Computing

slide-15
SLIDE 15

Shared Storage Deduplication

Deduplication Algorithm

Shared Storage VM1 VM2 VM3 Physical Machine 1 Physical Machine 2 DHT Extend Server Tap disk Tap disk

Read/ write V‐>P … … V‐>P … … V‐>P … … João Paulo and José Pereira Storage Deduplication in Cloud Computing

slide-16
SLIDE 16

Shared Storage Deduplication

Deduplication Algorithm

Shared Storage VM1 VM2 VM3 Physical Machine 1 Physical Machine 2 DHT Extend Server Tap disk Tap disk

Read/ write V‐>P … … V‐>P … … V‐>P … … João Paulo and José Pereira Storage Deduplication in Cloud Computing

slide-17
SLIDE 17

Shared Storage Deduplication

Deduplication Algorithm

Shared Storage VM1 VM2 VM3 Physical Machine 1 Physical Machine 2 DHT Extend Server Tap disk Tap disk

Write Dirty addresses

Dirty addresses Dirty addresses

João Paulo and José Pereira Storage Deduplication in Cloud Computing

slide-18
SLIDE 18

Shared Storage Deduplication

Deduplication Algorithm

Shared Storage VM1 VM2 VM3 Physical Machine 1 Physical Machine 2 DHT Extend Server Tap disk Tap disk Share Share

V‐>P … … COW João Paulo and José Pereira Storage Deduplication in Cloud Computing

slide-19
SLIDE 19

Shared Storage Deduplication

Deduplication Algorithm

Shared Storage VM1 VM2 VM3 Physical Machine 1 Physical Machine 2 DHT Extend Server Tap disk Tap disk Share Share

V‐>P … … COW João Paulo and José Pereira Storage Deduplication in Cloud Computing

slide-20
SLIDE 20

Shared Storage Deduplication

Deduplication Algorithm

Shared Storage VM1 VM2 VM3 Physical Machine 1 Physical Machine 2 DHT Extend Server Tap disk Tap disk Share Share

V‐>P … … COW

Hash‐>(Padd,Cont)

João Paulo and José Pereira Storage Deduplication in Cloud Computing

slide-21
SLIDE 21

Shared Storage Deduplication

Deduplication Algorithm

Shared Storage VM1 VM2 VM3 Physical Machine 1 Physical Machine 2 DHT Tap disk Tap disk Share Share

V‐>P … … update

Free blocks queue Extend Server

João Paulo and José Pereira Storage Deduplication in Cloud Computing

slide-22
SLIDE 22

Shared Storage Deduplication

Deduplication Algorithm

Shared Storage VM1 VM2 VM3 Physical Machine 1 Physical Machine 2 DHT Extend Server Tap disk Tap disk

Write COW Dirty addresses

COW Addresses COW Addresses

free blocks buffer free blocks buffer João Paulo and José Pereira Storage Deduplication in Cloud Computing

slide-23
SLIDE 23

Shared Storage Deduplication

Deduplication Algorithm

Shared Storage VM1 VM2 VM3 Physical Machine 1 Physical Machine 2 DHT Extend Server Tap disk Tap disk GC GC

free blocks buffer free blocks buffer João Paulo and José Pereira Storage Deduplication in Cloud Computing

slide-24
SLIDE 24

Shared Storage Deduplication

Deduplication Algorithm

Shared Storage VM1 VM2 VM3 Physical Machine 1 Physical Machine 2 DHT Extend Server Tap disk Tap disk GC GC

free blocks buffer free blocks buffer

Hash‐>(Padd,Cont)

João Paulo and José Pereira Storage Deduplication in Cloud Computing

slide-25
SLIDE 25

Shared Storage Deduplication

Deduplication Algorithm

Shared Storage VM1 VM2 VM3 Physical Machine 1 Physical Machine 2 DHT Tap disk Tap disk GC GC Extend Server

free blocks buffer free blocks buffer

Free blocks queue

João Paulo and José Pereira Storage Deduplication in Cloud Computing

slide-26
SLIDE 26

Shared Storage Deduplication

Deduplication Algorithm

Shared Storage VM1 VM2 VM3 Physical Machine 1 Physical Machine 2 DHT Tap disk Tap disk GC/ Share Extend Server

free blocks buffer free blocks buffer

GC/ Share Free blocks queue

João Paulo and José Pereira Storage Deduplication in Cloud Computing

slide-27
SLIDE 27

Shared Storage Deduplication

Deduplication Algorithm

Shared Storage VM1 VM2 VM3 Physical Machine 1 Physical Machine 2 DHT Tap disk Tap disk GC/ Share Extend Server

free blocks buffer free blocks buffer

Free blocks queue

João Paulo and José Pereira Storage Deduplication in Cloud Computing

slide-28
SLIDE 28

Shared Storage Deduplication

Deduplication Algorithm

Shared Storage VM1 VM2 VM3 Physical Machine 1 Physical Machine 2 DHT Tap disk Tap disk GC/ Share Extend Server

free blocks buffer free blocks buffer

Free blocks queue

João Paulo and José Pereira Storage Deduplication in Cloud Computing

slide-29
SLIDE 29

Experimental Evaluation - Preliminary Results

Outline

1

Shared Storage Deduplication

2

Experimental Evaluation - Preliminary Results

3

Conclusions

4

Future Work and Challenges

João Paulo and José Pereira Storage Deduplication in Cloud Computing

slide-30
SLIDE 30

Experimental Evaluation - Preliminary Results

Evaluated Prototype

Shared Storage VM1 VM2 Physical Machine 1 Tap disk GC/ Share

free blocks buffer

Free blocks queue

Without Distribution and Fault Tolerant design. Two Optimizations:

Set of mutexes for each VM’s Translation table. VM’s free blocks buffer refilling granularity.

João Paulo and José Pereira Storage Deduplication in Cloud Computing

slide-31
SLIDE 31

Experimental Evaluation - Preliminary Results

Benchmarks

Main Goals Measure the I/O and CPU overhead introduced by our prototype when compared to a default approach (Tap Aio). Measure the sharing rates achieved by our prototype. Write and Read Benchmarks TPC-C NURand function is used to generate hotspots for write and read operations. A realistic distribution is used for generating the content of the blocks that are written.

João Paulo and José Pereira Storage Deduplication in Cloud Computing

slide-32
SLIDE 32

Experimental Evaluation - Preliminary Results

Benchmarks

Main Goals Measure the I/O and CPU overhead introduced by our prototype when compared to a default approach (Tap Aio). Measure the sharing rates achieved by our prototype. Write and Read Benchmarks TPC-C NURand function is used to generate hotspots for write and read operations. A realistic distribution is used for generating the content of the blocks that are written.

João Paulo and José Pereira Storage Deduplication in Cloud Computing

slide-33
SLIDE 33

Experimental Evaluation - Preliminary Results

Write Benchmark

The benchmark ran for 30 minutes in three VMs with 10 GB images. Additional RAM used was 250 MB for the fully optimized version. Approximately 28% of the written data (20 GB) was shared.

!"#$%&"'%( )*(+,-. /01 2 3 42 43 52 53 62 7*8+ 9%(+: 9%(+:;<;7%==+#

> ; $ ? + # " + * @ ; A / $ B ' * # C 8 $ , ; D C ( " ; ! * ' ; E C $ F

João Paulo and José Pereira Storage Deduplication in Cloud Computing

slide-34
SLIDE 34

Experimental Evaluation - Preliminary Results

Read Benchmark

The benchmark ran for 40 minutes in three VMs with 10 GB images. Additional RAM used was 200MB for the fully optimized version. Approximately 55% of the written data (4.5 GB) was shared.

!"#$%&"'%( )*(+,-. /01 2 3 42 43 52 53 62

78$9+#"+*:8;/$<'*#=>$,8?=("8!*'8@=$A

João Paulo and José Pereira Storage Deduplication in Cloud Computing

slide-35
SLIDE 35

Conclusions

Outline

1

Shared Storage Deduplication

2

Experimental Evaluation - Preliminary Results

3

Conclusions

4

Future Work and Challenges

João Paulo and José Pereira Storage Deduplication in Cloud Computing

slide-36
SLIDE 36

Conclusions

Conclusions

The evaluated prototype shares identical data without introducing a significant amount of overhead in the CPU usage and I/O requests. The asynchronous approach to share identical data, the dynamic detection of modified data and the prototype optimizations are key aspects to achieve these results. Project available at http://www.holeycow.org/

João Paulo and José Pereira Storage Deduplication in Cloud Computing

slide-37
SLIDE 37

Future Work and Challenges

Outline

1

Shared Storage Deduplication

2

Experimental Evaluation - Preliminary Results

3

Conclusions

4

Future Work and Challenges

João Paulo and José Pereira Storage Deduplication in Cloud Computing

slide-38
SLIDE 38

Future Work and Challenges

Future Work and Challenges

Replicated Databases HoleyCOW - Shared Storage Cluster. Can deduplication be used to improve this specific scenario? Fault Tolerance Byzantine Faults are not contemplated.

Data corruption. What level of redundancy should we keep? Malicious attacks.

Resilient Databases (Red) - http://red.lsd.di.uminho.pt/

João Paulo and José Pereira Storage Deduplication in Cloud Computing

slide-39
SLIDE 39

Future Work and Challenges

Future Work and Challenges

Replicated Databases HoleyCOW - Shared Storage Cluster. Can deduplication be used to improve this specific scenario? Fault Tolerance Byzantine Faults are not contemplated.

Data corruption. What level of redundancy should we keep? Malicious attacks.

Resilient Databases (Red) - http://red.lsd.di.uminho.pt/

João Paulo and José Pereira Storage Deduplication in Cloud Computing

slide-40
SLIDE 40

Future Work and Challenges

Future Work and Challenges

Replicated Databases HoleyCOW - Shared Storage Cluster. Can deduplication be used to improve this specific scenario? Fault Tolerance Byzantine Faults are not contemplated.

Data corruption. What level of redundancy should we keep? Malicious attacks.

Resilient Databases (Red) - http://red.lsd.di.uminho.pt/

João Paulo and José Pereira Storage Deduplication in Cloud Computing

slide-41
SLIDE 41

Future Work and Challenges

Future Work and Challenges

Distributed Storage Each server has its own disk. A new approach to find and share duplicated data is necessary. Epidemic Protocols? Garbage Collector also needs to be redesigned.

João Paulo and José Pereira Storage Deduplication in Cloud Computing