09'Linux Plumbers Conference Data de-duplication Mingming Cao IBM - PowerPoint PPT Presentation

09'Linux Plumbers Conference Data de-duplication Mingming Cao IBM Linux Technology Center cmm@us.ibm.com 2009-09-25

Current storage challenges ● Our world is facing data explosion. Data is growing in a amazing rate ● Severe challenges just storing the data we have today, imagine how much more difficult and expensive storing six times more data tomorrow. ● Eliminating redundant data is very important!

Existing technology... ● Hard links/cow/file 5M lpc-dedup.ods clone Copy to another file ● Compression 5M lpc-dedup-1.ods ● All are done at file Made slight modification and level. There are more save as another file room to save space 5M lpc-dedup-2.ods from. Backup ● Imagine this … 5M lpc-dedup-3.ods

Definition ● Data de-duplication A method of reducing storage needs by eliminating redundant data. Only one unique instance of the data is actually retained on storage media, such as disk or tape. Redundant data is replaced with a pointer to the unique data copy. ● Data de-duplication is extended compression, more efficient to remove the redundant data. Could be done at file level, sub-file(block) level and even bit level

de-duplication before dedup.ods file2.ods A B A C D C A E A B A C D C A E

de-duplication after dedup.ods file2.ods A B A C D C A E A B C D E

de-duplication for VM VM1 OS OS app1 app2 data1 data2 free free free VM2 OS OS app1 app3 data3 data4 free free free VM3 OS OS app3 app2 data5 data6 free free free VM4 OS OS app1 app2 data1 data2 free free free

de-duplication for VM(after) VM2 VM1 VM3 VM4 OS OS app1 app2 app3 data1 OS Data 2 data3 data4 data5 data6 free free free free free free free free free free free free free free free free free free free free free free free free

de-duplication benefit ● Two major savings – Storage footprint(6x healthcare, 3x VM, 20x backup) – Network bandwidth to transfer data across WAN ● Disks are cheep, but there is more than just space – Save more energy (power) and cooling – Disaster and recovery becomes manageable – Save resources to manage same amount of data ● Typical workload: Backup, Archives, Healthcare, Virtualization, NAS, remote office etc.

de-duplication concerns ● Large CPU and memory resources required for de- duplication processing ● Potentially more fragmented files/filesystem ● Potentially increase risk of lost data ● Might not work with encryption ● Hash collison still possible

de-duplication ratios ● Indicating how much reduction by de-duplication – Before 50TB, after 10TB, ratio is 5:1 ● Ratio could various from 2:1 to 10:1, depends on – Type of data – Change rate of the data – Amount of redundant data – Type of backup performed (full, incremental or differential) – de-duplication methods

de-duplication process generates a Has seen Insert this No unique this key in store new key to index Data number the index data to disk tree (key) before? Yes Duplicated, reference to exit the orignal data

Where: source vs target method where advantages disadvantages Deduplication Reduce network consumes CPU cycles performed at the data bandwidth;Awareness on the file/ application Source source, before transfer of data usage and server; May not dedup to target location format may allow more files across various effective data dedup sources Applies to any type of performed at the target Deduplication consumes filesystem; No impact to (e.g. by backup CPU cycles on the Target data ingestion;possible software or storage target server or storage for parellel data appliance) device deduplication

When: In-line vs Post process method when advantages disadvantages deduplication occurs in the primary data path. Immediate data Perfornance concerns, No data is written to reduction, uses the In-line high cput and memory disk until the least disk space No cost; deduplication process post-processing is complete. Deduplication occurs Easy to implement; No Data being processed on the secondary impact to data Post twice; Need extra space storage. Data were ingestion;possible for for dedup;Race with process first store data on disk parellel data concurrent writes and then deduplicate deduplication

In-line de-dup in btrfs: How to detect the redundancy? ● Make sense...Btrfs already create checksum for every fs block, and stored on disk. Re-use hash value for duplication key. ● To speed look up, need separate checksum index tree, indexed by checksums rather than logical offset ● Duplication screen could happen at data writeout time. After data get compressed, but before delayed allocation allocate space and flush-out data ● If hash collision occurs, do byte to byte compare to ensure no data lost

In-line de-dup in btrfs: How to lower the cost? ● Memory usage is the key to dedup performance – The dedup hash tree needs in memory. For 1TB fs needs 8G RAM for SHA256, or 4G RAM for MD5 – Make dedup optional: filesystem mount option, or enable/disable dedup on file/subvolumes etc ● Fragmentation – Apply policies to defrag to group shared files close to each other – Reduce seek time: frequently and lately shared blocks are likely already pinged in memory – Might be less an issue with SSD

In-line dedup in btrfs: Keep the impact low ● Could have impact to running applications ● Gets some latency stats, enable/disable dedup if ingestion is high ● Could have a background scrub thread to do dedup on files that didn't get dedup in-line before writeout to disk ● Flag to each btrfs extent to indicating deduped or not, to avoid double dedup

User space de-duplication? ● User apps do the job instead of kernel. Could avoid ingestion. ● Could apply to any filesystem (ext4, btrfs, xfs etc) ● The checksum is maintained in userspace. ● Introduce VFS API to allow apps to poll whether chunk of data have been modified before merge – Could use inode ctime/mtime or inode version – Better, a new system call to tell a range of file(offset, size,transaction ID) has be changed since then

Summary ● Linux needs data de-duplication technology to able to control data explosion ... ● One size won't fit all … perhaps both?

09'Linux Plumbers Conference Data de-duplication Mingming Cao IBM - PowerPoint PPT Presentation

09'Linux Plumbers Conference Data de-duplication Mingming Cao IBM Linux Technology Center cmm@us.ibm.com 2009-09-25 Current storage challenges Our world is facing data explosion. Data is growing in a amazing rate Severe challenges

Linux-iSCSI.org BoF Linux-iSCSI.org BoF Current Status and Future of iSCSI on the Current Status

Linux Audio: Origins & Futures Paul Davis Linux Audio Systems Linux Plumbers Conference,

The Light Weight JIT Compiler Project Vladimir Makarov RedHat Linux Plumbers Conference, Aug 24,

Introduction to Linux Aline Abler Aline Abler Linux, whats that? The pieces of a Linux

Linux Overview Amir Hossein Payberah payberah@gmail.com 1 Agenda Linux Overview Linux

Linux from Sensors to Servers ! When is Linux Not Linux? ! 1 1 Linux runs across a huge range

Linux Support for USB 3.0 Sarah Sharp Linux Plumbers Conference Why USB 3.0? 480Mb/s is too

Plug and Play Multiseat Linux Plumbers Conference 2009 BoF Session Bernie Thompson About the

Linux Kernel Tinification Josh Triplett josh@joshtriplett.org Linux Plumbers Conference 2014

Linux Network Programming with P4 Linux Plumbers 2018 Fabian Ruffy, William Tu, Mihai Budiu

Linux Kung Fu Introduction What is Linux? Why Linux? What is the difference between a client

Linux Plumbers Conference 2011 LTTng 2.0 : Application, Library and Kernel tracing within your

SGX Upstreaming Story Linux Plumbers Conference 2019 Jarkko Sakkinen <

The State of the Linux Desktop An OSDL Perspective John Cherry OSDL Desktop Linux (DTL)

Introduction to Linux Introduction to Linux Phil Mercurio The Scripps Research Institute

Linux Plumbers Conference 2012 LTTngTop: Human-Readable Trace Viewer Julien Desfossez

ECE590-03 Enterprise Storage Architecture Fall 2016 Storage Efficiency Tyler Bletsch Duke

Hardware Acceleration for RAID5/6, Deduplication & Security for parallel workloads Vikas

Chapter 4. Taxes Chapter 4. Taxes are very complicated and we will only cover are very

Propositional Logic: Soundness of Formal Deduction Alice Gao Lecture 9 CS 245 Logic and

Data Deduplication with Random Substitutions Hao Lou Farzad Farnoud Electrical and Computer

Block-level Inline Data Deduplication in ext3 Dedupfs Performance Summary Conclusions Aaron

Efficient Locally Trackable from seed Deduplication in Replicated Systems Joo Barreto and

ISOLATION ATTACKS GRAD SEC OCT 03 2017 TODAYS PAPERS ROWHAMMER ROWHAMMER ROWHAMMER

09'Linux Plumbers Conference Data de-duplication Mingming Cao IBM - PowerPoint PPT Presentation

09'Linux Plumbers Conference Data de-duplication Mingming Cao IBM Linux Technology Center cmm@us.ibm.com 2009-09-25 Current storage challenges Our world is facing data explosion. Data is growing in a amazing rate Severe challenges

Linux-iSCSI.org BoF Linux-iSCSI.org BoF Current Status and Future of iSCSI on the Current Status

Linux Audio: Origins &amp; Futures Paul Davis Linux Audio Systems Linux Plumbers Conference,

The Light Weight JIT Compiler Project Vladimir Makarov RedHat Linux Plumbers Conference, Aug 24,

Introduction to Linux Aline Abler Aline Abler Linux, whats that? The pieces of a Linux

Linux Overview Amir Hossein Payberah payberah@gmail.com 1 Agenda Linux Overview Linux

Linux from Sensors to Servers ! When is Linux Not Linux? ! 1 1 Linux runs across a huge range

Linux Support for USB 3.0 Sarah Sharp Linux Plumbers Conference Why USB 3.0? 480Mb/s is too

Plug and Play Multiseat Linux Plumbers Conference 2009 BoF Session Bernie Thompson About the

Linux Kernel Tinification Josh Triplett josh@joshtriplett.org Linux Plumbers Conference 2014

Linux Network Programming with P4 Linux Plumbers 2018 Fabian Ruffy, William Tu, Mihai Budiu

Linux Kung Fu Introduction What is Linux? Why Linux? What is the difference between a client

Linux Plumbers Conference 2011 LTTng 2.0 : Application, Library and Kernel tracing within your

SGX Upstreaming Story Linux Plumbers Conference 2019 Jarkko Sakkinen &lt;

The State of the Linux Desktop An OSDL Perspective John Cherry OSDL Desktop Linux (DTL)

Introduction to Linux Introduction to Linux Phil Mercurio The Scripps Research Institute

Linux Plumbers Conference 2012 LTTngTop: Human-Readable Trace Viewer Julien Desfossez

ECE590-03 Enterprise Storage Architecture Fall 2016 Storage Efficiency Tyler Bletsch Duke

Hardware Acceleration for RAID5/6, Deduplication &amp; Security for parallel workloads Vikas

Chapter 4. Taxes Chapter 4. Taxes are very complicated and we will only cover are very

Propositional Logic: Soundness of Formal Deduction Alice Gao Lecture 9 CS 245 Logic and

Data Deduplication with Random Substitutions Hao Lou Farzad Farnoud Electrical and Computer

Block-level Inline Data Deduplication in ext3 Dedupfs Performance Summary Conclusions Aaron

Efficient Locally Trackable from seed Deduplication in Replicated Systems Joo Barreto and

ISOLATION ATTACKS GRAD SEC OCT 03 2017 TODAYS PAPERS ROWHAMMER ROWHAMMER ROWHAMMER

Linux Audio: Origins & Futures Paul Davis Linux Audio Systems Linux Plumbers Conference,

SGX Upstreaming Story Linux Plumbers Conference 2019 Jarkko Sakkinen <

Hardware Acceleration for RAID5/6, Deduplication & Security for parallel workloads Vikas