DupHunter : Flexible High-Performance Deduplication for Docker - PowerPoint PPT Presentation

DupHunter : Flexible High-Performance Deduplication for Docker Registries Nannan Zhao , Hadeel Albahar, Subil Abraham, Keren Chen, Vasily Tarasov, Dimitrios Skourtis, Lukas Rupprecht, Ali Anwar, and Ali R. Butt

Containers are ubiquitous OS Database Web server Cache Serverless Big data Languages Deep learning Nannan Zhao znannan1@vt.edu 2

Application containerization is becoming a significant market player

Nannan Zhao znannan1@vt.edu 4

3,657,773 Nannan Zhao znannan1@vt.edu 4

3,657,773 Docker image dataset is growing fast! Nannan Zhao znannan1@vt.edu 4

3,657,773 Docker image dataset is growing fast! How to efficiently manage the ever-growing image dataset for Docker registries? Nannan Zhao znannan1@vt.edu 4

Our contribution: DupHunter — a framework to deduplicate images in Docker registries ❑ We make two key observations: 1. Container images exhibit a lot of redundancy. 2. User access pattern is predictable. ❑ We design DupHunter to work with compressed images and provide layer deduplication and reduce layer restore overhead. ❑ We evaluate DupHunter with representative real world workloads. Compared to the state of the art, DupHunter: ▪ reduces storage space by up to 6.9x. ▪ reduces the GET layer latency up to 2.8x. Nannan Zhao znannan1@vt.edu 5

Overview of Docker ❑ Docker container is a self-contained executable package, that is: Docker hub ▪ Lightweight ▪ Portable docker push docker pull image image ▪ Provides Isolation Docker host ❑ Docker registry: R/W layer Container layer ▪ Stores Docker images PHP Image layers MySQL ▪ Supports fast distribution (Read only) Base image: Ubuntu docker pull ▪ Facilitates easy docker push docker build deployment docker run Container Container Container Docker client Docker daemon Host OS Hardware Nannan Zhao znannan1@vt.edu 6

Key observation I: Image dataset has large amount of redundant files ❑ Container images have a lot of redundancy. ▪ 97% of files across layers are duplicates! ❑ Existing technologies such as Jdupes, VDO, Btrfs, ZFS, and Ceph are unable to harness this redundancy. Does not help! Compressed layer dataset Nannan Zhao znannan1@vt.edu 7

Key observation I: Image dataset has large amount of redundant files ❑ Container images have a lot of redundancy. ▪ 97% of files across layers are duplicates! ❑ Existing technologies such as Jdupes, VDO, Btrfs, ZFS, and Ceph are unable to harness this redundancy. Does not help! Decompress Compressed layer dataset Nannan Zhao znannan1@vt.edu 7

Key observation I: Image dataset has large amount of redundant files ❑ Container images have a lot of redundancy. ▪ 97% of files across layers are duplicates! ❑ Existing technologies such as Jdupes, VDO, Btrfs, ZFS, and Ceph are unable to harness this redundancy. Does not help! Decompress Compressed layer Uncompressed dataset layer dataset Nannan Zhao znannan1@vt.edu 7

Key observation I: Image dataset has large amount of redundant files ❑ Container images have a lot of redundancy. ▪ 97% of files across layers are duplicates! ❑ Existing technologies such as Jdupes, VDO, Btrfs, ZFS, and Ceph are unable to harness this redundancy. Does not help! Unpack Decompress Compressed layer Uncompressed dataset layer dataset Nannan Zhao znannan1@vt.edu 7

Key observation I: Image dataset has large amount of redundant files ❑ Container images have a lot of redundancy. ▪ 97% of files across layers are duplicates! ❑ Existing technologies such as Jdupes, VDO, Btrfs, ZFS, and Ceph are unable to harness this redundancy. Does not help! Unpack Decompress Deduplicate Compressed layer Uncompressed dataset layer dataset Nannan Zhao znannan1@vt.edu 7

Key observation I: Image dataset has large amount of redundant files ❑ Container images have a lot of redundancy. ▪ 97% of files across layers are duplicates! ❑ Existing technologies such as Jdupes, VDO, Btrfs, ZFS, and Ceph are unable to harness this redundancy. Reduces space by up Does not help! to 4X Unpack Decompress Deduplicate Compressed layer Uncompressed dataset layer dataset Nannan Zhao znannan1@vt.edu 7

Key observation I: Image dataset has large amount of redundant files ❑ Container images have a lot of redundancy. ▪ 97% of files across layers are duplicates! ❑ Existing technologies such as Jdupes, VDO, Btrfs, ZFS, and Ceph are unable to harness this redundancy. Reduces space by up Does not help! to 4X Unpack Layer restore incurs considerable overhead for layer pulling latency up to 98x! Decompress Deduplicate Compressed layer Uncompressed dataset layer dataset Nannan Zhao znannan1@vt.edu 7

Key observation II: Predictable user access pattern ❑ We observe a consistent user pulling pattern: Pull manifest first, then layers, but not all of the layers will be pulled. ❑ We performed a quantitive study using a 75-day IBM Cloud Registry workload with 7 availability zones. 1 Dal Dev 0.9 Layers ratio Fra Lon 0.8 Pre Sta Syd 0.7 0.6 1 10 100 1,000 10,00050,000 GET Layer count Nannan Zhao znannan1@vt.edu 8

Key observation II: Predictable user access pattern ❑ We observe a consistent user pulling pattern: Pull manifest first, then layers, but not all of the layers will be pulled. ❑ We performed a quantitive study using a 75-day IBM Cloud Registry workload with 7 availability zones. 1 Dal Dev 0.9 Layers ratio Fra Lon 0.8 Pre Majority of layers are only Sta fetched once by the same client. Syd 0.7 0.6 1 10 100 1,000 10,00050,000 GET Layer count Nannan Zhao znannan1@vt.edu 8

Key observation II-b: User repulling pattern can also be predicted 1 0.8 Clients ratio 0.6 Dal Dev Fra 0.4 Pre Sta 0.2 Syd Lon 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Repulling probability Nannan Zhao znannan1@vt.edu 9

Key observation II-b: User repulling pattern can also be predicted 1 0.8 Clients ratio 0.6 Dal Dev Fra 0.4 Pre Sta 0.2 Syd Lon 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Half of the clients have a repull probability less than 0.2 → many clients pull a layer only once. Repulling probability Nannan Zhao znannan1@vt.edu 9

Key observation II-b: User repulling pattern can also be predicted 1 0.8 Clients ratio 0.6 Dal Dev Fra 0.4 Pre Sta 0.2 Syd Lon 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Repulling probability Nannan Zhao znannan1@vt.edu 10

Key observation II-b: User repulling pattern can also be predicted 1 0.8 Clients ratio 0.6 Dal Dev Fra 0.4 Pre Sta 0.2 Syd Lon 0 User repulling pattern is either pull-once or 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 always-pull → we can predict which layers to pull. Repulling probability Nannan Zhao znannan1@vt.edu 10

Key observation II-c: Layer preconstruction is possible Nannan Zhao znannan1@vt.edu 11

Key observation II-c: Layer preconstruction is possible Layer preconstruction can significantly reduce layer restore overhead. Nannan Zhao znannan1@vt.edu 11

DupHunter architecture Distributed metadata database Registry REST API Registry REST API Server A Server B Clients Server C Server D Local storage system storage cluster Nannan Zhao znannan1@vt.edu 12

Reducing overhead in DupHunter 1. Support multiple replica deduplication modes. 2. Facilitate parallel layer reconstruction. 3. Enable proactive layer prefetching/preconstruction. Nannan Zhao znannan1@vt.edu 13

DupHunter supports multiple replica deduplication modes ❑ B-mode n : Basic deduplication mode n ▪ Keep n layer replicas intact. ▪ Deduplicate the remaining R-n layer replicas ( R = layer replication level). ❑ S-mode : Selective deduplication mode ▪ The number of intact layer replicas proportional to the layer’s popularity. ▪ Hot layers have more intact replicas. Nannan Zhao znannan1@vt.edu 14

DupHunter : Flexible High-Performance Deduplication for Docker - PowerPoint PPT Presentation

DupHunter : Flexible High-Performance Deduplication for Docker Registries Nannan Zhao , Hadeel Albahar, Subil Abraham, Keren Chen, Vasily Tarasov, Dimitrios Skourtis, Lukas Rupprecht, Ali Anwar, and Ali R. Butt Containers are ubiquitous OS

Office Hours: COVID-19 Planning and Response May 15, 2020 Housekeeping A recording of

Codec Chips Tribute to Prof. Goto Jinjia Zhou 1 , Dajiang Zhou 2 , Satoshi Goto 2 1 Hosei

SMB3 Protocol Update Tom Talpey Microsoft Corporation 1 Outline SMB3 Protocol changes

Support for mini-debuginfo in LLDB How to read the .gnu_debugdata section Konrad Kleine February

Functional Mock-up Interface Thierry S. Nouidui and Michael Wetter Simulation Research Group

V is for Algorithmic Paradigms Huffman Compression Virtual Memory Part 1 of 4 When

HIDING IN THE FAMILIAR: STEGANOGRAPHY AND VULNERABILITIES IN POPULAR ARCHIVES FORMATS Agenda

HaLoop: Efficient Iterative Data Processing On Large Scale Clusters Yingyi Bu, UC Irvine Horizon

Reverse Engineering Paul deGrandis Applications Software Maintenance Source Code and

OpenBox: A Software-Defined Framework for Developing, Deploying, and Managing Network Functions

Recorder 2.0: Efficient Parallel I/O Tracing and Analysis Chen Wang, Jinghan Sun and Marc Snir

Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode Heidi

Unpacking tips and tricks Protector Techniques Conclusion Samuel Chevet w4kfu@lse.epita.fr

VAST A Unified Platform for Interactive Network Forensics Matthias Vallentin 1 , 2 Vern Paxson 1 ,

UNIX Commands CIS 218 Advanced UNIX Commands (UNIX) File/Directory information ls

HICAMP Bitmap A Space-Efficient Updatable Bitmap Index for In-Memory Databases Bo Wang,

Kernel Address Space Layout Randomization http://outflux.net/slides/2013/lss/kaslr.pdf gholzer

BGP Scanner Isolario BGP-MRT Data Reader: C library & tool Lorenzo Cogotti lorenzo.cogotti

MathWiki 2007 / Logiweb Klaus Grue, grue@diku.dk Senior Software Engineer, Rovsing A/S Rovsing

CS4617 Computer Architecture Lecture 7: Instruction Set Architectures Dr J Vaughan October 1,

arXiv:1209.2137v6 [cs.IR] 15 May 2014 SUMMARY In many important applicationssuch as search

FAST & FURIOUS REVERSE ENGINEERING WITH TITANENGINE Agenda Obligatory Scare Talk Why

Streaming Massive Environments From Zero to 200MPH Chris Tector (Software Architect Turn 10

libdft Practical Dynamic Data Flow Tracking for Commodity Systems Vasileios P. Kemerlis Georgios

DupHunter : Flexible High-Performance Deduplication for Docker - PowerPoint PPT Presentation

DupHunter : Flexible High-Performance Deduplication for Docker Registries Nannan Zhao , Hadeel Albahar, Subil Abraham, Keren Chen, Vasily Tarasov, Dimitrios Skourtis, Lukas Rupprecht, Ali Anwar, and Ali R. Butt Containers are ubiquitous OS

Office Hours: COVID-19 Planning and Response May 15, 2020 Housekeeping A recording of

Codec Chips Tribute to Prof. Goto Jinjia Zhou 1 , Dajiang Zhou 2 , Satoshi Goto 2 1 Hosei

SMB3 Protocol Update Tom Talpey Microsoft Corporation 1 Outline SMB3 Protocol changes

Support for mini-debuginfo in LLDB How to read the .gnu_debugdata section Konrad Kleine February

Functional Mock-up Interface Thierry S. Nouidui and Michael Wetter Simulation Research Group

V is for Algorithmic Paradigms Huffman Compression Virtual Memory Part 1 of 4 When

HIDING IN THE FAMILIAR: STEGANOGRAPHY AND VULNERABILITIES IN POPULAR ARCHIVES FORMATS Agenda

HaLoop: Efficient Iterative Data Processing On Large Scale Clusters Yingyi Bu, UC Irvine Horizon

Reverse Engineering Paul deGrandis Applications Software Maintenance Source Code and

OpenBox: A Software-Defined Framework for Developing, Deploying, and Managing Network Functions

Recorder 2.0: Efficient Parallel I/O Tracing and Analysis Chen Wang, Jinghan Sun and Marc Snir

Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode Heidi

Unpacking tips and tricks Protector Techniques Conclusion Samuel Chevet w4kfu@lse.epita.fr

VAST A Unified Platform for Interactive Network Forensics Matthias Vallentin 1 , 2 Vern Paxson 1 ,

UNIX Commands CIS 218 Advanced UNIX Commands (UNIX) File/Directory information ls

HICAMP Bitmap A Space-Efficient Updatable Bitmap Index for In-Memory Databases Bo Wang,

Kernel Address Space Layout Randomization http://outflux.net/slides/2013/lss/kaslr.pdf gholzer

BGP Scanner Isolario BGP-MRT Data Reader: C library &amp; tool Lorenzo Cogotti lorenzo.cogotti

MathWiki 2007 / Logiweb Klaus Grue, grue@diku.dk Senior Software Engineer, Rovsing A/S Rovsing

CS4617 Computer Architecture Lecture 7: Instruction Set Architectures Dr J Vaughan October 1,

arXiv:1209.2137v6 [cs.IR] 15 May 2014 SUMMARY In many important applicationssuch as search

FAST &amp; FURIOUS REVERSE ENGINEERING WITH TITANENGINE Agenda Obligatory Scare Talk Why

Streaming Massive Environments From Zero to 200MPH Chris Tector (Software Architect Turn 10

libdft Practical Dynamic Data Flow Tracking for Commodity Systems Vasileios P. Kemerlis Georgios

BGP Scanner Isolario BGP-MRT Data Reader: C library & tool Lorenzo Cogotti lorenzo.cogotti

FAST & FURIOUS REVERSE ENGINEERING WITH TITANENGINE Agenda Obligatory Scare Talk Why