 
              DupHunter : Flexible High-Performance Deduplication for Docker Registries Nannan Zhao , Hadeel Albahar, Subil Abraham, Keren Chen, Vasily Tarasov, Dimitrios Skourtis, Lukas Rupprecht, Ali Anwar, and Ali R. Butt
Containers are ubiquitous OS Database Web server Cache Serverless Big data Languages Deep learning Nannan Zhao znannan1@vt.edu 2
Application containerization is becoming a significant market player
Nannan Zhao znannan1@vt.edu 4
3,657,773 Nannan Zhao znannan1@vt.edu 4
3,657,773 Docker image dataset is growing fast! Nannan Zhao znannan1@vt.edu 4
3,657,773 Docker image dataset is growing fast! How to efficiently manage the ever-growing image dataset for Docker registries? Nannan Zhao znannan1@vt.edu 4
Our contribution: DupHunter — a framework to deduplicate images in Docker registries ❑ We make two key observations: 1. Container images exhibit a lot of redundancy. 2. User access pattern is predictable. ❑ We design DupHunter to work with compressed images and provide layer deduplication and reduce layer restore overhead. ❑ We evaluate DupHunter with representative real world workloads. Compared to the state of the art, DupHunter: ▪ reduces storage space by up to 6.9x. ▪ reduces the GET layer latency up to 2.8x. Nannan Zhao znannan1@vt.edu 5
Overview of Docker ❑ Docker container is a self-contained executable package, that is: Docker hub ▪ Lightweight ▪ Portable docker push docker pull image image ▪ Provides Isolation Docker host ❑ Docker registry: R/W layer Container layer ▪ Stores Docker images PHP Image layers MySQL ▪ Supports fast distribution (Read only) Base image: Ubuntu docker pull ▪ Facilitates easy docker push docker build deployment docker run Container Container Container Docker client Docker daemon Host OS Hardware Nannan Zhao znannan1@vt.edu 6
Key observation I: Image dataset has large amount of redundant files ❑ Container images have a lot of redundancy. ▪ 97% of files across layers are duplicates! ❑ Existing technologies such as Jdupes, VDO, Btrfs, ZFS, and Ceph are unable to harness this redundancy. Does not help! Compressed layer dataset Nannan Zhao znannan1@vt.edu 7
Key observation I: Image dataset has large amount of redundant files ❑ Container images have a lot of redundancy. ▪ 97% of files across layers are duplicates! ❑ Existing technologies such as Jdupes, VDO, Btrfs, ZFS, and Ceph are unable to harness this redundancy. Does not help! Decompress Compressed layer dataset Nannan Zhao znannan1@vt.edu 7
Key observation I: Image dataset has large amount of redundant files ❑ Container images have a lot of redundancy. ▪ 97% of files across layers are duplicates! ❑ Existing technologies such as Jdupes, VDO, Btrfs, ZFS, and Ceph are unable to harness this redundancy. Does not help! Decompress Compressed layer Uncompressed dataset layer dataset Nannan Zhao znannan1@vt.edu 7
Key observation I: Image dataset has large amount of redundant files ❑ Container images have a lot of redundancy. ▪ 97% of files across layers are duplicates! ❑ Existing technologies such as Jdupes, VDO, Btrfs, ZFS, and Ceph are unable to harness this redundancy. Does not help! Unpack Decompress Compressed layer Uncompressed dataset layer dataset Nannan Zhao znannan1@vt.edu 7
Key observation I: Image dataset has large amount of redundant files ❑ Container images have a lot of redundancy. ▪ 97% of files across layers are duplicates! ❑ Existing technologies such as Jdupes, VDO, Btrfs, ZFS, and Ceph are unable to harness this redundancy. Does not help! Unpack Decompress Deduplicate Compressed layer Uncompressed dataset layer dataset Nannan Zhao znannan1@vt.edu 7
Key observation I: Image dataset has large amount of redundant files ❑ Container images have a lot of redundancy. ▪ 97% of files across layers are duplicates! ❑ Existing technologies such as Jdupes, VDO, Btrfs, ZFS, and Ceph are unable to harness this redundancy. Does not help! Unpack Decompress Deduplicate Compressed layer Uncompressed dataset layer dataset Nannan Zhao znannan1@vt.edu 7
Key observation I: Image dataset has large amount of redundant files ❑ Container images have a lot of redundancy. ▪ 97% of files across layers are duplicates! ❑ Existing technologies such as Jdupes, VDO, Btrfs, ZFS, and Ceph are unable to harness this redundancy. Reduces space by up Does not help! to 4X Unpack Decompress Deduplicate Compressed layer Uncompressed dataset layer dataset Nannan Zhao znannan1@vt.edu 7
Key observation I: Image dataset has large amount of redundant files ❑ Container images have a lot of redundancy. ▪ 97% of files across layers are duplicates! ❑ Existing technologies such as Jdupes, VDO, Btrfs, ZFS, and Ceph are unable to harness this redundancy. Reduces space by up Does not help! to 4X Unpack Layer restore incurs considerable overhead for layer pulling latency up to 98x! Decompress Deduplicate Compressed layer Uncompressed dataset layer dataset Nannan Zhao znannan1@vt.edu 7
Key observation II: Predictable user access pattern ❑ We observe a consistent user pulling pattern: Pull manifest first, then layers, but not all of the layers will be pulled. ❑ We performed a quantitive study using a 75-day IBM Cloud Registry workload with 7 availability zones. 1 Dal Dev 0.9 Layers ratio Fra Lon 0.8 Pre Sta Syd 0.7 0.6 1 10 100 1,000 10,00050,000 GET Layer count Nannan Zhao znannan1@vt.edu 8
Key observation II: Predictable user access pattern ❑ We observe a consistent user pulling pattern: Pull manifest first, then layers, but not all of the layers will be pulled. ❑ We performed a quantitive study using a 75-day IBM Cloud Registry workload with 7 availability zones. 1 Dal Dev 0.9 Layers ratio Fra Lon 0.8 Pre Sta Syd 0.7 0.6 1 10 100 1,000 10,00050,000 GET Layer count Nannan Zhao znannan1@vt.edu 8
Key observation II: Predictable user access pattern ❑ We observe a consistent user pulling pattern: Pull manifest first, then layers, but not all of the layers will be pulled. ❑ We performed a quantitive study using a 75-day IBM Cloud Registry workload with 7 availability zones. 1 Dal Dev 0.9 Layers ratio Fra Lon 0.8 Pre Majority of layers are only Sta fetched once by the same client. Syd 0.7 0.6 1 10 100 1,000 10,00050,000 GET Layer count Nannan Zhao znannan1@vt.edu 8
Key observation II-b: User repulling pattern can also be predicted 1 0.8 Clients ratio 0.6 Dal Dev Fra 0.4 Pre Sta 0.2 Syd Lon 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Repulling probability Nannan Zhao znannan1@vt.edu 9
Key observation II-b: User repulling pattern can also be predicted 1 0.8 Clients ratio 0.6 Dal Dev Fra 0.4 Pre Sta 0.2 Syd Lon 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Half of the clients have a repull probability less than 0.2 → many clients pull a layer only once. Repulling probability Nannan Zhao znannan1@vt.edu 9
Key observation II-b: User repulling pattern can also be predicted 1 0.8 Clients ratio 0.6 Dal Dev Fra 0.4 Pre Sta 0.2 Syd Lon 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Repulling probability Nannan Zhao znannan1@vt.edu 10
Key observation II-b: User repulling pattern can also be predicted 1 0.8 Clients ratio 0.6 Dal Dev Fra 0.4 Pre Sta 0.2 Syd Lon 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Repulling probability Nannan Zhao znannan1@vt.edu 10
Key observation II-b: User repulling pattern can also be predicted 1 0.8 Clients ratio 0.6 Dal Dev Fra 0.4 Pre Sta 0.2 Syd Lon 0 User repulling pattern is either pull-once or 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 always-pull → we can predict which layers to pull. Repulling probability Nannan Zhao znannan1@vt.edu 10
Key observation II-c: Layer preconstruction is possible Nannan Zhao znannan1@vt.edu 11
Key observation II-c: Layer preconstruction is possible Layer preconstruction can significantly reduce layer restore overhead. Nannan Zhao znannan1@vt.edu 11
DupHunter architecture Distributed metadata database Registry REST API Registry REST API Server A Server B Clients Server C Server D Local storage system storage cluster Nannan Zhao znannan1@vt.edu 12
Reducing overhead in DupHunter 1. Support multiple replica deduplication modes. 2. Facilitate parallel layer reconstruction. 3. Enable proactive layer prefetching/preconstruction. Nannan Zhao znannan1@vt.edu 13
DupHunter supports multiple replica deduplication modes ❑ B-mode n : Basic deduplication mode n ▪ Keep n layer replicas intact. ▪ Deduplicate the remaining R-n layer replicas ( R = layer replication level). ❑ S-mode : Selective deduplication mode ▪ The number of intact layer replicas proportional to the layer’s popularity. ▪ Hot layers have more intact replicas. Nannan Zhao znannan1@vt.edu 14
Recommend
More recommend