DADI Block-Level Image Service for Agile and Elastic Application - - PowerPoint PPT Presentation

dadi block level image service for agile and elastic
SMART_READER_LITE
LIVE PREVIEW

DADI Block-Level Image Service for Agile and Elastic Application - - PowerPoint PPT Presentation

DADI Block-Level Image Service for Agile and Elastic Application Deployment Huiba Li, Yifan Yuan, Rui Du, Kai Ma, Lanzheng Liu and Windsor Hsu Alibaba Group The Problem Container deployment (cold startup) is slow Long-tail latency reaches


slide-1
SLIDE 1

DADI Block-Level Image Service for Agile and Elastic Application Deployment

Huiba Li, Yifan Yuan, Rui Du, Kai Ma, Lanzheng Liu and Windsor Hsu Alibaba Group

slide-2
SLIDE 2

The Problem

  • Container deployment (cold startup) is slow
  • Long-tail latency reaches 10s of minutes
  • The essential reasons are image downloading and unpacking
  • Only 6.4% [Slacker] of the image is used for startup
  • A regression to a decade ago, when VM images were also downloaded to hosts
  • P2P downloading [Dragonfly, Kraken, Borg, Tupperware, FID] is not enough
  • Deals with only half the reason
  • Little effect for small clusters
  • Slimming the images [DockerSlim, Cntr] is not universal
  • Hard to automatically find all dependencies for all applications
  • Hard to support ad-hoc operations
slide-3
SLIDE 3

Remote Image

  • is the trend
  • [CRFS, Teleport, CernVM-FS, Slacker, Wharf, CFS, Cider]
  • Optionally with P2P transfer for large clusters
  • Container image (tarball) is, however, NOT viable for remote image
  • Designed for unpacking, not seekable
  • Hard to support advanced features, such as xattr, cross-layer reference, etc.
  • We’d better to design a new one
  • Type of image
  • File-system-based image?
  • Block-device-based image?
slide-4
SLIDE 4

Type of Image: Block!

Features Existing Sys Complexity Universality Security Overall Block- Device- Based

  • Work together with a

regular file system, such as ext4

  • Viable for container,

secure container and virtual machine Cider (based on Ceph; no layering format.) Low stability↑

  • ptimization↑

advanced features↑ App can choose a best-match file system, e.g. NTFS, and pack it into the image as a dependence. small attack 
 surface need the courage to walk alone (almost) TODO: layering File- System- Based

  • Provides a file-system

interface directly

  • “Natural” extension of

container image

  • Less mental friction

(due to inertia and following the crowd) CRFS, Teleport, CernVM-FS, Slacker, Wharf, CFS High stability↓

  • ptimization↓

advanced features↓ Fixed features;
 may not match all applications. (e.g. a Windows container on a Linux host) large attack 
 surface Technical advantage is insignificant.

slide-5
SLIDE 5

Background: Layered Image of Container

docker registry

download untar

slide-6
SLIDE 6

Background: Layered Image of Container

Each layer is a change set compared to the previous state
 (files added, modified, deleted) (read-only, shared) docker registry

download untar

slide-7
SLIDE 7

Background: Layered Image of Container

Each layer is a change set compared to the previous state
 (files added, modified, deleted) (read-only, shared) Container layer is a change set compared to the image
 (files added, modified, deleted) (read-write, private) docker registry

download untar

slide-8
SLIDE 8

Background: Layered Image of Container

Each layer is a change set compared to the previous state
 (files added, modified, deleted) (read-only, shared) Container layer is a change set compared to the image
 (files added, modified, deleted) (read-write, private) Usually the layers are stored in separate directories, and a merged view is created with a kernel module overlayfs. docker registry

download untar

slide-9
SLIDE 9

Background: I/O Path

App Processes directories container user space kernel space

  • verlayfs

directories layers (directories)

Docker Registry

download, ungzip & untar

slide-10
SLIDE 10

DADI Remote Image

  • A layered image format
  • based on virtual block device
  • work together with a regular file system, e.g. ext4
  • a general solution for container ecology
  • Compression
  • and seekable decompression (online)
  • Scalability
  • peer-to-peer (P2P) transfer
slide-11
SLIDE 11

DADI Remote Image

  • A layered image format
  • based on virtual block device
  • work together with a regular file system, e.g. ext4
  • a general solution for container ecology
  • Compression
  • and seekable decompression (online)
  • Scalability
  • peer-to-peer (P2P) transfer

Overlay Block Device

slide-12
SLIDE 12

DADI Remote Image

  • A layered image format
  • based on virtual block device
  • work together with a regular file system, e.g. ext4
  • a general solution for container ecology
  • Compression
  • and seekable decompression (online)
  • Scalability
  • peer-to-peer (P2P) transfer

ZFile Overlay Block Device

slide-13
SLIDE 13

DADI Remote Image

  • A layered image format
  • based on virtual block device
  • work together with a regular file system, e.g. ext4
  • a general solution for container ecology
  • Compression
  • and seekable decompression (online)
  • Scalability
  • peer-to-peer (P2P) transfer

ZFile P2P on-demand read in a tree-structured topology Overlay Block Device

slide-14
SLIDE 14

DADI I/O Path

App Processes regular file system (ext4, etc.) virtual block device OverlayBD file system (ext4, etc.) container

P2P RPC

for downloaded layers

user space kernel space

for new layers

ZFile lsmd daemon ZFile ZFile (layer blobs)

slide-15
SLIDE 15

2 15 87 150 1 4 10 50 10 30 15

pread

  • ffset

length Segment raw data to read raw data raw data to read hole

hole

raw data to read

Overlay Block Device

  • Each layer is a change set of overwritten blocks
  • no concept of file or file system
  • 512 bytes block size (granularity)
  • An index for fast reading
  • variable-length entries to

save memory by combining

  • non-overlapping entries

sorted by logical offsets

  • range query by binary search
slide-16
SLIDE 16

Index Merge

5 10 100 5 10 10 2 1 5 3 5 10 10 20 87 5 2 15 87 150 1 4 10 50 10 30 15 30 15 13 100 10 110 27 150 10

+ =>

  • ffset

length Segment

# of Segments in Merged Index 0K 1K 2K 3K 4K 5K Layers Depth 5 10 15 20 25 30 35 40 45 Merged Index Size of Productional Images 4.5K * 16 bytes = 72KB

slide-17
SLIDE 17

Index Performance

Queries / Second 0M 3M 6M 9M Size of Index (# of Segments) 1K 2K 3K 4K 5K 6K 7K 8K 9K 10K IOPS (bs=8KB, non-cached)

0K 30K 60K 90K 120K

I/O Queue Depth 1 2 4 8 16 32 64 128 256 Thin LVM DADI w/o comp DADI - ZFile

> 6M QPS for productional images

slide-18
SLIDE 18

Writable Layer

  • Log-structured design
  • appending index and raw data to separate logs
  • Maintaining an in-memory index
  • red-black-tree
  • Commit only useful data blocks (in offset order)
  • combine index entries

Data (R/W) Index (R/W)

Header

Index

Trailer

Raw Data

Layer (RO)

Header

Raw Data Index

Header

append append commit

slide-19
SLIDE 19

Header

Index

Trailer

Compressed Chunks [Dict]

Header

Index

Trailer

Raw Data

ZFile Underlay file

(DADI layer blob)

ZFile

  • A seekable compression format
  • random reading, and online decompression
  • Compressed by fixed-sized chunks
  • Decompressed only needed chunks
  • Not tied to DADI
slide-20
SLIDE 20

On-Demand P2P Transfer

  • In a tree-structured topology
  • Each P2P node caches recently used data blocks.
  • A request is likely to hit parent’s cache,
  • or the parent will forward the request upward, recursively.

Registry

DADI-Root DADI-Agent DADI-Agent DADI-Agent DADI-Agent DADI-Agent DADI-Agent DADI-Root DADI-Agent DADI-Agent DADI-Agent DADI-Agent DADI-Agent DADI-Agent DADI-Agent DADI-Agent HTTP(S) request DADI request Datacenter 1 Datacenter 2 DADI-Agent

slide-21
SLIDE 21

Evaluations

slide-22
SLIDE 22

Startup Latency with DADI

Cold Start Latency (s) 5 10 15 20

.tgz + 


  • verlay2

CRFS pseudo 
 Slacker DADI from
 Registry DADI from 
 P2P Root

Image Pull App Launch Warm Startup Latency (s) 0.6 1.2 1.8 2.4

  • verlay2

Thin LVM
 (device mapper) DADI

NVMe SSD Cloud Disk

slide-23
SLIDE 23

Startup Latency with DADI

Startup Latency (s) 0.0 0.6 1.2 1.8 2.4

Warm
 Cache Cold
 Cache

app launch with prefetch app launch Cold Startup Latency (s) 0.0 1.0 2.0 3.0 # of Hosts (and Containers)

10 20 30 40

pseudo-Slacker DADI

slide-24
SLIDE 24

Scalability with DADI

# of Container Instances Started 0K 3K 5K 8K 10K Time (s) 1 2 3 4

Cold Startup 1 Cold Startup 2 Cold Startup 3 Warm Startup

Estimated Startup Latencies (s) 1.5 2.0 2.5 3.0 3.5 # of Containers 10K 20K 30K 40K 50K 60K 70K 80K 90K 100K 2-ary tree 3-ary tree 4-ary tree 5-ary tree

Large-Scale Startup of Agilityon 1,000 hosts Projected Hyper-Scale Startup of Agility (by evaluating a single branch of the P2P tree) (Agility is a small application specifically written in Python to assist the test)

slide-25
SLIDE 25

I/O Performance

Time to du All Files (s) 0.4 0.8 1.2 1.6

  • verlay2

Thin LVM DADI

NVMe SSD Cloud Disk Time to tar All Files (s) 3 6 9 12

  • verlay2

Thin LVM DADI

NVMe SSD Cloud Disk

Image Scanning with du Image Scanning with tar

slide-26
SLIDE 26

Thanks!