PERFORMANCE ANALYSIS OF CONTAINERIZED APPLICATIONS ON LOCAL AND - - PowerPoint PPT Presentation

performance analysis of containerized applications on
SMART_READER_LITE
LIVE PREVIEW

PERFORMANCE ANALYSIS OF CONTAINERIZED APPLICATIONS ON LOCAL AND - - PowerPoint PPT Presentation

PERFORMANCE ANALYSIS OF CONTAINERIZED APPLICATIONS ON LOCAL AND REMOTE STORAGE Qiumin Xu 1 , Manu Awasthi 2 , Krishna T. Malladi 3 , Janki Bhimani 4 , Jingpei Yang 3 , Murali Annavaram 1 1 USC, 2 IIT Gandhinagar 3 Samsung 4 Northeastern 1 Docker


slide-1
SLIDE 1

PERFORMANCE ANALYSIS OF CONTAINERIZED APPLICATIONS ON LOCAL AND REMOTE STORAGE

1

Qiumin Xu1, Manu Awasthi2, Krishna T. Malladi3, Janki Bhimani4, Jingpei Yang3, Murali Annavaram1

1USC, 2IIT Gandhinagar 3Samsung 4Northeastern

slide-2
SLIDE 2

Docker Becomes Very Popular

2

Software container platform with many desirable features

Ease of deployment, developer friendliness and light virtualization

Mainstay in cloud platforms

Google Cloud Platform, Amazon EC2, Microsoft Azure

Storage Hierarchy is the key component

High Performance SSDs NVMe, NVMe over Fabrics

slide-3
SLIDE 3

Agenda

Docker, NVMe and NVMe over Fabrics (NVMf) How to best utilize NVMe SSDs for single container?

Best configuration performs similar to raw performance Where do the performance anomalies come from?

Do Docker containers scale well on NVMe SSDs?

Exemplify using Cassandra Best strategy to divide the resources

Scaling Docker containers on NVMe-over-Fabrics

3

slide-4
SLIDE 4

4

What is Docker Container?

Each virtualized application includes an entire OS (~10s of GB) Docker container comprises just application and bins/libs Shares the kernel with other container Much more portable and efficient

figure from https://docs.docker.com

slide-5
SLIDE 5

Non-Volatile Memory Express (NVMe)

A storage protocol standard on top of PCIe NVMe SSDs connect through PCIe and support the standard

Since 2014 (Intel, Samsung) Enterprise and consumer variants

NVMe SSDs leverage the interface to deliver superior perf

5X to 10X over SATA SSD[1]

5

[1] Qiumin Xu et al. “Performance analysis of NVMe SSDs and their implication on real world databases.” SYSTOR’15

slide-6
SLIDE 6

Why NVMe over Fabrics (NVMf)?

6

Retains NVMe performance over network fabrics Eliminate unnecessary protocol translations Enables low latency and high IOPS remote storage

  • J. M. Dave Minturn, “Under the Hood with NVMe over Fabrics,”, SINA Ethernet Storage Forum
slide-7
SLIDE 7

7

Storage Architecture in Docker

Container Read / Write Operations Host Backing File system (EXT4, XFS, etc.) Thin Pool Aufs, Btrfs, Overlayfs Devicemapper

(Loop-lvm, direct-lvm)

Data Volume Base Device NVMe SSDs Sparse Files Storage Driver 1 2 3

2.a 2.b

  • g option
  • v option

Storage Options:

  • 1. Through Docker Filesystem (Aufs, Btrfs, Overlayfs)
  • 2. Through Virtual Block Devices (2.a Loop-lvm, 2.b Direct-lvm)
  • 3. Through Docker Data Volume (-v)

1 2 3

slide-8
SLIDE 8

Dual-socket, 12 HT cores Xeon E5-2670 V3 enterprise-class NVMe SSD

Samsung XS1715

kernel v4.6.0 Docker v1.11.2 fio used for traffic generation

Asynchronous IO engine, libaio 32 concurrent jobs and iodepth is 32 Measure steady state performance

Optimize Storage Configuration for Single Container 


8

Experimental Environment

? NVMe SSD (XS1715) Docker FIO

slide-9
SLIDE 9

Performance Comparison

9

1000 1500 2000 2500 3000 SR SW Average BW (MB/s) RAW EXT4 XFS 200 400 600 800 RR RW Average IOPS K RAW EXT4 XFS

EXT4 performs 25% worse for RR XFS performs closely resembles RAW for all but RW — Host Backing Filesystems

slide-10
SLIDE 10

Tuning the Performance Gap

10

200 400 600 800

Default dioread_nolock

IOPS of RR K EXT4

—Random Reads 700K IOPS XFS allows multiple processes to read a file at once

Uses allocation groups which can be accessed independently

EXT4 requires mutex locks even for read operations

slide-11
SLIDE 11

11

[1] https://www.percona.com/blog/2012/03/15/ext4-vs-xfs-on-ssd/

50 100 150 200 250 1 2 4 8 16 24 28 32 48 64 IOPS of RW # of Jobs RAW EXT4 XFS K

Tuning the Performance Gap

—Random Writes XFS performs poorly with high thread count Contention in exclusive locking kills the write performance

Used by extent look up and write checks Patch available but not for Linux 4.6 [1]

slide-12
SLIDE 12

12

Storage Architecture in Docker

Container Read / Write Operations Host Backing File system (EXT4, XFS, etc.) Thin Pool Aufs, Btrfs, Overlayfs Devicemapper

(Loop-lvm, direct-lvm)

Data Volume Base Device NVMe SSDs Sparse Files Storage Driver 1 2 3

2.a 2.b

  • g option
  • v option

Storage Options:

  • 1. Through Docker Filesystem (Aufs, Btrfs, Overlayfs)
  • 2. Through Virtual Block Devices (2.a Loop-lvm, 2.b Direct-lvm)
  • 3. Through Docker Data Volume (-v)

1 2 3

slide-13
SLIDE 13

Docker Storage Options

13

Aufs (Advanced multi-layered Unification FileSystem):

A fast reliable unification file system

Btrfs (B-tree file system):

A modern CoW file system which implements many advanced features for fault tolerance, repair and easy administration

Overlayfs:

Another modern unification file system which has simpler design and potentially faster than Aufs

Option 1: Through Docker File System

slide-14
SLIDE 14

Performance Comparison

14

1000 1500 2000 2500 3000 SR SW Average BW (MB/s) Raw Aufs Btrfs Overlay

200 400 600 800 RR RW Average IOPS K Raw Aufs Btrfs Overlay

Option 1: Through Docker File System Aufs and Overlayfs performs close to raw block device for most cases Btrfs has the worst performance for random workloads

slide-15
SLIDE 15

Tuning the Performance Gap of Btrfs

15

—Random Reads

1000 2000 3000 4000 BW (MB/s) of RR Block Size RAW EXT4 Btrfs

Btrfs doesn’t work well for small block size yet Btrfs must read the file extent before reading the file data.

Large block size reduces the frequency of reading metadata

slide-16
SLIDE 16

Tuning the Performance Gap of Btrfs

16

—Random Reads Btrfs doesn’t work well for random writes due to CoW overhead

50 100 150 200 Default nodatacow IOPS of RW K Btrfs

slide-17
SLIDE 17

17

Storage Architecture in Docker

Container Read / Write Operations Host Backing File system (EXT4, XFS, etc.) Thin Pool Aufs, Btrfs, Overlayfs Devicemapper

(Loop-lvm, direct-lvm)

Data Volume Base Device NVMe SSDs Sparse Files Storage Driver 1 2 3

2.a 2.b

  • g option
  • v option

Storage Options:

  • 1. Through Docker Filesystem (Aufs, Btrfs, Overlayfs)
  • 2. Through Virtual Block Devices (2.a Loop-lvm, 2.b Direct-lvm)
  • 3. Through Docker Data Volume (-v)

1 2 3

slide-18
SLIDE 18

18

Docker Storage Configurations

Option 2: Through Virtual Block Device Devicemapper storage driver leverages the thin provisioning and snapshotting capabilities of the kernel based Device Mapper Framework Loop-lvm uses sparse files to build the thin-provisioned pools Direct-lvm uses block device to directly create the thin pools (Recommended by Docker)

slide-19
SLIDE 19

19

Docker Storage Configurations

Option 3: Through Docker Data Volume (-v) Data persists beyond the lifetime of the container and can be shared and accessed from other containers

* figure from https://github.com/libopenstorage/openstorage

slide-20
SLIDE 20

Performance Comparison

20

Option 2 & Option 3

200 400 600 800 RR RW Average IOPS K RAW Direct-lvm Loop-lvm -v Aufs -v Overlay -v

1000 1500 2000 2500 3000 SR SW Average BW (MB/s) RAW Direct-lvm Loop-lvm -v Aufs -v Overlay -v

Direct-lvm has worse performance for RR/RW LVM, device mapper, and the dm-thinp kernel module introduced additional code paths and overhead may not suit IO intensive workloads

slide-21
SLIDE 21

Application Performance

21

NoSQL database Scale linearly to the number of nodes in the cluster (theoretically) [1] Requires data persistence

uses docker volume to store data

Cassandra Database

[1] Rabl, Tilmann et al. "Solving Big Data Challenges for Enterprise Application Performance Management”, VLDB’13

slide-22
SLIDE 22

22

Scaling Docker Containers on NVMe

multiple containerized Cassandra Databases Experiment Setup

Dual socket, Xeon E5 server, 10Gb ethernet N = 1, 2, 3, … 8 containers Each container is driven by a YCSB client Record Count: 100M records, 100GB in each DB Client thread count: 16 Workload A, 50% read, 50% update, Zipfian distribution Workload D, 95% read, 5% insert, normal distribution

Workloads

slide-23
SLIDE 23

Results-Throughput

23

10000 20000 30000 40000 50000 1 2 3 4 5 6 7 8 Throughput (ops/sec) # of Cassandra Containers C1 C2 C3 C4 C5 C6 C7 C8 Cgroups

Aggregated throughput peaks at 4 containers Cgroups: 6 CPU cores, 6GB memory, 400MB/s bandwidth Workload D, directly attached SSD

slide-24
SLIDE 24

Strategies for Dividing Resources

24

10000 20000 30000 40000 50000 1 2 3 4 5 6 7 8 9 Throughput (ops/sec) # of Cassandra Containers

CPU MEM CPU+MEM BW All Uncontrolled

MEM has the most significant impact on throughput Best strategy for dividing resources using cgroups

Assign 6 CPU cores for each container, leave other resource uncontrolled

slide-25
SLIDE 25

Scaling Containerized Cassandra using NVMf

25

Experiment Setup

YCSB Clients Application Server NVMf Target Storage Server 10Gbe 40Gbe Cassandra + Docker

slide-26
SLIDE 26

Results-Throughput

26

1 2 3 4 1 2 3 4 5 6 7 8 Relative TPS # of Cassandra Instances DAS_A NVMf_A DAS_D NVMf_D

The throughput of NVMf is within 6% - 12% compared to directly attached SSDs

slide-27
SLIDE 27

Results-Latency

27

2 4 6 8 1 2 3 4 5 6 7 8 Relative Latency # of Cassandra Instances DAS_A NVMf_A DAS_D NVMf_D

NVMF incurs only 2% - 15% longer latency than direct attached SSD.

slide-28
SLIDE 28

28

Results-CPU Utilization

NVMF incurs less than 1.8% CPU Utilization on Target Machine

slide-29
SLIDE 29

SUMMARY

29

THANK YOU! QIUMIN@USC.EDU

Best Option in Docker for NVMe Drive Performance

Overlay FS + XFS + Data Volume

Best Strategy for Dividing Resources using Cgroups

Control only the CPU resources

Scaling Docker Containers on NVMf

Throughput: within 6% - 12% vs. DAS

Latency: 2% - 15% longer than DAS