PERFORMANCE ANALYSIS OF CONTAINERIZED APPLICATIONS ON LOCAL AND - PowerPoint PPT Presentation

PERFORMANCE ANALYSIS OF CONTAINERIZED APPLICATIONS ON LOCAL AND REMOTE STORAGE Qiumin Xu 1 , Manu Awasthi 2 , Krishna T. Malladi 3 , Janki Bhimani 4 , Jingpei Yang 3 , Murali Annavaram 1 1 USC, 2 IIT Gandhinagar 3 Samsung 4 Northeastern 1

Docker Becomes Very Popular Software container platform with many desirable features Ease of deployment, developer friendliness and light virtualization Mainstay in cloud platforms Google Cloud Platform, Amazon EC2, Microsoft Azure Storage Hierarchy is the key component High Performance SSDs NVMe, NVMe over Fabrics 2

Agenda Docker, NVMe and NVMe over Fabrics (NVMf) How to best utilize NVMe SSDs for single container? Best configuration performs similar to raw performance Where do the performance anomalies come from? Do Docker containers scale well on NVMe SSDs? Exemplify using Cassandra Best strategy to divide the resources Scaling Docker containers on NVMe-over-Fabrics 3

What is Docker Container? Each virtualized application includes an entire OS (~10s of GB) Docker container comprises just application and bins/libs Shares the kernel with other container Much more portable and efficient figure from https://docs.docker.com 4

Non-Volatile Memory Express (NVMe) A storage protocol standard on top of PCIe NVMe SSDs connect through PCIe and support the standard Since 2014 (Intel, Samsung) Enterprise and consumer variants NVMe SSDs leverage the interface to deliver superior perf 5X to 10X over SATA SSD [1] [1] Qiumin Xu et al. “Performance analysis of NVMe SSDs and their implication on real world databases.” SYSTOR’15 5

Why NVMe over Fabrics (NVMf)? Retains NVMe performance over network fabrics Eliminate unnecessary protocol translations Enables low latency and high IOPS remote storage J. M. Dave Minturn, “Under the Hood with NVMe over Fabrics,”, SINA Ethernet Storage Forum 6

Storage Architecture in Docker Storage Options: 1 1. Through Docker Filesystem (Aufs, Btrfs, Overlayfs) 2. Through Virtual Block Devices (2.a Loop-lvm, 2.b Direct-lvm) 2 3. Through Docker Data Volume (-v) 3 Container Read / Write Operations 1 -g option 2 3 -v option Storage Driver Aufs, Btrfs, Devicemapper Data Volume Overlayfs (Loop-lvm, direct-lvm) 2.a 2.b Host Backing File system (EXT4, XFS, etc.) Base Device Thin Pool Sparse Files NVMe SSDs 7

Optimize Storage Configuration for Single Container   Experimental Environment Dual-socket, 12 HT cores Xeon E5-2670 V3 enterprise-class NVMe SSD Docker Samsung XS1715 FIO kernel v4.6.0 Docker v1.11.2 ? fio used for traffic generation NVMe SSD Asynchronous IO engine, libaio (XS1715) 32 concurrent jobs and iodepth is 32 Measure steady state performance 8

Performance Comparison — Host Backing Filesystems 800 K RAW RAW 3000 Average BW Average IOPS EXT4 EXT4 600 2500 (MB/s) XFS XFS 400 2000 200 1500 0 1000 RR RW SR SW EXT4 performs 25% worse for RR XFS performs closely resembles RAW for all but RW 9

Tuning the Performance Gap —Random Reads 700K IOPS 800 K IOPS of RR 600 400 200 0 Default dioread_nolock EXT4 XFS allows multiple processes to read a file at once Uses allocation groups which can be accessed independently EXT4 requires mutex locks even for read operations 10

Tuning the Performance Gap —Random Writes 250 K RAW EXT4 XFS IOPS of RW 200 150 100 50 0 1 2 4 8 16 24 28 32 48 64 # of Jobs XFS performs poorly with high thread count Contention in exclusive locking kills the write performance Used by extent look up and write checks Patch available but not for Linux 4.6 [1] [1] https://www.percona.com/blog/2012/03/15/ext4-vs-xfs-on-ssd/ 11

Docker Storage Options Option 1: Through Docker File System Aufs (Advanced multi-layered Unification FileSystem): A fast reliable unification file system Btrfs (B-tree file system): A modern CoW file system which implements many advanced features for fault tolerance, repair and easy administration Overlayfs: Another modern unification file system which has simpler design and potentially faster than Aufs 13

Performance Comparison Option 1: Through Docker File System 800 K Raw Raw 3000 Average IOPS Aufs Aufs Average BW 600 Btrfs Btrfs 2500 (MB/s) Overlay Overlay 400 2000 200 1500 1000 0 SR SW RR RW Aufs and Overlayfs performs close to raw block device for most cases Btrfs has the worst performance for random workloads 14

Tuning the Performance Gap of Btrfs —Random Reads 4000 BW (MB/s) of RR RAW EXT4 Btrfs 3000 2000 1000 0 Block Size Btrfs doesn’t work well for small block size yet Btrfs must read the file extent before reading the file data. Large block size reduces the frequency of reading metadata 15

Tuning the Performance Gap of Btrfs —Random Reads 200 K IOPS of RW 150 100 50 0 Default nodatacow Btrfs Btrfs doesn’t work well for random writes due to CoW overhead 16

Docker Storage Configurations Option 2: Through Virtual Block Device Devicemapper storage driver leverages the thin provisioning and snapshotting capabilities of the kernel based Device Mapper Framework Loop-lvm uses sparse files to build the thin-provisioned pools Direct-lvm uses block device to directly create the thin pools (Recommended by Docker) 18

Docker Storage Configurations Option 3: Through Docker Data Volume (-v) Data persists beyond the lifetime of the container and can be shared and accessed from other containers * figure from https://github.com/libopenstorage/openstorage 19

Performance Comparison Option 2 & Option 3 800 K RAW RAW 3000 Average IOPS Average BW Direct-lvm 600 Direct-lvm Loop-lvm -v 2500 Loop-lvm -v (MB/s) Aufs -v 400 Aufs -v 2000 Overlay -v Overlay -v 200 1500 0 1000 SR SW RR RW Direct-lvm has worse performance for RR/RW LVM, device mapper, and the dm-thinp kernel module introduced additional code paths and overhead may not suit IO intensive workloads 20

Application Performance Cassandra Database NoSQL database Scale linearly to the number of nodes in the cluster (theoretically) [1] Requires data persistence uses docker volume to store data [1] Rabl, Tilmann et al. "Solving Big Data Challenges for Enterprise Application Performance Management”, VLDB’13 21

Scaling Docker Containers on NVMe multiple containerized Cassandra Databases Experiment Setup Dual socket, Xeon E5 server, 10Gb ethernet N = 1, 2, 3, … 8 containers Each container is driven by a YCSB client Record Count: 100M records, 100GB in each DB Client thread count: 16 Workloads Workload A, 50% read, 50% update, Zipfian distribution Workload D, 95% read, 5% insert, normal distribution 22

Results-Throughput Workload D, directly attached SSD C1 C2 C3 C4 C5 C6 C7 C8 Cgroups 50000 Throughput (ops/sec) 40000 30000 20000 10000 0 1 2 3 4 5 6 7 8 # of Cassandra Containers Aggregated throughput peaks at 4 containers Cgroups: 6 CPU cores, 6GB memory, 400MB/s bandwidth 23

Strategies for Dividing Resources CPU MEM CPU+MEM BW All Uncontrolled 50000 Throughput (ops/sec) 40000 30000 20000 10000 0 0 1 2 3 4 5 6 7 8 9 # of Cassandra Containers MEM has the most significant impact on throughput Best strategy for dividing resources using cgroups Assign 6 CPU cores for each container, leave other resource uncontrolled 24

Scaling Containerized Cassandra using NVMf Experiment Setup Cassandra + Docker 10Gbe 40Gbe Application NVMf Target YCSB Clients Server Storage Server 25

Results-Throughput DAS_A NVMf_A DAS_D NVMf_D 4 Relative TPS 3 2 1 0 1 2 3 4 5 6 7 8 # of Cassandra Instances The throughput of NVMf is within 6% - 12% compared to directly attached SSDs 26

Results-Latency DAS_A NVMf_A DAS_D NVMf_D 8 Relative Latency 6 4 2 0 1 2 3 4 5 6 7 8 # of Cassandra Instances NVMF incurs only 2% - 15% longer latency than direct attached SSD. 27

Results-CPU Utilization NVMF incurs less than 1.8% CPU Utilization on Target Machine 28

SUMMARY Best Option in Docker for NVMe Drive Performance Overlay FS + XFS + Data Volume Best Strategy for Dividing Resources using Cgroups Control only the CPU resources Scaling Docker Containers on NVMf Throughput: within 6% - 12% vs. DAS Latency: 2% - 15% longer than DAS THANK YOU! QIUMIN@USC.EDU 29

PERFORMANCE ANALYSIS OF CONTAINERIZED APPLICATIONS ON LOCAL AND - PowerPoint PPT Presentation

PERFORMANCE ANALYSIS OF CONTAINERIZED APPLICATIONS ON LOCAL AND REMOTE STORAGE Qiumin Xu 1 , Manu Awasthi 2 , Krishna T. Malladi 3 , Janki Bhimani 4 , Jingpei Yang 3 , Murali Annavaram 1 1 USC, 2 IIT Gandhinagar 3 Samsung 4 Northeastern 1 Docker

BARE ROOT AND BARE ROOT AND CONTAINERIZED FOREST CONTAINERIZED FOREST PLANTS PLANTS PLANTS

ANALYZING PERFORMANCE OF CONTAINERIZED CLIMATE MODELS IN SINGULARITY ON A SUPERCOMPUTER COMPUTER

The Challenge HPC IT departments required to host Data Science and Machine Learning a

Architecting for Failure in a Containerized World Tom Faulhaber Infolace How can container tech

Compaa Sud Americana de Vapores S.A. June, 2010 Agenda CSAV Group Containerized

Containerized Workflow Scheduling Research Project 1 Project #71 Isaac Klop July 5, 2018

Planting Containerized Trees Dig a hole Dig a hole 3 to 4 times wider than the container and

CGS II BRIDGES2000 Introduces Innovation Safer Offshore Accessibility CONTAINERIZED GAN

Networking for Containerized Clouds Daehyeok Kim Tianlong Yu 1 , Hongqiang Liu 3 , Yibo Zhu 4 ,

based Containerized Cloud Haseeb Akhtar, Ericsson and Bin Hu, AT&T Special contributors:

Atomic Developer Bundle Containerized Development Made Easy CentOS Dojo Brussels - January 2016

One Size Does Not Fit All: An Empirical Study of Containerized Continuous Deployment Workflows

Verification Verification, Performance Performance Analysis Performance Performance Analysis

High Performance Systems EuroMPI 2015 Objectives Yet another performance analysis tool

The 2007 trends: EMEA analysis EMEA analysis The 2007 trends: Francesco Pignatti, MD Safety and

Blended Analysis for Blended Analysis for Performance Understanding of Performance Understanding

Human-in-the-loop Data Integration Guoliang Li Department of Computer Science, Tsinghua

Review Ch 1-5 Executing code Compile code (convert from C++ to computer code) - Syntax errors

PERFORMANCE OPTIMISATION Adrian Jackson adrianj@epcc.ed.ac.uk Hardware design Image from Colfax

Nonassociative Lie Theory Ivan P . Shestakov The International Conference on Group Theory in

Local Netlist Transformations in Lagrangian Relaxation Apostolos Stefanidis, Dimitrios Mangiras,

Broadband Mapping International Good Practices and World Bank Experience 2 July 2019 1 Agenda

1 SA-C Image Processing Support Cameron Project Objective Data parallelism through tight

ONE VIEW: a tool for performance analysis agregation W. Jalby University of Versailles Saint

PERFORMANCE ANALYSIS OF CONTAINERIZED APPLICATIONS ON LOCAL AND - PowerPoint PPT Presentation

PERFORMANCE ANALYSIS OF CONTAINERIZED APPLICATIONS ON LOCAL AND REMOTE STORAGE Qiumin Xu 1 , Manu Awasthi 2 , Krishna T. Malladi 3 , Janki Bhimani 4 , Jingpei Yang 3 , Murali Annavaram 1 1 USC, 2 IIT Gandhinagar 3 Samsung 4 Northeastern 1 Docker

BARE ROOT AND BARE ROOT AND CONTAINERIZED FOREST CONTAINERIZED FOREST PLANTS PLANTS PLANTS

ANALYZING PERFORMANCE OF CONTAINERIZED CLIMATE MODELS IN SINGULARITY ON A SUPERCOMPUTER COMPUTER

The Challenge HPC IT departments required to host Data Science and Machine Learning a

Architecting for Failure in a Containerized World Tom Faulhaber Infolace How can container tech

Compaa Sud Americana de Vapores S.A. June, 2010 Agenda CSAV Group Containerized

Containerized Workflow Scheduling Research Project 1 Project #71 Isaac Klop July 5, 2018

Planting Containerized Trees Dig a hole Dig a hole 3 to 4 times wider than the container and

CGS II BRIDGES2000 Introduces Innovation Safer Offshore Accessibility CONTAINERIZED GAN

Networking for Containerized Clouds Daehyeok Kim Tianlong Yu 1 , Hongqiang Liu 3 , Yibo Zhu 4 ,

based Containerized Cloud Haseeb Akhtar, Ericsson and Bin Hu, AT&amp;T Special contributors:

Atomic Developer Bundle Containerized Development Made Easy CentOS Dojo Brussels - January 2016

One Size Does Not Fit All: An Empirical Study of Containerized Continuous Deployment Workflows

Verification Verification, Performance Performance Analysis Performance Performance Analysis

High Performance Systems EuroMPI 2015 Objectives Yet another performance analysis tool

The 2007 trends: EMEA analysis EMEA analysis The 2007 trends: Francesco Pignatti, MD Safety and

Blended Analysis for Blended Analysis for Performance Understanding of Performance Understanding

Human-in-the-loop Data Integration Guoliang Li Department of Computer Science, Tsinghua

Review Ch 1-5 Executing code Compile code (convert from C++ to computer code) - Syntax errors

PERFORMANCE OPTIMISATION Adrian Jackson adrianj@epcc.ed.ac.uk Hardware design Image from Colfax

Nonassociative Lie Theory Ivan P . Shestakov The International Conference on Group Theory in

Local Netlist Transformations in Lagrangian Relaxation Apostolos Stefanidis, Dimitrios Mangiras,

Broadband Mapping International Good Practices and World Bank Experience 2 July 2019 1 Agenda

1 SA-C Image Processing Support Cameron Project Objective Data parallelism through tight

ONE VIEW: a tool for performance analysis agregation W. Jalby University of Versailles Saint

based Containerized Cloud Haseeb Akhtar, Ericsson and Bin Hu, AT&T Special contributors: