 
              NUMA Implication for Storage I/O Throughput in Modern Servers Shoaib Akram, Manolis Marazakis, and Angelos Bilas Computer Architecture and VLSI Laboratory FORTH-ICS Greece
Outline  Introduction  Motivation  Previous Work  The “Affinity” Space Exploration  Evaluation Methodology  Results  Conclusions
Introduction  The number of processors on a single motherboard is increasing.  Each processor has faster access to local memory compared to remote memory leading to the well-known NUMA problem.  Much effort in the past is devoted to NUMA-aware memory management and scheduling.  There is a similar “non-uniform latency of access” relation between processors and I/O devices.  In this work, we quantify the combined impact of non-uniform latency in accessing memory and I/O devices.
Motivation-Trends in I/O Subsystems  Storage-related features:  Fast point-to-point interconnects.  Multiple storage controllers.  Arrays of storage devices with high bandwwidth.  Result:  Storage bandwidth is catching up with memory bandwidth.  The anatomy of a modern server  Today, NUMA affinity is a machine with 2 NUMA domains. problem for the entire system.
Motivation-Trends in Processor Technology and Applications Backend ¡ Data ¡Stores ¡ OLTP ¡ 250 ¡ 200 ¡ Growth ¡of ¡Cores ¡ 150 ¡ GB/s ¡ 100 ¡ 50 ¡ 0 ¡ 16 ¡ 32 ¡ 64 ¡ 128 ¡ 256 ¡ 512 ¡ 1024 ¡ 2048 ¡ 4096 ¡ Number ¡of ¡Cores ¡  Throughput of typical workloads in today’s data- centres measured for 16 cores and projected to many cores.
Motivation-Trends in Processor Technology and Applications Backend ¡ Data ¡Stores ¡ OLTP ¡ 250 ¡  Today : 200 ¡ Growth ¡of ¡Cores ¡  Low device throughput. 150 ¡ GB/s ¡  High per I/O cycle overhead. 100 ¡  Low server utilization in data- 50 ¡ centres. 0 ¡ 16 ¡ 32 ¡ 64 ¡ 128 ¡ 256 ¡ 512 ¡ 1024 ¡ 2048 ¡ 4096 ¡ Number ¡of ¡Cores ¡  Throughput of typical workloads in today’s data- centres measured for 16 cores and projected to many cores.
Motivation-Trends in Processor Technology and Applications Backend ¡ Data ¡Stores ¡ OLTP ¡ 250 ¡  Today : 200 ¡ Growth ¡of ¡Cores ¡  Low device throughput. 150 ¡ GB/s ¡  High per I/O cycle overhead. 100 ¡  Low server utilization in data- 50 ¡ centres. 0 ¡ 16 ¡ 32 ¡ 64 ¡ 128 ¡ 256 ¡ 512 ¡ 1024 ¡ 2048 ¡ 4096 ¡ Number ¡of ¡Cores ¡  In Future :  Throughput of typical workloads in today’s data-  Many GB/s of device throughput. centres measured for 16 cores and projected to many cores.  Low per I/O cycle overhead as system stacks improve.  High server utilization in data- centres through server consolidation etc.
Previous Work  Multiprocessors  Multicore Processors
The “Affinity” Space Exploration  The movement of data on any I/O access:
The “Affinity” Space Exploration  The movement of data on any I/O access:  to/from the application Memory buffer from/to the kernel Copy buffer.
The “Affinity” Space Exploration  The movement of data on any I/O access:  to/from the application Memory buffer from/to the kernel Copy buffer.  to/from the kernel buffer from/to the I/O device. DMA Transfer
The “Affinity” Space Exploration  The movement of data on any I/O access:  to/from the application buffer from/to the kernel Memory buffer. Copy  to/from the kernel buffer from/to the I/O device.  Application and kernel buffers could be located in DMA Transfer different NUMA domains.  Kernel buffers and I/O devices could be located in different NUMA domains.
The “Affinity” Space Exploration Transfer (TR) Copy (CP) Configuration  The axis of the space are:  Transfer between I/O devices Local (L) Local (L) TRLCPL and kernel buffers. Local (L) Remote (R) TRLCPR  Copy from kernel buffer to the application buffer. Remote (R) Local (L) TRRCPL Remote (R) Remote (R) TRRCPR TRLCPR TRRCPR TRLCPL Example Scenarios
NUMA Memory Allocation  How application buffers are allocated?  The domain in which the process is executing.  The numactl utility allows pinning application processes and the memory buffers on a particular domain.  How kernel buffers are allocated?  Kernel buffer allocation can not be controlled using namactl .  I/O buffers in the kernel are shared by all contexts performing I/O in the kernel.  How to “maximize possibility of keeping kernel buffers in a particular socket?”  Start each experiment with a clean buffer cache.  Large DRAM/Socket compared to application datasets.
Evaluation Methodology  Bandwidth characterization of evaluation testbed. Total Memory Bandwidth=24 GB/s 12 GB/s 12 GB/s 24 GT/s 2 Storage Controllers=3 GB/s 2 Storage Controllers=3 GB/s SSD Read Throughput=250 MB/s SSD Read Throughput=250 MB/s 12 SSDS=3 GB/s 12 SSDS=3 GB/s Total Storage Throughput=6 GB/s
Benchmarks, Applications and Datasets  Benchmarks and Applications  zmIO : in-house microbenchmark.  fsmark : a filesystem stress benchmarks.  stream : a streaming workload.  psearchy : a file indexing application (part of MOSBENCH).  IOR : application checkpointing.  Workload Configuration and Datasets  Four software RAID Level 0 devices, each on top of 6 SSDS and 1 storage controller.  One workload instance per RAID device.  Datasets consist of large files , parameters that result in high concurrency , high read/write throughput and stressing of system stack .
Evaluation Metrics  Problem with high-level application metrics:  Not possible to map actual volume of data transmitted.  Not possible to look at indiviual components of complex software stacks.  It is important to look at indiviual components of execution time(user, system, idle, and iowait).  Cycles per I/O  Physical cycles consumed by the application during the execution time divided by the number of I/O operations.  Can be used as an efficiency metric.  Can be converted to energy per I/O.
Results – mem_stress and zmIO  Remote memory accesses: mem_stress 16 Local  Memory throughput drops by 14 Remote 12 one half. GB/s 10 8  The degradation starts from 6 4 one instance of mem_stress. 2 Number of Instances 0  Remote transfers: 1 2 3 4 5 6  Device throughput drops by one half. zmIO 7 TRLCPL  The throughput is same for one 6 TRRCPR 5 instance. GB/s 4 3  Contention is a possible culprit 2 for two and more instances. 1 Number of Instances 0 1 2 3 4 5 6 7 8 Round-robin assignments of instances to RAID devices.
Results – fsmark and psearchy  fsmark is filesystem fsmark 9000 iowait 8000 intensive: Cycles per I/O Sector system 7000 6000  Remote transfers result in 5000 4000 40% higher system time. 3000 2000 1000  130% increase in iowait 0 TRLCPL TRRCPR TRRCPL TRLCPR time. Configuration  psearchy is both filesystem psearchy 7000 iowait and I/O intensive: system Cycles per I/O Sector 6000 5000  57% increase in system 4000 time. 3000 2000  70% increase in iowait 1000 0 time. TRLCPL TRRCPR TRRCPL TRLCPR Configuration
Results - IOR  IOR is both read and write 8000 Read intensive benchmark. IOR 7000 Write 6000  15% decrease in read 5000 MB/s 4000 throughput due to remote 3000 2000 transfers and memory 1000 0 copies. TRLCPL TRRCPR TRRCPL TRLCPR Configuration  20% decrease in write throughput due to remote transfers and copies.
Results - stream  24 SSDs are divided into two domains. 250  Each set of 12 SSDs are stream Set A 200 Set B connected to two MB/s 150 controllers. 100 50  Ideally, symmetric 0 TRLCPL TRRCPR TRRCPL TRLCPR throughput is expected. Configuration  Remote transfers result in a 27% drop in throughput of one of the sets.
Conclusions  A mix of synthetic benchmarks and applications show the potential of NUMA affinity to hurt I/O throughput.  Future systems will have increased heterogeneity, more domains and high bandwidths.  Today, NUMA affinity is not a problem for cores within a single processor socket. Future processors with 100s of cores will have domains within a processor chip.  The range of performance degradation is important. Different server configurations and runtime libraries result in throughput within a range.  Partitioning of the system stacks based on sockets will become necessary.
Recommend
More recommend