NUMA Implication for Storage I/O Throughput in Modern Servers - PowerPoint PPT Presentation

NUMA Implication for Storage I/O Throughput in Modern Servers Shoaib Akram, Manolis Marazakis, and Angelos Bilas Computer Architecture and VLSI Laboratory FORTH-ICS Greece

Outline  Introduction  Motivation  Previous Work  The “Affinity” Space Exploration  Evaluation Methodology  Results  Conclusions

Introduction  The number of processors on a single motherboard is increasing.  Each processor has faster access to local memory compared to remote memory leading to the well-known NUMA problem.  Much effort in the past is devoted to NUMA-aware memory management and scheduling.  There is a similar “non-uniform latency of access” relation between processors and I/O devices.  In this work, we quantify the combined impact of non-uniform latency in accessing memory and I/O devices.

Motivation-Trends in I/O Subsystems  Storage-related features:  Fast point-to-point interconnects.  Multiple storage controllers.  Arrays of storage devices with high bandwwidth.  Result:  Storage bandwidth is catching up with memory bandwidth.  The anatomy of a modern server  Today, NUMA affinity is a machine with 2 NUMA domains. problem for the entire system.

Motivation-Trends in Processor Technology and Applications Backend ¡ Data ¡Stores ¡ OLTP ¡ 250 ¡ 200 ¡ Growth ¡of ¡Cores ¡ 150 ¡ GB/s ¡ 100 ¡ 50 ¡ 0 ¡ 16 ¡ 32 ¡ 64 ¡ 128 ¡ 256 ¡ 512 ¡ 1024 ¡ 2048 ¡ 4096 ¡ Number ¡of ¡Cores ¡  Throughput of typical workloads in today’s data- centres measured for 16 cores and projected to many cores.

Motivation-Trends in Processor Technology and Applications Backend ¡ Data ¡Stores ¡ OLTP ¡ 250 ¡  Today : 200 ¡ Growth ¡of ¡Cores ¡  Low device throughput. 150 ¡ GB/s ¡  High per I/O cycle overhead. 100 ¡  Low server utilization in data- 50 ¡ centres. 0 ¡ 16 ¡ 32 ¡ 64 ¡ 128 ¡ 256 ¡ 512 ¡ 1024 ¡ 2048 ¡ 4096 ¡ Number ¡of ¡Cores ¡  Throughput of typical workloads in today’s data- centres measured for 16 cores and projected to many cores.

Motivation-Trends in Processor Technology and Applications Backend ¡ Data ¡Stores ¡ OLTP ¡ 250 ¡  Today : 200 ¡ Growth ¡of ¡Cores ¡  Low device throughput. 150 ¡ GB/s ¡  High per I/O cycle overhead. 100 ¡  Low server utilization in data- 50 ¡ centres. 0 ¡ 16 ¡ 32 ¡ 64 ¡ 128 ¡ 256 ¡ 512 ¡ 1024 ¡ 2048 ¡ 4096 ¡ Number ¡of ¡Cores ¡  In Future :  Throughput of typical workloads in today’s data-  Many GB/s of device throughput. centres measured for 16 cores and projected to many cores.  Low per I/O cycle overhead as system stacks improve.  High server utilization in data- centres through server consolidation etc.

Previous Work  Multiprocessors  Multicore Processors

The “Affinity” Space Exploration  The movement of data on any I/O access:

The “Affinity” Space Exploration  The movement of data on any I/O access:  to/from the application Memory buffer from/to the kernel Copy buffer.

The “Affinity” Space Exploration  The movement of data on any I/O access:  to/from the application Memory buffer from/to the kernel Copy buffer.  to/from the kernel buffer from/to the I/O device. DMA Transfer

The “Affinity” Space Exploration  The movement of data on any I/O access:  to/from the application buffer from/to the kernel Memory buffer. Copy  to/from the kernel buffer from/to the I/O device.  Application and kernel buffers could be located in DMA Transfer different NUMA domains.  Kernel buffers and I/O devices could be located in different NUMA domains.

The “Affinity” Space Exploration Transfer (TR) Copy (CP) Configuration  The axis of the space are:  Transfer between I/O devices Local (L) Local (L) TRLCPL and kernel buffers. Local (L) Remote (R) TRLCPR  Copy from kernel buffer to the application buffer. Remote (R) Local (L) TRRCPL Remote (R) Remote (R) TRRCPR TRLCPR TRRCPR TRLCPL Example Scenarios

NUMA Memory Allocation  How application buffers are allocated?  The domain in which the process is executing.  The numactl utility allows pinning application processes and the memory buffers on a particular domain.  How kernel buffers are allocated?  Kernel buffer allocation can not be controlled using namactl .  I/O buffers in the kernel are shared by all contexts performing I/O in the kernel.  How to “maximize possibility of keeping kernel buffers in a particular socket?”  Start each experiment with a clean buffer cache.  Large DRAM/Socket compared to application datasets.

Evaluation Methodology  Bandwidth characterization of evaluation testbed. Total Memory Bandwidth=24 GB/s 12 GB/s 12 GB/s 24 GT/s 2 Storage Controllers=3 GB/s 2 Storage Controllers=3 GB/s SSD Read Throughput=250 MB/s SSD Read Throughput=250 MB/s 12 SSDS=3 GB/s 12 SSDS=3 GB/s Total Storage Throughput=6 GB/s

Benchmarks, Applications and Datasets  Benchmarks and Applications  zmIO : in-house microbenchmark.  fsmark : a filesystem stress benchmarks.  stream : a streaming workload.  psearchy : a file indexing application (part of MOSBENCH).  IOR : application checkpointing.  Workload Configuration and Datasets  Four software RAID Level 0 devices, each on top of 6 SSDS and 1 storage controller.  One workload instance per RAID device.  Datasets consist of large files , parameters that result in high concurrency , high read/write throughput and stressing of system stack .

Evaluation Metrics  Problem with high-level application metrics:  Not possible to map actual volume of data transmitted.  Not possible to look at indiviual components of complex software stacks.  It is important to look at indiviual components of execution time(user, system, idle, and iowait).  Cycles per I/O  Physical cycles consumed by the application during the execution time divided by the number of I/O operations.  Can be used as an efficiency metric.  Can be converted to energy per I/O.

Results – mem_stress and zmIO  Remote memory accesses: mem_stress 16 Local  Memory throughput drops by 14 Remote 12 one half. GB/s 10 8  The degradation starts from 6 4 one instance of mem_stress. 2 Number of Instances 0  Remote transfers: 1 2 3 4 5 6  Device throughput drops by one half. zmIO 7 TRLCPL  The throughput is same for one 6 TRRCPR 5 instance. GB/s 4 3  Contention is a possible culprit 2 for two and more instances. 1 Number of Instances 0 1 2 3 4 5 6 7 8 Round-robin assignments of instances to RAID devices.

Results – fsmark and psearchy  fsmark is filesystem fsmark 9000 iowait 8000 intensive: Cycles per I/O Sector system 7000 6000  Remote transfers result in 5000 4000 40% higher system time. 3000 2000 1000  130% increase in iowait 0 TRLCPL TRRCPR TRRCPL TRLCPR time. Configuration  psearchy is both filesystem psearchy 7000 iowait and I/O intensive: system Cycles per I/O Sector 6000 5000  57% increase in system 4000 time. 3000 2000  70% increase in iowait 1000 0 time. TRLCPL TRRCPR TRRCPL TRLCPR Configuration

Results - IOR  IOR is both read and write 8000 Read intensive benchmark. IOR 7000 Write 6000  15% decrease in read 5000 MB/s 4000 throughput due to remote 3000 2000 transfers and memory 1000 0 copies. TRLCPL TRRCPR TRRCPL TRLCPR Configuration  20% decrease in write throughput due to remote transfers and copies.

Results - stream  24 SSDs are divided into two domains. 250  Each set of 12 SSDs are stream Set A 200 Set B connected to two MB/s 150 controllers. 100 50  Ideally, symmetric 0 TRLCPL TRRCPR TRRCPL TRLCPR throughput is expected. Configuration  Remote transfers result in a 27% drop in throughput of one of the sets.

Conclusions  A mix of synthetic benchmarks and applications show the potential of NUMA affinity to hurt I/O throughput.  Future systems will have increased heterogeneity, more domains and high bandwidths.  Today, NUMA affinity is not a problem for cores within a single processor socket. Future processors with 100s of cores will have domains within a processor chip.  The range of performance degradation is important. Different server configurations and runtime libraries result in throughput within a range.  Partitioning of the system stacks based on sockets will become necessary.

NUMA Implication for Storage I/O Throughput in Modern Servers - PowerPoint PPT Presentation

NUMA Implication for Storage I/O Throughput in Modern Servers Shoaib Akram, Manolis Marazakis, and Angelos Bilas Computer Architecture and VLSI Laboratory FORTH-ICS Greece Outline Introduction Motivation Previous Work

Scalable NUMA-aware Blocking Synchronization Primitives Sanidhya Kashyap , Changwoo Min, Taesoo

NUMA-aware Reader-Writer Locks Tom Herold, Marco Lamina 04.02.2015 NUMA Seminar Agenda 1.

Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master

COMP 633 - Parallel Computing Lecture 10 September 15, 2020 CC-NUMA (1) CC-NUMA implementation

MODERN 1 MODERN 2 MODERN 3 MODERN 4 MODERN A peep at some distant orb has power to raise

I T LOOKS LIKE A DUCK ... CASES OF IMPLICATION M ARTIN A HER M ATERIAL IMPLICATION F ALSE

NUMA Non-Uniform Memory Access Numa becomes more common because memory controllers get close

NUMA Support for Charm++ Does memory affinity matter? Christiane Pousa Ribeiro Maxime Martinasso

FreeBSD and NUMA John Baldwin NYC*BUG June 3, 2015 What is NUMA Non-Uniform Memory

NUMA-Friendly Stack (using Delegation and Elimination) Irina Calciu Justin Gottschlich Maurice

NUMA-ICTM: A Parallel Version of ICTM Exploiting Memory Placement Strategies for NUMA Machines

NUMA-aware Matrix-Matrix-Multiplication Max Reimann, Philipp Otto 1 About this talk

The Implication of Overlay Routing The Implication of Overlay Routing on ISPs Connecting

Implication There is another fundamental type of connectives between statements, that of

Evaluation of Improved Scalability Comparison points Throughput (IPC/Node)

NUMA obliviousness through memory mapping Mrunal Gawade Martin Kersten CWI, Amsterdam

MENTAL HEALTH FIRST AID: MORE VALUABLE THAN CPR? WALTER P. SCHEFFE 2019 CPE SERIES CLARK BISHOP,

Dr.BrianEgan,DepartmentofGeography,SimonFraserUniversity

European Energy Policy and Standardization - Buildings and Building Components J.J. BLOEM DG

Drowning and Water-Incidents -- Arizona and Maricopa County-- Data from Arizona Vital Statistics

Background Costs of attending college keep rising Declining family incomes Students are

Static Analysis and Code Optimizations in Glasgow Haskell Compiler Ilya Sergey

Perinatology Care of the mother and fetus during pregnancy, labor, delivery, and early neonatal

Pre-Test Imminent Death: Recognition & Management Robert M. Taylor, MD Medical Director,

Sambuz

Useful Links

Newsletter

Mail Us

NUMA Implication for Storage I/O Throughput in Modern Servers - PowerPoint PPT Presentation

NUMA Implication for Storage I/O Throughput in Modern Servers Shoaib Akram, Manolis Marazakis, and Angelos Bilas Computer Architecture and VLSI Laboratory FORTH-ICS Greece Outline Introduction Motivation Previous Work

Scalable NUMA-aware Blocking Synchronization Primitives Sanidhya Kashyap , Changwoo Min, Taesoo

NUMA-aware Reader-Writer Locks Tom Herold, Marco Lamina 04.02.2015 NUMA Seminar Agenda 1.

Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master

COMP 633 - Parallel Computing Lecture 10 September 15, 2020 CC-NUMA (1) CC-NUMA implementation

MODERN 1 MODERN 2 MODERN 3 MODERN 4 MODERN A peep at some distant orb has power to raise

I T LOOKS LIKE A DUCK ... CASES OF IMPLICATION M ARTIN A HER M ATERIAL IMPLICATION F ALSE

NUMA Non-Uniform Memory Access Numa becomes more common because memory controllers get close

NUMA Support for Charm++ Does memory affinity matter? Christiane Pousa Ribeiro Maxime Martinasso

FreeBSD and NUMA John Baldwin NYC*BUG June 3, 2015 What is NUMA Non-Uniform Memory

NUMA-Friendly Stack (using Delegation and Elimination) Irina Calciu Justin Gottschlich Maurice

NUMA-ICTM: A Parallel Version of ICTM Exploiting Memory Placement Strategies for NUMA Machines

NUMA-aware Matrix-Matrix-Multiplication Max Reimann, Philipp Otto 1 About this talk

The Implication of Overlay Routing The Implication of Overlay Routing on ISPs Connecting

Implication There is another fundamental type of connectives between statements, that of

Evaluation of Improved Scalability Comparison points Throughput (IPC/Node)

NUMA obliviousness through memory mapping Mrunal Gawade Martin Kersten CWI, Amsterdam

MENTAL HEALTH FIRST AID: MORE VALUABLE THAN CPR? WALTER P. SCHEFFE 2019 CPE SERIES CLARK BISHOP,

Dr.BrianEgan,DepartmentofGeography,SimonFraserUniversity

European Energy Policy and Standardization - Buildings and Building Components J.J. BLOEM DG

Drowning and Water-Incidents -- Arizona and Maricopa County-- Data from Arizona Vital Statistics

Background Costs of attending college keep rising Declining family incomes Students are

Static Analysis and Code Optimizations in Glasgow Haskell Compiler Ilya Sergey

Perinatology Care of the mother and fetus during pregnancy, labor, delivery, and early neonatal

Pre-Test Imminent Death: Recognition &amp; Management Robert M. Taylor, MD Medical Director,

Sambuz

Useful Links

Newsletter

Mail Us

Pre-Test Imminent Death: Recognition & Management Robert M. Taylor, MD Medical Director,