NUMA-Aware Thread Migration for High Performance NVMM File Systems - PowerPoint PPT Presentation

Contribution • A NUMA-Aware thread migration for NVMM FS – Reduce remote access – Reduce resource contention • CPU • NVMM Application NVMM file system QPI link CPU1 NVMM NVMM CPU0 10 Node 1 Node 0

Contribution • A NUMA-Aware thread migration for NVMM FS – Reduce remote access – Reduce resource contention • CPU • NVMM – Increase CPU cache sharing between threads Application NVMM file system QPI link CPU1 NVMM NVMM CPU0 10 Node 1 Node 0

Contribution • A NUMA-Aware thread migration for NVMM FS – Reduce remote access – Reduce resource contention • CPU • NVMM – Increase CPU cache sharing between threads – Transparent to application Application NVMM file system NThread QPI link CPU1 NVMM NVMM CPU0 10 Node 1 Node 0

Outline • Background & Motivation • NThread design – Reduce remote access – Reduce resource contention – Increase CPU cache sharing • Evaluation • Summary 11

Reduce remote access • How to reduce remote access • How to avoid ping-pong migration 12

Reduce remote access • How to reduce remote access – Write • allocate new space to perform write operations • Write data on the node where the thread running • How to avoid ping-pong migration 12

Reduce remote access • How to reduce remote access – Write • allocate new space to perform write operations • Write data on the node where the thread running – Read • Count the read amount of each node for each thread • Migrate threads to the node with the most data read • How to avoid ping-pong migration 12

Reduce remote access • How to reduce remote access – Write • allocate new space to perform write operations • Write data on the node where the thread running – Read • Count the read amount of each node for each thread • Migrate threads to the node with the most data read • How to avoid ping-pong migration • When the read size of a thread on one node is higher than all other nodes by a value per period (such as 200 MB per second) Node 1 Node 0 Node 1 Node 0 T1 T1 300MB 300MB 100MB 100MB 12

Reduce resource contention • Problems – How to find contention – How to reduce contention – How to avoid new contention NVMM file system QPI link CPU1 NVMM NVMM CPU0 NVMM access Node 1 Node 0 14 contention

Reduce NVMM contention • How to find contention 15

Reduce NVMM contention • How to find contention – The access amount of NVMM in one node exceeds a threshold that the use of other nodes is less than ½ of the node 15

Reduce NVMM contention • How to find contention – The access amount of NVMM in one node exceeds a threshold that the use of other nodes is less than ½ of the node – How to define access amount 15

Reduce NVMM contention • How to find contention – The access amount of NVMM in one node exceeds a threshold that the use of other nodes is less than ½ of the node – How to define access amount • Bandwidth !!!! – Considering the theoretical bandwidth with running bandwidth of NVMM – Bandwidth = read bandwidth + write bandwidth 15

Reduce NVMM contention • How to find contention – The access amount of NVMM in one node exceeds a threshold that the use of other nodes is less than ½ of the node – How to define access amount • Bandwidth !!!! – Considering the theoretical bandwidth with running bandwidth of NVMM – Bandwidth = read bandwidth + write bandwidth • However – The write bandwidth of NVMM is about 1/3 of the read bandwidth 15

Reduce NVMM contention • How to find contention – Bandwidth 16

Reduce NVMM contention • How to find contention – Bandwidth • It is inaccurate to calculate NVMM access by using the sum of read and write bandwidth 16

Reduce NVMM contention • How to find contention – Bandwidth • It is inaccurate to calculate NVMM access by using the sum of read and write bandwidth – Read 1 GB/s + Write 1 GB/s = 2GB/s à low contention Low Contention W 1GB/s 2GB/s R 1GB/s 16

Reduce NVMM contention • How to find contention – Bandwidth • It is inaccurate to calculate NVMM access by using the sum of read and write bandwidth – Read 1 GB/s + Write 1 GB/s = 2GB/s à low contention – Read 0 GB/s + write 2 GB/s = 2GB/s à high contention Low Contention High Contention W 1GB/s W 2GB/s 2GB/s R 1GB/s R 0 GB/s 16

Reduce NVMM contention • How to find contention – Bandwidth • It is inaccurate to calculate NVMM access by using the sum of read and write bandwidth – Read 1 GB/s + Write 1 GB/s = 2GB/s à low contention – Read 0 GB/s + write 2 GB/s = 2GB/s à high contention • Solution – Change the read and write weight of bandwidth » BW N = NWr N * 1/3 + BWw N (Refer to paper) Low Contention High Contention W 1GB/s W 2GB/s 2GB/s R 1GB/s R 0 GB/s 16

Reduce NVMM contention • How to reduce contention 17

Reduce NVMM contention • How to reduce contention – The access contention come from read and write 17

Reduce NVMM contention • How to reduce contention – The access contention come from read and write • Read – data location is fixed 17

Reduce NVMM contention • How to reduce contention – The access contention come from read and write • Read – data location is fixed • Write – Specify the node where data is written 17

Reduce NVMM contention • How to reduce contention – The access contention come from read and write • Read – data location is fixed • Write – Specify the node where data is written – Long remote write latency: reduce performance by 65.5% Node 0 Node 1 Remote T1 write 17

Reduce NVMM contention • How to reduce contention 18

Reduce NVMM contention • How to reduce contention – Migrating threads with high write rate to the nodes with low access pressure • Reduce remote write • Reduce NVMM contention 18

Reduce NVMM contention • How to reduce contention – Migrating threads with high write rate to the nodes with low access pressure • Reduce remote write • Reduce NVMM contention Access: 4 Access: 0 Node 0 Node 1 T1 W:90% T2 W:70% T3 W:20% T4 W:10% 18

Reduce NVMM contention • How to reduce contention – Migrating threads with high write rate to the nodes with low access pressure • Reduce remote write • Reduce NVMM contention Access: 2.4 Access: 4 Access: 0 Access: 1.6 Node 0 Node 1 Node 0 Node 1 T1 T1 T3 Remote read W:90% W:90% W:20% 0.4 T2 T2 T4 W:70% W:70% W:10% T3 W:20% T4 W:10% 18

Reduce NVMM contention • How to avoid new contention 19

Reduce NVMM contention • How to avoid new contention – Migrate too much threads to low contention nodes 19

Reduce NVMM contention • How to avoid new contention – Migrate too much threads to low contention nodes – Determine the number of threads to migrate according to the current bandwidth of each node 19

Reduce NVMM contention • How to avoid new contention – Migrate too much threads to low contention nodes – Determine the number of threads to migrate according to the current bandwidth of each node Access: 4 Node 0 T1 W:90% T2 W:70% T3 W:20% T4 W:10% 19

Reduce NVMM contention • How to avoid new contention – Migrate too much threads to low contention nodes – Determine the number of threads to migrate according to the current bandwidth of each node Access: 3 Access: 4 Node 1 Node 0 T5 T1 W:90% W:90% T6 T2 W:70% W:70% T3 T7 W:20% W:70% T4 W:10% 19

Reduce NVMM contention • How to avoid new contention – Migrate too much threads to low contention nodes – Determine the number of threads to migrate according to the current bandwidth of each node Access: 3 Access: 4 Node 1 Node 0 T5 T1 W:90% W:90% T6 T2 W:70% W:70% Average access: 3.5 T3 T7 W:20% W:70% T4 W:10% 19

Reduce CPU contention • How to find contention 20

Reduce CPU contention • How to find contention – When the CPU utilization of a node exceeds 90% and is 2x of other nodes 20

Reduce CPU contention • How to find contention – When the CPU utilization of a node exceeds 90% and is 2x of other nodes • How to reduce contention – Migrating threads from NUMA node with high CPU utilization to other low CPU utilization node 20

Reduce CPU contention • How to find contention – When the CPU utilization of a node exceeds 90% and is 2x of other nodes • How to reduce contention – Migrating threads from NUMA node with high CPU utilization to other low CPU utilization node • How to avoid new contention – If the CPU utilization of migrate thread and target NUMA node does not exceed 90%, migrating thread 20

Increase CPU cache sharing 22

Increase CPU cache sharing • How to find threads that share data 22

Increase CPU cache sharing • How to find threads that share data – Once a file accessed by multiple threads, all threads accessing the file share data 22

Increase CPU cache sharing • How to find threads that share data – Once a file accessed by multiple threads, all threads accessing the file share data • How to increase CPU cache sharing 22

Increase CPU cache sharing • How to find threads that share data – Once a file accessed by multiple threads, all threads accessing the file share data • How to increase CPU cache sharing – Reducing remote memory access 22

Composing Optimizations together • Remote access, resource contention and CPU cache sharing 23

Composing Optimizations together • Remote access, resource contention and CPU cache sharing – Reduce remote access can increase CPU cache sharing • Threads accessing the same data run in the same node, sharing CPU cache 23

Composing Optimizations together • Remote access, resource contention and CPU cache sharing – Reduce remote access can increase CPU cache sharing • Threads accessing the same data run in the same node, sharing CPU cache – Reduce resource contention may increase remote memory access and destroy CPU cache sharing 23

Composing Optimizations together • Remote access, resource contention and CPU cache sharing – Reduce remote access can increase CPU cache sharing • Threads accessing the same data run in the same node, sharing CPU cache – Reduce resource contention may increase remote memory access and destroy CPU cache sharing – Reduce NVMM contention may increase CPU contention NVMM file system QPI link CPU1 NVMM NVMM CPU0 Node 1 Node 0 23

Composing Optimizations together • What-if analysis 24

Composing Optimizations together 1 Get information • What-if analysis – Get information each second • Data access size, NVMM bandwidth, CPU utilization and data sharing 24

Composing Optimizations together 1 Get information • What-if analysis 2 Decide initial target node – Get information each second Reduce remote access • Data access size, NVMM bandwidth, CPU utilization and data sharing – Decide initial target node • Reduce remote memory access 24

Composing Optimizations together 1 Get information • What-if analysis 2 Decide initial target node – Get information each second Reduce remote access • Data access size, NVMM 3 Decide final target node bandwidth, CPU utilization and data sharing – Decide initial target node • Reduce remote memory access – Decide final target node • Reduce NVMM and CPU contention 24

Composing Optimizations together 1 Get information • What-if analysis 2 Decide initial target node – Get information each second Reduce remote access • Data access size, NVMM 3 Decide final target node bandwidth, CPU utilization and Avoid migrate shared thread data sharing – Decide initial target node • Reduce remote memory access – Decide final target node • Reduce NVMM and CPU contention – Avoid migrate shared thread 24

Composing Optimizations together 1 Get information • What-if analysis 2 Decide initial target node – Get information each second Reduce remote access • Data access size, NVMM 3 Decide final target node bandwidth, CPU utilization and Avoid migrate shared thread data sharing NVMM Y – Decide initial target node N contention? • Reduce remote memory access Reduce NVMM N CPU Y con. – Decide final target node • Reduce NVMM and CPU Reduce CPU con. contention – Avoid migrate shared thread – NVMM > CPU (Refer to paper) 24

Composing Optimizations together 1 Get information • What-if analysis 2 Decide initial target node – Get information each second Reduce remote access • Data access size, NVMM 3 Decide final target node bandwidth, CPU utilization and Avoid migrate shared thread data sharing NVMM Y – Decide initial target node N contention? • Reduce remote memory access Reduce NVMM N CPU Y con. – Decide final target node • Reduce NVMM and CPU Reduce CPU con. contention – Avoid migrate shared thread – NVMM > CPU (Refer to paper) 4 Migrate threads – Migrate threads 24

Evaluation • Platform – Two NUMA nodes • Intel Xeon 5214 CPU � 10 CPU core • 64G DRAM, 128G Optane PMM – Four NUMA nodes • Intel Xeon 5214 CPU � 10 CPU core • 4GB DRAM, 12GB Emulated PMM • Compared system – Existing FS: Ext4-dax, PMFS, NOVA, NOVA_n – Modified FS: NOVA_n (A NOVA-based multi-node support FS) 26

Micro-benchmark: fio 2.0 ext4-dax PMFS NOVA NOVA_N NThread_rl NThread 1.5 bandwidth GB/s 1.0 0.5 0.0 20% 40% 60% 80% 27 Read ratio

Micro-benchmark: fio • NThread_rl: reduce remote access – The bandwidth is increased by 26.9% when the read ratio is 40% 2.0 ext4-dax PMFS NOVA NOVA_N NThread_rl NThread 1.5 bandwidth GB/s 1.0 0.5 0.0 20% 40% 60% 80% 27 Read ratio

Micro-benchmark: fio • NThread_rl: reduce remote access – The bandwidth is increased by 26.9% when the read ratio is 40% • NThread: reduce remote access, avoid contention and increase CPU sharing – Bandwidth increased by an average of 43.8% 2.0 ext4-dax PMFS NOVA NOVA_N NThread_rl NThread 1.5 bandwidth GB/s 1.0 0.5 0.0 20% 40% 60% 80% 27 Read ratio

Application: RocksDB • NThread increases the throughput by 88.6% on average when RocksDB runs in the NVMM file system 3000 700 2500 Throughput (K ops/s) 600 Throughput (K ops/s) 2000 500 1500 400 300 1000 200 500 100 0 0 PUT GET MIX PUT GET MIX ext4-dax PMFS NOVA NOVA_n NThread ext4-dax PMFS NOVA NOVA_n NThread Four NUMA nodes Two NUMA nodes 28

Summary 30

Summary • The features of NVMM enables FS to be built on the memory bus, improving the performance of FS 30

Summary • The features of NVMM enables FS to be built on the memory bus, improving the performance of FS • NUMA brings remote access and resource contention to NVMM FS 30

Summary • The features of NVMM enables FS to be built on the memory bus, improving the performance of FS • NUMA brings remote access and resource contention to NVMM FS • NThread is a NUMA-aware thread migration 30

Summary • The features of NVMM enables FS to be built on the memory bus, improving the performance of FS • NUMA brings remote access and resource contention to NVMM FS • NThread is a NUMA-aware thread migration – Migrate threads according to data amount to reduce remote access 30

Summary • The features of NVMM enables FS to be built on the memory bus, improving the performance of FS • NUMA brings remote access and resource contention to NVMM FS • NThread is a NUMA-aware thread migration – Migrate threads according to data amount to reduce remote access – Reduce resource contention and avoid introducing new contention 30

Summary • The features of NVMM enables FS to be built on the memory bus, improving the performance of FS • NUMA brings remote access and resource contention to NVMM FS • NThread is a NUMA-aware thread migration – Migrate threads according to data amount to reduce remote access – Reduce resource contention and avoid introducing new contention – Avoid migrating data-sharing threads to increase CPU cache sharing 30

NUMA-Aware Thread Migration for High Performance NVMM File Systems - PowerPoint PPT Presentation

NUMA-Aware Thread Migration for High Performance NVMM File Systems Ying Wang , Dejun Jiang, Jin Xiong Institute of Computing Technology, CAS University of Chinese Academy of

Scalable NUMA-aware Blocking Synchronization Primitives Sanidhya Kashyap , Changwoo Min, Taesoo

NUMA-aware Reader-Writer Locks Tom Herold, Marco Lamina 04.02.2015 NUMA Seminar Agenda 1.

Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master

NUMA-aware Matrix-Matrix-Multiplication Max Reimann, Philipp Otto 1 About this talk

COMP 633 - Parallel Computing Lecture 10 September 15, 2020 CC-NUMA (1) CC-NUMA implementation

13 IN THIS CHAPTER Benefits of Thread Pooling 308 Considerations and Costs of Thread

Improving access to migration data Improving access to migration data Local area migration

NUMA-Aware Thread and Resource Scheduling for Terabit Data Movement Taeuk Kim , Awais Khan,

NUMA Non-Uniform Memory Access Numa becomes more common because memory controllers get close

NUMA Support for Charm++ Does memory affinity matter? Christiane Pousa Ribeiro Maxime Martinasso

FreeBSD and NUMA John Baldwin NYC*BUG June 3, 2015 What is NUMA Non-Uniform Memory

NUMA-Friendly Stack (using Delegation and Elimination) Irina Calciu Justin Gottschlich Maurice

NUMA-ICTM: A Parallel Version of ICTM Exploiting Memory Placement Strategies for NUMA Machines

Patrick Schmidt, Christoph Sterz NUMA-aware SURF Speeded Up Robust Features Object detection

To thread or not to thread? Why PETSc favors MPI-only Plenary Discussion PETSc User Meeting 2016

International Dialogue on Migration (IDM) Human Rights and Migration: Working Together for Safe,

Theory of Reals for Verification and Synthesis of Hybrid Dynamical Systems Ashish Tiwari

Physics 2D Lecture Slides Lecture 27: Mar 8th Vivek Sharma UCSD Physics Quiz 8 16 14 12

WBS 121.3.11 Cryogenics SC Acceleration Modules and Cryogenics Anindya Chakravarty and Arkadiy

A Code-Based Cryptosystem using GRS Codes Violetta Weger University of Zurich Master Thesis

Using Crash Hoare Logic for Certifying the FSCQ File System Haogang Chen, Daniel Ziegler, Tej

Computer System Design Depth in ECE Computer System Design Depth Qualcomm Snapdragon 845 Mobile

From One To Many: Checking A Set Of Models Rohit Dureja and Kristin Yvonne Rozier Iowa State

Systems Development (DRCHSD) Program The Center DRCHSD Team April 30, 2018 1 Presentation