NUMA-Aware Thread and Resource Scheduling for Terabit Data Movement - - PowerPoint PPT Presentation

▶

Jul 06, 2023 232 likes •421 views

NUMA-Aware Thread and Resource Scheduling for Terabit Data Movement Taeuk Kim , Awais Khan, Youngjae Kim, Sungyong Park, Scott Atchley PDSW-DISCS 17 WIP session November 13, 2017, Denver, USA LABORATORY FOR ADVANCED SYSTEM SOFTWARE 1 SOGANG

SLIDE 1

NUMA-Aware Thread and Resource Scheduling for Terabit Data Movement

Taeuk Kim, Awais Khan, Youngjae Kim, Sungyong Park, Scott Atchley PDSW-DISCS 17 WIP session November 13, 2017, Denver, USA

LABORATORY FOR ADVANCED SYSTEM SOFTWARE SOGANG UNIVERSITY

SLIDE 2

Need for Data Coupling over ESnet

Data Coupling across HPC facilities
Nuclear interaction datasets generated at NERSC needed at

the OLCF for Peta-scale simulation

Climate simulations run at ALCF and OLCF validated with BER

datasets at ORNL data centers

Coupling Data: Example: Moving Large Data Sets

LABORATORY FOR ADVANCED SYSTEM SOFTWARE SOGANG UNIVERSITY

SLIDE 3

Terabits Network Environment

WAN

IB Ethernet

Data Mover Data Mover

Verbs UDP WAN or LinuxEthDirect UDP LAN or Myricom

Streaming Neutron Experiment Data Simulation Data

But, data sets are stored at slow storage systems!

Terabits network improvement only

contributed the network transfer rate.

LABORATORY FOR ADVANCED SYSTEM SOFTWARE SOGANG UNIVERSITY

SLIDE 4

LADS: Layout-Aware Data Scheduling [FAST’15]

LADS solved the impedance mismatch problem between the faster network and slower storage system!

WAN

IB Ethernet

Data Mover Data Mover

Verbs UDP WAN or LinuxEthDirect UDP LAN or Myricom

Streaming Neutron Experiment Data Simulation Data

LADS offers an end-to-end data transfer
ptimization.

LABORATORY FOR ADVANCED SYSTEM SOFTWARE SOGANG UNIVERSITY

SLIDE 5

What Memory Bottleneck Occurs in LADS?

WAN

IB Ethernet

DTN DTN

Simulation Data QPI DRAM

NUMA node 0 NUMA node 1 NUMA node 2 NUMA node 3

DRAM

CPU socket 0 CPU socket 1

DRAM DRAM Cores in CPU0 Cores in CPU1 Cores in CPU2 Cores in CPU3

Scheduler Communicator I/O thread RMA buffer

LABORATORY FOR ADVANCED SYSTEM SOFTWARE SOGANG UNIVERSITY

SLIDE 6

Architectural Overview for LADS

NUMA-based DTN Architecture in Source and Sink

LABORATORY FOR ADVANCED SYSTEM SOFTWARE SOGANG UNIVERSITY 6

QPI DRAM

NUMA node 0 NUMA node 1 NUMA node 2 NUMA node 3

DRAM

CPU socket 0 CPU socket 1

RMA buffer DRAM

SLIDE 7

Memory Bottleneck with Single RMA Buffer

LABORATORY FOR ADVANCED SYSTEM SOFTWARE SOGANG UNIVERSITY 7

QPI DRAM

NUMA node 0 NUMA node 1 NUMA node 2 NUMA node 3

DRAM

CPU socket 0 CPU socket 1

RMA buffer DRAM

NUMA-based DTN Architecture in Source and Sink

SLIDE 8

Memory Bottleneck with Single RMA Buffer

LABORATORY FOR ADVANCED SYSTEM SOFTWARE SOGANG UNIVERSITY 8

QPI DRAM

NUMA node 0 NUMA node 1 NUMA node 2 NUMA node 3

DRAM

CPU socket 0 CPU socket 1

RMA buffer DRAM

Remote Memory Accesses!!!

CPU Socket 1 accessing RMA Buffer hosted by CPU Socket 0

NUMA-based DTN Architecture in Source and Sink

SLIDE 9

Multiple RMA Buffers

Distributing the RMA buffer to all CPU sockets
To reduce the remote socket’s memory access

LABORATORY FOR ADVANCED SYSTEM SOFTWARE SOGANG UNIVERSITY 9

QPI

NUMA node 0 NUMA node 1 NUMA node 2 NUMA node 3 CPU socket 0 CPU socket 1

RMA buffer RMA buffer DRAM DRAM

SLIDE 10

Multiple RMA Buffers

LABORATORY FOR ADVANCED SYSTEM SOFTWARE SOGANG UNIVERSITY 10

QPI

NUMA node 0 NUMA node 1 NUMA node 2 NUMA node 3 CPU socket 0 CPU socket 1

RMA buffer RMA buffer DRAM DRAM

Possibility for accessing remote socket’s memory reduced!

SLIDE 11

Memory-aware Thread Scheduling (MTS)

Binding all threads to in-socket RMA buffer
Load balancing among in-socket NUMA nodes

LABORATORY FOR ADVANCED SYSTEM SOFTWARE SOGANG UNIVERSITY 11

QPI

NUMA node 0 NUMA node 1 NUMA node 2 NUMA node 3 CPU socket 0 CPU socket 1

RMA buffer RMA buffer DRAM DRAM

SLIDE 12

Memory-aware Thread Scheduling (MTS)

LABORATORY FOR ADVANCED SYSTEM SOFTWARE SOGANG UNIVERSITY 12

QPI

NUMA node 0 NUMA node 1 NUMA node 2 NUMA node 3 CPU socket 0 CPU socket 1

RMA buffer RMA buffer DRAM DRAM

Local Memory Accesses & Load Balancing

SLIDE 13

Test-bed Configuration

Data Transfer Nodes (DTNs)
2 CPU sockets, 4 NUMA nodes,

24 cores

128GB memory
InfiniBand EDR (100Gb/s)

LABORATORY FOR ADVANCED SYSTEM SOFTWARE SOGANG UNIVERSITY 13

IB QDR (40Gb/s)

DTN

source

DTN

sink Memory File System (tmpfs) Memory File System (tmpfs)

Workloads
8x3GB files (Big file workload)
24,000x1MB files (Small file

workload)

Storage Systems
We used the memory file system (tmpfs) to eliminate storage

bottlenecks.

SLIDE 14

Evaluation

500 1000 1500 2000 2500 3000 3500 4000 2 4 8 16 32 64 Baseline

LABORATORY FOR ADVANCED SYSTEM SOFTWARE SOGANG UNIVERSITY 14

Single RMA Buffer

Throughput (MB/s) Number of I/O Threads

SLIDE 15

Evaluation

500 1000 1500 2000 2500 3000 3500 4000 2 4 8 16 32 64 Baseline MTS

LABORATORY FOR ADVANCED SYSTEM SOFTWARE SOGANG UNIVERSITY 15

NUMA-aware Scheduling

Throughput (MB/s) Number of I/O Threads

Single RMA Buffer

SLIDE 16

Evaluation

500 1000 1500 2000 2500 3000 3500 4000 2 4 8 16 32 64 Baseline MTS

LABORATORY FOR ADVANCED SYSTEM SOFTWARE SOGANG UNIVERSITY 16

Throughput increased to an average of 24.3%!

Throughput (MB/s) Number of I/O Threads

NUMA-aware Scheduling Single RMA Buffer

SLIDE 17

Q&A

Contact:

Taeuk Kim (taugi323@sogang.ac.kr) Department of Computer Science and Engineering Sogang University, Seoul, Republic of KOREA

LABORATORY FOR ADVANCED SYSTEM SOFTWARE SOGANG UNIVERSITY 17