NUMA-Aware Thread and Resource Scheduling for Terabit Data Movement - - PowerPoint PPT Presentation

numa aware thread and resource scheduling for terabit
SMART_READER_LITE
LIVE PREVIEW

NUMA-Aware Thread and Resource Scheduling for Terabit Data Movement - - PowerPoint PPT Presentation

NUMA-Aware Thread and Resource Scheduling for Terabit Data Movement Taeuk Kim , Awais Khan, Youngjae Kim, Sungyong Park, Scott Atchley PDSW-DISCS 17 WIP session November 13, 2017, Denver, USA LABORATORY FOR ADVANCED SYSTEM SOFTWARE 1 SOGANG


slide-1
SLIDE 1

NUMA-Aware Thread and Resource Scheduling for Terabit Data Movement

1

Taeuk Kim, Awais Khan, Youngjae Kim, Sungyong Park, Scott Atchley PDSW-DISCS 17 WIP session November 13, 2017, Denver, USA

LABORATORY FOR ADVANCED SYSTEM SOFTWARE SOGANG UNIVERSITY

slide-2
SLIDE 2

Need for Data Coupling over ESnet

  • Data Coupling across HPC facilities
  • Nuclear interaction datasets generated at NERSC needed at

the OLCF for Peta-scale simulation

  • Climate simulations run at ALCF and OLCF validated with BER

datasets at ORNL data centers

2

Coupling Data: Example: Moving Large Data Sets

LABORATORY FOR ADVANCED SYSTEM SOFTWARE SOGANG UNIVERSITY

slide-3
SLIDE 3

Terabits Network Environment

3

WAN

IB Ethernet

Data Mover Data Mover

Verbs UDP WAN or LinuxEthDirect UDP LAN or Myricom

Streaming Neutron Experiment Data Simulation Data

But, data sets are stored at slow storage systems!

  • Terabits network improvement only

contributed the network transfer rate.

LABORATORY FOR ADVANCED SYSTEM SOFTWARE SOGANG UNIVERSITY

slide-4
SLIDE 4

LADS: Layout-Aware Data Scheduling [FAST’15]

4

LADS solved the impedance mismatch problem between the faster network and slower storage system!

WAN

IB Ethernet

Data Mover Data Mover

Verbs UDP WAN or LinuxEthDirect UDP LAN or Myricom

Streaming Neutron Experiment Data Simulation Data

  • LADS offers an end-to-end data transfer
  • ptimization.

LABORATORY FOR ADVANCED SYSTEM SOFTWARE SOGANG UNIVERSITY

slide-5
SLIDE 5

What Memory Bottleneck Occurs in LADS?

5

WAN

IB Ethernet

DTN DTN

Simulation Data QPI DRAM

NUMA node 0 NUMA node 1 NUMA node 2 NUMA node 3

DRAM

CPU socket 0 CPU socket 1

DRAM DRAM Cores in CPU0 Cores in CPU1 Cores in CPU2 Cores in CPU3

Scheduler Communicator I/O thread RMA buffer

LABORATORY FOR ADVANCED SYSTEM SOFTWARE SOGANG UNIVERSITY

slide-6
SLIDE 6

Architectural Overview for LADS

  • NUMA-based DTN Architecture in Source and Sink

LABORATORY FOR ADVANCED SYSTEM SOFTWARE SOGANG UNIVERSITY 6

QPI DRAM

NUMA node 0 NUMA node 1 NUMA node 2 NUMA node 3

DRAM

CPU socket 0 CPU socket 1

RMA buffer DRAM

slide-7
SLIDE 7

Memory Bottleneck with Single RMA Buffer

LABORATORY FOR ADVANCED SYSTEM SOFTWARE SOGANG UNIVERSITY 7

QPI DRAM

NUMA node 0 NUMA node 1 NUMA node 2 NUMA node 3

DRAM

CPU socket 0 CPU socket 1

RMA buffer DRAM

  • NUMA-based DTN Architecture in Source and Sink
slide-8
SLIDE 8

Memory Bottleneck with Single RMA Buffer

LABORATORY FOR ADVANCED SYSTEM SOFTWARE SOGANG UNIVERSITY 8

QPI DRAM

NUMA node 0 NUMA node 1 NUMA node 2 NUMA node 3

DRAM

CPU socket 0 CPU socket 1

RMA buffer DRAM

Remote Memory Accesses!!!

CPU Socket 1 accessing RMA Buffer hosted by CPU Socket 0

  • NUMA-based DTN Architecture in Source and Sink
slide-9
SLIDE 9

Multiple RMA Buffers

  • Distributing the RMA buffer to all CPU sockets
  • To reduce the remote socket’s memory access

LABORATORY FOR ADVANCED SYSTEM SOFTWARE SOGANG UNIVERSITY 9

QPI

NUMA node 0 NUMA node 1 NUMA node 2 NUMA node 3 CPU socket 0 CPU socket 1

RMA buffer RMA buffer DRAM DRAM

slide-10
SLIDE 10

Multiple RMA Buffers

LABORATORY FOR ADVANCED SYSTEM SOFTWARE SOGANG UNIVERSITY 10

QPI

NUMA node 0 NUMA node 1 NUMA node 2 NUMA node 3 CPU socket 0 CPU socket 1

RMA buffer RMA buffer DRAM DRAM

Possibility for accessing remote socket’s memory reduced!

slide-11
SLIDE 11

Memory-aware Thread Scheduling (MTS)

  • Binding all threads to in-socket RMA buffer
  • Load balancing among in-socket NUMA nodes

LABORATORY FOR ADVANCED SYSTEM SOFTWARE SOGANG UNIVERSITY 11

QPI

NUMA node 0 NUMA node 1 NUMA node 2 NUMA node 3 CPU socket 0 CPU socket 1

RMA buffer RMA buffer DRAM DRAM

slide-12
SLIDE 12

Memory-aware Thread Scheduling (MTS)

LABORATORY FOR ADVANCED SYSTEM SOFTWARE SOGANG UNIVERSITY 12

QPI

NUMA node 0 NUMA node 1 NUMA node 2 NUMA node 3 CPU socket 0 CPU socket 1

RMA buffer RMA buffer DRAM DRAM

Local Memory Accesses & Load Balancing

slide-13
SLIDE 13

Test-bed Configuration

  • Data Transfer Nodes (DTNs)
  • 2 CPU sockets, 4 NUMA nodes,

24 cores

  • 128GB memory
  • InfiniBand EDR (100Gb/s)

LABORATORY FOR ADVANCED SYSTEM SOFTWARE SOGANG UNIVERSITY 13

IB QDR (40Gb/s)

DTN

source

DTN

sink Memory File System (tmpfs) Memory File System (tmpfs)

  • Workloads
  • 8x3GB files (Big file workload)
  • 24,000x1MB files (Small file

workload)

  • Storage Systems
  • We used the memory file system (tmpfs) to eliminate storage

bottlenecks.

slide-14
SLIDE 14

Evaluation

500 1000 1500 2000 2500 3000 3500 4000 2 4 8 16 32 64 Baseline

LABORATORY FOR ADVANCED SYSTEM SOFTWARE SOGANG UNIVERSITY 14

Single RMA Buffer

Throughput (MB/s) Number of I/O Threads

slide-15
SLIDE 15

Evaluation

500 1000 1500 2000 2500 3000 3500 4000 2 4 8 16 32 64 Baseline MTS

LABORATORY FOR ADVANCED SYSTEM SOFTWARE SOGANG UNIVERSITY 15

NUMA-aware Scheduling

Throughput (MB/s) Number of I/O Threads

Single RMA Buffer

slide-16
SLIDE 16

Evaluation

500 1000 1500 2000 2500 3000 3500 4000 2 4 8 16 32 64 Baseline MTS

LABORATORY FOR ADVANCED SYSTEM SOFTWARE SOGANG UNIVERSITY 16

Throughput increased to an average of 24.3%!

Throughput (MB/s) Number of I/O Threads

NUMA-aware Scheduling Single RMA Buffer

slide-17
SLIDE 17

Q&A

Contact:

Taeuk Kim (taugi323@sogang.ac.kr) Department of Computer Science and Engineering Sogang University, Seoul, Republic of KOREA

LABORATORY FOR ADVANCED SYSTEM SOFTWARE SOGANG UNIVERSITY 17