Parallel Programming and Heterogeneous Computing Non-Uniform Memory - - PowerPoint PPT Presentation

parallel programming and heterogeneous computing
SMART_READER_LITE
LIVE PREVIEW

Parallel Programming and Heterogeneous Computing Non-Uniform Memory - - PowerPoint PPT Presentation

Parallel Programming and Heterogeneous Computing Non-Uniform Memory Access Max Plauth, Sven Khler, Felix Eberhardt , Lukas Wenzel and Andreas Polze Operating Systems and Middleware Group Non-Uniform Memory Access Context: Uniform Memory Access


slide-1
SLIDE 1

Parallel Programming and Heterogeneous Computing

Non-Uniform Memory Access

Max Plauth, Sven Köhler, Felix Eberhardt, Lukas Wenzel and Andreas Polze Operating Systems and Middleware Group

slide-2
SLIDE 2

Non-Uniform Memory Access Context: Uniform Memory Access Machines

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 3

Socket0

Interconnect Memory Controller

Memory Memory

Socket1 Socket2 Socket3

Multiple sockets access main memory through a shared interconnect. Latency and bandwidth characteristic is equal for any pair of socket and memory location.

Memory

C10 C11 C13 C12 C20 C21 C23 C22 C00 C01 C03 C02 C30 C31 C33 C32

slide-3
SLIDE 3

Non-Uniform Memory Access Context: Uniform Memory Access Machines

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 4

Socket0

Interconnect Memory Controller

Memory Memory

Socket1 Socket2 Socket3

Multiple sockets access main memory through a shared interconnect. Latency and bandwidth characteristic is equal for any pair of socket and memory location.

Memory

C10 C11 C13 C12 C20 C21 C23 C22 C00 C01 C03 C02 C30 C31 C33 C32

slide-4
SLIDE 4

Non-Uniform Memory Access Context: Uniform Memory Access Machines

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 5

Socket0

Interconnect Memory Controller

Memory Memory

Socket1 Socket2 Socket3

Multiple sockets access main memory through a shared interconnect. Latency and bandwidth characteristic is equal for any pair of socket and memory location.

Memory

C10 C11 C13 C12 C20 C21 C23 C22 C01 C03 C02 C30 C31 C33 C32 C00

slide-5
SLIDE 5

Non-Uniform Memory Access Context: Uniform Memory Access Machines

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 6

Socket0

Interconnect Memory Controller

Memory Memory

Socket1 Socket2 Socket3

Multiple sockets access main memory through a shared interconnect. Latency and bandwidth characteristic is equal for any pair of socket and memory location.

Memory

C10 C11 C13 C12 C20 C21 C23 C22 C01 C03 C02 C30 C31 C33 C32 C00

slide-6
SLIDE 6

Non-Uniform Memory Access Context: Uniform Memory Access Machines

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 7

Socket0

Interconnect Memory Controller

Memory Memory

Socket1 Socket2 Socket3

Multiple sockets access main memory through a shared interconnect. Latency and bandwidth characteristic is equal for any pair of socket and memory location.

Memory

C10 C11 C13 C12 C20 C21 C23 C22 C01 C03 C30 C31 C33 C32 C00 C02

slide-7
SLIDE 7

Non-Uniform Memory Access Context: Uniform Memory Access Machines

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 8

Interconnect Memory Controller

Memory Memory

Multiple sockets access main memory through a shared interconnect.

Problem:

Sockets contend for memory bandwidth

Full utilization of the memory controller link means only 1/4 utilization of each socket link (or 1/n utilization for n sockets) Memory

Contention

Socket0 Socket1 Socket2 Socket3

C10 C11 C13 C12 C20 C21 C23 C22 C00 C01 C03 C02 C30 C31 C33 C32

slide-8
SLIDE 8

Parallelism for…

Speedup – compute faster

Throughput – compute more in the same time

Scalability – compute faster / more with additional resources

Price / performance – be as fast as possible for given money

Scavenging – compute faster / more with idle resources

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 9

Non-Uniform Memory Access Context: Scalability

slide-9
SLIDE 9

Two basic approaches to scaling computing hardware:

Scale-Up: combine more resources (memory or cores) in a tightly coupled system

Ø

User perceives a single large shared-memory system

Non-Uniform Memory Access Context: Scalability

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 10

Machine

slide-10
SLIDE 10

Two basic approaches to scaling computing hardware:

Scale-Out: connect more machines in a loosely coupled network

Ø

User perceives multiple communicating machines in a shared- nothing system

Non-Uniform Memory Access Context: Scalability

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 11

Machine

slide-11
SLIDE 11

Recent coherent interconnect technologies enable hybrid systems with both scale-up and scale-out characteristics:

Example: Gen-Z strives to connect an entire datacenter of machines coherently

Ø

User perceives a shared-memory system, but with the performance characteristics (communication latency and bandwidth) of a shared- nothing system

Non-Uniform Memory Access Context: Scalability

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 12

Machine

slide-12
SLIDE 12

Non-Uniform Memory Access Concept

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 13

Interconnect

Socket Socket Socket Socket

Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Core Core Core Core Memory Controller

Part of the main memory is directly attached to a socket (local memory)

Memory attached to a different socket can be accessed indirectly via the

  • ther socket‘s memory controller and interconnect (remote memory)

Socket + local memory form a NUMA node

slide-13
SLIDE 13

Multiple point to point links between sockets scale better than a shared interconnect

Multiple memory controllers partition address space and provide a higher total memory bandwidth (though the bandwidth to a single local region remains the same)

Access to local memory behaves exactly like UMA system

Access to remote memory traverses more hops (local interconnect -> inter-socket link -> remote interconnect -> remote memory controller)

Ø

Certainly higher access latency

Ø

Probably lower bandwidth, as inter-socket link is likely not as wide as

  • n chip connections

Ø

Predominant architecture for current multi-socket machines

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 14

Non-Uniform Memory Access Concept

slide-14
SLIDE 14

Physical Perspective

1.

Hardware Thread

2.

Core

3.

Chip, Die

4.

Multichip Module

5.

Socket, Package, Processor, CPU

6.

Mainboard

7.

Machine, System

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 15

Non-Uniform Memory Access Terminology

Logical Perspective

Core, CPU, Processing Unit, Processing Element

NUMA Node/Region

slide-15
SLIDE 15

Non-Uniform Memory Access Example: SGI UV 300H

240 Cores

12 TB RAM

16 Sockets

What is the Killer Application for such a machine?

Ø In-Memory Databases!

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 16

[Workload Taxonomy by Pfister]

Data Traffic Volume Synchroni- zation Traffic Frequency

LSLD

“Parallel Nirvana”

LSHD HSHD

“Parallel Hell”

HSLD

NUMA UMA Cluster

slide-16
SLIDE 16

Non-Uniform Memory Access Example: SGI UV 300H

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 18

Experiment: Deploy a Database Workload on a NUMA Machine

15 Cores / 30 Threads per Socket

Ø

Performance degrades when using more than two sockets!

slide-17
SLIDE 17

Non-Uniform Memory Access Characteristics

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 19

Socket0 Socket3 Socket2 Socket1

Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory C10 C11 C13 C12 C20 C21 C23 C22 C00 C01 C03 C02 C30 C31 C33 C32

Local memory access does not involve inter-socket links

Remote memory access involves one

  • r more inter-socket links

Inter-socket links might not form a complete graph

Ø

Performance of remote memory access is non-uniform as well (e.g. S0 can access memory on S3 and S1 with fewer hops than

  • n S2)
slide-18
SLIDE 18

Felix Eberhardt

Non-Uniform Memory Access Characteristics

ParProg 2019 Non-Uniform Memory Access Chart 20

high low local bandwidth utilization interconnect utilization low high

Unsuitable access patterns can severely degrade performance:

Inter-socket link contention on excessive remote memory accesses

Local memory controller contention on excessive combined local and remote memory accesses

Local interconnect contention also on excessive multi-hop forward traffic

slide-19
SLIDE 19

Non-Uniform Memory Access Data Access Patterns

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 22

Node0 Node3 Node2 Node1

Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory

Single task accesses private buffer on a different node

1.

Relocate remote buffer to local memory

2.

Relocate task to remote node

Ø

Reduce inter-socket contention

C10 C11 C13 C12 C20 C21 C23 C22 C00 C01 C03 C02 C30 C31 C33 C32

slide-20
SLIDE 20

Non-Uniform Memory Access Data Access Patterns

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 23

Node0 Node3 Node2 Node1

Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory

Single task accesses private buffer on a different node

1.

Relocate remote buffer to local memory

2.

Relocate task to remote node

Ø

Reduce inter-socket contention

C10 C11 C13 C12 C20 C21 C23 C22 C00 C01 C03 C02 C30 C31 C33 C32

1.

slide-21
SLIDE 21

Non-Uniform Memory Access Data Access Patterns

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 24

Node0 Node3 Node2 Node1

Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory

Single task accesses private buffer on a different node

1.

Relocate remote buffer to local memory

2.

Relocate task to remote node

Ø

Reduce inter-socket contention

C11 C13 C12 C20 C21 C23 C22 C00 C01 C03 C02 C30 C31 C33 C32

2.

C10

slide-22
SLIDE 22

Non-Uniform Memory Access Data Access Patterns

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 26

Node0 Node3 Node2 Node1

Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory C10 C11 C13 C12 C20 C21 C23 C22 C00 C01 C03 C02 C30 C31 C33 C32

Multiple tasks on multiple nodes access private buffers on single node

Relocate remote buffers to local memory

Ø

Reduce memory controller contention

slide-23
SLIDE 23

Non-Uniform Memory Access Data Access Patterns

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 27

Node0 Node3 Node2 Node1

Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory C10 C11 C13 C12 C20 C21 C23 C22 C00 C01 C03 C02 C30 C31 C33 C32

Multiple tasks on multiple nodes access private buffers on single node

Relocate remote buffers to local memory

Ø

Reduce memory controller contention

slide-24
SLIDE 24

Non-Uniform Memory Access Data Access Patterns

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 29

Node0 Node3 Node2 Node1

Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory C10 C11 C13 C12 C20 C21 C23 C22 C00 C01 C03 C02 C30 C31 C33 C32

Multiple tasks on a single node access private buffers on the same node

Distribute tasks and buffers to different nodes

Ø

Balance memory controller utilization

slide-25
SLIDE 25

Non-Uniform Memory Access Data Access Patterns

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 30

Node0 Node3 Node2 Node1

Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory C10 C13 C12 C20 C21 C23 C01 C03 C30 C31 C33 C32

Multiple tasks on a single node access private buffers on the same node

Distribute tasks and buffers to different nodes

Ø

Balance memory controller utilization

C00 C02 C22 C11

slide-26
SLIDE 26

Non-Uniform Memory Access Data Access Patterns

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 31

Node0 Node3 Node2 Node1

Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory C10 C13 C12 C20 C21 C23 C01 C03 C30 C31 C33 C32

Multiple tasks on a single node access private buffers on the same node

Distribute tasks and buffers to different nodes

Ø

Balance memory controller utilization

C00 C02 C22 C11

slide-27
SLIDE 27

Non-Uniform Memory Access Data Access Patterns

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 33

Node0 Node3 Node2 Node1

Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory C10 C11 C13 C12 C20 C21 C23 C22 C00 C01 C03 C02 C30 C31 C33 C32

Tasks on multiple nodes access a shared buffer on single node

Distribute shared buffer among all nodes

Ø

Reduce memory controller contention

Ø

Balance inter-node traffic

slide-28
SLIDE 28

Non-Uniform Memory Access Data Access Patterns

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 34

Node0 Node3 Node2 Node1

Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory C10 C11 C13 C12 C20 C21 C23 C22 C00 C01 C03 C02 C30 C31 C33 C32

Tasks on multiple nodes access a shared buffer on single node

Distribute shared buffer among all nodes

Ø

Reduce memory controller contention

Ø

Balance inter-node traffic

slide-29
SLIDE 29

Non-Uniform Memory Access Data Access Patterns

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 36

Node0 Node3 Node2 Node1

Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory C10 C11 C13 C12 C20 C21 C23 C22 C00 C01 C03 C02 C30 C31 C33 C32 R R

Tasks on multiple nodes read a shared buffer on single node

Read only: Duplicate buffer on every node

Ø

Avoid inter node traffic entirely

R R

slide-30
SLIDE 30

Non-Uniform Memory Access Data Access Patterns

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 37

Node0 Node3 Node2 Node1

Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory C10 C11 C13 C12 C20 C21 C23 C22 C00 C01 C03 C02 C30 C31 C33 C32 R R

Tasks on multiple nodes read a shared buffer on single node

Read only: Duplicate buffer on every node

Ø

Avoid inter node traffic entirely

R R R R R R R R R R R R R R

slide-31
SLIDE 31

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access

Bandwidth Measurements: Maximal Local Bandwidth

0 MB/s 10000 MB/s 20000 MB/s 30000 MB/s 40000 MB/s 50000 MB/s 60000 MB/s 70000 MB/s 80000 MB/s 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Bandwidth Threads

All Reads 3:1 Reads-Writes 2:1 Reads-Writes 1:1 Reads-Writes Stream-triad like Ideal

Significant flattening: Reasonable Bandwidth increase per thread up to 8 threads. Almost no benefit to use more than 8 threads. Chart 38

slide-32
SLIDE 32

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access

Bandwidth Measurements: Maximal System Bandwidth

  • Almost ideal scale-up:

16 processors x 51080 MB/s = 817280 MB/s

  • Random data distribution. Not the worst

case! (Which would be all data on one socket.)

  • Only 22,7% of the local-only performance
  • Hyperthreading:

2 threads per core

  • Data resides on first socket only
  • 110,6% of the local-only performance

Huge performance potentials for adequate thread and memory placement!

One socket 4x4 sockets

Chart 39

slide-33
SLIDE 33

Avoid data movement

Remote memory accesses across long distances take time -> high latency -> wasted cycles

High volume will cause contention -> high latency for accessing threads -> wasted cycles

Avoid contention

Balance utilization of resources (memory controller, interconnect, ...)

Analzye data access patterns

Decompose loosely coupled tasks -> increase flexibility of placement

Agglomerate tightly coupled tasks -> reduce communication overhead

Identify shared and private data chunks and place accordingly

Identify read-only, read-write, write-only access patterns

Consider benefits of dynamic adaption during runtime Data locality should be maximized

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 40

Best practices: Decisions for placement

slide-34
SLIDE 34

Tradeoff: computational load balancing vs. data locality

Granularity

Process

Thread

Task

Affinity Mask: A bitmask to specify on which core the process or threads in a process can be scheduled

Pinning (only one bit set)

Can be adjusted at runtime „Computation follows data“

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 41

Best practices: Thread placement

slide-35
SLIDE 35

Granularity is a page (4k, 64k, ... 64GB)

Static (allocation time): You are able to define placement policies for the entire process or override at every allocation

First-touch – defacto standard policy

Allocate on fixed node(s)

Interleaving

(Replicate pages on multiple nodes)

Dynamic (runtime)

Migraton: Move pages

Copy – Keep in mind that you now have to ensure consistency if values are updated „Data follows computation“

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 42

Best practices: Data placement

slide-36
SLIDE 36

Run your application with numactl options command

Thread Placement set default affinity mask for a given process

numactl --physcpubind=<cpus>

numactl --cpunodebind=<nodes> <cpus> is a comma delimited list of cpu numbers or A-B ranges or all

taskset is another tool to control the affinity mask, it can be used to start with a given mask or modify that of a running process

Data Placement

numactl --interleave=<nodes>

numactl --membind=<nodes> <nodes> is a comma delimited list of node numbers or A-B ranges or all.

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 43

Linux Control process from outside

slide-37
SLIDE 37

Thread placement Systemcall

sched_setaffinity(pid_t pid, size_t cpusetsize, cpu_set_t *mask) Pthread

pthread_setaffinity_np(pthread_t thread, size_t cpusetsize, const cpu_set_t *cpuset) Libnuma

numa_run_on_node(int node) Data placement with libnuma

void *numa_alloc_onnode(size_t size, int node)

void *numa_alloc_interleaved(size_t size)

int numa_move_pages(int pid, unsigned long count, void **pages, const int *nodes, int *status, int flags);

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 44

Linux Thread and data placement

slide-38
SLIDE 38

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 45

Experiment: First touch

slide-39
SLIDE 39

Tools for topology discovery:

ACPI distance values

Linux sysfs

Libnuma: numactl

Hwloc lstopo

MLC (Memory Latency Checker)

… Tools for analyzing the runtime behaviour:

Intel Performance Counter Monitor

numatop

… numatop: top focused on NUMA-related information

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access

First step for portable applications: Discovering and assessing the NUMA topology

Chart 46

slide-40
SLIDE 40

Information provided:

NUMA nodes

ACPI distance values

  • f nodes and cores

Mapping of cores to nodes

Cache sizes, levels, associativity, cache line size

Cache sharing of CPUs

Restrictions:

Linux only

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access

Linux sysfs

Chart 47

slide-41
SLIDE 41

numa_max_node() get the number of the highest node in the system

numa_num_configured_nodes() get the total number of NUMA nodes in the system

numa_num_configured_cpus() get the total number of cores in the system

numa_distance(int node1, int node2) get the distance between two nodes as reported by ACPI

numa_node_to_cpus(int node, struct bitmask *mask) get a bitmask of all cores associated with the given NUMA node

numa_node_of_cpu(int cpu) get the node associated with the given core id

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 48

Linux Libnuma topology discovery

slide-42
SLIDE 42

Information provided:

NUMA Nodes

ACPI distance values

  • f nodes and cores

Mapping of cores to nodes

Restrictions:

Linux only

Available as library to be used in applications to query system devices

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access

Libnuma numactl --hardware

Chart 49

slide-43
SLIDE 43

Information provided:

NUMA Nodes

ACPI distance values

  • f nodes and cores

Mapping of cores to nodes

Grouping of nodes according to distance values

Memory hierarchy (Caches) Restrictions:

Several platforms: Windows, Linux, BSD, ...

Available as library to be used in applications to query system devices

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access

Hwloc lstopo

Chart 50

slide-44
SLIDE 44

Empirical information provided:

Latencies to local memory hierarchy

Bandwidth to local memory hierachy

Latencies between NUMA nodes

Bandwidth between NUMA nodes

Latencies of Cache-to-Cache transfers

Latencies under load Restrictions:

Only on Intel Processors

No source code available

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access

mlc

Chart 51

slide-45
SLIDE 45

Felix Eberhardt

Advanced Topology SGI UV-300H

ParProg 2019 Non-Uniform Memory Access Chart 53

slide-46
SLIDE 46

Can be acquired with numactrl --hardware

Clusters relate to blades in the system

Seem to be related to latency and bandwidth characteristics (see next slide)

In contrast to distance values, actual measurements show that

  • ne direct neighbor always

has better results

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access

ACPI Distance Values of SGI UV300H

Chart 54

slide-47
SLIDE 47

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access

Related to ACPI Distance Values? Latency (Left) and Bandwidth (Right) Measurements

Seem to be related to ACPI distances. In contrast to distance values, actual measure- ments show that

  • ne direct

neighbor always has better results. (1,2) vs. (1,4)

Chart 55

slide-48
SLIDE 48

NUMA on Chip (Socket)

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 57

https://www.servethehome.com/wp-content/uploads/2017/08/AMD-EPYC-Infinity-Fabric-Topology-Mapping.jpg

slide-49
SLIDE 49

Information provided:

Similar to top tool

Shows NUMA specific metrics

Uses instruction sampling

Memory view to find out which memory addresses are accessed frequently by remote nodes

Ability to collect stack traces Restrictions:

Linux only, Kernel 3.9

  • r later

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access

numatop

Chart 58

slide-50
SLIDE 50

Information provided:

API for Intel specific performance counters

Core and Uncore events

QPI links and memory controller utilization Restrictions:

Available on Windows and Linux

Intel processors only

Felix Eberhardt ParProg 2019 Non-Uniform Memory Access

Intel Performance Counter Monitor

Chart 59