Parallel Programming and Heterogeneous Computing Non-Uniform Memory - - PowerPoint PPT Presentation
Parallel Programming and Heterogeneous Computing Non-Uniform Memory - - PowerPoint PPT Presentation
Parallel Programming and Heterogeneous Computing Non-Uniform Memory Access Max Plauth, Sven Khler, Felix Eberhardt , Lukas Wenzel and Andreas Polze Operating Systems and Middleware Group Non-Uniform Memory Access Context: Uniform Memory Access
Non-Uniform Memory Access Context: Uniform Memory Access Machines
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 3
Socket0
Interconnect Memory Controller
Memory Memory
Socket1 Socket2 Socket3
Multiple sockets access main memory through a shared interconnect. Latency and bandwidth characteristic is equal for any pair of socket and memory location.
Memory
C10 C11 C13 C12 C20 C21 C23 C22 C00 C01 C03 C02 C30 C31 C33 C32
Non-Uniform Memory Access Context: Uniform Memory Access Machines
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 4
Socket0
Interconnect Memory Controller
Memory Memory
Socket1 Socket2 Socket3
Multiple sockets access main memory through a shared interconnect. Latency and bandwidth characteristic is equal for any pair of socket and memory location.
Memory
C10 C11 C13 C12 C20 C21 C23 C22 C00 C01 C03 C02 C30 C31 C33 C32
Non-Uniform Memory Access Context: Uniform Memory Access Machines
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 5
Socket0
Interconnect Memory Controller
Memory Memory
Socket1 Socket2 Socket3
Multiple sockets access main memory through a shared interconnect. Latency and bandwidth characteristic is equal for any pair of socket and memory location.
Memory
C10 C11 C13 C12 C20 C21 C23 C22 C01 C03 C02 C30 C31 C33 C32 C00
Non-Uniform Memory Access Context: Uniform Memory Access Machines
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 6
Socket0
Interconnect Memory Controller
Memory Memory
Socket1 Socket2 Socket3
Multiple sockets access main memory through a shared interconnect. Latency and bandwidth characteristic is equal for any pair of socket and memory location.
Memory
C10 C11 C13 C12 C20 C21 C23 C22 C01 C03 C02 C30 C31 C33 C32 C00
Non-Uniform Memory Access Context: Uniform Memory Access Machines
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 7
Socket0
Interconnect Memory Controller
Memory Memory
Socket1 Socket2 Socket3
Multiple sockets access main memory through a shared interconnect. Latency and bandwidth characteristic is equal for any pair of socket and memory location.
Memory
C10 C11 C13 C12 C20 C21 C23 C22 C01 C03 C30 C31 C33 C32 C00 C02
Non-Uniform Memory Access Context: Uniform Memory Access Machines
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 8
Interconnect Memory Controller
Memory Memory
Multiple sockets access main memory through a shared interconnect.
Problem:
■
Sockets contend for memory bandwidth
■
Full utilization of the memory controller link means only 1/4 utilization of each socket link (or 1/n utilization for n sockets) Memory
Contention
Socket0 Socket1 Socket2 Socket3
C10 C11 C13 C12 C20 C21 C23 C22 C00 C01 C03 C02 C30 C31 C33 C32
Parallelism for…
■
Speedup – compute faster
■
Throughput – compute more in the same time
■
Scalability – compute faster / more with additional resources
■
Price / performance – be as fast as possible for given money
■
Scavenging – compute faster / more with idle resources
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 9
Non-Uniform Memory Access Context: Scalability
■
Two basic approaches to scaling computing hardware:
□
Scale-Up: combine more resources (memory or cores) in a tightly coupled system
Ø
User perceives a single large shared-memory system
Non-Uniform Memory Access Context: Scalability
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 10
Machine
■
Two basic approaches to scaling computing hardware:
□
Scale-Out: connect more machines in a loosely coupled network
Ø
User perceives multiple communicating machines in a shared- nothing system
Non-Uniform Memory Access Context: Scalability
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 11
Machine
■
Recent coherent interconnect technologies enable hybrid systems with both scale-up and scale-out characteristics:
□
Example: Gen-Z strives to connect an entire datacenter of machines coherently
Ø
User perceives a shared-memory system, but with the performance characteristics (communication latency and bandwidth) of a shared- nothing system
Non-Uniform Memory Access Context: Scalability
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 12
Machine
Non-Uniform Memory Access Concept
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 13
Interconnect
Socket Socket Socket Socket
Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Core Core Core Core Memory Controller
■
Part of the main memory is directly attached to a socket (local memory)
■
Memory attached to a different socket can be accessed indirectly via the
- ther socket‘s memory controller and interconnect (remote memory)
■
Socket + local memory form a NUMA node
■
Multiple point to point links between sockets scale better than a shared interconnect
■
Multiple memory controllers partition address space and provide a higher total memory bandwidth (though the bandwidth to a single local region remains the same)
■
Access to local memory behaves exactly like UMA system
■
Access to remote memory traverses more hops (local interconnect -> inter-socket link -> remote interconnect -> remote memory controller)
Ø
Certainly higher access latency
Ø
Probably lower bandwidth, as inter-socket link is likely not as wide as
- n chip connections
Ø
Predominant architecture for current multi-socket machines
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 14
Non-Uniform Memory Access Concept
Physical Perspective
1.
Hardware Thread
2.
Core
3.
Chip, Die
4.
Multichip Module
5.
Socket, Package, Processor, CPU
6.
Mainboard
7.
Machine, System
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 15
Non-Uniform Memory Access Terminology
Logical Perspective
■
Core, CPU, Processing Unit, Processing Element
■
NUMA Node/Region
Non-Uniform Memory Access Example: SGI UV 300H
■
240 Cores
■
12 TB RAM
■
16 Sockets
What is the Killer Application for such a machine?
Ø In-Memory Databases!
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 16
[Workload Taxonomy by Pfister]
Data Traffic Volume Synchroni- zation Traffic Frequency
LSLD
“Parallel Nirvana”
LSHD HSHD
“Parallel Hell”
HSLD
NUMA UMA Cluster
Non-Uniform Memory Access Example: SGI UV 300H
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 18
Experiment: Deploy a Database Workload on a NUMA Machine
■
15 Cores / 30 Threads per Socket
Ø
Performance degrades when using more than two sockets!
Non-Uniform Memory Access Characteristics
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 19
Socket0 Socket3 Socket2 Socket1
Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory C10 C11 C13 C12 C20 C21 C23 C22 C00 C01 C03 C02 C30 C31 C33 C32
■
Local memory access does not involve inter-socket links
■
Remote memory access involves one
- r more inter-socket links
■
Inter-socket links might not form a complete graph
Ø
Performance of remote memory access is non-uniform as well (e.g. S0 can access memory on S3 and S1 with fewer hops than
- n S2)
Felix Eberhardt
Non-Uniform Memory Access Characteristics
ParProg 2019 Non-Uniform Memory Access Chart 20
high low local bandwidth utilization interconnect utilization low high
■
Unsuitable access patterns can severely degrade performance:
□
Inter-socket link contention on excessive remote memory accesses
□
Local memory controller contention on excessive combined local and remote memory accesses
□
Local interconnect contention also on excessive multi-hop forward traffic
Non-Uniform Memory Access Data Access Patterns
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 22
Node0 Node3 Node2 Node1
Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory
Single task accesses private buffer on a different node
1.
Relocate remote buffer to local memory
2.
Relocate task to remote node
Ø
Reduce inter-socket contention
C10 C11 C13 C12 C20 C21 C23 C22 C00 C01 C03 C02 C30 C31 C33 C32
Non-Uniform Memory Access Data Access Patterns
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 23
Node0 Node3 Node2 Node1
Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory
Single task accesses private buffer on a different node
1.
Relocate remote buffer to local memory
2.
Relocate task to remote node
Ø
Reduce inter-socket contention
C10 C11 C13 C12 C20 C21 C23 C22 C00 C01 C03 C02 C30 C31 C33 C32
1.
Non-Uniform Memory Access Data Access Patterns
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 24
Node0 Node3 Node2 Node1
Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory
Single task accesses private buffer on a different node
1.
Relocate remote buffer to local memory
2.
Relocate task to remote node
Ø
Reduce inter-socket contention
C11 C13 C12 C20 C21 C23 C22 C00 C01 C03 C02 C30 C31 C33 C32
2.
C10
Non-Uniform Memory Access Data Access Patterns
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 26
Node0 Node3 Node2 Node1
Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory C10 C11 C13 C12 C20 C21 C23 C22 C00 C01 C03 C02 C30 C31 C33 C32
Multiple tasks on multiple nodes access private buffers on single node
■
Relocate remote buffers to local memory
Ø
Reduce memory controller contention
Non-Uniform Memory Access Data Access Patterns
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 27
Node0 Node3 Node2 Node1
Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory C10 C11 C13 C12 C20 C21 C23 C22 C00 C01 C03 C02 C30 C31 C33 C32
Multiple tasks on multiple nodes access private buffers on single node
■
Relocate remote buffers to local memory
Ø
Reduce memory controller contention
Non-Uniform Memory Access Data Access Patterns
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 29
Node0 Node3 Node2 Node1
Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory C10 C11 C13 C12 C20 C21 C23 C22 C00 C01 C03 C02 C30 C31 C33 C32
Multiple tasks on a single node access private buffers on the same node
■
Distribute tasks and buffers to different nodes
Ø
Balance memory controller utilization
Non-Uniform Memory Access Data Access Patterns
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 30
Node0 Node3 Node2 Node1
Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory C10 C13 C12 C20 C21 C23 C01 C03 C30 C31 C33 C32
Multiple tasks on a single node access private buffers on the same node
■
Distribute tasks and buffers to different nodes
Ø
Balance memory controller utilization
C00 C02 C22 C11
Non-Uniform Memory Access Data Access Patterns
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 31
Node0 Node3 Node2 Node1
Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory C10 C13 C12 C20 C21 C23 C01 C03 C30 C31 C33 C32
Multiple tasks on a single node access private buffers on the same node
■
Distribute tasks and buffers to different nodes
Ø
Balance memory controller utilization
C00 C02 C22 C11
Non-Uniform Memory Access Data Access Patterns
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 33
Node0 Node3 Node2 Node1
Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory C10 C11 C13 C12 C20 C21 C23 C22 C00 C01 C03 C02 C30 C31 C33 C32
Tasks on multiple nodes access a shared buffer on single node
■
Distribute shared buffer among all nodes
Ø
Reduce memory controller contention
Ø
Balance inter-node traffic
Non-Uniform Memory Access Data Access Patterns
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 34
Node0 Node3 Node2 Node1
Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory C10 C11 C13 C12 C20 C21 C23 C22 C00 C01 C03 C02 C30 C31 C33 C32
Tasks on multiple nodes access a shared buffer on single node
■
Distribute shared buffer among all nodes
Ø
Reduce memory controller contention
Ø
Balance inter-node traffic
Non-Uniform Memory Access Data Access Patterns
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 36
Node0 Node3 Node2 Node1
Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory C10 C11 C13 C12 C20 C21 C23 C22 C00 C01 C03 C02 C30 C31 C33 C32 R R
Tasks on multiple nodes read a shared buffer on single node
■
Read only: Duplicate buffer on every node
Ø
Avoid inter node traffic entirely
R R
Non-Uniform Memory Access Data Access Patterns
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 37
Node0 Node3 Node2 Node1
Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory C10 C11 C13 C12 C20 C21 C23 C22 C00 C01 C03 C02 C30 C31 C33 C32 R R
Tasks on multiple nodes read a shared buffer on single node
■
Read only: Duplicate buffer on every node
Ø
Avoid inter node traffic entirely
R R R R R R R R R R R R R R
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access
Bandwidth Measurements: Maximal Local Bandwidth
0 MB/s 10000 MB/s 20000 MB/s 30000 MB/s 40000 MB/s 50000 MB/s 60000 MB/s 70000 MB/s 80000 MB/s 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Bandwidth Threads
All Reads 3:1 Reads-Writes 2:1 Reads-Writes 1:1 Reads-Writes Stream-triad like Ideal
Significant flattening: Reasonable Bandwidth increase per thread up to 8 threads. Almost no benefit to use more than 8 threads. Chart 38
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access
Bandwidth Measurements: Maximal System Bandwidth
- Almost ideal scale-up:
16 processors x 51080 MB/s = 817280 MB/s
- Random data distribution. Not the worst
case! (Which would be all data on one socket.)
- Only 22,7% of the local-only performance
- Hyperthreading:
2 threads per core
- Data resides on first socket only
- 110,6% of the local-only performance
Huge performance potentials for adequate thread and memory placement!
One socket 4x4 sockets
Chart 39
■
Avoid data movement
□
Remote memory accesses across long distances take time -> high latency -> wasted cycles
□
High volume will cause contention -> high latency for accessing threads -> wasted cycles
■
Avoid contention
□
Balance utilization of resources (memory controller, interconnect, ...)
■
Analzye data access patterns
□
Decompose loosely coupled tasks -> increase flexibility of placement
□
Agglomerate tightly coupled tasks -> reduce communication overhead
□
Identify shared and private data chunks and place accordingly
□
Identify read-only, read-write, write-only access patterns
□
Consider benefits of dynamic adaption during runtime Data locality should be maximized
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 40
Best practices: Decisions for placement
■
Tradeoff: computational load balancing vs. data locality
■
Granularity
□
Process
□
Thread
□
Task
■
Affinity Mask: A bitmask to specify on which core the process or threads in a process can be scheduled
□
Pinning (only one bit set)
□
Can be adjusted at runtime „Computation follows data“
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 41
Best practices: Thread placement
■
Granularity is a page (4k, 64k, ... 64GB)
■
Static (allocation time): You are able to define placement policies for the entire process or override at every allocation
□
First-touch – defacto standard policy
□
Allocate on fixed node(s)
□
Interleaving
□
(Replicate pages on multiple nodes)
■
Dynamic (runtime)
□
Migraton: Move pages
□
Copy – Keep in mind that you now have to ensure consistency if values are updated „Data follows computation“
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 42
Best practices: Data placement
Run your application with numactl options command
■
Thread Placement set default affinity mask for a given process
□
numactl --physcpubind=<cpus>
□
numactl --cpunodebind=<nodes> <cpus> is a comma delimited list of cpu numbers or A-B ranges or all
□
taskset is another tool to control the affinity mask, it can be used to start with a given mask or modify that of a running process
■
Data Placement
□
numactl --interleave=<nodes>
□
numactl --membind=<nodes> <nodes> is a comma delimited list of node numbers or A-B ranges or all.
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 43
Linux Control process from outside
Thread placement Systemcall
□
sched_setaffinity(pid_t pid, size_t cpusetsize, cpu_set_t *mask) Pthread
□
pthread_setaffinity_np(pthread_t thread, size_t cpusetsize, const cpu_set_t *cpuset) Libnuma
□
numa_run_on_node(int node) Data placement with libnuma
□
void *numa_alloc_onnode(size_t size, int node)
□
void *numa_alloc_interleaved(size_t size)
□
int numa_move_pages(int pid, unsigned long count, void **pages, const int *nodes, int *status, int flags);
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 44
Linux Thread and data placement
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 45
Experiment: First touch
Tools for topology discovery:
■
ACPI distance values
■
Linux sysfs
■
Libnuma: numactl
■
Hwloc lstopo
■
MLC (Memory Latency Checker)
■
… Tools for analyzing the runtime behaviour:
■
Intel Performance Counter Monitor
■
numatop
■
… numatop: top focused on NUMA-related information
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access
First step for portable applications: Discovering and assessing the NUMA topology
Chart 46
Information provided:
■
NUMA nodes
■
ACPI distance values
- f nodes and cores
■
Mapping of cores to nodes
■
Cache sizes, levels, associativity, cache line size
■
Cache sharing of CPUs
■
Restrictions:
■
Linux only
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access
Linux sysfs
Chart 47
■
numa_max_node() get the number of the highest node in the system
■
numa_num_configured_nodes() get the total number of NUMA nodes in the system
■
numa_num_configured_cpus() get the total number of cores in the system
■
numa_distance(int node1, int node2) get the distance between two nodes as reported by ACPI
■
numa_node_to_cpus(int node, struct bitmask *mask) get a bitmask of all cores associated with the given NUMA node
■
numa_node_of_cpu(int cpu) get the node associated with the given core id
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 48
Linux Libnuma topology discovery
Information provided:
■
NUMA Nodes
■
ACPI distance values
- f nodes and cores
■
Mapping of cores to nodes
■
Restrictions:
■
Linux only
■
Available as library to be used in applications to query system devices
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access
Libnuma numactl --hardware
Chart 49
Information provided:
■
NUMA Nodes
■
ACPI distance values
- f nodes and cores
■
Mapping of cores to nodes
■
Grouping of nodes according to distance values
■
Memory hierarchy (Caches) Restrictions:
■
Several platforms: Windows, Linux, BSD, ...
■
Available as library to be used in applications to query system devices
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access
Hwloc lstopo
Chart 50
Empirical information provided:
■
Latencies to local memory hierarchy
■
Bandwidth to local memory hierachy
■
Latencies between NUMA nodes
■
Bandwidth between NUMA nodes
■
Latencies of Cache-to-Cache transfers
■
Latencies under load Restrictions:
■
Only on Intel Processors
■
No source code available
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access
mlc
Chart 51
Felix Eberhardt
Advanced Topology SGI UV-300H
ParProg 2019 Non-Uniform Memory Access Chart 53
■
Can be acquired with numactrl --hardware
■
Clusters relate to blades in the system
■
Seem to be related to latency and bandwidth characteristics (see next slide)
■
In contrast to distance values, actual measurements show that
- ne direct neighbor always
has better results
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access
ACPI Distance Values of SGI UV300H
Chart 54
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access
Related to ACPI Distance Values? Latency (Left) and Bandwidth (Right) Measurements
Seem to be related to ACPI distances. In contrast to distance values, actual measure- ments show that
- ne direct
neighbor always has better results. (1,2) vs. (1,4)
Chart 55
NUMA on Chip (Socket)
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access Chart 57
https://www.servethehome.com/wp-content/uploads/2017/08/AMD-EPYC-Infinity-Fabric-Topology-Mapping.jpg
Information provided:
■
Similar to top tool
■
Shows NUMA specific metrics
■
Uses instruction sampling
■
Memory view to find out which memory addresses are accessed frequently by remote nodes
■
Ability to collect stack traces Restrictions:
■
Linux only, Kernel 3.9
- r later
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access
numatop
Chart 58
Information provided:
■
API for Intel specific performance counters
■
Core and Uncore events
■
QPI links and memory controller utilization Restrictions:
■
Available on Windows and Linux
■
Intel processors only
Felix Eberhardt ParProg 2019 Non-Uniform Memory Access
Intel Performance Counter Monitor
Chart 59