Parallel Programming and Heterogeneous Computing Non-Uniform Memory - PowerPoint PPT Presentation

Parallel Programming and Heterogeneous Computing Non-Uniform Memory Access Max Plauth, Sven Köhler, Felix Eberhardt , Lukas Wenzel and Andreas Polze Operating Systems and Middleware Group

Non-Uniform Memory Access Context: Uniform Memory Access Machines Multiple sockets access main memory through a shared interconnect. C00 C01 C10 C11 C20 C21 C30 C31 Latency and bandwidth Socket0 Socket1 Socket2 Socket3 characteristic is equal for any pair of socket and memory C03 C02 C13 C12 C23 C22 C33 C32 location. Interconnect Memory Controller ParProg 2019 Non-Uniform Memory Access Felix Eberhardt Memory Memory Chart 3 Memory

Non-Uniform Memory Access Context: Uniform Memory Access Machines Multiple sockets access main memory through a shared interconnect. C00 C01 C10 C11 C20 C21 C30 C31 Problem: Socket0 Socket1 Socket2 Socket3 Sockets contend for memory ■ bandwidth C03 C02 C13 C12 C23 C22 C33 C32 Full utilization of the memory ■ controller link means only 1/4 Interconnect utilization of each socket link Contention (or 1/n utilization for n sockets) Memory Controller ParProg 2019 Non-Uniform Memory Access Felix Eberhardt Memory Memory Chart 8 Memory

Non-Uniform Memory Access Context: Scalability Parallelism for… Speedup – compute faster ■ Throughput – compute more in the same time ■ Scalability – compute faster / more with additional resources ■ Price / performance – be as fast as possible for given money ■ Scavenging – compute faster / more with idle resources ■ ParProg 2019 Non-Uniform Memory Access Felix Eberhardt Chart 9

Non-Uniform Memory Access Context: Scalability Two basic approaches to scaling computing hardware: ■ Scale-Up : combine more resources (memory or cores) in a tightly □ coupled system User perceives a single large shared-memory system Ø ParProg 2019 Non-Uniform Memory Access Felix Eberhardt Machine Chart 10

Non-Uniform Memory Access Context: Scalability Two basic approaches to scaling computing hardware: ■ Scale-Out : connect more machines in a loosely coupled network □ User perceives multiple communicating machines in a shared- Ø nothing system ParProg 2019 Non-Uniform Memory Access Felix Eberhardt Machine Chart 11

Non-Uniform Memory Access Context: Scalability Recent coherent interconnect technologies enable hybrid systems with ■ both scale-up and scale-out characteristics: Example: Gen-Z strives to connect an entire datacenter of machines □ coherently User perceives a shared-memory system, but with the performance Ø characteristics (communication latency and bandwidth) of a shared- nothing system ParProg 2019 Non-Uniform Memory Access Felix Eberhardt Machine Chart 12

Non-Uniform Memory Access Concept Part of the main memory is directly attached to a socket (local memory) ■ Memory attached to a different socket can be accessed indirectly via the ■ other socket‘s memory controller and interconnect (remote memory) Socket + local memory form a NUMA node ■ Core Core Core Core Memory Memory Memory Memory Socket Socket Memory Memory Interconnect Memory Controller ParProg 2019 Non-Uniform Memory Access Felix Eberhardt Memory Memory Memory Memory Socket Socket Memory Memory Chart 13

Non-Uniform Memory Access Concept Multiple point to point links between sockets scale better than a shared ■ interconnect Multiple memory controllers partition address space and provide a higher ■ total memory bandwidth (though the bandwidth to a single local region remains the same) Access to local memory behaves exactly like UMA system ■ Access to remote memory traverses more hops (local interconnect -> ■ inter-socket link -> remote interconnect -> remote memory controller) ParProg 2019 Certainly higher access latency Ø Non-Uniform Memory Access Probably lower bandwidth, as inter-socket link is likely not as wide as Ø Felix Eberhardt on chip connections Chart 14 Predominant architecture for current multi-socket machines Ø

Non-Uniform Memory Access Terminology Physical Perspective Logical Perspective Hardware Thread Core, CPU, Processing Unit, 1. ■ Processing Element Core 2. Chip, Die 3. Multichip Module 4. NUMA Node/Region ■ Socket, Package, Processor, CPU 5. Mainboard 6. Machine, System 7. ParProg 2019 Non-Uniform Memory Access Felix Eberhardt Chart 15

Non-Uniform Memory Access Example: SGI UV 300H 240 Cores ■ 12 TB RAM ■ 16 Sockets ■ What is the Killer Application for such a machine? Ø In-Memory Databases! HSLD HSHD Synchroni- ParProg 2019 “Parallel Hell” UMA Non-Uniform zation Memory Access Traffic LSLD LSHD Felix Eberhardt Frequency “Parallel Cluster NUMA Nirvana” Chart 16 Data Traffic Volume [Workload Taxonomy by Pfister]

Non-Uniform Memory Access Example: SGI UV 300H Experiment: Deploy a Database Workload on a NUMA Machine 15 Cores / 30 Threads per Socket ■ Performance degrades when using more than two sockets! Ø ParProg 2019 Non-Uniform Memory Access Felix Eberhardt Chart 18

Non-Uniform Memory Access Characteristics Local memory access does not ■ involve inter-socket links C00 C10 C11 C01 Memory Memory Remote memory access involves one Memory Memory ■ Socket0 Socket1 Memory Memory or more inter-socket links C03 C02 C13 C12 Inter-socket links might not form a ■ complete graph Performance of remote memory Ø access is non-uniform as well C30 C31 C20 C21 Memory Memory Memory Memory Socket3 Socket2 (e.g. S0 can access memory on Memory Memory S3 and S1 with fewer hops than C33 C32 C23 C22 ParProg 2019 on S2) Non-Uniform Memory Access Felix Eberhardt Chart 19

Non-Uniform Memory Access Characteristics Unsuitable access patterns can severely ■ degrade performance: Inter-socket link contention on excessive □ remote memory accesses local bandwidth high Local memory controller contention on utilization □ low excessive combined local and remote high memory accesses interconnect utilization low Local interconnect contention also on □ excessive multi-hop forward traffic ParProg 2019 Non-Uniform Memory Access Felix Eberhardt Chart 20

Non-Uniform Memory Access Data Access Patterns Single task accesses private buffer on a different node C00 C01 C10 C11 Memory Memory Relocate remote buffer to local Memory Memory 1. Node0 Node1 Memory Memory memory C03 C02 C13 C12 Relocate task to remote node 2. Reduce inter-socket contention Ø C30 C31 C20 C21 Memory Memory Memory Memory Node3 Node2 Memory Memory C33 C32 C23 C22 ParProg 2019 Non-Uniform Memory Access Felix Eberhardt Chart 22

Non-Uniform Memory Access Data Access Patterns Single task accesses private buffer on a different node C00 C01 C10 C11 Memory Memory Relocate remote buffer to local Memory Memory 1. Node0 Node1 Memory Memory memory C03 C02 C13 C12 1. Relocate task to remote node 2. Reduce inter-socket contention Ø C30 C31 C20 C21 Memory Memory Memory Memory Node3 Node2 Memory Memory C33 C32 C23 C22 ParProg 2019 Non-Uniform Memory Access Felix Eberhardt Chart 23

Parallel Programming and Heterogeneous Computing Non-Uniform Memory - PowerPoint PPT Presentation

Parallel Programming and Heterogeneous Computing Non-Uniform Memory Access Max Plauth, Sven Khler, Felix Eberhardt , Lukas Wenzel and Andreas Polze Operating Systems and Middleware Group Non-Uniform Memory Access Context: Uniform Memory Access

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

Parallel Programming and Heterogeneous Computing A2 - Parallel Hardware Max Plauth, Sven Khler,

Parallel Computing the Why and the How Albert-Jan Yzelman February, 2010 Albert-Jan Yzelman

Cluster Basics Hana Sevcikova University of Washington DataCamp Parallel Programming in R

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Outline Overview Theoretical background Parallel computing systems Parallel

Overview Parallel computing platforms Approaches to building parallel computers

PARALLEL Joachim Nitschke PROGRAMMING Project Seminar Parallel Programming, Summer

Parallel Programming and Heterogeneous Computing B2 - Shared-Memory: Programming Models Max

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Parallel Programming and Heterogeneous Computing Shared-Nothing Parallelism Models Max Plauth,

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &

Introduction to OpenMP ! Introduction to parallel computing ! Classification of parallel

Introduction to Parallel Computing George Karypis Parallel Programming Platforms Elements of a

COMP 633 - Parallel Computing Lecture 15 October 1, 2020 Programming Accelerators using

D ISTRIBUTED S YSTEMS [COMP9243] Migration: a file can transparently move to another server

Clustering CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani

CMP722 ADVANCED COMPUTER VISION Lecture #9 Graph Networks Aykut Erdem // Hacettepe

Parallel Numerical Algorithms Chapter 3 Dense Linear Systems Section 3.3 Triangular

Lecture 4 : Clustering wyang@ntu.edu.tw Introduction to

Morphology 11-711 Algorithms for NLP 1 November 2018 Part I (Some slides from Lori Levin,

Class n-gram models for very large vocabulary speech recognition of Finnish and Estonian Matti

A Note on Aggregate MAC Schemes Shoichi Hirose 1 Junji Shikata 2 1 University of Fukui, Japan 2

Parallel Programming and Heterogeneous Computing Non-Uniform Memory - PowerPoint PPT Presentation

Parallel Programming and Heterogeneous Computing Non-Uniform Memory Access Max Plauth, Sven Khler, Felix Eberhardt , Lukas Wenzel and Andreas Polze Operating Systems and Middleware Group Non-Uniform Memory Access Context: Uniform Memory Access

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

Parallel Programming and Heterogeneous Computing A2 - Parallel Hardware Max Plauth, Sven Khler,

Parallel Computing the Why and the How Albert-Jan Yzelman February, 2010 Albert-Jan Yzelman

Cluster Basics Hana Sevcikova University of Washington DataCamp Parallel Programming in R

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Outline Overview Theoretical background Parallel computing systems Parallel

Overview Parallel computing platforms Approaches to building parallel computers

PARALLEL Joachim Nitschke PROGRAMMING Project Seminar Parallel Programming, Summer

Parallel Programming and Heterogeneous Computing B2 - Shared-Memory: Programming Models Max

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Parallel Programming and Heterogeneous Computing Shared-Nothing Parallelism Models Max Plauth,

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &amp;

Introduction to OpenMP ! Introduction to parallel computing ! Classification of parallel

Introduction to Parallel Computing George Karypis Parallel Programming Platforms Elements of a

COMP 633 - Parallel Computing Lecture 15 October 1, 2020 Programming Accelerators using

D ISTRIBUTED S YSTEMS [COMP9243] Migration: a file can transparently move to another server

Clustering CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani

CMP722 ADVANCED COMPUTER VISION Lecture #9 Graph Networks Aykut Erdem // Hacettepe

Parallel Numerical Algorithms Chapter 3 Dense Linear Systems Section 3.3 Triangular

Lecture 4 : Clustering wyang@ntu.edu.tw Introduction to

Morphology 11-711 Algorithms for NLP 1 November 2018 Part I (Some slides from Lori Levin,

Class n-gram models for very large vocabulary speech recognition of Finnish and Estonian Matti

A Note on Aggregate MAC Schemes Shoichi Hirose 1 Junji Shikata 2 1 University of Fukui, Japan 2

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &