Parallel Programming and Heterogeneous Computing Non-Uniform Memory - PowerPoint PPT Presentation

Parallel Programming and Heterogeneous Computing Non-Uniform Memory Access Max Plauth, Sven Köhler, Felix Eberhardt , Lukas Wenzel and Andreas Polze Operating Systems and Middleware Group

Recap Optimization Goals Decrease Latency – process a single workload faster (= speedup ) ■ Increase Throughput – process more workloads in the same time ■ Both are Performance metrics Ø Scalability : make best use of additional resources ■ Scale Up : Utilize additional resources on a machine □ Scale Out : Utilize resources on additional machines □ Cost/Energy Efficiency : ■ minimize cost/energy requirements for given performance objectives □ ParProg 2020 B4 Non-Uniform alternatively: maximize performance for given cost/energy budget □ Memory Access Felix Eberhardt Utilization : minimize idle time (=waste) of available resources ■ Precision-Tradeoffs : trade performance for precision of results ■ Chart 2

Non-Uniform Memory Access Context: Scalability Two basic approaches to scaling computing hardware: ■ Scale-Up : combine more resources (memory or cores) in a tightly □ coupled system User perceives a single large shared-memory system Ø ParProg 2020 B4 Non-Uniform Memory Access Felix Eberhardt Machine Chart 3

Non-Uniform Memory Access Context: Scalability Two basic approaches to scaling computing hardware: ■ Scale-Out : connect more machines in a loosely coupled network □ User perceives multiple communicating machines in a shared- Ø nothing system ParProg 2020 B4 Non-Uniform Memory Access Felix Eberhardt Machine Chart 4

Non-Uniform Memory Access Context: Scalability Recent coherent interconnect technologies enable hybrid systems with ■ both scale-up and scale-out characteristics: Example: Gen-Z strives to connect an entire datacenter of machines □ coherently User perceives a shared-memory system, but with the performance Ø characteristics (communication latency and bandwidth) of a shared- nothing system ParProg 2020 B4 Non-Uniform Memory Access Felix Eberhardt Machine Chart 5

Non-Uniform Memory Access Context: Uniform Memory Access Machines Multiple sockets access main memory through a shared interconnect. C00 C01 C10 C11 C20 C21 C30 C31 Latency and bandwidth Socket0 Socket1 Socket2 Socket3 characteristic is equal for any pair of socket and memory C03 C02 C13 C12 C23 C22 C33 C32 location. Interconnect Memory Controller ParProg 2020 B4 Non-Uniform Memory Access Felix Eberhardt Memory Memory Chart 6.1 Memory

Non-Uniform Memory Access Context: Uniform Memory Access Machines Multiple sockets access main memory through a shared interconnect. C00 C01 C10 C11 C20 C21 C30 C31 Latency and bandwidth Socket0 Socket1 Socket2 Socket3 characteristic is equal for any pair of socket and memory C03 C02 C13 C12 C23 C22 C33 C32 location. Interconnect Contention Memory Controller ParProg 2020 B4 Non-Uniform Memory Access Felix Eberhardt Memory Memory Chart 6.5 Memory

Non-Uniform Memory Access Concept Part of the main memory is directly attached to a socket ( local memory ) ■ Memory attached to a different socket can be accessed indirectly via the other ■ socket‘s memory controller and interconnect ( remote memory ) Socket + local memory form a NUMA node ■ Core Core Core Core Memory Memory Memory Memory Socket Socket Memory Memory Interconnect Memory Controller ParProg 2020 B4 Non-Uniform Memory Access Felix Eberhardt Memory Memory Memory Memory Socket Socket Memory Memory Chart 7

Non-Uniform Memory Access Characteristics Local memory access does not involve ■ inter-socket links, but they are shared C00 C10 C11 C01 for remote requests Memory Memory Memory Memory Socket0 Socket1 Memory Memory Local performance can suffer from Ø remote activity C03 C02 C13 C12 Remote memory access involves one or ■ C30 C31 C20 C21 more inter-socket links, as they need Memory Memory Memory Memory Socket3 Socket2 not form a complete graph Memory Memory Access to different remote memory C33 C32 C23 C22 ParProg 2020 B4 Ø Non-Uniform regions is non-uniform as well Memory Access Felix Eberhardt Chart 8

Non-Uniform Memory Access Concept Multiple point to point links between sockets scale better than a shared ■ interconnect Multiple memory controllers partition address space and provide a higher ■ total memory bandwidth (though the bandwidth to a single local region remains the same) Access to local memory behaves exactly like UMA system ■ Access to remote memory traverses more hops ( local interconnect → inter- ■ socket link → remote interconnect → remote memory controller ) Certainly higher access latency Ø ParProg 2020 B4 Probably lower bandwidth, as inter-socket link is likely not as wide as on Non-Uniform Ø chip connections Memory Access Felix Eberhardt Predominant architecture for current multi-socket machines Ø Chart 9

Non-Uniform Memory Access Terminology Physical Perspective Logical Perspective Hardware Thread Core, CPU, Processing Unit, 1. ■ Processing Element Core 2. Chip, Die 3. Multichip Module 4. NUMA Node/Region ■ Socket, Package, Processor, CPU 5. Mainboard 6. Machine, System 7. ParProg 2020 B4 Non-Uniform Memory Access Felix Eberhardt Chart 10

Non-Uniform Memory Access Example: SGI UV 300H 240 Cores ■ 12 TB RAM ■ 16 Sockets ■ What is a Killer Application for such a machine? In-Memory Databases! Ø HSLD HSHD Synchroni- ParProg 2020 B4 “Parallel Hell” UMA Non-Uniform zation Memory Access Traffic LSLD LSHD Felix Eberhardt Frequency “Parallel Cluster NUMA Nirvana” Chart 11 Data Traffic Volume [Workload Taxonomy by Pfister]

Non-Uniform Memory Access Example: SGI UV 300H Experiment: NUMA behavior when scaling a workload Machine has 16 sockets x 15 cores x 2-way SMT (allocated in locality order) ■ Performance degrades when using more than two sockets! Ø ParProg 2020 B4 Non-Uniform Memory Access Felix Eberhardt Chart 12

Non-Uniform Memory Access Characteristics Unsuitable access patterns can severely degrade ■ performance: Inter-socket link contention on excessive □ remote memory accesses Local memory controller contention on local bandwidth high □ utilization excessive combined local and remote low memory accesses high interconnect Local interconnect contention also on □ utilization low excessive multi-hop forward traffic ParProg 2020 B4 Non-Uniform Memory Access Felix Eberhardt Chart 13

Non-Uniform Memory Access Data Access Patterns Single task accesses private buffer B. on a different node C00 C01 C10 C11 Relocate remote buffer to local A. Memory Memory Memory Memory Node0 Node1 Memory Memory memory Relocate task to remote node C03 C02 C13 C12 B. A. Reduce inter-socket contention Ø C30 C31 C20 C21 Memory Memory Memory Memory Node3 Node2 Memory Memory C33 C32 C23 C22 ParProg 2020 B4 Non-Uniform Memory Access Felix Eberhardt Chart 14

Non-Uniform Memory Access Data Access Patterns A. Multiple tasks on multiple nodes access private buffers on single C00 C01 C10 C11 node Memory Memory Memory Memory Node0 Node1 Memory Memory Relocate remote buffers to local A. memory C03 C02 C13 C12 A. A. Reduce memory controller Ø contention C30 C31 C20 C21 Memory Memory Memory Memory Node3 Node2 Memory Memory C33 C32 C23 C22 ParProg 2020 B4 Non-Uniform Memory Access Felix Eberhardt Chart 15

Non-Uniform Memory Access Data Access Patterns Multiple tasks on a single node access private buffers on the same A. C00 C01 C10 C11 node Memory Memory Memory Memory Node0 Node1 Memory Memory Distribute tasks and buffers to A. different nodes C03 C02 C13 C12 A. A. Balance memory controller Ø utilization C30 C31 C20 C21 Memory Memory Memory Memory Node3 Node2 Memory Memory C33 C32 C23 C22 ParProg 2020 B4 Non-Uniform Memory Access Felix Eberhardt Chart 16

Parallel Programming and Heterogeneous Computing Non-Uniform Memory - PowerPoint PPT Presentation

Parallel Programming and Heterogeneous Computing Non-Uniform Memory Access Max Plauth, Sven Khler, Felix Eberhardt , Lukas Wenzel and Andreas Polze Operating Systems and Middleware Group Recap Optimization Goals Decrease Latency process a

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

Parallel Programming and Heterogeneous Computing A2 - Parallel Hardware Max Plauth, Sven Khler,

Parallel Computing the Why and the How Albert-Jan Yzelman February, 2010 Albert-Jan Yzelman

Cluster Basics Hana Sevcikova University of Washington DataCamp Parallel Programming in R

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Outline Overview Theoretical background Parallel computing systems Parallel

Overview Parallel computing platforms Approaches to building parallel computers

PARALLEL Joachim Nitschke PROGRAMMING Project Seminar Parallel Programming, Summer

Parallel Programming and Heterogeneous Computing B2 - Shared-Memory: Programming Models Max

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Parallel Programming and Heterogeneous Computing Shared-Nothing Parallelism Models Max Plauth,

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &

Introduction to OpenMP ! Introduction to parallel computing ! Classification of parallel

Introduction to Parallel Computing George Karypis Parallel Programming Platforms Elements of a

COMP 633 - Parallel Computing Lecture 15 October 1, 2020 Programming Accelerators using

Adaptive Prefetching for Accelerating Read and Write in NVM-based File Systems Shengan Zheng ,

Disks and RAID Profs. Bracy and Van Renesse based on slides by Prof. Sirer 50 Years Old!

Building SSA Form Each use refers to exactly one name x 17 - 4 Whats hard? x a +

Last time System F K 1 is a kind K 2 is a kind -kind K 1 K 2 is a kind A :: K 1

An SSA-based Algorithm for Optimal Speculative Code Motion under an Execution Profile Hucheng

Theory of Computer Science B4. Predicate Logic I Gabriele R oger University of Basel March

Welcome! Todays Agenda: Recap Flow Control AVX, Larrabee, GPGPU Further

From Verified Parsers and Serializers to Format-Aware Fuzzers Benjamin Delaware Purdue Computer

Parallel Programming and Heterogeneous Computing Non-Uniform Memory - PowerPoint PPT Presentation

Parallel Programming and Heterogeneous Computing Non-Uniform Memory Access Max Plauth, Sven Khler, Felix Eberhardt , Lukas Wenzel and Andreas Polze Operating Systems and Middleware Group Recap Optimization Goals Decrease Latency process a

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

Parallel Programming and Heterogeneous Computing A2 - Parallel Hardware Max Plauth, Sven Khler,

Parallel Computing the Why and the How Albert-Jan Yzelman February, 2010 Albert-Jan Yzelman

Cluster Basics Hana Sevcikova University of Washington DataCamp Parallel Programming in R

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Outline Overview Theoretical background Parallel computing systems Parallel

Overview Parallel computing platforms Approaches to building parallel computers

PARALLEL Joachim Nitschke PROGRAMMING Project Seminar Parallel Programming, Summer

Parallel Programming and Heterogeneous Computing B2 - Shared-Memory: Programming Models Max

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Parallel Programming and Heterogeneous Computing Shared-Nothing Parallelism Models Max Plauth,

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &amp;

Introduction to OpenMP ! Introduction to parallel computing ! Classification of parallel

Introduction to Parallel Computing George Karypis Parallel Programming Platforms Elements of a

COMP 633 - Parallel Computing Lecture 15 October 1, 2020 Programming Accelerators using

Adaptive Prefetching for Accelerating Read and Write in NVM-based File Systems Shengan Zheng ,

Disks and RAID Profs. Bracy and Van Renesse based on slides by Prof. Sirer 50 Years Old!

Building SSA Form Each use refers to exactly one name x 17 - 4 Whats hard? x a +

Last time System F K 1 is a kind K 2 is a kind -kind K 1 K 2 is a kind A :: K 1

An SSA-based Algorithm for Optimal Speculative Code Motion under an Execution Profile Hucheng

Theory of Computer Science B4. Predicate Logic I Gabriele R oger University of Basel March

Welcome! Todays Agenda: Recap Flow Control AVX, Larrabee, GPGPU Further

From Verified Parsers and Serializers to Format-Aware Fuzzers Benjamin Delaware Purdue Computer

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &