COMP 633 - Parallel Computing Lecture 6 September 1, 2020 SMM (1) - PowerPoint PPT Presentation

COMP 633 - Parallel Computing Lecture 6 September 1, 2020 SMM (1) Memory Hierarchies and Shared Memory COMP 633 - Prins Shared Memory Multiprocessors (1) 1

PRAM example 1 • Evaluate a polynomial for a given value of 𝑦 and coefficients 𝑐 1 … 𝑐 𝑜 𝑄 𝑦 = 𝑐 1 𝑦 𝑜−1 + ⋯ + 𝑐 𝑜−1 + 𝑐 𝑜 COMP 633 - Prins Shared Memory Multiprocessors (1) 2

PRAM example 2 – bitonic merge COMP 633 - Prins Shared Memory Multiprocessors (1) 3

Topics • PRAM algorithm examples • Memory systems – organization – caches and the memory hierarchy – influence of the memory hierarchy on algorithms • Shared memory systems – Taxonomy of actual shared memory systems • UMA, NUMA, cc-NUMA COMP 633 - Prins Shared Memory Multiprocessors (1) 4

Recall PRAM shared memory system • PRAM model – assumes access latency is constant, regardless of value of p or the size of memory – simultaneous reads permitted under CR model and simultaneous writes permitted under CW model • Physically impossible to realize – processors and memory occupy physical space shared memory • speed of light limitations ( ) ( ) 1 3 =  + L p m – CR / CW must be reduced to ER / EW 2 • • • 1 p • requires  (lg p) time in general case processors COMP 633 - Prins Shared Memory Multiprocessors (1) 5

Anatomy of a processor  memory system • Performance parameters of Random Access Memory (RAM) – latency L • elapsed time from presentation of memory address to arrival of data – address transit time – memory access time t mem – data transit time – bandwidth W • number of values (e.g. 64 bit words) delivered to processor per unit time – simple implementation W ~ 1/L Processor Memory COMP 633 - Prins Shared Memory Multiprocessors (1) 6

Processor vs. memory performance • The memory “wall” – Processors compute faster than memory delivers data • increasing imbalance 𝑢arith ≪ 𝑢mem • ≪ COMP 633 - Prins Shared Memory Multiprocessors (1) 7

Improving memory system performance (1) • Decrease latency L to memory – speed of light is a limiting factor • bring memory closer to processor – decrease memory access time by decreasing memory size s • access time  s ½ (VLSI) – use faster memory technology • DRAM (Dynamic RAM) 1 transistor per stored bit – high density, low power, long access time, low cost • SRAM (Static RAM) 6 transistors per stored bit – low density, high power, short access time, high cost COMP 633 - Prins Shared Memory Multiprocessors (1) 8

Improving memory system performance (1) • Decrease latency using cache memory – low latency access to frequently used values, high latency for the remaining values Processor Cache Memory – Example • 90% of references are to cache with latency L 1 • 10% of references are to memory with latency L 2 • average latency is 0.9L 1 + 0.1L 2 COMP 633 - Prins Shared Memory Multiprocessors (1) 9

Improving memory system performance (2) • Increase bandwidth W – multiport (parallel access) memory • multiple reads, multiple exclusive writes per memory cycle – High cost, very limited scalability Register file Processor – “blocked” memory • memory supplies block of size b containing requested word – supports spatial locality in cache access b Processor Cache Memory COMP 633 - Prins Shared Memory Multiprocessors (1) 10

Improving memory system performance (2) • Increase bandwidth W (contd) – pipeline memory requests • requires independent memory references – interleave memory • problem: memory access is limited by t mem • use m separate memories (or memory banks) • W ~ m / L if references distribute over memory banks COMP 633 - Prins Shared Memory Multiprocessors (1) 11

Latency hiding • Amortize latency using a pipelined interleaved memory system – k independent references in  (L + k  t proc ) time • O(L/k) amortized (expected) latency per reference • Where do we get independent references? – out-of-order execution of independent load/store operations • found in most modern performance-oriented processors • partial latency hiding: k ~ 2 - 10 references outstanding – vector load/store operations • small vector units (AVX512) – vector length 2-8 words (Intel Xeon) – partial latency hiding • high-performance vector units (NEC SX-9, SX-Aurora) – vector length k = L / t proc (128 - 256 words) – crossbar network to highly interleaved memory (~ 16,000 banks) – full latency hiding: amortized memory access at processor speed – multithreaded operation • independent execution threads with individual hardware contexts – partial latency hiding: 2-way hyperthreading (Intel) – full latency hiding: 128-way threading with high-performance memory (Cray MTA) COMP 633 - Prins Shared Memory Multiprocessors (1) 12

Implementing the PRAM • How close can we come to O(1) latency PRAM memory in practice? M 1 M 2 M 3 M m-1 M m • • • – requires processor to memory network Network • latency L = sum of – twice network latency – memory cycle time – serialization time for CR, CW • L increases with m, p P 1 P 2 P p • • • – L too large with current technology – examples • NYU Ultracomputer (1987), IBM RP3 (1991), SBPRAM (1999) – logarithmic depth combining network eliminates memory contention time for CR, CW »  (lg p) latency in network is prohibitive COMP 633 - Prins Shared Memory Multiprocessors (1) 13

Implementing PRAM – a compromise • Using latency hiding with a high-performance memory system – implements p  k processor EREW PRAM slowed down by a factor of k • use m  p (t mem / t proc ) memory banks to match memory reference rate of p processors • total latency 2L for k = L / t proc independent random references at each processor • O(t proc ) amortized latency per reference at each processor – unit latency degrades in the presence of concurrent reads/writes M 1 M 2 M 3 M m-1 M m • • • Network P 1 P 2 P p • • • – Bottom line: doable but very expensive and only limited scaling in p COMP 633 - Prins Shared Memory Multiprocessors (1) 14

Memory systems summary • Memory performance – Latency is limited by physics – Bandwidth is limited by cost • Cache memory: low latency access to some values – caching frequently used values • rewards temporal locality of reference – caching consecutive values • rewards spatial locality of reference – decrease average latency • 90 fast references, 10 slow references: effective latency = 0.9L 1 + 0.1L 2 • Parallel memories – 100 independent references ≈ 100 fast references – relatively expensive – requires parallel processing COMP 633 - Prins Shared Memory Multiprocessors (1) 15

Simple uniprocessor memory hierarchy • Each component is characterized by Disk – capacity – block size – (associativity) • Traffic between components is characterized by Main – access latency Memory – transfer rate (bandwidth) • Example: – IBM RS6000/320H (ca. 1991) Cache Storage Latency Transfer Rate component (cycles) (words [8B] / cycle) Disk 1,000,000 0.001 Regs Main memory 60 0.1 Cache 2 1 ALU Registers 0 3 COMP 633 - Prins Shared Memory Multiprocessors (1) 16

Cache operation • ABC cache parameters – associativity Cache – block size – capacity associativity capacity • CCC performance model – cache misses can be • compulsory • capacity block size • conflict COMP 633 - Prins Shared Memory Multiprocessors (1) 17

associativity = 256-way Cache operation: read block size = 64 bytes (512b) 40-bit address <26> <8> <6> address Tag Index blk data Valid Tag Data 1,2,4,8 bytes <1> <26> <512> : MUX = COMP 633 - Prins Shared Memory Multiprocessors (1) 18

The changing memory hierarchy • IBM RS6000 320H - 25 MHz (1991) Disk Storage Latency Transfer Rate component (cycles) (words [8B] / cycle) Disk 1,000,000 0.001 Main memory 60 0.1 Cache 2 1 Registers 1 3 Main Memory • Intel Xeon 61xx [per core @3GHz] (2017) Storage Latency Transfer Rate component (cycles) (words [8B] / cycle) HDD 18,000,000 0.00007 Cache SSD 300,000 0.02 Main memory 250 0.2 L3 Cache 48 0.5 Regs L2 Cache 12 1 L1 Cache 4 2 ALU Registers 1 6 COMP 633 - Prins Shared Memory Multiprocessors (1) 20

Computational Intensity: a key metric limiting performance • Computational intensity of a problem I = (total # of arithmetic operations required) in flops (size of input + size of result) in 64-bit words • BLAS - Basic Linear Algebra Subroutines – Asymptotic performance limited by computational intensity • A,B,C   n  n x,y   n a   name defn flops refs I y = ax n 2n 0.5 scale BLAS 1 y = ax + y 2n 3n 0.67 triad x • y 2n 2n 1 dot product 2n 2 +n n 2 +3n y = y + Ax ~ 2 Matrix-vector BLAS 2 A = A + xy T 2n 2 2n 2 +2n ~ 1 rank-1 update BLAS 3 C = C + AB 2n 3 4n 2 n/2 Matrix product COMP 633 - Prins Shared Memory Multiprocessors (1) 21

COMP 633 - Parallel Computing Lecture 6 September 1, 2020 SMM (1) - PowerPoint PPT Presentation

COMP 633 - Parallel Computing Lecture 6 September 1, 2020 SMM (1) Memory Hierarchies and Shared Memory COMP 633 - Prins Shared Memory Multiprocessors (1) 1 PRAM example 1 Evaluate a polynomial for a given value of and

COMP 633 - Parallel Computing Lecture 13 September 24, 2020 Computational Accelerators COMP

COMP 633 - Parallel Computing Lecture 13 September 24, 2020 Computational Accelerators COMP

COMP 633 - Parallel Computing Lecture 12 September 17, 2020 CC-NUMA (2) Memory Consistency

COMP 633 - Parallel Computing Lecture 12 September 22, 2020 CC-NUMA (3) Synchronization

COMP 633 - Parallel Computing Lecture 10 September 15, 2020 CC-NUMA (1) CC-NUMA implementation

COMP 633 - Parallel Computing Lecture 15 October 1, 2020 Programming Accelerators using

COMP 633 - Parallel Computing Lecture 7 September 3, 2020 SMM (2) OpenMP Programming Model

COMP 633 - Parallel Computing Lecture 8 September 8, 2020 SMM (3) Nested Parallelism

COMP 633 - Parallel Computing Lecture 20 October 27, 2020 Interconnection Networks

COMP 633 - Parallel Computing Lecture 23 November 5, 2020 Datacenters and Large Scale Data

Martin Law Firm Martin Law Firm Martin Law Firm Martin Law Firm 1- -800 800- -633 633-

The Coming Gift Boom and the art of Symphonic Marketing Andy Ragone Crescendo Interactive 1

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

Parallel Computing the Why and the How Albert-Jan Yzelman February, 2010 Albert-Jan Yzelman

2110412 Parallel Comp Arch Parallel Programming Paradigm Natawut Nupairoj, Ph.D. Department of

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &

Dynamic Fuzzy Logic Leads How Do We Find the . . . to More Adequate And How Do We Find the

Dynamic Local Remeshing for Elastoplastic Simulation Martin Wicke Daniel Ritchie Bryan

A 1MS/s to 1GS/s Ringamp-Based Pipelined ADC with Fully Dynamic Reference Regulation and

Signal Acquisition from Piezoelectric Transducers for Impedance-Based Damage Detection Danilo

Smart camera design for realtime High Dynamic Range High Dynamic Range imaging imaging P.J.

iLab Dynamic Routing Florian Wohlfart wohlfart@in.tum.de Chair of Network Architectures and

Routing Algorithms 14-740: Fundamentals of Computer Networks Bill Nace Material from Computer

Dynamic Routing; Configuration of Overloaded Interacting Servers N.D. Vvedenskaya, Moscow, IITP

COMP 633 - Parallel Computing Lecture 6 September 1, 2020 SMM (1) - PowerPoint PPT Presentation

COMP 633 - Parallel Computing Lecture 6 September 1, 2020 SMM (1) Memory Hierarchies and Shared Memory COMP 633 - Prins Shared Memory Multiprocessors (1) 1 PRAM example 1 Evaluate a polynomial for a given value of and

COMP 633 - Parallel Computing Lecture 13 September 24, 2020 Computational Accelerators COMP

COMP 633 - Parallel Computing Lecture 13 September 24, 2020 Computational Accelerators COMP

COMP 633 - Parallel Computing Lecture 12 September 17, 2020 CC-NUMA (2) Memory Consistency

COMP 633 - Parallel Computing Lecture 12 September 22, 2020 CC-NUMA (3) Synchronization

COMP 633 - Parallel Computing Lecture 10 September 15, 2020 CC-NUMA (1) CC-NUMA implementation

COMP 633 - Parallel Computing Lecture 15 October 1, 2020 Programming Accelerators using

COMP 633 - Parallel Computing Lecture 7 September 3, 2020 SMM (2) OpenMP Programming Model

COMP 633 - Parallel Computing Lecture 8 September 8, 2020 SMM (3) Nested Parallelism

COMP 633 - Parallel Computing Lecture 20 October 27, 2020 Interconnection Networks

COMP 633 - Parallel Computing Lecture 23 November 5, 2020 Datacenters and Large Scale Data

Martin Law Firm Martin Law Firm Martin Law Firm Martin Law Firm 1- -800 800- -633 633-

The Coming Gift Boom and the art of Symphonic Marketing Andy Ragone Crescendo Interactive 1

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

Parallel Computing the Why and the How Albert-Jan Yzelman February, 2010 Albert-Jan Yzelman

2110412 Parallel Comp Arch Parallel Programming Paradigm Natawut Nupairoj, Ph.D. Department of

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &amp;

Dynamic Fuzzy Logic Leads How Do We Find the . . . to More Adequate And How Do We Find the

Dynamic Local Remeshing for Elastoplastic Simulation Martin Wicke Daniel Ritchie Bryan

A 1MS/s to 1GS/s Ringamp-Based Pipelined ADC with Fully Dynamic Reference Regulation and

Signal Acquisition from Piezoelectric Transducers for Impedance-Based Damage Detection Danilo

Smart camera design for realtime High Dynamic Range High Dynamic Range imaging imaging P.J.

iLab Dynamic Routing Florian Wohlfart wohlfart@in.tum.de Chair of Network Architectures and

Routing Algorithms 14-740: Fundamentals of Computer Networks Bill Nace Material from Computer

Dynamic Routing; Configuration of Overloaded Interacting Servers N.D. Vvedenskaya, Moscow, IITP

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &