comp 633 parallel computing
play

COMP 633 - Parallel Computing Lecture 6 September 1, 2020 SMM (1) - PowerPoint PPT Presentation

COMP 633 - Parallel Computing Lecture 6 September 1, 2020 SMM (1) Memory Hierarchies and Shared Memory COMP 633 - Prins Shared Memory Multiprocessors (1) 1 PRAM example 1 Evaluate a polynomial for a given value of and


  1. COMP 633 - Parallel Computing Lecture 6 September 1, 2020 SMM (1) Memory Hierarchies and Shared Memory COMP 633 - Prins Shared Memory Multiprocessors (1) 1

  2. PRAM example 1 β€’ Evaluate a polynomial for a given value of 𝑦 and coefficients 𝑐 1 … 𝑐 π‘œ 𝑄 𝑦 = 𝑐 1 𝑦 π‘œβˆ’1 + β‹― + 𝑐 π‘œβˆ’1 + 𝑐 π‘œ COMP 633 - Prins Shared Memory Multiprocessors (1) 2

  3. PRAM example 2 – bitonic merge COMP 633 - Prins Shared Memory Multiprocessors (1) 3

  4. Topics β€’ PRAM algorithm examples β€’ Memory systems – organization – caches and the memory hierarchy – influence of the memory hierarchy on algorithms β€’ Shared memory systems – Taxonomy of actual shared memory systems β€’ UMA, NUMA, cc-NUMA COMP 633 - Prins Shared Memory Multiprocessors (1) 4

  5. Recall PRAM shared memory system β€’ PRAM model – assumes access latency is constant, regardless of value of p or the size of memory – simultaneous reads permitted under CR model and simultaneous writes permitted under CW model β€’ Physically impossible to realize – processors and memory occupy physical space shared memory β€’ speed of light limitations ( ) ( ) 1 3 =  + L p m – CR / CW must be reduced to ER / EW 2 β€’ β€’ β€’ 1 p β€’ requires  (lg p) time in general case processors COMP 633 - Prins Shared Memory Multiprocessors (1) 5

  6. Anatomy of a processor ο‚« memory system β€’ Performance parameters of Random Access Memory (RAM) – latency L β€’ elapsed time from presentation of memory address to arrival of data – address transit time – memory access time t mem – data transit time – bandwidth W β€’ number of values (e.g. 64 bit words) delivered to processor per unit time – simple implementation W ~ 1/L Processor Memory COMP 633 - Prins Shared Memory Multiprocessors (1) 6

  7. Processor vs. memory performance β€’ The memory β€œwall” – Processors compute faster than memory delivers data β€’ increasing imbalance 𝑒arith β‰ͺ 𝑒mem β€’ β‰ͺ COMP 633 - Prins Shared Memory Multiprocessors (1) 7

  8. Improving memory system performance (1) β€’ Decrease latency L to memory – speed of light is a limiting factor β€’ bring memory closer to processor – decrease memory access time by decreasing memory size s β€’ access time ο‚΅ s Β½ (VLSI) – use faster memory technology β€’ DRAM (Dynamic RAM) 1 transistor per stored bit – high density, low power, long access time, low cost β€’ SRAM (Static RAM) 6 transistors per stored bit – low density, high power, short access time, high cost COMP 633 - Prins Shared Memory Multiprocessors (1) 8

  9. Improving memory system performance (1) β€’ Decrease latency using cache memory – low latency access to frequently used values, high latency for the remaining values Processor Cache Memory – Example β€’ 90% of references are to cache with latency L 1 β€’ 10% of references are to memory with latency L 2 β€’ average latency is 0.9L 1 + 0.1L 2 COMP 633 - Prins Shared Memory Multiprocessors (1) 9

  10. Improving memory system performance (2) β€’ Increase bandwidth W – multiport (parallel access) memory β€’ multiple reads, multiple exclusive writes per memory cycle – High cost, very limited scalability Register file Processor – β€œblocked” memory β€’ memory supplies block of size b containing requested word – supports spatial locality in cache access b Processor Cache Memory COMP 633 - Prins Shared Memory Multiprocessors (1) 10

  11. Improving memory system performance (2) β€’ Increase bandwidth W (contd) – pipeline memory requests β€’ requires independent memory references – interleave memory β€’ problem: memory access is limited by t mem β€’ use m separate memories (or memory banks) β€’ W ~ m / L if references distribute over memory banks COMP 633 - Prins Shared Memory Multiprocessors (1) 11

  12. Latency hiding β€’ Amortize latency using a pipelined interleaved memory system – k independent references in  (L + k οƒ— t proc ) time β€’ O(L/k) amortized (expected) latency per reference β€’ Where do we get independent references? – out-of-order execution of independent load/store operations β€’ found in most modern performance-oriented processors β€’ partial latency hiding: k ~ 2 - 10 references outstanding – vector load/store operations β€’ small vector units (AVX512) – vector length 2-8 words (Intel Xeon) – partial latency hiding β€’ high-performance vector units (NEC SX-9, SX-Aurora) – vector length k = L / t proc (128 - 256 words) – crossbar network to highly interleaved memory (~ 16,000 banks) – full latency hiding: amortized memory access at processor speed – multithreaded operation β€’ independent execution threads with individual hardware contexts – partial latency hiding: 2-way hyperthreading (Intel) – full latency hiding: 128-way threading with high-performance memory (Cray MTA) COMP 633 - Prins Shared Memory Multiprocessors (1) 12

  13. Implementing the PRAM β€’ How close can we come to O(1) latency PRAM memory in practice? M 1 M 2 M 3 M m-1 M m β€’ β€’ β€’ – requires processor to memory network Network β€’ latency L = sum of – twice network latency – memory cycle time – serialization time for CR, CW β€’ L increases with m, p P 1 P 2 P p β€’ β€’ β€’ – L too large with current technology – examples β€’ NYU Ultracomputer (1987), IBM RP3 (1991), SBPRAM (1999) – logarithmic depth combining network eliminates memory contention time for CR, CW Β»  (lg p) latency in network is prohibitive COMP 633 - Prins Shared Memory Multiprocessors (1) 13

  14. Implementing PRAM – a compromise β€’ Using latency hiding with a high-performance memory system – implements p οƒ— k processor EREW PRAM slowed down by a factor of k β€’ use m ο‚³ p (t mem / t proc ) memory banks to match memory reference rate of p processors β€’ total latency 2L for k = L / t proc independent random references at each processor β€’ O(t proc ) amortized latency per reference at each processor – unit latency degrades in the presence of concurrent reads/writes M 1 M 2 M 3 M m-1 M m β€’ β€’ β€’ Network P 1 P 2 P p β€’ β€’ β€’ – Bottom line: doable but very expensive and only limited scaling in p COMP 633 - Prins Shared Memory Multiprocessors (1) 14

  15. Memory systems summary β€’ Memory performance – Latency is limited by physics – Bandwidth is limited by cost β€’ Cache memory: low latency access to some values – caching frequently used values β€’ rewards temporal locality of reference – caching consecutive values β€’ rewards spatial locality of reference – decrease average latency β€’ 90 fast references, 10 slow references: effective latency = 0.9L 1 + 0.1L 2 β€’ Parallel memories – 100 independent references β‰ˆ 100 fast references – relatively expensive – requires parallel processing COMP 633 - Prins Shared Memory Multiprocessors (1) 15

  16. Simple uniprocessor memory hierarchy β€’ Each component is characterized by Disk – capacity – block size – (associativity) β€’ Traffic between components is characterized by Main – access latency Memory – transfer rate (bandwidth) β€’ Example: – IBM RS6000/320H (ca. 1991) Cache Storage Latency Transfer Rate component (cycles) (words [8B] / cycle) Disk 1,000,000 0.001 Regs Main memory 60 0.1 Cache 2 1 ALU Registers 0 3 COMP 633 - Prins Shared Memory Multiprocessors (1) 16

  17. Cache operation β€’ ABC cache parameters – associativity Cache – block size – capacity associativity capacity β€’ CCC performance model – cache misses can be β€’ compulsory β€’ capacity block size β€’ conflict COMP 633 - Prins Shared Memory Multiprocessors (1) 17

  18. associativity = 256-way Cache operation: read block size = 64 bytes (512b) 40-bit address <26> <8> <6> address Tag Index blk data Valid Tag Data 1,2,4,8 bytes <1> <26> <512> : MUX = COMP 633 - Prins Shared Memory Multiprocessors (1) 18

  19. The changing memory hierarchy β€’ IBM RS6000 320H - 25 MHz (1991) Disk Storage Latency Transfer Rate component (cycles) (words [8B] / cycle) Disk 1,000,000 0.001 Main memory 60 0.1 Cache 2 1 Registers 1 3 Main Memory β€’ Intel Xeon 61xx [per core @3GHz] (2017) Storage Latency Transfer Rate component (cycles) (words [8B] / cycle) HDD 18,000,000 0.00007 Cache SSD 300,000 0.02 Main memory 250 0.2 L3 Cache 48 0.5 Regs L2 Cache 12 1 L1 Cache 4 2 ALU Registers 1 6 COMP 633 - Prins Shared Memory Multiprocessors (1) 20

  20. Computational Intensity: a key metric limiting performance β€’ Computational intensity of a problem I = (total # of arithmetic operations required) in flops (size of input + size of result) in 64-bit words β€’ BLAS - Basic Linear Algebra Subroutines – Asymptotic performance limited by computational intensity β€’ A,B,C οƒŽ  n ο‚΄ n x,y οƒŽ  n a οƒŽ  name defn flops refs I y = ax n 2n 0.5 scale BLAS 1 y = ax + y 2n 3n 0.67 triad x β€’ y 2n 2n 1 dot product 2n 2 +n n 2 +3n y = y + Ax ~ 2 Matrix-vector BLAS 2 A = A + xy T 2n 2 2n 2 +2n ~ 1 rank-1 update BLAS 3 C = C + AB 2n 3 4n 2 n/2 Matrix product COMP 633 - Prins Shared Memory Multiprocessors (1) 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend