Distributed Memory Programming Wolfgang Schreiner Research - PDF document

Distributed Memory Programming Distributed Memory Programming Wolfgang Schreiner Research Institute for Symbolic Computation (RISC-Linz) Johannes Kepler University, A-4040 Linz, Austria Wolfgang.Schreiner@risc.uni-linz.ac.at http://www.risc.uni-linz.ac.at/people/schreine Wolfgang Schreiner RISC-Linz

Distributed Memory Programming SIMD Mesh Matrix Multiplication Single Instruction, Multiple Data • n 2 processors, • 3 n time. Algorithm: see slide. Wolfgang Schreiner 1

Distributed Memory Programming SIMD Mesh Matrix Multiplication 1. Precondition array • Shift row i by i − 1 elements west, • Shift column j by j − 1 elements north. 2. Multiply and add On processor � i, j � : c = k a ik ∗ b kj � • Inverted dimensions – Matrix ↓ i, → j . – Processor array ↓ iyproc , → ixproc . • n shift and n arithmetic operations. • n 2 processors. Maspar program: see slide. Wolfgang Schreiner 2

Distributed Memory Programming SIMD Cube Matrix Multiplication Cube of d 3 processors nzproc nxproc N D W nyproc U S Idea • Map A ( i, j ) to all P ( j, i, k ) • Map B ( i, j ) to all P ( i, k, j ) B C A Wolfgang Schreiner 3

Distributed Memory Programming SIMD Cube Matrix Multiplication Multiplication and Addition • Each processor computes single product P ijk : c ijk = a ik ∗ b kj • Bars along x-directions are added P 0 ij : C ij = k c ijk � B(k,j) C(i,j) A(i,k) Wolfgang Schreiner 4

Distributed Memory Programming SIMD Cube Matrix Multiplication Maspar Program int A[N,N], B[N,N], C[N,N]; plural int a, b, c; a = A[iyproc, ixproc]; b = B[ixproc, izproc]; c = a*b; for (i = 0; i < N-1; i++) if (ixproc > 0) c = xnetE[1].c else c += xnetE[1].c; if (ixproc == 0) C[iyproc, izproc] = c; • O ( n 3 ) processors, • O ( n ) time. Wolfgang Schreiner 5

Distributed Memory Programming SIMD Cube Matrix Multiplication Tree-like summation plural x, d; ... x = ixproc; d = 1; while (d < N) { if (x % 2 != 0) break; c += xnetE[d].c; x /= 2; d *= 2; } if (ixproc == 0) C[iyproc, izproc] = c; • O (log n ) time • O ( n 3 ) processors Long-distance communication required! Wolfgang Schreiner 6

Distributed Memory Programming SIMD Hypercube Mat. Multiplication 101 111 1 01 11 001 011 100 110 _ 0 00 10 000 010 d=0 d=1 d=2 d=3 1010 0010 d=4 • d -dimensional hypercube ⇒ processors in- dexed with d bits. • p 1 and p 2 differ in i bits ⇒ shortest path between p 1 and p 2 has length i . Wolfgang Schreiner 7

Distributed Memory Programming SIMD Hypercube Matrix Multiplica- tion Mapping of cube with dimension n to hypercube with dimension d . • Hypercube of n 3 = 2 d processors ⇒ d = 3 s (for some s ). • 64 processors ⇒ n = 4 , d = 6 , s = 2 . Hypercube d 5 d 4 d 3 d 2 d 1 d 0 Cube x y z • Embedding algorithm – Cube indices in binary form ( s bits each) – Concatenate indices ( 3 s = d bits) • Neighbor processors in cube remain neigh- bors in hypercube. • Any cube algorithm can be executed with same efficiency on hypercube. Wolfgang Schreiner 8

Distributed Memory Programming SIMD Hypercube Matrix Multiplica- tion Tree summation in hypercube. Processors 000 001 010 011 100 101 110 111 Step 1 r 0 s 0 r 1 s 1 r 2 s 2 r 3 s 3 Step 2 r 0 s 0 r 1 s 1 Step 3 r 0 s 0 • Each processor receives value from neigh- boring processors only. • Only short-distance communication is required. Cube algorithm can be more efficient on hypercube! Wolfgang Schreiner 9

Distributed Memory Programming Row/Column-Oriented Matrix Multi- plication A B C 1. Load A i on every processor P i . 2. For all P i do: for j =0 to N -1 Receive B j from root C ij = A i * B j 3. Collect C i Broadcasting of each B j ⇒ Step 2 takes O ( N log N ) time. Wolfgang Schreiner 10

Distributed Memory Programming Ring Algorithm See Quinn, Figure 7-15. • Change order of multiplication by • Using a ring of processors. 1. Load A i and B i on every processor P i . 2. For all P i do: p = ( i +1) mod N j = i for k =0 to N -1 do C ij = A i * B j j = ( j +1) mod N Receive B j from P p 3. Collect C i Point-to-point communication ⇒ Step 2 takes O ( N ) time. Wolfgang Schreiner 11

Distributed Memory Programming Hypercube Algorithm Problem: How to embed ring into hypercube? • Simple solution H ( i ) = i : – Ring processor i is mapped to hypercube processor H ( i ) . – Massive non-neighbor communication! • How to preserve neighbor-to-neighbor communication? (see Quinn, Figure 5-13) • Requirements for H ( i ) : – H must be a 1-to-1 mapping. – H ( i ) and H ( i + 1) must differ in 1 bit. – H (0) and H ( N − 1) must differ in 1 bit. Can we construct such a function H ? Wolfgang Schreiner 12

Distributed Memory Programming Ring Successor Assume H is given. • Given: hypercube processor number i • Wanted: “ring successor” S ( i )  0 , if i = N − 1    S ( i ) =   H ( H − 1 ( i ) + 1) , otherwise      Same technique for embedding a 2-D mesh into an hypercube (see Quinn, Figure 5-14). Wolfgang Schreiner 13

Distributed Memory Programming Gray Codes Recursive construction. • 1-bit Gray code G 1 i G 1 ( i ) 0 0 1 1 • n -bit Gray code G n i G n ( i ) i G n ( i ) 0 0 G n − 1 (0) n − 1 1 G n − 1 (0) 1 0 G n − 1 (1) n − 2 1 G n − 1 (1) . . . . . . . . . . . . n 2 − 1 0 G n − 1 ( n n 1 G n − 1 ( n 2 − 1) 2 − 1) 2 • Required properties preserved by construction! H ( i ) = G ( i ) = i xor i 2 . Wolfgang Schreiner 14

Distributed Memory Programming Gray Code Computation C functions. • Gray-Code int G(int i) { return(i ^ (i/2)); } • Inverse Gray-Code int G_inv(int i) { int answer, mask; answer = i; mask = answer/2; while (mask > 0) { answer = answer ^ mask; mask = mask / 2; } return(answer); } Wolfgang Schreiner 15

Distributed Memory Programming Block-Oriented Algorithm      A 11 A 12  B 11 B 12  B = A =     A 21 A 22 B 21 B 22     C 11 C 12  = C =   C 21 C 22    A 11 B 11 + A 12 B 21 A 11 B 12 + A 12 B 22   A 21 B 11 + A 22 B 21 A 21 B 12 + A 22 B 22  • Use block-oriented distribution introduced for shared memory multiprocessors. Block-matrix multiplication is analogous to scalar matrix multiplication. • Use staggering technique introduced for 2D SIMD mesh. Rotation along rows and columns. • Perform the SIMD matrix multiplication algorithm on whole submatrices . Submatrices are multiplied and shifted. Wolfgang Schreiner 16

Distributed Memory Programming Analysis of Algorithm n 2 matrix, p processors. • Row/Column-oriented – Computation: n 2 /p ∗ n/p = n 3 /p 2 . – Communication: 2( λ + βn 2 /p ) – p iterations. • Block-oriented (staggering ignored) – Computation: n 2 /p ∗ n/p = n 3 /p 2 . – Communication: 4( λ + βn 2 /p ) – √ p − 1 iterations. • Comparison 2 p ( λ + βn 2 /p ) > 4( √ p − 1)( λ + βn 2 /p ) 2 λp + 2 βn 2 > 4 λ ( √ p − 1) + 4 β ( √ p − 1) n 2 /p 1. p > 2( √ p − 1) 2. 1 > 2( √ p − 1) /p True for all p ≥ 1 . Also including staggering, for larger p the block-oriented algorithm performs better! Wolfgang Schreiner 17

Distributed Memory Programming Wolfgang Schreiner Research - PDF document

Distributed Memory Programming Distributed Memory Programming Wolfgang Schreiner Research Institute for Symbolic Computation (RISC-Linz) Johannes Kepler University, A-4040 Linz, Austria Wolfgang.Schreiner@risc.uni-linz.ac.at

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Distributed Shared Memory 1 Distributed Shared Memory Making the main memory of a cluster of

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Multiple- -Writer Distributed Memory Writer Distributed Memory Multiple The Sequential

Distributed Memory and Cache Consistency Distributed Memory and Cache Consistency (some slides

Distributed Memory and Cache Consistency Distributed Memory and Cache Consistency (some slides

Programming Distributed Systems Programming Models for Distributed Systems Annette Bieniusa FB

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

Distributed Shared Memory Presented by Humayun Arafat 1 Outline Background Shared Memory,

Memory Systems Design & Programming CMPE 310 Memory Address Decoding The processor can

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower

Virtual Memory and Virtual Memory and Demand Paging Demand Paging Virtual Memory Illustrated

Fourier Series Formalization in ACL2(r) Cuong Chau, Matt Kaufmann, Warren Hunt {

Embedded Linux Conference San Diego 2016 Linux Power Management Optimization on the Nvidia

CS111 Lecture 01: Computer Concepts Hardware and Software Binary Representation Programming

Project Factsheet Funding Scheme: ICT-STREP Web Site:

ON -MAGIC HYPERCUBES RINOVIA SIMANJUNTAK rino@math.itb.ac.id PALTON ANUWIKSA AKIHIRO

COMMUNICATION IN HYPERCUBES 2 1 12 11 2015 OVERVIEW Parallel Sum (Reduction) on

Upper bounds on the size of 4 - and 6 -cycle-free subgraphs of the hypercube Ping Hu Joint work

Subcube isoperimetry and power of coalitions Petr Gregor Charles University in Prague

Distributed Memory Programming Wolfgang Schreiner Research - PDF document

Distributed Memory Programming Distributed Memory Programming Wolfgang Schreiner Research Institute for Symbolic Computation (RISC-Linz) Johannes Kepler University, A-4040 Linz, Austria Wolfgang.Schreiner@risc.uni-linz.ac.at

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Distributed Shared Memory 1 Distributed Shared Memory Making the main memory of a cluster of

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Multiple- -Writer Distributed Memory Writer Distributed Memory Multiple The Sequential

Distributed Memory and Cache Consistency Distributed Memory and Cache Consistency (some slides

Distributed Memory and Cache Consistency Distributed Memory and Cache Consistency (some slides

Programming Distributed Systems Programming Models for Distributed Systems Annette Bieniusa FB

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

Distributed Shared Memory Presented by Humayun Arafat 1 Outline Background Shared Memory,

Memory Systems Design &amp; Programming CMPE 310 Memory Address Decoding The processor can

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower

Virtual Memory and Virtual Memory and Demand Paging Demand Paging Virtual Memory Illustrated

Fourier Series Formalization in ACL2(r) Cuong Chau, Matt Kaufmann, Warren Hunt {

Embedded Linux Conference San Diego 2016 Linux Power Management Optimization on the Nvidia

CS111 Lecture 01: Computer Concepts Hardware and Software Binary Representation Programming

Project Factsheet Funding Scheme: ICT-STREP Web Site:

ON -MAGIC HYPERCUBES RINOVIA SIMANJUNTAK rino@math.itb.ac.id PALTON ANUWIKSA AKIHIRO

COMMUNICATION IN HYPERCUBES 2 1 12 11 2015 OVERVIEW Parallel Sum (Reduction) on

Upper bounds on the size of 4 - and 6 -cycle-free subgraphs of the hypercube Ping Hu Joint work

Subcube isoperimetry and power of coalitions Petr Gregor Charles University in Prague

Memory Systems Design & Programming CMPE 310 Memory Address Decoding The processor can