mpi d atatype p rocessing using r untime c ompilation
play

MPI D ATATYPE P ROCESSING USING R UNTIME C OMPILATION T IMO S - PowerPoint PPT Presentation

MPI D ATATYPE P ROCESSING USING R UNTIME C OMPILATION T IMO S CHNEIDER , F REDRIK K JOLSTAD , T ORSTEN H OEFLER W HAT YOUR VENDOR SOLD Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 2 W HAT YOUR A PPLICATIONS GET 10% of Ping-Pong


  1. MPI D ATATYPE P ROCESSING USING R UNTIME C OMPILATION T IMO S CHNEIDER , F REDRIK K JOLSTAD , T ORSTEN H OEFLER

  2. W HAT YOUR VENDOR SOLD Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 2

  3. W HAT YOUR A PPLICATIONS GET 10% of Ping-Pong performance Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 3

  4. W HAT YOUR A PPLICATIONS GET Why? 10% of Ping-Pong performance Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 4

  5. W HAT YOUR A PPLICATIONS GET How to measure? Why? 10% of Ping-Pong performance Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 5

  6. W HAT MPI OFFERS Manual packing MPI Datatypes sbuf = malloc(N*sizeof(double)) MPI_Datatype nt rbuf = malloc(N*sizeof(double)) MPI_Type_vector(N-2, 1, N, MPI_DOUBLE, &nt) for (i=1; i<N-1; ++i) MPI_Type_commit(&nt) sbuf[i]=data[i*N+N-1] MPI_Isend(&data[N+N-1], 1, nt, …) MPI_Isend(sbuf, …) MPI_Irecv(&data[N], 1, nt, …) MPI_Irecv(rbuf, …) MPI_Waitall (…) MPI_Waitall (…) MPI_Type_free(&nt) for (i=1; i<N-1; ++i) data[i*N]=rbuf[i] • No explicit copying free(sbuf) • Less code free(rbuf) • Often slower than manual packing (see [1]) [1] Schneider, Gerstenberger, Hoefler: Micro-Applications for Communication Data Access Patterns and MPI Datatypes Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 6

  7. I NTERPRETATION VS . C OMPILATION  MPI DDTs are interpreted at runtime, while manual pack loops are compiled bt = Vector(2, 1, 2, MPI_BYTE) If (dt.type == VECTOR) nt =Vector(N, 1, 4, bt) for (int i=0; i<dt.count; i++) { tin = inbuf; tout=outbuf for (b=0; b<dt.blklen; d++) { Internal Representation interpret(dt.basetype, tin, tout) } Vector: tin += dt.stride * dt.base.extent count: N Vector: blklen: 1 tout = dt.blklen * dt.base.size count: 2 stride: 4 } blklen: 1 size: 10 inbuf += dt.extent stride: 2 Primitive: extent: 51 outbuf += dt.size size: 2 size: 1 } extent: 3 extent: 1 Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 7

  8. I NTERPRETATION VS . C OMPILATION  MPI DDTs are interpreted at runtime, while manual pack loops are compiled If (dt.type == VECTOR) for (int i=0; i<dt.count; i++) { • None of these variables are tin = inbuf; tout=outbuf; known when this code is compiled for (b=0; b<dt.blklen; d++) { • Many nested loops interpret(dt.basetype, tin, tout) } tin += dt.stride * dt.base.extent tout = dt.blklen * dt.base.size } inbuf += dt.extent outbuf += dt.size } Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 8

  9. I NTERPRETATION VS . C OMPILATION  MPI DDTs are interpreted at runtime, while manual pack loops are compiled for (int i=0; i<N; ++i) { for(j=0; j<2; ++j) { outbuf[j] = inbuf[j*2] } inbuf += 3*4 outbuf += 2 } Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 9

  10. I NTERPRETATION VS . C OMPILATION  MPI DDTs are interpreted at runtime, while manual pack loops are compiled • Loop unrolling for (int i=0; i<N; ++i) { for(j=0; j<2; ++j) { outbuf[j] = inbuf[j*2] } inbuf += 3*4 outbuf += 2 } Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 10

  11. I NTERPRETATION VS . C OMPILATION  MPI DDTs are interpreted at runtime, while manual pack loops are compiled • Loop unrolling for (int i=0; i<N; ++i) { int j = 0 • Constant Propagation outbuf[j] = inbuf[j*2] outbuf[j+1] = inbuf[(j+1)*2] inbuf += 3*4 outbuf += 2 } Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 11

  12. I NTERPRETATION VS . C OMPILATION  MPI DDTs are interpreted at runtime, while manual pack loops are compiled • Loop unrolling for (int i=0; i<N; ++i) { outbuf[0] = inbuf[0] • Constant Propagation outbuf[1] = inbuf[2] • Strength reduction inbuf += 12 outbuf += 2 } Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 12

  13. I NTERPRETATION VS . C OMPILATION  MPI DDTs are interpreted at runtime, while manual pack loops are compiled • Loop unrolling bound = outbuf + 2*N while (outbuf<bound) { • Constant Propagation outbuf[0] = inbuf[0] • Strength reduction outbuf[1] = inbuf[2] inbuf += 12 outbuf += 2 } Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 13

  14. I NTERPRETATION VS . C OMPILATION  MPI DDTs are interpreted at runtime, while manual pack loops are compiled • Loop unrolling bound = outbuf + 2*N while (outbuf<bound) { • Constant Propagation outbuf[0] = inbuf[0] • Strength reduction outbuf[1] = inbuf[2] • Unrolling of outer loop inbuf += 12 outbuf += 2 } Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 14

  15. I NTERPRETATION VS . C OMPILATION  MPI DDTs are interpreted at runtime, while manual pack loops are compiled • Loop unrolling bound = outbuf + 2*N while (outbuf<bound) { • Constant Propagation outbuf[0] = inbuf[0] • Strength reduction outbuf[1] = inbuf[2] • Unrolling of outer loop inbuf += 12 • SIMDization outbuf += 2 } Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 15

  16. R UNTIME -C OMPILED PACK FUNCTIONS Record arguments in internal MPI_Type_vector(cnt, blklen , …) representation (Tree of C++ objects) Generate pack(*in, cnt, *out) MPI_Type_commit(new_ddt) function using LLVM IR. Compile to machine code. Store f-pointer. new_ddt.pack(buf, cnt tmpbuf) MPI_Send(cnt, buf, new_ddt ,…) PMPI_Send (… tmpbuf, MPI_BYTE) Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 16

  17. C OPYING B LOCKS  Even for non- contiguous transfers, the “leaves” of the DDT are consecutive blocks  It is important that we copy those blocks as efficiently as possible  If the size of the cont. block is less the 256B we completely unroll the loop around it  Use fastest available instruction (SSE2 on our test system) Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 17

  18. B LOCK C OPY P ERFORMANCE 35% In-cache measurement on AMD Interlagos CPU (Blue Waters test system) Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 18

  19. P ACKING V ECTORS  Vector count and size and extent of subtype are always known  Use this to eliminate induction variables to reduce loop overhead  Unroll loop for innermost loop 16 times Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 19

  20. V ECTOR P ACKING P ERFORMANCE HVector(2,1,6144) of Vector(8,8,32) of Contig(6) of MPI_FLOAT This datatype is used by the 14x Quantum-Chromodynamics faster code MILC [2] [2] Studying quarks and gluons on MIMD parallel computers, Bernard, et al. In-cache measurement on AMD Interlagos CPU (Blue Waters test system) Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 20

  21. I RREGULAR D ATATYPES Depending on index list length: copy(inb+off[0], outb +…, len[0]) for (i=0; i<idx.len; i+=3) { copy(inb+off[1], outb +…, len[1]) inb0=load(idx[i+0])+inb copy(inb+off[2], outb +…, len[2]) inb1=load(idx[i+1])+inb inb2=load(idx[i+2])+inb // load oub and len Inline indexes into code copy(inb0, outb0, len0) copy(inb1, outb1, len1) copy(inb2, outb2, len2) } Minimize loop overhead by unrolling the loop over the index list Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 21

  22. I RREGULAR P ACKING P ERFORMANCE 33% Hindexed DDT with faster random displacements Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 22

  23. W HAT ’ S THE CATCH ?  Emitting and compiling IR is expensive!  Commit should tune the DDT, but we do not know how often it will be used – how much tuning is ok?  Lets see how often we need to reuse the datatypes in a real application! Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 23

  24. P ERFORMANCE S TUDY : MILC 0-1 column is empty. We don’t make anything slower than Cray MPI Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 24

  25. P ERFORMANCE S TUDY : MILC Most datatypes become seven times faster! Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 25

  26. P ERFORMANCE S TUDY : MILC  Packing faster, but commit is now slower  How often do Some even 38 we need to use times faster a DDT to break even? Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 26

  27. P ERFORMANCE S TUDY : MILC Most datatypes have to be reused 180-5000 times Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 27

  28. P ERFORMANCE S TUDY : MILC But some need 30000 uses to amortize their costs at commit time Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 28

  29. P ERFORMANCE H INTS FOR DDT S  How often will the DDT be reused?  How will it be used (Send/Recv/Pack/Unpack)?  Will the buffer argument be always the same?  Will the data to pack be in cache or not? Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 29

  30. C AN WE BEAT MANUAL PACKING ? Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 30

  31. F UTURE W ORK  Currently we do not support pipelining of packing and communicating  Our packing library is not yet integrated with an MPI implementation – we use the MPI Profiling interface to hijack calls http://spcl.inf.ethz.ch/Research/Parallel_Programming/ MPI_Datatypes/libpack Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 31

  32. T HANK Y OU !  Questions? Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 32

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend