MPI D ATATYPE P ROCESSING USING R UNTIME C OMPILATION T IMO S - - PowerPoint PPT Presentation

mpi d atatype p rocessing using r untime c ompilation
SMART_READER_LITE
LIVE PREVIEW

MPI D ATATYPE P ROCESSING USING R UNTIME C OMPILATION T IMO S - - PowerPoint PPT Presentation

MPI D ATATYPE P ROCESSING USING R UNTIME C OMPILATION T IMO S CHNEIDER , F REDRIK K JOLSTAD , T ORSTEN H OEFLER W HAT YOUR VENDOR SOLD Timo Schneider, Fredrik Kjolstad, Torsten Hoefler Slide 2 W HAT YOUR A PPLICATIONS GET 10% of Ping-Pong


slide-1
SLIDE 1

MPI DATATYPE PROCESSING USING RUNTIME COMPILATION

TIMO SCHNEIDER, FREDRIK KJOLSTAD, TORSTEN HOEFLER

slide-2
SLIDE 2

WHAT YOUR VENDOR SOLD

Slide 2 Timo Schneider, Fredrik Kjolstad, Torsten Hoefler

slide-3
SLIDE 3

Slide 3 Timo Schneider, Fredrik Kjolstad, Torsten Hoefler

WHAT YOUR APPLICATIONS GET

10% of Ping-Pong performance

slide-4
SLIDE 4

Slide 4 Timo Schneider, Fredrik Kjolstad, Torsten Hoefler

WHAT YOUR APPLICATIONS GET

10% of Ping-Pong performance

Why?

slide-5
SLIDE 5

Slide 5 Timo Schneider, Fredrik Kjolstad, Torsten Hoefler

WHAT YOUR APPLICATIONS GET

10% of Ping-Pong performance

Why? How to measure?

slide-6
SLIDE 6

Slide 6 Timo Schneider, Fredrik Kjolstad, Torsten Hoefler

sbuf = malloc(N*sizeof(double)) rbuf = malloc(N*sizeof(double)) for (i=1; i<N-1; ++i) sbuf[i]=data[i*N+N-1] MPI_Isend(sbuf, …) MPI_Irecv(rbuf, …) MPI_Waitall(…) for (i=1; i<N-1; ++i) data[i*N]=rbuf[i] free(sbuf) free(rbuf) MPI_Datatype nt MPI_Type_vector(N-2, 1, N, MPI_DOUBLE, &nt) MPI_Type_commit(&nt) MPI_Isend(&data[N+N-1], 1, nt, …) MPI_Irecv(&data[N], 1, nt, …) MPI_Waitall(…) MPI_Type_free(&nt)

  • No explicit copying
  • Less code
  • Often slower than manual packing (see [1])

WHAT MPI OFFERS

Manual packing MPI Datatypes

[1] Schneider, Gerstenberger, Hoefler: Micro-Applications for Communication Data Access Patterns and MPI Datatypes

slide-7
SLIDE 7
  • MPI DDTs are interpreted at runtime, while

manual pack loops are compiled

Slide 7 Timo Schneider, Fredrik Kjolstad, Torsten Hoefler

bt = Vector(2, 1, 2, MPI_BYTE) nt =Vector(N, 1, 4, bt)

Vector: count: N blklen: 1 stride: 4 size: 10 extent: 51 Vector: count: 2 blklen: 1 stride: 2 size: 2 extent: 3 Primitive: size: 1 extent: 1

Internal Representation If (dt.type == VECTOR) for (int i=0; i<dt.count; i++) { tin = inbuf; tout=outbuf for (b=0; b<dt.blklen; d++) { interpret(dt.basetype, tin, tout) } tin += dt.stride * dt.base.extent tout = dt.blklen * dt.base.size } inbuf += dt.extent

  • utbuf += dt.size

}

INTERPRETATION VS. COMPILATION

slide-8
SLIDE 8
  • MPI DDTs are interpreted at runtime, while

manual pack loops are compiled

Slide 8 Timo Schneider, Fredrik Kjolstad, Torsten Hoefler

If (dt.type == VECTOR) for (int i=0; i<dt.count; i++) { tin = inbuf; tout=outbuf; for (b=0; b<dt.blklen; d++) { interpret(dt.basetype, tin, tout) } tin += dt.stride * dt.base.extent tout = dt.blklen * dt.base.size } inbuf += dt.extent

  • utbuf += dt.size

}

  • None of these variables are

known when this code is compiled

  • Many nested loops

INTERPRETATION VS. COMPILATION

slide-9
SLIDE 9
  • MPI DDTs are interpreted at runtime, while

manual pack loops are compiled

Slide 9 Timo Schneider, Fredrik Kjolstad, Torsten Hoefler

for (int i=0; i<N; ++i) { for(j=0; j<2; ++j) {

  • utbuf[j] = inbuf[j*2]

} inbuf += 3*4

  • utbuf += 2

}

INTERPRETATION VS. COMPILATION

slide-10
SLIDE 10
  • MPI DDTs are interpreted at runtime, while

manual pack loops are compiled

Slide 10 Timo Schneider, Fredrik Kjolstad, Torsten Hoefler

for (int i=0; i<N; ++i) { for(j=0; j<2; ++j) {

  • utbuf[j] = inbuf[j*2]

} inbuf += 3*4

  • utbuf += 2

}

  • Loop unrolling

INTERPRETATION VS. COMPILATION

slide-11
SLIDE 11
  • MPI DDTs are interpreted at runtime, while

manual pack loops are compiled

Slide 11 Timo Schneider, Fredrik Kjolstad, Torsten Hoefler

for (int i=0; i<N; ++i) { int j = 0

  • utbuf[j] = inbuf[j*2]
  • utbuf[j+1] = inbuf[(j+1)*2]

inbuf += 3*4

  • utbuf += 2

}

  • Loop unrolling
  • Constant Propagation

INTERPRETATION VS. COMPILATION

slide-12
SLIDE 12
  • MPI DDTs are interpreted at runtime, while

manual pack loops are compiled

Slide 12 Timo Schneider, Fredrik Kjolstad, Torsten Hoefler

for (int i=0; i<N; ++i) {

  • utbuf[0] = inbuf[0]
  • utbuf[1] = inbuf[2]

inbuf += 12

  • utbuf += 2

}

  • Loop unrolling
  • Constant Propagation
  • Strength reduction

INTERPRETATION VS. COMPILATION

slide-13
SLIDE 13
  • MPI DDTs are interpreted at runtime, while

manual pack loops are compiled

Slide 13 Timo Schneider, Fredrik Kjolstad, Torsten Hoefler

bound = outbuf + 2*N while (outbuf<bound) {

  • utbuf[0] = inbuf[0]
  • utbuf[1] = inbuf[2]

inbuf += 12

  • utbuf += 2

}

  • Loop unrolling
  • Constant Propagation
  • Strength reduction

INTERPRETATION VS. COMPILATION

slide-14
SLIDE 14
  • MPI DDTs are interpreted at runtime, while

manual pack loops are compiled

Slide 14 Timo Schneider, Fredrik Kjolstad, Torsten Hoefler

bound = outbuf + 2*N while (outbuf<bound) {

  • utbuf[0] = inbuf[0]
  • utbuf[1] = inbuf[2]

inbuf += 12

  • utbuf += 2

}

  • Loop unrolling
  • Constant Propagation
  • Strength reduction
  • Unrolling of outer loop

INTERPRETATION VS. COMPILATION

slide-15
SLIDE 15
  • MPI DDTs are interpreted at runtime, while

manual pack loops are compiled

Slide 15 Timo Schneider, Fredrik Kjolstad, Torsten Hoefler

bound = outbuf + 2*N while (outbuf<bound) {

  • utbuf[0] = inbuf[0]
  • utbuf[1] = inbuf[2]

inbuf += 12

  • utbuf += 2

}

  • Loop unrolling
  • Constant Propagation
  • Strength reduction
  • Unrolling of outer loop
  • SIMDization

INTERPRETATION VS. COMPILATION

slide-16
SLIDE 16

Slide 16 Timo Schneider, Fredrik Kjolstad, Torsten Hoefler

MPI_Type_vector(cnt, blklen, …)

RUNTIME-COMPILED PACK FUNCTIONS

MPI_Type_commit(new_ddt) MPI_Send(cnt, buf, new_ddt,…) Record arguments in internal representation (Tree of C++

  • bjects)

Generate pack(*in, cnt, *out) function using LLVM IR. Compile to machine code. Store f-pointer. new_ddt.pack(buf, cnt tmpbuf) PMPI_Send(…tmpbuf, MPI_BYTE)

slide-17
SLIDE 17

Slide 17 Timo Schneider, Fredrik Kjolstad, Torsten Hoefler

  • Even for non-contiguous transfers, the “leaves” of

the DDT are consecutive blocks

  • It is important that we copy those blocks as

efficiently as possible

  • If the size of the cont. block is less the 256B we

completely unroll the loop around it

  • Use fastest available instruction (SSE2 on our test

system)

COPYING BLOCKS

slide-18
SLIDE 18

Slide 18 Timo Schneider, Fredrik Kjolstad, Torsten Hoefler

BLOCK COPY PERFORMANCE

In-cache measurement on AMD Interlagos CPU (Blue Waters test system)

35%

slide-19
SLIDE 19

Slide 19 Timo Schneider, Fredrik Kjolstad, Torsten Hoefler

  • Vector count and size and extent of subtype are

always known

  • Use this to eliminate induction variables to reduce

loop overhead

  • Unroll loop for innermost loop 16 times

PACKING VECTORS

slide-20
SLIDE 20

Slide 20 Timo Schneider, Fredrik Kjolstad, Torsten Hoefler

HVector(2,1,6144) of Vector(8,8,32) of Contig(6) of MPI_FLOAT This datatype is used by the Quantum-Chromodynamics code MILC [2]

VECTOR PACKING PERFORMANCE

14x faster

[2] Studying quarks and gluons on MIMD parallel computers, Bernard, et al. In-cache measurement on AMD Interlagos CPU (Blue Waters test system)

slide-21
SLIDE 21

Slide 21 Timo Schneider, Fredrik Kjolstad, Torsten Hoefler

IRREGULAR DATATYPES

copy(inb+off[0], outb+…, len[0]) copy(inb+off[1], outb+…, len[1]) copy(inb+off[2], outb+…, len[2])

Inline indexes into code

for (i=0; i<idx.len; i+=3) { inb0=load(idx[i+0])+inb inb1=load(idx[i+1])+inb inb2=load(idx[i+2])+inb // load oub and len copy(inb0, outb0, len0) copy(inb1, outb1, len1) copy(inb2, outb2, len2) }

Minimize loop overhead by unrolling the loop over the index list Depending on index list length:

slide-22
SLIDE 22

Slide 22 Timo Schneider, Fredrik Kjolstad, Torsten Hoefler

Hindexed DDT with random displacements

IRREGULAR PACKING PERFORMANCE

33% faster

slide-23
SLIDE 23

Slide 23 Timo Schneider, Fredrik Kjolstad, Torsten Hoefler

  • Emitting and compiling IR is expensive!
  • Commit should tune the DDT, but we do not know

how often it will be used – how much tuning is

  • k?
  • Lets see how often we need to reuse the

datatypes in a real application!

WHAT’S THE CATCH?

slide-24
SLIDE 24

Slide 24 Timo Schneider, Fredrik Kjolstad, Torsten Hoefler

0-1 column is empty. We don’t make anything slower than Cray MPI

PERFORMANCE STUDY: MILC

slide-25
SLIDE 25

Slide 25 Timo Schneider, Fredrik Kjolstad, Torsten Hoefler

Most datatypes become seven times faster!

PERFORMANCE STUDY: MILC

slide-26
SLIDE 26

Slide 26 Timo Schneider, Fredrik Kjolstad, Torsten Hoefler

Some even 38 times faster

  • Packing faster,

but commit is now slower

  • How often do

we need to use a DDT to break even?

PERFORMANCE STUDY: MILC

slide-27
SLIDE 27

Slide 27 Timo Schneider, Fredrik Kjolstad, Torsten Hoefler

Most datatypes have to be reused 180-5000 times

PERFORMANCE STUDY: MILC

slide-28
SLIDE 28

Slide 28 Timo Schneider, Fredrik Kjolstad, Torsten Hoefler

But some need 30000 uses to amortize their costs at commit time

PERFORMANCE STUDY: MILC

slide-29
SLIDE 29

Slide 29 Timo Schneider, Fredrik Kjolstad, Torsten Hoefler

  • How often will the DDT be reused?
  • How will it be used (Send/Recv/Pack/Unpack)?
  • Will the buffer argument be always the same?
  • Will the data to pack be in cache or not?

PERFORMANCE HINTS FOR DDTS

slide-30
SLIDE 30

Slide 30 Timo Schneider, Fredrik Kjolstad, Torsten Hoefler

CAN WE BEAT MANUAL PACKING?

slide-31
SLIDE 31

Slide 31 Timo Schneider, Fredrik Kjolstad, Torsten Hoefler

  • Currently we do not support pipelining of packing

and communicating

  • Our packing library is not yet integrated with an

MPI implementation – we use the MPI Profiling interface to hijack calls

http://spcl.inf.ethz.ch/Research/Parallel_Programming/ MPI_Datatypes/libpack

FUTURE WORK

slide-32
SLIDE 32

Slide 32 Timo Schneider, Fredrik Kjolstad, Torsten Hoefler

  • Questions?

THANK YOU!