Celling SHIM: Compiling Deterministic Concurrency to a Heterogeneous - - PowerPoint PPT Presentation

celling shim compiling deterministic concurrency to a
SMART_READER_LITE
LIVE PREVIEW

Celling SHIM: Compiling Deterministic Concurrency to a Heterogeneous - - PowerPoint PPT Presentation

Celling SHIM: Compiling Deterministic Concurrency to a Heterogeneous Multicore Nalini Vasudevan and Stephen A. Edwards Columbia University in the City of New York, USA March 2009 Main Points Scheduling-independent message passing works for


slide-1
SLIDE 1

Celling SHIM: Compiling Deterministic Concurrency to a Heterogeneous Multicore

Nalini Vasudevan and Stephen A. Edwards

Columbia University in the City of New York, USA

March 2009

slide-2
SLIDE 2

Main Points

Scheduling-independent message passing works for parallel programming We use the SHIM language This paradigm helps to safely explore schedules Compiler catches race-related bugs Our compiler generates code that runs on the IBM CELL Synthesizing communication the trick

slide-3
SLIDE 3

A SHIM example

void h(chan int &A) { A = 4; send A; A = 2; send A; } void j(chan int A) throws Done { recv A; throw Done; } void f(chan int &A) throws Done { h(A); par j(A); } void g(chan int A) { recv A; recv A; } void main() { try { chan int A; f(A); par g(A); } catch (Done) {} }

Five functions that call each

  • ther and communicate

through channel A

slide-4
SLIDE 4

A SHIM example

void h(chan int &A) { A = 4; send A; A = 2; send A; } void j(chan int A) throws Done { recv A; throw Done; } void f(chan int &A) throws Done { h(A); par j(A); } void g(chan int A) { recv A; recv A; } void main() { try { chan int A; f(A); par g(A); } catch (Done) {} }

Parents call children

slide-5
SLIDE 5

A SHIM example

void h(chan int &A) { A = 4; send A; A = 2; send A; } void j(chan int A) throws Done { recv A; throw Done; } void f(chan int &A) throws Done { h(A); par j(A); } void g(chan int A) { recv A; recv A; } void main() { try { chan int A; f(A); par g(A); } catch (Done) {} }

h sends 4 on A, g and j rendezvous

slide-6
SLIDE 6

A SHIM example

void h(chan int &A) { A = 4; send A; A = 2; send A; } void j(chan int A) throws Done { recv A; throw Done; } void f(chan int &A) throws Done { h(A); par j(A); } void g(chan int A) { recv A; recv A; } void main() { try { chan int A; f(A); par g(A); } catch (Done) {} }

j throws an exception. g and h poisoned by attempting communication

slide-7
SLIDE 7

A SHIM example

void h(chan int &A) { A = 4; send A; A = 2; send A; } void j(chan int A) throws Done { recv A; throw Done; } void f(chan int &A) throws Done { h(A); par j(A); } void g(chan int A) { recv A; recv A; } void main() { try { chan int A; f(A); par g(A); } catch (Done) {} }

Concurrent processes terminate, control passed to exception handler

slide-8
SLIDE 8

Task and Channel Structures

void foo(int a, int a) { chan int c; }

slide-9
SLIDE 9

Task and Channel Structures

void foo(int a, int a) { chan int c; }

struct { pthread_t ≀; pthread_mutex_t ; pthread_cond_t

YIELD ;

enum {!, ,

A} state;

int children; /* xxx*/ int a; /* formal */ int b; /* formal */ } thread_foo;

slide-10
SLIDE 10

Task and Channel Structures

void foo(int a, int a) { chan int c; }

struct { pthread_mutex_t ; pthread_cond_t

YIELD ;

uint connected; /*

  • */

uint blocked; /* !

  • */

uint poisoned /*

A
  • */

int * ; } channel_c; struct { pthread_t ≀; pthread_mutex_t ; pthread_cond_t

YIELD ;

enum {!, ,

A} state;

int children; /* xxx*/ int a; /* formal */ int b; /* formal */ } thread_foo;

slide-11
SLIDE 11

Task and Channel Structures

void foo(int a, int a) { chan int c; }

struct { pthread_mutex_t ; pthread_cond_t

YIELD ;

uint connected; /*

  • */

uint blocked; /* !

  • */

uint poisoned /*

A
  • */

int * ; } channel_c; struct { pthread_t ≀; pthread_mutex_t ; pthread_cond_t

YIELD ;

enum {!, ,

A} state;

int children; /* xxx*/ int a; /* formal */ int b; /* formal */ } thread_foo;

void event_c() { if (c.connected == c.blocked) { // Communicate } else if (c.poisoned) { // Propagate exceptions } }

slide-12
SLIDE 12

Pthreads Implementation

void main() { try { chan int A; f(A); par g(A); } catch (Done) {} } void f(chan int &A) throws Done { h(A); par j(A); } void g(chan int A) { recv A; recv A; } void h(chan int &A) { A = 4; send A; A = 2; send A; } void j(chan int A) throws Done { recv A; throw Done; }

struct { ... } _task_main; void _func_main() { ... } // Code for task main struct { ... } _chan_A; void _event_A() { ... } // Synchronize on A struct { ... } _task_f; void _func_f() { // Code for task f } struct { ... } _task_g; void _func_g() { // Code for task g } struct { ... } _task_h; void _func_h() { // Code for task h } struct { ... } _task_j; void _func_j() { // Code for task j }

slide-13
SLIDE 13
slide-14
SLIDE 14

IBM’s Cell Broadband Engine

slide-15
SLIDE 15

IBM’s Cell Broadband Engine

PPE PPE 512K L2 512K L2 SPE SPE 256K 256K SPE SPE 256K 256K SPE SPE 256K 256K SPE SPE 256K 256K SPE SPE 256K 256K SPE SPE 256K 256K SPE SPE 256K 256K SPE SPE 256K 256K
slide-16
SLIDE 16

IBM’s Cell Broadband Engine

PPE PPE 512K L2 512K L2 SPE SPE 256K 256K SPE SPE 256K 256K SPE SPE 256K 256K SPE SPE 256K 256K SPE SPE 256K 256K SPE SPE 256K 256K SPE SPE 256K 256K SPE SPE 256K 256K Element Inter onne t Bus Element Inter onne t Bus 128 bits ← 128 bits ← 128 bits → 128 bits →
slide-17
SLIDE 17

Adapting Pthreads Code to the Cell

struct { ... } _task_main; void _func_main() { ... } // Code for main struct { ... } _chan_A; void _event_A() { ... } // Synchronize on A struct { ... } _task_f; void _func_f() { // Code for task f } struct { ... } _task_g; void _func_g() { // Code for task g } struct { ... } _task_h; void _func_h() { // Code for task h } struct { ... } _task_j; void _func_j() { // Code for task j }

slide-18
SLIDE 18

Adapting Pthreads Code to the Cell

PPE Code

struct { ... } _task_main; void _func_main() { ... } // Code for main struct { ... } _chan_A; void _event_A() { ... } // Synchronize on A struct { ... } _task_f; void _func_f() { // Code for task f } struct { ... } _task_g; void _func_g() { // Code for task g } struct { ... } _task_h; void _func_h() { // Proxy for task h } struct { ... } _task_j; void _func_j() { // Proxy for task j }

On SPE 1

struct { ... } _task_h; void main() { // Code for task h }

On SPE 2

struct { ... } _task_j; void main() { // Code for task j }

slide-19
SLIDE 19

Communication Details

void j(chan int A) throws Done { recv A; throw Done; } struct { ... int A; } _task_j; void _func_j() { // j’s proxy mailbox_send(START); for (;;) { switch (mailbox()) { case BLOCK_A: _chan_A._blocked |= h; _event_A(); while (_chan_A.blocked & h) wait(_chan_A._cond); mailbox_send(ACK); break; case TERM: ... case POISON: ... } } } struct { int A; } _task_j; void main() { // Code for task j for (;;) { if (mailbox() == EXIT) return; DMA_receive(_task_j.A); mailbox_send(BLOCK_A); if (mailbox() == POISON) break; DMA_receive(_task_j.A); mailbox_send(POISON); } }

slide-20
SLIDE 20

Communication Details

void j(chan int A) throws Done { recv A; throw Done; } struct { ... int A; } _task_j; void _func_j() { // j’s proxy mailbox_send(START); for (;;) { switch (mailbox()) { case BLOCK_A: _chan_A._blocked |= h; _event_A(); while (_chan_A.blocked & h) wait(_chan_A._cond); mailbox_send(ACK); break; case TERM: ... case POISON: ... } } } struct { int A; } _task_j; void main() { // Code for task j for (;;) { if (mailbox() == EXIT) return; DMA_receive(_task_j.A); mailbox_send(BLOCK_A); if (mailbox() == POISON) break; DMA_receive(_task_j.A); mailbox_send(POISON); } }

1

Proxy wakes SPE

slide-21
SLIDE 21

Communication Details

void j(chan int A) throws Done { recv A; throw Done; } struct { ... int A; } _task_j; void _func_j() { // j’s proxy mailbox_send(START); for (;;) { switch (mailbox()) { case BLOCK_A: _chan_A._blocked |= h; _event_A(); while (_chan_A.blocked & h) wait(_chan_A._cond); mailbox_send(ACK); break; case TERM: ... case POISON: ... } } } struct { int A; } _task_j; void main() { // Code for task j for (;;) { if (mailbox() == EXIT) return; DMA_receive(_task_j.A); mailbox_send(BLOCK_A); if (mailbox() == POISON) break; DMA_receive(_task_j.A); mailbox_send(POISON); } }

1

Proxy wakes SPE

2

SPE DMAs arguments

slide-22
SLIDE 22

Communication Details

void j(chan int A) throws Done { recv A; throw Done; } struct { ... int A; } _task_j; void _func_j() { // j’s proxy mailbox_send(START); for (;;) { switch (mailbox()) { case BLOCK_A: _chan_A._blocked |= h; _event_A(); while (_chan_A.blocked & h) wait(_chan_A._cond); mailbox_send(ACK); break; case TERM: ... case POISON: ... } } } struct { int A; } _task_j; void main() { // Code for task j for (;;) { if (mailbox() == EXIT) return; DMA_receive(_task_j.A); mailbox_send(BLOCK_A); if (mailbox() == POISON) break; DMA_receive(_task_j.A); mailbox_send(POISON); } }

1

Proxy wakes SPE

2

SPE DMAs arguments

3

SPE blocks on A, notifies proxy

slide-23
SLIDE 23

Communication Details

void j(chan int A) throws Done { recv A; throw Done; } struct { ... int A; } _task_j; void _func_j() { // j’s proxy mailbox_send(START); for (;;) { switch (mailbox()) { case BLOCK_A: _chan_A._blocked |= h; _event_A(); while (_chan_A.blocked & h) wait(_chan_A._cond); mailbox_send(ACK); break; case TERM: ... case POISON: ... } } } struct { int A; } _task_j; void main() { // Code for task j for (;;) { if (mailbox() == EXIT) return; DMA_receive(_task_j.A); mailbox_send(BLOCK_A); if (mailbox() == POISON) break; DMA_receive(_task_j.A); mailbox_send(POISON); } }

1

Proxy wakes SPE

2

SPE DMAs arguments

3

SPE blocks on A, notifies proxy

4

Proxy communicates, notifies SPE

slide-24
SLIDE 24

Communication Details

void j(chan int A) throws Done { recv A; throw Done; } struct { ... int A; } _task_j; void _func_j() { // j’s proxy mailbox_send(START); for (;;) { switch (mailbox()) { case BLOCK_A: _chan_A._blocked |= h; _event_A(); while (_chan_A.blocked & h) wait(_chan_A._cond); mailbox_send(ACK); break; case TERM: ... case POISON: ... } } } struct { int A; } _task_j; void main() { // Code for task j for (;;) { if (mailbox() == EXIT) return; DMA_receive(_task_j.A); mailbox_send(BLOCK_A); if (mailbox() == POISON) break; DMA_receive(_task_j.A); mailbox_send(POISON); } }

1

Proxy wakes SPE

2

SPE DMAs arguments

3

SPE blocks on A, notifies proxy

4

Proxy communicates, notifies SPE

5

SPE DMAs new value

slide-25
SLIDE 25

Communication Details

void j(chan int A) throws Done { recv A; throw Done; } struct { ... int A; } _task_j; void _func_j() { // j’s proxy mailbox_send(START); for (;;) { switch (mailbox()) { case BLOCK_A: _chan_A._blocked |= h; _event_A(); while (_chan_A.blocked & h) wait(_chan_A._cond); mailbox_send(ACK); break; case TERM: ... case POISON: ... } } } struct { int A; } _task_j; void main() { // Code for task j for (;;) { if (mailbox() == EXIT) return; DMA_receive(_task_j.A); mailbox_send(BLOCK_A); if (mailbox() == POISON) break; DMA_receive(_task_j.A); mailbox_send(POISON); } }

1

Proxy wakes SPE

2

SPE DMAs arguments

3

SPE blocks on A, notifies proxy

4

Proxy communicates, notifies SPE

5

SPE DMAs new value

6

SPE poisons A, notifies proxy

slide-26
SLIDE 26

Running Times for the FFT on Varying SPEs

1 2 3 4 5 PPU only 1 2 3 4 5 6 Execution time (s) Number of SPE tasks Observed

+ + + + + + + + + + + + + + + + + + + + + +

Ideal

Run on a 20 MB audio file, 1024-point FFTs

slide-27
SLIDE 27

Temporal Behavior of the FFT

400 402 404 406 408 410 412 414 416 418 Time (ms) 1 SPE 2 SPEs 3 SPEs 4 SPEs 5 SPEs 6 SPEs Blocked

  • Comm. started
  • Comm. completed
slide-28
SLIDE 28

Running Times for the JPEG on Varying SPEs

1 2 3 PPU only 1 2 3 4 5 6 Execution time (s) Number of SPE tasks Observed

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

Ideal

Run on a 1.7 MB image that expands to a 29 MB raster file

slide-29
SLIDE 29

Temporal Behavior of the JPEG Decoder

400 402 404 406 408 410 412 414 416 418 Time (ms) 1 SPE 2 SPEs 3 SPEs 4 SPEs 5 SPEs 6 SPEs

slide-30
SLIDE 30

Conclusions

SHIM code can be compiled to run on the Cell Compiler takes care of synthesizing fussy communication code Performance can be excellent for good communication/computation balance Near-ideal speedup for embarassingly parallel FFT Performance not-so-great when communication outweighs computation Amdahl’s revenge: sequential part of JPEG dominates Need good temporal monitoring tools (not just gprof) to get effective speedups. SPE performance counters critical; had to be synchronized