[PPT] - Filip Blagojevi , Costin Iancu, Katherine Yelick, Matthew PowerPoint Presentation

SLIDE 1

Filip Blagojević, Costin Iancu, Katherine Yelick, Matthew Curtis-Maury, Dimitrios S. Nikolopoulos, Benjamin Rose (presented by Rajesh Nishtala) For questions please email upc@lbl.gov

SLIDE 2

Heterogeneous architectures are High Performance, Cost Effective,

Power-Effective

Cell BE, FPGA, GPGPU, Larrabee, Rapport Kilocore
Execution model considered: off-loading
Enables relatively easy and efficient porting of the existing applications
Achieves high performance and utilization of the architectures
Off-loading requires efficient PPE-SPE communication and

synchronization

Idle time on accelerators
Co-scheduling policies required for higher utilization

SLIDE 3

Efficient chip utilization requires multigrain parallelism:
PPE oversubscription required for performance (RAxML)
Parallelization balance depends on application characteristics (1-N PPE-SPE)
Efficient PPE-SPE synchronization is required
Number of off-loads in an applications is large (>100,000)
Linux is unaware of the offloading execution

SLIDE 4

Cell SDKs
Callbacks - register event handler to respond to SPE requests
Interrupt Mailboxes - PPE process performs blocking call to read a mailbox
Busy-wait (S/P)PE polls on a shared memory variable

SPE idle

PPE SPEs

OS quantum

Busy Waiting  Wait_for_SPE(sync_t *flag){ while(!flag); }

S1 S2 S3 S1 P1 P2 P3 P1

SLIDE 5

Busy-wait with yielding Yield-If-Not-Ready (YNR)
Cooperative Scheduling Slack-Minimizing Event-Driven Scheduler (SLED)
Work-Stealing
YNR, SLED 1-1 PPE-SPE mapping; WS any-any mapping

. . .

yield if not ready

PPE SPEs

P1 P1 P1 P2 P2 P2 P3 P3 P3

S1 S2 S3 S1 Yield if Not Ready Wait_for_SPE(sync_t *flag){ while(!flag){ sched_yield(); }}

. . .

P1 P1 P1 P2 P2 P2 P3 P3 P3

S1 S2 S3 Ideal SLED_Wait_for_SPE(){ while(not_done){ Determine SPE ready;   yield_to(ready); }} S1

SLIDE 6

Task scheduling: yield_to(pid) system call
Signaling and task selection
Shared memory data structure
Evaluated both user and kernel level interfaces and implementations

SPE1

1 2 3 4 5 6 7 8

PPE with multiple processes

Kernel

PPE schedules process which is ready to run Ready-To-Run List

SPE2 SPE3 SPE4 SPE5 SPE6 SPE7 SPE8

yield_to(pid)

SLIDE 7

Evaluated list and array based data structures
Trade-off between ordering and fast access
List – FIFO maximizes utilization but requires mutual exclusion
Array – no ordering but avoids synchronization
Design:
Split array data structure
Processes pinned to PPE h/w contexts
Static partitioning SPEs to PPE h/w contexts
Evaluated kernel and user level placement (SPE idle time per offload)
User-Level = 7us
Kernel-Level = 9us

SLIDE 8

No kernel support and no system calls
User-Level Work Stealing using BUPC
Work pool resides in memory shared between processes (any-any PPE-SPE)
Signaling mechanism identical to SLED, no pinning requirements

. . .

Off-loaded task can be served by any process

PPE SPE 1 3 4 5 6 7 8

. . .

1 2 3 4 5 6 7 8

PPE Processes 2

SLIDE 9

Microbenchmarks
Multiple PPE processes off-load SPE tasks of various length
Evaluated kernel/user level implementation
Evaluated multiple signaling data structure designs
Evaluated SDK synchronization primitives

 Bioinformatics applications to generate Phylogenetic trees

(PBPI,RAxML)

 Evolutionary history among a set of species  Computationally expensive  NP-hard algorithms

SLIDE 10

 RAxML (Maximum Likelihood) – Master-Worker, Embarrassingly Parallel

 Work unit is multiple loops (three)  Communication between M-W only after unit is completed  Multiple code entry points into an SPE code module (work-stealing requires significant

re-engineering)

 PBPI (Parallel Bayesian Phylogenetic Inference) - Data Parallel

 Three main loops offloaded separately  ALL_TO_ALL communication after each loop body

 Whole loop body offloaded in both applications  Over 95% of execution time spent on SPEs in both applications

SLIDE 11

PPE-SPE Synchronization Overhead

Better Better

SPE Utilization

“MBOX” - synchronization via interrupt mailboxes 19us
“PPE” - yielding overhead with oversubscription (6 processes on PPE) 14us
“YNR” < “PPE” due to congestion
Average latency : Work-Steal 3us, SLED 7us, YNR 10us

Faster synchronization leads to improved SPE utilization

SLIDE 12

Work-stealing for RAxML produces a 23%

slowdown for the workload (work migration implementation is not a full Continuation Passing Style)

SLED improves the performance for up to 5%

(3% avg)

Better

Work-Stealing improves the performance for

up to 21% (12% avg)

SLED improves the performance for up to

10% (5% avg) For very short tasks (length < context switch) YNR always performs best

PBPI Speedup

10%
5%

0% 5% 10% 15% 20% 25% 204 444 684 924 1164 1404 1644 1884 2124 2364 2604 2844 3084 3324 3564 3804 4044 4284 4524 4764

Speedup DNA Sequence Length

SLED-Kernel SLED-User Work-Steal

RAxML Speedup

4%
3%
2%
1%

0% 1% 2% 3% 4% 5% 6%

100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000

Speedup

DNA Sequence Length

SLED User SLED Kernel

SLIDE 13

Left-hand y-axis represents the reductions of the idle time when compared to the YNR idle time.
IDLE: (right-hand y axis): time when SPEs are waiting for work as a percentage of total execution time

with YNR.

SLED reduces the SPE idle time by up to 20% for both applications
UPC Work-Steal reduces SPE idle time by up to 30% for PBPI
With optimizations SPE still idle 20% (PBPI) and 10% (RAxML)

Task Distribution

50000 100000 150000 200000 250000 300000 350000 400000

0-10: 10-20: 20-30: 30-40: 40-50: 50-60: 60-70: 70-80: 80-90: 90-100: 100-200: 200-300: 300-400: 400-500: 500-600: 600-700: 700-800: 800-900: 900-1000: 1000-10000:

Task Length Number of Tasks

RAxML 105 RAxML 1176 RAxML 3063 PBPI 105 PBPI 1176 PBPI 3063

SLIDE 14

Cell execution models:
Offloading:
P. Bellens et al “CellSs: A Programming Model for the Cell BE Architecture”
M. de Krujif and K. Sankaralingam “MapReduce for the Cell B.E. Architecture”
K. Fatahalian et al “Memory - Sequoia: Programming the Memory Hierarchy”
M. Monteyne “RapidMind Multi-core Development Platform”
Streaming
Kudlur & Mahlke
Shared memory model
Eichenberger et al “Optimizing Compiler for the Cell Processor”
SPE micro-kernels
Mohamed F. Ahmed et al “SPENK: Adding Another Level of Parallelism on the Cell Broadband Engine”
Many Application Studies
F. Petrini et al. “Challenges in Mapping Graph Exploration Algorithms on Advanced Multicore Processors”
D. Bader et al. “On the Design and Analysis of Irregular Algorithms on the Cell Processor: A Case Study on List Ranking”
Jayram M. N. “Brain Circuit Bottom-Up Engine Simulation and Acceleration on Cell BE, for Vision Applications”

SLIDE 15

Efficient execution on accelerators requires careful scheduling of

disjoint parallelism

The current support for synchronization among the heterogeneous

cores is not sufficient (callbacks, mailboxes, busy wait)

The cooperative scheduling strategies explored improve performance
Impact of coscheduling will increase as contention on the general

purpose core increases (Cell blade with 8SPEs instead of PS3 with 6 SPEs)

Ratio task length/scheduling overhead likely to remain constant in the

Filip Blagojević, Costin Iancu, Katherine Yelick, Matthew Curtis-Maury, Dimitrios S. Nikolopoulos, Benjamin Rose (presented by Rajesh Nishtala) For questions please email upc@lbl.gov

Power-Effective

synchronization

SPE idle

PPE SPEs

OS quantum

Busy Waiting Wait_for_SPE(sync_t *flag){ while(!flag); }

S1 S2 S3 S1 P1 P2 P3 P1

. . .

PPE SPEs

S1 S2 S3 S1 Yield if Not Ready Wait_for_SPE(sync_t *flag){ while(!flag){ sched_yield(); }}

. . .

S1 S2 S3 Ideal SLED_Wait_for_SPE(){ while(not_done){ Determine SPE ready; yield_to(ready); }} S1

SPE1

1 2 3 4 5 6 7 8

PPE with multiple processes

Kernel

PPE schedules process which is ready to run Ready-To-Run List

SPE2 SPE3 SPE4 SPE5 SPE6 SPE7 SPE8

yield_to(pid)

. . .

Off-loaded task can be served by any process

PPE SPE 1 3 4 5 6 7 8

. . .

1 2 3 4 5 6 7 8

PPE Processes 2

(PBPI,RAxML)

re-engineering)

PPE-SPE Synchronization Overhead

Better Better

SPE Utilization

Faster synchronization leads to improved SPE utilization

slowdown for the workload (work migration implementation is not a full Continuation Passing Style)

(3% avg)

Better

up to 21% (12% avg)

10% (5% avg) For very short tasks (length < context switch) YNR always performs best

with YNR.

disjoint parallelism

cores is not sufficient (callbacks, mailboxes, busy wait)

purpose core increases (Cell blade with 8SPEs instead of PS3 with 6 SPEs)

future architecture revisions

Busy Waiting  Wait_for_SPE(sync_t *flag){ while(!flag); }

S1 S2 S3 Ideal SLED_Wait_for_SPE(){ while(not_done){ Determine SPE ready;   yield_to(ready); }} S1