Filip Blagojevi , Costin Iancu, Katherine Yelick, Matthew - - PowerPoint PPT Presentation
Filip Blagojevi , Costin Iancu, Katherine Yelick, Matthew - - PowerPoint PPT Presentation
Filip Blagojevi , Costin Iancu, Katherine Yelick, Matthew Curtis-Maury, Dimitrios S. Nikolopoulos, Benjamin Rose (presented by Rajesh Nishtala) For questions please email upc@lbl.gov Heterogeneous architectures are High Performance, Cost
- Heterogeneous architectures are High Performance, Cost Effective,
Power-Effective
- Cell BE, FPGA, GPGPU, Larrabee, Rapport Kilocore
- Execution model considered: off-loading
- Enables relatively easy and efficient porting of the existing applications
- Achieves high performance and utilization of the architectures
- Off-loading requires efficient PPE-SPE communication and
synchronization
- Idle time on accelerators
- Co-scheduling policies required for higher utilization
- Efficient chip utilization requires multigrain parallelism:
- PPE oversubscription required for performance (RAxML)
- Parallelization balance depends on application characteristics (1-N PPE-SPE)
- Efficient PPE-SPE synchronization is required
- Number of off-loads in an applications is large (>100,000)
- Linux is unaware of the offloading execution
- Cell SDKs
- Callbacks - register event handler to respond to SPE requests
- Interrupt Mailboxes - PPE process performs blocking call to read a mailbox
- Busy-wait (S/P)PE polls on a shared memory variable
SPE idle
PPE SPEs
OS quantum
Busy Waiting Wait_for_SPE(sync_t *flag){ while(!flag); }
S1 S2 S3 S1 P1 P2 P3 P1
- Busy-wait with yielding Yield-If-Not-Ready (YNR)
- Cooperative Scheduling Slack-Minimizing Event-Driven Scheduler (SLED)
- Work-Stealing
- YNR, SLED 1-1 PPE-SPE mapping; WS any-any mapping
. . .
yield if not ready
PPE SPEs
P1 P1 P1 P2 P2 P2 P3 P3 P3
S1 S2 S3 S1 Yield if Not Ready Wait_for_SPE(sync_t *flag){ while(!flag){ sched_yield(); }}
. . .
P1 P1 P1 P2 P2 P2 P3 P3 P3
S1 S2 S3 Ideal SLED_Wait_for_SPE(){ while(not_done){ Determine SPE ready; yield_to(ready); }} S1
- Task scheduling: yield_to(pid) system call
- Signaling and task selection
- Shared memory data structure
- Evaluated both user and kernel level interfaces and implementations
SPE1
1 2 3 4 5 6 7 8
PPE with multiple processes
Kernel
PPE schedules process which is ready to run Ready-To-Run List
SPE2 SPE3 SPE4 SPE5 SPE6 SPE7 SPE8
yield_to(pid)
- Evaluated list and array based data structures
- Trade-off between ordering and fast access
- List – FIFO maximizes utilization but requires mutual exclusion
- Array – no ordering but avoids synchronization
- Design:
- Split array data structure
- Processes pinned to PPE h/w contexts
- Static partitioning SPEs to PPE h/w contexts
- Evaluated kernel and user level placement (SPE idle time per offload)
- User-Level = 7us
- Kernel-Level = 9us
- No kernel support and no system calls
- User-Level Work Stealing using BUPC
- Work pool resides in memory shared between processes (any-any PPE-SPE)
- Signaling mechanism identical to SLED, no pinning requirements
. . .
Off-loaded task can be served by any process
PPE SPE 1 3 4 5 6 7 8
. . .
1 2 3 4 5 6 7 8
PPE Processes 2
- Microbenchmarks
- Multiple PPE processes off-load SPE tasks of various length
- Evaluated kernel/user level implementation
- Evaluated multiple signaling data structure designs
- Evaluated SDK synchronization primitives
Bioinformatics applications to generate Phylogenetic trees
(PBPI,RAxML)
Evolutionary history among a set of species Computationally expensive NP-hard algorithms
RAxML (Maximum Likelihood) – Master-Worker, Embarrassingly Parallel
Work unit is multiple loops (three) Communication between M-W only after unit is completed Multiple code entry points into an SPE code module (work-stealing requires significant
re-engineering)
PBPI (Parallel Bayesian Phylogenetic Inference) - Data Parallel
Three main loops offloaded separately ALL_TO_ALL communication after each loop body
Whole loop body offloaded in both applications Over 95% of execution time spent on SPEs in both applications
PPE-SPE Synchronization Overhead
Better Better
SPE Utilization
- “MBOX” - synchronization via interrupt mailboxes 19us
- “PPE” - yielding overhead with oversubscription (6 processes on PPE) 14us
- “YNR” < “PPE” due to congestion
- Average latency : Work-Steal 3us, SLED 7us, YNR 10us
Faster synchronization leads to improved SPE utilization
- Work-stealing for RAxML produces a 23%
slowdown for the workload (work migration implementation is not a full Continuation Passing Style)
- SLED improves the performance for up to 5%
(3% avg)
Better
- Work-Stealing improves the performance for
up to 21% (12% avg)
- SLED improves the performance for up to
10% (5% avg) For very short tasks (length < context switch) YNR always performs best
PBPI Speedup
- 10%
- 5%
0% 5% 10% 15% 20% 25% 204 444 684 924 1164 1404 1644 1884 2124 2364 2604 2844 3084 3324 3564 3804 4044 4284 4524 4764
Speedup DNA Sequence Length
SLED-Kernel SLED-User Work-Steal
RAxML Speedup
- 4%
- 3%
- 2%
- 1%
0% 1% 2% 3% 4% 5% 6%
100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000
Speedup
DNA Sequence Length
SLED User SLED Kernel
- Left-hand y-axis represents the reductions of the idle time when compared to the YNR idle time.
- IDLE: (right-hand y axis): time when SPEs are waiting for work as a percentage of total execution time
with YNR.
- SLED reduces the SPE idle time by up to 20% for both applications
- UPC Work-Steal reduces SPE idle time by up to 30% for PBPI
- With optimizations SPE still idle 20% (PBPI) and 10% (RAxML)
Task Distribution
50000 100000 150000 200000 250000 300000 350000 400000
0-10: 10-20: 20-30: 30-40: 40-50: 50-60: 60-70: 70-80: 80-90: 90-100: 100-200: 200-300: 300-400: 400-500: 500-600: 600-700: 700-800: 800-900: 900-1000: 1000-10000:
Task Length Number of Tasks
RAxML 105 RAxML 1176 RAxML 3063 PBPI 105 PBPI 1176 PBPI 3063
- Cell execution models:
- Offloading:
- P. Bellens et al “CellSs: A Programming Model for the Cell BE Architecture”
- M. de Krujif and K. Sankaralingam “MapReduce for the Cell B.E. Architecture”
- K. Fatahalian et al “Memory - Sequoia: Programming the Memory Hierarchy”
- M. Monteyne “RapidMind Multi-core Development Platform”
- Streaming
- Kudlur & Mahlke
- Shared memory model
- Eichenberger et al “Optimizing Compiler for the Cell Processor”
- SPE micro-kernels
- Mohamed F. Ahmed et al “SPENK: Adding Another Level of Parallelism on the Cell Broadband Engine”
- Many Application Studies
- F. Petrini et al. “Challenges in Mapping Graph Exploration Algorithms on Advanced Multicore Processors”
- D. Bader et al. “On the Design and Analysis of Irregular Algorithms on the Cell Processor: A Case Study on List Ranking”
- Jayram M. N. “Brain Circuit Bottom-Up Engine Simulation and Acceleration on Cell BE, for Vision Applications”
- Efficient execution on accelerators requires careful scheduling of
disjoint parallelism
- The current support for synchronization among the heterogeneous
cores is not sufficient (callbacks, mailboxes, busy wait)
- The cooperative scheduling strategies explored improve performance
- Impact of coscheduling will increase as contention on the general
purpose core increases (Cell blade with 8SPEs instead of PS3 with 6 SPEs)
- Ratio task length/scheduling overhead likely to remain constant in the