Filip Blagojevi , Costin Iancu, Katherine Yelick, Matthew - - PowerPoint PPT Presentation

filip blagojevi costin iancu katherine yelick matthew
SMART_READER_LITE
LIVE PREVIEW

Filip Blagojevi , Costin Iancu, Katherine Yelick, Matthew - - PowerPoint PPT Presentation

Filip Blagojevi , Costin Iancu, Katherine Yelick, Matthew Curtis-Maury, Dimitrios S. Nikolopoulos, Benjamin Rose (presented by Rajesh Nishtala) For questions please email upc@lbl.gov Heterogeneous architectures are High Performance, Cost


slide-1
SLIDE 1

Filip Blagojević, Costin Iancu, Katherine Yelick, Matthew Curtis-Maury, Dimitrios S. Nikolopoulos, Benjamin Rose (presented by Rajesh Nishtala) For questions please email upc@lbl.gov

slide-2
SLIDE 2
  • Heterogeneous architectures are High Performance, Cost Effective,

Power-Effective

  • Cell BE, FPGA, GPGPU, Larrabee, Rapport Kilocore
  • Execution model considered: off-loading
  • Enables relatively easy and efficient porting of the existing applications
  • Achieves high performance and utilization of the architectures
  • Off-loading requires efficient PPE-SPE communication and

synchronization

  • Idle time on accelerators
  • Co-scheduling policies required for higher utilization
slide-3
SLIDE 3
  • Efficient chip utilization requires multigrain parallelism:
  • PPE oversubscription required for performance (RAxML)
  • Parallelization balance depends on application characteristics (1-N PPE-SPE)
  • Efficient PPE-SPE synchronization is required
  • Number of off-loads in an applications is large (>100,000)
  • Linux is unaware of the offloading execution
slide-4
SLIDE 4
  • Cell SDKs
  • Callbacks - register event handler to respond to SPE requests
  • Interrupt Mailboxes - PPE process performs blocking call to read a mailbox
  • Busy-wait (S/P)PE polls on a shared memory variable

SPE idle

PPE SPEs

OS quantum

Busy Waiting
 Wait_for_SPE(sync_t *flag){ while(!flag); }

S1 S2 S3 S1 P1 P2 P3 P1

slide-5
SLIDE 5
  • Busy-wait with yielding Yield-If-Not-Ready (YNR)
  • Cooperative Scheduling Slack-Minimizing Event-Driven Scheduler (SLED)
  • Work-Stealing
  • YNR, SLED 1-1 PPE-SPE mapping; WS any-any mapping

. . .

yield if not ready

PPE SPEs

P1 P1 P1 P2 P2 P2 P3 P3 P3

S1 S2 S3 S1 Yield if Not Ready Wait_for_SPE(sync_t *flag){ while(!flag){ sched_yield(); }}

. . .

P1 P1 P1 P2 P2 P2 P3 P3 P3

S1 S2 S3 Ideal SLED_Wait_for_SPE(){ while(not_done){ Determine SPE ready; 
 yield_to(ready); }} S1

slide-6
SLIDE 6
  • Task scheduling: yield_to(pid) system call
  • Signaling and task selection
  • Shared memory data structure
  • Evaluated both user and kernel level interfaces and implementations

SPE1

1 2 3 4 5 6 7 8

PPE with multiple processes

Kernel

PPE schedules process which is ready to run Ready-To-Run List

SPE2 SPE3 SPE4 SPE5 SPE6 SPE7 SPE8

yield_to(pid)

slide-7
SLIDE 7
  • Evaluated list and array based data structures
  • Trade-off between ordering and fast access
  • List – FIFO maximizes utilization but requires mutual exclusion
  • Array – no ordering but avoids synchronization
  • Design:
  • Split array data structure
  • Processes pinned to PPE h/w contexts
  • Static partitioning SPEs to PPE h/w contexts
  • Evaluated kernel and user level placement (SPE idle time per offload)
  • User-Level = 7us
  • Kernel-Level = 9us
slide-8
SLIDE 8
  • No kernel support and no system calls
  • User-Level Work Stealing using BUPC
  • Work pool resides in memory shared between processes (any-any PPE-SPE)
  • Signaling mechanism identical to SLED, no pinning requirements

. . .

Off-loaded task can be served by any process

PPE SPE 1 3 4 5 6 7 8

. . .

1 2 3 4 5 6 7 8

PPE Processes 2

slide-9
SLIDE 9
  • Microbenchmarks
  • Multiple PPE processes off-load SPE tasks of various length
  • Evaluated kernel/user level implementation
  • Evaluated multiple signaling data structure designs
  • Evaluated SDK synchronization primitives

 Bioinformatics applications to generate Phylogenetic trees

(PBPI,RAxML)

 Evolutionary history among a set of species  Computationally expensive  NP-hard algorithms

slide-10
SLIDE 10

 RAxML (Maximum Likelihood) – Master-Worker, Embarrassingly Parallel

 Work unit is multiple loops (three)  Communication between M-W only after unit is completed  Multiple code entry points into an SPE code module (work-stealing requires significant

re-engineering)

 PBPI (Parallel Bayesian Phylogenetic Inference) - Data Parallel

 Three main loops offloaded separately  ALL_TO_ALL communication after each loop body

 Whole loop body offloaded in both applications  Over 95% of execution time spent on SPEs in both applications

slide-11
SLIDE 11

PPE-SPE Synchronization Overhead

Better Better

SPE Utilization

  • “MBOX” - synchronization via interrupt mailboxes 19us
  • “PPE” - yielding overhead with oversubscription (6 processes on PPE) 14us
  • “YNR” < “PPE” due to congestion
  • Average latency : Work-Steal 3us, SLED 7us, YNR 10us

Faster synchronization leads to improved SPE utilization

slide-12
SLIDE 12
  • Work-stealing for RAxML produces a 23%

slowdown for the workload (work migration implementation is not a full Continuation Passing Style)

  • SLED improves the performance for up to 5%

(3% avg)

Better

  • Work-Stealing improves the performance for

up to 21% (12% avg)

  • SLED improves the performance for up to

10% (5% avg) For very short tasks (length < context switch) YNR always performs best

PBPI Speedup

  • 10%
  • 5%

0% 5% 10% 15% 20% 25% 204 444 684 924 1164 1404 1644 1884 2124 2364 2604 2844 3084 3324 3564 3804 4044 4284 4524 4764

Speedup DNA Sequence Length

SLED-Kernel SLED-User Work-Steal

RAxML Speedup

  • 4%
  • 3%
  • 2%
  • 1%

0% 1% 2% 3% 4% 5% 6%

100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000

Speedup

DNA Sequence Length

SLED User SLED Kernel

slide-13
SLIDE 13
  • Left-hand y-axis represents the reductions of the idle time when compared to the YNR idle time.
  • IDLE: (right-hand y axis): time when SPEs are waiting for work as a percentage of total execution time

with YNR.

  • SLED reduces the SPE idle time by up to 20% for both applications
  • UPC Work-Steal reduces SPE idle time by up to 30% for PBPI
  • With optimizations SPE still idle 20% (PBPI) and 10% (RAxML)

Task Distribution

50000 100000 150000 200000 250000 300000 350000 400000

0-10: 10-20: 20-30: 30-40: 40-50: 50-60: 60-70: 70-80: 80-90: 90-100: 100-200: 200-300: 300-400: 400-500: 500-600: 600-700: 700-800: 800-900: 900-1000: 1000-10000:

Task Length Number of Tasks

RAxML 105 RAxML 1176 RAxML 3063 PBPI 105 PBPI 1176 PBPI 3063

slide-14
SLIDE 14
  • Cell execution models:
  • Offloading:
  • P. Bellens et al “CellSs: A Programming Model for the Cell BE Architecture”
  • M. de Krujif and K. Sankaralingam “MapReduce for the Cell B.E. Architecture”
  • K. Fatahalian et al “Memory - Sequoia: Programming the Memory Hierarchy”
  • M. Monteyne “RapidMind Multi-core Development Platform”
  • Streaming
  • Kudlur & Mahlke
  • Shared memory model
  • Eichenberger et al “Optimizing Compiler for the Cell Processor”
  • SPE micro-kernels
  • Mohamed F. Ahmed et al “SPENK: Adding Another Level of Parallelism on the Cell Broadband Engine”
  • Many Application Studies
  • F. Petrini et al. “Challenges in Mapping Graph Exploration Algorithms on Advanced Multicore Processors”
  • D. Bader et al. “On the Design and Analysis of Irregular Algorithms on the Cell Processor: A Case Study on List Ranking”
  • Jayram M. N. “Brain Circuit Bottom-Up Engine Simulation and Acceleration on Cell BE, for Vision Applications”
slide-15
SLIDE 15
  • Efficient execution on accelerators requires careful scheduling of

disjoint parallelism

  • The current support for synchronization among the heterogeneous

cores is not sufficient (callbacks, mailboxes, busy wait)

  • The cooperative scheduling strategies explored improve performance
  • Impact of coscheduling will increase as contention on the general

purpose core increases (Cell blade with 8SPEs instead of PS3 with 6 SPEs)

  • Ratio task length/scheduling overhead likely to remain constant in the

future architecture revisions