Enabling Technologies for a Programmable Many-core Ben Juurlink TU - - PDF document

enabling technologies for a programmable many core
SMART_READER_LITE
LIVE PREVIEW

Enabling Technologies for a Programmable Many-core Ben Juurlink TU - - PDF document

2/8/11 Enabling Technologies for a Programmable Many-core Ben Juurlink TU Berlin Partner and work package leader Disclaimer Presentation (partially) personal view on ENCORE Minor focus on TU Berlin activities Contains some


slide-1
SLIDE 1

2/8/11 1

Enabling Technologies for a Programmable Many-core

Ben Juurlink TU Berlin Partner and work package leader

January 22, 2011 PEPPHER workshop, Crete 2

Disclaimer

§ Presentation (partially) personal view on ENCORE

§ Minor focus on TU Berlin activities

§ Contains some grammar mistakes

§ No time for sanity check (FP7 deadline) § Some grammar mistakes on purpose § To save space § ENCORE view matters most

slide-2
SLIDE 2

2/8/11 2

January 22, 2011 PEPPHER workshop, Crete 3

Outline

§ Consortium § Objectives § Programming Model § Runtime System § Preliminary Evaluation of Programming Model § Hardware Support for Runtime System § Conclusions & Future Work

January 22, 2011 PEPPHER workshop, Crete 4

ENCORE consortium § Funded under FP7 Objective ICT 2009.3.6 - Computing Systems § 3-year STREP project (March 2010 - February 2012)

ISRAEL INSTITUTE OF TECHNOLOGY

slide-3
SLIDE 3

2/8/11 3

January 22, 2011 PEPPHER workshop, Crete 5

Project Objectives

§ To achieve breakthrough on usability, code portability, and

performance scalability of multicore systems

§ Define easy to use parallel programming model § Develop intelligent runtime management system § Hide complexity of parallel programming § Detect + manage parallelism § Detect + manage data locality § Hide complexity of underlying architecture § Heterogeneous processors § Physically distributed memory (NUMA) § Software managed memory hierarchy § Design scalable parallel architecture § Providing support to the runtime system ENCORE Programming Model § Start from mainstream programming language (C) § Extend sequential code with #pragma annotations § Programmer identifies pieces of code to be executed as tasks

§ Also identifies task inputs and outputs, and specifies requirements

§ Tasks need not be parallel

§ Runtime system will detect and exploit parallelism § Programmer is not directly concerned with parallelism

for (i=0; i<height; i+=16) for (j=0; j<width; j+=16) mb_decode(&frame[i][j]);

Imperative code

for (i=0; i<height; i+=16) for (j=0; j<width; j+=16) #pragma omp task \ input([16][16] frame[i-16][j]) \ input([16][16] frame[i][j-16]) \ inout([16][16] frame[i][j]) mb_decode(&frame[i][j]);

OmpSs programmer

slide-4
SLIDE 4

2/8/11 4

January 22, 2011 PEPPHER workshop, Crete 7

Task Dependency Graph

§ Input/output clauses allow to build task dependency graph

§ Expressions evaluated at runtime

for (i=0; i<height; i+=16) for (j=0; j<width; j+=16) #pragma omp task \ input([16][16] frame[i-16][j]) \ input([16][16] frame[i][j-16]) \ inout([16][16] frame[i][j]) mb_decode(&frame[i][j]);

1,1 1,2 1,3 2,1 2,2 3,1 2,3 3,2 3,3

January 22, 2011 PEPPHER workshop, Crete 8

Task Dependency Graph

§ Dependency graph used by runtime system to

§ ensure correctness of execution

§ task cannot start before its predecessors have finished

§ optimize performance, e.g.,

§ reduce overhead of submitting tasks by task bundling § improve data locality by exploiting in/out usage information

1,1 1,2 1,3 2,1 2,2 3,1 2,3 3,2 1,1 1,1 1,1 1,1

mapped to Core 0 mapped to Core 1 mapped to Core 2 mapped to Core 3

slide-5
SLIDE 5

2/8/11 5

January 22, 2011 PEPPHER workshop, Crete 9

Runtime System

§ Compiler transforms pragmas to calls to runtime system (RTS) § Runtime system responsible for:

§ Building dependency graph § Extracting parallel tasks from dependency graph § Offloading tasks to accelerators (if applicable) § Managing data transfers § Maintaining data coherence § Performing optimizations while maintaining correctness

§ Task bundling § Memory renaming to resolve WAW and WAR hazards § Double buffering § Scheduling for locality

January 22, 2011 PEPPHER workshop, Crete 10

Execution Model § Single master thread that submits tasks to runtime system

§ Tasks can also generate new tasks if dependency graphs disjoint

§ RTS builds dependency graph and submits tasks to worker cores § Worker cores execute tasks and request RTS new tasks when done

for (i=0; i<n; i+=16) for (j=0; j<n; j+=16) { wd = nanos_create_wd(.., input-output_info); nanos_submit(wd); }

master core worker 1

RTS

task MGT core / master core thread

mb_decode(){ ...; }

worker 2 worker 3 worker n

slide-6
SLIDE 6

2/8/11 6

January 22, 2011 PEPPHER workshop, Crete 11

Runtime Library Structure

§ slide 16 Alex Duran

January 22, 2011 PEPPHER workshop, Crete 12

Supported Platforms

§ SMP § SMP-NUMA § Makes copies of input/output data in local memory

§ SMP-Cluster

§ Makes copies across the network

§ CUDA

§ Manages copies to/from GPUs with overlapping

§ ENCORE

slide-7
SLIDE 7

2/8/11 7

January 22, 2011 PEPPHER workshop, Crete 13

Preliminary Performance Evaluation

§ How well does OmpSs perform on non-HPC applications? § Next performance evaluation uses SMPSs

§ SMP-instance of StarSs § StarSs subset of OmpSs features

§ Performance evaluation preliminary

§ SMPSs startup cost not included (=large, negligible for large

applications)

§ Still need to analyze results in detail

§ “Non-biased” comparison

§ TU Berlin not involved in SMPSs development

January 22, 2011 PEPPHER workshop, Crete 14

Experimental Setup

§ Platform:

§ 64-core cc-NUMA § HP DL980 G7 § 8x Xeon X7560 (Nehalem EX)

§ Benchmarks:

§ Kernels: mainly from EEMBC MultiBench § Applications: H.264 decoding § Workloads: set of several kernels/applications

§ Methodology:

§ Started with EEMBC MultiBench § Stripped away MITH framework § Ported to Pthreads § Ported to SMPSs

§ Compare SMPSs to Pthreads

slide-8
SLIDE 8

2/8/11 8

January 22, 2011 PEPPHER workshop, Crete 15

C-ray Kernel § Brute force raytracer § 500 (SMPSs) / 700 (Pthreads) LoC § Unoptimized, simple, clean § Distributes (blocks of) scanlines to workers

5 10 15 20 25 30 35

Speedup

1 2 4 8 16 32 64

Thread count

Apples-to-apples: c-ray [small]

Pthreads SMPSs-2.2

10 20 30 40 50 60

Speedup

1 2 4 8 16 32 64

Thread count

Apples-to-apples: c-ray [large]

Pthreads SMPSs-2.2 January 22, 2011 PEPPHER workshop, Crete 16

Ray-Rot Workload § C-ray feeds binary output to rotate kernel § Pipelining parallelism (easier to exploit in SMPSs) § Introduces additional dependencies § Rotation angle is 90°

2 4 6 8 10 12

Speedup

1 2 4 8 16 32 64

Thread count

Apples-to-apples: ray-rot [small]

Pthreads SMPSs-2.2

5 10 15 20 25 30 35 40 45 50

Speedup

1 2 4 8 16 32 64

Thread count

Apples-to-apples: ray-rot [large]

Pthreads SMPSs-2.2

slide-9
SLIDE 9

2/8/11 9

January 22, 2011 PEPPHER workshop, Crete 17

Rot-cc Workload § Rotate feeds binary output to rgbcmy kernel § Pipelined, dependent, requires regions § Cache performance deteriorates § Rotation angle is 90°

1 2 3 4 5 6 7

Speedup

1 2 4 8 16 32 64

Thread count

Programming Models - Speedup

SMPSs[barrier] SMPSs[regions] Pthreads

2 4 6 8 10 12 14

Execution time [s]

1 2 4 8 16 32 64

Thread count

Programming Models - Execution time

SMPSs[barrier] SMPSs[regions] Pthreads January 22, 2011 PEPPHER workshop, Crete 18

Preliminary Conclusions from Preliminary Performance Evaluation

§ OmpSs / SMPSs is good

§ For several benchmarks SMPSs performs better than Pthreads § Serial program behavior maintained § (Often) programs just ‘work’ after adding pragmas § Very easy to exploit DLP using task-level parallelism

§ Task-based parallel programming model in development

§ Documentation can be improved § Compiler does not support all constructs § Parameter list ‘explosion’ § Programming style restrictions (syntax / structure) (bad?)

slide-10
SLIDE 10

2/8/11 10

January 22, 2011 PEPPHER workshop, Crete 19

Architecture Support for Runtime System

§ In OmpSs / StarSs, runtime takes care of

§ Task dependency determination § Task B depends on task A if output of A overlaps input of B § Scheduling while § Reducing task issuing overhead § Optimizing data locality

§ This can take a lot of time

§ Reduces scalability when threads are fine grain § Coarse grain threads reduce scalability also § Lose-lose situation

§ Next evaluation performed using CellSs

§ Cell instance of StarSs

§ “Complex dependencies (CD)” pattern

§ H.264-like dependencies Scalability of CellSs Runtime System

Scalability of StarSS with the CD benchmark 2 4 6 8 10 12 14 16 1.0 10.0 100.0 1000.0 10000.0

Task size (us) Scalability 16 SPEs 8 SPEs 4 SPEs 2 SPEs 1 SPE

max = 14.5 Scalability = 4.9 max H.264 MB decoding: Average = 20µs

§ “Optimal” CellSs configuration

slide-11
SLIDE 11

2/8/11 11 Scalability of CellSs

Paraver trace of CD (task size 19µs)

idle

Nexus: HW Support for TPU

Task “life cycle”: TPU SPE SPE SPE SPE SPE SPE SPE SPE PPE TC TC TC TC TC TC TC TC

Task Descriptor task_func no_params p1_io_type p1_pointer p1_x_length p1_y_lenght p1_y_stride p2_io_type …

  • 1. Create task descriptor and send its address to TPU.

1

  • 2. Load task descriptor.

2

  • 3. Process task descriptor; update task pool

3

  • 4. Add ready tasks to ready queue.

4

  • 5. Read ready queue; process; inform TPU.

5

  • 6. Update task pool.

6

Pipelined for throughput

slide-12
SLIDE 12

2/8/11 12 Nexus TPU Design

ptr size in buffer status register address kick-off list producers table address #deps kick-off list consumer table id *descriptor status #deps task table descriptor 1 descriptor 2 task storage descriptor loader descriptor handler finish handler id *descriptor ready queue id finish buffer

Preliminary Evaluation Results for Nexus

ISO-efficiency 80%

1 10 100 1000 2 4 8 16 number of SPUs task size (us) StarSS Manual StarSS + Nexus

47 us 5.1 us 9x 13x 134 us 10 us

slide-13
SLIDE 13

2/8/11 13

January 22, 2011 PEPPHER workshop, Crete 25

Preliminary Conclusions on Nexus

§ Runtime System of CellSs / OmpSs can become bottleneck

§ Mainly for fine-grain tasks

§ HW support (Nexus) can remove bottleneck

§ Up to 100+ (?) cores

§ Detailed VHDL model will be designed, implemented, and

evaluated in ENCORE

January 22, 2011 PEPPHER workshop, Crete 26

Conclusions § ENCORE targets § Programmability § Performance portability § Right kind of hardware support

§ Preliminary SMPSs vs. Pthreads comparison shows

§ Satisfactory performance achieved with little programming effort

§ Preliminary Nexus task manager

§ Runtime system not bottleneck until 100+ cores

slide-14
SLIDE 14

2/8/11 14

January 22, 2011 PEPPHER workshop, Crete 27

Future Work in ENCORE

§ Programming model

§ Region dependency checking

§ Allows to capture more complex dependency patterns

§ Improve runtime scheduling

§ Based on locality § Based on QoS

§ Applications and performance evaluation

§ Can we effectively and efficiently implement H.264 decoding in

OMPSs?

§ Hardware support for runtime system

§ VHDL model of Nexus++ in FPGA multicore prototype

§ . . . § Stay tuned at http://www.encore-project.eu

Backup Slides

January 22, 2011 PEPPHER workshop, Crete

slide-15
SLIDE 15

2/8/11 15 Heterogeneity

January 22, 2011 PEPPHER workshop, Crete

#pragma omp task input([BS][BS] A, [BS][BS] B) inout([BS][BS] C) void matmul(float *A, float *B, float *C) { // original sequential matmul }

  • #pragma omp target device(cuda) implements(matmul) copy_deps

void matmul_cuda (float *A, float *B, float *C) { // optimized kernel for cuda }

  • // library function

#pragma omp target device(cell) implements(matmul) copy_deps void matmul_spe(float *A, float *B, float *C);