Scheduling complex streaming applications on the Cell processor - - PowerPoint PPT Presentation

scheduling complex streaming applications on the cell
SMART_READER_LITE
LIVE PREVIEW

Scheduling complex streaming applications on the Cell processor - - PowerPoint PPT Presentation

Scheduling complex streaming applications on the Cell processor Mathias Jacquelin, joint work with Matthieu Gallet and Loris Marchal INRIA ROMA project-team LIP (ENS-Lyon, CNRS, INRIA) Ecole Normale Sup erieure de Lyon, France Workshop


slide-1
SLIDE 1

1/ 28

Scheduling complex streaming applications on the Cell processor

Mathias Jacquelin, joint work with Matthieu Gallet and Loris Marchal

INRIA ROMA project-team LIP (ENS-Lyon, CNRS, INRIA) ´ Ecole Normale Sup´ erieure de Lyon, France

Workshop on Multithreaded Architectures and Applications, Atlanta, April 23, 2010.

slide-2
SLIDE 2

2/ 28

Outline

Introduction Steady-state scheduling CELL Platform and Application Modeling Mapping the Application Practical Steady-State on CELL Preprocessing of the schedule State machine of the framework Experimental results Conclusion and Future works

slide-3
SLIDE 3

3/ 28

Motivation

◮ Multicore architectures: new opportunity to test the

scheduling strategies designed in the ROMA team.

◮ Our trademark: efficient scheduling on heterogeneous

platforms

◮ Most multicore architecture are homogeneous, regular

◮ Need for tailored algorithms (linear algebra,. . . )

◮ Emerging heterogeneous multicore:

◮ Dedicated processing units on GPUs ◮ Mixed system: processor + accelerator

◮ This study: steady-state scheduling on CELL (bounded

heterogeneity) to demonstrate the usefulness of complex (static) scheduling techniques

slide-4
SLIDE 4

3/ 28

Motivation

◮ Multicore architectures: new opportunity to test the

scheduling strategies designed in the ROMA team.

◮ Our trademark: efficient scheduling on heterogeneous

platforms

◮ Most multicore architecture are homogeneous, regular

◮ Need for tailored algorithms (linear algebra,. . . )

◮ Emerging heterogeneous multicore:

◮ Dedicated processing units on GPUs ◮ Mixed system: processor + accelerator

◮ This study: steady-state scheduling on CELL (bounded

heterogeneity) to demonstrate the usefulness of complex (static) scheduling techniques

slide-5
SLIDE 5

3/ 28

Motivation

◮ Multicore architectures: new opportunity to test the

scheduling strategies designed in the ROMA team.

◮ Our trademark: efficient scheduling on heterogeneous

platforms

◮ Most multicore architecture are homogeneous, regular

◮ Need for tailored algorithms (linear algebra,. . . )

◮ Emerging heterogeneous multicore:

◮ Dedicated processing units on GPUs ◮ Mixed system: processor + accelerator

◮ This study: steady-state scheduling on CELL (bounded

heterogeneity) to demonstrate the usefulness of complex (static) scheduling techniques

slide-6
SLIDE 6

4/ 28

Introduction: Steady-state Scheduling

Rationale:

◮ A pipelined application:

◮ Simple chain ◮ More complex application

(Directed Acyclic Graph)

◮ Objective: optimize the throughput

  • f the application

(number of input files treated per seconds)

◮ Today: simple case where each

task has to be mapped on one single resource

slide-7
SLIDE 7

4/ 28

Introduction: Steady-state Scheduling

Rationale:

◮ A pipelined application:

◮ Simple chain ◮ More complex application

(Directed Acyclic Graph)

◮ Objective: optimize the throughput

  • f the application

(number of input files treated per seconds)

◮ Today: simple case where each

task has to be mapped on one single resource

T1 T2 T3

slide-8
SLIDE 8

4/ 28

Introduction: Steady-state Scheduling

Rationale:

◮ A pipelined application:

◮ Simple chain ◮ More complex application

(Directed Acyclic Graph)

◮ Objective: optimize the throughput

  • f the application

(number of input files treated per seconds)

◮ Today: simple case where each

task has to be mapped on one single resource

T1 T2 T3 T4 T5 T6 T7 T8 T9

slide-9
SLIDE 9

4/ 28

Introduction: Steady-state Scheduling

Rationale:

◮ A pipelined application:

◮ Simple chain ◮ More complex application

(Directed Acyclic Graph)

◮ Objective: optimize the throughput

  • f the application

(number of input files treated per seconds)

◮ Today: simple case where each

task has to be mapped on one single resource

T1 T2 T3 T4 T5 T6 T7 T8 T9

slide-10
SLIDE 10

4/ 28

Introduction: Steady-state Scheduling

Rationale:

◮ A pipelined application:

◮ Simple chain ◮ More complex application

(Directed Acyclic Graph)

◮ Objective: optimize the throughput

  • f the application

(number of input files treated per seconds)

◮ Today: simple case where each

task has to be mapped on one single resource

P2 P1 P3 P4

T5 T6 T7 T8 T9 T1 T2 T3 T4

slide-11
SLIDE 11

5/ 28

CELL brief introduction

◮ Multicore heterogeneous processor ◮ Accelerator extension to Power architecture

slide-12
SLIDE 12

5/ 28

CELL brief introduction

◮ Multicore heterogeneous processor ◮ Accelerator extension to Power architecture

EIB PPE0 SPE3 SPE2 SPE6 SPE7 SPE1 SPE0 SPE4 MEMORY SPE5

slide-13
SLIDE 13

5/ 28

CELL brief introduction

◮ Multicore heterogeneous processor ◮ Accelerator extension to Power architecture

EIB SPE5 SPE3 SPE2 SPE6 SPE7 SPE1 SPE0 SPE4 PPE0 MEMORY

◮ 1 PPE core

◮ VMX unit ◮ L1, L2 cache ◮ 2 way SMT

slide-14
SLIDE 14

5/ 28

CELL brief introduction

◮ Multicore heterogeneous processor ◮ Accelerator extension to Power architecture

EIB SPE4 SPE5 SPE2 SPE6 SPE7 SPE1 MEMORY SPE3 PPE0 SPE0

◮ 8 SPEs

◮ 128-bit SIMD instruction set ◮ Local store 256KB ◮ Dedicated Asynchronous DMA engine

slide-15
SLIDE 15

5/ 28

CELL brief introduction

◮ Multicore heterogeneous processor ◮ Accelerator extension to Power architecture

EIB SPE5 SPE3 SPE2 SPE6 SPE7 SPE1 SPE0 SPE4 PPE0 MEMORY

slide-16
SLIDE 16

5/ 28

CELL brief introduction

◮ Multicore heterogeneous processor ◮ Accelerator extension to Power architecture

EIB PPE0 SPE3 SPE2 SPE6 SPE7 SPE1 SPE0 SPE4 MEMORY SPE5

◮ Element Interconnect Bus (EIB)

◮ 200 GB/s bandwidth

slide-17
SLIDE 17

5/ 28

CELL brief introduction

◮ Multicore heterogeneous processor ◮ Accelerator extension to Power architecture

EIB PPE0 SPE4 SPE2 SPE6 SPE7 SPE1 SPE0 SPE5 MEMORY SPE3

◮ 25 GB/s bandwidth

slide-18
SLIDE 18

6/ 28

Outline

Introduction Steady-state scheduling CELL Platform and Application Modeling Mapping the Application Practical Steady-State on CELL Preprocessing of the schedule State machine of the framework Experimental results Conclusion and Future works

slide-19
SLIDE 19

7/ 28

Platform modeling

Simple CELL modeling:

◮ 1 PPE and 8 SPE: 9 processing elements P1, . . . , P9, with

unrelated speed,

◮ Each processing element access the communication bus with a

(bidirectional) bandwidth b = (25GB/s) ,

◮ The bus is able to route all concurrent communications

without contention (in a first step),

◮ Due to the limited size of the DMA stack on each SPE:

◮ Each SPE can perform at most 16 simultaneous DMA

  • perations,

◮ The PPE can perform at most 8 simultaneous DMA

  • perations to/from a given SPE.

◮ Linear cost communication model:

a data of size S is sent/received in time S/b

slide-20
SLIDE 20

8/ 28

Application modeling

Application is described by a directed acyclic graph:

◮ Tasks T1, . . . , Tn ◮ Processing time of task Tk on Pi is

ti(k),

◮ If there is a dependency Tk → Tl,

datak,l is the size of the file produced by Tk and needed by Tl,

T1 T2 T3 T4 T5 T6 T7 T8 T9

◮ If Tk is an input task, it reads readk bytes from main memory, ◮ If Tk is an output task, it writes writek bytes to main memory,

slide-21
SLIDE 21

8/ 28

Application modeling

Application is described by a directed acyclic graph:

◮ Tasks T1, . . . , Tn ◮ Processing time of task Tk on Pi is

ti(k),

◮ If there is a dependency Tk → Tl,

datak,l is the size of the file produced by Tk and needed by Tl,

T1 T2 T3 T4 T5 T6 T7 T8 T9

◮ If Tk is an input task, it reads readk bytes from main memory, ◮ If Tk is an output task, it writes writek bytes to main memory,

slide-22
SLIDE 22

8/ 28

Application modeling

Application is described by a directed acyclic graph:

◮ Tasks T1, . . . , Tn ◮ Processing time of task Tk on Pi is

ti(k),

◮ If there is a dependency Tk → Tl,

datak,l is the size of the file produced by Tk and needed by Tl,

T1 T2 T3 T4 T5 T6 T7 T8 T9

◮ If Tk is an input task, it reads readk bytes from main memory, ◮ If Tk is an output task, it writes writek bytes to main memory,

slide-23
SLIDE 23

9/ 28

Target application: any DAG

◮ Today, we will focus on three random task graphs:

slide-24
SLIDE 24

9/ 28

Target application: any DAG

◮ Today, we will focus on three random task graphs:

stateful T1: T0 cost ppe: cost spe: peek: 0 stateless T2: T1 cost ppe: cost spe: peek: 1 stateless T3: T2 cost ppe: cost spe: peek: 0 stateless T4: T3 cost ppe: cost spe: peek: 0 stateless T5: T4 cost ppe: cost spe: peek: 1 stateless T6: T5 cost ppe: cost spe: peek: 0 stateful T50: T49 cost ppe: cost spe: peek: 0 stateful T7: T6 cost ppe: cost spe: peek: 0 stateless T8: T7 cost ppe: cost spe: peek: 0 stateful T9: T8 cost ppe: cost spe: peek: 0 stateless T13: T12 cost ppe: cost spe: peek: 1 stateless T10: T9 cost ppe: cost spe: peek: 1 stateless T11: T10 cost ppe: cost spe: peek: 1 stateless T12: T11 cost ppe: cost spe: peek: 0 stateless T14: T13 cost ppe: cost spe: peek: 0 stateless T18: T17 cost ppe: cost spe: peek: 0 stateful T19: T18 cost ppe: cost spe: peek: 1 stateless T16: T15 cost ppe: cost spe: peek: 0 stateless T17: T16 cost ppe: cost spe: peek: 0 stateless T15: T14 cost ppe: cost spe: peek: 0 stateful T20: T19 cost ppe: cost spe: peek: 0 stateless T21: T20 cost ppe: cost spe: peek: 0 stateful T22: T21 cost ppe: cost spe: peek: 1 stateless T23: T22 cost ppe: cost spe: peek: 1 stateful T24: T23 cost ppe: cost spe: peek: 0 stateless T25: T24 cost ppe: cost spe: peek: 0 stateless T26: T25 cost ppe: cost spe: peek: 1 stateless T27: T26 cost ppe: cost spe: peek: 0 stateful T30: T29 cost ppe: cost spe: peek: 1 stateless T29: T28 cost ppe: cost spe: peek: 1 stateless T31: T30 cost ppe: cost spe: peek: 1 stateless T32: T31 cost ppe: cost spe: peek: 1 stateless T28: T27 cost ppe: cost spe: peek: 1 stateless T34: T33 cost ppe: cost spe: peek: 0 stateful T37: T36 cost ppe: cost spe: peek: 1 stateless T33: T32 cost ppe: cost spe: peek: 1 stateless T36: T35 cost ppe: cost spe: peek: 1 stateless T35: T34 cost ppe: cost spe: peek: 1 stateless T39: T38 cost ppe: cost spe: peek: 1 stateless T38: T37 cost ppe: cost spe: peek: 0 stateless T40: T39 cost ppe: cost spe: peek: 0 stateless T41: T40 cost ppe: cost spe: peek: 1 stateless T42: T41 cost ppe: cost spe: peek: 1 stateless T43: T42 cost ppe: cost spe: peek: 1 stateless T44: T43 cost ppe: cost spe: peek: 0 stateful T45: T44 cost ppe: cost spe: peek: 0 stateless T46: T45 cost ppe: cost spe: peek: 1 stateless T47: T46 cost ppe: cost spe: peek: 1 stateless T48: T47 cost ppe: cost spe: peek: 0 stateless T49: T48 cost ppe: cost spe: peek: 0
slide-25
SLIDE 25

9/ 28

Target application: any DAG

◮ Today, we will focus on three random task graphs:

stateful T1: T0 cost ppe: cost spe: peek: 0 stateless T2: T1 cost ppe: cost spe: peek: 1 stateless T3: T2 cost ppe: cost spe: peek: 0 stateless T4: T3 cost ppe: cost spe: peek: 0 stateless T5: T4 cost ppe: cost spe: peek: 1 stateless T6: T5 cost ppe: cost spe: peek: 0 stateful T50: T49 cost ppe: cost spe: peek: 0 stateful T7: T6 cost ppe: cost spe: peek: 0 stateless T8: T7 cost ppe: cost spe: peek: 0 stateful T9: T8 cost ppe: cost spe: peek: 0 stateless T13: T12 cost ppe: cost spe: peek: 1 stateless T10: T9 cost ppe: cost spe: peek: 1 stateless T11: T10 cost ppe: cost spe: peek: 1 stateless T12: T11 cost ppe: cost spe: peek: 0 stateless T14: T13 cost ppe: cost spe: peek: 0 stateless T18: T17 cost ppe: cost spe: peek: 0 stateful T19: T18 cost ppe: cost spe: peek: 1 stateless T16: T15 cost ppe: cost spe: peek: 0 stateless T17: T16 cost ppe: cost spe: peek: 0 stateless T15: T14 cost ppe: cost spe: peek: 0 stateful T20: T19 cost ppe: cost spe: peek: 0 stateless T21: T20 cost ppe: cost spe: peek: 0 stateful T22: T21 cost ppe: cost spe: peek: 1 stateless T23: T22 cost ppe: cost spe: peek: 1 stateful T24: T23 cost ppe: cost spe: peek: 0 stateless T25: T24 cost ppe: cost spe: peek: 0 stateless T26: T25 cost ppe: cost spe: peek: 1 stateless T27: T26 cost ppe: cost spe: peek: 0 stateful T30: T29 cost ppe: cost spe: peek: 1 stateless T29: T28 cost ppe: cost spe: peek: 1 stateless T31: T30 cost ppe: cost spe: peek: 1 stateless T32: T31 cost ppe: cost spe: peek: 1 stateless T28: T27 cost ppe: cost spe: peek: 1 stateless T34: T33 cost ppe: cost spe: peek: 0 stateful T37: T36 cost ppe: cost spe: peek: 1 stateless T33: T32 cost ppe: cost spe: peek: 1 stateless T36: T35 cost ppe: cost spe: peek: 1 stateless T35: T34 cost ppe: cost spe: peek: 1 stateless T39: T38 cost ppe: cost spe: peek: 1 stateless T38: T37 cost ppe: cost spe: peek: 0 stateless T40: T39 cost ppe: cost spe: peek: 0 stateless T41: T40 cost ppe: cost spe: peek: 1 stateless T42: T41 cost ppe: cost spe: peek: 1 stateless T43: T42 cost ppe: cost spe: peek: 1 stateless T44: T43 cost ppe: cost spe: peek: 0 stateful T45: T44 cost ppe: cost spe: peek: 0 stateless T46: T45 cost ppe: cost spe: peek: 1 stateless T47: T46 cost ppe: cost spe: peek: 1 stateless T48: T47 cost ppe: cost spe: peek: 0 stateless T49: T48 cost ppe: cost spe: peek: 0 stateless T1: T0 cost ppe: cost spe: peek: 0 stateless T2: T1 cost ppe: cost spe: peek: 1 stateful T3: T2 cost ppe: cost spe: peek: 1 stateless T4: T3 cost ppe: cost spe: peek: 1 stateless T5: T4 cost ppe: cost spe: peek: 0 stateless T6: T5 cost ppe: cost spe: peek: 2 stateless T7: T6 cost ppe: cost spe: peek: 2 stateful T8: T7 cost ppe: cost spe: peek: 1 stateless T11: T10 cost ppe: cost spe: peek: 0 stateless T13: T12 cost ppe: cost spe: peek: 2 stateless T9: T8 cost ppe: cost spe: peek: 2 stateless T12: T11 cost ppe: cost spe: peek: 2 stateless T10: T9 cost ppe: cost spe: peek: 0 stateless T94: T93 cost ppe: cost spe: peek: 0 stateless T14: T13 cost ppe: cost spe: peek: 0 stateful T18: T17 cost ppe: cost spe: peek: 1 stateless T20: T19 cost ppe: cost spe: peek: 1 stateless T21: T20 cost ppe: cost spe: peek: 0 stateless T15: T14 cost ppe: cost spe: peek: 1 stateful T17: T16 cost ppe: cost spe: peek: 0 stateless T19: T18 cost ppe: cost spe: peek: 2 stateless T22: T21 cost ppe: cost spe: peek: 2 stateless T16: T15 cost ppe: cost spe: peek: 2 stateless T29: T28 cost ppe: cost spe: peek: 2 stateless T30: T29 cost ppe: cost spe: peek: 2 stateless T32: T31 cost ppe: cost spe: peek: 1 stateless T24: T23 cost ppe: cost spe: peek: 0 stateless T23: T22 cost ppe: cost spe: peek: 1 stateful T25: T24 cost ppe: cost spe: peek: 2 stateless T26: T25 cost ppe: cost spe: peek: 0 stateless T28: T27 cost ppe: cost spe: peek: 0 stateless T27: T26 cost ppe: cost spe: peek: 2 stateless T31: T30 cost ppe: cost spe: peek: 1 stateless T34: T33 cost ppe: cost spe: peek: 0 stateful T40: T39 cost ppe: cost spe: peek: 2 stateless T33: T32 cost ppe: cost spe: peek: 1 stateless T35: T34 cost ppe: cost spe: peek: 0 stateless T37: T36 cost ppe: cost spe: peek: 2 stateless T39: T38 cost ppe: cost spe: peek: 0 stateless T41: T40 cost ppe: cost spe: peek: 1 stateless T36: T35 cost ppe: cost spe: peek: 1 stateless T38: T37 cost ppe: cost spe: peek: 0 stateless T43: T42 cost ppe: cost spe: peek: 0 stateless T42: T41 cost ppe: cost spe: peek: 0 stateful T44: T43 cost ppe: cost spe: peek: 1 stateless T45: T44 cost ppe: cost spe: peek: 0 stateless T46: T45 cost ppe: cost spe: peek: 0 stateless T47: T46 cost ppe: cost spe: peek: 0 stateful T48: T47 cost ppe: cost spe: peek: 0 stateful T50: T49 cost ppe: cost spe: peek: 0 stateless T49: T48 cost ppe: cost spe: peek: 1 stateless T51: T50 cost ppe: cost spe: peek: 1 stateful T57: T56 cost ppe: cost spe: peek: 1 stateless T59: T58 cost ppe: cost spe: peek: 2 stateful T60: T59 cost ppe: cost spe: peek: 1 stateless T54: T53 cost ppe: cost spe: peek: 1 stateless T52: T51 cost ppe: cost spe: peek: 0 stateless T53: T52 cost ppe: cost spe: peek: 1 stateful T61: T60 cost ppe: cost spe: peek: 1 stateless T55: T54 cost ppe: cost spe: peek: 0 stateful T56: T55 cost ppe: cost spe: peek: 1 stateless T58: T57 cost ppe: cost spe: peek: 1 stateless T62: T61 cost ppe: cost spe: peek: 0 stateless T63: T62 cost ppe: cost spe: peek: 2 stateless T65: T64 cost ppe: cost spe: peek: 0 stateless T66: T65 cost ppe: cost spe: peek: 0 stateful T64: T63 cost ppe: cost spe: peek: 1 stateless T67: T66 cost ppe: cost spe: peek: 0 stateful T68: T67 cost ppe: cost spe: peek: 0 stateful T69: T68 cost ppe: cost spe: peek: 1 stateful T70: T69 cost ppe: cost spe: peek: 0 stateless T73: T72 cost ppe: cost spe: peek: 0 stateless T75: T74 cost ppe: cost spe: peek: 2 stateful T72: T71 cost ppe: cost spe: peek: 2 stateless T76: T75 cost ppe: cost spe: peek: 1 stateless T71: T70 cost ppe: cost spe: peek: 0 stateless T74: T73 cost ppe: cost spe: peek: 1 stateless T77: T76 cost ppe: cost spe: peek: 0 stateless T78: T77 cost ppe: cost spe: peek: 0 stateless T79: T78 cost ppe: cost spe: peek: 2 stateless T81: T80 cost ppe: cost spe: peek: 0 stateless T82: T81 cost ppe: cost spe: peek: 1 stateless T80: T79 cost ppe: cost spe: peek: 2 stateful T85: T84 cost ppe: cost spe: peek: 1 stateless T90: T89 cost ppe: cost spe: peek: 2 stateless T84: T83 cost ppe: cost spe: peek: 2 stateful T86: T85 cost ppe: cost spe: peek: 0 stateful T87: T86 cost ppe: cost spe: peek: 1 stateless T88: T87 cost ppe: cost spe: peek: 1 stateful T91: T90 cost ppe: cost spe: peek: 1 stateless T92: T91 cost ppe: cost spe: peek: 2 stateful T83: T82 cost ppe: cost spe: peek: 1 stateless T89: T88 cost ppe: cost spe: peek: 0 stateless T93: T92 cost ppe: cost spe: peek: 0
slide-26
SLIDE 26

9/ 28

Target application: any DAG

◮ Today, we will focus on three random task graphs:

stateful T1: T0 cost ppe: cost spe: peek: 0 stateless T2: T1 cost ppe: cost spe: peek: 1 stateless T3: T2 cost ppe: cost spe: peek: 0 stateless T4: T3 cost ppe: cost spe: peek: 0 stateless T5: T4 cost ppe: cost spe: peek: 1 stateless T6: T5 cost ppe: cost spe: peek: 0 stateful T50: T49 cost ppe: cost spe: peek: 0 stateful T7: T6 cost ppe: cost spe: peek: 0 stateless T8: T7 cost ppe: cost spe: peek: 0 stateful T9: T8 cost ppe: cost spe: peek: 0 stateless T13: T12 cost ppe: cost spe: peek: 1 stateless T10: T9 cost ppe: cost spe: peek: 1 stateless T11: T10 cost ppe: cost spe: peek: 1 stateless T12: T11 cost ppe: cost spe: peek: 0 stateless T14: T13 cost ppe: cost spe: peek: 0 stateless T18: T17 cost ppe: cost spe: peek: 0 stateful T19: T18 cost ppe: cost spe: peek: 1 stateless T16: T15 cost ppe: cost spe: peek: 0 stateless T17: T16 cost ppe: cost spe: peek: 0 stateless T15: T14 cost ppe: cost spe: peek: 0 stateful T20: T19 cost ppe: cost spe: peek: 0 stateless T21: T20 cost ppe: cost spe: peek: 0 stateful T22: T21 cost ppe: cost spe: peek: 1 stateless T23: T22 cost ppe: cost spe: peek: 1 stateful T24: T23 cost ppe: cost spe: peek: 0 stateless T25: T24 cost ppe: cost spe: peek: 0 stateless T26: T25 cost ppe: cost spe: peek: 1 stateless T27: T26 cost ppe: cost spe: peek: 0 stateful T30: T29 cost ppe: cost spe: peek: 1 stateless T29: T28 cost ppe: cost spe: peek: 1 stateless T31: T30 cost ppe: cost spe: peek: 1 stateless T32: T31 cost ppe: cost spe: peek: 1 stateless T28: T27 cost ppe: cost spe: peek: 1 stateless T34: T33 cost ppe: cost spe: peek: 0 stateful T37: T36 cost ppe: cost spe: peek: 1 stateless T33: T32 cost ppe: cost spe: peek: 1 stateless T36: T35 cost ppe: cost spe: peek: 1 stateless T35: T34 cost ppe: cost spe: peek: 1 stateless T39: T38 cost ppe: cost spe: peek: 1 stateless T38: T37 cost ppe: cost spe: peek: 0 stateless T40: T39 cost ppe: cost spe: peek: 0 stateless T41: T40 cost ppe: cost spe: peek: 1 stateless T42: T41 cost ppe: cost spe: peek: 1 stateless T43: T42 cost ppe: cost spe: peek: 1 stateless T44: T43 cost ppe: cost spe: peek: 0 stateful T45: T44 cost ppe: cost spe: peek: 0 stateless T46: T45 cost ppe: cost spe: peek: 1 stateless T47: T46 cost ppe: cost spe: peek: 1 stateless T48: T47 cost ppe: cost spe: peek: 0 stateless T49: T48 cost ppe: cost spe: peek: 0 stateless T1: T0 cost ppe: cost spe: peek: 0 stateless T2: T1 cost ppe: cost spe: peek: 1 stateful T3: T2 cost ppe: cost spe: peek: 1 stateless T4: T3 cost ppe: cost spe: peek: 1 stateless T5: T4 cost ppe: cost spe: peek: 0 stateless T6: T5 cost ppe: cost spe: peek: 2 stateless T7: T6 cost ppe: cost spe: peek: 2 stateful T8: T7 cost ppe: cost spe: peek: 1 stateless T11: T10 cost ppe: cost spe: peek: 0 stateless T13: T12 cost ppe: cost spe: peek: 2 stateless T9: T8 cost ppe: cost spe: peek: 2 stateless T12: T11 cost ppe: cost spe: peek: 2 stateless T10: T9 cost ppe: cost spe: peek: 0 stateless T94: T93 cost ppe: cost spe: peek: 0 stateless T14: T13 cost ppe: cost spe: peek: 0 stateful T18: T17 cost ppe: cost spe: peek: 1 stateless T20: T19 cost ppe: cost spe: peek: 1 stateless T21: T20 cost ppe: cost spe: peek: 0 stateless T15: T14 cost ppe: cost spe: peek: 1 stateful T17: T16 cost ppe: cost spe: peek: 0 stateless T19: T18 cost ppe: cost spe: peek: 2 stateless T22: T21 cost ppe: cost spe: peek: 2 stateless T16: T15 cost ppe: cost spe: peek: 2 stateless T29: T28 cost ppe: cost spe: peek: 2 stateless T30: T29 cost ppe: cost spe: peek: 2 stateless T32: T31 cost ppe: cost spe: peek: 1 stateless T24: T23 cost ppe: cost spe: peek: 0 stateless T23: T22 cost ppe: cost spe: peek: 1 stateful T25: T24 cost ppe: cost spe: peek: 2 stateless T26: T25 cost ppe: cost spe: peek: 0 stateless T28: T27 cost ppe: cost spe: peek: 0 stateless T27: T26 cost ppe: cost spe: peek: 2 stateless T31: T30 cost ppe: cost spe: peek: 1 stateless T34: T33 cost ppe: cost spe: peek: 0 stateful T40: T39 cost ppe: cost spe: peek: 2 stateless T33: T32 cost ppe: cost spe: peek: 1 stateless T35: T34 cost ppe: cost spe: peek: 0 stateless T37: T36 cost ppe: cost spe: peek: 2 stateless T39: T38 cost ppe: cost spe: peek: 0 stateless T41: T40 cost ppe: cost spe: peek: 1 stateless T36: T35 cost ppe: cost spe: peek: 1 stateless T38: T37 cost ppe: cost spe: peek: 0 stateless T43: T42 cost ppe: cost spe: peek: 0 stateless T42: T41 cost ppe: cost spe: peek: 0 stateful T44: T43 cost ppe: cost spe: peek: 1 stateless T45: T44 cost ppe: cost spe: peek: 0 stateless T46: T45 cost ppe: cost spe: peek: 0 stateless T47: T46 cost ppe: cost spe: peek: 0 stateful T48: T47 cost ppe: cost spe: peek: 0 stateful T50: T49 cost ppe: cost spe: peek: 0 stateless T49: T48 cost ppe: cost spe: peek: 1 stateless T51: T50 cost ppe: cost spe: peek: 1 stateful T57: T56 cost ppe: cost spe: peek: 1 stateless T59: T58 cost ppe: cost spe: peek: 2 stateful T60: T59 cost ppe: cost spe: peek: 1 stateless T54: T53 cost ppe: cost spe: peek: 1 stateless T52: T51 cost ppe: cost spe: peek: 0 stateless T53: T52 cost ppe: cost spe: peek: 1 stateful T61: T60 cost ppe: cost spe: peek: 1 stateless T55: T54 cost ppe: cost spe: peek: 0 stateful T56: T55 cost ppe: cost spe: peek: 1 stateless T58: T57 cost ppe: cost spe: peek: 1 stateless T62: T61 cost ppe: cost spe: peek: 0 stateless T63: T62 cost ppe: cost spe: peek: 2 stateless T65: T64 cost ppe: cost spe: peek: 0 stateless T66: T65 cost ppe: cost spe: peek: 0 stateful T64: T63 cost ppe: cost spe: peek: 1 stateless T67: T66 cost ppe: cost spe: peek: 0 stateful T68: T67 cost ppe: cost spe: peek: 0 stateful T69: T68 cost ppe: cost spe: peek: 1 stateful T70: T69 cost ppe: cost spe: peek: 0 stateless T73: T72 cost ppe: cost spe: peek: 0 stateless T75: T74 cost ppe: cost spe: peek: 2 stateful T72: T71 cost ppe: cost spe: peek: 2 stateless T76: T75 cost ppe: cost spe: peek: 1 stateless T71: T70 cost ppe: cost spe: peek: 0 stateless T74: T73 cost ppe: cost spe: peek: 1 stateless T77: T76 cost ppe: cost spe: peek: 0 stateless T78: T77 cost ppe: cost spe: peek: 0 stateless T79: T78 cost ppe: cost spe: peek: 2 stateless T81: T80 cost ppe: cost spe: peek: 0 stateless T82: T81 cost ppe: cost spe: peek: 1 stateless T80: T79 cost ppe: cost spe: peek: 2 stateful T85: T84 cost ppe: cost spe: peek: 1 stateless T90: T89 cost ppe: cost spe: peek: 2 stateless T84: T83 cost ppe: cost spe: peek: 2 stateful T86: T85 cost ppe: cost spe: peek: 0 stateful T87: T86 cost ppe: cost spe: peek: 1 stateless T88: T87 cost ppe: cost spe: peek: 1 stateful T91: T90 cost ppe: cost spe: peek: 1 stateless T92: T91 cost ppe: cost spe: peek: 2 stateful T83: T82 cost ppe: cost spe: peek: 1 stateless T89: T88 cost ppe: cost spe: peek: 0 stateless T93: T92 cost ppe: cost spe: peek: 0

And a simple chain graph (50 tasks)

slide-27
SLIDE 27

10/ 28

Outline

Introduction Steady-state scheduling CELL Platform and Application Modeling Mapping the Application Practical Steady-State on CELL Preprocessing of the schedule State machine of the framework Experimental results Conclusion and Future works

slide-28
SLIDE 28

11/ 28

How to compute an optimal mapping

◮ Ojective: maximize throughput ρ ◮ Method: write a linear program gathering constraints on the

mapping

◮ Binary variables: αk i =

  • 1

if Tk is mapped on Pi

  • therwise

◮ Other useful binary variables: βk,l i,j = 1 iff file Tk → Tl is

transfered from Pi to Pj

slide-29
SLIDE 29

12/ 28

Constraints 1/2

On the application structure:

◮ Each task is mapped on a processor:

∀Tk

  • i

αk

i = 1 ◮ Given a dependency Tk → Tl, the processor computing Tl

must receive the corresponding file: ∀(k, l) ∈ E, ∀Pj,

  • i

βk,l

i,j ≥ αl j ◮ Given a dependency Tk → Tl, only the processor computing

Tk can send the corresponding file: ∀(k, l) ∈ E, ∀Pi,

  • j

βk,l

i,j ≤ αk i

slide-30
SLIDE 30

13/ 28

Constraints 2/2

On the achievable throughput ρ = 1/T:

◮ On a given processor, all tasks must be completed within T:

∀Pi,

  • k

αk

i × ti(k) ≤ T ◮ All incoming communications must be completed within T:

∀Pj, 1 b

k

αk

j × readk +

  • k,l
  • i

βk,l

i,j × datak,l

  • ≤ T

◮ All outgoing communications must be completed within T:

∀Pi, 1 b

k

αk

i × writek +

  • k,l
  • i

βk,l

i,j × datak,l

  • ≤ T

+ constraints on the number of incoming/outgoing communications to respect the DMA requirements + constraints on the available memory on SPE

slide-31
SLIDE 31

14/ 28

Optimal mapping computation

◮ Linear program with the objective of minimizing T ◮ Integer (binary) variables: Mixed Integer Programming ◮ NP-complete problem ◮ Efficient solvers exist with short running time

◮ for small-size problems ◮ or when an approximate solution is searched

◮ We use CPLEX, and look for an approximate solution (5% of

the optimal throughput is good enough)

slide-32
SLIDE 32

15/ 28

Outline

Introduction Steady-state scheduling CELL Platform and Application Modeling Mapping the Application Practical Steady-State on CELL Preprocessing of the schedule State machine of the framework Experimental results Conclusion and Future works

slide-33
SLIDE 33

16/ 28

Preprocessing of the schedule

Main Objective: Compute minimal starting period and buffer sizes.

◮ min periodl = maxm∈precl(min periodm) + peekl + 2 ◮ min buffi,l = min periodl − min periodi

min periodk min periodl = maxm∈precl(min periodm) + peekl + 2 min buffj,l min buffi,l =min periodl − min periodi min periodj peekj peekk peeki peekl min buffi,k min buffi,j min periodi min buffk,l

Tl Ti Tk Tj

slide-34
SLIDE 34

16/ 28

Preprocessing of the schedule

Main Objective: Compute minimal starting period and buffer sizes.

min buffi,j = 3 min buffj,l = 6 min periodl = 9 min buffi,l = 9 min buffk,l = 4 peeki = 0 min periodi = 0 peekj = 1 min periodj = 3 min periodk = 5 peekk = 3 peekl = 2 min buffi,k = 5

Tl Ti Tk Tj

slide-35
SLIDE 35

16/ 28

Preprocessing of the schedule

Main Objective: Compute minimal starting period and buffer sizes.

peekk = 3 peekl = 2 min buffi,k = 5 min buffi,j = 3 min periodl = 9 min buffi,l = 9 min buffj,l = 6 peeki = 0 min buffk,l = 4 min periodi = 0 min periodj = 3 peekj = 1 min periodk = 5

Ti Tk Tj Tl

period = 0

slide-36
SLIDE 36

16/ 28

Preprocessing of the schedule

Main Objective: Compute minimal starting period and buffer sizes.

min buffj,l = 6 min buffi,j = 3 min buffi,k = 5 peekl = 2 peekk = 3 min periodj = 3 min periodi = 0 peeki = 0 min periodk = 5 min buffi,l = 9 min buffk,l = 4 peekj = 1 min periodl = 9

Tk Tl Tj Ti

period = 1

slide-37
SLIDE 37

16/ 28

Preprocessing of the schedule

Main Objective: Compute minimal starting period and buffer sizes.

peeki = 0 min buffk,l = 4 min buffj,l = 6 min buffi,j = 3 min buffi,k = 5 peekl = 2 peekk = 3 peekj = 1 min periodk = 5 min periodj = 3 min buffi,l = 9 min periodl = 9 min periodi = 0

Tl Tj Tk Ti

period = 2

slide-38
SLIDE 38

16/ 28

Preprocessing of the schedule

Main Objective: Compute minimal starting period and buffer sizes.

peekk = 3 min periodk = 5 peekj = 1 min periodj = 3 min periodi = 0 peeki = 0 min buffk,l = 4 min periodl = 9 min buffi,l = 9 min buffj,l = 6 min buffi,j = 3 min buffi,k = 5 peekl = 2

Tl Tj Tk Ti

period = 3

slide-39
SLIDE 39

16/ 28

Preprocessing of the schedule

Main Objective: Compute minimal starting period and buffer sizes.

peekk = 3 min periodk = 5 peekj = 1 min periodj = 3 min periodi = 0 peeki = 0 min buffk,l = 4 min periodl = 9 min buffi,l = 9 min buffj,l = 6 min buffi,j = 3 min buffi,k = 5 peekl = 2

Tl Tj Tk Ti

period = 4

slide-40
SLIDE 40

16/ 28

Preprocessing of the schedule

Main Objective: Compute minimal starting period and buffer sizes.

peekk = 3 min periodk = 5 peekj = 1 min periodj = 3 min periodi = 0 peeki = 0 min buffk,l = 4 min periodl = 9 min buffi,l = 9 min buffj,l = 6 min buffi,j = 3 min buffi,k = 5 peekl = 2

Tl Tj Tk Ti

period = 5

slide-41
SLIDE 41

16/ 28

Preprocessing of the schedule

Main Objective: Compute minimal starting period and buffer sizes.

peekk = 3 min periodk = 5 peekj = 1 min periodj = 3 min periodi = 0 peeki = 0 min buffk,l = 4 min periodl = 9 min buffi,l = 9 min buffj,l = 6 min buffi,j = 3 min buffi,k = 5 peekl = 2

Tl Tj Tk Ti

period = 6

slide-42
SLIDE 42

16/ 28

Preprocessing of the schedule

Main Objective: Compute minimal starting period and buffer sizes.

peekj = 1 min periodi = 0 min buffj,l = 6 peekk = 3 min buffi,j = 3 min periodk = 5 min periodj = 3 peeki = 0 min buffk,l = 4 min buffi,k = 5 peekl = 2 min periodl = 9 min buffi,l = 9

Tk Ti Tj Tl

period = 7

slide-43
SLIDE 43

16/ 28

Preprocessing of the schedule

Main Objective: Compute minimal starting period and buffer sizes.

peeki = 0 peekj = 1 min periodk = 5 peekk = 3 peekl = 2 min buffi,k = 5 min buffi,j = 3 min buffj,l = 6 min periodl = 9 min buffi,l = 9 min periodi = 0 min buffk,l = 4 min periodj = 3

Ti Tk Tj Tl

period = 8

slide-44
SLIDE 44

16/ 28

Preprocessing of the schedule

Main Objective: Compute minimal starting period and buffer sizes.

min buffi,l = 9 min buffk,l = 4 peeki = 0 min periodi = 0 min periodj = 3 peekj = 1 min periodk = 5 peekl = 2 peekk = 3 min buffi,k = 5 min buffi,j = 3 min buffj,l = 6 min periodl = 9

Tj Tl Ti Tk

period = 9

slide-45
SLIDE 45

17/ 28

State machine of the framework

Two main phases: Computation and Communication

Select a Task Wait Resources Process Task Signal new Data

Computation Phase

Communicate

slide-46
SLIDE 46

17/ 28

State machine of the framework

Two main phases: Computation and Communication

Select a Task Wait Resources Communicate Signal new Data

Computation Phase

Process Task

slide-47
SLIDE 47

17/ 28

State machine of the framework

Two main phases: Computation and Communication

Communicate Wait Resources Process Task Signal new Data

Computation Phase

Communicate Select a Task

slide-48
SLIDE 48

17/ 28

State machine of the framework

Two main phases: Computation and Communication

Select a Task Wait Resources Communicate Signal new Data

Computation Phase

Process Task

slide-49
SLIDE 49

17/ 28

State machine of the framework

Two main phases: Computation and Communication

Select a Task Wait Resources Communicate Signal new Data

Computation Phase

Process Task

slide-50
SLIDE 50

17/ 28

State machine of the framework

Two main phases: Computation and Communication

No No more comm. No

Get Data Watch DMA Check input buffers Check input data

Communication Phase

Compute For each inbound comm.

slide-51
SLIDE 51

17/ 28

State machine of the framework

Two main phases: Computation and Communication

No No more comm. No

Check input data Watch DMA Check input buffers Get Data

Communication Phase

For each inbound comm. Compute

slide-52
SLIDE 52

17/ 28

State machine of the framework

Two main phases: Computation and Communication

No No more comm. No

Check input data Watch DMA Check input buffers Get Data

Communication Phase

For each inbound comm. Compute

slide-53
SLIDE 53

17/ 28

State machine of the framework

Two main phases: Computation and Communication

No No more comm. No

Check input data Watch DMA Check input buffers Get Data

Communication Phase

For each inbound comm. Compute

slide-54
SLIDE 54

17/ 28

State machine of the framework

Two main phases: Computation and Communication

No No more comm. No

Check input data Watch DMA Check input buffers Get Data

Communication Phase

For each inbound comm. Compute

slide-55
SLIDE 55

17/ 28

State machine of the framework

Two main phases: Computation and Communication

No No more comm. No

Check input data Watch DMA Check input buffers Get Data

Communication Phase

For each inbound comm. Compute

slide-56
SLIDE 56

18/ 28

Communication between processors

PK PL T (i−1)

2

T (i)

2

T (i+1)

1

T (i)

1

slide-57
SLIDE 57

18/ 28

Communication between processors

Signal Data(i) PL T (i)

2

T (i+1)

1

T (i)

1

PK T (i−1)

2

mfc putb for SPEs’ outbound communications. spe mfcio getb for PPEs’ outbound communications to SPEs. memcpy for PPEs’ outbound communications to main memory.

slide-58
SLIDE 58

18/ 28

Communication between processors

cannot be overwritten Input buffers are available to store data Output buffer containing i

Signal Data(i) PL T (i−1)

2

T (i)

2

T (i+1)

1

T (i)

1

PK

mfc putb for SPEs’ outbound communications. spe mfcio getb for PPEs’ outbound communications to SPEs. memcpy for PPEs’ outbound communications to main memory.

slide-59
SLIDE 59

18/ 28

Communication between processors

Get Data(i)

cannot be overwritten Input buffers are available to store data Output buffer containing i

Signal Data(i) T (i−1)

2

T (i)

2

T (i+1)

1

T (i)

1

PK PL

mfc get for SPEs’ inbound communications. spe mfcio put for PPEs’ inbound communications from SPEs. memcpy for PPEs’ inbound communications from main memory.

slide-60
SLIDE 60

18/ 28

Communication between processors

Get Data(i)

cannot be overwritten Input buffers are available to store data Output buffer containing i

Signal Data(i) T (i−1)

2

T (i)

2

T (i+1)

1

T (i)

1

PK PL

mfc get for SPEs’ inbound communications. spe mfcio put for PPEs’ inbound communications from SPEs. memcpy for PPEs’ inbound communications from main memory.

slide-61
SLIDE 61

18/ 28

Communication between processors

Transfer Done(i) Get Data(i)

Output buffer containing i cannot be overwritten Input buffers are available to store data

Signal Data(i) PL T (i−1)

2

T (i)

2

T (i+1)

1

T (i)

1

PK

mfc putb for SPEs’ acknowledgements. spe mfcio getb for PPEs’ acknowledgements to SPEs. Self acknowledgement of PPEs’ transfers from main memory.

slide-62
SLIDE 62

18/ 28

Communication between processors

Output buffer containing i can now be overwritten

Transfer Done(i) Get Data(i)

Input buffers are available to store data Output buffer containing i cannot be overwritten

Signal Data(i) PL T (i−1)

2

T (i)

2

T (i+1)

1

T (i)

1

PK

mfc putb for SPEs’ acknowledgements. spe mfcio getb for PPEs’ acknowledgements to SPEs. Self acknowledgement of PPEs’ transfers from main memory.

slide-63
SLIDE 63

18/ 28

Communication between processors

Signal Data(i + 1)

can now be overwritten Output buffer containing i

Transfer Done(i) Get Data(i)

Input buffers are available to store data Output buffer containing i cannot be overwritten

Signal Data(i) T (i−1)

2

T (i)

2

T (i+1)

1

T (i)

1

PK PL

mfc putb for SPEs’ acknowledgements. spe mfcio getb for PPEs’ acknowledgements to SPEs. Self acknowledgement of PPEs’ transfers from main memory.

slide-64
SLIDE 64

19/ 28

Experimental setup

◮ Linear-Programming: 5% from optimal to reduce compute

time

◮ GreedyMem: Simple greedy heuristics balancing memory

footprint across PEs.

◮ Tasks are processed in topological order. ◮ Valid SPE with the least loaded memory is selected.

◮ GreedyCpu: Simple greedy heuristics balancing compute

load across PEs.

◮ Tasks are processed in topological order. ◮ Least loaded SPE is selected, provided that it has enough free

memory.

slide-65
SLIDE 65

19/ 28

Experimental setup

◮ Linear-Programming: 5% from optimal to reduce compute

time

◮ GreedyMem: Simple greedy heuristics balancing memory

footprint across PEs.

◮ Tasks are processed in topological order. ◮ Valid SPE with the least loaded memory is selected.

◮ GreedyCpu: Simple greedy heuristics balancing compute

load across PEs.

◮ Tasks are processed in topological order. ◮ Least loaded SPE is selected, provided that it has enough free

memory.

slide-66
SLIDE 66

19/ 28

Experimental setup

◮ Linear-Programming: 5% from optimal to reduce compute

time

◮ GreedyMem: Simple greedy heuristics balancing memory

footprint across PEs.

◮ Tasks are processed in topological order. ◮ Valid SPE with the least loaded memory is selected.

◮ GreedyCpu: Simple greedy heuristics balancing compute

load across PEs.

◮ Tasks are processed in topological order. ◮ Least loaded SPE is selected, provided that it has enough free

memory.

slide-67
SLIDE 67

19/ 28

Experimental setup

◮ Linear-Programming: 5% from optimal to reduce compute

time

◮ GreedyMem: Simple greedy heuristics balancing memory

footprint across PEs.

◮ Tasks are processed in topological order. ◮ Valid SPE with the least loaded memory is selected.

◮ GreedyCpu: Simple greedy heuristics balancing compute

load across PEs.

◮ Tasks are processed in topological order. ◮ Least loaded SPE is selected, provided that it has enough free

memory.

slide-68
SLIDE 68

19/ 28

Experimental setup

◮ Linear-Programming: 5% from optimal to reduce compute

time

◮ GreedyMem: Simple greedy heuristics balancing memory

footprint across PEs.

◮ Tasks are processed in topological order. ◮ Valid SPE with the least loaded memory is selected.

◮ GreedyCpu: Simple greedy heuristics balancing compute

load across PEs.

◮ Tasks are processed in topological order. ◮ Least loaded SPE is selected, provided that it has enough free

memory.

slide-69
SLIDE 69

19/ 28

Experimental setup

◮ Linear-Programming: 5% from optimal to reduce compute

time

◮ GreedyMem: Simple greedy heuristics balancing memory

footprint across PEs.

◮ Tasks are processed in topological order. ◮ Valid SPE with the least loaded memory is selected.

◮ GreedyCpu: Simple greedy heuristics balancing compute

load across PEs.

◮ Tasks are processed in topological order. ◮ Least loaded SPE is selected, provided that it has enough free

memory.

slide-70
SLIDE 70

19/ 28

Experimental setup

◮ Linear-Programming: 5% from optimal to reduce compute

time

◮ GreedyMem: Simple greedy heuristics balancing memory

footprint across PEs.

◮ Tasks are processed in topological order. ◮ Valid SPE with the least loaded memory is selected.

◮ GreedyCpu: Simple greedy heuristics balancing compute

load across PEs.

◮ Tasks are processed in topological order. ◮ Least loaded SPE is selected, provided that it has enough free

memory.

slide-71
SLIDE 71

20/ 28

Reaching steady state

Graph 1

stateful T1: T0 cost ppe: cost spe: peek: 0 stateless T2: T1 cost ppe: cost spe: peek: 1 stateless T3: T2 cost ppe: cost spe: peek: 0 stateless T4: T3 cost ppe: cost spe: peek: 0 stateless T5: T4 cost ppe: cost spe: peek: 1 stateless T6: T5 cost ppe: cost spe: peek: 0 stateful T50: T49 cost ppe: cost spe: peek: 0 stateful T7: T6 cost ppe: cost spe: peek: 0 stateless T8: T7 cost ppe: cost spe: peek: 0 stateful T9: T8 cost ppe: cost spe: peek: 0 stateless T13: T12 cost ppe: cost spe: peek: 1 stateless T10: T9 cost ppe: cost spe: peek: 1 stateless T11: T10 cost ppe: cost spe: peek: 1 stateless T12: T11 cost ppe: cost spe: peek: 0 stateless T14: T13 cost ppe: cost spe: peek: 0 stateless T18: T17 cost ppe: cost spe: peek: 0 stateful T19: T18 cost ppe: cost spe: peek: 1 stateless T16: T15 cost ppe: cost spe: peek: 0 stateless T17: T16 cost ppe: cost spe: peek: 0 stateless T15: T14 cost ppe: cost spe: peek: 0 stateful T20: T19 cost ppe: cost spe: peek: 0 stateless T21: T20 cost ppe: cost spe: peek: 0 stateful T22: T21 cost ppe: cost spe: peek: 1 stateless T23: T22 cost ppe: cost spe: peek: 1 stateful T24: T23 cost ppe: cost spe: peek: 0 stateless T25: T24 cost ppe: cost spe: peek: 0 stateless T26: T25 cost ppe: cost spe: peek: 1 stateless T27: T26 cost ppe: cost spe: peek: 0 stateful T30: T29 cost ppe: cost spe: peek: 1 stateless T29: T28 cost ppe: cost spe: peek: 1 stateless T31: T30 cost ppe: cost spe: peek: 1 stateless T32: T31 cost ppe: cost spe: peek: 1 stateless T28: T27 cost ppe: cost spe: peek: 1 stateless T34: T33 cost ppe: cost spe: peek: 0 stateful T37: T36 cost ppe: cost spe: peek: 1 stateless T33: T32 cost ppe: cost spe: peek: 1 stateless T36: T35 cost ppe: cost spe: peek: 1 stateless T35: T34 cost ppe: cost spe: peek: 1 stateless T39: T38 cost ppe: cost spe: peek: 1 stateless T38: T37 cost ppe: cost spe: peek: 0 stateless T40: T39 cost ppe: cost spe: peek: 0 stateless T41: T40 cost ppe: cost spe: peek: 1 stateless T42: T41 cost ppe: cost spe: peek: 1 stateless T43: T42 cost ppe: cost spe: peek: 1 stateless T44: T43 cost ppe: cost spe: peek: 0 stateful T45: T44 cost ppe: cost spe: peek: 0 stateless T46: T45 cost ppe: cost spe: peek: 1 stateless T47: T46 cost ppe: cost spe: peek: 1 stateless T48: T47 cost ppe: cost spe: peek: 0 stateless T49: T48 cost ppe: cost spe: peek: 0

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Number of instances 5 10 15 20 25 30 35 40 Throughput (instances / seconds) Theoretical throughput Experimental throughput

95% of the theoretical throughput is achieved after 1000 periods

slide-72
SLIDE 72

21/ 28

Experimental results

Graph 1

stateful T1: T0 cost ppe: cost spe: peek: 0 stateless T2: T1 cost ppe: cost spe: peek: 1 stateless T3: T2 cost ppe: cost spe: peek: 0 stateless T4: T3 cost ppe: cost spe: peek: 0 stateless T5: T4 cost ppe: cost spe: peek: 1 stateless T6: T5 cost ppe: cost spe: peek: 0 stateful T50: T49 cost ppe: cost spe: peek: 0 stateful T7: T6 cost ppe: cost spe: peek: 0 stateless T8: T7 cost ppe: cost spe: peek: 0 stateful T9: T8 cost ppe: cost spe: peek: 0 stateless T13: T12 cost ppe: cost spe: peek: 1 stateless T10: T9 cost ppe: cost spe: peek: 1 stateless T11: T10 cost ppe: cost spe: peek: 1 stateless T12: T11 cost ppe: cost spe: peek: 0 stateless T14: T13 cost ppe: cost spe: peek: 0 stateless T18: T17 cost ppe: cost spe: peek: 0 stateful T19: T18 cost ppe: cost spe: peek: 1 stateless T16: T15 cost ppe: cost spe: peek: 0 stateless T17: T16 cost ppe: cost spe: peek: 0 stateless T15: T14 cost ppe: cost spe: peek: 0 stateful T20: T19 cost ppe: cost spe: peek: 0 stateless T21: T20 cost ppe: cost spe: peek: 0 stateful T22: T21 cost ppe: cost spe: peek: 1 stateless T23: T22 cost ppe: cost spe: peek: 1 stateful T24: T23 cost ppe: cost spe: peek: 0 stateless T25: T24 cost ppe: cost spe: peek: 0 stateless T26: T25 cost ppe: cost spe: peek: 1 stateless T27: T26 cost ppe: cost spe: peek: 0 stateful T30: T29 cost ppe: cost spe: peek: 1 stateless T29: T28 cost ppe: cost spe: peek: 1 stateless T31: T30 cost ppe: cost spe: peek: 1 stateless T32: T31 cost ppe: cost spe: peek: 1 stateless T28: T27 cost ppe: cost spe: peek: 1 stateless T34: T33 cost ppe: cost spe: peek: 0 stateful T37: T36 cost ppe: cost spe: peek: 1 stateless T33: T32 cost ppe: cost spe: peek: 1 stateless T36: T35 cost ppe: cost spe: peek: 1 stateless T35: T34 cost ppe: cost spe: peek: 1 stateless T39: T38 cost ppe: cost spe: peek: 1 stateless T38: T37 cost ppe: cost spe: peek: 0 stateless T40: T39 cost ppe: cost spe: peek: 0 stateless T41: T40 cost ppe: cost spe: peek: 1 stateless T42: T41 cost ppe: cost spe: peek: 1 stateless T43: T42 cost ppe: cost spe: peek: 1 stateless T44: T43 cost ppe: cost spe: peek: 0 stateful T45: T44 cost ppe: cost spe: peek: 0 stateless T46: T45 cost ppe: cost spe: peek: 1 stateless T47: T46 cost ppe: cost spe: peek: 1 stateless T48: T47 cost ppe: cost spe: peek: 0 stateless T49: T48 cost ppe: cost spe: peek: 0

Speed-up for 5000 instances 1 2 3 4 5 6 7 8 Number of SPEs 1 1.5 2 GreedyCpu GreedyMem Linear Programming

Results are obtained over 5000 periods, 2x speedup using 8 SPEs.

slide-73
SLIDE 73

22/ 28

Experimental results

Graph 2

stateless T1: T0 cost ppe: cost spe: peek: 0 stateless T2: T1 cost ppe: cost spe: peek: 1 stateful T3: T2 cost ppe: cost spe: peek: 1 stateless T4: T3 cost ppe: cost spe: peek: 1 stateless T5: T4 cost ppe: cost spe: peek: 0 stateless T6: T5 cost ppe: cost spe: peek: 2 stateless T7: T6 cost ppe: cost spe: peek: 2 stateful T8: T7 cost ppe: cost spe: peek: 1 stateless T11: T10 cost ppe: cost spe: peek: 0 stateless T13: T12 cost ppe: cost spe: peek: 2 stateless T9: T8 cost ppe: cost spe: peek: 2 stateless T12: T11 cost ppe: cost spe: peek: 2 stateless T10: T9 cost ppe: cost spe: peek: 0 stateless T94: T93 cost ppe: cost spe: peek: 0 stateless T14: T13 cost ppe: cost spe: peek: 0 stateful T18: T17 cost ppe: cost spe: peek: 1 stateless T20: T19 cost ppe: cost spe: peek: 1 stateless T21: T20 cost ppe: cost spe: peek: 0 stateless T15: T14 cost ppe: cost spe: peek: 1 stateful T17: T16 cost ppe: cost spe: peek: 0 stateless T19: T18 cost ppe: cost spe: peek: 2 stateless T22: T21 cost ppe: cost spe: peek: 2 stateless T16: T15 cost ppe: cost spe: peek: 2 stateless T29: T28 cost ppe: cost spe: peek: 2 stateless T30: T29 cost ppe: cost spe: peek: 2 stateless T32: T31 cost ppe: cost spe: peek: 1 stateless T24: T23 cost ppe: cost spe: peek: 0 stateless T23: T22 cost ppe: cost spe: peek: 1 stateful T25: T24 cost ppe: cost spe: peek: 2 stateless T26: T25 cost ppe: cost spe: peek: 0 stateless T28: T27 cost ppe: cost spe: peek: 0 stateless T27: T26 cost ppe: cost spe: peek: 2 stateless T31: T30 cost ppe: cost spe: peek: 1 stateless T34: T33 cost ppe: cost spe: peek: 0 stateful T40: T39 cost ppe: cost spe: peek: 2 stateless T33: T32 cost ppe: cost spe: peek: 1 stateless T35: T34 cost ppe: cost spe: peek: 0 stateless T37: T36 cost ppe: cost spe: peek: 2 stateless T39: T38 cost ppe: cost spe: peek: 0 stateless T41: T40 cost ppe: cost spe: peek: 1 stateless T36: T35 cost ppe: cost spe: peek: 1 stateless T38: T37 cost ppe: cost spe: peek: 0 stateless T43: T42 cost ppe: cost spe: peek: 0 stateless T42: T41 cost ppe: cost spe: peek: 0 stateful T44: T43 cost ppe: cost spe: peek: 1 stateless T45: T44 cost ppe: cost spe: peek: 0 stateless T46: T45 cost ppe: cost spe: peek: 0 stateless T47: T46 cost ppe: cost spe: peek: 0 stateful T48: T47 cost ppe: cost spe: peek: 0 stateful T50: T49 cost ppe: cost spe: peek: 0 stateless T49: T48 cost ppe: cost spe: peek: 1 stateless T51: T50 cost ppe: cost spe: peek: 1 stateful T57: T56 cost ppe: cost spe: peek: 1 stateless T59: T58 cost ppe: cost spe: peek: 2 stateful T60: T59 cost ppe: cost spe: peek: 1 stateless T54: T53 cost ppe: cost spe: peek: 1 stateless T52: T51 cost ppe: cost spe: peek: 0 stateless T53: T52 cost ppe: cost spe: peek: 1 stateful T61: T60 cost ppe: cost spe: peek: 1 stateless T55: T54 cost ppe: cost spe: peek: 0 stateful T56: T55 cost ppe: cost spe: peek: 1 stateless T58: T57 cost ppe: cost spe: peek: 1 stateless T62: T61 cost ppe: cost spe: peek: 0 stateless T63: T62 cost ppe: cost spe: peek: 2 stateless T65: T64 cost ppe: cost spe: peek: 0 stateless T66: T65 cost ppe: cost spe: peek: 0 stateful T64: T63 cost ppe: cost spe: peek: 1 stateless T67: T66 cost ppe: cost spe: peek: 0 stateful T68: T67 cost ppe: cost spe: peek: 0 stateful T69: T68 cost ppe: cost spe: peek: 1 stateful T70: T69 cost ppe: cost spe: peek: 0 stateless T73: T72 cost ppe: cost spe: peek: 0 stateless T75: T74 cost ppe: cost spe: peek: 2 stateful T72: T71 cost ppe: cost spe: peek: 2 stateless T76: T75 cost ppe: cost spe: peek: 1 stateless T71: T70 cost ppe: cost spe: peek: 0 stateless T74: T73 cost ppe: cost spe: peek: 1 stateless T77: T76 cost ppe: cost spe: peek: 0 stateless T78: T77 cost ppe: cost spe: peek: 0 stateless T79: T78 cost ppe: cost spe: peek: 2 stateless T81: T80 cost ppe: cost spe: peek: 0 stateless T82: T81 cost ppe: cost spe: peek: 1 stateless T80: T79 cost ppe: cost spe: peek: 2 stateful T85: T84 cost ppe: cost spe: peek: 1 stateless T90: T89 cost ppe: cost spe: peek: 2 stateless T84: T83 cost ppe: cost spe: peek: 2 stateful T86: T85 cost ppe: cost spe: peek: 0 stateful T87: T86 cost ppe: cost spe: peek: 1 stateless T88: T87 cost ppe: cost spe: peek: 1 stateful T91: T90 cost ppe: cost spe: peek: 1 stateless T92: T91 cost ppe: cost spe: peek: 2 stateful T83: T82 cost ppe: cost spe: peek: 1 stateless T89: T88 cost ppe: cost spe: peek: 0 stateless T93: T92 cost ppe: cost spe: peek: 0

Speed-up for 5000 instances 1 2 3 4 5 6 7 8 Number of SPEs 1 1.5 2 GreedyCpu GreedyMem Linear Programming

Results are obtained over 5000 periods, 2x speedup using 8 SPEs.

slide-74
SLIDE 74

23/ 28

Experimental results

Graph 3: 50 tasks deep chain graph

Speed-up for 5000 instances 1 2 3 4 5 6 7 8 Number of SPEs 1 1.5 2 2.5 3 GreedyCpu GreedyMem Linear Programming

Results are obtained over 5000 periods, 3x speedup using 8 SPEs.

slide-75
SLIDE 75

24/ 28

Experimental results

We let the communication to computation ratio of each graph vary

Speed-up for 10000 instances 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Communication to computation ratio 1 1.5 2 2.5 3 3.5 4 Random graph 2 Random graph 3 Random graph 1

Results are obtained over 10000 periods. The heavier communication are, the harder it is to achieve theoretical throughput... ... but increasing the number of periods helps a lot.

slide-76
SLIDE 76

25/ 28

Outline

Introduction Steady-state scheduling CELL Platform and Application Modeling Mapping the Application Practical Steady-State on CELL Preprocessing of the schedule State machine of the framework Experimental results Conclusion and Future works

slide-77
SLIDE 77

26/ 28

Feedback on our approach

◮ We designed a realistic and yet tractable model of the Cell

processor.

◮ Our framework allowed us to test our scheduling strategy, and

to compare it to simpler heuristic strategies.

◮ We have shown that :

◮ 95% of the throughput predicted by the linear program, ◮ Good and scalable speedup when using up to 8 SPEs, ◮ Clearly outperforms simple heuristics

Scheduling a complex application on a heterogeneous multicore processor is a challenging task Scheduling tools can help to achieve good performance.

slide-78
SLIDE 78

27/ 28

Feedback on Cell programming

◮ Multilevel heterogeneity:

◮ 32 bits SPEs vs 64 bits PPE architectures ◮ Different communication mechanism and constraints

◮ Non trivial initialization phase

◮ Varying data structure sizes (32/64bits) ◮ Runtime memory allocation

slide-79
SLIDE 79

28/ 28

On-going and Future work

◮ Better communication modeling

◮ Is linear cost model relevant ? ◮ Contention on concurrent DMA operations ?

◮ Larger platforms

◮ Using multiple CELL processors ◮ CELL + other type of processing units ? ◮ Work on communication modeling

◮ Design scheduling heuristics

◮ MIP is costly