Steady-state scheduling on CELL Mathias Jacquelin, joint work with - - PowerPoint PPT Presentation

steady state scheduling on cell
SMART_READER_LITE
LIVE PREVIEW

Steady-state scheduling on CELL Mathias Jacquelin, joint work with - - PowerPoint PPT Presentation

Steady-state scheduling on CELL Mathias Jacquelin, joint work with Matthieu Gallet, Loris Marchal and Yves Robert INRIA GRAAL project-team LIP (ENS-Lyon, CNRS, INRIA) Ecole Normale Sup erieure de Lyon, France Scheduling for


slide-1
SLIDE 1

1/ 22

Steady-state scheduling on CELL

Mathias Jacquelin, joint work with Matthieu Gallet, Loris Marchal and Yves Robert

INRIA GRAAL project-team LIP (ENS-Lyon, CNRS, INRIA) ´ Ecole Normale Sup´ erieure de Lyon, France

“Scheduling for large-scale systems” workshop, Knoxville, May 14, 2009.

slide-2
SLIDE 2

2/ 22

Outline

Introduction Steady-state scheduling CELL Platform and Application Modeling Mapping the Application Practical Steady-State on CELL Preprocessing of the schedule State machine of the application Preliminary results Conclusion and Future works

slide-3
SLIDE 3

3/ 22

Motivation

◮ Multicore architectures: new opportunity to test the

scheduling strategies designed in the GRAAL team.

◮ Our trademark: efficient scheduling on heterogeneous

platforms

◮ Most multicore architecture are homogeneous, regular

◮ Need for tailored algorithms (linear algebra,. . . )

◮ Emerging heterogeneous multicore:

◮ Dedicated processing units on GPUs ◮ Mixed system: processor + accelerator

◮ This study: steady-state scheduling on CELL (bounded

heterogeneity) to demonstrate the usefulness of complex (static) scheduling techniques

◮ Ongoing work: only preliminary results

slide-4
SLIDE 4

3/ 22

Motivation

◮ Multicore architectures: new opportunity to test the

scheduling strategies designed in the GRAAL team.

◮ Our trademark: efficient scheduling on heterogeneous

platforms

◮ Most multicore architecture are homogeneous, regular

◮ Need for tailored algorithms (linear algebra,. . . )

◮ Emerging heterogeneous multicore:

◮ Dedicated processing units on GPUs ◮ Mixed system: processor + accelerator

◮ This study: steady-state scheduling on CELL (bounded

heterogeneity) to demonstrate the usefulness of complex (static) scheduling techniques

◮ Ongoing work: only preliminary results

slide-5
SLIDE 5

3/ 22

Motivation

◮ Multicore architectures: new opportunity to test the

scheduling strategies designed in the GRAAL team.

◮ Our trademark: efficient scheduling on heterogeneous

platforms

◮ Most multicore architecture are homogeneous, regular

◮ Need for tailored algorithms (linear algebra,. . . )

◮ Emerging heterogeneous multicore:

◮ Dedicated processing units on GPUs ◮ Mixed system: processor + accelerator

◮ This study: steady-state scheduling on CELL (bounded

heterogeneity) to demonstrate the usefulness of complex (static) scheduling techniques

◮ Ongoing work: only preliminary results

slide-6
SLIDE 6

4/ 22

Introduction: Steady-state Scheduling

Rationale:

◮ A pipelined application:

◮ Simple chain ◮ More complex application

(Directed Acyclic Graph)

◮ Objective: optimize the throughput

  • f the application

(number of input files treated per seconds)

◮ Today: simple case where each

task has to be mapped on one single resource

slide-7
SLIDE 7

4/ 22

Introduction: Steady-state Scheduling

Rationale:

◮ A pipelined application:

◮ Simple chain ◮ More complex application

(Directed Acyclic Graph)

◮ Objective: optimize the throughput

  • f the application

(number of input files treated per seconds)

◮ Today: simple case where each

task has to be mapped on one single resource

T1 T2 T3

slide-8
SLIDE 8

4/ 22

Introduction: Steady-state Scheduling

Rationale:

◮ A pipelined application:

◮ Simple chain ◮ More complex application

(Directed Acyclic Graph)

◮ Objective: optimize the throughput

  • f the application

(number of input files treated per seconds)

◮ Today: simple case where each

task has to be mapped on one single resource

T1 T2 T3 T4 T5 T6 T7 T8 T9

slide-9
SLIDE 9

4/ 22

Introduction: Steady-state Scheduling

Rationale:

◮ A pipelined application:

◮ Simple chain ◮ More complex application

(Directed Acyclic Graph)

◮ Objective: optimize the throughput

  • f the application

(number of input files treated per seconds)

◮ Today: simple case where each

task has to be mapped on one single resource

T1 T2 T3 T4 T5 T6 T7 T8 T9

slide-10
SLIDE 10

4/ 22

Introduction: Steady-state Scheduling

Rationale:

◮ A pipelined application:

◮ Simple chain ◮ More complex application

(Directed Acyclic Graph)

◮ Objective: optimize the throughput

  • f the application

(number of input files treated per seconds)

◮ Today: simple case where each

task has to be mapped on one single resource

P2 P1 P3 P4

T5 T6 T7 T8 T9 T1 T2 T3 T4

slide-11
SLIDE 11

5/ 22

CELL brief introduction

◮ Multicore heterogeneous processor ◮ Accelerator extension to Power architecture

slide-12
SLIDE 12

5/ 22

CELL brief introduction

◮ Multicore heterogeneous processor ◮ Accelerator extension to Power architecture

EIB PPE0 SPE3 SPE2 SPE6 SPE7 SPE1 SPE0 SPE4 MEMORY SPE5

slide-13
SLIDE 13

5/ 22

CELL brief introduction

◮ Multicore heterogeneous processor ◮ Accelerator extension to Power architecture

EIB SPE5 SPE3 SPE2 SPE6 SPE7 SPE1 SPE0 SPE4 PPE0 MEMORY

◮ 1 PPE core

◮ VMX unit ◮ L1, L2 cache ◮ 2 way SMT

slide-14
SLIDE 14

5/ 22

CELL brief introduction

◮ Multicore heterogeneous processor ◮ Accelerator extension to Power architecture

EIB SPE4 SPE5 SPE2 SPE6 SPE7 SPE1 MEMORY SPE3 PPE0 SPE0

◮ 8 SPEs

◮ 128-bit SIMD instruction set ◮ Local store 256KB ◮ Dedicated Asynchronous DMA engine

slide-15
SLIDE 15

5/ 22

CELL brief introduction

◮ Multicore heterogeneous processor ◮ Accelerator extension to Power architecture

EIB SPE5 SPE3 SPE2 SPE6 SPE7 SPE1 SPE0 SPE4 PPE0 MEMORY

slide-16
SLIDE 16

5/ 22

CELL brief introduction

◮ Multicore heterogeneous processor ◮ Accelerator extension to Power architecture

EIB PPE0 SPE3 SPE2 SPE6 SPE7 SPE1 SPE0 SPE4 MEMORY SPE5

◮ Element Interconnect Bus (EIB)

◮ 200 GB/s bandwidth

slide-17
SLIDE 17

5/ 22

CELL brief introduction

◮ Multicore heterogeneous processor ◮ Accelerator extension to Power architecture

EIB PPE0 SPE4 SPE2 SPE6 SPE7 SPE1 SPE0 SPE5 MEMORY SPE3

◮ 25 GB/s bandwidth

slide-18
SLIDE 18

6/ 22

Outline

Introduction Steady-state scheduling CELL Platform and Application Modeling Mapping the Application Practical Steady-State on CELL Preprocessing of the schedule State machine of the application Preliminary results Conclusion and Future works

slide-19
SLIDE 19

7/ 22

Platform modeling

Simple CELL modeling:

◮ 1 PPE and 8 SPE: 9 processing elements P1, . . . , P9, with

unrelated speed,

◮ Each processing element access the communication bus with a

(bidirectional) bandwidth b = (25GB/s) ,

◮ The bus is able to route all concurrent communications

without contention (in a first step),

◮ Due to the limited size of the DMA stack on each SPE:

◮ Each SPE can perform at most 16 simultaneous DMA

  • perations,

◮ The PPE can perform at most 8 simultaneous DMA

  • perations to/from a given SPE.

◮ Linear cost communication model:

a data of size S is sent/received in time S/b

slide-20
SLIDE 20

8/ 22

Application modeling

Application is described by a directed acyclic graph:

◮ Tasks T1, . . . , Tn ◮ Processing time of task Tk on Pi is

ti(k),

◮ If there is a dependency Tk → Tl,

datak,l is the size of the file produced by Tk and needed by Tl,

T1 T2 T3 T4 T5 T6 T7 T8 T9

◮ If Tk is an input task, it reads readk bytes from main memory, ◮ If Tk is an output task, it writes writek bytes to main memory,

slide-21
SLIDE 21

8/ 22

Application modeling

Application is described by a directed acyclic graph:

◮ Tasks T1, . . . , Tn ◮ Processing time of task Tk on Pi is

ti(k),

◮ If there is a dependency Tk → Tl,

datak,l is the size of the file produced by Tk and needed by Tl,

T1 T2 T3 T4 T5 T6 T7 T8 T9

◮ If Tk is an input task, it reads readk bytes from main memory, ◮ If Tk is an output task, it writes writek bytes to main memory,

slide-22
SLIDE 22

8/ 22

Application modeling

Application is described by a directed acyclic graph:

◮ Tasks T1, . . . , Tn ◮ Processing time of task Tk on Pi is

ti(k),

◮ If there is a dependency Tk → Tl,

datak,l is the size of the file produced by Tk and needed by Tl,

T1 T2 T3 T4 T5 T6 T7 T8 T9

◮ If Tk is an input task, it reads readk bytes from main memory, ◮ If Tk is an output task, it writes writek bytes to main memory,

slide-23
SLIDE 23

9/ 22

Target application: vocoder

Vocode r Ste pSource work= 21 I/O: 0-> 1 *** STATEFUL *** IntToFloa t work= 6 I/O: 1-> 1 De la y work= 215 I/O: 1-> 1 *** STATEFUL *** DUPLICATE(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1) work= null DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** WEIGHTED_ROUND_ROBIN(2,2,2,2,2,2,2,2,2,2,2,2,2,2,2) work= null Re cta ngula rToPola r work= 9105 I/O: 30-> 30 WEIGHTED_ROUND_ROBIN(1,1) work= null DUPLICATE(1,1) work= null WEIGHTED_ROUND_ROBIN(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1) work= null WEIGHTED_ROUND_ROBIN(1,1) work= null Pola rToRe cta ngula r work= 5060 I/O: 40-> 40 FIRSm
  • othingFilte
r work= 3300 I/O: 15-> 15 Ide ntity work= 90 I/O: 15-> 15 WEIGHTED_ROUND_ROBIN(1,1) work= null De convolve work= 450 I/O: 30-> 30 WEIGHTED_ROUND_ROBIN(1,1) work= null Duplica tor work= 195 I/O: 15-> 20 Line a rInte rpola tor work= 2010 I/O: 15-> 60 *** PEEKS 1 AHEAD *** WEIGHTED_ROUND_ROBIN(1,1) work= null Multiplie r work= 220 I/O: 40-> 20 De cim a tor work= 320 I/O: 60-> 20 Ide ntity work= 120 I/O: 20-> 20 Pha se Unwra ppe r work= 107 I/O: 1-> 1 *** STATEFUL *** Pha se Unwra ppe r work= 107 I/O: 1-> 1 *** STATEFUL *** Pha se Unwra ppe r work= 107 I/O: 1-> 1 *** STATEFUL *** Pha se Unwra ppe r work= 107 I/O: 1-> 1 *** STATEFUL *** Pha se Unwra ppe r work= 107 I/O: 1-> 1 *** STATEFUL *** Pha se Unwra ppe r work= 107 I/O: 1-> 1 *** STATEFUL *** Pha se Unwra ppe r work= 107 I/O: 1-> 1 *** STATEFUL *** Pha se Unwra ppe r work= 107 I/O: 1-> 1 *** STATEFUL *** Pha se Unwra ppe r work= 107 I/O: 1-> 1 *** STATEFUL *** Pha se Unwra ppe r work= 107 I/O: 1-> 1 *** STATEFUL *** Pha se Unwra ppe r work= 107 I/O: 1-> 1 *** STATEFUL *** Pha se Unwra ppe r work= 107 I/O: 1-> 1 *** STATEFUL *** Pha se Unwra ppe r work= 107 I/O: 1-> 1 *** STATEFUL *** Pha se Unwra ppe r work= 107 I/O: 1-> 1 *** STATEFUL *** Pha se Unwra ppe r work= 107 I/O: 1-> 1 *** STATEFUL *** WEIGHTED_ROUND_ROBIN(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1) work= null Duplica tor work= 195 I/O: 15-> 20 FirstDiffe re nce work= 15 I/O: 1-> 1 *** STATEFUL *** ConstMultiplie r work= 8 I/O: 1-> 1 Accum ula tor work= 14 I/O: 1-> 1 *** STATEFUL *** FirstDiffe re nce work= 15 I/O: 1-> 1 *** STATEFUL *** ConstMultiplie r work= 8 I/O: 1-> 1 Accum ula tor work= 14 I/O: 1-> 1 *** STATEFUL *** FirstDiffe re nce work= 15 I/O: 1-> 1 *** STATEFUL *** ConstMultiplie r work= 8 I/O: 1-> 1 Accum ula tor work= 14 I/O: 1-> 1 *** STATEFUL *** FirstDiffe re nce work= 15 I/O: 1-> 1 *** STATEFUL *** ConstMultiplie r work= 8 I/O: 1-> 1 Accum ula tor work= 14 I/O: 1-> 1 *** STATEFUL *** FirstDiffe re nce work= 15 I/O: 1-> 1 *** STATEFUL *** ConstMultiplie r work= 8 I/O: 1-> 1 Accum ula tor work= 14 I/O: 1-> 1 *** STATEFUL *** FirstDiffe re nce work= 15 I/O: 1-> 1 *** STATEFUL *** ConstMultiplie r work= 8 I/O: 1-> 1 Accum ula tor work= 14 I/O: 1-> 1 *** STATEFUL *** FirstDiffe re nce work= 15 I/O: 1-> 1 *** STATEFUL *** ConstMultiplie r work= 8 I/O: 1-> 1 Accum ula tor work= 14 I/O: 1-> 1 *** STATEFUL *** FirstDiffe re nce work= 15 I/O: 1-> 1 *** STATEFUL *** ConstMultiplie r work= 8 I/O: 1-> 1 Accum ula tor work= 14 I/O: 1-> 1 *** STATEFUL *** FirstDiffe re nce work= 15 I/O: 1-> 1 *** STATEFUL *** ConstMultiplie r work= 8 I/O: 1-> 1 Accum ula tor work= 14 I/O: 1-> 1 *** STATEFUL *** FirstDiffe re nce work= 15 I/O: 1-> 1 *** STATEFUL *** ConstMultiplie r work= 8 I/O: 1-> 1 Accum ula tor work= 14 I/O: 1-> 1 *** STATEFUL *** FirstDiffe re nce work= 15 I/O: 1-> 1 *** STATEFUL *** ConstMultiplie r work= 8 I/O: 1-> 1 Accum ula tor work= 14 I/O: 1-> 1 *** STATEFUL *** FirstDiffe re nce work= 15 I/O: 1-> 1 *** STATEFUL *** ConstMultiplie r work= 8 I/O: 1-> 1 Accum ula tor work= 14 I/O: 1-> 1 *** STATEFUL *** FirstDiffe re nce work= 15 I/O: 1-> 1 *** STATEFUL *** ConstMultiplie r work= 8 I/O: 1-> 1 Accum ula tor work= 14 I/O: 1-> 1 *** STATEFUL *** FirstDiffe re nce work= 15 I/O: 1-> 1 *** STATEFUL *** ConstMultiplie r work= 8 I/O: 1-> 1 Accum ula tor work= 14 I/O: 1-> 1 *** STATEFUL *** FirstDiffe re nce work= 15 I/O: 1-> 1 *** STATEFUL *** ConstMultiplie r work= 8 I/O: 1-> 1 Accum ula tor work= 14 I/O: 1-> 1 *** STATEFUL *** WEIGHTED_ROUND_ROBIN(1,1) work= null WEIGHTED_ROUND_ROBIN(1,18,1) work= null Floa tVoid work= 60 I/O: 20-> WEIGHTED_ROUND_ROBIN(1,0) work= null InvDe la y work= 9 I/O: 1-> 1 *** PEEKS 13 AHEAD *** Ide ntity work= 6 I/O: 1-> 1 Double r work= 252 I/O: 18-> 18 Ide ntity work= 6 I/O: 1-> 1 WEIGHTED_ROUND_ROBIN(1,18,1) work= null Pre _Colla pse dDa ta Pa ra lle l_1 work= 207 I/O: 20-> 20 Adde r work= 146 I/O: 20-> 2 Subtra ctor work= 14 I/O: 2-> 1 ConstMultiplie r work= 8 I/O: 1-> 1 Floa tToShort work= 12 I/O: 1-> 1 File Write r work= I/O: 1->
slide-24
SLIDE 24

10/ 22

Outline

Introduction Steady-state scheduling CELL Platform and Application Modeling Mapping the Application Practical Steady-State on CELL Preprocessing of the schedule State machine of the application Preliminary results Conclusion and Future works

slide-25
SLIDE 25

11/ 22

How to compute an optimal mapping

◮ Ojective: maximize throughput ρ ◮ Method: write a linear program gathering constraints on the

mapping

◮ Binary variables: αk i =

  • 1

if Tk is mapped on Pi

  • therwise

◮ Other useful binary variables: βk,l i,j = 1 iff file Tk → Tl is

transfered from Pi to Pj

slide-26
SLIDE 26

12/ 22

Constraints 1/2

On the application structure:

◮ Each task is mapped on a processor:

∀Tk

  • i

αk

i = 1 ◮ Given a dependency Tk → Tl, the processor computing Tl

must receive the corresponding file: ∀(k, l) ∈ E, ∀Pj,

  • i

βk,l

i,j ≥ αl j ◮ Given a dependency Tk → Tl, only the processor computing

Tk can send the corresponding file: ∀(k, l) ∈ E, ∀Pi,

  • j

βk,l

i,j ≤ αk i

slide-27
SLIDE 27

13/ 22

Constraints 2/2

On the achievable throughput ρ = 1/T:

◮ On a given processor, all tasks must be completed within T:

∀Pi,

  • k

αk

i × ti(k) ≤ T ◮ All incoming communications must be completed within T:

∀Pj, 1 b

k

αk

j × readk +

  • k,l
  • i

βk,l

i,j × datak,l

  • ≤ T

◮ All outgoing communications must be completed within T:

∀Pi, 1 b

k

αk

i × writek +

  • k,l
  • i

βk,l

i,j × datak,l

  • ≤ T

+ constraints on the number of incoming/outgoing communications to respect the DMA requirements + constraints on the available memory on SPE

slide-28
SLIDE 28

14/ 22

Optimal mapping computation

◮ Linear program with the objective of minimizing T ◮ Integer (binary) variables: Mixed Integer Programming ◮ NP-complete problem ◮ Efficient solvers exist with short running time

◮ for small-size problems ◮ or when an approximate solution is searched

◮ We use CPLEX, and look for an approximate solution (5% of

the optimal throughput is good enough)

slide-29
SLIDE 29

15/ 22

Outline

Introduction Steady-state scheduling CELL Platform and Application Modeling Mapping the Application Practical Steady-State on CELL Preprocessing of the schedule State machine of the application Preliminary results Conclusion and Future works

slide-30
SLIDE 30

16/ 22

Preprocessing of the schedule

Main Objective: Compute minimal starting period and buffer sizes.

◮ min periodl = maxm∈precl(min periodm) + peekl + 2 ◮ min buffi,l = min periodl − min periodi

min periodk min periodl = maxm∈precl(min periodm) + peekl + 2 min buffj,l min buffi,l =min periodl − min periodi min periodj peekj peekk peeki peekl min buffi,k min buffi,j min periodi min buffk,l

Tl Ti Tk Tj

slide-31
SLIDE 31

16/ 22

Preprocessing of the schedule

Main Objective: Compute minimal starting period and buffer sizes.

min buffi,j = 3 min buffj,l = 6 min periodl = 9 min buffi,l = 9 min buffk,l = 4 peeki = 0 min periodi = 0 peekj = 1 min periodj = 3 min periodk = 5 peekk = 3 peekl = 2 min buffi,k = 5

Tl Ti Tk Tj

slide-32
SLIDE 32

16/ 22

Preprocessing of the schedule

Main Objective: Compute minimal starting period and buffer sizes.

peekk = 3 peekl = 2 min buffi,k = 5 min buffi,j = 3 min buffi,l = 9 min buffj,l = 6 min periodl = 9 peeki = 0 min buffk,l = 4 min periodi = 0 min periodj = 3 peekj = 1 min periodk = 5

Ti Tk Tj Tl

period = 0

slide-33
SLIDE 33

16/ 22

Preprocessing of the schedule

Main Objective: Compute minimal starting period and buffer sizes.

min periodl = 9 min buffj,l = 6 min buffi,j = 3 min buffi,k = 5 peekl = 2 peekk = 3 min periodj = 3 min periodk = 5 min periodi = 0 peeki = 0 min buffi,l = 9 min buffk,l = 4 peekj = 1

Tk Tl Tj Ti

period = 1

slide-34
SLIDE 34

16/ 22

Preprocessing of the schedule

Main Objective: Compute minimal starting period and buffer sizes.

peekk = 3 min periodk = 5 peekj = 1 min periodj = 3 min periodi = 0 peeki = 0 min buffk,l = 4 min periodl = 9 min buffj,l = 6 min buffi,j = 3 min buffi,k = 5 peekl = 2 min buffi,l = 9

Tk Ti Tj Tl

period = 2

slide-35
SLIDE 35

16/ 22

Preprocessing of the schedule

Main Objective: Compute minimal starting period and buffer sizes.

peekk = 3 min periodk = 5 peekj = 1 min periodj = 3 min periodi = 0 peeki = 0 min buffk,l = 4 min periodl = 9 min buffi,l = 9 min buffj,l = 6 min buffi,j = 3 min buffi,k = 5 peekl = 2

Tl Tj Tk Ti

period = 3

slide-36
SLIDE 36

16/ 22

Preprocessing of the schedule

Main Objective: Compute minimal starting period and buffer sizes.

peekk = 3 min periodk = 5 peekj = 1 min periodj = 3 min periodi = 0 peeki = 0 min buffk,l = 4 min periodl = 9 min buffi,l = 9 min buffj,l = 6 min buffi,j = 3 min buffi,k = 5 peekl = 2

Tl Tj Tk Ti

period = 4

slide-37
SLIDE 37

16/ 22

Preprocessing of the schedule

Main Objective: Compute minimal starting period and buffer sizes.

peekk = 3 min periodk = 5 peekj = 1 min periodj = 3 min periodi = 0 peeki = 0 min buffk,l = 4 min periodl = 9 min buffi,l = 9 min buffj,l = 6 min buffi,j = 3 min buffi,k = 5 peekl = 2

Tl Tj Tk Ti

period = 5

slide-38
SLIDE 38

16/ 22

Preprocessing of the schedule

Main Objective: Compute minimal starting period and buffer sizes.

min periodj = 3 peekj = 1 min periodk = 5 peekk = 3 peekl = 2 min buffi,k = 5 min buffi,j = 3 min buffj,l = 6 min periodl = 9 min buffi,l = 9 min buffk,l = 4 peeki = 0 min periodi = 0

Tk Tj Tl Ti

period = 6

slide-39
SLIDE 39

16/ 22

Preprocessing of the schedule

Main Objective: Compute minimal starting period and buffer sizes.

peekj = 1 min periodj = 3 min periodi = 0 min buffj,l = 6 peekk = 3 min buffi,j = 3 min periodk = 5 peeki = 0 min buffk,l = 4 min buffi,k = 5 peekl = 2 min periodl = 9 min buffi,l = 9

Tk Ti Tj Tl

period = 7

slide-40
SLIDE 40

16/ 22

Preprocessing of the schedule

Main Objective: Compute minimal starting period and buffer sizes.

min periodl = 9 min buffi,l = 9 min buffk,l = 4 peeki = 0 min periodi = 0 min periodj = 3 peekj = 1 peekk = 3 min periodk = 5 peekl = 2 min buffi,k = 5 min buffi,j = 3 min buffj,l = 6

Ti Tk Tj Tl

period = 8

slide-41
SLIDE 41

16/ 22

Preprocessing of the schedule

Main Objective: Compute minimal starting period and buffer sizes.

min buffi,l = 9 min buffk,l = 4 peeki = 0 min periodi = 0 min periodj = 3 peekj = 1 min periodk = 5 peekl = 2 peekk = 3 min buffi,k = 5 min buffi,j = 3 min buffj,l = 6 min periodl = 9

Tj Tl Ti Tk

period = 9

slide-42
SLIDE 42

17/ 22

State machine of the application

Two main phases: Computation and Communication

Select a Task Wait Resources Process Task Signal new Data

Computation Phase

Communicate

slide-43
SLIDE 43

17/ 22

State machine of the application

Two main phases: Computation and Communication

Select a Task Wait Resources Communicate Signal new Data

Computation Phase

Process Task

slide-44
SLIDE 44

17/ 22

State machine of the application

Two main phases: Computation and Communication

Communicate Wait Resources Process Task Signal new Data

Computation Phase

Communicate Select a Task

slide-45
SLIDE 45

17/ 22

State machine of the application

Two main phases: Computation and Communication

Select a Task Wait Resources Communicate Signal new Data

Computation Phase

Process Task

slide-46
SLIDE 46

17/ 22

State machine of the application

Two main phases: Computation and Communication

Select a Task Wait Resources Communicate Signal new Data

Computation Phase

Process Task

slide-47
SLIDE 47

17/ 22

State machine of the application

Two main phases: Computation and Communication

No No more comm. No

Get Data Watch DMA Check input buffers Check input data

Communication Phase

Compute For each inbound comm.

slide-48
SLIDE 48

17/ 22

State machine of the application

Two main phases: Computation and Communication

No No more comm. No

Check input data Watch DMA Check input buffers Get Data

Communication Phase

For each inbound comm. Compute

slide-49
SLIDE 49

17/ 22

State machine of the application

Two main phases: Computation and Communication

No No more comm. No

Check input data Watch DMA Check input buffers Get Data

Communication Phase

For each inbound comm. Compute

slide-50
SLIDE 50

17/ 22

State machine of the application

Two main phases: Computation and Communication

No No more comm. No

Check input data Watch DMA Check input buffers Get Data

Communication Phase

For each inbound comm. Compute

slide-51
SLIDE 51

17/ 22

State machine of the application

Two main phases: Computation and Communication

No No more comm. No

Check input data Watch DMA Check input buffers Get Data

Communication Phase

For each inbound comm. Compute

slide-52
SLIDE 52

17/ 22

State machine of the application

Two main phases: Computation and Communication

No No more comm. No

Check input data Watch DMA Check input buffers Get Data

Communication Phase

For each inbound comm. Compute

slide-53
SLIDE 53

18/ 22

Communication between processors

PK PL T (i−1)

2

T (i)

2

T (i+1)

1

T (i)

1

slide-54
SLIDE 54

18/ 22

Communication between processors

Signal Data(i) PL T (i)

2

T (i+1)

1

T (i)

1

PK T (i−1)

2

mfc putb for SPEs’ outbound communications. spe mfcio getb for PPEs’ outbound communications to SPEs. memcpy for PPEs’ outbound communications to main memory.

slide-55
SLIDE 55

18/ 22

Communication between processors

cannot be overwritten Input buffers are available to store data Output buffer containing i

Signal Data(i) PL T (i−1)

2

T (i)

2

T (i+1)

1

T (i)

1

PK

mfc putb for SPEs’ outbound communications. spe mfcio getb for PPEs’ outbound communications to SPEs. memcpy for PPEs’ outbound communications to main memory.

slide-56
SLIDE 56

18/ 22

Communication between processors

Get Data(i)

cannot be overwritten Input buffers are available to store data Output buffer containing i

Signal Data(i) T (i−1)

2

T (i)

2

T (i+1)

1

T (i)

1

PK PL

mfc get for SPEs’ inbound communications. spe mfcio put for PPEs’ inbound communications from SPEs. memcpy for PPEs’ inbound communications from main memory.

slide-57
SLIDE 57

18/ 22

Communication between processors

Get Data(i)

cannot be overwritten Input buffers are available to store data Output buffer containing i

Signal Data(i) T (i−1)

2

T (i)

2

T (i+1)

1

T (i)

1

PK PL

mfc get for SPEs’ inbound communications. spe mfcio put for PPEs’ inbound communications from SPEs. memcpy for PPEs’ inbound communications from main memory.

slide-58
SLIDE 58

18/ 22

Communication between processors

Transfer Done(i) Get Data(i)

Output buffer containing i cannot be overwritten Input buffers are available to store data

Signal Data(i) PL T (i−1)

2

T (i)

2

T (i+1)

1

T (i)

1

PK

mfc putb for SPEs’ acknowledgements. spe mfcio getb for PPEs’ acknowledgements to SPEs. Self acknowledgement of PPEs’ transfers from main memory.

slide-59
SLIDE 59

18/ 22

Communication between processors

Output buffer containing i can now be overwritten

Transfer Done(i) Get Data(i)

Input buffers are available to store data Output buffer containing i cannot be overwritten

Signal Data(i) PL T (i−1)

2

T (i)

2

T (i+1)

1

T (i)

1

PK

mfc putb for SPEs’ acknowledgements. spe mfcio getb for PPEs’ acknowledgements to SPEs. Self acknowledgement of PPEs’ transfers from main memory.

slide-60
SLIDE 60

18/ 22

Communication between processors

Signal Data(i + 1)

can now be overwritten Output buffer containing i

Transfer Done(i) Get Data(i)

Input buffers are available to store data Output buffer containing i cannot be overwritten

Signal Data(i) T (i−1)

2

T (i)

2

T (i+1)

1

T (i)

1

PK PL

mfc putb for SPEs’ acknowledgements. spe mfcio getb for PPEs’ acknowledgements to SPEs. Self acknowledgement of PPEs’ transfers from main memory.

slide-61
SLIDE 61

19/ 22

Preliminary results

We outperform both greedy heuristic and sequential version.

0
 50
 100
 150
 200
 250
 300
 350
 400


Sequen-al
 Greedy
 Linear
Program
 Throughput


Results are obtained over 70000 periods

slide-62
SLIDE 62

20/ 22

Outline

Introduction Steady-state scheduling CELL Platform and Application Modeling Mapping the Application Practical Steady-State on CELL Preprocessing of the schedule State machine of the application Preliminary results Conclusion and Future works

slide-63
SLIDE 63

21/ 22

Feedback on Cell programming

◮ Multilevel heterogeneity:

◮ 32 bits SPEs vs 64 bits PPE architectures ◮ Different communication mechanism and constraints

◮ Non trivial initialization phase

◮ Varying data structure sizes (32/64bits) ◮ Runtime memory allocation

slide-64
SLIDE 64

22/ 22

On-going and Future work

◮ Various code optimizations

◮ SIMD code for SPEs ◮ Reduce control overhead

◮ Better communication modeling

◮ Is linear cost model relevant ? ◮ Contention on concurrent DMA operations ?

◮ Larger platforms

◮ Using multiple CELL processors ◮ CELL + other type of processing units ? ◮ Work on communication modeling

◮ Design scheduling heuristics

◮ MIP is costly