1/ 22
Steady-state scheduling on CELL Mathias Jacquelin, joint work with - - PowerPoint PPT Presentation
Steady-state scheduling on CELL Mathias Jacquelin, joint work with - - PowerPoint PPT Presentation
Steady-state scheduling on CELL Mathias Jacquelin, joint work with Matthieu Gallet, Loris Marchal and Yves Robert INRIA GRAAL project-team LIP (ENS-Lyon, CNRS, INRIA) Ecole Normale Sup erieure de Lyon, France Scheduling for
2/ 22
Outline
Introduction Steady-state scheduling CELL Platform and Application Modeling Mapping the Application Practical Steady-State on CELL Preprocessing of the schedule State machine of the application Preliminary results Conclusion and Future works
3/ 22
Motivation
◮ Multicore architectures: new opportunity to test the
scheduling strategies designed in the GRAAL team.
◮ Our trademark: efficient scheduling on heterogeneous
platforms
◮ Most multicore architecture are homogeneous, regular
◮ Need for tailored algorithms (linear algebra,. . . )
◮ Emerging heterogeneous multicore:
◮ Dedicated processing units on GPUs ◮ Mixed system: processor + accelerator
◮ This study: steady-state scheduling on CELL (bounded
heterogeneity) to demonstrate the usefulness of complex (static) scheduling techniques
◮ Ongoing work: only preliminary results
3/ 22
Motivation
◮ Multicore architectures: new opportunity to test the
scheduling strategies designed in the GRAAL team.
◮ Our trademark: efficient scheduling on heterogeneous
platforms
◮ Most multicore architecture are homogeneous, regular
◮ Need for tailored algorithms (linear algebra,. . . )
◮ Emerging heterogeneous multicore:
◮ Dedicated processing units on GPUs ◮ Mixed system: processor + accelerator
◮ This study: steady-state scheduling on CELL (bounded
heterogeneity) to demonstrate the usefulness of complex (static) scheduling techniques
◮ Ongoing work: only preliminary results
3/ 22
Motivation
◮ Multicore architectures: new opportunity to test the
scheduling strategies designed in the GRAAL team.
◮ Our trademark: efficient scheduling on heterogeneous
platforms
◮ Most multicore architecture are homogeneous, regular
◮ Need for tailored algorithms (linear algebra,. . . )
◮ Emerging heterogeneous multicore:
◮ Dedicated processing units on GPUs ◮ Mixed system: processor + accelerator
◮ This study: steady-state scheduling on CELL (bounded
heterogeneity) to demonstrate the usefulness of complex (static) scheduling techniques
◮ Ongoing work: only preliminary results
4/ 22
Introduction: Steady-state Scheduling
Rationale:
◮ A pipelined application:
◮ Simple chain ◮ More complex application
(Directed Acyclic Graph)
◮ Objective: optimize the throughput
- f the application
(number of input files treated per seconds)
◮ Today: simple case where each
task has to be mapped on one single resource
4/ 22
Introduction: Steady-state Scheduling
Rationale:
◮ A pipelined application:
◮ Simple chain ◮ More complex application
(Directed Acyclic Graph)
◮ Objective: optimize the throughput
- f the application
(number of input files treated per seconds)
◮ Today: simple case where each
task has to be mapped on one single resource
T1 T2 T3
4/ 22
Introduction: Steady-state Scheduling
Rationale:
◮ A pipelined application:
◮ Simple chain ◮ More complex application
(Directed Acyclic Graph)
◮ Objective: optimize the throughput
- f the application
(number of input files treated per seconds)
◮ Today: simple case where each
task has to be mapped on one single resource
T1 T2 T3 T4 T5 T6 T7 T8 T9
4/ 22
Introduction: Steady-state Scheduling
Rationale:
◮ A pipelined application:
◮ Simple chain ◮ More complex application
(Directed Acyclic Graph)
◮ Objective: optimize the throughput
- f the application
(number of input files treated per seconds)
◮ Today: simple case where each
task has to be mapped on one single resource
T1 T2 T3 T4 T5 T6 T7 T8 T9
4/ 22
Introduction: Steady-state Scheduling
Rationale:
◮ A pipelined application:
◮ Simple chain ◮ More complex application
(Directed Acyclic Graph)
◮ Objective: optimize the throughput
- f the application
(number of input files treated per seconds)
◮ Today: simple case where each
task has to be mapped on one single resource
P2 P1 P3 P4
T5 T6 T7 T8 T9 T1 T2 T3 T4
5/ 22
CELL brief introduction
◮ Multicore heterogeneous processor ◮ Accelerator extension to Power architecture
5/ 22
CELL brief introduction
◮ Multicore heterogeneous processor ◮ Accelerator extension to Power architecture
EIB PPE0 SPE3 SPE2 SPE6 SPE7 SPE1 SPE0 SPE4 MEMORY SPE5
5/ 22
CELL brief introduction
◮ Multicore heterogeneous processor ◮ Accelerator extension to Power architecture
EIB SPE5 SPE3 SPE2 SPE6 SPE7 SPE1 SPE0 SPE4 PPE0 MEMORY
◮ 1 PPE core
◮ VMX unit ◮ L1, L2 cache ◮ 2 way SMT
5/ 22
CELL brief introduction
◮ Multicore heterogeneous processor ◮ Accelerator extension to Power architecture
EIB SPE4 SPE5 SPE2 SPE6 SPE7 SPE1 MEMORY SPE3 PPE0 SPE0
◮ 8 SPEs
◮ 128-bit SIMD instruction set ◮ Local store 256KB ◮ Dedicated Asynchronous DMA engine
5/ 22
CELL brief introduction
◮ Multicore heterogeneous processor ◮ Accelerator extension to Power architecture
EIB SPE5 SPE3 SPE2 SPE6 SPE7 SPE1 SPE0 SPE4 PPE0 MEMORY
5/ 22
CELL brief introduction
◮ Multicore heterogeneous processor ◮ Accelerator extension to Power architecture
EIB PPE0 SPE3 SPE2 SPE6 SPE7 SPE1 SPE0 SPE4 MEMORY SPE5
◮ Element Interconnect Bus (EIB)
◮ 200 GB/s bandwidth
5/ 22
CELL brief introduction
◮ Multicore heterogeneous processor ◮ Accelerator extension to Power architecture
EIB PPE0 SPE4 SPE2 SPE6 SPE7 SPE1 SPE0 SPE5 MEMORY SPE3
◮ 25 GB/s bandwidth
6/ 22
Outline
Introduction Steady-state scheduling CELL Platform and Application Modeling Mapping the Application Practical Steady-State on CELL Preprocessing of the schedule State machine of the application Preliminary results Conclusion and Future works
7/ 22
Platform modeling
Simple CELL modeling:
◮ 1 PPE and 8 SPE: 9 processing elements P1, . . . , P9, with
unrelated speed,
◮ Each processing element access the communication bus with a
(bidirectional) bandwidth b = (25GB/s) ,
◮ The bus is able to route all concurrent communications
without contention (in a first step),
◮ Due to the limited size of the DMA stack on each SPE:
◮ Each SPE can perform at most 16 simultaneous DMA
- perations,
◮ The PPE can perform at most 8 simultaneous DMA
- perations to/from a given SPE.
◮ Linear cost communication model:
a data of size S is sent/received in time S/b
8/ 22
Application modeling
Application is described by a directed acyclic graph:
◮ Tasks T1, . . . , Tn ◮ Processing time of task Tk on Pi is
ti(k),
◮ If there is a dependency Tk → Tl,
datak,l is the size of the file produced by Tk and needed by Tl,
T1 T2 T3 T4 T5 T6 T7 T8 T9
◮ If Tk is an input task, it reads readk bytes from main memory, ◮ If Tk is an output task, it writes writek bytes to main memory,
8/ 22
Application modeling
Application is described by a directed acyclic graph:
◮ Tasks T1, . . . , Tn ◮ Processing time of task Tk on Pi is
ti(k),
◮ If there is a dependency Tk → Tl,
datak,l is the size of the file produced by Tk and needed by Tl,
T1 T2 T3 T4 T5 T6 T7 T8 T9
◮ If Tk is an input task, it reads readk bytes from main memory, ◮ If Tk is an output task, it writes writek bytes to main memory,
8/ 22
Application modeling
Application is described by a directed acyclic graph:
◮ Tasks T1, . . . , Tn ◮ Processing time of task Tk on Pi is
ti(k),
◮ If there is a dependency Tk → Tl,
datak,l is the size of the file produced by Tk and needed by Tl,
T1 T2 T3 T4 T5 T6 T7 T8 T9
◮ If Tk is an input task, it reads readk bytes from main memory, ◮ If Tk is an output task, it writes writek bytes to main memory,
9/ 22
Target application: vocoder
Vocode r Ste pSource work= 21 I/O: 0-> 1 *** STATEFUL *** IntToFloa t work= 6 I/O: 1-> 1 De la y work= 215 I/O: 1-> 1 *** STATEFUL *** DUPLICATE(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1) work= null DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** WEIGHTED_ROUND_ROBIN(2,2,2,2,2,2,2,2,2,2,2,2,2,2,2) work= null Re cta ngula rToPola r work= 9105 I/O: 30-> 30 WEIGHTED_ROUND_ROBIN(1,1) work= null DUPLICATE(1,1) work= null WEIGHTED_ROUND_ROBIN(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1) work= null WEIGHTED_ROUND_ROBIN(1,1) work= null Pola rToRe cta ngula r work= 5060 I/O: 40-> 40 FIRSm- othingFilte
10/ 22
Outline
Introduction Steady-state scheduling CELL Platform and Application Modeling Mapping the Application Practical Steady-State on CELL Preprocessing of the schedule State machine of the application Preliminary results Conclusion and Future works
11/ 22
How to compute an optimal mapping
◮ Ojective: maximize throughput ρ ◮ Method: write a linear program gathering constraints on the
mapping
◮ Binary variables: αk i =
- 1
if Tk is mapped on Pi
- therwise
◮ Other useful binary variables: βk,l i,j = 1 iff file Tk → Tl is
transfered from Pi to Pj
12/ 22
Constraints 1/2
On the application structure:
◮ Each task is mapped on a processor:
∀Tk
- i
αk
i = 1 ◮ Given a dependency Tk → Tl, the processor computing Tl
must receive the corresponding file: ∀(k, l) ∈ E, ∀Pj,
- i
βk,l
i,j ≥ αl j ◮ Given a dependency Tk → Tl, only the processor computing
Tk can send the corresponding file: ∀(k, l) ∈ E, ∀Pi,
- j
βk,l
i,j ≤ αk i
13/ 22
Constraints 2/2
On the achievable throughput ρ = 1/T:
◮ On a given processor, all tasks must be completed within T:
∀Pi,
- k
αk
i × ti(k) ≤ T ◮ All incoming communications must be completed within T:
∀Pj, 1 b
k
αk
j × readk +
- k,l
- i
βk,l
i,j × datak,l
- ≤ T
◮ All outgoing communications must be completed within T:
∀Pi, 1 b
k
αk
i × writek +
- k,l
- i
βk,l
i,j × datak,l
- ≤ T
+ constraints on the number of incoming/outgoing communications to respect the DMA requirements + constraints on the available memory on SPE
14/ 22
Optimal mapping computation
◮ Linear program with the objective of minimizing T ◮ Integer (binary) variables: Mixed Integer Programming ◮ NP-complete problem ◮ Efficient solvers exist with short running time
◮ for small-size problems ◮ or when an approximate solution is searched
◮ We use CPLEX, and look for an approximate solution (5% of
the optimal throughput is good enough)
15/ 22
Outline
Introduction Steady-state scheduling CELL Platform and Application Modeling Mapping the Application Practical Steady-State on CELL Preprocessing of the schedule State machine of the application Preliminary results Conclusion and Future works
16/ 22
Preprocessing of the schedule
Main Objective: Compute minimal starting period and buffer sizes.
◮ min periodl = maxm∈precl(min periodm) + peekl + 2 ◮ min buffi,l = min periodl − min periodi
min periodk min periodl = maxm∈precl(min periodm) + peekl + 2 min buffj,l min buffi,l =min periodl − min periodi min periodj peekj peekk peeki peekl min buffi,k min buffi,j min periodi min buffk,l
Tl Ti Tk Tj
16/ 22
Preprocessing of the schedule
Main Objective: Compute minimal starting period and buffer sizes.
min buffi,j = 3 min buffj,l = 6 min periodl = 9 min buffi,l = 9 min buffk,l = 4 peeki = 0 min periodi = 0 peekj = 1 min periodj = 3 min periodk = 5 peekk = 3 peekl = 2 min buffi,k = 5
Tl Ti Tk Tj
16/ 22
Preprocessing of the schedule
Main Objective: Compute minimal starting period and buffer sizes.
peekk = 3 peekl = 2 min buffi,k = 5 min buffi,j = 3 min buffi,l = 9 min buffj,l = 6 min periodl = 9 peeki = 0 min buffk,l = 4 min periodi = 0 min periodj = 3 peekj = 1 min periodk = 5
Ti Tk Tj Tl
period = 0
16/ 22
Preprocessing of the schedule
Main Objective: Compute minimal starting period and buffer sizes.
min periodl = 9 min buffj,l = 6 min buffi,j = 3 min buffi,k = 5 peekl = 2 peekk = 3 min periodj = 3 min periodk = 5 min periodi = 0 peeki = 0 min buffi,l = 9 min buffk,l = 4 peekj = 1
Tk Tl Tj Ti
period = 1
16/ 22
Preprocessing of the schedule
Main Objective: Compute minimal starting period and buffer sizes.
peekk = 3 min periodk = 5 peekj = 1 min periodj = 3 min periodi = 0 peeki = 0 min buffk,l = 4 min periodl = 9 min buffj,l = 6 min buffi,j = 3 min buffi,k = 5 peekl = 2 min buffi,l = 9
Tk Ti Tj Tl
period = 2
16/ 22
Preprocessing of the schedule
Main Objective: Compute minimal starting period and buffer sizes.
peekk = 3 min periodk = 5 peekj = 1 min periodj = 3 min periodi = 0 peeki = 0 min buffk,l = 4 min periodl = 9 min buffi,l = 9 min buffj,l = 6 min buffi,j = 3 min buffi,k = 5 peekl = 2
Tl Tj Tk Ti
period = 3
16/ 22
Preprocessing of the schedule
Main Objective: Compute minimal starting period and buffer sizes.
peekk = 3 min periodk = 5 peekj = 1 min periodj = 3 min periodi = 0 peeki = 0 min buffk,l = 4 min periodl = 9 min buffi,l = 9 min buffj,l = 6 min buffi,j = 3 min buffi,k = 5 peekl = 2
Tl Tj Tk Ti
period = 4
16/ 22
Preprocessing of the schedule
Main Objective: Compute minimal starting period and buffer sizes.
peekk = 3 min periodk = 5 peekj = 1 min periodj = 3 min periodi = 0 peeki = 0 min buffk,l = 4 min periodl = 9 min buffi,l = 9 min buffj,l = 6 min buffi,j = 3 min buffi,k = 5 peekl = 2
Tl Tj Tk Ti
period = 5
16/ 22
Preprocessing of the schedule
Main Objective: Compute minimal starting period and buffer sizes.
min periodj = 3 peekj = 1 min periodk = 5 peekk = 3 peekl = 2 min buffi,k = 5 min buffi,j = 3 min buffj,l = 6 min periodl = 9 min buffi,l = 9 min buffk,l = 4 peeki = 0 min periodi = 0
Tk Tj Tl Ti
period = 6
16/ 22
Preprocessing of the schedule
Main Objective: Compute minimal starting period and buffer sizes.
peekj = 1 min periodj = 3 min periodi = 0 min buffj,l = 6 peekk = 3 min buffi,j = 3 min periodk = 5 peeki = 0 min buffk,l = 4 min buffi,k = 5 peekl = 2 min periodl = 9 min buffi,l = 9
Tk Ti Tj Tl
period = 7
16/ 22
Preprocessing of the schedule
Main Objective: Compute minimal starting period and buffer sizes.
min periodl = 9 min buffi,l = 9 min buffk,l = 4 peeki = 0 min periodi = 0 min periodj = 3 peekj = 1 peekk = 3 min periodk = 5 peekl = 2 min buffi,k = 5 min buffi,j = 3 min buffj,l = 6
Ti Tk Tj Tl
period = 8
16/ 22
Preprocessing of the schedule
Main Objective: Compute minimal starting period and buffer sizes.
min buffi,l = 9 min buffk,l = 4 peeki = 0 min periodi = 0 min periodj = 3 peekj = 1 min periodk = 5 peekl = 2 peekk = 3 min buffi,k = 5 min buffi,j = 3 min buffj,l = 6 min periodl = 9
Tj Tl Ti Tk
period = 9
17/ 22
State machine of the application
Two main phases: Computation and Communication
Select a Task Wait Resources Process Task Signal new Data
Computation Phase
Communicate
17/ 22
State machine of the application
Two main phases: Computation and Communication
Select a Task Wait Resources Communicate Signal new Data
Computation Phase
Process Task
17/ 22
State machine of the application
Two main phases: Computation and Communication
Communicate Wait Resources Process Task Signal new Data
Computation Phase
Communicate Select a Task
17/ 22
State machine of the application
Two main phases: Computation and Communication
Select a Task Wait Resources Communicate Signal new Data
Computation Phase
Process Task
17/ 22
State machine of the application
Two main phases: Computation and Communication
Select a Task Wait Resources Communicate Signal new Data
Computation Phase
Process Task
17/ 22
State machine of the application
Two main phases: Computation and Communication
No No more comm. No
Get Data Watch DMA Check input buffers Check input data
Communication Phase
Compute For each inbound comm.
17/ 22
State machine of the application
Two main phases: Computation and Communication
No No more comm. No
Check input data Watch DMA Check input buffers Get Data
Communication Phase
For each inbound comm. Compute
17/ 22
State machine of the application
Two main phases: Computation and Communication
No No more comm. No
Check input data Watch DMA Check input buffers Get Data
Communication Phase
For each inbound comm. Compute
17/ 22
State machine of the application
Two main phases: Computation and Communication
No No more comm. No
Check input data Watch DMA Check input buffers Get Data
Communication Phase
For each inbound comm. Compute
17/ 22
State machine of the application
Two main phases: Computation and Communication
No No more comm. No
Check input data Watch DMA Check input buffers Get Data
Communication Phase
For each inbound comm. Compute
17/ 22
State machine of the application
Two main phases: Computation and Communication
No No more comm. No
Check input data Watch DMA Check input buffers Get Data
Communication Phase
For each inbound comm. Compute
18/ 22
Communication between processors
PK PL T (i−1)
2
T (i)
2
T (i+1)
1
T (i)
1
18/ 22
Communication between processors
Signal Data(i) PL T (i)
2
T (i+1)
1
T (i)
1
PK T (i−1)
2
mfc putb for SPEs’ outbound communications. spe mfcio getb for PPEs’ outbound communications to SPEs. memcpy for PPEs’ outbound communications to main memory.
18/ 22
Communication between processors
cannot be overwritten Input buffers are available to store data Output buffer containing i
Signal Data(i) PL T (i−1)
2
T (i)
2
T (i+1)
1
T (i)
1
PK
mfc putb for SPEs’ outbound communications. spe mfcio getb for PPEs’ outbound communications to SPEs. memcpy for PPEs’ outbound communications to main memory.
18/ 22
Communication between processors
Get Data(i)
cannot be overwritten Input buffers are available to store data Output buffer containing i
Signal Data(i) T (i−1)
2
T (i)
2
T (i+1)
1
T (i)
1
PK PL
mfc get for SPEs’ inbound communications. spe mfcio put for PPEs’ inbound communications from SPEs. memcpy for PPEs’ inbound communications from main memory.
18/ 22
Communication between processors
Get Data(i)
cannot be overwritten Input buffers are available to store data Output buffer containing i
Signal Data(i) T (i−1)
2
T (i)
2
T (i+1)
1
T (i)
1
PK PL
mfc get for SPEs’ inbound communications. spe mfcio put for PPEs’ inbound communications from SPEs. memcpy for PPEs’ inbound communications from main memory.
18/ 22
Communication between processors
Transfer Done(i) Get Data(i)
Output buffer containing i cannot be overwritten Input buffers are available to store data
Signal Data(i) PL T (i−1)
2
T (i)
2
T (i+1)
1
T (i)
1
PK
mfc putb for SPEs’ acknowledgements. spe mfcio getb for PPEs’ acknowledgements to SPEs. Self acknowledgement of PPEs’ transfers from main memory.
18/ 22
Communication between processors
Output buffer containing i can now be overwritten
Transfer Done(i) Get Data(i)
Input buffers are available to store data Output buffer containing i cannot be overwritten
Signal Data(i) PL T (i−1)
2
T (i)
2
T (i+1)
1
T (i)
1
PK
mfc putb for SPEs’ acknowledgements. spe mfcio getb for PPEs’ acknowledgements to SPEs. Self acknowledgement of PPEs’ transfers from main memory.
18/ 22
Communication between processors
Signal Data(i + 1)
can now be overwritten Output buffer containing i
Transfer Done(i) Get Data(i)
Input buffers are available to store data Output buffer containing i cannot be overwritten
Signal Data(i) T (i−1)
2
T (i)
2
T (i+1)
1
T (i)
1
PK PL
mfc putb for SPEs’ acknowledgements. spe mfcio getb for PPEs’ acknowledgements to SPEs. Self acknowledgement of PPEs’ transfers from main memory.
19/ 22
Preliminary results
We outperform both greedy heuristic and sequential version.
0 50 100 150 200 250 300 350 400
Sequen-al Greedy Linear Program Throughput
Results are obtained over 70000 periods
20/ 22
Outline
Introduction Steady-state scheduling CELL Platform and Application Modeling Mapping the Application Practical Steady-State on CELL Preprocessing of the schedule State machine of the application Preliminary results Conclusion and Future works
21/ 22
Feedback on Cell programming
◮ Multilevel heterogeneity:
◮ 32 bits SPEs vs 64 bits PPE architectures ◮ Different communication mechanism and constraints
◮ Non trivial initialization phase
◮ Varying data structure sizes (32/64bits) ◮ Runtime memory allocation
22/ 22
On-going and Future work
◮ Various code optimizations
◮ SIMD code for SPEs ◮ Reduce control overhead
◮ Better communication modeling
◮ Is linear cost model relevant ? ◮ Contention on concurrent DMA operations ?
◮ Larger platforms
◮ Using multiple CELL processors ◮ CELL + other type of processing units ? ◮ Work on communication modeling
◮ Design scheduling heuristics
◮ MIP is costly