scheduling complex streaming applications on the cell
play

Scheduling complex streaming applications on the Cell processor - PowerPoint PPT Presentation

Scheduling complex streaming applications on the Cell processor Mathias Jacquelin, joint work with Matthieu Gallet and Loris Marchal INRIA ROMA project-team LIP (ENS-Lyon, CNRS, INRIA) Ecole Normale Sup erieure de Lyon, France Workshop


  1. Scheduling complex streaming applications on the Cell processor Mathias Jacquelin, joint work with Matthieu Gallet and Loris Marchal INRIA ROMA project-team LIP (ENS-Lyon, CNRS, INRIA) ´ Ecole Normale Sup´ erieure de Lyon, France Workshop on Multithreaded Architectures and Applications, Atlanta, April 23, 2010. 1/ 28

  2. Outline Introduction Steady-state scheduling CELL Platform and Application Modeling Mapping the Application Practical Steady-State on CELL Preprocessing of the schedule State machine of the framework Experimental results Conclusion and Future works 2/ 28

  3. Motivation ◮ Multicore architectures: new opportunity to test the scheduling strategies designed in the ROMA team. ◮ Our trademark: efficient scheduling on heterogeneous platforms ◮ Most multicore architecture are homogeneous, regular ◮ Need for tailored algorithms (linear algebra,. . . ) ◮ Emerging heterogeneous multicore: ◮ Dedicated processing units on GPUs ◮ Mixed system: processor + accelerator ◮ This study: steady-state scheduling on CELL (bounded heterogeneity) to demonstrate the usefulness of complex (static) scheduling techniques 3/ 28

  4. Motivation ◮ Multicore architectures: new opportunity to test the scheduling strategies designed in the ROMA team. ◮ Our trademark: efficient scheduling on heterogeneous platforms ◮ Most multicore architecture are homogeneous, regular ◮ Need for tailored algorithms (linear algebra,. . . ) ◮ Emerging heterogeneous multicore: ◮ Dedicated processing units on GPUs ◮ Mixed system: processor + accelerator ◮ This study: steady-state scheduling on CELL (bounded heterogeneity) to demonstrate the usefulness of complex (static) scheduling techniques 3/ 28

  5. Motivation ◮ Multicore architectures: new opportunity to test the scheduling strategies designed in the ROMA team. ◮ Our trademark: efficient scheduling on heterogeneous platforms ◮ Most multicore architecture are homogeneous, regular ◮ Need for tailored algorithms (linear algebra,. . . ) ◮ Emerging heterogeneous multicore: ◮ Dedicated processing units on GPUs ◮ Mixed system: processor + accelerator ◮ This study: steady-state scheduling on CELL (bounded heterogeneity) to demonstrate the usefulness of complex (static) scheduling techniques 3/ 28

  6. Introduction: Steady-state Scheduling Rationale: ◮ A pipelined application: ◮ Simple chain ◮ More complex application (Directed Acyclic Graph) ◮ Objective: optimize the throughput of the application (number of input files treated per seconds) ◮ Today: simple case where each task has to be mapped on one single resource 4/ 28

  7. Introduction: Steady-state Scheduling Rationale: ◮ A pipelined application: ◮ Simple chain ◮ More complex application (Directed Acyclic Graph) T 1 ◮ Objective: optimize the throughput T 2 of the application (number of input files treated per T 3 seconds) ◮ Today: simple case where each task has to be mapped on one single resource 4/ 28

  8. Introduction: Steady-state Scheduling Rationale: ◮ A pipelined application: ◮ Simple chain ◮ More complex application T 1 (Directed Acyclic Graph) T 2 T 3 T 4 ◮ Objective: optimize the throughput of the application T 5 T 6 T 7 T 8 (number of input files treated per seconds) T 9 ◮ Today: simple case where each task has to be mapped on one single resource 4/ 28

  9. Introduction: Steady-state Scheduling Rationale: ◮ A pipelined application: ◮ Simple chain ◮ More complex application T 1 (Directed Acyclic Graph) T 2 T 3 T 4 ◮ Objective: optimize the throughput of the application T 5 T 6 T 7 T 8 (number of input files treated per seconds) T 9 ◮ Today: simple case where each task has to be mapped on one single resource 4/ 28

  10. Introduction: Steady-state Scheduling Rationale: ◮ A pipelined application: P 1 ◮ Simple chain ◮ More complex application T 1 P 3 P 2 (Directed Acyclic Graph) T 2 T 3 T 4 ◮ Objective: optimize the throughput of the application T 5 T 6 T 7 T 8 (number of input files treated per seconds) P 4 T 9 ◮ Today: simple case where each task has to be mapped on one single resource 4/ 28

  11. CELL brief introduction ◮ Multicore heterogeneous processor ◮ Accelerator extension to Power architecture 5/ 28

  12. CELL brief introduction ◮ Multicore heterogeneous processor ◮ Accelerator extension to Power architecture SPE 0 SPE 1 SPE 7 SPE 6 MEMORY PPE 0 EIB SPE 5 SPE 4 SPE 2 SPE 3 5/ 28

  13. CELL brief introduction ◮ Multicore heterogeneous processor ◮ Accelerator extension to Power architecture SPE 0 SPE 1 SPE 7 SPE 6 MEMORY PPE 0 EIB SPE 5 SPE 4 SPE 2 SPE 3 ◮ 1 PPE core ◮ VMX unit ◮ L1, L2 cache ◮ 2 way SMT 5/ 28

  14. CELL brief introduction ◮ Multicore heterogeneous processor ◮ Accelerator extension to Power architecture SPE 0 SPE 1 SPE 7 SPE 6 MEMORY PPE 0 EIB SPE 5 SPE 4 SPE 2 SPE 3 ◮ 8 SPEs ◮ 128-bit SIMD instruction set ◮ Local store 256KB ◮ Dedicated Asynchronous DMA engine 5/ 28

  15. CELL brief introduction ◮ Multicore heterogeneous processor ◮ Accelerator extension to Power architecture SPE 0 SPE 1 SPE 7 SPE 6 MEMORY PPE 0 EIB SPE 5 SPE 4 SPE 2 SPE 3 5/ 28

  16. CELL brief introduction ◮ Multicore heterogeneous processor ◮ Accelerator extension to Power architecture SPE 0 SPE 1 SPE 7 SPE 6 MEMORY PPE 0 EIB SPE 5 SPE 4 SPE 2 SPE 3 ◮ Element Interconnect Bus (EIB) ◮ 200 GB/s bandwidth 5/ 28

  17. CELL brief introduction ◮ Multicore heterogeneous processor ◮ Accelerator extension to Power architecture SPE 0 SPE 1 SPE 7 SPE 6 MEMORY PPE 0 EIB SPE 5 SPE 4 SPE 2 SPE 3 ◮ 25 GB/s bandwidth 5/ 28

  18. Outline Introduction Steady-state scheduling CELL Platform and Application Modeling Mapping the Application Practical Steady-State on CELL Preprocessing of the schedule State machine of the framework Experimental results Conclusion and Future works 6/ 28

  19. Platform modeling Simple CELL modeling: ◮ 1 PPE and 8 SPE: 9 processing elements P 1 , . . . , P 9 , with unrelated speed, ◮ Each processing element access the communication bus with a (bidirectional) bandwidth b = (25 GB / s ) , ◮ The bus is able to route all concurrent communications without contention (in a first step), ◮ Due to the limited size of the DMA stack on each SPE: ◮ Each SPE can perform at most 16 simultaneous DMA operations, ◮ The PPE can perform at most 8 simultaneous DMA operations to/from a given SPE. ◮ Linear cost communication model: a data of size S is sent/received in time S / b 7/ 28

  20. Application modeling Application is described by a directed acyclic graph: T 1 ◮ Tasks T 1 , . . . , T n T 2 T 3 T 4 ◮ Processing time of task T k on P i is t i ( k ), T 5 T 6 T 7 T 8 ◮ If there is a dependency T k → T l , data k , l is the size of the file T 9 produced by T k and needed by T l , ◮ If T k is an input task, it reads read k bytes from main memory, ◮ If T k is an output task, it writes write k bytes to main memory, 8/ 28

  21. Application modeling Application is described by a directed acyclic graph: T 1 ◮ Tasks T 1 , . . . , T n T 2 T 3 T 4 ◮ Processing time of task T k on P i is t i ( k ), T 5 T 6 T 7 T 8 ◮ If there is a dependency T k → T l , data k , l is the size of the file T 9 produced by T k and needed by T l , ◮ If T k is an input task, it reads read k bytes from main memory, ◮ If T k is an output task, it writes write k bytes to main memory, 8/ 28

  22. Application modeling Application is described by a directed acyclic graph: T 1 ◮ Tasks T 1 , . . . , T n T 2 T 3 T 4 ◮ Processing time of task T k on P i is t i ( k ), T 5 T 6 T 7 T 8 ◮ If there is a dependency T k → T l , data k , l is the size of the file T 9 produced by T k and needed by T l , ◮ If T k is an input task, it reads read k bytes from main memory, ◮ If T k is an output task, it writes write k bytes to main memory, 8/ 28

  23. Target application: any DAG ◮ Today, we will focus on three random task graphs: 9/ 28

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend