optimizing explicit data transfers for data parallel
play

Optimizing Explicit Data Transfers for Data Parallel Applications on - PowerPoint PPT Presentation

Optimizing Explicit Data Transfers for Data Parallel Applications on Heterogeneous Multi-core Platforms S. Saidi 1 , 2 P.Tendulkar 1 T. Lepley 2 O. Maler 1 1 Verimag 2 STMicroelectronics Hipeac 2012 Saidi, Tendulkar, Lepley, Maler Data Transfers


  1. Optimizing Explicit Data Transfers for Data Parallel Applications on Heterogeneous Multi-core Platforms S. Saidi 1 , 2 P.Tendulkar 1 T. Lepley 2 O. Maler 1 1 Verimag 2 STMicroelectronics Hipeac 2012 Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 1 / 56

  2. Outline Introduction 1 Optimal Granularity 2 One Processor Multiple Processors Shared Data Transfers 3 Experiments on the CELL Architecture 4 Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 2 / 56

  3. Introduction Outline Introduction 1 Optimal Granularity 2 One Processor Multiple Processors Shared Data Transfers 3 Experiments on the CELL Architecture 4 Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 3 / 56

  4. Introduction Motivation How to reduce/hide the off-chip memory latency? Multi−core fabric Host CPU PE n PE 0 ... Memory Memory Interconnect Off−chip Memory Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 4 / 56

  5. Introduction Heterogeneous Multi-core Architectures a powerful host processor and a multi-core fabric to accelerate computationally heavy kernels. Multi−core fabric Host CPU PE n PE 0 ... Memory Memory Interconnect Off−chip Memory Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 5 / 56

  6. Introduction Heterogeneous Multi-core Architectures a powerful host processor and a multi-core fabric to accelerate computationally heavy kernels. Multi−core fabric T 2 T 0 Host CPU PE n PE 0 T 1 ... Memory Memory Interconnect Off−chip Memory Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 5 / 56

  7. Introduction Data Transfers Offloadable kernels work on large data sets, initially stored in the off-chip memory. Algorithm for i = 0 to n − 1 Y [ i ] = f ( X [ i ]) od Multi−core fabric T 0 PE 0 Host CPU PE n ... Memory Memory Interconnect Off−chip Memory .... X .... Y Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 6 / 56

  8. Introduction Data Transfers High off-chip memory latency: accessing off-chip data is very costly Algorithm for i = 0 to n − 1 Y [ i ] = f ( X [ i ]) od Read Multi−core fabric T 0 PE 0 Host CPU PE n ... Memory Memory Interconnect Off−chip Memory Write .... X .... Y Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 7 / 56

  9. Introduction Data Transfers Data is transferred to a closer but smaller on-chip memory, using DMAs (Direct Memory Access). Algorithm for i = 0 to n − 1 Y [ i ] = f ( X [ i ]) od Multi−core fabric T 0 PE 0 Host CPU PE n ... block 0 Data Block Transfers Memory Memory block 1 Interconnect ... Off−chip Memory X .... .... Y Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 8 / 56

  10. Introduction DMA Data Transfers s : number of array elements in one block, block m − 1 block 0 block 1 block m − 2 X X [0] X [1] X [ n − 1] ... s n ���������������� ���������������� ����������������� ����������������� i = 0 Fetch(block i ) while ( i < n / s ) i + + Compute(block i ) Write back(block i ) Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 9 / 56

  11. Introduction DMA Data Transfers s : number of array elements in one block, block m − 1 block 0 block 1 block m − 2 X X [0] X [1] X [ n − 1] ... s n ���������������� ���������������� ����������������� ����������������� i = 0 dma get(local-buffer, block i , s) Fetch(block i ) while ( i < n / s ) i + + Compute(block i ) Write back(block i ) Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 10 / 56

  12. Introduction DMA Data Transfers s : number of array elements in one block, block m − 1 block 0 block 1 block m − 2 X X [0] X [1] X [ n − 1] ... s n ���������������� ���������������� ����������������� ����������������� i = 0 Fetch(block i ) while ( i < n / s ) i + + Compute(block i ) Compute(block i ) Write back(block i ) Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 11 / 56

  13. Introduction DMA Data Transfers s : number of array elements in one block, block m − 1 block 0 block 1 block m − 2 X X [0] X [1] X [ n − 1] ... s n ���������������� ���������������� ����������������� ����������������� i = 0 Fetch(block i ) while ( i < n / s ) i + + Compute(block i ) dma put(block i , local-buffer, s) Write back(block i ) Write back(block i ) Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 12 / 56

  14. Introduction DMA Data Transfers s : number of array elements in one block, block m − 1 block 0 block 1 block m − 2 X X [0] X [1] X [ n − 1] ... s n ���������������� ���������������� ����������������� ����������������� i = 0 dma get(local-buffer, block i , s) Fetch(block i ) Sequential execution of computations and data transfers. while ( i < n / s ) i + + Compute(block i ) Write back(block i ) Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 13 / 56

  15. Introduction DMA Data Transfers s : number of array elements in one block, block m − 1 block 0 block 1 block m − 2 X X [0] X [1] X [ n − 1] ... s n ���������������� ���������������� ����������������� ����������������� i = 0 Fetch(block i ) Sequential execution of computations and data transfers. while ( i < n / s ) i + + Compute(block i ) Compute(block i ) Write back(block i ) Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 14 / 56

  16. Introduction DMA Data Transfers s : number of array elements in one block, block m − 1 block 0 block 1 block m − 2 X X [0] X [1] X [ n − 1] ... s n ���������������� ���������������� ����������������� ����������������� i = 0 Fetch(block i ) Sequential execution of computations and data transfers. while ( i < n / s ) i + + Compute(block i ) dma put(block i , local-buffer, s) Write back(block i ) Write back(block i ) Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 15 / 56

  17. Introduction Double Buffering Asynchronous DMA calls and double buffering: dma get ( local − buffer [1] , block 0 , s ) Fetch(block 0 ) i = 0 dma get ( local − buffer [2] , block i +1 , s ) Compute(block i ) Fetch(block i +1 ) while ( i < ( n / s ) − 1) i + + Write back(block i ) Compute(block ( n / s ) − 1 ) Write back(block ( n / s ) − 1 ) Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 16 / 56

  18. Introduction Double Buffering Asynchronous DMA calls and double buffering: dma get ( local − buffer [1] , block 0 , s ) Fetch(block 0 ) Fetch(block 0 ) i = 0 dma get ( local − buffer [2] , block i +1 , s ) Compute(block i ) Fetch(block i +1 ) while ( i < ( n / s ) − 1) i + + Write back(block i ) Compute(block ( n / s ) − 1 ) Write back(block ( n / s ) − 1 ) Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 17 / 56

  19. Introduction Double Buffering Asynchronous DMA calls and double buffering: dma get ( local − buffer [1] , block 0 , s ) Fetch(block 0 ) i = 0 dma get ( local − buffer [2] , block i +1 , s ) Compute(block i ) Compute(block i ) Fetch(block i +1 ) Fetch(block i +1 ) while ( i < ( n / s ) − 1) i + + Write back(block i ) Compute(block ( n / s ) − 1 ) Write back(block ( n / s ) − 1 ) Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 18 / 56

  20. Introduction Double Buffering Asynchronous DMA calls and double buffering: dma get ( local − buffer [1] , block 0 , s ) Fetch(block 0 ) i = 0 dma get ( local − buffer [2] , block i +1 , s ) Compute(block i ) Fetch(block i +1 ) while ( i < ( n / s ) − 1) i + + Write back(block i ) Write back(block i ) Compute(block ( n / s ) − 1 ) Write back(block ( n / s ) − 1 ) Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 19 / 56

  21. Introduction Double Buffering Asynchronous DMA calls and double buffering: dma get ( local − buffer [1] , block 0 , s ) Fetch(block 0 ) i = 0 dma get ( local − buffer [2] , block i +1 , s ) Compute(block i ) Compute(block i ) Fetch(block i +1 ) Fetch(block i +1 ) while ( i < ( n / s ) − 1) i + + Write back(block i ) Compute(block ( n / s ) − 1 ) Write back(block ( n / s ) − 1 ) Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 20 / 56

  22. Introduction Double Buffering Asynchronous DMA calls and double buffering: Overlap of computations and data transfers. dma get ( local − buffer [1] , block 0 , s ) Fetch(block 0 ) i = 0 dma get ( local − buffer [2] , block i +1 , s ) Compute(block i ) Compute(block i ) Fetch(block i +1 ) Fetch(block i +1 ) while ( i < ( n / s ) − 1) i + + Write back(block i ) Compute(block ( n / s ) − 1 ) Write back(block ( n / s ) − 1 ) Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 21 / 56

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend