leap shared memories
play

LEAP Shared Memories: Automating the Construction of FPGA Coherent - PowerPoint PPT Presentation

LEAP Shared Memories: Automating the Construction of FPGA Coherent Memories Hsin-Jung Yang , Kermin E. Fleming , Michael Adler , and Joel Emer Massachusetts Institute of Technology Intel Corporation May 12th, FCCM 2014


  1. LEAP Shared Memories: Automating the Construction of FPGA Coherent Memories Hsin-Jung Yang † , Kermin E. Fleming ‡ , Michael Adler ‡ , and Joel Emer †‡ † Massachusetts Institute of Technology ‡ Intel Corporation May 12th, FCCM 2014

  2. Motivation • Goal: simplifying parallel programming on FPGAs

  3. Motivation • Goal: simplifying parallel programming on FPGAs • 2D Heat Transfer Equation N M

  4. Motivation • Goal: simplifying parallel programming on FPGAs • 2D Heat Transfer Equation for( int t = 0 ; t < T ; t ++){ #pragma omp parallel num_threads(4){ int thread_id = omp_get_thread_num (); int bid_x = thread_id % 2 ; N int bid_y = thread_id / 2 ; for ( int y = bid_y *( N / 2 ); y < ( 1 + bid_y )*( N / 2 ); y ++) for ( int x = bid_x *( M / 2 ); x < ( 1 + bid_x )*( M / 2 ); x ++) U [ t + 1 , x , y ] = C0 * U [ t , x , y ] + Cx *( U [ t , x - 1 , y ]+ U [ t , x + 1 , y ]) M + Cy *( U [ t , x , y - 1 ]+ U [ t , x , y + 1 ]); } }

  5. Motivation • Goal: simplifying parallel programming on FPGAs • 2D Heat Transfer Equation for( int t = 0 ; t < T ; t ++){ #pragma omp parallel num_threads(4){ int thread_id = omp_get_thread_num (); int bid_x = thread_id % 2 ; N int bid_y = thread_id / 2 ; for ( int y = bid_y *( N / 2 ); y < ( 1 + bid_y )*( N / 2 ); y ++) for ( int x = bid_x *( M / 2 ); x < ( 1 + bid_x )*( M / 2 ); x ++) U [ t + 1 , x , y ] = C0 * U [ t , x , y ] + Cx *( U [ t , x - 1 , y ]+ U [ t , x + 1 , y ]) M + Cy *( U [ t , x , y - 1 ]+ U [ t , x , y + 1 ]); } operation on the shared array }

  6. Motivation • Goal: simplifying parallel programming on FPGAs • 2D Heat Transfer Equation for( int t = 0 ; t < T ; t ++){ #pragma omp parallel num_threads(4){ int thread_id = omp_get_thread_num (); int bid_x = thread_id % 2 ; N int bid_y = thread_id / 2 ; for ( int y = bid_y *( N / 2 ); y < ( 1 + bid_y )*( N / 2 ); y ++) for ( int x = bid_x *( M / 2 ); x < ( 1 + bid_x )*( M / 2 ); x ++) U [ t + 1 , x , y ] = C0 * U [ t , x , y ] + Cx *( U [ t , x - 1 , y ]+ U [ t , x + 1 , y ]) M + Cy *( U [ t , x , y - 1 ]+ U [ t , x , y + 1 ]); } operation on the shared array } implicit barrier synchronization

  7. Motivation • Goal: simplifying parallel programming on FPGAs • 2D Heat Transfer Equation N M

  8. Motivation • Goal: simplifying parallel programming on FPGAs • 2D Heat Transfer Equation N How to implement on FPGAs? M

  9. Programming on FPGA • 2D Heat Transfer Equation (using FPGA Block RAM) Engine Interface N RAM Block M

  10. Programming on FPGA • 2D Heat Transfer Equation (using FPGA Block RAM) Engine Interface N RAM Block M

  11. Programming on FPGA • 2D Heat Transfer Equation (using FPGA Block RAM) Engine Interface N RAM Block M

  12. Programming on FPGA • 2D Heat Transfer Equation (using FPGA Block RAM) Engine Interface N RAM Block M Difficulty: Problem size cannot fit in RAM block •

  13. Programming on FPGA • 2D Heat Transfer Equation (using LEAP Scratchpad) Engine Interface N M unlimited address space M. Adler et al. , “LEAP Scratchpads,” in FPGA, 2011.

  14. Programming on FPGA • 2D Heat Transfer Equation (using LEAP Scratchpad) Engine Interface N M unlimited address space M. Adler et al. , “LEAP Scratchpads,” in FPGA, 2011.

  15. Programming on FPGA • 2D Heat Transfer Equation (using LEAP Scratchpad) Engine Interface N M unlimited address space Difficulty: Single engine • is too slow M. Adler et al. , “LEAP Scratchpads,” in FPGA, 2011.

  16. Parallel Programming on FPGA • 2D Heat Transfer Equation Engine 2 Engine 3 Engine 4 Engine 1 N Interface M

  17. Parallel Programming on FPGA • 2D Heat Transfer Equation Engine 2 Engine 3 Engine 4 Engine 1 N Interface M Difficulty: Performance is limited

  18. Parallel Programming on FPGA • 2D Heat Transfer Equation Engine 2 Engine 3 Engine 4 Engine 1 N Interface M Difficulty: Performance is limited Serialized requests

  19. Parallel Programming on FPGA • 2D Heat Transfer Equation Engine 2 Engine 3 Engine 4 Engine 1 N Interface M Difficulty: Performance is limited Serialized requests Long latency if across FPGAs

  20. Parallel Programming on FPGA • 2D Heat Transfer Equation Engine 2 Engine 3 Engine 4 Engine 1 N Interface M

  21. Parallel Programming on FPGA • 2D Heat Transfer Equation Engine 2 Engine 3 Engine 4 Engine 1 N Interface M Difficulty: • Edge pixels are shared

  22. Parallel Programming on FPGA • 2D Heat Transfer Equation Engine 2 Engine 3 Engine 4 Engine 1 N Interface M Difficulty: • Edge pixels are shared Need cache coherence!

  23. Shared Memory Services: Coherent Scratchpad (CS) Engine Engine Engine Engine Interface Shared Cache Ring-based snoopy protocol pre-order request ordered request response (1) Ordering point

  24. Shared Memory Services: Coherent Scratchpad (CS) Engine Engine Engine Engine Interface Shared Cache Ring-based snoopy protocol pre-order request Modified MOSI protocol ordered request response (1) Ordering point

  25. Shared Memory Services: Coherent Scratchpad (CS) Engine Engine Engine Engine Interface Shared Cache Ring-based snoopy protocol pre-order request Modified MOSI protocol ordered request response (1) Ordering point (1) (2) Store data

  26. Shared Memory Services: Coherent Scratchpad (CS) Engine Engine Engine Engine Interface Shared Cache Ring-based snoopy protocol pre-order request Modified MOSI protocol ordered request response (1) Ordering point (1) (1) (2) Store data (2) Shared data ca (3) Store owner-bit information for every address

  27. Shared Memory Services: Coherent Scratchpad (CS) Engine Engine Engine Engine Interface Coherent Scratchpad Controller Shared Cache Ring-based snoopy protocol pre-order request Modified MOSI protocol ordered request response (1) Ordering point (2) Store data (3) Store owner-bit information owner bit data

  28. Shared Memory Services: Coherent Scratchpad

  29. Parallel Programming on FPGA • 2D Heat Transfer Equation for( int t = 0 ; t < T ; t ++){ #pragma omp parallel num_threads(4){ int thread_id = omp_get_thread_num (); int bid_x = thread_id % 2 ; N int bid_y = thread_id / 2 ; for ( int y = bid_y *( N / 2 ); y < ( 1 + bid_y )*( N / 2 ); y ++) for ( int x = bid_x *( M / 2 ); x < ( 1 + bid_x )*( M / 2 ); x ++) U [ t + 1 , x , y ] = C0 * U [ t , x , y ] + Cx *( U [ t , x - 1 , y ]+ U [ t , x + 1 , y ]) M + Cy *( U [ t , x , y - 1 ]+ U [ t , x , y + 1 ]); } } implicit barrier synchronization

  30. Parallel Programming on FPGA • 2D Heat Transfer Equation for( int t = 0 ; t < T ; t ++){ #pragma omp parallel num_threads(4){ int thread_id = omp_get_thread_num (); int bid_x = thread_id % 2 ; N int bid_y = thread_id / 2 ; for ( int y = bid_y *( N / 2 ); y < ( 1 + bid_y )*( N / 2 ); y ++) for ( int x = bid_x *( M / 2 ); x < ( 1 + bid_x )*( M / 2 ); x ++) U [ t + 1 , x , y ] = C0 * U [ t , x , y ] + Cx *( U [ t , x - 1 , y ]+ U [ t , x + 1 , y ]) M + Cy *( U [ t , x , y - 1 ]+ U [ t , x , y + 1 ]); } operation on the shared array } implicit barrier synchronization

  31. Parallel Programming on FPGA • 2D Heat Transfer Equation for( int t = 0 ; t < T ; t ++){ #pragma omp parallel num_threads(4){ int thread_id = omp_get_thread_num (); int bid_x = thread_id % 2 ; N int bid_y = thread_id / 2 ; for ( int y = bid_y *( N / 2 ); y < ( 1 + bid_y )*( N / 2 ); y ++) for ( int x = bid_x *( M / 2 ); x < ( 1 + bid_x )*( M / 2 ); x ++) U [ t + 1 , x , y ] = C0 * U [ t , x , y ] + Cx *( U [ t , x - 1 , y ]+ U [ t , x + 1 , y ]) M + Cy *( U [ t , x , y - 1 ]+ U [ t , x , y + 1 ]); } operation on the shared array } implicit barrier synchronization Finish the inner loop operations •  Computations complete  Memory operations complete Wait until all threads are finished •

  32. Shared Memory Services: Memory Consistency • Block RAM/Private Scratchpad Interface interface MEM_IFC#(type t_ADDR, type t_DATA); method void readReq (t_ADDR addr); method void write (t_ADDR addr, t_DATA data); method t_DATA readResp (); endinterface • Coherent Scratchpad Interface interface MEM_IFC#(type t_ADDR, type t_DATA); method void readReq (t_ADDR addr); method void write (t_ADDR addr, t_DATA data); method t_DATA readResp (); // t_REQ r := {READ, WRITE, FULL} method Bool requestPending (t_REQ r); endinterface

  33. Shared Memory Services: Memory Consistency • Block RAM/Private Scratchpad Interface interface MEM_IFC#(type t_ADDR, type t_DATA); method void readReq (t_ADDR addr); method void write (t_ADDR addr, t_DATA data); method t_DATA readResp (); endinterface • Coherent Scratchpad Interface interface MEM_IFC#(type t_ADDR, type t_DATA); method void readReq (t_ADDR addr); method void write (t_ADDR addr, t_DATA data); method t_DATA readResp (); // t_REQ r := {READ, WRITE, FULL} method Bool requestPending (t_REQ r); Fence support endinterface (memory consistency)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend