LEAP Shared Memories: Automating the Construction of FPGA Coherent - PowerPoint PPT Presentation

LEAP Shared Memories: Automating the Construction of FPGA Coherent Memories Hsin-Jung Yang † , Kermin E. Fleming ‡ , Michael Adler ‡ , and Joel Emer †‡ † Massachusetts Institute of Technology ‡ Intel Corporation May 12th, FCCM 2014

Motivation • Goal: simplifying parallel programming on FPGAs

Motivation • Goal: simplifying parallel programming on FPGAs • 2D Heat Transfer Equation N M

Motivation • Goal: simplifying parallel programming on FPGAs • 2D Heat Transfer Equation for( int t = 0 ; t < T ; t ++){ #pragma omp parallel num_threads(4){ int thread_id = omp_get_thread_num (); int bid_x = thread_id % 2 ; N int bid_y = thread_id / 2 ; for ( int y = bid_y *( N / 2 ); y < ( 1 + bid_y )*( N / 2 ); y ++) for ( int x = bid_x *( M / 2 ); x < ( 1 + bid_x )*( M / 2 ); x ++) U [ t + 1 , x , y ] = C0 * U [ t , x , y ] + Cx *( U [ t , x - 1 , y ]+ U [ t , x + 1 , y ]) M + Cy *( U [ t , x , y - 1 ]+ U [ t , x , y + 1 ]); } }

Motivation • Goal: simplifying parallel programming on FPGAs • 2D Heat Transfer Equation for( int t = 0 ; t < T ; t ++){ #pragma omp parallel num_threads(4){ int thread_id = omp_get_thread_num (); int bid_x = thread_id % 2 ; N int bid_y = thread_id / 2 ; for ( int y = bid_y *( N / 2 ); y < ( 1 + bid_y )*( N / 2 ); y ++) for ( int x = bid_x *( M / 2 ); x < ( 1 + bid_x )*( M / 2 ); x ++) U [ t + 1 , x , y ] = C0 * U [ t , x , y ] + Cx *( U [ t , x - 1 , y ]+ U [ t , x + 1 , y ]) M + Cy *( U [ t , x , y - 1 ]+ U [ t , x , y + 1 ]); } operation on the shared array }

Motivation • Goal: simplifying parallel programming on FPGAs • 2D Heat Transfer Equation for( int t = 0 ; t < T ; t ++){ #pragma omp parallel num_threads(4){ int thread_id = omp_get_thread_num (); int bid_x = thread_id % 2 ; N int bid_y = thread_id / 2 ; for ( int y = bid_y *( N / 2 ); y < ( 1 + bid_y )*( N / 2 ); y ++) for ( int x = bid_x *( M / 2 ); x < ( 1 + bid_x )*( M / 2 ); x ++) U [ t + 1 , x , y ] = C0 * U [ t , x , y ] + Cx *( U [ t , x - 1 , y ]+ U [ t , x + 1 , y ]) M + Cy *( U [ t , x , y - 1 ]+ U [ t , x , y + 1 ]); } operation on the shared array } implicit barrier synchronization

Motivation • Goal: simplifying parallel programming on FPGAs • 2D Heat Transfer Equation N M

Motivation • Goal: simplifying parallel programming on FPGAs • 2D Heat Transfer Equation N How to implement on FPGAs? M

Programming on FPGA • 2D Heat Transfer Equation (using FPGA Block RAM) Engine Interface N RAM Block M

Programming on FPGA • 2D Heat Transfer Equation (using FPGA Block RAM) Engine Interface N RAM Block M Difficulty: Problem size cannot fit in RAM block •

Programming on FPGA • 2D Heat Transfer Equation (using LEAP Scratchpad) Engine Interface N M unlimited address space M. Adler et al. , “LEAP Scratchpads,” in FPGA, 2011.

Programming on FPGA • 2D Heat Transfer Equation (using LEAP Scratchpad) Engine Interface N M unlimited address space Difficulty: Single engine • is too slow M. Adler et al. , “LEAP Scratchpads,” in FPGA, 2011.

Parallel Programming on FPGA • 2D Heat Transfer Equation Engine 2 Engine 3 Engine 4 Engine 1 N Interface M

Parallel Programming on FPGA • 2D Heat Transfer Equation Engine 2 Engine 3 Engine 4 Engine 1 N Interface M Difficulty: Performance is limited

Parallel Programming on FPGA • 2D Heat Transfer Equation Engine 2 Engine 3 Engine 4 Engine 1 N Interface M Difficulty: Performance is limited Serialized requests

Parallel Programming on FPGA • 2D Heat Transfer Equation Engine 2 Engine 3 Engine 4 Engine 1 N Interface M Difficulty: Performance is limited Serialized requests Long latency if across FPGAs

Parallel Programming on FPGA • 2D Heat Transfer Equation Engine 2 Engine 3 Engine 4 Engine 1 N Interface M

Parallel Programming on FPGA • 2D Heat Transfer Equation Engine 2 Engine 3 Engine 4 Engine 1 N Interface M Difficulty: • Edge pixels are shared

Parallel Programming on FPGA • 2D Heat Transfer Equation Engine 2 Engine 3 Engine 4 Engine 1 N Interface M Difficulty: • Edge pixels are shared Need cache coherence!

Shared Memory Services: Coherent Scratchpad (CS) Engine Engine Engine Engine Interface Shared Cache Ring-based snoopy protocol pre-order request ordered request response (1) Ordering point

Shared Memory Services: Coherent Scratchpad (CS) Engine Engine Engine Engine Interface Shared Cache Ring-based snoopy protocol pre-order request Modified MOSI protocol ordered request response (1) Ordering point

Shared Memory Services: Coherent Scratchpad (CS) Engine Engine Engine Engine Interface Shared Cache Ring-based snoopy protocol pre-order request Modified MOSI protocol ordered request response (1) Ordering point (1) (2) Store data

Shared Memory Services: Coherent Scratchpad (CS) Engine Engine Engine Engine Interface Shared Cache Ring-based snoopy protocol pre-order request Modified MOSI protocol ordered request response (1) Ordering point (1) (1) (2) Store data (2) Shared data ca (3) Store owner-bit information for every address

Shared Memory Services: Coherent Scratchpad (CS) Engine Engine Engine Engine Interface Coherent Scratchpad Controller Shared Cache Ring-based snoopy protocol pre-order request Modified MOSI protocol ordered request response (1) Ordering point (2) Store data (3) Store owner-bit information owner bit data

Shared Memory Services: Coherent Scratchpad

Parallel Programming on FPGA • 2D Heat Transfer Equation for( int t = 0 ; t < T ; t ++){ #pragma omp parallel num_threads(4){ int thread_id = omp_get_thread_num (); int bid_x = thread_id % 2 ; N int bid_y = thread_id / 2 ; for ( int y = bid_y *( N / 2 ); y < ( 1 + bid_y )*( N / 2 ); y ++) for ( int x = bid_x *( M / 2 ); x < ( 1 + bid_x )*( M / 2 ); x ++) U [ t + 1 , x , y ] = C0 * U [ t , x , y ] + Cx *( U [ t , x - 1 , y ]+ U [ t , x + 1 , y ]) M + Cy *( U [ t , x , y - 1 ]+ U [ t , x , y + 1 ]); } } implicit barrier synchronization

Parallel Programming on FPGA • 2D Heat Transfer Equation for( int t = 0 ; t < T ; t ++){ #pragma omp parallel num_threads(4){ int thread_id = omp_get_thread_num (); int bid_x = thread_id % 2 ; N int bid_y = thread_id / 2 ; for ( int y = bid_y *( N / 2 ); y < ( 1 + bid_y )*( N / 2 ); y ++) for ( int x = bid_x *( M / 2 ); x < ( 1 + bid_x )*( M / 2 ); x ++) U [ t + 1 , x , y ] = C0 * U [ t , x , y ] + Cx *( U [ t , x - 1 , y ]+ U [ t , x + 1 , y ]) M + Cy *( U [ t , x , y - 1 ]+ U [ t , x , y + 1 ]); } operation on the shared array } implicit barrier synchronization

Parallel Programming on FPGA • 2D Heat Transfer Equation for( int t = 0 ; t < T ; t ++){ #pragma omp parallel num_threads(4){ int thread_id = omp_get_thread_num (); int bid_x = thread_id % 2 ; N int bid_y = thread_id / 2 ; for ( int y = bid_y *( N / 2 ); y < ( 1 + bid_y )*( N / 2 ); y ++) for ( int x = bid_x *( M / 2 ); x < ( 1 + bid_x )*( M / 2 ); x ++) U [ t + 1 , x , y ] = C0 * U [ t , x , y ] + Cx *( U [ t , x - 1 , y ]+ U [ t , x + 1 , y ]) M + Cy *( U [ t , x , y - 1 ]+ U [ t , x , y + 1 ]); } operation on the shared array } implicit barrier synchronization Finish the inner loop operations •  Computations complete  Memory operations complete Wait until all threads are finished •

Shared Memory Services: Memory Consistency • Block RAM/Private Scratchpad Interface interface MEM_IFC#(type t_ADDR, type t_DATA); method void readReq (t_ADDR addr); method void write (t_ADDR addr, t_DATA data); method t_DATA readResp (); endinterface • Coherent Scratchpad Interface interface MEM_IFC#(type t_ADDR, type t_DATA); method void readReq (t_ADDR addr); method void write (t_ADDR addr, t_DATA data); method t_DATA readResp (); // t_REQ r := {READ, WRITE, FULL} method Bool requestPending (t_REQ r); endinterface

Shared Memory Services: Memory Consistency • Block RAM/Private Scratchpad Interface interface MEM_IFC#(type t_ADDR, type t_DATA); method void readReq (t_ADDR addr); method void write (t_ADDR addr, t_DATA data); method t_DATA readResp (); endinterface • Coherent Scratchpad Interface interface MEM_IFC#(type t_ADDR, type t_DATA); method void readReq (t_ADDR addr); method void write (t_ADDR addr, t_DATA data); method t_DATA readResp (); // t_REQ r := {READ, WRITE, FULL} method Bool requestPending (t_REQ r); Fence support endinterface (memory consistency)

LEAP Shared Memories: Automating the Construction of FPGA Coherent - PowerPoint PPT Presentation

LEAP Shared Memories: Automating the Construction of FPGA Coherent Memories Hsin-Jung Yang , Kermin E. Fleming , Michael Adler , and Joel Emer Massachusetts Institute of Technology Intel Corporation May 12th, FCCM 2014

A Tale of Two Telescopes - Scintillation studies using LEAP and LOFAR Robert Main Olaf Wucknitz,

U.S. Department of Justice LEADERSHIP EXCELLENCE & ACHIEVEMENT PROGRAM (LEAP) Bill

Sausalito Sustainability Low Emissions Action Plan - LEAP Low Emissions Action Plan - LEAP

LEAP . Welcomes you to your Lightning Decision Jam Prepared by LEAP Lightning Decision Jam or

LEAP Data Integration Platform Update as at mid Oct 2020 Introduction to LEAP LEAP is an

Real Time Embedded Systems " Memories Memories " rene.beuchat@epfl.ch LAP/ISIM/IC/EPFL

Leap Motion Piano Patrice Liang Matthew Patey Vanshil Shah Kevin Walters Overview - Leap

Welcome to your Lightning Decision Jam Prepared by LEAP Lightning Decision Jam LDJ

Memories Introduction Why do we need memory in an FPGA Device? Topics Types of FPGA

The LEAP Preschool Model PRE-PLANNED, TARGETED STRATEGIES for ACTIVE MEANINGFUL ENGAGEMENT

Polarization calibration using pulsar K.J.Lee kjlee@pku.edu.cn with leap team members:LEAP

Are Kenyan children ready Are Kenyan children ready for the leap from pre for the leap from pre

The LEAP Challenge: Mapping Guided Learning Pathways to Deep Learning and Long-Term Student

MI MI and Shared MI MI and Shared and Shared Decision Making and Shared Decision Making

A Suitcase full of memories: Exploring the meaning of tourism memories for people with dementia

WILLISCROFT FAMILY WILLISCROFT FAMILY WILLISCROFT FAMILY WILLISCROFT FAMILY HISTORY HISTORY

Computer Security 3e Dieter Gollmann Security.di.unimi.it/sicurezza1314/ Chapter 6: 1 Chapter

Better Care Fund 2017-19 A guide to assurance of plans Draft v5 The Better Care Fund Contents #

Device I/O Programming Don Porter 1 COMP 790: OS Implementation Logical Diagram Binary Memory

SI485i : NLP Set 14 Reference Resolution Reference Resolution Kraken , also called the Crab-fish

KLCCP Stapled Group FY2016 Results 20 January 2017 Disclaimer These materials contain

How Failures Come to be Andreas Zeller 1 An F-16 fighter plane An F-16 on the northern

Growth Hacking Your Brand to Success! Irfan Khairi www.irfankhairi.org |

Microservices for Performance Building low latency Micro Services and monolith in Java using high

LEAP Shared Memories: Automating the Construction of FPGA Coherent - PowerPoint PPT Presentation

LEAP Shared Memories: Automating the Construction of FPGA Coherent Memories Hsin-Jung Yang , Kermin E. Fleming , Michael Adler , and Joel Emer Massachusetts Institute of Technology Intel Corporation May 12th, FCCM 2014

A Tale of Two Telescopes - Scintillation studies using LEAP and LOFAR Robert Main Olaf Wucknitz,

U.S. Department of Justice LEADERSHIP EXCELLENCE &amp; ACHIEVEMENT PROGRAM (LEAP) Bill

Sausalito Sustainability Low Emissions Action Plan - LEAP Low Emissions Action Plan - LEAP

LEAP . Welcomes you to your Lightning Decision Jam Prepared by LEAP Lightning Decision Jam or

LEAP Data Integration Platform Update as at mid Oct 2020 Introduction to LEAP LEAP is an

Real Time Embedded Systems &quot; Memories Memories &quot; rene.beuchat@epfl.ch LAP/ISIM/IC/EPFL

Leap Motion Piano Patrice Liang Matthew Patey Vanshil Shah Kevin Walters Overview - Leap

Welcome to your Lightning Decision Jam Prepared by LEAP Lightning Decision Jam LDJ

Memories Introduction Why do we need memory in an FPGA Device? Topics Types of FPGA

The LEAP Preschool Model PRE-PLANNED, TARGETED STRATEGIES for ACTIVE MEANINGFUL ENGAGEMENT

Polarization calibration using pulsar K.J.Lee kjlee@pku.edu.cn with leap team members:LEAP

Are Kenyan children ready Are Kenyan children ready for the leap from pre for the leap from pre

The LEAP Challenge: Mapping Guided Learning Pathways to Deep Learning and Long-Term Student

MI MI and Shared MI MI and Shared and Shared Decision Making and Shared Decision Making

A Suitcase full of memories: Exploring the meaning of tourism memories for people with dementia

WILLISCROFT FAMILY WILLISCROFT FAMILY WILLISCROFT FAMILY WILLISCROFT FAMILY HISTORY HISTORY

Computer Security 3e Dieter Gollmann Security.di.unimi.it/sicurezza1314/ Chapter 6: 1 Chapter

Better Care Fund 2017-19 A guide to assurance of plans Draft v5 The Better Care Fund Contents #

Device I/O Programming Don Porter 1 COMP 790: OS Implementation Logical Diagram Binary Memory

SI485i : NLP Set 14 Reference Resolution Reference Resolution Kraken , also called the Crab-fish

KLCCP Stapled Group FY2016 Results 20 January 2017 Disclaimer These materials contain

How Failures Come to be Andreas Zeller 1 An F-16 fighter plane An F-16 on the northern

Growth Hacking Your Brand to Success! Irfan Khairi www.irfankhairi.org |

Microservices for Performance Building low latency Micro Services and monolith in Java using high

U.S. Department of Justice LEADERSHIP EXCELLENCE & ACHIEVEMENT PROGRAM (LEAP) Bill

Real Time Embedded Systems " Memories Memories " rene.beuchat@epfl.ch LAP/ISIM/IC/EPFL