LEAP Shared Memories: Automating the Construction of FPGA Coherent - - PowerPoint PPT Presentation

leap shared memories
SMART_READER_LITE
LIVE PREVIEW

LEAP Shared Memories: Automating the Construction of FPGA Coherent - - PowerPoint PPT Presentation

LEAP Shared Memories: Automating the Construction of FPGA Coherent Memories Hsin-Jung Yang , Kermin E. Fleming , Michael Adler , and Joel Emer Massachusetts Institute of Technology Intel Corporation May 12th, FCCM 2014


slide-1
SLIDE 1

LEAP Shared Memories:

Automating the Construction of FPGA Coherent Memories

Hsin-Jung Yang†, Kermin E. Fleming‡, Michael Adler‡, and Joel Emer†‡

† Massachusetts Institute of Technology ‡ Intel Corporation

May 12th, FCCM 2014

slide-2
SLIDE 2

Motivation

  • Goal: simplifying parallel programming on FPGAs
slide-3
SLIDE 3

Motivation

  • Goal: simplifying parallel programming on FPGAs
  • 2D Heat Transfer Equation

N M

slide-4
SLIDE 4

Motivation

  • Goal: simplifying parallel programming on FPGAs
  • 2D Heat Transfer Equation

N M

for(int t = 0; t < T; t++){ #pragma omp parallel num_threads(4){ int thread_id = omp_get_thread_num(); int bid_x = thread_id%2; int bid_y = thread_id/2; for (int y = bid_y*(N/2); y < (1+bid_y)*(N/2); y++) for (int x = bid_x*(M/2); x < (1+bid_x)*(M/2); x++) U[t+1,x,y] = C0*U[t,x,y] + Cx*(U[t,x-1,y]+U[t,x+1,y]) + Cy*(U[t,x,y-1]+U[t,x,y+1]); } }

slide-5
SLIDE 5

Motivation

  • Goal: simplifying parallel programming on FPGAs
  • 2D Heat Transfer Equation

N M

for(int t = 0; t < T; t++){ #pragma omp parallel num_threads(4){ int thread_id = omp_get_thread_num(); int bid_x = thread_id%2; int bid_y = thread_id/2; for (int y = bid_y*(N/2); y < (1+bid_y)*(N/2); y++) for (int x = bid_x*(M/2); x < (1+bid_x)*(M/2); x++) U[t+1,x,y] = C0*U[t,x,y] + Cx*(U[t,x-1,y]+U[t,x+1,y]) + Cy*(U[t,x,y-1]+U[t,x,y+1]); } }

  • peration on the shared array
slide-6
SLIDE 6

Motivation

  • Goal: simplifying parallel programming on FPGAs
  • 2D Heat Transfer Equation

N M

for(int t = 0; t < T; t++){ #pragma omp parallel num_threads(4){ int thread_id = omp_get_thread_num(); int bid_x = thread_id%2; int bid_y = thread_id/2; for (int y = bid_y*(N/2); y < (1+bid_y)*(N/2); y++) for (int x = bid_x*(M/2); x < (1+bid_x)*(M/2); x++) U[t+1,x,y] = C0*U[t,x,y] + Cx*(U[t,x-1,y]+U[t,x+1,y]) + Cy*(U[t,x,y-1]+U[t,x,y+1]); } }

  • peration on the shared array

implicit barrier synchronization

slide-7
SLIDE 7

Motivation

  • Goal: simplifying parallel programming on FPGAs
  • 2D Heat Transfer Equation

N M

slide-8
SLIDE 8

Motivation

  • Goal: simplifying parallel programming on FPGAs
  • 2D Heat Transfer Equation

How to implement on FPGAs?

N M

slide-9
SLIDE 9
  • 2D Heat Transfer Equation (using FPGA Block RAM)

Programming on FPGA

RAM Block Engine Interface N M

slide-10
SLIDE 10
  • 2D Heat Transfer Equation (using FPGA Block RAM)

Programming on FPGA

RAM Block Engine Interface N M

slide-11
SLIDE 11
  • 2D Heat Transfer Equation (using FPGA Block RAM)

Programming on FPGA

RAM Block Engine Interface N M

slide-12
SLIDE 12
  • 2D Heat Transfer Equation (using FPGA Block RAM)

Programming on FPGA

RAM Block Engine Difficulty:

  • Problem size cannot fit in RAM block

Interface N M

slide-13
SLIDE 13
  • 2D Heat Transfer Equation (using LEAP Scratchpad)

Programming on FPGA

  • M. Adler et al., “LEAP Scratchpads,” in FPGA, 2011.

Engine Interface N M

unlimited address space

slide-14
SLIDE 14
  • 2D Heat Transfer Equation (using LEAP Scratchpad)

Programming on FPGA

  • M. Adler et al., “LEAP Scratchpads,” in FPGA, 2011.

Engine Interface N M

unlimited address space

slide-15
SLIDE 15
  • 2D Heat Transfer Equation (using LEAP Scratchpad)

Programming on FPGA

  • M. Adler et al., “LEAP Scratchpads,” in FPGA, 2011.

Engine Interface N M

unlimited address space

Difficulty:

  • Single engine

is too slow

slide-16
SLIDE 16
  • 2D Heat Transfer Equation

Parallel Programming on FPGA

N M

Engine 1 Engine 2 Engine 3 Engine 4

Interface

slide-17
SLIDE 17
  • 2D Heat Transfer Equation

Parallel Programming on FPGA

N M

Engine 1 Engine 2 Engine 3 Engine 4

Interface

Difficulty: Performance is limited

slide-18
SLIDE 18
  • 2D Heat Transfer Equation

Parallel Programming on FPGA

N M

Engine 1 Engine 2 Engine 3 Engine 4

Interface

Difficulty: Performance is limited

Serialized requests

slide-19
SLIDE 19
  • 2D Heat Transfer Equation

Parallel Programming on FPGA

N M

Engine 1 Engine 2 Engine 3 Engine 4

Interface

Difficulty: Performance is limited

Serialized requests Long latency if across FPGAs

slide-20
SLIDE 20
  • 2D Heat Transfer Equation

Parallel Programming on FPGA

N M

Engine 1 Engine 2 Engine 3 Engine 4

Interface

slide-21
SLIDE 21
  • 2D Heat Transfer Equation

Parallel Programming on FPGA

N M

Engine 1 Engine 2 Engine 3 Engine 4

Difficulty:

  • Edge pixels are

shared

Interface

slide-22
SLIDE 22
  • 2D Heat Transfer Equation

Parallel Programming on FPGA

N M

Engine 1 Engine 2 Engine 3 Engine 4

Difficulty:

  • Edge pixels are

shared

Interface

Need cache coherence!

slide-23
SLIDE 23

(1) Ordering point

Shared Memory Services: Coherent Scratchpad (CS)

Ring-based snoopy protocol

Engine

Interface

Engine Engine Engine Shared Cache

pre-order request

  • rdered request

response

slide-24
SLIDE 24

(1) Ordering point

Shared Memory Services: Coherent Scratchpad (CS)

Ring-based snoopy protocol Modified MOSI protocol

Engine

Interface

Engine Engine Engine Shared Cache

pre-order request

  • rdered request

response

slide-25
SLIDE 25

(1) (2) Store data (1) Ordering point

Shared Memory Services: Coherent Scratchpad (CS)

Ring-based snoopy protocol Modified MOSI protocol

Engine

Interface

Engine Engine Engine Shared Cache

pre-order request

  • rdered request

response

slide-26
SLIDE 26

(1) (2) Shared data ca (3) Store owner-bit information for every address (1) (2) Store data (1) Ordering point

Shared Memory Services: Coherent Scratchpad (CS)

Ring-based snoopy protocol Modified MOSI protocol

Engine

Interface

Engine Engine Engine Shared Cache

pre-order request

  • rdered request

response

slide-27
SLIDE 27

Shared Memory Services: Coherent Scratchpad (CS)

Ring-based snoopy protocol Modified MOSI protocol

Engine

Interface

Engine Engine Engine Shared Cache

pre-order request

  • rdered request

response

Coherent Scratchpad Controller

(1) Ordering point (2) Store data (3) Store owner-bit information data

  • wner bit
slide-28
SLIDE 28

Shared Memory Services: Coherent Scratchpad

slide-29
SLIDE 29

Parallel Programming on FPGA

  • 2D Heat Transfer Equation

N M

for(int t = 0; t < T; t++){ #pragma omp parallel num_threads(4){ int thread_id = omp_get_thread_num(); int bid_x = thread_id%2; int bid_y = thread_id/2; for (int y = bid_y*(N/2); y < (1+bid_y)*(N/2); y++) for (int x = bid_x*(M/2); x < (1+bid_x)*(M/2); x++) U[t+1,x,y] = C0*U[t,x,y] + Cx*(U[t,x-1,y]+U[t,x+1,y]) + Cy*(U[t,x,y-1]+U[t,x,y+1]); } }

implicit barrier synchronization

slide-30
SLIDE 30

Parallel Programming on FPGA

  • 2D Heat Transfer Equation

N M

for(int t = 0; t < T; t++){ #pragma omp parallel num_threads(4){ int thread_id = omp_get_thread_num(); int bid_x = thread_id%2; int bid_y = thread_id/2; for (int y = bid_y*(N/2); y < (1+bid_y)*(N/2); y++) for (int x = bid_x*(M/2); x < (1+bid_x)*(M/2); x++) U[t+1,x,y] = C0*U[t,x,y] + Cx*(U[t,x-1,y]+U[t,x+1,y]) + Cy*(U[t,x,y-1]+U[t,x,y+1]); } }

  • peration on the shared array

implicit barrier synchronization

slide-31
SLIDE 31

Parallel Programming on FPGA

  • 2D Heat Transfer Equation

N M

for(int t = 0; t < T; t++){ #pragma omp parallel num_threads(4){ int thread_id = omp_get_thread_num(); int bid_x = thread_id%2; int bid_y = thread_id/2; for (int y = bid_y*(N/2); y < (1+bid_y)*(N/2); y++) for (int x = bid_x*(M/2); x < (1+bid_x)*(M/2); x++) U[t+1,x,y] = C0*U[t,x,y] + Cx*(U[t,x-1,y]+U[t,x+1,y]) + Cy*(U[t,x,y-1]+U[t,x,y+1]); } }

  • peration on the shared array

implicit barrier synchronization

  • Finish the inner loop operations

 Computations complete  Memory operations complete

  • Wait until all threads are finished
slide-32
SLIDE 32
  • Block RAM/Private Scratchpad Interface
  • Coherent Scratchpad Interface

interface MEM_IFC#(type t_ADDR, type t_DATA); method void readReq (t_ADDR addr); method void write(t_ADDR addr, t_DATA data); method t_DATA readResp(); // t_REQ r := {READ, WRITE, FULL} method Bool requestPending(t_REQ r); endinterface

Shared Memory Services: Memory Consistency

interface MEM_IFC#(type t_ADDR, type t_DATA); method void readReq (t_ADDR addr); method void write(t_ADDR addr, t_DATA data); method t_DATA readResp(); endinterface

slide-33
SLIDE 33
  • Block RAM/Private Scratchpad Interface
  • Coherent Scratchpad Interface

interface MEM_IFC#(type t_ADDR, type t_DATA); method void readReq (t_ADDR addr); method void write(t_ADDR addr, t_DATA data); method t_DATA readResp(); // t_REQ r := {READ, WRITE, FULL} method Bool requestPending(t_REQ r); endinterface Fence support (memory consistency)

Shared Memory Services: Memory Consistency

interface MEM_IFC#(type t_ADDR, type t_DATA); method void readReq (t_ADDR addr); method void write(t_ADDR addr, t_DATA data); method t_DATA readResp(); endinterface

slide-34
SLIDE 34

Parallel Programming on FPGA

  • 2D Heat Transfer Equation

N M

for(int t = 0; t < T; t++){ #pragma omp parallel num_threads(4){ int thread_id = omp_get_thread_num(); int bid_x = thread_id%2; int bid_y = thread_id/2; for (int y = bid_y*(N/2); y < (1+bid_y)*(N/2); y++) for (int x = bid_x*(M/2); x < (1+bid_x)*(M/2); x++) U[t+1,x,y] = C0*U[t,x,y] + Cx*(U[t,x-1,y]+U[t,x+1,y]) + Cy*(U[t,x,y-1]+U[t,x,y+1]); } }

  • peration on the shared array

implicit barrier synchronization

  • Finish the inner loop operations

 Computations complete  Memory operations complete

  • Wait until all threads are finished
slide-35
SLIDE 35
  • In Processor: software through-memory barriers

– via shared memory & locks

Synchronization Services: Memory Barrier

void barrier(num_threads_const, lock_addr, eflag_addr, lflag_addr, ecounter_addr, lcounter_addr) { while (*eflag_addr); lock(lock_addr); (*ecounter_addr)++; if ((*ecounter_addr) == num_thread_const){ (*eflag_addr) = 0; (*lflag_addr) = 1; } unlock(lock_addr); while (*lflag_addr); lock(lock_addr); (*lcounter_addr)++; if ((*lcounter_addr) == num_thread_const){ (*lcounter_addr) = 0; (*ecounter_addr) = 0; (*eflag_addr) = 1; (*lflag_addr) = 0; } unlock(lock_addr); }

slide-36
SLIDE 36
  • In Processor: software through-memory barriers

– via shared memory & locks

  • In FPGA:

Synchronization Services: Memory Barrier

slide-37
SLIDE 37

Mask

Client A Client B Client C Synchronization Client Synchronization Client Synchronization Client Synchronization Controller

  • In Processor: software through-memory barriers

– via shared memory & locks

  • In FPGA:

Synchronization Services: Memory Barrier

Suppose clients A & B need to synchronize

A B C

slide-38
SLIDE 38

Mask

Client A Client B Client C Synchronization Client Synchronization Client Synchronization Client Synchronization Controller

  • In Processor: software through-memory barriers

– via shared memory & locks

  • In FPGA:

Synchronization Services: Memory Barrier

Suppose clients A & B need to synchronize

A B C

slide-39
SLIDE 39

Mask

Client A Client B Client C Synchronization Client Synchronization Client Synchronization Client Synchronization Controller A isDone

  • In Processor: software through-memory barriers

– via shared memory & locks

  • In FPGA:

Synchronization Services: Memory Barrier

Suppose clients A & B need to synchronize

A B C

slide-40
SLIDE 40

Mask

Client A Client B Client C Synchronization Client Synchronization Client Synchronization Client Synchronization Controller

  • In Processor: software through-memory barriers

– via shared memory & locks

  • In FPGA:

Synchronization Services: Memory Barrier

Suppose clients A & B need to synchronize

B isDone A B C

slide-41
SLIDE 41

Mask

Client A Client B Client C Synchronization Client Synchronization Client Synchronization Client Synchronization Controller

  • In Processor: software through-memory barriers

– via shared memory & locks

  • In FPGA:

Synchronization Services: Memory Barrier

Suppose clients A & B need to synchronize allDone

A B C

slide-42
SLIDE 42

Mask

Client A Client B Client C Synchronization Client Synchronization Client Synchronization Client Synchronization Controller

  • In Processor: software through-memory barriers

– via shared memory & locks

  • In FPGA:

Synchronization Services: Memory Barrier

Suppose clients A & B need to synchronize allDone

void barrier() { send(isDone); while (receive(allDone)); }

A B C

slide-43
SLIDE 43
  • In Processor: software through-memory barriers

– via shared memory & locks

  • In FPGA:

– outside of shared memory

  • Performance Comparison:

Synchronization Services: Memory Barrier

System Barriers per Second Normalized Throughput LEAP Barrier Service 7352076 342 Hardware Lock Barrier via Coherent Scratchpad 85088 4 Spin-Lock Mutex-Enabled Cache* 21510 1

* V. Mirian and P. Chow, “Managing mutex variables in a cache-coherent shared-memory system for FPGAs,” in FPT, 2012.

slide-44
SLIDE 44

Performance on 2D Heat Transfer

FPGA: Xilinx VC707 Frame Size: 512x512 Coherent Cache Size: 8KB Pixel Size: 8bit N M

Centralized Scratchpad Coherent Scratchpad

slide-45
SLIDE 45

Coherent Scratchpads on Multiple FPGAs

  • K. Fleming et al., “Leveraging latency-insensitivity to ease multiple FPGA design,” in FPGA, 2012.
slide-46
SLIDE 46

Performance of Dual FPGA

  • 2D Heat Transfer Equation

FPGA: Xilinx VC707 Frame Size: 512x512 Coherent Cache Size: 64KB Pixel Size: 8bit N M Single FPGA Dual FPGA

slide-47
SLIDE 47

Conclusion

  • Programming on FPGA is difficult due to the lack of

useful abstractions

  • We provide a set of FPGA-based shared memory

primitives:

– Coherent scratchpads: manage multiple coherent caches – Synchronization primitives

  • We improve programming efficiency

– Common interface: Block RAM -> multi-FPGA coherent memory – It took only a few hours to write the 2D heat transfer equation

slide-48
SLIDE 48

Thank You