Motivation Moores Law continues More transistors & memory - - PowerPoint PPT Presentation

motivation
SMART_READER_LITE
LIVE PREVIEW

Motivation Moores Law continues More transistors & memory - - PowerPoint PPT Presentation

LMC: Automatic Resource-Aware Program-Optimized Memory Partitioning Hsin-Jung Yang , Kermin E. Fleming , Michael Adler , Felix Winterstein , and Joel Emer Massachusetts Institute of Technology, Intel Corporation,


slide-1
SLIDE 1

LMC: Automatic Resource-Aware Program-Optimized Memory Partitioning

Hsin-Jung Yang†, Kermin E. Fleming‡, Michael Adler‡, Felix Winterstein§, and Joel Emer†

† Massachusetts Institute of Technology, ‡ Intel Corporation, § Imperial College London,

February 22nd, FPGA 2016

slide-2
SLIDE 2

Motivation

  • Moore’s Law continues

– More transistors & memory controllers on modern FPGAs

  • Example: Xilinx VC709: two 4GB DDR3 memories

Nallatech 510T: eight 4GB DDR4 memories + 2GB HMC Xeon + FPGA: three memory channels

  • It is difficult to fully utilize DRAM bandwidth

– Co-optimizing application cores and memory systems – Porting an existing design to a new platform

  • Smaller FPGA -> Larger FPGA
  • Single FPGA -> Multiple FPGAs
slide-3
SLIDE 3

Motivation

  • Moore’s Law continues

– More transistors & memory controllers on modern FPGAs

  • Example: Xilinx VC709: two 4GB DDR3 memories

Nallatech 510T: eight 4GB DDR4 memories + 2GB HMC Xeon + FPGA: three memory channels

  • It is difficult to fully utilize DRAM bandwidth

– Co-optimizing application cores and memory systems – Porting an existing design to a new platform

  • Smaller FPGA -> Larger FPGA
  • Single FPGA -> Multiple FPGAs

Goal: automatically optimizing the memory system to efficiently utilize the increased DRAM bandwidth

slide-4
SLIDE 4
  • How to connect computational engines to DRAMs in
  • rder to maximize program performance?

– Network topology: latency, bandwidth – On-chip caching – Area constraints

Utilizing Multiple DRAMs

?

slide-5
SLIDE 5
  • How to connect computational engines to DRAMs in
  • rder to maximize program performance?

– Network topology: latency, bandwidth – On-chip caching – Area constraints

Utilizing Multiple DRAMs

?

slide-6
SLIDE 6

Utilizing Multiple DRAMs

?

  • How to connect computational engines to DRAMs in
  • rder to maximize program performance?

– High design complexity: network, caching…

slide-7
SLIDE 7
  • How to connect computational engines to DRAMs in
  • rder to maximize program performance?

– High design complexity: network, caching…

  • Applications have different memory behavior

Utilizing Multiple DRAMs

slide-8
SLIDE 8
  • How to connect computational engines to DRAMs in
  • rder to maximize program performance?

– High design complexity: network, caching…

  • Applications have different memory behavior

Utilizing Multiple DRAMs

Need more bandwidth!

slide-9
SLIDE 9
  • How to connect computational engines to DRAMs in
  • rder to maximize program performance?

– High design complexity: network, caching…

  • Applications have different memory behavior

Utilizing Multiple DRAMs

Need more bandwidth!

slide-10
SLIDE 10
  • How to connect computational engines to DRAMs in
  • rder to maximize program performance?

– High design complexity: network, caching…

  • Applications have different memory behavior

Utilizing Multiple DRAMs

Need a memory compiler!

Need more bandwidth!

slide-11
SLIDE 11
  • A clearly-defined, generic memory abstraction

– Separate the user program from the memory system implementation

  • Program introspection

– To understand the program’s memory behavior

  • A resource-aware, feedback-driven memory compiler

– Use introspection results as feedback to automatically construct the “best” memory system for the target program and platform

Automatic Construction of Program-Optimized Memories

slide-12
SLIDE 12

Abstraction

  • Abstraction hides implementation details and provides

good programmability

FPGA

User Program Abstraction Memory Communication C/Python Application Memory

Processor

Instruction Set Architecture CPU I/O Operating System

Software Hardware

slide-13
SLIDE 13

Abstraction

  • Abstraction hides implementation details and provides

good programmability

FPGA

User Program Abstraction Memory Communication C/Python Application Memory

Processor

Instruction Set Architecture CPU I/O Operating System

Software Hardware

  • Hardware can be optimized for the

target application and platform Compilers & system developers

slide-14
SLIDE 14

LEAP Memory Abstraction

interface MEM_IFC#(type t_ADDR, type t_DATA) method void readReq(t_ADDR addr); method void write(t_ADDR addr, t_DATA din); method t_DATA readResp(); endinterface

LEAP Memory User Engine Interface

LEAP memory block

  • Simple memory interface
  • Arbitrary data size
  • Private address space
  • “Unlimited” storage
  • Automatic caching
slide-15
SLIDE 15

LEAP Memory Abstraction

interface MEM_IFC#(type t_ADDR, type t_DATA) method void readReq(t_ADDR addr); method void write(t_ADDR addr, t_DATA din); method t_DATA readResp(); endinterface

LEAP Memory User Engine Interface

LEAP memory block

  • Simple memory interface
  • Arbitrary data size
  • Private address space
  • “Unlimited” storage
  • Automatic caching

Same as block RAMs

slide-16
SLIDE 16

LEAP Private Memory

Client Client Client Interface

FPGA

  • M. Adler et al., “LEAP Scratchpads,” in FPGA, 2011.

User Program Platform

slide-17
SLIDE 17

LEAP Private Memory

Client Client Client Interface

FPGA

  • n-chip SRAM
  • n-board DRAM
  • M. Adler et al., “LEAP Scratchpads,” in FPGA, 2011.

User Program Platform

slide-18
SLIDE 18

LEAP Private Memory

Client Client Client Interface

Processor

Application L1 Cache L2 Cache Memory

FPGA

  • n-chip SRAM
  • n-board DRAM
  • M. Adler et al., “LEAP Scratchpads,” in FPGA, 2011.

User Program Platform

slide-19
SLIDE 19
  • Naïve solution: unified memory with multiple DRAM banks

LEAP Memory w ith Multiple DRAMs

Client Client Client Interface

slide-20
SLIDE 20
  • Naïve solution: unified memory with multiple DRAM banks

LEAP Memory w ith Multiple DRAMs

Client Client Client Interface

slide-21
SLIDE 21
  • Naïve solution: unified memory with multiple DRAM banks

LEAP Memory w ith Multiple DRAMs

Client Client Client Interface Simplicity More capacity Higher bandwidth

slide-22
SLIDE 22
  • Naïve solution: unified memory with multiple DRAM banks

LEAP Memory w ith Multiple DRAMs

Difficulty: Performance is limited

Serialized requests Long latency for large rings Client Client Client Interface Simplicity More capacity Higher bandwidth

slide-23
SLIDE 23
  • Naïve solution: unified memory with multiple DRAM banks

LEAP Memory w ith Multiple DRAMs

Difficulty: Performance is limited

Serialized requests Long latency for large rings Client Client Client Interface Simplicity More capacity Higher bandwidth

Can we do better?

slide-24
SLIDE 24
  • Distributed central caches and memory controllers

LEAP Memory w ith Multiple DRAMs

slide-25
SLIDE 25
  • Distributed central caches and memory controllers

?

LEAP Memory w ith Multiple DRAMs

slide-26
SLIDE 26
  • Program introspection

– To understand programs’ memory behavior

Private Cache Network Partitioning

Statistics file Client A: 100 Client B: 10 Client C: 50 Client D: 20

Statistics Counter

Statistics file Client A: 100 Client B: 10 Client C: 50 Client D: 20

Ex: # Cache misses # Outstanding requests Queueing delays

slide-27
SLIDE 27
  • Case 1: Memory clients with homogeneous behavior

Private Cache Network Partitioning

slide-28
SLIDE 28
  • Case 1: Memory clients with homogeneous behavior

Private Cache Network Partitioning

Homogeneous

slide-29
SLIDE 29
  • Case 1: Memory clients with homogeneous behavior

Private Cache Network Partitioning

Homogeneous

slide-30
SLIDE 30
  • Case 2: Memory clients with heterogeneous behavior

Private Cache Network Partitioning

Traffic: 100 10 50 20

slide-31
SLIDE 31
  • Case 2: Memory clients with heterogeneous behavior

Private Cache Network Partitioning

Traffic: 100 10 50 20 Need more bandwidth!

slide-32
SLIDE 32
  • Case 2: Memory clients with heterogeneous behavior

Private Cache Network Partitioning

Traffic: 100 10 50 20 Need more bandwidth!

slide-33
SLIDE 33
  • Case 2: Memory clients with heterogeneous behavior

– Load-balanced partitioning

  • Classical minimum makespan scheduling problem

Private Cache Network Partitioning

𝑛 controllers, n clients, client j with traffic 𝑢𝑘 𝑦𝑗,𝑘 = 1 ILP formulation: minimize t s.t. ∑ 𝑦𝑗,𝑘𝑢𝑘

𝑜 𝑘=1

≤ 𝑢, 𝑗 = 1, … , 𝑛 s.t. ∑ 𝑦𝑗,𝑘

𝑛 𝑗=1

= 1, j = 1, … , n s.t. 𝑦𝑗,𝑘 ∈ 0,1 , 𝑗 = 1, … , 𝑛, 𝑘 = 1, … , 𝑜

if client j is mapped to controller i

  • therwise
slide-34
SLIDE 34
  • Case 2: Memory clients with heterogeneous behavior

– Load-balanced partitioning

  • Classical minimum makespan scheduling problem

Private Cache Network Partitioning

𝑛 controllers, n clients, client j with traffic 𝑢𝑘 𝑦𝑗,𝑘 = 1 ILP formulation: minimize t s.t. ∑ 𝑦𝑗,𝑘𝑢𝑘

𝑜 𝑘=1

≤ 𝑢, 𝑗 = 1, … , 𝑛 s.t. ∑ 𝑦𝑗,𝑘

𝑛 𝑗=1

= 1, j = 1, … , n s.t. 𝑦𝑗,𝑘 ∈ 0,1 , 𝑗 = 1, … , 𝑛, 𝑘 = 1, … , 𝑜

if client j is mapped to controller i

  • therwise

Approximation: Longest processing time (LPT) algorithm

slide-35
SLIDE 35
  • Case 3: Fractional load-balancing

Private Cache Network Partitioning

slide-36
SLIDE 36
  • Case 3: Fractional load-balancing

Private Cache Network Partitioning

slide-37
SLIDE 37
  • Case 3: Fractional load-balancing

Private Cache Network Partitioning

minimize t s.t. ∑ 𝑦𝑗,𝑘𝑢𝑘

𝑜 𝑘=1

≤ 𝑢 s.t. ∑ 𝑦𝑗,𝑘

𝑛 𝑗=1

= 1 s.t. 𝟏 ≤ 𝒚𝒋,𝒌 ≤ 𝟐

ILP->LP

slide-38
SLIDE 38
  • Three-phase feedback-driven compilation

LEAP Memory Compiler

– Instrumentation (optional): to collect runtime information about the way the program uses memory – Analysis: to analyze the program properties and decide an optimized memory hierarchy – Synthesis: to implement the program-optimized memory

slide-39
SLIDE 39

LEAP Memory Performance

  • Baseline
slide-40
SLIDE 40

LEAP Memory Performance

  • Baseline

Private cache Central cache

slide-41
SLIDE 41

LEAP Memory Performance

  • Memory interleaving

Private cache Central cache

slide-42
SLIDE 42

Case Study: Cryptosorter

  • Cryptosorter: each sorter uses a LEAP private memory
slide-43
SLIDE 43

Case Study: Filtering Algorithm

  • Filtering algorithm for K-means clustering (HLS kernel)

– 8 partitions: each uses 3 LEAP private memories

slide-44
SLIDE 44
  • Baseline coherent memory

Coherent Cache Network Partitioning

slide-45
SLIDE 45
  • Coherent memory interleaving

Coherent Cache Network Partitioning

slide-46
SLIDE 46
  • Coherent memory interleaving

Coherent Cache Network Partitioning

Private cache network optimizations can be directly composed

slide-47
SLIDE 47

Case Study: Heat Transfer

  • Heat transfer: 16 engines, 1024x1024 frame

Private memory optimizations only Private + coherent memory optimizations

slide-48
SLIDE 48

Case Study: Heat Transfer

  • Heat transfer: 16 engines, 1024x1024 frame

Private memory optimizations only Private + coherent memory optimizations 57% (96%) performance gain

slide-49
SLIDE 49

Moving to Multi-FPGA Platforms

slide-50
SLIDE 50

Moving to Multi-FPGA Platforms

slide-51
SLIDE 51

Performance on Dual FPGAs

slide-52
SLIDE 52

Conclusion

  • We introduce the LEAP memory compiler that can

transparently optimize the memory system for a given application.

  • The compiler automatically partitions both private and

coherent memory networks to efficiently utilize the increased DRAM bandwidth on modern FPGAs.

  • Future work:

– More case studies on asymmetric memory clients – More complex memory network topologies – Dynamic cache partitioning

slide-53
SLIDE 53

Thank You