Implementing Logic in FPGA Embedded Memory Arrays: Heterogeneous - - PowerPoint PPT Presentation
Implementing Logic in FPGA Embedded Memory Arrays: Heterogeneous - - PowerPoint PPT Presentation
Implementing Logic in FPGA Embedded Memory Arrays: Heterogeneous Memory Architectures Steve Wilton University of British Columbia Vancouver, B.C., Canada stevew@ece.ubc.ca As FPGAs Get Bigger... Embedded Memory is becoming critical
As FPGAs Get Bigger...
Embedded Memory is becoming critical Implementing Storage on-chip is important:
- Integration
- Relax I/O Constraints
- Speed
- Flexibility
Today, most FPGAs have large embedded memory arrays
Problem: If a circuit doesn’t need all memory blocks, valuable chip area wasted Solution: Configure memory blocks as ROMs and use them to implement logic
Implementing Logic in Memory:
N M P D C B A E Q F G L H K J
Implementing Logic in Memory:
Two published algorithms: SMAP, EMB_Pack
N M P D C B A E Q F G L H K J N M P A E Q C
The ability of memory arrays to implement logic depends on the memory array architecture Previous Work: 2Kbit arrays with 8 outputs are good
Heterogeneous Memory Architectures
Altera Stratix: Three types of memories
M512 Blocks M4K Blocks MegaRAM
This Talk:
A given: For storage: Several types of memories on a single chip is a good idea In this paper: For logic: 1. Heterogeneous memory architectures: a good idea?
- 2. How much does it help?
- 3. What memory sizes are best?
Methodology:
SMAP Pack as much logic as possible into memory arrays Area Model Packing Ratio = Amount of logic packed Area Architecture Benchmark Circuits Area Amount of Logic Packed
SMAP Algorithm:
Overall approach:
- 1. Map to 4-LUTs using Flowmap
- 2. Pack as many 4-LUTs as possible into arrays
Goal: Maximize number of LUTs that can be packed
N M P D C B A E Q F G L H K J N M P A E Q C
SMAP Algorithm:
Goal: Maximize number of LUTs that can be packed Four Steps:
- 1. Choose a “seed node”
- 2. Choose signals that will become array inputs
- 3. Choose signals that will become array outputs
- 4. Insert memory into circuit, and remove 4-LUTs
no longer needed
Choosing Inputs of Memory Array:
Find maximum-volume d-feasible cut (Flowpack) Cut edges become memory array inputs
Seed Node 8-input memory
Choosing Outputs of Memory Array:
A bad way to choose output signal: Since D and F fan-out outside the fanin cone, we still need D and F (and their predecessors)
N M P D C B A E Q F G L H K J N M P C E Q F G L H K J D
Suppose there are two memory outputs:
N M P D C B A E Q F G L H K J N M P E Q F G L H K J N M P D C B A E Q F G L H K J N M P C E C A A F D Q
Bad Solution Better Solution
Choosing Outputs of Memory Array:
Goal: We want to select the w nodes such that the largest number of nodes can be deleted Problem: For w > 1, it is computationally expensive to check all combinations of w potential
- utputs
Heuristic:
- 1. For each potential output individually, find
that node’s maximum fanout-free cone
- 2. Choose the w nodes with the largest MFFC’s.
Choosing a Seed Node:
It turns out that the choice of seed node is very important
- Try all nodes as potential seeds, choose whichever
gives the best results
- There are ways to speed this up, especially if there
are many arrays
50 100 150 200 250 300 350 128 256 512 1024 2048 4096 8192 Packed Logic Blocks Bits Per Array
Results: Homogeneous Architectures
50 100 150 200 250 300 350 128 50 100 150 200 250 300 350 256 512 1024 2048 4096 8192 Area (equiv. logic blocks) Packed Logic Blocks Bits Per Array
Results: Homogeneous Architectures
0.5 1.0 1.5 2.0 2.5 3.0 128 256 512 1024 2048 4096 8192 Bits Per Array Packing Ratio Logic Blocks Packed Area (Equiv Logic Blocks) Packing Ratio =
Results: Homogeneous Architectures
Modifying SMAP for Heterogeneous Archs:
SMAP fills arrays sequentially We have looked at two strategies:
- 1. Fill all large arrays first
- 2. Fill all small arrays first
Strategy 1 gives better results
Two Sizes: Four Arrays of Each
Array 1 Size Array 2 Size Packing Density 128 256 512 1024 2048 4096 8192 128 256 512 1024 2048 4096 8192 1.0 1.5 2.0 2.5 3.5 3.0
Homogeneous Results Best: 2048 bits / 128 bits 23 % Improvement
Observations from our Results:
Trend 2: The more arrays, the higher the gain seen by using a heterogeneous architecture Trend 1: A combination of 2048 / 128 bit arrays is always the best choice
One Type-1 array and Two Type-2 Arrays:
Array 1 Size (one of these) Array 2 Size (two of these) Packing Density 128 256 512 1024 2048 4096 8192 128 256 512 1024 2048 4096 8192 4.0 2.0 2.5 3.5 3.0 1.0 1.5
Four Type-1 arrays and Eight Type-2 Arrays:
Array 2 Size (eight of these) Packing Density 256 512 1024 2048 4096 8192 128 256 512 1024 2048 4096 8192 2.0 2.5 3.0 128 Array 1 Size (four of these) 1.5 1.0
One Type-1 array and Three Type-2 Arrays:
Array 1 Size (one of these) Array 2 Size (three of these) Packing Density 128 256 512 1024 2048 4096 8192 128 256 512 1024 2048 4096 8192 4.0 2.0 2.5 3.5 3.0 1.5
Three Type-1 arrays and Nine Type-2 Arrays:
Array 1 Size (three of these) Array 2 Size (nine of these) Packing Density 128 256 512 1024 2048 4096 8192 128 256 512 1024 2048 4096 8192 2.0 2.5 1.0 1.5
Observations from our Results:
Trend 2: The more arrays, the higher the gain seen by using a heterogeneous architecture Trend 1: A combination of 2048 / 128 bit arrays is always the best choice Trend 3: From above, we should have 2048 bit arrays and 128 bit arrays. As the number of arrays increases, more of the arrays should be small.
One Type-1 array and Three Type-2 Arrays:
Array 1 Size (one of these) Array 2 Size (three of these) Packing Density 128 256 512 1024 2048 4096 8192 128 256 512 1024 2048 4096 8192 4.0 2.0 2.5 3.5 3.0 1.5
Three large arrays and one small array One large array and 3 small arrays Better
Three Type-1 arrays and Nine Type-2 Arrays:
Array 1 Size (three of these) Array 2 Size (nine of these) Packing Density 128 256 512 1024 2048 4096 8192 128 256 512 1024 2048 4096 8192 2.0 2.5 1.0 1.5
Nine large arrays and 3 small arrays 3 large arrays and 9 small arrays Better
Things we haven't taken into account:
Speed:
- Heterogeneous architectures are likely to
give gains in speed (compared to homogeneous) since an array of "just the right size" can be used
- Right now, SMAP doesn't optimize for speed, but
for homogeneous architectures, there is little impact on speed Routing:
- With heterogeneous architectures, there may be
longer routes to get to the right memory
- But not too bad, if only a few memory types
Summary
Heterogeneous Memory Architectures are efficient when implementing logic
- Compared to homogeneous architectures
23 % improvement is typical
- The more arrays, the higher the gain
- A combination of 2048 / 128 bit arrays is always
the best choice
- As the number of arrays increases, more of the