Resource Allocation for Hardware Implementations of Map Richard - - PowerPoint PPT Presentation
Resource Allocation for Hardware Implementations of Map Richard - - PowerPoint PPT Presentation
Resource Allocation for Hardware Implementations of Map Richard Townsend Martha A. Kim Stephen A. Edwards Columbia University ASBD Workshop, June 15, 2014 Functional Programs to Functional Hardware Functional program (Haskell) Compiler
Functional Programs to Functional Hardware
Functional program (Haskell) Compiler Circuit (Verilog)
Functional Programs to Functional Hardware
Map f [x1,x2,...,xn]
This Talk
?
Ordered List
Functional Programs to Functional Hardware
Map f [x1,x2,...,xn]
This Talk
?
Ordered List
fold scan
Order Dependent
Functional Map vs. MapReduce
Functional Map vs. MapReduce
(0,0) (0,1) (0,2) (0,3) (1,0) (1,1) (1,2) (1,3) (2,0) (2,1) (2,2) (2,3) (3,0) (3,1) (3,2) (3,3)
Functional Map vs. MapReduce
(0,0) (0,1) (0,2) (0,3) (1,0) (1,1) (1,2) (1,3) (2,0) (2,1) (2,2) (2,3) (3,0) (3,1) (3,2) (3,3)
Ordered
Functional Map vs. MapReduce
(0,0) (0,1) (0,2) (0,3) (1,0) (1,1) (1,2) (1,3) (2,0) (2,1) (2,2) (2,3) (3,0) (3,1) (3,2) (3,3)
Ordered Unordered
Structure of a Hardware Implementation
f
Structure of a Hardware Implementation
f f f f f
Structure of a Hardware Implementation
f f f f f
Structure of a Hardware Implementation
f f f f f x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
Structure of a Hardware Implementation
f f f f f x1 x2 x3 x4 x5 x6 x7
Structure of a Hardware Implementation
f f f f f x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
Structure of a Hardware Implementation
f f f f f x6 x7 x8 x9 x10 f (x1) x2 x4 x5 f (x3)
Structure of a Hardware Implementation
f f f f f f (x1) x2 x4 x5 f (x3) x6 x7 x8 x9 x10
Structure of a Hardware Implementation
f f f f f x2 x4 x5 f (x3) x6 x7 x8 x9 x10 f (x1)
Structure of a Hardware Implementation
f f f f f x2 x4 x5 f (x3) x6 x7 x8 x9 x10 f (x1) f (x7)
Structure of a Hardware Implementation
f f f f f x2 x4 x5 f (x3) x6 x7 x8 x9 x10 f (x1) f (x7) still processing
Structure of a Hardware Implementation
f f f f f x2 x4 x5 f (x3) x6 x7 x8 x9 x10 f (x1) f (x7) still processing stuck in buffer
Structure of a Hardware Implementation
f f f f f x2 x4 x5 f (x3) x6 x7 x8 x9 x10 f (x1) f (x7) still processing stuck in buffer holding and stalling
Structure of a Hardware Implementation
f f f f f
Structure of a Hardware Implementation
f f f f f More Functional Units (Parallelism) More Buffers (Utilization)
Multiple Possible Configurations...Which to Choose?
Area = 15
Multiple Possible Configurations...Which to Choose?
Area = 15 Buffers 50% size of func. unit
Multiple Possible Configurations...Which to Choose?
Area = 15
f f f f f f f f f f f f f f f
15 Functional Units
Multiple Possible Configurations...Which to Choose?
Area = 15
f
28 Buffers
Multiple Possible Configurations...Which to Choose?
Area = 15
f f f f f
20× 1
2 = 10
5
f f f
24× 1
2 = 12
3
Workload Structure
f f f
Workload Structure
f f f
Best-case
Workload Structure
f f f
Best-case
Time f f f
Workload Structure
f f f
Best-case
Time f f f
Average-case
?
Workload Structure
f f f
Best-case
Time f f f
Average-case
?
Worst-case
Workload Structure
f f f
Best-case
Time f f f
Average-case
?
Worst-case
f f f
Workload Structure
f f f
Best-case
Time f f f
Average-case
?
Worst-case
f f f
Optimal Resource Allocation
0% 20% 40% 60% 80% 100% 5 10 15 20 25 30 Completion Time 50 100 150 200 250 Buffer Slots per Functional Unit
Simulator Results
Optimal Resource Allocation
0% 20% 40% 60% 80% 100% 5 10 15 20 25 30 Completion Time 50 100 150 200 250 Buffer Slots per Functional Unit
Simulator Results 2x speedup
Maximizing Functional Units
Optimal Resource Allocation
0% 20% 40% 60% 80% 100% 5 10 15 20 25 30 Completion Time 50 100 150 200 250 Buffer Slots per Functional Unit
Simulator Results 3x speedup
Maximizing Buffers
Optimal Resource Allocation
0% 20% 40% 60% 80% 100% 5 10 15 20 25 30 Completion Time 50 100 150 200 250 Buffer Slots per Functional Unit
f f f f f
Optimal Resource Allocation
0% 20% 40% 60% 80% 100% 5 10 15 20 25 30 Completion Time 50 100 150 200 250 Buffer Slots per Functional Unit
f f f f f
Fewer Buffers Slower
Why Are There Multiple Optima?
0% 20% 40% 60% 80% 100% 5 10 15 20 25 30 Completion Time
Why Are There Multiple Optima?
0% 20% 40% 60% 80% 100% 5 10 15 20 25 30 Completion Time
12 5
7 10
Why Are There Multiple Optima?
0% 20% 40% 60% 80% 100% 5 10 15 20 25 30 Completion Time
12 5
7 10 11 6 6 11
Performance Scales with Area
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 10 20 30 40 50 60 Completion Time Total Area
Ideal Minimum
Performance Scales with Area
0% 20% 40% 60% 80% 100% 10 20 30 40 50 60 Completion Time 10 20 30 10 20 30 40 50 60 Buffers Slots / Functional Unit 2 4 6 8 10 10 20 30 40 50 60 Functional Units Total Area
Performance Scales with Area
0% 20% 40% 60% 80% 100% 10 20 30 40 50 60 Completion Time 10 20 30 10 20 30 40 50 60 Buffers Slots / Functional Unit 2 4 6 8 10 12 10 20 30 40 50 60 Functional Units Total Area
f f
Conclusions
0% 20% 40% 60% 80% 100% 5 10 15 20 25 30 Completion Time 50 100 150 200 250 Buffer Slots per Functional UnitArea trade-off is important...
Conclusions
0% 20% 40% 60% 80% 100% 5 10 15 20 25 30 Completion Time 50 100 150 200 250 Buffer Slots per Functional UnitArea trade-off is important...
f f f f f f f f
?
...and non-obvious
Conclusions
0% 20% 40% 60% 80% 100% 5 10 15 20 25 30 Completion Time 50 100 150 200 250 Buffer Slots per Functional UnitArea trade-off is important...
f f f f f f f f
?
...and non-obvious
f f f f f x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
Model helps explore design space
Conclusions
0% 20% 40% 60% 80% 100% 5 10 15 20 25 30 Completion Time 50 100 150 200 250 Buffer Slots per Functional UnitArea trade-off is important...
f f f f f f f f
?
...and non-obvious
f f f f f x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
Model helps explore design space
Synthesize Efficient Hardware Implementation of Map
Conclusions
0% 20% 40% 60% 80% 100% 5 10 15 20 25 30 Completion Time 50 100 150 200 250 Buffer Slots per Functional UnitArea trade-off is important...
f f f f f f f f
?
...and non-obvious
f f f f f x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
Model helps explore design space
Synthesize Efficient Hardware Implementation of Map
Map Fold Scan
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 10 20 30 40 50 60 Completion Time Total Area