COSMOS: Coordination of High-Level Synthesis and Memory Optimization for Hardware Accelerators
Luca Piccolboni, Paolo Mantovani, Giuseppe Di Guglielmo, Luca Carloni
Columbia University, New York, USA
ACM/IEEE CODES+ISSS 2017, Seoul, South Korea
COSMOS: Coordination of High-Level Synthesis and Memory Optimization - - PowerPoint PPT Presentation
ACM/IEEE CODES+ISSS 2017, Seoul, South Korea COSMOS: Coordination of High-Level Synthesis and Memory Optimization for Hardware Accelerators Luca Piccolboni, Paolo Mantovani, Giuseppe Di Guglielmo, Luca Carloni Columbia University, New York,
ACM/IEEE CODES+ISSS 2017, Seoul, South Korea
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
2 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
On-chip Interconnect
…
Private Local Memory (PLM)
… bank bank bank bank bank bank bank bank
Loop #1 Loop #N
3 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
Private Local Memory (PLM)
… bank bank bank bank bank bank bank bank
Loop #1 Loop #N 4 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
Private Local Memory (PLM)
… bank bank bank bank bank bank bank bank
Loop #1 Loop #N 4 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
Private Local Memory (PLM)
… bank bank bank bank bank bank bank bank
Loop #1 Loop #N 4 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
4 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
4 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
4 / 16
Private Local Memory (PLM)
bank bank bank bank
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
4 / 16
Private Local Memory (PLM)
bank bank bank bank bank bank bank bank
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
5 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
5 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
5 / 16
0.5 1.0 1.5 2.0 2.5 3.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Area (mm2) Effective Latency (ms)
1 port 2 ports 4 ports 8 ports
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
5 / 16
0.5 1.0 1.5 2.0 2.5 3.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Area (mm2) Effective Latency (ms)
1 port 2 ports 4 ports 8 ports
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
5 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
5 / 16
0.5 1.0 1.5 2.0 2.5 3.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Area (mm2) Effective Latency (ms)
1 port 2 ports 4 ports 8 ports
1.00 1.04 1.08 1.12 1.16 1.20 0.32 0.34 0.36 0.38 0.40 0.42 0.44 0.46 0.48
2u 3u 4u 5u 6u 7u 8u 9u 10u 14u
2 3 4 5 6 9 8 7 10 14
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
0.5 1.0 1.5 2.0 2.5 3.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Area (mm2) Effective Latency (ms)
1 port 2 ports 4 ports 8 ports
1.00 1.04 1.08 1.12 1.16 1.20 0.32 0.34 0.36 0.38 0.40 0.42 0.44 0.46 0.48
2u 3u 4u 5u 6u 7u 8u 9u 10u 14u
2 3 4 5 6 9 8 7 10 14
5 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
0.5 1.0 1.5 2.0 2.5 3.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Area (mm2) Effective Latency (ms)
1 port 2 ports 4 ports 8 ports
1.00 1.04 1.08 1.12 1.16 1.20 0.32 0.34 0.36 0.38 0.40 0.42 0.44 0.46 0.48
2u 3u 4u 5u 6u 7u 8u 9u 10u 14u
2 3 4 5 6 9 8 7 10 14
5 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
5 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
1.60 1.64 1.68 1.72 1.76 1.80 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4
Composition
Area (mm2) Effective Throughput (1/ms)
Pareto Dominated
1.00 1.04 1.08 1.12 1.16 1.20 0.32 0.34 0.36 0.38 0.40 0.42 0.44 0.46 Gradient Area (mm2) Effective Latency (ms) 0.59 0.60 0.61 0.62 0.63 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 Grayscale Area (mm2) Effective Latency (ms)
5 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 6 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 6 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 6 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 6 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
0.70 0.75 0.80 0.85 0.90 0.95 1.00 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Area (mm2) Effective Latency (ms)
1 port
7 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
0.70 0.75 0.80 0.85 0.90 0.95 1.00 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Area (mm2) Effective Latency (ms)
1 port
7 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
0.70 0.75 0.80 0.85 0.90 0.95 1.00 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Area (mm2) Effective Latency (ms)
1 port
7 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
0.70 0.75 0.80 0.85 0.90 0.95 1.00 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Area (mm2) Effective Latency (ms)
1 port
7 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
0.70 0.75 0.80 0.85 0.90 0.95 1.00 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Area (mm2) Effective Latency (ms)
1 port
7 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
0.70 0.75 0.80 0.85 0.90 0.95 1.00 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Area (mm2) Effective Latency (ms)
1 port
"#$%&
"#$%& + η
7 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
0.70 0.75 0.80 0.85 0.90 0.95 1.00 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Area (mm2) Effective Latency (ms)
1 port
"#$%&
"#$%& + η
7 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
0.70 0.75 0.80 0.85 0.90 0.95 1.00 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Area (mm2) Effective Latency (ms)
1 port
"#$%&
"#$%& + η
7 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
0.70 0.75 0.80 0.85 0.90 0.95 1.00 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Area (mm2) Effective Latency (ms)
1 port
"#$%&
"#$%& + η
7 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
0.70 0.75 0.80 0.85 0.90 0.95 1.00 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Area (mm2) Effective Latency (ms)
1 port
"#$%&
"#$%& + η
7 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
8 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
8 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
8 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
8 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
9 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
9 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
G HGI HJ, G HGI HL)
9 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
G HGI HJ, G HGI HL)
9 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
10 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
10 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
10 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
10 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
10 / 16
5 10 15 20 25 30 35 10 15 20 25 30 35 40 45
latency = 40 unrolls = 1 latency = 30 unrolls = 4 latency = 20 unrolls = 11 latency = 10 unrolls = 30
Number of Unrolls Effective Latency (ms) ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
10 / 16
DEBAYER GRAYSCALE GRADIENT WARP-DX WARP-DY STEEP.-DESCENT HESSIAN MATRIX-INV WARP-GRAY MATRIX-SUB SD-UPDATE MATRIX-MUL MATRIX-RESH MATRIX-ADD WARP-IWXP CHANGE-DET.
LUCAS-KANADE
11 / 16
0.0 1.6 3.2 4.8 6.4 8.0 9.6 0.0 0.9 1.8 2.7 3.6 4.5 5.4 6.3 7.2
Area (mm2) Effective Latency (ms) 2 ports 4 ports 8 ports 16 ports
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
12 / 16
2.00 2.10 2.20 2.30 2.40 2.50 2.4 2.7 3.0 3.3 3.6 3.9
12 / 16
DEBAYER GRAYSCALE GRADIENT MATRIX-SUB WARP MATRIX-ADD MATRIX-MUL MATRIX-RESH HESSIAN STEEP-DESCENT SD-UPDATE CHANGE-DET
1 2 3 4 5 6 7 8 9 10
DEBAYER GRAYSCALE GRADIENT MATRIX-SUB WARP MATRIX-ADD MATRIX-MUL MATRIX-RESH HESSIAN STEEP-DESCENT SD-UPDATE CHANGE-DET
1 2 3 4 5 6 7 8 9 10
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
12 / 16
DEBAYER GRAYSCALE GRADIENT MATRIX-SUB WARP MATRIX-ADD MATRIX-MUL MATRIX-RESH HESSIAN STEEP-DESCENT SD-UPDATE CHANGE-DET
1 2 3 4 5 6 7 8 9 10
DEBAYER GRAYSCALE GRADIENT MATRIX-SUB WARP MATRIX-ADD MATRIX-MUL MATRIX-RESH HESSIAN STEEP-DESCENT SD-UPDATE CHANGE-DET
1 2 3 4 5 6 7 8 9 10
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
DEBAYER GRAYSCALE GRADIENT MATRIX-SUB WARP MATRIX-ADD MATRIX-MUL MATRIX-RESH HESSIAN STEEP-DESCENT SD-UPDATE CHANGE-DET
10 20 30 40 50 60 70 80 90 100 110 120
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
13 / 16
DEBAYER GRAYSCALE GRADIENT MATRIX-SUB WARP MATRIX-ADD MATRIX-MUL MATRIX-RESH HESSIAN STEEP-DESCENT SD-UPDATE CHANGE-DET
10 20 30 40 50 60 70 80 90 100 110 120
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
13 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
11.0 12.0 13.0 14.0 15.0 16.0 17.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0 110.0 Area (mm2) Throughput (frames/s)
Planned design point (theoretical) Mapped design point (algorithm)
2.5% 11.9% 13.0% 1.5% 2.5% 0.1% 2.1% 1.8% 1.6% 1.8%
percentage of area mismatch
14 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
11.0 12.0 13.0 14.0 15.0 16.0 17.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0 110.0 Area (mm2) Throughput (frames/s)
Planned design point (theoretical) Mapped design point (algorithm)
2.5% 11.9% 13.0% 1.5% 2.5% 0.1% 2.1% 1.8% 1.6% 1.8%
large mismatch in area because:
14 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
15 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
15 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
15 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
15 / 16
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea
Images from: https://www.flaticon.com/