Stitch: Fusible Heterogeneous Accelerators Enmeshed with Many-Core Architecture for Wearables
Cheng Tan, Manupa Karunaratne, Tulika Mitra, Li-Shiuan Peh
Stitch: Fusible Heterogeneous Accelerators Enmeshed with Many-Core - - PowerPoint PPT Presentation
Stitch: Fusible Heterogeneous Accelerators Enmeshed with Many-Core Architecture for Wearables Cheng Tan , Manupa Karunaratne, Tulika Mitra, Li-Shiuan Peh Emerging Wearables Software programmable to support diverse applications Here Maps on
Cheng Tan, Manupa Karunaratne, Tulika Mitra, Li-Shiuan Peh
Pokemon go on Apple Watch Here Maps on Samsung gear s2 Health care apps
Bus stop detection app (user defined)
Navigation on smart glass
page 1
Pokemon go on Apple Watch Here Maps on Samsung gear s2 Health care apps
Bus stop detection app (user defined)
Navigation on smart glass Performance Requirement (10000 MIPS)
page 2
Pokemon go on Apple Watch Here Maps on Samsung gear s2 Health care apps
Bus stop detection app (user defined)
Navigation on smart glass Performance Requirement (10000 MIPS) Power constraint (500 mW)
page 3
1 10 100 1000 10000 100000 Jan-2013 Nov-2013 Aug-2014 May-2015 Feb-2016 Nov-2016 Sep-2017 Core Count DMIPS/watt Power (mW) DMIPS Core Count Trend DMIPS/watt Trend Power Trend DMIPS Trend
Sony Smartwatch1 ARM Cortex-M3 Sony Smartwatch2 ARM Cortex-M4 Qualcomm toq, ARM Cortex-M3 Motorola Moto 360 1st ARM Cortex-A8 LG G Watch R ARM Cortex-A7 Motorola Moto 360 2ed ARM Cortex-A7 Samsung Gear S2 ARM Cortex-A7 Samsung Gear S3 ARM Cortex-A7 Samsung Gear S ARM Cortex-A7 Asus Zenwatch 3 ARM Cortex-A7 Huawei Watch 2 ARM Cortex-A7
chronology
page 4
1 10 100 1000 10000 100000 Jan-2013 Nov-2013 Aug-2014 May-2015 Feb-2016 Nov-2016 Sep-2017 Core Count DMIPS/watt Power (mW) DMIPS Core Count Trend DMIPS/watt Trend Power Trend DMIPS Trend
Sony Smartwatch1 ARM Cortex-M3 Sony Smartwatch2 ARM Cortex-M4 Qualcomm toq, ARM Cortex-M3 Motorola Moto 360 1st ARM Cortex-A8 LG G Watch R ARM Cortex-A7 Motorola Moto 360 2ed ARM Cortex-A7 Samsung Gear S2 ARM Cortex-A7 Samsung Gear S3 ARM Cortex-A7 Samsung Gear S ARM Cortex-A7 Asus Zenwatch 3 ARM Cortex-A7 Huawei Watch 2 ARM Cortex-A7
chronology
page 5
500 mW 10000 MIPS
page 6
Ø Odroid board emulating the state-of-the-art smartwatch Ø Time per gesture: 13 ms > 10 ms Ø Cannot meet the target throughput
page 7
✗
4-core ARM Cortex-A7 Meeting throughput no Time per Gesture (ms) 13 Power (mW) 469 Frequency 1200 Technology 28nm
Accelerators
(e.g., ASIC, FPGA, CGRA, and Reconfigurable Functional Unit)
Acc/Gyro (X, Y, Z) Window moving FFT1 FFT2 FFT3 FFT4 FFT6 FFT5 Filter Classify
Finger gesture application (6-stage pipeline, 16 kernels)
R R R R R R R R R R R R R R R R
Memory Controller
IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5 Update feature IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5
Abundant parallelism -> many-core architecture
FFT acc
Filter acc
IFFT acc IFFT acc FFT acc
Updt acc
Cla- ssify acc IFFT acc
page 8
Acc/Gyro (X, Y, Z) Window moving FFT1 FFT2 FFT3 FFT4 FFT6 FFT5 Filter Classify
Finger gesture application (6-stage pipeline, 16 kernels)
R R R R R R R R R R R R R R R R
Memory Controller
IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5 Update feature IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5 page 9
Power budget -> simple in-order core
Each tile: 8.75 mW
In-order CPU
Accelerators
(e.g., ASIC, FPGA, CGRA, and Reconfigurable Functional Unit)
Acc/Gyro (X, Y, Z) Window moving FFT1 FFT2 FFT3 FFT4 FFT6 FFT5 Filter Classify
Finger gesture application (6-stage pipeline, 16 kernels)
R R R R R R R R R R R R R R R R
Memory Controller
IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5 Update feature IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5 page 10
Improve performance/power -> accelerators
Each tile: 8.75 mW
Acc/Gyro (X, Y, Z) Window moving FFT1 FFT2 FFT3 FFT4 FFT6 FFT5 Filter Classify
Finger gesture application (6-stage pipeline, 16 kernels)
R R R R R R R R R R R R R R R R
Memory Controller
IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5 Update feature IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5 Update feature IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5
FFT acc
Filter acc
IFFT acc IFFT acc FFT acc
Updt acc
Cla- ssify acc IFFT acc
page 11
Different kernels -> heterogeneous accelerators
Acc/Gyro (X, Y, Z) Window moving FFT1 FFT2 FFT3 FFT4 FFT6 FFT5 Filter Classify
Finger gesture application (6-stage pipeline, 16 kernels)
IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5 Update feature IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5 Update feature IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5 page 12
Different kernels -> heterogeneous accelerators In-order CPU
A T M A
patch8
Xbar Switch
Heterogeneous
Accelerator
R R R R R R R R R R R R R R R R
Memory Controller
Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 2 Acc 2 Acc 2 Acc 2 Acc 3 Acc 3 Acc 3 Acc 3
Acc/Gyro (X, Y, Z) Window moving FFT1 FFT2 FFT3 FFT4 FFT6 FFT5 Filter Classify
Finger gesture application (6-stage pipeline, 16 kernels)
IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5 Update feature IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5 Update feature IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5 page 13
In-order CPU
A T M
patch8
Xbar Switch
Heterogeneous
Accelerator
R R R R R R R R R R R R R R R R
Memory Controller
Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 2 Acc 2 Acc 2 Acc 2 Acc 3 Acc 3 Acc 3 Acc 3
Imbalanced workload -> fusible accelerators
Acc
CPU
Actual fusion happens at runtime
Stitch compiler tool chain
Compiler decides the fusion of accelerators offline
R R R R R R R R R R R R R R R R
Memory Controller
R R R R R R R R R R R R R R R R
Memory Controller
Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 2 Acc 2 Acc 2 Acc 2 Acc 3 Acc 3 Acc 3 Acc 3
page 14
In-order CPU
Patch
AT-AS
Router
NIC L1-I L1-D
Stitch compiler tool chain
Patch
AT-AS In-order CPU
s t
s t + + +
DTW
| & | |
AES
>
x + x x x + > l d +
page 15
{AT}: arithmetic + memory access 95.7% {MA}: Multiply + arithmetic 47.8% {AA}: arithmetic + arithmetic 34.8% {AS}: arithmetic + shift 21.7% {SA}: shift + arithmetic 21.7%
+ ld + ^ + > > + X +
‘Hot’ patterns
page 16
Simple computation fragment
Multiple rounds of Longest Common Substring (LCS) identification
R R R R R R R R R R R R R R R R
Memory Controller
Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 2 Acc 2 Acc 2 Acc 2 Acc 3 Acc 3 Acc 3 Acc 3
Ø 8 x Acc1 -> {AT-MA} Ø 4 x Acc2 -> {AT-AS} Ø 4 x Acc3 -> {AT-SA}
page 17
In-order CPU
Patch
AT-AS
Router
NIC L1-I L1-D
Ø ALU, SPM access; Multiplier, ALU
Ø ALU, SPM access; Shifter, ALU
Ø ALU, SPM access; ALU, Shifter
(a) patch {AT-MA}
Output 1 ALU LMAU Local Mem Control Signals M ALU Output 2 Shift ALU Output 2 Control Signals Output 1 ALU LMAU Local Mem Shift Output 2 ALU Control Signals Output 1 ALU LMAU Local Mem
(b) patch {AT-SA} (c) patch {AT-AS}
page 18
Ø ALU, SPM access; Multiplier, ALU
Ø ALU, SPM access; Shifter, ALU
Ø ALU, SPM access; ALU, Shifter
page 19
R R R R R R R R R R R R R R R R
Memory Controller
Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 2 Acc 2 Acc 2 Acc 2 Acc 3 Acc 3 Acc 3 Acc 3
AT-MA AT-SA AT-AS v T indicates the memory access operation v A scratchpad memory is attached beside the CPU v Both CPU and accelerator can access the SPM In-order CPU
Patch
AT-AS
Router
NIC L1-I L1-D SPM
Acc 1 Acc 2 Acc 3
Ø ALU, SPM access; Multiplier, ALU
Ø ALU, SPM access; Shifter, ALU
Ø ALU, SPM access; ALU, Shifter
page 20
R R R R R R R R R R R R R R R R
Memory Controller
Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 2 Acc 2 Acc 2 Acc 2 Acc 3 Acc 3 Acc 3 Acc 3
In-order CPU
Patch
AT-AS
Router
NIC L1-I L1-D SPM Stitch Compiler tool chain
Address space allocation
+ + > >
ld st
Output: r2 Output: r5 Input: r2 Input: r1 Input: r3 Input: r4
(a) Data flow graph
page 21
Ø By {AT-MA}: 4 cycles
+ + > >
ld st
Output: r2 Output: r5 Input: r2 Input: r1 Input: r3 Input: r4
(a) Data flow graph (b) Accelerated by patch {AT-MA}
M A T A M
+
ld A
CPU EXE stage
>
CPU EXE stage
> M
+
st A 4 cycle
page 22
Ø By {AT-MA}: 4 cycles Ø By {AT-AS}: 2 cycles
+ + > >
ld st
Output: r2 Output: r5 Input: r2 Input: r1 Input: r3 Input: r4
(a) Data flow graph (b) Accelerated by patch {AT-MA}
M A T A M
+
ld A
CPU EXE stage
>
CPU EXE stage
> M
+
st A A + ld A S T A +
(c) Accelerated by patch {AT-AS}
> > A st 4 cycle 2 cycle
page 23
Ø By {AT-MA}: 4 cycles Ø By {AT-AS}: 2 cycles
Ø {AT-AS} ∪ {AT-AS}: 1 cycle
(e) Accelerated by a fused patch {AT-AS, AT-AS}
S T A A S T A A A A
+ +
ld st > > 1 cycle
page 24
+ + > >
ld st
Output: r2 Output: r5 Input: r2 Input: r1 Input: r3 Input: r4
(a) Data flow graph
page 25
R R R R R R R R R R R R R R R R
Memory Controller
Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 2 Acc 2 Acc 2 Acc 2 Acc 3 Acc 3 Acc 3 Acc 3
Patch
AT-AS
Router
NIC L1-I L1-D SPM
Ø By {AT-MA}: 4 cycles Ø By {AT-AS}: 2 cycles
Ø {AT-AS} ∪ {AT-AS}: 1 cycle
Ø Crossbar-based; Bufferless; Compiler-scheduled
+ + > >
ld st
Output: r2 Output: r5 Input: r2 Input: r1 Input: r3 Input: r4
(a) Data flow graph
Xbar Switch
Patch AT-AS
In-order CPU
CrossBar Switch
North
North EAST WEST SOUTH REG
Asynchronous Repeaters A T A
4 word-size register values 38-bit size patch configuration
C
SPM port
R R R R R R R C R Crossbar Config Reg Patch Patch EAST WEST SOUTH REG
M
T A A S S S N N W W E E M A T A M A T A M A T A T A A S A S T A A S T A S S N N W W E E
patch2 patch10 patch6
page 26
CrossBar Switch
North
North EAST WEST SOUTH REG
Asynchronous Repeaters A T A
4 word-size register values 38-bit size patch configuration
C
SPM port
R R R R R R R C R Crossbar Config Reg Patch Patch EAST WEST SOUTH REG
M
T A A S S S N N W W E E M A T A M A T A M A T A T A A S A S T A A S T A S S N N W W E E
patch2 patch10 patch6
North
SOUTH
page 27
Northin = Southout
Acc/Gyro (X, Y, Z) Window moving FFT 1 FFT 2 FFT3 FFT4 FFT6 FFT 5 Update feature Filte r IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5 Classify Finger gesture application (6-stage pipeline, 16 kernels)
Stitch compiler tool chain
core1 core2 core16
cores: patches:
Stitch architecture
patch 5 patch 2 patch 1
Update feature core2 Patch2,10 IFFT1 core1 Patch1,5 IFFT2 core3 Patch3,11 core9 IFFT3 Patch9,13 IFFT4 core11 Patch11,15 IFFT5 core6 Patch6,8 IFFT6 core14 Patch14,16
page 28
4-core ARM Cortex-A7 Meeting throughput no Time per Gesture (ms) 13 Power (mW) 469 Frequency 1200 Technology 28nm
page 29
4-core ARM Cortex-A7 Meeting throughput no Time per Gesture (ms) 13 Power (mW) 469 Frequency 1200 Technology 28nm Stitch w/o fusion no 11.49 108 200 40nm
R R R R R R R R R R R R R R R R M T M T M T M T M T M T M T M T T S S T T S S T T S S T T S S T tile2 tile10 IFFT6 Classify FFT3 tile6 IFFT2 IFFT3 Filter Update feature IFFT5 Window moving IFFT4 FFT6 FFT1 FFT2 FFT4 FFT5 IFFT1 A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A
page 30
R R R R R R R R R R R R R R R R M A T A M A T A M A T A M A T A M A T A M A T A M A T A M A T A T A A S A S T A T A A S A S T A T A A S A S T A T A A S A S T A tile2 tile10 IFFT6 Classify FFT3 tile6 IFFT2 IFFT3 Filter Update feature IFFT5 Window moving IFFT4 FFT6 FFT1 FFT2 FFT4 FFT5 IFFT1
4-core ARM Cortex-A7 Meeting throughput no Time per Gesture (ms) 13 Power (mW) 469 Frequency 1200 Technology 28nm Stitch w/o fusion no 11.49 108 200 40nm
R R R R R R R R R R R R R R R R M T M T M T M T M T M T M T M T T S S T T S S T T S S T T S S T tile2 tile10 IFFT6 Classify FFT3 tile6 IFFT2 IFFT3 Filter Update feature IFFT5 Window moving IFFT4 FFT6 FFT1 FFT2 FFT4 FFT5 IFFT1 A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A
Stitch yes 7.62 139.5 200 40nm
page 31
CONV1,1 CONV1,2 CONV1,3 CONV1,4 CONV2,1 CONV2,2 CONV2,3 CONV2,4 CONV2,5 CONV2,6 CONV3,1 CONV3,2 CONV3,3 imag e imag e imag e
POOL2 POOL1 POOL&FC
car
APP2 -- Image recognition
SVM5 SVM6 SVM7 SVM8 gather scatter
AES 1
SVM1 SVM2 SVM3 SVM4
AES 2 AES 3 AES 4 AES 5 AES 6
APP3 -- SVM-based machine learning
AES 1 AES 2 AES 3 AES 4 AES 5 AES 6 DAES 1 DAES 2 DAES 3 DAES 4 DAES 5 DAES 6 DTW1 DTW2 DTW3 DTW4
APP4 -- Transportation context-detection
page 32
R R R R R R R R R R R R R R R R AES2 AES3 AES3 AES4 Scatter SVM SVM SVM SVM SVM SVM Gather SVM SVM AES1 AES6
APP3
DAES3 DAES6 AES2 AES6 R R R R R R R R R R R R R R R R DTW DTW DTW DTW DAES2 DAES4 DAES5 DAES1 AES1 AES3 AES5 AES4
APP4
R R R R R R R R R R R R R R R R IFFT1 Update feature IFFT2 IFFT3 IFFT4 IFFT5 IFFT6 Classify Filter FFT3 Window moving FFT2 FFT1 FFT5 FFT4 FFT6
APP1
R R R R R R R R R R R R R R R R Conv3,1 Conv3,2 Conv3,3 Conv2,1 Conv2,2 Conv2,3 Conv2,4 Conv2,5 Conv2,6 Conv1,1 Conv1,2 Conv1,4 pool1 pool2 pool&fc
APP2
Conv1,3
page 33
Ø 2.3X improvement in terms of throughput
1.14 1.53 2.30 0.00 1.00 2.00 3.00 APP1 APP2 APP3 APP4 average Normalized throughput LOCUS Stitch w/o fusion Stitch
page 34
1.55 2.06 1.57 2.15 1.77 1.90 2.66 1.81 2.77 2.28 0.5 1 1.5 2 2.5 3 APP1 APP2 APP3 APP4 average Normalized power- and area-efficiency Normalized power-efficiency Normalized area-efficiency
page 35
Ø 0.5% of the entire chip
page 36
(quad-core ARM Cortex-A7)
Ø Average 1.65X improvement in terms of throughput Ø With only average 27% power consumption Ø Average 6.04X improvement in terms of performance/watt
1.71 1.69 1.23 1.97 1.65 0.28 0.27 0.28 0.27 0.27 6.20 6.19 4.39 7.39 6.04 2 4 6 8 10 0.7 1.4 2.1 APP1 APP2 APP3 APP4 average Normalized performance/watt Normalized throughput & power Normalized throughput Normalized power Normalized performance/watt
page 37
Ø 6.04x vs. quad-core ARM-A7 in terms of performance/watt; Ø 1.77x vs. 16-core baseline architecture in terms of performance/watt; Ø 2.28x vs. 16-core baseline architecture in terms of area-efficiency.
configurable and fusible accelerators (polymorphic patches) are effectively enmeshed with the cores
Ø Each patch can handle very simple custom instructions Ø Multiple polymorphic patches are able to be fused together across the chip to create large, virtual accelerators for complex custom instructions Ø Fusion is achieved by using an ultra-light weight compiler-scheduled network-on-chip without any buffers or control logic
page 38
Ø require local register files, control and data memories, and high data transfer bandwidth Ø High design complexity and area overhead
Ø Sharing processor resources ( instruction fetch, decode, register file, and even on-chip memory) Ø Consideration of stringent area and power budget
(a) Loosely coupled accelerator
D $ Fetch Dec Exe Mem Write I $
Reg File
CPU
Status_Reg
Accelerator logic Private Data Mem
Accelerator Interface DMA Ctrl Mode_Reg
…
Ctrl Mem
(b) Tightly coupled accelerator
D $ Fetch Dec Exe Mem Write I $
Reg File
CPU Acc logic
Interconnect
Different Architectures Incorporated with Reconfigurable Fabrics
explicit message passing.
Source Core 1 Destination Core 11 mutex.lock(); mutex.lock(); value = (value + 1) * 3; value = (value – 2) / 4;
NIC Router I-cache In-order core
mutex.unlock(); mutex.unlock();
NIC Router I-cache In-order core D-cache
R1
Shared memory
Cache coherence Lock contention
D-cache
R0
D-cache NIC Router
Shared memory
Router NIC D-cache
Page 7
Conventional str r0, [value] ldr r1, [value]
explicit message passing.
Source Core 1 Destination Core 11 value = (value + 1) * 3; value = (value – 2) / 4;
NIC
SMART
Router I-cache LMP In-order core SFU
mutex.lock(); mutex.lock(); mutex.unlock(); mutex.unlock();
NIC
SMART
Router I-cache LMP In-order core SFU D-cache
R1
Shared memory
D-cache
R0 R0
NIC
SMART
Router NIC
R1
Send(value, core 11); Recv(value, core 1);
LMP LMP
Page 8
Shared memory
bypassed
SMART
Router
Our method
register-to-register data transfer
Source Core 1 Destination Core 11 for (i = 1 to 100) value[i] = (value[i] + 1) * 3; for (i = 1 to 100) value[i] = (value[i] – 2) / 4;
NIC
SMART
Router I-cache LMP In-order core SFU
mutex.lock(); mutex.lock(); mutex.unlock(); mutex.unlock();
NIC
SMART
Router I-cache LMP In-order core SFU D-cache
Shared memory
D-cache NIC
SMART
Router NIC
Send(&value[0], 100, core 11); Recv(value, 100, core 1);
D-cache
SMART
Router D-cache
Page 9
Our method
cache-to-cache data transfer
à Less/faster on-chip communication.
Compiler tool chain
1) Multi-kernel application -> assembly by the GNU GCC front-end. 2) Profiling each kernel -> bottlenecked kernels and ‘hot’ basic blocks. 3) ‘Hot’ computational patterns (4/2 in/out) -> DFGs. 4) All custom instruction candidates -> mapped onto each patch -> potential speedup accelerated by any patch. 5) Modified GNU Assembler -> new assembly/executable with the patch control signals. 6) Stitching algorithm targeting maximizing overall throughput -> appropriate kernel mapping, version selection, patch stitching, and inter-patch NoC configuration.
*.asm
Multi-kernel application
GNU GCC front- end ‘Hot’ basic block identifier Different patches type Graph- based mapper ISE sele- ctor Modified GNU Assembler Hot basic blocks DFGs Custom instruction candidates DFG Gene- rator Constraints ISE ident- ifier Stitch architecture Stitching algorithm Multiple versions with different speedup for each kernel
Potential speedup for each (kernel, patches) New *.exe Address mapping New *.asm ISE control signals
page 17