[PPT] - Stitch: Fusible Heterogeneous Accelerators Enmeshed with Many-Core PowerPoint Presentation

SLIDE 1

Stitch: Fusible Heterogeneous Accelerators Enmeshed with Many-Core Architecture for Wearables

Cheng Tan, Manupa Karunaratne, Tulika Mitra, Li-Shiuan Peh

SLIDE 2

Emerging Wearables

Software programmable to support diverse applications

Pokemon go on Apple Watch Here Maps on Samsung gear s2 Health care apps

n smart watches

Bus stop detection app (user defined)

n LG Watch Urban

Navigation on smart glass

page 1

SLIDE 3

Emerging Wearables

Software programmable to support diverse applications

Pokemon go on Apple Watch Here Maps on Samsung gear s2 Health care apps

n smart watches

Bus stop detection app (user defined)

n LG Watch Urban

Navigation on smart glass Performance Requirement （10000 MIPS）

page 2

SLIDE 4

Emerging Wearables

Software programmable to support diverse applications

Pokemon go on Apple Watch Here Maps on Samsung gear s2 Health care apps

n smart watches

Bus stop detection app (user defined)

n LG Watch Urban

Navigation on smart glass Performance Requirement （10000 MIPS） Power constraint （500 mW）

page 3

SLIDE 5

Wearable SoC Architecture Trend

1 10 100 1000 10000 100000 Jan-2013 Nov-2013 Aug-2014 May-2015 Feb-2016 Nov-2016 Sep-2017 Core Count DMIPS/watt Power (mW) DMIPS Core Count Trend DMIPS/watt Trend Power Trend DMIPS Trend

Sony Smartwatch1 ARM Cortex-M3 Sony Smartwatch2 ARM Cortex-M4 Qualcomm toq, ARM Cortex-M3 Motorola Moto 360 1st ARM Cortex-A8 LG G Watch R ARM Cortex-A7 Motorola Moto 360 2ed ARM Cortex-A7 Samsung Gear S2 ARM Cortex-A7 Samsung Gear S3 ARM Cortex-A7 Samsung Gear S ARM Cortex-A7 Asus Zenwatch 3 ARM Cortex-A7 Huawei Watch 2 ARM Cortex-A7

chronology

page 4

SLIDE 6

Wearable SoC Architecture Trend

1 10 100 1000 10000 100000 Jan-2013 Nov-2013 Aug-2014 May-2015 Feb-2016 Nov-2016 Sep-2017 Core Count DMIPS/watt Power (mW) DMIPS Core Count Trend DMIPS/watt Trend Power Trend DMIPS Trend

Sony Smartwatch1 ARM Cortex-M3 Sony Smartwatch2 ARM Cortex-M4 Qualcomm toq, ARM Cortex-M3 Motorola Moto 360 1st ARM Cortex-A8 LG G Watch R ARM Cortex-A7 Motorola Moto 360 2ed ARM Cortex-A7 Samsung Gear S2 ARM Cortex-A7 Samsung Gear S3 ARM Cortex-A7 Samsung Gear S ARM Cortex-A7 Asus Zenwatch 3 ARM Cortex-A7 Huawei Watch 2 ARM Cortex-A7

chronology

page 5

500 mW 10000 MIPS

SLIDE 7

Finger gesture recognition application

Motivating Case Study

page 6

SLIDE 8

Finger gesture recognition application
State-of-the-art smartwatch

Ø Odroid board emulating the state-of-the-art smartwatch Ø Time per gesture: 13 ms > 10 ms Ø Cannot meet the target throughput

Motivating Case Study

page 7

✗

4-core ARM Cortex-A7 Meeting throughput no Time per Gesture (ms) 13 Power (mW) 469 Frequency 1200 Technology 28nm

SLIDE 9

Accelerators

(e.g., ASIC, FPGA, CGRA, and Reconfigurable Functional Unit)

Acc/Gyro (X, Y, Z) Window moving FFT1 FFT2 FFT3 FFT4 FFT6 FFT5 Filter Classify

Finger gesture application (6-stage pipeline, 16 kernels)

R R R R R R R R R R R R R R R R

Memory Controller

Wearable Application Characteristics

IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5 Update feature IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5

Abundant parallelism -> many-core architecture

FFT acc

Filter acc

IFFT acc IFFT acc FFT acc

Updt acc

Cla- ssify acc IFFT acc

page 8

SLIDE 10

Acc/Gyro (X, Y, Z) Window moving FFT1 FFT2 FFT3 FFT4 FFT6 FFT5 Filter Classify

Finger gesture application (6-stage pipeline, 16 kernels)

R R R R R R R R R R R R R R R R

Memory Controller

Wearable Application Characteristics

IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5 Update feature IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5 page 9

Power budget -> simple in-order core

Each tile: 8.75 mW

In-order CPU

SLIDE 11

Accelerators

(e.g., ASIC, FPGA, CGRA, and Reconfigurable Functional Unit)

Acc/Gyro (X, Y, Z) Window moving FFT1 FFT2 FFT3 FFT4 FFT6 FFT5 Filter Classify

Finger gesture application (6-stage pipeline, 16 kernels)

R R R R R R R R R R R R R R R R

Memory Controller

Wearable Application Characteristics

IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5 Update feature IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5 page 10

Improve performance/power -> accelerators

Each tile: 8.75 mW

SLIDE 12

Acc/Gyro (X, Y, Z) Window moving FFT1 FFT2 FFT3 FFT4 FFT6 FFT5 Filter Classify

Finger gesture application (6-stage pipeline, 16 kernels)

R R R R R R R R R R R R R R R R

Memory Controller

Wearable Application Characteristics

IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5 Update feature IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5 Update feature IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5

FFT acc

Filter acc

IFFT acc IFFT acc FFT acc

Updt acc

Cla- ssify acc IFFT acc

page 11

Different kernels -> heterogeneous accelerators

SLIDE 13

Acc/Gyro (X, Y, Z) Window moving FFT1 FFT2 FFT3 FFT4 FFT6 FFT5 Filter Classify

Finger gesture application (6-stage pipeline, 16 kernels)

Wearable Application Characteristics

IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5 Update feature IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5 Update feature IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5 page 12

Different kernels -> heterogeneous accelerators In-order CPU

A T M A

patch8

Xbar Switch

Heterogeneous

Accelerator

R R R R R R R R R R R R R R R R

Memory Controller

Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 2 Acc 2 Acc 2 Acc 2 Acc 3 Acc 3 Acc 3 Acc 3

SLIDE 14

Acc/Gyro (X, Y, Z) Window moving FFT1 FFT2 FFT3 FFT4 FFT6 FFT5 Filter Classify

Finger gesture application (6-stage pipeline, 16 kernels)

Wearable Application Characteristics

IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5 Update feature IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5 Update feature IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5 page 13

In-order CPU

A T M

patch8

Xbar Switch

Heterogeneous

Accelerator

R R R R R R R R R R R R R R R R

Memory Controller

Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 2 Acc 2 Acc 2 Acc 2 Acc 3 Acc 3 Acc 3 Acc 3

Imbalanced workload -> fusible accelerators

Acc

CPU

Actual fusion happens at runtime

Stitch compiler tool chain

Compiler decides the fusion of accelerators offline

SLIDE 15

R R R R R R R R R R R R R R R R

Memory Controller

R R R R R R R R R R R R R R R R

Memory Controller

Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 2 Acc 2 Acc 2 Acc 2 Acc 3 Acc 3 Acc 3 Acc 3

Many-core architecture with simple in-order CPU and accelerator
Heterogeneous customizable accelerators – polymorphic patches
Patches are able to fuse together to alleviate the bottleneck kernels
The fusion of patches is directed offline by our compiler tool chain

Stitch Architecture - Overview

page 14

In-order CPU

Patch

AT-AS

Router

NIC L1-I L1-D

Stitch compiler tool chain

Patch

AT-AS In-order CPU

SLIDE 16

Heterogeneous customizable accelerators – polymorphic patches
Patch architecture motivated by representative wearable kernels
^
#
1

s t

+

s t + + +

DTW

| & | |

AES

>

+

x + x x x + > l d +

FFT

Patch Architecture

page 15

SLIDE 17

Heterogeneous customizable accelerators – polymorphic patches
Patch architecture motivated by representative wearable kernels

{AT}: arithmetic + memory access 95.7% {MA}: Multiply + arithmetic 47.8% {AA}: arithmetic + arithmetic 34.8% {AS}: arithmetic + shift 21.7% {SA}: shift + arithmetic 21.7%

+ ld + ^ + > > + X +

‘Hot’ patterns

Patch Architecture

page 16

Simple computation fragment

Multiple rounds of Longest Common Substring (LCS) identification

SLIDE 18

R R R R R R R R R R R R R R R R

Memory Controller

Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 2 Acc 2 Acc 2 Acc 2 Acc 3 Acc 3 Acc 3 Acc 3

Heterogeneous customizable accelerators – polymorphic patches
Patch architecture motivated by representative wearable kernels

Ø 8 x Acc1 -> {AT-MA} Ø 4 x Acc2 -> {AT-AS} Ø 4 x Acc3 -> {AT-SA}

Patch Architecture

page 17

In-order CPU

Patch

AT-AS

Router

NIC L1-I L1-D

SLIDE 19

Patch Architecture

AT-MA

Ø ALU, SPM access; Multiplier, ALU

AT-SA

Ø ALU, SPM access; Shifter, ALU

AT-AS

Ø ALU, SPM access; ALU, Shifter

(a) patch {AT-MA}

Output 1 ALU LMAU Local Mem Control Signals M ALU Output 2 Shift ALU Output 2 Control Signals Output 1 ALU LMAU Local Mem Shift Output 2 ALU Control Signals Output 1 ALU LMAU Local Mem

(b) patch {AT-SA} (c) patch {AT-AS}

page 18

SLIDE 20

AT-MA

Ø ALU, SPM access; Multiplier, ALU

AT-SA

Ø ALU, SPM access; Shifter, ALU

AT-AS

Ø ALU, SPM access; ALU, Shifter

Patch Architecture

page 19

R R R R R R R R R R R R R R R R

Memory Controller

Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 2 Acc 2 Acc 2 Acc 2 Acc 3 Acc 3 Acc 3 Acc 3

AT-MA AT-SA AT-AS v T indicates the memory access operation v A scratchpad memory is attached beside the CPU v Both CPU and accelerator can access the SPM In-order CPU

Patch

AT-AS

Router

NIC L1-I L1-D SPM

Acc 1 Acc 2 Acc 3

SLIDE 21

AT-MA

Ø ALU, SPM access; Multiplier, ALU

AT-SA

Ø ALU, SPM access; Shifter, ALU

AT-AS

Ø ALU, SPM access; ALU, Shifter

Patch Architecture

page 20

R R R R R R R R R R R R R R R R

Memory Controller

Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 2 Acc 2 Acc 2 Acc 2 Acc 3 Acc 3 Acc 3 Acc 3

In-order CPU

Patch

AT-AS

Router

NIC L1-I L1-D SPM Stitch Compiler tool chain

Address space allocation

SLIDE 22

Single patch accelerates DFG

Mapping Computation to Patches

+ + > >

ld st

Output: r2 Output: r5 Input: r2 Input: r1 Input: r3 Input: r4

(a) Data flow graph

page 21

SLIDE 23

Single patch accelerates DFG

Ø By {AT-MA}: 4 cycles

Mapping Computation to Patches

+ + > >

ld st

Output: r2 Output: r5 Input: r2 Input: r1 Input: r3 Input: r4

(a) Data flow graph (b) Accelerated by patch {AT-MA}

M A T A M

+

ld A

CPU EXE stage

>

CPU EXE stage

> M

+

st A 4 cycle

page 22

SLIDE 24

Single patch accelerates DFG

Ø By {AT-MA}: 4 cycles Ø By {AT-AS}: 2 cycles

Mapping Computation to Patches

+ + > >

ld st

Output: r2 Output: r5 Input: r2 Input: r1 Input: r3 Input: r4

(a) Data flow graph (b) Accelerated by patch {AT-MA}

M A T A M

+

ld A

CPU EXE stage

>

CPU EXE stage

> M

+

st A A + ld A S T A +

(c) Accelerated by patch {AT-AS}

> > A st 4 cycle 2 cycle

page 23

SLIDE 25

Single patch accelerates DFG

Ø By {AT-MA}: 4 cycles Ø By {AT-AS}: 2 cycles

Fused patch accelerates DFG

Ø {AT-AS} ∪ {AT-AS}: 1 cycle

(e) Accelerated by a fused patch {AT-AS, AT-AS}

S T A A S T A A A A

+ +

ld st > > 1 cycle

Mapping Computation to Patches

page 24

+ + > >

ld st

Output: r2 Output: r5 Input: r2 Input: r1 Input: r3 Input: r4

(a) Data flow graph

SLIDE 26

Inter-patch NoC

page 25

R R R R R R R R R R R R R R R R

Memory Controller

Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 2 Acc 2 Acc 2 Acc 2 Acc 3 Acc 3 Acc 3 Acc 3

Patch

AT-AS

Router

NIC L1-I L1-D SPM

Single patch accelerates DFG

Ø By {AT-MA}: 4 cycles Ø By {AT-AS}: 2 cycles

Fused patch accelerates DFG

Ø {AT-AS} ∪ {AT-AS}: 1 cycle

Fusion is achieved by a lightweight NoC

Ø Crossbar-based; Bufferless; Compiler-scheduled

+ + > >

ld st

Output: r2 Output: r5 Input: r2 Input: r1 Input: r3 Input: r4

(a) Data flow graph

Xbar Switch

Patch AT-AS

In-order CPU

SLIDE 27

Multiple-hop stitching per cycle
Bufferless
Configure before running the application

CrossBar Switch

North

North EAST WEST SOUTH REG

Asynchronous Repeaters A T A

4 word-size register values 38-bit size patch configuration

C

SPM port

R R R R R R R C R Crossbar Config Reg Patch Patch EAST WEST SOUTH REG

M

Inter-patch NoC

T A A S S S N N W W E E M A T A M A T A M A T A T A A S A S T A A S T A S S N N W W E E

patch2 patch10 patch6

page 26

SLIDE 28

Multiple-hop stitching per cycle
Bufferless
Configure before running the application

CrossBar Switch

North

North EAST WEST SOUTH REG

Asynchronous Repeaters A T A

4 word-size register values 38-bit size patch configuration

C

SPM port

R R R R R R R C R Crossbar Config Reg Patch Patch EAST WEST SOUTH REG

M

Inter-patch NoC

T A A S S S N N W W E E M A T A M A T A M A T A T A A S A S T A A S T A S S N N W W E E

patch2 patch10 patch6

North

SOUTH

page 27

Northin = Southout

SLIDE 29

Acc/Gyro (X, Y, Z) Window moving FFT 1 FFT 2 FFT3 FFT4 FFT6 FFT 5 Update feature Filte r IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5 Classify Finger gesture application (6-stage pipeline, 16 kernels)

Stitch compiler tool chain

core1 core2 core16

cores: patches:

Stitch architecture

patch 5 patch 2 patch 1

Update feature core2 Patch2,10 IFFT1 core1 Patch1,5 IFFT2 core3 Patch3,11 core9 IFFT3 Patch9,13 IFFT4 core11 Patch11,15 IFFT5 core6 Patch6,8 IFFT6 core14 Patch14,16

1. Bottlenecks identified
2. Kernel mapping
3. Patch fusion

Compiler Support Illustrated

page 28

SLIDE 30

4-core ARM Cortex-A7 Meeting throughput no Time per Gesture (ms) 13 Power (mW) 469 Frequency 1200 Technology 28nm

page 29

Compiler Support Illustrated

State-of-the-art smartwatch: 13 ms/gesture > 10 ms/gesture

SLIDE 31

4-core ARM Cortex-A7 Meeting throughput no Time per Gesture (ms) 13 Power (mW) 469 Frequency 1200 Technology 28nm Stitch w/o fusion no 11.49 108 200 40nm

R R R R R R R R R R R R R R R R M T M T M T M T M T M T M T M T T S S T T S S T T S S T T S S T tile2 tile10 IFFT6 Classify FFT3 tile6 IFFT2 IFFT3 Filter Update feature IFFT5 Window moving IFFT4 FFT6 FFT1 FFT2 FFT4 FFT5 IFFT1 A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A

page 30

Compiler Support Illustrated

State-of-the-art smartwatch: 13 ms/gesture > 10 ms/gesture
Stitch without fusion: 11.49 ms/gesture

R R R R R R R R R R R R R R R R M A T A M A T A M A T A M A T A M A T A M A T A M A T A M A T A T A A S A S T A T A A S A S T A T A A S A S T A T A A S A S T A tile2 tile10 IFFT6 Classify FFT3 tile6 IFFT2 IFFT3 Filter Update feature IFFT5 Window moving IFFT4 FFT6 FFT1 FFT2 FFT4 FFT5 IFFT1

SLIDE 32

4-core ARM Cortex-A7 Meeting throughput no Time per Gesture (ms) 13 Power (mW) 469 Frequency 1200 Technology 28nm Stitch w/o fusion no 11.49 108 200 40nm

R R R R R R R R R R R R R R R R M T M T M T M T M T M T M T M T T S S T T S S T T S S T T S S T tile2 tile10 IFFT6 Classify FFT3 tile6 IFFT2 IFFT3 Filter Update feature IFFT5 Window moving IFFT4 FFT6 FFT1 FFT2 FFT4 FFT5 IFFT1 A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A

Stitch yes 7.62 139.5 200 40nm

page 31

Compiler Support Illustrated

State-of-the-art smartwatch: 13 ms/gesture > 10 ms/gesture
Stitch without fusion: 11.49 ms/gesture
Stitch: 7.62 ms/gesture

SLIDE 33

Evaluation – Wearable Applications

CONV1,1 CONV1,2 CONV1,3 CONV1,4 CONV2,1 CONV2,2 CONV2,3 CONV2,4 CONV2,5 CONV2,6 CONV3,1 CONV3,2 CONV3,3 imag e imag e imag e

POOL2 POOL1 POOL&FC

car

APP2 -- Image recognition

SVM5 SVM6 SVM7 SVM8 gather scatter

AES 1

SVM1 SVM2 SVM3 SVM4

AES 2 AES 3 AES 4 AES 5 AES 6

APP3 -- SVM-based machine learning

AES 1 AES 2 AES 3 AES 4 AES 5 AES 6 DAES 1 DAES 2 DAES 3 DAES 4 DAES 5 DAES 6 DTW1 DTW2 DTW3 DTW4

APP4 -- Transportation context-detection

page 32

SLIDE 34

Wearable Applications on Stitch

R R R R R R R R R R R R R R R R AES2 AES3 AES3 AES4 Scatter SVM SVM SVM SVM SVM SVM Gather SVM SVM AES1 AES6

APP3

DAES3 DAES6 AES2 AES6 R R R R R R R R R R R R R R R R DTW DTW DTW DTW DAES2 DAES4 DAES5 DAES1 AES1 AES3 AES5 AES4

APP4

R R R R R R R R R R R R R R R R IFFT1 Update feature IFFT2 IFFT3 IFFT4 IFFT5 IFFT6 Classify Filter FFT3 Window moving FFT2 FFT1 FFT5 FFT4 FFT6

APP1

R R R R R R R R R R R R R R R R Conv3,1 Conv3,2 Conv3,3 Conv2,1 Conv2,2 Conv2,3 Conv2,4 Conv2,5 Conv2,6 Conv1,1 Conv1,2 Conv1,4 pool1 pool2 pool&fc

APP2

Conv1,3

page 33

SLIDE 35

Gem5 simulation for performance evaluation
Message passing-based many-core architecture
Comparing with baseline architecture without acceleration

Ø 2.3X improvement in terms of throughput

Performance

1.14 1.53 2.30 0.00 1.00 2.00 3.00 APP1 APP2 APP3 APP4 average Normalized throughput LOCUS Stitch w/o fusion Stitch

page 34

SLIDE 36

16-core chip at 40nm technology (Synopsis tools)
140 mW at 200 MHz

RTL Synthesis of Stitch – Power

1.55 2.06 1.57 2.15 1.77 1.90 2.66 1.81 2.77 2.28 0.5 1 1.5 2 2.5 3 APP1 APP2 APP3 APP4 average Normalized power- and area-efficiency Normalized power-efficiency Normalized area-efficiency

page 35

SLIDE 37

16-core chip at 40nm technology (Synopsis tools)
140 mW at 200 MHz
Accelerator (patch) area overhead: 0.169 𝑛𝑛)

Ø 0.5% of the entire chip

Area-Efficiency

page 36

SLIDE 38

Comparing with the state-of-the-art wearable SoC Snapdragon W2100

(quad-core ARM Cortex-A7)

Ø Average 1.65X improvement in terms of throughput Ø With only average 27% power consumption Ø Average 6.04X improvement in terms of performance/watt

Speedup w.r.t. Commercial SoC

1.71 1.69 1.23 1.97 1.65 0.28 0.27 0.28 0.27 0.27 6.20 6.19 4.39 7.39 6.04 2 4 6 8 10 0.7 1.4 2.1 APP1 APP2 APP3 APP4 average Normalized performance/watt Normalized throughput & power Normalized throughput Normalized power Normalized performance/watt

page 37

SLIDE 39

Stitch Conclusion

Improvement:

Ø 6.04x vs. quad-core ARM-A7 in terms of performance/watt; Ø 1.77x vs. 16-core baseline architecture in terms of performance/watt; Ø 2.28x vs. 16-core baseline architecture in terms of area-efficiency.

We propose Stitch, a many-core architecture where tiny heterogeneous,

configurable and fusible accelerators (polymorphic patches) are effectively enmeshed with the cores

Ø Each patch can handle very simple custom instructions Ø Multiple polymorphic patches are able to be fused together across the chip to create large, virtual accelerators for complex custom instructions Ø Fusion is achieved by using an ultra-light weight compiler-scheduled network-on-chip without any buffers or control logic

page 38

SLIDE 40

Thanks

SLIDE 41

Backup Slides

SLIDE 42

Loosely coupled accelerators

Ø require local register files, control and data memories, and high data transfer bandwidth Ø High design complexity and area overhead

Tightly coupled accelerators

Ø Sharing processor resources ( instruction fetch, decode, register file, and even on-chip memory) Ø Consideration of stringent area and power budget

(a) Loosely coupled accelerator

D $ Fetch Dec Exe Mem Write I $

Reg File

CPU

Status_Reg

Accelerator logic Private Data Mem

Accelerator Interface DMA Ctrl Mode_Reg

…

Ctrl Mem

(b) Tightly coupled accelerator

D $ Fetch Dec Exe Mem Write I $

Reg File

CPU Acc logic

Interconnect

Accelerators with different couplings

SLIDE 43

Related Works

Different Architectures Incorporated with Reconfigurable Fabrics

SLIDE 44

Lightweight Message Passing

Replace complex power-hungry shared memory cache coherence with

explicit message passing.

Source Core 1 Destination Core 11 mutex.lock(); mutex.lock(); value = (value + 1) * 3; value = (value – 2) / 4;

NIC Router I-cache In-order core

mutex.unlock(); mutex.unlock();

NIC Router I-cache In-order core D-cache

R1

Shared memory

Cache coherence Lock contention

D-cache

R0

D-cache NIC Router

Shared memory

Router NIC D-cache

Page 7

Conventional str r0, [value] ldr r1, [value]

SLIDE 45

Replace complex power-hungry shared memory cache coherence with

explicit message passing.

Source Core 1 Destination Core 11 value = (value + 1) * 3; value = (value – 2) / 4;

NIC

SMART

Router I-cache LMP In-order core SFU

mutex.lock(); mutex.lock(); mutex.unlock(); mutex.unlock();

NIC

SMART

Router I-cache LMP In-order core SFU D-cache

R1

Shared memory

D-cache

R0 R0

NIC

SMART

Router NIC

R1

Send(value, core 11); Recv(value, core 1);

LMP LMP

Page 8

Shared memory

bypassed

SMART

Router

Our method

register-to-register data transfer

Lightweight Message Passing

SLIDE 46

Larger block size cache-to-cache message passing is supported.

Source Core 1 Destination Core 11 for (i = 1 to 100) value[i] = (value[i] + 1) * 3; for (i = 1 to 100) value[i] = (value[i] – 2) / 4;

NIC

SMART

Router I-cache LMP In-order core SFU

mutex.lock(); mutex.lock(); mutex.unlock(); mutex.unlock();

NIC

SMART

Router I-cache LMP In-order core SFU D-cache

Shared memory

D-cache NIC

SMART

Router NIC

Send(&value[0], 100, core 11); Recv(value, 100, core 1);

D-cache

SMART

Router D-cache

Page 9

Our method

cache-to-cache data transfer

à Less/faster on-chip communication.

Lightweight Message Passing

SLIDE 47

Compiler tool chain

Compiler Support Illustrated

1) Multi-kernel application -> assembly by the GNU GCC front-end. 2) Profiling each kernel -> bottlenecked kernels and ‘hot’ basic blocks. 3) ‘Hot’ computational patterns (4/2 in/out) -> DFGs. 4) All custom instruction candidates -> mapped onto each patch -> potential speedup accelerated by any patch. 5) Modified GNU Assembler -> new assembly/executable with the patch control signals. 6) Stitching algorithm targeting maximizing overall throughput -> appropriate kernel mapping, version selection, patch stitching, and inter-patch NoC configuration.

*.asm

Multi-kernel application

GNU GCC front- end ‘Hot’ basic block identifier Different patches type Graph- based mapper ISE sele- ctor Modified GNU Assembler Hot basic blocks DFGs Custom instruction candidates DFG Gene- rator Constraints ISE identifier Stitch architecture Stitching algorithm Multiple versions with different speedup for each kernel

1. Kernel mapping
2. NoC configuration

Potential speedup for each (kernel, patches) New *.exe Address mapping New *.asm ISE control signals

page 17