Stitch: Fusible Heterogeneous Accelerators Enmeshed with Many-Core - - PowerPoint PPT Presentation

stitch fusible heterogeneous accelerators enmeshed with
SMART_READER_LITE
LIVE PREVIEW

Stitch: Fusible Heterogeneous Accelerators Enmeshed with Many-Core - - PowerPoint PPT Presentation

Stitch: Fusible Heterogeneous Accelerators Enmeshed with Many-Core Architecture for Wearables Cheng Tan , Manupa Karunaratne, Tulika Mitra, Li-Shiuan Peh Emerging Wearables Software programmable to support diverse applications Here Maps on


slide-1
SLIDE 1

Stitch: Fusible Heterogeneous Accelerators Enmeshed with Many-Core Architecture for Wearables

Cheng Tan, Manupa Karunaratne, Tulika Mitra, Li-Shiuan Peh

slide-2
SLIDE 2

Emerging Wearables

  • Software programmable to support diverse applications

Pokemon go on Apple Watch Here Maps on Samsung gear s2 Health care apps

  • n smart watches

Bus stop detection app (user defined)

  • n LG Watch Urban

Navigation on smart glass

page 1

slide-3
SLIDE 3

Emerging Wearables

  • Software programmable to support diverse applications

Pokemon go on Apple Watch Here Maps on Samsung gear s2 Health care apps

  • n smart watches

Bus stop detection app (user defined)

  • n LG Watch Urban

Navigation on smart glass Performance Requirement (10000 MIPS)

page 2

slide-4
SLIDE 4

Emerging Wearables

  • Software programmable to support diverse applications

Pokemon go on Apple Watch Here Maps on Samsung gear s2 Health care apps

  • n smart watches

Bus stop detection app (user defined)

  • n LG Watch Urban

Navigation on smart glass Performance Requirement (10000 MIPS) Power constraint (500 mW)

page 3

slide-5
SLIDE 5

Wearable SoC Architecture Trend

1 10 100 1000 10000 100000 Jan-2013 Nov-2013 Aug-2014 May-2015 Feb-2016 Nov-2016 Sep-2017 Core Count DMIPS/watt Power (mW) DMIPS Core Count Trend DMIPS/watt Trend Power Trend DMIPS Trend

Sony Smartwatch1 ARM Cortex-M3 Sony Smartwatch2 ARM Cortex-M4 Qualcomm toq, ARM Cortex-M3 Motorola Moto 360 1st ARM Cortex-A8 LG G Watch R ARM Cortex-A7 Motorola Moto 360 2ed ARM Cortex-A7 Samsung Gear S2 ARM Cortex-A7 Samsung Gear S3 ARM Cortex-A7 Samsung Gear S ARM Cortex-A7 Asus Zenwatch 3 ARM Cortex-A7 Huawei Watch 2 ARM Cortex-A7

chronology

page 4

slide-6
SLIDE 6

Wearable SoC Architecture Trend

1 10 100 1000 10000 100000 Jan-2013 Nov-2013 Aug-2014 May-2015 Feb-2016 Nov-2016 Sep-2017 Core Count DMIPS/watt Power (mW) DMIPS Core Count Trend DMIPS/watt Trend Power Trend DMIPS Trend

Sony Smartwatch1 ARM Cortex-M3 Sony Smartwatch2 ARM Cortex-M4 Qualcomm toq, ARM Cortex-M3 Motorola Moto 360 1st ARM Cortex-A8 LG G Watch R ARM Cortex-A7 Motorola Moto 360 2ed ARM Cortex-A7 Samsung Gear S2 ARM Cortex-A7 Samsung Gear S3 ARM Cortex-A7 Samsung Gear S ARM Cortex-A7 Asus Zenwatch 3 ARM Cortex-A7 Huawei Watch 2 ARM Cortex-A7

chronology

page 5

500 mW 10000 MIPS

slide-7
SLIDE 7
  • Finger gesture recognition application

Motivating Case Study

page 6

slide-8
SLIDE 8
  • Finger gesture recognition application
  • State-of-the-art smartwatch

Ø Odroid board emulating the state-of-the-art smartwatch Ø Time per gesture: 13 ms > 10 ms Ø Cannot meet the target throughput

Motivating Case Study

page 7

4-core ARM Cortex-A7 Meeting throughput no Time per Gesture (ms) 13 Power (mW) 469 Frequency 1200 Technology 28nm

slide-9
SLIDE 9

Accelerators

(e.g., ASIC, FPGA, CGRA, and Reconfigurable Functional Unit)

Acc/Gyro (X, Y, Z) Window moving FFT1 FFT2 FFT3 FFT4 FFT6 FFT5 Filter Classify

Finger gesture application (6-stage pipeline, 16 kernels)

R R R R R R R R R R R R R R R R

Memory Controller

Wearable Application Characteristics

IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5 Update feature IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5

Abundant parallelism -> many-core architecture

FFT acc

Filter acc

IFFT acc IFFT acc FFT acc

Updt acc

Cla- ssify acc IFFT acc

page 8

slide-10
SLIDE 10

Acc/Gyro (X, Y, Z) Window moving FFT1 FFT2 FFT3 FFT4 FFT6 FFT5 Filter Classify

Finger gesture application (6-stage pipeline, 16 kernels)

R R R R R R R R R R R R R R R R

Memory Controller

Wearable Application Characteristics

IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5 Update feature IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5 page 9

Power budget -> simple in-order core

Each tile: 8.75 mW

In-order CPU

slide-11
SLIDE 11

Accelerators

(e.g., ASIC, FPGA, CGRA, and Reconfigurable Functional Unit)

Acc/Gyro (X, Y, Z) Window moving FFT1 FFT2 FFT3 FFT4 FFT6 FFT5 Filter Classify

Finger gesture application (6-stage pipeline, 16 kernels)

R R R R R R R R R R R R R R R R

Memory Controller

Wearable Application Characteristics

IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5 Update feature IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5 page 10

Improve performance/power -> accelerators

Each tile: 8.75 mW

slide-12
SLIDE 12

Acc/Gyro (X, Y, Z) Window moving FFT1 FFT2 FFT3 FFT4 FFT6 FFT5 Filter Classify

Finger gesture application (6-stage pipeline, 16 kernels)

R R R R R R R R R R R R R R R R

Memory Controller

Wearable Application Characteristics

IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5 Update feature IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5 Update feature IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5

FFT acc

Filter acc

IFFT acc IFFT acc FFT acc

Updt acc

Cla- ssify acc IFFT acc

page 11

Different kernels -> heterogeneous accelerators

slide-13
SLIDE 13

Acc/Gyro (X, Y, Z) Window moving FFT1 FFT2 FFT3 FFT4 FFT6 FFT5 Filter Classify

Finger gesture application (6-stage pipeline, 16 kernels)

Wearable Application Characteristics

IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5 Update feature IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5 Update feature IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5 page 12

Different kernels -> heterogeneous accelerators In-order CPU

A T M A

patch8

Xbar Switch

Heterogeneous

Accelerator

R R R R R R R R R R R R R R R R

Memory Controller

Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 2 Acc 2 Acc 2 Acc 2 Acc 3 Acc 3 Acc 3 Acc 3

slide-14
SLIDE 14

Acc/Gyro (X, Y, Z) Window moving FFT1 FFT2 FFT3 FFT4 FFT6 FFT5 Filter Classify

Finger gesture application (6-stage pipeline, 16 kernels)

Wearable Application Characteristics

IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5 Update feature IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5 Update feature IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5 page 13

In-order CPU

A T M

patch8

Xbar Switch

Heterogeneous

Accelerator

R R R R R R R R R R R R R R R R

Memory Controller

Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 2 Acc 2 Acc 2 Acc 2 Acc 3 Acc 3 Acc 3 Acc 3

Imbalanced workload -> fusible accelerators

Acc

CPU

Actual fusion happens at runtime

Stitch compiler tool chain

Compiler decides the fusion of accelerators offline

slide-15
SLIDE 15

R R R R R R R R R R R R R R R R

Memory Controller

R R R R R R R R R R R R R R R R

Memory Controller

Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 2 Acc 2 Acc 2 Acc 2 Acc 3 Acc 3 Acc 3 Acc 3

  • Many-core architecture with simple in-order CPU and accelerator
  • Heterogeneous customizable accelerators – polymorphic patches
  • Patches are able to fuse together to alleviate the bottleneck kernels
  • The fusion of patches is directed offline by our compiler tool chain

Stitch Architecture - Overview

page 14

In-order CPU

Patch

AT-AS

Router

NIC L1-I L1-D

Stitch compiler tool chain

Patch

AT-AS In-order CPU

slide-16
SLIDE 16
  • Heterogeneous customizable accelerators – polymorphic patches
  • Patch architecture motivated by representative wearable kernels
  • ^
  • #
  • 1

s t

  • +

s t + + +

DTW

| & | |

AES

>

  • +

x + x x x + > l d +

  • FFT

Patch Architecture

page 15

slide-17
SLIDE 17
  • Heterogeneous customizable accelerators – polymorphic patches
  • Patch architecture motivated by representative wearable kernels

{AT}: arithmetic + memory access 95.7% {MA}: Multiply + arithmetic 47.8% {AA}: arithmetic + arithmetic 34.8% {AS}: arithmetic + shift 21.7% {SA}: shift + arithmetic 21.7%

+ ld + ^ + > > + X +

‘Hot’ patterns

Patch Architecture

page 16

Simple computation fragment

Multiple rounds of Longest Common Substring (LCS) identification

slide-18
SLIDE 18

R R R R R R R R R R R R R R R R

Memory Controller

Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 2 Acc 2 Acc 2 Acc 2 Acc 3 Acc 3 Acc 3 Acc 3

  • Heterogeneous customizable accelerators – polymorphic patches
  • Patch architecture motivated by representative wearable kernels

Ø 8 x Acc1 -> {AT-MA} Ø 4 x Acc2 -> {AT-AS} Ø 4 x Acc3 -> {AT-SA}

Patch Architecture

page 17

In-order CPU

Patch

AT-AS

Router

NIC L1-I L1-D

slide-19
SLIDE 19

Patch Architecture

  • AT-MA

Ø ALU, SPM access; Multiplier, ALU

  • AT-SA

Ø ALU, SPM access; Shifter, ALU

  • AT-AS

Ø ALU, SPM access; ALU, Shifter

(a) patch {AT-MA}

Output 1 ALU LMAU Local Mem Control Signals M ALU Output 2 Shift ALU Output 2 Control Signals Output 1 ALU LMAU Local Mem Shift Output 2 ALU Control Signals Output 1 ALU LMAU Local Mem

(b) patch {AT-SA} (c) patch {AT-AS}

page 18

slide-20
SLIDE 20
  • AT-MA

Ø ALU, SPM access; Multiplier, ALU

  • AT-SA

Ø ALU, SPM access; Shifter, ALU

  • AT-AS

Ø ALU, SPM access; ALU, Shifter

Patch Architecture

page 19

R R R R R R R R R R R R R R R R

Memory Controller

Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 2 Acc 2 Acc 2 Acc 2 Acc 3 Acc 3 Acc 3 Acc 3

AT-MA AT-SA AT-AS v T indicates the memory access operation v A scratchpad memory is attached beside the CPU v Both CPU and accelerator can access the SPM In-order CPU

Patch

AT-AS

Router

NIC L1-I L1-D SPM

Acc 1 Acc 2 Acc 3

slide-21
SLIDE 21
  • AT-MA

Ø ALU, SPM access; Multiplier, ALU

  • AT-SA

Ø ALU, SPM access; Shifter, ALU

  • AT-AS

Ø ALU, SPM access; ALU, Shifter

Patch Architecture

page 20

R R R R R R R R R R R R R R R R

Memory Controller

Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 2 Acc 2 Acc 2 Acc 2 Acc 3 Acc 3 Acc 3 Acc 3

In-order CPU

Patch

AT-AS

Router

NIC L1-I L1-D SPM Stitch Compiler tool chain

Address space allocation

slide-22
SLIDE 22
  • Single patch accelerates DFG

Mapping Computation to Patches

+ + > >

ld st

Output: r2 Output: r5 Input: r2 Input: r1 Input: r3 Input: r4

(a) Data flow graph

page 21

slide-23
SLIDE 23
  • Single patch accelerates DFG

Ø By {AT-MA}: 4 cycles

Mapping Computation to Patches

+ + > >

ld st

Output: r2 Output: r5 Input: r2 Input: r1 Input: r3 Input: r4

(a) Data flow graph (b) Accelerated by patch {AT-MA}

M A T A M

+

ld A

CPU EXE stage

>

CPU EXE stage

> M

+

st A 4 cycle

page 22

slide-24
SLIDE 24
  • Single patch accelerates DFG

Ø By {AT-MA}: 4 cycles Ø By {AT-AS}: 2 cycles

Mapping Computation to Patches

+ + > >

ld st

Output: r2 Output: r5 Input: r2 Input: r1 Input: r3 Input: r4

(a) Data flow graph (b) Accelerated by patch {AT-MA}

M A T A M

+

ld A

CPU EXE stage

>

CPU EXE stage

> M

+

st A A + ld A S T A +

(c) Accelerated by patch {AT-AS}

> > A st 4 cycle 2 cycle

page 23

slide-25
SLIDE 25
  • Single patch accelerates DFG

Ø By {AT-MA}: 4 cycles Ø By {AT-AS}: 2 cycles

  • Fused patch accelerates DFG

Ø {AT-AS} ∪ {AT-AS}: 1 cycle

(e) Accelerated by a fused patch {AT-AS, AT-AS}

S T A A S T A A A A

+ +

ld st > > 1 cycle

Mapping Computation to Patches

page 24

+ + > >

ld st

Output: r2 Output: r5 Input: r2 Input: r1 Input: r3 Input: r4

(a) Data flow graph

slide-26
SLIDE 26

Inter-patch NoC

page 25

R R R R R R R R R R R R R R R R

Memory Controller

Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 1 Acc 2 Acc 2 Acc 2 Acc 2 Acc 3 Acc 3 Acc 3 Acc 3

Patch

AT-AS

Router

NIC L1-I L1-D SPM

  • Single patch accelerates DFG

Ø By {AT-MA}: 4 cycles Ø By {AT-AS}: 2 cycles

  • Fused patch accelerates DFG

Ø {AT-AS} ∪ {AT-AS}: 1 cycle

  • Fusion is achieved by a lightweight NoC

Ø Crossbar-based; Bufferless; Compiler-scheduled

+ + > >

ld st

Output: r2 Output: r5 Input: r2 Input: r1 Input: r3 Input: r4

(a) Data flow graph

Xbar Switch

Patch AT-AS

In-order CPU

slide-27
SLIDE 27
  • Multiple-hop stitching per cycle
  • Bufferless
  • Configure before running the application

CrossBar Switch

North

North EAST WEST SOUTH REG

Asynchronous Repeaters A T A

4 word-size register values 38-bit size patch configuration

C

SPM port

R R R R R R R C R Crossbar Config Reg Patch Patch EAST WEST SOUTH REG

M

Inter-patch NoC

T A A S S S N N W W E E M A T A M A T A M A T A T A A S A S T A A S T A S S N N W W E E

patch2 patch10 patch6

page 26

slide-28
SLIDE 28
  • Multiple-hop stitching per cycle
  • Bufferless
  • Configure before running the application

CrossBar Switch

North

North EAST WEST SOUTH REG

Asynchronous Repeaters A T A

4 word-size register values 38-bit size patch configuration

C

SPM port

R R R R R R R C R Crossbar Config Reg Patch Patch EAST WEST SOUTH REG

M

Inter-patch NoC

T A A S S S N N W W E E M A T A M A T A M A T A T A A S A S T A A S T A S S N N W W E E

patch2 patch10 patch6

North

SOUTH

page 27

Northin = Southout

slide-29
SLIDE 29

Acc/Gyro (X, Y, Z) Window moving FFT 1 FFT 2 FFT3 FFT4 FFT6 FFT 5 Update feature Filte r IFFT1 IFFT2 IFFT3 IFFT4 IFFT6 IFFT5 Classify Finger gesture application (6-stage pipeline, 16 kernels)

Stitch compiler tool chain

core1 core2 core16

cores: patches:

Stitch architecture

patch 5 patch 2 patch 1

Update feature core2 Patch2,10 IFFT1 core1 Patch1,5 IFFT2 core3 Patch3,11 core9 IFFT3 Patch9,13 IFFT4 core11 Patch11,15 IFFT5 core6 Patch6,8 IFFT6 core14 Patch14,16

  • 1. Bottlenecks identified
  • 2. Kernel mapping
  • 3. Patch fusion

Compiler Support Illustrated

page 28

slide-30
SLIDE 30

4-core ARM Cortex-A7 Meeting throughput no Time per Gesture (ms) 13 Power (mW) 469 Frequency 1200 Technology 28nm

page 29

Compiler Support Illustrated

  • State-of-the-art smartwatch: 13 ms/gesture > 10 ms/gesture
slide-31
SLIDE 31

4-core ARM Cortex-A7 Meeting throughput no Time per Gesture (ms) 13 Power (mW) 469 Frequency 1200 Technology 28nm Stitch w/o fusion no 11.49 108 200 40nm

R R R R R R R R R R R R R R R R M T M T M T M T M T M T M T M T T S S T T S S T T S S T T S S T tile2 tile10 IFFT6 Classify FFT3 tile6 IFFT2 IFFT3 Filter Update feature IFFT5 Window moving IFFT4 FFT6 FFT1 FFT2 FFT4 FFT5 IFFT1 A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A

page 30

Compiler Support Illustrated

  • State-of-the-art smartwatch: 13 ms/gesture > 10 ms/gesture
  • Stitch without fusion: 11.49 ms/gesture

R R R R R R R R R R R R R R R R M A T A M A T A M A T A M A T A M A T A M A T A M A T A M A T A T A A S A S T A T A A S A S T A T A A S A S T A T A A S A S T A tile2 tile10 IFFT6 Classify FFT3 tile6 IFFT2 IFFT3 Filter Update feature IFFT5 Window moving IFFT4 FFT6 FFT1 FFT2 FFT4 FFT5 IFFT1

slide-32
SLIDE 32

4-core ARM Cortex-A7 Meeting throughput no Time per Gesture (ms) 13 Power (mW) 469 Frequency 1200 Technology 28nm Stitch w/o fusion no 11.49 108 200 40nm

R R R R R R R R R R R R R R R R M T M T M T M T M T M T M T M T T S S T T S S T T S S T T S S T tile2 tile10 IFFT6 Classify FFT3 tile6 IFFT2 IFFT3 Filter Update feature IFFT5 Window moving IFFT4 FFT6 FFT1 FFT2 FFT4 FFT5 IFFT1 A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A

Stitch yes 7.62 139.5 200 40nm

page 31

Compiler Support Illustrated

  • State-of-the-art smartwatch: 13 ms/gesture > 10 ms/gesture
  • Stitch without fusion: 11.49 ms/gesture
  • Stitch: 7.62 ms/gesture
slide-33
SLIDE 33

Evaluation – Wearable Applications

CONV1,1 CONV1,2 CONV1,3 CONV1,4 CONV2,1 CONV2,2 CONV2,3 CONV2,4 CONV2,5 CONV2,6 CONV3,1 CONV3,2 CONV3,3 imag e imag e imag e

POOL2 POOL1 POOL&FC

car

APP2 -- Image recognition

SVM5 SVM6 SVM7 SVM8 gather scatter

AES 1

SVM1 SVM2 SVM3 SVM4

AES 2 AES 3 AES 4 AES 5 AES 6

APP3 -- SVM-based machine learning

AES 1 AES 2 AES 3 AES 4 AES 5 AES 6 DAES 1 DAES 2 DAES 3 DAES 4 DAES 5 DAES 6 DTW1 DTW2 DTW3 DTW4

APP4 -- Transportation context-detection

page 32

slide-34
SLIDE 34

Wearable Applications on Stitch

R R R R R R R R R R R R R R R R AES2 AES3 AES3 AES4 Scatter SVM SVM SVM SVM SVM SVM Gather SVM SVM AES1 AES6

APP3

DAES3 DAES6 AES2 AES6 R R R R R R R R R R R R R R R R DTW DTW DTW DTW DAES2 DAES4 DAES5 DAES1 AES1 AES3 AES5 AES4

APP4

R R R R R R R R R R R R R R R R IFFT1 Update feature IFFT2 IFFT3 IFFT4 IFFT5 IFFT6 Classify Filter FFT3 Window moving FFT2 FFT1 FFT5 FFT4 FFT6

APP1

R R R R R R R R R R R R R R R R Conv3,1 Conv3,2 Conv3,3 Conv2,1 Conv2,2 Conv2,3 Conv2,4 Conv2,5 Conv2,6 Conv1,1 Conv1,2 Conv1,4 pool1 pool2 pool&fc

APP2

Conv1,3

page 33

slide-35
SLIDE 35
  • Gem5 simulation for performance evaluation
  • Message passing-based many-core architecture
  • Comparing with baseline architecture without acceleration

Ø 2.3X improvement in terms of throughput

Performance

1.14 1.53 2.30 0.00 1.00 2.00 3.00 APP1 APP2 APP3 APP4 average Normalized throughput LOCUS Stitch w/o fusion Stitch

page 34

slide-36
SLIDE 36
  • 16-core chip at 40nm technology (Synopsis tools)
  • 140 mW at 200 MHz

RTL Synthesis of Stitch – Power

1.55 2.06 1.57 2.15 1.77 1.90 2.66 1.81 2.77 2.28 0.5 1 1.5 2 2.5 3 APP1 APP2 APP3 APP4 average Normalized power- and area-efficiency Normalized power-efficiency Normalized area-efficiency

page 35

slide-37
SLIDE 37
  • 16-core chip at 40nm technology (Synopsis tools)
  • 140 mW at 200 MHz
  • Accelerator (patch) area overhead: 0.169 𝑛𝑛)

Ø 0.5% of the entire chip

Area-Efficiency

page 36

slide-38
SLIDE 38
  • Comparing with the state-of-the-art wearable SoC Snapdragon W2100

(quad-core ARM Cortex-A7)

Ø Average 1.65X improvement in terms of throughput Ø With only average 27% power consumption Ø Average 6.04X improvement in terms of performance/watt

Speedup w.r.t. Commercial SoC

1.71 1.69 1.23 1.97 1.65 0.28 0.27 0.28 0.27 0.27 6.20 6.19 4.39 7.39 6.04 2 4 6 8 10 0.7 1.4 2.1 APP1 APP2 APP3 APP4 average Normalized performance/watt Normalized throughput & power Normalized throughput Normalized power Normalized performance/watt

page 37

slide-39
SLIDE 39

Stitch Conclusion

  • Improvement:

Ø 6.04x vs. quad-core ARM-A7 in terms of performance/watt; Ø 1.77x vs. 16-core baseline architecture in terms of performance/watt; Ø 2.28x vs. 16-core baseline architecture in terms of area-efficiency.

  • We propose Stitch, a many-core architecture where tiny heterogeneous,

configurable and fusible accelerators (polymorphic patches) are effectively enmeshed with the cores

Ø Each patch can handle very simple custom instructions Ø Multiple polymorphic patches are able to be fused together across the chip to create large, virtual accelerators for complex custom instructions Ø Fusion is achieved by using an ultra-light weight compiler-scheduled network-on-chip without any buffers or control logic

page 38

slide-40
SLIDE 40

Thanks

slide-41
SLIDE 41

Backup Slides

slide-42
SLIDE 42
  • Loosely coupled accelerators

Ø require local register files, control and data memories, and high data transfer bandwidth Ø High design complexity and area overhead

  • Tightly coupled accelerators

Ø Sharing processor resources ( instruction fetch, decode, register file, and even on-chip memory) Ø Consideration of stringent area and power budget

(a) Loosely coupled accelerator

D $ Fetch Dec Exe Mem Write I $

Reg File

CPU

Status_Reg

Accelerator logic Private Data Mem

Accelerator Interface DMA Ctrl Mode_Reg

Ctrl Mem

(b) Tightly coupled accelerator

D $ Fetch Dec Exe Mem Write I $

Reg File

CPU Acc logic

Interconnect

Accelerators with different couplings

slide-43
SLIDE 43

Related Works

Different Architectures Incorporated with Reconfigurable Fabrics

slide-44
SLIDE 44

Lightweight Message Passing

  • Replace complex power-hungry shared memory cache coherence with

explicit message passing.

Source Core 1 Destination Core 11 mutex.lock(); mutex.lock(); value = (value + 1) * 3; value = (value – 2) / 4;

NIC Router I-cache In-order core

mutex.unlock(); mutex.unlock();

NIC Router I-cache In-order core D-cache

R1

Shared memory

Cache coherence Lock contention

D-cache

R0

D-cache NIC Router

Shared memory

Router NIC D-cache

Page 7

Conventional str r0, [value] ldr r1, [value]

slide-45
SLIDE 45
  • Replace complex power-hungry shared memory cache coherence with

explicit message passing.

Source Core 1 Destination Core 11 value = (value + 1) * 3; value = (value – 2) / 4;

NIC

SMART

Router I-cache LMP In-order core SFU

mutex.lock(); mutex.lock(); mutex.unlock(); mutex.unlock();

NIC

SMART

Router I-cache LMP In-order core SFU D-cache

R1

Shared memory

D-cache

R0 R0

NIC

SMART

Router NIC

R1

Send(value, core 11); Recv(value, core 1);

LMP LMP

Page 8

Shared memory

bypassed

SMART

Router

Our method

register-to-register data transfer

Lightweight Message Passing

slide-46
SLIDE 46
  • Larger block size cache-to-cache message passing is supported.

Source Core 1 Destination Core 11 for (i = 1 to 100) value[i] = (value[i] + 1) * 3; for (i = 1 to 100) value[i] = (value[i] – 2) / 4;

NIC

SMART

Router I-cache LMP In-order core SFU

mutex.lock(); mutex.lock(); mutex.unlock(); mutex.unlock();

NIC

SMART

Router I-cache LMP In-order core SFU D-cache

Shared memory

D-cache NIC

SMART

Router NIC

Send(&value[0], 100, core 11); Recv(value, 100, core 1);

D-cache

SMART

Router D-cache

Page 9

Our method

cache-to-cache data transfer

à Less/faster on-chip communication.

Lightweight Message Passing

slide-47
SLIDE 47

Compiler tool chain

Compiler Support Illustrated

1) Multi-kernel application -> assembly by the GNU GCC front-end. 2) Profiling each kernel -> bottlenecked kernels and ‘hot’ basic blocks. 3) ‘Hot’ computational patterns (4/2 in/out) -> DFGs. 4) All custom instruction candidates -> mapped onto each patch -> potential speedup accelerated by any patch. 5) Modified GNU Assembler -> new assembly/executable with the patch control signals. 6) Stitching algorithm targeting maximizing overall throughput -> appropriate kernel mapping, version selection, patch stitching, and inter-patch NoC configuration.

*.asm

Multi-kernel application

GNU GCC front- end ‘Hot’ basic block identifier Different patches type Graph- based mapper ISE sele- ctor Modified GNU Assembler Hot basic blocks DFGs Custom instruction candidates DFG Gene- rator Constraints ISE ident- ifier Stitch architecture Stitching algorithm Multiple versions with different speedup for each kernel

  • 1. Kernel mapping
  • 2. NoC configuration

Potential speedup for each (kernel, patches) New *.exe Address mapping New *.asm ISE control signals

page 17