A Preliminary Evaluation of Building Block Computing Systems Sayaka - - PowerPoint PPT Presentation

a preliminary evaluation of building block computing
SMART_READER_LITE
LIVE PREVIEW

A Preliminary Evaluation of Building Block Computing Systems Sayaka - - PowerPoint PPT Presentation

A Preliminary Evaluation of Building Block Computing Systems Sayaka Terashima , Takuya Kojima * , Hayate Okuhara * , Kazusa Musha * , Hideharu Amano * , Ryuichi Sakamoto , Masaaki Kondo , Mitaro Namiki * Keio University, The


slide-1
SLIDE 1

A Preliminary Evaluation of Building Block Computing Systems

Sayaka Terashima∗, Takuya Kojima*, Hayate Okuhara*, Kazusa Musha*, Hideharu Amano*, Ryuichi Sakamoto†, Masaaki Kondo†, Mitaro Namiki§

*Keio University, †The University of Tokyo, § Tokyo University of Agriculture and Technology

2019 IEEE 13th International Symposium on embedded Multicore/Manycore Systems-on-Chip (IEEE MCSoC-2019)

slide-2
SLIDE 2

Limitation of a Monolithic SoC

l Many requests for recent embedded system − High performance, high functionality − Low power consumption, low cost l Increasing NRE cost of LSI chip − Due to complicated design, test, mask l Problems − Hard to meet such demand with a single SoC − High cost to develop a LSI for each application (ASIC)

Building Block Computing System

・A technique of SiP (System in Package)

2

slide-3
SLIDE 3

Building Block Computing Systems

l For flexible & various systems − Combining several basic chips depending on target apps. − Using ThruChip Interface (TCI) for inter-chip communication

3

Accelerator2 CPU Memory Accelerator1 Memory CPU Memory Memory CPU Memory Memory Accelerator1 Accelerator2 Accelerator2 CPU Accelerator2 Accelerator1 Accelerator1 CPU

slide-4
SLIDE 4

TCI: ThruChip Interface[1]

4

l A wireless data transferring

technique

− Employing

electromagnetic wave of coils

− No need of special

fabrication process

− Up to 8 Gbps with 10$%&

bit error ratio

l TCI IP includes

− Two SERDESes for Rx & Tx − An oscillator for trans. CLK

[1] Y. Take, et al, “3D NoC with Inductive-Coupling Links for Building-Block SiPs,” IEEE Transactions on Computers, vol. 63, no. 3, pp. 748–763, 2014.

slide-5
SLIDE 5

Escalator Network by TCI Link

l Stacked chips form ring network − A packet-based network

l The packet is composed of 1~17 of 35-bit flits

5

34 32 31 0 Header Payload

35bit Flit Structure

slide-6
SLIDE 6

Cube-2: A Prototype of Building Block Computing Systems

6

l Geyser[2] ―MIPS R3000 compatible CPU l Accelerators ―CC-SOTB2[3]

  • High energy efficient CGRA

―SNACC[4]

  • CNN accelerator

―KVS[5]

  • Non-SQL DB accelerator

[2] L. Zhao, et al. “Geyser-2: The second prototype CPU with fine-grained run-time power gating”, Proc of the 16th ASP-DAC 2011. [3] T. Kojima, et al. “Real Chip Evaluation of a Low Power CGRA with Optimized Application Map- ping”, Proc of the 9th HEART 2018. [4] R.Sakamoto , et al. “The design and implementation of scalable deep neural network accelerator cores,” in Proc. of IEEE MCSoC 2017 [5] Y.Tokuyoshi, , et al. “Key-valueStoreChipDesign for Low Power Consumption,” in Proc of IEEE CoolChips 22 (2019).

Geyser

slide-7
SLIDE 7

Shared Memory for Twin-Tower (SMTT)

l A bridge SRAM chip − Has two TCI IP − Shares 256KB between twin towers − Provides atomic operation Fetch&Dec

for synchronization among stacked chips

− Supports DMA transfer

7

SMTT Accelerator1 Accelerator1 CPU Accelerator2 Accelerator1 CPU

slide-8
SLIDE 8

l Geyser architecture

―MIPS R3000 compatible CPU

  • General compilers are available

―Responsible for host controller of Cube-2 system ―Including 2-way d-cache、2-way i-cache、TLB

Overview of GeyerTT

8

TCI

l GeyserTT

― A real chip Implementation

  • f Geyser for Twin-Tower

― Three TCI IP for various stacking structure

slide-9
SLIDE 9

Overview of SNACC

9

l 4つのSIMDコアで構成され

たCNN向けアクセラレータ

l SNACC architecture

―Composed of 4 cores

l Each core consists of

―Custom SIMD unit ―General-purpose ALU & Regfile ―5 distributed memories

1. Instruction 2. Input data 3. Weight data 4. Look-up-table 5. Write buffer

TCI Cores

slide-10
SLIDE 10

Memory-Mapped Chips

10

l Both towers have Independent addr. space l Each chip in the tower is mapped to Geyser’s addr. space

slide-11
SLIDE 11

Contributions of This Work

l Fabricating & evaluating Cube2-family chips − Focusing on GeyserTT, SNACC, SMTT − About power consumption & performance − Based on real chip measurement l Evaluating TCI IP itself − About feasibility of this technology − About power consumption & performance − Based on real chip measurement l Demonstrating possibility for practical apps. − With CNN application as a case study

11

slide-12
SLIDE 12

Real Chip Implementation

12

Process Renesas SOTB 65nm Supply voltage 0.75 V Design Verilog HDL Synthesis Synopsys Design Compiler 2016.03-SP4 Place & Route Synopsys IC Compiler 2016.03-SP4 Chip size SNACC & GeyserTT 3mm x 6mm SMTT 6mm x 6mm Target Frequency SNACC & GeyserTT 50MHz SMTT 100MHz TCI IP 50MHz

Stacked Chips

slide-13
SLIDE 13

Evaluation: Power Consumption

l Dynamic power is dominant − Leak power is only 40~80 μW

13 Design Target Design Target Design Target

slide-14
SLIDE 14

Evaluation: TCI performance

l GeyserTT x SNACC case − Bidirectional links

can work

− Compared to design value (50MHz)

l TCI consumes maximum 2.0x power

& achieves 0.12x performance

14

SNACC GeyserTT SMTT GeyserTT

×

Tx: 38.9 mW Rx: 20.1 mW 6MHz 8MHz

l GeyserTT x SMTT case − Upward link does not work l But the latest chip shows − 10~15MHz transfer − 1.5x power than design value

slide-15
SLIDE 15

Evaluation: TCI power consumption

15

Power Breakdown of whole system SMTT SNACC GeyserTT SNACC GeyserTT

l TCI IP consumes

large part (85%) of power

l Sleeping the link while data

  • trans. is not needed

may reduce the power

slide-16
SLIDE 16

Case study: Processing FC layers of a CNN

16

[6] A. Krizhevsky, I. Sutskever and G. E. Hinton: “Imagenet classification with deep convolutional neural networks”, Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS’12, USA, Curran Associates Inc., pp. 1097–1105 (2012).

layer # of input # of output Kernel size Bias FC7 4096 4096 (4096, 4096) 4096 FC8 4096 1000 (1000, 4096) 1000

l Last two FC layers of AlexNet[6]

slide-17
SLIDE 17

Evaluation: Simulated Configurations

l Evaluated system configurations 1.

GeyserTT

2.

GeyserTT x2 + SMTT

3.

GeyserTT + SNACC

4.

GeyserTT x2 + SNACC x2 + SMTT

17

SMTT GeyserTT GeyserTT GeyserTT

slide-18
SLIDE 18

18

SMTT SNACC GeyserTT SNACC GeyserTT SNACC GeyserTT

Evaluation: Simulated Configurations

l Evaluated system configurations 1.

GeyserTT

2.

GeyserTT x2 + SMTT

3.

GeyserTT + SNACC

4.

GeyserTT x2 + SNACC x2 + SMTT

slide-19
SLIDE 19

Evaluation: Execution time @50MHz

19

l The execution time for each configuration

includes data transfer time through TCI ×6.0 faster

slide-20
SLIDE 20

Conclusion

l Evaluating some real chip fabricated

with Renesas SOTB 65nm technology

− MIPS R3000 processor ~35mW @ 50MHz − CNN accelerator & memory chip ~4mW @ 50MHz

l Demonstrating chip stacking with TCI

− Communications partially work − Much larger power is consumed than designed one − A twin-tower system achieves x6.0 higher performance

l Future work

− Optimization of TCI power using sleep mode − Refinement of power grid for TCI IP

l Partially completed

− Use of other family chip such as CC-SOTB2 & KVS

20