A Preliminary Evaluation of Building Block Computing Systems Sayaka - PowerPoint PPT Presentation

A Preliminary Evaluation of Building Block Computing Systems Sayaka Terashima ∗ , Takuya Kojima * , Hayate Okuhara * , Kazusa Musha * , Hideharu Amano * , Ryuichi Sakamoto † , Masaaki Kondo † , Mitaro Namiki § * Keio University, † The University of Tokyo, § Tokyo University of Agriculture and Technology 2019 IEEE 13th International Symposium on embedded Multicore/Manycore Systems-on-Chip (IEEE MCSoC-2019)

Limitation of a Monolithic SoC l Many requests for recent embedded system − High performance, high functionality − Low power consumption, low cost l Increasing NRE cost of LSI chip − Due to complicated design, test, mask l Problems − Hard to meet such demand with a single SoC − High cost to develop a LSI for each application (ASIC) Building Block Computing System ・ A technique of SiP (System in Package) 2

Building Block Computing Systems l For flexible & various systems − Combining several basic chips depending on target apps. − Using ThruChip Interface (TCI) for inter-chip communication CPU Memory Accelerator1 Accelerator2 CPU CPU CPU CPU Memory Memory Accelerator1 Accelerator2 Accelerator1 Memory Memory Accelerator1 Memory Accelerator2 Accelerator2 3

TCI: ThruChip Interface[1] l A wireless data transferring technique − Employing electromagnetic wave of coils − No need of special fabrication process − Up to 8 Gbps with 10 $%& bit error ratio l TCI IP includes − Two SERDESes for Rx & Tx − An oscillator for trans. CLK [1] Y. Take , et al , “3D NoC with Inductive-Coupling Links for Building-Block SiPs,” 4 IEEE Transactions on Computers, vol. 63, no. 3, pp. 748–763, 2014.

Escalator Network by TCI Link l Stacked chips form ring network − A packet-based network l The packet is composed of 1~17 of 35-bit flits 35bit Flit Structure 34 32 31 0 Header Payload 5

Cube-2: A Prototype of Building Block Computing Systems l Geyser [2] ― MIPS R3000 compatible CPU Geyser l Accelerators ― CC-SOTB2 [3] ◦ High energy efficient CGRA ― SNACC [4] ◦ CNN accelerator ― KVS [5] ◦ Non-SQL DB accelerator [2] L. Zhao, et al . “Geyser-2: The second prototype CPU with fine-grained run-time power gating”, Proc of the 16th ASP-DAC 2011. [3] T. Kojima, et al . “Real Chip Evaluation of a Low Power CGRA with Optimized Application Map- ping”, Proc of the 9th HEART 2018. [4] R.Sakamoto , et al . “The design and implementation of scalable deep neural network accelerator cores,” in Proc. of IEEE MCSoC 2017 6 [5] Y.Tokuyoshi, , et al . “Key-valueStoreChipDesign for Low Power Consumption,” in Proc of IEEE CoolChips 22 (2019).

Shared Memory for Twin-Tower (SMTT) l A bridge SRAM chip − Has two TCI IP − Shares 256KB between twin towers − Provides atomic operation Fetch&Dec for synchronization among stacked chips − Supports DMA transfer CPU Accelerator1 Accelerator1 CPU SMTT Accelerator1 Accelerator2 7

Overview of GeyerTT l Geyser architecture ― MIPS R3000 compatible CPU ◦ General compilers are available ― Responsible for host controller of Cube-2 system ― Including 2-way d-cache 、 2-way i-cache 、 TLB l GeyserTT TCI ― A real chip Implementation of Geyser for Twin-Tower ― Three TCI IP for various stacking structure 8

Overview of SNACC l SNACC architecture l 4つのSIMDコアで構成されたCNN向けアクセラレータ ― Composed of 4 cores TCI Cores l Each core consists of ― Custom SIMD unit ― General-purpose ALU & Regfile ― 5 distributed memories 1. Instruction 2. Input data 3. Weight data 4. Look-up-table 5. Write buffer 9

Memory-Mapped Chips l Each chip in the tower is mapped to Geyser’s addr. space l Both towers have Independent addr. space 10

Contributions of This Work l Fabricating & evaluating Cube2-family chips − Focusing on GeyserTT, SNACC, SMTT − About power consumption & performance − Based on real chip measurement l Evaluating TCI IP itself − About feasibility of this technology − About power consumption & performance − Based on real chip measurement l Demonstrating possibility for practical apps. − With CNN application as a case study 11

Real Chip Implementation Process Renesas SOTB 65nm Supply 0.75 V voltage Design Verilog HDL Synopsys Design Compiler Synthesis 2016.03-SP4 Synopsys IC Compiler Place & Route Stacked Chips 2016.03-SP4 SNACC & GeyserTT 3mm x 6mm Chip size SMTT 6mm x 6mm SNACC & GeyserTT 50MHz Target SMTT 100MHz Frequency TCI IP 50MHz 12

Evaluation: Power Consumption Design Target Design Design Target Target l Dynamic power is dominant − Leak power is only 40~80 μW 13

Evaluation: TCI performance l GeyserTT x SNACC case GeyserTT Tx: 38.9 mW Rx: 20.1 mW − Bidirectional links 6MHz can work SNACC − Compared to design value (50MHz) l TCI consumes maximum 2.0x power & achieves 0.12x performance GeyserTT l GeyserTT x SMTT case × 8MHz − Upward link does not work SMTT l But the latest chip shows − 10~15MHz transfer − 1.5x power than design value 14

Evaluation: TCI power consumption GeyserTT SNACC SMTT GeyserTT SNACC l TCI IP consumes large part (85%) of power l Sleeping the link while data trans. is not needed Power Breakdown of whole system may reduce the power 15

Case study: Processing FC layers of a CNN l Last two FC layers of AlexNet [6] layer # of input # of output Kernel size Bias FC7 4096 4096 (4096, 4096) 4096 FC8 4096 1000 (1000, 4096) 1000 16 [6] A. Krizhevsky, I. Sutskever and G. E. Hinton: “Imagenet classification with deep convolutional neural networks”, Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS’12, USA, Curran Associates Inc., pp. 1097–1105 (2012).

Evaluation: Simulated Configurations l Evaluated system configurations GeyserTT 1. GeyserTT x2 + SMTT 2. GeyserTT + SNACC 3. GeyserTT x2 + SNACC x2 + SMTT 4. GeyserTT SMTT GeyserTT GeyserTT 17

Evaluation: Simulated Configurations l Evaluated system configurations GeyserTT 1. GeyserTT x2 + SMTT 2. GeyserTT + SNACC 3. GeyserTT x2 + SNACC x2 + SMTT 4. GeyserTT SNACC GeyserTT SMTT SNACC GeyserTT SNACC 18

Evaluation: Execution time @50MHz × 6.0 faster l The execution time for each configuration includes data transfer time through TCI 19

Conclusion l Evaluating some real chip fabricated with Renesas SOTB 65nm technology − MIPS R3000 processor ~35mW @ 50MHz − CNN accelerator & memory chip ~4mW @ 50MHz l Demonstrating chip stacking with TCI − Communications partially work − Much larger power is consumed than designed one − A twin-tower system achieves x6.0 higher performance l Future work − Optimization of TCI power using sleep mode − Refinement of power grid for TCI IP l Partially completed − Use of other family chip such as CC-SOTB2 & KVS 20

A Preliminary Evaluation of Building Block Computing Systems Sayaka - PowerPoint PPT Presentation

A Preliminary Evaluation of Building Block Computing Systems Sayaka Terashima , Takuya Kojima * , Hayate Okuhara * , Kazusa Musha * , Hideharu Amano * , Ryuichi Sakamoto , Masaaki Kondo , Mitaro Namiki * Keio University, The

Introducing EF Block TM Introduction to EF Block Building Materials Overview of EF

Problem 1 k zero bits n bits IV Block Block Block Block Cipher Cipher Cipher Cipher

Preliminary results of Preliminary results of Preliminary results of Invalda Preliminary results

Preliminary Report from Preliminary Report from Preliminary Report from Preliminary Report from

Preliminary results of Preliminary results of Preliminary results of Preliminary results of

CRYSTAL CITY BLOCK PLAN # CCBP- J-K 2019 1 BLOCK J-K Long Range Planning Committee Block

CRYSTAL CITY BLOCK PLAN # CCBP- G 1 BLOCK G (Metro Market Square block) 1 Long Range

CRYSTAL CITY BLOCK PLAN # CCBP- G 1 BLOCK G (Metro Market Square block) 1 Long Range

Block Ciphers Eli Biham - May 3, 2005 c 83 Block Ciphers (4) Block Ciphers and Stream

Chapter 16 Chapter 16 The Elements: The he Elements: The d -Block -Block The d -Block

InfiniBand Network Block Device Overview IBNBD: InfiniBand Network Block device Transfer

Chapter 12. Evaluation Research Chapter 12. Evaluation Research evaluation research? evaluation

User Interface Evaluation Empirical evaluation Heuristic evaluation 1 CS 349 - UI evaluation

Canadas Parliament Buildings Centre Block - circa 1900 Centre Block - Today Sir John A.

PRELIMINARY BUDGET TIMELINE Adopt Preliminary budget on June 23 rd The preliminary budget

CRYSTAL CITY BLOCK PLAN # CCBP- G 1 BLOCK G (Metro Market Square block) 1 Long Range

The GEYSERS project Sergi Figuerola Sergi.figuerola@i2cat.net Grant agreement no. 248657

Gamma- -Ray Burst observation with GLAST Ray Burst observation with GLAST Gamma F. Piron F.

economy Charlotte Banks Energy Research & Project Officer APSE Energy www.apse.org.uk The

Hydrates Bearing Sediments Stability Conditions Overlying Sediment Layer (Geothermal

Time-varying resilient virtual network mapping for

QM2018 preliminary request: Search for collective effects in electron-proton collisions with ZEUS

Online interactive tools for undergraduate phonology Peter Jurgec University of Toronto ACL/CLA

Introduction Statistical Thermodynamics Monday, January 5, 15 1 Molecular Simulations

A Preliminary Evaluation of Building Block Computing Systems Sayaka - PowerPoint PPT Presentation

A Preliminary Evaluation of Building Block Computing Systems Sayaka Terashima , Takuya Kojima * , Hayate Okuhara * , Kazusa Musha * , Hideharu Amano * , Ryuichi Sakamoto , Masaaki Kondo , Mitaro Namiki * Keio University, The

Introducing EF Block TM Introduction to EF Block Building Materials Overview of EF

Problem 1 k zero bits n bits IV Block Block Block Block Cipher Cipher Cipher Cipher

Preliminary results of Preliminary results of Preliminary results of Invalda Preliminary results

Preliminary Report from Preliminary Report from Preliminary Report from Preliminary Report from

Preliminary results of Preliminary results of Preliminary results of Preliminary results of

CRYSTAL CITY BLOCK PLAN # CCBP- J-K 2019 1 BLOCK J-K Long Range Planning Committee Block

CRYSTAL CITY BLOCK PLAN # CCBP- G 1 BLOCK G (Metro Market Square block) 1 Long Range

CRYSTAL CITY BLOCK PLAN # CCBP- G 1 BLOCK G (Metro Market Square block) 1 Long Range

Block Ciphers Eli Biham - May 3, 2005 c 83 Block Ciphers (4) Block Ciphers and Stream

Chapter 16 Chapter 16 The Elements: The he Elements: The d -Block -Block The d -Block

InfiniBand Network Block Device Overview IBNBD: InfiniBand Network Block device Transfer

Chapter 12. Evaluation Research Chapter 12. Evaluation Research evaluation research? evaluation

User Interface Evaluation Empirical evaluation Heuristic evaluation 1 CS 349 - UI evaluation

Canadas Parliament Buildings Centre Block - circa 1900 Centre Block - Today Sir John A.

PRELIMINARY BUDGET TIMELINE Adopt Preliminary budget on June 23 rd The preliminary budget

CRYSTAL CITY BLOCK PLAN # CCBP- G 1 BLOCK G (Metro Market Square block) 1 Long Range

The GEYSERS project Sergi Figuerola Sergi.figuerola@i2cat.net Grant agreement no. 248657

Gamma- -Ray Burst observation with GLAST Ray Burst observation with GLAST Gamma F. Piron F.

economy Charlotte Banks Energy Research &amp; Project Officer APSE Energy www.apse.org.uk The

Hydrates Bearing Sediments Stability Conditions Overlying Sediment Layer (Geothermal

Time-varying resilient virtual network mapping for

QM2018 preliminary request: Search for collective effects in electron-proton collisions with ZEUS

Online interactive tools for undergraduate phonology Peter Jurgec University of Toronto ACL/CLA

Introduction Statistical Thermodynamics Monday, January 5, 15 1 Molecular Simulations

economy Charlotte Banks Energy Research & Project Officer APSE Energy www.apse.org.uk The