LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE PROCESSOR USING - - PowerPoint PPT Presentation

low power high performance reconfigurable processor using
SMART_READER_LITE
LIVE PREVIEW

LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE PROCESSOR USING - - PowerPoint PPT Presentation

LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE PROCESSOR USING SINGLE-FLUX-QUANTUM CIRCUITS Naofumi Takagi Graduate School of Informatics Kyoto University Kyoto 606-8501, Japan takagi@i.kyoto-u.ac.jp Our Team Takagi Group: Kyoto University


slide-1
SLIDE 1

LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE PROCESSOR USING SINGLE-FLUX-QUANTUM CIRCUITS

Naofumi Takagi

Graduate School of Informatics Kyoto University Kyoto 606-8501, Japan takagi@i.kyoto-u.ac.jp

slide-2
SLIDE 2

Our Team

Takagi Group: Kyoto University

  • Prof. N. Takagi, Prof. K. Takagi

Murakami Group: Kyushu University

  • Prof. K. Murakami, Prof. K. Inoue, Prof. H. Honda

Yoshikawa Group: Yokohama National University

  • Prof. N. Yoshikawa, Prof. Y. Yamanashi

Akaike Group: Nagoya University

  • Prof. H. Akaike, Prof. A. Fujimaki, Prof. M. Tanaka

Nagasawa Group: ISTEC-SRL

(Superconductivity Research Laboratory, International Superconductivity Technology Center)

  • Mr. S. Nagasawa, Dr. M. Hidaka

2

slide-3
SLIDE 3

Aim of the research

Developing basic technologies of energy- efficient, high-performance computers, e.g., a 10 Tflops desk-side computer, using superconducting sigle-flux-quantum (SFQ) circuits. By adopting the processor architecture called ‘large-scale reconfigurable data- paths (RDPs).’

3

slide-4
SLIDE 4

4

SFQ (0.5~0.35um) LSRDP

Developing basic technologies A 10TFlops Computer

By technologies at 2006 90nm CMOS Parallel computer

slide-5
SLIDE 5

Our approach

Reduction of power consumption

  • f conventional circuit technology

5

Development of technologies for realizing a high-performance computer using a new low-power circuit technology SFQ

slide-6
SLIDE 6

Backgrounds

Superconducting Single-Flux-Quantum Circuits

Ultra Low-Power, Ultra High-Speed

6

Φ0 = h/2e = 2.07 mV. ps 2~3 ps SFQ pulse

SFQ in a superconductive loop

~1 mV

Josephson Junction

slide-7
SLIDE 7

SFQ technologies at 2006

  • Conventional Nb 4-layer 2µm fabrication process

Cell-based design, Logic cell library Automatic routing by Josephson transmission line SFQ-LSIs with more than 10,000JJs

  • Development of a new 1µm fabrication process

Nb 6 lyers No design environment

  • Development of passive transmission line (PTL)

technology

High-speed inner-chip data transfer

7

Superconducting micro-strip line SFQ pulse

slide-8
SLIDE 8

Reconfigurable Data-Path (RDP) processor

  • Reconfigurable data-path

–A lot of floating point Units (FPUs) –Reconfigurable operand routing networks :(ORNs) –Dynamic reconfiguration

  • Features

–Reconfiguring the data-path by routing ORNs to fit the processing of a loop in large- scale numerical computation –Parallel and pipelined processing – Burst input /output data is transferred from/to memory

PE PE PE ORN PE PE PE PE PE PE PE PE ORN

オペランドルーティング ネットワーク (ORN )

ORN

演算器 (PE ) メモリアクセスコントローラー (MAC )

MAC I/O Port

汎用 プロセッサ (GPP )

... ... ...

... ... ... ...

主記憶 FPU FPU ORN FPU FPU FPU FPU FPU FPU FPU FPU ORN Operand Routing Network (ORN) ORN Streaming Memory Access Controller (SMAC) SMAC I/O Port General Purpose Processor ... ... ...

... ... ... ...

Main Mem. 演算器

PE PE

FPU FPU

slide-9
SLIDE 9

Research subjects

  • 1. SFQ fabrication process and circuit design environments

(1) Nb multi-layer 1µm fabrication process (Nagasawa G.) (2) Logic cell library for the 1µm process (Yoshikawa G. and Akaike G.) (3) CAD for SFQ digital circuit design (Takagi G.)

  • 2. SFQ-FPUs and SFQ-RDP prototypes

(1) SFQ-FPUs (Yoshikawa G. and Takagi G.) Half-precision FPA and FPM operating at 25GHz (2µm process) FPA and FPM operating at 50GHz (1µm process) (2) SFQ-RDP prototypes (ALU+ORN) (Akaike G.) 2x2 SFQ-RDP operating at 25GHz (2µm process) 4x4 SFQ-RDP operating at 50GHz (1µm process)

  • 3. RDP architecture (Murakami G.)

RDP architecture, RDP compiler, RDP-oriented algorithms

9

slide-10
SLIDE 10

Results of the research

slide-11
SLIDE 11
  • 1. Fabrication process and design environment

Development of a Nb 9-layer 1µm fabrication process

11

300 nm 400 nm 300 nm 300 nm 150 nm 150 nm

BC GC RC RC AlOx C6 C6 BC SiO2 GC JJ M7 (GP) C6 M3 (PTL1) M5 (PTL2) C5 M6 (GND3) C5 C5 C5 C5

400 nm 400 nm

GC M8 (BAS) RES1 JC C2 C2 C2 C2 C3 C4 C3 C4 C4 C3 M1 (DCP) M2 (GND1) M4 (GND2) C4 C3 C3 C4

150 nm 150 nm 150 nm 150 nm 150 nm 150 nm 150 nm 150 nm 150 nm 150 nm 200 nm 200 nm

M9 (COU) GC C6 M2 (GND1) M8 (BAS) C1 C1

200 nm

Si Substrate

M9 (COU)

Nb layer thickness SiO2 layer thickness

Complemented planarization layer

SiO2 SiO2 SiO2

Cross-sectional SEM photograph

Excellent flatness was obtained even though the step edges of several underlying patterns are overlapped.

Active layers including junctions and resistors Main Ground plane 1st PTL layer DC power layer 2nd PTL layer

Nb layers for M1-M7 are planarized.

slide-12
SLIDE 12

Shift registers for evaluation of the Nb 9-layer 1µm process Measurement results (Best chip)

 Only three defects, Correct operation of 13 of 16 circuits  Correct operation of all 2560bit shift registers with 10,281 JJs

2560-bit SR

1280-bit SR 640-bit SR

64-bit SR

2560-bit SR

1280-bit SR

16-bit SR

2560-bit SR 2560-bit SR

1280-bit SR 1280-bit SR 640-bit SR

160-bit SR 64-bit SR 16-bit SR 160-bit SR

Chip size: 8.5mm x 7.0mm

Design

 16 circuits

  • Two 16-bit shift registers
  • Two 64-bit shift registers
  • Two 160-bit shift registers
  • Two 640-bit shift registers
  • Four 1280-bit shift

registers

  • Four 2560-bit shift

registers

 68,990 JJs in total

slide-13
SLIDE 13

Development of a logic cell library for the 9-layer process

A microphotograph of a dffc2 cell Basic structure of a logic cell

30 µm

PTL2 PTL1

30 µm

グランドコンタクト

バイアスピラー

Bias pillar Ground contact

30 µm

slide-14
SLIDE 14

A 4x4 switch by the 9-layer 1µm process and the cell library

 Operation up to 112 GHz (World’s highest)  Total Power Consumption : 660 µW  Total number of JJs: 3362  The number of vias: 434

Upper PTL Lower PTL Via hole

slide-15
SLIDE 15

Circuit area ratio 1 : 0.19 (81% reduction)

Conventional Nb 4-layer 2µm proces New Nb 9-ayer 1µm process

(Cell size:40mm x 40mm → 30µm x 30µm)

Functional block

Functional block

Area Reduction by 81% compared to the conventional Nb 4-layer 2µm fabrication process

slide-16
SLIDE 16

Device density and operating frequency in LSIs

GaAs HEMT

0.01 0.1 1 10 100 1000 1 103 106 109 1012

Si Bip Si MOSFET GaAs HBT InP HEMT GaAs MESFET SiGe HBT InP HBT

Frequency (GHz) Device Density (Trs/cm2)

Limit from Long Interconnect Delay Limit from Power Density for CMOS Limit from Power Density for Compound

Demonstrated in CREST Present SFQ in USA/ Previous SFQ in Japan

4x4 SW 2x2 RDP

SFQ LSIs developed in this project have reached the region that semiconductors can not reach.

slide-17
SLIDE 17

Energy consumption for a device used in LSIs

10-23 10-21 10-19 10-17 10-15 10-13 1 101 102 103 104

Thermal Energy @4K

Present CMOS

Thermal Energy @350K

Clock Period (ps) Energy Consumption (J)

SFQ in Japan before 2005

Demonstrated in CREST

2x2 RDP FPM/FPA 4x4 SW

Primitives will be demonstrated in CREST

Present SFQ in USA

slide-18
SLIDE 18

Design flow of SFQ LSIs

18

Design E ntry Logic S ynthesis Logic Netlist Placement

Placed C ells &C

  • nnections

R

  • uting

Mask Layout Layout Viewer T echnology Library P&R Library S pecificS ynthesis S ubs ystems for S FQLogicCircuits

C

  • nstraints

&Violations C ell &Wire Geometry

Timing Verification Logic S imulator S taticTiming Analyzer

S pecification &C

  • nstraints
  • Sequential circuit synthesis
  • Clock scheduling and

distribution

  • Asynchronous logic

synthesis

Layout-driven design Precise timing analysis Unique process to SFQ circuits Verification of pipeline operations

slide-19
SLIDE 19

Development of design tools

  • Designed a sample circuit:

8-bit carry lookahead adder Verified correct operations

clock tree synthesis semi-automatic placement automatic routing

8-bit CLA 158 gates, 9 levels concurrent-flow clocking 7092JJs, 598PTLs

slide-20
SLIDE 20

2.SFQ-FPUs and SFQ-RDP prototypes

20

Operating frequency: 20GHz Performance: 1.67 GFLOPs The number of junctions: 10244 JJs Power consumption: 3.5 mW Circuit area: 5.86 x 5.72 mm2

Shifter of A Shifter of B Controller Norm alizer Adder & Subtractor Normalizer 1mm Shifter Register for Confirmation Shifter Register for Confirmation

FPA

Clock Generator Shifter Register of Significands Shifter Register of Exponent and Sign

Operating frequency: 32GHz Performance: 2.6 GFLOPs The number of junctions: 11044 JJs Power consumption: 3.5 mW Circuit area: 6.22 ×3.78 mm2

Multiplier Normalizer Shifter Register Normalizer Clock Generator 1mm Shifter Register

FPM

Half-precision FPA and FPM using the 2µm process

slide-21
SLIDE 21

FPA and FPM using the 1µm process

21

Operation circuit for significant part Operation circuit for exponent part

Systolic array multiplier

Clock Generator

Significand Processing Circuit Exponent Processing Circuit

Normalizer Shift Register for Input Shift Register for Output

3.510 mm 2.16 mm Circuit area: 7.58 mm2 Junction count: 6157 JJs

Micrograph of 10-bit bit-serial FPM

Block diagram of bit-serial FPM

Measurement Simulation 9%

50-GHz test results of 4b multiplier

slide-22
SLIDE 22

22

2x3 SFQ-RDP prototype using the 2µm process

6 ALUs Clock frequency: 23 GHz Junction counts : 14040

(World’s largest integration scale)

Circuit area: 6.84 ×6.72 mm2

CONNECT

cooperated with SRL, NiCT, NU & YNU

*SRL Nb 2.5 kA/cm2 standard process

slide-23
SLIDE 23

2x2 SFQ-RDP prototype using the 1µm process

23

Clock frequency: 45GHz Power dissipation: 3.4mW Junction count: 11458 Circuit area: 5.61 ×2.82mm2

AUL4

AND

AUL3

ADD

TU TU

IN4 IN3 IN2 IN1 OUT1 OUT2 OUT3 OUT4

AUL2

XOR

AUL1

SUB

TU TU

TU: Transfer Unit

1mm

ORN1 ORN2 ALU1a ALU1b ALU2a ALU2b

slide-24
SLIDE 24

ORN architecture and a crossbar switch for a 2-bit wide data streams

Number of rows = 1.5×M Number of columns = 4×MCL+ 1 Number of rows = 1.5×M Number of columns = 4×MCL+ 1

ORN with MCL= 2

Junction count: 547 Clock frequency: 65GHz Power Consumption: 0.14mW 24

M MCL JJ count Power dissipation (mW) RDP-M(Middle-scale) 24 5 307692 7.7 RDP-L(Large scale) 48 5 615384 15

Operating region

M: Number of FPUs in an array, MCL: Maximum Connection Length

slide-25
SLIDE 25

Micro-architecture:

Two types of FPUs: FPA and FPM FPU:Three Inputs (A,B,C)→ Three Outputs (A(*B),B,C) Arrangement of FPUs: alternate Three types (scales) of RDP

(Small, Medium and Large-Scales)

25

FU TU TU TU FP TU TU TU

PE PE (i, , j) (i+2, j+1) (i+L, L, j+1) (i+1, j+1) (i, j+1)

MCL = L

・・・ OR ORN

RDP parameters (optimized by total number of JJs) # Input # Output Width Height MCL Total JJs (∝RDP size) RDP-S 19 12 22 14 4 19387K RDP-M 19 12 24 17 5 27027K RDP-L 38 24 41 34 6 96374K

  • 3. Development of RDP Architecture

TU:Data Through

slide-26
SLIDE 26

Development of RDP Complier

Application C code 1 Modified code 2 Modifying application code

Manual: Inserting LSRDP instructions in the code

1 ISAcc or COINS compiler

2

DFG Extraction

Semi-manual

1 .asm code

for MIPS-based GPP

2

Data flow graphs Placement and Routing Tool 2 Configuration file +

various text and schematic reports

1

RDP library file

Functions definition & declaration

1

RDP architecture description

2 2: flow of generation of configuration bit-stream for RDP Simulator

Performance evaluation

1: flow of generation of assembly codes for GPP

slide-27
SLIDE 27

27

Development of RDP-Oriented Algorithms

 One-dimensional heat and vibrational equations  Two-dimensional heat and FDTD equations  Two-Electron Repulsion Integral calculation in quantum

chemistry

 Runge-Kutta calculation for ordinary differential equation

Performance Estimation

 Two-dimensional heat equation (1024x1024 mesh)

SFQ-RDP: 50.6GFlops vs. GPU1): 63.0GFlops

Estimation method: RDP - Execution time model,

  • DFG has 21 inputs, 9 outputs, and 63 operations
  • BW: 159.0GB/s

GPP - Cycle-accurate processor simulator

1) T. Aoki, and A. Nukada,“CUDA programming: beginners,“ Kougakusya,

ISBN-10:4777514773, 2009 (in Japanese).

slide-28
SLIDE 28

Summary

  • 1. SFQ fabrication process and circuit design environments

(1) We have established a Nb 9-layer 1µm fabrication process. (2) We have developed a logic cell library for the 1µm process. (3) We have developed several CAD tools for the 1µm process.

It is possible to design and fabricate large scale SFQ circuits.

  • 2. SFQ-FPUs and SFQ-RDP prototypes

(1) We have fabricated a half-precision FPA and FPM by the 2µm process, which have operated at 20GHz and 35GHz, respectively. We are designing an FPA and FPM by the 1µm process. (2) We have fabricated a 2x3 SFQ-RDP prototype by the 2µm process, which has operated at 23GHz. We have fabricated a 2x2 SFQ-RDP prototype by the 1µm process, which has operated at 45GHz. We are developing a 4x4 SFQ-RDP prototype by the 1µm process.

It is possible to realize SFQ-RDPs.

28

slide-29
SLIDE 29

29

  • 3. RDP architecture

(1) We have determined architectural specifications of SFQ-RDP. (2) We have developed a compiler for RDP. (3) We have developed RDP-oriented algorithms for several applications

and have estimated the performance of SFQ-RDP.

SFQ-RDP is effective.

We will be able to develop energy-efficient, high-performance computers using SFQ circuits in the near future.