LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE PROCESSOR USING - - PowerPoint PPT Presentation
LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE PROCESSOR USING - - PowerPoint PPT Presentation
LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE PROCESSOR USING SINGLE-FLUX-QUANTUM CIRCUITS Naofumi Takagi Graduate School of Informatics Kyoto University Kyoto 606-8501, Japan takagi@i.kyoto-u.ac.jp Our Team Takagi Group: Kyoto University
Our Team
Takagi Group: Kyoto University
- Prof. N. Takagi, Prof. K. Takagi
Murakami Group: Kyushu University
- Prof. K. Murakami, Prof. K. Inoue, Prof. H. Honda
Yoshikawa Group: Yokohama National University
- Prof. N. Yoshikawa, Prof. Y. Yamanashi
Akaike Group: Nagoya University
- Prof. H. Akaike, Prof. A. Fujimaki, Prof. M. Tanaka
Nagasawa Group: ISTEC-SRL
(Superconductivity Research Laboratory, International Superconductivity Technology Center)
- Mr. S. Nagasawa, Dr. M. Hidaka
2
Aim of the research
Developing basic technologies of energy- efficient, high-performance computers, e.g., a 10 Tflops desk-side computer, using superconducting sigle-flux-quantum (SFQ) circuits. By adopting the processor architecture called ‘large-scale reconfigurable data- paths (RDPs).’
3
4
SFQ (0.5~0.35um) LSRDP
Developing basic technologies A 10TFlops Computer
By technologies at 2006 90nm CMOS Parallel computer
Our approach
Reduction of power consumption
- f conventional circuit technology
5
Development of technologies for realizing a high-performance computer using a new low-power circuit technology SFQ
Backgrounds
Superconducting Single-Flux-Quantum Circuits
Ultra Low-Power, Ultra High-Speed
6
Φ0 = h/2e = 2.07 mV. ps 2~3 ps SFQ pulse
SFQ in a superconductive loop
~1 mV
Josephson Junction
SFQ technologies at 2006
- Conventional Nb 4-layer 2µm fabrication process
Cell-based design, Logic cell library Automatic routing by Josephson transmission line SFQ-LSIs with more than 10,000JJs
- Development of a new 1µm fabrication process
Nb 6 lyers No design environment
- Development of passive transmission line (PTL)
technology
High-speed inner-chip data transfer
7
Superconducting micro-strip line SFQ pulse
Reconfigurable Data-Path (RDP) processor
- Reconfigurable data-path
–A lot of floating point Units (FPUs) –Reconfigurable operand routing networks :(ORNs) –Dynamic reconfiguration
- Features
–Reconfiguring the data-path by routing ORNs to fit the processing of a loop in large- scale numerical computation –Parallel and pipelined processing – Burst input /output data is transferred from/to memory
PE PE PE ORN PE PE PE PE PE PE PE PE ORN
オペランドルーティング ネットワーク (ORN )
ORN
演算器 (PE ) メモリアクセスコントローラー (MAC )
MAC I/O Port
汎用 プロセッサ (GPP )
... ... ...
... ... ... ...
主記憶 FPU FPU ORN FPU FPU FPU FPU FPU FPU FPU FPU ORN Operand Routing Network (ORN) ORN Streaming Memory Access Controller (SMAC) SMAC I/O Port General Purpose Processor ... ... ...
... ... ... ...
Main Mem. 演算器
PE PE
FPU FPU
Research subjects
- 1. SFQ fabrication process and circuit design environments
(1) Nb multi-layer 1µm fabrication process (Nagasawa G.) (2) Logic cell library for the 1µm process (Yoshikawa G. and Akaike G.) (3) CAD for SFQ digital circuit design (Takagi G.)
- 2. SFQ-FPUs and SFQ-RDP prototypes
(1) SFQ-FPUs (Yoshikawa G. and Takagi G.) Half-precision FPA and FPM operating at 25GHz (2µm process) FPA and FPM operating at 50GHz (1µm process) (2) SFQ-RDP prototypes (ALU+ORN) (Akaike G.) 2x2 SFQ-RDP operating at 25GHz (2µm process) 4x4 SFQ-RDP operating at 50GHz (1µm process)
- 3. RDP architecture (Murakami G.)
RDP architecture, RDP compiler, RDP-oriented algorithms
9
Results of the research
- 1. Fabrication process and design environment
Development of a Nb 9-layer 1µm fabrication process
11
300 nm 400 nm 300 nm 300 nm 150 nm 150 nm
BC GC RC RC AlOx C6 C6 BC SiO2 GC JJ M7 (GP) C6 M3 (PTL1) M5 (PTL2) C5 M6 (GND3) C5 C5 C5 C5
400 nm 400 nm
GC M8 (BAS) RES1 JC C2 C2 C2 C2 C3 C4 C3 C4 C4 C3 M1 (DCP) M2 (GND1) M4 (GND2) C4 C3 C3 C4
150 nm 150 nm 150 nm 150 nm 150 nm 150 nm 150 nm 150 nm 150 nm 150 nm 200 nm 200 nm
M9 (COU) GC C6 M2 (GND1) M8 (BAS) C1 C1
200 nm
Si Substrate
M9 (COU)
Nb layer thickness SiO2 layer thickness
Complemented planarization layer
SiO2 SiO2 SiO2
Cross-sectional SEM photograph
Excellent flatness was obtained even though the step edges of several underlying patterns are overlapped.
Active layers including junctions and resistors Main Ground plane 1st PTL layer DC power layer 2nd PTL layer
Nb layers for M1-M7 are planarized.
Shift registers for evaluation of the Nb 9-layer 1µm process Measurement results (Best chip)
Only three defects, Correct operation of 13 of 16 circuits Correct operation of all 2560bit shift registers with 10,281 JJs
2560-bit SR
1280-bit SR 640-bit SR
64-bit SR
2560-bit SR
1280-bit SR
16-bit SR
2560-bit SR 2560-bit SR
1280-bit SR 1280-bit SR 640-bit SR
160-bit SR 64-bit SR 16-bit SR 160-bit SR
Chip size: 8.5mm x 7.0mm
Design
16 circuits
- Two 16-bit shift registers
- Two 64-bit shift registers
- Two 160-bit shift registers
- Two 640-bit shift registers
- Four 1280-bit shift
registers
- Four 2560-bit shift
registers
68,990 JJs in total
Development of a logic cell library for the 9-layer process
A microphotograph of a dffc2 cell Basic structure of a logic cell
30 µm
PTL2 PTL1
30 µm
グランドコンタクト
バイアスピラー
Bias pillar Ground contact
30 µm
A 4x4 switch by the 9-layer 1µm process and the cell library
Operation up to 112 GHz (World’s highest) Total Power Consumption : 660 µW Total number of JJs: 3362 The number of vias: 434
Upper PTL Lower PTL Via hole
Circuit area ratio 1 : 0.19 (81% reduction)
Conventional Nb 4-layer 2µm proces New Nb 9-ayer 1µm process
(Cell size:40mm x 40mm → 30µm x 30µm)
Functional block
Functional block
Area Reduction by 81% compared to the conventional Nb 4-layer 2µm fabrication process
Device density and operating frequency in LSIs
GaAs HEMT
0.01 0.1 1 10 100 1000 1 103 106 109 1012
Si Bip Si MOSFET GaAs HBT InP HEMT GaAs MESFET SiGe HBT InP HBT
Frequency (GHz) Device Density (Trs/cm2)
Limit from Long Interconnect Delay Limit from Power Density for CMOS Limit from Power Density for Compound
Demonstrated in CREST Present SFQ in USA/ Previous SFQ in Japan
4x4 SW 2x2 RDP
SFQ LSIs developed in this project have reached the region that semiconductors can not reach.
Energy consumption for a device used in LSIs
10-23 10-21 10-19 10-17 10-15 10-13 1 101 102 103 104
Thermal Energy @4K
Present CMOS
Thermal Energy @350K
Clock Period (ps) Energy Consumption (J)
SFQ in Japan before 2005
Demonstrated in CREST
2x2 RDP FPM/FPA 4x4 SW
Primitives will be demonstrated in CREST
Present SFQ in USA
Design flow of SFQ LSIs
18
Design E ntry Logic S ynthesis Logic Netlist Placement
Placed C ells &C
- nnections
R
- uting
Mask Layout Layout Viewer T echnology Library P&R Library S pecificS ynthesis S ubs ystems for S FQLogicCircuits
C
- nstraints
&Violations C ell &Wire Geometry
Timing Verification Logic S imulator S taticTiming Analyzer
S pecification &C
- nstraints
- Sequential circuit synthesis
- Clock scheduling and
distribution
- Asynchronous logic
synthesis
Layout-driven design Precise timing analysis Unique process to SFQ circuits Verification of pipeline operations
Development of design tools
- Designed a sample circuit:
8-bit carry lookahead adder Verified correct operations
clock tree synthesis semi-automatic placement automatic routing
8-bit CLA 158 gates, 9 levels concurrent-flow clocking 7092JJs, 598PTLs
2.SFQ-FPUs and SFQ-RDP prototypes
20
Operating frequency: 20GHz Performance: 1.67 GFLOPs The number of junctions: 10244 JJs Power consumption: 3.5 mW Circuit area: 5.86 x 5.72 mm2
Shifter of A Shifter of B Controller Norm alizer Adder & Subtractor Normalizer 1mm Shifter Register for Confirmation Shifter Register for Confirmation
FPA
Clock Generator Shifter Register of Significands Shifter Register of Exponent and Sign
Operating frequency: 32GHz Performance: 2.6 GFLOPs The number of junctions: 11044 JJs Power consumption: 3.5 mW Circuit area: 6.22 ×3.78 mm2
Multiplier Normalizer Shifter Register Normalizer Clock Generator 1mm Shifter Register
FPM
Half-precision FPA and FPM using the 2µm process
FPA and FPM using the 1µm process
21
Operation circuit for significant part Operation circuit for exponent part
Systolic array multiplier
Clock Generator
Significand Processing Circuit Exponent Processing Circuit
Normalizer Shift Register for Input Shift Register for Output
3.510 mm 2.16 mm Circuit area: 7.58 mm2 Junction count: 6157 JJs
Micrograph of 10-bit bit-serial FPM
Block diagram of bit-serial FPM
Measurement Simulation 9%
50-GHz test results of 4b multiplier
22
2x3 SFQ-RDP prototype using the 2µm process
6 ALUs Clock frequency: 23 GHz Junction counts : 14040
(World’s largest integration scale)
Circuit area: 6.84 ×6.72 mm2
CONNECT
cooperated with SRL, NiCT, NU & YNU
*SRL Nb 2.5 kA/cm2 standard process
2x2 SFQ-RDP prototype using the 1µm process
23
Clock frequency: 45GHz Power dissipation: 3.4mW Junction count: 11458 Circuit area: 5.61 ×2.82mm2
AUL4
AND
AUL3
ADD
TU TU
IN4 IN3 IN2 IN1 OUT1 OUT2 OUT3 OUT4
AUL2
XOR
AUL1
SUB
TU TU
TU: Transfer Unit
1mm
ORN1 ORN2 ALU1a ALU1b ALU2a ALU2b
ORN architecture and a crossbar switch for a 2-bit wide data streams
Number of rows = 1.5×M Number of columns = 4×MCL+ 1 Number of rows = 1.5×M Number of columns = 4×MCL+ 1
ORN with MCL= 2
Junction count: 547 Clock frequency: 65GHz Power Consumption: 0.14mW 24
M MCL JJ count Power dissipation (mW) RDP-M(Middle-scale) 24 5 307692 7.7 RDP-L(Large scale) 48 5 615384 15
Operating region
M: Number of FPUs in an array, MCL: Maximum Connection Length
Micro-architecture:
Two types of FPUs: FPA and FPM FPU:Three Inputs (A,B,C)→ Three Outputs (A(*B),B,C) Arrangement of FPUs: alternate Three types (scales) of RDP
(Small, Medium and Large-Scales)
25
FU TU TU TU FP TU TU TU
PE PE (i, , j) (i+2, j+1) (i+L, L, j+1) (i+1, j+1) (i, j+1)
MCL = L
・・・ OR ORN
RDP parameters (optimized by total number of JJs) # Input # Output Width Height MCL Total JJs (∝RDP size) RDP-S 19 12 22 14 4 19387K RDP-M 19 12 24 17 5 27027K RDP-L 38 24 41 34 6 96374K
- 3. Development of RDP Architecture
TU:Data Through
Development of RDP Complier
Application C code 1 Modified code 2 Modifying application code
Manual: Inserting LSRDP instructions in the code
1 ISAcc or COINS compiler
2
DFG Extraction
Semi-manual
1 .asm code
for MIPS-based GPP
2
Data flow graphs Placement and Routing Tool 2 Configuration file +
various text and schematic reports
1
RDP library file
Functions definition & declaration
1
RDP architecture description
2 2: flow of generation of configuration bit-stream for RDP Simulator
Performance evaluation
1: flow of generation of assembly codes for GPP
27
Development of RDP-Oriented Algorithms
One-dimensional heat and vibrational equations Two-dimensional heat and FDTD equations Two-Electron Repulsion Integral calculation in quantum
chemistry
Runge-Kutta calculation for ordinary differential equation
Performance Estimation
Two-dimensional heat equation (1024x1024 mesh)
SFQ-RDP: 50.6GFlops vs. GPU1): 63.0GFlops
Estimation method: RDP - Execution time model,
- DFG has 21 inputs, 9 outputs, and 63 operations
- BW: 159.0GB/s
GPP - Cycle-accurate processor simulator
1) T. Aoki, and A. Nukada,“CUDA programming: beginners,“ Kougakusya,
ISBN-10:4777514773, 2009 (in Japanese).
Summary
- 1. SFQ fabrication process and circuit design environments
(1) We have established a Nb 9-layer 1µm fabrication process. (2) We have developed a logic cell library for the 1µm process. (3) We have developed several CAD tools for the 1µm process.
It is possible to design and fabricate large scale SFQ circuits.
- 2. SFQ-FPUs and SFQ-RDP prototypes
(1) We have fabricated a half-precision FPA and FPM by the 2µm process, which have operated at 20GHz and 35GHz, respectively. We are designing an FPA and FPM by the 1µm process. (2) We have fabricated a 2x3 SFQ-RDP prototype by the 2µm process, which has operated at 23GHz. We have fabricated a 2x2 SFQ-RDP prototype by the 1µm process, which has operated at 45GHz. We are developing a 4x4 SFQ-RDP prototype by the 1µm process.
It is possible to realize SFQ-RDPs.
28
29
- 3. RDP architecture
(1) We have determined architectural specifications of SFQ-RDP. (2) We have developed a compiler for RDP. (3) We have developed RDP-oriented algorithms for several applications
and have estimated the performance of SFQ-RDP.