3D-MAPS: 3D Massively Parallel Processor With Stacked Memory IEEE - - PowerPoint PPT Presentation

3d maps 3d massively parallel processor with stacked
SMART_READER_LITE
LIVE PREVIEW

3D-MAPS: 3D Massively Parallel Processor With Stacked Memory IEEE - - PowerPoint PPT Presentation

3D-MAPS: 3D Massively Parallel Processor With Stacked Memory IEEE ISSCC 2012 Presentation Dae Hyun Kim 1 , Krit Athikulwongse 1 , Michael B. Healy 1 , Mohammad M. Hossain 1 , Moongon Jung 1 , Ilya Khorosh 1 , Gokul Kumar 1 , Young-Joon Lee 1 ,


slide-1
SLIDE 1

Dae Hyun Kim1, Krit Athikulwongse1, Michael B. Healy1, Mohammad M. Hossain1, Moongon Jung1, Ilya Khorosh1, Gokul Kumar1, Young-Joon Lee1, Dean L. Lewis1, Tzu-Wei Lin1, Chang Liu1, Shreepad Panth1, Mohit Pathak1, Minzhen Ren1, Guanhao Shen1, Taigon Song1, Dong Hyuk Woo1, Xin Zhao1, Joungho Kim2, Ho Choi3, Gabriel H. Loh1, Hsien-Hsin S. Lee1, and Sung Kyu Lim1

1 Georgia Institute of Technology, Atlanta, USA 2 Korea Advanced Institute of Science and Technology, Daejon, Korea 3 Amkor Technology, Seoul, Korea

3D-MAPS: 3D Massively Parallel Processor With Stacked Memory

IEEE ISSCC 2012 Presentation

slide-2
SLIDE 2

2/31

Agenda

  • Objective and Overview
  • TSV and Stacking Technology
  • Design

– Architecture, layouts, and design analysis

  • Testing

– Die photos, package, board, and testing Infrastructure

  • Measurement Results
  • Ongoing Works
  • Conclusions
slide-3
SLIDE 3

3/31

Objective

  • Papers on TSV modeling and manufacturing: many
  • Papers on CAD tools: some
  • Papers on architecture and application: few
  • Papers on test chips: few

– Neuromorphic vision chip, Tohoku Univ [ISSCC’01] – Inductive coupling, Keio Univ [ISSCC’08] – DDR3 DRAM, Samsung [ISSCC’09] – Design-for-Reliability, IMEC [ISSCC’10] – Wide-I/O DRAM, Samsung [ISSCC’11]

  • Objective: build the first general-purpose many-core 3D processor
slide-4
SLIDE 4

4/31

3D-MAPS: An Overview

  • 3D MAssively Parallel processor with Stacked memory
  • 130nm GLOBALFOUNDRIES + Tezzaron F2F bonding
  • 64 cores, 5-stage/2-way VLIW architecture
  • 256KB SRAM, 1-cycle access
  • 5mm X 5mm, 230 IO cells
  • 277MHz Fmax, 1.5V Vdd
  • 64GB/s memory BW @ 4W

P/G F2F (gray) TSV (navy) signal F2F (red)

  • TSV: 50K used for IO & dummy
  • TSV: 1.2um diameter, 5um pitch
  • F2F: 50K used for memory access
  • F2F: 3.4um diameter, 5um pitch
slide-5
SLIDE 5

5/31

Tezzaron 3D Stack-up

  • 2 logic tiers, face-to-face bonded

– Top die thinned to 12um, bottom die is 765um – GLOBALFOUNDRIES 130nm technology + Artisan library/IP

slide-6
SLIDE 6

6/31

3D MAPS Core Architecture

  • 2-issue (memory/ALU), 5-stage VLIW

– single cycle memory access at every cycle

single cycle 3D memory memory pipeline ALU pipeline

slide-7
SLIDE 7

7/31

V1 Full-die Layouts

64 cores + 235 IO cells (on periphery) 64 SRAM memory tiles (64 x 4KB) core-to-core wires

slide-8
SLIDE 8

8/31

Face-to-face Via Usage

  • Spec: 3.4um diameter, 5um pitch, negligible RC

– Usage: 64 for signal, 684 for P/G per core

single core single SRAM tile (4KB) signal F2F P/G F2F P/G F2F

slide-9
SLIDE 9

9/31

Through-Silicon-Via Usage

  • Spec: 1.2um diameter, 5um pitch, R = 0.6ohm, C = 3fF

– Usage: mainly in IO cells – 204 redundant TSVs in each IO cell – 53 dummy TSVs per core

IO cells along the periphery IO cell (zoom-in) 12x17=204 P/G TSV array

slide-10
SLIDE 10

10/31

Timing Closure and Power Delivery

P/G rings for the cores buffers and gates in between cores P/G rings for SRAM tiles decap cells attached to P/G rings

slide-11
SLIDE 11

11/31

3D CAD Tools and Methodologies

  • Commercial 3D tools are NOT available
  • We started with 2D Tools and added scripts & plug-ins

– 3D layout construction: Encounter – 3D timing optimization: Encounter + PrimeTime – 3D timing and SI analysis: CeltIC + PrimeTime – 3D power analysis: ModelSim + Encounter – 3D clock analysis: Encounter + SPICE – 3D IR-drop analysis: VoltageStorm – 3D thermal analysis: ANSYS + Fluent – 3D DRC/LVS: Calibre

  • Used to design both V1, V2, and more
slide-12
SLIDE 12

12/31

3D Static Timing Analysis with SI

Layout: Die1 (Encounter) Layout: Die1 (Encounter) Die0/1 Verilog Netlists (updated) Die0/1 Verilog Netlists (updated) 3D STA (PrimeTime) 3D STA (PrimeTime) Cadence Cadence Synopsys Synopsys in-house 3D STA 3D STA 3D SI Noise 3D SI Noise Top-level Verilog netlist Die0/1 SPEF (QRC Extractor) Die0/1 SPEF (QRC Extractor) Top-level SPEF (for F2F) Stitched SPEF Stitched SPEF Layout: Die0 (Encounter) Layout: Die0 (Encounter) 3D SI Noise Analysis (CeltIC) 3D SI Noise Analysis (CeltIC)

slide-13
SLIDE 13

13/31

3D Timing Analysis

  • Our worst-case path has 3.6ns delay, so Fmax = 277MHz

– RF-to-memory write path: stage 2/3 FF – MUX – ADD – MUX – DMEM_ADDR

slide-14
SLIDE 14

14/31

3D Signal Integrity Analysis

  • We analyze both 2D and 3D nets

– All nets < 500mV: 5um F2F pitch was enough

slide-15
SLIDE 15

15/31

3D IR-drop Analysis

  • Can handle di/dt noise as well

Layout: 2D dies (Encounter) Layout: 2D dies (Encounter) 3D LEF/DEF/GDS : merge 2D files 3D ICT file :define all layers 3D ICT file :define all layers 3D tech file: cap (Techgen) 3D tech file: cap (Techgen) VCD file: switching activity (ModelSim) VCD file: switching activity (ModelSim) Power Analysis (Encounter) Power Analysis (Encounter) Rail Analysis (VoltageStorm) Rail Analysis (VoltageStorm) Cadence Cadence Mentor Mentor in-house 3D IR-drop 3D IR-drop

  • 1. RC network generation for PDN
  • 2. Inserting current sources
  • 3. P/G grid analysis
  • 1. RC network generation for PDN
  • 2. Inserting current sources
  • 3. P/G grid analysis
slide-16
SLIDE 16

16/31

3D IR-drop Analysis

  • Single-core: clock buffers are power hungry (60mV)
  • 64-core: cores in the middle experience more IR-drop (78mV)
slide-17
SLIDE 17

17/31

DFT Infrastructure

  • 64 cores split into 4 sectors, tested independently

– Scan IO pins located on one side – Testing circuitry sitting in between the cores

slide-18
SLIDE 18

18/31

3D-MAPS Die Photos

slide-19
SLIDE 19

19/31

SEM Images

slide-20
SLIDE 20

20/31

IR Images

slide-21
SLIDE 21

21/31

Amkor Packaging

slide-22
SLIDE 22

22/31

Testing Infrastructure

3D-MAPS V1 Xilinx ML605 Agilent 16804A

slide-23
SLIDE 23

23/31

Sample Bit Stream: 3D Interface Test

  • Data memory R/W works: TSVs and F2Fs work

test vectors test responses expected results

slide-24
SLIDE 24

24/31

....... 0001001111010000000000000000000100101111110000000000001000000000 0001001111011110111111111111111110101111111000000000001000000100 0001001000010000000000000000000000101111100000000000001000001000 0001001000100000000000010000000000000000000000000000000000000000 0000000000000000000000000000000000111000000000010000000000000100 0000000000000000000000000000000000111000000000010000000000000100 0000000000000000000000000000000000111000000000010000000000000100 0000000000000000000000000000000000111000000000010000000000000100 0000000000000000000000000000000000111000000000010000000000000100 0000000000000000000000000000000000111000000000010000000000000100 0000000000000000000000000000000000111000000000010000000000000100 0000000000000000000000000000000000111000000000010000000000000100 1011011000010001011111111111101110000000000000000000000000000000 0001001000010000000000010000011000000000000000000000000000000000 0001001111001110011111111111111110110000010000010000000000000001 0000000000000000000000000000000000000000000000000000000000000000 0001110000100001001000000000000100000000000000000000000000000000 0000000000000000000000000000000000101100011000100000000000000000 0011011111000000011111111111110110000000000000000000000000000000 0001001100100000000000000000000000101100111000010000000000000000 1011010100101110100000000000001010000000000000000000000000000000 0000000000000000000000000000000001110000111101010000000000000000 0000000000000000000000000000000001100000110101000000000000000000 0000000000000000000000000000000000101100101000010000000000000000 0011111000000000000000000001010010000000000000000000000000000000 0011111000000000000000000001001100000000000000000000000000000000 0001001000010000100000000000001000000000000000000000000000000000 .......

Programming Environment

  • No OS/compiler yet

// histogram 64-core version #include<stdio.h> int main(int argc, char *argv[]) { if ((argc!=2)&&(argc!=3)) { printf("Usage: %s <input> [<output>]\n",argv[0]); return 0; } int histogram[256], i; for (i=0;i<256;i++) histogram[i]=0; FILE* input; if ((input=fopen(argv[1],"r")) == NULL) { printf("%s does not exist\n",argv[1]); return 0; } if ( input == NULL ) { perror ( "file can't be opend\n") ; } else { char c; while (fscanf(input,"%c",&c) != EOF) histogram[c]++; fclose(input); } ....... } ....... movi $r21, WEST movi $r1, 0 movi $r2, 512 FORWARD_COUNTER_LEFT: beq $r1, $r2, DONE BARRIER LW_I $r7, $r1, 0 movi $r18, 0 CASCADE_LEFT: beq $r18, $r29, DONE_CASCADE_LEFT SW_BUF $r7, $r21 LW_BUF $r6, $r20 LW_I $r5, $r1, 0 add $r7, $r5, $r6 addi $r18, $r18, 1 jmp CASCADE_LEFT DONE_CASCADE_LEFT: bne $r31, $r0, AVOID_MEM_UP SW_I $r7, $r1, 0 AVOID_MEM_UP: addi $r1, $r1, 4 jmp FORWARD_COUNTER_LEFT .......

slide-25
SLIDE 25

25/31

  • 64-core version of apps written in assembly

– 3D-MAPS V1 supports 42 integer instructions – Max achievable BW is 277 MHz X 64 ch X 4 Bytes = 70.9 GB/s – Modern CPU + DDR3 BW: 25 to 30GB/s

benchmark Memory bandwidth Measured Power AES encryption 49.5 GB/s 4.032 W Edge detection 15.6 GB/s 3.768 W Histogram 30.3 GB/s 3.588 W K-means clustering 40.6 GB/s 4.014 W Matrix multiply 13.8 GB/s 3.789 W Median filter 63.8 GB/s 4.007 W Motion estimation 24.1 GB/s 3.830 W String search 8.9 GB/s 3.876 W

BW and Power Measurement

slide-26
SLIDE 26

26/31

Frequency and Voltage Sweep

  • Frequency vs power (voltage = 1.5V)
  • Voltage vs power (frequency = 250MHz)
slide-27
SLIDE 27

27/31

Ongoing Work: 3D-MAPS V2

  • Through MOSIS/Tezzaron 3D IC MPW (taped out: Oct 2011)

3D-MAPS V1 3D-MAPS V2 # of tiers 2 (1 logic, 1 SRAM) 5 (2 logic, 3 DRAM) # of cores 64 128 Memory capacity 256KB SRAM 256MB DRAM & 512KB SRAM Logic footprint 5mm X 5mm 10mm X 10mm DRAM footprint

  • 20mm X 12mm

Bonding style F2F F2F and F2B TSV/F2F usage ~ 50K / ~50K ~ 150K / ~185K Memory access* 2048 bit/cycle SRAM 4096 bit/cycle SRAM 2048 bit/cycle DRAM (DDR) freq / power 277MHz / 4.0W 175MHz / 10.4W * Current wide-I/O allows 512 bit/cycle DRAM access

slide-28
SLIDE 28

28/31

Stack Up Comparison

  • TSV usage

– 3D-MAPS V1: For I/O (204 redundancy) – 3D-MAPS V2: For I/O (204 redundancy) and DRAM access (9 redundancy)

slide-29
SLIDE 29

29/31

V2 Layouts

Top logic die Bot logic die wide-I/O DRAM port wide-I/O DRAM port Single core and its scratchpad SRAM (4KB)

slide-30
SLIDE 30

30/31

Wide-I/O DRAM Interface

530um Each M1 landing pad has 9 redundant TSVs single port

320 landing pads 320x9 TSVs

2000um

slide-31
SLIDE 31

31/31

Conclusions

  • 3D-MAPS V1

– 64 general-purpose cores + stacked SRAM – Ran 8 parallel applications successfully – Achieved 64GB/s memory bandwidth @ 4W power – Developed CAD tools and methodologies – TSV used for I/O

  • 3D-MAPS V2 (ongoing work)

– 128 cores + stacked SRAM & DRAM cube – TSV used for I/O and DRAM access