Implementing Low-Diameter OCN for Manycore Processors Using A Tiled - - PowerPoint PPT Presentation

implementing low diameter ocn for manycore processors
SMART_READER_LITE
LIVE PREVIEW

Implementing Low-Diameter OCN for Manycore Processors Using A Tiled - - PowerPoint PPT Presentation

I MPLEMENTING L OW -D IAMETER O N -C HIP N ETWORKS FOR M ANYCORE P ROCESSORS U SING A T ILED P HYSICAL D ESIGN M ETHODOLOGY Yanghui Ou, Shady Agwa, Christopher Batten Computer Systems Laboratory Cornell University R EAL M ANYCORE I


slide-1
SLIDE 1

IMPLEMENTING LOW-DIAMETER ON-CHIP NETWORKS FOR MANYCORE PROCESSORS USING A TILED PHYSICAL DESIGN METHODOLOGY

Yanghui Ou, Shady Agwa, Christopher Batten

Computer Systems Laboratory Cornell University

slide-2
SLIDE 2

REAL MANYCORE IMPLEMENTATIONS USE SIMPLE MESH OCNS

Page 1 of 23

KiloCore, 1000 cores, 32x32 mesh UC Davis Epiphany-V, 1024 cores, 32x32 mesh Adapteva, Inc Celerity, 496 cores, 16x31 mesh

University of Washington, University of Michigan, Cornell University, UC San Diego

Motivation • Topologies • Analytical Modeling • Physical Design • PyOCN Framework

slide-3
SLIDE 3

PLENTY OF NOVEL OCN TOPOLOGIES PROPOSED IN THE ACADEMIC AREA

Page 2 of 23

Flattened Butterfly Kim+, MICRO’07 Concentrated Mesh, Fat-Tree Balfour+, ICS’06 Multi-drop Express Channels Grot+, HPCA’06/ISCA’11 Clos Network Kao+, TCAS’11 Slim NoC Besta+, ASPLOS’18 Asymmetric High-Radix Abeyratne+, HPCA’13 SMART NoC, Chen +, HPCA’13

Motivation • Topologies • Analytical Modeling • Physical Design • PyOCN Framework

slide-4
SLIDE 4

GAP BETWEEN PRINCIPLE AND PRACTICE

§ Why do manycore processor implementations with 500-1000 cores continue to use simple high-diameter on-chip networks? § Manycores require simple, low-area routers § Manycores use standard-cell-based design § Manycores use a tiled physical design methodology with three key constraints: 1. Design is based on tiling a homogeneous hard macro across the chip 2. All chip top-level routing between hard macros must use short wires to neighboring macros 3. Timing closure for the hard macro must imply timing closure at the chip level

Page 3 of 23

Hard Macros in Celerity

Motivation • Topologies • Analytical Modeling • Physical Design • PyOCN Framework

slide-5
SLIDE 5

Page 4 of 23

Implementing Low-Diameter OCN for Manycore Processors Using A Tiled Physical Design Methodology

Motivation Manycore OCN Topologies Manycore OCN Analytical Modeling Manycore OCN Physical Design PyOCN Framework

Motivation • Topologies • Analytical Modeling • Physical Design • PyOCN Framework

slide-6
SLIDE 6

Manycore Per-core Area 24,250µm2 103,500µm2 Process 16nm 16nm Frequency ~1GHz 500MHz ISA RV32IM RV64G Issue Width Single Dual L1 Memory 8KB 64KB

TARGET CHIP: 16 X 16 MANYCORE

Page 5 of 23

§ 16x16 manycore at 1GHz using 14nm technology § 3mm x 3mm, 185µm x 185µm per core § Per-core area roughly corresponds to an in-order RV32IMAF processor with 4KB data cache and 4KB instruction cache Component Area (µm2) RV32IMAF-IO 15983 4KB data cache 9407 4KB inst. cache 9347 Total 34737 3mm 3mm

Motivation • Topologies • Analytical Modeling • Physical Design • PyOCN Framework

slide-7
SLIDE 7

RUCHE CHANNELS TO REDUCE THE OCN DIAMETER

Page 6 of 23

§ Directly skips one or more routers § Reduces network diameter § Increases the number of bisection channels § Increases router radix

No ruche channels Ruche factor of 2 Ruche factor of 3 16x16 manycore

Concurrently proposed with T. Jung et al, Ruche Networks: Wire-Maximal No-Fuss NoCs

Motivation • Topologies • Analytical Modeling • Physical Design • PyOCN Framework

slide-8
SLIDE 8

CONCENTRATION TO REDUCE THE OCN DIAMETER

Page 7 of 23

No concentration Concentration factor of four Concentration factor of eight

§ Groups multiple cores together to share one router § Reduces network diameter § Reduces the number of routers § Reduces the number of bisection channels § Increases router radix

16x16 manycore

Motivation • Topologies • Analytical Modeling • Physical Design • PyOCN Framework

slide-9
SLIDE 9

TILED PHYSICAL DESIGN – NO RUCHE CHANNELS

Page 8 of 23

mesh-c1r0 tiled physical design

mesh-c1r0 hard macro in 1D

Near Channel

§ Only has near channel in both dimensions § Pins are aligned to ensure short global routing

Motivation • Topologies • Analytical Modeling • Physical Design • PyOCN Framework

slide-10
SLIDE 10

TILED PHYSICAL DESIGN – RUCHE FACTOR OF TWO

Page 9 of 23

mesh-c1r2 tiled physical design

§ Near channel, far channel, and one feedthrough channel in one dimension § Short cross-over routing between feedthrough channel and far channel

Feedthrough Channel Far Channel Near Channel

mesh-c1r2 hard macro in 1D

Motivation • Topologies • Analytical Modeling • Physical Design • PyOCN Framework

slide-11
SLIDE 11

TILED PHYSICAL DESIGN – RUCHE FACTOR OF THREE

Page 10 of 23

Feedthrough Channel Far Channel Near Channel

mesh-c1r3 hard macro in 1D

§ Near channel, far channel, and two feedthrough channels in one dimension § Short cross-over routing between feedthrough channels and far channel

mesh-c1r3 tiled physical design

Motivation • Topologies • Analytical Modeling • Physical Design • PyOCN Framework

slide-12
SLIDE 12

TILED PHYSICAL DESIGN – FOLDED TORUS

Page 11 of 23

torus-c4r0 tiled physical design

Feedthrough Channel Far Channel

torus-c1r0 hard macro in 1D

§ Only far channel and feedthrough channel in

  • ne dimension

§ Short cross-over routing between feedthrough channels and far channel § Short wrap-around routing at the edge

Motivation • Topologies • Analytical Modeling • Physical Design • PyOCN Framework

slide-13
SLIDE 13

Page 12 of 23

Implementing Low-Diameter OCN for Manycore Processors Using A Tiled Physical Design Methodology

Motivation Manycore OCN Topologies Manycore OCN Analytical Modeling Manycore OCN Physical Design PyOCN Framework

Motivation • Topologies • Analytical Modeling • Physical Design • PyOCN Framework

slide-14
SLIDE 14

ANALYTICAL MODELING METHODOLOGY

Page 13 of 23

§ Model the latency, area, and bandwidth analytically before doing physical design to narrow down our focus § Router area model and channel latency model are constructed based on physical results and floorplans § Zero-load latency is calculated analytically !

ø = $%&% + $(&( + )

*

No Concentration

§ Observation

  • Router area does not scale

quadratically as radix increases

  • A packet can travel a very long

distance on the channel in one cycle

Motivation • Topologies • Analytical Modeling • Physical Design • PyOCN Framework

Concentration factor of four Concentration factor of eight

slide-15
SLIDE 15

ANALYTICAL MODELING RESULTS

Page 14 of 23

Latency vs Area 256b Message Under 4Kb/cycle BW Constraint Latency vs Bandwidth 256b Message Under 10% Area Constraint

§ Moderate ruche factor improves bandwidth and/or reduces area § Moderate concentration reduces latency at similar bandwidth and area § Increasing ruche factor does not necessarily improves latency as it may lead to narrower channels which increases serialization latency

Motivation • Topologies • Analytical Modeling • Physical Design • PyOCN Framework

slide-16
SLIDE 16

Page 15 of 23

Implementing Low-Diameter OCN for Manycore Processors Using A Tiled Physical Design Methodology

Motivation Manycore OCN Topologies Manycore OCN Analytical Modeling Manycore OCN Physical Design PyOCN Framework

Motivation • Topologies • Analytical Modeling • Physical Design • PyOCN Framework

slide-17
SLIDE 17

HARD MACRO DESIGN METHODOLOGY

Page 16 of 23

§ Map global timing constraints to local timing constraints § Use three metal layers for local horizontal routing (M2, M4, M6), three layers for vertical routing (M3, M5, M7) § Connect “dummy cores” to the injection and ejection ports of the router to prevent ASIC toolflow from

  • ptimizing away any logic

§ Use routing and placement blockages to prevent the ASIC toolflow from using the routing resources reserved for the real cores

Feedthrough Channel Far channel Near channel mesh-c1r2 global constraints mesh-c1r2 local constraints

Motivation • Topologies • Analytical Modeling • Physical Design • PyOCN Framework

slide-18
SLIDE 18

EXAMPLE HARD MACROS

Page 17 of 23

185µm mesh-c1r0-b32 mesh-c1r0-b64 185µm mesh-c1r0-b128 185µm 185µm mesh-c1r0q0-b32 185µm torus-c1r0-b32 375µm mesh-c4r0-b128 mesh-c4r2-b64 375µm

No Concentration & Ruche Channels No Ruche Channels Ruche Factor of Two Concentration Factor of Four

Motivation • Topologies • Analytical Modeling • Physical Design • PyOCN Framework

slide-19
SLIDE 19

275µm mesh-c4r2-b64 Close Up

COMPOSING HARD MACROS AT CHIP TOP-LEVEL

Page 18 of 23

1. Design is based on tiling a homogeneous hard macro across the chip 2. All chip top-level routing between hard macros must use short wires to neighboring macros 3. Timing closure for the hard macro must imply timing closure at the chip level

3140µm torus-c1r0-b32 Full Chip 230µm torus-c1r0-b32 Close Up Wrap-Around Routing Cross-Over Routing Global Clock & Reset Routing 3100µm mesh-c4r2-b64 Full Chip Straight Across Routing Straight Across Routing

Motivation • Topologies • Analytical Modeling • Physical Design • PyOCN Framework

slide-20
SLIDE 20

MACRO-LEVEL RESULTS FOR PROMISING TOPOLOGIES

Page 19 of 23

Bandwidth vs Area Latency vs Area 64b Message Latency vs Area 256b Message

§ Increasing bandwidth leads to increase in area for all topologies § Increasing concentration and ruche factor leads to lower latency & lower Area

Motivation • Topologies • Analytical Modeling • Physical Design • PyOCN Framework

slide-21
SLIDE 21

Page 20 of 23

Implementing Low-Diameter OCN for Manycore Processors Using A Tiled Physical Design Methodology

Motivation Manycore OCN Topologies Manycore OCN Analytical Modeling Manycore OCN Physical Design PyOCN Framework

Motivation • Topologies • Analytical Modeling • Physical Design • PyOCN Framework

slide-22
SLIDE 22

PYOCN: A UNIFIED FRAMEWORK FOR MODELING, TESTING, AND EVALUATING OCN

Page 21 of 23

Motivation • Topologies • Analytical Modeling • Physical Design • PyOCN Framework

slide-23
SLIDE 23

PYOCN IS OPEN-SOURCED, PACKAGED, AND PUBLISHED

Page 22 of 23

Motivation • Topologies • Analytical Modeling • Physical Design • PyOCN Framework

slide-24
SLIDE 24

Page 23 of 23

Implementing Low-Diameter OCN for Manycore Processors Using A Tiled Physical Design Methodology

Motivation • Topologies • Analytical Modeling • Physical Design • PyOCN Framework

§ We present a tiled physical design methodology to implement low-diameter OCNs for manycore processors § We analyze the latency, area, and bandwidth tradeoffs of 12 topologies with different concentration and ruche factor § Long channels are the key to fully exploiting the VLSI wiring capability but must leverage a tiled physical design methodology § Moderate concentration and ruching can reduce latency at similar area and bisection bandwidth

This work was supported in part by NSF CRI Award #1512937, DARPA POSH Award #FA8650- 18-2-7852, and equipment, tool, and/or physical IP donations from Intel, Synopsys, and Cadence.