Optimized Core-links for Low-latency NoCs Ryuta Kawano , Seiichi - - PowerPoint PPT Presentation

optimized core links for low latency nocs
SMART_READER_LITE
LIVE PREVIEW

Optimized Core-links for Low-latency NoCs Ryuta Kawano , Seiichi - - PowerPoint PPT Presentation

1 Optimized Core-links for Low-latency NoCs Ryuta Kawano , Seiichi Tade , Ikki Fujiwara , Hiroki Matsutani , Hideharu Amano , Michihiro Koibuchi Keio University National Institute of


slide-1
SLIDE 1

Optimized Core-links
 for Low-latency NoCs

Ryuta Kawano†, Seiichi Tade†, Ikki Fujiwara††, 
 Hiroki Matsutani†, Hideharu Amano†, Michihiro Koibuchi††

† Keio University †† National Institute of Informatics

blackbus@am.ics.keio.ac.jp

1

slide-2
SLIDE 2

Contents

  • Conventional NoCs
  • Small-world Networks
  • Difficulty in applying on Chips
  • How do we reduce path hops of NoCs?
  • Adding multiple links between a core and routers
  • Optimization method for picking core-links
  • Evaluations
  • Zero-load latencies
  • Costs
  • Full-system simulation
  • Conclusions

2

slide-3
SLIDE 3

3

4 8 16 32 64 128 256 2002 2004 2006 2008 2010

MIT RAW STI Cell BE Sparc T1 Sparc T2 TILERA TILE64 Intel Xeon, AMD Opteron IBM Power7, Fujitsu Sparc64 Intel 80-core ClearSpeed CSX600 Geforce GTX280 UT TRIPS (OPN)

2

Chip Multi-Processors Intel SCC

2012

Sparc T3 TILE Gx100 Xeon Phi Accelerator Graphic processing units Many simple PEs are integrated Geforce 8800 picoChip PC102 Geforce GTX480

Number of PEs (caches are not included)

Increasing # of Cores on NoCs

slide-4
SLIDE 4

4 8 16 32 64 128 256 2002 2004 2006 2008 2010

MIT RAW Sparc T1 TILERA TILE64 Intel Xeon, AMD Opteron IBM Power7, Fujitsu Sparc64 ClearSpeed CSX600 Geforce GTX280 UT TRIPS (OPN)

2

Intel SCC

2012

Sparc T3 TILE Gx100 Xeon Phi Geforce 8800 picoChip PC102 Geforce GTX480

Number of PEs (caches are not included)

STI Cell BE Sparc T2 Intel 80-core

Teraflops chip (HPC) [3]

[3] http://techresearch.intel.com/spaw2/uploads/images/Teraflop-Chip.jpg

4

Intel Teraflops Chip: 80-tile 2D Mesh

slide-5
SLIDE 5

4 8 16 32 64 128 256 2002 2004 2006 2008 2010

MIT RAW Sparc T1 TILERA TILE64 Intel Xeon, AMD Opteron IBM Power7, Fujitsu Sparc64 ClearSpeed CSX600 Geforce GTX280 UT TRIPS (OPN)

2

Intel SCC

2012

Sparc T3 TILE Gx100 Xeon Phi Geforce 8800 picoChip PC102 Geforce GTX480

Number of PEs (caches are not included)

STI Cell BE Sparc T2 Intel 80-core

Teraflops chip (HPC) [3]

[3] http://techresearch.intel.com/spaw2/uploads/images/Teraflop-Chip.jpg

5

Intel Teraflops Chip: 80-tile 2D Mesh

Conventional topologies (e.g. 2D Mesh) have large # of hops

  • n many cores.
slide-6
SLIDE 6

Contents

  • Conventional NoCs
  • Small-world Networks
  • Difficulty in applying on Chips
  • How do we reduce path hops of NoCs?
  • Adding multiple links between a core and routers
  • Optimization method for picking core-links
  • Evaluations
  • Zero-load latencies
  • Costs
  • Full-system simulation
  • Conclusions

6

slide-7
SLIDE 7

Small-world Topology for off-Chip Network

7 Reduction of # of hops using small-world effects [ Koibuchi et al. 2012 ] Ring + Non-Random Links Ring + Random Links Router

slide-8
SLIDE 8

Small-world Topology on Chips

8

1 4 7 2 5 8 3 6 9

Router Core

2D Mesh 
 + Inter-‐‑–router additional links
 [ Ogras et al. 2006 ]

  • Need to use custom routing
  • to reduce path hops
  • to avoid deadlocks
slide-9
SLIDE 9

Contents

  • Conventional NoCs
  • Small-world Networks
  • Difficulty in applying on Chips
  • How do we reduce path hops of NoCs?
  • Adding multiple links between a core and routers
  • Optimization method for picking core-links
  • Evaluations
  • Zero-load latencies
  • Costs
  • Full-system simulation
  • Conclusions

9

slide-10
SLIDE 10

Multiple Core-links to Reduce Path Hops

10 Idea: Router topology + multiple links between a core and multiple routers

Router Core

Conventional NoC Multiple Core-link NoC

(Mesh router topology)

slide-11
SLIDE 11

11

  • Using shortest path between 


source and destination cores

  • Achieving lower hops


by small-world effects

  • Maintaining regularities of 


router topologies

Multiple Core-links 、

Router Core

Our Idea: Reduction of First and Last 1-hop
 Latencies with Shortcut Core-links

slide-12
SLIDE 12

12

  • Using shortest path between 


source and destination cores

  • Achieving lower hops


by small-world effects

  • Maintaining regularities of 


router topologies

Multiple Core-links 、

Router Core

Our Idea: Reduction of First and Last 1-hop
 Latencies with Shortcut Core-links

slide-13
SLIDE 13
  • Problem: Lower operating frequency by longer core-links
  • Solution: Optimization using GA (Genetic Algorithm)

Optimization Method for Picking Core-links

13

2

2

1

1

3

3

1 2 1 3 3 2

1 2 3 (Src. core) Individual (Dst. router) Example of Individual Corresponding Topology

Definition of Individual Providing best tradeoff between link length and # of hops

slide-14
SLIDE 14

Contents

  • Conventional NoCs
  • Small-world Networks
  • Difficulty in applying on Chips
  • How do we reduce path hops of NoCs?
  • Adding multiple links between a core and routers
  • Optimization method for picking core-links
  • Evaluations
  • Zero-load latencies
  • Costs
  • Full-system simulation
  • Conclusions

14

slide-15
SLIDE 15

10 20 30 40 50 60 70 1 2 3 4 Zero-load latency [cycles] # of links per core

Zero-load Latency

15

Conventional 8×8 Mesh

Max. Reduction of max. / ave. 
 zero-load latencies 
 by up to 49 % / 58 % Ave.

  • 8×8 Mesh router topology


+ optimized core-links

  • Max. core-link length: 4 tiles
slide-16
SLIDE 16

Costs (8×8 Mesh)

  • Wire Density Overhead

  • n each tile
  • two links per core

16

Max. 5.40 links Ave. 3.06 links SD 1.58 links

  • Router area 


(Fujitsu 65nm Process)

  • Increase by 27.8 %
  • Energy Consumption
  • 1.2 V supply voltage
  • 65 nm CMOS Process
  • Wire capacitance load: 


0.20 [pJ / mm]
 (from ITRS 2007)

1 link per core 7.71 mm2 2 links per core 9.86 mm2

400 500 600 700 1 2 3 4

Energy consumption
 [pJ / flit]

# of links per core

  • Increase by 3.0 % at minimum
slide-17
SLIDE 17

# of CPUs / L1 caches 8 # of L2 caches 48 # of Directory controllers 8 ・GEM5 [Binkert et al. 2011] is used as full-system simulator Switching Wormhole Packet length 1- or 5-flit Flit length 128-bit # of VCs 3 Size of VC 4 flits Router latency 3 [cycles] Link latency 1 or 2 [cycles] Router topology 8×8 Mesh

  • Max. link length

4 [tiles] Parameters of Router Network Inter-router XY routing Between router and core Selecting shortest path Routing

17

Parameters of Full System Simulation

Processor x86 (64-bit) L1 cache size 32 KB (line: 64 B) L1 cache latency 1 cycle L2 cache size 256 KB (assoc: 8) L2 cache latency 6 cycles Memory size 2 GB Memory latency 160 cycles Chip Configuration Parameters of Simulation ・Using 7 applications from the OpenMP implementation of NAS Parallel Benchmarks

slide-18
SLIDE 18

Full-system Simulation Results

18

0.85 0.9 0.95 1 1.05 MG SP BT UA CG LU IS Normalized Execution Time Benchmarks 8x8 MESH w/ 1 link per core 8x8 MESH w/ 2 links per core 8x8 MESH w/ 4 links per core

Reduction of application execution time by up to 10.1 %

Ÿ8×8 Mesh router topology ŸMax. core-link length: 4 tiles

slide-19
SLIDE 19
  • Idea: Multiple links between a core and routers on NoCs
  • Design:GA optimization for picking core-links
  • Reduction of max. / ave. zero-load latencies by up to 49 % / 58 %
  • Reduction of application execution time by up to 10.1 %

Conclusions

19 Conventional NoC Our Optimized Core-link NoC

Router Core

slide-20
SLIDE 20

Contents

  • Conventional NoCs
  • Small-world Networks
  • Difficulty in applying on Chips
  • How do we reduce path hops of NoCs?
  • Adding multiple links between a core and routers
  • Optimization method for picking core-links
  • Evaluations
  • Zero-load latencies
  • Costs
  • Full-system simulation
  • Conclusions

20

slide-21
SLIDE 21

Backup Slides

21

slide-22
SLIDE 22
  • Length of the j-th core-link for the i-th core (0 ≤ i < N, 0 ≤ j < x) : li, j
  • # of hops between the p-th and q-th core (0 ≤ p < N, 0 ≤ q < N) : hp, q
  • # of links longer than the given maximum link length : I
  • Supplemental parameters :α = β = 1000

Definition of Fitness Function fk

22 fk = α(β ⋅ I + l i, j

j=0 x−1

)

i=0 N−1

(I > 0) (1) max(h)+ mean(h) (otherwise) (2) $ % & ' &

Reducing link length with function (1) Reducing # of hops with function (2)

GA Parameters

  • # of inds: 100, # of gens: 20000
  • P. of crossover, mutation: 1 %, 20 %
  • Tournament size: 3