Optimized Core-links for Low-latency NoCs
Ryuta Kawano†, Seiichi Tade†, Ikki Fujiwara††, Hiroki Matsutani†, Hideharu Amano†, Michihiro Koibuchi††
† Keio University †† National Institute of Informatics
blackbus@am.ics.keio.ac.jp
1
Optimized Core-links for Low-latency NoCs Ryuta Kawano , Seiichi - - PowerPoint PPT Presentation
1 Optimized Core-links for Low-latency NoCs Ryuta Kawano , Seiichi Tade , Ikki Fujiwara , Hiroki Matsutani , Hideharu Amano , Michihiro Koibuchi Keio University National Institute of
Ryuta Kawano†, Seiichi Tade†, Ikki Fujiwara††, Hiroki Matsutani†, Hideharu Amano†, Michihiro Koibuchi††
† Keio University †† National Institute of Informatics
blackbus@am.ics.keio.ac.jp
1
2
3
4 8 16 32 64 128 256 2002 2004 2006 2008 2010
MIT RAW STI Cell BE Sparc T1 Sparc T2 TILERA TILE64 Intel Xeon, AMD Opteron IBM Power7, Fujitsu Sparc64 Intel 80-core ClearSpeed CSX600 Geforce GTX280 UT TRIPS (OPN)
2
Chip Multi-Processors Intel SCC
2012
Sparc T3 TILE Gx100 Xeon Phi Accelerator Graphic processing units Many simple PEs are integrated Geforce 8800 picoChip PC102 Geforce GTX480
Number of PEs (caches are not included)
4 8 16 32 64 128 256 2002 2004 2006 2008 2010
MIT RAW Sparc T1 TILERA TILE64 Intel Xeon, AMD Opteron IBM Power7, Fujitsu Sparc64 ClearSpeed CSX600 Geforce GTX280 UT TRIPS (OPN)
2
Intel SCC
2012
Sparc T3 TILE Gx100 Xeon Phi Geforce 8800 picoChip PC102 Geforce GTX480
Number of PEs (caches are not included)
STI Cell BE Sparc T2 Intel 80-core
Teraflops chip (HPC) [3]
[3] http://techresearch.intel.com/spaw2/uploads/images/Teraflop-Chip.jpg
4
4 8 16 32 64 128 256 2002 2004 2006 2008 2010
MIT RAW Sparc T1 TILERA TILE64 Intel Xeon, AMD Opteron IBM Power7, Fujitsu Sparc64 ClearSpeed CSX600 Geforce GTX280 UT TRIPS (OPN)
2
Intel SCC
2012
Sparc T3 TILE Gx100 Xeon Phi Geforce 8800 picoChip PC102 Geforce GTX480
Number of PEs (caches are not included)
STI Cell BE Sparc T2 Intel 80-core
Teraflops chip (HPC) [3]
[3] http://techresearch.intel.com/spaw2/uploads/images/Teraflop-Chip.jpg
5
Conventional topologies (e.g. 2D Mesh) have large # of hops
6
7 Reduction of # of hops using small-world effects [ Koibuchi et al. 2012 ] Ring + Non-Random Links Ring + Random Links Router
8
Router Core
2D Mesh + Inter-‐‑–router additional links [ Ogras et al. 2006 ]
9
10 Idea: Router topology + multiple links between a core and multiple routers
Router Core
Conventional NoC Multiple Core-link NoC
(Mesh router topology)
11
source and destination cores
by small-world effects
router topologies
Router Core
12
source and destination cores
by small-world effects
router topologies
Router Core
13
2
2
1
1
3
3
1 2 1 3 3 2
1 2 3 (Src. core) Individual (Dst. router) Example of Individual Corresponding Topology
Definition of Individual Providing best tradeoff between link length and # of hops
14
10 20 30 40 50 60 70 1 2 3 4 Zero-load latency [cycles] # of links per core
15
Conventional 8×8 Mesh
Max. Reduction of max. / ave. zero-load latencies by up to 49 % / 58 % Ave.
16
Max. 5.40 links Ave. 3.06 links SD 1.58 links
(Fujitsu 65nm Process)
0.20 [pJ / mm] (from ITRS 2007)
1 link per core 7.71 mm2 2 links per core 9.86 mm2
400 500 600 700 1 2 3 4
Energy consumption [pJ / flit]
# of links per core
# of CPUs / L1 caches 8 # of L2 caches 48 # of Directory controllers 8 ・GEM5 [Binkert et al. 2011] is used as full-system simulator Switching Wormhole Packet length 1- or 5-flit Flit length 128-bit # of VCs 3 Size of VC 4 flits Router latency 3 [cycles] Link latency 1 or 2 [cycles] Router topology 8×8 Mesh
4 [tiles] Parameters of Router Network Inter-router XY routing Between router and core Selecting shortest path Routing
17
Processor x86 (64-bit) L1 cache size 32 KB (line: 64 B) L1 cache latency 1 cycle L2 cache size 256 KB (assoc: 8) L2 cache latency 6 cycles Memory size 2 GB Memory latency 160 cycles Chip Configuration Parameters of Simulation ・Using 7 applications from the OpenMP implementation of NAS Parallel Benchmarks
18
0.85 0.9 0.95 1 1.05 MG SP BT UA CG LU IS Normalized Execution Time Benchmarks 8x8 MESH w/ 1 link per core 8x8 MESH w/ 2 links per core 8x8 MESH w/ 4 links per core
Reduction of application execution time by up to 10.1 %
8×8 Mesh router topology Max. core-link length: 4 tiles
19 Conventional NoC Our Optimized Core-link NoC
Router Core
20
21
22 fk = α(β ⋅ I + l i, j
j=0 x−1
)
i=0 N−1
(I > 0) (1) max(h)+ mean(h) (otherwise) (2) $ % & ' &
Reducing link length with function (1) Reducing # of hops with function (2)
GA Parameters