SLIDE 1
Speed And Accuracy Dilemma In NoC Simulation: What About Memory Impact?
Manuel Selva Abdoulaye Gamati´ e David Novo Gilles Sassatelli
LIRMM (CNRS and University of Montpellier)
18 January 2016
SLIDE 2 Context
Manycore processors integrating a NoC are there
◮ Intel Xeon Phi ◮ Kalray MPPA2-256 ◮ TILE-Gx72
NoC simulation tools are needed (and already there)
◮ Booksim, NoCTweak, Garnet, Noxim, McSim ◮ The perfect simulator is both fast and accurate ◮ Speed/accuracy dillema
1 / 13
SLIDE 3 Context
Manycore processors integrating a NoC are there
◮ Intel Xeon Phi ◮ Kalray MPPA2-256 ◮ TILE-Gx72
NoC simulation tools are needed (and already there)
◮ Booksim, NoCTweak, Garnet, Noxim, McSim ◮ The perfect simulator is both fast and accurate ◮ Speed/accuracy dillema
What about memory footprint?
1 / 13
SLIDE 4 Why Care About Memory Footprint?
Swapping required Number of cores in simulated manycore Simulation time
2 / 13
SLIDE 5 Why Care About Memory Footprint?
Swapping required Number of cores in simulated manycore Simulation time
Evaluate memory footprint of different simulators
2 / 13
SLIDE 6 Outline
Considered Simulators Impact Of Accuracy On Memory Footprint Impact Of Programming Abstraction On Memory Footprint Conclusions and Perspectives
3 / 13
SLIDE 7 Considered Simulators - 2 Criteria
Accuracy
◮ Bit-accurate ◮ Cycle-accurate ◮ Transactional Level Modeling (TLM)
4 / 13
SLIDE 8 Considered Simulators - 2 Criteria
Accuracy
◮ Bit-accurate ◮ Cycle-accurate ◮ Transactional Level Modeling (TLM)
Programming abstraction
Low level programming languages (C) High level programming languages (C++, Java) Simulation frameworks (SystemC, Ptolemy II)
+
+
4 / 13
SLIDE 9 Considered Simulators
Simulator Accuracy Programming Injector abstraction McSim-TLM TLM SystemC Application Model McSim-CA Cycle-accurate SystemC Application Model Booksim Cycle-accurate C++ Random uniform McSim-CA is based on NoCTweak
5 / 13
SLIDE 10 Simulated Hardware - Distributed Memory System
3 6 1 4 7 2 5 8 Router Core Mem Router Core Mem Router Core Mem Router Core Mem Router Core Mem Router Core Mem Router Core Mem Router Core Mem Router Core Mem
6 / 13
SLIDE 11 Simulated Hardware - Priority Based Routers
Routing Packet switching Arbitration ... data in ... data in ... data in ... data in ... data in data out data out data out data out data out Local port
7 / 13
SLIDE 12 McSim-TLM vs McSim-CA - Accuracy
2x2 3x3 4x4 5x5 8x8 10x10 15x15 20x20 10 20 30 40
37 33 25 25 24 25 26 26 38 31 24 21 22 21 22 22
2x2 3x3 4x4 5x5 8x8 10x10 15x15 20x20 10 20 30 40 NoC Size Execution time (ms) McSim-TLM McSim-CA
8 / 13
SLIDE 13 McSim-TLM vs McSim-CA - Memory Footprint
4x4 8x8 16x16 20x20 32x32 64x64 128x128 10 100 1,000 Host mem=4,000 9 11 25 31 75 209 718 136 608 2,420 3,777 4x4 8x8 16x16 20x20 32x32 64x64 128x128 10 100 1,000 NoC Size Average memory footprint (Mb) McSim-TLM McSim-CA
9 / 13
SLIDE 14 McSim-CA vs Booksim - Memory Footprint
4x4 8x8 16x16 20x20 32x32 64x64 128x128 10 100 1,000 Host mem=4,000 5 8 15 25 72 276 1,069 136 608 2,420 3,777 4x4 8x8 16x16 20x20 32x32 64x64 128x128 10 100 1,000 NoC Size Average memory footprint (Mb) BookSim McSim-CA
10 / 13
SLIDE 15 Deep Memory Footprint Analysis
A lot of objects
◮ Few big objects accounting for 1% of footprint ◮ A lot of small SystemC objects (3,500,000 for 20x20)
Accellera implementation
◮ Each SystemC object has a unique name ◮ Debug purposes ◮ Required by the standard
11 / 13
SLIDE 16 Optimized Accellera - Memory Footprint
4x4 8x8 16x16 20x20 10 100 1,000 10,000 738Mb saved 5 8 15 25 136 608 2,420 3,777 105 491 1,951 3,039 4x4 8x8 16x16 20x20 10 100 1,000 10,000 NoC Size Average memory footprint (Mb) BookSim McSim-CA McSim-CA-Opt
12 / 13
SLIDE 17 Conclusion
From TLM to cycle-accurate
◮ Costs memory in addition to CPU
Cycle-accurate concerns
◮ Programming abstraction costs memory in addition to CPU ◮ SystemC object names can consume a lot of memory
Perspectives
◮ Evaluate memory footprint of other simulators ◮ Perform lazy allocation in SystemC?
13 / 13
SLIDE 18
References I
◮ N. Agarwal, T. Krishna, L. S. Peh, and N. K. Jha. Garnet: A detailed on-chip network model inside a full-system simulator. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, pages 33–42, April 2009. ◮ V. Catania, A. Mineo, S. Monteleone, M. Palesi, and D. Patti. Noxim: An open, extensible and cycle-accurate network on chip simulator. In Application-specific Systems, Architectures and Processors (ASAP), 2015 IEEE 26th International Conference on, pages 162–163, July 2015. ◮ L. S. Indrusiak and O. M. dos Santos. Fast and accurate transaction-level model of a wormhole network-on-chip with priority preemptive virtual channel arbitration. In Design, Automation Test in Europe Conference Exhibition (DATE), 2011, pages 1–6, March 2011. ◮ Leandro Soares Indrusiak, James Harbin, and Osmar Marchi Dos Santos. Fast simulation of networks-on-chip with priority-preemptive arbitration. ACM Trans. Des. Autom. Electron. Syst., 20(4):56:1–56:22, September 2015.
SLIDE 19 References II
◮ Nan Jiang, D.U. Becker, G. Michelogiannakis, J. Balfour, B. Towles, D.E. Shaw,
A detailed and flexible cycle-accurate network-on-chip simulator. In Performance Analysis of Systems and Software (ISPASS), 2013 IEEE International Symposium on, pages 86–96, April 2013. ◮ Khalid Latif, Manuel Selva, Charles Effiong, Roman Ursu, Abdoulaye Gamatie, Gilles Sassatelli, Leonardo Zordan, Luciano Ost, Piotr Dziurzanski, and Leandro Soares Indrusiak. Design space exploration for complex automotive applications: An engine control system case study. In Proceedings of the 2016 Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools, RAPIDO ’16, pages 2:1–2:7, New York, NY, USA,
◮ L. Lehtonen, E. Salminen, and T. D. Hmlinen. Analysis of modeling styles on network-on-chip simulation. In NORCHIP, 2010, pages 1–4, Nov 2010. ◮ Gunar Schirner and Rainer D¨
Quantitative analysis of the speed/accuracy trade-off in transaction level modeling. ACM Trans. Embed. Comput. Syst., 8(1):4:1–4:29, January 2009.
SLIDE 20
References III
◮ Anh T. Tran and Bevan Baas. NoCTweak: A highly parameterizable simulator for early exploration of performance and energy of networks on-chip. Technical Report ECE-VCL-2012-2, VLSI Computation Lab, ECE Department, University of California, Davis, 2012.
SLIDE 21 C++ String Implementation
◮ g++ 5.2.1 ◮ for a 2 characters string:
◮ stack space = 32, heap space = 0, capacity = 15
◮ for a 16 characters string:
◮ stack space = 32, heap space = 17, capacity = 16
◮ 15 characters stack buffer to avoid dynamic memory allocation
16 / 13