The Multi2Sim Simulation Framework A CPU-GPU Model for - PowerPoint PPT Presentation

The Multi2Sim Simulation Framework A CPU-GPU Model for Heterogeneous Computing www.multi2sim.org Rafael Ubal David R. Kaeli Northeastern University Boston, MA Conference title 1

Outline 1. Introduction First Block – The x86 CPU Simulation 2. The x86 CPU Emulation 3. The x86 CPU Architectural Simulation 4. The Memory Hierarchy 5. Benchmarks and Simulations Second Block – The AMD Evergreen GPU Simulation 6. The OpenCL Programming Model 7. The AMD Evergreen GPU Emulation 8. The AMD Evergreen GPU Architectural Simulation 9. Benchmarks and Simulations 10. Conclusions and Future work The Multi2Sim Simulation Framework, PACT 2011 Tutorial 2

1. Introduction Motivation • Limitations of existing CPU simulators – Such as SimpleScalar, Simics, SSMT, M-Sim, SMTSim, M5, ... – Full-system vs. application-only simulation. – Free, open-source. – Architectural simulation accuracy. – Alpha/PISA architectures → cross-compilers. – Integrated system. • Current simulation needs – Based on current processor market. – Heterogeneous CPU-GPU environments. – Tool for evaluation of new architectural proposals. – Simulation of a GPU ISA. • Existing GPU simulation approaches – Barra: NVIDIA Telsa ISA. – Ocelot: PTX intermediate language simulator. – No architectural simulation. – No emulation of AMD ISAs. – Not capable of heterogeneous simulation. The Multi2Sim Simulation Framework, PACT 2011 Tutorial 3

1. Introduction Multi2Sim Background • Multi2Sim 1.x version series, 2007 (MIPS-based) Superscalar pipeline Multithreading Out-of-order execution, Fine-grain, coarse-grain branch prediction, trace and simultaneous (SMT). cache, etc. • Multi2Sim 2.x version series, 2008 (x86-based) Multicore architecture. State-of-the-art benchmarks. Configurable memory hierarchy, Tested support for common research cache coherence, benchmarks, available for download. interconnection networks. • Multi2Sim 3.x version series, 2011 (x86+Evergreen) GPU model Support for OpenCL benchmarks. Model for Evergreen ISA. The Multi2Sim Simulation Framework, PACT 2011 Tutorial 4

1. Introduction Getting Started • User-friendly installation and test $ tar -xzf multi2sim-3.1.tar.gz $ cd multi2sim-3.1 $ ./configure $ make $ sudo make install • Application-only simulator Original execution Simulated execution $ ./test-args hola que tal $ m2s ./test-args hola que tal arg[0] = 'hola' <... Simulator output ...> arg[1] = 'que' arg[0] = 'hola' arg[2] = 'tal' arg[1] = 'que' arg[2] = 'tal' <... Simulator statistics ...> The Multi2Sim Simulation Framework, PACT 2011 Tutorial 5

1. Introduction The IniFile Format • Example of IniFile ; This is a comment. [ Section 0 ] Color = Red Height = 40 [ OtherSection ] Variable = Value • Multi2Sim uses IniFile for – Configuration files. – Output statistic files. – Standard error output. Demo 1 The Multi2Sim Simulation Framework, PACT 2011 Tutorial 6

Block 1 The x86 CPU Simulation The Multi2Sim Simulation Framework, PACT 2011 Tutorial 7

2. The CPU Emulation Definition • Emulation (a.k.a. functional simulation) – Just mimic original behavior of a program. – … as opposed to timing/detailed/architectural simulation. • Steps 1) Program loading. 2) Simulation loop. The Multi2Sim Simulation Framework, PACT 2011 Tutorial 8

2. The CPU Emulation Program Loading • Initialization of a process state – Virtual memory map. – Value of x86 registers. 1) Parse ELF executable – ELF sections. 0xc0000000 Stack – Initialized code and data. eax eax Program arguments Environment variables ebx k ecx c a t s 2) Initialize stack f o mmap region p – Program headers. o T esp (not initialized) – Arguments. 0x40000000 eip – Environment variables. r e t n i d o e p Heap z i n l a o Initialized data i i t 3) Initialize registers t i c n u 0x08xxxxxx I r t – Program entry point → eip s n Text i – Stack pointer → esp Initialized data 0x08000000 The Multi2Sim Simulation Framework, PACT 2011 Tutorial 9

2. The CPU Emulation Simulation Loop • Emulation of x86 instructions Read instr. at eip – Update memory map (if needed). Instr . – Update x86 registers. bytes – Example: add [bp+16], 0x5 Decode instruction Instr . fields • Emulation of Linux system No Yes Instr. is calls int 0x80 – Analyze system call code and args. – Update memory map. Emulate Emulate x86 instr. system call – Update eax with return value. – Example: read(fd, buf, count); Move eip to next instr. Demo 2 The Multi2Sim Simulation Framework, PACT 2011 Tutorial 10

3. The CPU Architectural Simulation Definition • Architectural simulation (a.k.a. detailed/timing simulation) – Provides performance results from executing a program on a configurable CPU model. – Main performance metric: execution time. But also structures occupancy, cache hit rates, contention points... Architectural Simulator cycle counter CPU Run a new x86 functional instruction CPU cores model simulator This is the isntr. Memory hierarchy that was run model The Multi2Sim Simulation Framework, PACT 2011 Tutorial 11

3. The CPU Architectural Simulation The Superscalar Pipeline Reorder Buffer ··· Fetch queue Commit ··· μop queue Instruction Queue ··· ··· Fetch Decode Dispatch Trace queue ··· Load/Store Queue Issue FU ··· Instr . Trace Cache Cache Data Register Writeback Cache File • Characteristics – Speculative execution. – Branch prediction. – Out-of-order execution. – Trace cache. Demo 3 The Multi2Sim Simulation Framework, PACT 2011 Tutorial 12

3. The CPU Architectural Simulation Multithreaded Processor Model ··· Commit ··· Commit ··· ··· Commit ··· ··· ··· ··· Fetch Decode Dispatch ··· ··· Fetch Decode Dispatch ··· ··· Fetch Decode Dispatch ··· Issue FU ··· Issue ··· FU ··· Issue FU ··· ··· Instr . Trace Instr . Trace Instr . Trace Cache Cache Cache Cache Data Register Cache Cache Data Register Writeback Data Register Writeback Cache File Writeback Cache File Cache File Shared Functional Unit Pool • Multithreading Paradigms – Coarse grain multithreading Thread switch upon long-latency events. – Fine grain multithreading Thread switch at a cycle granularity. – Simultaneous multithreading Multiple-thread issuing of instructions. The Multi2Sim Simulation Framework, PACT 2011 Tutorial 13

3. The CPU Architectural Simulation Multicore Processor Model Core 0 Core 1 ··· ··· Commit ··· Commit ··· ··· ··· ··· ··· ··· Fetch Decode Dispatch Fetch Decode Dispatch ··· ··· Issue FU Issue FU ··· ··· Instr . Trace Instr . Trace Cache Cache Cache Cache Data Register Data Register Writeback Writeback Cache File Cache File Memory Hierarchy • Multicore Processor – Multiple independent superscalar pipelines. – Communication only through memory hierarchy. • What can we run on it? – Multiple single-threaded programs. – One (or more) programs spawning child threads. Demo 4 The Multi2Sim Simulation Framework, PACT 2011 Tutorial 14

3. The CPU Architectural Simulation Definitions • Core (c-0, c-1, ...) – Hardware component with an independent set of superscalar pipelines. – Each core may contain several threads . • Thread (t-0, t-1, ...) – Hardware component with a partially independent set of pipeline stages. • Context (ctx-0, ctx-1, ...) – Software thread with independent value for registers (incl. eip ). – Can be a sequential program or a spawned child context. • Node – Hardware component running a context. – Multicore proc.: c0 , c1 , … Multithreaded proc.: t0 , t1 , … Multicore-multithreaded proc.: c0-t0 , c0-t1 , ... Demo 4 The Multi2Sim Simulation Framework, PACT 2011 Tutorial 15

4. Memory Hierarchy Configuration • Configuring memory hierarchy – Any number of caches organized in any number of levels. – Connected through any number of interconnects. – A set of 1 or more caches must connect to an interconnect from “above”. Only one cache –or main memory– connected “below”. ··· Cache Cache Cache Interconnect Cache or Main Memory • Memory hierarchy entries – Each node has two entries to the memory hierarchy: Instruction entry + Data entry – Several node entries can converge to the same cache (or main memory). The Multi2Sim Simulation Framework, PACT 2011 Tutorial 16

4. Memory Hierarchy Configuration • Example – 2-core, 2-threaded processor (4 nodes). – Each thread has its own private data and instruction L1 caches. – L2 caches: shared among threads, private per core, unified for data/instr. Core 0 Core 1 c0-t0 c0-t1 c1-t0 c1-t1 Data Instr. Data Instr. Data Instr. Data Instr. L1 L1 L1 L1 L1 L1 L1 L1 L2 Cache L2 Cache Main Memory Demo 5 The Multi2Sim Simulation Framework, PACT 2011 Tutorial 17

The Multi2Sim Simulation Framework A CPU-GPU Model for - PowerPoint PPT Presentation

The Multi2Sim Simulation Framework A CPU-GPU Model for Heterogeneous Computing www.multi2sim.org Rafael Ubal David R. Kaeli Northeastern University Boston, MA Conference title 1 Outline 1. Introduction First Block The x86 CPU

Simulation of OpenCL and APUs on Multi2Sim 4.1 Rafael Ubal, David Kaeli Conference title 1

Outline Narcisse Ngada DESY, MKK 1) What is simulation ? 14.05.2014 2) Why simulation ? 3)

Multi2sim Kepler: A Detailed Architectural GPU Simulator Xun Gong , Rafael Ubal, David Kaeli

PROGRAMMING AND SIMULATING HETEROGENEOUS DEVICES - OPENCL AND MULTI2SIM Rafael Ubal, Dana Schaa,

Laser Diode Simulation Semiconductor Laser Diode Simulation Laser as part of the ATLAS Framework

Grid simulation (AliEn) Outline GRID simulation Simulation tool Ptolemy (Berkeley)

Multi-Architecture ISA-Level Simulation of OpenCL Dana Schaa, Rafael Ubal Northeastern

T7 Cloud Simulation On-demand access simulation December 2016 T7 Cloud Simulation December 2016

Simulation Simulation CHAPTER 1 INTRODUCTION TO SIMULATION 2 MODELING CHAPTER 1 INTRODUCTION

Automated Configuration of Co-simulation with Domain Specific Hints Co-simulation on the rise

MD3311 Simulation Results Joschua Dilly 28.01.2019 MD3311 Simulation Results 2 Introduction

Surgical Simulation: Surgical Simulation: We dont need simulation. We dont need

Why Bayesian methods in Simulation? Simulation Simulation Model Inputs BAYESIAN IDEAS

Statistical Simulation in Python Tushar Shanker Data Scientist DataCamp Statistical Simulation

Chapter 2 Simulation Examples Banks, Carson, Nelson & Nicol Discrete-Event System Simulation

Simulation of stationary processes Timo Tiihonen 2014 Tactical aspects of simulation

t Pr r t

Ocelot Rela%onal Logic in a Solver-Aided Language James Bornholt http://ocelot.tools Emina

Automated optimization of the European XFEL performance with OCELOT Sergey Tomin Machine

Light source and FEL Simulations Ilya Agapov, SLAC ML workshop, 1 March 2018 with material from

Approaches to GPU computing Manuel Ujaldon Nvidia CUDA Fellow Computer Architecture Department

Week 6 Discussion Wednesday, 11/6/19 Reminders PSA4 Submission due Tuesday, November 12 11:59pm

C OLUMN D ATABASES A N D R E W C R O T T Y & A L E X G A L A K AT O S O UTLINE RDBMS

HARDWARE-CONSCIOUS DATA PROCESSING SYSTEMS Holger Pirk http://doc.ic.ac.uk/~hlgr Data

The Multi2Sim Simulation Framework A CPU-GPU Model for - PowerPoint PPT Presentation

The Multi2Sim Simulation Framework A CPU-GPU Model for Heterogeneous Computing www.multi2sim.org Rafael Ubal David R. Kaeli Northeastern University Boston, MA Conference title 1 Outline 1. Introduction First Block The x86 CPU

Simulation of OpenCL and APUs on Multi2Sim 4.1 Rafael Ubal, David Kaeli Conference title 1

Outline Narcisse Ngada DESY, MKK 1) What is simulation ? 14.05.2014 2) Why simulation ? 3)

Multi2sim Kepler: A Detailed Architectural GPU Simulator Xun Gong , Rafael Ubal, David Kaeli

PROGRAMMING AND SIMULATING HETEROGENEOUS DEVICES - OPENCL AND MULTI2SIM Rafael Ubal, Dana Schaa,

Laser Diode Simulation Semiconductor Laser Diode Simulation Laser as part of the ATLAS Framework

Grid simulation (AliEn) Outline GRID simulation Simulation tool Ptolemy (Berkeley)

Multi-Architecture ISA-Level Simulation of OpenCL Dana Schaa, Rafael Ubal Northeastern

T7 Cloud Simulation On-demand access simulation December 2016 T7 Cloud Simulation December 2016

Simulation Simulation CHAPTER 1 INTRODUCTION TO SIMULATION 2 MODELING CHAPTER 1 INTRODUCTION

Automated Configuration of Co-simulation with Domain Specific Hints Co-simulation on the rise

MD3311 Simulation Results Joschua Dilly 28.01.2019 MD3311 Simulation Results 2 Introduction

Surgical Simulation: Surgical Simulation: We dont need simulation. We dont need

Why Bayesian methods in Simulation? Simulation Simulation Model Inputs BAYESIAN IDEAS

Statistical Simulation in Python Tushar Shanker Data Scientist DataCamp Statistical Simulation

Chapter 2 Simulation Examples Banks, Carson, Nelson &amp; Nicol Discrete-Event System Simulation

Simulation of stationary processes Timo Tiihonen 2014 Tactical aspects of simulation

t Pr r t

Ocelot Rela%onal Logic in a Solver-Aided Language James Bornholt http://ocelot.tools Emina

Automated optimization of the European XFEL performance with OCELOT Sergey Tomin Machine

Light source and FEL Simulations Ilya Agapov, SLAC ML workshop, 1 March 2018 with material from

Approaches to GPU computing Manuel Ujaldon Nvidia CUDA Fellow Computer Architecture Department

Week 6 Discussion Wednesday, 11/6/19 Reminders PSA4 Submission due Tuesday, November 12 11:59pm

C OLUMN D ATABASES A N D R E W C R O T T Y &amp; A L E X G A L A K AT O S O UTLINE RDBMS

HARDWARE-CONSCIOUS DATA PROCESSING SYSTEMS Holger Pirk http://doc.ic.ac.uk/~hlgr Data

Chapter 2 Simulation Examples Banks, Carson, Nelson & Nicol Discrete-Event System Simulation

C OLUMN D ATABASES A N D R E W C R O T T Y & A L E X G A L A K AT O S O UTLINE RDBMS