Simulation of OpenCL and APUs on Multi2Sim 4.1 Rafael Ubal, David - PowerPoint PPT Presentation

Simulation of OpenCL and APUs on Multi2Sim 4.1 Rafael Ubal, David Kaeli Conference title 1

Outline Introduction Simulation methodology Part 1 – Simulation of an x86 CPU Part 2 – Simulation of a Southern Islands GPU Disassembler OpenCL from host to device Emulation Disassembler Timing simulation Emulation Memory hierarchy Timing simulation Visualization tool Visualization tool Case study: ND-Range virtualization Part 3 – Concluding Remarks Additional Projects The Multi2Sim Community ISCA 2013, Tel-Aviv 2

Introduction Getting Started Follow our demos! • User accounts for demos ─ Machine: fusion1.ece.neu.edu ─ User: isca1, isca2, isca3, ... ─ Password: isca2013 ISCA 2013, Tel-Aviv 3

Introduction Getting Started • Connect to our server $ ssh isca<N>@fusion1.ece.neu.edu -X (Notice the X forwarding for later demos using graphics) • Demo descriptions $ ls demo1 demo2 demo3 demo4 demo5 demo6 demo7 README All files needed for each demo are present in its corresponding directory. README files describe commands to run and interpretation of outputs. • Download and compile Multi2Sim $ wget http://www.multi2sim.org/files/multi2sim-4.1.tar.gz $ tar -xzf multi2sim-4.1.tar.gz $ cd multi2sim-4.1 $ ./configure && make ISCA 2013, Tel-Aviv 4

Introduction First Execution • Source code #include <stdio.h> int main(int argc, char **argv) { int i; printf("Number of arguments: %d\n", argc); for (i = 0; i < argc; i++) printf("\targv[%d] = %s\n", i, argv[i]); return 0; } • Native execution • Execution on Multi2Sim $ test-args hello there $ m2s test-args hello there Number of arguments: 4 < Simulator message in stderr > arg[0] = 'test-args' Number of arguments: 4 arg[1] = 'hello' arg[0] = 'test-args' arg[2] = 'there' arg[1] = 'hello' arg[2] = 'there' < Simulator statistics > Demo 1 ISCA 2013, Tel-Aviv 5

Introduction Simulator Input/Output Files • Example of INI file format ; This is a comment. [ Section 0 ] Color = Red Height = 40 [ OtherSection ] Variable = Value • Multi2Sim uses INI file for ─ Configuration files. ─ Output statistics. ─ Statistic summary in standard error output. ISCA 2013, Tel-Aviv 6

Simulation Methodology Application-Only vs. Full-System ... ... Guest Guest Guest Guest program 1 program 2 program 1 program 2 Virtualization of Virtualization of Full O.S. User-space subset of ISA Complete processor ISA System call interface I/O hardware Full-system Application-only simulator core simulator core • Full-system simulation • Application-only simulation An entire OS runs on top of the simulator. Only an application runs on top of the The simulator models the entire ISA, and simulator. The simulator implements a virtualizes native hardware devices, similar subset of the ISA, and needs to virtualize to a virtual machine. Very accurate the system call interface (ABI). Multi2Sim simulations, but extremely slow. falls in this category. ISCA 2013, Tel-Aviv 7

Simulation Methodology Four-Stage Simulation Process Executable file, Exectuable Exectuable file, program arguments, User ELF file program arguments processor configuration interaction Instruction Run one bytes instruction Pipeline trace Timing Emulator simulator Visual Disassembler (or functional (or detailed/ tool simulator) architectural) Instruction Instruction fields information Instructions Program Performance Cycle navigation, dump output statistics timing diagrams • Modular implementation ─ Four clearly different software modules per architecture (x86, MIPS, ...) ─ Each module has a standard interface for stand-alone execution, or interaction with other modules. ISCA 2013, Tel-Aviv 8

Simulation Methodology Current Architecture Support Timing Graphic Disasm. Emulation simulation pipelines X – – ARM In progress X X – – MIPS X X X X x86 X X X X AMD Evergreen X X X X AMD Southern Islands X – – NVIDIA Fermi In progress – – – NVIDIA Kepler In progress • In our latest Multi2Sim SVN repository ─ 4 GPU + 3 CPU architectures supported or in progress. ─ This tutorial will focus on x86 and AMD Southern Islands. ISCA 2013, Tel-Aviv 9

Part 1 Simulation of an x86 CPU ISCA 2013, Tel-Aviv 10

The x86 Disassembler Timing Emulator simulator Visual Disassembler (or functional (or detailed/ tool simulator) architectural) ISCA 2013, Tel-Aviv 11

The x86 Disassembler Methodology ─ Implementation of an efficient instruction decoder based on lookup tables. ─ When used as a stand-alone tool, the output is provided with exactly the same format as the GNU x86 disassembler for automatic verification . • GNU x86 disassembler • Multi2Sim x86 disassembler $ objdump -S -M intel test-args $ m2s --x86-disasm test-args • Verification of common output 08048900 <_start>: 8048900: 31 ed xor ebp,ebp 8048902: 5e pop esi 8048903: 89 e1 mov ecx,esp 8048905: 83 e4 f0 and esp,0xfffffff0 8048908: 50 push eax 8048909: 54 push esp 804890a: 52 push edx 804890b: 68 70 91 04 08 push 0x8049170 ... ISCA 2013, Tel-Aviv 12

The x86 Emulator Timing Emulator simulator Visual Disassembler (or functional (or detailed/ tool simulator) architectural) ISCA 2013, Tel-Aviv 13

The x86 Emulator Program Loading 1) Parse ELF executable Initial virtual Initial values for memory image x86 registers ─ Read ELF sections and symbols. ─ Initialize code and data. 0xc0000000 Stack eax eax Program args. ebx Env. variables Stack pointer 2) Initialize stack ecx ─ Program headers. mmap region ─ Arguments. esp (not initialized) eip 0x40000000 ─ Environment variables. Instruction pointer Heap Initialized data 3) Initialize registers 0x08xxxxxx Text ─ Program entry → eip Initialized data ─ Stack pointer → esp 0x08000000 ISCA 2013, Tel-Aviv 14

The x86 Emulator Emulation Loop • Emulation of x86 instructions Read instr. at eip ─ Update x86 registers. Instr. ─ Update memory map if needed. bytes Decode ─ Example: add [bp+16], 0x5 instruction Instr. fields • Emulation of Linux system calls No Yes Instr. is int 0x80 ─ Analyze system call code and arguments. ─ Update memory map. Emulate Emulate x86 instr. system call ─ Update register eax with return value. ─ Example: read(fd, buf, count) Move eip to next instr. Demo 2 ISCA 2013, Tel-Aviv 15

The x86 Timing Simulator Timing Emulator simulator Visual Disassembler (or functional (or detailed/ tool simulator) architectural) ISCA 2013, Tel-Aviv 16

The x86 Timing Simulator Superscalar Processor Fetch queue c Uop I a n ··· ··· ··· Reorder buffer Commit c s queue h t r e . Fetch Decode ··· Dispatch Instruction queue c T Issue ALU a r ··· a c h c Load/store queue e e Trace queue Data Reg. Writeback cache file • Superscalar x86 pipelines ─ 6-stage pipeline with configurable latencies. ─ Supported features include speculative execution, branch prediction, micro- instruction generation, trace caches, out-of-order execution, … ─ Modeled structures include fetch queues, reorder buffer, load-store queues, register files, register mapping tables, ... ISCA 2013, Tel-Aviv 17

The x86 Timing Simulator Multithreaded and Multicore Processors • Multithreading ─ Replicated superscalar n c o Superscalar Core pipelines with partially ··· r e s Nodes shared resources. Shared resources: ( n – 1) m to nm – 1 Reorder buffer Register file ─ Fine-grain, coarse-grain, and Instruction queue Functional units simultaneous multithreading. Memory hierarchy Load/store queue Nodes m m t to 2 m - 1 h • Multicore Hardware Thread r ··· e a d s Private resources: Node 0 ─ Fully replicated Program counter Node 1 superscalar pipelines, Register aliasing table TLB Node m – 1 connected through caches. ··· ··· ─ Running multiple ··· ··· programs concurrently, or one program spawning child threads (using OpenMP, pthread , etc.) Demo 3 ISCA 2013, Tel-Aviv 18

The x86 Timing Simulator Benchmark Support • Single-threaded applications ─ SPEC 2000 and SPEC 2006 benchmarks are fully supported. Pre-compiled x86 binaries are available on the website. ─ The Mediabench suite includes program binaries and data files, with all you need to run them. • Multithreaded applications ─ SPLASH-2 benchmark suite with pre-compiled x86 executables and data files available on the website. ─ PARSEC-2.1 with pre-compiled x86 executables and data files. ISCA 2013, Tel-Aviv 19

The Memory Hierarchy Configuration • Flexible hierarchies ─ Any number of caches organized in any number of levels. ─ Cache levels connected through default cross-bar interconnects, or complex custom interconnect configurations. ─ Each architecture undergoing a timing simulation specifies its own entry point (cache memory) in the memory hierarchy, for data or instructions. ─ Cache coherence is guaranteed with an implementation of the 5-state MOESI protocol . ISCA 2013, Tel-Aviv 20

Simulation of OpenCL and APUs on Multi2Sim 4.1 Rafael Ubal, David - PowerPoint PPT Presentation

Simulation of OpenCL and APUs on Multi2Sim 4.1 Rafael Ubal, David Kaeli Conference title 1 Outline Introduction Simulation methodology Part 1 Simulation of an x86 CPU Part 2 Simulation of a Southern Islands GPU Disassembler OpenCL

The Multi2Sim Simulation Framework A CPU-GPU Model for Heterogeneous Computing www.multi2sim.org

Working in partnership with business- case studies on conserving swifts Apus apus 1 st May

OpenCL Kernel Compilation Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin,

PROGRAMMING AND SIMULATING HETEROGENEOUS DEVICES - OPENCL AND MULTI2SIM Rafael Ubal, Dana Schaa,

Multi-Architecture ISA-Level Simulation of OpenCL Dana Schaa, Rafael Ubal Northeastern

The OpenCL C++ API Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin, James

OpenCL on FPGAs Contains material from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin, James

Synchronization in OpenCL Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin,

Investigation of the OpenCL support in the GeantV's Vectorized Geometry Gabor Biro 22.09.2014.

Introduction to OpenCL David Black-Schaffer david.black-schaffer@it.uu.se 1 Disclaimer I

Scalable Multi-Precision Simulation of Spiking Neural Networks on GPU with OpenCL Dmitri

Outline Narcisse Ngada DESY, MKK 1) What is simulation ? 14.05.2014 2) Why simulation ? 3)

Multi2sim Kepler: A Detailed Architectural GPU Simulator Xun Gong , Rafael Ubal, David Kaeli

Grid simulation (AliEn) Outline GRID simulation Simulation tool Ptolemy (Berkeley)

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and

CUDA (Compute Unified Device Dr. Bharathwaj Bharath Muthuswamy Architecture) and OpenCL

Microarchitectural Mechanisms to Exploit Value Structure in SIMT Architectures Ji Kim,

Visualization of OpenCL Application Execution on CPU-GPU Systems A. Ziabari, R. Ubal, D.

The Era of Heterogeneous Compute: Challenges and Opportunities Sudhakar Yalamanchili Computer

Vector Forward Mode Automatic Differentiation on SIMD/SIMT architectures Jan H uckelheim,

Challenges in GPGPU architectures: fixed-function units and regularity Sylvain Collange CARAMEL

Symbolic Crosschecking of Floating-Point and SIMD Code Peter Collingbourne, Cristian Cadar, Paul

Dynamic Front End Sharing In Graphics Dynamic Front End Sharing In Graphics Processing

Building an AI that Codes http:// chris cummins. cc 2013 2014 2015 + 2016 What makes a good

Simulation of OpenCL and APUs on Multi2Sim 4.1 Rafael Ubal, David - PowerPoint PPT Presentation

Simulation of OpenCL and APUs on Multi2Sim 4.1 Rafael Ubal, David Kaeli Conference title 1 Outline Introduction Simulation methodology Part 1 Simulation of an x86 CPU Part 2 Simulation of a Southern Islands GPU Disassembler OpenCL

The Multi2Sim Simulation Framework A CPU-GPU Model for Heterogeneous Computing www.multi2sim.org

Working in partnership with business- case studies on conserving swifts Apus apus 1 st May

OpenCL Kernel Compilation Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin,

PROGRAMMING AND SIMULATING HETEROGENEOUS DEVICES - OPENCL AND MULTI2SIM Rafael Ubal, Dana Schaa,

Multi-Architecture ISA-Level Simulation of OpenCL Dana Schaa, Rafael Ubal Northeastern

The OpenCL C++ API Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin, James

OpenCL on FPGAs Contains material from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin, James

Synchronization in OpenCL Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin,

Investigation of the OpenCL support in the GeantV's Vectorized Geometry Gabor Biro 22.09.2014.

Introduction to OpenCL David Black-Schaffer david.black-schaffer@it.uu.se 1 Disclaimer I

Scalable Multi-Precision Simulation of Spiking Neural Networks on GPU with OpenCL Dmitri

Outline Narcisse Ngada DESY, MKK 1) What is simulation ? 14.05.2014 2) Why simulation ? 3)

Multi2sim Kepler: A Detailed Architectural GPU Simulator Xun Gong , Rafael Ubal, David Kaeli

Grid simulation (AliEn) Outline GRID simulation Simulation tool Ptolemy (Berkeley)

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and

CUDA (Compute Unified Device Dr. Bharathwaj Bharath Muthuswamy Architecture) and OpenCL

Microarchitectural Mechanisms to Exploit Value Structure in SIMT Architectures Ji Kim,

Visualization of OpenCL Application Execution on CPU-GPU Systems A. Ziabari*, R. Ubal*, D.

The Era of Heterogeneous Compute: Challenges and Opportunities Sudhakar Yalamanchili Computer

Vector Forward Mode Automatic Differentiation on SIMD/SIMT architectures Jan H uckelheim,

Challenges in GPGPU architectures: fixed-function units and regularity Sylvain Collange CARAMEL

Symbolic Crosschecking of Floating-Point and SIMD Code Peter Collingbourne, Cristian Cadar, Paul

Dynamic Front End Sharing In Graphics Dynamic Front End Sharing In Graphics Processing

Building an AI that Codes http:// chris cummins. cc 2013 2014 2015 + 2016 What makes a good

Visualization of OpenCL Application Execution on CPU-GPU Systems A. Ziabari, R. Ubal, D.