Conference title 1
Rafael Ubal, David Kaeli
Simulation of OpenCL and APUs on Multi2Sim 4.1 Rafael Ubal, David - - PowerPoint PPT Presentation
Simulation of OpenCL and APUs on Multi2Sim 4.1 Rafael Ubal, David Kaeli Conference title 1 Outline Introduction Simulation methodology Part 1 Simulation of an x86 CPU Part 2 Simulation of a Southern Islands GPU Disassembler OpenCL
Conference title 1
Rafael Ubal, David Kaeli
ISCA 2013, Tel-Aviv 2
Introduction Simulation methodology
Part 2 – Simulation of a Southern Islands GPU OpenCL from host to device Disassembler Emulation Timing simulation Visualization tool Case study: ND-Range virtualization Part 3 – Concluding Remarks Additional Projects The Multi2Sim Community Part 1 – Simulation of an x86 CPU Disassembler Emulation Timing simulation Memory hierarchy Visualization tool
ISCA 2013, Tel-Aviv 3
Getting Started
─ Machine: fusion1.ece.neu.edu ─ User: isca1, isca2, isca3, ... ─ Password: isca2013
ISCA 2013, Tel-Aviv 4
Getting Started
$ ssh isca<N>@fusion1.ece.neu.edu -X
(Notice the X forwarding for later demos using graphics)
$ wget http://www.multi2sim.org/files/multi2sim-4.1.tar.gz $ tar -xzf multi2sim-4.1.tar.gz $ cd multi2sim-4.1 $ ./configure && make
$ ls demo1 demo2 demo3 demo4 demo5 demo6 demo7 README
All files needed for each demo are present in its corresponding directory. README files describe commands to run and interpretation of outputs.
ISCA 2013, Tel-Aviv 5
First Execution
#include <stdio.h> int main(int argc, char **argv) { int i; printf("Number of arguments: %d\n", argc); for (i = 0; i < argc; i++) printf("\targv[%d] = %s\n", i, argv[i]); return 0; }
Demo 1
$ test-args hello there Number of arguments: 4 arg[0] = 'test-args' arg[1] = 'hello' arg[2] = 'there' $ m2s test-args hello there < Simulator message in stderr > Number of arguments: 4 arg[0] = 'test-args' arg[1] = 'hello' arg[2] = 'there' < Simulator statistics >
ISCA 2013, Tel-Aviv 6 ; This is a comment. [ Section 0 ] Color = Red Height = 40 [ OtherSection ] Variable = Value
Simulator Input/Output Files
─ Configuration files. ─ Output statistics. ─ Statistic summary in standard error output.
ISCA 2013, Tel-Aviv 7
Application-Only vs. Full-System
Virtualization of User-space subset of ISA System call interface Full-system simulator core Application-only simulator core Guest program 1 Guest program 2 Full O.S. Guest program 1 Guest program 2
... ...
Virtualization of Complete processor ISA I/O hardware
An entire OS runs on top of the simulator. The simulator models the entire ISA, and virtualizes native hardware devices, similar to a virtual machine. Very accurate simulations, but extremely slow.
Only an application runs on top of the
subset of the ISA, and needs to virtualize the system call interface (ABI). Multi2Sim falls in this category.
ISCA 2013, Tel-Aviv 8
Four-Stage Simulation Process
─ Four clearly different software modules per architecture (x86, MIPS, ...) ─ Each module has a standard interface for stand-alone execution, or interaction with other modules.
Disassembler
Instruction bytes Instruction fields
Emulator (or functional simulator)
Run one instruction Instruction information
Timing simulator (or detailed/ architectural)
Pipeline trace
Visual tool
Exectuable ELF file Instructions dump Exectuable file, program arguments Program
Executable file, program arguments, processor configuration Performance statistics User interaction Cycle navigation, timing diagrams
ISCA 2013, Tel-Aviv 9
Current Architecture Support
─ 4 GPU + 3 CPU architectures supported or in progress. ─ This tutorial will focus on x86 and AMD Southern Islands.
Disasm. Emulation Timing simulation Graphic pipelines ARM
X
In progress
– –
MIPS
X – –
x86
X X X X
AMD Evergreen
X X X X
AMD Southern Islands
X X X X
NVIDIA Fermi
X
In progress
– –
NVIDIA Kepler In progress
– – – X
ISCA 2013, Tel-Aviv 10
ISCA 2013, Tel-Aviv 11 Disassembler Emulator (or functional simulator) Timing simulator (or detailed/ architectural) Visual tool
ISCA 2013, Tel-Aviv 12 08048900 <_start>: 8048900: 31 ed xor ebp,ebp 8048902: 5e pop esi 8048903: 89 e1 mov ecx,esp 8048905: 83 e4 f0 and esp,0xfffffff0 8048908: 50 push eax 8048909: 54 push esp 804890a: 52 push edx 804890b: 68 70 91 04 08 push 0x8049170 ...
$ objdump -S -M intel test-args $ m2s --x86-disasm test-args
Methodology
─ Implementation of an efficient instruction decoder based on lookup tables. ─ When used as a stand-alone tool, the output is provided with exactly the same format as the GNU x86 disassembler for automatic verification.
ISCA 2013, Tel-Aviv 13
Disassembler Emulator (or functional simulator) Timing simulator (or detailed/ architectural) Visual tool
ISCA 2013, Tel-Aviv 14
Program Loading 1) Parse ELF executable
─ Read ELF sections and symbols. ─ Initialize code and data.
2) Initialize stack
─ Program headers. ─ Arguments. ─ Environment variables.
3) Initialize registers
─ Program entry → eip ─ Stack pointer → esp
Stack Program args.
mmap region (not initialized) Heap Initialized data Text Initialized data 0x08000000 0x08xxxxxx 0x40000000 0xc0000000 eax ebx eax ecx esp eip
Initial virtual memory image Initial values for x86 registers
Stack pointer Instruction pointer
ISCA 2013, Tel-Aviv 15
Emulation Loop
─ Update x86 registers. ─ Update memory map if needed. ─ Example: add [bp+16], 0x5
─ Analyze system call code and arguments. ─ Update memory map. ─ Update register eax with return value. ─ Example: read(fd, buf, count)
Demo 2
Read instr. at eip Instr. bytes Decode instruction Instr. fields
int 0x80 No Yes Emulate system call Emulate x86 instr. Move eip to next instr.
ISCA 2013, Tel-Aviv 16
Disassembler Emulator (or functional simulator) Timing simulator (or detailed/ architectural) Visual tool
ISCA 2013, Tel-Aviv 17
Superscalar Processor
─ 6-stage pipeline with configurable latencies. ─ Supported features include speculative execution, branch prediction, micro- instruction generation, trace caches, out-of-order execution, … ─ Modeled structures include fetch queues, reorder buffer, load-store queues, register files, register mapping tables, ...
I n s t r . c a c h e Fetch T r a c e c a c h e
··· ···
Fetch queue Trace queue Decode
···
Uop queue Dispatch Writeback Data cache Reg. file Issue ALU Commit Instruction queue Load/store queue Reorder buffer
··· ···
ISCA 2013, Tel-Aviv 18
Multithreaded and Multicore Processors
─ Fully replicated superscalar pipelines, connected through caches. ─ Running multiple programs concurrently, or
child threads (using OpenMP, pthread, etc.)
─ Replicated superscalar pipelines with partially shared resources. ─ Fine-grain, coarse-grain, and simultaneous multithreading.
Demo 3
Superscalar Core Shared resources: Reorder buffer Instruction queue Load/store queue Register file Functional units Private resources: Hardware Thread Program counter Register aliasing table TLB
Node 0 Node 1 Node m – 1 Nodes m to 2m - 1 Nodes (n – 1)m to nm – 1
··· ··· ··· ··· ··· ···
m t h r e a d s n c
e s
Memory hierarchy
ISCA 2013, Tel-Aviv 19
Benchmark Support
─ SPLASH-2 benchmark suite with pre-compiled x86 executables and data files available on the website. ─ PARSEC-2.1 with pre-compiled x86 executables and data files.
─ SPEC 2000 and SPEC 2006 benchmarks are fully supported. Pre-compiled x86 binaries are available on the website. ─ The Mediabench suite includes program binaries and data files, with all you need to run them.
ISCA 2013, Tel-Aviv 20
Configuration
─ Any number of caches organized in any number of levels. ─ Cache levels connected through default cross-bar interconnects, or complex custom interconnect configurations. ─ Each architecture undergoing a timing simulation specifies its own entry point (cache memory) in the memory hierarchy, for data or instructions. ─ Cache coherence is guaranteed with an implementation of the 5-state MOESI protocol.
ISCA 2013, Tel-Aviv 21
Configuration Examples
Three CPU cores with private L1 caches, two L2 caches, and default cross-bar based
serves physical address range [0, 7ff...ff], and cache L2-1 serves [80...00, ff...ff].
Core 0 Core 1 Core 2 L1-0 L1-1 L1-2 Switch L2-0 L2-1 Switch Main Memory
Demo 4
ISCA 2013, Tel-Aviv 22
Four CPU cores with private L1 data caches, L1 instruction caches and L2 caches shared every 2 cores (serving the whole address space), and four main memory modules, connected with a custom network on a ring topology.
Data L1-0 Core 0 Core 1 Data L1-1 Switch L2-0 L2-1 sw1 sw2 sw0 sw3 MM-0 MM-1 MM-2 MM-3 Inst. L1-0 Switch Data L1-2 Core 2 Core 3 Data L1-3 Inst. L1-1
n0 n2 n3 n4 n5 n1
Configuration Examples
ISCA 2013, Tel-Aviv 23
s3 s0 s1 s2 n0 n1 n2 n3
Ring connection between four switches associated with end-nodes with routing tables calculated automatically based on shortest
algorithm can contain cycles, potentially leading to routing deadlocks at runtime.
Configuration Examples
ISCA 2013, Tel-Aviv 24
n3 s1 s2 s3 s0 n1 n2 n0
Virtual Channel 0 Virtual Channel 1
Ring connection between for switches associated with end nodes, where a routing cycle has been removed by adding an additional virtual channel.
Configuration Examples
ISCA 2013, Tel-Aviv 25
Disassembler Emulator (or functional simulator) Timing simulator (or detailed/ architectural) Visual tool
ISCA 2013, Tel-Aviv 26
Pipeline Diagrams
─ Cycle bar on main window for navigation. ─ Panel on main window shows software contexts mapped to hardware cores. ─ Clicking on the Detail button opens a secondary window with a pipeline diagram.
ISCA 2013, Tel-Aviv 27
Memory Hierarchy
─ Panel on main window shows how memory accesses traverse the memory hierarchy. ─ Clicking on a Detail button opens a secondary window with the cache memory representation. ─ Each row is a set, each column is a way. ─ Each cell shows the tag and state (color)
─ Additional columns show the number of sharers and in-flight accesses.
Demo 5
ISCA 2013, Tel-Aviv 28
ISCA 2013, Tel-Aviv 29
Execution Framework
─ Multi2Sim 4.1 includes a new execution framework for OpenCL, developed in collaboration with University of Toronto. ─ The new framework is a more accurate analogy to a native execution, and is fully AMD-compliant. ─ When working with x86 kernel binaries, the OpenCL runtime can perform both native and simulated execution correctly. ─ When run natively, an OpenCL call to clGetDeviceIDs returns only the x86 device. ─ When run on Multi2Sim, clGetDeviceIDs returns one device per supported architecture: x86, Evergreen, and Southern Islands devices (more to be added).
ISCA 2013, Tel-Aviv 30
Execution Framework
─ The following slides show the modular organization of the OpenCL execution framework, based on 4 software/hardware entities. ─ In each case, we compare native execution with simulated execution.
User application API call Device driver ABI call Hardware Internal interface Runtime library
User-level code OS-level code
ISCA 2013, Tel-Aviv 31
The OpenCL CPU Host Program Native
An x86 OpenCL host program performs an OpenCL API call.
Multi2Sim
Exact same scenario.
ISCA 2013, Tel-Aviv 32
The OpenCL Runtime Library Native
AMD's OpenCL runtime library handles the call, and communicates with the driver through system calls ioctl, read, write, etc. These are referred to as ABI calls.
Multi2Sim
Multi2Sim's OpenCL runtime library, running with guest code, transparently intercepts the call. It communicates with the Multi2Sim driver using system calls with codes not reserved in Linux.
ISCA 2013, Tel-Aviv 33
The OpenCL Device Driver Native
The AMD Catalyst driver (kernel module) handles the ABI call and communicates with the GPU through the PCIe bus.
Multi2Sim
An OpenCL driver module (Multi2Sim code) intercepts the ABI call and communicates with the GPU emulator.
ISCA 2013, Tel-Aviv 34
The GPU Emulator Native
The command processor in the GPU handles the messages received from the driver.
Multi2Sim
The GPU emulator updates its internal state based on the message received from the driver.
ISCA 2013, Tel-Aviv 35
Transferring Control
─ The key OpenCL call that effectively triggers GPU execution is clEnqueueNDRangeKernel.
─ The host program performs API call clEnqueueNDRangeKernel. ─ The runtime intercepts the call, and enqueues a new task in an OpenCL command queue object. A user-level thread associated with the command queue eventually processes the command, performing a LaunchKernel ABI call. ─ The driver intercepts the ABI call, reads ND-Range parameters, and launches the GPU emulator. ─ The GPU emulator enters a simulation loop until the ND-Range completes.
ISCA 2013, Tel-Aviv 36
Execution Model
─ Work-items execute multiple instances of the same kernel code. ─ Work-groups are sets of work-items that can synchronize and communicate efficiently. ─ The ND-Range is composed by all work-groups, not communicating with each other and executing in any order.
Work- group Work- group
··· ···
Work- group
···
Global Memory Work- group Work- item Work- item
··· ···
Work- item
···
Local Memory Work- item
···
__kernel func() { } Private Memory
ND-Range Work-Group Work-Item
ISCA 2013, Tel-Aviv 37
Execution Model
─ When the kernel is launched by the Southern Islands driver, the OpenCL ND-Range is mapped to the compute device (Fig. a). ─ The work-groups are mapped to the compute units (Fig. b). ─ The work-items are executed by the SIMD lanes (Fig. c). ─ This is a simplified view of the architecture, explored later in more detail.
SIMD 0 Ultra-Threaded Dispatcher Compute Unit 0 Compute Unit 1
···
Global Memory Lane
15
· · ·
SIMD 3 Lane
15
· · ·
···
Local Memory Wavefront Scheduler Register File Functional Units (integer & float.)
a) Compute device. c) SIMD lane (stream core).
Compute Unit 31
b) Compute unit.
ISCA 2013, Tel-Aviv 38 Disassembler Emulator (or functional simulator) Timing simulator (or detailed/ architectural) Visual tool
ISCA 2013, Tel-Aviv 39
SIMD Execution
─ OpenCL follows a Single-Program Multiple-Data (SPMD) programming model, which maps to a Single-Instruction Multiple-Data (SIMD) execution model. ─ A wavefront is a fixed set of work-items forming the SIMD execution unit. One single instruction is fetched, multiple instances of it are executed. Trade-off: 1 Wavefront = 64 Work-Items
─ Amortize ports in the instruction cache and fetch hardware in general.
─ Easier to create work-groups with a size equal to an exact multiple of the wavefront size, reducing waste in the wavefront tail.. ─ Reduce the effects of thread divergence in conditional execution or loops.
ISCA 2013, Tel-Aviv 40
Scalar Instructions
─ Sometimes work-items within a wavefront do not only execute the same instruction, but also do it on the same data. ─ The compiler can detect some of these cases and emit scalar instructions, issued to a special execution unit of much lower cost.
─ Loading a base address of a buffer. ─ Loading values from constant memory. ─ Loop initialization, comparison, and increments.
ISCA 2013, Tel-Aviv 41
Vector Addition Kernel
s_buffer_load_dword s0, s[4:7], 0x04 s_buffer_load_dword s1, s[4:7], 0x18 s_buffer_load_dword s4, s[8:11], 0x00 s_buffer_load_dword s5, s[8:11], 0x04 s_buffer_load_dword s6, s[8:11], 0x08 s_load_dwordx4 s[8:11], s[2:3], 0x58 s_load_dwordx4 s[16:19], s[2:3], 0x60 s_load_dwordx4 s[20:23], s[2:3], 0x50 s_waitcnt lgkmcnt(0) s_min_u32 s0, s0, 0x0000ffff v_mov_b32 v1, s0 v_mul_i32_i24 v1, s12, v1 v_add_i32 v0, vcc, v0, v1 v_add_i32 v0, vcc, s1, v0 v_lshlrev_b32 v0, 2, v0 v_add_i32 v1, vcc, s4, v0 v_add_i32 v2, vcc, s5, v0 v_add_i32 v0, vcc, s6, v0 tbuffer_load_format_x v1, v1, s[8:11], 0 tbuffer_load_format_x v2, v2, s[16:19], 0 s_waitcnt vmcnt(0) v_add_i32 v1, vcc, v1, v2 tbuffer_store_format_x v1, v0, s[20:23], 0 s_endpgm Scalar instructions The loads The addition The store Vector instructions Vector registers Scalar registers __kernel void vector_add( __read_only __global int *src1, __read_only __global int *src2, __write_only __global int *dst) { int id = get_global_id(0); dst[id] = src1[id] + src2[id]; }
ISCA 2013, Tel-Aviv 42
Conditional Statements
__kernel void if_kernel( __global int *v) { uint id = get_global_id(0); if (id < 5) v[id] = 10; }
s_buffer_load_dword s0, s[8:11], 0x04 s_buffer_load_dword s1, s[8:11], 0x18 s_waitcnt lgkmcnt(0) s_min_u32 s0, s0, 0x0000ffff v_mov_b32 v1, s0 v_mul_i32_i24 v1, s16, v1 v_add_i32 v0, vcc, v0, v1 v_add_i32 v0, vcc, s1, v0 s_buffer_load_dword s0, s[12:15], 0x00 v_cmp_lt_u32 s[2:3], v0, 5 s_and_saveexec_b64 s[2:3], s[2:3] v_lshlrev_b32 v0, 2, v0 s_waitcnt lgkmcnt(0) v_add_i32 v0, vcc, s0, v0 v_mov_b32 v1, 10 tbuffer_store_format_x v1, v0, s[4:7], 0 s_mov_b64 exec, s[2:3] s_endpgm
The comparison. Save active mask. Store value 10. Restore active mask.
ISCA 2013, Tel-Aviv 43
Disassembler Emulator (or functional simulator) Timing simulator (or detailed/ architectural) Visual tool
ISCA 2013, Tel-Aviv 44
Program Loading
User application API call Device driver ABI call Hardware Internal interface Runtime library
─ The device driver is the responsible module for setting up an initial state for the hardware, leaving it ready to run the first ISA instruction. ─ Natively, it writes on hardware registers and global memory locations. On Multi2Sim, it calls initialization functions of the emulator.
─ Instruction memories in compute units, each with one copy of the ISA section of the kernel binary. ─ Initial global memory image, copying global buffers from CPU to GPU memory. ─ Kernel arguments. ─ ND-Range topology, including number of dimensions and sizes.
ISCA 2013, Tel-Aviv 45
Split ND-Range into work-groups Work-group pool Any work-groups left? Grab work-group and split in wavefronts Wavefront pool Any wavefront left? For each running wavefront not stalled in a barrier:
Yes No Yes No End
Emulation Loop
Demo 6
─ Work-groups can execute in any order. This order is irrelevant for emulation purposes. ─ The chosen policy is executing one work-group at a time, in increasing order of ID for each dimension.
─ Wavefronts within a work-group can also execute in any order, as long as synchronizations are considered. ─ The chosen policy is executing one wavefront at a time until it hits a barrier, if any.
ISCA 2013, Tel-Aviv 46
Disassembler Emulator (or functional simulator) Timing simulator (or detailed/ architectural) Visual tool
ISCA 2013, Tel-Aviv 47
The GPU Architecture
Command Processor Ultra-Threaded Dispatcher Compute Unit 0 Compute Unit 1 Compute Unit 31
···
L1 Cache L1 Cache L1 Cache
···
Crossbar Main Memory Hierarchy (L2 caches, memory controllers, video memory)
─ A command processor receives and processes commands from the host. ─ When the ND-Range is created, an ultra-threaded dispatcher (scheduler) assigns work-groups into compute units while new available slots occur. ─ The following slides present the architecture on each of the compute units.
ISCA 2013, Tel-Aviv 48
The Compute Unit
─ The instruction memory of each compute unit contains the OpenCL kernel. ─ A front-end fetches instructions and sends them to the appropriate execution unit. ─ There is one scalar unit, vector-memory unit, branch unit, LDS (local data store) unit. ─ There are multiple instances of SIMD units.
ISCA 2013, Tel-Aviv 49
The Front-End
─ Work-groups are split into wavefronts and allocated to wavefront pools. ─ Fetch and issue stages operate in a round-robin fashion. ─ There is one SIMD unit associated to each wavefront pool.
Wavefront Pool
··· ··· ··· ···
Wavefront Pool Wavefront Pool Wavefront Pool
··· ··· ··· ··· ···
Fetch buffers, one per wavefront pool SIMD issue buffer, matching wavefront pool Scalar unit issue buffer Branch unit issue buffer Vector memory unit issue buffer LDS unit issue buffer Fetch Issue
ISCA 2013, Tel-Aviv 50
─ Runs arithmetic-logic vector instructions. ─ There are 4 SIMD units, each one associated with one of the 4 wavefront pools. ─ The SIMD unit pipeline is modeled with 5 stages: decode, read, execute, write, and complete. ─ In the execute stage, a wavefront (64 work-items max.) is split into 4 subwavefronts (16 work-items each). Subwavefronts are pipelined over the 16 stream cores in 4 consecutive cycles. ─ The vector register file is accessed in the read and write stages to consume input and produce output operands, respectively.
The SIMD Unit
ISCA 2013, Tel-Aviv 51
The SIMD Unit
Execute Work-item 0 Work-item 16 Work-item 32 Work-item 48 Pipelined Functional units SIMD Lane 0 SIMD Lane 1 Work-items 1, 17, 33, 49
...
SIMD Lane 15 Work-items 15, 31, 47, 63
···
Read
···
Issue buffer Read buffer Write
···
Execute buffer
· · ·
Complete W r i t e b u f f e r
· · ·
D e c
e b u f f e r Decode From compute unit front-end Vector/scalar register file Vector register file
ISCA 2013, Tel-Aviv 52
The Scalar Unit
─ Runs both arithmetic-logic and memory scalar instructions. ─ Modeled with 5 stages – decode, read, execute/memory, write, complete.
···
Read
···
Issue buffer Read buffer
···
Decode buffer Decode From compute unit front-end Scalar register file Execute Memory
···
Execute buffer Write Complete Write buffer Vector register file
···
ISCA 2013, Tel-Aviv 53
The Vector-Memory Unit
─ Runs vector memory instructions. ─ Modeled with 5 stages – decode, read, memory, write, complete. ─ Accesses to the global memory hierarchy happen mainly in this unit.
Read
···
Issue buffer Decode buffer Decode From compute unit front-end Vector register file
··· ···
Memory Read buffer
···
Memory buffer Write Complete Write buffer
···
Vector register file Global memory
ISCA 2013, Tel-Aviv 54
The Branch Unit
─ Runs branch instructions. These instructions decide whether to make an entire wavefront jump to a target address depending on the scalar condition code. ─ Modeled with 5 stages – decode, read, execute/memory, write, complete.
Read
···
Issue buffer Decode buffer Decode From compute unit front-end Scalar register file (condition codes)
··· ···
Execute Read buffer
···
Execute buffer Write Complete Write buffer
···
Scalar reg. file (program counter)
ISCA 2013, Tel-Aviv 55
The LDS (Local Data Store) Unit
─ Runs local memory accesses instructions. ─ Modeled with 5 stages – decode, read, execute/memory, write, complete. ─ The memory stage accesses the compute unit local memory for read/write.
Read
···
Issue buffer Decode buffer Decode From compute unit front-end Vector register file
··· ···
Memory Read buffer
···
Memory buffer Write Complete Write buffer
···
Vector register file Local memory
ISCA 2013, Tel-Aviv 56
Benchmark Support
─ Other applications from the Rodinia and Parboil benchmark suite have been added support for by other users. Packages are under development.
─ Host programs are statically linked x86 binaries. ─ The SDK is available for device kernels using three different architectures: Evergreen, Southern Islands, and x86. ─ Packages include data files and execution commands.
ISCA 2013, Tel-Aviv 57
Organization
─ Fully configurable memory hierarchy, with default values based on the AMD Radeon HD 7970 Southern Islands GPU. ─ One 16KB data L1 per compute unit. ─ One scalar L1 cache shared by every 4 compute units. ─ Six L2 banks with a total size of 128KB, each connected to a DRAM module.
CU 0 L1 CU 1 L1 CU 2 L1 CU 3 L1 Scalar cache CU 28 L1 CU 29 L1 CU 30 L1 CU 31 L1 Scalar cache
. . . . . .
L2 Bank 0 L2 Bank 1 L2 Bank 1
...
Interconnect
ISCA 2013, Tel-Aviv 58
Policies
─ Inclusive, write-back, non-coherent caches. ─ Non-coherence is implemented as a 3-state “coherence” protocol: NSI ─ Blocks in N state are merged on write-back using write bit masks.
accessed for read/write access, while shared with other caches.
read access, possible shared with other caches.
valid data.
1111 0000 0000 1111
L2 cache Write back Write back Cache block merged on writeback L1 cache L1 cache
ISCA 2013, Tel-Aviv 59
Disassembler Emulator (or functional simulator) Timing simulator (or detailed/ architectural) Visual tool
ISCA 2013, Tel-Aviv 60
Demo 7
Main Window and Timing Diagram
─ Main window provides cycle-by-cycle navigation throughout simulation. ─ A dedicated Southern Islands panel contains one widget per compute unit, showing allocated work-groups. ─ The memory hierarchy panel shows caches connected to Southern Islands compute units, and special-purpose scalar caches.
ISCA 2013, Tel-Aviv 61
Motivation
─ Cooperative work-group execution. ─ Next-level system heterogeneity.
─ The program writes one single OpenCL kernel. ─ The OpenCL runtime creates one kernel binary targeting each architecture present in the system. ─ The ND-Range is dynamically and automatically partitioned, and work- groups are spread across CPU and GPU cores.
─ Data-parallel programming model complexity resides in the design of the OpenCL kernel, but not in cross-device scheduling. ─ System resources (cores) better utilized and transparently managed.
ISCA 2013, Tel-Aviv 62
Current Heterogeneous Execution
Hardware ND-Range
WG-0 WG-1 WG-2 WG-3 WG-N
S.I. S.I. S.I. S.I. x86 x86
. . .
Hardware ND-Range
WG-0 WG-1 WG-2 WG-3 WG-N
S.I. S.I. S.I. S.I. x86 x86
. . .
S.I. S.I. S.I. S.I. x86 x86 Time Host program Idle Device kernel
─ Complete ND-Range sent to one device, selected by the user with a call to clGetDeviceIDs.
─ Host program runs on one x86 core, device kernel runs on Southern Islands compute units. ─ Idle execution regions on x86 cores while Southern Islands compute units run the ND-Range.
ISCA 2013, Tel-Aviv 63
Heterogeneous Execution with ND-Range Virtualization
S.I. S.I. S.I. S.I. x86 x86 Time Host program + kernel Kernel Kernel Hardware ND-Range S.I. S.I. S.I. S.I. x86 x86
WG-0 WG-1 WG-2 WG-3 WG-N
. . .
─ Portions of ND-Range executed by CPU/GPU cores with different ISAs. ─ Work-groups mapped to CPU cores
become available.
─ x86 cores run both the host program and a portion of the ND- Range. ─ Idle regions are removed during the execution of the ND-Range..
ISCA 2013, Tel-Aviv 64
The Hardware
─ Protocols MOESI and NSI merged into a single 6-state protocol: NMOESI. ─ CPU and GPU cores with any ISA can be connected to different entry points of the memory hierarchy. ─ Processing nodes interact with the memory hierarchy with three types of accesses: load, store, and n-store. ─ GPUs and CPUs running OpenCL kernels issue n-store write accesses. Rest issue regular store accesses.
L2 Bank 0
...
ARM x86 Evg. S.I.
Fermi L1 L1 L1 L1 Interconnect
L2 Bank 1 NMOESI Interface
ISCA 2013, Tel-Aviv 65
The Software
The OpenCL host program is simplified. Only one command queue needed to exploit system heterogeneity.
User application API call Device driver ABI call Hardware Internal interface Runtime library
The runtime provides an additional virtual fused device, returned as a result of a call to clGetDeviceIDs. The driver extends its interface to allow each physical device to run a portion of the ND- Range, at the work-group granularity.
ISCA 2013, Tel-Aviv 66
ISCA 2013, Tel-Aviv 67
Additional Material
─ Complete documentation of the simulator's user interface, simulation models, and additional tools.
─ New version releases and other important information is posted on the Multi2Sim mailing list (no spam, 1 email per month). ─ Users share question and knowledge on the website forum.
─ Automatic verification framework for Multi2Sim. ─ Based on a cluster of computers running condor.
ISCA 2013, Tel-Aviv 68
Multi2C – The Multi2Sim Compiler
─ LLVM-based compiler for OpenCL and CUDA kernels. ─ Future release Multi2Sim 4.2 will include a working version. ─ Diagrams show progress as per SVN Rev. 1838
vec-add.cl OpenCL C to LLVM front-end CUDA to LLVM front-end vec-add.cu LLVM to Southern Islands back-end vec-add.llvm vec-add.s LLVM to Fermi back-end LLVM to Kepler back-end vec-add.s vec-add.s Southern Islands assembler Fermi assembler Kepler assembler vec-add.bin vec-add.cubin vec-add.cubin
ISCA 2013, Tel-Aviv 69
Academic Efforts at Northeastern
─ We started an unofficial seminar that students can voluntarily attend. The syllabus covers OpenCL programming, GPU architecture, and state-of-the- art research topics on GPUs. ─ Average attendance of ~25 students per semester.
─ Official alternative equivalent to a 4-credit course that an undergraduate student can optionally enroll in, collaborating in Multi2Sim development.
─ Lots of research projects at the graduate level depend are based on Multi2Sim, and selectively included in the development trunk for public access. ─ Simulation of OpenGL pipelines, support for new CPU/GPU architectures, among others.
ISCA 2013, Tel-Aviv 70
Collaborating Research Groups
─ Pedro López, Salvador Petit, Julio Sahuquillo, José Duato.
─ Chris Barton, Shu Chen, Zhongliang Chen, Tahir Diop, Xiang Gong, David Kaeli, Nicholas Materise, Perhaad Mistry, Dana Schaa, Rafael Ubal, Mark Wilkening, Ang Shen, Tushar Swamy, Amir Ziabari.
─ Byunghyun Jang
─ Norm Rubin
─ Jason Anderson, Natalie Enright, Steven Gurfinkel, Tahir Diop.
─ Rustam Miftakhutdinov
ISCA 2013, Tel-Aviv 71
Multi2Sim Academic Publications
─ Multi2Sim: A Simulation Framework to Evaluate Multicore-Multithreaded Processors, SBAC-PAD, 2007. ─ The Multi2Sim Simulation Framework: A CPU-GPU Model for Heterogeneous Computing, PACT, 2012.
─ The Multi2Sim Simulation Framework: A CPU-GPU Model for Heterogeneous Computing, PACT, 2011. ─ Programming and Simulating Fused Devices — OpenCL and Multi2Sim, ICPE, 2012. ─ Multi-Architecture ISA-Level Simulation of OpenCL, IWOCL, 2013. ─ Simulation of OpenCL and APUs on Multi2Sim, ISCA, 2013.
ISCA 2013, Tel-Aviv 72
Published Academic Works Using Multi2Sim
─ R. Miftakhutdinov, E. Ebrahimi, Y. Patt, Predicting Performance Impact of DVFS for Realistic Memory Systems, MICRO, 2012. ─ D. Lustig, M. Martonosi, Reducing GPU Offload Latency via Fine-Grained CPU-GPU Synchronization, HPCA, 2013.
─ H. Calborean, R. Jahr, T. Ungerer, L. Vintan, A Comparison of Multi-objective Algorithms for the Automatic Design Space Exploration of a Superscalar System, Advances in Intelligent Systems and Computing, vol. 187. ─ X. Li, C. Wang, X. Zhou, Z. Zhu, Cache Promotion Policy Using Re-reference Interval Prediction, CLUSTER, 2012. … and 62 more citations, as per Google Scholar.
ISCA 2013, Tel-Aviv 73
Sponsors