Conference title 1
The Multi2Sim Simulation Framework
A CPU-GPU Model for Heterogeneous Computing
www.multi2sim.org Rafael Ubal David R. Kaeli
Northeastern University Boston, MA
The Multi2Sim Simulation Framework A CPU-GPU Model for - - PowerPoint PPT Presentation
The Multi2Sim Simulation Framework A CPU-GPU Model for Heterogeneous Computing www.multi2sim.org Rafael Ubal David R. Kaeli Northeastern University Boston, MA Conference title 1 Outline 1. Introduction First Block The x86 CPU
Conference title 1
www.multi2sim.org Rafael Ubal David R. Kaeli
Northeastern University Boston, MA
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 2
First Block – The x86 CPU Simulation
5. Benchmarks and Simulations Second Block – The AMD Evergreen GPU Simulation 6. The OpenCL Programming Model 7. The AMD Evergreen GPU Emulation 8. The AMD Evergreen GPU Architectural Simulation 9. Benchmarks and Simulations
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 3
– Such as SimpleScalar, Simics, SSMT, M-Sim, SMTSim, M5, ... – Full-system vs. application-only simulation. – Free, open-source. – Architectural simulation accuracy. – Alpha/PISA architectures → cross-compilers. – Integrated system.
– Based on current processor market. – Heterogeneous CPU-GPU environments. – Tool for evaluation of new architectural proposals. – Simulation of a GPU ISA.
– Barra: NVIDIA Telsa ISA. – Ocelot: PTX intermediate language simulator. – No architectural simulation. – No emulation of AMD ISAs. – Not capable of heterogeneous simulation.
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 4
Superscalar pipeline
Out-of-order execution, branch prediction, trace cache, etc.
Multithreading
Fine-grain, coarse-grain and simultaneous (SMT).
Multicore architecture.
Configurable memory hierarchy, cache coherence, interconnection networks.
State-of-the-art benchmarks.
Tested support for common research benchmarks, available for download.
GPU model
Support for OpenCL benchmarks. Model for Evergreen ISA.
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 5
test
$ tar -xzf multi2sim-3.1.tar.gz $ cd multi2sim-3.1 $ ./configure $ make $ sudo make install
Original execution Simulated execution
$ ./test-args hola que tal arg[0] = 'hola' arg[1] = 'que' arg[2] = 'tal' $ m2s ./test-args hola que tal <... Simulator output ...> arg[0] = 'hola' arg[1] = 'que' arg[2] = 'tal' <... Simulator statistics ...>
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 6
; This is a comment. [ Section 0 ] Color = Red Height = 40 [ OtherSection ] Variable = Value
Demo 1
– Configuration files. – Output statistic files. – Standard error output.
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 7
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 8
– Just mimic original behavior of a program. – … as opposed to timing/detailed/architectural simulation.
1) Program loading. 2) Simulation loop.
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 9
– Virtual memory map. – Value of x86 registers.
Stack
Program arguments Environment variables 0x08000000
mmap region
(not initialized)
Heap
Initialized data
Text
Initialized data 0x08xxxxxx 0x40000000 0xc0000000 eax ebx eax ecx esp eip I n i t i a l i z e d i n s t r u c t i
p
n t e r T
s t a c k
1) Parse ELF executable
– ELF sections. – Initialized code and data.
2) Initialize stack
– Program headers. – Arguments. – Environment variables.
3) Initialize registers
– Program entry point → eip – Stack pointer → esp
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 10
Demo 2 Read instr. at eip Instr . bytes Decode instruction Instr . fields
int 0x80 No Yes Emulate system call Emulate x86 instr. Move eip to next instr.
– Update memory map (if needed). – Update x86 registers. – Example: add [bp+16], 0x5
calls
– Analyze system call code and args. – Update memory map. – Update eax with return value. – Example: read(fd, buf, count);
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 11
simulation)
– Provides performance results from executing a program
– Main performance metric: execution time. But also structures occupancy, cache hit rates, contention points...
Architectural Simulator
cycle counter
CPU functional simulator
CPU cores model Memory hierarchy model Run a new x86 instruction This is the isntr. that was run
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 12
Demo 3
Fetch Instr . Cache
Fetch queue
Dispatch
···
Reorder Buffer
··· ···
Instruction Queue
···
Load/Store Queue
Issue Commit Data Cache Register File FU
Trace queue
···
Trace Cache Decode
μop queue
···
Writeback
– Speculative execution. – Branch prediction. – Out-of-order execution. – Trace cache.
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 13
Fetch Instr . Cache Dispatch
··· ··· ··· ···
Issue Commit Data Cache Register File FU
···
Trace Cache Decode
···
Writeback Fetch Instr . Cache Dispatch
··· ··· ··· ···
Issue Commit Data Cache Register File FU
···
Trace Cache Decode
···
Writeback Fetch Instr . Cache Dispatch
··· ··· ··· ···
Issue Commit Data Cache Register File FU
···
Trace Cache Decode
···
Writeback
Shared Functional Unit Pool
– Coarse grain multithreading Thread switch upon long-latency events. – Fine grain multithreading Thread switch at a cycle granularity. – Simultaneous multithreading Multiple-thread issuing of instructions.
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 14
Core 0 Core 1
Memory Hierarchy
Fetch Instr . Cache Dispatch··· ··· ··· ···
Issue Commit Data Cache Register File FU···
Trace Cache Decode···
Writeback Fetch Instr . Cache Dispatch··· ··· ··· ···
Issue Commit Data Cache Register File FU···
Trace Cache Decode···
Writeback– Multiple independent superscalar pipelines. – Communication only through memory hierarchy.
Demo 4
– Multiple single-threaded programs. – One (or more) programs spawning child threads.
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 15
– Hardware component with an independent set of superscalar pipelines. – Each core may contain several threads.
Demo 4
– Hardware component with a partially independent set of pipeline stages.
– Software thread with independent value for registers (incl. eip). – Can be a sequential program or a spawned child context.
– Hardware component running a context. – Multicore proc.: c0, c1, … Multithreaded proc.: t0, t1, … Multicore-multithreaded proc.: c0-t0, c0-t1, ...
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 16
– Any number of caches organized in any number of levels. – Connected through any number of interconnects. – A set of 1 or more caches must connect to an interconnect from “above”. Only one cache –or main memory– connected “below”.
– Each node has two entries to the memory hierarchy: Instruction entry + Data entry – Several node entries can converge to the same cache (or main memory).
Interconnect Cache Cache Cache Cache or Main Memory
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 17
c0-t0 Data L1 Instr. L1 c0-t1 Data L1 Instr. L1 Core 0 c1-t0 Data L1 Instr. L1 c1-t1 Data L1 Instr. L1 Core 1 L2 Cache L2 Cache Main Memory
– 2-core, 2-threaded processor (4 nodes). – Each thread has its own private data and instruction L1 caches. – L2 caches: shared among threads, private per core, unified for data/instr.
Demo 5
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 18
– SPEC CPU 2000 – SPEC CPU 2006 – MediaBench-I
Demo 6
– SPLASH-2 – PARSEC 2.1
– x86 binaries tested on Multi2Sim. – List of execution commands. – Data files for free-distribution benchmarks.
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 19
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 20
– Massively parallel device. – Originally devoted to graphics computations. – Now getting popular for general purpose computations (GPGPU). – Single-Program Multiple-Data (SIMP) model.
– NVIDIA → CUDA programming language. – AMD → OpenCL programming language.
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 21
int main() { [ ... ] clCreateProgramWithSource(..., "vector_add.cl", ...); clCreateKernel(..., "vector_add", ...); buf1 = clCreateBuffer(..., CL_MEM_READ, size, ...); buf2 = clCreateBuffer(..., CL_MEM_READ, size, ...); buf3 = clCreateBuffer(..., CL_MEM_WRITE, size, ...); clSetKernelArg(..., 0, buf1, ...); clSetKernelArg(..., 1, buf2, ...); clSetKernelArg(..., 2, buf3, ...); clEnqueueNDRangeKernel(...); [ ... ] }
OpenCL Host Program vector_add.c OpenCL Device Kernel vector_add.cl
__kernel void vector_add( __read_only __global int *buf1, __read_only __global int *buf2, __write_only __global int *buf3) { int id = get_global_id(0); buf3[id] = buf1[id] + buf2[id]; }
x86 executable binary vector_add AMD Evergreen kernel binary vector_add.bin
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 22
Common OpenCL Kernel:
__kernel func() { }
Work- group ... Work- item Work- group Work- group Work- group
ND-Range
Work-group
Work- item Work- item Work- item
Work-item
Global memory Local memory Private memory (Synchronization allowed at this level)
– Host program configures ND-Range and Work-group sizes. – Only Work-items in the same Work-group can synchronize and share data. – Work-groups in ND-Range can execute in any order.
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 23
Operating system code User-space code
OpenCL function call (e.g., clEnqueueNDRangeKernel)
OpenCL host program AMD OpenCL library (libOpenCL.so)
System calls (mainly ioctl)
GPU Driver Multi2Sim Emulated program
OpenCL function call
OpenCL host program Multi2Sim OpenCL library (m2s-libpencl.so)
Special system call (code 325)
GPU Emulator
Native Execution Simulated Execution
– OpenCL function calls are forwarded to m2s-libopencl.so. – Each function is implemented as a system call 325. – Multi2Sim emulates GPU after clEnqueueNDRangeKernel.
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 24
– Global memory map (whole ND-Range). – Local memories (each work-group). – Register files (each work-item).
Work-item Work-item ··· Work-group Work-item Work-item ··· Work-group
ND-Range Global Memory Local Memories Register Files
OpenCL kernel binary (vector_add.bin)
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 25
– Main Control Flow (CF) clause. – Secondary Arithmetic-Logic (ALU) and Texture (TEX) clauses. – ALU instructions are VLIW.
00 ALU: ADDR(32) CNT(8) KCACHE0(CB1:0-15) 0 x: LSHL R3.x, R0.x, 1 w: LSHL ____, R0.x, (0x3).x t: MOV R8.x, 1 1 x: LSHL R5.x, PV1.x, (0x2).x y: LSHR R1.y, PV1.z, (0x2).x z: ADD_INT ____, KC0[1].x, PV2.x t: LSHR R7.x, KC0[3].x, 1 2 y: LSHR R2.y, PV3.z, (0x2).x 01 TEX: ADDR(144) CNT(2) 3 VFETCH R1.x___, R1.y, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 4 VFETCH R2.x___, R2.y, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 02 ALU_PUSH_BEFORE: ADDR(47) CNT(3) 5 x: LDS_WRITE ____, R1.w, R1.x 6 x: LDS_WRITE ____, R6.x, R2.x 7 x: PREDNE_INT ____, R7.x, 0.0f UPDATE_EXEC_MASK UPDATE_PRED 03 JUMP POP_CNT(1) ADDR(13) 04 MEM_RAT_CACHELESS_STORE_RAW: RAT(1)[R1].x___, R0, ARRAY_SIZE(4) MARK VPM
CF Instruction Counter Secondary Clause Instruction Counter Secondary ALU Clause Secondary TEX Clause
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 26
Emulate (all work-items) Emulate (all work-items) Read CF instruction Instr . bytes Decode instruction Instr . fields
CF? Yes No Start ALU/ TEX clause Read ALU/ TEX instr. Instr . bytes Decode instruction Instr . fields End of clause? No Go Up Yes
– Instructions affecting control flow. – Synchronization operations. – Writes to global memory.
– Arithmetic-logic operations. – Accesses to local memory.
– Reads from global memory.
Demo 7
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 27
– Pool of pending work-groups (Wgs). – Set of compute units (Cus). – Dispatcher – maps WGs to CUs. – Global memory hierarchy.
Compute Unit 0 Compute Unit 1 Compute Unit N-1
···
Work-group dispatcher Pending Work-group pool Global Memory Hierarchy ALU Engine CF Engine TEX Engine Register File Ready Wavefront Pool Global Memory (reads) Local Memory ALU Clause TEX Clause Global memory (writes)
– Pool of pending wavefronts (Wfs) – Three execution engines. – Local memory. – Register file.
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 28
Fetch (one WF) Instr . bytes buffers Decode (round- robin) Instruction Memory (CF Clause) From Ready Wavefront Pool Extract WF To Ready Wavefront Pool Insert WF WF0 WF1 WFN-1
· · ·
CF Instr . buffers (1 entry per WF) WF0 WF1 WFN-1
· · ·
Execute (round-robin) Launch secondary ALU clause Launch secondary TEX clause Execute CF instruction Complete
– 4 stages. – Extracts one WF from pool at fetch stage. – Places a WF back into pool at complete stage. – Secondary clauses can be launched at execute stage.
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 29
···
Instruction bytes Decode Read (each SubWF) SubWF 0 SubWF 1 SubWF 2 x y z w t
... ...
Stream Core 0
Processing Elements Pipeline Stages
Stream Core 1 Stream Core N–1
. . .
Work-Item 0 SubWF 0, 1, ... Write Execute Instruction Memory (ALU clauses) Local Memory Local Memory From Register File To Register File
x y z w t VLIW bundle buffer (1 entry)
Work-Item 0 SubWF 0, 1, ... Work-Item N-1 SubWF 0, 1, ... Fetch (one WF)
– 5 stages. – WF is split into SubWFs at the read stage. – SubWF size is equal to number of available Stream Cores (Scs). – Each SC has 5 pipelined processing elements (x, y, z, w, t).
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 30
...
Fetch (one WF) Instruction bytes Decode Instruction Memory (TEX Clauses) Read Request to L1 cache (Global Mem.) Write Data from L1 cache To Register File TEX instr . buffer (1 entry) From Register File addr . data
– 4 stages. – Global memory reads are issued at read stage. – They complete at write stage.
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 31
– Group of all work-items for one kernel launch.
– Work-items can perform synchronizations. – Work-items share a fast-access local memory.
– SIMD execution unit.
– Work-items that can be issued to Stream Cores at a time.
O p e n C L P r
. M
e l G P U A r c h i t e c t u r e
Demo 8
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 32
– Matrix computations. – Financial benchmarks. – Sorting algorithms. – etc.
– Provided in Multi2Sim site as x86 + Evergreen binaries. – Command-line can be tuned for different input sizes. – Provide both CPU and GPU implementations, with self-check.
Demo 9
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 33
– ISA-level. – No need for full-system simulation. – Superscalar/multithreaded/multicore. – Memory hierarchies and interconnects. – State-of-the-art benchmarks.
– ISA-level. – First full architectural simulation framework. – Realistic GPU pipeline (based on AMD Radeon 5870). – Memory hierarchies and interconnects.
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 34
– Complete documentation. – “Getting started” sections, with execution examples. – Description of CPU and GPU architectural models.
– Discussion forum for Multi2Sim users.
– Announcements of new versions, updated documentation, etc.
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 35
– Support for the entire OpenCL specification. – Support for the entire Evergreen ISA. – Support for the complete AMD SDK suite, and other upcoming benchmarks.
– Model for AMD Fusion. – CPU and GPU working concurrently. – Supporting/designing benchmarks with heterogeneous processing.
– Issues reported by Multi2Sim users. – Stability and support increases day by day.
Conference title 36
www.multi2sim.org Rafael Ubal David R. Kaeli
Northeastern University Boston, MA