The Multi2Sim Simulation Framework A CPU-GPU Model for - - PowerPoint PPT Presentation

the multi2sim simulation framework
SMART_READER_LITE
LIVE PREVIEW

The Multi2Sim Simulation Framework A CPU-GPU Model for - - PowerPoint PPT Presentation

The Multi2Sim Simulation Framework A CPU-GPU Model for Heterogeneous Computing www.multi2sim.org Rafael Ubal David R. Kaeli Northeastern University Boston, MA Conference title 1 Outline 1. Introduction First Block The x86 CPU


slide-1
SLIDE 1

Conference title 1

The Multi2Sim Simulation Framework

A CPU-GPU Model for Heterogeneous Computing

www.multi2sim.org Rafael Ubal David R. Kaeli

Northeastern University Boston, MA

slide-2
SLIDE 2

The Multi2Sim Simulation Framework, PACT 2011 Tutorial 2

Outline

  • 1. Introduction

First Block – The x86 CPU Simulation

  • 2. The x86 CPU Emulation
  • 3. The x86 CPU Architectural Simulation
  • 4. The Memory Hierarchy

5. Benchmarks and Simulations Second Block – The AMD Evergreen GPU Simulation 6. The OpenCL Programming Model 7. The AMD Evergreen GPU Emulation 8. The AMD Evergreen GPU Architectural Simulation 9. Benchmarks and Simulations

  • 10. Conclusions and Future work
slide-3
SLIDE 3

The Multi2Sim Simulation Framework, PACT 2011 Tutorial 3

  • 1. Introduction

Motivation

  • Limitations of existing CPU simulators

– Such as SimpleScalar, Simics, SSMT, M-Sim, SMTSim, M5, ... – Full-system vs. application-only simulation. – Free, open-source. – Architectural simulation accuracy. – Alpha/PISA architectures → cross-compilers. – Integrated system.

  • Current simulation needs

– Based on current processor market. – Heterogeneous CPU-GPU environments. – Tool for evaluation of new architectural proposals. – Simulation of a GPU ISA.

  • Existing GPU simulation approaches

– Barra: NVIDIA Telsa ISA. – Ocelot: PTX intermediate language simulator. – No architectural simulation. – No emulation of AMD ISAs. – Not capable of heterogeneous simulation.

slide-4
SLIDE 4

The Multi2Sim Simulation Framework, PACT 2011 Tutorial 4

  • 1. Introduction

Multi2Sim Background

  • Multi2Sim 3.x version series, 2011 (x86+Evergreen)

Superscalar pipeline

Out-of-order execution, branch prediction, trace cache, etc.

Multithreading

Fine-grain, coarse-grain and simultaneous (SMT).

Multicore architecture.

Configurable memory hierarchy, cache coherence, interconnection networks.

State-of-the-art benchmarks.

Tested support for common research benchmarks, available for download.

GPU model

Support for OpenCL benchmarks. Model for Evergreen ISA.

  • Multi2Sim 1.x version series, 2007 (MIPS-based)
  • Multi2Sim 2.x version series, 2008 (x86-based)
slide-5
SLIDE 5

The Multi2Sim Simulation Framework, PACT 2011 Tutorial 5

  • 1. Introduction

Getting Started

  • User-friendly installation and

test

$ tar -xzf multi2sim-3.1.tar.gz $ cd multi2sim-3.1 $ ./configure $ make $ sudo make install

  • Application-only simulator

Original execution Simulated execution

$ ./test-args hola que tal arg[0] = 'hola' arg[1] = 'que' arg[2] = 'tal' $ m2s ./test-args hola que tal <... Simulator output ...> arg[0] = 'hola' arg[1] = 'que' arg[2] = 'tal' <... Simulator statistics ...>

slide-6
SLIDE 6

The Multi2Sim Simulation Framework, PACT 2011 Tutorial 6

  • 1. Introduction

The IniFile Format

  • Example of IniFile

; This is a comment. [ Section 0 ] Color = Red Height = 40 [ OtherSection ] Variable = Value

Demo 1

  • Multi2Sim uses IniFile for

– Configuration files. – Output statistic files. – Standard error output.

slide-7
SLIDE 7

The Multi2Sim Simulation Framework, PACT 2011 Tutorial 7

Block 1

The x86 CPU Simulation

slide-8
SLIDE 8

The Multi2Sim Simulation Framework, PACT 2011 Tutorial 8

  • 2. The CPU Emulation

Definition

  • Emulation (a.k.a. functional simulation)

– Just mimic original behavior of a program. – … as opposed to timing/detailed/architectural simulation.

  • Steps

1) Program loading. 2) Simulation loop.

slide-9
SLIDE 9

The Multi2Sim Simulation Framework, PACT 2011 Tutorial 9

  • 2. The CPU Emulation

Program Loading

  • Initialization of a process state

– Virtual memory map. – Value of x86 registers.

Stack

Program arguments Environment variables 0x08000000

mmap region

(not initialized)

Heap

Initialized data

Text

Initialized data 0x08xxxxxx 0x40000000 0xc0000000 eax ebx eax ecx esp eip I n i t i a l i z e d i n s t r u c t i

  • n

p

  • i

n t e r T

  • p
  • f

s t a c k

1) Parse ELF executable

– ELF sections. – Initialized code and data.

2) Initialize stack

– Program headers. – Arguments. – Environment variables.

3) Initialize registers

– Program entry point → eip – Stack pointer → esp

slide-10
SLIDE 10

The Multi2Sim Simulation Framework, PACT 2011 Tutorial 10

  • 2. The CPU Emulation

Simulation Loop

Demo 2 Read instr. at eip Instr . bytes Decode instruction Instr . fields

  • Instr. is

int 0x80 No Yes Emulate system call Emulate x86 instr. Move eip to next instr.

  • Emulation of x86 instructions

– Update memory map (if needed). – Update x86 registers. – Example: add [bp+16], 0x5

  • Emulation of Linux system

calls

– Analyze system call code and args. – Update memory map. – Update eax with return value. – Example: read(fd, buf, count);

slide-11
SLIDE 11

The Multi2Sim Simulation Framework, PACT 2011 Tutorial 11

  • 3. The CPU Architectural Simulation

Definition

  • Architectural simulation (a.k.a. detailed/timing

simulation)

– Provides performance results from executing a program

  • n a configurable CPU model.

– Main performance metric: execution time. But also structures occupancy, cache hit rates, contention points...

Architectural Simulator

cycle counter

CPU functional simulator

CPU cores model Memory hierarchy model Run a new x86 instruction This is the isntr. that was run

slide-12
SLIDE 12

The Multi2Sim Simulation Framework, PACT 2011 Tutorial 12

  • 3. The CPU Architectural Simulation

The Superscalar Pipeline

Demo 3

Fetch Instr . Cache

Fetch queue

Dispatch

···

Reorder Buffer

··· ···

Instruction Queue

···

Load/Store Queue

Issue Commit Data Cache Register File FU

Trace queue

···

Trace Cache Decode

μop queue

···

Writeback

  • Characteristics

– Speculative execution. – Branch prediction. – Out-of-order execution. – Trace cache.

slide-13
SLIDE 13

The Multi2Sim Simulation Framework, PACT 2011 Tutorial 13

  • 3. The CPU Architectural Simulation

Multithreaded Processor Model

Fetch Instr . Cache Dispatch

··· ··· ··· ···

Issue Commit Data Cache Register File FU

···

Trace Cache Decode

···

Writeback Fetch Instr . Cache Dispatch

··· ··· ··· ···

Issue Commit Data Cache Register File FU

···

Trace Cache Decode

···

Writeback Fetch Instr . Cache Dispatch

··· ··· ··· ···

Issue Commit Data Cache Register File FU

···

Trace Cache Decode

···

Writeback

Shared Functional Unit Pool

  • Multithreading Paradigms

– Coarse grain multithreading Thread switch upon long-latency events. – Fine grain multithreading Thread switch at a cycle granularity. – Simultaneous multithreading Multiple-thread issuing of instructions.

slide-14
SLIDE 14

The Multi2Sim Simulation Framework, PACT 2011 Tutorial 14

  • 3. The CPU Architectural Simulation

Multicore Processor Model

Core 0 Core 1

···

Memory Hierarchy

Fetch Instr . Cache Dispatch

··· ··· ··· ···

Issue Commit Data Cache Register File FU

···

Trace Cache Decode

···

Writeback Fetch Instr . Cache Dispatch

··· ··· ··· ···

Issue Commit Data Cache Register File FU

···

Trace Cache Decode

···

Writeback
  • Multicore Processor

– Multiple independent superscalar pipelines. – Communication only through memory hierarchy.

Demo 4

  • What can we run on it?

– Multiple single-threaded programs. – One (or more) programs spawning child threads.

slide-15
SLIDE 15

The Multi2Sim Simulation Framework, PACT 2011 Tutorial 15

  • 3. The CPU Architectural Simulation

Definitions

  • Core (c-0, c-1, ...)

– Hardware component with an independent set of superscalar pipelines. – Each core may contain several threads.

Demo 4

  • Thread (t-0, t-1, ...)

– Hardware component with a partially independent set of pipeline stages.

  • Context (ctx-0, ctx-1, ...)

– Software thread with independent value for registers (incl. eip). – Can be a sequential program or a spawned child context.

  • Node

– Hardware component running a context. – Multicore proc.: c0, c1, … Multithreaded proc.: t0, t1, … Multicore-multithreaded proc.: c0-t0, c0-t1, ...

slide-16
SLIDE 16

The Multi2Sim Simulation Framework, PACT 2011 Tutorial 16

  • Configuring memory hierarchy

– Any number of caches organized in any number of levels. – Connected through any number of interconnects. – A set of 1 or more caches must connect to an interconnect from “above”. Only one cache –or main memory– connected “below”.

  • 4. Memory Hierarchy

Configuration

  • Memory hierarchy entries

– Each node has two entries to the memory hierarchy: Instruction entry + Data entry – Several node entries can converge to the same cache (or main memory).

···

Interconnect Cache Cache Cache Cache or Main Memory

slide-17
SLIDE 17

The Multi2Sim Simulation Framework, PACT 2011 Tutorial 17

  • 4. Memory Hierarchy

Configuration

c0-t0 Data L1 Instr. L1 c0-t1 Data L1 Instr. L1 Core 0 c1-t0 Data L1 Instr. L1 c1-t1 Data L1 Instr. L1 Core 1 L2 Cache L2 Cache Main Memory

  • Example

– 2-core, 2-threaded processor (4 nodes). – Each thread has its own private data and instruction L1 caches. – L2 caches: shared among threads, private per core, unified for data/instr.

Demo 5

slide-18
SLIDE 18

The Multi2Sim Simulation Framework, PACT 2011 Tutorial 18

  • 5. Benchmarks and Simulations

Supported CPU Benchmarks

  • Sequential benchmarks

– SPEC CPU 2000 – SPEC CPU 2006 – MediaBench-I

Demo 6

  • Parallel benchmarks

– SPLASH-2 – PARSEC 2.1

  • Availability on website

– x86 binaries tested on Multi2Sim. – List of execution commands. – Data files for free-distribution benchmarks.

slide-19
SLIDE 19

The Multi2Sim Simulation Framework, PACT 2011 Tutorial 19

Block 2

The AMD Evergreen GPU Simulation

slide-20
SLIDE 20

The Multi2Sim Simulation Framework, PACT 2011 Tutorial 20

  • 6. The OpenCL Programming Model

Introduction

  • GPU

– Massively parallel device. – Originally devoted to graphics computations. – Now getting popular for general purpose computations (GPGPU). – Single-Program Multiple-Data (SIMP) model.

  • Major GPU vendors

– NVIDIA → CUDA programming language. – AMD → OpenCL programming language.

slide-21
SLIDE 21

The Multi2Sim Simulation Framework, PACT 2011 Tutorial 21

  • 6. The OpenCL Programming Model

Vector Addition Example

int main() { [ ... ] clCreateProgramWithSource(..., "vector_add.cl", ...); clCreateKernel(..., "vector_add", ...); buf1 = clCreateBuffer(..., CL_MEM_READ, size, ...); buf2 = clCreateBuffer(..., CL_MEM_READ, size, ...); buf3 = clCreateBuffer(..., CL_MEM_WRITE, size, ...); clSetKernelArg(..., 0, buf1, ...); clSetKernelArg(..., 1, buf2, ...); clSetKernelArg(..., 2, buf3, ...); clEnqueueNDRangeKernel(...); [ ... ] }

OpenCL Host Program vector_add.c OpenCL Device Kernel vector_add.cl

__kernel void vector_add( __read_only __global int *buf1, __read_only __global int *buf2, __write_only __global int *buf3) { int id = get_global_id(0); buf3[id] = buf1[id] + buf2[id]; }

x86 executable binary vector_add AMD Evergreen kernel binary vector_add.bin

slide-22
SLIDE 22

The Multi2Sim Simulation Framework, PACT 2011 Tutorial 22

  • 6. The OpenCL Programming Model

OpenCL Software Entities

Common OpenCL Kernel:

__kernel func() { }

Work- group ... Work- item Work- group Work- group Work- group

... ... ...

ND-Range

... ... ... ...

Work-group

Work- item Work- item Work- item

Work-item

Global memory Local memory Private memory (Synchronization allowed at this level)

  • Properties

– Host program configures ND-Range and Work-group sizes. – Only Work-items in the same Work-group can synchronize and share data. – Work-groups in ND-Range can execute in any order.

slide-23
SLIDE 23

The Multi2Sim Simulation Framework, PACT 2011 Tutorial 23

  • 7. The Evergreen GPU Emulation

The OpenCL Call Stack

Operating system code User-space code

OpenCL function call (e.g., clEnqueueNDRangeKernel)

OpenCL host program AMD OpenCL library (libOpenCL.so)

System calls (mainly ioctl)

GPU Driver Multi2Sim Emulated program

OpenCL function call

OpenCL host program Multi2Sim OpenCL library (m2s-libpencl.so)

Special system call (code 325)

GPU Emulator

Native Execution Simulated Execution

  • Comparison

– OpenCL function calls are forwarded to m2s-libopencl.so. – Each function is implemented as a system call 325. – Multi2Sim emulates GPU after clEnqueueNDRangeKernel.

slide-24
SLIDE 24

The Multi2Sim Simulation Framework, PACT 2011 Tutorial 24

  • 7. The Evergreen GPU Emulation

Program Loading

  • Initialization of device kernel

– Global memory map (whole ND-Range). – Local memories (each work-group). – Register files (each work-item).

Work-item Work-item ··· Work-group Work-item Work-item ··· Work-group

···

ND-Range Global Memory Local Memories Register Files

OpenCL kernel binary (vector_add.bin)

slide-25
SLIDE 25

The Multi2Sim Simulation Framework, PACT 2011 Tutorial 25

  • 7. The Evergreen GPU Emulation

Evergreen Assembly Code

  • Structure

– Main Control Flow (CF) clause. – Secondary Arithmetic-Logic (ALU) and Texture (TEX) clauses. – ALU instructions are VLIW.

00 ALU: ADDR(32) CNT(8) KCACHE0(CB1:0-15) 0 x: LSHL R3.x, R0.x, 1 w: LSHL ____, R0.x, (0x3).x t: MOV R8.x, 1 1 x: LSHL R5.x, PV1.x, (0x2).x y: LSHR R1.y, PV1.z, (0x2).x z: ADD_INT ____, KC0[1].x, PV2.x t: LSHR R7.x, KC0[3].x, 1 2 y: LSHR R2.y, PV3.z, (0x2).x 01 TEX: ADDR(144) CNT(2) 3 VFETCH R1.x___, R1.y, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 4 VFETCH R2.x___, R2.y, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 02 ALU_PUSH_BEFORE: ADDR(47) CNT(3) 5 x: LDS_WRITE ____, R1.w, R1.x 6 x: LDS_WRITE ____, R6.x, R2.x 7 x: PREDNE_INT ____, R7.x, 0.0f UPDATE_EXEC_MASK UPDATE_PRED 03 JUMP POP_CNT(1) ADDR(13) 04 MEM_RAT_CACHELESS_STORE_RAW: RAT(1)[R1].x___, R0, ARRAY_SIZE(4) MARK VPM

CF Instruction Counter Secondary Clause Instruction Counter Secondary ALU Clause Secondary TEX Clause

slide-26
SLIDE 26

The Multi2Sim Simulation Framework, PACT 2011 Tutorial 26

  • 7. The Evergreen GPU Emulation

Simulation Loop

Emulate (all work-items) Emulate (all work-items) Read CF instruction Instr . bytes Decode instruction Instr . fields

  • Instr. is

CF? Yes No Start ALU/ TEX clause Read ALU/ TEX instr. Instr . bytes Decode instruction Instr . fields End of clause? No Go Up Yes

  • Execution of CF clause

– Instructions affecting control flow. – Synchronization operations. – Writes to global memory.

  • Secondary ALU clause

– Arithmetic-logic operations. – Accesses to local memory.

  • Secondary TEX clause

– Reads from global memory.

Demo 7

slide-27
SLIDE 27

The Multi2Sim Simulation Framework, PACT 2011 Tutorial 27

  • 8. The GPU Architectural Simulation

AMD Evergreen GPU Architecture

  • The GPU Compute Device

– Pool of pending work-groups (Wgs). – Set of compute units (Cus). – Dispatcher – maps WGs to CUs. – Global memory hierarchy.

Compute Unit 0 Compute Unit 1 Compute Unit N-1

···

Work-group dispatcher Pending Work-group pool Global Memory Hierarchy ALU Engine CF Engine TEX Engine Register File Ready Wavefront Pool Global Memory (reads) Local Memory ALU Clause TEX Clause Global memory (writes)

  • Compute Unit

– Pool of pending wavefronts (Wfs) – Three execution engines. – Local memory. – Register file.

slide-28
SLIDE 28

The Multi2Sim Simulation Framework, PACT 2011 Tutorial 28

  • 8. The GPU Architectural Simulation

Execution Engines

Fetch (one WF) Instr . bytes buffers Decode (round- robin) Instruction Memory (CF Clause) From Ready Wavefront Pool Extract WF To Ready Wavefront Pool Insert WF WF0 WF1 WFN-1

· · ·

CF Instr . buffers (1 entry per WF) WF0 WF1 WFN-1

· · ·

Execute (round-robin) Launch secondary ALU clause Launch secondary TEX clause Execute CF instruction Complete

  • Control Flow (CF) Engine

– 4 stages. – Extracts one WF from pool at fetch stage. – Places a WF back into pool at complete stage. – Secondary clauses can be launched at execute stage.

slide-29
SLIDE 29

The Multi2Sim Simulation Framework, PACT 2011 Tutorial 29

  • 8. The GPU Architectural Simulation

Execution Engines

···

Instruction bytes Decode Read (each SubWF) SubWF 0 SubWF 1 SubWF 2 x y z w t

... ...

Stream Core 0

Processing Elements Pipeline Stages

Stream Core 1 Stream Core N–1

. . .

Work-Item 0 SubWF 0, 1, ... Write Execute Instruction Memory (ALU clauses) Local Memory Local Memory From Register File To Register File

x y z w t VLIW bundle buffer (1 entry)

Work-Item 0 SubWF 0, 1, ... Work-Item N-1 SubWF 0, 1, ... Fetch (one WF)

  • Arithmetic-Logic (ALU) Engine

– 5 stages. – WF is split into SubWFs at the read stage. – SubWF size is equal to number of available Stream Cores (Scs). – Each SC has 5 pipelined processing elements (x, y, z, w, t).

slide-30
SLIDE 30

The Multi2Sim Simulation Framework, PACT 2011 Tutorial 30

  • 8. The GPU Architectural Simulation

Execution Engines

...

Fetch (one WF) Instruction bytes Decode Instruction Memory (TEX Clauses) Read Request to L1 cache (Global Mem.) Write Data from L1 cache To Register File TEX instr . buffer (1 entry) From Register File addr . data

  • Control Flow (CF) Engine

– 4 stages. – Global memory reads are issued at read stage. – They complete at write stage.

slide-31
SLIDE 31

The Multi2Sim Simulation Framework, PACT 2011 Tutorial 31

  • 8. The GPU Architectural Simulation

Summary of Work-items Grouping

  • ND-Range

– Group of all work-items for one kernel launch.

  • Work-group

– Work-items can perform synchronizations. – Work-items share a fast-access local memory.

  • Wavefront

– SIMD execution unit.

  • Subwavefront

– Work-items that can be issued to Stream Cores at a time.

O p e n C L P r

  • g

. M

  • d

e l G P U A r c h i t e c t u r e

Demo 8

slide-32
SLIDE 32

The Multi2Sim Simulation Framework, PACT 2011 Tutorial 32

  • 9. Benchmarks and Simulations

Supported GPU Benchmarks

  • AMD SDK's OpenCL Benchmarks

– Matrix computations. – Financial benchmarks. – Sorting algorithms. – etc.

  • Features

– Provided in Multi2Sim site as x86 + Evergreen binaries. – Command-line can be tuned for different input sizes. – Provide both CPU and GPU implementations, with self-check.

Demo 9

slide-33
SLIDE 33

The Multi2Sim Simulation Framework, PACT 2011 Tutorial 33

  • 10. Conclusions

Simulation Capabilities

  • x86 CPU Simulation

– ISA-level. – No need for full-system simulation. – Superscalar/multithreaded/multicore. – Memory hierarchies and interconnects. – State-of-the-art benchmarks.

  • AMD Evergreen GPU Simulation

– ISA-level. – First full architectural simulation framework. – Realistic GPU pipeline (based on AMD Radeon 5870). – Memory hierarchies and interconnects.

slide-34
SLIDE 34

The Multi2Sim Simulation Framework, PACT 2011 Tutorial 34

  • 10. Conclusions

Additional Material

  • The Multi2Sim Guide

– Complete documentation. – “Getting started” sections, with execution examples. – Description of CPU and GPU architectural models.

  • The Multi2Sim Forum

– Discussion forum for Multi2Sim users.

  • The Multi2Sim Mailing List

– Announcements of new versions, updated documentation, etc.

slide-35
SLIDE 35

The Multi2Sim Simulation Framework, PACT 2011 Tutorial 35

  • 10. Conclusions

Future Work

  • Extending support for benchmarks

– Support for the entire OpenCL specification. – Support for the entire Evergreen ISA. – Support for the complete AMD SDK suite, and other upcoming benchmarks.

  • Focus on heterogeneous architectures

– Model for AMD Fusion. – CPU and GPU working concurrently. – Supporting/designing benchmarks with heterogeneous processing.

  • Maintenance of CPU simulation

– Issues reported by Multi2Sim users. – Stability and support increases day by day.

slide-36
SLIDE 36

Conference title 36

The Multi2Sim Simulation Framework

A CPU-GPU Model for Heterogeneous Computing

www.multi2sim.org Rafael Ubal David R. Kaeli

Northeastern University Boston, MA