Trace-driven Simulation of Multithreaded Applications Alejandro - PowerPoint PPT Presentation

Trace-driven Simulation of Multithreaded Applications Alejandro Rico, Alejandro Duran, Felipe Cabarcas Yoav Etsion, Alex Ramirez and Mateo Valero

Multithreaded applications and trace-driven simulation ● Most computer architecture research employ execution-driven simulation tools. ● Trace-driven simulation cannot capture the dynamic behavior of multithreaded applications. Scenario 2 Scenario 1 Core 0 Core 0 Core 1 Core 1 acquire_lock acquire_lock check check acquire_lock acquired acquire_lock check acquired check wait! critical wait! critical section section release lock release lock acquired acquired 2

Trace-driven simulation has advantages ● Avoid computational requirements of simulated applications. ● Memory footprint. ● Disk space for input sets. ● Simulate applications with non-accessible sources, but accessible traces. ● Confidential/restricted applications. ● Lower modeling complexity. ● Different host 1 and target 2 ISAs / endianness. ● Problem: How to appropriately simulate multithreaded applications using traces? 1 Host : system where the simulator executes. 2 Target : system modeled in the simulator. 3

Targeting applications with decoupled execution ● Distinguish the user code (sequential code sections) from parallelism- management operations ( parops ). Switch Seq. code section parop call parop execution Idle Task-based parallel applications Loop-based parallel applications Core 0 Core 1 Core 0 Core 1 Core 2 Core 3 parallel create task 1 loop exec task 1 sync completion task 1 sync sync sync sync 4

How traces are collected (I) Core 0 Core 1 Core 2 Core 3 parallel loop sync sync sync sync 5

How traces are collected (II) ● Capture traces for sequential code sections. trace ● Execution is independent of the environment. Core 0 Core 1 Core 2 Core 3 trace parallel 20: sub r15, r12, r13 loop 24: store r35, r15 (0x7e6a0) 28: sub r3, r31, r4 2c: load r21, r7 (0x80a88) 30: addi r3, r3 34: beq r3 (next_i: 7C) 7c: mul r32, r8, r9 trace trace trace 80: mul r33, r10, r11 sync trace 84: mul r34, r12, r13 sync 88: store r32, r17 (0x7f280) sync 8c: store r33, r18 (0x7f284) sync trace 6

How traces are collected (III) ● Capture traces for sequential code sections. trace ● Execution is independent of the environment. ● Capture calls to parops . ● Specific parop call events are included in the trace. Core 0 Core 1 Core 2 Core 3 trace parallel loop call to parallel loop trace trace trace sync trace calls to sync sync sync sync trace 7

How traces are collected (IV) ● Capture traces for sequential code sections. trace ● Execution is independent of the environment. ● Capture calls to parops . ● Specific parop call events are included in the trace. ● Do not capture the execution of parops . ● Execution depends on the environment. Core 0 Core 1 Core 2 Core 3 trace call to parallel loop trace trace trace trace calls to sync trace 8

Simulation framework ● Trace-driven simulator simulates sequential code sections . ● The dynamic component executes parops at simulation time. ● Includes the implementation of parops. ● Parops are exposed to the simulator through the parop interface. ● The architecture state is exposed to the dynamic component through the target architecture interface. parop interface Trace-driven Dynamic Interface simulator component target architecture interface 9

Sample implementation: TaskSim – NANOS++ ● Parops are exposed to the simulator through the parop interface ● It includes operations for task management and synchronization. ● The architecture state and associated actions are exposed to NANOS++ through the architecture-dependent module. ● NANOS++ can alter the simulator state and manage the simulated thread according to the decisions based on the target architecture. create task Parop wait for tasks interface wait on data TaskSim NANOS++ execute task C C C C Target L1 L1 L1 L1 start/join architecture L2 bind L1 L1 L1 L1 interface yield C C C C 10

OmpSs application example float A[N][N][M][M]; // NxN blocked matrix, ● Cholesky factorization. // with MxM blocks for (int j = 0; j<N; j++) { ● Tasks are spawned on for (int k = 0; k<j; k++) pragma task annotations. for (int i = j+1; i<N; i++) #pragma task input(a, b) inout(c) ● Inputs and outputs are sgemm_t(A[i][k], A[j][k], A[i][j]); specified for automatic for (int i = 0; i<j; i++) dependence resolution. #pragma task input(a) inout(b) ssyrk_t(A[j][i], A[j][j]); #pragma task inout(a) spotrf_t(A[j][j]); for (int i = j+1; i<N; i++) #pragma task input(a) inout(b) strsm_t(A[j][j], A[i][j]); } 11

Traces for OmpSs applications ● Sequential code sections correspond to tasks . ● One trace for the main task ● The thread starting the program execution at the main function ● One trace for each task ● Information for each function call ● E.g., for task creation it needs the task id and the input and output data addresses and sizes main task task N task 1 task 2 task 3 … Application trace parop calls + info 12

Simulation example (I) 1. Simulation starts the main task. Parop interface TaskSim NANOS++ Architecture dependent operations Core 1 Core 0 initialization 13

Simulation example (II) 2. On a create task event, it calls the interface in the Parop interface . Parop interface TaskSim NANOS++ Architecture dependent operations Core 1 Core 0 initialization create task 1 14

Simulation example (III) 3. That triggers the creation of the task in Nanos++. Parop interface TaskSim NANOS++ Architecture dependent operations Core 1 Core 0 initialization create task 1 15

Simulation example (IV) 4. Returns control to TaskSim. Core 1 takes task 1 for simulation. Parop interface TaskSim NANOS++ Architecture dependent operations Core 1 Core 0 initialization create task 1 16

Simulation example (V) 5. TaskSim resumes simulation, and Core 1 starts simulating task 1. Parop interface TaskSim NANOS++ Architecture dependent operations Core 1 Core 0 initialization create task 1 exec task 1 17

Simulation example (VI) 6. On create task 2 event, TaskSim calls the runtime again. Parop interface TaskSim NANOS++ Architecture dependent operations Core 1 Core 0 initialization create task 1 exec task 1 create task 2 18

Simulation example (VII) 7. NANOS++ creates task 2, and returns control to TaskSim. Parop interface TaskSim NANOS++ Architecture dependent operations Core 1 Core 0 initialization create task 1 exec task 1 create task 2 19

Simulation example (VIII) 8. When Core 1 finishes the execution of task 1, starts task 2. Parop interface TaskSim NANOS++ Architecture dependent operations Core 1 Core 0 initialization create task 1 exec task 1 create task 2 exec task 2 … … 20

Simulation example (IX) 9. TaskSim reaches a synchronization parop . NANOS++ checks for pending tasks. Parop interface TaskSim NANOS++ Architecture dependent operations Core 1 Core 0 initialization create task 1 exec task 1 create task 2 exec task 2 … … task wait 21

Simulation example (X) 10.All tasks are finished, and TaskSim continues the main task simulation. Parop interface TaskSim NANOS++ Architecture dependent operations Core 1 Core 0 initialization create task 1 exec task 1 create task 2 exec task 2 … … task wait 22

Task generation scheme scalability 16p 32p 64p ● Task generation (green) on the main task limits scalability (on the left) ● Parallelization of task generation (on the right) is crucial to avoid this bottleneck 23

Coverage and opportunities ● Appropriate for high-level programming models. ● OpenMP, OmpSs, Cilk,… ● Mixing scheduling/synchronization and application code is limited. ● Runtime system can be used as the dynamic component . ● Not suitable for: ● Scheduling dependent on user code (user-guided scheduling). ● Computation based on random values (e.g., Monte Carlo algorithms). ● Runtime system development: ● Scheduling policies. ● Overall efficiency optimizations. ● For future machines before the actual hardware is available. ● Runtime software/hardware co-design. ● Hardware support for runtime system. 24

Conclusions ● We propose a novel trace-driven simulation methodology for multithreaded applications. ● The methodology is based on distinguishing: ● Application intrinsic behavior (user code). ● Parallelism-management operations ( parops ). ● It allows to properly simulate different architecture configurations: ● With different numbers of cores. ● Using a single trace per application. ● It provides a framework not only for architecture exploration but also for runtime system development. 25

Trace-driven Simulation of Multithreaded Applications Alejandro - PowerPoint PPT Presentation

Trace-driven Simulation of Multithreaded Applications Alejandro Rico, Alejandro Duran, Felipe Cabarcas Yoav Etsion, Alex Ramirez and Mateo Valero Multithreaded applications and trace-driven simulation Most computer architecture research

Trace Caches and optimizations therein CSE 240C - Rushi Chakrabarti - Winter 2009 Trace Caches

SE350: Operating Systems Lecture 5: Multithreaded Kernels Outline Use cases for multithreaded

CS626 Data Analysis and Simulation Instructor: Peter Kemper R 104A, phone 221-3462,

RadixVM: Scalable address spaces for multithreaded applications Austin T. Clements M. Frans

Assessing the Performance of MPI Applications Through Time-Independent Trace Replay . Desprez 1

Event Driven Simulation and Test-benches Event Driven Simulation Continuous time and value

Outline Narcisse Ngada DESY, MKK 1) What is simulation ? 14.05.2014 2) Why simulation ? 3)

Priority-Driven Scheduling of Periodic Tasks Priority-driven vs. clock-driven scheduling:

Our Hobbies 1B Cindy Chan Trace Chan Yuki Lo All: Good morning ,everybody. Cindy: I am Cindy

Trace Elements in igneous petrology Abundances of trace elements are used to test petrogenetic

Trace and center of the twisted Heisenberg category Michael Reeks June 4, 2018 Michael Reeks

DIV 26000 AND HEAT TRACE FOR MECHANICAL SYSTEMS ACE/ASM DOS AND DONTS OF HEAT TRACE IN

Semantic Trace-based Malware Variants Detection Khalid Alzarooni CREST - DCS - UCL April 6,

Grid simulation (AliEn) Outline GRID simulation Simulation tool Ptolemy (Berkeley)

False fasting is driven by pride False fasting is driven by pride False fasting is

Issues with Multithreaded Parallelism on Multicore Architectures Marc Moreno Maza University of

Snappy Ubuntu Core Enabling secure devices with app stores We are the company behind Ubuntu.

Core-Chasing Algorithms for the Eigenvalue Problem David S. Watkins Department of Mathematics

H-Store: A Specialized Architecture for High-throughput OLTP Applications Evan Jones (MIT)

Memory Consistency Models CSE 451 James Bornholt Memory consistency models The short version:

The Programming Language Co re The Programming Language Co re W olfgang Schreiner

C++20 Coroutines Miosz Warzecha Introduction Coroutines allow you to suspend function

Introduction to C++ Coroutines JAMES MCNELLIS SENIOR SOFTWARE ENGINEER MICROSOFT VISUAL C++

Concurrency, Parallelism and Coroutines Anthony Williams Just Software Solutions Ltd