Trace-driven Simulation of Multithreaded Applications Alejandro - - PowerPoint PPT Presentation

trace driven simulation of multithreaded applications
SMART_READER_LITE
LIVE PREVIEW

Trace-driven Simulation of Multithreaded Applications Alejandro - - PowerPoint PPT Presentation

Trace-driven Simulation of Multithreaded Applications Alejandro Rico, Alejandro Duran, Felipe Cabarcas Yoav Etsion, Alex Ramirez and Mateo Valero Multithreaded applications and trace-driven simulation Most computer architecture research


slide-1
SLIDE 1

Trace-driven Simulation of Multithreaded Applications

Alejandro Rico, Alejandro Duran, Felipe Cabarcas Yoav Etsion, Alex Ramirez and Mateo Valero

slide-2
SLIDE 2

2

Multithreaded applications and trace-driven simulation

  • Most computer architecture research employ execution-driven simulation tools.
  • Trace-driven simulation cannot capture the dynamic behavior of multithreaded

applications. acquire_lock check acquired Core 0 acquire_lock check acquired Core 1 release lock wait! critical section Core 0 Core 1 acquire_lock check acquired wait! acquire_lock check acquired release lock critical section Scenario 1 Scenario 2

slide-3
SLIDE 3

3

Trace-driven simulation has advantages

  • Avoid computational requirements of simulated applications.
  • Memory footprint.
  • Disk space for input sets.
  • Simulate applications with non-accessible sources, but accessible

traces.

  • Confidential/restricted applications.
  • Lower modeling complexity.
  • Different host1

and target2 ISAs / endianness.

  • Problem: How to appropriately simulate multithreaded applications

using traces?

1Host: system where the simulator executes. 2Target: system modeled in the simulator.

slide-4
SLIDE 4

4

Targeting applications with decoupled execution

  • Distinguish the user code (sequential code sections) from parallelism-

management operations (parops). Task-based parallel applications Loop-based parallel applications create task 1 sync exec task 1 parallel loop sync

  • Seq. code section

parop execution

completion task 1 sync sync sync Core 0 Core 1 Core 1 Core 0 Core 2 Core 3

Idle parop call Switch

slide-5
SLIDE 5

5

How traces are collected (I) parallel loop sync sync sync sync Core 1 Core 0 Core 2 Core 3

slide-6
SLIDE 6

6

How traces are collected (II)

  • Capture traces for sequential code sections.
  • Execution is independent of the environment.

trace

20: sub r15, r12, r13 24: store r35, r15 (0x7e6a0) 28: sub r3, r31, r4 2c: load r21, r7 (0x80a88) 30: addi r3, r3 34: beq r3 (next_i: 7C) 7c: mul r32, r8, r9 80: mul r33, r10, r11 84: mul r34, r12, r13 88: store r32, r17 (0x7f280) 8c: store r33, r18 (0x7f284)

trace parallel loop sync trace trace trace trace sync sync sync trace Core 1 Core 0 Core 2 Core 3

slide-7
SLIDE 7

7

trace parallel loop sync trace trace trace trace sync sync sync trace Core 1 Core 0 Core 2 Core 3 How traces are collected (III)

  • Capture traces for sequential code sections.
  • Execution is independent of the environment.
  • Capture calls

to parops.

  • Specific parop call events are included in the trace.

trace call to parallel loop calls to sync

slide-8
SLIDE 8

8

How traces are collected (IV)

  • Capture traces for sequential code sections.
  • Execution is independent of the environment.
  • Capture calls

to parops.

  • Specific parop call events are included in the trace.
  • Do not

capture the execution of parops.

  • Execution depends on the environment.

trace trace trace trace trace trace trace Core 1 Core 0 Core 2 Core 3 call to parallel loop calls to sync

slide-9
SLIDE 9

9

target architecture interface parop interface

Simulation framework

  • Trace-driven simulator simulates sequential code sections.
  • The dynamic component executes parops

at simulation time.

  • Includes the implementation of parops.
  • Parops

are exposed to the simulator through the parop interface.

  • The architecture state is exposed to the dynamic component through

the target architecture interface.

Trace-driven simulator Interface Dynamic component

slide-10
SLIDE 10

10

Sample implementation: TaskSim – NANOS++

  • Parops are exposed to the simulator through the parop interface
  • It includes operations for task management and synchronization.
  • The architecture state and associated actions are exposed to

NANOS++ through the architecture-dependent module.

  • NANOS++ can alter the simulator state and manage the simulated thread

according to the decisions based on the target architecture.

TaskSim NANOS++

Target architecture interface Parop interface

create task wait for tasks wait on data execute task start/join bind yield

C L2 L1 C C C C L1 L1 L1 L1 L1 L1 L1 C C C

slide-11
SLIDE 11

11

OmpSs application example

  • Cholesky

factorization.

  • Tasks are spawned on

pragma task annotations.

  • Inputs and outputs are

specified for automatic dependence resolution.

float A[N][N][M][M]; // NxN blocked matrix, // with MxM blocks for (int j = 0; j<N; j++) { for (int k = 0; k<j; k++) for (int i = j+1; i<N; i++) #pragma task input(a, b) inout(c) sgemm_t(A[i][k], A[j][k], A[i][j]); for (int i = 0; i<j; i++) #pragma task input(a) inout(b) ssyrk_t(A[j][i], A[j][j]); #pragma task inout(a) spotrf_t(A[j][j]); for (int i = j+1; i<N; i++) #pragma task input(a) inout(b) strsm_t(A[j][j], A[i][j]); }

slide-12
SLIDE 12

12

Traces for OmpSs applications

  • Sequential code sections correspond to tasks.
  • One trace for the main task
  • The thread starting the program execution at the main function
  • One trace for each task
  • Information for each function call
  • E.g., for task creation it needs the task id and the input and output data

addresses and sizes Application trace main task parop calls + info … task 1 task 2 task 3 task N

slide-13
SLIDE 13

13

Simulation example (I)

1. Simulation starts the main task.

Architecture dependent operations Parop interface

TaskSim NANOS++

Core 0 Core 1 initialization

slide-14
SLIDE 14

14

Simulation example (II)

2. On a create task event, it calls the interface in the Parop interface.

Architecture dependent operations Parop interface

TaskSim NANOS++

Core 0 Core 1 initialization create task 1

slide-15
SLIDE 15

15

Simulation example (III)

3. That triggers the creation of the task in Nanos++.

Architecture dependent operations Parop interface

TaskSim NANOS++

Core 0 Core 1 initialization create task 1

slide-16
SLIDE 16

16

Simulation example (IV)

4. Returns control to TaskSim. Core 1 takes task 1 for simulation.

Architecture dependent operations Parop interface

TaskSim NANOS++

Core 0 Core 1 initialization create task 1

slide-17
SLIDE 17

17

Simulation example (V)

5. TaskSim resumes simulation, and Core 1 starts simulating task 1.

Architecture dependent operations Parop interface

TaskSim NANOS++

Core 0 Core 1 initialization exec task 1 create task 1

slide-18
SLIDE 18

18

Simulation example (VI)

6. On create task 2 event, TaskSim calls the runtime again.

Architecture dependent operations Parop interface

TaskSim NANOS++

Core 0 Core 1 initialization exec task 1 create task 1 create task 2

slide-19
SLIDE 19

19

Simulation example (VII)

7. NANOS++ creates task 2, and returns control to TaskSim.

Architecture dependent operations Parop interface

TaskSim NANOS++

Core 0 Core 1 initialization exec task 1 create task 1 create task 2

slide-20
SLIDE 20

20

Simulation example (VIII)

8. When Core 1 finishes the execution of task 1, starts task 2.

Architecture dependent operations Parop interface

TaskSim NANOS++

… Core 0 … Core 1 initialization exec task 1 create task 1 exec task 2 create task 2

slide-21
SLIDE 21

21

Simulation example (IX)

9. TaskSim reaches a synchronization parop. NANOS++ checks for pending tasks.

Architecture dependent operations Parop interface

TaskSim NANOS++

… Core 0 … task wait Core 1 initialization exec task 1 create task 1 exec task 2 create task 2

slide-22
SLIDE 22

22

Simulation example (X)

10.All tasks are finished, and TaskSim continues the main task simulation.

Architecture dependent operations Parop interface

TaskSim NANOS++

… Core 0 … task wait Core 1 initialization exec task 1 create task 1 exec task 2 create task 2

slide-23
SLIDE 23

23

Task generation scheme scalability

  • Task generation (green) on the main task limits scalability (on the left)
  • Parallelization of task generation (on the right) is crucial to avoid this bottleneck

16p 32p 64p

slide-24
SLIDE 24

24

Coverage and opportunities

  • Appropriate for high-level programming models.
  • OpenMP, OmpSs, Cilk,…
  • Mixing scheduling/synchronization and application code is limited.
  • Runtime system can be used as the dynamic component.
  • Not suitable for:
  • Scheduling dependent on user code (user-guided scheduling).
  • Computation based on random values (e.g., Monte Carlo algorithms).
  • Runtime system development:
  • Scheduling policies.
  • Overall efficiency optimizations.
  • For future machines before the actual hardware is available.
  • Runtime software/hardware co-design.
  • Hardware support for runtime system.
slide-25
SLIDE 25

25

Conclusions

  • We propose a novel trace-driven simulation methodology for

multithreaded applications.

  • The methodology is based on distinguishing:
  • Application intrinsic behavior (user code).
  • Parallelism-management operations (parops).
  • It allows to properly simulate different architecture configurations:
  • With different numbers of cores.
  • Using a single trace per application.
  • It provides a framework not only for architecture exploration but also for

runtime system development.