Analysis of Data Reuse in Task Parallel Runtimes Miquel Peric` as , - - PowerPoint PPT Presentation

analysis of data reuse in task parallel runtimes
SMART_READER_LITE
LIVE PREVIEW

Analysis of Data Reuse in Task Parallel Runtimes Miquel Peric` as , - - PowerPoint PPT Presentation

Analysis of Data Reuse in Task Parallel Runtimes Miquel Peric` as , Abdelhalim Amer , Kenjiro Taura and Satoshi Matsuoka Tokyo Institute of Technology The University of Tokyo PMBS13, Denver, November 18th 2013 1 Table


slide-1
SLIDE 1

Analysis of Data Reuse in Task Parallel Runtimes

Miquel Peric` as⋆, Abdelhalim Amer⋆, Kenjiro Taura† and Satoshi Matsuoka⋆

⋆Tokyo Institute of Technology †The University of Tokyo PMBS’13, Denver, November 18th 2013 1

slide-2
SLIDE 2

Table of Contents

1 Task Parallel Runtimes 2 Case Study of Matmul and FMM 3 Kernel Reuse Distance 4 Experimental Evaluation 5 Current Weaknesses 6 Conclusions

PMBS’13, Denver, November 18th 2013 2

slide-3
SLIDE 3

Table of Contents

1 Task Parallel Runtimes 2 Case Study of Matmul and FMM 3 Kernel Reuse Distance 4 Experimental Evaluation 5 Current Weaknesses 6 Conclusions

PMBS’13, Denver, November 18th 2013 3

slide-4
SLIDE 4

Task Parallel Programming Models

  • Task-parallel programming models are popular tools for

multicore programming

  • They are general, simple and can be implemented efficiently

Runtime Layer (Cilk, TBB, OpenMP, ..)

C C C C Tasks DAG Cores

  • Task-parallel runtimes manage assignation of tasks to cores,

allowing programmers to write cleaner code

PMBS’13, Denver, November 18th 2013 4

slide-5
SLIDE 5

Performance of Runtime Systems

  • Runtime schedulers implement heuristics to maximize

parallelism and optimize resource sharing

  • Performance can depend considerably on such heuristics,

degradation often occurs without any obvious reason

4 8 12 16 20 24 1 2 4 6 12 18 24

Speed-Up Number of Cores Runtime A Runtime B Runtime C Runtime D Linear

?

PMBS’13, Denver, November 18th 2013 5

slide-6
SLIDE 6

Scalability of task parallel applications

Why do task parallel codes not scale linearly?

  • Runtime Overheads: execution cycles inside API calls
  • Parallel Idleness: lost cycles due to load imbalance and lack
  • f parallelism
  • Resource Sharing: additional cycles due to contention or

destructive sharing → work time inflation (WTI)

PMBS’13, Denver, November 18th 2013 6

slide-7
SLIDE 7

Quantifying Parallelization Stretch

  • OVRN = Non-work Overheads at N cores (API + IDLE time)
  • WTIN = Work Time Inflation at N cores

Serial` W1s W2s W3s W4s W5s W6s Parallel Core 1 Core 2 Core 1 Core 3 Core 4 W1p API IDLE W5p W3p IDLE WTI4 OVR4 W2p API W6p W4p IDLE 1 2 1 2 3 4 1 2 5 6

Parallel Stretch Tpar = Tser N × WTIN × OVRN → Speed-UpN = N OVRN×WTIN

PMBS’13, Denver, November 18th 2013 7

slide-8
SLIDE 8

Table of Contents

1 Task Parallel Runtimes 2 Case Study of Matmul and FMM 3 Kernel Reuse Distance 4 Experimental Evaluation 5 Current Weaknesses 6 Conclusions

PMBS’13, Denver, November 18th 2013 8

slide-9
SLIDE 9

Case Study: Matmul and FMM

Matrix Multiplication (C = A x B)

  • Input Size: 4096×4096 elements
  • Task Inputs/Outputs: 2D submatrices
  • Average task size1: 17 µs

Fast Multipole Method: Tree Traversal2

  • Input Size: 1 million particles (Plummer)
  • Task Inputs/Outputs: octree cells (multipoles and vectors of

bodies)

  • Average task size: 3.25 µs

1measured on Intel Xeon E7-4807 at 1.86GHz 2https://bitbucket.org/rioyokota/exafmm-dev

slide-10
SLIDE 10

Case Study: three runtimes

  • MassiveThreads: Cilk-like runtime with random work stealer

and work-first policy.

  • Threading Building Blocks: C++ template based runtime

with random work stealer and help-first policy.

  • Qthread: Locality-aware runtime with shared task queue. A

set of workers are grouped in a shepherd. Bulk work stealing across shepherds (50% of victim’s tasks). Help-first policy.

C C C C

LIFO local task scheduling FIFO Work stealing

Task Queues

MassiveThreads

Work First

C C C C

LIFO local task scheduling FIFO Work stealing

Task Queues

Help First

Thread Building Blocks C C C C C C C C C C C C

LIFO global task scheduling

NUMA node #2 (shepherd)

bulk FIFO work stealing

“shepherd”

Qthread

PMBS’13, Denver, November 18th 2013 10

slide-11
SLIDE 11

Experimental Setup

  • Experimental platform is a 4-socket Intel Xeon E7- 4807

(Westmere) machine with 6 cores per die (1.87GHz) and 18MB of LLC.

  • We specify the same subset of cores for every experiment
  • The following runtime configurations are used:

Runtime Task Creation Work Stealing Task Queue MTH Work-First Random / 1 task Core/LIFO TBB Help-First Random / 1 task Core/LIFO QTH/Core Help-First Random / Bulk Core/LIFO QTH/Socket Help-First Random / Bulk Socket/LIFO

PMBS’13, Denver, November 18th 2013 11

slide-12
SLIDE 12

Speed-Ups for Matmul and FMM

Matmul FMM

4 8 12 16 20 24 1 2 4 6 12 18 24 Speed-Up Number of Cores

MassiveThreads Threading Building Blocks QThread/Core QThread/Socket Linear

4 8 12 16 20 24 1 2 4 6 12 18 24 Speed-Up Number of Cores

MassiveThreads Threading Building Blocks QThread/Core QThread/Socket Linear

Performance Variation at 24 Cores:

  • Matmul: 16×–21× (MTH best, QTH/Socket worst)
  • FMM: 9×–18× (MTH best, TBB worst)

PMBS’13, Denver, November 18th 2013 12

slide-13
SLIDE 13

Overheads (OVRN) for Matmul and FMM

Matmul FMM

1 1.2 1.4 1.6 1.8 2 2.2 1 2 4 6 12 18 24 Non-Work Overheads Number of cores

MassiveThreads Threading Building Blocks QThread/Core QThread/Socket

1 1.2 1.4 1.6 1.8 2 2.2 1 2 4 6 12 18 24 Non-Work Overheads Number of cores

MassiveThreads Threading Building Blocks QThread/Core QThread/Socket

Overheads are obtained by measuring the time cores spend outside

  • f work kernels. At 24 cores:
  • Matmul: 1.1×–1.4× (MTH best; QTH/Socket worst)
  • FMM: 1.3×–2.2× (MTH best; TBB and QTH/Socket worst)

PMBS’13, Denver, November 18th 2013 13

slide-14
SLIDE 14

Do overheads alone explain performance?

Normalized speed-up overhead product Speed-UpN = N OVRN×WTIN → Speed-UpN×OVRN N = 1 WTIN

  • The normalized speed-up overhead product is a measure of

performance loss due to resource sharing

  • A value of 1.0 means no work time inflation is occurring

PMBS’13, Denver, November 18th 2013 14

slide-15
SLIDE 15

Normalized speed-up overhead product

Matmul FMM

0.75 0.8 0.85 0.9 0.95 1 1.05 1 2 4 6 12 18 24

Normalized Speed-Up x Overhead Product

Number of Cores

MassiveThreads Threading Building Blocks QThread/Core QThread/Socket

0.75 0.8 0.85 0.9 0.95 1 1.05 1 2 4 6 12 18 24

Normalized Speed-Up x Overhead Product

Number of cores

MassiveThreads Threading Building Blocks QThread/Core QThread/Socket

Speed-up degradation due to resource contention

  • Matmul: 2%–10% (MTH best; TBB worst)
  • FMM: 2%–18% (MTH best; TBB worst)
  • Reason? cache effects due to different orders of tasks

PMBS’13, Denver, November 18th 2013 15

slide-16
SLIDE 16

Performance bottlenecks analysis

  • Overheads can be studied with a variety of tools
  • Sampling-based: perf1, HPCToolkit2, extrae3, etc
  • Tracing-based: vampirtrace4, TAU5, extrae, etc
  • Runtime library support
  • How can we analyze the impact of different runtime

schedulers on data locality? → Proposal: use the reuse distance to evaluate cache performance

1https://perf.wiki.kernel.org 2http://hpctoolkit.org/ 3http://www.bsc.es/computer-sciences/performance-tools/paraver 4http://www.vampir.eu 5http://tau.uoregon.edu

slide-17
SLIDE 17

Table of Contents

1 Task Parallel Runtimes 2 Case Study of Matmul and FMM 3 Kernel Reuse Distance 4 Experimental Evaluation 5 Current Weaknesses 6 Conclusions

PMBS’13, Denver, November 18th 2013 17

slide-18
SLIDE 18

Multicore-aware Reuse Distance

L1 L2 CORE #1 a b c d e a f g f 4 1 @

1 4 100 % dist

∞ L1 L2 CORE #1

a b c d e a f g f

4 1 @ L1 L2 CORE #2

e f g e h i j k i

2 2 @ L3

a e f b c g e d e a h i f j k g f i

1 4 100 % dist

Single-threaded Reuse Distance Multi-threaded Reuse Distance

  • Generation of full address traces is too intrusive

→ changes task schedules

  • Computing the reuse distance is expensive

PMBS’13, Denver, November 18th 2013 18

slide-19
SLIDE 19

Lightweight data tracing

We make several assumptions to reduce the cost of the metric

  • Cache performance is dominated by global (shared) data

→ short lived stack variables are not tracked. Only the kernel inputs/outputs are recorded.

  • Performance is dominated by last level cache misses

→ we interleave the address streams of all threads and compute the reuse distance histogram

  • For large reuse distances individual LD/ST tracking is not

needed → we record kernel inputs at bulk (timestamp, address, size)

PMBS’13, Denver, November 18th 2013 19

slide-20
SLIDE 20

Kernel Reuse Distance (KRD)

Kernel Access Trace CORE #1

L1 L2

LLC

L1 L2 4 5 7 9 1 2 3 6 8 101112 4 5 7 9 1 2 3 6 8 101112

4 5 7 9 1 2 3 6 8 10 11 12 first time accesses

Kernel Access Trace CORE #2

CORE #1 CORE #2 MAIN MEMORY

1) Kernel Data Trace Generation 2) Merging/Synchronization 3) Histogram Generation

PMBS’13, Denver, November 18th 2013 20

slide-21
SLIDE 21

Kernel Reuse Distance: Application

Kernel Reuse Distance (KRD) KRD provides an intuitive measure of data reuse quality. We want to make quick assessments on reuse, comparing the performance of different schedulers

PMBS’13, Denver, November 18th 2013 21

slide-22
SLIDE 22

Table of Contents

1 Task Parallel Runtimes 2 Case Study of Matmul and FMM 3 Kernel Reuse Distance 4 Experimental Evaluation

KRD histograms and runtime schedulers KRD histograms and performance

5 Current Weaknesses 6 Conclusions

PMBS’13, Denver, November 18th 2013 22

slide-23
SLIDE 23

Instrumentation

  • We record submatrices for matmul, and multipoles and body

arrays for FMM

  • Total overhead below 5% for FMM and below 1% for Matmul
  • As memory traces record data regions, histogram generation is

much faster

PMBS’13, Denver, November 18th 2013 23

slide-24
SLIDE 24

KRD histograms and runtime schedulers

  • We first analyze the correlation of different schedulers and the

KRD metric:

  • Four schedulers
  • MassiveThreads, TBB, Qthread/Core and

Qthread/Socket

  • Three system configurations:
  • 1 core
  • 1 socket (6 cores)
  • 4 sockets (24 cores)

PMBS’13, Denver, November 18th 2013 24

slide-25
SLIDE 25

Single Core Kernel Reuse Distance (KRD-1)

Matmul FMM

20 40 60 80 100 3 2 K B 6 4 K B 1 2 8 K B 2 5 6 K B 5 1 2 K B 1 M B 2 M B 4 M B 8 M B 1 6 M B 3 2 M B 6 4 M B 1 2 8 M B 2 5 6 M B I N F

Reuse Ratio (%) Reuse Distance (Bytes)

MassiveThreads Threading Building Blocks QThread/Core QThread/Socket 10 20 30 40 50 60 70 80 90 100 2 5 6 1 K B 4 K B 1 6 K B 6 4 K B 2 5 6 K B 1 M B 4 M B 1 6 M B 6 4 M B I N F

Reuse Ratio (%) Reuse Distance (Bytes)

MassiveThreads Threading Building Blocks QThread/Core QThread/Socket

Almost no variations between histograms:

  • In the abscence of work steals order is only determined by

Work-First or Help-First

  • Matmul kernel order is independent of spawn policy. FMM is

sensitive, but differences are still minimal

PMBS’13, Denver, November 18th 2013 25

slide-26
SLIDE 26

Single Socket / 6 Core Kernel Reuse Distance (KRD-6)

Matmul FMM

10 20 30 40 50 60 70 80 90 100 3 2 K B 6 4 K B 1 2 8 K B 2 5 6 K B 5 1 2 K B 1 M B 2 M B 4 M B 8 M B 1 6 M B 3 2 M B 6 4 M B 1 2 8 M B 2 5 6 M B I N F

Reuse Ratio (%) Reuse Distance (Bytes)

MassiveThreads Threading Building Blocks QThread/Core QThread/Socket 10 20 30 40 50 60 70 80 90 100 2 5 6 1 K B 4 K B 1 6 K B 6 4 K B 2 5 6 K B 1 M B 4 M B 1 6 M B 6 4 M B I N F

Reuse Ratio (%) Reuse Distance (Bytes)

MassiveThreads Threading Building Blocks QThread/Core QThread/Socket

  • QTH/Socket shared queue improves temporal locality
  • Other schedulers almost no difference. TBB slightly better
  • Differences in FMM are much smaller

PMBS’13, Denver, November 18th 2013 26

slide-27
SLIDE 27

Four Sockets / 24 Core Kernel Reuse Distance (KRD-24)

Matmul FMM

10 20 30 40 50 60 70 80 90 100 3 2 K B 6 4 K B 1 2 8 K B 2 5 6 K B 5 1 2 K B 1 M B 2 M B 4 M B 8 M B 1 6 M B 3 2 M B 6 4 M B 1 2 8 M B 2 5 6 M B I N F

Reuse Ratio (%) Reuse Distance (Bytes)

MassiveThreads Threading Building Blocks QThread/Core QThread/Socket 10 20 30 40 50 60 70 80 90 100 2 5 6 1 K B 4 K B 1 6 K B 6 4 K B 2 5 6 K B 1 M B 4 M B 1 6 M B 6 4 M B I N F

Reuse Ratio (%) Reuse Distance (Bytes)

MassiveThreads Threading Building Blocks QThread/Core QThread/Socket

  • Differences in distant reuses grow
  • QTH/Socket shared queue also improves temporal locality

with multiple sockets

  • TBB suffers in the context of multiple sockets

PMBS’13, Denver, November 18th 2013 27

slide-28
SLIDE 28

Impact of Multiple Sockets on Cold Accesses

Matmul FMM 1 socket

80 85 90 95 100 4 M B 8 M B 1 6 M B 3 2 M B 6 4 M B 1 2 8 M B 2 5 6 M B I N F

Reuse Ratio (%) Reuse Distance (Bytes)

MassiveThreads Threading Building Blocks QThread/Core QThread/Socket 99.1% 88 90 92 94 96 98 100 1 M B 4 M B 1 6 M B 6 4 M B I N F

Reuse Ratio (%) Reuse Distance (Bytes)

MassiveThreads Threading Building Blocks QThread/Core QThread/Socket 97.5%

4 sockets

80 85 90 95 100 4 M B 8 M B 1 6 M B 3 2 M B 6 4 M B 1 2 8 M B 2 5 6 M B I N F

Reuse Ratio (%) Reuse Distance (Bytes)

MassiveThreads Threading Building Blocks QThread/Core QThread/Socket 96.2% 97.4% 88 90 92 94 96 98 100 1 M B 4 M B 1 6 M B 6 4 M B I N F

Reuse Ratio (%) Reuse Distance (Bytes)

MassiveThreads Threading Building Blocks QThread/Core QThread/Socket 95.6% 93.5%

PMBS’13, Denver, November 18th 2013 28

slide-29
SLIDE 29

KRD histograms and performance

  • Want to understand how the KRD metric correlates with

hardware performance metrics

  • We choose a multisocket low overheads scenario: Matmul on

2 sockets / 12 cores

  • Low Overhead: MTH, TBB, QTH/Core overheads

around 1.1-1.2×

  • Moderate Overhead: QTH/Socket overhead 1.35×

PMBS’13, Denver, November 18th 2013 29

slide-30
SLIDE 30

Hardware Metrics and KRD for Matmul on 2 sockets

Runtime

  • Exec. Time

OVR12 LLC Misses Kernel Time & Inflation MTH 1.642 sec 1.094× 1.829×106 17441ns (1.0250×) TBB 1.742 sec 1.11× 2.807×106 17898ns (1.0519×) QTH/Core 1.859 sec 1.21× 2.339×106 17767ns (1.0441×) QTH/Socket 2.111 sec 1.34× 1.987×106 18401ns (1.0814×)

93 94 95 96 97 98 1 6 M B 3 2 M B 6 4 M B 1 2 8 M B 2 5 6 M B

Reuse Ratio (%) Reuse Distance (Bytes)

MassiveThreads Threading Building Blocks QThread/Core QThread/Socket

  • LLC misses correlate to data

reuse histograms

  • QTH/Socket LLC miss rate

and WTI too large.

  • Runtime activity is causing

additional misses and contention on memory subsystem

PMBS’13, Denver, November 18th 2013 30

slide-31
SLIDE 31

Table of Contents

1 Task Parallel Runtimes 2 Case Study of Matmul and FMM 3 Kernel Reuse Distance 4 Experimental Evaluation 5 Current Weaknesses 6 Conclusions

PMBS’13, Denver, November 18th 2013 31

slide-32
SLIDE 32

Discussion

Tracks only global data accesses in bulk

  • The user needs to ensure that stack accesses are negligible
  • For matmul we found that stack accesses are less than 1%

No modeling of the effects of cache coherence

  • Affects multisocket scenario, our results are optimstic

No measurement of spatial locality

  • A spatial locality metric could be added to model prefetchers.

PMBS’13, Denver, November 18th 2013 32

slide-33
SLIDE 33

Table of Contents

1 Task Parallel Runtimes 2 Case Study of Matmul and FMM 3 Kernel Reuse Distance 4 Experimental Evaluation 5 Current Weaknesses 6 Conclusions

PMBS’13, Denver, November 18th 2013 33

slide-34
SLIDE 34

Summary of findings

  • We developed a tool based on the reuse distance to study

reuse in task parallel applications. The tools is designed to be lightweight and can provide fast comparison of different implementations.

  • Although the tool was developed to compare task parallel

runtimes, it can be applied to any shared memory model.

  • Our experiments indicate that the reuse distance histograms

correlate with scheduler policies and with hardware metrics

PMBS’13, Denver, November 18th 2013 34

slide-35
SLIDE 35

Thank you!

Questions?

PMBS’13, Denver, November 18th 2013 35