Analysis of Data Reuse in Task Parallel Runtimes Miquel Peric` as , - PowerPoint PPT Presentation

Analysis of Data Reuse in Task Parallel Runtimes Miquel Peric` as ⋆ , Abdelhalim Amer ⋆ , Kenjiro Taura † and Satoshi Matsuoka ⋆ ⋆ Tokyo Institute of Technology † The University of Tokyo PMBS’13, Denver, November 18th 2013 1

Table of Contents 1 Task Parallel Runtimes 2 Case Study of Matmul and FMM 3 Kernel Reuse Distance 4 Experimental Evaluation 5 Current Weaknesses 6 Conclusions PMBS’13, Denver, November 18th 2013 2

Task Parallel Programming Models • Task-parallel programming models are popular tools for multicore programming • They are general, simple and can be implemented efficiently Tasks DAG Runtime Layer (Cilk, TBB, OpenMP, ..) C C C C Cores • Task-parallel runtimes manage assignation of tasks to cores, allowing programmers to write cleaner code PMBS’13, Denver, November 18th 2013 4

Performance of Runtime Systems • Runtime schedulers implement heuristics to maximize parallelism and optimize resource sharing • Performance can depend considerably on such heuristics, degradation often occurs without any obvious reason 24 Runtime A Runtime B Runtime C Runtime D 20 Linear Speed-Up 16 ? 12 8 4 0 1 2 4 6 12 18 24 Number of Cores PMBS’13, Denver, November 18th 2013 5

Scalability of task parallel applications Why do task parallel codes not scale linearly? • Runtime Overheads : execution cycles inside API calls • Parallel Idleness : lost cycles due to load imbalance and lack of parallelism • Resource Sharing : additional cycles due to contention or destructive sharing → work time inflation (WTI) PMBS’13, Denver, November 18th 2013 6

Quantifying Parallelization Stretch • OVR N = Non-work Overheads at N cores (API + IDLE time) • WTI N = Work Time Inflation at N cores Serial` W1 s W2 s W3 s W4 s W5 s W6 s WTI 4 OVR 4 W1 p W5 p API IDLE Core 1 Core 1 1 2 W2 p W6 p API Parallel Core 2 W3 p 1 3 4 2 5 1 2 6 IDLE Core 3 W4 p IDLE Core 4 Parallel Stretch Tpar = Tser N × WTI N × OVR N → Speed-Up N = N OVR N × WTI N PMBS’13, Denver, November 18th 2013 7

Case Study: Matmul and FMM Matrix Multiplication (C = A x B) • Input Size : 4096 × 4096 elements • Task Inputs/Outputs : 2D submatrices • Average task size 1 : 17 µ s Fast Multipole Method: Tree Traversal 2 • Input Size : 1 million particles (Plummer) • Task Inputs/Outputs : octree cells (multipoles and vectors of bodies) • Average task size : 3 . 25 µ s 1 measured on Intel Xeon E7-4807 at 1.86GHz 2 https://bitbucket.org/rioyokota/exafmm-dev

Case Study: three runtimes • MassiveThreads : Cilk-like runtime with random work stealer and work-first policy. • Threading Building Blocks : C++ template based runtime with random work stealer and help-first policy. • Qthread : Locality-aware runtime with shared task queue. A set of workers are grouped in a shepherd . Bulk work stealing across shepherds (50% of victim’s tasks). Help-first policy. C C C C C C C C C C C C C C C C C C C C LIFO LIFO local task local task Task Queues Task Queues scheduling scheduling LIFO global task scheduling NUMA FIFO FIFO node #2 (shepherd) Work Work stealing stealing “ shepherd ” bulk FIFO Work First Help First work stealing MassiveThreads Thread Building Blocks Qthread PMBS’13, Denver, November 18th 2013 10

Experimental Setup • Experimental platform is a 4-socket Intel Xeon E7- 4807 (Westmere) machine with 6 cores per die (1.87GHz) and 18MB of LLC. • We specify the same subset of cores for every experiment • The following runtime configurations are used: Runtime Task Creation Work Stealing Task Queue MTH Work-First Random / 1 task Core/LIFO TBB Help-First Random / 1 task Core/LIFO QTH/Core Help-First Random / Bulk Core/LIFO QTH/Socket Help-First Random / Bulk Socket/LIFO PMBS’13, Denver, November 18th 2013 11

Speed-Ups for Matmul and FMM Matmul FMM 24 24 MassiveThreads MassiveThreads Threading Building Blocks Threading Building Blocks QThread/Core QThread/Core 20 20 QThread/Socket QThread/Socket Linear Linear 16 16 Speed-Up Speed-Up 12 12 8 8 4 4 0 0 1 2 4 6 12 18 24 1 2 4 6 12 18 24 Number of Cores Number of Cores Performance Variation at 24 Cores: • Matmul: 16 × –21 × (MTH best, QTH/Socket worst) • FMM: 9 × –18 × (MTH best, TBB worst) PMBS’13, Denver, November 18th 2013 12

Overheads (OVR N ) for Matmul and FMM Matmul FMM 2.2 2.2 MassiveThreads MassiveThreads Threading Building Blocks Threading Building Blocks QThread/Core QThread/Core 2 2 Non-Work Overheads QThread/Socket Non-Work Overheads QThread/Socket 1.8 1.8 1.6 1.6 1.4 1.4 1.2 1.2 1 1 1 2 4 6 12 18 24 1 2 4 6 12 18 24 Number of cores Number of cores Overheads are obtained by measuring the time cores spend outside of work kernels. At 24 cores: • Matmul: 1.1 × –1.4 × (MTH best; QTH/Socket worst) • FMM: 1.3 × –2.2 × (MTH best; TBB and QTH/Socket worst) PMBS’13, Denver, November 18th 2013 13

Do overheads alone explain performance? Normalized speed-up overhead product OVR N × WTI N → Speed-Up N × OVR N N 1 Speed-Up N = = N WTI N • The normalized speed-up overhead product is a measure of performance loss due to resource sharing • A value of 1.0 means no work time inflation is occurring PMBS’13, Denver, November 18th 2013 14

Normalized speed-up overhead product Matmul FMM 1.05 1.05 Normalized Speed-Up x Overhead Product Normalized Speed-Up x Overhead Product 1 1 0.95 0.95 0.9 0.9 0.85 0.85 0.8 0.8 MassiveThreads MassiveThreads Threading Building Blocks Threading Building Blocks QThread/Core QThread/Core QThread/Socket QThread/Socket 0.75 0.75 1 2 4 6 12 18 24 1 2 4 6 12 18 24 Number of Cores Number of cores Speed-up degradation due to resource contention • Matmul: 2%–10% (MTH best; TBB worst) • FMM: 2%–18% (MTH best; TBB worst) • Reason? cache effects due to different orders of tasks PMBS’13, Denver, November 18th 2013 15

Performance bottlenecks analysis • Overheads can be studied with a variety of tools • Sampling-based : perf 1 , HPCToolkit 2 , extrae 3 , etc • Tracing-based : vampirtrace 4 , TAU 5 , extrae, etc • Runtime library support • How can we analyze the impact of different runtime schedulers on data locality? → Proposal : use the reuse distance to evaluate cache performance 1 https://perf.wiki.kernel.org 2 http://hpctoolkit.org/ 3 http://www.bsc.es/computer-sciences/performance-tools/paraver 4 http://www.vampir.eu 5 http://tau.uoregon.edu

Multicore-aware Reuse Distance Single-threaded Reuse Distance Multi-threaded Reuse Distance @ @ @ CORE CORE CORE a e #1 a b f #1 #2 2 b c g 4 d e c L1 4 e h L1 L1 d a i e f j L2 2 g k 1 a L2 L2 f i f g 1 a f e L3 f b c g % % e d 100 100 e a h i 0 dist f dist 0 ∞ 1 4 ∞ j 1 4 k g f i • Generation of full address traces is too intrusive → changes task schedules • Computing the reuse distance is expensive PMBS’13, Denver, November 18th 2013 18

Lightweight data tracing We make several assumptions to reduce the cost of the metric • Cache performance is dominated by global (shared) data → short lived stack variables are not tracked. Only the kernel inputs/outputs are recorded. • Performance is dominated by last level cache misses → we interleave the address streams of all threads and compute the reuse distance histogram • For large reuse distances individual LD/ST tracking is not needed → we record kernel inputs at bulk (timestamp, address, size) PMBS’13, Denver, November 18th 2013 19

Kernel Reuse Distance (KRD) Kernel Access Trace Kernel Access Trace 2) Merging/Synchronization CORE #1 CORE #2 101112 101112 1 2 3 4 5 6 7 8 9 10 11 12 3) Histogram 9 9 L1 L1 CORE CORE Generation #1 8 8 #2 L2 L2 7 7 6 6 LLC 5 5 first time accesses 4 4 2 3 2 3 MAIN MEMORY 1 1 1) Kernel Data Trace Generation PMBS’13, Denver, November 18th 2013 20

Kernel Reuse Distance: Application Kernel Reuse Distance (KRD) KRD provides an intuitive measure of data reuse quality. We want to make quick assessments on reuse, comparing the performance of different schedulers PMBS’13, Denver, November 18th 2013 21

Table of Contents 1 Task Parallel Runtimes 2 Case Study of Matmul and FMM 3 Kernel Reuse Distance 4 Experimental Evaluation KRD histograms and runtime schedulers KRD histograms and performance 5 Current Weaknesses 6 Conclusions PMBS’13, Denver, November 18th 2013 22

Instrumentation • We record submatrices for matmul , and multipoles and body arrays for FMM • Total overhead below 5% for FMM and below 1% for Matmul • As memory traces record data regions, histogram generation is much faster PMBS’13, Denver, November 18th 2013 23

Analysis of Data Reuse in Task Parallel Runtimes Miquel Peric` as , - PowerPoint PPT Presentation

Analysis of Data Reuse in Task Parallel Runtimes Miquel Peric` as , Abdelhalim Amer , Kenjiro Taura and Satoshi Matsuoka Tokyo Institute of Technology The University of Tokyo PMBS13, Denver, November 18th 2013 1 Table

Database Cracking Languages and Runtimes for Big Data CSE 662 - Database Languages & Runtimes

The Effect of Asymmetric Performance on Asynchronous Task Based Runtimes Debashis Ganguly and John

1 Infrastructure Requirements Limit Reuse Planned Indirect Potable Reuse (Purple pipe may be a

Software, Architecture, and VLSI Co-Design for Efficient Task-Based Parallel Runtimes Christopher

UC Berkeley ReUSE Programs March 9, 2017 Lin King Cal Zero Waste Manager UC Berkeley Chair

TRACER TUTORIAL: TEXT REUSE DETECTION INTRODUCTION TO HISTORICAL TEXT REUSE DETECTION M arco B

Software Reuse From informal reuse (scavenging) to systematic reuse Management and technical

Manipulating Managed Execution Manipulating Managed Execution Runtimes to support Self-Healing

Beyond the Embarrassingly Parallel New Languages, Compilers, and Runtimes for Big-Data Processing

Californias Regulatory Process to Protect Public Health for Crop Irrigation Reuse and Potable

Japanese waste paper trend Japanese waste paper trend High collection & reuse High

Groundwater Solutions for Indirect Potable Reuse 2014 Rocky Mountain Water Reuse Workshop August

Potable Reuse for Inland Applications: Pilot Testing Results from a New Potable Reuse Treatment

CS 5150 So(ware Engineering 18. Reuse and Design Pa9erns William Y. Arms So(ware Reuse It is

ACCELERATING ASYNCHRONOUS EVENTS FOR HYBRID PARALLEL RUNTIMES Kyle C. Hale and Peter Dinda 1

Bond Task Force Draft Bond Task Force Recommendations Tuesday, February 27 , 2018 Bond Task

Resolving the Missing Deflation Puzzle Jesper Lind Mathias Trabandt Sveriges Riksbank Freie

IAP APF ANN F ANNUAL INVES ESTMENT C TMENT CONFER ERENC ENCE E 201 2019 Evo Evolu

Gaining information about inflation via the reheating era Christophe Ringeval Centre for

US Macro Economic Landscape Growth, Interest Rates and Recession Risk Satyam Panday, Ph.D.

SCOTIA CAPITAL Financials Summit TONY COMPER President & Chief Executive Officer September

CPU Scheduling (Chapters 7-11) CS 4410 Operating Systems [R. Agarwal, L. Alvisi, A. Bracy, M.

Cost-Efficient Resource Management for Scientific Workflows on the Cloud Ilia Pietri School of

Online Teaching Lectures are delivered live over Zoom at class time. q Also recorded for offline

Analysis of Data Reuse in Task Parallel Runtimes Miquel Peric` as , - PowerPoint PPT Presentation

Analysis of Data Reuse in Task Parallel Runtimes Miquel Peric` as , Abdelhalim Amer , Kenjiro Taura and Satoshi Matsuoka Tokyo Institute of Technology The University of Tokyo PMBS13, Denver, November 18th 2013 1 Table

Database Cracking Languages and Runtimes for Big Data CSE 662 - Database Languages &amp; Runtimes

The Effect of Asymmetric Performance on Asynchronous Task Based Runtimes Debashis Ganguly and John

1 Infrastructure Requirements Limit Reuse Planned Indirect Potable Reuse (Purple pipe may be a

Software, Architecture, and VLSI Co-Design for Efficient Task-Based Parallel Runtimes Christopher

UC Berkeley ReUSE Programs March 9, 2017 Lin King Cal Zero Waste Manager UC Berkeley Chair

TRACER TUTORIAL: TEXT REUSE DETECTION INTRODUCTION TO HISTORICAL TEXT REUSE DETECTION M arco B

Software Reuse From informal reuse (scavenging) to systematic reuse Management and technical

Manipulating Managed Execution Manipulating Managed Execution Runtimes to support Self-Healing

Beyond the Embarrassingly Parallel New Languages, Compilers, and Runtimes for Big-Data Processing

Californias Regulatory Process to Protect Public Health for Crop Irrigation Reuse and Potable

Japanese waste paper trend Japanese waste paper trend High collection &amp; reuse High

Groundwater Solutions for Indirect Potable Reuse 2014 Rocky Mountain Water Reuse Workshop August

Potable Reuse for Inland Applications: Pilot Testing Results from a New Potable Reuse Treatment

CS 5150 So(ware Engineering 18. Reuse and Design Pa9erns William Y. Arms So(ware Reuse It is

ACCELERATING ASYNCHRONOUS EVENTS FOR HYBRID PARALLEL RUNTIMES Kyle C. Hale and Peter Dinda 1

Bond Task Force Draft Bond Task Force Recommendations Tuesday, February 27 , 2018 Bond Task

Resolving the Missing Deflation Puzzle Jesper Lind Mathias Trabandt Sveriges Riksbank Freie

IAP APF ANN F ANNUAL INVES ESTMENT C TMENT CONFER ERENC ENCE E 201 2019 Evo Evolu

Gaining information about inflation via the reheating era Christophe Ringeval Centre for

US Macro Economic Landscape Growth, Interest Rates and Recession Risk Satyam Panday, Ph.D.

SCOTIA CAPITAL Financials Summit TONY COMPER President &amp; Chief Executive Officer September

CPU Scheduling (Chapters 7-11) CS 4410 Operating Systems [R. Agarwal, L. Alvisi, A. Bracy, M.

Cost-Efficient Resource Management for Scientific Workflows on the Cloud Ilia Pietri School of

Online Teaching Lectures are delivered live over Zoom at class time. q Also recorded for offline

Database Cracking Languages and Runtimes for Big Data CSE 662 - Database Languages & Runtimes

Japanese waste paper trend Japanese waste paper trend High collection & reuse High

SCOTIA CAPITAL Financials Summit TONY COMPER President & Chief Executive Officer September