Lithe: Enabling Efficient Composition of Parallel Libraries Heidi - PowerPoint PPT Presentation

B ERKELEY P AR L AB Lithe: Enabling Efficient Composition of Parallel Libraries Heidi Pan , Benjamin Hindman, Krste Asanovi ć xoxo@mit.edu  {benh, krste}@eecs.berkeley.edu Massachusetts Institute of Technology  UC Berkeley HotPar  Berkeley, CA  March 31, 2009 1

How to Build Parallel Apps? B ERKELEY P AR L AB Functionality: or or or App Resource Management: OS Hardware Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Need both programmer productivity and performance! 2

Composability is Key to Productivity B ERKELEY P AR L AB App 1 App 2 App sort sort bubble quick sort sort code reuse modularity same library implementation, different apps same app, different library implementations Functional Composability 3

Composability is Key to Productivity B ERKELEY P AR L AB fast + fast fast fast + faster fast(er) Performance Composability 4

Talk Roadmap B ERKELEY P AR L AB  Problem: Efficient parallel composability is hard!  Solution:  Harts  Lithe  Evaluation 5

Motivational Example B ERKELEY P AR L AB Sparse QR Factorization (Tim Davis, Univ of Florida) Column Elimination SPQR Tree Frontal Matrix Factorization MKL TBB OpenMP OS Hardware System Stack Software Architecture 6

Out-of-the-Box Performance B ERKELEY P AR L AB Performance of SPQR on 16-core Machine Out-of-the-Box Time (sec) sequential Input Matrix 7

Out-of-the-Box Libraries Oversubscribe the Resources B ERKELEY P AR L AB A[0] A[0] A[0] A[0] A[0] A[0] A[0] A[0] TX TY TX TY TX TY TX TY TX TY TX TY TX TY TX TY + Z + Z + Z + Z + Z + Z + Z + Z A[1] A[1] A[1] A[1] A[1] A[1] A[1] A[1] A[2] A[2] A[2] A[2] A[2] A[2] A[2] A[2] NZ NZ NZ NZ CZ NZ CZ NZ CZ NZ CZ NZ CZ CZ CZ CZ + Y + Y + Y + Y + Y + Y + Y + Y + X + X + X + X + X + X + X + X NY NY NY NY CY NY CY NY CY NY CY NY CY CY CY CY NX NX NX NX CX NX CX NX CX NX CX A[10] NX CX A[10] CX A[10] CX A[10] CX A[10] A[10] A[10] A[10] (unit stride) (unit stride) (unit stride) (unit stride) (unit stride) (unit stride) (unit stride) (unit stride) A[11] A[11] A[11] A[11] A[11] A[11] A[11] A[11] Prefetch Prefetch Prefetch Prefetch Prefetch Prefetch Prefetch Prefetch A[12] A[12] A[12] A[12] A[12] A[12] A[12] A[12] Cache Blocking Cache Blocking Cache Blocking Cache Blocking Cache Blocking Cache Blocking Cache Blocking Cache Blocking Core Core Core Core 0 1 2 3 TBB OpenMP OS virtualized kernel threads Hardware 8

MKL Quick Fix B ERKELEY P AR L AB Using Intel MKL with Threaded Applications http://www.intel.com/support/performancetools/libraries/mkl/sb/CS-017177.htm  If more than one thread calls Intel MKL and the function being called is threaded, it is important that threading in Intel MKL be turned off. Set OMP_NUM_THREADS=1 in the environment. 9

Sequential MKL in SPQR B ERKELEY P AR L AB Core Core Core Core 0 1 2 3 TBB OpenMP OS Hardware 10

Sequential MKL Performance B ERKELEY P AR L AB Performance of SPQR on 16-core Machine Out-of-the-Box Sequential MKL Time (sec) Input Matrix 11

SPQR Wants to Use Parallel MKL B ERKELEY P AR L AB No task-level parallelism! Want to exploit matrix-level parallelism. 12

Share Resources Cooperatively B ERKELEY P AR L AB Core Core Core Core 0 1 2 3 TBB_NUM_THREADS = 2 OMP_NUM_THREADS = 2 TBB OpenMP OS Hardware Tim Davis manually tunes libraries to effectively partition the resources. 13

Manually Tuned Performance B ERKELEY P AR L AB Performance of SPQR on 16-core Machine Out-of-the-Box Sequential MKL Manually Tuned Time (sec) Input Matrix 14

Manual Tuning Cannot Share Resources Effectively B ERKELEY P AR L AB Give resources to OpenMP Give resources to TBB 15

Manual Tuning Destroys Functional Composability B ERKELEY P AR L AB Tim Davis OMP_NUM_THREADS = 4 MKL LAPACK OpenMP Ax=b 16

Manual Tuning Destroys Performance Composability B ERKELEY P AR L AB SPQR MKL MKL MKL v1 v2 v3 App 0 0 1 2 3 17

Talk Roadmap B ERKELEY P AR L AB  Problem: Efficient parallel composability is hard!  Solution:  Harts: better resource abstraction  Lithe: framework for sharing resources  Evaluation 18

Virtualized Threads are Bad B ERKELEY P AR L AB App 1 (TBB) App 1 (OpenMP) App 2 OS Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Different codes compete unproductively for resources. 19

Space-Time Partitions aren’t Enough B ERKELEY P AR L AB Partition 1 Partition 2 SPQR App 2 MKL TBB OpenMP OS Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Space-time partitions isolate diff apps. What to do within an app? 20

Harts: Hardware Thread Contexts B ERKELEY P AR L AB SPQR MKL TBB OpenMP Harts  Represent real hw resources.  Requested, not created.  OS doesn’t manage harts for app. OS Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 21

Sharing Harts B ERKELEY P AR L AB Hart 0 Hart 1 Hart 2 Hart 3 time TBB OpenMP OS Hardware Partition 22

Cooperative Hierarchical Schedulers B ERKELEY P AR L AB application call graph library (scheduler) hierarchy Cilk Cilk TBB Ct TBB Ct OpenMP OMP  Modular: Each piece of the app scheduled independently.  Hierarchical: Caller gives resources to callee to execute on its behalf.  Cooperative: Callee gives resources back to caller when done. 23

A Day in the Life of a Hart B ERKELEY P AR L AB TBB Sched: next? Cilk executeTBB task TBB Ct TBB TBB Sched: next? Tasks OpenMP execute TBB task TBB Sched: next? nothing left to do, give hart back to parent Cilk Sched: next? don‘t start new task, finish existing one first Ct Sched: next? time 24

Standard Lithe ABI B ERKELEY P AR L AB TBB Lithe Scheduler Cilk Lithe Scheduler Parent Scheduler Caller enter yield request register unregister call return interface for sharing harts interface for exchanging values enter yield request register unregister call return OpenMP Lithe Scheduler TBB Lithe Scheduler Child Scheduler Callee  Analogous to function call ABI for enabling interoperable codes.  Mechanism for sharing harts, not policy. 25

Lithe Runtime B ERKELEY P AR L AB current current TBB Lithe scheduler scheduler yield enter yield request register unregister OpenMP Lithe enter yield request register unregister scheduler hierarchy harts TBB Lithe OpenMP Lithe TBB OpenMP Lithe OS OS Hardware Hardware 26

Register / Unregister B ERKELEY P AR L AB matmult(){ TBB Lithe Scheduler register(OpenMP Lithe ); register unregister enter yield request register unregister : : OpenMP Lithe Scheduler unregister(OpenMP Lithe ); enter yield request register unregister } time Register dynamically adds the new scheduler to the hierarchy. 27

Request B ERKELEY P AR L AB matmult(){ TBB Lithe Scheduler register(OpenMP Lithe ); request enter yield request register unregister request(n); : OpenMP Lithe Scheduler unregister(OpenMP Lithe ); enter yield request register unregister } time Request asks for more harts from the parent scheduler. 28

Enter / Yield B ERKELEY P AR L AB TBB Lithe Scheduler enter(OpenMP Lithe ); yield enter yield request register unregister : : OpenMP Lithe Scheduler yield(); enter enter yield request register unregister time Enter/Yield transfers additional harts between the parent and child. 29

SPQR with Lithe B ERKELEY P AR L AB SPQR MKL TBB Lithe OpenMP Lithe matmult reg req enter enter enter yield yield yield unreg time 30

SPQR with Lithe B ERKELEY P AR L AB SPQR MKL TBB Lithe OpenMP Lithe matmult matmult matmult matmult reg reg reg reg req req req req unreg unreg unreg unreg time 31

Talk Roadmap B ERKELEY P AR L AB  Problem: Efficient parallel composability is hard!  Solution:  Harts  Lithe  Evaluation 32

Implementation B ERKELEY P AR L AB  Harts: simulated using pinned Pthreads on x86-Linux ~600 lines of C & assembly  Lithe: user-level library (register, unregister, request, enter, yield, ...) ~2000 lines of C, C++, assembly  TBB Lithe ~1500 / ~8000 relevant lines added/removed/modified  OpenMP Lithe (GCC4.4) ~1000 / ~6000 relevant lines added/removed/modified TBB Lithe OpenMP Lithe Lithe Harts 33

No Lithe Overhead w/o Composing B ERKELEY P AR L AB All results on Linux 2.6.18, 8-core Intel Clovertown.  TBB Lithe Performance ( µ bench included with release) tree sum preorder fibonacci TBB Lithe 54.80ms 228.20ms 8.42ms TBB 54.80ms 242.51ms 8.72ms  OpenMP Lithe Performance (NAS parallel benchmarks) conjugate gradient (cg) LU solver (lu) multigrid (mg) OpenMP Lithe 57.06s 122.15s 9.23s OpenMP 57.00s 123.68s 9.54s 34

Performance Characteristics of SPQR (Input = ESOC) B ERKELEY P AR L AB Time (sec) NUM_OMP_THREADS 35

Performance Characteristics of SPQR (Input = ESOC) B ERKELEY P AR L AB Sequential TBB=1, OMP=1 172.1 sec Time (sec) NUM_OMP_THREADS 36

Performance Characteristics of SPQR (Input = ESOC) B ERKELEY P AR L AB Time (sec) Out-of-the-Box TBB=8, OMP=8 111.8 sec NUM_OMP_THREADS 37

Performance Characteristics of SPQR (Input = ESOC) B ERKELEY P AR L AB Time (sec) Manually Tuned 70.8 sec Out-of-the-Box NUM_OMP_THREADS 38

Lithe: Enabling Efficient Composition of Parallel Libraries Heidi - PowerPoint PPT Presentation

B ERKELEY P AR L AB Lithe: Enabling Efficient Composition of Parallel Libraries Heidi Pan , Benjamin Hindman, Krste Asanovi xoxo@mit.edu {benh, krste}@eecs.berkeley.edu Massachusetts Institute of Technology UC Berkeley HotPar

Libraries Jonathan Platt Head of Libraries and Heritage 22 nd July 2014 Libraries 1.

Libraries In C++ its possible to create static libraries and shared libraries Static

Enabling Efficient Batch Verification Enabling Efficient Batch Verification on Data Integrity for

Framework for Metric Composition + Spatial Composition of Spatial Composition of Metrics Al

Xamarin One platform to rule them all? Erwin de Groot @ 040 coders .NET frameworks WPF UI SL

T T T The CDO Blueprint: Enabling the he CDO Blueprint: Enabling the he CDO Blueprint:

SINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES Parallel Programming Languages and Approaches

Interfaces for Efficient Software Composition on Modern Hardware Shoumik Palkar Dissertation

Tuesday Wednesday Thursday Friday Keynotes Keynotes Keynotes parallel Photo coffee Grids

Marion County Waste Composition 2016 February 27, 2018 Peter Spendelow Oregon Department of

Real Composition Algebras Steven Clanton Harriet L. Wilkes Honors College Florida Atlantic

AP VS DE ENGLISH 11 th : AP Language and Composition or DE 111 and 112 12 th : AP Literature and

Automated Web Service Composition in Practice: from Composition Requirements Specification to

Towards Abstractions for Web Services Composition Manuel Mazzara Manuel Mazzara Towards

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Incentivizing Stable Path Selection in Future Internet Architectures Simon Scherrer Markus Legner

Braidio: An Integrated Active-Passive Radio for Mobile Devices with Asymmetric Energy Budgets Pan

Nearby Galaxies as measures of Feedback Brent Groves (MPIA) Quenching & Quiescence MPIA,

Collaborative Recommendation with Multiclass Preference Context Weike Pan and Zhong Ming {

EUDAT: Towards a pan-European Collaborative Data Infrastructure Federated Identity Management and

Liquidity of Corporate Bonds Jack Bao, Jun Pan and Jiang Wang MIT October 21, 2008 The Q-Group

THEOS DATA TRANSFER VIA TEIN3 : U d Update Report & Implementation Plan R & I l i Pl

Poisson Access Networks with Shadowing Modelling and Statiscal Inference Mokhtar Zahdi ALAYA

Lithe: Enabling Efficient Composition of Parallel Libraries Heidi - PowerPoint PPT Presentation

B ERKELEY P AR L AB Lithe: Enabling Efficient Composition of Parallel Libraries Heidi Pan , Benjamin Hindman, Krste Asanovi xoxo@mit.edu {benh, krste}@eecs.berkeley.edu Massachusetts Institute of Technology UC Berkeley HotPar

Libraries Jonathan Platt Head of Libraries and Heritage 22 nd July 2014 Libraries 1.

Libraries In C++ its possible to create static libraries and shared libraries Static

Enabling Efficient Batch Verification Enabling Efficient Batch Verification on Data Integrity for

Framework for Metric Composition + Spatial Composition of Spatial Composition of Metrics Al

Xamarin One platform to rule them all? Erwin de Groot @ 040 coders .NET frameworks WPF UI SL

T T T The CDO Blueprint: Enabling the he CDO Blueprint: Enabling the he CDO Blueprint:

SINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES Parallel Programming Languages and Approaches

Interfaces for Efficient Software Composition on Modern Hardware Shoumik Palkar Dissertation

Tuesday Wednesday Thursday Friday Keynotes Keynotes Keynotes parallel Photo coffee Grids

Marion County Waste Composition 2016 February 27, 2018 Peter Spendelow Oregon Department of

Real Composition Algebras Steven Clanton Harriet L. Wilkes Honors College Florida Atlantic

AP VS DE ENGLISH 11 th : AP Language and Composition or DE 111 and 112 12 th : AP Literature and

Automated Web Service Composition in Practice: from Composition Requirements Specification to

Towards Abstractions for Web Services Composition Manuel Mazzara Manuel Mazzara Towards

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Incentivizing Stable Path Selection in Future Internet Architectures Simon Scherrer Markus Legner

Braidio: An Integrated Active-Passive Radio for Mobile Devices with Asymmetric Energy Budgets Pan

Nearby Galaxies as measures of Feedback Brent Groves (MPIA) Quenching &amp; Quiescence MPIA,

Collaborative Recommendation with Multiclass Preference Context Weike Pan and Zhong Ming {

EUDAT: Towards a pan-European Collaborative Data Infrastructure Federated Identity Management and

Liquidity of Corporate Bonds Jack Bao, Jun Pan and Jiang Wang MIT October 21, 2008 The Q-Group

THEOS DATA TRANSFER VIA TEIN3 : U d Update Report &amp; Implementation Plan R &amp; I l i Pl

Poisson Access Networks with Shadowing Modelling and Statiscal Inference Mokhtar Zahdi ALAYA

Nearby Galaxies as measures of Feedback Brent Groves (MPIA) Quenching & Quiescence MPIA,

THEOS DATA TRANSFER VIA TEIN3 : U d Update Report & Implementation Plan R & I l i Pl