Lithe: Enabling Efficient Composition of Parallel Libraries Heidi - - PowerPoint PPT Presentation

lithe enabling efficient composition of parallel libraries
SMART_READER_LITE
LIVE PREVIEW

Lithe: Enabling Efficient Composition of Parallel Libraries Heidi - - PowerPoint PPT Presentation

B ERKELEY P AR L AB Lithe: Enabling Efficient Composition of Parallel Libraries Heidi Pan , Benjamin Hindman, Krste Asanovi xoxo@mit.edu {benh, krste}@eecs.berkeley.edu Massachusetts Institute of Technology UC Berkeley HotPar


slide-1
SLIDE 1

BERKELEY PAR LAB

1

Lithe: Enabling Efficient Composition

  • f Parallel Libraries

Heidi Pan, Benjamin Hindman, Krste Asanović

HotPar  Berkeley, CA  March 31, 2009

xoxo@mit.edu  {benh, krste}@eecs.berkeley.edu Massachusetts Institute of Technology  UC Berkeley

slide-2
SLIDE 2

BERKELEY PAR LAB

2

How to Build Parallel Apps?

Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7

Hardware OS App Resource Management:

Need both programmer productivity and performance!

Functionality:

  • r
  • r
  • r
slide-3
SLIDE 3

BERKELEY PAR LAB

3

Composability is Key to Productivity

Functional Composability

sort App 1 App 2

code reuse

sort

same library implementation, different apps

modularity

App

same app, different library implementations

bubble sort quick sort

slide-4
SLIDE 4

BERKELEY PAR LAB

4

Composability is Key to Productivity

Performance Composability

fast fast fast

+

faster fast fast(er)

+

slide-5
SLIDE 5

BERKELEY PAR LAB

5

Talk Roadmap

 Problem: Efficient parallel composability is hard!  Solution:

  • Harts
  • Lithe

 Evaluation

slide-6
SLIDE 6

BERKELEY PAR LAB

6

Motivational Example

Sparse QR Factorization

(Tim Davis, Univ of Florida) OS MKL OpenMP System Stack Hardware TBB SPQR

Frontal Matrix Factorization Column Elimination Tree

Software Architecture

slide-7
SLIDE 7

BERKELEY PAR LAB

7

Out-of-the-Box Performance

Time (sec)

Performance of SPQR on 16-core Machine

Out-of-the-Box Input Matrix

sequential

slide-8
SLIDE 8

BERKELEY PAR LAB

8

Out-of-the-Box Libraries Oversubscribe the Resources

OS TBB OpenMP Hardware

Core Core

1

Core

2

Core

3

virtualized kernel threads

A[0] A[1] A[2] A[10] A[11] A[12]

Prefetch

+Y +Z +X (unit stride) NY NZ NX CY CZ CX TY TX

Cache Blocking

A[0] A[1] A[2] A[10] A[11] A[12]

Prefetch

+Y +Z +X (unit stride) NY NZ NX CY CZ CX TY TX

Cache Blocking

A[0] A[1] A[2] A[10] A[11] A[12]

Prefetch

+Y +Z +X (unit stride) NY NZ NX CY CZ CX TY TX

Cache Blocking

A[0] A[1] A[2] A[10] A[11] A[12]

Prefetch

+Y +Z +X (unit stride) NY NZ NX CY CZ CX TY TX

Cache Blocking

A[0] A[1] A[2] A[10] A[11] A[12]

Prefetch

+Y +Z +X (unit stride) NY NZ NX CY CZ CX TY TX

Cache Blocking

A[0] A[1] A[2] A[10] A[11] A[12]

Prefetch

+Y +Z +X (unit stride) NY NZ NX CY CZ CX TY TX

Cache Blocking

A[0] A[1] A[2] A[10] A[11] A[12]

Prefetch

+Y +Z +X (unit stride) NY NZ NX CY CZ CX TY TX

Cache Blocking

A[0] A[1] A[2] A[10] A[11] A[12]

Prefetch

+Y +Z +X (unit stride) NY NZ NX CY CZ CX TY TX

Cache Blocking

slide-9
SLIDE 9

BERKELEY PAR LAB

9

MKL Quick Fix

Using Intel MKL with Threaded Applications

http://www.intel.com/support/performancetools/libraries/mkl/sb/CS-017177.htm

 If more than one thread calls Intel MKL and the function being called is threaded, it is important that threading in Intel MKL be turned off. Set OMP_NUM_THREADS=1 in the environment.

slide-10
SLIDE 10

BERKELEY PAR LAB

10

Sequential MKL in SPQR

OS TBB OpenMP Hardware

Core Core

1

Core

2

Core

3

slide-11
SLIDE 11

BERKELEY PAR LAB

11

Sequential MKL Performance

Time (sec)

Performance of SPQR on 16-core Machine

Out-of-the-Box Sequential MKL Input Matrix

slide-12
SLIDE 12

BERKELEY PAR LAB

12

SPQR Wants to Use Parallel MKL

No task-level parallelism! Want to exploit matrix-level parallelism.

slide-13
SLIDE 13

BERKELEY PAR LAB

13

Share Resources Cooperatively

OS TBB OpenMP Hardware Tim Davis manually tunes libraries to effectively partition the resources.

Core Core

1 TBB_NUM_THREADS = 2

Core

2

Core

3 OMP_NUM_THREADS = 2

slide-14
SLIDE 14

BERKELEY PAR LAB

14

Manually Tuned Performance

Time (sec)

Performance of SPQR on 16-core Machine

Out-of-the-Box Sequential MKL Manually Tuned Input Matrix

slide-15
SLIDE 15

BERKELEY PAR LAB

15

Manual Tuning Cannot Share Resources Effectively

Give resources to OpenMP Give resources to TBB

slide-16
SLIDE 16

BERKELEY PAR LAB

16

Manual Tuning Destroys Functional Composability

Tim Davis LAPACK Ax=b MKL OpenMP OMP_NUM_THREADS = 4

slide-17
SLIDE 17

BERKELEY PAR LAB

17

Manual Tuning Destroys Performance Composability

SPQR MKL v1 MKL v2 MKL v3 App

1 2 3

slide-18
SLIDE 18

BERKELEY PAR LAB

18

Talk Roadmap

 Problem: Efficient parallel composability is hard!  Solution:

  • Harts: better resource abstraction
  • Lithe: framework for sharing resources

 Evaluation

slide-19
SLIDE 19

BERKELEY PAR LAB

19

Virtualized Threads are Bad

OS

Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 App 1 (TBB) App 1 (OpenMP) App 2

Different codes compete unproductively for resources.

slide-20
SLIDE 20

BERKELEY PAR LAB

20

Space-Time Partitions aren’t Enough

OS

Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7

Space-time partitions isolate diff apps. MKL OpenMP TBB SPQR App 2 Partition 1 Partition 2 What to do within an app?

slide-21
SLIDE 21

BERKELEY PAR LAB

21

Harts: Hardware Thread Contexts

 Represent real hw resources.  Requested, not created.  OS doesn’t manage harts for app.

OS

Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7

MKL OpenMP TBB SPQR Harts

slide-22
SLIDE 22

BERKELEY PAR LAB

22

Sharing Harts

OS TBB OpenMP Hardware time Hart 0 Hart 1 Hart 2 Hart 3 Partition

slide-23
SLIDE 23

BERKELEY PAR LAB

23

Cooperative Hierarchical Schedulers

OMP TBB Cilk Ct

application call graph Cilk library (scheduler) hierarchy TBB Ct OpenMP

 Modular: Each piece of the app scheduled independently.  Hierarchical: Caller gives resources to callee to execute on its behalf.  Cooperative: Callee gives resources back to caller when done.

slide-24
SLIDE 24

BERKELEY PAR LAB

24

A Day in the Life of a Hart

Cilk TBB Ct OpenMP TBB Sched: next? time TBB Tasks

executeTBB task

TBB Sched: next?

execute TBB task

TBB Sched: next?

nothing left to do, give hart back to parent

Cilk Sched: next?

don‘t start new task, finish existing one first

Ct Sched: next?

slide-25
SLIDE 25

BERKELEY PAR LAB

25

Child Scheduler Parent Scheduler

Standard Lithe ABI

CilkLithe Scheduler interface for sharing harts TBBLithe Scheduler Caller Callee

return call return call

interface for exchanging values

 Analogous to function call ABI for enabling interoperable codes.

TBBLithe Scheduler OpenMPLithe Scheduler

unregister enter yield request register unregister enter yield request register

 Mechanism for sharing harts, not policy.

slide-26
SLIDE 26

BERKELEY PAR LAB

26

OS TBB OpenMP Hardware

Lithe Runtime

harts current scheduler OS TBBLithe OpenMPLithe Hardware Lithe TBBLithe scheduler hierarchy

enter yield request register unregister

OpenMPLithe

enter yield request register unregister

yield

current scheduler

slide-27
SLIDE 27

BERKELEY PAR LAB

27

} : : register(OpenMPLithe);

Register / Unregister

TBBLithe Scheduler OpenMPLithe Scheduler

unregister enter yield request register

matmult(){ time Register dynamically adds the new scheduler to the hierarchy.

unregister enter yield request register

register unregister

unregister(OpenMPLithe);

slide-28
SLIDE 28

BERKELEY PAR LAB

28

} : register(OpenMPLithe);

Request

TBBLithe Scheduler OpenMPLithe Scheduler

unregister enter yield request register

matmult(){ time

unregister enter yield request register

request

unregister(OpenMPLithe); Request asks for more harts from the parent scheduler. request(n);

slide-29
SLIDE 29

BERKELEY PAR LAB

29

: :

Enter / Yield

TBBLithe Scheduler OpenMPLithe Scheduler

unregister enter yield request register

time

unregister enter yield request register

enter

yield();

yield

enter(OpenMPLithe); Enter/Yield transfers additional harts between the parent and child.

slide-30
SLIDE 30

BERKELEY PAR LAB

30

SPQR with Lithe

time reg enter enter enter yield yield

MKL OpenMPLithe TBBLithe SPQR

unreg yield

matmult

req

slide-31
SLIDE 31

BERKELEY PAR LAB

31

SPQR with Lithe

time

MKL OpenMPLithe TBBLithe SPQR

unreg unreg unreg unreg reg reg reg reg

matmult matmult matmult matmult

req req req req

slide-32
SLIDE 32

BERKELEY PAR LAB

32

Talk Roadmap

 Problem: Efficient parallel composability is hard!  Solution:

  • Harts
  • Lithe

 Evaluation

slide-33
SLIDE 33

BERKELEY PAR LAB

33

Implementation

Harts TBBLithe OpenMPLithe Lithe

 Harts: simulated using pinned Pthreads on x86-Linux  Lithe: user-level library (register, unregister, request, enter, yield, ...)  TBBLithe  OpenMPLithe (GCC4.4)

~600 lines of C & assembly ~2000 lines of C, C++, assembly ~1500 / ~8000 relevant lines added/removed/modified ~1000 / ~6000 relevant lines added/removed/modified

slide-34
SLIDE 34

BERKELEY PAR LAB

34

No Lithe Overhead w/o Composing

 TBBLithe Performance (µbench included with release)  OpenMPLithe Performance (NAS parallel benchmarks)

tree sum preorder fibonacci TBBLithe 54.80ms 228.20ms 8.42ms TBB 54.80ms 242.51ms 8.72ms conjugate gradient (cg) LU solver (lu) multigrid (mg) OpenMPLithe 57.06s 122.15s 9.23s OpenMP 57.00s 123.68s 9.54s

All results on Linux 2.6.18, 8-core Intel Clovertown.

slide-35
SLIDE 35

BERKELEY PAR LAB

35

Performance Characteristics of SPQR (Input = ESOC)

NUM_OMP_THREADS Time (sec)

slide-36
SLIDE 36

BERKELEY PAR LAB

36

Performance Characteristics of SPQR (Input = ESOC)

NUM_OMP_THREADS Time (sec)

Sequential TBB=1, OMP=1 172.1 sec

slide-37
SLIDE 37

BERKELEY PAR LAB

37

Performance Characteristics of SPQR (Input = ESOC)

NUM_OMP_THREADS Time (sec)

Out-of-the-Box TBB=8, OMP=8 111.8 sec

slide-38
SLIDE 38

BERKELEY PAR LAB

38

Performance Characteristics of SPQR (Input = ESOC)

NUM_OMP_THREADS Time (sec)

Out-of-the-Box Manually Tuned 70.8 sec

slide-39
SLIDE 39

BERKELEY PAR LAB

39

Performance of SPQR with Lithe

Time (sec) Out-of-the-Box Lithe Input Matrix Manually Tuned

slide-40
SLIDE 40

BERKELEY PAR LAB

40

Future Work

SPQR

TBBLithe

enter yield req reg unreg

OpenMPLithe

enter yield req reg unreg

CtLithe

enter yield req reg unreg

CilkLithe

enter yield req reg unreg

  

slide-41
SLIDE 41

BERKELEY PAR LAB

41

Conclusion

 Composability essential for parallel programming to

become widely adopted.

 Lithe project contributions

  • Harts: better resource model for parallel programming
  • Lithe: enables parallel codes to interoperate by

standardizing the sharing of harts

MKL OpenMP TBB SPQR

resource management functionality

1 2 3

 Parallel libraries need to share resources cooperatively.

slide-42
SLIDE 42

BERKELEY PAR LAB

42

Acknowledgements

We would like to thank George Necula and the rest of Berkeley Par Lab for their feedback on this work. Research supported by Microsoft (Award #024263 ) and Intel (Award #024894) funding and by matching funding by U.C. Discovery (Award #DIG07-10227). This work has also been in part supported by a National Science Foundation Graduate Research Fellowship. Any opinions, findings, conclusions, or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the National Science Foundation. The authors also acknowledge the support

  • f the Gigascale Systems Research Focus Center, one of five

research centers funded under the Focus Center Research Program, a Semiconductor Research Corporation program.