The Rocky Road To Tasking March 21, 2019 Ivo Kabadshow, Laura - - PowerPoint PPT Presentation

β–Ά
the rocky road to tasking
SMART_READER_LITE
LIVE PREVIEW

The Rocky Road To Tasking March 21, 2019 Ivo Kabadshow, Laura - - PowerPoint PPT Presentation

The Rocky Road To Tasking March 21, 2019 Ivo Kabadshow, Laura Morgenstern Jlich Supercomputing Centre Member of the Helmholtz Association Requirements for MD Strong scalability Performance portability HPC HPC High Frequency Trading


slide-1
SLIDE 1

The Rocky Road To Tasking

March 21, 2019 Ivo Kabadshow, Laura Morgenstern JΓΌlich Supercomputing Centre

Member of the Helmholtz Association

slide-2
SLIDE 2

HPC β‰  HPC

ns

𝜈s

ms s min h CPU Cycle Network Latency High Frequency Trading MD Game Dev Deep Learning Astrophysics Critical walltime

Requirements for MD

Strong scalability Performance portability

Member of the Helmholtz Association March 21, 2019 Slide 1

slide-3
SLIDE 3

HPC β‰  HPC

ns

𝜈s

ms s min h CPU Cycle Network Latency High Frequency Trading MD Game Dev Deep Learning Astrophysics Critical walltime

Requirements for MD

Strong scalability Performance portability

Member of the Helmholtz Association March 21, 2019 Slide 1

slide-4
SLIDE 4

HPC β‰  HPC

ns

𝜈s

ms s min h CPU Cycle Network Latency High Frequency Trading MD Game Dev Deep Learning Astrophysics Critical walltime

Requirements for MD

Strong scalability Performance portability

Member of the Helmholtz Association March 21, 2019 Slide 1

slide-5
SLIDE 5

HPC β‰  HPC

ns

𝜈s

ms s min h CPU Cycle Network Latency High Frequency Trading MD Game Dev Deep Learning Astrophysics Critical walltime

Requirements for MD

Strong scalability Performance portability

Member of the Helmholtz Association March 21, 2019 Slide 1

slide-6
SLIDE 6

HPC β‰  HPC

ns

𝜈s

ms s min h CPU Cycle Network Latency High Frequency Trading MD Game Dev Deep Learning Astrophysics Critical walltime

Requirements for MD

Strong scalability Performance portability

Member of the Helmholtz Association March 21, 2019 Slide 1

slide-7
SLIDE 7

Our Motivation

Solving Coulomb problem for Molecular Dynamics

Task: Compute all pairwise interactions of N particles

N-body problem: O(N2) β†’ O(N) with FMM

Why is that an issue?

MD targets < 1ms runtime per time step MD runs millions or billions of time steps not compute-bound, but synchronization bound no libraries (like BLAS) to do the heavy lifting We might have to look under the hood ... and get our hands dirty.

Member of the Helmholtz Association March 21, 2019 Slide 2

slide-8
SLIDE 8

Parallelization Potential

Classical

O(N2)

high low easy hard Algorithmic Complexity Parallelization

Classical Approach

Lots of independent parallelism

Member of the Helmholtz Association March 21, 2019 Slide 3

slide-9
SLIDE 9

Parallelization Potential

FMM

O(N)

Classical

O(N2)

high low easy hard Algorithmic Complexity Parallelization

Fast Multipole Method (FMM)

Many dependent phases Varying amount of parallelism

Member of the Helmholtz Association March 21, 2019 Slide 4

slide-10
SLIDE 10

Coarse-Grained Parallelization

Input P2M M2M M2L L2L L2P P2P Output synchronization points

Different amount of available loop-level parallelism within each phase Some phases contain sub-dependencies Synchronizations might be problematic

Member of the Helmholtz Association March 21, 2019 Slide 5

slide-11
SLIDE 11

Coarse-Grained Parallelization

Input P2M M2M M2L L2L L2P P2P Output synchronization points

Different amount of available loop-level parallelism within each phase Some phases contain sub-dependencies Synchronizations might be problematic

Member of the Helmholtz Association March 21, 2019 Slide 5

slide-12
SLIDE 12

FMM Algorithmic Flow

Multipole to multipole (M2M), shifting multipoles upwards

πœ•

1 2 3 4 d =

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

Dataflow – Fine-grained Dependencies

Member of the Helmholtz Association March 21, 2019 Slide 6

slide-13
SLIDE 13

FMM Algorithmic Flow

Multipole to multipole (M2M), shifting multipoles upwards

πœ•

1 2 3 4 d =

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

Dataflow – Fine-grained Dependencies

p2m m2m m2l l2l l2p

πœ• πœ• πœ•

Member of the Helmholtz Association March 21, 2019 Slide 7

slide-14
SLIDE 14

FMM Algorithmic Flow

Multipole to local (M2L), translate remote multipoles into local taylor moments

𝜈

1 2 3 4 d =

+ + + + + + +

Dataflow – Fine-grained Dependencies

Member of the Helmholtz Association March 21, 2019 Slide 8

slide-15
SLIDE 15

FMM Algorithmic Flow

Multipole to local (M2L), translate remote multipoles into local taylor moments

𝜈

1 2 3 4 d =

+ + + + + + +

Dataflow – Fine-grained Dependencies

p2m m2m m2l l2l l2p

πœ• πœ• 𝜈 𝜈

Member of the Helmholtz Association March 21, 2019 Slide 9

slide-16
SLIDE 16

FMM Algorithmic Flow

Local to local (L2L), shifting Taylor moments downwards

𝜈

1 2 3 4 d =

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

Dataflow – Fine-grained Dependencies

Member of the Helmholtz Association March 21, 2019 Slide 10

slide-17
SLIDE 17

FMM Algorithmic Flow

Local to local (L2L), shifting Taylor moments downwards

𝜈

1 2 3 4 d =

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

Dataflow – Fine-grained Dependencies

p2m m2m m2l l2l l2p

𝜈 𝜈 𝜈

Member of the Helmholtz Association March 21, 2019 Slide 11

slide-18
SLIDE 18

CPU Tasking Framework

Core ThreadingWrapper Thread Scheduler Queue

β‹―

Dispatcher TaskFactory LoadBalancer

Member of the Helmholtz Association March 21, 2019 Slide 12

slide-19
SLIDE 19

CPU Tasking Framework

Core ThreadingWrapper Thread Scheduler Queue

β‹―

Dispatcher TaskFactory LoadBalancer

Member of the Helmholtz Association March 21, 2019 Slide 12

slide-20
SLIDE 20

CPU Tasking Framework

Core ThreadingWrapper Thread Scheduler Queue

β‹―

Dispatcher TaskFactory LoadBalancer

Member of the Helmholtz Association March 21, 2019 Slide 12

slide-21
SLIDE 21

CPU Tasking Framework

Core ThreadingWrapper Thread Scheduler Queue

β‹―

Dispatcher TaskFactory LoadBalancer

Member of the Helmholtz Association March 21, 2019 Slide 12

slide-22
SLIDE 22

CPU Tasking Framework

Task life-cycle per thread Dispatcher TaskFactory LoadBalancer

  • Queues

Task execution

new task

Tasks can be prioritized by task type Only ready-to-execute tasks are stored in queue Workstealing from other threads is possible

Member of the Helmholtz Association March 21, 2019 Slide 13

slide-23
SLIDE 23

CPU Tasking Framework

Task life-cycle per thread Dispatcher TaskFactory LoadBalancer Queues

Task execution

new task task

Tasks can be prioritized by task type Only ready-to-execute tasks are stored in queue Workstealing from other threads is possible

Member of the Helmholtz Association March 21, 2019 Slide 13

slide-24
SLIDE 24

CPU Tasking Framework

Task life-cycle per thread Dispatcher TaskFactory LoadBalancer Queues

Task execution

new task task

Tasks can be prioritized by task type Only ready-to-execute tasks are stored in queue Workstealing from other threads is possible

Member of the Helmholtz Association March 21, 2019 Slide 13

slide-25
SLIDE 25

CPU Tasking Framework

Task life-cycle per thread Dispatcher TaskFactory LoadBalancer Queues

Task execution

new task task

Tasks can be prioritized by task type Only ready-to-execute tasks are stored in queue Workstealing from other threads is possible

Member of the Helmholtz Association March 21, 2019 Slide 13

slide-26
SLIDE 26

CPU Tasking Framework

Task life-cycle per thread Dispatcher TaskFactory LoadBalancer

  • Queues

Task execution

new task new task task

Tasks can be prioritized by task type Only ready-to-execute tasks are stored in queue Workstealing from other threads is possible

Member of the Helmholtz Association March 21, 2019 Slide 13

slide-27
SLIDE 27

Tasking Without Workstealing

103 680 Particles on 2Γ—Intel Xeon E5-2680 v3 (2Γ—12 cores) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 4 8 12 16 20 24

P2P P2M M2M M2L L2L L2P

Runtime [s] #Active Threads

Member of the Helmholtz Association March 21, 2019 Slide 14

slide-28
SLIDE 28

Tasking With Workstealing

103 680 Particles on 2Γ—Intel Xeon E5-2680 v3 (2Γ—12 cores) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 4 8 12 16 20 24

P2P P2M M2M M2L L2L L2P

Runtime [s] #Active Threads

Member of the Helmholtz Association March 21, 2019 Slide 15

slide-29
SLIDE 29

The Rocky Road To Tasking

March 21, 2019 Ivo Kabadshow, Laura Morgenstern JΓΌlich Supercomputing Centre

Member of the Helmholtz Association

slide-30
SLIDE 30

GPU Tasking

Goal

Provide same features as CPU tasking:

Static and dynamic load balancing Priority queues Ready-to-execute tasks

Member of the Helmholtz Association March 21, 2019 Slide 16

slide-31
SLIDE 31

GPU Tasking

Uniform Programming Model for CPUs and GPUs

Member of the Helmholtz Association March 21, 2019 Slide 17

slide-32
SLIDE 32

GPU Tasking

Uniform Programming Model for CPUs and GPUs

Member of the Helmholtz Association March 21, 2019 Slide 17

slide-33
SLIDE 33

GPU Tasking

Uniform Programming Model for CPUs and GPUs

Member of the Helmholtz Association March 21, 2019 Slide 17

slide-34
SLIDE 34

GPU Tasking

Uniform Programming Model for CPUs and GPUs

Member of the Helmholtz Association March 21, 2019 Slide 17

slide-35
SLIDE 35

GPU Tasking

Uniform Programming Model for CPUs and GPUs

Member of the Helmholtz Association March 21, 2019 Slide 17

slide-36
SLIDE 36

Pitfalls

Performance Portability

Diverse GPU programming approaches: OpenCL CUDA SYCL Our requirements: Strong subset of C++11 Portability between GPU vendors Tasking features Maturity

(Intermediate) Solution

Use CUDA for reasons of performance, specific tasking features and maturity. Take the loss of not being portable out of the box.

Member of the Helmholtz Association March 21, 2019 Slide 18

slide-37
SLIDE 37

Pitfalls

Performance Portability

For performance portability we consider diverse GPU programming approaches: OpenCL CUDA SYCL

Unsatisfying (Intermediate) Solution

Use CUDA for reasons of performance and specific features. Take the loss of not being portable out of the box.

Member of the Helmholtz Association March 21, 2019 Slide 19

slide-38
SLIDE 38

Pitfalls

Architectural Differences

Pitfalls for Load Balancing

No thread pinning No cache coherency

Pitfalls for Mutual Exclusion

Weak memory consistency Missing forward progress guarantees

Member of the Helmholtz Association March 21, 2019 Slide 20

slide-39
SLIDE 39

Pitfalls

Load Balancing

No possibility to pin threads to streaming multiprocessors No direct access to shared memory of other streaming multiprocessors Work stealing requires multi-producer multi-consumer queues β†’ Mechanism for mutual exclusion?

Member of the Helmholtz Association March 21, 2019 Slide 21

slide-40
SLIDE 40

Pitfalls

Mutual Exclusion

Weak memory consistency Warp-synchronous deadlocks due to lock step How to prove thread safety?

Member of the Helmholtz Association March 21, 2019 Slide 22

slide-41
SLIDE 41

Pitfalls

Mutex Implementation

class Mutex { __inline__ __device__ void lock() { while (atomicCAS(&mutex, 0, 1) != 0) __threadfence(); }; __inline__ __device__ void unlock() { __threadfence(); atomicExch(&mutex, 0); }; int mutex = 0; };

Member of the Helmholtz Association March 21, 2019 Slide 23

slide-42
SLIDE 42

Very First Evaluation

Conditions

Tasking with global queue only Measurements without work load to determine enqueue and dequeue overhead Measurements on P100 with 56 thread blocks with 1024 threads each Measurements on V100 with 80 thread blocks with 1024 threads each

Member of the Helmholtz Association March 21, 2019 Slide 24

slide-43
SLIDE 43

First Evaluation

Tasking Overhead on P100 and V100

100 101 102 103 104 105 106 10βˆ’1 101 103 105 #Tasks Runtime in ms P100 V100

Member of the Helmholtz Association March 21, 2019 Slide 25

slide-44
SLIDE 44

GPU Tasking

Conclusion

Fine-grained task parallelism pays off on CPUs Developed mapping between CPU and GPU concepts (Partly) overcome pitfalls:

Lock-based mutual exclusion Reusability of CPU tasking code Architectural differences between CPU and GPU

Successfully transferred parts of CPU tasking to GPUs

Member of the Helmholtz Association March 21, 2019 Slide 26

slide-45
SLIDE 45

Next Steps

Analyze and solve performance issues in dependency resolution Use memory pool for dynamic allocations Implement hierarchical queues Transfer priority queue to GPU Exploit data-parallelism through warps Consider the use of lock-free data structures Implement FMM based on GPU tasking

Member of the Helmholtz Association March 21, 2019 Slide 27

slide-46
SLIDE 46

Thank You to Our Sponsor!

NVIDIA Tesla V100 and NVIDIA Tesla P100 where provided by

Member of the Helmholtz Association March 21, 2019 Slide 28

slide-47
SLIDE 47

The Rocky Road To Tasking

March 21, 2019 Ivo Kabadshow, Laura Morgenstern JΓΌlich Supercomputing Centre

Member of the Helmholtz Association