Load Imbalance Mitigation Optimizations for GPU-Accelerated - - PowerPoint PPT Presentation

load imbalance mitigation optimizations for gpu
SMART_READER_LITE
LIVE PREVIEW

Load Imbalance Mitigation Optimizations for GPU-Accelerated - - PowerPoint PPT Presentation

Load Imbalance Mitigation Optimizations for GPU-Accelerated Similarity Joins Benoit Gallet, Michael Gowanlock benoit.gallet@nau.edu, michael.gowanlock@nau.edu Northern Arizona University School of Informatics, Computing and Cyber Systems 5th


slide-1
SLIDE 1

Load Imbalance Mitigation Optimizations for GPU-Accelerated Similarity Joins

Benoit Gallet, Michael Gowanlock

benoit.gallet@nau.edu, michael.gowanlock@nau.edu Northern Arizona University School of Informatics, Computing and Cyber Systems

5th HPBDC Workshop, Rio de Janeiro, Brazil, May 20th, 2019

slide-2
SLIDE 2

Introduction

slide-3
SLIDE 3

Introduction

Given a dataset D in n dimensions

  • Similarity self-join → Find pairs of objects in D whose similarity is within

a threshold (D ⋈ D)

  • Similarity defined by a predicate or a metric
slide-4
SLIDE 4

Given a dataset D in n dimensions

  • Similarity self-join → Find pairs of objects in D whose similarity is within

a threshold (D ⋈ D)

  • Similarity defined by a predicate or a metric

Introduction

  • Distance similarity self-join → Find pairs of
  • bject within a distance ε

○ e.g.: Euclidean distance (D ⋈ε D)

  • Range Query: Compute distances

between a query point q and its candidate points c

○ Distance similarity self-join = |D| range queries

q ε

slide-5
SLIDE 5

Introduction

  • Brute force method: nested for loops

○ Complexity ≈ O( |D| 2 )

  • Use an indexing method to prune the search space

○ Complexity ≈ between O( |D| x log |D| ) and O( |D| 2 )

slide-6
SLIDE 6

Introduction

  • Brute force method: nested for loops

○ Complexity ≈ O( |D| 2 )

  • Use an indexing method to prune the search space

○ Complexity ≈ between O( |D| x log |D| ) and O( |D| 2 )

  • Hierarchical structures

○ R-Tree, X-Tree, k-D Tree, B-Tree, etc.

  • Non-hierarchical structures

○ Grids, space filling curves, etc.

slide-7
SLIDE 7

Introduction

  • Brute force method: nested for loops

○ Complexity ≈ O( |D| 2 )

  • Use an indexing method to prune the search space

○ Complexity ≈ between O( |D| x log |D| ) and O( |D| 2 )

  • Hierarchical structures

○ R-Tree, X-Tree, k-D Tree, B-Tree, etc.

  • Non-hierarchical structures

○ Grids, space filling curves, etc.

  • Some better for high dimensions, some better for low dimensions
  • Some better for the CPU, some better for the GPU

○ Recursion, branching, size, etc.

slide-8
SLIDE 8

Background

slide-9
SLIDE 9

Background

Reasons to use a GPU

  • Range queries are independent

○ Can be performed in parallel

  • Many memory operations

○ Benefits from high-bandwidth memory on the GPU

slide-10
SLIDE 10

Background

Reasons to use a GPU

  • Range queries are independent

○ Can be performed in parallel

  • Many memory operations

○ Benefits from high-bandwidth memory

  • Lot of cores, high-memory bandwidth

○ Intel Xeon E7-8894v4 → 24 physical cores, up to 85 GB/s memory bandwidth ○ Nvidia Tesla V100 → 5,120 CUDA cores, up to 900 GB/s memory bandwidth

→ The GPU is well suited for this type of application

slide-11
SLIDE 11

Background

However,

  • Limited global memory size*

○ 512 GB of RAM per node (256 GB per CPU) ○ 96 GB of GPU memory per node (16 GB per GPU)

  • Slow Host / Device communication bandwidth

○ 16 GB/s for PCIe 3.0 ○ Known as a major bottleneck

  • High chance of uneven workload between points

○ Uneven computation time between threads

→ Necessary to consider these potential issues

* Specs of the Summit supercomputer (ranked 1 in TOP500), https://www.olcf.ornl.gov/olcf-resources/compute-systems/summit/

slide-12
SLIDE 12

Background

Leverage work from previous contribution [1]

  • Batching scheme

○ Splits the computation into smaller executions ○ Avoids memory overflow ○ Overlaps computation with memory transfers

[1] M. Gowanlock and B. Karsin, “GPU Accelerated Self-join for the Distance Similarity Metric,” IEEE High-Performance Big Data, Deep Learning, and Cloud Computing, in Proc. of the 2018 IEEE Intl. Parallel and Distributed Processing Symposium Workshops, pp. 477–486, 2018.

slide-13
SLIDE 13

Leverage work from previous contribution [1]

  • Batching scheme

○ Splits the computation into smaller executions ○ Avoids memory overflow ○ Overlaps computation with memory transfers

  • Grid indexing

○ Cells of size εn ○ Only indexes non-empty cells ○ Bounds the search to 3n adjacent cells ○ Threads check the same cell in lockstep ■ Reduces divergence

Background

[1] M. Gowanlock and B. Karsin, “GPU Accelerated Self-join for the Distance Similarity Metric,” IEEE High-Performance Big Data, Deep Learning, and Cloud Computing, in Proc. of the 2018 IEEE Intl. Parallel and Distributed Processing Symposium Workshops, pp. 477–486, 2018. q ε ε ε

slide-14
SLIDE 14

Leverage work from previous contribution [1]

  • Batching scheme

○ Splits the computation into smaller executions ○ Avoids memory overflow ○ Overlaps computation with memory transfers

  • Grid indexing

○ Cells of size εn ○ Only indexes non-empty cells ○ Bounds the search to 3n adjacent cells ○ Threads check the same cell in lockstep ■ Reduces divergence

Background

[1] M. Gowanlock and B. Karsin, “GPU Accelerated Self-join for the Distance Similarity Metric,” IEEE High-Performance Big Data, Deep Learning, and Cloud Computing, in Proc. of the 2018 IEEE Intl. Parallel and Distributed Processing Symposium Workshops, pp. 477–486, 2018. q ε ε ε

Pruned search space

slide-15
SLIDE 15

Background

Leverage work from previous contribution [1]

  • Unidirectional Comparison: Unicomp

○ Euclidean distance is a symmetric function ○ p, q ϵ D, distance(p, q) = distance(q, p) ○ Only look at some of the neighboring cells → Only computes the distance once

[1] M. Gowanlock and B. Karsin, “GPU Accelerated Self-join for the Distance Similarity Metric,” IEEE High-Performance Big Data, Deep Learning, and Cloud Computing, in Proc. of the 2018 IEEE Intl. Parallel and Distributed Processing Symposium Workshops, pp. 477–486, 2018.

slide-16
SLIDE 16

Background

Leverage work from previous contribution [1]

  • Unidirectional Comparison: Unicomp

○ Euclidean distance is a symmetric function ○ p, q ϵ D, distance(p, q) = distance(q, p) ○ Only look at some of the neighboring cells → Only computes the distance once

  • GPU Kernel

○ Computes the ε-neighborhood of each query point ○ A thread is assigned a single query point ○ |D| threads in total

[1] M. Gowanlock and B. Karsin, “GPU Accelerated Self-join for the Distance Similarity Metric,” IEEE High-Performance Big Data, Deep Learning, and Cloud Computing, in Proc. of the 2018 IEEE Intl. Parallel and Distributed Processing Symposium Workshops, pp. 477–486, 2018.

slide-17
SLIDE 17

Issue

  • Depending on data characteristics → different workload between

threads

○ SIMT architecture of the GPU → threads executed by groups of 32 (warps) ○ Different workloads → idle time for some of the threads within a warp

slide-18
SLIDE 18

Optimizations

  • Range Query Granularity

Increase

  • Cell Access Pattern
  • Local and Global Load

Balancing

  • Warp Execution Scheduling
slide-19
SLIDE 19

Range Query Granularity Increase: k > 1

  • Original kernel → 1 thread per query point

q0 c0 c1 c2 c3 c4 c5 c6 c7 tid0

slide-20
SLIDE 20

Range Query Granularity Increase: k > 1

  • Original kernel → 1 thread per query point
  • Use multiple threads per query point

○ Each thread assigned to the query point q computes a fraction of the candidate points c ○ k =number of threads assigned to each query point q0 c0 c1 c2 c3 c4 c5 c6 c7 tid0 q0 c0 c1 c2 c3 c4 c5 c6 c7 tid0 tid1

slide-21
SLIDE 21

Cell Access Pattern: Lid-Unicomp

  • Unidirectional comparison

(Unicomp)

○ Potential load imbalance between cells

slide-22
SLIDE 22

Cell Access Pattern: Lid-Unicomp

  • Unidirectional comparison

(Unicomp)

○ Potential load imbalance between cells

  • Linear ID unidirectional

comparison (Lid-Unicomp)

○ Based on cells’ linear id ○ Compare to cells with greater linear id

slide-23
SLIDE 23

Local and Global Load Balancing: SortByWL

  • Sort the points from most to least workload

○ Reduces intra-warp load imbalance ○ Reduces block-level load imbalance

slide-24
SLIDE 24

Warp Execution Scheduling: WorkQueue

  • Sorting points does not guarantee their execution order

○ GPU’s physical scheduler

  • Force warp execution order with a work queue

○ Each thread atomically takes the available point with the most work

1 2 3 4 5 ... ... ... ... ... 1663 1664 D D the original dataset

slide-25
SLIDE 25

Warp Execution Scheduling: WorkQueue

  • Sorting points does not guarantee their execution order

○ GPU’s physical scheduler

  • Force warp execution order with a work queue

○ Each thread atomically takes the available point with the most work

37 8 128 ... 12 133 ... 135 ... 1337 ... 27 D’ D’ the original dataset sorted by workload

slide-26
SLIDE 26

Warp Execution Scheduling: WorkQueue

  • Sorting points does not guarantee their execution order

○ GPU’s physical scheduler

  • Force warp execution order with a work queue

○ Each thread atomically takes the available point with the most work

8 128 ... 12 133 ... 135 ... 1337 ... 27 D’ Counter = 1 Thread 1 ← D’[counter] counter ← counter + 1 37

Thread i → ith thread to be executed

slide-27
SLIDE 27

Warp Execution Scheduling: WorkQueue

  • Sorting points does not guarantee their execution order

○ GPU’s physical scheduler

  • Force warp execution order with a work queue

○ Each thread atomically takes the available point with the most work

128 ... 12 133 ... 135 ... 1337 ... 27 D’ Counter = 2 Thread 2 ← D’[counter] counter ← counter + 1 37 8

Thread i → ith thread to be executed

slide-28
SLIDE 28

Warp Execution Scheduling: WorkQueue

  • Sorting points does not guarantee their execution order

○ GPU’s physical scheduler

  • Force warp execution order with a work queue

○ Each thread atomically takes the available point with the most work

... 12 133 ... 135 ... 1337 ... 27 D’ Counter = 3 Thread 3 ← D’[counter] counter ← counter + 1 37 8 128

Thread i → ith thread to be executed

slide-29
SLIDE 29

Warp Execution Scheduling: WorkQueue

  • Sorting points does not guarantee their execution order

○ GPU’s physical scheduler

  • Force warp execution order with a work queue

○ Each thread atomically takes the available point with the most work

... 133 ... 135 ... 1337 ... 27 D’ Counter = 32 Thread 32 ← D’[counter] counter ← counter + 1 37 8 128 12

Thread i → ith thread to be executed

slide-30
SLIDE 30

Warp Execution Scheduling: WorkQueue

  • Sorting points does not guarantee their execution order

○ GPU’s physical scheduler

  • Force warp execution order with a work queue

○ Each thread atomically takes the available point with the most work

... ... 135 ... 1337 ... 27 D’ Counter = 33 Thread 33 ← D’[counter] counter ← counter + 1 37 8 128 12 133 Warp a done

Thread i → ith thread to be executed

slide-31
SLIDE 31

Warp Execution Scheduling: WorkQueue

  • Sorting points does not guarantee their execution order

○ GPU’s physical scheduler

  • Force warp execution order with a work queue

○ Each thread atomically takes the available point with the most work

... ... ... 1337 ... 27 D’ Counter = 64 Thread 64 ← D’[counter] counter ← counter + 1 37 8 128 12 133 Warp a done 135

Thread i → ith thread to be executed

slide-32
SLIDE 32

Warp Execution Scheduling: WorkQueue

  • Sorting points does not guarantee their execution order

○ GPU’s physical scheduler

  • Force warp execution order with a work queue

○ Each thread atomically takes the available point with the most work

... ... 1337 ... 27 D’ Threads n 37 8 128 12 133 Warp a done 135 ... Warp b done

Thread i → ith thread to be executed

slide-33
SLIDE 33

Warp Execution Scheduling: WorkQueue

  • Sorting points does not guarantee their execution order

○ GPU’s physical scheduler

  • Force warp execution order with a work queue

○ Each thread atomically takes the available point with the most work

... ... ... ... 27 D’ Counter = |D’| - 32 Thread |D’| - 32 ← D’[counter] counter ← counter + 1 37 8 128 12 133 Warp a done 135 1337 Warp b done

Thread i → ith thread to be executed

slide-34
SLIDE 34

Warp Execution Scheduling: WorkQueue

  • Sorting points does not guarantee their execution order

○ GPU’s physical scheduler

  • Force warp execution order with a work queue

○ Each thread atomically takes the available point with the most work

... ... ... ... 27 D’ Counter = |D’| Thread |D’| ← D’[counter] counter ← counter + 1 37 8 128 12 133 Warp a done 135 1337 Warp b done

Thread i → ith thread to be executed

slide-35
SLIDE 35

Warp Execution Scheduling: WorkQueue

  • Sorting points does not guarantee their execution order

○ GPU’s physical scheduler

  • Force warp execution order with a work queue

○ Each thread atomically takes the available point with the most work

... ... ... ... 27 D’ Counter = |D’| 37 8 128 12 133 135 1337

Computation done

slide-36
SLIDE 36

Warp Execution Scheduling: WorkQueue

  • Use a work queue to assign query points to threads

○ Ensures a similar workload within a warp ○ Ensures a similar workload within a batch

Further explained during the poster session of the IPDPS PhD Forum, Wednesday

slide-37
SLIDE 37

Experimental Evaluation

slide-38
SLIDE 38

Experimental Evaluation

  • Uniformly and exponentially distributed synthetic datasets

○ 2 to 6 dimensions ○ 2M points ○ Represent uniform and very different workloads

  • Real world datasets

○ Space Weather: 2 and 3 dimensions, 1.86M and 5M points ○ 50M points from the Gaia catalog

  • Platform used

○ 2 x Intel Xeon E5-2620v4@2.10 GHz (16 physical cores) + 128 GB of RAM ○ Nvidia Quadro P100 (16 GB of global memory)

slide-39
SLIDE 39

Experimental Evaluation

  • Uniformly and exponentially distributed synthetic datasets

○ 2 to 6 dimensions ○ 2M points ○ Represent uniform and very different workloads

  • Real world datasets

○ Space Weather: 2 and 3 dimensions, 1.86M and 5M points ○ 50M points from the Gaia catalog

  • Platform used

○ 2 x Intel Xeon E5-2620v4@2.10 GHz (16 physical cores) + 128 GB of RAM ○ Nvidia Quadro P100 (16 GB of global memory)

  • Code in C/C++ and CUDA,compiled with O3 flag
  • GPU implementations: 256 threads per block, 64-bit floating point values

○ GPUCalcGlobal: original GPU kernel from previous work

  • CPU implementation: 16 threads, 32-bit floating point values

○ Super-EGO: state-of-the-art parallel CPU algorithm

slide-40
SLIDE 40

Experimental Evaluation

  • Metrics used: time and warp execution efficiency (WEE)

○ Time: execution time of the application ■ Includes memory allocations, transfers, computation, etc. ■ Does not include index construction time (not the focus of this work) ○ Warp execution efficiency ■ Average percentage of active threads in each executed warp ■ Increasing it increases utilization of the GPU’s compute resources ■ Lowered by divergent branches ■ Good indicator of workload balancing

  • Higher percentage means similar workload
slide-41
SLIDE 41

Experimental Evaluation: k = 8

  • Comparison between k = 1 and k = 8 using GPUCalcGlobal

k = 1 WEE = 26.5% Dataset: Expo2D2M, Warp execution efficiency: ε = 0.2 0% 100%

slide-42
SLIDE 42

Experimental Evaluation: k = 8

  • Comparison between k = 1 and k = 8 using GPUCalcGlobal

k = 1 WEE = 26.5% k = 8 WEE = 40.8% Dataset: Expo2D2M, Warp execution efficiency: ε = 0.2 0% 100%

slide-43
SLIDE 43

Experimental Evaluation: Lid-Unicomp

  • Comparison between GPUCalcGlobal, Unicomp and Lid-Unicomp

○ Note: Unicomp and Lid-Unicomp perform half the distance calculations of GPUCalcGlobal Dataset: Expo6D2M, Warp execution efficiency: ε = 1.2 GPUCalcGlobal WEE = 15.2% Unicomp WEE = 7.8% 0% 100%

slide-44
SLIDE 44

Experimental Evaluation: Lid-Unicomp

  • Comparison between GPUCalcGlobal, Unicomp and Lid-Unicomp

○ Note: Unicomp and Lid-Unicomp perform half the distance calculations of GPUCalcGlobal Dataset: Expo6D2M, Warp execution efficiency: ε = 1.2 Unicomp WEE = 7.8% Lid-Unicomp WEE = 10% GPUCalcGlobal WEE = 15.2% 0% 100%

slide-45
SLIDE 45

Experimental Evaluation: SortByWL and WorkQueue

  • Comparison between GPUCalcGlobal, SortByWL and WorkQueue

Dataset: Expo2D2M, Warp execution efficiency: ε = 0.2 GPUCalcGlobal WEE = 26.5% 0% 100%

slide-46
SLIDE 46

Experimental Evaluation: SortByWL and WorkQueue

  • Comparison between GPUCalcGlobal, SortByWL and WorkQueue

Dataset: Expo2D2M, Warp execution efficiency: ε = 0.2 SortByWL WEE = 74.6% GPUCalcGlobal WEE = 26.5% 0% 100%

slide-47
SLIDE 47

Experimental Evaluation: SortByWL and WorkQueue

  • Comparison between GPUCalcGlobal, SortByWL and WorkQueue

Dataset: Expo2D2M, Warp execution efficiency: ε = 0.2 SortByWL WEE = 74.6% WorkQueue WEE = 83.2% GPUCalcGlobal WEE = 26.5% 0% 100%

slide-48
SLIDE 48

Experimental Evaluation: Combination

  • Comparison between GPUCalcGlobal, Super-EGO, WorkQueue, WorkQueue +

Lid-Unicomp, WorkQueue + k = 8 and WorkQueue + Lid-Unicomp + k = 8

Dataset: SW3DA, Warp execution efficiency: ε = 2.4 GPUCalcGlobal WEE = 26.5% 0% 100%

slide-49
SLIDE 49

Experimental Evaluation: Combination

  • Comparison between GPUCalcGlobal, Super-EGO, WorkQueue, WorkQueue +

Lid-Unicomp, WorkQueue + k = 8 and WorkQueue + Lid-Unicomp + k = 8

Dataset: SW3DA, Warp execution efficiency: ε = 2.4 WorkQueue + Lid-Unicomp WEE = 93.4% GPUCalcGlobal WEE = 26.5% 0% 100%

slide-50
SLIDE 50

Experimental Evaluation: Combination

  • Comparison between GPUCalcGlobal, Super-EGO, WorkQueue, WorkQueue +

Lid-Unicomp, WorkQueue + k = 8 and WorkQueue + Lid-Unicomp + k = 8

Dataset: SW3DA, Warp execution efficiency: ε = 2.4 WorkQueue + Lid-Unicomp WEE = 93.4% WorkQueue + Lid-Unicomp + k = 8 WEE = 83.2% GPUCalcGlobal WEE = 26.5% 0% 100%

slide-51
SLIDE 51

Experimental Evaluation: Speedup

  • Speedup of all our optimizations combined versus (a) Super-EGO and

(b) GPUCalcGlobal

  • Avg. = 2.5x, max = 10.7x
  • Avg. = 1.6x, max = 9.7x

Super-EGO GPUCalcGlobal

slide-52
SLIDE 52

Conclusion and Future Work

slide-53
SLIDE 53

Conclusion

  • Intra-warp and inter-warp load balancing improves performance

○ Similar workloads in a warp ■ Fewer idling threads ○ Similar workloads between warps ■ Less waiting for the last executing warp

  • May be used for other algorithms with data-dependent performance

characteristics

  • High warp execution efficiency improves GPU’s utilization

○ May indicate one of the potential boundaries for further performance

  • ptimizations (cannot go beyond 32 active threads over 32)
slide-54
SLIDE 54

Future Work

  • Improve Lid-Unicomp execution

○ Currently iterates over every neighboring cells, then checks the linear id → Remove unnecessary loop iterations

  • Improve the work queue

○ Memory allocation suited to the firsts large batches ○ Many small batches towards the end of the computation → Group the last batches together

  • Complete this work with a parallel CPU implementation

→ Splits the work between the CPU and the GPU