GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC MIKE - - PowerPoint PPT Presentation

gpu accelerated self join for the distance similarity
SMART_READER_LITE
LIVE PREVIEW

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC MIKE - - PowerPoint PPT Presentation

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC MIKE GOWANLOCK NORTHERN ARIZONA UNIVERSITY SCHOOL OF INFORMATICS, COMPUTING & CYBER SYSTEMS BEN KARSIN UNIVERSITY OF HAWAII AT MANOA DEPARTMENT OF INFORMATION AND COMPUTER


slide-1
SLIDE 1

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC

MIKE GOWANLOCK

NORTHERN ARIZONA UNIVERSITY SCHOOL OF INFORMATICS, COMPUTING & CYBER SYSTEMS

BEN KARSIN

UNIVERSITY OF HAWAII AT MANOA DEPARTMENT OF INFORMATION AND COMPUTER SCIENCES

slide-2
SLIDE 2

THE SELF-JOIN

  • The self-join is a fundamental operation in databases
  • Find all objects within a threshold distance of each other
  • Range queries around each point
  • A table joined onto itself with a distance similarity predicate
slide-3
SLIDE 3

APPLICATIONS

  • Building blocks of many clustering algorithms (e.g., DBSCAN)
  • Can be used for kNN
  • Time series data analysis
  • Mining spatial association rules
  • Many other applications/algorithms
slide-4
SLIDE 4

NESTED LOOP JOIN (BRUTE FORCE)

  • Two for loops
  • Each point performs a distance

calculation between itself and the

  • ther points
  • O(n2)
  • Performs well in high dimensionality

Example: 3 points, 9 total distance calculations

  • 3 total distance calculations

between a point and itself

  • 6 total distance calculations

between points

slide-5
SLIDE 5

LITERATURE ON THE (SELF-) JOIN

Low-dimensionality High-dimensionality Denser data

  • Many points within a search radius

Sparser Data

  • Fewer points within a search radius

Major focus on indexing techniques

  • Reduce the number of distance

calculations needed to find points within a search radius Indexing techniques do not perform well

  • Curse of dimensionality: index cannot

discriminate between points in the high-dimensional space Challenge

  • Large result sets and number of

distance comparisons Challenge

  • Large fraction of time searching for

candidate points that may be within the distance

slide-6
SLIDE 6

LITERATURE ON THE (SELF-) JOIN

Low-dimensionality High-dimensionality Denser data

  • Many points within a search radius

Sparser Data

  • Fewer points within a search radius

Major focus on indexing techniques

  • Reduce the number of distance

calculations needed to find points within a search radius Indexing techniques do not perform well

  • Curse of dimensionality: index cannot

discriminate between points in the high-dimensional space Challenge

  • Large result sets and number of

distance comparisons Challenge

  • Large fraction of time searching for

candidate points that may be within the distance

The literature is (mostly) split between low- and high- dimensional contributions

  • We focus on the low-

dimensional case

  • Dimensions 2-6
slide-7
SLIDE 7
  • Using an R-tree index, with a

fixed distance epsilon=1

  • 2 million points
  • The response times are greatest

at 2-D and 6-D

  • The number of neighbors

decreases to 0 with dimension

  • At 2-D the higher response time

is due to many distance calculations

  • A 6-D the higher response time is

due to more exhaustive index searches

PERFORMANCE EXAMPLE

slide-8
SLIDE 8

UTILIZING THE GPU

  • GPUs have thousands of cores
  • High memory bandwidth
  • 700 GB/s on Pascal, 900 GB/s Volta
  • CPU main memory bandwidth
  • ~100 GB/s
  • Overall: The GPU’s high memory bandwidth makes it an

attractive alternative to the CPU

slide-9
SLIDE 9

UTILIZING THE GPU

  • CPU-based self-joins are often characterized by an

irregular instruction flow:

  • Spatial index searches use tree traversals
  • Insights into spatial indexes for the GPU
  • J. Kim, W.-K. Jeong, and B. Nam, “Exploiting massive

parallelism for indexing multi-dimensional datasets

  • n the gpu,” IEEE Transactions on Parallel and

Distributed Systems, vol. 26, no. 8, pp. 2258–2271, 2015.

  • J. Kim, B. Nam, “Co-processing heterogeneous

parallel index for multi-dimensional datasets” JPDC, 113, pp. 195–203, 2018.

slide-10
SLIDE 10

UTILIZING THE GPU

  • CPU-based self-joins are often characterized by an

irregular instruction flow:

  • Spatial index searches use tree traversals
  • Insights into spatial indexes for the GPU
  • J. Kim, W.-K. Jeong, and B. Nam, “Exploiting massive

parallelism for indexing multi-dimensional datasets

  • n the gpu,” IEEE Transactions on Parallel and

Distributed Systems, vol. 26, no. 8, pp. 2258–2271, 2015.

  • J. Kim, B. Nam, “Co-processing heterogeneous

parallel index for multi-dimensional datasets” JPDC, 113, pp. 195–203, 2018.

This paper implemented an R- tree on the GPU that avoided some of the drawbacks of the SIMD architecture

slide-11
SLIDE 11

UTILIZING THE GPU

  • CPU-based self-joins are often characterized by an

irregular instruction flow:

  • Spatial index searches use tree traversals
  • Insights into spatial indexes for the GPU
  • J. Kim, W.-K. Jeong, and B. Nam, “Exploiting massive

parallelism for indexing multi-dimensional datasets

  • n the gpu,” IEEE Transactions on Parallel and

Distributed Systems, vol. 26, no. 8, pp. 2258–2271, 2015.

  • J. Kim, B. Nam, “Co-processing heterogeneous

parallel index for multi-dimensional datasets” JPDC, 113, pp. 195–203, 2018.

This paper showed its best to implement the traversal of the internal nodes (branching) on the CPU and the scan of the data elements in the leaf nodes

  • n the GPU
slide-12
SLIDE 12

UTILIZING THE GPU: TAKEAWAY

  • Due to the GPU’s SIMT architecture, branching can

significantly reduce performance when performing tree traversals

  • It is better to have a bounded search, where all threads take

the same or very similar execution pathways

slide-13
SLIDE 13
  • Grid Index
  • A grid is constructed with cells of

length epsilon

  • Point search
  • For each point in the dataset, the

points within epsilon can be found by checking adjacent grid cells and performing distance calculations to the points in these cells

  • Adjacent cells: 3n where n is the

number of dimensions

GRID INDEX

ε ε

Searched point Data points ε-bounded search space (9 cells in 2-D)

slide-14
SLIDE 14
  • The search for the nearby

points is bounded to the adjacent cells

  • No branching like spatial tree-

based indexes

  • Points within the same grid

cell will return the same cells

  • Reduces thread divergence

GRID INDEX

ε ε

Searched point Data points ε-bounded search space (9 cells in 2-D)

slide-15
SLIDE 15

SPACE EFFICIENT GRID INDEX

  • Store non-empty grid cells
  • Use a series of lookup arrays
  • Space complexity: O(|D|), where D is the number of data points in the dataset
  • In practice in our experiments: a few MiB

6 8 14 18 21 22 30 34 44 36 40

B: · · · G: · · · Ch = 6

Amin

h

= 14 Amax

h

= 15

Ch = 7

Amin

h

= 16 Amax

h

= 18

Ch = |G|

Amin

h

= · · · Amax

h

= |D| 1 2 3 4 5 6 7 8 |G| = 11 9 10 18 · · · 1 36 7 2 31 19 30

A:

3 |D| 1 2 13 14 15 16 17 18 · · · · · · · · · |D| − 1 55 p1 · · ·

D:

p2 p3 p18 p19 · · · p30 p31 p|D| p55 · · · · · · p36 · · · 1 2 3 4 5 6 1 2 3 4 5 6 8 14 18 21 22 30 36 34 40 44 ✏ 6 B[1] p36 p7

(a) (b)

slide-16
SLIDE 16
  • Self-join will generate large amounts of data that increase

with:

  • epsilon
  • Size of the dataset
  • Point density distribution of the dataset
  • Large overdensities increase the total number of neighbors
  • Need an efficient batching scheme to overlap computation

and communication between the host and GPU

RESULT SET SIZES

slide-17
SLIDE 17

EXAMPLE BATCHING SCHEME ILLUSTRATION

GPU

Processing Batch 2

Host

Kernel args Batch 3 Result returned Batch 1 We use a minimum of 3 batches/kernel executions

  • Hide data transfers: overlap the kernel execution on the

GPU with data transfer of the result sets

slide-18
SLIDE 18

BASELINE GLOBAL MEMORY GPU KERNEL

  • Global memory kernel: one thread per point
  • Number of GPU threads is the same as the dataset size
  • A thread/point searches adjacent non-empty cells
  • If a cell is non-empty, the thread computes the distance between it

and all points in the cell

slide-19
SLIDE 19

AVOIDING DUPLICATE DISTANCE CALCULATIONS: UNIDIRECTIONAL COMPARISON

  • We do not need to compute the

distances between all pairs of points

  • One nice property of the grid is

that as the space is divided evenly we can reduce the number of distance comparisons between points

Example: these two points will find each other within epsilon. However, we can perform one distance calculation and record both results.

slide-20
SLIDE 20

UNIDIRECTIONAL COMPARISON: TWO DIMENSIONS

  • We begin by comparing cells based
  • n the first dimension (x-coordinate)
  • Look at cells that share the same y-

coordinate

1 2 3 4 5 1 2 3 4 5 6 6

slide-21
SLIDE 21

UNIDIRECTIONAL COMPARISON: TWO DIMENSIONS

  • We begin by comparing cells based
  • n the first dimension (x-coordinate)
  • Look at cells that share the same y-

coordinate

  • If the x-coordinate is odd, then we

compare all points within the odd cell to the adjacent even coordinate cells

  • If the x-coordinate is even, we do

nothing

1 2 3 4 5 1 2 3 4 5 6 6 1 2 3 4 5 1 2 3 4 5 6 6

slide-22
SLIDE 22

1 2 3 4 5 1 2 3 4 5 6 6

UNIDIRECTIONAL COMPARISON: TWO DIMENSIONS

  • We then compare neighbors with a

different y-coordinate

slide-23
SLIDE 23

1 2 3 4 5 1 2 3 4 5 6 6

UNIDIRECTIONAL COMPARISON: TWO DIMENSIONS

  • We then compare neighbors with a

different y-coordinate

  • If the y-coordinate is odd, then we

compare against the adjacent cells

  • If the y-coordinate is even, then we

do nothing

1 2 3 4 5 1 2 3 4 5 6 6

slide-24
SLIDE 24

1 2 3 4 5 1 2 3 4 5 6 6

UNIDIRECTIONAL COMPARISON: TWO DIMENSIONS

  • In two dimensions, this is the final

pattern of comparisons

  • Note that in some cells, there are no

searches originating from the cells

slide-25
SLIDE 25

1 2 3 4 5 1 2 3 4 5 6 6

UNIDIRECTIONAL COMPARISON: TWO DIMENSIONS

  • In two dimensions, this is the final

pattern of comparisons

  • Note that in some cells, there are no

searches originating from the cells

  • In other cells there are fewer than 9

adjacent cells searched

slide-26
SLIDE 26

1 2 3 4 5 1 2 3 4 5 6 6

UNIDIRECTIONAL COMPARISON: TWO DIMENSIONS

  • In two dimensions, this is the final

pattern of comparisons

  • Note that in some cells, there are no

searches originating from the cells

  • In other cells there are fewer than 8

adjacent cells searched

  • Result: Only half of the cells need to

be searched on average

  • Reduction in searches
  • Reduction in point comparisons
  • Half of the work (with additional
  • verhead for executing the pattern)
slide-27
SLIDE 27

UNIDIRECTIONAL COMPARISON: 3 DIMENSIONS

  • Left: Same pattern as 2-D, cells in the odd z-coordinate are compared to

adjacent cells

  • Right: Can see the 9 cells shaded in blue that are compared with the cell with

z-coordinate 1 (red outline)

1 2 3 4 Y 1 2 3 4 X 2 1 Z 3 4 1 2 3 4 Y 1 2 3 4 X 4 3 2 1 X 1 2 3 4 Y 2 1 Z

slide-28
SLIDE 28

UNIDIRECTIONAL COMPARISON

  • The pattern can be applied to n-

dimensions

  • Reduces the number of cells

searched and point comparisons by half

1 2 3 4 5 1 2 3 4 5 6 6

slide-29
SLIDE 29

EXPERIMENTAL EVALUATION

  • Experiments performed on up to 32 cores of Intel E5-2683 v4 2.1 GHz

CPUs

  • Comparison implementations:
  • CPU-only sequential algorithm that uses an R-tree index
  • GPU Brute Force
  • State-of-the-art: Parallel SuperEgo algorithm (32 threads/cores)
  • Host code in C++ -O3 compiler optimization flag
  • GPU: NVIDIA Titan X, CUDA
  • Our GPU implementation uses 64-bit floats
  • SuperEGO executed with 32-bit floats
slide-30
SLIDE 30

RESPONSE TIME VS. EPSILON

  • 2-D Space Weather dataset
  • Roughly 5 million points
  • Unicomp yields a minor

performance gain over executing GPU-SJ without the optimization

GPU: Brute Force R-Tree SuperEGO GPU GPU: unicomp

0.1 0.2 0.3 0.4 0.5

100 101 102 103 104 105

Time (s)

slide-31
SLIDE 31

RESPONSE TIME VS. EPSILON

  • 6-D Synthetic Dataset

(Uniform Distribution)

  • 2 million points
  • Unicomp yields a larger

performance gain over executing GPU-SJ without the optimization

2 4 6 8 10

10−1 100 101 102 103 104 105

Time (s)

GPU: Brute Force R-Tree SuperEGO GPU GPU: unicomp

slide-32
SLIDE 32

RESPONSE TIME VS. EPSILON: R-TREE

  • Summary plot of speedup
  • ver the CPU sequential R-

tree implementation on 16 datasets

  • Speedup greatest for

higher dimensions

  • Average speedup: 26.9x

10−2 10−1 100 101

20 40 60 80 100 120 140

Speedup

Avg: All SW2DA SW3DA SW2DB SW3DB SDSS2DA SDSS2DB Syn2D2M Syn3D2M Syn4D2M Syn5D2M Syn6D2M Syn2D10M Syn3D10M Syn4D10M Syn5D10M Syn6D10M

slide-33
SLIDE 33

RESPONSE TIME VS. EPSILON: SUPEREGO

  • Summary plot of speedup
  • ver the parallel SuperEGO

algorithm

  • Average speedup over

state-of-the-art: 2.38x

10−2 10−1 100 101

1 2 3 4 5 6 7

Speedup

Avg: All Avg: Real SW2DA SW3DA SW2DB SW3DB SDSS2DA SDSS2DB Syn2D2M Syn3D2M Syn4D2M Syn5D2M Syn6D2M Syn2D10M Syn3D10M Syn4D10M Syn5D10M Syn6D10M

slide-34
SLIDE 34

PERFORMANCE CHARACTERIZATION OF UNICOMP

  • Our Unicomp optimization reduces the total number of

cells searched

  • Half of the number of searched cells
  • Under what scenarios is Unicomp effective?
slide-35
SLIDE 35

RATIO TIME WITHOUT/WITH UNICOMP

  • Real world datasets (2-3-D)

0.00 0.25 0.50 0.75 1.00 1.25 1.50

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Ratio

SW2DA SW3DA SW2DB SW3DB SDSS2DA SDSS2DB

1 2 3 4 5

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Ratio

Syn2D10M Syn3D10M Syn4D10M Syn5D10M Syn6D10M

  • Synthetic Datasets (2-6-D)
  • Performance of UNICOMP dependent on dimensionality
  • Achieves >2x speedup in some scenarios
  • Surprising result
slide-36
SLIDE 36

CONCLUSIONS

  • The self-join is a widely used operation in the database community
  • We propose a GPU accelerated algorithm to perform the self-join on

low dimensional data

  • We utilize an index tailored for the GPU with a small memory footprint
  • Our approach outperforms the CPU parallel state-of-the-art algorithm
slide-37
SLIDE 37

FUTURE WORK

  • Examine the high-dimensional case
  • Use the self-join as a building block for other algorithms,

such as kNN