GPU Sample Sort Nikolaj Leischner, Vitaly Osipov , Peter Sanders - - PowerPoint PPT Presentation

gpu sample sort
SMART_READER_LITE
LIVE PREVIEW

GPU Sample Sort Nikolaj Leischner, Vitaly Osipov , Peter Sanders - - PowerPoint PPT Presentation

GPU Sample Sort Nikolaj Leischner, Vitaly Osipov , Peter Sanders Institut fr Theoretische Informatik - Algorithmik II 1 Vitaly Osipov: Fakultt fr Informatik KIT Universitt des Landes Baden-Wrttemberg und GPU Sample Sort Institut


slide-1
SLIDE 1

1

Vitaly Osipov: GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik

Institut für Theoretische Informatik - Algorithmik II

GPU Sample Sort

Nikolaj Leischner, Vitaly Osipov, Peter Sanders

KIT – Universität des Landes Baden-Württemberg und nationales Grossforschungszentrum in der Helmholtz-Gemeinschaft

www.kit.edu

slide-2
SLIDE 2

Overview

2

Vitaly Osipov: GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik

Introduction Tesla architecture Computing Unified Device Architecture Model Performance Guidelines Sample Sort Algorithm Overview High Level GPU Algorithm Design Flavor of Implementation Details Experimental Evaluation Future Trends

slide-3
SLIDE 3

Introduction

multi-way sorting algorithms

3

Vitaly Osipov: GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik

Sorting is important Divide-and-Conquer approaches:

recursively split the input in tiles until the tile size is M (e.g cache size) sort each tile independently combine intermidiate results

Two-way approaches:

two-way distribution - quicksort log2 (n/M) scans to partition the input two-way merge sort log2 (n/M) scans to combine intermidiate results

Multi-way approaches:

k-way distribution - sample sort only logk (n/M) scans to partition k-way merge sort only logk (n/M) scans to combine

Multiway approaches are benifitial when the memory bandwidth is an issue!

slide-4
SLIDE 4

Introduction

multi-way sorting algorithms

3

Vitaly Osipov: GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik

Sorting is important Divide-and-Conquer approaches:

recursively split the input in tiles until the tile size is M (e.g cache size) sort each tile independently combine intermidiate results

Two-way approaches:

two-way distribution - quicksort log2 (n/M) scans to partition the input two-way merge sort log2 (n/M) scans to combine intermidiate results

Multi-way approaches:

k-way distribution - sample sort only logk (n/M) scans to partition k-way merge sort only logk (n/M) scans to combine

Multiway approaches are benifitial when the memory bandwidth is an issue!

slide-5
SLIDE 5

Introduction

multi-way sorting algorithms

3

Vitaly Osipov: GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik

Sorting is important Divide-and-Conquer approaches:

recursively split the input in tiles until the tile size is M (e.g cache size) sort each tile independently combine intermidiate results

Two-way approaches:

two-way distribution - quicksort log2 (n/M) scans to partition the input two-way merge sort log2 (n/M) scans to combine intermidiate results

Multi-way approaches:

k-way distribution - sample sort only logk (n/M) scans to partition k-way merge sort only logk (n/M) scans to combine

Multiway approaches are benifitial when the memory bandwidth is an issue!

slide-6
SLIDE 6

Introduction

multi-way sorting algorithms

3

Vitaly Osipov: GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik

Sorting is important Divide-and-Conquer approaches:

recursively split the input in tiles until the tile size is M (e.g cache size) sort each tile independently combine intermidiate results

Two-way approaches:

two-way distribution - quicksort log2 (n/M) scans to partition the input two-way merge sort log2 (n/M) scans to combine intermidiate results

Multi-way approaches:

k-way distribution - sample sort only logk (n/M) scans to partition k-way merge sort only logk (n/M) scans to combine

Multiway approaches are benifitial when the memory bandwidth is an issue!

slide-7
SLIDE 7

NVidia Tesla Architecture

4

Vitaly Osipov: GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik

30 Streaming Processors (SM) × 8 Scalar Processors (SP) each

  • verall 240 physical cores

16KB shared memory per SM similar to CPU L1 cache 4GB global device memory

slide-8
SLIDE 8

Computing Unified Device Architecture Model

5

Vitaly Osipov: GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik

tl tl tl tl Input Thread Blocks Prefix Sum 1 k-1 1 k-1 Input 0 1 2 k-1 Output Bucket indices Thread Blocks

... ... ... ...

SP

shared

SP SP SP SP SP SP SP virtualizes TBlock(0,0) TBlock(0,1) TBlock(0,2) TBlock(1,0) TBlock(1,1) TBlock(1,2) Grid Global Memory C code int main { //serial //parallel Kernel<<>> //serial }

Similar to SPMD (single-program multiple-data) model

block of concurrent threads execute a scalar sequential program, a kernel thread blocks constitute a grid

slide-9
SLIDE 9

Performance Guidelines

6

Vitaly Osipov: GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik

General pattern in GPU algorithm design

decompose the problem into many data-independent sub-problems solve sub-problems by blocks of cooperative parallel threads

Performance Guidelines

conditional branching

  • follow the same execution path

shared memory

  • exploit fast on-chip memory

coalesced global memory operations

  • load/store requests to the same memory block

fewer memory accesses

slide-10
SLIDE 10

Algorithm Overview

7

Vitaly Osipov: GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik

❙❛♠♣❧❡❙♦rt✭e = e1, . . . , en, k✮ begin if n < M then return ❙♠❛❧❧❙♦rt✭e✮ choose a random sample S = S1, . . . , Sak−1 of e ❙♦rt✭S✮

s0, s1, . . . , sk = −∞, Sa, . . . , Sa(k−1), ∞ for 1 ≤ i ≤ n do find j ∈ {1, . . . , k}, such that sj−1 ≤ ei ≤ sj place ei in bucket bj return ❈♦♥❝❛t❡♥❛t❡✭❙❛♠♣❧❡❙♦rt✭b1, k✮, . . . , ❙❛♠♣❧❡❙♦rt✭bk, k✮✮ end end Algorithm 1: Serial Sample Sort

slide-11
SLIDE 11

High Level GPU Algorithm Design

8

Vitaly Osipov: GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik

tl tl tl tl Input Thread Blocks Prefix Sum 1 k-1 1 k-1 Input 0 1 2 k-1 Output Bucket indices Thread Blocks

... ... ... ...

Parameters:

distribution degree k = 128 threads per block t = 256 elements per thread l = 8 number of blocks p = n/(t · l)

Phase 1. Choose splitters Phase 2. Each of p TB:

computes its elements bucket indices id, 0 ≤ id ≤ k − 1 stores the bucket sizes in DRAM

Phase 3. Prefix sum over the k × p table global offsets Phase 4.

as in Phase 2 local

  • ffsets

local + global offsets final positions

slide-12
SLIDE 12

High Level GPU Algorithm Design

8

Vitaly Osipov: GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik

tl tl tl tl Input Thread Blocks Prefix Sum 1 k-1 1 k-1 Input 0 1 2 k-1 Output Bucket indices Thread Blocks

... ... ... ...

Parameters:

distribution degree k = 128 threads per block t = 256 elements per thread l = 8 number of blocks p = n/(t · l)

Phase 1. Choose splitters Phase 2. Each of p TB:

computes its elements bucket indices id, 0 ≤ id ≤ k − 1 stores the bucket sizes in DRAM

Phase 3. Prefix sum over the k × p table global offsets Phase 4.

as in Phase 2 local

  • ffsets

local + global offsets final positions

slide-13
SLIDE 13

High Level GPU Algorithm Design

8

Vitaly Osipov: GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik

tl tl tl tl Input Thread Blocks Prefix Sum 1 k-1 1 k-1 Input 0 1 2 k-1 Output Bucket indices Thread Blocks

... ... ... ...

Parameters:

distribution degree k = 128 threads per block t = 256 elements per thread l = 8 number of blocks p = n/(t · l)

Phase 1. Choose splitters Phase 2. Each of p TB:

computes its elements bucket indices id, 0 ≤ id ≤ k − 1 stores the bucket sizes in DRAM

Phase 3. Prefix sum over the k × p table global offsets Phase 4.

as in Phase 2 local

  • ffsets

local + global offsets final positions

slide-14
SLIDE 14

High Level GPU Algorithm Design

8

Vitaly Osipov: GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik

tl tl tl tl Input Thread Blocks Prefix Sum 1 k-1 1 k-1 Input 0 1 2 k-1 Output Bucket indices Thread Blocks

... ... ... ...

Parameters:

distribution degree k = 128 threads per block t = 256 elements per thread l = 8 number of blocks p = n/(t · l)

Phase 1. Choose splitters Phase 2. Each of p TB:

computes its elements bucket indices id, 0 ≤ id ≤ k − 1 stores the bucket sizes in DRAM

Phase 3. Prefix sum over the k × p table global offsets Phase 4.

as in Phase 2 local

  • ffsets

local + global offsets final positions

slide-15
SLIDE 15

Flavor of Implementation Details

computing element bucket indices

9

Vitaly Osipov: GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik

bt = sk/2, sk/4, s3k/4, sk/8, s3k/8, s5k/8, s7k/8 . . . ❚r❛✈❡rs❡❚r❡❡✭ei✮ begin j := 1 ✴✴ ❣♦ ❧❡❢t ♦r r✐❣❤t❄ repeat log k times j := 2j + (ei > bt[j]) ✴✴ ❜✉❝❦❡t ✐♥❞❡① j := j − k + 1 end

3k/8

s

k/8

s

5k/8

s

7k/8

s

k/4

s

3k/4

s

k/2

s < < <

1

b

2

b

3

b

4

b

5

b

6

b

7

b

8

b < < < < > > > > > > >

Srore the tree in fast shared memory Use predicated instructions no path divergence Unroll the loop

slide-16
SLIDE 16

Flavor of Implementation Details

computing element bucket indices

9

Vitaly Osipov: GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik

bt = sk/2, sk/4, s3k/4, sk/8, s3k/8, s5k/8, s7k/8 . . . ❚r❛✈❡rs❡❚r❡❡✭ei✮ begin j := 1 ✴✴ ❣♦ ❧❡❢t ♦r r✐❣❤t❄ repeat log k times j := 2j + (ei > bt[j]) ✴✴ ❜✉❝❦❡t ✐♥❞❡① j := j − k + 1 end

3k/8

s

k/8

s

5k/8

s

7k/8

s

k/4

s

3k/4

s

k/2

s < < <

1

b

2

b

3

b

4

b

5

b

6

b

7

b

8

b < < < < > > > > > > >

Srore the tree in fast shared memory Use predicated instructions no path divergence Unroll the loop

slide-17
SLIDE 17

Flavor of Implementation Details

computing element bucket indices

9

Vitaly Osipov: GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik

bt = sk/2, sk/4, s3k/4, sk/8, s3k/8, s5k/8, s7k/8 . . . ❚r❛✈❡rs❡❚r❡❡✭ei✮ begin j := 1 ✴✴ ❣♦ ❧❡❢t ♦r r✐❣❤t❄ repeat log k times j := 2j + (ei > bt[j]) ✴✴ ❜✉❝❦❡t ✐♥❞❡① j := j − k + 1 end

3k/8

s

k/8

s

5k/8

s

7k/8

s

k/4

s

3k/4

s

k/2

s < < <

1

b

2

b

3

b

4

b

5

b

6

b

7

b

8

b < < < < > > > > > > >

Srore the tree in fast shared memory Use predicated instructions no path divergence Unroll the loop

slide-18
SLIDE 18

Flavor of Implementation Details

computing element bucket indices

9

Vitaly Osipov: GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik

bt = sk/2, sk/4, s3k/4, sk/8, s3k/8, s5k/8, s7k/8 . . . ❚r❛✈❡rs❡❚r❡❡✭ei✮ begin j := 1 ✴✴ ❣♦ ❧❡❢t ♦r r✐❣❤t❄ repeat log k times j := 2j + (ei > bt[j]) ✴✴ ❜✉❝❦❡t ✐♥❞❡① j := j − k + 1 end

3k/8

s

k/8

s

5k/8

s

7k/8

s

k/4

s

3k/4

s

k/2

s < < <

1

b

2

b

3

b

4

b

5

b

6

b

7

b

8

b < < < < > > > > > > >

Srore the tree in fast shared memory Use predicated instructions no path divergence Unroll the loop

slide-19
SLIDE 19

Experimental Evaluation

10 Vitaly Osipov:

GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik

NVidia Tesla C1060

30 SMs x 8 SPs = 240 cores 4GB RAM

Data types

32- and 64-bit integers key-value pairs

Distributions

Uniform Gausian Bucket Sorted Staggered Deterministic Duplicates

GPU sorting Algorithms

CUDPP and THRUST radix sort THRUST merge sort quicksort hybrid sort bbsort

slide-20
SLIDE 20

Experimental Evaluation

Uniform 32-bit integers

11 Vitaly Osipov:

GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik

20 40 60 80 100 120 140 160 180 217 218 219 220 221 222 223 224 225 226 227 228 sorted elements / time [µs] number of elements cudpp radix thrust radix quick bbsort hybrid (float) sample

slide-21
SLIDE 21

Experimental Evaluation

Uniform key-value pairs

12 Vitaly Osipov:

GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik

20 30 40 50 60 70 80 90 100 110 120 130 219 220 221 222 223 224 225 226 227 sorted elements / time [µs] number of elements thrust radix cudpp radix thrust merge sample

slide-22
SLIDE 22

Experimental Evaluation

Uniform 64-bit integers

13 Vitaly Osipov:

GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik

20 30 40 50 60 70 80 217 218 219 220 221 222 223 224 225 226 227 sorted elements / time [µs] number of elements thrust radix sample

slide-23
SLIDE 23

Future Trends

Fermi architecture

14 Vitaly Osipov:

GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik 11

bandwidth constrained. For existing applications that use Shared memory as software managed cache, code can be streamlined to take advantage of the hardware caching system, while still having access to at least 16 KB of shared memory for explicit thread cooperation. Best of all, applications that do not use Shared memory automatically benefit from the L1 cache, allowing high performance CUDA programs to be built with minimum time and effort. Summary Table

GPU G80 GT200 Fermi Transistors 681 million 1.4 billion 3.0 billion CUDA Cores 128 240 512 Double Precision Floating Point Capability None 30 FMA ops / clock 256 FMA ops /clock Single Precision Floating Point Capability 128 MAD

  • ps/clock

240 MAD ops / clock 512 FMA ops /clock Special Function Units (SFUs) / SM 2 2 4 Warp schedulers (per SM) 1 1 2 Shared Memory (per SM) 16 KB 16 KB Configurable 48 KB or 16 KB L1 Cache (per SM) None None Configurable 16 KB or 48 KB L2 Cache None None 768 KB ECC Memory Support No No Yes Concurrent Kernels No No Up to 16 Load/Store Address Width 32-bit 32-bit 64-bit

Second Generation Parallel Thread Execution ISA

Fermi is the first architecture to support the new Parallel Thread eXecution ( PTX) 2.0 instruction

  • set. PTX is a low level virtual machine and ISA designed to support the operations of a parallel

thread processor. At program install time, PTX instructions are translated to machine instructions by the GPU driver. The primary goals of PTX are:

p Provide a stable ISA that spans multiple GPU generations p Achieve full GPU performance in compiled applications p Provide a machine-independent ISA for C, C+

+ , Fortran, and other compiler targets.

p Provide a code distribution ISA for application and middleware developers p Provide a common ISA for optimizing code generators and translators, which map PTX

to specific target machines.

p Facilitate hand-coding of libraries and performance kernels p Provide a scalable programming model that spans GPU sizes from a few cores to many

parallel cores

What about memory bandwidth? No significant improvements? multi-way approaches are likely to be even more beneficial multi-way merge sort?

slide-24
SLIDE 24

15 Vitaly Osipov:

GPU Sample Sort Fakultät für Informatik Institut für Theoretische Informatik

Thank you!