Analysis-driven Engineering of Comparison-based Sorting Algorithms - - PowerPoint PPT Presentation

analysis driven engineering of comparison based sorting
SMART_READER_LITE
LIVE PREVIEW

Analysis-driven Engineering of Comparison-based Sorting Algorithms - - PowerPoint PPT Presentation

AlgoPARC Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs 32nd ACM International Conference on Supercomputing June 17, 2018 Ben Karsin 1 karsin@hawaii.edu Volker Weichert 2 weichert@cs.uni-frankfurt.de Henri


slide-1
SLIDE 1

Ben Karsin – A Performance Model for GPU Architectures

Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs

1DEPARTMENT OF ICS, UNIVERSITY OF HAWAII AT MANOA 2GOETHE UNIVERSITY FRANKFURT 3DPARTEMENT D’INFORMATIQUE, UNIVERSIT´

E LIBRE DE BRUXELLES

www.algoparc.ics.hawaii.edu

32nd ACM International Conference on Supercomputing · June 17, 2018

AlgoPARC

Ben Karsin1 · karsin@hawaii.edu Volker Weichert2 · weichert@cs.uni-frankfurt.de Henri Casanova1 · henric@hawaii.edu John Iacono3 · john.iacono@ulb.ac.be Nodari Sitchinava1 · nodari@hawaii.edu Work supported by the National Science Foundation under grants 1533823 and1745331

slide-2
SLIDE 2

Ben Karsin – A Performance Model for GPU Architectures

Sorting: A fundamental problem

Sorting is a building block Used by countless algorithms...

slide-3
SLIDE 3

Ben Karsin – A Performance Model for GPU Architectures

Sorting: A fundamental problem

Sorting is a building block Used by countless algorithms...

O(log N)

· · ·

O(N)

slide-4
SLIDE 4

Ben Karsin – A Performance Model for GPU Architectures

Sorting: A fundamental problem

Sorting is a building block Used by countless algorithms...

slide-5
SLIDE 5

Ben Karsin – A Performance Model for GPU Architectures

Sorting: A fundamental problem

Sorting is a building block Used by countless algorithms...

slide-6
SLIDE 6

Ben Karsin – A Performance Model for GPU Architectures

Sorting: A fundamental problem

Sorting is a building block Used by countless algorithms... Many solutions

slide-7
SLIDE 7

Ben Karsin – A Performance Model for GPU Architectures

Graphics Processing Units

Designed for high throughput Extremely Parallel Thousands of cores Huge performance potential Lots of application research No standard performance model

slide-8
SLIDE 8

Ben Karsin – A Performance Model for GPU Architectures

NVIDIA GPU

Control Logic Shared Memory SM

· · ·

NVIDIA GPU SM SM SM SM

· · ·

SM SM SM Global Memory

processor cores

Streaming Multiprocessors (SMs)

< 20 per GPU < 200 cores each

slide-9
SLIDE 9

Ben Karsin – A Performance Model for GPU Architectures

NVIDIA GPU

Control Logic Shared Memory SM

· · ·

NVIDIA GPU SM SM SM SM

· · ·

SM SM SM Global Memory

processor cores

Streaming Multiprocessors (SMs)

< 20 per GPU < 200 cores each

Memory Hierarchy User-controlled Different scope

slide-10
SLIDE 10

Ben Karsin – A Performance Model for GPU Architectures

NVIDIA GPU

Control Logic Shared Memory SM

· · ·

NVIDIA GPU SM SM SM SM

· · ·

SM SM SM Global Memory

processor cores

Streaming Multiprocessors (SMs)

< 20 per GPU < 200 cores each

Memory Hierarchy User-controlled Different scope Thread organization Cores share logic Need lots of parallelism!

slide-11
SLIDE 11

Ben Karsin – A Performance Model for GPU Architectures

Thread Organization

SM

· · ·

SM SM SM SM

· · ·

SM SM SM Global Memory

· · ·

slide-12
SLIDE 12

Ben Karsin – A Performance Model for GPU Architectures

Thread Organization

SM

· · ·

SM SM SM SM

· · ·

SM SM SM Global Memory

· · ·

b

Threads are groupped into thread-blocks b threads Run on the SM

slide-13
SLIDE 13

Ben Karsin – A Performance Model for GPU Architectures

Thread Organization

Groups of w = 32 form a warp execute in ‘SIMT’ lockstep

SM

· · ·

SM SM SM SM

· · ·

SM SM SM Global Memory

· · ·

b

Threads are groupped into thread-blocks b threads Run on the SM

w

slide-14
SLIDE 14

Ben Karsin – A Performance Model for GPU Architectures

Memory Hierarchy

Control Logic Shared Memory SM

· · ·

NVIDIA GPU SM SM SM SM

· · ·

SM SM SM Global Memory

processor cores

3 levels with different: Access scope Capacity Access pattern Latency Peak bandwidth

slide-15
SLIDE 15

Ben Karsin – A Performance Model for GPU Architectures

Global Memory

Control Logic Shared Memory SM

· · ·

NVIDIA GPU SM SM SM SM

· · ·

SM SM SM Global Memory

processor cores

Large (up to 32 GB) Shared by all threads Slow “Blocked” accesses I/O model

slide-16
SLIDE 16

Ben Karsin – A Performance Model for GPU Architectures

Global Memory Access Pattern

Warp - 32 threads execute in lockstep Access global memory together Warp is a single unit 1 operation accesses 32 elements Just like disk accesses in ’I/O’ model (B = 32)

slide-17
SLIDE 17

Ben Karsin – A Performance Model for GPU Architectures

Shared Memory

Control Logic Shared Memory SM

· · ·

NVIDIA GPU SM SM SM SM

· · ·

SM SM SM Global Memory

processor cores

Small (48-64 KB per SM) Private to SM User defines sharing 5 – 10 × faster Unique access pattern

  • rganized into banks
slide-18
SLIDE 18

Ben Karsin – A Performance Model for GPU Architectures

Shared Memory Access Pattern

Stored across w memory banks A

· · ·

. . .

Bank 1 Bank 2 Bank 3 Bank 4

Shared memory

slide-19
SLIDE 19

Ben Karsin – A Performance Model for GPU Architectures

Shared Memory Access Pattern

A

· · ·

. . .

Bank 1 Bank 2 Bank 3 Bank 4

Shared memory

T 1 T 2 T 3 T 4

O O O O Separate banks accessed concurrently

slide-20
SLIDE 20

Ben Karsin – A Performance Model for GPU Architectures

Shared Memory Access Pattern

A

· · ·

. . .

Bank 1 Bank 2 Bank 3 Bank 4

Shared memory

T 1 T 2 T 3 T 4

X X X X Threads accessing same bank = Bank conflict Serialize access

slide-21
SLIDE 21

Ben Karsin – A Performance Model for GPU Architectures

Registers

Control Logic Shared Memory SM

· · ·

NVIDIA GPU SM SM SM SM

· · ·

SM SM SM Global Memory

processor cores

Small (255 per thread) Private to thread Fastest Random access Must be “static” known at compile time

slide-22
SLIDE 22

Ben Karsin – A Performance Model for GPU Architectures

Talk Outline

GPU overview Memory hierarchy State-of-the-art GPU sorting Our multiway mergesort (GPU-MMS) Optimizations Performance results Conclusions & future work Motivation/background

slide-23
SLIDE 23

Ben Karsin – A Performance Model for GPU Architectures

State-of-the-art GPU sorting

CUB Radix sort Limited application Modern GPU (MGPU) Pairwise mergesort Thrust Changes algorithm based on input type Comes with CUDA compiler All highly engineered and optimized for hardware Change parameters based on hardware detected

slide-24
SLIDE 24

Ben Karsin – A Performance Model for GPU Architectures

MGPU mergesort

Pairwise mergesort E E elements per thread t1 t2 t3 t4 t( N

E −1)

· · ·

t N

E

slide-25
SLIDE 25

Ben Karsin – A Performance Model for GPU Architectures

MGPU mergesort

Pairwise mergesort E elements per thread t1 t2 t3 t4 t( N

E −1)

· · ·

t N

E

b threads per thread-block bE

slide-26
SLIDE 26

Ben Karsin – A Performance Model for GPU Architectures

MGPU mergesort

Pairwise mergesort E elements per thread t1 t2 t3 t4 t( N

E −1)

· · ·

t N

E

b threads per thread-block bE Lots of parallelism

N E threads!

slide-27
SLIDE 27

Ben Karsin – A Performance Model for GPU Architectures

MGPU mergesort

Each thread-block sorts bE elements bE

slide-28
SLIDE 28

Ben Karsin – A Performance Model for GPU Architectures

MGPU mergesort

Each thread-block sorts bE elements bE Merge pairs of lists

slide-29
SLIDE 29

Ben Karsin – A Performance Model for GPU Architectures

MGPU mergesort

Each thread-block sorts bE elements bE Merge pairs of lists

  • log N

bE

  • merge rounds

b and E iare small constants

  • log N

bE

slide-30
SLIDE 30

Ben Karsin – A Performance Model for GPU Architectures

MGPU bottlenecks

Global memory is the main bottleneck Unavoidable: O(log2 N) merge rounds

slide-31
SLIDE 31

Ben Karsin – A Performance Model for GPU Architectures

Multiway mergesort

Reduce global memory bottleneck Merge K lists at a time!

· · ·

· · · · · · · · · · · ·

Merging done in internal memory Use a priority queue

  • logK

N B

  • merge rounds

logK N

slide-32
SLIDE 32

Ben Karsin – A Performance Model for GPU Architectures

Merging K lists

Use a heap Load blocks from each list Build min-heap on smallest items 1 3 6 7 4 8 7 8 9 6 5 10 11 9 11 7 8 12 19 16 16 14 13 22 18 K

slide-33
SLIDE 33

Ben Karsin – A Performance Model for GPU Architectures

Merging K lists

Use a heap 6 7 8 7 8 9 6 10 11 9 11 7 12 19 16 16 13 22 18 K Buffer smallest item Heapify to find next smallest 1 3 4 5 8 14

slide-34
SLIDE 34

Ben Karsin – A Performance Model for GPU Architectures

Merging K lists

Use a heap 6 7 8 7 8 9 6 10 11 9 11 7 12 19 16 16 13 22 18 K 1 3 4 5 8 14 Output buffer when full Read block when needed

slide-35
SLIDE 35

Ben Karsin – A Performance Model for GPU Architectures

Parallel ’Block Heap’

Warp shares a heap 32 threads all need work... 32 K

slide-36
SLIDE 36

Ben Karsin – A Performance Model for GPU Architectures

Parallel ’Block Heap’

1 2 4 5 7 9 1117 8 121420 19222330 18202124 28293133 23242526 Each node has a sorted list

slide-37
SLIDE 37

Ben Karsin – A Performance Model for GPU Architectures

Parallel ’Block Heap’

7 9 1117 8 121420 19222330 18202124 28293133 23242526 Each node has a sorted list Output

slide-38
SLIDE 38

Ben Karsin – A Performance Model for GPU Architectures

Parallel ’Block Heap’

7 9 1117 8 121420 19222330 18202124 28293133 23242526 Each node has a sorted list Merge Merge child nodes All 32 threads work together

slide-39
SLIDE 39

Ben Karsin – A Performance Model for GPU Architectures

Parallel ’Block Heap’

19222330 18202124 28293133 23242526 Each node has a sorted list 7 9 11 17 8 1214 20 Merge child nodes Smallest Largest

slide-40
SLIDE 40

Ben Karsin – A Performance Model for GPU Architectures

Parallel ’Block Heap’

19222330 18202124 28293133 23242526 Each node has a sorted list 7 9 11 17 8 1214 20 Merge child nodes Repeat on empty child

slide-41
SLIDE 41

Ben Karsin – A Performance Model for GPU Architectures

Multiway mergesort (GPU-MMS) analysis

Base case sorts w2 elements Merge groups of K lists per round

  • logK

N w2

  • rounds

Perform merging of nodes in registers No bank conflicts

slide-42
SLIDE 42

Ben Karsin – A Performance Model for GPU Architectures

Multiway mergesort (GPU-MMS) analysis

Base case sorts w2 elements Merge groups of K lists per round

  • logK

N w2

  • rounds

Perform merging of nodes in registers No bank conflicts Not work-efficient log w more register accesses

slide-43
SLIDE 43

Ben Karsin – A Performance Model for GPU Architectures

Multiway mergesort (GPU-MMS) analysis

Base case sorts w2 elements Merge groups of K lists per round

  • logK

N w2

  • rounds

Perform merging of nodes in registers No bank conflicts Not work-efficient log w more register accesses Low parallelism Lots of shared memory used Dependent operations

slide-44
SLIDE 44

Ben Karsin – A Performance Model for GPU Architectures

Pipelining merge steps

19222330 18202124 28293133 23242526 17 1214 20 Pre-search path to leaf Identify all nodes to be merged 1 2 4 5 7 9 1117

slide-45
SLIDE 45

Ben Karsin – A Performance Model for GPU Architectures

Pipelining merge steps

19222330 18202124 28293133 23242526 17 1214 20 Pre-search path to leaf Identify all nodes to be merged 1 2 4 5 7 9 1117 Output Merge

slide-46
SLIDE 46

Ben Karsin – A Performance Model for GPU Architectures

Tuning K

Small K: too many global memory access Large K: not enough parallelism

slide-47
SLIDE 47

Ben Karsin – A Performance Model for GPU Architectures

Sorting Performance

Sorting integers on Maxwell GPU

slide-48
SLIDE 48

Ben Karsin – A Performance Model for GPU Architectures

Impact of Bank Conflcits

Generate input that causes bank conflicts GPU-MMS is unaffected

slide-49
SLIDE 49

Ben Karsin – A Performance Model for GPU Architectures

Different datatypes

Increasing comparison work degrades performance

slide-50
SLIDE 50

Ben Karsin – A Performance Model for GPU Architectures

Conclusions

Analysis helps us develop better GPU algorithms I/O-efficient techniques work well Minimize global memory accesses Don’t forget parallelism

slide-51
SLIDE 51

Ben Karsin – A Performance Model for GPU Architectures

Conclusions

Apply analysis methods to other algorithms Analysis helps us develop better GPU algorithms Future work I/O-efficient techniques work well Minimize global memory accesses Don’t forget parallelism Optimize GPU-MMS Work efficient (open problem) How will future architectures change things?

slide-52
SLIDE 52

Ben Karsin – A Performance Model for GPU Architectures

Conclusions

Apply analysis methods to other algorithms Analysis helps us develop better GPU algorithms Future work

Thank You!

I/O-efficient techniques work well Minimize global memory accesses Don’t forget parallelism Optimize GPU-MMS Work efficient (open problem) How will future architectures change things?

GPU-MMS available: https://github.com/algoparc/GPU-MMS

slide-53
SLIDE 53

Ben Karsin – A Performance Model for GPU Architectures

Backup Slides

slide-54
SLIDE 54

Ben Karsin – A Performance Model for GPU Architectures

MGPU Merge phase

TB1 Merge pairs of lists Repeat until sorted TB2 TB3 TB4

· · ·

slide-55
SLIDE 55

Ben Karsin – A Performance Model for GPU Architectures

MGPU Merge phase

TB1 Merge pairs of lists Repeat until sorted TB2 TB3 TB4 Find thread-block partition

· · ·

slide-56
SLIDE 56

Ben Karsin – A Performance Model for GPU Architectures

MGPU Merge phase

TB1 Merge pairs of lists Repeat until sorted TB2 TB3 TB4 Find thread-block partition Each thread-block loads partition into shared memory

· · ·

slide-57
SLIDE 57

Ben Karsin – A Performance Model for GPU Architectures

MGPU Merge phase

TB1 Merge pairs of lists Repeat until sorted TB2 TB3 TB4 Find thread-block partition Each thread-block loads partition into shared memory And merges...

· · ·

slide-58
SLIDE 58

Ben Karsin – A Performance Model for GPU Architectures

GPU-MMS Bottlenecks

Sync GMEM SMEM

Basecase Registers

Mostly compute-bound

slide-59
SLIDE 59

Ben Karsin – A Performance Model for GPU Architectures

Searching in global memory

slide-60
SLIDE 60

Ben Karsin – A Performance Model for GPU Architectures

Model results: MGPU mergesort

Model is quite accurate! Shows that E = 31 is ideal for this GPU! (E = 15 is hard-coded)

slide-61
SLIDE 61

Ben Karsin – A Performance Model for GPU Architectures

Hiding Latency

tx: average time per x operation min tx → max throughput But operations have latency...

slide-62
SLIDE 62

Ben Karsin – A Performance Model for GPU Architectures

Hiding Latency

tx: average time per x operation min tx → max throughput But operations have latency... Memory core threads Multiplicity: X multiple threads per core

slide-63
SLIDE 63

Ben Karsin – A Performance Model for GPU Architectures

Hiding Latency

tx: average time per x operation min tx → max throughput

request

thread sends request to slow memory But operations have latency... Memory core threads Multiplicity: X multiple threads per core

slide-64
SLIDE 64

Ben Karsin – A Performance Model for GPU Architectures

Hiding Latency

tx: average time per x operation min tx → max throughput

request

switch out thread while it waits But operations have latency... Memory core threads Multiplicity: X multiple threads per core

slide-65
SLIDE 65

Ben Karsin – A Performance Model for GPU Architectures

Hiding Latency

tx: average time per x operation min tx → max throughput schedule new thread to use core

request

But operations have latency... Memory core threads Multiplicity: X multiple threads per core

slide-66
SLIDE 66

Ben Karsin – A Performance Model for GPU Architectures

Hiding Latency

tx: average time per x operation min tx → max throughput

request request

issue more requests to saturate bandwidth But operations have latency... Memory core threads Multiplicity: X multiple threads per core

slide-67
SLIDE 67

Ben Karsin – A Performance Model for GPU Architectures

Hiding Latency

Memory core Instruction-level parallelism (ILP): I consecutive independent instructions tx: average time per x operation min tx → max throughput But operations have latency...

slide-68
SLIDE 68

Ben Karsin – A Performance Model for GPU Architectures

Hiding Latency

Memory core

request X

Instruction-level parallelism (ILP): I consecutive independent instructions thread requests memory element X tx: average time per x operation min tx → max throughput But operations have latency...

slide-69
SLIDE 69

Ben Karsin – A Performance Model for GPU Architectures

Hiding Latency

Memory core

request X

Instruction-level parallelism (ILP): I consecutive independent instructions next instruction requests Y tx: average time per x operation min tx → max throughput But operations have latency...

slide-70
SLIDE 70

Ben Karsin – A Performance Model for GPU Architectures

Hiding Latency

Memory core

request X

Instruction-level parallelism (ILP): I consecutive independent instructions issue next request without waiting for X

request Y

tx: average time per x operation min tx → max throughput But operations have latency...

slide-71
SLIDE 71

Ben Karsin – A Performance Model for GPU Architectures

Hiding Latency

Memory core Instruction-level parallelism (ILP): I consecutive independent instructions issue more requests to saturate bandwidth

request X request Y

tx: average time per x operation min tx → max throughput But operations have latency...

slide-72
SLIDE 72

Ben Karsin – A Performance Model for GPU Architectures

Impact of X and I (global memory)

Copy 216 elts. in global memory per thread When X · I ≥ 8 is limited by bandwidth

X I

slide-73
SLIDE 73

Ben Karsin – A Performance Model for GPU Architectures

Impact of X and I (global memory)

Copy 216 elts. in global memory per thread When X · I ≥ 8 is limited by bandwidth

X · I = 8 · 1

X I

slide-74
SLIDE 74

Ben Karsin – A Performance Model for GPU Architectures

Impact of X and I (global memory)

Copy 216 elts. in global memory per thread When X · I ≥ 8 is limited by bandwidth

X · I = 8 · 1 X · I = 4 · 2

X I

slide-75
SLIDE 75

Ben Karsin – A Performance Model for GPU Architectures

Impact of X and I (global memory)

Copy 216 elts. in global memory per thread When X · I ≥ 8 is limited by bandwidth

X · I = 8 · 1 X · I = 4 · 2 X · I = 2 · 4

X I

slide-76
SLIDE 76

Ben Karsin – A Performance Model for GPU Architectures

Time per memory access

Increasing (X · I): Reduce latency Until peak bandwidth reached Reduce latency until bandwidth reached: Parameters for each type of memory:

Lx - memory access latency (clock cycles) Bx - peak bandwidth

peak operations per clock cycle, per core tx = max

  • 1

Bx ,

Lx

X·I

slide-77
SLIDE 77

Ben Karsin – A Performance Model for GPU Architectures

GPU Hardware Parameters

Run benchmarks on 3 architectures ALGOPARC: server in our lab GIBSON: desktop with GPU UHHPC: GPU node of UH cluster

Parameter ALGOPARC GIBSON UHHPC NVIDIA GPU Quadro M4000 GTX 770 K40 P (total cores) 1664 1536 2880 Lg 269.5 267.6 291.2

Bg

0.0301 0.0279 0.0275 Ls 85.84 123.1 111.9

Bs

0.233 0.13 0.131 Lr 6 10 10

Br ∼ 1 ∼ 1 ∼ 1