Analysis-driven Engineering of Comparison-based Sorting Algorithms - PowerPoint PPT Presentation

AlgoPARC Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs 32nd ACM International Conference on Supercomputing · June 17, 2018 Ben Karsin 1 · karsin@hawaii.edu Volker Weichert 2 · weichert@cs.uni-frankfurt.de Henri Casanova 1 · henric@hawaii.edu John Iacono 3 · john.iacono@ulb.ac.be Nodari Sitchinava 1 · nodari@hawaii.edu 1 D EPARTMENT OF ICS, U NIVERSITY OF H AWAII AT M ANOA 2 G OETHE U NIVERSITY F RANKFURT 3 D PARTEMENT D ’I NFORMATIQUE , U NIVERSIT ´ E L IBRE DE B RUXELLES Work supported by the National Science Foundation under grants 1533823 and1745331 www.algoparc.ics.hawaii.edu Ben Karsin – A Performance Model for GPU Architectures

Sorting: A fundamental problem Sorting is a building block Used by countless algorithms... Ben Karsin – A Performance Model for GPU Architectures

Sorting: A fundamental problem Sorting is a building block Used by countless algorithms... O ( N ) · · · O (log N ) Ben Karsin – A Performance Model for GPU Architectures

Sorting: A fundamental problem Sorting is a building block Used by countless algorithms... Ben Karsin – A Performance Model for GPU Architectures

Sorting: A fundamental problem Sorting is a building block Used by countless algorithms... Many solutions Ben Karsin – A Performance Model for GPU Architectures

Graphics Processing Units Designed for high throughput Extremely Parallel Thousands of cores Huge performance potential Lots of application research No standard performance model Ben Karsin – A Performance Model for GPU Architectures

NVIDIA GPU Streaming Multiprocessors NVIDIA GPU (SMs) · · · SM SM SM SM < 20 per GPU Global Memory < 200 cores each · · · SM SM SM SM Control Logic Shared Memory processor cores Ben Karsin – A Performance Model for GPU Architectures

NVIDIA GPU Streaming Multiprocessors NVIDIA GPU (SMs) · · · SM SM SM SM < 20 per GPU Global Memory < 200 cores each · · · SM SM SM SM Memory Hierarchy User-controlled Control Logic Shared Memory Different scope processor cores Ben Karsin – A Performance Model for GPU Architectures

NVIDIA GPU Streaming Multiprocessors NVIDIA GPU (SMs) · · · SM SM SM SM < 20 per GPU Global Memory < 200 cores each · · · SM SM SM SM Memory Hierarchy User-controlled Control Logic Shared Memory Different scope processor cores Thread organization Cores share logic Need lots of parallelism! Ben Karsin – A Performance Model for GPU Architectures

Thread Organization · · · · · · SM SM SM SM Global Memory · · · SM SM SM SM Ben Karsin – A Performance Model for GPU Architectures

Thread Organization b · · · · · · SM SM SM SM Global Memory · · · SM SM SM SM Threads are groupped into thread-blocks b threads Run on the SM Ben Karsin – A Performance Model for GPU Architectures

Thread Organization b w · · · · · · SM SM SM SM Global Memory · · · SM SM SM SM Threads are groupped into thread-blocks b threads Run on the SM Groups of w = 32 form a warp execute in ‘SIMT’ lockstep Ben Karsin – A Performance Model for GPU Architectures

Memory Hierarchy NVIDIA GPU 3 levels with different: · · · SM SM SM SM Access scope Capacity Global Memory Access pattern · · · SM SM SM SM Latency Peak bandwidth Control Logic Shared Memory processor cores Ben Karsin – A Performance Model for GPU Architectures

Global Memory Large (up to 32 GB) NVIDIA GPU · · · SM SM SM SM Shared by all threads Global Memory Slow · · · SM SM SM SM “Blocked” accesses Control Logic Shared Memory I/O model processor cores Ben Karsin – A Performance Model for GPU Architectures

Global Memory Access Pattern Warp - 32 threads execute in lockstep Access global memory together Warp is a single unit 1 operation accesses 32 elements Just like disk accesses in ’I/O’ model ( B = 32) Ben Karsin – A Performance Model for GPU Architectures

Shared Memory Small (48-64 KB per SM) NVIDIA GPU · · · SM SM SM SM Private to SM Global Memory User defines sharing · · · SM SM SM SM 5 – 10 × faster Unique access pattern Control Logic Shared Memory organized into banks processor cores Ben Karsin – A Performance Model for GPU Architectures

Shared Memory Access Pattern · · · A Stored across w memory banks Shared memory . Bank 1 . . Bank 2 Bank 3 Bank 4 Ben Karsin – A Performance Model for GPU Architectures

Shared Memory Access Pattern T 1 T 2 T 3 T 4 · · · A Separate banks accessed concurrently Shared memory . Bank 1 O . . Bank 2 O Bank 3 O Bank 4 O Ben Karsin – A Performance Model for GPU Architectures

Shared Memory Access Pattern T 1 T 2 T 3 T 4 · · · A Threads accessing same bank = Bank conflict Serialize access Shared memory . Bank 1 X X X X . . Bank 2 Bank 3 Bank 4 Ben Karsin – A Performance Model for GPU Architectures

Registers Small (255 per thread) NVIDIA GPU · · · SM SM SM SM Private to thread Global Memory Fastest · · · SM SM SM SM Random access Must be “static” Control Logic Shared Memory known at compile time processor cores Ben Karsin – A Performance Model for GPU Architectures

Talk Outline Motivation/background GPU overview Memory hierarchy State-of-the-art GPU sorting Our multiway mergesort (GPU-MMS) Optimizations Performance results Conclusions & future work Ben Karsin – A Performance Model for GPU Architectures

State-of-the-art GPU sorting Modern GPU (MGPU) Pairwise mergesort CUB Radix sort Limited application Thrust Changes algorithm based on input type Comes with CUDA compiler All highly engineered and optimized for hardware Change parameters based on hardware detected Ben Karsin – A Performance Model for GPU Architectures

MGPU mergesort Pairwise mergesort E elements per thread E · · · t 1 t 2 t 3 t 4 t ( N t N E − 1) E Ben Karsin – A Performance Model for GPU Architectures

MGPU mergesort Pairwise mergesort E elements per thread b threads per thread-block bE · · · t 1 t 2 t 3 t 4 t ( N t N E − 1) E Ben Karsin – A Performance Model for GPU Architectures

MGPU mergesort Pairwise mergesort E elements per thread b threads per thread-block Lots of parallelism N E threads! bE · · · t 1 t 2 t 3 t 4 t ( N t N E − 1) E Ben Karsin – A Performance Model for GPU Architectures

MGPU mergesort Each thread-block sorts bE elements bE Ben Karsin – A Performance Model for GPU Architectures

MGPU mergesort Each thread-block sorts bE elements Merge pairs of lists bE Ben Karsin – A Performance Model for GPU Architectures

MGPU mergesort Each thread-block sorts bE elements Merge pairs of lists � log N � merge rounds bE b and E iare small constants log N � � bE bE Ben Karsin – A Performance Model for GPU Architectures

MGPU bottlenecks Global memory is the main bottleneck Unavoidable: O (log 2 N ) merge rounds Ben Karsin – A Performance Model for GPU Architectures

Multiway mergesort Reduce global memory bottleneck Merge K lists at a time! log K N · · · · · · · · · · · · · · · N � � log K merge rounds B Merging done in internal memory Use a priority queue Ben Karsin – A Performance Model for GPU Architectures

Merging K lists Use a heap Load blocks from each list Build min-heap on smallest items 1 3 6 7 4 8 7 8 9 6 5 10 11 9 11 7 8 12 19 16 16 14 13 22 18 K Ben Karsin – A Performance Model for GPU Architectures

Merging K lists Use a heap Buffer smallest item Heapify to find next smallest 3 1 4 6 7 5 8 7 8 9 6 8 10 11 9 11 7 14 12 19 16 16 13 22 18 K Ben Karsin – A Performance Model for GPU Architectures

Merging K lists Use a heap Output buffer when full Read block when needed 3 1 4 6 7 5 8 7 8 9 6 8 10 11 9 11 7 14 12 19 16 16 13 22 18 K Ben Karsin – A Performance Model for GPU Architectures

Parallel ’Block Heap’ Warp shares a heap 32 threads all need work... 32 K Ben Karsin – A Performance Model for GPU Architectures

Parallel ’Block Heap’ Each node has a sorted list 1 2 4 5 7 9 1117 8 121420 19222330 18202124 28293133 23242526 Ben Karsin – A Performance Model for GPU Architectures

Parallel ’Block Heap’ Each node has a sorted list Output 7 9 1117 8 121420 19222330 18202124 28293133 23242526 Ben Karsin – A Performance Model for GPU Architectures

Parallel ’Block Heap’ Each node has a sorted list Merge child nodes All 32 threads work together 7 9 1117 8 121420 Merge 19222330 18202124 28293133 23242526 Ben Karsin – A Performance Model for GPU Architectures

Parallel ’Block Heap’ Each node has a sorted list Merge child nodes Smallest Largest 7 8 9 11 1214 17 20 19222330 18202124 28293133 23242526 Ben Karsin – A Performance Model for GPU Architectures

Analysis-driven Engineering of Comparison-based Sorting Algorithms - PowerPoint PPT Presentation

AlgoPARC Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs 32nd ACM International Conference on Supercomputing June 17, 2018 Ben Karsin 1 karsin@hawaii.edu Volker Weichert 2 weichert@cs.uni-frankfurt.de Henri

Sorting Lower Bound Sorting Lower Bound 1 Comparison-Based Sorting (10.4) Many sorting

9. Sorting III Lower bounds for the comparison based sorting, radix- and bucket-sort 248 9.1

Sorting Lower Bound Comparison Based Sorting Recall - Sorting input: A sequence of n values

SORTING Review of Sorting Merge Sort Sets sorting 1 Sorting Algorithms

Overview/Questions What is sorting? Why does sorting matter? How is sorting

Sorting Insertion sort Bubble sort Divide and conquer sorting Sorting Last time: introduction

Searching, Sorting part 1 Week 3 Objectives Searching: binary search Comparison-based

28: More Sorting Mergesort review analysis Lower bound on comparison-based sorting Mergesort: A

Sorting with Pop Stacks Stack sorting Pop stack sorting 1-pop-stack sortability 2-pop-stack

Sorting Sorting used as a step in many algorithms Savitch Chapter 7.4 Sorting algorithms

Sorting Sorting as a tool Sorting problem: Given a list a with n elements possessing a There are

Sorting Sorting: to arrange data in some sequential order Sorting occurs as a part in

Chapter 7 External Sorting Sorting Tables Larger Than Main Memory Query Processing Sorting

Sorting Algorithms Introduction Sorting Problem Sorting Problem Given a sequence A = a 1 , .

Sorting in Linear Time Pedro Ribeiro DCC/FCUP 2018/2019 Pedro Ribeiro (DCC/FCUP) Sorting in

Sorting Algorithms CENG 707 Data Structures and Algorithms Sorting Sorting is a process

Cyber Security Information Sharing Oscar Serrano NCI Agency Cyber Security Service Line

iRODS UGM2017 Welcome Carolien Besselink CIO Utrecht University Information and Technology

Team Optimal Control of Coupled Subsystems with Mean-Field Sharing Jalal Arabneydi and Aditya

MapReduce Design Patterns This section is based on the book by Jimmy Lin Now lets look at

Cloud: How Big Is Your Risk? Prasidh Srikanth Booth #450 Agenda Cloud BYOD Security Booth

The Cloud Computations on Encrypted Data and Privacy David Pointcheval CNRS - ENS - INRIA 11th

Control Path Design and Lab 3 1 Separating Control From Data The datapath is where data

LECTURE 5 Single-Cycle Datapath and Control PROCESSORS Datapath and control are the two