GPU Sample Sort Nikolaj Leischner, Vitaly Osipov , Peter Sanders - PowerPoint PPT Presentation

GPU Sample Sort Nikolaj Leischner, Vitaly Osipov , Peter Sanders Institut für Theoretische Informatik - Algorithmik II 1 Vitaly Osipov: Fakultät für Informatik KIT – Universität des Landes Baden-Württemberg und GPU Sample Sort Institut für Theoretische Informatik nationales Grossforschungszentrum in der Helmholtz-Gemeinschaft www.kit.edu

Overview Introduction Tesla architecture Computing Unified Device Architecture Model Performance Guidelines Sample Sort Algorithm Overview High Level GPU Algorithm Design Flavor of Implementation Details Experimental Evaluation Future Trends 2 Vitaly Osipov: Fakultät für Informatik GPU Sample Sort Institut für Theoretische Informatik

Introduction multi-way sorting algorithms Sorting is important Divide-and-Conquer approaches: recursively split the input in tiles until the tile size is M (e.g cache size) sort each tile independently combine intermidiate results Two-way approaches: two-way distribution - quicksort � log 2 ( n / M ) scans to partition the input two-way merge sort � log 2 ( n / M ) scans to combine intermidiate results Multi-way approaches: k -way distribution - sample sort � only log k ( n / M ) scans to partition k -way merge sort � only log k ( n / M ) scans to combine Multiway approaches are benifitial when the memory bandwidth is an issue! 3 Vitaly Osipov: Fakultät für Informatik GPU Sample Sort Institut für Theoretische Informatik

NVidia Tesla Architecture 30 Streaming Processors (SM) × 8 Scalar Processors (SP) each overall 240 physical cores 16KB shared memory per SM similar to CPU L1 cache 4GB global device memory 4 Vitaly Osipov: Fakultät für Informatik GPU Sample Sort Institut für Theoretische Informatik

tl tl tl tl ... ... Input Thread Blocks 0 0 1 Prefix 1 Sum k-1 k-1 ... Input Thread Blocks ... Output Bucket indices 0 1 2 k-1 Computing Unified Device Architecture Model Global Memory Grid C code int main { Similar to SPMD TBlock(0,0) TBlock(0,1) TBlock(0,2) //serial (single-program multiple-data) model //parallel block of concurrent Kernel<<>> threads execute a scalar sequential //serial TBlock(1,0) TBlock(1,1) TBlock(1,2) program, a kernel } thread blocks virtualizes constitute a grid SP SP SP SP SP SP SP SP shared 5 Vitaly Osipov: Fakultät für Informatik GPU Sample Sort Institut für Theoretische Informatik

Performance Guidelines General pattern in GPU algorithm design decompose the problem into many data-independent sub-problems solve sub-problems by blocks of cooperative parallel threads Performance Guidelines conditional branching - follow the same execution path shared memory - exploit fast on-chip memory coalesced global memory operations - load/store requests to the same memory block � fewer memory accesses 6 Vitaly Osipov: Fakultät für Informatik GPU Sample Sort Institut für Theoretische Informatik

Algorithm Overview ❙❛♠♣❧❡❙♦rt✭ e = � e 1 , . . . , e n � , k ✮ begin if n < M then return ❙♠❛❧❧❙♦rt✭ e ✮ choose a random sample S = S 1 , . . . , S ak − 1 of e ❙♦rt✭ S ✮ � s 0 , s 1 , . . . , s k � = �− ∞ , S a , . . . , S a ( k − 1 ) , ∞ � for 1 ≤ i ≤ n do find j ∈ { 1 , . . . , k } , such that s j − 1 ≤ e i ≤ s j place e i in bucket b j return ❈♦♥❝❛t❡♥❛t❡✭❙❛♠♣❧❡❙♦rt✭ b 1 , k ✮ , . . . , ❙❛♠♣❧❡❙♦rt✭ b k , k ✮✮ end end Algorithm 1 : Serial Sample Sort 7 Vitaly Osipov: Fakultät für Informatik GPU Sample Sort Institut für Theoretische Informatik

High Level GPU Algorithm Design Phase 1. Choose splitters tl tl tl tl ... ... Input Phase 2. Each of p TB: Thread Blocks computes its elements 0 0 bucket indices 1 Prefix 1 Sum id , 0 ≤ id ≤ k − 1 k-1 k-1 ... Input stores the bucket sizes in Thread Blocks DRAM ... Output Phase 3. Prefix sum over the Bucket indices 0 1 2 k-1 k × p table � global offsets Parameters: Phase 4. distribution degree k = 128 threads per block t = 256 as in Phase 2 � local elements per thread l = 8 offsets number of blocks p = n / ( t · l ) local + global offsets � final positions 8 Vitaly Osipov: Fakultät für Informatik GPU Sample Sort Institut für Theoretische Informatik

GPU Sample Sort Nikolaj Leischner, Vitaly Osipov , Peter Sanders - PowerPoint PPT Presentation

GPU Sample Sort Nikolaj Leischner, Vitaly Osipov , Peter Sanders Institut fr Theoretische Informatik - Algorithmik II 1 Vitaly Osipov: Fakultt fr Informatik KIT Universitt des Landes Baden-Wrttemberg und GPU Sample Sort Institut

Insertion-Sort M. Esponda Insertion-Sort M. Esponda Insertion-Sort M. Esponda Insertion-Sort

Selection Sort Section 10.2 Code for Selection Sort (cont.) Code for an Array Sort Code for an

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

R A D I X S O R T Radix Sort 147 dnc CS 16: Radix Sort Radix Sort Unlike other sorting

RADIX SORT Parosh Aziz Abdulla Uppsala University September 21, 2008 Parosh Aziz Abdulla

Sort Algorithms 15-110 - Friday 10/09 Learning Objectives Recognize the general algorithm and

Sorting a List: bubble sort selection sort insertion sort Sept. 22, 2017 1 Sorting BEFORE

Bucket-Sort and Radix-Sort 1, c 3, a 3, b 7, d 7, g 7, e B 0 1

SORTING Chapter 8 Sorting 2 Why sort? To make searching faster! How? Binary Search gives

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Sample 2 Inlet in western (Sunset) Bay 0 Sample 3 Inlet behind Christian Island 1 Sample

Topological Sort Shivam Patel Viktor Zenkov Questions 1. Who first described topological sort?

Sorting Lower Bound Radix Sort Radix sort to the rescue sort of After today, you should

Sorting Chapter 7 1 Quick Sort One of the most popular fast sorting algorithms Quick sort

Sorting Lower Bound Radix Sort Radix sort to the rescue sort of After today, you should be

Dialogue systems & chatbots Pierre Lison IN4080 : Natural Language Processing (Fall 2020)

AITP Components Cezary Kaliszyk 03 April 2016 University of Innsbruck, Austria Talk Overview

Multilinear maps from lattices Constructions, attacks, and applications Yilei Chen (Visa

Anisotropic Long Range Spin Systems Nicol` o Defenu Scuola Internazionale Superiore di Studi

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 24: Statistical

Command Support Research Overview Leonard Adelman C4I Center Review May 19, 2006 Research

Theoretical Foundations of the UML Gao Complexity Lecture 09: Realisability Joost-Pieter

Review of vector terms I A D -vector over F is a function with domain D and co-domain F . F must be

Sambuz

Useful Links

Newsletter

Mail Us

GPU Sample Sort Nikolaj Leischner, Vitaly Osipov , Peter Sanders - PowerPoint PPT Presentation

GPU Sample Sort Nikolaj Leischner, Vitaly Osipov , Peter Sanders Institut fr Theoretische Informatik - Algorithmik II 1 Vitaly Osipov: Fakultt fr Informatik KIT Universitt des Landes Baden-Wrttemberg und GPU Sample Sort Institut

Insertion-Sort M. Esponda Insertion-Sort M. Esponda Insertion-Sort M. Esponda Insertion-Sort

Selection Sort Section 10.2 Code for Selection Sort (cont.) Code for an Array Sort Code for an

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

R A D I X S O R T Radix Sort 147 dnc CS 16: Radix Sort Radix Sort Unlike other sorting

RADIX SORT Parosh Aziz Abdulla Uppsala University September 21, 2008 Parosh Aziz Abdulla

Sort Algorithms 15-110 - Friday 10/09 Learning Objectives Recognize the general algorithm and

Sorting a List: bubble sort selection sort insertion sort Sept. 22, 2017 1 Sorting BEFORE

Bucket-Sort and Radix-Sort 1, c 3, a 3, b 7, d 7, g 7, e B 0 1

SORTING Chapter 8 Sorting 2 Why sort? To make searching faster! How? Binary Search gives

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Sample 2 Inlet in western (Sunset) Bay 0 Sample 3 Inlet behind Christian Island 1 Sample

Topological Sort Shivam Patel Viktor Zenkov Questions 1. Who first described topological sort?

Sorting Lower Bound Radix Sort Radix sort to the rescue sort of After today, you should

Sorting Chapter 7 1 Quick Sort One of the most popular fast sorting algorithms Quick sort

Sorting Lower Bound Radix Sort Radix sort to the rescue sort of After today, you should be

Dialogue systems &amp; chatbots Pierre Lison IN4080 : Natural Language Processing (Fall 2020)

AITP Components Cezary Kaliszyk 03 April 2016 University of Innsbruck, Austria Talk Overview

Multilinear maps from lattices Constructions, attacks, and applications Yilei Chen (Visa

Anisotropic Long Range Spin Systems Nicol` o Defenu Scuola Internazionale Superiore di Studi

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 24: Statistical

Command Support Research Overview Leonard Adelman C4I Center Review May 19, 2006 Research

Theoretical Foundations of the UML Gao Complexity Lecture 09: Realisability Joost-Pieter

Review of vector terms I A D -vector over F is a function with domain D and co-domain F . F must be

Sambuz

Useful Links

Newsletter

Mail Us

Dialogue systems & chatbots Pierre Lison IN4080 : Natural Language Processing (Fall 2020)