Welcome! Todays Agenda: Dont Trust the Template The Prefix Sum - PowerPoint PPT Presentation

/INFOMOV/ Optimization & Vectorization J. Bikker - Sep-Nov 2018 - Lecture 10: “GPGPU (3)” Welcome!

Today’s Agenda: ▪ Don’t Trust the Template ▪ The Prefix Sum ▪ Parallel Sorting ▪ Stream Filtering ▪ Optimizing GPU code

INFOMOV – Lecture 10 – “GPGPU (3)” 3 Introduction Beyond “Let OpenCL Sort Them Out” void Kernel::Run( const size_t count ) { cl_int error; CHECKCL( error = clEnqueueNDRangeKernel( queue, kernel, 1, 0, &count, 0, 0, 0, 0 ) ); clFinish( queue ); } Here: ▪ A queue is a command queue : we can have more than one*. ▪ ‘1’ is the dimensionality of the task (can be 1, 2 or 3). ▪ ‘count’ is the number of threads we are spawning (multiple of local work size, if specified). ▪ ‘0’ is the local work size (0 means: not specified, let OpenCL decide) . *: http://sa09.idav.ucdavis.edu/docs/SA09-opencl-dg-events-stream.pdf

INFOMOV – Lecture 10 – “GPGPU (3)” 4 Introduction Beyond “Let OpenCL Sort Them Out” void Kernel::Run( const size_t count ) { cl_int error; CHECKCL( error = clEnqueueNDRangeKernel( queue, kernel, 1, 0, &count, 0, 0, 0, 0 ) ); clFinish( queue ); } 2D 2D task task: ▪ Improves data locality ▪ Improves flow coherence ▪ Also available in CUDA ▪ Fractal data stalls: few ▪ Fractal flow coherence: apparently not a big deal.

INFOMOV – Lecture 10 – “GPGPU (3)” 5 Introduction Beyond “Let OpenCL Sort Them Out” void Kernel::Run( const size_t count ) { cl_int error; CHECKCL( error = clEnqueueNDRangeKernel( queue, kernel, 1, 0, &count, 0, 0, 0, 0 ) ); clFinish( queue ); } Tas ask size: ▪ Use to balance thread count / registers per thread. ▪ Tune per device class. ▪ Auto-tuning possible? ▪ Do not trust OpenCL.

INFOMOV – Lecture 10 – “GPGPU (3)” 6 Introduction Beyond “Let OpenCL Sort Them Out” void Kernel::Run( const size_t count ) { cl_int error; CHECKCL( error = clEnqueueNDRangeKernel( queue, kernel, 1, 0, &count, 0, 0, 0, 0 ) ); clFinish( queue ); } clF clFinish: ▪ Not a good idea: has CPU idling. ▪ Queue enforces order. ▪ Multiple queues are useful. ▪ How about running part of the fractal on the CPU? ▪ How do we balance CPU / GPU work?

INFOMOV – Lecture 10 – “GPGPU (3)” 7 Introduction Beyond “Let OpenCL Sort Them Out” void Kernel::Run( const size_t count ) { cl_int error; CHECKCL( error = clEnqueueNDRangeKernel( queue, kernel, 1, 0, &count, &localCount, 0, 0, 0 ) ); clFinish( queue ); } A thread knows it’s place in the global task set, but also the local group: __kernel void DoWork() { // get the index of the thread in the global pool int idx = get_global_id( 0 ); // get the index of the thread in the local set int localIdx = get_local_id( 0 ); // determine in which warp the current thread is int warpIdx = localIdx >> 5; // determine in which lane we are int lane = localIdx & 31; }

INFOMOV – Lecture 10 – “GPGPU (3)” 8 Introduction Beyond “Many Independent Threads” Many algorithms do not lend themselves to GPGPU, at least not at first sight: ▪ Divide and conquer algorithms ▪ Sorting In fact, lock-free implementations of linked lists and hash tables exist and ▪ Anything with an unpredictable number of iterations can be used in CUDA, see e.g.: ▪ Walking a linked list or a tree Misra & Chaudhuri, 2012, Performance ▪ Ray tracing Evaluation of Concurrent Lock-free ▪ Anything that needs to emit data in a compacted array Data Structures on GPUs. ▪ Run-length encoding Note that the possibility of using linked ▪ Duplicate removal lists on the GPU does not automatically justify their use. ▪ Anything that requires inter-thread synchronization ▪ Hash table ▪ Linked list

INFOMOV – Lecture 10 – “GPGPU (3)” 9 Introduction Beyond “Many Independent Threads” Many algorithms do not lend themselves to GPGPU. In many cases, we have to design entirely new algorithms. In some cases, we can use two important building blocks: ▪ Sort ▪ Prefix sum

INFOMOV – Lecture 10 – “GPGPU (3)” 11 Prefix Sum Prefix Sum The prefix sum (or cumulative sum) of a sequence of numbers is a second sequence of numbers consisting of the running totals of the input sequence: Input: 𝑦 0 , 𝑦 1 , 𝑦 2 Output: 𝑦 0 , 𝑦 0 + 𝑦 1 , 𝑦 0 + 𝑦 1 + 𝑦 2 (inclusive) or 0, 𝑦 0 , 𝑦 0 + 𝑦 1 (exclusive). input 1 2 2 1 4 3 Example: inclusive 1 3 5 6 10 13 exclusive 0 1 3 5 6 10 Here, addition is used; more generally we can use an arbitrary binary associative operator.

INFOMOV – Lecture 10 – “GPGPU (3)” 12 Prefix Sum input 1 2 2 1 4 3 inclusive 1 3 5 6 10 13 Prefix Sum exclusive 0 1 3 5 6 10 In C++: // exclusive scan out[0] = 0; for ( i = 1; i < n; i++ ) out[i] = in[i-1] + out[i-1]; (Note the obvious loop dependency)

INFOMOV – Lecture 10 – “GPGPU (3)” 13 Prefix Sum Prefix Sum The prefix sum is used for compaction . Given: kernel 𝐿 which may or may not produce output for further processing. K

INFOMOV – Lecture 10 – “GPGPU (3)” 14 Prefix Sum Prefix Sum - Compaction Given: kernel K which may or may not produce output for further processing. output array size K 0 0 1 0 0 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 boolean array 0 0 0 1 1 1 2 3 4 4 4 4 4 5 5 5 5 5 5 5 exclusive prefix sum output array

INFOMOV – Lecture 10 – “GPGPU (3)” 15 Prefix Sum For each pass: ▪ Each thread in the warp reads data Prefix Sum ▪ Each thread in the warp sums 2 input elements ▪ Each thread in the warp writes data. out[0] = 0; for ( i = 1; i < n; i++ ) out[i] = in[i-1] + out[i-1]; In parallel: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 n = 16 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 for ( d = 1; d <= log 2 n; d++ ) for all k in parallel do 1 2 3 4 4 4 4 4 4 4 4 4 4 4 4 4 if k >= 2 d-1 x[k] += x[k – 2 d-1 ] 1 2 3 4 5 6 7 8 8 8 8 8 8 8 8 8 1 1 1 1 1 1 1 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6

INFOMOV – Lecture 10 – “GPGPU (3)” 16 Prefix Sum For each pass: ▪ Each thread in the warp reads data Prefix Sum ▪ Each thread in the warp sums 2 input elements ▪ Each thread in the warp writes data. out[0] = 0; for ( i = 1; i < n; i++ ) out[i] = in[i-1] + out[i-1]; In parallel: Notes: ▪ The scan happens in-place. This is only correct for ( d = 1; d <= log 2 n; d++ ) if we have 32 input elements, and the scan is done in a single warp. Otherwise we need to for all k in parallel do double buffer for correct results. if k >= 2 d-1 ▪ Span of the algorithm is log 𝑜 , but work is x[k] += x[k – 2 d-1 ] 𝑜 log 𝑜 ; it is not work-efficient. Efficient algorithms for large inputs can be found in: Meril & Garland, 2016, Single-pass Parallel Prefix Scan with Decoupled Look-back.

INFOMOV – Lecture 10 – “GPGPU (3)” 17 Prefix Sum Prefix Sum out[0] = 0; for ( i = 1; i < n; i++ ) out[i] = in[i-1] + out[i-1]; In OpenCL: int warp_scan_exclusive( int* input, int lane ) { if (lane > 0 ) input[lane] += input[lane - 1]; if (lane > 1 ) input[lane] += input[lane - 2]; if (lane > 3 ) input[lane] += input[lane - 4]; if (lane > 7 ) input[lane] += input[lane - 8]; if (lane > 15) input[lane] += input[lane - 16]; return (lane > 0) ? input[lane - 1] : 0; }

INFOMOV – Lecture 10 – “GPGPU (3)” 18 Prefix Sum int warp_scan_exclusive( int* input, int lane ) { if (lane > 0 ) input[lane] += input[lane - 1]; ! if (lane > 1 ) input[lane] += input[lane - 2]; if (lane > 3 ) input[lane] += input[lane - 4]; if (lane > 7 ) input[lane] += input[lane - 8]; if (lane > 15) input[lane] += input[lane - 16]; return (lane > 0) ? input[lane - 1] : 0; }

INFOMOV – Lecture 10 – “GPGPU (3)” 19 Prefix Sum Take-away: ▪ A “scan” is useful for compacting arrays. ▪ The naïve scan has an obvious loop dependency. ▪ It is nevertheless possible to run the scan in parallel. ▪ Especially at the warp level, this heavily leans on a core GPU mechanism: lockstep SIMT processing.

INFOMOV – Lecture 10 – “GPGPU (3)” 21 Sorting GPU Sorting Observation: ▪ We frequently need sorting in our algorithms. But: ▪ Most sorting algorithms are divide and conquer algorithms.

INFOMOV – Lecture 10 – “GPGPU (3)” 22 Sorting GPU Sorting: Selection Sort __kernel void Sort( __global int* in, __global int* out ) { int i = get_global_id( 0 ); int n = get_global_size( 0 ); int iKey = in[i]; // compute position of in[i] in output int pos = 0; for( int j = 0; j < n; j++ ) { int jKey = in[j]; // broadcasted bool smaller = (jKey < iKey) || (jKey == iKey && j < i); pos += (smaller) ? 1 : 0; } out[pos] = iKey; }

INFOMOV – Lecture 10 – “GPGPU (3)” 23 Sorting GPU Sorting

INFOMOV – Lecture 10 – “GPGPU (3)” 24 Sorting GPU Sorting

INFOMOV – Lecture 10 – “GPGPU (3)” 25 Sorting GPU Sorting Bubblesort: Sorting network, size ze: number of comparisons (in this case: 5 + 4 + 3 + 2 + 1 = 15) Dep epth: number of sequential steps (in this case: 9)

Welcome! Todays Agenda: Dont Trust the Template The Prefix Sum - PowerPoint PPT Presentation

/INFOMOV/ Optimization & Vectorization J. Bikker - Sep-Nov 2018 - Lecture 10: GPGPU (3) Welcome! Todays Agenda: Dont Trust the Template The Prefix Sum Parallel Sorting Stream Filtering Optimizing

Welcome back. Today. Welcome back. Today. Continue Sampling combinatorial structures. Welcome

Welcome! Welcome! Welcome! Welcome! What will happen today? What will happen today? Lecture

What is the League Today 1 1/23/2017 What is the League Today What is the League Today 2

Welcome back. Today. Welcome back. Today. Review: Spectral gap, Edge expansion h ( G ) ,

Welcome back... Welcome back... ..to me. Welcome back... ..to me. Test out Welcome back...

Social/Network/Analysis mohamed.bouguessa@uqo.ca/ 1 Web/today 2

Lecture 15 Logistics HW4 is due today HW5 posted today HW5 posted today Exam

Welcome to Today s ACM Webinar Welcome to today s ACM Webinar. The presentation starts

Welcome! Welcome ! - Agenda ANNUAL STEM EXPO 17 ..:: TIME AGENDA ITEM 2:30 PM Welcome Ceremony

Welcome Monthly Meeting August 2, 2019 Welcome & Check-in Agenda I. Welcome and

TEC Roadshow 2016 Welcome Agenda What well cover today: Welcome TECs current

2015 Assigners Summit Welcome Agenda: 1. Welcome 2. Part 1 Issues in assigning today 3.

Department Collaborative June 25, 2018 Welcome! Agenda for today: Welcome Presentation

WIEMANN LAMPHERE ARCHITECTS MONTPELIER TODAY MONTPELIER TODAY PARKING! VEHICLES ARE

Today. Types of graphs. Today. Types of graphs. Complete Graphs. Trees. Hypercubes. Today.

Welcome! Welcome! Welcome! Welcome! Autor:Johann Oberdorfer Autor:Johann Oberdorfer With

Trust Me An exploration of @trusted code in D Steven Schveigho ff er - DConf Online 2020 Code:

The Trust Anchor Key Renewal Method Applied to DNS Security Presentation by Thierry Moreau,

1 Figure 7.3 Figure 7.3 Messages with both Authenticity and Secrecy Messages with both

Frama-C WP Tutorial Virgile Prevosto , Nikolay Kosmatov and Julien Signoles June 11 th , 2013

Software Engineering I (02161) Week 11 Assoc. Prof. Hubert Baumeister DTU Compute Technical

Families First Coronavirus Response Act and Covid-Related Accommodations Presented by: Patricia

Do spatially targeted initiatives improve health and well-being? Richard Crisp Centre for

Healthcare professionals experiences with EHR-system access

Welcome! Todays Agenda: Dont Trust the Template The Prefix Sum - PowerPoint PPT Presentation

/INFOMOV/ Optimization & Vectorization J. Bikker - Sep-Nov 2018 - Lecture 10: GPGPU (3) Welcome! Todays Agenda: Dont Trust the Template The Prefix Sum Parallel Sorting Stream Filtering Optimizing

Welcome back. Today. Welcome back. Today. Continue Sampling combinatorial structures. Welcome

Welcome! Welcome! Welcome! Welcome! What will happen today? What will happen today? Lecture

What is the League Today 1 1/23/2017 What is the League Today What is the League Today 2

Welcome back. Today. Welcome back. Today. Review: Spectral gap, Edge expansion h ( G ) ,

Welcome back... Welcome back... ..to me. Welcome back... ..to me. Test out Welcome back...

Social/Network/Analysis mohamed.bouguessa@uqo.ca/ 1 Web/today 2

Lecture 15 Logistics HW4 is due today HW5 posted today HW5 posted today Exam

Welcome to Today s ACM Webinar Welcome to today s ACM Webinar. The presentation starts

Welcome! Welcome ! - Agenda ANNUAL STEM EXPO 17 ..:: TIME AGENDA ITEM 2:30 PM Welcome Ceremony

Welcome Monthly Meeting August 2, 2019 Welcome &amp; Check-in Agenda I. Welcome and

TEC Roadshow 2016 Welcome Agenda What well cover today: Welcome TECs current

2015 Assigners Summit Welcome Agenda: 1. Welcome 2. Part 1 Issues in assigning today 3.

Department Collaborative June 25, 2018 Welcome! Agenda for today: Welcome Presentation

WIEMANN LAMPHERE ARCHITECTS MONTPELIER TODAY MONTPELIER TODAY PARKING! VEHICLES ARE

Today. Types of graphs. Today. Types of graphs. Complete Graphs. Trees. Hypercubes. Today.

Welcome! Welcome! Welcome! Welcome! Autor:Johann Oberdorfer Autor:Johann Oberdorfer With

Trust Me An exploration of @trusted code in D Steven Schveigho ff er - DConf Online 2020 Code:

The Trust Anchor Key Renewal Method Applied to DNS Security Presentation by Thierry Moreau,

1 Figure 7.3 Figure 7.3 Messages with both Authenticity and Secrecy Messages with both

Frama-C WP Tutorial Virgile Prevosto , Nikolay Kosmatov and Julien Signoles June 11 th , 2013

Software Engineering I (02161) Week 11 Assoc. Prof. Hubert Baumeister DTU Compute Technical

Families First Coronavirus Response Act and Covid-Related Accommodations Presented by: Patricia

Do spatially targeted initiatives improve health and well-being? Richard Crisp Centre for

Healthcare professionals experiences with EHR-system access

Welcome Monthly Meeting August 2, 2019 Welcome & Check-in Agenda I. Welcome and