GPGPU: General-Purpose Computation on GPUs Prekshu Ajmera 03d05006

Overview 1. Motivation: Why GPGPU ? 2. CPU-GPU Analogies 3. GPU Resources  The Graphics Pipeline  Textures  Programmable Vertex Processor  Fixed Function Rasterizer  Programmable Fragment Processor  Feedback 4. GPU Program Flow Control 5. GPGPU Techniques  Reduction : Max  Sort  Search  Matrix Multiplication

Why GPGPU ? − The GPU has evolved into an extremely flexible and powerful processor  Programmability − Programmable pixel and vertex engines − High-level language support  Precision − 32-bit floating point throughout the pipeline  Performance − 3 GHz Pentium 4 theoretical : 12 GFLOPS − GeForce 6800 Ultra observed : 53 GFLOPs

CPU-GPU Analogies

GPU Textures = CPU Arrays Textures are the equivalent of arrays.  Native data layout: Rectangular (2D) textures.  Size limitation: 4096 texels in each dimension.  Data formats: One channel (LUMINANCE) to four channels (RGBA).  They provide a natural data structure for vector data types with 2 to 4  components. Supported floating point formats: 16bit, 32bit, 24bit  Most basic operation :   array (memory) read == texture lookup  array offset == texture lookup

Feedback = Texture Update  Feedback: Results of an intermediate computation used as an input to the next pass.  Trivially implemented in CPU using variables and arrays that can both be read and written.  Not trivial on GPUs  Output of fragment processor always written on frame buffer  Think of the frame buffer as a 2-D array that can't be read directly . Solution ? Use texture as a frame buffer so that GPUs can write to it for storing intermediate results. This is called Render-to-Texture.

GPU Fragment Programs = CPU Loop Bodies  Consider a 2-D grid.  A CPU implementation uses a pair of nested loops to iterate over each cell in the grid and perform same computation at each cell.  GPUs do not have this capablity to perform this inner loop over each texel in a texture. Solution ?  Fragment pipeline is designed to perform identical computations at each fragment simultaneously.  It is similar to having a processor for each fragment.  Thus, GPU analog of computation inside nested loops over an array is a fragment program applied in data-parallel fashion to each fragment.

The Modern Graphics Pipeline  Each stage in the graphics pipeline can be independently configured through graphics APIs like OpenGL or directX.  Programmable Graphics Pipeline  Fixed function operations on vertices like transformations and lighting calculations replaced by user-defined vertex program.  Fixed function operations on fragments that determine fragment's color replaced with user-defined fragment program.

Textures Textures are the equivalent of arrays.  Size limitation: 4096 texels in each  dimension. Native data layout: Rectangular (2D)  textures. Data formats: One channel (LUMINANCE)  to four channels (RGBA). Supported floating point formats: 16bit,  32bit, 24bit

Programmable Vertex Processor Input: Stream of geometry.  Transforms each vertex in homogeneous  coordinates (xyzw) independent of the other vertices, works on 4-tuples simultaneously. Fully Programmable (SIMD/MIMD)  Processes 4-component vectors (RGBA/XYZW)  Capable of Scatter but not gather   Can change the location of current vertex  Cannot read info from other vertices  Limited gather capabilities: Can fetch from texture but can't fetch from current vertex stream.  Output: Stream of transformed vertices and triangles.

Fixed-Function Rasterizer Input: Stream of transformed  vertices and triangles. Generates fragment for each pixel  covered by transformed geometry. Interpolates vertex attributes  linearly. Output: Stream of fragments.  Fixed-function part of the pipeline. 

Programmable Fragment Processor Input: Stream of fragments with interpolated  attributes. Applies fragment program to each fragment  independently. Capable of gather but not scatter.   Indirect memory read (texture fetch), but no indirect memory write.  Output address fixed to a specific pixel Fully Programmable (SIMD)  Processes 4-component vectors  (RGBA/XYZW) Output: Pixels to be displayed. 

Feedback: Render-To-Texture Textures can be used as render targets!  Textures are either read-only or write-  only. Feedback loop: Render intermediate  results into a texture, use it as input in subsequent pass. Visualization: Render single quad into  frame buffer textured with last intermediate result. Further processing on CPU: Read back  texture data.

GPGPU Terminology

Arithmetic Intensity  Arithmetic intensity  Math operations per word transferred  Computation / bandwidth  Ideal applications to target GPGPU have:  Large data sets  High parallelism  Minimal dependencies between data elements  High arithmetic intensity  Lots of work to do without CPU intervention

Data Streams & Kernels  Streams  Collection of records requiring similar computation  Thus they provide data parallelism  Kernels  Functions applied to each element in stream  transforms, PDE, …  No dependencies between stream elements  Encourage high Arithmetic Intensity

Scatter vs. Gather  Gather  Indirect read from memory ( x = a[i] )  Naturally maps to a texture fetch  Used to access data structures and data streams  Scatter  Indirect write to memory ( a[i] = x )  Difficult to emulate:  change in frame buffer write location of a fragment  dependent texture write operation  Both these operations not available on GPUs  Solution ?  Rewrite the problem in terms of gather  Using vertex processor  Needed for building many data structures  Usually done on the CPU

GPU Program Flow Control Highly parallel nature of GPUs !  Limitations of branching on GPUs ?  Techniques for iteration and decision making ? 

Hardware mechanisms for Flow Control Three basic implementations of data parallel branching on GPUs -  Prediction  Single Instruction Multiple Data (SIMD)  Multiple Instruction Multiple Data (MIMD)

Hardware mechanisms for Flow Control  Prediction  No true data-dependent branch instructions  GPU evaluates both sides of a branch & discards one of the results based on the value of boolean branch condition.  Disadvantage : evaluating both branches can be costly

Hardware mechanisms for Flow Control  SIMD branching  All active processors execute the same instructions at the same time  When evaluation of a branch condition is identical on all active processors, only taken side of the branch is evaluated.  But when different, then both sides evaluated and the results predicted.  Thus divergence in branching of simultaneously processed fragments can lead to reduced performance.  MIMD branching  Different processors can follow different paths through the program

Other Techniques for Flow Control  Static Branch Resolution  Avoid branching inside inner loops  Results in loops that contain efficient code without branches  Pre-Computation  Result of a branch constant over a large domain of input values or a number of iterations.  Evaluate branches only when results are known to change  Store the results for use over many subsequent iterations  Z-Cull  Feature to avoid shading pixels that will not be seen  Discard the fragments which fail the depth test, before their pixel colors are calculated in fragment processor  Lot of work saved !!!

GPGPU : 4 Problems  Reduction: Max  Sorting  Searching  Matrix Multiplication

Simple Fragment Application Flow Write Data to Texture Bind Textures Load Fragment Draw Large Write Results Program Quad to Texture Bind Fragment Program Configure OpenGL for 1:1 Rendering

Reduction (max)  Goal  Find maximum element in an array of n elements.  Approach  Each fragment processor will find max of 4 adjacent array elements (each pass processes 16 elements)  Input:  Array of n elements stored as 2D texture  Output:  Array of n/4 elements to frame buffer (each pass overwrites the array)

Reduction on GPU Store array as 2D texture  Max() comparison runs as  fragment program Each fragment compares 4  texels and returns max Frame buffer stores max from  each Fragment (buffer quarters  original array size) Frame buffer overwrites  previous texture

Another look at Reduction Loop

Sorting on GPU  Sort an array of n floats  CPU implementation: standard merge sort in Ο(n lgn)  GPU implementation: bitonic merge sort in Ο(lgn lgn)

The Bitonic Merge Sort – A classic (parallel) algorithm  Repeatedly build Bitonic lists and then sort them  Bitonic list is two monotonic lists concatenated together, one increasing and one decreasing.  List A : (3, 4, 7, 8) monotonically increasing  List B : (6, 5, 2, 1) monotonically decreasing  List AB : (3, 4, 7, 8, 6, 5, 2, 1) Bitonic

Similar to parallelizing Classic Merge Sort

The Bitonic Sort 3 7 4 8 6 2 1 5 8x monotonic lists: (3) (7) (4) (8) (6) (2) (1) (5) 4x bitonic lists: (3,7) (4,8) (6,2) (1,5)

GPGPU: General-Purpose Computation on GPUs Prekshu Ajmera 03d05006 - PowerPoint PPT Presentation

GPGPU: General-Purpose Computation on GPUs Prekshu Ajmera 03d05006 Overview 1. Motivation: Why GPGPU ? 2. CPU-GPU Analogies 3. GPU Resources The Graphics Pipeline Textures Programmable Vertex Processor Fixed Function Rasterizer

Welcome! Global Agenda: 1. GPGPU (1) : Introduction, architecture, concepts 2. GPGPU (2) :

Welcome! Todays Agenda: GPU Execution Model GPGPU Flow GPGPU Low Level Notes

Parallel Incep+on MPP Databases GPGPU Kyle Dunn Me Data nerd for Recovering HPC/GPGPU

Welcome! Todays Agenda: Introduction to GPGPU Example: Voronoi Noise GPGPU

Welcome! Todays Agenda: Practical GPGPU: Verlet Fluid GPGPU Algorithms Optimizing

Efficient Abstractions for GPGPU Programming . Mathias Bourgoin 10.03.2015 Efficient

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

GPGPU Programming in Haskell with Accelerate Trevor L. McDonell University of New South Wales

Regular Array Computation in Haskell on GPUs Geoffrey Mainland Microsoft Research Ltd WG 2.8,

K E D b . D a L a t a B a s e Jordan Vincent XML processing using GPGPU Jordan

K Pre-Post Cloud Tutorial for the use of GPGPU instances RIKEN R-CCS MARCH 29, 2019 About this

Formal Definition of Computation Formal Definition of Computation p.1/28 Computation

Wednesday, November 30, 2016 3:41 PM General Page 1 General Page 2 General Page 3 General Page

Marcus Bakker & Roel van der Jagt Background information Main question Test

Scott Le Grand Some Things Never Change (GPUs vs the World) How Best to Exploit GPUs

Unleashing the Power of GPUs over the Web Vishal Vaidyanathan Royal Caliber LLC GPUs are

Introduction to Scilab Aditya Sengupta and Deepak Patil National Mission on Education through ICT

STREAM PROCESSING @ UBER DANNY YUAN @ UBER What is Uber Transportation at your fingertips

Quadratically Tight Relations for Randomized Query Complexity Rahul Jain Hartmut Klauck Srijita

Command-form Coverage for Testing DB Applications Alessandro Orso William G.J. Halfond Georgia

Programming a Calculator -Ashley Kling (ask2203), Joseph Thompson (jot2102), Phillip Godzin

Search Strategy - I Dr. V. V. Subrahmanyam Associate Professor, SOCIS, IGNOU Search and Search

A Model-driven Methodology for Generating and Verifying CSP-based Java Code no 1 ul N.N. Alborodo

1 E. Boros, G.Felici Boolean Seminar, Liblice, April 2013 Supervised Learning and Data Mining