 
              GPGPU: General-Purpose Computation on GPUs Prekshu Ajmera 03d05006
Overview 1. Motivation: Why GPGPU ? 2. CPU-GPU Analogies 3. GPU Resources  The Graphics Pipeline  Textures  Programmable Vertex Processor  Fixed Function Rasterizer  Programmable Fragment Processor  Feedback 4. GPU Program Flow Control 5. GPGPU Techniques  Reduction : Max  Sort  Search  Matrix Multiplication
Why GPGPU ? − The GPU has evolved into an extremely flexible and powerful processor  Programmability − Programmable pixel and vertex engines − High-level language support  Precision − 32-bit floating point throughout the pipeline  Performance − 3 GHz Pentium 4 theoretical : 12 GFLOPS − GeForce 6800 Ultra observed : 53 GFLOPs
CPU-GPU Analogies
GPU Textures = CPU Arrays Textures are the equivalent of arrays.  Native data layout: Rectangular (2D) textures.  Size limitation: 4096 texels in each dimension.  Data formats: One channel (LUMINANCE) to four channels (RGBA).  They provide a natural data structure for vector data types with 2 to 4  components. Supported floating point formats: 16bit, 32bit, 24bit  Most basic operation :   array (memory) read == texture lookup  array offset == texture lookup
Feedback = Texture Update  Feedback: Results of an intermediate computation used as an input to the next pass.  Trivially implemented in CPU using variables and arrays that can both be read and written.  Not trivial on GPUs  Output of fragment processor always written on frame buffer  Think of the frame buffer as a 2-D array that can't be read directly . Solution ? Use texture as a frame buffer so that GPUs can write to it for storing intermediate results. This is called Render-to-Texture.
GPU Fragment Programs = CPU Loop Bodies  Consider a 2-D grid.  A CPU implementation uses a pair of nested loops to iterate over each cell in the grid and perform same computation at each cell.  GPUs do not have this capablity to perform this inner loop over each texel in a texture. Solution ?  Fragment pipeline is designed to perform identical computations at each fragment simultaneously.  It is similar to having a processor for each fragment.  Thus, GPU analog of computation inside nested loops over an array is a fragment program applied in data-parallel fashion to each fragment.
The Modern Graphics Pipeline  Each stage in the graphics pipeline can be independently configured through graphics APIs like OpenGL or directX.  Programmable Graphics Pipeline  Fixed function operations on vertices like transformations and lighting calculations replaced by user-defined vertex program.  Fixed function operations on fragments that determine fragment's color replaced with user-defined fragment program.
Textures Textures are the equivalent of arrays.  Size limitation: 4096 texels in each  dimension. Native data layout: Rectangular (2D)  textures. Data formats: One channel (LUMINANCE)  to four channels (RGBA). Supported floating point formats: 16bit,  32bit, 24bit
Programmable Vertex Processor Input: Stream of geometry.  Transforms each vertex in homogeneous  coordinates (xyzw) independent of the other vertices, works on 4-tuples simultaneously. Fully Programmable (SIMD/MIMD)  Processes 4-component vectors (RGBA/XYZW)  Capable of Scatter but not gather   Can change the location of current vertex  Cannot read info from other vertices  Limited gather capabilities: Can fetch from texture but can't fetch from current vertex stream.  Output: Stream of transformed vertices and triangles.
Fixed-Function Rasterizer Input: Stream of transformed  vertices and triangles. Generates fragment for each pixel  covered by transformed geometry. Interpolates vertex attributes  linearly. Output: Stream of fragments.  Fixed-function part of the pipeline. 
Programmable Fragment Processor Input: Stream of fragments with interpolated  attributes. Applies fragment program to each fragment  independently. Capable of gather but not scatter.   Indirect memory read (texture fetch), but no indirect memory write.  Output address fixed to a specific pixel Fully Programmable (SIMD)  Processes 4-component vectors  (RGBA/XYZW) Output: Pixels to be displayed. 
Feedback: Render-To-Texture Textures can be used as render targets!  Textures are either read-only or write-  only. Feedback loop: Render intermediate  results into a texture, use it as input in subsequent pass. Visualization: Render single quad into  frame buffer textured with last intermediate result. Further processing on CPU: Read back  texture data.
GPGPU Terminology
Arithmetic Intensity  Arithmetic intensity  Math operations per word transferred  Computation / bandwidth  Ideal applications to target GPGPU have:  Large data sets  High parallelism  Minimal dependencies between data elements  High arithmetic intensity  Lots of work to do without CPU intervention
Data Streams & Kernels  Streams  Collection of records requiring similar computation  Thus they provide data parallelism  Kernels  Functions applied to each element in stream  transforms, PDE, …  No dependencies between stream elements  Encourage high Arithmetic Intensity
Scatter vs. Gather  Gather  Indirect read from memory ( x = a[i] )  Naturally maps to a texture fetch  Used to access data structures and data streams  Scatter  Indirect write to memory ( a[i] = x )  Difficult to emulate:  change in frame buffer write location of a fragment  dependent texture write operation  Both these operations not available on GPUs  Solution ?  Rewrite the problem in terms of gather  Using vertex processor  Needed for building many data structures  Usually done on the CPU
GPU Program Flow Control Highly parallel nature of GPUs !  Limitations of branching on GPUs ?  Techniques for iteration and decision making ? 
Hardware mechanisms for Flow Control Three basic implementations of data parallel branching on GPUs -  Prediction  Single Instruction Multiple Data (SIMD)  Multiple Instruction Multiple Data (MIMD)
Hardware mechanisms for Flow Control  Prediction  No true data-dependent branch instructions  GPU evaluates both sides of a branch & discards one of the results based on the value of boolean branch condition.  Disadvantage : evaluating both branches can be costly
Hardware mechanisms for Flow Control  SIMD branching  All active processors execute the same instructions at the same time  When evaluation of a branch condition is identical on all active processors, only taken side of the branch is evaluated.  But when different, then both sides evaluated and the results predicted.  Thus divergence in branching of simultaneously processed fragments can lead to reduced performance.  MIMD branching  Different processors can follow different paths through the program
Other Techniques for Flow Control  Static Branch Resolution  Avoid branching inside inner loops  Results in loops that contain efficient code without branches  Pre-Computation  Result of a branch constant over a large domain of input values or a number of iterations.  Evaluate branches only when results are known to change  Store the results for use over many subsequent iterations  Z-Cull  Feature to avoid shading pixels that will not be seen  Discard the fragments which fail the depth test, before their pixel colors are calculated in fragment processor  Lot of work saved !!!
GPGPU : 4 Problems  Reduction: Max  Sorting  Searching  Matrix Multiplication
Simple Fragment Application Flow Write Data to Texture Bind Textures Load Fragment Draw Large Write Results Program Quad to Texture Bind Fragment Program Configure OpenGL for 1:1 Rendering
Reduction (max)  Goal  Find maximum element in an array of n elements.  Approach  Each fragment processor will find max of 4 adjacent array elements (each pass processes 16 elements)  Input:  Array of n elements stored as 2D texture  Output:  Array of n/4 elements to frame buffer (each pass overwrites the array)
Reduction on GPU Store array as 2D texture  Max() comparison runs as  fragment program Each fragment compares 4  texels and returns max Frame buffer stores max from  each Fragment (buffer quarters  original array size) Frame buffer overwrites  previous texture
Another look at Reduction Loop
Sorting on GPU  Sort an array of n floats  CPU implementation: standard merge sort in Ο(n lgn)  GPU implementation: bitonic merge sort in Ο(lgn lgn)
The Bitonic Merge Sort – A classic (parallel) algorithm  Repeatedly build Bitonic lists and then sort them  Bitonic list is two monotonic lists concatenated together, one increasing and one decreasing.  List A : (3, 4, 7, 8) monotonically increasing  List B : (6, 5, 2, 1) monotonically decreasing  List AB : (3, 4, 7, 8, 6, 5, 2, 1) Bitonic
Similar to parallelizing Classic Merge Sort
The Bitonic Sort 3 7 4 8 6 2 1 5 8x monotonic lists: (3) (7) (4) (8) (6) (2) (1) (5) 4x bitonic lists: (3,7) (4,8) (6,2) (1,5)
Recommend
More recommend