gpgpu general purpose computation on gpus
play

GPGPU: General-Purpose Computation on GPUs Prekshu Ajmera 03d05006 - PowerPoint PPT Presentation

GPGPU: General-Purpose Computation on GPUs Prekshu Ajmera 03d05006 Overview 1. Motivation: Why GPGPU ? 2. CPU-GPU Analogies 3. GPU Resources The Graphics Pipeline Textures Programmable Vertex Processor Fixed Function Rasterizer


  1. GPGPU: General-Purpose Computation on GPUs Prekshu Ajmera 03d05006

  2. Overview 1. Motivation: Why GPGPU ? 2. CPU-GPU Analogies 3. GPU Resources  The Graphics Pipeline  Textures  Programmable Vertex Processor  Fixed Function Rasterizer  Programmable Fragment Processor  Feedback 4. GPU Program Flow Control 5. GPGPU Techniques  Reduction : Max  Sort  Search  Matrix Multiplication

  3. Why GPGPU ? − The GPU has evolved into an extremely flexible and powerful processor  Programmability − Programmable pixel and vertex engines − High-level language support  Precision − 32-bit floating point throughout the pipeline  Performance − 3 GHz Pentium 4 theoretical : 12 GFLOPS − GeForce 6800 Ultra observed : 53 GFLOPs

  4. CPU-GPU Analogies

  5. GPU Textures = CPU Arrays Textures are the equivalent of arrays.  Native data layout: Rectangular (2D) textures.  Size limitation: 4096 texels in each dimension.  Data formats: One channel (LUMINANCE) to four channels (RGBA).  They provide a natural data structure for vector data types with 2 to 4  components. Supported floating point formats: 16bit, 32bit, 24bit  Most basic operation :   array (memory) read == texture lookup  array offset == texture lookup

  6. Feedback = Texture Update  Feedback: Results of an intermediate computation used as an input to the next pass.  Trivially implemented in CPU using variables and arrays that can both be read and written.  Not trivial on GPUs  Output of fragment processor always written on frame buffer  Think of the frame buffer as a 2-D array that can't be read directly . Solution ? Use texture as a frame buffer so that GPUs can write to it for storing intermediate results. This is called Render-to-Texture.

  7. GPU Fragment Programs = CPU Loop Bodies  Consider a 2-D grid.  A CPU implementation uses a pair of nested loops to iterate over each cell in the grid and perform same computation at each cell.  GPUs do not have this capablity to perform this inner loop over each texel in a texture. Solution ?  Fragment pipeline is designed to perform identical computations at each fragment simultaneously.  It is similar to having a processor for each fragment.  Thus, GPU analog of computation inside nested loops over an array is a fragment program applied in data-parallel fashion to each fragment.

  8. The Modern Graphics Pipeline  Each stage in the graphics pipeline can be independently configured through graphics APIs like OpenGL or directX.  Programmable Graphics Pipeline  Fixed function operations on vertices like transformations and lighting calculations replaced by user-defined vertex program.  Fixed function operations on fragments that determine fragment's color replaced with user-defined fragment program.

  9. Textures Textures are the equivalent of arrays.  Size limitation: 4096 texels in each  dimension. Native data layout: Rectangular (2D)  textures. Data formats: One channel (LUMINANCE)  to four channels (RGBA). Supported floating point formats: 16bit,  32bit, 24bit

  10. Programmable Vertex Processor Input: Stream of geometry.  Transforms each vertex in homogeneous  coordinates (xyzw) independent of the other vertices, works on 4-tuples simultaneously. Fully Programmable (SIMD/MIMD)  Processes 4-component vectors (RGBA/XYZW)  Capable of Scatter but not gather   Can change the location of current vertex  Cannot read info from other vertices  Limited gather capabilities: Can fetch from texture but can't fetch from current vertex stream.  Output: Stream of transformed vertices and triangles.

  11. Fixed-Function Rasterizer Input: Stream of transformed  vertices and triangles. Generates fragment for each pixel  covered by transformed geometry. Interpolates vertex attributes  linearly. Output: Stream of fragments.  Fixed-function part of the pipeline. 

  12. Programmable Fragment Processor Input: Stream of fragments with interpolated  attributes. Applies fragment program to each fragment  independently. Capable of gather but not scatter.   Indirect memory read (texture fetch), but no indirect memory write.  Output address fixed to a specific pixel Fully Programmable (SIMD)  Processes 4-component vectors  (RGBA/XYZW) Output: Pixels to be displayed. 

  13. Feedback: Render-To-Texture Textures can be used as render targets!  Textures are either read-only or write-  only. Feedback loop: Render intermediate  results into a texture, use it as input in subsequent pass. Visualization: Render single quad into  frame buffer textured with last intermediate result. Further processing on CPU: Read back  texture data.

  14. GPGPU Terminology

  15. Arithmetic Intensity  Arithmetic intensity  Math operations per word transferred  Computation / bandwidth  Ideal applications to target GPGPU have:  Large data sets  High parallelism  Minimal dependencies between data elements  High arithmetic intensity  Lots of work to do without CPU intervention

  16. Data Streams & Kernels  Streams  Collection of records requiring similar computation  Thus they provide data parallelism  Kernels  Functions applied to each element in stream  transforms, PDE, …  No dependencies between stream elements  Encourage high Arithmetic Intensity

  17. Scatter vs. Gather  Gather  Indirect read from memory ( x = a[i] )  Naturally maps to a texture fetch  Used to access data structures and data streams  Scatter  Indirect write to memory ( a[i] = x )  Difficult to emulate:  change in frame buffer write location of a fragment  dependent texture write operation  Both these operations not available on GPUs  Solution ?  Rewrite the problem in terms of gather  Using vertex processor  Needed for building many data structures  Usually done on the CPU

  18. GPU Program Flow Control Highly parallel nature of GPUs !  Limitations of branching on GPUs ?  Techniques for iteration and decision making ? 

  19. Hardware mechanisms for Flow Control Three basic implementations of data parallel branching on GPUs -  Prediction  Single Instruction Multiple Data (SIMD)  Multiple Instruction Multiple Data (MIMD)

  20. Hardware mechanisms for Flow Control  Prediction  No true data-dependent branch instructions  GPU evaluates both sides of a branch & discards one of the results based on the value of boolean branch condition.  Disadvantage : evaluating both branches can be costly

  21. Hardware mechanisms for Flow Control  SIMD branching  All active processors execute the same instructions at the same time  When evaluation of a branch condition is identical on all active processors, only taken side of the branch is evaluated.  But when different, then both sides evaluated and the results predicted.  Thus divergence in branching of simultaneously processed fragments can lead to reduced performance.  MIMD branching  Different processors can follow different paths through the program

  22. Other Techniques for Flow Control  Static Branch Resolution  Avoid branching inside inner loops  Results in loops that contain efficient code without branches  Pre-Computation  Result of a branch constant over a large domain of input values or a number of iterations.  Evaluate branches only when results are known to change  Store the results for use over many subsequent iterations  Z-Cull  Feature to avoid shading pixels that will not be seen  Discard the fragments which fail the depth test, before their pixel colors are calculated in fragment processor  Lot of work saved !!!

  23. GPGPU : 4 Problems  Reduction: Max  Sorting  Searching  Matrix Multiplication

  24. Simple Fragment Application Flow Write Data to Texture Bind Textures Load Fragment Draw Large Write Results Program Quad to Texture Bind Fragment Program Configure OpenGL for 1:1 Rendering

  25. Reduction (max)  Goal  Find maximum element in an array of n elements.  Approach  Each fragment processor will find max of 4 adjacent array elements (each pass processes 16 elements)  Input:  Array of n elements stored as 2D texture  Output:  Array of n/4 elements to frame buffer (each pass overwrites the array)

  26. Reduction on GPU Store array as 2D texture  Max() comparison runs as  fragment program Each fragment compares 4  texels and returns max Frame buffer stores max from  each Fragment (buffer quarters  original array size) Frame buffer overwrites  previous texture

  27. Another look at Reduction Loop

  28. Sorting on GPU  Sort an array of n floats  CPU implementation: standard merge sort in Ο(n lgn)  GPU implementation: bitonic merge sort in Ο(lgn lgn)

  29. The Bitonic Merge Sort – A classic (parallel) algorithm  Repeatedly build Bitonic lists and then sort them  Bitonic list is two monotonic lists concatenated together, one increasing and one decreasing.  List A : (3, 4, 7, 8) monotonically increasing  List B : (6, 5, 2, 1) monotonically decreasing  List AB : (3, 4, 7, 8, 6, 5, 2, 1) Bitonic

  30. Similar to parallelizing Classic Merge Sort

  31. The Bitonic Sort 3 7 4 8 6 2 1 5 8x monotonic lists: (3) (7) (4) (8) (6) (2) (1) (5) 4x bitonic lists: (3,7) (4,8) (6,2) (1,5)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend