GPGPU: General-Purpose Computation on GPUs Prekshu Ajmera 03d05006 - - PowerPoint PPT Presentation
GPGPU: General-Purpose Computation on GPUs Prekshu Ajmera 03d05006 - - PowerPoint PPT Presentation
GPGPU: General-Purpose Computation on GPUs Prekshu Ajmera 03d05006 Overview 1. Motivation: Why GPGPU ? 2. CPU-GPU Analogies 3. GPU Resources The Graphics Pipeline Textures Programmable Vertex Processor Fixed Function Rasterizer
Overview
- 1. Motivation: Why GPGPU ?
- 2. CPU-GPU Analogies
- 3. GPU Resources
The Graphics Pipeline Textures Programmable Vertex Processor Fixed Function Rasterizer Programmable Fragment Processor Feedback
- 4. GPU Program Flow Control
- 5. GPGPU Techniques
Reduction : Max Sort
Search Matrix Multiplication
Why GPGPU ?
− The GPU has evolved into an extremely flexible and powerful
processor
Programmability
− Programmable pixel and vertex engines − High-level language support
Precision
− 32-bit floating point throughout the pipeline
Performance
− 3 GHz Pentium 4 theoretical: 12 GFLOPS − GeForce 6800 Ultra observed: 53 GFLOPs
CPU-GPU Analogies
GPU Textures = CPU Arrays
Textures are the equivalent of arrays.
Native data layout: Rectangular (2D) textures.
Size limitation: 4096 texels in each dimension.
Data formats: One channel (LUMINANCE) to four channels (RGBA).
They provide a natural data structure for vector data types with 2 to 4 components.
Supported floating point formats: 16bit, 32bit, 24bit
Most basic operation :
array (memory) read == texture lookup array offset == texture lookup
Feedback = Texture Update
Feedback: Results of an intermediate computation used as an input to the
next pass.
Trivially implemented in CPU using variables and arrays that can both be
read and written.
Not trivial on GPUs Output of fragment processor always written on frame buffer Think of the frame buffer as a 2-D array that can't be read directly.
Solution ?
Use texture as a frame buffer so that GPUs can write to it for storing intermediate results. This is called Render-to-Texture.
GPU Fragment Programs = CPU Loop Bodies
Consider a 2-D grid. A CPU implementation uses a pair of nested loops to iterate over each cell
in the grid and perform same computation at each cell.
GPUs do not have this capablity to perform this inner loop over each texel
in a texture.
Solution ?
Fragment pipeline is designed to perform identical computations at each
fragment simultaneously.
It is similar to having a processor for each fragment. Thus, GPU analog of computation inside nested loops over an array is a
fragment program applied in data-parallel fashion to each fragment.
The Modern Graphics Pipeline
Each stage in the graphics pipeline can be
independently configured through graphics APIs like OpenGL or directX.
Programmable Graphics Pipeline Fixed function operations on vertices like
transformations and lighting calculations replaced by user-defined vertex program.
Fixed function operations on fragments
that determine fragment's color replaced with user-defined fragment program.
Textures
Textures are the equivalent of arrays.
Size limitation: 4096 texels in each dimension.
Native data layout: Rectangular (2D) textures.
Data formats: One channel (LUMINANCE) to four channels (RGBA).
Supported floating point formats: 16bit, 32bit, 24bit
Programmable Vertex Processor
Input: Stream of geometry.
Transforms each vertex in homogeneous coordinates (xyzw) independent of the other vertices, works on 4-tuples simultaneously.
Fully Programmable (SIMD/MIMD)
Processes 4-component vectors (RGBA/XYZW)
Capable of Scatter but not gather
Can change the location of current vertex Cannot read info from other vertices
Limited gather capabilities: Can fetch from texture but can't fetch from
current vertex stream.
Output: Stream of transformed vertices and triangles.
Fixed-Function Rasterizer
Input: Stream of transformed vertices and triangles.
Generates fragment for each pixel covered by transformed geometry.
Interpolates vertex attributes linearly.
Output: Stream of fragments.
Fixed-function part of the pipeline.
Programmable Fragment Processor
Input: Stream of fragments with interpolated attributes.
Applies fragment program to each fragment independently.
Capable of gather but not scatter.
Indirect memory read (texture fetch), but
no indirect memory write.
Output address fixed to a specific pixel
Fully Programmable (SIMD)
Processes 4-component vectors (RGBA/XYZW)
Output: Pixels to be displayed.
Feedback: Render-To-Texture
Textures can be used as render targets!
Textures are either read-only or write-
- nly.
Feedback loop: Render intermediate results into a texture, use it as input in subsequent pass.
Visualization: Render single quad into frame buffer textured with last intermediate result.
Further processing on CPU: Read back texture data.
GPGPU Terminology
Arithmetic Intensity
Arithmetic intensity
Math operations per word transferred Computation / bandwidth
Ideal applications to target GPGPU have:
Large data sets High parallelism Minimal dependencies between data elements High arithmetic intensity Lots of work to do without CPU intervention
Data Streams & Kernels
Streams
Collection of records requiring similar computation Thus they provide data parallelism
Kernels
Functions applied to each element in stream
transforms, PDE, …
No dependencies between stream elements
Encourage high Arithmetic Intensity
Scatter vs. Gather
Gather
Indirect read from memory ( x = a[i] ) Naturally maps to a texture fetch Used to access data structures and data streams
Scatter
Indirect write to memory ( a[i] = x ) Difficult to emulate: change in frame buffer write location of a fragment dependent texture write operation Both these operations not available on GPUs
Solution ?
Rewrite the problem in terms of gather Using vertex processor
Needed for building many data structures Usually done on the CPU
GPU Program Flow Control
Highly parallel nature of GPUs !
Limitations of branching on GPUs ?
Techniques for iteration and decision making ?
Hardware mechanisms for Flow Control
Prediction Single Instruction Multiple Data (SIMD) Multiple Instruction Multiple Data (MIMD)
Three basic implementations of data parallel branching on GPUs -
Hardware mechanisms for Flow Control
Prediction
No true data-dependent branch instructions GPU evaluates both sides of a branch & discards one of the results
based on the value of boolean branch condition.
Disadvantage : evaluating both branches can be costly
Hardware mechanisms for Flow Control
SIMD branching
All active processors execute the same instructions at the same time When evaluation of a branch condition is identical on all active
processors, only taken side of the branch is evaluated.
But when different, then both sides evaluated and the results
predicted.
Thus divergence in branching of simultaneously processed fragments
can lead to reduced performance.
MIMD branching
Different processors can follow different paths through the program
Other Techniques for Flow Control
Static Branch Resolution
Avoid branching inside inner loops Results in loops that contain efficient code without branches
Pre-Computation
Result of a branch constant over a large domain of input values or a
number of iterations.
Evaluate branches only when results are known to change Store the results for use over many subsequent iterations
Z-Cull
Feature to avoid shading pixels that will not be seen Discard the fragments which fail the depth test, before their pixel colors
are calculated in fragment processor
Lot of work saved !!!
GPGPU : 4 Problems
Reduction: Max Sorting Searching Matrix Multiplication
Configure OpenGL for 1:1 Rendering
Simple Fragment Application Flow
Write Data to Texture Load Fragment Program Bind Fragment Program Bind Textures Draw Large Quad Write Results to Texture
Reduction (max)
Goal
Find maximum element in an array of n elements.
Approach
Each fragment processor will find max of 4 adjacent array elements
(each pass processes 16 elements)
Input: Array of n elements stored as 2D texture Output: Array of n/4 elements to frame buffer (each pass overwrites the
array)
Reduction on GPU
Store array as 2D texture
Max() comparison runs as fragment program
Each fragment compares 4 texels and returns max
Frame buffer stores max from each
Fragment (buffer quarters
- riginal array size)
Frame buffer overwrites previous texture
Another look at Reduction Loop
Sorting on GPU
Sort an array of n floats CPU implementation: standard merge sort in Ο(n lgn) GPU implementation: bitonic merge sort in Ο(lgn lgn)
The Bitonic Merge Sort
– A classic (parallel) algorithm
Repeatedly build Bitonic lists and then sort them Bitonic list is two monotonic lists concatenated together, one
increasing and one decreasing.
List A : (3, 4, 7, 8)
monotonically increasing
List B : (6, 5, 2, 1)
monotonically decreasing
List AB : (3, 4, 7, 8, 6, 5, 2, 1)
Bitonic
Similar to parallelizing Classic Merge Sort
The Bitonic Sort
1 2 3 4 5 6 7 8 8x monotonic lists: (3) (7) (4) (8) (6) (2) (1) (5) 4x bitonic lists: (3,7) (4,8) (6,2) (1,5)
The Bitonic Sort
1 2 3 4 5 6 7 8 Sort the bitonic lists
The Bitonic Sort
1 2 3 4 5 6 7 8 Sort the bitonic lists 3 8 7 4 5 6 1 2
The Bitonic Sort
1 2 3 4 5 6 7 8 Sort the bitonic lists 3 8 7 4 5 6 1 2 3 8 4 7 2 6 1 5
The Bitonic Sort
1 2 3 4 5 6 7 8 2x monotonic lists: (3,4,7,8) (6,5,2,1) 1x bitonic list: (3,4,7,8, 6,5,2,1) 3 8 7 4 5 6 1 2 3 8 4 7 2 6 1 5 3 7 4 8 2 5 1 6
The Bitonic Sort
1 2 3 4 5 6 7 8 3 8 7 4 5 6 1 2 3 8 4 7 2 6 1 5 3 7 4 8 2 5 1 6
The Bitonic Sort
1 2 3 4 5 6 7 8 3 8 7 4 5 6 1 2 3 8 4 7 2 6 1 5 3 7 4 8 2 5 1 6 3 2 4 1 7 5 8 6
The Bitonic Sort
1 2 3 4 5 6 7 8 3 8 7 4 5 6 1 2 3 8 4 7 2 6 1 5 3 7 4 8 2 5 1 6 3 2 4 1 7 5 8 6
The Bitonic Sort
1 2 3 4 5 6 7 8 3 8 7 4 5 6 1 2 3 8 4 7 2 6 1 5 3 7 4 8 2 5 1 6 3 2 4 1 7 5 8 6 2 3 1 4 7 5 8 6
The Bitonic Sort
1 2 3 4 5 6 7 8 3 8 7 4 5 6 1 2 3 8 4 7 2 6 1 5 3 7 4 8 2 5 1 6 3 2 4 1 7 5 8 6 2 3 1 4 7 5 8 6
The Bitonic Sort
1 2 3 4 5 6 7 8 3 8 7 4 5 6 1 2 3 8 4 7 2 6 1 5 3 7 4 8 2 5 1 6 3 2 4 1 7 5 8 6 2 3 1 4 7 5 8 6 1 3 2 4 7 6 8 5
Complexity Analysis
Separate rendering pass for each set of swaps
O(log2n) passes Each pass performs n compare/swaps Total compare/swaps: O(n log2n)
Limitations of GPU cost us factor of (log n) over best CPU-
based sorting algorithms
Computational time complexity using n processors
Parallel Merge sort
O(n) but unbalanced processor load and communication .
Parallel Quick sort
O(n) but unbalanced processor load and communication can
generate to O(n2)
Odd-Even Merge sort & Bitonic Merge sort
O(log2n)
Bitonic Merge Sort has been a popular choice for parallel sorting..
GPU Bitonic Sort
Store array as 2D texture
Bitonic sort runs as fragment program.
Each array position sorts vs. partner position determined by current stage and step of bitonic sort.
Sorted array position updated in frame buffer.
Frame buffer overwrites texture
GPU Matrix Multiplication
Store each matrix array as a 2D texture
Matrix multiplication runs as fragment program
Each fragment loads a 4x4 matrix from each texture and calls a hardware matrix multiplication operation
Conclusions
GPU easily outperforms CPU when:
Problem is suited to parallelism Data set is large (but not larger than video memory) GPU instruction set can accommodate the needs of
problem in an efficient manner