GPGPU: General-Purpose Computation on GPUs Prekshu Ajmera 03d05006 - - PowerPoint PPT Presentation

gpgpu general purpose computation on gpus
SMART_READER_LITE
LIVE PREVIEW

GPGPU: General-Purpose Computation on GPUs Prekshu Ajmera 03d05006 - - PowerPoint PPT Presentation

GPGPU: General-Purpose Computation on GPUs Prekshu Ajmera 03d05006 Overview 1. Motivation: Why GPGPU ? 2. CPU-GPU Analogies 3. GPU Resources The Graphics Pipeline Textures Programmable Vertex Processor Fixed Function Rasterizer


slide-1
SLIDE 1

GPGPU: General-Purpose Computation on GPUs

Prekshu Ajmera 03d05006

slide-2
SLIDE 2

Overview

  • 1. Motivation: Why GPGPU ?
  • 2. CPU-GPU Analogies
  • 3. GPU Resources

 The Graphics Pipeline  Textures  Programmable Vertex Processor  Fixed Function Rasterizer  Programmable Fragment Processor  Feedback

  • 4. GPU Program Flow Control
  • 5. GPGPU Techniques

 Reduction : Max  Sort

 Search  Matrix Multiplication

slide-3
SLIDE 3

Why GPGPU ?

− The GPU has evolved into an extremely flexible and powerful

processor

 Programmability

− Programmable pixel and vertex engines − High-level language support

 Precision

− 32-bit floating point throughout the pipeline

 Performance

− 3 GHz Pentium 4 theoretical: 12 GFLOPS − GeForce 6800 Ultra observed: 53 GFLOPs

slide-4
SLIDE 4

CPU-GPU Analogies

slide-5
SLIDE 5

GPU Textures = CPU Arrays

Textures are the equivalent of arrays.

Native data layout: Rectangular (2D) textures.

Size limitation: 4096 texels in each dimension.

Data formats: One channel (LUMINANCE) to four channels (RGBA).

They provide a natural data structure for vector data types with 2 to 4 components.

Supported floating point formats: 16bit, 32bit, 24bit

Most basic operation :

 array (memory) read == texture lookup  array offset == texture lookup

slide-6
SLIDE 6

Feedback = Texture Update

 Feedback: Results of an intermediate computation used as an input to the

next pass.

 Trivially implemented in CPU using variables and arrays that can both be

read and written.

 Not trivial on GPUs  Output of fragment processor always written on frame buffer  Think of the frame buffer as a 2-D array that can't be read directly.

Solution ?

Use texture as a frame buffer so that GPUs can write to it for storing intermediate results. This is called Render-to-Texture.

slide-7
SLIDE 7

GPU Fragment Programs = CPU Loop Bodies

 Consider a 2-D grid.  A CPU implementation uses a pair of nested loops to iterate over each cell

in the grid and perform same computation at each cell.

 GPUs do not have this capablity to perform this inner loop over each texel

in a texture.

Solution ?

 Fragment pipeline is designed to perform identical computations at each

fragment simultaneously.

 It is similar to having a processor for each fragment.  Thus, GPU analog of computation inside nested loops over an array is a

fragment program applied in data-parallel fashion to each fragment.

slide-8
SLIDE 8

The Modern Graphics Pipeline

 Each stage in the graphics pipeline can be

independently configured through graphics APIs like OpenGL or directX.

 Programmable Graphics Pipeline  Fixed function operations on vertices like

transformations and lighting calculations replaced by user-defined vertex program.

 Fixed function operations on fragments

that determine fragment's color replaced with user-defined fragment program.

slide-9
SLIDE 9

Textures

Textures are the equivalent of arrays.

Size limitation: 4096 texels in each dimension.

Native data layout: Rectangular (2D) textures.

Data formats: One channel (LUMINANCE) to four channels (RGBA).

Supported floating point formats: 16bit, 32bit, 24bit

slide-10
SLIDE 10

Programmable Vertex Processor

Input: Stream of geometry.

Transforms each vertex in homogeneous coordinates (xyzw) independent of the other vertices, works on 4-tuples simultaneously.

Fully Programmable (SIMD/MIMD)

Processes 4-component vectors (RGBA/XYZW)

Capable of Scatter but not gather

 Can change the location of current vertex  Cannot read info from other vertices

 Limited gather capabilities: Can fetch from texture but can't fetch from

current vertex stream.

 Output: Stream of transformed vertices and triangles.

slide-11
SLIDE 11

Fixed-Function Rasterizer

Input: Stream of transformed vertices and triangles.

Generates fragment for each pixel covered by transformed geometry.

Interpolates vertex attributes linearly.

Output: Stream of fragments.

Fixed-function part of the pipeline.

slide-12
SLIDE 12

Programmable Fragment Processor

Input: Stream of fragments with interpolated attributes.

Applies fragment program to each fragment independently.

Capable of gather but not scatter.

 Indirect memory read (texture fetch), but

no indirect memory write.

 Output address fixed to a specific pixel

Fully Programmable (SIMD)

Processes 4-component vectors (RGBA/XYZW)

Output: Pixels to be displayed.

slide-13
SLIDE 13

Feedback: Render-To-Texture

Textures can be used as render targets!

Textures are either read-only or write-

  • nly.

Feedback loop: Render intermediate results into a texture, use it as input in subsequent pass.

Visualization: Render single quad into frame buffer textured with last intermediate result.

Further processing on CPU: Read back texture data.

slide-14
SLIDE 14

GPGPU Terminology

slide-15
SLIDE 15

Arithmetic Intensity

 Arithmetic intensity

 Math operations per word transferred  Computation / bandwidth

 Ideal applications to target GPGPU have:

 Large data sets  High parallelism  Minimal dependencies between data elements  High arithmetic intensity  Lots of work to do without CPU intervention

slide-16
SLIDE 16

Data Streams & Kernels

 Streams

 Collection of records requiring similar computation  Thus they provide data parallelism

 Kernels

 Functions applied to each element in stream

 transforms, PDE, …

 No dependencies between stream elements

 Encourage high Arithmetic Intensity

slide-17
SLIDE 17

Scatter vs. Gather

 Gather

 Indirect read from memory ( x = a[i] )  Naturally maps to a texture fetch  Used to access data structures and data streams

 Scatter

 Indirect write to memory ( a[i] = x )  Difficult to emulate:  change in frame buffer write location of a fragment  dependent texture write operation  Both these operations not available on GPUs

 Solution ?

 Rewrite the problem in terms of gather  Using vertex processor

 Needed for building many data structures  Usually done on the CPU

slide-18
SLIDE 18

GPU Program Flow Control

Highly parallel nature of GPUs !

Limitations of branching on GPUs ?

Techniques for iteration and decision making ?

slide-19
SLIDE 19

Hardware mechanisms for Flow Control

 Prediction  Single Instruction Multiple Data (SIMD)  Multiple Instruction Multiple Data (MIMD)

Three basic implementations of data parallel branching on GPUs -

slide-20
SLIDE 20

Hardware mechanisms for Flow Control

 Prediction

 No true data-dependent branch instructions  GPU evaluates both sides of a branch & discards one of the results

based on the value of boolean branch condition.

 Disadvantage : evaluating both branches can be costly

slide-21
SLIDE 21

Hardware mechanisms for Flow Control

 SIMD branching

 All active processors execute the same instructions at the same time  When evaluation of a branch condition is identical on all active

processors, only taken side of the branch is evaluated.

 But when different, then both sides evaluated and the results

predicted.

 Thus divergence in branching of simultaneously processed fragments

can lead to reduced performance.

 MIMD branching

 Different processors can follow different paths through the program

slide-22
SLIDE 22

Other Techniques for Flow Control

 Static Branch Resolution

 Avoid branching inside inner loops  Results in loops that contain efficient code without branches

 Pre-Computation

 Result of a branch constant over a large domain of input values or a

number of iterations.

 Evaluate branches only when results are known to change  Store the results for use over many subsequent iterations

 Z-Cull

 Feature to avoid shading pixels that will not be seen  Discard the fragments which fail the depth test, before their pixel colors

are calculated in fragment processor

 Lot of work saved !!!

slide-23
SLIDE 23

GPGPU : 4 Problems

 Reduction: Max  Sorting  Searching  Matrix Multiplication

slide-24
SLIDE 24

Configure OpenGL for 1:1 Rendering

Simple Fragment Application Flow

Write Data to Texture Load Fragment Program Bind Fragment Program Bind Textures Draw Large Quad Write Results to Texture

slide-25
SLIDE 25

Reduction (max)

 Goal

 Find maximum element in an array of n elements.

 Approach

 Each fragment processor will find max of 4 adjacent array elements

(each pass processes 16 elements)

 Input:  Array of n elements stored as 2D texture  Output:  Array of n/4 elements to frame buffer (each pass overwrites the

array)

slide-26
SLIDE 26

Reduction on GPU

Store array as 2D texture

Max() comparison runs as fragment program

Each fragment compares 4 texels and returns max

Frame buffer stores max from each

Fragment (buffer quarters

  • riginal array size)

Frame buffer overwrites previous texture

slide-27
SLIDE 27

Another look at Reduction Loop

slide-28
SLIDE 28

Sorting on GPU

 Sort an array of n floats  CPU implementation: standard merge sort in Ο(n lgn)  GPU implementation: bitonic merge sort in Ο(lgn lgn)

slide-29
SLIDE 29

The Bitonic Merge Sort

– A classic (parallel) algorithm

 Repeatedly build Bitonic lists and then sort them  Bitonic list is two monotonic lists concatenated together, one

increasing and one decreasing.

 List A : (3, 4, 7, 8)

monotonically increasing

 List B : (6, 5, 2, 1)

monotonically decreasing

 List AB : (3, 4, 7, 8, 6, 5, 2, 1)

Bitonic

slide-30
SLIDE 30
slide-31
SLIDE 31

Similar to parallelizing Classic Merge Sort

slide-32
SLIDE 32

The Bitonic Sort

1 2 3 4 5 6 7 8 8x monotonic lists: (3) (7) (4) (8) (6) (2) (1) (5) 4x bitonic lists: (3,7) (4,8) (6,2) (1,5)

slide-33
SLIDE 33

The Bitonic Sort

1 2 3 4 5 6 7 8 Sort the bitonic lists

slide-34
SLIDE 34

The Bitonic Sort

1 2 3 4 5 6 7 8 Sort the bitonic lists 3 8 7 4 5 6 1 2

slide-35
SLIDE 35

The Bitonic Sort

1 2 3 4 5 6 7 8 Sort the bitonic lists 3 8 7 4 5 6 1 2 3 8 4 7 2 6 1 5

slide-36
SLIDE 36

The Bitonic Sort

1 2 3 4 5 6 7 8 2x monotonic lists: (3,4,7,8) (6,5,2,1) 1x bitonic list: (3,4,7,8, 6,5,2,1) 3 8 7 4 5 6 1 2 3 8 4 7 2 6 1 5 3 7 4 8 2 5 1 6

slide-37
SLIDE 37

The Bitonic Sort

1 2 3 4 5 6 7 8 3 8 7 4 5 6 1 2 3 8 4 7 2 6 1 5 3 7 4 8 2 5 1 6

slide-38
SLIDE 38

The Bitonic Sort

1 2 3 4 5 6 7 8 3 8 7 4 5 6 1 2 3 8 4 7 2 6 1 5 3 7 4 8 2 5 1 6 3 2 4 1 7 5 8 6

slide-39
SLIDE 39

The Bitonic Sort

1 2 3 4 5 6 7 8 3 8 7 4 5 6 1 2 3 8 4 7 2 6 1 5 3 7 4 8 2 5 1 6 3 2 4 1 7 5 8 6

slide-40
SLIDE 40

The Bitonic Sort

1 2 3 4 5 6 7 8 3 8 7 4 5 6 1 2 3 8 4 7 2 6 1 5 3 7 4 8 2 5 1 6 3 2 4 1 7 5 8 6 2 3 1 4 7 5 8 6

slide-41
SLIDE 41

The Bitonic Sort

1 2 3 4 5 6 7 8 3 8 7 4 5 6 1 2 3 8 4 7 2 6 1 5 3 7 4 8 2 5 1 6 3 2 4 1 7 5 8 6 2 3 1 4 7 5 8 6

slide-42
SLIDE 42

The Bitonic Sort

1 2 3 4 5 6 7 8 3 8 7 4 5 6 1 2 3 8 4 7 2 6 1 5 3 7 4 8 2 5 1 6 3 2 4 1 7 5 8 6 2 3 1 4 7 5 8 6 1 3 2 4 7 6 8 5

slide-43
SLIDE 43

Complexity Analysis

 Separate rendering pass for each set of swaps

 O(log2n) passes  Each pass performs n compare/swaps  Total compare/swaps: O(n log2n)

 Limitations of GPU cost us factor of (log n) over best CPU-

based sorting algorithms

slide-44
SLIDE 44

Computational time complexity using n processors

 Parallel Merge sort

 O(n) but unbalanced processor load and communication .

 Parallel Quick sort

 O(n) but unbalanced processor load and communication can

generate to O(n2)

 Odd-Even Merge sort & Bitonic Merge sort

 O(log2n)

Bitonic Merge Sort has been a popular choice for parallel sorting..

slide-45
SLIDE 45

GPU Bitonic Sort

Store array as 2D texture

Bitonic sort runs as fragment program.

Each array position sorts vs. partner position determined by current stage and step of bitonic sort.

Sorted array position updated in frame buffer.

Frame buffer overwrites texture

slide-46
SLIDE 46

GPU Matrix Multiplication

Store each matrix array as a 2D texture

Matrix multiplication runs as fragment program

Each fragment loads a 4x4 matrix from each texture and calls a hardware matrix multiplication operation

slide-47
SLIDE 47

Conclusions

 GPU easily outperforms CPU when:

 Problem is suited to parallelism  Data set is large (but not larger than video memory)  GPU instruction set can accommodate the needs of

problem in an efficient manner

slide-48
SLIDE 48

Thank You !