Comparing FPGAs, GPUs and the PS2 Motivation using a unified source - - PowerPoint PPT Presentation

comparing fpgas gpus and the ps2
SMART_READER_LITE
LIVE PREVIEW

Comparing FPGAs, GPUs and the PS2 Motivation using a unified source - - PowerPoint PPT Presentation

FPL 2006 L. W. Howes P . Price O. Mencer O. Beckmann O. Pell Comparing FPGAs, GPUs and the PS2 Motivation using a unified source description The Future? Accelerators Benefits Technology Related Work L. W. Howes, P . Price, O. Mencer,


slide-1
SLIDE 1

FPL 2006

  • L. W. Howes

P . Price

  • O. Mencer
  • O. Beckmann
  • O. Pell

Motivation

The Future? Accelerators Benefits Technology

Related Work Implementation

ASC targets Targeting the architectures Example Limitations

Results Conclusions 1 / 21

Comparing FPGAs, GPUs and the PS2 using a unified source description

  • L. W. Howes, P

. Price, O. Mencer, O. Beckmann, O. Pell

Department of Computing, Imperial College London

August 28, 2006

slide-2
SLIDE 2

FPL 2006

  • L. W. Howes

P . Price

  • O. Mencer
  • O. Beckmann
  • O. Pell

Motivation

The Future? Accelerators Benefits Technology

Related Work Implementation

ASC targets Targeting the architectures Example Limitations

Results Conclusions 2 / 21

Motivation: Graphics Processing Units - the future?

Thanks to Mark Harris of NVIDIA for this graph

slide-3
SLIDE 3

FPL 2006

  • L. W. Howes

P . Price

  • O. Mencer
  • O. Beckmann
  • O. Pell

Motivation

The Future? Accelerators Benefits Technology

Related Work Implementation

ASC targets Targeting the architectures Example Limitations

Results Conclusions 3 / 21

Motivation: Comparing Accelerators

Different characteristics

Applications Accelerators

As a result, accelerators match some applications better than others Wish to learn which accelerator is best

Experiment fairly A single representation

slide-4
SLIDE 4

FPL 2006

  • L. W. Howes

P . Price

  • O. Mencer
  • O. Beckmann
  • O. Pell

Motivation

The Future? Accelerators Benefits Technology

Related Work Implementation

ASC targets Targeting the architectures Example Limitations

Results Conclusions 4 / 21

Motivation: Development

Heterogeneous architectures A variety of programming methodologies Even high level languages require low level knowledge Development becomes slow and expensive Use a single source description

GPU PS2 A Stream Compiler (ASC) FPGA

slide-5
SLIDE 5

FPL 2006

  • L. W. Howes

P . Price

  • O. Mencer
  • O. Beckmann
  • O. Pell

Motivation

The Future? Accelerators Benefits Technology

Related Work Implementation

ASC targets Targeting the architectures Example Limitations

Results Conclusions 5 / 21

Motivation: Single Source Benefits

Fair comparison of performance

  • n different architectures

May need architecture specific optimisations

Easier development for multiple architectures

Could use architecture specific optimisations Allow integration of multiple accelerators into a project

  • sharing the performance gain

GPU PS2 A Stream Compiler (ASC) FPGA

slide-6
SLIDE 6

FPL 2006

  • L. W. Howes

P . Price

  • O. Mencer
  • O. Beckmann
  • O. Pell

Motivation

The Future? Accelerators Benefits Technology

Related Work Implementation

ASC targets Targeting the architectures Example Limitations

Results Conclusions 6 / 21

Target: FPGAs

Flexible Highly parallel Generally considered to be very difficult to program

slide-7
SLIDE 7

FPL 2006

  • L. W. Howes

P . Price

  • O. Mencer
  • O. Beckmann
  • O. Pell

Motivation

The Future? Accelerators Benefits Technology

Related Work Implementation

ASC targets Targeting the architectures Example Limitations

Results Conclusions 7 / 21

Target: GPUs

Highly parallel Widespread and used to accelerate graphics processing, largely for games Relatively low cost Recently being investigated for general purpose computation

Texture Cache Host VP Vertex Processors VP VP VP VP VP Rasterisation FP FP FP FP FP FP FP FP FP FP FP FP FP FP FP FP DRAM DRAM DRAM DRAM Fragment Processors

slide-8
SLIDE 8

FPL 2006

  • L. W. Howes

P . Price

  • O. Mencer
  • O. Beckmann
  • O. Pell

Motivation

The Future? Accelerators Benefits Technology

Related Work Implementation

ASC targets Targeting the architectures Example Limitations

Results Conclusions 8 / 21

Target: PS2

Core MIPS processor Programmable vector units with local memory Large install base The real benefit: A step towards Cell

FPU MIPS CPU (EE Core) 16KB I cache 8KB D cache Scratch Pad 32KB Vector Unit 0 (VU0) 4KB Data 4KB Code Vector Unit 1 (VU1) 16KB Data 16KB Code Vector Unit Interface 0 (VIF0) Vector Unit Interface 1 (VIF1) 10 Channel DMA Controller (DMAC) Memory Interface I/O Interface Graphics Interface (GIF) Graphics Synthesiser (GS) 2.4 Gb/s bus

slide-9
SLIDE 9

FPL 2006

  • L. W. Howes

P . Price

  • O. Mencer
  • O. Beckmann
  • O. Pell

Motivation

The Future? Accelerators Benefits Technology

Related Work Implementation

ASC targets Targeting the architectures Example Limitations

Results Conclusions 9 / 21

Related Work

McCool et. al.;SIGGRAPH 2002 Shader Metaprogramming Cope et. al.; FPT 2005 Have GPUs made FPGAs redudant in the field of Video Processing? Cornwall et. al.; IPDPS 2006 Automatically Translating a General Purpose C++ Image Processing Library for GPUs Trancoso et. al.; DSD 2005 Exploring Graphics Processor Performance for General Purpose Applications Pavan Tumati; Undergraduate Thesis, Univ. Illinois Sony Playstation-2 VPU: A Study on the Feasibility of Utilizing Gaming Vector Hardware for Scientific Computing

slide-10
SLIDE 10

FPL 2006

  • L. W. Howes

P . Price

  • O. Mencer
  • O. Beckmann
  • O. Pell

Motivation

The Future? Accelerators Benefits Technology

Related Work Implementation

ASC targets Targeting the architectures Example Limitations

Results Conclusions 10 / 21

A Stream Compiler - ASC

Generates stream architectures for FPGAs C++ object oriented approach to development Combines algorithm, architecture and arithmetic levels into a single tool

slide-11
SLIDE 11

FPL 2006

  • L. W. Howes

P . Price

  • O. Mencer
  • O. Beckmann
  • O. Pell

Motivation

The Future? Accelerators Benefits Technology

Related Work Implementation

ASC targets Targeting the architectures Example Limitations

Results Conclusions 11 / 21

ASC Compilation

Map a data-flow graph directly to hardware High throughput, low clock frequency

* XOR + + * XOR * + * + XOR XOR XOR XOR

key[x] key[x+1] key[x+2] key[x+3] key[x+4] key[x+5]

word1 word2 word3 word4 word1 word2 word3 word4

slide-12
SLIDE 12

FPL 2006

  • L. W. Howes

P . Price

  • O. Mencer
  • O. Beckmann
  • O. Pell

Motivation

The Future? Accelerators Benefits Technology

Related Work Implementation

ASC targets Targeting the architectures Example Limitations

Results Conclusions 12 / 21

ASC for other architectures

ASC code represents the data flow of a program The ASC data flow can be implemented for various architectures

GPU PS2 ASC GPU ASC ASC PS2 Runtime API Accelerated Application ASC Code FPGA

slide-13
SLIDE 13

FPL 2006

  • L. W. Howes

P . Price

  • O. Mencer
  • O. Beckmann
  • O. Pell

Motivation

The Future? Accelerators Benefits Technology

Related Work Implementation

ASC targets Targeting the architectures Example Limitations

Results Conclusions 13 / 21

Targeting the PS2

PS2 vector units take entire data flow Input data is split into blocks Data is fed to vector units to process each block in turn Makes use of operations on vector registers Can use both vector units to improve parallelism

AST PS2 ASM Vector Unit Programs Final Combined Executable Calling Program Data Management PS2 Emotion Engine Dataflow Graph ASC – PS2 executable

slide-14
SLIDE 14

FPL 2006

  • L. W. Howes

P . Price

  • O. Mencer
  • O. Beckmann
  • O. Pell

Motivation

The Future? Accelerators Benefits Technology

Related Work Implementation

ASC targets Targeting the architectures Example Limitations

Results Conclusions 14 / 21

Targeting the GPU

Split data flow at various points and divide into computation kernels separated by intermediate arrays

Split at points of data reuse Split where kernel complexity would be high

Uses the OpenGL Shader Language to program the GPU

AST GLSL Code C++ and GLSL Final Executable Calling Program OpenGL Libraries GPU Hardware Dataflow Graph ASC – GPU executable

slide-15
SLIDE 15

FPL 2006

  • L. W. Howes

P . Price

  • O. Mencer
  • O. Beckmann
  • O. Pell

Motivation

The Future? Accelerators Benefits Technology

Related Work Implementation

ASC targets Targeting the architectures Example Limitations

Results Conclusions 15 / 21

Example: ASC code targeting the GPU

Example

STREAM_START; HWfloat input(IN); HWfloat temporary(TMP); HWfloat intermediate(TMP); HWfloat output(OUT); STREAM_LOOP(40); temporary = input + prev(input,2); intermediate = temporary + prev(temporary,2);

  • utput = input + prev(intermediate,3)

+ prev(temporary,4); STREAM_END_GLSL;

slide-16
SLIDE 16

FPL 2006

  • L. W. Howes

P . Price

  • O. Mencer
  • O. Beckmann
  • O. Pell

Motivation

The Future? Accelerators Benefits Technology

Related Work Implementation

ASC targets Targeting the architectures Example Limitations

Results Conclusions 16 / 21

Example: GLSL output for previous example

Example

string ks__temporary10 = "void main(uniform samplerRect in__inputvar1) \n" "{\n" " vec4 _temporary10;\n" " vec4 in__inputvar1_var_P0;\n" " in__inputvar1_var_P0 = textureRect( in__inputvar1, vec2(gl_TexCoord[0].s, 0)).rgba;\n" " vec4 in__inputvar1_var_P2;\n" " in__inputvar1_var_P2.r = in__inputvar1_var_P0.b;\n" " in__inputvar1_var_P2.g = in__inputvar1_var_P0.a;\n" " in__inputvar1_var_P2.b = textureRect( in__inputvar1, vec2(gl_TexCoord[0].s + 1, 0)).r;\n" " in__inputvar1_var_P2.a = textureRect( in__inputvar1, vec2(gl_TexCoord[0].s + 1, 0)).g;\n" " _temporary10.rgba = ( in__inputvar1_var_P2.rgba + in__inputvar1_var_P0.rgba );\n" " gl_FragColor.rgba = _temporary10;\n" "} \n"; /* End kernel k__temporary10*/ ... glslProgram.setProgram(ks_out__C5); glslProgram.setInputArray("in__A1", in__A1, TEXTURESIZEX, TEXTURESIZEY, 4); glslProgram.setInputArray("in__B3", in__B3, TEXTURESIZEX, TEXTURESIZEY, 4); glslProgram.setIteratorDimensions(TEXTURESIZEX, TEXTURESIZEY); float *outputsout__C5[1] = {out__C5}; glslProgram.setOutputs(1, outputsout__C5, TEXTURESIZEX, TEXTURESIZEY, 4); glslProgram.run(); ...

slide-17
SLIDE 17

FPL 2006

  • L. W. Howes

P . Price

  • O. Mencer
  • O. Beckmann
  • O. Pell

Motivation

The Future? Accelerators Benefits Technology

Related Work Implementation

ASC targets Targeting the architectures Example Limitations

Results Conclusions 17 / 21

Limitations of the GPU

Each output value requires a separate kernel execution Fragment executions cannot communicate Feedback loops limited by the lack of communication Executions occur automatically in hardware The order of execution is left undefined

Input Buffer 1 Input Buffer 2 Fragment 1 Fragment 2 Output Buffer

GPU Fragment Processing

slide-18
SLIDE 18

FPL 2006

  • L. W. Howes

P . Price

  • O. Mencer
  • O. Beckmann
  • O. Pell

Motivation

The Future? Accelerators Benefits Technology

Related Work Implementation

ASC targets Targeting the architectures Example Limitations

Results Conclusions 18 / 21

Results: Montecarlo Simulation

2 4 6 8 10 0.05 0.25 0.5 1 2 4 8 16 32 64 128 Number of data points [100k pts] Execution time [100 seconds]

Montecarlo simulation computation time

P4 P4 with SSE2 PS2 Vector Units PS2 CPU NVIDIA 6800 Ultra Naive FPGA Optimised FPGA

slide-19
SLIDE 19

FPL 2006

  • L. W. Howes

P . Price

  • O. Mencer
  • O. Beckmann
  • O. Pell

Motivation

The Future? Accelerators Benefits Technology

Related Work Implementation

ASC targets Targeting the architectures Example Limitations

Results Conclusions 19 / 21

Results: FFT

16 256 4096 65536 524288 0.005 0.05 0.25 0.5 1 2 4 8 Number of FFT points Execution time [100 seconds]

FFT computation time

fftw3 − Opteron 275 P4 with SSE2 PS2 Vector Units PS2 CPU NVIDIA 7800 GTX Naive FPGA Optimised FPGA

slide-20
SLIDE 20

FPL 2006

  • L. W. Howes

P . Price

  • O. Mencer
  • O. Beckmann
  • O. Pell

Motivation

The Future? Accelerators Benefits Technology

Related Work Implementation

ASC targets Targeting the architectures Example Limitations

Results Conclusions 20 / 21

Summary and conclusions

Multiple heterogenous acceleration architectures Experimenting can be difficult Use a single representation and try multiple targets

Compare the performance characteristics of the target architectures Utilise multiple target architectures in achieving acceleration goals Make best use of individual characteristics

Early results, more optimisation effort needed

slide-21
SLIDE 21

FPL 2006

  • L. W. Howes

P . Price

  • O. Mencer
  • O. Beckmann
  • O. Pell

Motivation

The Future? Accelerators Benefits Technology

Related Work Implementation

ASC targets Targeting the architectures Example Limitations

Results Conclusions 21 / 21

Questions?

Any questions?