[PPT] - A GPU-Inspired Soft Processor for High- Throughput Acceleration PowerPoint Presentation

SLIDE 1

A GPU-Inspired Soft Processor for High- Throughput Acceleration

1

Throughput Acceleration

Jeffrey Kingyens and J. Gregory Steffan Electrical and Computer Engineering University of Toronto

SLIDE 2

FGPA-Based Acceleration

In-socket acceleration platforms

FPGA and CPU on same motherboard Xtremedata, Nallatech, SGI RASC

How to program them?

A GPU-Inspired Soft Processor

How to program them?

HDL is for experts Behavioural synthesis is limited

Can we provide a more familiar programming model?

2

XD1000

SLIDE 3

Potential Solution: Soft Processor

Advantages of soft processors:

Familiar, portable, customizable

Our Goal: Develop a new S.P. architecture that:

Excels at high-throughput workloads

A GPU-Inspired Soft Processor

Is naturally capable of high utilization of datapath

Challenges:

Memory latency Pipeline latency and hazards Exploiting parallelism Scaling

3

SLIDE 4

Inspiration: GPU Architecture

Multithreading

Tolerate memory and pipeline latencies

Vector instructions

Data-level parallelism, scaling

A GPU-Inspired Soft Processor

Multiple processors

Scaling

Long-term goal: FPGA-specific design using above This work: FPGA implementation of a GPU

4

SLIDE 5

Overview

A GPU-based system

NVIDIA’s Cg AMD’s CTM r5xx ISA

A GPU-inspired architecture

A GPU-Inspired Soft Processor

Overcoming port limitations Avoiding stalls

Preliminary results

Simulation based on Xtremedata XD1000

5

SLIDE 6

A GPU-Based System

A GPU-Inspired Soft Processor

A GPU-Based System

6

SLIDE 7

GPU Shader Processors

Shader Program

( Xo,Y

)

Fetch( n,x,y) Data

Register File Constant Registers

A GPU-Inspired Soft Processor

7

Input Buffers Registers Output Buffer

Xo Yo

Separate in/out buffers simplify memory coherence

SLIDE 8

NVIDIA’s Cg Language (C-like)

Cg Shader Program

struct data_out { float4 sum : COLOR; }; data_out multadd(float2 coord : TEXCOORD0, uniform sampler2D A: TEXUNIT0,

A GPU-Inspired Soft Processor

uniform sampler2D A: TEXUNIT0, uniform sampler2D B: TEXUNIT1) { data_out r; float4 offset = {1.0f, 1.0f, 1.0f, 1.0f}; r.sum = tex2D(A,coord)*tex2D(B,coord)+offset; return r; }

8

Matrix-matrix element-wise multiplication + offset

SLIDE 9

AMD’s CTM r5xx ISA (simplified)

multadd: TEX r1, r0 s1 TEX r0, r0 s0 MAD o0, r1 r0 c0

Loads ALU op

A GPU-Inspired Soft Processor

MAD o0, r1 r0 c0 END

A B C

Source regs Dest regs

9

ALU op

Each register is a 4-element vector

SLIDE 10

A GPU-Inspired Architecture

A GPU-Inspired Soft Processor

A GPU-Inspired Architecture

10

SLIDE 11

Soft Processor Architecture

Soft Processor

Coordinate Generator Register File Config

HT Slave

A B C A

FPGA Block- RAMs have

nly 2

ports!

A GPU-Inspired Soft Processor

11

HT Master

Output Register ALU TEX Fifo Fifo

64 Cycles! 305 cycles!

Must tolerate port limitations and latencies

SLIDE 12

Overcoming Port Limitations

Problem: central register file:

Needs four reads and two writes per cycle FPGA block RAMs have only two ports

Solution: exploit symmetry of threads

A GPU-Inspired Soft Processor

Symmetry: every thread executes same inst sequence Group threads into batches of four Fetch operands across batches in lock-step

12

Only read one operand per thread per cycle

SLIDE 13

Reading Operands Across Batch

multadd: TEX r1 r0 s1 TEX r0 r0 s0 MAD o0 r1 r0 c0 END multadd: TEX r1 r0 s1 TEX r0 r0 s0 MAD o0 r1 r0 c0 END multadd: TEX r1 r0 s1 TEX r0 r0 s0 MAD o0 r1 r0 c0 END multadd: TEX r1 r0 s1 TEX r0 r0 s0 MAD o0 r1 r0 c0 END

Batch (of 4 threads)

A GPU-Inspired Soft Processor

Three cycles to read operands: 1) Read A’s 2) Read B’s 3) Read C’s Only read one operand per thread per cycle

SLIDE 14

Transposed RegFile Access

T3 RF T2 RF T1 RF T0 RF

A GPU-Inspired Soft Processor

14

3 2 1 0

C

3 2 1 0

B

3 2 1 0

A

SLIDE 15

Avoiding ALU Pipeline Bubbles

Problem: long pipeline and memory latency

Frequent stalls lead to underutilized ALU datapath

Solution: exploit abundance of threads

Store contexts for multiple batches of threads

A GPU-Inspired Soft Processor

Store contexts for multiple batches of threads
Issue instructions from different batches to hide latencies

15

Requires logic to issue-from and manage batches How many batches do we need to avoid bubbles?

SLIDE 16

1 2 3

Issuing from Multiple Batches

Batch:

A GPU-Inspired Soft Processor

ALU Pipeline

Ideally ALU is fully utilized

SLIDE 17

Methodology and Results

A GPU-Inspired Soft Processor

Methodology and Results

17

SLIDE 18

Simulation Methodology

SystemC-based simulation

Parameterized to model XD1000 Assume conservative 100Mhz soft processor clock Cycle accurate at the block interfaces

A GPU-Inspired Soft Processor

Cycle accurate at the block interfaces Models HyperTransport (bandwidth and latency)

currently 8-bit HT, capable of 16-bit HT

Benchmarks

photon: monte-carlo heat-transfer sim (ALU-intensive) matmatmult: dense matrix multiplication (mem-intensive)

18

SLIDE 19

ALU Utilization (8-bit HT)

tilization (%)

Memory Not ALU Data Hazard 100% 80% 60%

A GPU-Inspired Soft Processor

19

ALU Utiliz

Number of Hardware Batch Contexts (Photon)

Utilized Not ALU 40% 20% 1 2 4 8 16 32 64 0%

SLIDE 20

ALU Utilization (8-bit HT)

A GPU-Inspired Soft Processor

20

Utilized Memory Not ALU Data Hazard

Matmatmult is bottlenecked on memory bandwidth

SLIDE 21

ALU Utilization (16-bit HT)

A GPU-Inspired Soft Processor

21 21

Utilized Memory Not ALU Data Hazard

32 batches is sufficient

SLIDE 22

Conclusions

GPU-inspired soft processor architecture

exploits multithreading, vector operations

Thread symmetry and batching allows:

tolerating limited block RAM ports tolerating long memory and pipeline latencies

A GPU-Inspired Soft Processor

tolerating long memory and pipeline latencies

32 batches sufficient

to achieve 100% ALU utilization

Future work:

customize programming model and arch. to FPGAs exploit longer vectors, multiple CPUs, custom ops

22