A GPU Register File using Static Data Compression Alexandra Angerd, - - PowerPoint PPT Presentation

a gpu register file using static data compression
SMART_READER_LITE
LIVE PREVIEW

A GPU Register File using Static Data Compression Alexandra Angerd, - - PowerPoint PPT Presentation

A GPU Register File using Static Data Compression Alexandra Angerd, Erik Sintorn, Per Stenstrm Department of Computer Science and Engineering Chalmers University of Technology Gteborg, Sweden Motivation Register file Threads . . .


slide-1
SLIDE 1

A GPU Register File using Static Data Compression

Alexandra Angerd, Erik Sintorn, Per Stenström Department of Computer Science and Engineering Chalmers University of Technology Göteborg, Sweden

slide-2
SLIDE 2

Motivation

2

. . .

… … … … … … … …

Limiting factors for TLP:

  • Register file size
  • Register footprint

Threads Register file

slide-3
SLIDE 3

Motivation

3

13.5x

Sizes keep increasing! Already huge and power hungry! Instead: decrease footprint!

slide-4
SLIDE 4

Observation #1: Float precision can be tuned offline

4

Tuned precision High Medium Low Register file

slide-5
SLIDE 5

Observation #2: Static analysis of narrow integers

5

Narrow values Register file

k = 0 while k < 50{ i = 0 j = k while i < j{ print k i = i + 1 k = k + 1 } } print k k1 = φ(k0, k2) k1 < 50? k0 = 0 kt = k1∩[−∞,49] i0 = 0 j0 = kt print kf i1 = φ(i0,i2) i1 < j0? k2 = kt + 1 print kt i2 = i1 + 1 t f t f

(a) (b)

I[k0] = [0,0] I[k1] = [0,50] I[k2] = [1,50] I[kt] = [0,49] I[kf] = [50,50] I[i0] = [0,0] I[i1] = [0,49] I[i2] = [1,50] I[j0] = [0,49] I[k] = I[kx] = [0,50] I[i] = I[ix] = [0,50] I[j] = I[jx] = [0,49] k : 6 bits i : 6 bits j : 6 bits

(c) (d)

slide-6
SLIDE 6

Problem Statement

  • Existing techniques for GPUs either:
  • Rely on run-time detection of narrow integer values
  • Support only statically detected narrow integers or narrow (precision-reduced) floats

6

How to design a register file which utilizes both narrow floats AND narrow integers?

?

slide-7
SLIDE 7

Contributions

  • A new GPU register file organization which supports both narrow integer and float data
  • A new concept for efficient packing of narrow operands
  • Based on static bitwidth analysis co-designed with the new register file organization
  • Evaluation of benefits
  • Up to 79% performance improvement (avg: 18.6%) when allowing for a slight quality loss

7

slide-8
SLIDE 8

Outline

  • Approach
  • Challenges
  • Proposed Register File Organization
  • Evaluation Methodology
  • Results
  • Impact on Register Pressure and Performance
  • Overhead Estimation
  • Conclusion

8

slide-9
SLIDE 9

Approach

  • 9

Angerd, Sintorn, Stenström. “A Framework for Automated and Controlled Floating-Point Accuracy Reduction in Graphics Applications on GPUs”, ACM Transactions on Architecture and Code Optimization (TACO), Volume 14 Issue 4, December 2017 . Pereira, Rodrigues, Campos. “A Fast and Low-overhead Technique to Secure Programs Against Integer Overflows”, In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization

slide-10
SLIDE 10

Approach

10

32

V 2

32

V 1

32

R0 R1 Register p0 p1 m0 m1 … … … … … V1 R0

  • 11000000

… V2 R0

  • 00111111

… R0 Baseline Our Approach Indirection table Changes to baseline:

  • Sliced physical registers
  • Access by indirection table

V 1 V 2

V 1 V 2

8 bits 24 bits

slide-11
SLIDE 11

Approach

11

Bit-width [exponent bits , mantissa bits]

32 bits 28 bits 24 bits 20 bits 16 bits 12 bits 8 bits

IEEE754-compliant

[8 , 23]

  • [5 , 10]
  • IEEE754-style

[8 , 23] [7 , 20] [6 , 17] [5 , 14] [5 , 10] [4 , 7] [3 , 4]

  • Supported floating-point format: “IEEE-style” [Angerd et al. TACO 2017]
slide-12
SLIDE 12

Challenges

  • Indirection table on the critical path
  • Multiple indirection table accesses per cycle
  • Conversion between floating-point formats

12

slide-13
SLIDE 13

Baseline Architecture

13

slide-14
SLIDE 14

Proposed Register File Organization

14

slide-15
SLIDE 15

Pipeline Extension

15

slide-16
SLIDE 16

Evaluation Methodology

  • Implemented in GPGPU-Sim
  • Benchmarks:
  • Graphics: Deferred, SSAO, Elevated, Pathtracer
  • 7 kernels from Rodinia benchmark suite
  • Quality metric:
  • Graphics BMs: Structural Similarity Index (SSIM)
  • Rodinia BMs: Avg. relative error, Binary

16

slide-17
SLIDE 17

Results: Impact on Register Pressure

17

Register pressure lowered in all cases Both integer and float reduction is important

!

slide-18
SLIDE 18

Results: Impact on Performance

18

Quality: Very high

Average: 18.6% increase in IPC

!

SSIM ≥ 0.9

  • Avg. relative error: ≤ 10%

Binary: All outputs correct

slide-19
SLIDE 19

Results: Area Overhead Estimation

  • Transistor count as proxy
  • Estimated through, e.g., logic synthesis
  • Less than 1% of total chip transistor budget

19

slide-20
SLIDE 20

Results: Power Overhead Estimation

  • Estimated analytically
  • Static power:
  • Increases linearly with circuit area (Area overhead ≈ static power overhead)
  • Dynamic power:
  • Conclusion: less than 2x larger register file
  • Why?
  • Largest difference: occasionally 2x fetches per operand
  • Controlled by the compiler
  • Worst case: 2x more fetches
  • However, 2x more entries in register file means 2x longer bitlines to charge

20

slide-21
SLIDE 21

Conclusion

  • Contributions:
  • A new GPU register file organization which supports both narrow integer and float data
  • A new concept for efficient packing of narrow operands
  • Evaluation of benefits
  • Evaluation:
  • Performance increased up to 79%, 18.6% on average when allowing a slight quality loss
  • Uses less than 1% of the chip transistor budget

21

slide-22
SLIDE 22