A GPU Register File using Static Data Compression Alexandra Angerd, - PowerPoint PPT Presentation

A GPU Register File using Static Data Compression Alexandra Angerd, Erik Sintorn, Per Stenström Department of Computer Science and Engineering Chalmers University of Technology Göteborg, Sweden

Motivation Register file Threads . . . … … … … … … … … Limiting factors for TLP: • Register file size • Register footprint 2

Motivation Sizes keep increasing! Already huge and power hungry! 13.5x Instead: decrease footprint! 3

Observation #1: Float precision can be tuned offline Register file Medium Low High Tuned precision 4

Observation #2: Static analysis of narrow integers k 1 = φ (k 0 , k 2 ) k 0 = 0 k 1 < 50? Register file t f k = 0 k t = k 1 ∩ [ −∞ ,49] while k < 50{ i 0 = 0 print k f i = 0 j 0 = k t Narrow values j = k while i < j{ i 1 = φ (i 0 ,i 2 ) print k i = i + 1 i 1 < j 0 ? t k = k + 1 f } print k t k 2 = k t + 1 } i 2 = i 1 + 1 print k (a) (b) I[k 0 ] = [0,0] I[k 1 ] = [0,50] I[k] = � I[k x ] = [0,50] I[k 2 ] = [1,50] I[i] = � I[i x ] = [0,50] I[k t ] = [0,49] I[j] = � I[j x ] = [0,49] I[k f ] = [50,50] I[i 0 ] = [0,0] I[i 1 ] = [0,49] k : 6 bits I[i 2 ] = [1,50] i : 6 bits I[j 0 ] = [0,49] j : 6 bits (c) (d) 5

Problem Statement • Existing techniques for GPUs either: • Rely on run-time detection of narrow integer values • Support only statically detected narrow integers or narrow (precision-reduced) floats ? How to design a register file which utilizes both narrow floats AND narrow integers? 6

Contributions • A new GPU register file organization which supports both narrow integer and float data • A new concept for efficient packing of narrow operands • Based on static bitwidth analysis co-designed with the new register file organization • Evaluation of benefits • Up to 79% performance improvement (avg: 18.6%) when allowing for a slight quality loss 7

Outline • Approach • Challenges • Proposed Register File Organization • Evaluation Methodology • Results • Impact on Register Pressure and Performance • Overhead Estimation • Conclusion 8

Approach �� Pereira, Rodrigues, Campos. �� Angerd, Sintorn, Stenström. “A Fast and Low-overhead “A Framework for Automated �� Technique to Secure and Controlled Floating-Point Programs Against Integer Accuracy Reduction in �� Overflows”, In Proceedings of �� Graphics Applications on the 2013 IEEE/ACM GPUs”, ACM Transactions on International Symposium on Architecture and Code �� Code Generation and Optimization (TACO), Optimization Volume 14 Issue 4, �� December 2017 . �� 9

Approach 24 bits 8 bits Baseline V 2 V 1 V 1 V 2 32 0 32 0 R0 R1 Our Approach Indirection table Changes to baseline: Register p0 p1 m0 m1 • Sliced physical registers … … … … … V 1 V 2 • Access by indirection table V1 R0 - 11000000 … 0 32 R0 V2 R0 - 00111111 … 10

Approach • Supported floating-point format: “IEEE-style” [Angerd et al. TACO 2017] Bit-width [exponent bits , mantissa bits] 32 bits 28 bits 24 bits 20 bits 16 bits 12 bits 8 bits IEEE754-compliant [8 , 23] - - - [5 , 10] - - IEEE754-style [8 , 23] [7 , 20] [6 , 17] [5 , 14] [5 , 10] [4 , 7] [3 , 4] 11

Challenges • Indirection table on the critical path • Multiple indirection table accesses per cycle • Conversion between floating-point formats 12

Baseline Architecture �� 13

Proposed Register File Organization �� 14 ��

Pipeline Extension �� 15

Evaluation Methodology • Implemented in GPGPU-Sim • Benchmarks: • Graphics: Deferred, SSAO, Elevated, Pathtracer • 7 kernels from Rodinia benchmark suite • Quality metric: • Graphics BMs: Structural Similarity Index (SSIM) • Rodinia BMs: Avg. relative error, Binary 16

Results: Impact on Register Pressure Register pressure lowered in all cases ! Both integer and float reduction is important 17

Results: Impact on Performance Quality: Very high SSIM ≥ 0.9 Avg. relative error: ≤ 10% Binary: All outputs correct ! Average: 18.6% increase in IPC 18

Results: Area Overhead Estimation • Transistor count as proxy • Estimated through, e.g., logic synthesis • Less than 1% of total chip transistor budget 19

Results: Power Overhead Estimation • Estimated analytically • Static power: • Increases linearly with circuit area (Area overhead ≈ static power overhead) • Dynamic power: • Conclusion: less than 2x larger register file • Why? • Largest difference: occasionally 2x fetches per operand • Controlled by the compiler • Worst case: 2x more fetches • However, 2x more entries in register file means 2x longer bitlines to charge 20

Conclusion • Contributions: • A new GPU register file organization which supports both narrow integer and float data • A new concept for efficient packing of narrow operands • Evaluation of benefits • Evaluation: • Performance increased up to 79%, 18.6% on average when allowing a slight quality loss • Uses less than 1% of the chip transistor budget 21

A GPU Register File using Static Data Compression Alexandra Angerd, - PowerPoint PPT Presentation

A GPU Register File using Static Data Compression Alexandra Angerd, Erik Sintorn, Per Stenstrm Department of Computer Science and Engineering Chalmers University of Technology Gteborg, Sweden Motivation Register file Threads . . .

Lossless compression in lossy compression systems Almost every lossy compression system

14.9.2 JPEG2000 compression DCT compression basis for JPEG wavelet compression

File Management What is a file? Elements of file management File organization

static file cache Static file caching using realurl, mod_rewrite and mod_expires. . . . It slows

JPEG Compression Ian Snyder December 11, 2009 Ian Snyder JPEG Compression Outline

Lecture 9: Compression 1 / 52 Compression Recap Bu ff er Management Recap 2 / 52 Compression

Click on M odel File for CAD Click on M odel File for CAD Click on Model File for CAD Click

Compression Programs File Compression: Gzip, Bzip Archivers :Arc, Pkzip, Winrar,

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Register Packing Register Packing Exploiting Narrow- -Width Operands Width Operands Exploiting

Digital Design Discussion: RTL Storage Components Shift Register Timing Register File Timing

Static and Method Overloading static One per class, not per object static variables

Tradeoffs in XML Database Compression James Cheney University of Edinburgh Data Compression

Control Unit Datapath Elements & Single Cycle Datapath Unit Register Files Register Layout

From Sorting to Heaps to Compression Data Compression video on demand/set top box jpeg

ISPD 2005/2006 Placement Contest Updates Gi-Joon Nam IBM Corp. 2 ISPD Placement Contest ISPD

Active Circuits for Resonant Axion Detectors Second Workshop on Microwave Cavities and Detectors

in 97 . , - , CF B. Tree restructuring ( Parent lost one key Rotation ( Adoption )

Dark Energy Survey 47 th Fermilab Users Mee2ng, June 2014

Accelerating the Configuration Tuning of Big Data Analytics with Similarity-aware Multitask

Frequency Tuners Mini-Workshop Objectives Akira Yamamoto To be held at CERN, 5 September, 2014

Introduction to Machine Learning Tuning: Nested Resampling compstat-lmu.github.io/lecture_i2ml

Kernel Exploitation via Uninitialized Stack http://people.canonical.com/~kees/defcon19/ Kees

Sambuz

Useful Links

Newsletter

Mail Us

A GPU Register File using Static Data Compression Alexandra Angerd, - PowerPoint PPT Presentation

A GPU Register File using Static Data Compression Alexandra Angerd, Erik Sintorn, Per Stenstrm Department of Computer Science and Engineering Chalmers University of Technology Gteborg, Sweden Motivation Register file Threads . . .

Lossless compression in lossy compression systems Almost every lossy compression system

14.9.2 JPEG2000 compression DCT compression basis for JPEG wavelet compression

File Management What is a file? Elements of file management File organization

static file cache Static file caching using realurl, mod_rewrite and mod_expires. . . . It slows

JPEG Compression Ian Snyder December 11, 2009 Ian Snyder JPEG Compression Outline

Lecture 9: Compression 1 / 52 Compression Recap Bu ff er Management Recap 2 / 52 Compression

Click on M odel File for CAD Click on M odel File for CAD Click on Model File for CAD Click

Compression Programs File Compression: Gzip, Bzip Archivers :Arc, Pkzip, Winrar,

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Register Packing Register Packing Exploiting Narrow- -Width Operands Width Operands Exploiting

Digital Design Discussion: RTL Storage Components Shift Register Timing Register File Timing

Static and Method Overloading static One per class, not per object static variables

Tradeoffs in XML Database Compression James Cheney University of Edinburgh Data Compression

Control Unit Datapath Elements &amp; Single Cycle Datapath Unit Register Files Register Layout

From Sorting to Heaps to Compression Data Compression video on demand/set top box jpeg

ISPD 2005/2006 Placement Contest Updates Gi-Joon Nam IBM Corp. 2 ISPD Placement Contest ISPD

Active Circuits for Resonant Axion Detectors Second Workshop on Microwave Cavities and Detectors

in 97 . , - , CF B. Tree restructuring ( Parent lost one key Rotation ( Adoption )

Dark Energy Survey 47 th Fermilab Users Mee2ng, June 2014

Accelerating the Configuration Tuning of Big Data Analytics with Similarity-aware Multitask

Frequency Tuners Mini-Workshop Objectives Akira Yamamoto To be held at CERN, 5 September, 2014

Introduction to Machine Learning Tuning: Nested Resampling compstat-lmu.github.io/lecture_i2ml

Kernel Exploitation via Uninitialized Stack http://people.canonical.com/~kees/defcon19/ Kees

Sambuz

Useful Links

Newsletter

Mail Us

Control Unit Datapath Elements & Single Cycle Datapath Unit Register Files Register Layout