A GPU Register File using Static Data Compression Alexandra Angerd, Erik Sintorn, Per Stenström Department of Computer Science and Engineering Chalmers University of Technology Göteborg, Sweden
Motivation Register file Threads . . . … … … … … … … … Limiting factors for TLP: • Register file size • Register footprint 2
Motivation Sizes keep increasing! Already huge and power hungry! 13.5x Instead: decrease footprint! 3
Observation #1: Float precision can be tuned offline Register file Medium Low High Tuned precision 4
Observation #2: Static analysis of narrow integers k 1 = φ (k 0 , k 2 ) k 0 = 0 k 1 < 50? Register file t f k = 0 k t = k 1 ∩ [ −∞ ,49] while k < 50{ i 0 = 0 print k f i = 0 j 0 = k t Narrow values j = k while i < j{ i 1 = φ (i 0 ,i 2 ) print k i = i + 1 i 1 < j 0 ? t k = k + 1 f } print k t k 2 = k t + 1 } i 2 = i 1 + 1 print k (a) (b) I[k 0 ] = [0,0] I[k 1 ] = [0,50] I[k] = � I[k x ] = [0,50] I[k 2 ] = [1,50] I[i] = � I[i x ] = [0,50] I[k t ] = [0,49] I[j] = � I[j x ] = [0,49] I[k f ] = [50,50] I[i 0 ] = [0,0] I[i 1 ] = [0,49] k : 6 bits I[i 2 ] = [1,50] i : 6 bits I[j 0 ] = [0,49] j : 6 bits (c) (d) 5
Problem Statement • Existing techniques for GPUs either: • Rely on run-time detection of narrow integer values • Support only statically detected narrow integers or narrow (precision-reduced) floats ? How to design a register file which utilizes both narrow floats AND narrow integers? 6
Contributions • A new GPU register file organization which supports both narrow integer and float data • A new concept for efficient packing of narrow operands • Based on static bitwidth analysis co-designed with the new register file organization • Evaluation of benefits • Up to 79% performance improvement (avg: 18.6%) when allowing for a slight quality loss 7
Outline • Approach • Challenges • Proposed Register File Organization • Evaluation Methodology • Results • Impact on Register Pressure and Performance • Overhead Estimation • Conclusion 8
Approach �������� ������� ����������� ��������� ������ ������� ������ ��������� Pereira, Rodrigues, Campos. �������� ������ Angerd, Sintorn, Stenström. “A Fast and Low-overhead “A Framework for Automated ����� ����������� Technique to Secure and Controlled Floating-Point Programs Against Integer Accuracy Reduction in ��������� Overflows”, In Proceedings of ���������� Graphics Applications on the 2013 IEEE/ACM GPUs”, ACM Transactions on International Symposium on Architecture and Code ���������������� Code Generation and Optimization (TACO), Optimization Volume 14 Issue 4, ������ ��������� December 2017 . ������ �������� ����������� ������������� ������� �� �� �� �� ������ ������ � � � � � � � � � � � � � �������� �������� � � � �� �� ������ � � � � � � � � �� �� ������ ����������� ����� �������� ��������� �������� ��������� ��������� ��������� 9
Approach 24 bits 8 bits Baseline V 2 V 1 V 1 V 2 32 0 32 0 R0 R1 Our Approach Indirection table Changes to baseline: Register p0 p1 m0 m1 • Sliced physical registers … … … … … V 1 V 2 • Access by indirection table V1 R0 - 11000000 … 0 32 R0 V2 R0 - 00111111 … 10
Approach • Supported floating-point format: “IEEE-style” [Angerd et al. TACO 2017] Bit-width [exponent bits , mantissa bits] 32 bits 28 bits 24 bits 20 bits 16 bits 12 bits 8 bits IEEE754-compliant [8 , 23] - - - [5 , 10] - - IEEE754-style [8 , 23] [7 , 20] [6 , 17] [5 , 14] [5 , 10] [4 , 7] [3 , 4] 11
Challenges • Indirection table on the critical path • Multiple indirection table accesses per cycle • Conversion between floating-point formats 12
Baseline Architecture ������������������� ������������������ ������ ���������������� ������ ���������������� ���������� ����� ����� ������� ����������������� ���� �� ����������� ���������� �������� �� ���������� ������� ����������� ��������� ���������� ����������� ���������� ������� ����������� ���������� ����������� ���������� ������� ����������� 13
Proposed Register File Organization ������������������� ������������������ ������� ����������������� ���������������� ��������������� ������ ���������������� ������ ��������� ������ ������ ����� ��������� ���������� ����� ����������������� ������ ������� ��������� ������ ��������� ������ ��������� ������������ 14 �����������������
Pipeline Extension ������������������� ��������������� ������� ������������ ��������� ������ ������ ������ ������������ ��������� ��������� ������������ ����� ��������� ��������� ��������� ����� ����� 15
Evaluation Methodology • Implemented in GPGPU-Sim • Benchmarks: • Graphics: Deferred, SSAO, Elevated, Pathtracer • 7 kernels from Rodinia benchmark suite • Quality metric: • Graphics BMs: Structural Similarity Index (SSIM) • Rodinia BMs: Avg. relative error, Binary 16
Results: Impact on Register Pressure Register pressure lowered in all cases ! Both integer and float reduction is important 17
Results: Impact on Performance Quality: Very high SSIM ≥ 0.9 Avg. relative error: ≤ 10% Binary: All outputs correct ! Average: 18.6% increase in IPC 18
Results: Area Overhead Estimation • Transistor count as proxy • Estimated through, e.g., logic synthesis • Less than 1% of total chip transistor budget 19
Results: Power Overhead Estimation • Estimated analytically • Static power: • Increases linearly with circuit area (Area overhead ≈ static power overhead) • Dynamic power: • Conclusion: less than 2x larger register file • Why? • Largest difference: occasionally 2x fetches per operand • Controlled by the compiler • Worst case: 2x more fetches • However, 2x more entries in register file means 2x longer bitlines to charge 20
Conclusion • Contributions: • A new GPU register file organization which supports both narrow integer and float data • A new concept for efficient packing of narrow operands • Evaluation of benefits • Evaluation: • Performance increased up to 79%, 18.6% on average when allowing a slight quality loss • Uses less than 1% of the chip transistor budget 21
Recommend
More recommend