Power-Aware Performance of Mixed-Precision Linear Solvers for FPGAs - PowerPoint PPT Presentation

Power-Aware Performance of Mixed-Precision Linear Solvers for FPGAs and GPGPUs Tennessee Advanced Computing Laboratory University of Tennessee July 14 th 2010 JunKyu Lee, Junqing Sun, Gregory D. Peterson, Robert J. Harrison, Robert J. Hinde This work was partially supported by the National Science Foundation, grant NSF CHE-0625598.

Overview of the Presentation High Performance Computational Science Applications Mixed Precision Linear System Solvers Accelerators Power ( GPGPUs / FPGAs ) Precision Power-aware performance for mixed-precision solvers for GPGPUs and FPGAs according to system characteristics (matrix sizes and condition numbers)

Impact of precision ALU Precision Lower Higher SPEED UP!! Smaller ALUs Larger ALUs Number of ALUs Clock Rate Shorter Wires Number of TRs in Fixed Area Shorter Pipeline

Mixed precision solvers Mixed Precision Solver : 1. Employ multiple precisions. 2. Lower precision (faster) for computationally intensive tasks Higher precision (slower) for refinement. Goals:  High performance (Lower precision computation)  Numeric accuracy (Higher precision refinement) Mixed Precision Solver Digital Iterative Computational ( better numeric Refinement Computers Science results and high Error Prone (better numeric Applications performance ) results) Solution x Static, Finite J.Langou et al. James Wilkinson Precision (2006) Ax = b (1948) Computation J. Sun et al. (2008)

Solving a Linear System Equation To solve Ax = b; 1. 2/3 n 3 Ops U A = x b L LU Decomposition n 2 Ops n 2 Ops 2. 3. U = x y = y b L

Mixed precision algorithm To solve Ax = b; Computationally Approximation Expensive O(n 3 ) Step 1: LUPP (A); O(n 3 ) <= precision P I ; Employ lower precision for faster Solve LUx(1) = P × b; O(n 2 ) <= precision P I ; computation Refinement Computationally O(n 2 ) Less Expensive for ( i = 1 to x(i) accurate enough) Step 2: r(i) = b – A h x(i); O(n 2 ) <= precision P H ; Step 3: LUz(i) = P × r(i); O(n 2 ) <= precision P I ; Step 4: x(i) = x(i) + z(i); O(n) <= precision P H ; Employ higher end precision for accuracy P is a permutation matrix and r is a residual vector.

How Mixed Precision Algorithm Works ? Successful convergence  Condition number Ax = b of the matrix Exact Ax at iteration 2 r = b – Ax at iteration 1 Ax at iteration 1 A × z = r x at iteration 2 Solution x Exact x at iteration 1 z = A -1 × r at iteration 1

Mixed precision linear solvers for GPGPUs and FPGAs GPGPUs FPGAs Single precision for LUPP Arbitrary precision for LUPP Double precision for LUPP Arbitrary precision for LUPP Double precision refinement Arbitrary precision refinement Converge ? Converge ? DONE DONE

Benefits for Mixed precision linear solvers for FPGAs 1. FPGA can employ arbitrary precision computation (Selecting a precision based on a condition number) 2. Lower precision  Smaller, Faster ALUs  More ALUs (Quadratic) (Significant performance difference for multiplication between lower precision and higher precision in FPGAs <=> Table I) Table I. Number of DSP48Es for a multiplier on Xilinx XC5VLX330T # of required DSP48Es per Exponent Mantissa Multiplier 8 16 1 8 (Single) 17-23 2 (5x speed up) 11 24-33 4 11 34-40 6 11 41-50 9 11 51 12 11 (Double) 52 10 (1x)

Power-Aware Performance Apply dynamic power consumption: the incremental performance benefit of one additional watt. Total Power (U) = Static (S) + Dynamic (D) = S + C × Volt 2 × freq = S + α × freq, α MAX = (U MAX – S)/freq = D MAX /freq Three Kind Performance Metric: F := # of Flops, (Time-based Performance) F CLK := # of Flops / clock-cycle, (Clock-based Performance) F WATT-D := # of Flops / Watt (Power-based Performance) Relation between Clock-based Performance and Power-based Performance: MAX(F WATT-D ) = F / D MAX = F/( α MAX × freq) = F CLK / α MAX F CLK = F/freq, Design the logic to obtain maximum Flops/Cycle to save power !!

Methodology Performance estimation for GPGPUs : MAGMA v0.2 with Tesla C1060 and Intel Xeon 2.93GHz Multi Core. Performance estimation for FPGAs : By Performance Modeling (Xilinx XC5VLX 330T) with the previous work. Precision Choice for FPGA Performance estimation: Mantissa bit width (M) = (log 2 (condition number)) – 1 Exponent bit width (E) = 8 (if M ≤ 23) / 11 (if 24 ≤ M ≤ 52) FPGA Performance : 2(Flops) × Number of PEs × Clock Rate Table I. Number of PEs on Xilinx XC5VLX330T Condition 1 – 2 17 2 24 – 2 34 2 34 – 2 41 2 41 – 2 51 2 51 – 2 52 2 52 – 2 53 2 17 -2 24 Number 1 – 16 Mantissa bits 17-23 24-33 34-40 41-50 51 52 # of PEs 192 96 48 32 21 16 19

Tesla C1060 – MAGMA v0.2 {F WATT-D , F CLK , F } Mixed precision solver performance on hybrid system (Tesla C1060 + Intel Xeon 2.93GHz) Performance (GFlops/Watt, Flops/cc, GFlops/sec) {1.9, 192, 250} {1.5, 154, 200} {1.2, 115, 150} {0.8, 77, 100} {0.4, 38, 50} {0, 0, 0} 14 12 0 10 10 20 8 30 40 6 50 Matrix size (log2 base) 4 60 Infinite norm condition number (log2 base)

FPGA (XC5VLX330T) {F WATT-D , F CLK , F } Mixed precision solver performance on FPGAs (XC5VLX330T) Performance (GFlops/Watt, Flops/cc, GFlops/sec) {3.8, 417, 50} {3.1, 333, 40} {2.3, 250, 30} {1.5, 167, 20} {0.8, 83, 10} {0, 0, 0} 14 12 10 5 10 15 8 20 25 30 6 35 40 45 50 4 55 Matrix size (log2 base) Infinite norm condition number (log2 base)

GFLOPs (Blue: FPGA / Green: GPU) Mixed precision solver performance for 8192x8192 matrices (x:GPU, o:FPGA) 250 200 150 GFlops 100 50 0 5 10 15 20 25 30 35 40 45 50 55 Infinite norm condition number (log2 base)

FLOPs/Cycle (Blue: FPGA / Green: GPU) Mixed precision solver performance for 8192x8192 matrices (x:GPU, o:FPGA) 400 350 300 Flops/Clock-cycle 250 200 150 100 50 0 5 10 15 20 25 30 35 40 45 50 55 Infinite norm condition number (log2 base)

GFLOPs/Watt (Blue: FPGA / Green: GPU) Mixed precision solver performance for 8192x8192 matrices (x:GPU, o:FPGA) 4 3.5 3 2.5 GFlops/W 2 1.5 1 0.5 0 5 10 15 20 25 30 35 40 45 50 55 Infinite norm condition number (log2 base)

Discussions and Conclusions - FPGAs can employ arbitrary precisions while GPUs can employ either single or double precision. - In order to save power, it is important to design the logic to obtain good clock-based performance. - The FPGA shows power-based performance better than the GPGPU in the case-study for mixed precision linear solvers, since we can obtain higher clock-based performance due to flexibility of the design choices in the FPGA.

Thank you, Any Questions ?

Power-Aware Performance of Mixed-Precision Linear Solvers for FPGAs - PowerPoint PPT Presentation

Power-Aware Performance of Mixed-Precision Linear Solvers for FPGAs and GPGPUs Tennessee Advanced Computing Laboratory University of Tennessee July 14 th 2010 JunKyu Lee, Junqing Sun, Gregory D. Peterson, Robert J. Harrison, Robert J. Hinde This

Mixed Precision Training PAI Overview What is mixed-precision

MIXED PRECISION TRAINING OF DEEP NEURAL NETWORKS Carl Case, NVIDIA OUTLINE 1. What is mixed

EFFECTIVE USE OF MIXED PRECISION FOR HPC Kate Clark, Smoky Mountain Conference 2019 Why Mixed

MIXED PRECISION TRAINING Michael OConnor MIXED PRECISION What is the benefit? Using mixed

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius What is Mixed Precision

Automated Mixed-Precision for TensorFlow Training Reed Wanderman-Milne (Google) and Nathan Luehr

Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in

Mixed Methodological Analysis David F. Feldon Utah State University May 8, 2018 Mixed Methods

Regression 2: Mixed Models Marco Baroni Practical Statistics in R Outline Mixed models with

Mixing it up with random effects Joshua Loftus Mixed models Intro to mixed models What is a

(power x 0) == 1 (power x (+ n 1)) == (* (power x n) x) (power x 0) == 1 (power x (+ (* 2 m)

VLVK EHF. VLVK EHF. Precision machining Precision machining Professional precision for

2018 Milken Institute Hamptons Dialogues Precision, Precision, Precision: The Future of Health

AUTOMATIC MIXED PRECISION IN PYTORCH Michael Carilli and Michael Ruberry, 3/20/2019 THIS TALK

Toolkit to Support Intelligibility in Context Aware Applications Context-Aware Applications P

Mixed Feelings about Mixed Precision? Judy Hill Scientific Computing Group Leader, Center for

Nutley Public Schools 2020-2021 PUBLIC BUDGET HEARING May 5, 2020 Budget Process The budget is

Role of Technology in Improving Care for Complex Patients Sue Vos, RN Program Director

Together for Tomorrow Reopening our Schools Maize USD 266 Board of Education July 27, 2020

This title summarizes the key elements of our plan to achieve excellence in writing. Each strategy

CELLINKs transformational combination with Scienion creates a global leader in precision

United States Court of Appeals for the Federal Circuit 01-1449, -1583, -1604, -1641, 02-1174, -1192

Latency Trumps All Chris Saari twitter.com/chrissaari blog.chrissaari.com saari@yahoo-inc.com

Multiband Antenna-Receiver Integration using an RF Multiplexer with Sensitivity-Constrained Design

Sambuz

Useful Links

Newsletter

Mail Us