Power-Aware Performance of Mixed-Precision Linear Solvers for FPGAs - - PowerPoint PPT Presentation

power aware performance of mixed precision
SMART_READER_LITE
LIVE PREVIEW

Power-Aware Performance of Mixed-Precision Linear Solvers for FPGAs - - PowerPoint PPT Presentation

Power-Aware Performance of Mixed-Precision Linear Solvers for FPGAs and GPGPUs Tennessee Advanced Computing Laboratory University of Tennessee July 14 th 2010 JunKyu Lee, Junqing Sun, Gregory D. Peterson, Robert J. Harrison, Robert J. Hinde This


slide-1
SLIDE 1

Power-Aware Performance of Mixed-Precision Linear Solvers for FPGAs and GPGPUs

Tennessee Advanced Computing Laboratory University of Tennessee

July 14th 2010

JunKyu Lee, Junqing Sun, Gregory D. Peterson, Robert J. Harrison, Robert J. Hinde

This work was partially supported by the National Science Foundation, grant NSF CHE-0625598.

slide-2
SLIDE 2

Overview of the Presentation

High Performance Computational Science Applications Mixed Precision Linear System Solvers Accelerators ( GPGPUs / FPGAs )

Power-aware performance for mixed-precision solvers for GPGPUs and FPGAs according to system characteristics (matrix sizes and condition numbers)

Power Precision

slide-3
SLIDE 3

Impact of precision

ALU Precision Higher Lower Smaller ALUs Larger ALUs Number of TRs Number of ALUs in Fixed Area Shorter Wires Shorter Pipeline Clock Rate

SPEED UP!!

slide-4
SLIDE 4

Mixed precision solvers

Computational Science Applications Ax = b Digital Computers Static, Finite Precision Computation Error Prone Solution x Iterative Refinement (better numeric results) James Wilkinson (1948) Mixed Precision Solver ( better numeric results and high performance ) J.Langou et al. (2006)

  • J. Sun et al.

(2008)

Mixed Precision Solver :

  • 1. Employ multiple precisions.
  • 2. Lower precision (faster) for computationally intensive tasks

Higher precision (slower) for refinement. Goals:  High performance (Lower precision computation)  Numeric accuracy (Higher precision refinement)

slide-5
SLIDE 5

Solving a Linear System Equation To solve Ax = b; A L U L y b x = b = 3. x U y =

LU Decomposition

2.

  • 1. 2/3 n3 Ops

n2 Ops n2 Ops

slide-6
SLIDE 6

Mixed precision algorithm

To solve Ax = b; Approximation Step 1: LUPP (A); O(n3) <= precision PI; Solve LUx(1) = P×b; O(n2) <= precision PI; Refinement for ( i = 1 to x(i) accurate enough) Step 2: r(i) = b – Ah x(i); O(n2) <= precision PH; Step 3: LUz(i) = P×r(i); O(n2) <= precision PI; Step 4: x(i) = x(i) + z(i); O(n) <= precision PH; end P is a permutation matrix and r is a residual vector. Computationally Expensive Employ lower precision for faster computation Computationally Less Expensive Employ higher precision for accuracy

O(n3) O(n2)

slide-7
SLIDE 7

How Mixed Precision Algorithm Works ?

Ax = b Exact Solution x Exact

x at iteration 1 Ax at iteration 1 r = b – Ax at iteration 1 z = A-1 × r at iteration 1 Ax at iteration 2 x at iteration 2 A × z = r

Successful convergence Condition number

  • f the matrix
slide-8
SLIDE 8

Mixed precision linear solvers for GPGPUs and FPGAs

Single precision for LUPP Double precision refinement Converge ? Double precision for LUPP DONE

GPGPUs

Arbitrary precision for LUPP Arbitrary precision refinement Arbitrary precision for LUPP DONE

FPGAs

Converge ?

slide-9
SLIDE 9

Benefits for Mixed precision linear solvers for FPGAs

  • 1. FPGA can employ arbitrary precision computation

(Selecting a precision based on a condition number)

  • 2. Lower precision  Smaller, Faster ALUs  More ALUs (Quadratic)

(Significant performance difference for multiplication between lower precision and higher precision in FPGAs <=> Table I)

Exponent Mantissa # of required DSP48Es per Multiplier 8 16 1 8 (Single) 17-23 2 (5x speed up) 11 24-33 4 11 34-40 6 11 41-50 9 11 51 12 11 (Double) 52 10 (1x) Table I. Number of DSP48Es for a multiplier on Xilinx XC5VLX330T

slide-10
SLIDE 10

Power-Aware Performance

Apply dynamic power consumption: the incremental performance benefit of one additional watt.

Total Power (U) = Static (S) + Dynamic (D) = S + C × Volt2 × freq = S + α × freq, αMAX = (UMAX – S)/freq = DMAX/freq Three Kind Performance Metric: F := # of Flops, (Time-based Performance) FCLK := # of Flops / clock-cycle, (Clock-based Performance) FWATT-D := # of Flops / Watt (Power-based Performance) Relation between Clock-based Performance and Power-based Performance: FCLK = F/freq, MAX(FWATT-D ) = F / DMAX = F/(αMAX×freq) = FCLK / αMAX

Design the logic to obtain maximum Flops/Cycle to save power !!

slide-11
SLIDE 11

Methodology

Performance estimation for GPGPUs :

MAGMA v0.2 with Tesla C1060 and Intel Xeon 2.93GHz Multi Core.

Performance estimation for FPGAs :

By Performance Modeling (Xilinx XC5VLX 330T) with the previous work.

Table I. Number of PEs on Xilinx XC5VLX330T Condition Number 1–217 217-224 224–234 234–241 241–251 251–252 252–253 Mantissa bits 1–16 17-23 24-33 34-40 41-50 51 52 # of PEs 192 96 48 32 21 16 19

Mantissa bit width (M) = (log2(condition number)) – 1 Exponent bit width (E) = 8 (if M≤23) / 11 (if 24≤M≤52) Precision Choice for FPGA Performance estimation:

FPGA Performance : 2(Flops) × Number of PEs × Clock Rate

slide-12
SLIDE 12

Tesla C1060 – MAGMA v0.2

4 6 8 10 12 14 10 20 30 40 50 60 {0, 0, 0} {0.4, 38, 50} {0.8, 77, 100} {1.2, 115, 150} {1.5, 154, 200} {1.9, 192, 250}

Infinite norm condition number (log2 base) Mixed precision solver performance on hybrid system (Tesla C1060 + Intel Xeon 2.93GHz) Matrix size (log2 base) Performance (GFlops/Watt, Flops/cc, GFlops/sec)

{FWATT-D , FCLK , F }

slide-13
SLIDE 13

4 6 8 10 12 14 5 10 15 20 25 30 35 40 45 50 55 {0, 0, 0} {0.8, 83, 10} {1.5, 167, 20} {2.3, 250, 30} {3.1, 333, 40} {3.8, 417, 50}

Infinite norm condition number (log2 base) Mixed precision solver performance on FPGAs (XC5VLX330T) Matrix size (log2 base) Performance (GFlops/Watt, Flops/cc, GFlops/sec)

{FWATT-D , FCLK , F }

FPGA (XC5VLX330T)

slide-14
SLIDE 14

5 10 15 20 25 30 35 40 45 50 55 50 100 150 200 250

Mixed precision solver performance for 8192x8192 matrices (x:GPU, o:FPGA) GFlops Infinite norm condition number (log2 base)

GFLOPs (Blue: FPGA / Green: GPU)

slide-15
SLIDE 15

5 10 15 20 25 30 35 40 45 50 55 50 100 150 200 250 300 350 400

Mixed precision solver performance for 8192x8192 matrices (x:GPU, o:FPGA) Flops/Clock-cycle Infinite norm condition number (log2 base)

FLOPs/Cycle (Blue: FPGA / Green: GPU)

slide-16
SLIDE 16

5 10 15 20 25 30 35 40 45 50 55 0.5 1 1.5 2 2.5 3 3.5 4

Mixed precision solver performance for 8192x8192 matrices (x:GPU, o:FPGA) GFlops/W Infinite norm condition number (log2 base)

GFLOPs/Watt (Blue: FPGA / Green: GPU)

slide-17
SLIDE 17

Discussions and Conclusions

  • FPGAs can employ arbitrary precisions while GPUs can employ either

single or double precision.

  • In order to save power, it is important to design the logic to obtain good

clock-based performance.

  • The FPGA shows power-based performance better than the GPGPU in

the case-study for mixed precision linear solvers, since we can obtain higher clock-based performance due to flexibility of the design choices in the FPGA.

slide-18
SLIDE 18

Thank you, Any Questions ?