Half Precision Benchmarking for HPC S7676 May 11, 2017 GPU T - - PowerPoint PPT Presentation

half precision benchmarking for hpc
SMART_READER_LITE
LIVE PREVIEW

Half Precision Benchmarking for HPC S7676 May 11, 2017 GPU T - - PowerPoint PPT Presentation

PiotrLuszczek Half Precision Benchmarking for HPC S7676 May 11, 2017 GPU T echnology Conference, San Jose, CA, USA 1 / 18 Major Floating Point Formats from IEEE 754 (2008) Exponent Mantissa Precision Width Epsilon Max bits bits


slide-1
SLIDE 1

May 11, 2017 GPU T echnology Conference, San Jose, CA, USA 1 / 18

PiotrLuszczek

Half Precision Benchmarking for HPC

S7676

slide-2
SLIDE 2

May 11, 2017 GPU T echnology Conference, San Jose, CA, USA 2 / 18

Major Floating Point Formats from IEEE 754 (2008)

Precision Width Exponent bits Mantissa bits Epsilon Max Quadruple 128 15 112 O(10-34) 1.2x104932 Extended 80 15 64 O(10-19) Double 64 11 52 O(10-16) 1.8x10308 Single 32 8 23 O(10-7) 3.4x1038 Half* 16 5 10 O(10-3) 65504

*Only storage format is specifjed

slide-3
SLIDE 3

May 11, 2017 GPU T echnology Conference, San Jose, CA, USA 3 / 18

Programming Data T ypes: C/C++ and Fortran

  • long double

80 or 128 bits

  • double

64 bits

  • fmoat

32 bits

  • __half (short fmoat)

16 bits

  • __half2 (cuda_fp16.h)

2x16 bits

  • real*16

128 bits

  • real*8

64 bits

  • real*4

32 bits

  • real*2

16 bits

slide-4
SLIDE 4

May 11, 2017 GPU T echnology Conference, San Jose, CA, USA 4 / 18

FP16 Hardware (Current and Future)

  • AMD

MI5, MI8, MI25

  • ARM

NEON VFP FP16 in V8.2-A

  • Intel

Xeon CPUs (vectorized conversions)

  • NVIDIA

Pascal: P100, TX1, TX2, ...

Volta: T ensor Core

  • Supercomputers

TSUBAME 3.0

T

  • kyo T

ech

  • Cloud

Google with P100 (coming soon)

Azure: Pascal debut in 2017

slide-5
SLIDE 5

May 11, 2017 GPU T echnology Conference, San Jose, CA, USA 5 / 18

Applications Using FP16

  • Machine Learning

Deep Neural Networks

Visualization and image processing (OpenVZ)

  • Linear Algebra

Eigen

University of T ennessee libraries and projects

  • Molecular dynamics

Gromacs

slide-6
SLIDE 6

May 11, 2017 GPU T echnology Conference, San Jose, CA, USA 6 / 18

Iterative Refjnement

  • In exact arithmetic

x1 ← x0 + A-1(b – Ax0)

  • In fjnite precision A-1 is not available due to

Round-ofg error

Lower-precision LU factors

  • In practice, Richardson Iteration is often used

xk+1 ← xk + A-1(b – Axk)

Convergence depends on the spectrum

  • T

extbook result wrt. I-A-1A

slide-7
SLIDE 7

May 11, 2017 GPU T echnology Conference, San Jose, CA, USA 7 / 18

Classic Iterative Refjnement Implementation

  • Linear system Ax=b may be solved through LU factorization

L, U, P ← lu_factor( A )

y ← L \ Pb

x ← U \ y

r ← b - Ax (use higher precision to accumulate)

z ← U \ L \ P * r

xfjnal ← x+z (use higher precision)

  • All operations performed in the same fmoating-point precision
slide-8
SLIDE 8

May 11, 2017 GPU T echnology Conference, San Jose, CA, USA 8 / 18

Mixed-Precision Iterative Refjnement

  • Linear system in 64-bit precision Ax=b may be solved through LU

factorization in 16-bit precision:

L,U, P ← lu(A) (n3) (16 bits)

y ← L \ Pb (n2) (16 bits)

x ← U \ y (n2) (16 bits)

r ← b - Ax (n2) (64 bits)

z ← P L \ U \ r (n2) (16 bits)

xfjnal ← x+z (n) (64 bits)

  • Requirement:

Matrix A must be well conditioned in 16 bits:

  • κ(A) < 105
slide-9
SLIDE 9

May 11, 2017 GPU T echnology Conference, San Jose, CA, USA 9 / 18

Early Error Analysis Results

  • Standard backward stability
  • Need to generalize to two machine precision: 16 and 64
  • Details: see paper and tech report
  • Summary: it works if matrix cooperates

(A+E)x=b where E≤ϕ(n)ϵ‖A‖ lim

k→∞‖x0−xk‖=ForwardError(ϵ16,ϵ64)

lim

k→∞

‖b−A x k‖ ‖A‖ ⋅ ‖xk‖ =BackwardError(ϵ16,ϵ64)

slide-10
SLIDE 10

May 11, 2017 GPU T echnology Conference, San Jose, CA, USA 10 / 18

Hardware/Software Support: Assembly and Intrinsics

  • x86

CVTSH_SS, CVTSS_SH

emmintrin.h

  • _cvtss_sh(), _cvtsh_ss()

f16cintrin.h → x86intrin.h

  • _mm_cvtph_ps(), _mm_cvtps_ph(), _mm256_cvtph_ps(),

_mm256_cvtps_ph()

  • PTX

cvt.f16.*

fma.f16x2

  • ARM

vld1_f16, vst1_f16, vcvt_f16_f32

slide-11
SLIDE 11

May 11, 2017 GPU T echnology Conference, San Jose, CA, USA 11 / 18

In High-Level Programming Environments

  • Julia

A = zeros(Float16, N, N); b = zeros(Float16, N,N);

A[:,:] = randn(N, N); b[:,:] = randn(N,1);

x = A \ b; # works OK

  • Python

numpy.fmoat16

linalg.solve(randn(N,N,fmoat16), randn(N,1,fmoat16))

T ypeError: array type fmoat16 is unsupported in linalg

  • MATLAB

Must use MEX fjles

slide-12
SLIDE 12

May 11, 2017 GPU T echnology Conference, San Jose, CA, USA 12 / 18

Autotuning with FP16, FP32, and FP64

N=35000 12 Tfmop/s xGETRF()*

  • n P100
slide-13
SLIDE 13

May 11, 2017 GPU T echnology Conference, San Jose, CA, USA 13 / 18

Autotuning with FP16,FP32,FP64 (color)

N=35000 12 Tfmop/s xGETRF()*

  • n P100
slide-14
SLIDE 14

May 11, 2017 GPU T echnology Conference, San Jose, CA, USA 14 / 18

Best Performers for FP16, FP32, FP64

FP64 FP32 FP16

N=35000 12 Tfmop/s xGETRF()*

  • n P100
slide-15
SLIDE 15

May 11, 2017 GPU T echnology Conference, San Jose, CA, USA 15 / 18

Convergence Results: All Precisions

30 iterations

10-4 (fp16) 10-8 (fp32) 10-16 (fp64)

||b-Ax ||oo

slide-16
SLIDE 16

May 11, 2017 GPU T echnology Conference, San Jose, CA, USA 16 / 18

Example Convergence: FP64 to FP32 / FP16

||b-Ax ||oo 30 iterations

slide-17
SLIDE 17

May 11, 2017 GPU T echnology Conference, San Jose, CA, USA 17 / 18

Example of Slow Convergence: FP64 → FP16

||b-Ax ||oo 100 iterations

slide-18
SLIDE 18

May 11, 2017 GPU T echnology Conference, San Jose, CA, USA 18 / 18

Future Work

  • T

est on new Hardware

IBM/NVIDIA Minsky

ARM/Cavium

T egra/Jetson

  • New algorithm approaches

New iterative schemes

New precision tweaks to increase accuracy

  • Verifjcation

Up-casting

Down-casting

Convergence

  • Performance

Improve 16-bit kernels

Use 32-bit kernels on non- supporting hardware