Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA - - PowerPoint PPT Presentation

jacobi based eigenvalue solver on gpu
SMART_READER_LITE
LIVE PREVIEW

Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA - - PowerPoint PPT Presentation

Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline Symmetric eigenvalue solver Experiment Applications Conclusions Symmetric eigenvalue solver The standard form is QR algorithm


slide-1
SLIDE 1

Jacobi-Based Eigenvalue Solver on GPU

Lung-Sheng Chien, NVIDIA lchien@nvidia.com

slide-2
SLIDE 2

Outline

  • Symmetric eigenvalue solver
  • Experiment
  • Applications
  • Conclusions
slide-3
SLIDE 3

Symmetric eigenvalue solver

  • The standard form is
  • QR algorithm is state-of-the-art, most popular, and implemented in LAPACK
  • Jacobi method was introduced in 1846, prior to QR algorithm. The parallelism
  • f Jacobi method makes GPU implementation appealing
  • Naming convention

syevd (heevd): QR algorithm syevj (heevj): Jacobi method

slide-4
SLIDE 4

QR algorithm (syevd)

  • sytrd : tridiagonalization
  • =

= Σ

  • stedc : spectrum of tridiagonal
  • ormtr : form eigenvectors
  • =
slide-5
SLIDE 5

Example of QR algorithm

  • sytrd:

= 1 2 2 5 3 4 6 7 3 6 4 7 8 9 9 10 = 1 −5.39 −5.39 22.48 2.768 2.768 0.286 −0.18 −0.18 0.232

  • stedc:

Σ = −0.81 0.185 0.558 24.1 = −0.73 0.33 −0.24 0.05 0.56 −0.2 0.05 0.97 0.635 0.24 0.109 0.91 0.73 0.11 −0.4 = 1 −0.37 −0.74 0.56 −0.56 −0.74 −0.31 −0.77 0.6 0.3

slide-6
SLIDE 6

Pros and Cons

  • Step 1: [GPU] sytrd has 60% BLAS2 and 40% BLAS3
  • Step 3: [GPU] ormtr is a sequence of householder

product

  • Step 2: [CPU] stedc performs QR algorithm

sequentially

  • the runtime of stedc is about the same as sytrd
  • CPU is occupied during eigenvalue solver
  • The performance depends on CPU as well

Find an alternative to replace stedc

routine time(sec) sytrd 4.59 stedc 14.55

  • rmtr

3.65 n = 4096, double precision

slide-7
SLIDE 7

Jacobi method (syevj)

  • A series of rotations, each rotation eliminates one off-diagonal

= =

  • =
  • =
  • Monotone property

= − ( =

  • ,
  • >

where

  • With proper termination condition

Σ = lim

= ⋯

slide-8
SLIDE 8

Example of Jacobi method

  • Eliminate (1,2):
  • =

1 2 2 5 3 4 6 7 3 6 4 7 8 9 9 10 = 0.17 5.83 0.48 1.02 6.70 8.00 0.48 6.70 1.02 8.00 8 9 9 10

  • Eliminate (1,3):
  • =

0.14 −0.4 −0.4 5.83 0.47 6.70 8.00 6.70 0.47 8.00 8.03 9.05 9.05 10 = 19.74842 = 19.54482 = 19.53325

  • Monotone property holds,
slide-9
SLIDE 9
  • Eliminate (1,4):
  • =

0.12 −0.8 −0.8 5.83 −0.4 6.70 7.97 −0.4 6.70 7.97 8.03 9.03 9.03 10.02

  • Eliminate (2,3):
  • =

0.12 −0.32 −0.32 0.16 −0.84 0.23 −0.84 0.23 13.7 12 12 10.02 = 19.52188 = 17.08458

  • (1,4) and (2,3) operate on non-overlapped rows and columns
slide-10
SLIDE 10

Cyclic Jacobi

(1)

  • (2) while

(3) for p = 1:n-1 (4) for q = p+1 : n (5) compute (6)

  • (7)

(8) end // for q (9) end // for p (10) end // while

A sweep consists of n(n-1)/2 rotations Quadratic convergence is based on # of sweeps control accuracy Column rotation of A Row rotation of A Column rotation of V

slide-11
SLIDE 11

Parallel Jacobi

  • Eliminate (1,2) and (3,4):
  • =

0.17 5.83 −0.32 1.07 −0.35 10.4 −0.32 −0.35 1.07 10.4 −0.05 18.06 = 1

  • 5

3 4 6 7 3 6 4 7 8

  • 10

sweep Off(A) 1 1.195820 2 2.849376E-1 3 2.355936E-4 4 1.279163E-15

  • n/2 pairs of non-overlapped (p, q) which can be done in parallel
slide-12
SLIDE 12

Block Jacobi

  • Partition A into blocks

=

⋱ ⋮

  • Eliminate off-diagonal block (p,q) by Jacobi rotation
  • =
  • Basic block is batched syevj
  • Column and row rotations are done by efficient GEMM
  • Propagate proper termination condition
slide-13
SLIDE 13

Comparison

Basic routines sytrd stdec

  • rmtr

batched syevj GEMM GPU friendly? sytrd (yes) stedc (no), cpu with single thread

  • rmtr (yes)

batched syevj (yes) GEMM (yes) Scalable for next generation GPU sytrd (yes) stedc (no)

  • rmtr (yes)

batched syevj (yes) GEMM (yes) Computational Complexity low high Good for small matrix No Yes, with batched syevj Approximate eigenvalue No, it computes exact eigenvalues Yes, accuracy is controlled by ‘tol’ Support for ‘s’, ‘d’, ‘c’, and ‘z’ Yes Yes Stable algorithm Yes Yes Quadratic convergence Yes Yes

slide-14
SLIDE 14

Complexity Analysis

  • QR algorithm is about 2 sweeps of Jacobi method
  • To reach machine zero, Jacobi method needs 7 sweeps for single precision and

15 sweeps for double precision

  • Although complexity of Jacobi method is bigger, its parallelism makes it faster
  • n small matrices. Once the matrix gets bigger, Jacobi method suffers from

big complexity

slide-15
SLIDE 15

Outline

  • Symmetric eigenvalue solver
  • Experiment
  • Applications
  • Conclusions
slide-16
SLIDE 16

Experimental Setup

  • CPU: Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz, dual socket
  • GPU: K40
  • Comparison against MKL 11.0.4
  • Ngflops = 2*n^3 / time

normalized gflops w.r.t. GEMM

K40 E5-2690 v2 Number of processor cores 2880 10 Core clock 754 MHz 3 GHz bandwidth 180 GB/s 12 GB/s SGEMM 2768 Gflops 386 Gflops DGEMM 1221 Gflops 185 Gflops

http://ark.intel.com/products/75279/Intel-Xeon-Processor-E5-2690-v2-25M-Cache-3_00-GHz https://www.nvidia.com/content/PDF/kepler/Tesla-K40-Active-Board-Spec-BD-06949-001_v03.pdf

slide-17
SLIDE 17

Performance of syevj

n cusolver syevd syevj MKL syevd 32 0.008 0.04 0.004 64 0.03 0.15 0.04 128 0.11 0.46 0.19 256 0.50 1.05 1.30 512 1.42 2.88 5.19 1024 3.00 4.56 15.26 2048 5.56 5.96 32.66 4096 5.93 7.35 43.66 8192 4.85 9.80 48.35 Ngflops, Double precision

  • Jacobi method is faster than QR algorithm for small matrix, up to size 256
  • Jacobi method left behind for large matrix due to high complexity
slide-18
SLIDE 18

Performance of batched syevj

Data type MKL Ngflops Speedup S 0.73 7.3 D 1.59 1.4 C 3.13 3.0 Z 4.55 0.8

  • Batched syevj relies on shared memory, the dimension is limited by 32
  • The performance stabilizes when GPU is fully utilized
  • Batched syevj is faster than MKL with 16 threads for ‘s’, ’d’ and ‘c’

MKL, 16 threads

n=32

slide-19
SLIDE 19

Outline

  • Symmetric eigenvalue solver
  • Experiment
  • Applications
  • Conclusions
slide-20
SLIDE 20

Application 1: SVD

  • SVD computes singular value

and left/right singular vector U/V

  • =

Σ

  • LAPACK uses QR algorithm
  • Jacobi method can apply to SVD because monotone property still holds
  • Naming convention
  • gesvd : QR algorithm
  • gesvdj : Jacobi method
slide-21
SLIDE 21

Performance of gesvdj

  • Jacobi method is faster than QR algorithm for small matrix, up to size 512
  • For large matrix, Jacobi method is not bad for ‘s’ and ‘c’ compared to MKL

n cusolver gesvd gesvdj MKL gesvd 32 0.004 0.036 0.089 64 0.014 0.118 0.034 128 0.064 0.395 0.118 256 0.203 1.089 0.768 512 0.628 1.872 1.734 1024 1.688 2.489 5.487 2048 3.944 2.749 13.083 4096 6.178 2.714 18.111 8192 8.242 2.65 11.061 Ngflops, Double precision

slide-22
SLIDE 22

Performance of batched gesvdj

  • The matrix size is limited by 32-by-32
  • The performance stablizes when GPU is fully utilized
  • Batched gesvdj is faster than MKL with 16 threads for ‘s’ and ‘d’

Data type MKL Ngflops Speedup S 2.044 1.5 D 1.527 1.1 C 6.486 0.8 Z 5.176 0.3

MKL, 16 threads

slide-23
SLIDE 23

Application 2: multiGPU syevj

  • syevj runs on four K40
  • syevj is competitive against MKL for single precision

syevj is ½ to

  • f MKL for double precision
slide-24
SLIDE 24

Application 3: approximate eigensolver

  • Use case: to know full inaccurate spectrum quickly
  • r cannot afford a large cluster for dense eigensolver
  • Hydrogen Atom

,,

  • Energy:
  • Naming convention:

syevij (‘ij‘ stands for incomplete Jacobi)

slide-25
SLIDE 25

Accuracy of syevij

  • The resolution is 16 grid points for each dimension

matrix is 4096-by-4096 with 27136 nonzeros

  • There are 5 bound states but syevij reports 0 bound states
  • The error bound
  • (0.01 per eigenvalue in average)
  • 12.43 eV

20.36 eV −

slide-26
SLIDE 26

Performance of syevij

  • The complexity is still

but much faster than dense eigensolver

  • Strong scaling of multiGPU is not significant in this case

2 GPU: 1.4x speedup 4 GPU: 1.7x speedup

  • Single precision keeps the same accuracy but 2x faster

n Matrix size nnz 1 K40 2 K40 4 K40 16 4,096 27,136 0.6 0.6 0.7 32 32,768 223,232 27.2 18.8 16.0 64 262,144 1,810,432 1919.6 1320.7 1016.6 Double precision: runtime (second)

slide-27
SLIDE 27

Conclusions

  • Optimal complexity may not be the best for parallel computing
  • Jacobi method is faster than MKL for small matrices, as well as batched
  • perations
  • Jacobi method can be applied to symmetric eigenvalue solver and SVD
  • Jacobi method uses limited CPU resources
  • CUDA 9 will have

syevj, batched syevj gesvdj, batched gesvdj multiGPU syevj multiGPU syevij

slide-28
SLIDE 28

Thank you !

  • [1] Gene H. Golub, Charles F

. Van Loan, MATRIX COMPUTATIONS, 3rd edition, Johns Hopkins

  • [2] LAPACK: Symmetric Eigenproblems, http://www.netlib.org/lapack/lug/node48.html ,

http://www.netlib.org/lapack/lug/node30.html