Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA - - PowerPoint PPT Presentation
Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA - - PowerPoint PPT Presentation
Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline Symmetric eigenvalue solver Experiment Applications Conclusions Symmetric eigenvalue solver The standard form is QR algorithm
Outline
- Symmetric eigenvalue solver
- Experiment
- Applications
- Conclusions
Symmetric eigenvalue solver
- The standard form is
- QR algorithm is state-of-the-art, most popular, and implemented in LAPACK
- Jacobi method was introduced in 1846, prior to QR algorithm. The parallelism
- f Jacobi method makes GPU implementation appealing
- Naming convention
syevd (heevd): QR algorithm syevj (heevj): Jacobi method
QR algorithm (syevd)
- sytrd : tridiagonalization
- =
= Σ
- stedc : spectrum of tridiagonal
- ormtr : form eigenvectors
- =
Example of QR algorithm
- sytrd:
= 1 2 2 5 3 4 6 7 3 6 4 7 8 9 9 10 = 1 −5.39 −5.39 22.48 2.768 2.768 0.286 −0.18 −0.18 0.232
- stedc:
Σ = −0.81 0.185 0.558 24.1 = −0.73 0.33 −0.24 0.05 0.56 −0.2 0.05 0.97 0.635 0.24 0.109 0.91 0.73 0.11 −0.4 = 1 −0.37 −0.74 0.56 −0.56 −0.74 −0.31 −0.77 0.6 0.3
Pros and Cons
- Step 1: [GPU] sytrd has 60% BLAS2 and 40% BLAS3
- Step 3: [GPU] ormtr is a sequence of householder
product
- Step 2: [CPU] stedc performs QR algorithm
sequentially
- the runtime of stedc is about the same as sytrd
- CPU is occupied during eigenvalue solver
- The performance depends on CPU as well
Find an alternative to replace stedc
routine time(sec) sytrd 4.59 stedc 14.55
- rmtr
3.65 n = 4096, double precision
Jacobi method (syevj)
- A series of rotations, each rotation eliminates one off-diagonal
= =
- =
- =
- Monotone property
= − ( =
- ,
- >
where
- With proper termination condition
Σ = lim
⟶
= ⋯
Example of Jacobi method
- Eliminate (1,2):
- =
1 2 2 5 3 4 6 7 3 6 4 7 8 9 9 10 = 0.17 5.83 0.48 1.02 6.70 8.00 0.48 6.70 1.02 8.00 8 9 9 10
- Eliminate (1,3):
- =
0.14 −0.4 −0.4 5.83 0.47 6.70 8.00 6.70 0.47 8.00 8.03 9.05 9.05 10 = 19.74842 = 19.54482 = 19.53325
- Monotone property holds,
- Eliminate (1,4):
- =
0.12 −0.8 −0.8 5.83 −0.4 6.70 7.97 −0.4 6.70 7.97 8.03 9.03 9.03 10.02
- Eliminate (2,3):
- =
0.12 −0.32 −0.32 0.16 −0.84 0.23 −0.84 0.23 13.7 12 12 10.02 = 19.52188 = 17.08458
- (1,4) and (2,3) operate on non-overlapped rows and columns
Cyclic Jacobi
(1)
- (2) while
(3) for p = 1:n-1 (4) for q = p+1 : n (5) compute (6)
- (7)
(8) end // for q (9) end // for p (10) end // while
A sweep consists of n(n-1)/2 rotations Quadratic convergence is based on # of sweeps control accuracy Column rotation of A Row rotation of A Column rotation of V
Parallel Jacobi
- Eliminate (1,2) and (3,4):
- =
0.17 5.83 −0.32 1.07 −0.35 10.4 −0.32 −0.35 1.07 10.4 −0.05 18.06 = 1
- 5
3 4 6 7 3 6 4 7 8
- 10
sweep Off(A) 1 1.195820 2 2.849376E-1 3 2.355936E-4 4 1.279163E-15
- n/2 pairs of non-overlapped (p, q) which can be done in parallel
Block Jacobi
- Partition A into blocks
=
- ⋯
- ⋮
⋱ ⋮
- ⋯
- Eliminate off-diagonal block (p,q) by Jacobi rotation
- =
- Basic block is batched syevj
- Column and row rotations are done by efficient GEMM
- Propagate proper termination condition
Comparison
Basic routines sytrd stdec
- rmtr
batched syevj GEMM GPU friendly? sytrd (yes) stedc (no), cpu with single thread
- rmtr (yes)
batched syevj (yes) GEMM (yes) Scalable for next generation GPU sytrd (yes) stedc (no)
- rmtr (yes)
batched syevj (yes) GEMM (yes) Computational Complexity low high Good for small matrix No Yes, with batched syevj Approximate eigenvalue No, it computes exact eigenvalues Yes, accuracy is controlled by ‘tol’ Support for ‘s’, ‘d’, ‘c’, and ‘z’ Yes Yes Stable algorithm Yes Yes Quadratic convergence Yes Yes
Complexity Analysis
- QR algorithm is about 2 sweeps of Jacobi method
- To reach machine zero, Jacobi method needs 7 sweeps for single precision and
15 sweeps for double precision
- Although complexity of Jacobi method is bigger, its parallelism makes it faster
- n small matrices. Once the matrix gets bigger, Jacobi method suffers from
big complexity
Outline
- Symmetric eigenvalue solver
- Experiment
- Applications
- Conclusions
Experimental Setup
- CPU: Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz, dual socket
- GPU: K40
- Comparison against MKL 11.0.4
- Ngflops = 2*n^3 / time
normalized gflops w.r.t. GEMM
K40 E5-2690 v2 Number of processor cores 2880 10 Core clock 754 MHz 3 GHz bandwidth 180 GB/s 12 GB/s SGEMM 2768 Gflops 386 Gflops DGEMM 1221 Gflops 185 Gflops
http://ark.intel.com/products/75279/Intel-Xeon-Processor-E5-2690-v2-25M-Cache-3_00-GHz https://www.nvidia.com/content/PDF/kepler/Tesla-K40-Active-Board-Spec-BD-06949-001_v03.pdf
Performance of syevj
n cusolver syevd syevj MKL syevd 32 0.008 0.04 0.004 64 0.03 0.15 0.04 128 0.11 0.46 0.19 256 0.50 1.05 1.30 512 1.42 2.88 5.19 1024 3.00 4.56 15.26 2048 5.56 5.96 32.66 4096 5.93 7.35 43.66 8192 4.85 9.80 48.35 Ngflops, Double precision
- Jacobi method is faster than QR algorithm for small matrix, up to size 256
- Jacobi method left behind for large matrix due to high complexity
Performance of batched syevj
Data type MKL Ngflops Speedup S 0.73 7.3 D 1.59 1.4 C 3.13 3.0 Z 4.55 0.8
- Batched syevj relies on shared memory, the dimension is limited by 32
- The performance stabilizes when GPU is fully utilized
- Batched syevj is faster than MKL with 16 threads for ‘s’, ’d’ and ‘c’
MKL, 16 threads
n=32
Outline
- Symmetric eigenvalue solver
- Experiment
- Applications
- Conclusions
Application 1: SVD
- SVD computes singular value
and left/right singular vector U/V
- =
Σ
- LAPACK uses QR algorithm
- Jacobi method can apply to SVD because monotone property still holds
- Naming convention
- gesvd : QR algorithm
- gesvdj : Jacobi method
Performance of gesvdj
- Jacobi method is faster than QR algorithm for small matrix, up to size 512
- For large matrix, Jacobi method is not bad for ‘s’ and ‘c’ compared to MKL
n cusolver gesvd gesvdj MKL gesvd 32 0.004 0.036 0.089 64 0.014 0.118 0.034 128 0.064 0.395 0.118 256 0.203 1.089 0.768 512 0.628 1.872 1.734 1024 1.688 2.489 5.487 2048 3.944 2.749 13.083 4096 6.178 2.714 18.111 8192 8.242 2.65 11.061 Ngflops, Double precision
Performance of batched gesvdj
- The matrix size is limited by 32-by-32
- The performance stablizes when GPU is fully utilized
- Batched gesvdj is faster than MKL with 16 threads for ‘s’ and ‘d’
Data type MKL Ngflops Speedup S 2.044 1.5 D 1.527 1.1 C 6.486 0.8 Z 5.176 0.3
MKL, 16 threads
Application 2: multiGPU syevj
- syevj runs on four K40
- syevj is competitive against MKL for single precision
syevj is ½ to
- f MKL for double precision
Application 3: approximate eigensolver
- Use case: to know full inaccurate spectrum quickly
- r cannot afford a large cluster for dense eigensolver
- Hydrogen Atom
,,
- Energy:
- Naming convention:
syevij (‘ij‘ stands for incomplete Jacobi)
Accuracy of syevij
- The resolution is 16 grid points for each dimension
matrix is 4096-by-4096 with 27136 nonzeros
- There are 5 bound states but syevij reports 0 bound states
- The error bound
- (0.01 per eigenvalue in average)
- 12.43 eV
20.36 eV −
Performance of syevij
- The complexity is still
but much faster than dense eigensolver
- Strong scaling of multiGPU is not significant in this case
2 GPU: 1.4x speedup 4 GPU: 1.7x speedup
- Single precision keeps the same accuracy but 2x faster
n Matrix size nnz 1 K40 2 K40 4 K40 16 4,096 27,136 0.6 0.6 0.7 32 32,768 223,232 27.2 18.8 16.0 64 262,144 1,810,432 1919.6 1320.7 1016.6 Double precision: runtime (second)
Conclusions
- Optimal complexity may not be the best for parallel computing
- Jacobi method is faster than MKL for small matrices, as well as batched
- perations
- Jacobi method can be applied to symmetric eigenvalue solver and SVD
- Jacobi method uses limited CPU resources
- CUDA 9 will have
syevj, batched syevj gesvdj, batched gesvdj multiGPU syevj multiGPU syevij
Thank you !
- [1] Gene H. Golub, Charles F
. Van Loan, MATRIX COMPUTATIONS, 3rd edition, Johns Hopkins
- [2] LAPACK: Symmetric Eigenproblems, http://www.netlib.org/lapack/lug/node48.html ,
http://www.netlib.org/lapack/lug/node30.html