Applications of Berkeleyβs Dwarfs
- n Nvidia GPUs
Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015
Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics - - PowerPoint PPT Presentation
Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA Dynamic Programming Sparse Linear Algebra The Dwarfs
Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015
05.02.2015 Team N2: Yang Zhang & Haiqing Wang Overview 2/37
programming model for GPGPU
including C/C++ and Fortran
(e.g. cuSPARSE, cuBLAS, NPP, etcβ¦)
05.02.2015 Team N2: Yang Zhang & Haiqing Wang CUDA 3/37
05.02.2015 Team N2: Yang Zhang & Haiqing Wang CUDA: Execution Model 4/37
05.02.2015 Team N2: Yang Zhang & Haiqing Wang CUDA: Memory Model 5/37
05.02.2015 Team N2: Yang Zhang & Haiqing Wang Overview 6/37
05.02.2015 Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Matrix Chain Product 7/37
2*9*3+2*3*1+2*1*4+4*11*5+2*4*5=328
Goal: Minimize the total number of multiplications
((A1 A2 A3 A4) (A5 A6)) (A1 (A2 A3) (A4 A5) A6)
9*3*1+2*9*1+1*4*112*1*11+2*11*5=221
An example:
05.02.2015 Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Algorithm 8/37
05.02.2015 Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Algorithm 9/37
Table m: Table s: (n=6)
05.02.2015 Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Implementation 10/37
Computing is independent Can be computed in parallel
Table m: (n=8)
05.02.2015 Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Implementation 11/37
Using three different Kernels:
OneThreadPerOneEntry OneBlockPerOneEntry BlocksPerOneEntry
The number of (i,j) for each l The number of k for each (i,j) of each l The amount of the computation for each l
the performance depends on various factors
05.02.2015 Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Implementation 12/37
Memory Mapping Direction:
OneThreadPerOneEntry
Allocates one Thread to compute one entry e.g. π1,5,π2,6,π3,7, π4,8 each one is computed concurrently all use previous entries Change Memory Mapping
05.02.2015 Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Implementation 13/37
CUDA Architecture: Stored in Global memory:
OneThreadPerOneEntry
Allocates one Thread to compute one entry e.g. π1,5,π2,6,π3,7, π4,8 each one is computed concurrently by one core all use previous entries in shared memory stored in Global memory after computing
05.02.2015 Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Implementation 14/37
OneBlockPerOneEntry
Allocates one Block to compute one entry e.g. π1,5 = min
1β€π<5(π1,π + ππ+1,5 + π0πππ5) is computed by one Streaming multiprocessor
each (π1,π + ππ+1,5 + π0πππ5) is computed by one core use another core for selection
CUDA Architecture:
05.02.2015 Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Implementation 15/37
CUDA Architecture:
BlocksPerOneEntry
Allocates multiple Blocks to compute for one entry e.g. π1,5 = min
1β€π<5(π1,π + ππ+1,5 + π0πππ5) is computed by a few Streaming multiprocessors
each (π1,π + ππ+1,5 + π0πππ5) is computed by one core but maybe from different Streaming multiprocessors use another core in any Streaming multiprocessor for selection
05.02.2015 Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Evaluation 16/37
GPU: Nvidia GeForce GTX 480 with 480 processing cores (15 Streaming Multiprocessors which has 32 processing cores) 1.4GHz, 3GB memory.
Total time of each kernel for different number of threads and blocks (n = 16384)
05.02.2015 Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Evaluation 17/37
GPU: Nvidia GeForce GTX 480 with 480 processing cores (15 Streaming Multiprocessors which has 32 processing cores) 1.4GHz, 3GB memory.
OneThreadPerOneEntry OneBlockPerOneEntry BlocksPerOneEntry
Fastest Kernel for different l
Running time with l of each kernel:
05.02.2015 Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Evaluation GPU vs. CPU 18/37
GPU: Nvidia GeForce GTX 480 with 480 processing cores (15 Streaming Multiprocessors which has 32 processing cores) 1.4GHz, 3GB memory. (combination of three Kernels) CPU: Intel Core i7 870, 2.93GHz, 8GB memory (sequential program in C language)
Total computing time for n = 16384
The speedup factor is unfair
Fastest Kernel for different l
05.02.2015 Team N2: Yang Zhang & Haiqing Wang Overview 19/37
Goal: Accelerate sparse matrix-matrix (SpMM) product on GPU SpMM product: Compute π· = π΅πΆ where A sparse matrix, B dense matrix
05.02.2015 Team N2: Yang Zhang & Haiqing Wang Sparse Linear Algebra 20/37
Approach:
05.02.2015 Team N2: Yang Zhang & Haiqing Wang Sparse Linear Algebra: FastSpMM 21/37
05.02.2015 Team N2: Yang Zhang & Haiqing Wang Sparse Linear Algebra: Evaluation SpMM 22/37
Three versions of SpMM routines evaluated on two Nvidia GPUs: FastSpMM vs. ELLR-T (ELLPACK-R storage format) vs. cuSPARSE (CRS storage format)
GTX480 Tesla C2050 ππ¦π test sparse matrices
05.02.2015 Team N2: Yang Zhang & Haiqing Wang Sparse Linear Algebra: Evaluation GPU vs. CPU 23/37
GTX480 and Tesla C2050 using FastSpMM vs. Intel Xeon E5640 with 4 cores using the MKL library
Runtimes (in seconds) on test matrices:
Speedups compared to CPU: GTX480: 2,8 Γ β 6,2 Γ Tesla C2050: 1,7 Γ β 3,8 Γ
05.02.2015 Team N2: Yang Zhang & Haiqing Wang Overview 24/37
05.02.2015 Team N2: Yang Zhang & Haiqing Wang Unstructured Grids: Compressible Flows 25/37
Compressible flows simulation on 3-D unstructured grids Compressible flows : fluid mechanics that deals with flows having significant changes in fluid density
An example : Subsonic Flow past a Sphere
05.02.2015 Team N2: Yang Zhang & Haiqing Wang Unstructured Grids: DG Method 26/37
Discontinuous Galerkin (DG) method : in mathematics form a class of numerical methods for solving differential equations DG method can be implemented in parallel
An example : Subsonic Flow past a Sphere
05.02.2015 Team N2: Yang Zhang & Haiqing Wang Unstructured Grids: Evaluation GPU vs. CPU 27/37
Timing measurements for subsonic flow past a sphere
GPU:
NVIDIA Tesla K20c GPU containing 2496 multiprocessors (OpenACC-based program) CPU: AMD Opteron 6128 CPU containing 16 cores (MPI-based parallel program)
Nelem: number of elements Ntime : number of time steps
05.02.2015 Team N2: Yang Zhang & Haiqing Wang Overview 28/37
Goal: Efficient encryption/decryption of data streams on web server applications Approach: Design of a parallel AES on GPU Two design choices:
05.02.2015 Team N2: Yang Zhang & Haiqing Wang Combinational Logic: Parallel AES 29/37
Comparison: Fine-grained vs coarse-grained on a Nvidia 8880 GT (112 cores)
05.02.2015 Team N2: Yang Zhang & Haiqing Wang Combinational Logic: Evaluation 30/37
05.02.2015 Team N2: Yang Zhang & Haiqing Wang Combinational Logic: Evaluation GPU vs. CPU 31/37
Throughput (in Mbps) comparisons on two Nvidia GPUs and two high-end CPUs (in 2009):
05.02.2015 Team N2: Yang Zhang & Haiqing Wang Overview 32/37
05.02.2015 Team N2: Yang Zhang & Haiqing Wang Graphical Model: Speech Recognition System 33/37
ANN:Artificial Neural Network HMM:Hidden Markov Model ANN Model: recognizing the acoustic in a time frame (a word
HMM Model: warping and adjusting the whole acoustic combining these words or phonemes from ANN
05.02.2015 Team N2: Yang Zhang & Haiqing Wang Graphical Model: ANN Training 34/37
Input: A vector represents acoustic in a time frame Output: A vector represents most possible relative word or phoneme Hidden vector = Input vector Γ weight vector 1 Output vector = Hidden vector Γ weight vector 2 Training is the process of adjusting weight vector 1 and weight vector 2
Inner product
05.02.2015 Team N2: Yang Zhang & Haiqing Wang Graphical Model: Block ANN Training 35/37
Training can be solved by linear algebra Input: A Matrix made up of many input vectors Output: A Matrix made up of many output vectors Hidden matrix = Input matrix Γ weight vector 1 Output matrix = Hidden matrix Γ weight vector 2
05.02.2015 Team N2: Yang Zhang & Haiqing Wang Graphical Model: Evaluation GPU vs. CPU 36/37
GPU: 1600 MHz FSB, 8 GB RAM, NVIDIA GTX280 GPU
(CuBLAS library)
CPU: a quad-core 3.0 GHz CPU (Intel MKL library) Training time, and relative speed-up for the WSJ0 corpus:
a speedup factor of 5
memory management
Disclaimer: Some comparisons to CPU not really representative or not clearly specified
05.02.2015 Team N2: Yang Zhang & Haiqing Wang Summary 37/37
[1] K. Nishida, Y. Ito, K. Nakano. Accelerating the Dynamic Programming for the Matrix Chain Product on the GPU. Networking and Computing (ICNC), 2011 Second International Conference on, pp. 320- 326, Nov. 30 2011-Dec. 2 2011 [2] F. Vazquez, G. Ortega, J. J. Fernandez, I.Garcia and E. M. Garzon. Fast sparse matrix matrix product based on ELLR-T and GPU computing. Parallel and Distributed Processing with Applications (ISPA), 2012 IEEE 10th International Symposium
[3] Y. Xia, H. Luo, L. Luo, J. Edwards, J. Lou and F. Mueller. OpenACC-based GPU Acceleration of a 3-D Unstructured Discontinuous Galerkin Method. 52nd Aerospace Sciences Meeting. January 2014.
05.02.2015 Team N2: Yang Zhang & Haiqing Wang References 1 Ref 1/2
[4] A. di Biagio, A. Barenghi, G. Agosta, G. Pelosi. Design of a Parallel AES for Graphics Hardware using the CUDA framework. Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, pp. 1-8, 23- 29 May 2009. [5] S. Scanzio, S. Cumani, R. Gemello, F. Mana, P. Laface. Parallel implementation of artificial neural network training. Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, pp. 4902-4905, 14-19 March 2010
Image Sources:
http://rtcmagazine.com/files/images/5985/RTC08-ERTW-Nvidia-FigX_original_large.jpg https://www.pgroup.com/images/insider/v2n4a1i2.png http://pic002.cnblogs.com/images/2011/63234/2011030722152125.png http://3dgep.com/wp-content/uploads/2011/11/CUDA-memory-model.gif
05.02.2015 Team N2: Yang Zhang & Haiqing Wang References 2 Ref 2/2
Yang Zhang: Haiqing Wang: CUDA Sparse Linear Algebra Dynamic Programming (in detail) Combinational Logic Unstructured Grids Summary Graphical Model
05.02.2015 Team N2: Yang Zhang & Haiqing Wang Credits