 
              AMD GPU Jasper Manousek Ying Li 05.02.2015 Seminar | High-Performance and Scientific Computing Prof. Paolo Bientinesi, Ph.D.
Agenda  Architecture  Dwarfs  Sparse Linear Algebra  Dense Linear Algebra  Graph Traversal  MapReduce  Conclusion 2
Architecture 3
Comparison Nvidea GTX640 Radeon HD 6850 • blocks of 6 SP • 1 Controlling unit for every • 4 general ones and one 8 Stream processors overseer • advantage: easier for • one Sp with FP/Int developers due to simple arithmetic functions structure • advantage: more potential if used correctly • disadvantage: requires developer to specically program towards it Architecture 4
Comparison  Less Power overall  Through structure smaller Die size  Less Expensive  Other small differences Architecture 5
Dense Linear Algebra  Classic vector and matrix operations 1  Data is typically laid out as a contiguous array and computations on elements, rows, columns, or matrix blocks are the norm 2  Examples 3 1,2,3: http://view.eecs.berkeley.edu/wiki/Dense_Linear_Algebra Dense Linear Algebra 6
Paper Title Pannotia: Understanding Irregular GPGPU Graph Applications Author Shuai Che, Bradford M. Beckmann, Steven K. Reinhardt and Kevin Skadron Proceedings of 2013 IEEE International Symposium on Workload Publication Characterization (IISWC), Sept 2013 Link http://www.cs.virginia.edu/~skadron/Papers/Che-pannotia-iiswc2013.pdf Dense Linear Algebra 7
Overview of the Paper  Design of several fundamental dense linear algebra (DLA) algorithms in OpenCL (clMAGMA library)  Efficient implementation on AMD’s Tahiti GPUs with the use of the OpenCL standard and optimized BLAS routines  Observation of a wide applicability and many-fold performance improvement over highly tuned codes constituting state-of-the-art libraries for the current generation of multicore CPUs Dense Linear Algebra 8
Performance Study  Hardware : AMD’s Radeon HD7970 card and a single socket six-core AMD Phenom IIX6 1100T CPU running at 3.71 GHz as the GPU’s multicore host  Library: MKL 11.1 on CPU; clMAGMA on GPU and its CPU host  Results: Higher performance of the clMAGMA applied to heterogeneous systems of multicore processors with GPU accelerators and coprocessors in the area of dense linear algebra in comparison with the MKL applied to CPU Dense Linear Algebra 9
Results in Detail (1) 1) LU factorization 2) Cholesky factorization (up to 5.7x speedup vs. the CPU host) (up to 5.4x speedup vs. the CPU host) CPU+GPU with clMAGMA CPU with MKL11.1 Source of the figures: (1) Dense Linear Algebra 10
Results in Detail (2) 3) QR factorization 4) Hessenberg factorization (up to 5.9x speedup vs. the CPU host) (up to 5.5x speedup vs. the CPU host) CPU+GPU with clMAGMA CPU with MKL11.1 Source of the figures: (1) Dense Linear Algebra 11
Results in Detail (3) 5) Matrix Inversion (up to 1.2x speedup vs. the CPU host) CPU+GPU with clMAGMA CPU with MKL11.1 Source of the figures: (1) Dense Linear Algebra 12
Sparse Linear Algebra  Used when input matrices have a large number of zero entries 1  Compressed data structures, keeping only the non-zero entries and their indices, are the norm here 2 3 1, 2: http://view.eecs.berkeley.edu/wiki/Sparse_Linear_Algebra 3: http://www.lanl.gov/Caesar/node223.html Sparse Linear Algebra 13
Paper Title Programming CUDA and OpenCL: A Case Study Using Modern C++ Libraries Author Denis Demidov, Karsten Ahnert, Karl Rupp and Peter Gottschling Publication SIAM Journal on Scientific Computing: Vol. 35, No. 5 Link http://arxiv.org/pdf/1212.6326v2.pdf Sparse Linear Algebra 14
Overview of the Paper  Comparison of several modern C++ libraries providing high-level interfaces for programming multi- and many- core architectures on top of CUDA or OpenCL  One of the performance and usage study: a nonlinear disordered Hamiltonian lattice, the implementation of which is a sparse matrix-vector product  In general, all the experiments including the nonlinear disordered Hamiltonian lattice show up to 10x to 20x acceleration when running a GPU as compared to the CPU path Sparse Linear Algebra 15
Performance Study  Hardware − GPUs: AMD Radeon HD 7970/Tahiti & NVIDIA Tesla C2070 − CPU: Intel Core i7 930  Implementation − GPUs: OpenCL implementations from AMD and NVIDIA − CPU: OpenCL implementations from AMD and Intel  Results − Distinct acceleration is observed when running a GPU path vs. the CPU path − Significant acceleration requires problems of sizes between 10 3 and 10 5 due to considerable overhead at smaller problem size − Overhead of using high-level libraries negligible compared to the effort spent in getting familiar with the details of CUDA or OpenCL Sparse Linear Algebra 16
Results in Detail (1) VexCL CPU (Intel) GPU (AMD) Source of the table : (2) Sparse Linear Algebra 17
Results in Detail (2) Performance under largest problem size: Achieved throughput GB/sec Hamiltonian lattice Time sec (percentage of theoretical peak) GPU: NVIDIA Thrust 319.60 120 (81%) CMTL4 370.31 104 (70%) VexCL 401.39 96 (65%) ViennaCL 433.50 89 (60%) GPU: Tahiti VexCL 225.41 170 (65%) ViennaCL 214.87 179 (68%) CPU: Intel Core i7 930 Thrust N/A N/A VexCL (AMD) 2934.99 13 (51%) VexCL (Intel) 3171.74 12 (47%) ViennaCL (AMD) 2608.80 15 (58%) ViennaCL (Intel) 2580.47 15 (58%) Source of the table : (2) Sparse Linear Algebra 18
Graph Traversal http://de.wikipedia.org/wiki/Graph_%28Gr aphentheorie%29#mediaviewer/File:U- Bahn_Wien.png Graph Traversal 19
Divergence  Branche Divergence  Multiple Threads on same wavefront  Threads can go into Lockstep  Memory Divergence  All threads on one wavefront must access memory before next step  Some threds must go through multiple adjacency lists to find correct memory  Load Imbalance  Graphs are in their nature umbalanced  Some threads will get much more workload than others Graph Traversal 20
Speedup  All data was gathered using a AMD Radeon HD7000  AMD A8-5500 accelerated processing unit  Pannotia was used as an application suite Graph Traversal 21
Dijkstra and Graph Coloring http://de.wikipedia.org/wiki/Dijkstra-Algorithmus http://de.wikipedia.org/wiki/Datei:GolombGraphProperties.svg #mediaviewer/File:DijkstraStep09.svg Graph Traversal 22
Dijkstra and Graph Coloring  Speedups ranging from 4 to 8  Speedup tends to be better for larger graphs  Strong paralisation Graph Traversal 23
Dijkstra and Graph Coloring Source: (4) Graph Traversal 24
Friend Recommendation and Connected Components Labelling http://scipy- lectures.github.io/_images/plot_synthetic_ data_1.png Graph Traversal 25
Friend Recommendation and Connected Components Labelling  Speedups ranging from 1 to 2  Relativly little speedup due to strong inbalance Graph Traversal 26
Summary  Effetiveness dependant on exact problem  Deep understanding of GPU required  Deep understanding of problem required Graph Traversal 27
Map Reduce http://de.wikipedia.org/wiki/Datei:MapRed uce2.svg Map Reduce 28
Map Reduce  AMD GPUs have two ways of accesing memory  Fast Path/ complete Path  All Current GPU implimentations use global atomic operations  Use of global atomic operations causes AMD GPUs to use the complete path  Tests show 32 times slower memory access over the complete path Map Reduce 29
Software-based Atomic add A Map Reduce Framework for Heterogeneous Computing Architectures Map Reduce 30
Map Reduce  Master thread quickly becomes bottleneck  Instead group by wavefront  Define first thread as dominant thread  Create 4 global arrays with one elment per wavefront  WavefrontsAddresse, WavefrontsSum, WavefrontsPrefixSums, Finished. Map Reduce 31
Map Reduce Step 1 Threads Load address Sync and sums Map Reduce 32
Map Reduce Step 2 Local atomic add Update to generate dominate and prefixSumm and set local False increment increment to 0 Is only wavefront on address true WFprefixSum = address Sync Wfincrement = localSum Map Reduce 33
Map Reduce Step 3 true If Requesting Set wavefront addresses = 0 False true Reset Update Local global If dominant data variable False Sync Map Reduce 34
Evaluation  Hardware − GPU: ATI Radeon HD 5870 (Cypress) − CPU: Intel Xeon e5405 x2  Key Performance measures Total execution time in nano-seconds Ratio of FastPath to CompletePath memory transactions MapReduce 35
Experiment Micro Benchmarks 1) without memory 2) with memory transactions transaction (up to 3x vs. system atomic (up to 1.9x vs. system operation) atomic operation) Source of the figures: (3) MapReduce 36
Recommend
More recommend