amd gpu
play

AMD GPU Jasper Manousek Ying Li 05.02.2015 Seminar | - PowerPoint PPT Presentation

AMD GPU Jasper Manousek Ying Li 05.02.2015 Seminar | High-Performance and Scientific Computing Prof. Paolo Bientinesi, Ph.D. Agenda Architecture Dwarfs Sparse Linear Algebra Dense Linear Algebra Graph Traversal


  1. AMD GPU Jasper Manousek Ying Li 05.02.2015 Seminar | High-Performance and Scientific Computing Prof. Paolo Bientinesi, Ph.D.

  2. Agenda  Architecture  Dwarfs  Sparse Linear Algebra  Dense Linear Algebra  Graph Traversal  MapReduce  Conclusion 2

  3. Architecture 3

  4. Comparison Nvidea GTX640 Radeon HD 6850 • blocks of 6 SP • 1 Controlling unit for every • 4 general ones and one 8 Stream processors overseer • advantage: easier for • one Sp with FP/Int developers due to simple arithmetic functions structure • advantage: more potential if used correctly • disadvantage: requires developer to specically program towards it Architecture 4

  5. Comparison  Less Power overall  Through structure smaller Die size  Less Expensive  Other small differences Architecture 5

  6. Dense Linear Algebra  Classic vector and matrix operations 1  Data is typically laid out as a contiguous array and computations on elements, rows, columns, or matrix blocks are the norm 2  Examples 3 1,2,3: http://view.eecs.berkeley.edu/wiki/Dense_Linear_Algebra Dense Linear Algebra 6

  7. Paper Title Pannotia: Understanding Irregular GPGPU Graph Applications Author Shuai Che, Bradford M. Beckmann, Steven K. Reinhardt and Kevin Skadron Proceedings of 2013 IEEE International Symposium on Workload Publication Characterization (IISWC), Sept 2013 Link http://www.cs.virginia.edu/~skadron/Papers/Che-pannotia-iiswc2013.pdf Dense Linear Algebra 7

  8. Overview of the Paper  Design of several fundamental dense linear algebra (DLA) algorithms in OpenCL (clMAGMA library)  Efficient implementation on AMD’s Tahiti GPUs with the use of the OpenCL standard and optimized BLAS routines  Observation of a wide applicability and many-fold performance improvement over highly tuned codes constituting state-of-the-art libraries for the current generation of multicore CPUs Dense Linear Algebra 8

  9. Performance Study  Hardware : AMD’s Radeon HD7970 card and a single socket six-core AMD Phenom IIX6 1100T CPU running at 3.71 GHz as the GPU’s multicore host  Library: MKL 11.1 on CPU; clMAGMA on GPU and its CPU host  Results: Higher performance of the clMAGMA applied to heterogeneous systems of multicore processors with GPU accelerators and coprocessors in the area of dense linear algebra in comparison with the MKL applied to CPU Dense Linear Algebra 9

  10. Results in Detail (1) 1) LU factorization 2) Cholesky factorization (up to 5.7x speedup vs. the CPU host) (up to 5.4x speedup vs. the CPU host) CPU+GPU with clMAGMA CPU with MKL11.1 Source of the figures: (1) Dense Linear Algebra 10

  11. Results in Detail (2) 3) QR factorization 4) Hessenberg factorization (up to 5.9x speedup vs. the CPU host) (up to 5.5x speedup vs. the CPU host) CPU+GPU with clMAGMA CPU with MKL11.1 Source of the figures: (1) Dense Linear Algebra 11

  12. Results in Detail (3) 5) Matrix Inversion (up to 1.2x speedup vs. the CPU host) CPU+GPU with clMAGMA CPU with MKL11.1 Source of the figures: (1) Dense Linear Algebra 12

  13. Sparse Linear Algebra  Used when input matrices have a large number of zero entries 1  Compressed data structures, keeping only the non-zero entries and their indices, are the norm here 2 3 1, 2: http://view.eecs.berkeley.edu/wiki/Sparse_Linear_Algebra 3: http://www.lanl.gov/Caesar/node223.html Sparse Linear Algebra 13

  14. Paper Title Programming CUDA and OpenCL: A Case Study Using Modern C++ Libraries Author Denis Demidov, Karsten Ahnert, Karl Rupp and Peter Gottschling Publication SIAM Journal on Scientific Computing: Vol. 35, No. 5 Link http://arxiv.org/pdf/1212.6326v2.pdf Sparse Linear Algebra 14

  15. Overview of the Paper  Comparison of several modern C++ libraries providing high-level interfaces for programming multi- and many- core architectures on top of CUDA or OpenCL  One of the performance and usage study: a nonlinear disordered Hamiltonian lattice, the implementation of which is a sparse matrix-vector product  In general, all the experiments including the nonlinear disordered Hamiltonian lattice show up to 10x to 20x acceleration when running a GPU as compared to the CPU path Sparse Linear Algebra 15

  16. Performance Study  Hardware − GPUs: AMD Radeon HD 7970/Tahiti & NVIDIA Tesla C2070 − CPU: Intel Core i7 930  Implementation − GPUs: OpenCL implementations from AMD and NVIDIA − CPU: OpenCL implementations from AMD and Intel  Results − Distinct acceleration is observed when running a GPU path vs. the CPU path − Significant acceleration requires problems of sizes between 10 3 and 10 5 due to considerable overhead at smaller problem size − Overhead of using high-level libraries negligible compared to the effort spent in getting familiar with the details of CUDA or OpenCL Sparse Linear Algebra 16

  17. Results in Detail (1) VexCL CPU (Intel) GPU (AMD) Source of the table : (2) Sparse Linear Algebra 17

  18. Results in Detail (2) Performance under largest problem size: Achieved throughput GB/sec Hamiltonian lattice Time sec (percentage of theoretical peak) GPU: NVIDIA Thrust 319.60 120 (81%) CMTL4 370.31 104 (70%) VexCL 401.39 96 (65%) ViennaCL 433.50 89 (60%) GPU: Tahiti VexCL 225.41 170 (65%) ViennaCL 214.87 179 (68%) CPU: Intel Core i7 930 Thrust N/A N/A VexCL (AMD) 2934.99 13 (51%) VexCL (Intel) 3171.74 12 (47%) ViennaCL (AMD) 2608.80 15 (58%) ViennaCL (Intel) 2580.47 15 (58%) Source of the table : (2) Sparse Linear Algebra 18

  19. Graph Traversal http://de.wikipedia.org/wiki/Graph_%28Gr aphentheorie%29#mediaviewer/File:U- Bahn_Wien.png Graph Traversal 19

  20. Divergence  Branche Divergence  Multiple Threads on same wavefront  Threads can go into Lockstep  Memory Divergence  All threads on one wavefront must access memory before next step  Some threds must go through multiple adjacency lists to find correct memory  Load Imbalance  Graphs are in their nature umbalanced  Some threads will get much more workload than others Graph Traversal 20

  21. Speedup  All data was gathered using a AMD Radeon HD7000  AMD A8-5500 accelerated processing unit  Pannotia was used as an application suite Graph Traversal 21

  22. Dijkstra and Graph Coloring http://de.wikipedia.org/wiki/Dijkstra-Algorithmus http://de.wikipedia.org/wiki/Datei:GolombGraphProperties.svg #mediaviewer/File:DijkstraStep09.svg Graph Traversal 22

  23. Dijkstra and Graph Coloring  Speedups ranging from 4 to 8  Speedup tends to be better for larger graphs  Strong paralisation Graph Traversal 23

  24. Dijkstra and Graph Coloring Source: (4) Graph Traversal 24

  25. Friend Recommendation and Connected Components Labelling http://scipy- lectures.github.io/_images/plot_synthetic_ data_1.png Graph Traversal 25

  26. Friend Recommendation and Connected Components Labelling  Speedups ranging from 1 to 2  Relativly little speedup due to strong inbalance Graph Traversal 26

  27. Summary  Effetiveness dependant on exact problem  Deep understanding of GPU required  Deep understanding of problem required Graph Traversal 27

  28. Map Reduce http://de.wikipedia.org/wiki/Datei:MapRed uce2.svg Map Reduce 28

  29. Map Reduce  AMD GPUs have two ways of accesing memory  Fast Path/ complete Path  All Current GPU implimentations use global atomic operations  Use of global atomic operations causes AMD GPUs to use the complete path  Tests show 32 times slower memory access over the complete path Map Reduce 29

  30. Software-based Atomic add A Map Reduce Framework for Heterogeneous Computing Architectures Map Reduce 30

  31. Map Reduce  Master thread quickly becomes bottleneck  Instead group by wavefront  Define first thread as dominant thread  Create 4 global arrays with one elment per wavefront  WavefrontsAddresse, WavefrontsSum, WavefrontsPrefixSums, Finished. Map Reduce 31

  32. Map Reduce Step 1 Threads Load address Sync and sums Map Reduce 32

  33. Map Reduce Step 2 Local atomic add Update to generate dominate and prefixSumm and set local False increment increment to 0 Is only wavefront on address true WFprefixSum = address Sync Wfincrement = localSum Map Reduce 33

  34. Map Reduce Step 3 true If Requesting Set wavefront addresses = 0 False true Reset Update Local global If dominant data variable False Sync Map Reduce 34

  35. Evaluation  Hardware − GPU: ATI Radeon HD 5870 (Cypress) − CPU: Intel Xeon e5405 x2  Key Performance measures Total execution time in nano-seconds Ratio of FastPath to CompletePath memory transactions MapReduce 35

  36. Experiment Micro Benchmarks 1) without memory 2) with memory transactions transaction (up to 3x vs. system atomic (up to 1.9x vs. system operation) atomic operation) Source of the figures: (3) MapReduce 36

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend