AMD GPU Jasper Manousek Ying Li 05.02.2015 Seminar | - PowerPoint PPT Presentation

AMD GPU Jasper Manousek Ying Li 05.02.2015 Seminar | High-Performance and Scientific Computing Prof. Paolo Bientinesi, Ph.D.

Agenda  Architecture  Dwarfs  Sparse Linear Algebra  Dense Linear Algebra  Graph Traversal  MapReduce  Conclusion 2

Architecture 3

Comparison Nvidea GTX640 Radeon HD 6850 • blocks of 6 SP • 1 Controlling unit for every • 4 general ones and one 8 Stream processors overseer • advantage: easier for • one Sp with FP/Int developers due to simple arithmetic functions structure • advantage: more potential if used correctly • disadvantage: requires developer to specically program towards it Architecture 4

Comparison  Less Power overall  Through structure smaller Die size  Less Expensive  Other small differences Architecture 5

Dense Linear Algebra  Classic vector and matrix operations 1  Data is typically laid out as a contiguous array and computations on elements, rows, columns, or matrix blocks are the norm 2  Examples 3 1,2,3: http://view.eecs.berkeley.edu/wiki/Dense_Linear_Algebra Dense Linear Algebra 6

Paper Title Pannotia: Understanding Irregular GPGPU Graph Applications Author Shuai Che, Bradford M. Beckmann, Steven K. Reinhardt and Kevin Skadron Proceedings of 2013 IEEE International Symposium on Workload Publication Characterization (IISWC), Sept 2013 Link http://www.cs.virginia.edu/~skadron/Papers/Che-pannotia-iiswc2013.pdf Dense Linear Algebra 7

Overview of the Paper  Design of several fundamental dense linear algebra (DLA) algorithms in OpenCL (clMAGMA library)  Efficient implementation on AMD’s Tahiti GPUs with the use of the OpenCL standard and optimized BLAS routines  Observation of a wide applicability and many-fold performance improvement over highly tuned codes constituting state-of-the-art libraries for the current generation of multicore CPUs Dense Linear Algebra 8

Performance Study  Hardware : AMD’s Radeon HD7970 card and a single socket six-core AMD Phenom IIX6 1100T CPU running at 3.71 GHz as the GPU’s multicore host  Library: MKL 11.1 on CPU; clMAGMA on GPU and its CPU host  Results: Higher performance of the clMAGMA applied to heterogeneous systems of multicore processors with GPU accelerators and coprocessors in the area of dense linear algebra in comparison with the MKL applied to CPU Dense Linear Algebra 9

Results in Detail (1) 1) LU factorization 2) Cholesky factorization (up to 5.7x speedup vs. the CPU host) (up to 5.4x speedup vs. the CPU host) CPU+GPU with clMAGMA CPU with MKL11.1 Source of the figures: (1) Dense Linear Algebra 10

Results in Detail (2) 3) QR factorization 4) Hessenberg factorization (up to 5.9x speedup vs. the CPU host) (up to 5.5x speedup vs. the CPU host) CPU+GPU with clMAGMA CPU with MKL11.1 Source of the figures: (1) Dense Linear Algebra 11

Results in Detail (3) 5) Matrix Inversion (up to 1.2x speedup vs. the CPU host) CPU+GPU with clMAGMA CPU with MKL11.1 Source of the figures: (1) Dense Linear Algebra 12

Sparse Linear Algebra  Used when input matrices have a large number of zero entries 1  Compressed data structures, keeping only the non-zero entries and their indices, are the norm here 2 3 1, 2: http://view.eecs.berkeley.edu/wiki/Sparse_Linear_Algebra 3: http://www.lanl.gov/Caesar/node223.html Sparse Linear Algebra 13

Paper Title Programming CUDA and OpenCL: A Case Study Using Modern C++ Libraries Author Denis Demidov, Karsten Ahnert, Karl Rupp and Peter Gottschling Publication SIAM Journal on Scientific Computing: Vol. 35, No. 5 Link http://arxiv.org/pdf/1212.6326v2.pdf Sparse Linear Algebra 14

Overview of the Paper  Comparison of several modern C++ libraries providing high-level interfaces for programming multi- and many- core architectures on top of CUDA or OpenCL  One of the performance and usage study: a nonlinear disordered Hamiltonian lattice, the implementation of which is a sparse matrix-vector product  In general, all the experiments including the nonlinear disordered Hamiltonian lattice show up to 10x to 20x acceleration when running a GPU as compared to the CPU path Sparse Linear Algebra 15

Performance Study  Hardware − GPUs: AMD Radeon HD 7970/Tahiti & NVIDIA Tesla C2070 − CPU: Intel Core i7 930  Implementation − GPUs: OpenCL implementations from AMD and NVIDIA − CPU: OpenCL implementations from AMD and Intel  Results − Distinct acceleration is observed when running a GPU path vs. the CPU path − Significant acceleration requires problems of sizes between 10 3 and 10 5 due to considerable overhead at smaller problem size − Overhead of using high-level libraries negligible compared to the effort spent in getting familiar with the details of CUDA or OpenCL Sparse Linear Algebra 16

Results in Detail (1) VexCL CPU (Intel) GPU (AMD) Source of the table : (2) Sparse Linear Algebra 17

Results in Detail (2) Performance under largest problem size: Achieved throughput GB/sec Hamiltonian lattice Time sec (percentage of theoretical peak) GPU: NVIDIA Thrust 319.60 120 (81%) CMTL4 370.31 104 (70%) VexCL 401.39 96 (65%) ViennaCL 433.50 89 (60%) GPU: Tahiti VexCL 225.41 170 (65%) ViennaCL 214.87 179 (68%) CPU: Intel Core i7 930 Thrust N/A N/A VexCL (AMD) 2934.99 13 (51%) VexCL (Intel) 3171.74 12 (47%) ViennaCL (AMD) 2608.80 15 (58%) ViennaCL (Intel) 2580.47 15 (58%) Source of the table : (2) Sparse Linear Algebra 18

Graph Traversal http://de.wikipedia.org/wiki/Graph_%28Gr aphentheorie%29#mediaviewer/File:U- Bahn_Wien.png Graph Traversal 19

Divergence  Branche Divergence  Multiple Threads on same wavefront  Threads can go into Lockstep  Memory Divergence  All threads on one wavefront must access memory before next step  Some threds must go through multiple adjacency lists to find correct memory  Load Imbalance  Graphs are in their nature umbalanced  Some threads will get much more workload than others Graph Traversal 20

Speedup  All data was gathered using a AMD Radeon HD7000  AMD A8-5500 accelerated processing unit  Pannotia was used as an application suite Graph Traversal 21

Dijkstra and Graph Coloring http://de.wikipedia.org/wiki/Dijkstra-Algorithmus http://de.wikipedia.org/wiki/Datei:GolombGraphProperties.svg #mediaviewer/File:DijkstraStep09.svg Graph Traversal 22

Dijkstra and Graph Coloring  Speedups ranging from 4 to 8  Speedup tends to be better for larger graphs  Strong paralisation Graph Traversal 23

Dijkstra and Graph Coloring Source: (4) Graph Traversal 24

Friend Recommendation and Connected Components Labelling http://scipy- lectures.github.io/_images/plot_synthetic_ data_1.png Graph Traversal 25

Friend Recommendation and Connected Components Labelling  Speedups ranging from 1 to 2  Relativly little speedup due to strong inbalance Graph Traversal 26

Summary  Effetiveness dependant on exact problem  Deep understanding of GPU required  Deep understanding of problem required Graph Traversal 27

Map Reduce http://de.wikipedia.org/wiki/Datei:MapRed uce2.svg Map Reduce 28

Map Reduce  AMD GPUs have two ways of accesing memory  Fast Path/ complete Path  All Current GPU implimentations use global atomic operations  Use of global atomic operations causes AMD GPUs to use the complete path  Tests show 32 times slower memory access over the complete path Map Reduce 29

Software-based Atomic add A Map Reduce Framework for Heterogeneous Computing Architectures Map Reduce 30

Map Reduce  Master thread quickly becomes bottleneck  Instead group by wavefront  Define first thread as dominant thread  Create 4 global arrays with one elment per wavefront  WavefrontsAddresse, WavefrontsSum, WavefrontsPrefixSums, Finished. Map Reduce 31

Map Reduce Step 1 Threads Load address Sync and sums Map Reduce 32

Map Reduce Step 2 Local atomic add Update to generate dominate and prefixSumm and set local False increment increment to 0 Is only wavefront on address true WFprefixSum = address Sync Wfincrement = localSum Map Reduce 33

Map Reduce Step 3 true If Requesting Set wavefront addresses = 0 False true Reset Update Local global If dominant data variable False Sync Map Reduce 34

Evaluation  Hardware − GPU: ATI Radeon HD 5870 (Cypress) − CPU: Intel Xeon e5405 x2  Key Performance measures Total execution time in nano-seconds Ratio of FastPath to CompletePath memory transactions MapReduce 35

Experiment Micro Benchmarks 1) without memory 2) with memory transactions transaction (up to 3x vs. system atomic (up to 1.9x vs. system operation) atomic operation) Source of the figures: (3) MapReduce 36

AMD GPU Jasper Manousek Ying Li 05.02.2015 Seminar | - PowerPoint PPT Presentation

AMD GPU Jasper Manousek Ying Li 05.02.2015 Seminar | High-Performance and Scientific Computing Prof. Paolo Bientinesi, Ph.D. Agenda Architecture Dwarfs Sparse Linear Algebra Dense Linear Algebra Graph Traversal

AMD Pacifica Virtualization Technology AMD Unveils Virtualization Platform AMD Pacifica

High Resolution Mass Spectromter AMD 402/403 S 6 High Resolution Mass Spectrometer AMD 604 S 8

August 2014 What is Acid Mine Drainage (AMD) Consequences of AMD on environment Where

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Boost.Compute A C++ library for GPU computing Kyle Lutz GPUs Multi-core CPUs (NVIDIA, AMD,

Direct3D 11 Indirect Illumination Holger Gruen European ISV Relations AMD Direct3D 11 Indirect

An Optimized Diffusion Depth Of Field Solver (DDOF) Holger Gruen AMD AMD s Favorite

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

AMDs Unified CPU & GPU Processor Concept Advanced Seminar Computer Engineering Sven Nobis

Acid Mine Drainage Causes, Consequences and Remediation Dr. David M. Hunter EVOCRA.COM.AU 29

Pharmacy Admissions Information Session Sara Lofstrom Admissions & Career Counselor

WV AML In Stream Dosing for Treatment of AMD West Virginia Department of Environmental

Blindness, Liver Fibrosis, & Cancer March 2019 1 First-in-Class Treatment for Unmet Medical

Community AMD Environmental Issues Underground and Surface Mining of Sulfide Minerals Chuck

September 2019 I EPA: ALBPS Forward Looking Statements All statements pertaining to future

Extracting Rare Earth and Critical Minerals from Coal Mine Drainage: Supplying the Nations

AMD GPU Jasper Manousek Ying Li 05.02.2015 Seminar | - PowerPoint PPT Presentation

AMD GPU Jasper Manousek Ying Li 05.02.2015 Seminar | High-Performance and Scientific Computing Prof. Paolo Bientinesi, Ph.D. Agenda Architecture Dwarfs Sparse Linear Algebra Dense Linear Algebra Graph Traversal

AMD Pacifica Virtualization Technology AMD Unveils Virtualization Platform AMD Pacifica

High Resolution Mass Spectromter AMD 402/403 S 6 High Resolution Mass Spectrometer AMD 604 S 8

August 2014 What is Acid Mine Drainage (AMD) Consequences of AMD on environment Where

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Boost.Compute A C++ library for GPU computing Kyle Lutz GPUs Multi-core CPUs (NVIDIA, AMD,

Direct3D 11 Indirect Illumination Holger Gruen European ISV Relations AMD Direct3D 11 Indirect

An Optimized Diffusion Depth Of Field Solver (DDOF) Holger Gruen AMD AMD s Favorite

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

AMDs Unified CPU &amp; GPU Processor Concept Advanced Seminar Computer Engineering Sven Nobis

Acid Mine Drainage Causes, Consequences and Remediation Dr. David M. Hunter EVOCRA.COM.AU 29

Pharmacy Admissions Information Session Sara Lofstrom Admissions &amp; Career Counselor

WV AML In Stream Dosing for Treatment of AMD West Virginia Department of Environmental

Blindness, Liver Fibrosis, &amp; Cancer March 2019 1 First-in-Class Treatment for Unmet Medical

Community AMD Environmental Issues Underground and Surface Mining of Sulfide Minerals Chuck

September 2019 I EPA: ALBPS Forward Looking Statements All statements pertaining to future

Extracting Rare Earth and Critical Minerals from Coal Mine Drainage: Supplying the Nations

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

AMDs Unified CPU & GPU Processor Concept Advanced Seminar Computer Engineering Sven Nobis

Pharmacy Admissions Information Session Sara Lofstrom Admissions & Career Counselor

Blindness, Liver Fibrosis, & Cancer March 2019 1 First-in-Class Treatment for Unmet Medical