Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics - PowerPoint PPT Presentation

Applications of Berkeley ’ s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015

Overview CUDA • Dynamic Programming • Sparse Linear Algebra The Dwarfs • Unstructured Grids • Combinational Logic • Graphical Model Summary 05.02.2015 Team N2: Yang Zhang & Haiqing Wang Overview 2/37

CUDA • Parallel computing platform and programming model for GPGPU • Supports various languages including C/C++ and Fortran • Lots of libraries available (e.g. cuSPARSE, cuBLAS, NPP, etc … ) 05.02.2015 Team N2: Yang Zhang & Haiqing Wang CUDA 3/37

CUDA : Execution Model • Each thread gets an ID • Group of threads build a block • Group of blocks build a grid • Each thread executed by a core • Each block executed by a SM • A block is further split into warps • Blocks are independent of each other 05.02.2015 Team N2: Yang Zhang & Haiqing Wang CUDA: Execution Model 4/37

CUDA : Memory Model • Each thread has a private local memory • Each block has a shared memory • Allows communication between threads • All thread can access the global memory • Constant memory is a read-only memory 05.02.2015 Team N2: Yang Zhang & Haiqing Wang CUDA: Memory Model 5/37

Dynamic Programming [1] : Matrix Chain Product An example: ((A1 A2 A3 A4) (A5 A6)) 2*9*3+2*3*1+2*1*4+4*11*5+2*4*5=328 (A1 (A2 A3) (A4 A5) A6) 9*3*1+2*9*1+1*4*112*1*11+2*11*5=221 Goal: Minimize the total number of multiplications 05.02.2015 Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Matrix Chain Product 7/37

Dynamic Programming [1] : Algorithm 05.02.2015 Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Algorithm 8/37

Dynamic Programming [1] : Algorithm (n=6) Table s: Table m: 05.02.2015 Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Algorithm 9/37

Dynamic Programming [1] : Implementation Table m: Computing is independent (n=8) Can be computed in parallel 05.02.2015 Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Implementation 10/37

Dynamic Programming [1] : Implementation The number of k for each (i,j) of each l The number of (i,j) for each l the performance depends on various factors Using three different Kernels: OneThreadPerOneEntry OneBlockPerOneEntry BlocksPerOneEntry The amount of the computation for each l 05.02.2015 Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Implementation 11/37

Dynamic Programming [1] : Implementation OneThreadPerOneEntry Allocates one Thread to compute one entry e.g. 𝑛 1,5 ,𝑛 2,6 ,𝑛 3,7 , 𝑛 4,8 each one is computed concurrently all use previous entries Memory Mapping Direction: Change Memory Mapping 05.02.2015 Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Implementation 12/37

Dynamic Programming [1] : Implementation OneThreadPerOneEntry Allocates one Thread to compute one entry e.g. 𝑛 1,5 ,𝑛 2,6 ,𝑛 3,7 , 𝑛 4,8 each one is computed concurrently by one core all use previous entries in shared memory stored in Global memory after computing CUDA Architecture: Stored in Global memory: 05.02.2015 Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Implementation 13/37

Dynamic Programming [1] : Implementation OneBlockPerOneEntry Allocates one Block to compute one entry e.g. 𝑛 1,5 = min 1≤𝑙<5 (𝑛 1,𝑙 + 𝑛 𝑙+1,5 + 𝑞 0 𝑞 𝑙 𝑞 5 ) is computed by one Streaming multiprocessor each ( 𝑛 1,𝑙 + 𝑛 𝑙+1,5 + 𝑞 0 𝑞 𝑙 𝑞 5 ) is computed by one core use another core for selection CUDA Architecture: 05.02.2015 Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Implementation 14/37

Dynamic Programming [1] : Implementation BlocksPerOneEntry Allocates multiple Blocks to compute for one entry e.g. 𝑛 1,5 = min 1≤𝑙<5 (𝑛 1,𝑙 + 𝑛 𝑙+1,5 + 𝑞 0 𝑞 𝑙 𝑞 5 ) is computed by a few Streaming multiprocessors each ( 𝑛 1,𝑙 + 𝑛 𝑙+1,5 + 𝑞 0 𝑞 𝑙 𝑞 5 ) is computed by one core but maybe from different Streaming multiprocessors use another core in any Streaming multiprocessor for selection CUDA Architecture: 05.02.2015 Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Implementation 15/37

Dynamic Programming [1] : Evaluation GPU: Nvidia GeForce GTX 480 with 480 processing cores (15 Streaming Multiprocessors which has 32 processing cores) 1.4GHz, 3GB memory. Total time of each kernel for different number of threads and blocks (n = 16384) 05.02.2015 Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Evaluation 16/37

Dynamic Programming [1] : Evaluation GPU: Fastest Kernel for different l Nvidia GeForce GTX 480 with 480 processing cores (15 Streaming Multiprocessors which has 32 processing cores) 1.4GHz, 3GB memory. Running time with l of each kernel: OneBlockPerOneEntry OneThreadPerOneEntry BlocksPerOneEntry 05.02.2015 Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Evaluation 17/37

Dynamic Programming [1] : Evaluation GPU vs. CPU GPU: Fastest Kernel for different l Nvidia GeForce GTX 480 with 480 processing cores (15 Streaming Multiprocessors which has 32 processing cores) 1.4GHz, 3GB memory. (combination of three Kernels) Total computing time for n = 16384 CPU: Intel Core i7 870, 2.93GHz, 8GB memory (sequential program in C language) The speedup factor is unfair 05.02.2015 Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Evaluation GPU vs. CPU 18/37

Sparse Linear Algebra [2] Goal: Accelerate sparse matrix-matrix (SpMM) product on GPU SpMM product: Compute 𝐷 = 𝐵𝐶 where A sparse matrix, B dense matrix X 05.02.2015 Team N2: Yang Zhang & Haiqing Wang Sparse Linear Algebra 20/37

Sparse Linear Algebra [2] : FastSpMM Approach: • Extension of the ELLR-T kernel called FastSpMM • Relies on ELLPACK-R storage format • Outperforms common libraries for SpMM (e.g. cuSPARSE) 05.02.2015 Team N2: Yang Zhang & Haiqing Wang Sparse Linear Algebra: FastSpMM 21/37

Sparse Linear Algebra [2] : Evaluation SpMM Three versions of SpMM routines evaluated on two Nvidia GPUs: FastSpMM vs. ELLR-T (ELLPACK-R storage format) vs. cuSPARSE (CRS storage format) 𝑂𝑦𝑂 test sparse matrices GTX480 Tesla C2050 05.02.2015 Team N2: Yang Zhang & Haiqing Wang Sparse Linear Algebra: Evaluation SpMM 22/37

Sparse Linear Algebra [2] : Evaluation GPU vs. CPU GTX480 and Tesla C2050 using FastSpMM vs. Intel Xeon E5640 with 4 cores using the MKL library Runtimes (in seconds) on test matrices: Speedups compared to CPU: GTX480: 2,8 × − 6,2 × Tesla C2050: 1,7 × − 3,8 × 05.02.2015 Team N2: Yang Zhang & Haiqing Wang Sparse Linear Algebra: Evaluation GPU vs. CPU 23/37

Unstructured Grids [3] : Compressible Flows Compressible flows simulation on 3-D unstructured grids Compressible flows : fluid mechanics that deals with flows having significant changes in fluid density An example : Subsonic Flow past a Sphere 05.02.2015 Team N2: Yang Zhang & Haiqing Wang Unstructured Grids: Compressible Flows 25/37

Unstructured Grids [3] : DG Method Discontinuous Galerkin (DG) method : in mathematics form a class of numerical methods for solving differential equations DG method can be implemented in parallel An example : Subsonic Flow past a Sphere 05.02.2015 Team N2: Yang Zhang & Haiqing Wang Unstructured Grids: DG Method 26/37

Unstructured Grids [3] : Evaluation GPU vs. CPU GPU: Nelem: number of elements Ntime : number of time steps NVIDIA Tesla K20c GPU containing 2496 multiprocessors (OpenACC-based program) Timing measurements for subsonic flow past a sphere CPU: AMD Opteron 6128 CPU containing 16 cores (MPI-based parallel program) 05.02.2015 Team N2: Yang Zhang & Haiqing Wang Unstructured Grids: Evaluation GPU vs. CPU 27/37

Combinational Logic [4] : Parallel AES Goal: Efficient encryption/decryption of data streams on web server applications Approach: Design of a parallel AES on GPU Two design choices: • Fine-grained: Focus on thread-level parallelism • A lot of communication and synchronization • Coarse-grained: Focus on higher-level parallelism i.e. blocks 05.02.2015 Team N2: Yang Zhang & Haiqing Wang Combinational Logic: Parallel AES 29/37

Combinational Logic [4] : Evaluation Comparison: Fine-grained vs coarse-grained on a Nvidia 8880 GT (112 cores) 05.02.2015 Team N2: Yang Zhang & Haiqing Wang Combinational Logic: Evaluation 30/37

Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics - PowerPoint PPT Presentation

Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA Dynamic Programming Sparse Linear Algebra The Dwarfs

NVIDIA GPU - odd dwarfs Julian Na and Marcus V olker 12. Februar 2015 1/37 Dwarfs

White Dwarfs as Absolute Flux Standards David S. Finley 1 Abstract Hot DA white dwarfs can serve

Chapter 18 The Bizarre Stellar Graveyard 18.1 White Dwarfs Our goals for learning What

Chapter 18 The Bizarre Stellar Graveyard 18.1 White Dwarfs Our goals for learning What

Double white dwarfs and AM CVn binaries in the Galactic disc Gijs Nelemans Institute of

White D Dwarfs a and E Electron D Deg egeneracy cy Farley V. Ferrante Southern Methodist

The low mass dwarfs of @michelle_lmc Andromeda - an update Michelle Collins - University of

Magnetospheric Accretion and Out3low Activity in Brown Dwarfs

Benefits of the ARM architecture on the even Berkeley Dwarfs Patric Mai, Pierre Schoonbrood RWTH

Pathogenesis of hydrocephalus in achondroplastic dwarfs: a review and presentation of a case

Brown dwarfs, exoplanets & exo-tic objects Jan Budaj Astronomical Institute 05960 T

The Brown Dwarfs of our Milky Way Benne W. Holwerda (University of Louisville) Isabel van

NIRSPEC RV Measurements of Late-M Dwarfs by Angelle Tanner (GSU) with Russel White (GSU), John

Peering into the physics of brown dwarfs: spectroscopy with JWST/ NIRSpec Catarina Alves de

The theoretical instability strip of V777 Her white dwarfs Valerie Van Grootel (1) G. Fontaine (2)

Effects of Rotation in White Dwarfs Norbert Langer (Utrecht University) with thanks to: Rudy

Argon : tradeoff-resilient password hashing scheme Alex Biryukov Dmitry Khovratovich University

Securing Circuits Against Constant-Rate Tampering Dana Dachman-Soled Yael Tauman Kalai

Attractive routes Sound pleasantness of pedestrian walks in urban environment Catherine Lavandier

Parsing and Speech Research at Brown University Mark Johnson Brown University The University of

A Correctness Result for Synthesizing Plans With Loops in Stochastic Domains Laszlo Treszkai

Markov decision process (MDP) Robert Platt Northeastern University The RL Setting Action Agent

Multi-armed Bandits Prof. Kuan-Ting Lai 2020/3/12 k-armed Bandit Problem Playing k armed

From Qualitative to Quantitative Dominance Pruning for Optimal Planning Alvaro Torralba

Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics - PowerPoint PPT Presentation

Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA Dynamic Programming Sparse Linear Algebra The Dwarfs

NVIDIA GPU - odd dwarfs Julian Na and Marcus V olker 12. Februar 2015 1/37 Dwarfs

White Dwarfs as Absolute Flux Standards David S. Finley 1 Abstract Hot DA white dwarfs can serve

Chapter 18 The Bizarre Stellar Graveyard 18.1 White Dwarfs Our goals for learning What

Chapter 18 The Bizarre Stellar Graveyard 18.1 White Dwarfs Our goals for learning What

Double white dwarfs and AM CVn binaries in the Galactic disc Gijs Nelemans Institute of

White D Dwarfs a and E Electron D Deg egeneracy cy Farley V. Ferrante Southern Methodist

The low mass dwarfs of @michelle_lmc Andromeda - an update Michelle Collins - University of

Magnetospheric Accretion and Out3low Activity in Brown Dwarfs

Benefits of the ARM architecture on the even Berkeley Dwarfs Patric Mai, Pierre Schoonbrood RWTH

Pathogenesis of hydrocephalus in achondroplastic dwarfs: a review and presentation of a case

Brown dwarfs, exoplanets &amp; exo-tic objects Jan Budaj Astronomical Institute 05960 T

The Brown Dwarfs of our Milky Way Benne W. Holwerda (University of Louisville) Isabel van

NIRSPEC RV Measurements of Late-M Dwarfs by Angelle Tanner (GSU) with Russel White (GSU), John

Peering into the physics of brown dwarfs: spectroscopy with JWST/ NIRSpec Catarina Alves de

The theoretical instability strip of V777 Her white dwarfs Valerie Van Grootel (1) G. Fontaine (2)

Effects of Rotation in White Dwarfs Norbert Langer (Utrecht University) with thanks to: Rudy

Argon : tradeoff-resilient password hashing scheme Alex Biryukov Dmitry Khovratovich University

Securing Circuits Against Constant-Rate Tampering Dana Dachman-Soled Yael Tauman Kalai

Attractive routes Sound pleasantness of pedestrian walks in urban environment Catherine Lavandier

Parsing and Speech Research at Brown University Mark Johnson Brown University The University of

A Correctness Result for Synthesizing Plans With Loops in Stochastic Domains Laszlo Treszkai

Markov decision process (MDP) Robert Platt Northeastern University The RL Setting Action Agent

Multi-armed Bandits Prof. Kuan-Ting Lai 2020/3/12 k-armed Bandit Problem Playing k armed

From Qualitative to Quantitative Dominance Pruning for Optimal Planning Alvaro Torralba

Brown dwarfs, exoplanets & exo-tic objects Jan Budaj Astronomical Institute 05960 T