Exploiting CUDA Dynamic Parallelism for low power ARM based - PowerPoint PPT Presentation

www.bsc.es Exploiting CUDA Dynamic Parallelism for low power ARM based prototypes Vishal Mehta Engineer, Barcelona Supercomputing Center vishal.mehta@bsc.es

BSC/UPC CUDA Centre of Excellence (CCOE) Training • Build an education program on parallel programming using CUDA, OpenCL and OmpSs • PUMPS summer school 2010-2015, courses at BSC and UPC Research • Generation, Simulation and Rendering of Large Varied Animated Crowds that attendees can get a presentation using OmpSs at current GTC • HERTA Security GPU-based machine learning for real-time face recognition, and bio-Marketing, also presented at this GTC. • Exploring the potential of low-power GPU clusters as high-performance platforms involved in Mont-Blanc and PRACE prototypes 2

Top500 Power Consumption Evolution 8 7 TOP10 6 x5.04 in 5y Power [MW] 5 4 3 TOP50 x3.13 in 5y 2 TOP500 1 x3.25 in 5y 0 2008 2009 2010 2011 2012 2013 Higher performance, at the expense of higher power 3

Mont-Blanc Project http://www.montblanc-project.eu European approach for energy efficient HPC systems. Objectives: • To develop a full energy-efficient HPC prototype using low-power commercially available embedded technology. • To develop a portfolio of exascale applications to be run on this new generation of HPC systems. • Exploring different alternatives for the compute node (from low-power mobile sockets to special-purpose high-end ARM chips), and its implications on the rest of the system Partners: 4

Euroserver Project http://www.euroserver-project.eu European approach for energy efficient data servers. Objectives: • Reduced Energy consumption by: (i) using ARM (64-bit) cores (ii) drastically reducing the core-to-memory distance (iii) improving on the "energy proportionality". • Reduced Cost to build and operate each microserver, (i) improved manufacturing yield (ii) reduced physical volume of the packaged interposer module (iii) and energy efficient semiconductor process (FDSOI) . Partners: 5

Mont-Blanc Prototype Ecosystem 6

Outline 1.Pedraforca Prototype Architecture 2.Evaluation application 3.Exploiting Dynamic Parallelism 4.Some benchmarks and results 5.Limitations & Conclusions 7

Pedraforca : Prototype Node Architecture E4 ARKA single node desktop unit 8

Pedraforca: Cluster 3 ⨉ bullx 1200 rack 78 compute nodes 2 login nodes 4 36-port InfiniBand switches (MPI) 2 50-port GbE switches (storage) 9

Comparing Power Budgets X86_64 based system Low power ARM Component Max power usage Component Max power usage Tesla K20 235 Tesla K20 235 Board 80 Board 25 CPU 90 CPU 5 Total 405 Total 265 Quad core Intel i5-3570K @3.4GHz , Tegra 3 (quad core ARM A9 @ 1.3 GHz), Mini ITX – Carrier ASUS P8Z77 V-pro 10

Thick restarted Lanczos Algorithm in Lattice QCD At time ‘t’ SU(3) vector (complex double) SU(3 x 3) matrix(complex double) • Each point on lattice is SU(3) vector and links connecting points are SU(3) matrix. • Using thick restarted Lanczos algorithm for generating eigenpairs of the lattice • 80 % cuBLAS routines • Average number of cuBLAS calls: 60000 – 90000 depending on lattice configuration • Process lattice from multiple time steps in parallel 12

Evaluation Example – Lanczos Iteration Initial vector (v 0 ) Apply matrix • Large number of BLAS operations V i = A (V i-1 ) Compute alpha • Dominated by global orthogonalization α i = dot(V i ,V i-1 ) module which includes BLAS AXPY kernel V i = V i - α i V i-1 – β i-1 V i-2 N • Implemented using cuBLAS, highly iterations Global modularized and easy to use orthogonalization • Iterations are not independent of each Compute beta β i = Euclidean norm(V i ) other New subspace vector V i = V i / β i 13

Algorithm Implementation for the Prototype CPU works as GPU slave Bottlenecks coordinator executes kernels CPU pipeline GPU pipeline • Large number of calls to cuBLAS. Start Apply matrix • Overall algorithm is serial cuBLAS dot • Dominated by CPU’s capability of kernel launching cuBLAS kernels cuBLAS AXPY • ARM processor is not fast enough kernel to quickly launch kernels on GPU. GPU in underutilized End Serial Dependency 14

Exploiting Dynamic Parallelism The reason for dynamic parallelism, is to make GPU adapt to data Can we exploit further to solve bottlenecks and save power ? 16

Approach for Exploiting Dynamic Parallelism for Low Power Prototype CPU works as GPU slave CPU starts and GPU wrapper coordinator executes kernels ends wrapper coordinates the tasks CPU pipeline CPU pipeline GPU pipeline GPU pipeline Wrapper kernel, 1 control Start Start thread Apply Apply matrix matrix cuBLAS dot cuBLAS kernel dot kernel cuBLAS cuBLAS AXPY AXPY kernel kernel End End Serial Dependency 17

Example code:1 - Simple Wrapper Original code Code with wrapper __global__ Applymatrix(..,..) __global__ Applymatrix(..,..) __global__ wrapper(..,..) { int main() Applymatrix <<<…,…>>>(); { cublasZdot(); copytoGPU(); cublasZAXPY(); } Applymatrix <<<…,…>>>(); cublasZdot(); int main() cublasZAXPY(); { copytoGPU(); copyfromGPU(); wrapper<<<1,1>>>(); } copyfromGPU(); } 18

Multiple Threads in Wrapper CPU pipeline GPU pipeline GPU wrapper, 2 CUDA thread Start When wrapper executed Apply with more than one matrix thread to process multiple Apply instances. matrix cuBLAS Wrapper<<<1,2>>>() dot PROBLEM cuBLAS kernel dot Threads in same block cuBLAS kernel AXPY launch kernels one after another. Multiple kernel cuBLAS instances are not AXPY executed simultaneously. End kernel 19

Bottleneck caused by multiple threads in wrapper OUR GOAL GPU pipeline CPU pipeline GPU wrapper, 2 CUDA thread Start Wrapper SOLUTION Apply Apply CUDA streams matrix matrix created on GPU cuBLAS cuBLAS dot kernel side dot kernel cuBLAS cuBLAS AXPY AXPY kernel kernel End 20

Solution for processing multiple instances by CUDA streams GPU pipeline CPU pipeline Modification to code GPU wrapper, 2 CUDA thread Start Wrapper __global__ wrapper(..,..) { CUDA create CUDA create cudaStream_t stream; stream stream cudaStreamCreateWithFlags(&str Apply Apply eam,cudaStreamNonBlocking); matrix matrix cuBLAS cuBLAS cublasSetStream (….,stream); dot kernel dot kernel Applymatrix <<<…,…stream>>>(); cublasZdot(); cuBLAS cuBLAS cublasZAXPY(); AXPY AXPY kernel kernel cudaStreamDestroy(stream); End } 21

cuBLAS kernel launch scaling No of kernel cuBLAS calls by cuBLAS calls Speed up calls CPU (seconds) GPU thread (seconds) 1 x 10 3 1.72 1.43 1.20 x 3 x 10 3 2.23 1.62 1.37 x 5 x 10 3 4.7 2.9 1.62 x 10 x 10 3 7.52 3.5 2.14 x 50 x 10 3 11.78 4.2 2.80 x Speed Up 3 Speed up cuBLAS level 1 routines 2 1 40% reduction kernel Speed Up 0 30% AXPY kernel 30% dot product no. of cuBLAS calls 23

Application Performance (High Frequency CPU) Kernel calls by CPU Kernel calls by CPU (with streams) Kernel calls by GPU Kernel calls by GPU (with streams) 50 Execution Time (sec) 40 30 20 12.8 11.2 8.7 7.6 7.5 6.4 10 5.2 5.2 4.4 4.1 2.8 2.3 0 Lattice size 24 32 48 Code with wrapper may be slower on a system with fast CPU Quad core intel i5-3570K @3.4GHz 24

Application Performance (Pedraforca Prototype) Kernel calls by CPU Kernel calls by CPU (with streams) Kernel calls by GPU Kernel calls by GPU (with streams) 50 40.6 Execution Time (sec) 36.4 40 30 23.5 20.4 20 15.2 13.6 13.1 9 7.5 10 5.3 5.2 2.7 0 24 Lattice size 32 48 Code with wrapper kernel performs better on ARM based system Tegra 3 - quad core ARM A9 @ 1.3 GHz 25

Comparing systems A B Quad core Quad core i5- 3570K@3.4G ARM A9@1.3 Hz GHz Tesla K20 Tesla K20 26

Comparing power footprint – Without CUDA streams A : All kernels launched by CPU(Quad core intel i5-3570K@3.4GHz) B : All kernels launched by GPU (Tegra 3-quad core ARM A9@1.3 GHz) Energy Consumption (J) Execution time (seconds) Average Power (W) QCD lattice A B A B A B size 24 4.4 5.3 367 245 1614.8 1298.5 32 6.4 7.5 359 246 2297.6 1845 48 11.2 13.1 365 243 4088 3183.3 Energy savings (%) Percentage 24 22 20 Energy savings 18 (%) 16 24 32 48 27 Lattice size

Comparing power footprint – With CUDA streams A : All kernels launched by CPU(Quad core intel i5-3570K@3.4GHz) B : All kernels launched by GPU (Tegra 3-quad core ARM A9@1.3 GHz) Energy Consumption (J) Execution time (seconds) Average Power (W) QCD lattice A B A B A B size 24 2.3 2.7 420 286 966 772.2 32 4.1 5.2 426 287 1746.6 1392.4 48 7.5 9.0 425 282 3187.5 2538 Energy savings (%) Percentage 24 22 20 Energy savings 18 (%) 16 24 32 48 28 Lattice size

Exploiting CUDA Dynamic Parallelism for low power ARM based - PowerPoint PPT Presentation

www.bsc.es Exploiting CUDA Dynamic Parallelism for low power ARM based prototypes Vishal Mehta Engineer, Barcelona Supercomputing Center vishal.mehta@bsc.es BSC/UPC CUDA Centre of Excellence (CCOE) Training Build an education program on

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

9.4 Local Perception Filters 9.4 Local Perception Filters Exploiting Exploiting Perceptual

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A

GPU Programming Alan Gray EPCC The University of Edinburgh Overview Motivation and need

Lecture 2.4 Introduction to CUDA C Introduction to the CUDA Toolkit Objective To become

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

Module 3.1 - CUDA Parallelism Model Kernel-Based SPMD Parallel Programming Objective To

WaveScalar Dataflow machine good at exploiting ILP dataflow parallelism + traditional

ABCDE Safe Harbor Statement - ABCDE Private Securities Litigation Reform Act of 1995 Statement

60y.o. male 2 month history worsening left neck, shoulder and arm pain to the elbow PMH: NIDDM,

LISA-QEMU Peter Puhov and Rob Foley INTRODUCTIONS Peter Puhov Chief Architect @

Rotary Arm Design for U.S. Roaster Corp. Prepared by CoolRoast Engineering Design Group Drew

RAINSHOWER 310 MONO SHOWERHEADS 24 RAINSHOWER 310 MONO SHOWERHEADS TWO NEW STYLES 26 569 000 /

Barbara S. M S. Mensch Co Co-author hors: N : Nicole ole Haberla rland, E , Eri rica S

Nordea Kredit Covered Bonds Q3 2019 Debt investor presentation Table of contents 1. Nordea

International Smuggling & Trafficking of Firearms Special Agent Christopher Tafe National

Exploiting CUDA Dynamic Parallelism for low power ARM based - PowerPoint PPT Presentation

www.bsc.es Exploiting CUDA Dynamic Parallelism for low power ARM based prototypes Vishal Mehta Engineer, Barcelona Supercomputing Center vishal.mehta@bsc.es BSC/UPC CUDA Centre of Excellence (CCOE) Training Build an education program on

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

9.4 Local Perception Filters 9.4 Local Perception Filters Exploiting Exploiting Perceptual

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A

GPU Programming Alan Gray EPCC The University of Edinburgh Overview Motivation and need

Lecture 2.4 Introduction to CUDA C Introduction to the CUDA Toolkit Objective To become

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

Module 3.1 - CUDA Parallelism Model Kernel-Based SPMD Parallel Programming Objective To

WaveScalar Dataflow machine good at exploiting ILP dataflow parallelism + traditional

ABCDE Safe Harbor Statement - ABCDE Private Securities Litigation Reform Act of 1995 Statement

60y.o. male 2 month history worsening left neck, shoulder and arm pain to the elbow PMH: NIDDM,

LISA-QEMU Peter Puhov and Rob Foley INTRODUCTIONS Peter Puhov Chief Architect @

Rotary Arm Design for U.S. Roaster Corp. Prepared by CoolRoast Engineering Design Group Drew

RAINSHOWER 310 MONO SHOWERHEADS 24 RAINSHOWER 310 MONO SHOWERHEADS TWO NEW STYLES 26 569 000 /

Barbara S. M S. Mensch Co Co-author hors: N : Nicole ole Haberla rland, E , Eri rica S

Nordea Kredit Covered Bonds Q3 2019 Debt investor presentation Table of contents 1. Nordea

International Smuggling &amp; Trafficking of Firearms Special Agent Christopher Tafe National

International Smuggling & Trafficking of Firearms Special Agent Christopher Tafe National