Performing with CUDA W. B. Langdon CREST lab, Department of - PowerPoint PPT Presentation

Performing with CUDA W. B. Langdon CREST lab, Department of Computer Science Pages 423-430 8.7.2011

Introduction • Initial steps • Concentrate upon what is different about high performance with GPU: – Many threads – Finding and avoiding bottlenecks • Conclusions W. B. Langdon, UCL 2

Before you code • How much of your new application will be run in parallel? If <90% stop . • EA called “embarrassingly parallel” • If big population: one thread per member • May be hard to parallelise fitness function • How much of GPU’s speed, memory do you need? (Advertised performance is best possible) W. B. Langdon, UCL 3

GPU computing needs many threads Best speed ≥ 20× number of stream processors 4

GPU many threads hide latency W. B. Langdon, UCL 5

Bottlenecks W. B. Langdon, UCL 6

W. B. Langdon, UCL 7

Slowest step dominates • In a car you know if – Doing well, road is wide and smooth – In heavy traffic or road is narrow and bendy • With a GPU it is difficult to tell what is holding you back W. B. Langdon, UCL 9

Fermi C2050 PCI host ↔ GPU link always narrower bottleneck than GPU ↔ on board memory. Both can be important. W. B. Langdon, UCL 10

Locate Bottleneck in Design: Host PC ↔ GPU PCI Bus • PCI can be estimated in advance • Number bytes into and back from GPU per kernel call. • How long to transfer data (byte/bandwidth) • How long between kernel launches? – If <1millisec consider fewer bigger launches • bandwidthTest (see switches) gives PCI speed. W. B. Langdon, UCL 11

Other Bottlenecks • In theory can do the same for GPU-global memory transfers but. – Hard to do. – PCI can run at 100% usage (pinned memory) – Hard to predict fraction of usage inside GPU – What effect will caches have? – Enough threads to keep both processors and memory buses busy. – Atomic and non-coalesced operations may have unexpectedly large impact 12

Performance by Hacking • Measuring performance • Is performance good enough? Stop • Can it be made better? No: stop . • Identify and remove current bottleneck. • Measure new performance. What is new bootleneck? W. B. Langdon, UCL 13

Timing whole kernels on host Remember to use cudaThreadSynchronize. See examples in CUDA SDK sources.

Timing Kernel Code • Perhaps use GPU’s own clock • Alter kernel to do operation N+1 times instead of just once. – Time per operation ≈ extra kernel time/N • Ensure new code behaves same as old • Ensure nvcc compiler does not optimise away your modification • Results can be disappointing: less compute time may mean more time waiting for memory. 15

CUDA Profiler • Two parts – Counters on GPU, write data to host files – User interface to control which counters are active and display results • Linux Visual profiler not stable – Use spreadsheet, gnuplot etc instead • CUDA Profiler good for measuring: – Divergence – Cache misses (non-coalesced IO) – Serialised access to constant memory 16

Multiple GPUs • CUDA requires you to use conventional threads on host (eg pthreads). • Large overhead on creating GPU data structures on host. So: – Create CUDA data once at start of run – Create pthreads once at start of run W. B. Langdon, UCL 17

Other Approaches • Can you compress data. – eg send bytes across PCI rather than int • Can you keep data on GPU to avoid re-reading it? • Would it be better to re-calculate rather than re-read? W. B. Langdon, UCL 18

Conclusions • Design before you start. – Will non-parallel part prevent useful speedup? – Use lots of threads • Locate slowest step. Concentrate on it. • Slowest step usually moving data • Don’t be afraid to waste computation • Computation is cheap. Data is expensive W. B. Langdon, UCL 19

END http://www.epsrc.ac.uk/ W. B. Langdon, UCL 20

A Field Guide To Genetic Programming http://www.gp-field-guide.org.uk/ Free PDF

The Genetic Programming Bibliography The largest, most complete, collection of GP papers. http://www.cs.bham.ac.uk/~wbl/biblio/ With 7554 references, and 5,895 online publications, the GP Bibliography is a vital resource to the computer science, artificial intelligence, machine learning, and evolutionary computing communities. RSS Support available through the Collection of CS Bibliographies. A web form for adding your entries. Wiki to update homepages. Co-authorship community. Downloads A personalised list of every author’s GP publications. Search the GP Bibliography at http://liinwww.ira.uka.de/bibliography/Ai/genetic.programming.html

Performing with CUDA W. B. Langdon CREST lab, Department of - PowerPoint PPT Presentation

Performing with CUDA W. B. Langdon CREST lab, Department of Computer Science Pages 423-430 8.7.2011 Introduction Initial steps Concentrate upon what is different about high performance with GPU: Many threads Finding and

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

ROTORUA PERFORMING ARTS ACTIVATIONS ROTORUA PERFORMING ARTS ACTIVATIONS 2019 2019 IMAGE: IMAGE:

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A

GPU Programming Alan Gray EPCC The University of Edinburgh Overview Motivation and need

Lecture 2.4 Introduction to CUDA C Introduction to the CUDA Toolkit Objective To become

Welcome Performing Arts Academy Orientation Performing Arts Philosophy The Performing Arts

PERFORMING & PERFORMING & PROD PRODUCTION UCTION ARTS ARTS YEAR 1 1 Unit 1:

Develop A Peak Performing Value Proposition For Your _____ A. Develop A B. Develop A Peak

S9751: ACCELERATE YOUR CUDA DEVELOPMENT WITH LATEST DEBUGGING AND CODE ANALYSIS DEVELOPER TOOLS

CUDA 7 AND BEYOND MARK HARRIS, NVIDIA CUDA 7 Runtime C++11 cuSOLVER Compilation

SC13 GPU Technology Theater Accessing New CUDA Features from CUDA Fortran Brent Leback, Compiler

CUDA 8 AND BEYOND Mark Harris, April 5, 2016 INTRODUCING CUDA 8 Pascal Support Unified Memory

PerfMon redux: analyzing a CUDA application with the Windows PerfMon redux: analyzing a CUDA

Whats new in Virtio 1.1? and surrounding areas Jens Freimann Red Hat FOSDEM 2018 What

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

PCI: A Four-Letter Word of E-Commerce or: How I Learned to Stop Worrying and Love the Standard

CS6453 Data-Intensive Systems: Technology trends, Emerging challenges & opportuni=es Rachit

Big Picture Interrupts Processor IC220 Set #11: Cache Storage and I/O Memory- I/O bus Main

Wave Relay System and Wave Relay System and General Project Details General Project Details

CPSC 410/ 611: Week 8 I / O Har dwar e I / O Applicat ion I nt er f ace I / O

Making the Linux Kernel better (without coding) Wolfram Sang Consultant 1.2.2014, FOSDEM14

Performing with CUDA W. B. Langdon CREST lab, Department of - PowerPoint PPT Presentation

Performing with CUDA W. B. Langdon CREST lab, Department of Computer Science Pages 423-430 8.7.2011 Introduction Initial steps Concentrate upon what is different about high performance with GPU: Many threads Finding and

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

ROTORUA PERFORMING ARTS ACTIVATIONS ROTORUA PERFORMING ARTS ACTIVATIONS 2019 2019 IMAGE: IMAGE:

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A

GPU Programming Alan Gray EPCC The University of Edinburgh Overview Motivation and need

Lecture 2.4 Introduction to CUDA C Introduction to the CUDA Toolkit Objective To become

Welcome Performing Arts Academy Orientation Performing Arts Philosophy The Performing Arts

PERFORMING &amp; PERFORMING &amp; PROD PRODUCTION UCTION ARTS ARTS YEAR 1 1 Unit 1:

Develop A Peak Performing Value Proposition For Your _____ A. Develop A B. Develop A Peak

S9751: ACCELERATE YOUR CUDA DEVELOPMENT WITH LATEST DEBUGGING AND CODE ANALYSIS DEVELOPER TOOLS

CUDA 7 AND BEYOND MARK HARRIS, NVIDIA CUDA 7 Runtime C++11 cuSOLVER Compilation

SC13 GPU Technology Theater Accessing New CUDA Features from CUDA Fortran Brent Leback, Compiler

CUDA 8 AND BEYOND Mark Harris, April 5, 2016 INTRODUCING CUDA 8 Pascal Support Unified Memory

PerfMon redux: analyzing a CUDA application with the Windows PerfMon redux: analyzing a CUDA

Whats new in Virtio 1.1? and surrounding areas Jens Freimann Red Hat FOSDEM 2018 What

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

PCI: A Four-Letter Word of E-Commerce or: How I Learned to Stop Worrying and Love the Standard

CS6453 Data-Intensive Systems: Technology trends, Emerging challenges &amp; opportuni=es Rachit

Big Picture Interrupts Processor IC220 Set #11: Cache Storage and I/O Memory- I/O bus Main

Wave Relay System and Wave Relay System and General Project Details General Project Details

CPSC 410/ 611: Week 8 I / O Har dwar e I / O Applicat ion I nt er f ace I / O

Making the Linux Kernel better (without coding) Wolfram Sang Consultant 1.2.2014, FOSDEM14

PERFORMING & PERFORMING & PROD PRODUCTION UCTION ARTS ARTS YEAR 1 1 Unit 1:

CS6453 Data-Intensive Systems: Technology trends, Emerging challenges & opportuni=es Rachit