P ROGRAMMING IN CUDA Tams Budavri / The Johns Hopkins University - PowerPoint PPT Presentation

GRAPHICS PROCESSOR P ROGRAMMING IN CUDA Tamás Budavári / The Johns Hopkins University 7/18/2012

How I got into this? Tamás Budavári 2  Galaxy correlation function 8 bins  Histogram of distances  State-of-the-art method  Dual-tree traversal ISSAC at HiPACC 7/18/2012

What if? 800 × 800 bins Tamás Budavári 3 ISSAC at HiPACC 7/18/2012

Extending SQL Server 4 Tamás Budavári  Dedicated service for direct access  Shared memory IPC w/ on-the-fly data transform IPC ISSAC at HiPACC 7/18/2012

User-Defined Functions 5 Tamás Budavári  Pair counts computed on the GPU  Returns 2D histogram as a table (i, j, cts)  Calculate the correlation fn in SQL ISSAC at HiPACC 7/18/2012

User-Defined Functions 6 Tamás Budavári  Pair counts computed on the GPU  Returns 2D histogram as a table (i, j, cts)  Calculate the correlation fn in SQL ISSAC at HiPACC 7/18/2012

Multiple GPUs in Parallel 7 Tamás Budavári  Several C# proxies to launch jobs on more cards  Non-blocking SQL routines IPC ISSAC at HiPACC 7/18/2012

Async SQL Interface 8 Tamás Budavári ISSAC at HiPACC 7/18/2012

Baryon Acoustic Oscillations Tamás Budavári 9  600 trillion galaxy pairs Tian, Neyrinck, TB & Szalay (2011)  C for CUDA on GPUs BAO ISSAC at HiPACC 7/18/2012

Outline Tamás Budavári 10  Parallelism  Hardware  Programming  Multithreading  Coding for GPUs  CUDA, Thrust, … ISSAC at HiPACC 7/18/2012

Parallelism Tamás Budavári 11  Data parallel  Same processing on different pieces of data  Task parallel  Simultaneous processing on the same data ISSAC at HiPACC 7/18/2012

On all levels of the hierarchy 12 Tamás Budavári  Clouds  Clusters  Machines  Cores  Threads ISSAC at HiPACC 7/18/2012

Scalability 13 Tamás Budavári  Scale up  Scale out  Vertically  Horizontally  Add resources to a node  Use more of the  Bigger memory, …  Threads, cores, machines, clusters,  Faster processor, … clouds, … ISSAC at HiPACC 7/18/2012

Cluster 14 ISSAC at HiPACC 7/18/2012

High-Performance Computing Tamás Budavári 15  Traditional HPC clusters  Launching jobs on a cluster of machines  Use MPI to communicate among nodes  Message Passing Interface ISSAC at HiPACC 7/18/2012

Queuing Systems Tamás Budavári 16  Used for batch jobs on computer clusters  Fair scheduling of user jobs  Group policies  Several systems  Portable Batch System (PBS)  Condor, etc… ISSAC at HiPACC 7/18/2012

Computer 17 ISSAC at HiPACC 7/18/2012

Classification of Parallel Computers Tamás Budavári 18  Flynn’s Taxonomy ISSAC at HiPACC 7/18/2012

SISD Tamás Budavári 19  Single Instruction Single Data  Classical Von Neumann machines  Single threaded codes arstechnica.com ISSAC at HiPACC 7/18/2012

SIMD Tamás Budavári 20  Single Instruction Multiple Data  On x86  MMX: Math Matrix eXtension  SSE: Streaming SIMD Extension arstechnica.com  …and more…  GPU programming!! ISSAC at HiPACC 7/18/2012

Amdahl’s Law of Parallelism Tamás Budavári 22   Speed up: T ( 1 ) S P  P T ( N )  T ( 1 ) 1 S  N p T ( N )   ( 1 p ) P  p with N  S P  Before looking into parallelism, speed up the serial code, to figure out the max speedup, i.e.,   N ISSAC at HiPACC 7/18/2012

Chip 23 ISSAC at HiPACC 7/18/2012

Moore’s Law Tamás Budavári 24 ISSAC at HiPACC 7/18/2012

New Limitation is Energy! Tamás Budavári 25  Power to compute the same thing?  CPU is 10× less efficient than a digital signal processor  DSP is 10× less efficient than a custom chip  New design: multicores with slower clocks  But the interconnect is expensive  Need simpler components ISSAC at HiPACC 7/18/2012

Emerging Architectures Tamás Budavári 26  Andrew Chien: 10×10 to replace the 90/10 rule  Custom modules on chip, cf. SoC in cellphones  Statistics on a video codec module? ISSAC at HiPACC 7/18/2012

Emerging Architectures Tamás Budavári 27  Andrew Chien: 10×10 to replace the 90/10 rule  Custom modules on chip, cf. SoC in cellphones  Scientific analysis on such specialized units? ISSAC at HiPACC 7/18/2012

GPUs Evolved to be General Purpose Tamás Budavári 28  Virtual world: simulation of real physics  C for CUDA and OpenCL  512 cores  25k threads, running 1 billion/sec  Old algorithms built on wrong assumption  Today processing is free but memory is slow New programming paradigm! ISSAC at HiPACC 7/18/2012

New Moore’s Law 29 Tamás Budavári  In the number of cores  Faster than ever ISSAC at HiPACC 7/18/2012

Programming 30 ISSAC at HiPACC 7/18/2012

Programming Languages Tamás Budavári 31  No one language to rule them all  And many to choose from ISSAC at HiPACC 7/18/2012

Assembly Tamás Budavári 32  Low-level (almost) machine code  Different for each computer ISSAC at HiPACC 7/18/2012

The “C” Language Tamás Budavári 33  Higher level but still close to hardware, i.e., fast  Pointers!  Many things written in C  Operating systems  Other languages, … ISSAC at HiPACC 7/18/2012

Java Tamás Budavári 34  Pros  Memory management with garbage collection  Just-In- Time compilation from ‘ bytecode ’  Cons  Not so great performance  Hard to include legacy codes  New language features were an afterthought ISSAC at HiPACC 7/18/2012

Python Tamás Budavári 35  Scripting to glue things together  Easy to wrap legacy codes  Lots of scientific modules and plotting  Good for prototyping ISSAC at HiPACC 7/18/2012

Etc… 36 Tamás Budavári  Perl  Lisp  Matlab  Haskell  Mathematica  Ocaml  IDL  Erlang  R  Your favorite here… ISSAC at HiPACC 7/18/2012

Programming in C Tamás Budavári 37  Skeleton of an application ISSAC at HiPACC 7/18/2012

Programming in C Tamás Budavári 38  Files  Headers *.h  Source *.c  Building an application  Compile source  Link object files ISSAC at HiPACC 7/18/2012

Using Pointers Tamás Budavári 39 ISSAC at HiPACC 7/18/2012

Arrays Tamás Budavári 40  Dynamic arrays  Memory allocation  Freeing memory  Pointer arithmetics ISSAC at HiPACC 7/18/2012

Matrix, etc… Tamás Budavári 41  Point to pointers  Data allocated in v  Pointers in A  For 2D indexing  One can have  Matrix, tensor, …  Jagged arrays, … ISSAC at HiPACC 7/18/2012

Concurrency 42 Parallel actions ISSAC at HiPACC 7/18/2012

Data Parallel Techniques Tamás Budavári 43  “Embarrassingly Parallel”  Decoupled problems, independent processing  MapReduce  Map  Reduce ISSAC at HiPACC 7/18/2012

The Elevator Problem Tamás Budavári 44  People on multiple levels  Press the button… ISSAC at HiPACC 7/18/2012

Mutual Exclusion Tamás Budavári 45  Multiple processes or threads  Access shared resources in critical sections  E.g., call the elevator when it’s time to go  Locking  Elevators, etc… ISSAC at HiPACC 7/18/2012

Dining Philosophers Tamás Budavári 46  Five silent philosophers sit at the table  Alternate between eating and thinking  Need both forks left & right to eat  Must be picked up one by one!  Infinite food in front of them  How can they all think & eat forever? ISSAC at HiPACC 7/18/2012

Parallel Threads 47 ISSAC at HiPACC 7/18/2012

Threading Tamás Budavári 48  Concurrent parallelism in a machine ISSAC at HiPACC 7/18/2012

Parallelism Tamás Budavári 49  Data parallel  Same processing on different pieces of data  Task parallel  Simultaneous processing on the same data ISSAC at HiPACC 7/18/2012

Comparing Chips 50 Tamás Budavári ISSAC at HiPACC 7/18/2012

Hybrid Architecture 51 Tamás Budavári launch launch run run sync ISSAC at HiPACC 7/18/2012

Programming GPGPUs 52 Tamás Budavári  CUDA  Low-level & high-level  OpenCL  DirectCompute  DirectX, etc…  C++ AMP New!  Accelerated Massive Parallelism ISSAC at HiPACC 7/18/2012

CUDA 53 ISSAC at HiPACC 7/18/2012

Projects on CUDA Zone Tamás Budavári 54 ISSAC at HiPACC 7/18/2012

Currently Available Tamás Budavári 55  GPU optimized Sorting, RNG, BLAS, FFT, Hadamard...  SDK w/examples  Nsight debugger!  Imaging routines  Python w/ PyCUDA  High-level C++ programming with ISSAC at HiPACC 7/18/2012

Fermi Tamás Budavári 56  Previous generation  20 series Tesla cards, e.g., C2050  400+ series GeForce cards, e.g., GTX 480  IEEE-754 arithmetic  Standard floating point  Same as in the CPUs ISSAC at HiPACC 7/18/2012

Kepler Tamás Budavári 57  Latest generation  More efficient, more cores  GTX 680 has 1536 cores ISSAC at HiPACC 7/18/2012

P ROGRAMMING IN CUDA Tams Budavri / The Johns Hopkins University - PowerPoint PPT Presentation

GRAPHICS PROCESSOR P ROGRAMMING IN CUDA Tams Budavri / The Johns Hopkins University 7/18/2012 How I got into this? Tams Budavri 2 Galaxy correlation function 8 bins Histogram of distances State-of-the-art method

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A

GPU Programming Alan Gray EPCC The University of Edinburgh Overview Motivation and need

Lecture 2.4 Introduction to CUDA C Introduction to the CUDA Toolkit Objective To become

Syntax & Semantics UMaine School of Computing and Information Science P rogramming Fall

Syntax & Semantics UMaine School of Computing and Information Science P rogramming Fall

S9751: ACCELERATE YOUR CUDA DEVELOPMENT WITH LATEST DEBUGGING AND CODE ANALYSIS DEVELOPER TOOLS

CUDA 7 AND BEYOND MARK HARRIS, NVIDIA CUDA 7 Runtime C++11 cuSOLVER Compilation

SC13 GPU Technology Theater Accessing New CUDA Features from CUDA Fortran Brent Leback, Compiler

CUDA 8 AND BEYOND Mark Harris, April 5, 2016 INTRODUCING CUDA 8 Pascal Support Unified Memory

PerfMon redux: analyzing a CUDA application with the Windows PerfMon redux: analyzing a CUDA

CUDA ON MOBILE Yogesh Kini, GTC 2016 Typical pipeline ABSTRACT CUDA Interop APIs Unified

Approaches to GPU computing Manuel Ujaldon Nvidia CUDA Fellow Computer Architecture Department

CS 6453 LECTURE 6: MESOS PLATFORM REUBEN RAPPAPORT WHAT IS THE PROBLEM? There are many

The Process: Running Experiments, Writing, Presenting LING575 Analyzing Neural Language

Interoperabilty: The SAGA Approach and Experience Shantenu Jha, Andre Merzky, Ole Weidner &

A Transformation Approach for Classifying ALCHI(D) Ontologies with a Consequence-based ALCH

WISENT Distributed processing of energy meteorologic data with Condor Jan Ploski September 14th,

FX Options Trading Class 8 Gregory McDermott, MApp Fin OU Chief FX Strategist U.S.

Passive Host Monitoring John Kristoff DePaul University jtk@depaul.edu February 2001 1

CS 61: Database Systems NoSQL/Mongo CRUD Adapted from mongodb.com unless otherwise noted Agenda

P ROGRAMMING IN CUDA Tams Budavri / The Johns Hopkins University - PowerPoint PPT Presentation

GRAPHICS PROCESSOR P ROGRAMMING IN CUDA Tams Budavri / The Johns Hopkins University 7/18/2012 How I got into this? Tams Budavri 2 Galaxy correlation function 8 bins Histogram of distances State-of-the-art method

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A

GPU Programming Alan Gray EPCC The University of Edinburgh Overview Motivation and need

Lecture 2.4 Introduction to CUDA C Introduction to the CUDA Toolkit Objective To become

Syntax &amp; Semantics UMaine School of Computing and Information Science P rogramming Fall

Syntax &amp; Semantics UMaine School of Computing and Information Science P rogramming Fall

S9751: ACCELERATE YOUR CUDA DEVELOPMENT WITH LATEST DEBUGGING AND CODE ANALYSIS DEVELOPER TOOLS

CUDA 7 AND BEYOND MARK HARRIS, NVIDIA CUDA 7 Runtime C++11 cuSOLVER Compilation

SC13 GPU Technology Theater Accessing New CUDA Features from CUDA Fortran Brent Leback, Compiler

CUDA 8 AND BEYOND Mark Harris, April 5, 2016 INTRODUCING CUDA 8 Pascal Support Unified Memory

PerfMon redux: analyzing a CUDA application with the Windows PerfMon redux: analyzing a CUDA

CUDA ON MOBILE Yogesh Kini, GTC 2016 Typical pipeline ABSTRACT CUDA Interop APIs Unified

Approaches to GPU computing Manuel Ujaldon Nvidia CUDA Fellow Computer Architecture Department

CS 6453 LECTURE 6: MESOS PLATFORM REUBEN RAPPAPORT WHAT IS THE PROBLEM? There are many

The Process: Running Experiments, Writing, Presenting LING575 Analyzing Neural Language

Interoperabilty: The SAGA Approach and Experience Shantenu Jha, Andre Merzky, Ole Weidner &amp;

A Transformation Approach for Classifying ALCHI(D) Ontologies with a Consequence-based ALCH

WISENT Distributed processing of energy meteorologic data with Condor Jan Ploski September 14th,

FX Options Trading Class 8 Gregory McDermott, MApp Fin OU Chief FX Strategist U.S.

Passive Host Monitoring John Kristoff DePaul University jtk@depaul.edu February 2001 1

CS 61: Database Systems NoSQL/Mongo CRUD Adapted from mongodb.com unless otherwise noted Agenda

Syntax & Semantics UMaine School of Computing and Information Science P rogramming Fall

Syntax & Semantics UMaine School of Computing and Information Science P rogramming Fall

Interoperabilty: The SAGA Approach and Experience Shantenu Jha, Andre Merzky, Ole Weidner &