p rogramming in cuda
play

P ROGRAMMING IN CUDA Tams Budavri / The Johns Hopkins University - PowerPoint PPT Presentation

GRAPHICS PROCESSOR P ROGRAMMING IN CUDA Tams Budavri / The Johns Hopkins University 7/18/2012 How I got into this? Tams Budavri 2 Galaxy correlation function 8 bins Histogram of distances State-of-the-art method


  1. GRAPHICS PROCESSOR P ROGRAMMING IN CUDA Tamás Budavári / The Johns Hopkins University 7/18/2012

  2. How I got into this? Tamás Budavári 2  Galaxy correlation function 8 bins  Histogram of distances  State-of-the-art method  Dual-tree traversal ISSAC at HiPACC 7/18/2012

  3. What if? 800 × 800 bins Tamás Budavári 3 ISSAC at HiPACC 7/18/2012

  4. Extending SQL Server 4 Tamás Budavári  Dedicated service for direct access  Shared memory IPC w/ on-the-fly data transform IPC ISSAC at HiPACC 7/18/2012

  5. User-Defined Functions 5 Tamás Budavári  Pair counts computed on the GPU  Returns 2D histogram as a table (i, j, cts)  Calculate the correlation fn in SQL ISSAC at HiPACC 7/18/2012

  6. User-Defined Functions 6 Tamás Budavári  Pair counts computed on the GPU  Returns 2D histogram as a table (i, j, cts)  Calculate the correlation fn in SQL ISSAC at HiPACC 7/18/2012

  7. Multiple GPUs in Parallel 7 Tamás Budavári  Several C# proxies to launch jobs on more cards  Non-blocking SQL routines IPC ISSAC at HiPACC 7/18/2012

  8. Async SQL Interface 8 Tamás Budavári ISSAC at HiPACC 7/18/2012

  9. Baryon Acoustic Oscillations Tamás Budavári 9  600 trillion galaxy pairs Tian, Neyrinck, TB & Szalay (2011)  C for CUDA on GPUs BAO ISSAC at HiPACC 7/18/2012

  10. Outline Tamás Budavári 10  Parallelism  Hardware  Programming  Multithreading  Coding for GPUs  CUDA, Thrust, … ISSAC at HiPACC 7/18/2012

  11. Parallelism Tamás Budavári 11  Data parallel  Same processing on different pieces of data  Task parallel  Simultaneous processing on the same data ISSAC at HiPACC 7/18/2012

  12. On all levels of the hierarchy 12 Tamás Budavári  Clouds  Clusters  Machines  Cores  Threads ISSAC at HiPACC 7/18/2012

  13. Scalability 13 Tamás Budavári  Scale up  Scale out  Vertically  Horizontally  Add resources to a node  Use more of the  Bigger memory, …  Threads, cores, machines, clusters,  Faster processor, … clouds, … ISSAC at HiPACC 7/18/2012

  14. Cluster 14 ISSAC at HiPACC 7/18/2012

  15. High-Performance Computing Tamás Budavári 15  Traditional HPC clusters  Launching jobs on a cluster of machines  Use MPI to communicate among nodes  Message Passing Interface ISSAC at HiPACC 7/18/2012

  16. Queuing Systems Tamás Budavári 16  Used for batch jobs on computer clusters  Fair scheduling of user jobs  Group policies  Several systems  Portable Batch System (PBS)  Condor, etc… ISSAC at HiPACC 7/18/2012

  17. Computer 17 ISSAC at HiPACC 7/18/2012

  18. Classification of Parallel Computers Tamás Budavári 18  Flynn’s Taxonomy ISSAC at HiPACC 7/18/2012

  19. SISD Tamás Budavári 19  Single Instruction Single Data  Classical Von Neumann machines  Single threaded codes arstechnica.com ISSAC at HiPACC 7/18/2012

  20. SIMD Tamás Budavári 20  Single Instruction Multiple Data  On x86  MMX: Math Matrix eXtension  SSE: Streaming SIMD Extension arstechnica.com  …and more…  GPU programming!! ISSAC at HiPACC 7/18/2012

  21. Amdahl’s Law of Parallelism Tamás Budavári 22   Speed up: T ( 1 ) S P  P T ( N )  T ( 1 ) 1 S  N p T ( N )   ( 1 p ) P  p with N  S P  Before looking into parallelism, speed up the serial code, to figure out the max speedup, i.e.,   N ISSAC at HiPACC 7/18/2012

  22. Chip 23 ISSAC at HiPACC 7/18/2012

  23. Moore’s Law Tamás Budavári 24 ISSAC at HiPACC 7/18/2012

  24. New Limitation is Energy! Tamás Budavári 25  Power to compute the same thing?  CPU is 10× less efficient than a digital signal processor  DSP is 10× less efficient than a custom chip  New design: multicores with slower clocks  But the interconnect is expensive  Need simpler components ISSAC at HiPACC 7/18/2012

  25. Emerging Architectures Tamás Budavári 26  Andrew Chien: 10×10 to replace the 90/10 rule  Custom modules on chip, cf. SoC in cellphones  Statistics on a video codec module? ISSAC at HiPACC 7/18/2012

  26. Emerging Architectures Tamás Budavári 27  Andrew Chien: 10×10 to replace the 90/10 rule  Custom modules on chip, cf. SoC in cellphones  Scientific analysis on such specialized units? ISSAC at HiPACC 7/18/2012

  27. GPUs Evolved to be General Purpose Tamás Budavári 28  Virtual world: simulation of real physics  C for CUDA and OpenCL  512 cores  25k threads, running 1 billion/sec  Old algorithms built on wrong assumption  Today processing is free but memory is slow New programming paradigm! ISSAC at HiPACC 7/18/2012

  28. New Moore’s Law 29 Tamás Budavári  In the number of cores  Faster than ever ISSAC at HiPACC 7/18/2012

  29. Programming 30 ISSAC at HiPACC 7/18/2012

  30. Programming Languages Tamás Budavári 31  No one language to rule them all  And many to choose from ISSAC at HiPACC 7/18/2012

  31. Assembly Tamás Budavári 32  Low-level (almost) machine code  Different for each computer ISSAC at HiPACC 7/18/2012

  32. The “C” Language Tamás Budavári 33  Higher level but still close to hardware, i.e., fast  Pointers!  Many things written in C  Operating systems  Other languages, … ISSAC at HiPACC 7/18/2012

  33. Java Tamás Budavári 34  Pros  Memory management with garbage collection  Just-In- Time compilation from ‘ bytecode ’  Cons  Not so great performance  Hard to include legacy codes  New language features were an afterthought ISSAC at HiPACC 7/18/2012

  34. Python Tamás Budavári 35  Scripting to glue things together  Easy to wrap legacy codes  Lots of scientific modules and plotting  Good for prototyping ISSAC at HiPACC 7/18/2012

  35. Etc… 36 Tamás Budavári  Perl  Lisp  Matlab  Haskell  Mathematica  Ocaml  IDL  Erlang  R  Your favorite here… ISSAC at HiPACC 7/18/2012

  36. Programming in C Tamás Budavári 37  Skeleton of an application ISSAC at HiPACC 7/18/2012

  37. Programming in C Tamás Budavári 38  Files  Headers *.h  Source *.c  Building an application  Compile source  Link object files ISSAC at HiPACC 7/18/2012

  38. Using Pointers Tamás Budavári 39 ISSAC at HiPACC 7/18/2012

  39. Arrays Tamás Budavári 40  Dynamic arrays  Memory allocation  Freeing memory  Pointer arithmetics ISSAC at HiPACC 7/18/2012

  40. Matrix, etc… Tamás Budavári 41  Point to pointers  Data allocated in v  Pointers in A  For 2D indexing  One can have  Matrix, tensor, …  Jagged arrays, … ISSAC at HiPACC 7/18/2012

  41. Concurrency 42 Parallel actions ISSAC at HiPACC 7/18/2012

  42. Data Parallel Techniques Tamás Budavári 43  “Embarrassingly Parallel”  Decoupled problems, independent processing  MapReduce  Map  Reduce ISSAC at HiPACC 7/18/2012

  43. The Elevator Problem Tamás Budavári 44  People on multiple levels  Press the button… ISSAC at HiPACC 7/18/2012

  44. Mutual Exclusion Tamás Budavári 45  Multiple processes or threads  Access shared resources in critical sections  E.g., call the elevator when it’s time to go  Locking  Elevators, etc… ISSAC at HiPACC 7/18/2012

  45. Dining Philosophers Tamás Budavári 46  Five silent philosophers sit at the table  Alternate between eating and thinking  Need both forks left & right to eat  Must be picked up one by one!  Infinite food in front of them  How can they all think & eat forever? ISSAC at HiPACC 7/18/2012

  46. Parallel Threads 47 ISSAC at HiPACC 7/18/2012

  47. Threading Tamás Budavári 48  Concurrent parallelism in a machine ISSAC at HiPACC 7/18/2012

  48. Parallelism Tamás Budavári 49  Data parallel  Same processing on different pieces of data  Task parallel  Simultaneous processing on the same data ISSAC at HiPACC 7/18/2012

  49. Comparing Chips 50 Tamás Budavári ISSAC at HiPACC 7/18/2012

  50. Hybrid Architecture 51 Tamás Budavári launch launch run run sync ISSAC at HiPACC 7/18/2012

  51. Programming GPGPUs 52 Tamás Budavári  CUDA  Low-level & high-level  OpenCL  DirectCompute  DirectX, etc…  C++ AMP New!  Accelerated Massive Parallelism ISSAC at HiPACC 7/18/2012

  52. CUDA 53 ISSAC at HiPACC 7/18/2012

  53. Projects on CUDA Zone Tamás Budavári 54 ISSAC at HiPACC 7/18/2012

  54. Currently Available Tamás Budavári 55  GPU optimized Sorting, RNG, BLAS, FFT, Hadamard...  SDK w/examples  Nsight debugger!  Imaging routines  Python w/ PyCUDA  High-level C++ programming with ISSAC at HiPACC 7/18/2012

  55. Fermi Tamás Budavári 56  Previous generation  20 series Tesla cards, e.g., C2050  400+ series GeForce cards, e.g., GTX 480  IEEE-754 arithmetic  Standard floating point  Same as in the CPUs ISSAC at HiPACC 7/18/2012

  56. Kepler Tamás Budavári 57  Latest generation  More efficient, more cores  GTX 680 has 1536 cores ISSAC at HiPACC 7/18/2012

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend