lecture 18 cse 260 parallel computation fall 2015 scott b
play

Lecture 18 CSE 260 Parallel Computation (Fall 2015) Scott B. - PowerPoint PPT Presentation

Lecture 18 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Large scale computing Announcements Office hours on Wednesday u 3:30 PM until 5:30 u Ill stay after 5:30 until the last one leaves Test on the last day of class


  1. Lecture 18 CSE 260 – Parallel Computation (Fall 2015) Scott B. Baden Large scale computing

  2. Announcements • Office hours on Wednesday u 3:30 PM until 5:30 u I’ll stay after 5:30 until the last one leaves • Test on the last day of class u BRING A BLUE BOOK u Tests your ability to apply the knowledge you’ve gained in the course u Open book, open notes u You may bring a PDF viewer (e.g. preview, Acrobat) to look at course materials only • No web browsing - turn off internet • No cell phones Scott B. Baden / CSE 260, UCSD / Fall '15 2

  3. Today’s lecture • Supercomputers • Archiectures • Applications Scott B. Baden / CSE 260, UCSD / Fall '15 3

  4. What is the purpose of a supercomputer? • Improve our understanding of scientific and technologically important phenomenon • Improve the quality of life through technological innovation, simulations, data processing u Data Mining u Image processing u Simulations – financial modeling, weather, biomedical • Economic benefits Scott B. Baden / CSE 260, UCSD / Fall '15 4

  5. What is the world’s fastest supercomputer? • Top500 #1 (>2 years), Tianhe– 2 @ NUDT (China) 3.12M cores, 54.8 Pflops peak, 17.8MW power+6W cooling , 12-core Ivy Bridge + Intel Phi • #2: Titan @ Oak Ridge, USA, 561K cores, 27PF, 8.2MW, Cray XK7: AMD Opteron + Nvida Kepler K20x top500.org Scott B. Baden / CSE 260, UCSD / Fall '15 5

  6. What does a supercomputer look like? • Hierarchically organized servers • IrrHybrid communication u Threads within the server u Pass messages between servers (or among groups of cores) Edison @ nersc.gov conferences.computer.org/sc/2012/papers/1000a079.pdf Scott B. Baden / CSE 260, UCSD / Fall '15 6

  7. State-of-the-art applications Blood Simulation on Jaguar Ab Initio Molecular Dynamics (AIMD) using Gatech team Plane Waves Density Functional Theory Eric Bylaska (PNNL) 48 384 3072 24576 p Exchange time Time (sec) 899 . 8 116 . 7 16 . 7 4 . 9 on HOPPER Efficiency 1.00 0.96 0.84 0.35 Strong scaling 24576 98304 196608 p Time (sec) 228 . 3 258 304 . 9 Efficiency 1.00 0.88 0.75 Slide courtesy Weak scaling Tan Nguyen, UCSD Scott B. Baden / CSE 260, UCSD / Fall '15 7

  8. Performance differs across application domains u Collela’s 7 dwarfs, patterns of communication and computation that persist over time and across implementations u Structured grids A[i,:] • Panfilov method C[i,j] B[:,j] += * u Dense linear algebra • Matrix multiply, Vector-Mtx Mpy Gaussian elimination u N-body methods u Sparse linear algebra • With sparse matrix, use knowledge about the locations of non-zeroes to improve some aspect of performance u Unstructured Grids u Spectral methods (FFT) u Monte Carlo Scott B. Baden / CSE 260, UCSD / Fall '15 8

  9. Application-specific knowledge is important • Currently exists no tool that can convert a serial program into an efficient parallel program … for all applications … all of the time … on all hardware • The more we know about the application… … specific problem … math/physics ... initial data … … context for analyzing the output… … the more we can improve performance • Performance Programming Issues 4 Data motion and locality 4 Load balancing 4 Serial sections Scott B. Baden / CSE 260, UCSD / Fall '15 9

  10. Sparse Matrices • A matrix where we employ knowledge about the location of the non-zeroes • Consider Jacobi’s method with a 5-point stencil u’[i,j] = (u[i-1,j] + u[i+1,j]+ u[i,j-1]+ u[i, j+1] - h 2 f[i,j]) /4 Scott B. Baden / CSE 260, UCSD / Fall '15 10

  11. Web connectivity Matrix: 1M x 1M 1M x 1M submatrix of the web connectivity graph, constructed from an archive at the Stanford WebBase 3 non-zeroes/row Dense: 2 20 × 2 20 = 2 40 = 1024 Gwords Sparse: (3/2 20 ) × 2 40 = 3 Mwords Sparse representation saves a factor of 1 million in Jim Demmel storage Scott B. Baden / CSE 260, UCSD / Fall '15 11

  12. Circuit Simulation Motorola Circuit 170,998 2 958,936 nonzeroes .003% nonzeroes 5.6 nonzeroes/row www.cise.ufl.edu/research/sparse/matrices/Hamm/scircuit.html Scott B. Baden / CSE 260, UCSD / Fall '15 12

  13. Generating sparse matrices from unstructured grids • In some applications of sparse matrices, we generate the matrix from an “unstructured” mesh, e.g. finite element method • In some cases we apply direct mesh updates, using nearest neighbors A 2D airfoil • Irregular partitioning Scott B. Baden / CSE 260, UCSD / Fall '15 13

  14. Sparse Matrix Vector Multiplication • Important kernel used in linear algebra • Assume x[] fits in memory of 1 processor y[i] += A[i,j] × x[j] • Many formats, common format for CPUs is Compressed Sparse Row (CSR) Jim Demmel Scott B. Baden / CSE 260, UCSD / Fall '15 14

  15. Sparse matrix vector multiply kernel // y[i] += A[i,j] × x[j] #pragma parallel for schedule (dynamic,chunk) for i = 0 : N-1 // rows i0= ptr[i] i1 = ptr[i+1] – 1 X A j→ for j = i0 : i1 // cols y[ ind[j]] += i val[ j ] *x[ ind[j] ] ↓ end j end i Scott B. Baden / CSE 260, UCSD / Fall '15 15

  16. Up and beyond to Exascale • In 1961, President Kennedy mandated a landing on the Moon by the end of the decade • July 20, 1969 at tranquility base “The Eagle has landed” • The US Government set an ambitious schedule to reach 10 18 flops by ~2023 • DOE is taking the lead in the US, EU also engaged • Massive technical challenges Scott B. Baden / CSE 260, UCSD / Fall '15 16

  17. The Challenges to landing “Eagle” • High levels of parallelism within and across nodes u 10 18 flops using NVIDIA devices @ 10 12 flops u 10 6 devices. 10 9+ threads • Power : ≤ 20 MW. Today 18MW@0.05 ExaFlops u Power consumption 1-2nJ/op today → 20pJ @Exascale u Data storage & access consumes most of the energy • Ever lengthening communication delays u Complicated memory hierarchies u Raise amount of computation per unit of communication u Hide latency, conserve locality • Reliability and resilience u Blue Gene L’s Mean Time Between Failure (MTBF( measured in days • Application code complexity; domain specific languages u NUMA processors, not fully cache coherent on-chip u Mixture of accelerators and conventional cores Scott B. Baden / CSE 260, UCSD / Fall '15 17

  18. Technological trends • Growth: cores/socket rather than sockets • Heterogeneous processors • Memory/core is shrinking • Complicated software-managed parallel memory hierarchy • Communication costs increasing relative to computation Intel Sandybridge, anandtech.com Scott B. Baden / CSE 260, UCSD / Fall '15 18

  19. 35 years of processor trends Scott B. Baden / CSE 260, UCSD / Fall '15 19

  20. How do we manage these constraints? • Increase amount of computation performed per unit of communication u Conserve locality, “communication avoiding” • Hide communication • Many threads Exa??? Improvement Peta Processor Memory Tera Bandwidth Giga Latency Year Scott B. Baden / CSE 260, UCSD / Fall '15 20

  21. A Crosscutting issue: hiding communication • Little’s law [1961] The number of threads must equal the parallelism times the latency u T = p ×λ p and λ are increasing with time u • Difficult to implement Split phase algorithms u Partitioning and scheduling u • The state-of-the-art enables but doesn’t support the activity • Distracts from the focus on the domain science • Implementation policies entangled with correctness issues Non-robust performance u High development costs u Scott B. Baden / CSE 260, UCSD / Fall '15 21 21

  22. Motivating application • Solve Laplace’s equation in 3 dimensions with Dirichlet Boundary conditions Ω Δϕ = ρ (x,y,z), ϕ =0 on ∂Ω • Building block: iterative solver using ρ≠ 0 Jacobi’s method (7-point stencil) ∂Ω for (i,j,k) in 1:N x 1:N x 1:N u’[i][j][k] = (u[i-1][j][k] + u[i+1][j][k] + u[i][j-1][k] + u[i][j+1][k] + u[i][j][k+1] + u[i][j][k-1] ) / 6.0 Scott B. Baden / CSE 260, UCSD / Fall '15 22

  23. Classic message passing implementation • Decompose domain into sub-regions, one per process u Transmit halo regions between processes u Compute inner region after communication completes • Loop carried dependences impose a strict ordering on communication and computation Scott B. Baden / CSE 260, UCSD / Fall '15 23

  24. Communication tolerant variant • Only a subset of the domain exhibits loop carried dependences with respect to the halo region • Subdivide the domain to remove some of the dependences • We may now sweep the inner region in parallel with communication • Sweep the annulus after communication finishes Scott B. Baden / CSE 260, UCSD / Fall '15 24

  25. Processor Virtualization • Virtualize the processors by overdecomposing • AMPI [Kalé et al.] • When an MPI call blocks, thread yields to another virtual process • How do we inform the scheduler about ready tasks? Scott B. Baden / CSE 260, UCSD / Fall '15 25

  26. Observations • The exact execution order depends on the data dependence structure: communication & computation • But many other correct orderings are possible, and some can enable us to hide communication • We can characterize the running program in terms of a task precedence graph • There is a deterministic procedure for translating MPI code into the graph Irecv j Irecv j 0 2 Send j Send j 1 Wait Wait 3 4 Comp Comp SPMD MPI TASK GRAPH Scott B. Baden / CSE 260, UCSD / Fall '15 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend