Parallel and Distributed Programming Introduction Kenjiro Taura 1 - PowerPoint PPT Presentation

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21

Contents 1 Why Parallel Programming? 2 What Parallel Machines Look Like, and Where Performance Come From? 3 How to Program Parallel Machines? 4 How to Program Parallel Machines? 2 / 21

Why parallel? frequencies no longer increase (end of Dennard scaling) Frequency Dennard scaling 4 / 21

Why parallel? frequencies no longer increase (end of Dennard scaling) techniques to increase performance (ILP) of serial programs are increasingly difficult to pay off (Pollack’s law) Frequency Dennard scaling 4 / 21

Why parallel? frequencies no longer increase (end of Dennard scaling) techniques to increase performance (ILP) of serial programs are increasingly difficult to pay off (Pollack’s law) multicore, manycore, and GPUs are in part response to it have more transistors? ⇒ have more cores Frequency Dennard scaling 4 / 21

There are no serial machines any more virtually all CPUs are now multicore high performance accelerators (GPUs and Xeon Phi) run at even low frequencies and have many more cores (manycore) 5 / 21

Supercomputers look ordinary, perhaps more so Sunway ( ≈ 1.45GHz) Xeon Phi ( ≈ 1GHz) NVIDIA GPU ( < 1GHz) CPUs running at ≈ 2.0GHz www.top500.org 6 / 21

Implication to software existing serial SWs do not get (dramatically) faster on new CPUs 7 / 21

Implication to software existing serial SWs do not get (dramatically) faster on new CPUs just writing it in C/C++ goes nowhere close to machine’s potential performance, unless you know how to exploit parallelism of the machine 7 / 21

Implication to software existing serial SWs do not get (dramatically) faster on new CPUs just writing it in C/C++ goes nowhere close to machine’s potential performance, unless you know how to exploit parallelism of the machine you need to understand does it use multiple cores? if so, how work is distributed? does it use SIMD instructions (covered later)? 7 / 21

Example: matrix multiply Q: how much can we improve this on my laptop? ✞ void gemm(long n, / ∗ n = 2400 ∗ / 1 float A[n][n], float B[n][n], float C[n][n]) { 2 long i, j, k; 3 for (i = 0; i < n; i++) 4 for (j = 0; j < n; j++) 5 for (k = 0; k < n; k++) 6 C[i][j] += A[i][k] * B[k][j]; 7 } 8 8 / 21

Example: matrix multiply Q: how much can we improve this on my laptop? ✞ void gemm(long n, / ∗ n = 2400 ∗ / 1 float A[n][n], float B[n][n], float C[n][n]) { 2 long i, j, k; 3 for (i = 0; i < n; i++) 4 for (j = 0; j < n; j++) 5 for (k = 0; k < n; k++) 6 C[i][j] += A[i][k] * B[k][j]; 7 } 8 ✞ $ ./simple_mm 1 C[1200][1200] = 3011.114014 2 in 56.382360 sec 3 2.451831 GFLOPS 4 8 / 21

Example: matrix multiply Q: how much can we improve this on my laptop? ✞ void gemm(long n, / ∗ n = 2400 ∗ / 1 float A[n][n], float B[n][n], float C[n][n]) { 2 long i, j, k; 3 for (i = 0; i < n; i++) 4 for (j = 0; j < n; j++) 5 for (k = 0; k < n; k++) 6 C[i][j] += A[i][k] * B[k][j]; 7 } 8 ✞ $ ./simple_mm 1 C[1200][1200] = 3011.114014 2 in 56.382360 sec 3 2.451831 GFLOPS 4 ✞ $ ./opt_mm 1 C[1200][1200] = 3011.108154 2 in 1.302980 sec 3 106.095263 GFLOPS 4 8 / 21

What a single parallel machine (node) looks like board x2-8 socket socket x2-16 core } x2-8 SIMD (x8-32) virtual core SIMD : Single Instruction Multiple Data a single SIMD register holds many values a single instruction applies the same operation (e.g., add, multiply, etc.) on all data in a SIMD register 10 / 21

What a single parallel machine (node) looks like CPU (chip) board x2-8 socket x2-16 core } x2-8 SIMD (x8-32) virtual core SIMD : Single Instruction Multiple Data a single SIMD register holds many values a single instruction applies the same operation (e.g., add, multiply, etc.) on all data in a SIMD register 10 / 21

What a single parallel machine (node) looks like virtual core core board x2-8 socket x2-16 core } x2-8 SIMD (x8-32) virtual core SIMD : Single Instruction Multiple Data a single SIMD register holds many values a single instruction applies the same operation (e.g., add, multiply, etc.) on all data in a SIMD register 10 / 21

What a machine looks like chip (socket, node, CPU) (physical) core hardware thread (virtual core, CPU) L1 cache L2 cache board memory L3 cache controller x2-8 socket x2-16 interconnect core x2-8 } SIMD (x8-32) virtual core performance comes from multiplying parallelism of many levels parallelism (per CPU) = SIMD width × instructions/cycle × cores in particular, peak FLOPS (per CPU) = (2 × SIMD width) × FMA insts/cycle/core × freq × cores FMA: Fused Multiply Add ( d = a ∗ b + c ) the first factor of 2 : multiply and add (each counted as a flop) 11 / 21

What a GPU looks like? Streaming Multiprocessor a GPU consists of many Streaming Multiprocessors (SM) each SM is highly multithreaded and can interleave many warps each warp consists of 32 CUDA threads ; in a single cycle, threads in a warp can execute the same single instruction 12 / 21

What a GPU looks like? despite very different terminologies, there are more commonalities than differnces GPU CPU SM core multithreading in an SM simultaneous multithreading a warp a thread executing SIMD instructions CUDA thread each lane of a SIMD instruction there are significant differeces too, which we’ll cover later 13 / 21

How much parallelism? Intel CPUs arch model SIMD FMAs freq core peak TDP width /cycle GFLOPS SP/DP /core GHz SP/DP W Haswell E78880Lv3 8/4 2 2.0 18 1152/576 115 Broadwell 2699v4 8/4 2 2.2 22 1548/604 145 Skylake 6130 16/8 2 2.1 16 2150/1075 125 KNM 7285 16/8 2 1.4 68 6092/3046 250 NVIDIA GPUs acrh model threads FMAs freq SM paek TDP /warp /cycle GFLOPS /SM W SP/DP GHz SP/DP W Pascal P100 32 2/1 1.328 56 9519/4760 300 Volta V100 32 2/1 1.530 80 15667/7833 300 14 / 21

Peak (SP) FLOPS Skylake 6130 Volta V100 = (2 × 16) [flops/FMA insn] = (2 × 32) [flops/FMA insn] × 2 [FMA insns/cycle/core] × 2 [FMA insns/cycle/SM] × 2 . 1G [cycles/sec] × 1 . 53G [cycles/sec] × 28 [cores] × 80 [SMs] = 2150 GFLOPS = 15667 GFLOPS 15 / 21

So how to program it? no matter how you program it, you want to maximally utilize multiple cores and SIMD instructions “how” depends on programming languages 17 / 21

Language constructs for multiple cores from low level to high levels OS-level threads SPMD ≈ the entire program runs with N threads parallel loops dynamically created tasks internally parallelized libraries (e.g., matrix operations) high-level languages executing pre-determined operations (e.g., matrix operations and map & reduce-like patterns) in parallel (Torch7, Chainer, Spark, etc.) 18 / 21

Language constructs for SIMD from low level to high levels assembly intrinsics vector types vectorized loops internally vectorized libraries (e.g., matrix operations) 19 / 21

This lecture is for . . . those who want to: have a first-hand experience in parallel and high performance programming (OpenMP, CUDA, TBB, SIMD, . . . ) 21 / 21

This lecture is for . . . those who want to: have a first-hand experience in parallel and high performance programming (OpenMP, CUDA, TBB, SIMD, . . . ) know good tools to solve more complex problems in parallel (divide-and-conquer and task parallelism) 21 / 21

This lecture is for . . . those who want to: have a first-hand experience in parallel and high performance programming (OpenMP, CUDA, TBB, SIMD, . . . ) know good tools to solve more complex problems in parallel (divide-and-conquer and task parallelism) understand when you can get “close-to-peak” CPU/GPU performance and how to get it (SIMD and instruction level parallelism) 21 / 21

This lecture is for . . . those who want to: have a first-hand experience in parallel and high performance programming (OpenMP, CUDA, TBB, SIMD, . . . ) know good tools to solve more complex problems in parallel (divide-and-conquer and task parallelism) understand when you can get “close-to-peak” CPU/GPU performance and how to get it (SIMD and instruction level parallelism) learn many reasons why you don’t get good parallel performance 21 / 21

Parallel and Distributed Programming Introduction Kenjiro Taura 1 - PowerPoint PPT Presentation

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel Programming? 2 What Parallel Machines Look Like, and Where Performance Come From? 3 How to Program Parallel Machines? 4 How to Program Parallel

Distributed Data-Parallel Programming Parallel Programming and Data Analysis Heather Miller

Cluster Basics Hana Sevcikova University of Washington DataCamp Parallel Programming in R

Lecture 2: Parallel Architectures Lecture 2: Parallel Architectures and Programming Models

PARALLEL Joachim Nitschke PROGRAMMING Project Seminar Parallel Programming, Summer

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Programming Distributed Systems Programming Models for Distributed Systems Annette Bieniusa FB

Outline Parallel / Distributed Computers CSCI 8220 Parallel and Distributed Air

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Parallel Programming http://www.cs.bham.ac.uk/~hxt/2013/ parallel-programming/ based on: David

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

CSC2/458 Parallel and Distributed Systems Parallel Memory Systems: Coherence Sreepathi Pai

CSC2/458 Parallel and Distributed Systems Introduction Sreepathi Pai January 18, 2018 URCS

Introduction to Parallel Computing George Karypis Parallel Programming Platforms Elements of a

SINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES Parallel Programming Languages and Approaches

Overview Parallel computing platforms Approaches to building parallel computers

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

Divide and conquer roadmap : deciding connectivity for real alge- braic sets Marie-Franoise

Belief - Desire - Intention (BDI) Model BDI Introduction, Applications and Analyses Massimo

Automatic Generation of Compact Printable Shellcodes For x86 WOOT 20 Dhrumil Patel Aditya

Compositionality in Semantic Vector Spaces CS224U: Natural Language Understanding Feb. 28, 2012

Learning Invariant Feature Hierarchies Yann LeCun Center for Data Science & Courant

Measurement of western U.S. baseline ozone from the surface to the tropopause and assessment of

Diophantine Equations Involving the Euler Totient Function Number Theory Seminar, Dalhousie

Local Representations of Binding Randy Pollack LFCS, University of Edinburgh Joint work with

Parallel and Distributed Programming Introduction Kenjiro Taura 1 - PowerPoint PPT Presentation

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel Programming? 2 What Parallel Machines Look Like, and Where Performance Come From? 3 How to Program Parallel Machines? 4 How to Program Parallel

Distributed Data-Parallel Programming Parallel Programming and Data Analysis Heather Miller

Cluster Basics Hana Sevcikova University of Washington DataCamp Parallel Programming in R

Lecture 2: Parallel Architectures Lecture 2: Parallel Architectures and Programming Models

PARALLEL Joachim Nitschke PROGRAMMING Project Seminar Parallel Programming, Summer

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Programming Distributed Systems Programming Models for Distributed Systems Annette Bieniusa FB

Outline Parallel / Distributed Computers CSCI 8220 Parallel and Distributed Air

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Parallel Programming http://www.cs.bham.ac.uk/~hxt/2013/ parallel-programming/ based on: David

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

CSC2/458 Parallel and Distributed Systems Parallel Memory Systems: Coherence Sreepathi Pai

CSC2/458 Parallel and Distributed Systems Introduction Sreepathi Pai January 18, 2018 URCS

Introduction to Parallel Computing George Karypis Parallel Programming Platforms Elements of a

SINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES Parallel Programming Languages and Approaches

Overview Parallel computing platforms Approaches to building parallel computers

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

Divide and conquer roadmap : deciding connectivity for real alge- braic sets Marie-Franoise

Belief - Desire - Intention (BDI) Model BDI Introduction, Applications and Analyses Massimo

Automatic Generation of Compact Printable Shellcodes For x86 WOOT 20 Dhrumil Patel Aditya

Compositionality in Semantic Vector Spaces CS224U: Natural Language Understanding Feb. 28, 2012

Learning Invariant Feature Hierarchies Yann LeCun Center for Data Science &amp; Courant

Measurement of western U.S. baseline ozone from the surface to the tropopause and assessment of

Diophantine Equations Involving the Euler Totient Function Number Theory Seminar, Dalhousie

Local Representations of Binding Randy Pollack LFCS, University of Edinburgh Joint work with

Learning Invariant Feature Hierarchies Yann LeCun Center for Data Science & Courant