Parallel Computing Basics, Semantics Landaus 1st Rule of Education - - PowerPoint PPT Presentation

parallel computing basics semantics
SMART_READER_LITE
LIVE PREVIEW

Parallel Computing Basics, Semantics Landaus 1st Rule of Education - - PowerPoint PPT Presentation

Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude Parallel Computing Basics, Semantics Landaus 1st Rule of Education Rubin H Landau Sally Haerer, Producer-Director Based on A Survey of


slide-1
SLIDE 1

Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude

Parallel Computing Basics, Semantics

Landau’s 1st Rule of Education Rubin H Landau

Sally Haerer, Producer-Director

Based on A Survey of Computational Physics by Landau, Páez, & Bordeianu with Support from the National Science Foundation

Course: Computational Physics II

1 / 15

slide-2
SLIDE 2

Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude

Parallel Problems

Basic and Assigned Impressive parallel () computing hardware advances Beyond I/O, memory, internal CPU : multiple processors, single problem Software stuck in 1960s Message passing = dominant, = too elementary Need sophisticated compilers (OK cores) Understanding hybrid programming models Problem: Parallelize simple program’s parameter space Why do? faster, bigger, finer resolutions, different

2 / 15

slide-3
SLIDE 3

Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude

Computation Example, Matrix Multiplication

Need Communication, Synchronization, Math [B] = [A][B] (1) Bi,j =

N

  • k=1

Ai,kBk,j (2) Each LHS Bi,j Each LHS row, column [B] RHS Bk,j = old, before mult values ⇒ communicate [B] = [A][B] = data dependency, order matters [C] = [A][B] = data parallel

3 / 15

slide-4
SLIDE 4

Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude

Parallel Computer Categories

Nodes, Communications, Instructions & Data

Gigabyte Internet I/O Node Fast Ethernet Compute Nodes FPGA JTAG

CPU-CPU, mem-mem networks Internal (2) & external Node = processor location Node: 1-N CPUs Single-instruction, single-data Single-instruction, multiple-data Multiple instructs, multiple data MIMD: message-passing MIMD: no shared mem cluster MIMD: Difficult program, $

4 / 15

slide-5
SLIDE 5

Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude

Relation to MultiTasking

Locations in Memory (s)

D C A B A C D B A

Much on PC, Unix Multitasking ∼ Indep progs simultaneously in RAM Round robin processing SISD: 1 job/t MIMD: multi jobs/same t

5 / 15

slide-6
SLIDE 6

Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude

Parallel Categories

Granularity

D C A B A C D B A

Grain = measure computational work = computation / communication Coarse-grain: Separate programs & computers e.g. MC on 6 Linux PCs Medium-grain: Several simultaneous processors Bus = communication channel Parallel subroutines ∆ CPUs Fine-grain: custom compiler e.g. for loops

6 / 15

slide-7
SLIDE 7

Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude

Distributed Memory via Commodity PCs

Clusters, Multicomputers, Beowulf, David

Values of Parallel Processing Values of Parallel Processing

Mainframe Vector Computer PC Work station Mini

Beowulf

Dominant coarse-medium grain = Stand-alone PCs, hi-speed switch, messages & network Req: data chunks to indep busy ea processor Send data to nodes, collect, exchange, ...

7 / 15

slide-8
SLIDE 8

Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude

Parallel Performance: Amdahl’s law

Simple Accounting of Time

Parallel Fraction

Speedup

p = 2 p = 1 6

p = infinity

Amdahl's Law

4 8 20% 40% 60% 80%

Percent Parallel S p e e d u p

Clogged ketchup bottle in cafeteria line Slowest step determines reaction rate serial, communication = ketchup Need ∼90% parallel Need ∼100% for massive Need new problems

8 / 15

slide-9
SLIDE 9

Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude

Amdahl’s Law Derivation

p = no. of CPUs T1 = 1-CPU time, Tp = p-CPU time (1) Sp = max parallel speedup = T1 Tp → p (2)

Not achieved: some serial, data & memory conflicts Communication, synchronization of the processors f = fraction of program ⇒

Ts = (1 − f)T1 (serial time) (3) Tp = f T1 p (parallel time) (4) Speedup Sp = T1 Ts + Tp = 1 1 − f + f/p (Amdahl’s law) (5)

9 / 15

slide-10
SLIDE 10

Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude

Amdah’s Law + Communication Overhead

Include Communication Time; Simple & Profound Latency = Tc = time to move data Sp ≃ T1 T1/p + Tc < p (1) For communication time not to matter T1 p ≫ Tc ⇒ p ≪ T1 Tc (2) As ↑ number processors p, T1/p → Tc Then, more processors ⇒ slower Faster CPU irrelevant

10 / 15

slide-11
SLIDE 11

Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude

How Actually Parallelize

Main task program Main routine Serial subroutine a Parallel sub 1 Parallel sub 2 Parallel sub 3 Summation task

User creates tasks Task assigns processor threads Main: master, controller Subtasks: parallel subroutines, slaves Avoid storage conflicts ↓ Communication, synchronization Don’t sacrifice science to speed

11 / 15

slide-12
SLIDE 12

Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude

Practical Aspects of Message Passing; Don’t Do It

More Processors = More Challenge Only most numerically intensive Legacy codes often Fortran90 Rewrite (N months) vs Modify serial (∼70%)? Steep learning curve, failures, hard debugging Preconditions: run often, for days, little change Need higher resolution, more bodies Problem affects parallelism: data use, problem structure Perfectly (embarrassingly) parallel: (MC) repeats Fully synchronous: Data (MD), tightly coupled Loosely synchronous: (groundwater diffusion) Pipeline parallel: (data → images → animations)

12 / 15

slide-13
SLIDE 13

Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude

High-Level View of Message Passing

4 Simple Communication Commands

Compute Create Create Compute Receive Receive Receive Compute Receive Send Master compute send compute send compute receive send compute send Slave 1 compute send compute send compute receive send compute send Slave 2 T i m e

Simple basics C, Fortran + 4 communications send: named message receive: any sender myid: ID processor numnodes

13 / 15

slide-14
SLIDE 14

Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude

MP: What Can Go Wrong?

Hardware Communication = Problematic

Compute Create Create Compute Receive Receive Receive Compute Receive Send Master compute send compute send compute receive send compute send Slave 1 compute send compute send compute receive send compute send Slave 2 T i m e

Task cooperation, division Correct data division Many low-level details Distributed error messages Wrong messages order Race conditions: order dependent Deadlock: wait forever

14 / 15

slide-15
SLIDE 15

Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude

Conclude: IBM Blue Gene = by Committee

Performance/watt On, off chip mem 2 core CPU 1 Core compute, 1 communicate 65,536 (216) nodes Peak = 360 teraflops (1012) Medium speed CPU 5.6 Gflop (cool) 512 chips/card, 16 cards/Board Control: MPI

15 / 15