parallel computing basics semantics
play

Parallel Computing Basics, Semantics Landaus 1st Rule of Education - PowerPoint PPT Presentation

Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude Parallel Computing Basics, Semantics Landaus 1st Rule of Education Rubin H Landau Sally Haerer, Producer-Director Based on A Survey of


  1. Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude Parallel Computing Basics, Semantics Landau’s 1st Rule of Education Rubin H Landau Sally Haerer, Producer-Director Based on A Survey of Computational Physics by Landau, Páez, & Bordeianu with Support from the National Science Foundation Course: Computational Physics II 1 / 15

  2. Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude Parallel Problems Basic and Assigned Impressive parallel ( � ) computing hardware advances Beyond � I/O, memory, internal CPU � : multiple processors, single problem Software stuck in 1960s Message passing = dominant, = too elementary Need sophisticated compilers (OK cores) Understanding hybrid programming models Problem: Parallelize simple program’s parameter space Why do? faster, bigger, finer resolutions, different 2 / 15

  3. Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude � Computation Example, Matrix Multiplication Need Communication, Synchronization, Math [ B ] = [ A ][ B ] (1) N � B i , j = A i , k B k , j (2) k = 1 Each LHS B i , j � Each LHS row, column [ B ] � RHS B k , j = old, before mult values ⇒ communicate [ B ] = [ A ][ B ] = data dependency, order matters [ C ] = [ A ][ B ] = data parallel 3 / 15

  4. Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude Parallel Computer Categories Nodes, Communications, Instructions & Data CPU-CPU, mem-mem networks Internal (2) & external Node = processor location Node: 1-N CPUs Single-instruction, single-data Single-instruction, multiple-data I/O Node Gigabyte Internet Multiple instructs, multiple data Fast Ethernet FPGA MIMD: message-passing JTAG MIMD: no shared mem cluster Compute Nodes MIMD: Difficult program, $ 4 / 15

  5. Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude Relation to MultiTasking Locations in Memory (s) Much � on PC, Unix Multitasking ∼ � Indep progs B A B A A simultaneously in RAM C C D D Round robin processing SISD: 1 job/t MIMD: multi jobs/same t 5 / 15

  6. Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude Parallel Categories Granularity Coarse-grain: Separate programs & computers B A B A A C e.g. MC on 6 Linux PCs C D D Medium-grain: Several simultaneous processors Grain = measure Bus = communication channel computational work Parallel subroutines ∆ CPUs = computation / Fine-grain: custom compiler communication e.g. � for loops 6 / 15

  7. Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude Distributed Memory � via Commodity PCs Clusters, Multicomputers, Beowulf, David Values of Parallel Processing Values of Parallel Processing Mainframe PC Beowulf Mini Work station Vector Computer Dominant coarse-medium grain = Stand-alone PCs, hi-speed switch, messages & network Req: data chunks to indep busy ea processor Send data to nodes, collect, exchange, ... 7 / 15

  8. Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude Parallel Performance: Amdahl’s law Simple Accounting of Time Clogged ketchup bottle in cafeteria line Slowest step determines p = infinity 8 reaction rate Speedup 6 p 1 = u � serial, communication = p d Amdahl's Law e e ketchup p S 4 Need ∼ 90% parallel p = 2 Need ∼ 100% for massive 0 0 40% 60% 80% 20% Need new problems Parallel Fraction Percent Parallel 8 / 15

  9. Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude Amdahl’s Law Derivation p = no. of CPUs T 1 = 1-CPU time , T p = p -CPU time (1) S p = max parallel speedup = T 1 T p → p (2) Not achieved: some serial, data & memory conflicts Communication, synchronization of the processors f = � fraction of program ⇒ T s = ( 1 − f ) T 1 (serial time) (3) T p = f T 1 (parallel time) (4) p T 1 1 Speedup S p = T s + T p = (Amdahl’s law) (5) 1 − f + f / p 9 / 15

  10. Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude Amdah’s Law + Communication Overhead Include Communication Time; Simple & Profound Latency = T c = time to move data T 1 S p ≃ < p (1) T 1 / p + T c For communication time not to matter T 1 ⇒ p ≪ T 1 p ≫ T c (2) T c As ↑ number processors p , T 1 / p → T c Then, more processors ⇒ slower Faster CPU irrelevant 10 / 15

  11. Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude How Actually Parallelize Main task program Main routine Serial subroutine a Parallel sub 1 Parallel sub 2 Parallel sub 3 Summation task User creates tasks Avoid storage conflicts Task assigns processor threads ↓ Communication, Main: master, controller synchronization Subtasks: parallel subroutines, Don’t sacrifice science to speed slaves 11 / 15

  12. Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude Practical Aspects of Message Passing; Don’t Do It More Processors = More Challenge Only most numerically intensive � Legacy codes often Fortran90 Rewrite (N months) vs Modify serial ( ∼ 70 % )? Steep learning curve, failures, hard debugging Preconditions: run often, for days, little change Need higher resolution, more bodies Problem affects parallelism: data use, problem structure Perfectly (embarrassingly) parallel: (MC) repeats Fully synchronous: Data � (MD), tightly coupled Loosely synchronous: (groundwater diffusion) Pipeline parallel: (data → images → animations) 12 / 15

  13. Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude High-Level View of Message Passing 4 Simple Communication Commands Master Compute Simple basics Create Slave 1 compute Create C, Fortran + 4 Slave 2 compute Compute communications send e m send i compute T send: named message compute Receive send send Receive compute receive: any sender compute Receive receive Compute send receive myid: ID processor Receive compute send Send send numnodes compute send 13 / 15

  14. Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude � MP: What Can Go Wrong? Hardware Communication = Problematic Master Task cooperation, division Compute Create Slave 1 Correct data division compute Create Slave 2 compute Compute send Many low-level details e m send i compute T Distributed error messages compute Receive send send Receive compute Wrong messages order compute Receive receive Race conditions: order Compute send receive dependent Receive compute send Send send compute Deadlock: wait forever send 14 / 15

  15. Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude Conclude: IBM Blue Gene = � by Committee Performance/watt Peak = 360 teraflops (10 12 ) On, off chip mem Medium speed CPU 2 core CPU 5.6 Gflop (cool) 1 Core compute, 1 512 chips/card, 16 communicate cards/Board 65,536 (2 16 ) nodes Control: MPI 15 / 15

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend