getting the performance out of getting the performance
play

Getting the Performance Out Of Getting the Performance Out Of High - PDF document

Getting the Performance Out Of Getting the Performance Out Of High Performance Computing High Performance Computing Jack Dongarra I nnovative Computing Lab University of Tennessee and Computer Science and Math Division Oak Ridge Nat ional


  1. Getting the Performance Out Of Getting the Performance Out Of High Performance Computing High Performance Computing Jack Dongarra I nnovative Computing Lab University of Tennessee and Computer Science and Math Division Oak Ridge Nat ional Lab ht t p:/ / www.cs.ut k.edu/ ~dongarra ht t p:/ / www.cs.ut k.edu/ ~dongarra/ 1 Getting the Performance Out Of Getting the Performance Out Of High Performance Computing High Performance Computing Jack Dongarra I nnovative Computing Lab University of Tennessee and Computer Science and Math Division Oak Ridge Nat ional Lab ht t p:/ / ht t p:/ / www.cs.ut k.edu/ ~dongarra www.cs.ut k.edu/ ~dongarra/ 2 1

  2. 3 Getting the Performance into Getting the Performance into High Performance Computing High Performance Computing Jack Dongarra I nnovative Computing Lab University of Tennessee and Computer Science and Math Division Oak Ridge Nat ional Lab ht t p:/ / ht t p:/ / www.cs.ut k.edu/ ~dongarra www.cs.ut k.edu/ ~dongarra/ 4 2

  3. Technology Trends: Technology Trends: Microprocessor Capacity Microprocessor Capacity Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor Microprocessors have chips would double roughly every become smaller, denser, and 18 months. more powerful. Not just processors, storage, 2X transistors/Chip Every 1.5 years internet bandwidth, etc Called “Moore’s Law” 5 Moore’s Law Moore’s Law Super Scalar/Vector/Parallel 1 PFlop/s Earth Parallel Simulator ASCI White ASCI Red Pacific 1 TFlop/s TMC CM-5 Cray T3D Vector TMC CM-2 Cray 2 1 GFlop/s Cray X-MP Super Scalar Cray 1 CDC 7600 IBM 360/195 1 MFlop/s Scalar CDC 6600 IBM 7090 1 KFlop/s UNIVAC 1 EDSAC 1 6 1950 1960 1970 1980 1990 2000 2010 3

  4. Next Generation: IBM Blue Gene/L Next Generation: IBM Blue Gene/L and ASCI Purple and ASCI Purple ♦ Announced 11/ 19/ 02 � One of 2 machines f or LLNL � 360 TFlop/ s � 130, 000 proc � Linux � FY 2005 Plus ASCI Purple IBM Power 5 based 7 12K proc, 100 TFlop/s To Be Provocative… To Be Provocative… Citation in the Press, March 10 th th , 2008 Citation in the Press, March 10 , 2008 ♦ How could this happen? � Complexit y of programming t hese machines were DOE Supercomputers Sit Idle underestimated WASHINGTON, Mar. 10, 2008 � Users were unprepared f or GAO reports that after almost t he lack of reliabilit y of 5 years of effort and several the hardware and sof tware � Little ef f ort was spent to hundreds of M$’s spent at carry out medium and long the DOE labs, the term research activities to high performance solve problems t hat were computers recently f oreseen 5 years ago in purchased did not the areas of applications, meet users’ expectation algorithm, middleware, and are sitting idle…Alan programming models, and computer architectures, … 8 Laub head of the DOE efforts reports that the computer equ 4

  5. Software Technology & Performance Software Technology & Performance ♦ Tendency to f ocus on the hardware ♦ Sof t ware required t o bridge an ever widening gap ♦ Gaps bet ween pot ent ial and delivered perf ormance is very st eep � P erformance only if the data and controls are setup just right � Otherwise, dramatic perf ormance degradations, very unstable situation � Will become more unstable as systems change and become more complex ♦ Challenge f or applications, libraries, and tools is f ormidable with Tf lop/ s level, even great er wit h Pf lops, some might say insurmount able. 9 Linpack (100x100) Analysis, Linpack (100x100) Analysis, The Machine on My Desk 12 Years Ago and Today The Machine on My Desk 12 Years Ago and Today ♦ Compaq 386/ SX20 SX wit h FPA - . 16 Mf lop/ s ♦ Pent ium I V – 2. 8 GHz – 1317 Mf lop/ s ♦ 12 years � we see a f act or of ~ 8231 � Doubling in less than 12 months, f or 12 years ♦ Moore’s Law gives us a f act or of 256. ♦ How do we get a f act or > 8000? � Complex set of interaction between � Application � Clock speed increase = 128x � Algorithms � Programming language � External Bus Width & Caching – � Compiler � 16 vs. 64 bits = 4x � Machine instructions � Floating Point - � Hardware � Many layers of translation from � 4/ 8 bits multi vs. 64 bits (1 clock) = 8x the application to the hardware � Compiler Technology = 2x � Changing with each generation ♦ However t he pot ent ial f or t hat Pent ium 4 is 5. 6 Gf lop/ s and here we are getting 1. 32 Gf lop/ s � Still a f actor of 4.25 of f of peak 1 0 5

  6. Where Does Much of Our Lost Performance Go? or Where Does Much of Our Lost Performance Go? or Why Should I Care About the Memory Hierarchy? Why Should I Care About the Memory Hierarchy? Processor-DRAM Memory Gap (latency) µProc 60%/yr. 1000000 (2X/1.5yr) “Moore’s Law” CPU Performance 10000 Processor-Memory Performance Gap: (grows 50% / year) 100 DRAM DRAM 9%/yr. 1 (2X/10 yrs) 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 Year 1 1 Optimizing Computation and Optimizing Computation and Memory Use Memory Use ♦ Computational optimizations � Theoret ical peak:(# f pus)* (f lops/ cycle)* cycle time � Pentium 4: (1 f pu)*(2 f lops/ cycle)*(2. 8 Ghz) = 5600 MFLOP/ s ♦ Operations like: � y = α x + y : 3 operands (24 Bytes) needed f or 2 f lops; 5600 Mf lop/ s requires 8400 MWord/ s bandwidth f rom memory ♦ Memory optimization � Theoret ical peak: (bus widt h) * (bus speed) � Pentium 4: (32 bits)*(533 Mhz) = 2132 MB/ s = 266 MWord/ s Off by a factor of 30 from what’s required to drive 1 2 the processor from memory to peak performance 6

  7. Memory Hierarchy Memory Hierarchy ♦ By taking advantage of the principle of locality: � P resent the user with as much memory as is available in the cheapest technology. � P rovide access at the speed offered by the fastest technology. Processor Tertiary Storage Secondary (Disk/Tape) Control Storage (Disk) Level Main Remote Registers On-Chip 2 and 3 Memory Distributed Cache Datapath Cluster Cache (DRAM) Memory Memory (SRAM) 10,000,000s Speed (ns): 1s 10s 100s 10,000,000,000s (10s ms) (10s sec) Size (bytes): 100s 100,000 s Ks Ms 10,000,000 s (.1s ms) 1 3 (10s ms) Gs Ts Tool To Help Understand What’s Going Tool To Help Understand What’s Going On In the Processor On In the Processor ♦ Complex system with many f ilters ♦ Need to identif y bottlenecks ♦ Prioritize optimization ♦ Focus on important aspects 1 4 7

  8. Tools for Performance Analysis, Tools for Performance Analysis, Modeling and Optimization Modeling and Optimization ♦ PAPI ♦ Dynist ♦ SVPablo ROSE: Compiler Framework Recognition of high-level 1 5 abstractions Specification of Transformations Example PERC on Climate Model Example PERC on Climate Model ♦ I nt eract ion wit h t he SciDAC Climat e development ef f ort ♦ Prof iling � I dentif ying perf ormance bottleneck and prioritizing enhancements ♦ Evaluat ion of code over t ime 9 mont h period ♦ Produced 400% improvement via decreased overhead and increased scalability 1 6 8

  9. Signatures: Key Factors in Applications Signatures: Key Factors in Applications and System that Affect Performance and System that Affect Performance ♦ Application Signatures ♦ Hardware Signat ures � Perf ormance capabilit ies of Machine � Lat encies and bandwidt h of memory hierarchy � Characterization of operations � Local to node & to remote node needed to be perf ormed by � I nst ruct ion issue rat es application � Cache size � Description of application demands � TLB size on resources � Algorithm Algorithm Signatures Parallel or Distributed Application � Opts counts � Memory ref pat t erns � Dat a dependencies Performance Monitor � I / O charact erist ics � Sof tware Sof tware Signatures � Sync points Live Performance Data Live Performance Data Feedback Feedback � Thread level parallelism (Degree of Similarity) (Degree of Similarity) � I nst level parallelism Observation and Generation of � Ratio of mem ref t o f lpt ops Model Signature Observation � Predict application behavior and Comparison Signature perf ormance Application Signature � Execution signature 1 7 � combine application and machine signatures to Model Signature provide accurate performance models . Algorithms vs vs Applications Applications Algorithms Lattice Quantum Weather Comp Fluid Adjustment Inverse Structural Electronic Circuit Gauge Chemistry Simulation Dynamics of Geodetic Problems Mechanics Device Simulation (QCD) Networks Simulation Sparse Linear X X X X X X System Solvers Linear Least X X Squares Nonlinear X X X X Algebraic System Solvers Sparse Eigenvalue X X X Problems FFT X X X Rapid Elliptic X X X Problem Solvers MultigridSchemes X X Stiff ODE Solvers X X Monte Carlo X X Schemes Integral X Transformations 1 8 From: Supercomputing Tradeoffs and the Cedar System, E. Davidson, D. Kuck, D. Lawrie, and A. Sameh, in High- Speed Computing, Scientific Applications and Algorithm Design, Ed R. Wilhelmson, U of I Press, 1986. 9

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend