Getting the Performance Out Of Getting the Performance Out Of High - PDF document

Getting the Performance Out Of Getting the Performance Out Of High Performance Computing High Performance Computing Jack Dongarra I nnovative Computing Lab University of Tennessee and Computer Science and Math Division Oak Ridge Nat ional Lab ht t p:/ / www.cs.ut k.edu/ ~dongarra ht t p:/ / www.cs.ut k.edu/ ~dongarra/ 1 Getting the Performance Out Of Getting the Performance Out Of High Performance Computing High Performance Computing Jack Dongarra I nnovative Computing Lab University of Tennessee and Computer Science and Math Division Oak Ridge Nat ional Lab ht t p:/ / ht t p:/ / www.cs.ut k.edu/ ~dongarra www.cs.ut k.edu/ ~dongarra/ 2 1

3 Getting the Performance into Getting the Performance into High Performance Computing High Performance Computing Jack Dongarra I nnovative Computing Lab University of Tennessee and Computer Science and Math Division Oak Ridge Nat ional Lab ht t p:/ / ht t p:/ / www.cs.ut k.edu/ ~dongarra www.cs.ut k.edu/ ~dongarra/ 4 2

Technology Trends: Technology Trends: Microprocessor Capacity Microprocessor Capacity Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor Microprocessors have chips would double roughly every become smaller, denser, and 18 months. more powerful. Not just processors, storage, 2X transistors/Chip Every 1.5 years internet bandwidth, etc Called “Moore’s Law” 5 Moore’s Law Moore’s Law Super Scalar/Vector/Parallel 1 PFlop/s Earth Parallel Simulator ASCI White ASCI Red Pacific 1 TFlop/s TMC CM-5 Cray T3D Vector TMC CM-2 Cray 2 1 GFlop/s Cray X-MP Super Scalar Cray 1 CDC 7600 IBM 360/195 1 MFlop/s Scalar CDC 6600 IBM 7090 1 KFlop/s UNIVAC 1 EDSAC 1 6 1950 1960 1970 1980 1990 2000 2010 3

Next Generation: IBM Blue Gene/L Next Generation: IBM Blue Gene/L and ASCI Purple and ASCI Purple ♦ Announced 11/ 19/ 02 � One of 2 machines f or LLNL � 360 TFlop/ s � 130, 000 proc � Linux � FY 2005 Plus ASCI Purple IBM Power 5 based 7 12K proc, 100 TFlop/s To Be Provocative… To Be Provocative… Citation in the Press, March 10 th th , 2008 Citation in the Press, March 10 , 2008 ♦ How could this happen? � Complexit y of programming t hese machines were DOE Supercomputers Sit Idle underestimated WASHINGTON, Mar. 10, 2008 � Users were unprepared f or GAO reports that after almost t he lack of reliabilit y of 5 years of effort and several the hardware and sof tware � Little ef f ort was spent to hundreds of M$’s spent at carry out medium and long the DOE labs, the term research activities to high performance solve problems t hat were computers recently f oreseen 5 years ago in purchased did not the areas of applications, meet users’ expectation algorithm, middleware, and are sitting idle…Alan programming models, and computer architectures, … 8 Laub head of the DOE efforts reports that the computer equ 4

Software Technology & Performance Software Technology & Performance ♦ Tendency to f ocus on the hardware ♦ Sof t ware required t o bridge an ever widening gap ♦ Gaps bet ween pot ent ial and delivered perf ormance is very st eep � P erformance only if the data and controls are setup just right � Otherwise, dramatic perf ormance degradations, very unstable situation � Will become more unstable as systems change and become more complex ♦ Challenge f or applications, libraries, and tools is f ormidable with Tf lop/ s level, even great er wit h Pf lops, some might say insurmount able. 9 Linpack (100x100) Analysis, Linpack (100x100) Analysis, The Machine on My Desk 12 Years Ago and Today The Machine on My Desk 12 Years Ago and Today ♦ Compaq 386/ SX20 SX wit h FPA - . 16 Mf lop/ s ♦ Pent ium I V – 2. 8 GHz – 1317 Mf lop/ s ♦ 12 years � we see a f act or of ~ 8231 � Doubling in less than 12 months, f or 12 years ♦ Moore’s Law gives us a f act or of 256. ♦ How do we get a f act or > 8000? � Complex set of interaction between � Application � Clock speed increase = 128x � Algorithms � Programming language � External Bus Width & Caching – � Compiler � 16 vs. 64 bits = 4x � Machine instructions � Floating Point - � Hardware � Many layers of translation from � 4/ 8 bits multi vs. 64 bits (1 clock) = 8x the application to the hardware � Compiler Technology = 2x � Changing with each generation ♦ However t he pot ent ial f or t hat Pent ium 4 is 5. 6 Gf lop/ s and here we are getting 1. 32 Gf lop/ s � Still a f actor of 4.25 of f of peak 1 0 5

Where Does Much of Our Lost Performance Go? or Where Does Much of Our Lost Performance Go? or Why Should I Care About the Memory Hierarchy? Why Should I Care About the Memory Hierarchy? Processor-DRAM Memory Gap (latency) µProc 60%/yr. 1000000 (2X/1.5yr) “Moore’s Law” CPU Performance 10000 Processor-Memory Performance Gap: (grows 50% / year) 100 DRAM DRAM 9%/yr. 1 (2X/10 yrs) 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 Year 1 1 Optimizing Computation and Optimizing Computation and Memory Use Memory Use ♦ Computational optimizations � Theoret ical peak:(# f pus)* (f lops/ cycle)* cycle time � Pentium 4: (1 f pu)*(2 f lops/ cycle)*(2. 8 Ghz) = 5600 MFLOP/ s ♦ Operations like: � y = α x + y : 3 operands (24 Bytes) needed f or 2 f lops; 5600 Mf lop/ s requires 8400 MWord/ s bandwidth f rom memory ♦ Memory optimization � Theoret ical peak: (bus widt h) * (bus speed) � Pentium 4: (32 bits)*(533 Mhz) = 2132 MB/ s = 266 MWord/ s Off by a factor of 30 from what’s required to drive 1 2 the processor from memory to peak performance 6

Memory Hierarchy Memory Hierarchy ♦ By taking advantage of the principle of locality: � P resent the user with as much memory as is available in the cheapest technology. � P rovide access at the speed offered by the fastest technology. Processor Tertiary Storage Secondary (Disk/Tape) Control Storage (Disk) Level Main Remote Registers On-Chip 2 and 3 Memory Distributed Cache Datapath Cluster Cache (DRAM) Memory Memory (SRAM) 10,000,000s Speed (ns): 1s 10s 100s 10,000,000,000s (10s ms) (10s sec) Size (bytes): 100s 100,000 s Ks Ms 10,000,000 s (.1s ms) 1 3 (10s ms) Gs Ts Tool To Help Understand What’s Going Tool To Help Understand What’s Going On In the Processor On In the Processor ♦ Complex system with many f ilters ♦ Need to identif y bottlenecks ♦ Prioritize optimization ♦ Focus on important aspects 1 4 7

Tools for Performance Analysis, Tools for Performance Analysis, Modeling and Optimization Modeling and Optimization ♦ PAPI ♦ Dynist ♦ SVPablo ROSE: Compiler Framework Recognition of high-level 1 5 abstractions Specification of Transformations Example PERC on Climate Model Example PERC on Climate Model ♦ I nt eract ion wit h t he SciDAC Climat e development ef f ort ♦ Prof iling � I dentif ying perf ormance bottleneck and prioritizing enhancements ♦ Evaluat ion of code over t ime 9 mont h period ♦ Produced 400% improvement via decreased overhead and increased scalability 1 6 8

Signatures: Key Factors in Applications Signatures: Key Factors in Applications and System that Affect Performance and System that Affect Performance ♦ Application Signatures ♦ Hardware Signat ures � Perf ormance capabilit ies of Machine � Lat encies and bandwidt h of memory hierarchy � Characterization of operations � Local to node & to remote node needed to be perf ormed by � I nst ruct ion issue rat es application � Cache size � Description of application demands � TLB size on resources � Algorithm Algorithm Signatures Parallel or Distributed Application � Opts counts � Memory ref pat t erns � Dat a dependencies Performance Monitor � I / O charact erist ics � Sof tware Sof tware Signatures � Sync points Live Performance Data Live Performance Data Feedback Feedback � Thread level parallelism (Degree of Similarity) (Degree of Similarity) � I nst level parallelism Observation and Generation of � Ratio of mem ref t o f lpt ops Model Signature Observation � Predict application behavior and Comparison Signature perf ormance Application Signature � Execution signature 1 7 � combine application and machine signatures to Model Signature provide accurate performance models . Algorithms vs vs Applications Applications Algorithms Lattice Quantum Weather Comp Fluid Adjustment Inverse Structural Electronic Circuit Gauge Chemistry Simulation Dynamics of Geodetic Problems Mechanics Device Simulation (QCD) Networks Simulation Sparse Linear X X X X X X System Solvers Linear Least X X Squares Nonlinear X X X X Algebraic System Solvers Sparse Eigenvalue X X X Problems FFT X X X Rapid Elliptic X X X Problem Solvers MultigridSchemes X X Stiff ODE Solvers X X Monte Carlo X X Schemes Integral X Transformations 1 8 From: Supercomputing Tradeoffs and the Cedar System, E. Davidson, D. Kuck, D. Lawrie, and A. Sameh, in High- Speed Computing, Scientific Applications and Algorithm Design, Ed R. Wilhelmson, U of I Press, 1986. 9

Getting the Performance Out Of Getting the Performance Out Of High - PDF document

Getting the Performance Out Of Getting the Performance Out Of High Performance Computing High Performance Computing Jack Dongarra I nnovative Computing Lab University of Tennessee and Computer Science and Math Division Oak Ridge Nat ional

Commission: Out of touch, out of date, out of pocket April 2017 Commission: Out of touch, out of

ThorneConsulting.com W E L C O M E From Getting Noticed From Getting Noticed to Getting

God Reaches Out Week 1: God Reaches Out To Meet Us Where We Are Week 2: God Reaches Out In

Getting the Functional out of Getting the Functional out of Dysfunctional Teams Dysfunctional

Getting the Most out of New Technology Thursday 16 th July 2020 Getting the most out of new

Getting the Least Out Least Out Getting the Coding tips and usage tips for C Coding tips and

APS Webinars APS Webinars Getting more out of APS Meetings Getting more out of APS Meetings

CHRONIC PELVIC PAIN: CHRONIC PELVIC PAIN: OUT OF SIGHT... OUT OF MIND OUT OF SIGHT... OUT OF

List of hand outs for this session Hand out 1: Incident decision tree Hand out 2: Yorkshire

The topology of Out ( F n ) Mladen Bestvina Introduction Out ( F n ) = Aut ( F n ) /Inn ( F n )

ex Addition: 1-bit half adder A + Sum B Carry out Carry A B Sum out 0 0 A 0 1 Sum

Getting System Sizing and Getting System Sizing and performance testing right performance

Getting Into The Media or Getting Out the Message: Evaluating Mass Mediated Protest Actions As A

Getting the most out of Ref Centre Overview To get the most out of Ref Centre you need to: 1.

Getting Started Building Knowledge for a Better World lucintro.presenterswall.com Getting

Getting the Facts: Getting the Facts: Effective Communication with Effective Communication with

rs t

Semantic role labeling Christopher Potts CS 244U: Natural language understanding Feb 2 With

Recent advances in local graph clustering and the transition to global analysis Kimon

W The SensEval workshop series are specifically dedicated ORD sense disambiguation (WSD) is

1 Minimizing Finite Automata Here is an example of a non-minimal fjnite automaton: collapse

Vector Microprocessors: A Case Study in VLSI Processor Design Krste Asanovic MIT Laboratory for

Opioids and PAIN Opioids and Pharmacists Thursday , May 16, 2019 1 5/13/2019 A A Provide

Neural Semantic Parsing Graham Neubig Site https://phontron.com/class/nn4nlp2018/ Tree