moore s law s law moore
play

Moore s Law s Law Moore Super Scalar/Vector/Parallel 1 PFlop/s - PDF document

Self- -Adapting Adapting Self Numerical Software Numerical Software (SANS- -Effort) Effort) (SANS Jack Dongarra, Innovative Computing Laboratory University of Tennessee and Computer Science and Mathematics Division Oak Ridge National


  1. Self- -Adapting Adapting Self Numerical Software Numerical Software (SANS- -Effort) Effort) (SANS Jack Dongarra, Innovative Computing Laboratory University of Tennessee and Computer Science and Mathematics Division Oak Ridge National Laboratory Moore’ ’s Law s Law Moore Super Scalar/Vector/Parallel 1 PFlop/s Earth Parallel Simulator ASCI White ASCI Red Pacific 1 TFlop/s TMC CM-5 Cray T3D Vector TMC CM-2 Cray 2 1 GFlop/s Cray X-MP Super Scalar Cray 1 1941 1 (Floating Point operations / second, Flop/s) CDC 7600 IBM 360/195 1945 100 1 MFlop/s Scalar 1949 1,000 (1 KiloFlop/s, KFlop/s) 1951 10,000 CDC 6600 1961 100,000 1964 1,000,000 (1 MegaFlop/s, MFlop/s) IBM 7090 1968 10,000,000 1975 100,000,000 1987 1,000,000,000 (1 GigaFlop/s, GFlop/s) 1992 10,000,000,000 1993 100,000,000,000 1 KFlop/s 1997 1,000,000,000,000 (1 TeraFlop/s, TFlop/s) UNIVAC 1 2000 10,000,000,000,000 EDSAC 1 2003 35,000,000,000,000 (35 TFlop/s) 1950 1960 1970 1980 1990 2000 2010 1

  2. Linpack (100x100) Analysis Linpack (100x100) Analysis ♦ Compaq 386/SX20 SX with FPA - .16 Mflop/s ♦ Pentium IV – 2.8 GHz – 1.32 Gflop/s ♦ 12 years � we see a factor of ~ 8231 ♦ Moore’s Law says something about a factor of 2 every 18 months or a factor of 256 over 12 years ♦ Where is the missing factor of 32 … Complex set of interaction between Users’ applications � Clock speed increase = 128x Algorithm Programming language � External Bus Width & Caching – Compiler � 16 vs. 64 bits = 4x Machine instruction � Floating Point - Hardware Many layers of translation from � 4/8 bits multi vs. 64 bits (1 clock) = 8x the application to the hardware � Compiler Technology = 2x Changing with each generation ♦ However the theoretical peak for that Pentium 4 is 5.6 Gflop/s and here we are getting 1.32 Gflop/s � Still a factor of 4.25 off of peak Where Does the Performance Go? or Where Does the Performance Go? or Why Should I Care About the Memory Hierarchy? Why Should I Care About the Memory Hierarchy? Processor-DRAM Memory Gap (latency) µProc 60%/yr. 1000000 (2X/1.5yr) “Moore’s Law” CPU Performance 10000 Processor-Memory Performance Gap: 100 (grows 50% / year) DRAM DRAM 9%/yr. 1 (2X/10 yrs) 0 2 4 6 8 0 2 4 6 8 0 2 4 8 8 8 8 8 9 9 9 9 9 0 0 0 9 9 9 9 9 9 9 9 9 9 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 Year 2

  3. The Memory Hierarchy The Memory Hierarchy ♦ By taking advantage of the principle of locality: � Present the user with as much memory as is available in the cheapest technology. � Provide access at the speed offered by the fastest technology. Registers 1cy 3-10 words/cycle compiler managed CPU Chip 1-3cy 1-2 words/cycle hardware managed Level 1 Cache 5-10cy 1 word/cycle hardware managed Level 2 Cache Chips 30-100cy 0.5 words/cycle OS managed DRAM 10 6 -10 7 cy 0.01 words/cycle OS managed Mech Disk Tape Motivation Self Adapting Motivation Self Adapting Numerical Software (SANS) Effort Numerical Software (SANS) Effort ♦ Optimizing software to exploit the features of a given system has historically been an exercise in hand customization. � Time consuming and tedious � Hard to predict performance from source code � Must be redone for every architecture and compiler � Software technology often lags architecture � Best algorithm may depend on input, so some tuning may be needed at run-time. ♦ There is a need for quick/dynamic deployment of optimized routines . 3

  4. What is Self Adapting Self Adapting What is Performance Tuning of Software? Performance Tuning of Software? ♦ Two steps: 1.Identify and generate a space of algorithm/software, with various based on the architectural features � Instruction mixes and orders � Memory Access Patterns � Data structures � Mathematical Formulations 2.Generate different versions and search for the fastest one, by running them ♦ When do we search? � Once per kernel and architecture � At compile time � At run time � All of the above ♦ Many examples � PHiPAC, ATLAS, Sparsity, FFTW, Spiral,… Vendor BLAS 3500.0 Software Generation Software Generation ATLAS BLAS 3000.0 F77 BLAS 2500.0 Strategy - - ATLAS BLAS Strategy ATLAS BLAS MFLOP/S 2000.0 1500.0 ♦ Parameter study of the hw 1000.0 ♦ Generate multiple versions 500.0 of code, w/difference 0.0 3 0 0 3 5 2 0 0 0 0 0 0 3 1 6 0 z 2 0 7 0 values of key performance 6 5 5 - 1 1 1 2 H E 2 2 0 - 6 - 6 5 / - - - M S - - 2 n 5 v 3 4 2 3 8 0 2 - o v e 7 0 r r 3 S 2 3 c h l e 6 e e 3 w / p p r t C 0 / C w w 9 i i a A C E 0 P o o z 0 0 p E D 0 P P P I I H 0 0 S D 9 I - 0 0 parameters M D P M M M P G 0 2 a 1 1 t r A H B B B e l 3 R R l I I I t 5 I I U n 2 . G G n I u 4 S S - S P l e n t ♦ Run and measure the Architectures ♦ Takes ~ 20 minutes to run, I performance for various generates Level 1,2, & 3 BLAS versions ♦ “New” model of high ♦ Pick best and generate performance programming library where critical code is machine generated using parameter ♦ Level 1 cache multiply optimization. optimizes for: ♦ Designed for modern � TLB access architectures � L1 cache reuse � Need reasonable C compiler � FP unit usage ♦ Today ATLAS in used within � Memory fetch various ASCI and SciDAC � Register reuse activities and by Matlab, � Loop overhead minimization Mathematica, Octave, Maple, Similar to FFTW and Johnsson, ♦ Debian, Scyld Beowulf, SuSE,… UH See: http://icl.cs.utk.edu/atlas/ joint with Clint Whaley & Antoine Petitet 4

  5. ATLAS 3.6 (new release) ATLAS 3.6 (new release) ATLAS 3.6 AMD Opteron 1.6 GHz 3000 2500 2000 MFlop/s DGEMM 1500 DGETRF DPOTRF 1000 ATLAS 3.6 Intel Itanium-2 900 MHz 500 3500 DGEMM 0 3000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 3 5 7 9 1 3 5 7 9 1 1 1 1 1 DGETRF 2500 Order Mflop/s 2000 1500 1000 500 http://www.netlib.org/atlas/ 0 100 200 300 400 500 600 700 800 900 1000 Order Self Adapting Numerical Software - - Self Adapting Numerical Software SANS Effort SANS Effort ♦ Provide software technology to aid in high performance on commodity processors, clusters, and grids. ♦ Pre-run time (library building stage) and run time optimization. ♦ Integrated performance modeling and analysis ♦ Automatic algorithm selection – polyalgorithmic functions ♦ Automated installation process ♦ Can be expanded to areas such as communication software and selection of numerical algorithms Different Best Algorithms, Algorithm, TUNING Segment Sizes Segment Size SYSTEM 5

  6. Self Adapting for Message Passing Self Adapting for Message Passing ♦ Communication libraries � Optimize for the specifics of one’s configuration. � A specific MPI collective communication algorithm implementation may not give best results on all platforms. � Choose collective communication parameters that give best results for the system when the system is assembled. Root Sequential Binary Binomial Ring ♦ Algorithm layout and implementation � Look at the different ways to express implementation Different Best TUNING Algorithms, Algorithm, SYSTEM Size msgs Block msgs Self Adaptive Software Self Adaptive Software ♦ Software can adapt its workings to the environment in (at least) 3 ways � Kernels, optimized for platform (Atlas, Sparsity): static determination � Scheduling, taking network conditions into account (LFC): dynamic, but data- independent � Algorithm choice (Salsa): dynamic, strongly dependent on user data. 6

  7. Data Layout Critical for Data Layout Critical for Performance Performance Number of processors Aspect ratio of processes Block size m mb n nb Needs An “Expert” To Do The Tuning LFC Performance Results LFC Performance Results Using up to 64 of Time to solution of Ax=b (n=60k) AMD 1.4 GHz 25000 processors at Ohio Naive 20000 Supercomputer LFC Center ) s 15000 d n o c I ncreasing e (s e im margin 10000 T of potential user error 5000 0 32 34 36 39 42 45 47 49 51 54 56 58 62 64 Number of processors 7

  8. LFC: LAPACK For Clusters LFC: LAPACK For Clusters Want to relieve the user of some of the ♦ tasks via Cluster Middleware Make decisions on the number of User ♦ processors to use based on the user’s problem and the state of the system problem � Optimize for the best time to solution Library � Distribute the data on the processors and collections of results Middleware � Start the SPMD library routine on all the platforms Users, etc. e.g. 100 Mbit Resources ~ Myrinet Switch, ~ Gbit Switch, (fully connected) (fully connected) software hardware Joint with Piotr Ł uszczek & Kenny Roche http://icl.cs.utk.edu/lfc/ SALSA: Self-Adaptive Linear Solver Architecture Run-time adaptation to user data for linear system solving Choice between direct/iterative ♦ solver � Space and runtime considerations � Numerical properties of system Choice of preconditioner, scaling, ♦ ordering, decomposition User steering of decision process ♦ Insertion of performance data in ♦ database Metadata on both numerical data ♦ and algorithms Heuristics-driven automated ♦ analysis Self-adaptivity: tuning of ♦ heuristics over time through Joint work with experience gained from production runs Victor Eijkhout, Bill Gropp, & David Keyes 8

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend