silicon graphics scientific library update
play

Silicon Graphics Scientific Library Update Mimi Celis Tom Elken - PowerPoint PPT Presentation

Silicon Graphics Scientific Library Update Mimi Celis Tom Elken celis@sgi.com telken@sgi.com Supercomputing Applications Silicon Graphics, Inc. 41st Cray User Group Conference Minneapolis, Minnesota Contents Scientific Libraries


  1. Silicon Graphics Scientific Library Update Mimi Celis Tom Elken celis@sgi.com telken@sgi.com Supercomputing Applications Silicon Graphics, Inc. 41st Cray User Group Conference Minneapolis, Minnesota

  2. Contents ¥ Scientific Libraries available on SGI hardware ¥ SCSL Scientific Library (like ÒSGIÓ, ÒSCSLÓ doesnÕt mean anything ;-) ) ¥ SCSL Release 1.2 ¥ Signal Processing in SCSL 1.2 ¥ Performance ¥ Special Solvers in SCSL 1.2 ¥ Future 2

  3. Scientific Libraries on SGI There are ÒmanyÓ scientific libraries available on SGI platforms today. ¥ LibSci on Cray platforms. ¥ CHALLENGEcomplib on IRIX platforms. (libcomplib.sgimath,libblas) Ð Part of the IDO in IRIX 6.4 and older Ð Part of the IRIX development libraries in IRIX 6.5 Ð Version 3.1 ¥ SCSL on IRIX platforms. Ð Unbundled product Ð Available for IRIX 6.4 and newer Ð Version 1.1 3

  4. SCSL Scientific Library ¥ SCSL is a scientific and math library ¥ SCSL is (initially) available on IRIX 6.4 and 6.5 systems ¥ SCSL will become the standard scientific library on all SGI platforms ¥ SCSL will merge the important functionality of CHALLENGEcomplib and LibSci into one library ¥ SCSL will provide a new library with more functionality and better performance than either library by itself. 4

  5. SCSL Contents ¥ BLAS (Basic Linear Algebra Subprograms). Ð BLAS1-Vector-vector operations Ð BLAS2-Matrix-vector operations Ð BLAS3-Matrix-matrix operations ¥ LAPACK Ð Symmetric and Nonsymmetric linear systems of equations Ð Symmetric and Nonsymmetric eigenvector/value Ð Singular Value Decomposition Ð Linear Least Squares BLAS and LAPACK developed at the University of Tennessee. 5

  6. SCSL Contents (continued) ¥ Sparse Linear Equation Solvers Ð Symmetric linear systems of equations Ð Nonsymmetric linear systems of equations (NO pivoting) ¥ FFTs Ð multiple one-dimension mixed radix Ð one-,two-and three-dimension mixed radix Ð single-and double-precision, for both real and complex data types Sparse solvers and FFTs were developed at SGI. (There is no defacto standard API). 6

  7. How to use SCSL ¥ Documentation in form of man pages: Ð intro_libscsl Ð intro_blas1, _blas2, _blas3 Ð intro_fft Ð intro_lapack Ð intro_sparse (soon) Ð these will point you to more detailed man pages ¥ Linking: Ð Serial: -lscs Ð OpenMP or libmp parallel: -lscs_mp -mp 7

  8. SCSL Release 1.2 SCSL 1.1 is the current release. Release 1.2 will be the next SCSL release. Goals for 1.2: ¥ Add the missing complib Signal Processing functionality. ¥ Provide C language interfaces for the Signal Processing routines. ¥ Enhance the ordering techniques in the sparse linear solvers. ¥ Performance tuning for the MIPS R12000 Processor. ¥ Rollup bug fixes from SCSL 1.1 and complib 3.1. SCSL 1.2 will be released with IRIX 6.5.5 (late July 1999). 8

  9. SCSL Release 1.2 (continued) SCSL 1.2 is the follow-on to CHALLENGEcomplib with some exceptions: ¥ SCSL 1.2 will NOT include o32 versions of the libraries. ¥ SCSL 1.2 will NOT support LINPACK and EISPACK. ¥ SCSL 1.2 will run on all platforms that have n32 or 64 support. CHALLENGEcomplib is available to run on older and current platforms,however: ¥ There will be no further releases of complib. ¥ No complib bugs fixes (with rare exceptions). 9

  10. Signal Processing for SCSL 1.2 Additions to the FFTs : ¥ multiple 1D routine which calculates an FFT in one dimension for each row of a two-dimensional matrix. ¥ 1D, 2D and 3D routines that compute the product of the Fourier Transform of a sequence with the Fourier Transform of a filter (*prod routines in complib). ¥ Functions will be introduced to release memory allocated within the FFT routines. ¥ C language bindings. 10

  11. Signal Processing for SCSL 1.2 (continued) SCSL 1.2 will include convolution and correlation routines. ¥ Convolution for Finite Impulse Response (FIR) and Infinite Impulse Response (IIR) filters, together with Correlations. ¥ 1D and 2D convolution and correlation Single and double precision for real and complex arithmetic. ¥ 2D routines will run on multiple processors. ¥ API similar to complib API (but not fully compatible). ¥ Fortran and C language bindings. The two main goals of the Convolution and Correlation library are performance and generality . It provides well tuned modules usable in most convolution and correlation instances. 11

  12. Performance ¥ BLAS ¥ Fast Fourier Transforms ¥ Sparse Solver 12

  13. BLAS Performance DGEMM Performance 700 600 500 400 Mflops 300 200 100 0 32 64 128 256 512 1024 2048 Matrix Size 13

  14. BLAS Performance DGEMV Performance 450 400 350 300 Mflops 250 200 150 100 50 0 32 64 128 256 512 1024 2048 Matrix Size 14

  15. BLAS Performance DGEMM Parallel Performance 18000 16000 14000 12000 10000 Mflops 8000 6000 4000 2000 0 1 2 4 8 16 32 Number of processors 15

  16. Fast Fourier Transforms (FFT) ¥ 1-Dimensional FFT applications: Ð Seismic: many short FFTs (1024-4096 data points) Ð Sonar, radar cross-section, speech recognition and astronomical systems: large 1D FFTs ¥ Multi-dimensional FFTs: Ð image processing Ð PDEs from CFP applications Following charts show Òeffective megaflop rateÓ based on 5n*log(n) for each complex-to-complex FFT. 16

  17. FFT performance 1D Complex-complex FFT 600 500 Single Precision Double Precision 400 Mflops 300 200 100 0 1 100 10000 1000000 1E+08 FFT size 17

  18. FFT performance Complex-complex Multiple 1D FFT 600 500 400 Mflops 300 200 Single Precision Double Precision 100 0 10 100 1000 10000 FFT size and # of repetitions 18

  19. FFT performance 2D Complex-complex FFT 450 400 350 300 Mflops 250 Single Precision 200 Double Precision 150 100 50 0 10 100 1000 FFT size of one dimension 19

  20. FFT parallel performance Complex-complex Multiple 1D FFT 6000 1024-single 2048-single 5000 4096-single 1024-double 4000 2048-double Mflops 4096-double 3000 2000 1000 0 1 10 100 # of CPUs Ò1024-singleÓ means 1024 copies of a size 1024 single precision (32 bits) FFT 20

  21. Changes to SGI Sparse Solvers ¥ New Matrix Ordering Options Ð Methods 3 and 4 are termed ÒExtreme2Ó ordering ¥ New default for ordering option Ð Extreme ordering (Method 2) is now the default ¥ Out-of-core solver option Ð Was in recent SCSL version, but now is documented Ð Single-processor only Ð Striped file system useful Ð Simple interface and performs well 21

  22. New ordering options 3. Multiple Nested Dissection orders ¥ default is OMP_NUM_THREADS orders ¥ repeatable quality 4. Multiple ND orders using feedback file information ¥ default is 2 x OMP_NUM_THREADS orders ¥ feedback file is at most 5KB, up to 200 records ¥ binary feedback file ¥ a solver that learns 22

  23. Choosing a default method ¥ Should default be best for Total Time for Nine models which size model? 3500 ¥ Decided to optimize for medium or larger problems 3000 (at least 5000 equations) 2500 ¥ Extreme2 (3) about 3% 2000 faster than Extreme, but is 1500 new tech., so we use Method 2 as the new default. 1000 500 0 1 2 3 4 Ordering Method 23

  24. Out-of-core (OOC) Option ¥ Performance 10-40% Total Time for Nine models slower than extreme (1-CPU runs) (Method 2) ordering in- 1800 core; 15% in this case. 1600 1400 ¥ but faster than AMF (1) 1200 ¥ This used 4-way striping on 1000 file system -- 140 MB/s on 800 some reads 600 400 ¥ Allowed 128MB in-core for 200 factor storage 0 1 2 3 4 OOC Ordering Method / Factor Storage 24

  25. Scalability: Factorization Mflops ¥ AmdahlÕs law resp. for 3500 much of lack of scaling gismondi 3000 in previous chart fleet10 th2 ¥ Over 11 Gflops Factorization Mflops 2500 achieved on gismondi 280Kdof on 48 CPUs 2000 ¥ More can be done to 1500 improve memory placement 1000 ¥ These results used 500 DSM_ROUND_ROBIN data placement 0 0 5 10 # of CPUs 25

  26. PSLDLT: Scalability to 8 CPUs 7 ¥ Measured: Elapsed time 6 for 1 preprocess, 2 factorizations, 2 solves. fleet10 5 ¥ # floating point ops to gismondi Speedup factor & preprocess time 4 : 3 th2 Ð Gflop secs. Ð fleet10 383 27 2 Ð gismondi 133 3 280Kdof Ð th2 34 18 1 Ð 280Kdof 18 15 0 0 2 4 6 8 10 # of CPUs 26

  27. Summary ¥ SCSL 1.2 improvements: Ð FFTs have new interface Ð Add the missing complib Signal Processing functionality. Ð Provide C language interfaces for the Signal Processing routines. Ð Enhance the ordering techniques in the sparse linear solvers. Ð Performance tuning for the MIPS R12000 Processor. Ð Rollup bug fixes from SCSL 1.1 and complib 3.1. ¥ Comments, questions: Ð Mimi Celis; celis@sgi.com Ð Tom Elken; telken@sgi.com 27

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend