Silicon Graphics Scientific Library Update Mimi Celis Tom Elken - - PowerPoint PPT Presentation

silicon graphics scientific library update
SMART_READER_LITE
LIVE PREVIEW

Silicon Graphics Scientific Library Update Mimi Celis Tom Elken - - PowerPoint PPT Presentation

Silicon Graphics Scientific Library Update Mimi Celis Tom Elken celis@sgi.com telken@sgi.com Supercomputing Applications Silicon Graphics, Inc. 41st Cray User Group Conference Minneapolis, Minnesota Contents Scientific Libraries


slide-1
SLIDE 1

Silicon Graphics Scientific Library Update

Mimi Celis Tom Elken

celis@sgi.com telken@sgi.com

Supercomputing Applications

Silicon Graphics, Inc.

41st Cray User Group Conference Minneapolis, Minnesota

slide-2
SLIDE 2

2

Contents

¥ Scientific Libraries available on SGI hardware ¥ SCSL Scientific Library

(like ÒSGIÓ, ÒSCSLÓ doesnÕt mean anything ;-) )

¥ SCSL Release 1.2 ¥ Signal Processing in SCSL 1.2 ¥ Performance ¥ Special Solvers in SCSL 1.2 ¥ Future

slide-3
SLIDE 3

3

Scientific Libraries on SGI

There are ÒmanyÓ scientific libraries available on SGI platforms today.

¥ LibSci on Cray platforms. ¥ CHALLENGEcomplib on IRIX platforms. (libcomplib.sgimath,libblas) Ð Part of the IDO in IRIX 6.4 and older Ð Part of the IRIX development libraries in IRIX 6.5 Ð Version 3.1 ¥ SCSL on IRIX platforms. Ð Unbundled product Ð Available for IRIX 6.4 and newer Ð Version 1.1

slide-4
SLIDE 4

4

SCSL Scientific Library

¥ SCSL is a scientific and math library ¥ SCSL is (initially) available on IRIX 6.4 and 6.5 systems ¥ SCSL will become the standard scientific library on all SGI platforms ¥ SCSL will merge the important functionality of CHALLENGEcomplib and LibSci into one library ¥ SCSL will provide a new library with more functionality and better performance than either library by itself.

slide-5
SLIDE 5

5

SCSL Contents

¥ BLAS (Basic Linear Algebra Subprograms).

Ð BLAS1-Vector-vector operations Ð BLAS2-Matrix-vector operations Ð BLAS3-Matrix-matrix operations

¥ LAPACK

Ð Symmetric and Nonsymmetric linear systems of equations Ð Symmetric and Nonsymmetric eigenvector/value Ð Singular Value Decomposition Ð Linear Least Squares

BLAS and LAPACK developed at the University of Tennessee.

slide-6
SLIDE 6

6

SCSL Contents (continued)

¥ Sparse Linear Equation Solvers

Ð Symmetric linear systems of equations Ð Nonsymmetric linear systems of equations (NO pivoting)

¥ FFTs

Ð multiple one-dimension mixed radix Ð one-,two-and three-dimension mixed radix Ð single-and double-precision, for both real and complex data types

Sparse solvers and FFTs were developed at SGI. (There is no defacto standard API).

slide-7
SLIDE 7

7

How to use SCSL

¥ Documentation in form of man pages:

Ð intro_libscsl Ð intro_blas1, _blas2, _blas3 Ð intro_fft Ð intro_lapack Ð intro_sparse (soon) Ð these will point you to more detailed man pages

¥ Linking:

Ð Serial:

  • lscs

Ð OpenMP or libmp parallel:

  • lscs_mp -mp
slide-8
SLIDE 8

8

SCSL Release 1.2

SCSL 1.1 is the current release. Release 1.2 will be the next SCSL release. Goals for 1.2:

¥ Add the missing complib Signal Processing functionality. ¥ Provide C language interfaces for the Signal Processing routines. ¥ Enhance the ordering techniques in the sparse linear solvers. ¥ Performance tuning for the MIPS R12000 Processor. ¥ Rollup bug fixes from SCSL 1.1 and complib 3.1. SCSL 1.2 will be released with IRIX 6.5.5 (late July 1999).

slide-9
SLIDE 9

9

SCSL Release 1.2 (continued)

SCSL 1.2 is the follow-on to CHALLENGEcomplib with some exceptions:

¥ SCSL 1.2 will NOT include o32 versions of the libraries. ¥ SCSL 1.2 will NOT support LINPACK and EISPACK. ¥ SCSL 1.2 will run on all platforms that have n32 or 64 support.

CHALLENGEcomplib is available to run on older and current platforms,however:

¥ There will be no further releases of complib. ¥ No complib bugs fixes (with rare exceptions).

slide-10
SLIDE 10

10

Signal Processing for SCSL 1.2

Additions to the FFTs:

¥ multiple 1D routine which calculates an FFT in one dimension for each row of a two-dimensional matrix. ¥ 1D, 2D and 3D routines that compute the product of the Fourier Transform of a sequence with the Fourier Transform of a filter (*prod routines in complib). ¥ Functions will be introduced to release memory allocated within the FFT routines. ¥ C language bindings.

slide-11
SLIDE 11

11

Signal Processing for SCSL 1.2 (continued)

SCSL 1.2 will include convolution and correlation routines.

¥ Convolution for Finite Impulse Response (FIR) and Infinite Impulse Response (IIR) filters, together with Correlations. ¥ 1D and 2D convolution and correlation Single and double precision for real and complex arithmetic. ¥ 2D routines will run on multiple processors. ¥ API similar to complib API (but not fully compatible). ¥ Fortran and C language bindings. The two main goals of the Convolution and Correlation library are performance and generality. It provides well tuned modules usable in most convolution and correlation instances.

slide-12
SLIDE 12

12

Performance

¥ BLAS ¥ Fast Fourier Transforms ¥ Sparse Solver

slide-13
SLIDE 13

13

BLAS Performance

DGEMM Performance

100 200 300 400 500 600 700

32 64 128 256 512 1024 2048 Matrix Size Mflops

slide-14
SLIDE 14

14

BLAS Performance

50 100 150 200 250 300 350 400 450 32 64 128 256 512 1024 2048

Matrix Size Mflops

DGEMV Performance

slide-15
SLIDE 15

15

BLAS Performance

DGEMM Parallel Performance

2000 4000 6000 8000 10000 12000 14000 16000 18000 1 2 4 8 16 32 Number of processors Mflops

slide-16
SLIDE 16

16

Fast Fourier Transforms (FFT)

¥ 1-Dimensional FFT applications:

Ð Seismic: many short FFTs (1024-4096 data points) Ð Sonar, radar cross-section, speech recognition and astronomical systems: large 1D FFTs

¥ Multi-dimensional FFTs:

Ð image processing Ð PDEs from CFP applications Following charts show Òeffective megaflop rateÓ based on 5n*log(n) for each complex-to-complex FFT.

slide-17
SLIDE 17

17

FFT performance

100 200 300 400 500 600 1 100 10000 1000000 1E+08 FFT size Mflops

Single Precision Double Precision

1D Complex-complex FFT

slide-18
SLIDE 18

18

FFT performance

Complex-complex Multiple 1D FFT

100 200 300 400 500 600 10 100 1000 10000 FFT size and # of repetitions Mflops Single Precision Double Precision

slide-19
SLIDE 19

19

2D Complex-complex FFT

FFT performance

50 100 150 200 250 300 350 400 450 10 100 1000 FFT size of one dimension Mflops Single Precision Double Precision

slide-20
SLIDE 20

20

1000 2000 3000 4000 5000 6000 1 10 100 # of CPUs Mflops

1024-single 2048-single 4096-single 1024-double 2048-double 4096-double

FFT parallel performance

Complex-complex Multiple 1D FFT

Ò1024-singleÓ means 1024 copies of a size 1024 single precision (32 bits) FFT

slide-21
SLIDE 21

21

Changes to SGI Sparse Solvers

¥ New Matrix Ordering Options

Ð Methods 3 and 4 are termed ÒExtreme2Ó ordering

¥ New default for ordering option

Ð Extreme ordering (Method 2) is now the default

¥ Out-of-core solver option

Ð Was in recent SCSL version, but now is documented Ð Single-processor only Ð Striped file system useful Ð Simple interface and performs well

slide-22
SLIDE 22

22

New ordering options

  • 3. Multiple Nested Dissection orders

¥ default is OMP_NUM_THREADS orders ¥ repeatable quality

  • 4. Multiple ND orders using feedback file

information

¥ default is 2 x OMP_NUM_THREADS orders ¥ feedback file is at most 5KB, up to 200 records ¥ binary feedback file ¥ a solver that learns

slide-23
SLIDE 23

23

Choosing a default method

¥ Should default be best for which size model? ¥ Decided to optimize for medium or larger problems (at least 5000 equations) ¥ Extreme2 (3) about 3% faster than Extreme, but is new tech., so we use Method 2 as the new default.

500 1000 1500 2000 2500 3000 3500 1 2 3 4

Ordering Method

Total Time for Nine models

slide-24
SLIDE 24

24

Out-of-core (OOC) Option

¥ Performance 10-40% slower than extreme (Method 2) ordering in- core; 15% in this case. ¥ but faster than AMF (1) ¥ This used 4-way striping on file system -- 140 MB/s on some reads ¥ Allowed 128MB in-core for factor storage

200 400 600 800 1000 1200 1400 1600 1800 1 2 3 4 OOC

Ordering Method / Factor Storage

Total Time for Nine models (1-CPU runs)

slide-25
SLIDE 25

25

Scalability: Factorization Mflops

¥ AmdahlÕs law resp. for much of lack of scaling in previous chart ¥ Over 11 Gflops achieved on gismondi

  • n 48 CPUs

¥ More can be done to improve memory placement ¥ These results used DSM_ROUND_ROBIN data placement

500 1000 1500 2000 2500 3000 3500 5 10 # of CPUs Factorization Mflops

gismondi fleet10 th2 280Kdof

slide-26
SLIDE 26

26

PSLDLT: Scalability to 8 CPUs

¥ Measured: Elapsed time for 1 preprocess, 2 factorizations, 2 solves. ¥ # floating point ops to factor & preprocess time :

Ð Gflop secs. Ð fleet10 383 27 Ð gismondi 133 3 Ð th2 34 18 Ð 280Kdof 18 15

1 2 3 4 5 6 7 2 4 6 8 10 # of CPUs Speedup fleet10 gismondi th2 280Kdof

slide-27
SLIDE 27

27

Summary

¥ SCSL 1.2 improvements:

Ð FFTs have new interface Ð Add the missing complib Signal Processing functionality. Ð Provide C language interfaces for the Signal Processing routines. Ð Enhance the ordering techniques in the sparse linear solvers. Ð Performance tuning for the MIPS R12000 Processor. Ð Rollup bug fixes from SCSL 1.1 and complib 3.1.

¥ Comments, questions:

Ð Mimi Celis; celis@sgi.com Ð Tom Elken; telken@sgi.com