SSI : Overview of Simulation Sof tware I nf rastructure f or Large - - PowerPoint PPT Presentation

ssi overview of simulation sof tware i nf rastructure f
SMART_READER_LITE
LIVE PREVIEW

SSI : Overview of Simulation Sof tware I nf rastructure f or Large - - PowerPoint PPT Presentation

SSI : Overview of Simulation Sof tware I nf rastructure f or Large Scale Scientif ic Applications Akira Nishida Depart ment of Comput er Science, Universit y of Tokyo J ST CREST 98 t h I PSJ SI GHPC Meet ing Motivation Emergence of large


slide-1
SLIDE 1

SSI : Overview of Simulation Sof tware I nf rastructure f or Large Scale Scientif ic Applications

Akira Nishida

Depart ment of Comput er Science, Universit y of Tokyo J ST CREST 98t h I PSJ SI GHPC Meet ing

slide-2
SLIDE 2

Motivation

  • Emergence of large scale scient if ic simulat ions in various f ields
  • Development of numerical libraries in J apan
  • Mainly developed in supercomput ing cent ers (on mainf rames and

vect or supercomput ers) in 1980s

  • Cooperat ion wit h vendors (E.g. Fuj it su SSL I I )

Development in US

  • ScaLAPACK (wit h BLAS and LAPACK), PETSc, Azt ec, et c.
  • Developed and used in nat ional laborat or ies
  • St andardized and modularized
  • Run on parallel comput ing environment s
  • Dist ribut ed via WWW (net lib et c.)

since 1990s (also mirrored in J apan)

  • Demands f or reliable and port able parallel numerical libraries as

social inf rast ruct ure

slide-3
SLIDE 3

Brief History of Basic Numerical Libraries

  • Proj ect s in US and Europe
  • NATS (Nat ional Act ivit y t o Test Sof t ware) Proj ect by NSF st art ed in

1970

  • EI SPACK (1972) and LI NPACK (1978)
  • St andardizat ion of level 1 BLAS (Basic Linear Algebra Subprograms) in

1979

  • Development of LAPACK, LAPACK2, and ScaLAPACK by NSF and DARPA

during 1987-1995

  • PARASOL (An I nt egrat ed Programming Environment f or Parallel Sparse

Mat rix Solvers) since 1996

  • SciDAC (Scient if ic Discovery t hrough Advanced Comput ing) Program

st art ed in 2001 by DoE (Development of hardware/ sof t ware inf rast ruct ure f or t erascale comput ing)

slide-4
SLIDE 4

Brief History of Basic Numerical Libraries (2)

  • Proj ect s in J apan
  • Basic numerical libraries
  • I nt ernal use in nat ional supercomput ing cent ers

Program Syst em f or St at ist iacal Analysis wit h Least -Squares Fit t ing (T. Nakagawa and Y. Oyanagi et al., 1976-1982)

  • Of f line dist ribut ion

A series of books by K. Murat a, T. Oguni, H. Hasegawa published f rom Maruzen Co.,Lt d. wit h f loppy disks

  • No maj or nat ional proj ect s f or development of parallel numerical

libraries

  • Parallel processing
  • Real World Comput ing (RWC) Proj ect by MI TI (M. Sat o, Y. I shikawa,
  • T. Kudo et al.)
  • OBPLib: Obj ect orient ed librar y f or scient if ic comput ing on

dist ribut ed memory archit ect ures

  • Omni OpenMP Compiler: Free OpenMP compiler f or shared memory

parallel archit ect ures

slide-5
SLIDE 5

Features of the Project

  • St art ed as a $2M and 5-year nat ional proj ect since Nov. 2002
  • Complet e survey of domest ic and overseas research proj ect s
  • Cooperat ion wit h ot her proj ect s
  • I nvest igat e problems wit h exist ing libraries
  • Ref inement of sof t ware specif icat ion
  • Development
  • Select and evaluat e t arget archit ect ures (need t o predict mainst reams in 2007)
  • Fast prot ot yping of core component s (need f eedbacks)
  • St ar t wit h replacement of original libraries used in real applicat ions
  • Primary Target s:
  • Parallel eigensolvers
  • QR algorit hms (general purpose, real/ complex, symmet ric/ non-symmet ric)
  • Lanczos/ Arnoldi, Davidson met hods (select ed eigenpairs f or physical applicat ions)
  • Parallel linear solvers
  • Direct solvers (gener al purpose, real/ complex, symmet ric/ non-symmet ric, dense/ band/ sparse)
  • I t erat ive solvers (f or FDM and FEM)
  • Parallel f ast int egral t ransf orms
  • Fast Fourier t ransf orms (general purpose)
  • Fast Legendle Transf orm (climat e st udies) et c.
  • Port able obj ect -orient ed implement at ion
  • Dist ribut ion
  • Dist ribut ion via net work
  • Publicat ion of manuals f rom maj or publishers
slide-6
SLIDE 6

Core Research Fields

  • Eigensolvers
  • Akira Nishida (Tokyo Univ.)
  • Eigensolvers f or large sparse eigenproblems and t heir parallelizat ion.
  • Linear solvers
  • Hidehiko Hasegawa (Tsukuba Univ.)
  • Development of direct / it erat ive linear solvers
  • Shao-Liang Zhang (Tokyo Univ.)
  • St udies on it erat ive solvers. Proposed GPBiCG (product t ype it erat ive solver).
  • Kengo Nakaj ima (RI ST)
  • General purpose solver f or f init e element problems
  • Kuniyoshi Abe (Gif u Shot oku Gakuen Univ.)
  • J oint researcher wit h S. L. Zhang on product t ype it erat ive solvers
  • Shoj i I t o (Tsukuba Univ.)
  • Development of direct solvers
  • Koh Hashimot o (Tokyo Univ.)
  • J oint researches wit h S. L. Zhang. St udies on machanical syst ems.
  • Akihiro Fuj ii (Tokyo Univ. Doct oral candidat e)
  • Parallel and vect or implement at ion of AMG precondit ioned CG met hod
  • Tomohiro Sogabe (Tokyo Univ. Doct oral candidat e)
  • St udies on it erat ive solvers. Proposed BiCR t ype met hod.
slide-7
SLIDE 7

Core Research Fields (2)

  • Fast int egral t ransf orms
  • Reij i Suda (Tokyo Univ.)
  • Fast legendre t ransf orm f or spherical climat e analysis
  • Daisuke Takahashi (Tsukuba Univ.)
  • Development of opt imized parallel FFT
  • Akira Nukada (Tokyo Univ. Doct oral candidat e)
  • Development of opt imized parallel FFT
  • Parallel and dist ribut ed port able implement at ion
  • Akira Nishida
  • Reij i Suda
  • Hidehiko Hasegawa
  • Kengo Nakaj ima
  • Akira Nukada
  • Akihiro Fuj ii
  • Yuichiro Hourai (Tokyo Univ. Doct oral candidat e)
  • Parallel dist ribut ed comput at ion, opt imizat ion of broadcast communicat ions on t ree-

st ruct ured net works

slide-8
SLIDE 8

Linear solvers Tsukuba Univ. J apan Met eorological Agency Eart h Simulat or Cent er AI ST Grid Research Cent er Et c. I nst it ut e f or Solid St at e Physics I nst it ut e of Medical Science Et c. I nst it ut e of I ndust rial Science RI ST Advancesof t Corp. (MEXT I T Program) Fast int egral t ransf orms Eigensolvers I mplement at ion met hods Computing and networking environment

Organization

slide-9
SLIDE 9

Schedule

Tut orials I mplement at ion and verif icat ion Programming model Algorit hms Survey of hardware t echnologies Survey of sof t ware engineering Survey of Applicat ions Facilit ies 2007 (7 mont hs) 2006 2005 2004 2003 2002 (5 mont hs) Fiscal Year

slide-10
SLIDE 10

Target (1): Architectures and Systems

  • Survey of t rends and direct ion of hardware t echnologies
  • Trends of comput er archit ect ures
  • Higher densit y and lower power
  • E.g. I BM Blue Gene/ L: 130 t housand CPU - 180TFLOPS,
  • E.g. Fuj it su BioServer
  • Symmet ric mult it hreading
  • I BM Power, Sun Ult raSPARC, I nt el Pent ium & I t anium, et c.
  • Higher parallelism in every level of archit ect ure
  • I t becoming more import ant t o opt imize perf ormance of t he

libraries, while designing t hem growing more complex

slide-11
SLIDE 11

Current Status: Architectures and Systems

  • Predict comput ing environment t o be available in 5 years
  • Up-t o-dat e f acilit ies t o be updat ed every year
  • Current f acilit ies of SSI Proj ect
  • Shared memory pr ogramming environment : SGI Alt ix 3700

(I nt el Madison 1.3GHz × 32, Linux OS. 32GB main memory)

  • Vect or processing environment : NEC SX-6i
  • Clust er comput ing environment : Dual I nt el Xeon 2.8GHz server

x 16, GbE int erconnect

  • 10GbE enabled net working environment

( Cisco C6509)

  • Most of maj or archit ect ures have been

covered

  • Port abilit y
  • Port abilit y can be t est ed easily on

t he SSI environment by t he developers

slide-12
SLIDE 12

Current Status: Architectures and Systems (2)

SGI Altix 3700 NEC SX-6i GbE or 10GbE LAN I nf iniBand I nt erconnect ed I t anium3 Clust er HyperTransport I nt erconnect ed Opt eron Clust er Sun Fire 3800 Sun St orEdge T3 To GbE (→10GbE) WAN To Deskt ops Cisco Rout er C6509

slide-13
SLIDE 13

Current Status: Architectures and Systems (3)

Shared memory comput er SGI Alt ix 3700

Memory bandwidt h perf ormance compared wit h Sun

Fire 15k of Ult raSPARC I I I 900MHz x 72, Solaris 8 , wit h STREAM benchmark, 1.8GB dat a

10000 20000 30000 40000 50000 60000 10 20 30 40 50 60 70 memory bandwidth (MB/s) number of threads Copy Scale Add Triad 10000 20000 30000 40000 50000 60000 5 10 15 20 25 30 memory bandwidth (MB/s) number of threads Copy Scale Add Triad

slide-14
SLIDE 14

Target (2): Algorithms

  • Promot ion of f undament al st udies
  • Promot ion of f undament al st udies by t he members (research meet ings)
  • Provide up-t o-dat e comput ing environment f or j oint researchers
  • Support port ing of exist ing libraries writ t en by t he members t o t he new

comput ing environment

  • Planning t o develop a new librar ies based on a book “Numerical Libraries in

Fort ran 77” published by Maruzen Co.,Lt d. by Hasegawa et al.

  • NEDO APC aut omat ic pallelizer developed has been implement ed on our

environment .

  • Aut omat ically add OpenMP adapt ives
  • Fast release. Get f eedbacks f rom bet a users
  • A home page

ht t p:/ / ssi.is.s.u-t okyo.ac.j p/ has been opened

  • Cooperat ion wit h AI ST PHASE proj ect ht t p:/ / phase.hpcc.j p/ , et c.
  • Light weight libraries wit h mimimum f unct ions f or large scale pr oblems
  • Keep balance wit h oo overheads and perf ormance
  • OO int erf ace + primit ive API s
  • Publish det ailed document s
  • Easy t o use
slide-15
SLIDE 15

Current Status: Algorighms

Eigensolvers (CG Type)

  • Sove mimimum eigenvalue of generalized eigenproblem on real symmet ric

mat rices Ax = λBx

  • r maximum eigenvalue of t he equivalent eigenproblem

Bx = μAx, μ = 1/ λ

  • Minimize Rayleigh quot ient

μ(x) =xTBx / xTAx using t hat t he most ascending direct ion is ∇μ(x) ≡ g(x) =2(Bx -μAx) / xTAx by solving conj ugat e gradient met hod wit h t he above coef f icient as αi xi+1 = xi + αipi, pi = - gi + βi-

1pi- 1, βi- 1 =

gTi gi / gTi-

1gi- 1

  • Theoret ically O(n) complexit y
slide-16
SLIDE 16

Current Status: Algorighms (2)

Eigensolvers

  • CG t ype met hods
  • AMG precondit ioned CG solvers f or eigenproblems by Knyazev

and Argent at i (2003) (See Figures)

  • I LU precondit ioned CR solver by Suet omi and Sekimot o (1989)
50 10 1 50 200 250 300 10
  • 1 0
10
  • 8
1
  • 6
1
  • 4
1
  • 2
1 1 2

AM G- PCG Precondioner (1,445 sec.)

R es idu al

5 100 150 200 250 30 10
  • 10
10
  • 8
10
  • 6
10
  • 4
10
  • 2
10 10 2

N

  • Precondi
  • ner(11,240 sec.)

R es idu al

slide-17
SLIDE 17

Current Status: Algorithms (3)

Linear solvers

  • I t erat ive solver (

Bi-CR t ype met hod)

  • S.
  • L.

Zhang, T. Sogabe, Bi-CR met hod f or solving large nonsymmet ric linear syst ems, t he 2003 I nt ernat ional Conf erence on Numerical Linear Algebra and Opt imizat ion, Oct ober 7-10, 2003. (I nvit ed Talk)

) ; ( , r A z z x x

n n n n

K ∈ + = ,

n n

Az Ax Ax + = ,

n n

Az Ax Ax − − = − ,

n n

Az Ax b Ax b − − = − ) ; (

1

r A r

+

n n

K ,

n n

Az r r − =

CG:

1

|| || min

A

rn

CR:

|| || min

n

r

slide-18
SLIDE 18

Current Status: Algorithms (4)

Replace CG in Bi-CG wit h more st able CR algorit hm Test ed wit h Toeplit z mat rices and some Mat rix Market

problems

Derived CRS, BiCRSTAB, or GPBiCR which corresponds t o

CGS, BiCGSTAB, and GPBiCG

slide-19
SLIDE 19

Current Status: Algorithms (5)

  • Parallel AMG precondit ioned CG met hod
  • . Fuj ii, A. Nishida and Y. Oyanagi. I mprovement and Evaluat ion of Smoot hed

Aggregat ion MG f or Anisot ropic Problems. I n Proceedings of Symposium on Advanced Comput ing Syst ems and I nf rast ruct ures, pp.137-144, 2003.

  • A. Fuj ii, A. Nishida and Y. Oyanagi. Parallel AMG Algorit hm by Domain

Decomposit ion. I PSJ Transact ions on Advanced Comput ing Syst ems, Vol. 44, No.SI G 6 (ACS 1), pp.9-17, 2003.

  • Smoot hed

Aggregat ion MG

  • Solut ion of Ax=b
  • Algebraic mult igrid met hod
  • Generat e rest rict ed mat rix using vert ex set s
  • named aggregat es generat ed t he coef f icient mat rix
  • I t erat ion number does not depend problem size
  • Robust convergence even wit h anisot ropic problems
  • Cancel t he convergence problem wit h MGCG by

Tat ebe and Oyanagi

  • Parallelizat ion of direct linear solver
  • H. Hasegawa,

Parallelizat ion of Direct Liear Solver f or Banded Mat rices using

  • OpenMP. I PSJ Transact ions on Advanced Comput ing Syst ems, t o appear.

aggregat e

slide-20
SLIDE 20

Current Status: Algorithms (6)

Fast int egral t ransf orms

J oint st udies wit h researchers in t he f ield of weat her f orecast

and eart h hydrodynamics

  • Main result s

Ef f icient implement at ion of parallel FFT algorit hms in a

(mult iprocessor) node

  • A. Nukada, A. Nishida and Y. Oyanagi. New Radix-8 FFT Kernel f or

Mult iply-add I nst ruct ions. I n Proceedings of High Perf ormance Comput ing Symposium 2004, pp.17-24.

  • A. Nukada, A. Nishida and Y. Oyanagi. Parallel I mplement at ion of FFT

Algorit hm on Dist ribut ed Shared Memory Archit ect ure and it s Opt imizat ion. I PSJ Transact ions on Advanced Comput ing Syst ems Vol. 44, No. SI G 6 (ACS 1), pp.1-8, 2003.

  • I n-place FFT algorit hm
  • Less memory size
  • Need bit -reverse process
  • I mplement ed on I t anium server(

NEC AzusA)

  • 2.9Gf lops wit h 8PEs (12.4% of peak perf or mance)
  • Radix-8 FFT Kernel f or Mult iply-add I nst ruct ions
slide-21
SLIDE 21

Target (3): Sof tware and I mplementations

Provide general-purpose, easy-t o-use sof t ware

inf rast ruct ure

Surveys of st at us and direct ions of programming

t echnologies

  • Scalabilit y
  • HPF(J A)
  • Developped by HP

FPC and Eart h Simulat or Cent er

  • Co-Array Fort ran
  • Developed by Cray (f or T3E)
  • Open64 based implement at ion available f rom Rice Univ.
  • Request ed f or t he next version of Fort ran
  • MPI
  • St andard f or message passing on dist ribut ed memory archit ect ures
  • Global Arrays
  • API based
  • Easy t o implement
  • Obj ect orient ed programming concept s
  • Access t o obj ect s via API s only
  • OO concept s support ed language, such as Fort ran 9x/ 200x or C++
slide-22
SLIDE 22

Current Status: Sof tware and I mplementations

  • Parallel I mplement at ion
  • J oint research wit h Tokyou Univ. COE proj ect “I nf ormat ion Science and

Technology St rat egic Core”

  • O

p t i m i z a t i

  • n
  • f

c

  • m

m u n i c a t i

  • n
  • n

c l u s t e r / g r i d e n v i r

  • n

m e n t s

  • Y. Hourai, A. Nishida, and Y. Oyanagi, “Opt imal Broadcast Scheduling on Tree-st ruct ured

Net works”, I PSJ Transact ions on Advanced Comput ing Syst ems, t o appear.

  • Tradit ional implement at ion of br oadcast communicat ions (MPI CH-Scor e and

LAM)

  • Fix or ignore net work t opology
  • (Most implement at ions j ust shif t t he schedule of process I D 0 f or ot her

processes)

E.g. Perf ormance signif icant ly changes when alt ering broadcast ing root wit h naïve implement at ion of binary t ree based algorit hms

  • Opt imizat ion considering paramet erized bandwidt h and lat ency …NP hard
  • Reduct ion of redundancy using isomorphism …Fast er broadcast t han MPI CH-

Score and LAM

LAN Dual CPU Dual CPU 1 2 3 1 2 3 root root

slide-23
SLIDE 23

Concluding Remarks

Perf ormance of comput ers t o keep rapid progress

Parallel simulat ion t echnology is t o be used in wider areas wit h

popularizat ion of dist ribut ed

Domest ic ef f ort f or sof t ware inf rast ruct ure f or massively

parallel applicat ions will be helpf ul t o

  • Produce int ellect ual propert y

Design f or long t erm use at home and overseas Suppose t o be used by researchers working at supercomput ing

cent ers and research laborat ories as a pract ical component s

Publish of f icial manual on t he algorit hms and t heir usage Target a st andard high qualit y library

  • Creat e new t echnical inf rast ruct ure

Dist ribut ion of high qualit y common component s f or scient if ic

simulat ion

Est ablishment of reliable designing/ evaluat ing met hodologies via

f eedbacks f rom users