' $ Institut f ur Mathematik Universit at Augsburg Ulrich - - PDF document

institut f ur mathematik universit at augsburg ulrich r
SMART_READER_LITE
LIVE PREVIEW

' $ Institut f ur Mathematik Universit at Augsburg Ulrich - - PDF document

' $ Institut f ur Mathematik Universit at Augsburg Ulrich R ude ' $ Eciency of numerical algo rithms on future high p erfo rmance sup ercomputers Ulrich R ude Institut f ur Mathematik Universit


slide-1
SLIDE 1 ' ' $ $ Institut f ur Mathematik Universit
  • at
Augsburg Ulrich R ude Eciency
  • f
numerical algo rithms
  • n
future high p erfo rmance sup ercomputers Ulrich R ude Institut f ur Mathematik Universit
  • at
Augsburg http://scicomp.math.uni-augsburg.de/rue de/me. html DF G p roject: Datenlok ale Iterationsverfahren Ma rch 1998
  • Title
F98
  • 0.0
slide-2
SLIDE 2 ' ' $ $ Institut f ur Mathematik Universit
  • at
Augsburg Ulrich R ude Outline
  • The
eciency pa rado x
  • What
is wrong ab
  • ut
  • ur
algo rithms and p ro- grams
  • Cache
  • riented
iterative metho ds
  • High
p erfo rmance computer a rchitecture
  • Scientic
computing in the future
  • Contents
F98
  • 0.1
slide-3
SLIDE 3 ' ' $ $ Institut f ur Mathematik Universit
  • at
Augsburg Ulrich R ude Ideally ... Mathematical a rguments p redict that
  • multigrid
with nested iteration can solve scala r elliptic PDE with app ro ximately { 100
  • p
erations { sto ring 8 reals p er unkno wn
  • n
a w
  • rkstation
{ that can do 1
  • 10
9
  • p
erations (= 1000 MFlop) p er second { in 64
  • 10
6 w
  • rds
(=512 MByte)
  • f
sto rage and so, w e can solve fo r 10 7 unkno wns in ab
  • ut
  • ne
second .
  • What
it is ab
  • ut
F98
  • 0.2
slide-4
SLIDE 4 ' ' $ $ Institut f ur Mathematik Universit
  • at
Augsburg Ulrich R ude In p ractice ...
  • ur
p rograms
  • can
sometimes
  • nly
do 10 5 unkno wns
  • n
a (massively pa rallel) sup ercomputer
  • where
it do es not run fo r hours
  • F98
  • 0.3
slide-5
SLIDE 5 ' ' $ $ Institut f ur Mathematik Universit
  • at
Augsburg Ulrich R ude Run time compa rison
  • f
iterative algo rithms
  • n
unifo rm grids
  • Standa
rd Multigrid
  • Adaptive
Multigrid
  • SOR
  • SOR
with cache
  • ptimization
  • What
it's ab
  • ut
F98
  • 0.4
slide-6
SLIDE 6 ' ' $ $ Institut f ur Mathematik Universit
  • at
Augsburg Ulrich R ude Eciency
  • f
P
  • isson
Solvers Benchma rk suggested b y Botta et. al. in: Ho w fast the Laplace Equation w as solved in 1995.

2 4 6 8 10 12 14 16 4 6 8 10 12 14 Time per unknown (Microseconds) Level L (Gridsize= 2^L) Performance of Multigrid Poisson Solver Digital PWS 500 au SGI O200, 180 Mhz HP 9000/755, 99 Mhz P-II/266(SDRAM) P-Pro/200

  • With
1 GFlop p erfo rmance, 250
  • p
erations p er unkno wn should b e executed in 0.25 seconds.
  • F
  • r
small data sets w e thus have to 25% p eak p erfo rmance, fo r la rge data sets <7% p eak p erfo rmance
  • P
erfo rmance Analysis
  • f
Elliptic Solvers F98
  • 1.1
slide-7
SLIDE 7 ' ' $ $ Institut f ur Mathematik Universit
  • at
Augsburg Ulrich R ude Eciency (cont'd)
  • Benchma
rk requires erro r reduction
  • f
10 6 in the residual.
  • This
is
  • versatised
b y 5 V(2,1) cycles costing 250
  • ating
p
  • int
  • p
erations p er unkno wn.
  • Using
a V(2,1)-FMG algo rithm this could b e reduced b y a facto r 4. Compa rison with t w
  • b
est p erfo rmers from Botta pap er

10 20 30 40 50 60 70 80 90 100 4 6 8 10 12 14 Time per unknown (Microseconds) Level L (Gridsize= 2^L) Performance of Multigrid Poisson Solver Digital PWS 500 au SGI O200, 180 Mhz HP 9000/755, 99 Mhz P-II/266(SDRAM) P-Pro/200 MILU-rrb (on HP755) NGILU (on HP755)

  • P
erfo rmance Analysis
  • f
Elliptic Solvers F98
  • 1.2
slide-8
SLIDE 8 ' ' $ $ Institut f ur Mathematik Universit
  • at
Augsburg Ulrich R ude Which algo rithms and data structures sp
  • il
p erfo rmance? P erfo rm red-black relaxation
  • n
Digital Alpha PWS 500au using a
  • structured
grid with constant co ecients
  • structured
grid with va riable co ecients
  • unstructured
grid, implemented with link ed list, but all data ideally cache aligned
  • unstructured
grid, data non cache aligned
  • structured
grid, constant co ecients,
  • pti-
mized fo r cache p erfo rmance
  • P
erfo rmance Analysis
  • f
Elliptic Solvers F98
  • 1.3
slide-9
SLIDE 9 ' ' $ $ Institut f ur Mathematik Universit
  • at
Augsburg Ulrich R ude P erfo rmance
  • f
RB

100 200 300 400 500 600 700 16 32 64 128 256 512 1024 2048 MegaFlop Gridsize Performance of Red Black Relaxation Structured, const coeff Structured, const coeff, cache tuned Structured, variable coefficients Unstructured, cache aligned access Unstructured, non-cache alined 20 40 60 80 100 120 140 160 180 16 32 64 128 256 512 1024 MegaFlop Gridsize Performance of Red Black Relaxation Structured, variable coefficients Unstructured, cache aligned access Unstructured, non-cache alined

  • P
erfo rmance dep ending
  • n
vecto r length F98
  • 1.4
slide-10
SLIDE 10 ' ' $ $ Institut f ur Mathematik Universit
  • at
Augsburg Ulrich R ude Memo ry Hiera rchy (DEC PWS 600)

CPU 1 GW Disk Space 32 Registers 1000 W. Lev. 1 Cache 12000 W. Level 2 Cache 0.5 MW ext. Level 3 Cache 64 MW Main Memory

Level Capac. (MB/s) Latency FP Register 256 B 28,800 1.7 ns Cache 1 8 KB 19,200 1.7 ns Cache 2 96 KB 9,600 5.0 ns Cache 3 2 MB 873 23.3 ns Main Mem 1,536 MB 1,070 105.0 ns
  • Example
Architecture E97
  • 2.1
slide-11
SLIDE 11 ' ' $ $ Institut f ur Mathematik Universit
  • at
Augsburg Ulrich R ude Backus (1977): I p rop
  • se
to call this tub e the V
  • n-Neumann
b
  • ttleneck.
What a re the consequences? T
  • avoid
ineciency w e must: Avoid dynamic structures. No link ed lists, bi- na ry trees, etc.
  • n
to
  • lo
w granula rit y . Ho w to implement spa rse matrices then? W e don't. Exploit instruction-level pa rallelism. Prepa re the co des such that automatic restructuring to
  • ls
and compilers (optimizers) can extract the pa rallelism. F90 and HPF a rra y syntax a re counter-p ro ductive, since w e also need to Exploit data lo calit y . Do not p rogram in global sw eeps! W e cannot save in fo rming Ax, but w e can save when Ax; A 2 x; A 3 x; : : : a re needed. This is a wkw a rd p rogramming and in the future w e will need to
  • ls
fo r this job!
  • Consequences
F98
  • 2.2
slide-12
SLIDE 12 ' ' $ $ Institut f ur Mathematik Universit
  • at
Augsburg Ulrich R ude P AM
  • P
atch Adaptive Multigrid
  • no
des a re group ed in (non-overlapping) patches
  • f
xed size
  • each
level consists
  • f
a collection
  • f
patches
  • patches
ma y b e p resent (live)
  • r
  • patches
ma y b e virtual (ghosts)
  • P
AM F98
  • 3.1
slide-13
SLIDE 13 ' ' $ $ Institut f ur Mathematik Universit
  • at
Augsburg Ulrich R ude Histo ry and F uture
  • f
Microp ro cesso rs T echnology data Y r T yp e Mhz m T rans CPI MFlop 82 80286 12 1.5 0.14 M 30 0.4 85 80386 33 1.0 0.28 M 12 3.0 97 21164 625 0.35 9.30 M 0.5 1.25G 11 Int-X 10000 ? 1000.00 M ? ? Imp rovement F acto rs 82 { 97 97 { 2011 Mhz 50 16 T ransisto rs 65 100 Mop: 3000 ( 50
  • 65)
???
  • F
uture High P erfo rmance Computers F98
  • 4.1
slide-14
SLIDE 14 ' ' $ $ Institut f ur Mathematik Universit
  • at
Augsburg Ulrich R ude Dave P atterson (1997 in SIAM News): Instruction level pa rallelism is running
  • ut
  • f
steam. Interp retation
  • A
microp ro cesso r to da y { is faster than the fastest sup ercomputer 15 y ea rs ago. { has the internal pa rallelism equivalent to the la rgest pa rallel p ro cesso r 15 y ea rs ago.
  • A
microp ro cesso r in 2011 { could b e faster than the fastest sup ercom- puter to da y (if w e nd a w a y to exploit what technology will mak e p
  • ssible)
{ could emplo y as much internal pa rallelism as a massively pa rallel computer to da y .
  • F
uture High P erfo rmance Computers F98
  • 4.2
slide-15
SLIDE 15 ' ' $ $ Institut f ur Mathematik Universit
  • at
Augsburg Ulrich R ude J. Go
  • dman
& D. Burger (1997 in a IEEE Computer edito rial): The circumstances in which computer a r- chitects will nd themselves in the next 15 y ea rs a re truly daunting. Memo ry systems
  • T
  • sustain
the p eak p erfo rmance
  • f
a 1.25 GFlop p ro cesso r to da y , the memo ry system needs a bandwidth
  • f
30 Gb yte/sec, but t yp- ical (main) memo ry systems
  • nly
deliver 1 Gb yte/sec.
  • A
hyp
  • thetical
1.25 TFlop p ro cesso r w
  • uld
need 30 TByte/sec memo ry bandwidth. If w e assume that this p ro cesso r will have a 4096 Bit memo ry bus, it w
  • uld
still require a bus clo ck
  • f
60 GHz.
  • F
uture High P erfo rmance Computers F98
  • 4.3
slide-16
SLIDE 16 ' ' $ $ Institut f ur Mathematik Universit
  • at
Augsburg Ulrich R ude Sup ercomputing: The example
  • f
the US ASC Initiative
  • Standa
rd (but y et to develop) comp
  • nents
fo r p ro cesso rs, memo ry , and net w
  • rks
  • Sp
ecial development
  • f
  • verall
system through manufacturer contracts
  • 30
TFlop in 2001 and 100 TFlop in 2003
  • Can
such massively pa rallel systems (with 10,000s
  • f
p ro cesso rs) b e used fo r realistic p roblems?
  • Ho
w ever, b esides the quantitative asp ect, there is nothing fundamentally new in sight fo r pa rallel a rchitectures.
  • Pro
cesso rs will b e designed p rima rily fo r non- scientic applications
  • SC
will b e a side p ro duct
  • f
e.g. the graphics
  • r
signal p ro cessing capabilities
  • f
such chips.
  • All
algo rithms will have to b e
  • ptimized
fo r data lo calit y .
  • The
a rchitecture
  • f
these p ro cesso rs ma y b e radically dierent fo rm the p resent
  • nes.
  • F
uture Scientic Computing F97
  • 5.1
slide-17
SLIDE 17 ' ' $ $ Institut f ur Mathematik Universit
  • at
Augsburg Ulrich R ude Industrial Resea rch and Development Provided w e can exploit the p erfo rmance p
  • ten-
tial
  • f
future microp ro cesso rs, the
  • ha
rdw a re cost
  • f
many useful (and realistic) simulations will b e trivial
  • scientic
computing and simulation will in- creasingly b e used as a standa rd metho d in the computer aided design
  • f
technical de- vices.
  • Soft
w a re and the
  • training
  • f
engineers is the b
  • ttleneck
fo r in- dustrial use.
  • F
uture Scientic Computing F98
  • 5.2
slide-18
SLIDE 18 ' ' $ $ Institut f ur Mathematik Universit
  • at
Augsburg Ulrich R ude Emb edded Sup ercomputing Present emb edded control is mostly based
  • n
heuristic ad-ho c algo rithms. With imp roving p er- fo rmance/p rice ratio, it will increasingly b ecome feasible to solve complex control p roblems based
  • n
physical mo dels in real time in emb edded sys- tems. Examples include the
  • dynamic
b ehavio r
  • f
vehicles (ABS, engine control)
  • trac
control
  • chemical
  • r
physical p ro cesses
  • computer
vision, graphics
  • etc.
  • F
uture Scientic Computing F98
  • 5.3
slide-19
SLIDE 19 ' ' $ $ Institut f ur Mathematik Universit
  • at
Augsburg Ulrich R ude Burger and Go
  • dman
(1997): This is an exciting time to b e an a rchitect Summa ry and Conclusions
  • W
e need to cultivate the a rt
  • f
p erfo rmance compa risons
  • f
{ algo rithms { p rogramming techniques { computers
  • w
e a re exp eriencing a fundamental change in computer a rchitecture
  • these
changes do have a signicant eect
  • n
ho w w e should design
  • ur
algo rithms
  • even
mo re dramatic changes can b e exp ected fo r the next few y ea rs
  • Conclusions
F98
  • 6.1