institut f ur mathematik universit at augsburg ulrich r
play

' $ Institut f ur Mathematik Universit at Augsburg Ulrich - PDF document

' $ Institut f ur Mathematik Universit at Augsburg Ulrich R ude ' $ Eciency of numerical algo rithms on future high p erfo rmance sup ercomputers Ulrich R ude Institut f ur Mathematik Universit


  1. ' $ Institut f� ur Mathematik Universit � at Augsburg Ulrich R� ude ' $ E�ciency of numerical algo rithms on future high p erfo rmance sup ercomputers Ulrich R� ude Institut f� ur Mathematik Universit � at Augsburg http://scicomp.math.uni-augsburg.de/rue de/me. html DF G p roject: Datenlok ale Iterationsverfahren Ma rch 1998 � � Title F98 - 0.0 � �

  2. ' $ Institut f� ur Mathematik Universit � at Augsburg Ulrich R� ude ' $ Outline � The e�ciency pa rado x � What is wrong ab out our algo rithms and p ro- grams � Cache o riented iterative metho ds � High p erfo rmance computer a rchitecture � Scienti�c computing in the future � � Contents F98 - 0.1 � �

  3. ' $ Institut f� ur Mathematik Universit � at Augsburg Ulrich R� ude ' $ Ideally ... Mathematical a rguments p redict that � multigrid with nested iteration can solve scala r elliptic PDE with app ro ximately { 100 op erations { sto ring 8 reals p er unkno wn � on a w o rkstation 9 { that can do 1 � 10 op erations (= 1000 MFlop) p er second 6 { in 64 � 10 w o rds (=512 MByte) of sto rage and so, w e can solve fo r 7 10 unkno wns in ab out one second . � � What it is ab out F98 - 0.2 � �

  4. ' $ Institut f� ur Mathematik Universit � at Augsburg Ulrich R� ude ' $ In p ractice ... our p rograms 5 � can sometimes only do 10 unkno wns � on a (massively pa rallel) sup ercomputer � where it do es not run fo r hours � � F98 - 0.3 � �

  5. ' $ Institut f� ur Mathematik Universit � at Augsburg Ulrich R� ude ' $ Run time compa rison of iterative algo rithms on unifo rm grids � Standa rd Multigrid � Adaptive Multigrid � SOR � SOR with cache optimization � � What it's ab out F98 - 0.4 � �

  6. ' $ Institut f� ur Mathematik Universit � at Augsburg Ulrich R� ude ' $ E�ciency of P oisson Solvers Benchma rk suggested b y Botta et. al. in: Ho w fast the Laplace Equation w as solved in 1995. Performance of Multigrid Poisson Solver 16 Digital PWS 500 au SGI O200, 180 Mhz 14 HP 9000/755, 99 Mhz P-II/266(SDRAM) P-Pro/200 Time per unknown (Microseconds) 12 10 8 6 4 � With 1 GFlop p erfo rmance, 250 op erations 2 p er unkno wn should b e executed in 0.25 � seconds. 0 4 6 8 10 12 14 Level L (Gridsize= 2^L) � F o r small data sets w e thus have to 25% p eak p erfo rmance, fo r la rge data sets < 7% p eak p erfo rmance � � P erfo rmance Analysis of Elliptic Solvers F98 - 1.1 � �

  7. ' $ Institut f� ur Mathematik Universit � at Augsburg Ulrich R� ude ' $ E�ciency (cont'd) 6 � Benchma rk requires erro r reduction of 10 in the residual. � This is oversatis�ed b y 5 V(2,1) cycles costing 250 �oating p oint op erations p er unkno wn. � Using a V(2,1)-FMG algo rithm this could b e reduced b y a facto r 4. Compa rison with t w o b est p erfo rmers from Botta pap er Performance of Multigrid Poisson Solver 100 Digital PWS 500 au SGI O200, 180 Mhz 90 HP 9000/755, 99 Mhz P-II/266(SDRAM) P-Pro/200 80 MILU-rrb (on HP755) Time per unknown (Microseconds) NGILU (on HP755) 70 60 50 40 30 � � P erfo rmance Analysis of Elliptic Solvers F98 - 1.2 � � 20 10 0 4 6 8 10 12 14 Level L (Gridsize= 2^L)

  8. ' $ Institut f� ur Mathematik Universit � at Augsburg Ulrich R� ude ' $ Which algo rithms and data structures sp oil p erfo rmance? P erfo rm red-black relaxation on Digital Alpha PWS 500au using a � structured grid with constant co e�cients � structured grid with va riable co e�cients � unstructured grid, implemented with link ed list, but all data ideally cache aligned � unstructured grid, data non cache aligned � structured grid, constant co e�cients, opti- mized fo r cache p erfo rmance � � P erfo rmance Analysis of Elliptic Solvers F98 - 1.3 � �

  9. ' $ Institut f� ur Mathematik Universit � at Augsburg Ulrich R� ude ' $ P erfo rmance of RB Performance of Red Black Relaxation 700 Structured, const coeff Structured, const coeff, cache tuned Structured, variable coefficients 600 Unstructured, cache aligned access Unstructured, non-cache alined 500 400 MegaFlop 300 200 100 0 16 32 64 128 256 512 1024 2048 Gridsize Performance of Red Black Relaxation 180 Structured, variable coefficients Unstructured, cache aligned access 160 Unstructured, non-cache alined 140 120 100 MegaFlop 80 � 60 � P erfo rmance dep ending on vecto r length F98 - 1.4 � � 40 20 0 16 32 64 128 256 512 1024 Gridsize

  10. ' $ Institut f� ur Mathematik Universit � at Augsburg Ulrich R� ude ' $ Memo ry Hiera rchy (DEC PWS 600) CPU 32 Registers 1000 W. Lev. 1 Cache 12000 W. Level 2 Cache 0.5 MW ext. Level 3 Cache Level Capac. (MB/s) Latency 64 MW Main Memory FP Register 256 B 28,800 1.7 ns Cache 1 8 KB 19,200 1.7 ns 1 GW Disk Space Cache 2 96 KB 9,600 5.0 ns Cache 3 2 MB 873 23.3 ns Main Mem 1,536 MB 1,070 105.0 ns � � Example Architecture E97 - 2.1 � �

  11. ' $ Institut f� ur Mathematik Universit � at Augsburg Ulrich R� ude ' $ Backus (1977): I p rop ose to call this tub e the V on-Neumann b ottleneck. What a re the consequences? T o avoid ine�ciency w e must: Avoid dynamic structures. No link ed lists, bi- na ry trees, etc. on to o lo w granula rit y . Ho w to implement spa rse matrices then? W e don't. Exploit instruction-level pa rallelism. Prepa re the co des such that automatic restructuring to ols and compilers (optimizers) can extract the pa rallelism. F90 and HPF a rra y syntax a re counter-p ro ductive, since w e also need to Exploit data lo calit y . Do not p rogram in global sw eeps! W e cannot save in fo rming Ax , 2 3 but w e can save when Ax; A x; A x; : : : a re needed. This is a wkw a rd p rogramming and in the future w e will need to ols fo r this job! � � Consequences F98 - 2.2 � �

  12. ' $ Institut f� ur Mathematik Universit � at Augsburg Ulrich R� ude ' $ P AM - P atch Adaptive Multigrid � no des a re group ed in (non-overlapping) patches of �xed size � each level consists of a collection of patches � patches ma y b e p resent (live) o r � patches ma y b e virtual (ghosts) � � P AM F98 - 3.1 � �

  13. ' $ Institut f� ur Mathematik Universit � at Augsburg Ulrich R� ude ' $ Histo ry and F uture of Microp ro cesso rs T echnology data Y r T yp e Mhz � m T rans CPI MFlop 82 80286 12 1.5 0.14 M 30 0.4 85 80386 33 1.0 0.28 M 12 3.0 97 21164 625 0.35 9.30 M 0.5 1.25G 11 Int-X 10000 ? 1000.00 M ? ? Imp rovement F acto rs 82 { 97 97 { 2011 Mhz 50 16 T ransisto rs 65 100 M�op: 3000 ( � 50 � 65) ??? � � F uture High P erfo rmance Computers F98 - 4.1 � �

  14. ' $ Institut f� ur Mathematik Universit � at Augsburg Ulrich R� ude ' $ Dave P atterson (1997 in SIAM News): Instruction level pa rallelism is running out of steam. Interp retation � A microp ro cesso r to da y { is faster than the fastest sup ercomputer 15 y ea rs ago. { has the internal pa rallelism equivalent to the la rgest pa rallel p ro cesso r 15 y ea rs ago. � A microp ro cesso r in 2011 { could b e faster than the fastest sup ercom- puter to da y (if w e �nd a w a y to exploit what technology will mak e p ossible) { could emplo y as much internal pa rallelism as a massively pa rallel computer to da y . � � F uture High P erfo rmance Computers F98 - 4.2 � �

  15. ' $ Institut f� ur Mathematik Universit � at Augsburg Ulrich R� ude ' $ J. Go o dman & D. Burger (1997 in a IEEE Computer edito rial): The circumstances in which computer a r- chitects will �nd themselves in the next 15 y ea rs a re truly daunting. Memo ry systems � T o sustain the p eak p erfo rmance of a 1.25 GFlop p ro cesso r to da y , the memo ry system needs a bandwidth of 30 Gb yte/sec, but t yp- ical (main) memo ry systems only deliver 1 Gb yte/sec. � A hyp othetical 1.25 TFlop p ro cesso r w ould need 30 TByte/sec memo ry bandwidth. If w e assume that this p ro cesso r will have a 4096 Bit memo ry bus, it w ould still require a bus clo ck of 60 GHz. � � F uture High P erfo rmance Computers F98 - 4.3 � �

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend