on the future of high performance computing how to think
play

ON THE FUTURE OF HIGH PERFORMANCE COMPUTING: HOW TO THINK FOR PETA - PowerPoint PPT Presentation

ON THE FUTURE OF HIGH PERFORMANCE COMPUTING: HOW TO THINK FOR PETA AND EXASCALE COMPUTING JACK DONGARRA UNIVERSITY OF TENNESSEE OAK RIDGE NATIONAL LAB Over Last 20 Years - Performance Development 1E+09 123%%PFlop/s % 100 Pflop/s


  1. ON THE FUTURE OF HIGH PERFORMANCE COMPUTING: HOW TO THINK FOR PETA AND EXASCALE COMPUTING JACK DONGARRA UNIVERSITY OF TENNESSEE OAK RIDGE NATIONAL LAB

  2. Over Last 20 Years - Performance Development 1E+09 123%%PFlop/s % 100 Pflop/s 100000000 16.3%PFlop/s % 10 Pflop/s 10000000 1 Pflop/s 1000000 SUM % 100 Tflop/s 100000 N=1 % 10 Tflop/s 60.8%TFlop/s % 10000 6-8 years 1 Tflop/s 1000 1.17%TFlop/s % N=500 % 100 Gflop/s My Laptop (70 Gflop/s) 100 59.7%GFlop/s % 10 Gflop/s 10 My iPad2 & iPhone 4s (1.02 Gflop/s) 1 Gflop/s 1 400%MFlop/s % 100 Mflop/s 0.1 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2012

  3. June 2012: The TOP10 Rmax % of Power MFlops Rank Site Computer Country Cores [Pflops] Peak [MW] /Watt DOE / NNSA Sequoia, BlueGene/Q (16c) 1 USA 1,572,864 16.3 81 8.6 1895 L Livermore Nat Lab + custom RIKEN Advanced Inst K computer Fujitsu SPARC64 2 Japan 705,024 10.5 93 12.7 830 for Comp Sci VIIIfx (8c) + custom DOE / OS 3 Mira, BlueGene/Q (16c) + custom USA 786,432 8.16 81 3.95 2069 Argonne Nat Lab Leibniz 4 SuperMUC, Intel (8c) + IB Germany 147,456 2.90 90* 3.52 823 Rechenzentrum Tianhe-1A, NUDT Nat. SuperComputer 5 Intel (6c) + Nvidia GPU (14c) China 186,368 2.57 55 4.04 636 Center in Tianjin + custom DOE / OS Jaguar, Cray 6 USA 298,592 1.94 74 5.14 377 Oak Ridge Nat Lab AMD (16c) + custom Fermi, BlueGene/Q (16c) 7 CINECA Italy 163,840 1.73 82 .821 2099 + custom Forschungszentrum JuQUEEN, BlueGene/Q (16c) 8 Germany 131,072 1.38 82 .657 2099 Juelich (FZJ) + custom Commissariat a Curie, Bull 9 l'Energie Atomique France 77,184 1.36 82 2.25 604 Intel (8c) + IB (CEA) Nat. Supercomputer Nebulea, Dawning Intel (6) 10 China 120,640 43 2.58 493 1.27 Center in Shenzhen + Nvidia GPU (14c) + IB 3 ������������������������������������������������������������������������������������������������������������������������������������ � � � � � � �

  4. Accelerators (58 systems) 60" Intel"MIC"(1)" 50" Clearspeed"CSX600"(0)" ATI"GPU"(2)" 40" IBM"PowerXCell"8i"(2)" Systems% 30" NVIDIA"2070"(10)" NVIDIA"2050(12)" 20" NVIDIA"2090"(31)" ������ ��������� 10" �������� ������������ �������� ��������� 0" ��������� ��������� ��������� ������������ 2006" 2007" 2008" 2009" 2010" 2011" 2012" ���������� �������� �������� ��������� �������� �����

  5. Countries Share Absolute Counts US: 252 China: 68 Japan: 35 UK: 25 France: 22 Germany: 20 Switzerland 5

  6. Swiss Machines in Top500 (max:12 min:1) Jan+93" Oct+95" Jul+98" Apr+01" Jan+04" Oct+06" Jul+09" Apr+12" 0" 50" 100" 150" 200" 250" 300" 350" 400" 450" 500" 4 5 7 9 12 9 8 9 6 6 5 6 6 5 8 8 6 2 1 1 3 3 2 3 3 4 4 5 5 7 6 4 4 5 5 4 4 3 1 High point: 12 systems (6/95) Low points: 1 system (6/02, 11/02, 6/12) 6

  7. 28 Systems at > Pflop/s (Peak) Pflop/s"Club" 45" 41# 40" Pflop/s 35" 30" 25" 20" 16.2# 15" 11.1# 6.9# 10" 2.92# 2.73# 2.1# 5" 1.7# 0" US""""""""""""""""""" "Japan""""""" China""""""""""""""" Germany""""""""""" France"""""""""""" UK""""""""""""""""" Italy"""""""""""" Russia""""""" (9)" (4)" (5)" (4)" (2)" (2)" (1)" (1)" (Peak) 10/2/12 7

  8. Linpack Efficiency 100% 90% 80% Linpack Efficiency 70% 60% 50% 40% 30% 20% 10% 0% 0 100 200 300 400 500

  9. Linpack Efficiency 100% 90% 80% Linpack Efficiency 70% 60% 50% 40% 30% 20% 10% 0% 0 100 200 300 400 500

  10. Linpack Efficiency 100% 90% 80% Linpack Efficiency 70% 60% 50% 40% 30% 20% 10% 0% 0 100 200 300 400 500

  11. Performance Development in Top500 1E+11 1E+10 1 Eflop/s 1E+09 100 Pflop/s 100000000 10 Pflop/s 10000000 1 Pflop/s N=1% 1000000 100 Tflop/s 100000 10 Tflop/s 10000 1 Tflop/s N=500% 1000 100 Gflop/s 100 10 Gflop/s 10 1 Gflop/s 1 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 0.1

  12. The High Cost of Data Movement • ������������������������������������������� ������������������� � ���������������������������������������� � 2011 2018 DP FMADD flop 100 pJ 10 pJ DP DRAM read 4800 pJ 1920 pJ Local Interconnect 7500 pJ 2500 pJ Cross System 9000 pJ 3500 pJ ������������������������� • ������������������������������������� ������������������������������������������ ���������� 12

  13. Energy Cost Challenge � At ~$1M per MW energy costs are substantial ! 10 Pflop/s in 2011 uses ~10 MWs ! 1 Eflop/s in 2018 > 100 MWs ! DOE Target: 1 Eflop/s in 2018 at 20 MWs 13

  14. Potential System Architecture with a cap of $200M and 20MW Systems 2012 2019 Difference Today & 2019 BG/Q Computer System peak 20 Pflop/s 1 Eflop/s O(100) Power 8.6 MW ~20 MW System memory 1.6 PB 32 - 64 PB O(10) (16*96*1024) Node performance 205 GF/s 1.2 or 15TF/s O(10) – O(100) (16*1.6GHz*8) Node memory BW 42.6 GB/s 2 - 4TB/s O(1000) Node concurrency 64 O(1k) or 10k O(100) – O(1000) Threads Total Node Interconnect 20 GB/s 200-400GB/s O(10) BW System size (nodes) 98,304 O(100,000) or O(1M) O(100) – O(1000) (96*1024) Total concurrency 5.97 M O(billion) O(1,000) MTTI 4 days O(<1 day) - O(10)

  15. Potential System Architecture with a cap of $200M and 20MW Systems 2012 2022 Difference Today & 2022 BG/Q Computer System peak 20 Pflop/s 1 Eflop/s O(100) Power 8.6 MW ~20 MW System memory 1.6 PB 32 - 64 PB O(10) (16*96*1024) Node performance 205 GF/s 1.2 or 15TF/s O(10) – O(100) (16*1.6GHz*8) Node memory BW 42.6 GB/s 2 - 4TB/s O(1000) Node concurrency 64 O(1k) or 10k O(100) – O(1000) Threads Total Node Interconnect 20 GB/s 200-400GB/s O(10) BW System size (nodes) 98,304 O(100,000) or O(1M) O(100) – O(1000) (96*1024) Total concurrency 5.97 M O(billion) O(1,000) MTTI 4 days O(<1 day) - O(10)

  16. Critical Issues at Peta & Exascale for Algorithm and Software Design � Synchronization-reducing algorithms ! Break Fork-Join model � Communication-reducing algorithms ! Use methods which have lower bound on communication � Mixed precision methods ! 2x speed of ops and 2x speed for data movement � Autotuning ! Today’s machines are too complicated, build “smarts” into software to adapt to the hardware � Fault resilient algorithms ! Implement algorithms that can recover from failures/bit flips � Reproducibility of results ! Today we can’t guarantee this. We understand the issues, 16 but some of our “colleagues” have a hard time with this.

  17. Major Changes to Algorithms/Software • Must rethink the design of our algorithms and software ! Manycore and Hybrid architectures are disruptive technology ! Similar to what happened with cluster computing and message passing ! Rethink and rewrite the applications, algorithms, and software ! Data movement is expensive ! Flops are cheap 17

  18. Dense Linear Algebra Software Evolution LINPACK (70's) " Level 1 BLAS vector operations LAPACK (80's) " Level 3 BLAS block operations " PBLAS ScaLAPACK (90's) block cyclic " BLACS data distribution (message passing) PLASMA (00's) " tile layout tile operations " dataflow scheduling

  19. PLASMA Principles " Tile Algorithms " minimize capacity misses LAPACK CPU cache MEM " Tile Matrix Layout " minimize conflict misses PLASMA CPU CPU CPU CPU cache cache cache cache MEM " Dynamic DAG Scheduling " minimizes idle time " More overlap " Asynchronous ops

  20. Fork-Join Parallelization of LU and QR. Parallelize the update: dgemm • Easy and done in any reasonable software. - • This is the 2/3n 3 term in the FLOPs count. • Can be done efficiently with LAPACK+multithreaded BLAS Cores Time

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend