An Overview of High Performance An Overview of High Performance - PDF document

An Overview of High Performance An Overview of High Performance Computing, Clusters, and the Grid Computing, Clusters, and the Grid Jack Dongarra University of Tennessee and Oak Ridge National Laboratory 1 Technology Trends: Technology Trends: Microprocessor Capacity Microprocessor Capacity Gordon Moore (co-founder of Intel) Microprocessors have become Electronics Magazine, 1965 smaller, denser, and more powerful. Number of devices/chip doubles Not just processors, bandwidth, every 18 months storage, etc. 2X memory and processor speed and 2X transistors/Chip Every ½ size, cost, & power every 18 1.5 years “Moore’s Law” months. 2 1

Moore’ ’s Law s Law Moore Super Scalar/Vector/Parallel 1 PFlop/s Earth (10 15 ) Parallel Simulator ASCI White ASCI Red Pacific 1 TFlop/s (10 12 ) TMC CM-5 Cray T3D Vector TMC CM-2 Cray 2 1 GFlop/s Cray X-MP Super Scalar (10 9 ) Cray 1 1941 1 (Floating Point operations / second, Flop/s) CDC 7600 IBM 360/195 1945 100 Scalar 1 MFlop/s 1949 1,000 (1 KiloFlop/s, KFlop/s) 1951 10,000 CDC 6600 (10 6 ) 1961 100,000 1964 1,000,000 (1 MegaFlop/s, MFlop/s) IBM 7090 1968 10,000,000 1975 100,000,000 1987 1,000,000,000 (1 GigaFlop/s, GFlop/s) 1992 10,000,000,000 1993 100,000,000,000 1 KFlop/s 1997 1,000,000,000,000 (1 TeraFlop/s, TFlop/s) UNIVAC 1 (10 3 ) 2000 10,000,000,000,000 EDSAC 1 2003 35,000,000,000,000 (35 TFlop/s) 3 1950 1960 1970 1980 1990 2000 2010 H. Meuer, H. Simon, E. Strohmaier, & JD H. Meuer, H. Simon, E. Strohmaier, & JD - Listing of the 500 most powerful Computers in the World - Yardstick: Rmax from LINPACK MPP Ax=b, dense problem TPP performance PP performance Rate Rate - Updated twice a year Si Size ze SC‘xy in the States in November Meeting in Mannheim, Germany in June - All data available from www.top500.org 4 2

What is a What is a Supercomputer? Supercomputer? ♦ A supercomputer is a hardware and software system that provides close to the maximum performance that can currently be achieved. ♦ Over the last 10 years the range for the Top500 has increased greater than Moore’s Law ♦ 1993: � #1 = 59.7 GFlop/s Why do we need them? � #500 = 422 MFlop/s Computational fluid dynamics, ♦ 2003: protein folding, climate modeling, � #1 = 35.8 TFlop/s national security, in particular for � #500 = 403 GFlop/s cryptanalysis and for simulating nuclear weapons to name a few. 5 A Tour de Force in Engineering A Tour de Force in Engineering Homogeneous, Centralized, ♦ Proprietary, Expensive! Target Application: CFD-Weather, ♦ Climate, Earthquakes 640 NEC SX/6 Nodes (mod) ♦ � 5120 CPUs which have vector ops � Each CPU 8 Gflop/s Peak 40 TFlop/s (peak) ♦ ~ 1/2 Billion $ for machine, ♦ software, & building Footprint of 4 tennis courts ♦ 7 MWatts ♦ � Say 10 cent/KWhr - $16.8K/day = $6M/year! Expect to be on top of Top500 ♦ until 60-100 TFlop ASCI machine arrives From the Top500 (November 2003) ♦ � Performance of ESC > Σ Next Top 3 Computers 6 3

November 2003 November 2003 Rmax Rpeak Manufacturer Computer Installation Site Year # Proc Tflop/s Tflop/s Earth Simulator Center 1 NEC Earth-Simulator 35.8 2002 5120 40.90 Yokohama Hewlett- ASCI Q - AlphaServer SC Los Alamos National Laboratory 2 13.9 2002 8192 20.48 Packard ES45/1.25 GHz Los Alamos Virginia Tech Apple G5 Power PC 3 Self 10.3 Blacksburg, VA 2003 2200 17.60 w/Infiniband 4X PowerEdge 1750 P4 Xeon 3.6 Ghz University of Illinois U/C 4 Dell 9.82 2003 2500 15.30 w/Myrinet Urbana/Champaign Hewlett- rx2600 Itanium2 1 GHz Cluster – Pacific Northwest National Laboratory 5 8.63 2003 1936 11.62 Packard w/Quadrics Richland Opteron 2 GHz, Lawrence Livermore National Laboratory 6 Linux NetworX 8.05 2003 2816 11.26 w/Myrinet Livermore MCR Linux Cluster Xeon 2.4 GHz – Lawrence Livermore National Laboratory 7 Linux NetworX 7.63 2002 2304 11.06 w/Quadrics Livermore Lawrence Livermore National Laboratory 8 IBM ASCI White, Sp Power3 375 MHz 7.30 2000 8192 12.29 Livermore NERSC/LBNL 9 IBM SP Power3 375 MHz 16 way 7.30 2002 6656 9.984 Berkeley xSeries Cluster Xeon 2.4 GHz – Lawrence Livermore National Laboratory 10 IBM 6.59 2003 1920 9.216 w/Quadrics Livermore 50% of top500 performance in top 9 machines; 131 system > 1 TFlop/s; 210 machines are clusters 7 TOP500 – – Performance Performance - - Nov 2003 Nov 2003 TOP500 (10 15 ) 1 Pflop/s 528 TF/s SUM 100 Tflop/s 35.8 TF/s NEC 10 Tflop/s ES N=1 (10 12 ) 1.17 TF/s IBM ASCI White 1 Tflop/s LLNL Intel ASCI Red 59.7 GF/s Sandia 100 Gflop/s Fujitsu 403 GF/s N=500 'NWT' NAL 10 Gflop/s (10 9 ) 0.4 GF/s 1 Gflop/s My Laptop 100 Mflop/s 3 4 5 6 7 8 9 0 1 2 3 9 9 9 9 9 9 9 0 0 0 0 - - - - - - - - - - - n n n n n n n n n n n u u u u u u u u u u u J J J J J J J J J J J 8 4

Number of Systems on Top500 > 1 Tflop/s Tflop/s Number of Systems on Top500 > 1 Over Time Over Time 131 140 120 100 Since 1998 ~ doubling every 2 years 80 59 60 46 40 23 12 17 20 7 5 3 2 1 1 1 1 0 M ay-97 M ay-98 M ay-99 M ay-00 M ay-01 M ay-02 M ay-03 M ay-04 N ov-96 N ov-97 N ov-98 N ov-99 N ov-00 N ov-01 N ov-02 N ov-03 9 Factoids on Machines > 1 TFlop/s TFlop/s Factoids on Machines > 1 Year of Introduction for 131Systems ♦ 131 Systems > 1 TFlop/s ♦ 80 Clusters (61%) 100 86 80 ♦ Average rate: 2.44 Tflop/s 60 ♦ Median rate: 1.55 Tflop/s 40 31 ♦ Sum of processors in Top131: 20 7 155,161 3 3 1 0 � Sum for Top500: 267,789 1998 1999 2000 2001 2002 2003 ♦ Average processor count: 1184 ♦ Median processor count: 706 Number of processors 10000 ♦ Numbers of processors � Most number of processors: 9632 26 Num ber of processors � ASCI Red 1000 � Fewest number of processors: 124 71 � Cray X1 100 1 21 41 61 81 101 121 10 Rank 5

Percent Of 131 Systems Which Use The Percent Of 131 Systems Which Use The Following Processors > 1 TFlop/s Following Processors > 1 TFlop/s About a half are based on 32 bit architecture 9 (11) Machines have a Vector instruction Sets Cut by Manufacture of System Promicro Fujitsu 2% Fujitsu Atipa Technology Legend Group Sparc HPTi 1% 1% 2% 1% 1% NEC SGI Intel Cray AMD 3% 1% Hitachi Alpha 1% Hitachi 4% 2% 2% NEC 6% 2% IBM Itanium 3% 24% 9% Visual Technology Linux Networx 1% 5% Self-made 5% SGI IBM 5% Dell 52% Pentium 5% 48% Cray Inc. 4% HP 10% Cut of the data distorts manufacture counts, ie HP(14), IBM > 24% 11 What About Efficiency? What About Efficiency? ♦ Talking about Linpack ♦ What should be the efficiency of a machine on the Top131 be? � Percent of peak for Linpack > 90% ? > 80% ? > 70% ? > 60% ? … ♦ Remember this is O(n 3 ) ops and O(n 2 ) data � Mostly matrix multiply 12 6

Efficiency of Systems > 1 Tflop/s ES ASCI Q VT-Apple NCSA PNNL 1 LANL Lighting LLNL MCR ASCI White NERSC 0.9 LLNL (6.6) 0.8 AMD Cray X1 0.7 Alpha 0.6 IBM Efficency Hitachi 0.5 _ NEC SX 0.4 Pentium Sparc 0.3 Itanium SGI 0.2 100000 Performance Rank 0.1 10000 0 1000 1 21 41 61 81 101 121 Rank 1 21 41 61 81 101 121 13 Rank Commodity Interconnects Commodity Interconnects ♦ Gig Ethernet ♦ Myrinet Clos ♦ Infiniband ♦ QsNet F a t t r e e ♦ SCI T o r u Switch topology $ NIC $Sw/node $ Node Lt(us)/BW (MB/s) (MPI) s Gigabit Ethernet Bus $ 50 $ 50 $ 100 30 / 100 SCI Torus $1,600 $ 0 $1,600 5 / 300 QsNetII Fat Tree $1,200 $1,700 $2,900 3 / 880 Myrinet (D card) Clos $ 700 $ 400 $1,100 6.5/ 240 IB 4x Fat Tree $1,000 $ 400 $1,400 6 / 820 14 7

Efficency of Systems > 1 TFlop/s ES ASCI Q VT-Apple NCSA PNNL LANL Lighting 1 LLNL MCR ASCI White NERSC 0.9 LLNL gigE 0.8 44 Infiniband 0.7 3 0.6 Myrinet Efficency 19 0.5 Quadrics 12 0.4 Proprietary 52 0.3 SCI 1 0.2 0.1 0 1 21 41 61 81 101 121 Rank 15 Interconnects Used Interconnects Used SCI Myrinet 1% 15% Proprietary 39% GigE 34% Quadrics Infiniband 9% 2% Efficiency for Linpack Linpack Efficiency for Largest node count min max average GigE 1024 17% 63% 37% SCI 120 64% 64% 64% QsNetII 2000 68% 78% 74% Myrinet 1250 36% 79% 59% Infiniband 4x 1100 58% 69% 64% Proprietary 9632 45% 98% 68% 16 8

Country Percent by Total Performance Country Percent by Total Performance France Finland India 3% China 0% 0% Malaysia 2% Germany Italy Israel 0% Canada 3% 1% 0% 2% Korea, South 1% Australia Japan Mexico 0% 1% 15% New Zealand 1% United States Netherlands 63% 1% United Kingdom Saudia Arabia 5% 0% Sweden Switzerland 0% 17 0% KFlop/s per Capita (Flops/Pop) per Capita (Flops/Pop) KFlop/s 1000 900 WETA Digital (Lord of the Rings) 800 700 600 500 400 300 200 100 0 a s m d a a h y n e d a l d n s o a y a i e n n d n e c n d a e i i b l i t n o d c s a l u a a n a n a a i i a a d a p t n h o d a x y t a m r l I r e a l n s a a C r r l g I e a t S n t A l r e a e s r w I n J S M l r F i a e z u e C F Z , S i a a K M A h G t d i i w d e t w e e d r t u S e o N i e n a N 18 K t U S i n U 9

An Overview of High Performance An Overview of High Performance - PDF document

An Overview of High Performance An Overview of High Performance Computing, Clusters, and the Grid Computing, Clusters, and the Grid Jack Dongarra University of Tennessee and Oak Ridge National Laboratory 1 Technology Trends: Technology

Getting the Performance Out Of Getting the Performance Out Of High Performance Computing High

High Performance Computing in Web Browsers CE Seminar WT14/15 Henning Lohse High Performance

Drupal High Availability High Performance Samstag, 3. November 12 Drupal High Availability

Parallel Programming and High-Performance Computing Part 2: High-Performance Networks Dr.

High Performance Systems EuroMPI 2015 Objectives Yet another performance analysis tool

Corporate Presentation voestalpine High Performance Metals India voestalpine High Performance

New York University High Performance Computing High Performance Computing Information

Performance and Scalability (Chapter 11) Performance and Scalability Performance: How long

March 2019 CONTENTS Page Combined Partner Performance 1 Breckland Performance Reports 2-6

Performance Bas Performance Bas Performance Bas Performance Bas ed ed ed ed Methodology for

Verification Verification, Performance Performance Analysis Performance Performance Analysis

High Performance Computing at High Performance Computing at the University of Utah: A User the

An Overview of High An Overview of High Performance Computing and Performance Computing and

PERFORMANCE MANAGEMENT Presentation Outline Performance Management definition and rationale.

Eagle Scholars: High Eagle Scholars: High Eagle Scholars: High Eagle Scholars: High Eagle

X10: a High-Productivity Approach to X10: a High-Productivity Approach to High Performance

CSE 473: Artificial Intelligence Advanced Applic's: Natural Language Processing Steve Tanimoto

CEE 370 Environmental Engineering Principles Lecture #30 Wastewater Treatment I: WW

Endocrine Disruption as the pilote of mapping the Human Toxome Thomas Hartung Doerenkamp-Zbinden

Transceivers and semiconductor lasers for photonic networks 12/06/2007 Michele Agresti Michele

Moving Magic: Creating Compelling 4K Content on a Budget Jeffrey Weekley Research Associate of

Entry ry-Level Categories Vicente Ordonez, Jia Deng, Yejin Choi, Alexander C. Berg, Tamara L.

The Structure of Graphs Without Even Holes or Odd Pans Kathie Cameron Department of Mathematics

when web services go bad steve loughran hp laboratories notes from the field slo@hpl.hp.com