high performance computing today
play

High-Performance Computing Today Jack Dongarra I nnovative - PDF document

High-Performance Computing Today Jack Dongarra I nnovative Computing Laboratory University of Tennessee and Oak Ridge National Laboratory ht t p:/ / www.cs.ut k.edu/ ~dongarra ht t p:/ / www.cs.ut k.edu/ ~ dongarra/ / 1 Outline ? Look


  1. High-Performance Computing Today Jack Dongarra I nnovative Computing Laboratory University of Tennessee and Oak Ridge National Laboratory ht t p:/ / www.cs.ut k.edu/ ~dongarra ht t p:/ / www.cs.ut k.edu/ ~ dongarra/ / 1 Outline ? Look at trends in HPC � Top500 statistics ? Perf ormance of Super- Scalar Processors � ATLAS ? Perf ormance Monitoring � PAPI ? NetSolve � Example of grid middleware I n pioneer days, they used oxen f or heavy pulling, and when one ox couldn' t budge a log they didn' t try to grow a larger ox. We shouldn' t be trying f or bigger computers, 2 but f or more systems of computers. - - Grace Hopper 1

  2. Technology Trends: Microprocessor Capacity Moore’s Law 2X transistors/Chip Every 1.5 years Gordon Moore (co-founder of Called “Moore’s Law” Intel) predicted in 1965 that the transistor density of semiconductor Microprocessors have chips would double roughly every become smaller, denser, 18 months. and more powerful. 3 High Performance Computers & Numerical Libraries ? 20 years ago � 1x10 6 Floating Point Ops/ sec (Mf lop/ s) » Scalar based » Loop unrolling ? 10 years ago � 1x10 9 Floating Point Ops/ sec (Gf lop/ s) » Vect or & Shared memory comput ing, bandwidt h aware » Block part it ioned, lat ency t olerant ? Today � 1x10 12 Floating Point Ops/ sec (Tf lop/ s) » Highly parallel, dist ribut ed processing, message passing, net work based » dat a decomposit ion, communicat ion/ comput at ion ? 10 years away � 1x10 15 Floating Point Ops/ sec (Pf lop/ s) » Many more levels MH, combinat ion/ grids&HPC » More adapt ive, LT and bandwidt h aware, f ault t olerant , ext ended precision, at t ent ion t o SMP nodes 4 2

  3. TOP500 TOP500 - Listing of the 500 most powerful Computers in the World - Yardstick: Rmax from LINPACK MPP Ax=b, dense problem TPP perf ormance Rate - Updated twice a year Size SC‘xy in the States in November Meeting in Mannheim, Germany in June - All data available from www.top500.org 5 Fastest Computer Over Time 50 45 40 35 GFlop/s 30 25 20 TMC 15 CM-2 (2048) 10 Cray 5 Y-MP (8) Fujitsu VP-2600 0 1990 1992 1994 1996 1998 2000 Year I n 1980 a computation that took 1 f ull year to complete can now be done in 1 month! 3

  4. Fastest Computer Over Time 500 Hitachi 450 CP-PACS 400 (2040) Intel 350 Paragon GFlop/s (6788) 300 Fujitsu 250 VPP-500 200 (140) TMC 150 CM-5 NEC 100 (1024) SX-3 TMC 50 (4) Fujitsu CM-2 Cray VP-2600 (2048) Y-MP (8) 0 1990 1992 1994 1996 1998 2000 Year I n 1980 a computation that took 1 f ull year to complete can now be done in 4 days! Fastest Computer Over Time ASCI White 5000 Pacific 4500 (7424) 4000 3500 Intel ASCI GFlop/s Red Xeon 3000 ASCI Blue (9632) Pacific SST 2500 (5808) Intel 2000 ASCI Red SGI ASCI 1500 (9152) Blue Mountain 1000 Hitachi Intel Fujitsu CP-PACS TMC Paragon (5040) NEC TMC VPP-500 (2040) 500 CM-5 (6788) Cray Fujitsu SX-3 (140) CM-2 (1024) VP-2600 (4) Y-MP (8) (2048) 0 1990 1992 1994 1996 1998 2000 Year I n 1980 a computation that took 1 f ull year to complete can today be done in 1 hour! 4

  5. Top 10 Machines (June 2000) Rank Company Machine Procs Gflop/s Place Country Year Sandia National Labs 1 Intel ASCI Red 9632 2380 USA 1999 Albuquerque ASCI Blue-Pacific Lawrence Livermore National 2 IBM 5808 2144 USA 1999 SST, IBM SP 604e Laboratory Livermore ASCI Blue 1608 Los Alamos National Laboratory 3 SGI 6144 USA 1998 Mountain Los Alamos Leibniz Rechenzentrum 4 Hitachi SR8000-F1/112 112 1035 Germany 2000 Muenchen High Energy Accelerator 5 Hitachi SR8000-F1/100 100 917 Research Organization /KEK Japan 2000 Tsukuba 6 Cray Inc. T3E1200 1084 892 Government USA 1998 892 US Army HPC Research Center 7 Cray Inc. T3E1200 1084 USA 2000 at NCS Minneapolis 8 Hitachi SR8000/128 128 874 University of Tokyo Tokyo Japan 1999 9 Cray Inc. T3E900 1324 815 Government USA 1997 SP Power3 Naval Oceanographic Office 10 IBM 1336 723 USA 2000 375 MHz (NAVOCEANO) Poughkeepsie Performance Development 100 Tflop/s 64.3 TF/s SUM 10 Tflop/s 2.38 TF/s 1 Tflop/s Intel Intel Intel ASCI Red N=1 ASCI Red ASCI Red Sandia Sandia Sandia 100 Gflop/s Hitachi/Tsukuba 39.9 GF/s Fujitsu Fujitsu CP-PACS/2048 'NWT' NAL 'NWT' NAL Intel XP/S140 Sandia 10 Gflop/s IBM 604e N=500 69 proc Sun Sun Ultra Nabisco HPC 10000 HPC 1000 SGI Merril Lynch 1 Gflop/s News Cray POWER International Cray Y-MP CHALLANGE Y-MP C94/364 SNI VP200EX M94/4 GOODYEAR 'EPA' USA Uni Dresden KFA Jülich 100 Mflop/s Jun-93 Jun-94 Jun-95 Jun-96 Jun-97 Jun-98 Jun-99 Jun-00 Nov-93 Nov-94 Nov-95 Nov-96 Nov-97 Nov-98 Nov-99 [6 0 G - 4 0 0 M][2 .4 Tf lop/ s 4 0 Gf lop/ s] , Sch wab # 1 9 , 1 /2 each year , 13 3 > 10 0 Gf , f ast er t han Moore’s law, 5

  6. Performance Development 1 PFlop/s 1000000 ASCI 100000 Performance [GFlop/s] Earth Simulator 10000 Sum 1 TFlop/s 1000 N=1 100 10 1 My Laptop N=500 0.1 Nov-94 Nov-97 Nov-00 Nov-03 Nov-06 Nov-09 Jun-93 Jun-96 Jun-99 Jun-02 Jun-05 Jun-08 Ent ry 1 T 20 0 5 and 1 P 20 1 0 Architectures Cluster 500 Constellation SIMD 400 MPP 300 200 100 SMP Single Processor 0 Nov-93 Nov-94 Nov-95 Nov-96 Nov-97 Nov-98 Nov-99 Jun-93 Jun-94 Jun-95 Jun-96 Jun-97 Jun-98 Jun-99 Jun-00 91 const, 14 clus, 275 mpp, 120 smp 6

  7. Chip Technology 500 Other 400 Inmos Transputer SUN 300 MIPS 200 HP IBM 100 intel Alpha 0 Nov-93 Nov-94 Nov-95 Nov-96 Nov-97 Nov-98 Nov-99 Jun-93 Jun-94 Jun-95 Jun-96 Jun-97 Jun-98 Jun-99 Jun-00 High-Performance Computing Directions: Beowulf-class PC Clusters Definition: Advantages: ? COTS PC Nodes ? Best price- perf ormance � Pentium, Alpha, PowerPC, SMP ? Low entry- level cost ? COTS LAN/ SAN ? Just- in- place I nterconnect conf iguration � Ethernet, Myrinet, ? Vendor invulnerable Giganet, ATM ? Scalable ? Open Source Unix ? Rapid technology � Linux, BSD tracking ? Message Passing Computing Enabled by PC hardware, networks and operating system � MPI , PVM achieving capabilities of scientific workstations at a fraction of the cost and availability of industry standard message � HPF 1 4 passing libraries. 7

  8. Where Does the Performance Go? or Why Should I Cares About the Memory Hierarchy? Processor-DRAM Memory Gap (latency) µProc 1000 CPU 60%/yr. “Moore’s Law” Performance (2X/1.5yr) 100 Processor-Memory Performance Gap: (grows 50% / year) 10 DRAM 9%/yr. DRAM 1 (2X/10 yrs) 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Time 15 Optimizing Computation and Memory Use ? Computational optimizations � Theoretical peak: (# f pus)*(f lops/ cycle) * Mhz » PI I I : (1 f pu)*(1 f lop/ cycle)*(650 Mhz) = 650 MFLOP/ s » Athlon: (2 f pu)*(1f lop/ cycle)*(600 Mhz) = 1200 MFLOP/ s » Power3: (2 f pu)*(2 f lops/ cycle)*(375 Mhz) = 1500 MFLOP/ s ? Memory optimization � Theoretical peak: (bus width) * (bus speed) » PI I I : (32 bits)*(133 Mhz) = 532 MB/ s = 66. 5 MW/ s » Athlon: (64 bits)*(200 Mhz) = 1600 MB/ s = 200 MW/ s » Power3: (128 bits)*(100 Mhz) = 1600 MB/ s = 200 MW/ s ? Memory about an order of magnit ude slower 1 6 8

  9. Memory Hierarchy ? By taking advantage of the principle of locality: � Present the user with as much memory as is available in the cheapest technology. � Provide access at the speed of f ered by the f astest technology. Processor Tertiary Storage Secondary (Disk/Tape) Control Storage (Disk) Level Main On-Chip Registers Remote 2 and 3 Memory Distributed Cache Datapath Cluster Cache (DRAM) Memory Memory (SRAM) 10,000,000s Speed (ns): 1s 10s 100s 10,000,000,000s (10s ms) (10s sec) Size (bytes): 100s 100,000 s Ks Ms 10,000,000 s (.1s ms) (10s ms) Gs Ts How To Get Performance From Commodity Processors? ? Today’s processors can achieve high- perf ormance, but this requires extensive machine- specif ic hand tuning. ? Hardware and sof tware have a large design space w/ many parameters � Blocking sizes, loop nesting permutations, loop unrolling depths, sof tware pipelining strategies, register allocations, and instruction schedules. � Complicated interactions with the increasingly sophisticated micro- architectures of new microprocessors. ? Until recently, no tuned BLAS f or Pentium f or Linux. ? Need f or quick/ dynamic deployment of optimized routines. ? ATLAS - Automatic Tuned Linear Algebra Sof tware � PhiPac f rom Berkeley � FFTW f rom MI T (http:/ / www. f f tw. org) 18 9

  10. ATLAS ? An adaptive sof tware architecture � High- perf ormance � Portability � Elegance ? ATLAS is f aster than all other portable BLAS implementations and it is comparable with machine- specif ic libraries provided by the vendor. 1 9 ATLAS Across Various Architectures (DGEMM n=500) 900.0 Vendor BLAS 800.0 ATLAS BLAS F77 BLAS 700.0 600.0 MFLOPS 500.0 400.0 300.0 200.0 100.0 0.0 DEC ev6-500 AMD Athlon-600 DEC ev56-533 Pentium III-550 HP9000/735/135 IBM PPC604-112 IBM Power2-160 IBM Power3-200 Pentium Pro-200 Pentium II-266 SGI R10000ip28-200 SGI R12000ip30-270 Sun UltraSparc2-200 Architectures ? ATLAS is f aster than all other portable BLAS implementations and it is comparable with 20 machine- specif ic libraries provided by the vendor. 1 0

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend