An Overview Of High Performance Computing And Challenges For The - - PowerPoint PPT Presentation

an overview of high
SMART_READER_LITE
LIVE PREVIEW

An Overview Of High Performance Computing And Challenges For The - - PowerPoint PPT Presentation

An Overview Of High Performance Computing And Challenges For The Future Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 2/13/2009 1 A Growth-Factor of a Billion Super Scalar/Special


slide-1
SLIDE 1

2/13/2009 1

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester

An Overview Of High Performance Computing And Challenges For The Future

slide-2
SLIDE 2

07 2

IBM RoadRunner ASCI White Pacific EDSAC 1 UNIVAC 1 IBM 7090 CDC 6600 IBM 360/195 CDC 7600 Cray 1 Cray X-MP Cray 2 TMC CM-2 TMC CM-5 Cray T3D ASCI Red

1950 1960 1970 1980 1990 2000 2010 1 KFlop/s 1 MFlop/s 1 GFlop/s 1 TFlop/s 1 PFlop/s

Scalar Super Scalar Parallel Vector

1941 1 (Floating Point operations / second, Flop/s) 1945 100 1949 1,000 (1 KiloFlop/s /s, , KFlop/s) 1951 10,000 1961 100,000 1964 1,000,000 (1 MegaFlop/s, , MFlop/s) 1968 10,000,000 1975 100,000,000 1987 1,000,000,0 ,000 (1 GigaFlop/s /s, , GFlop/s /s) 1992 10,000,000,0 ,000 1993 100,000,000,0 ,000 1997 1,000,000,0 ,000,0 ,000 00 (1 TeraFlop/s /s, , TFlop/s) 2000 10,000,000,0 ,000,00 ,000 2007 2007 478,000,000,0 478,000,000,000, 00,000 000 (478 478 Tflop/s) 2009 1,100,000,0 ,000,0 ,000 00,0 ,000 (1.1 PetaFlop/s /s)

Super Scalar/Special Purpose/Parallel

(103) (106) (109) (1012) (1015)

2X Transistors/Chip Every 1.5 Years

A Growth-Factor of a Billion in Performance in a Career

Cray Jaguar

slide-3
SLIDE 3

07 3

  • H. Meuer, H. Simon, E. Strohmaier, & JD
  • Listing of the 500 most powerful

Computers in the World

  • Yardstick: Rmax from LINPACK MPP

Ax=b, dense problem

  • Updated twice a year

SC‘xy in the States in November Meeting in Germany in June

  • All data available from www.top500.org

Size Rate

TPP performance

slide-4
SLIDE 4

Performance Development

0.1 1 10 100 1000 10000 100000 1000000 10000000 100000000

1994 1996 1998 2000 2002 2004 2006 2008

1.1 PFlop/s 12.6 TFlop/s 16.9 59.7 GFlop/s 400 MFlop/s 1.17

SUM N=1 N=500

1 Gflop/s 1 Tflop/s 100 Mflop/s 100 Gflop/s 100 Tflop/s 10 Gflop/s 10 Tflop/s 1 Pflop/s 100 Pflop/s 10 Pflop/s

6-8 years My Laptop

slide-5
SLIDE 5

1.E+05 1.E+06 1.E+07 1.E+08 1.E+09 1.E+10 1.E+11 1.E+12 1.E+13 1.E+14 1.E+15 1.E+16 1.E+17 1.E+18 1.E+19

Performance Development and Projections

1.1 PFlop/s 12.6 TFlop/s 16.9 PFlop/s 59.7 GFlop/s 400 MFlop/s 1.17 TFlop/s

SUM N=1 N=500

Gflop/s Tflop/s Pflop/s Eflop/s Cray 2 1 Gflop/s O(1) Thread ASCI Red 1 Tflop/s O(103) Threads RoadRunner 1.1 Pflop/s O(106) Threads 1 Eflop/s O(109) Threads ~8 Hours ~1 Year ~1000 Year ~1 Min.

slide-6
SLIDE 6

37% 14% 13% 7% 6% 6% 3% 2% 2%

Xeon E54xx (Harpertown) Xeon 51xx (Woodcrest) Xeon 53xx (Clovertown) Xeon L54xx (Harpertown) Opteron Quad Core Opteron Dual Core PowerPC 440 PowerPC 450 POWER6 Others

Processors / Systems

Intel 71% AMD 13% IBM 7%

slide-7
SLIDE 7

50 100 150 200 250 300 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 GigE Myrinet Infiniband Quadrics

Cluster Interconnects

slide-8
SLIDE 8

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 100 200 300 400 500 Effeciency TOP500 Ranking

Efficiency

slide-9
SLIDE 9

100 200 300 400 500 Systems 9 4 2 1

Cores Per Socket

4 cores: 67% 2 cores: 31% 9 cores: 7 systems Single core: 4 systems

slide-10
SLIDE 10

100 200 300 400 500 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 Systems 1 2 "4-7" "8-15" 16-31 17-32 33-64 65-128 129-256 257-512 513-1024 1025-2048 2049-4096 4k-8k 8k-16k 16k-32k 32k-64k 64k-128k 128k-

Core Count

slide-11
SLIDE 11

58% 9% 5% 5% 4% 3% 2% 2% 2% 2% 1% 1% 6% United States United Kingdom France Germany Japan China Italy Sweden India Russia Spain Poland

Countries / System Share

58% 9% 5% 5% 4% 3% 2% 2% 2% 2% 1% 1%

slide-12
SLIDE 12

100 200 300 400 500 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 Systems Others Government Vendor Classified Academic Research Industry

Customer Segments

slide-13
SLIDE 13

Distribution of the Top500

100 200 300 400 500 600 700 800 900 1000 1100 1200 1 27 53 79 105 131 157 183 209 235 261 287 313 339 365 391 417 443 469 495

Rmax (Tflop/s) Rank

19 systems > 100 Tflop/s 51 systems > 50 Tflop/s 119 systems > 25 Tflop/s 12.6 Tflop/s 1.1 Pflop/s 2 systems > 1 Pflop/s

slide-14
SLIDE 14

267 50 100 150 200 250 300 350 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008

Replacement Rate

slide-15
SLIDE 15

32nd List: The TOP10

Rank Site Computer Country Cores Rmax [Tflops] Rmax/ Rpeak Power [MW] MF/W 1 DOE/NNSA/LANL IBM / Roadrunner - BladeCenter QS22/LS21 USA 129600 1105.0 76% 2.48 445 2 DOE/Oak Ridge National Laboratory Cray / Jaguar - Cray XT5 QC 2.3 GHz USA 150152 1059.0 77% 6.95 152 3 NASA/Ames Research Center/NAS SGI / Pleiades - SGI Altix ICE 8200EX USA 51200 487.0 80% 2.09 233 4 DOE/NNSA/LLNL IBM / eServer Blue Gene Solution USA 212992 478.2 80% 2.32 205 5 DOE/Argonne National Laboratory IBM / Blue Gene/P Solution USA 163840 450.3 81% 1.26 357 6 NSF/Texas Advanced Computing Center/Univ. of Texas Sun / Ranger - SunBlade x6420 USA 62976 433.2 75% 2.0 217 7 DOE/NERSC/LBNL Cray / Franklin - Cray XT4 USA 38642 266.3 75% 1.15 232 8 DOE/Oak Ridge National Laboratory Cray / Jaguar - Cray XT4 USA 30976 205.0 79% 1.58 130 9 DOE/NNSA/Sandia National Laboratories Cray / Red Storm - XT3/4 USA 38208 204.2 72% 2.5 81 10 Shanghai Supercomputer Center Dawning 5000A, Windows HPC 2008 China 30720 180.6 77%

slide-16
SLIDE 16

32nd List: The TOP10

Rank Site Computer Country Cores Rmax [Tflops] Rmax/ Rpeak Power [MW] MF/W 1 DOE/NNSA/LANL IBM / Roadrunner - BladeCenter QS22/LS21 USA 129600 1105.0 76% 2.48 445 2 DOE/Oak Ridge National Laboratory Cray / Jaguar - Cray XT5 QC 2.3 GHz USA 150152 1059.0 77% 6.95 152 3 NASA/Ames Research Center/NAS SGI / Pleiades - SGI Altix ICE 8200EX USA 51200 487.0 80% 2.09 233 4 DOE/NNSA/LLNL IBM / eServer Blue Gene Solution USA 212992 478.2 80% 2.32 205 5 DOE/Argonne National Laboratory IBM / Blue Gene/P Solution USA 163840 450.3 81% 1.26 357 6 NSF/Texas Advanced Computing Center/Univ. of Texas Sun / Ranger - SunBlade x6420 USA 62976 433.2 75% 2.0 217 7 DOE/NERSC/LBNL Cray / Franklin - Cray XT4 USA 38642 266.3 75% 1.15 232 8 DOE/Oak Ridge National Laboratory Cray / Jaguar - Cray XT4 USA 30976 205.0 79% 1.58 130 9 DOE/NNSA/Sandia National Laboratories Cray / Red Storm - XT3/4 USA 38208 204.2 72% 2.5 81 10 Shanghai Supercomputer Center Dawning 5000A, Windows HPC 2008 China 30720 180.6 77%

slide-17
SLIDE 17

LANL Roadrunner A Petascale System in 2008

“Connected Unit” cluster 192 Opteron nodes (180 w/ 2 dual-Cell blades connected w/ 4 PCIe x8 links) ≈ 13,000 Cell HPC chips ≈ 1.33 PetaFlo Flop/s /s (from Cell) ≈ 7,000 dual-core Opterons ≈ 122,000 cores

17 clusters

2nd stage InfiniBand 4x DDR interconnect (18 sets of 12 links to 8 switches) 2nd stage InfiniBand interconnect (8 switches) Based on the 100 Gflop/s /s (DP) ) Cell chip

Hybrid Design (2 kinds of chips & 3 kinds of cores) Programming required at 3 levels.

Dual Core Opteron Chip

Cell chip for each core

slide-18
SLIDE 18

ORNL’s Newest System Jaguar XT5

Offic ice of Scienc ience The systems will be combined after acceptance of the new XT5 upgrade. Each system will be linked to the file system through 4x-DDR Infiniband

Jaguar Total XT5 XT4 Peak Performance 1,645 1,382 263 AMD Opteron Cores 181,504 150,17 6 31,328 System Memory (TB) 362 300 62 Disk Bandwidth (GB/s) 284 240 44 Disk Space (TB) 10,750 10,000 750

Interconnect Bandwidth (TB/s)

532 374 157

slide-19
SLIDE 19

’s HPC System

  • University of Tennessee’s
  • National Institute for Computational Sciences
  • Housed at ORNL
  • Operated for the NSF
  • Named Kraken

Today:

  • Cray XT5 (608 TF) + Cray XT4 (167 TF)
  • XT5: 16,512 sockets, 66,048 cores
  • XT4: 4,512 sockets, 18,048 cores
  • Number 15 on the Top500

19

slide-20
SLIDE 20

20

Power is an Industry Wide Problem

“Hiding in Plain Sight, Google Seeks More Power”, by John Markoff, June 14, 2006

Google facilities

  • leveraging

hydroelectric power

  • old aluminum plants

Micros

  • soft
  • ft and Yahoo
  • o are buildi

ding ng big data center ers upstr tream eam in Wenatchee ee and Quincy, , Wash. – To keep up with Google, which means they need cheap electricity and readily accessible data networking

Micro rosoft soft Quincy, y, Wash. 470,00 000 Sq Ft, 47MW!

slide-21
SLIDE 21

ORNL/UTK Power Cost Projections 2007-2011

Over the next 5 years ORNL/UTK will deploy 2 large Petascale systems Using 4 MW today, going to 15MW before year end By 2012 could be using more than 50MW!! Cost estimates based on $0.07 per KwH

Includes both DOE and NSF systems.

slide-22
SLIDE 22

Something’s Happening Here…

  • In the “old

days” it was: each year processors would become faster

  • Today the clock

speed is fixed or getting slower

  • Things are still

doubling every 18 -24 months

  • Moore’s Law

reinterpretated.

  • Number of cores

double every 18-24 months

07 22 From K. Olukotun, L. Hammond, H. Sutter, and B. Smith

A hardwar dware e issue ue just beca came me a software ware probl

  • blem

em

slide-23
SLIDE 23

23

Power Cost of Frequency

  • Pow
  • wer

er ∝ Vol

  • lta

tage ge2 x Freq eque uenc ncy (V (V2F) F)

  • Frequency ∝ Vol
  • lta

tage ge

  • Po

Power er ∝Fr Freq eque uenc ncy3

slide-24
SLIDE 24

24

Power Cost of Frequency

  • Pow
  • wer

er ∝ Vol

  • lta

tage ge2 x Freq eque uenc ncy (V (V2F) F)

  • Frequency ∝ Vol
  • lta

tage ge

  • Po

Power er ∝Fr Freq eque uenc ncy3

slide-25
SLIDE 25

What’s Next?

SRAM + 3D Stacked Memory Many Floating- Point Cores

All Large Core Mixed Large and Small Core All Small Core Many Small Cores

Different Classes of Chips Home Games / Graphics Business Scientific

slide-26
SLIDE 26

And then there’s the GPGPU’s NVIDIA’s Tesla T10P

T10P chip

  • 240 cores; 1.5 GHz
  • Tpeak 1 Tflop/s - 32 bit floating point
  • Tpeak 100 Gflop/s - 64 bit floating point

S1070 board

  • 4 - T10P devices;
  • 700 Watts

GTX 280

  • 1 – T10P; 1.3 GHz
  • Tpeak 864 Gflop/s - 32 bit floating point
  • Tpeak 86.4 Gflop/s - 64 bit floating point

26

slide-27
SLIDE 27

27

Intel’s Larrabee Chip

  • Many X 86 IA cores
  • Scalable to Tflop/s
  • New cache architecture
  • New vector instructions

set

  • Vector memory operations
  • Conditionals
  • Integer and floating point

arithmetic

  • New vector processing unit

/ wide SIMD

slide-28
SLIDE 28

Architecture of Interest

Manycore chip Composed of hybrid cores

  • Some general purpose
  • Some graphics
  • Some floating point

28

slide-29
SLIDE 29

Architecture of Interest

Board composed of multiple chips sharing memory

29

Memory

slide-30
SLIDE 30

Architecture of Interest

Rack composed

  • f multiple

boards

30

Memo mory ry

slide-31
SLIDE 31

Architecture of Interest

A room full of these racks Think millions of cores

31

Memory

  • ry
slide-32
SLIDE 32

Near Term Situation

Million core systems and beyond are on the horizon By 2012 there will be more systems deployed in the 200K – 1M core range By 2020 there will be systems with perhaps 100M cores Personal systems with > 1000 cores within 5 years (I have over 100 cores in my office now) Personal systems with requirements for 1M threads is not too far fetched (think GPUs)

07 32

slide-33
SLIDE 33

Exascale Computing

Exascale systems (1018 Flop/s) are likely feasible by 2017±2 10-100 Million processing elements (cores or mini-cores) with chips perhaps as dense as 1,000 cores per socket, clock rates will grow more slowly 3D packaging likely Large-scale optics based interconnects 10-100 PB of aggregate memory > 10,000’s of I/O channels to 10-100 Exabytes of secondary storage, disk bandwidth to storage ratios not

  • ptimal for HPC use

Hardware and software based fault management Achievable performance per watt will likely be the primary measure of progress

33

slide-34
SLIDE 34

Conclusions

34

Moore’s Law Reinterpreted

  • Number of cores per chip doubles every two

year, while clock speed roughly stable

  • Threads of execution double every 2 years
  • 100 M cores coming

Need to deal with systems with millions of concurrent threads

  • Future generation will have billions of

threads!

  • MPI and programming languages from the

60’s will not make it

Power limiting clock rate growth

  • Power becomes the architectural driver for

Exescale systems.

slide-35
SLIDE 35

35

Collaborators

Top500 Team

  • Erich Strohmaier, NERSC
  • Hans Meuer, Mannheim
  • Horst Simon, NERSC

http://www.top500.org/c