- An Overview of High Performance
Computing and Trends
- Outline for the Next 3 Days
!"
!#
$#
An Overview of High Performance Computing and Trends - - PDF document
An Overview of High Performance Computing and Trends
!"
$#
%! !" "# & '!
### ! #()!* +++ ,-&*& ./.& ,0#*$#### 1*22 ,34
&"!"5(6.# %72'22
9"!
+++
!8 .!! '8 #+ !.9
6#!
"!
3#/2! !4
"!!
!#! 3!4
2X transistors/Chip Every 1.5 years
Moore’s Law
Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months.
! 2005 2010 ASCI White Pacific EDSAC 1 UNIVAC 1 IBM 7090 CDC 6600 IBM 360/195 CDC 7600 Cray 1 Cray X-MP Cray 2 TMC CM-2 TMC CM-5 Cray T3D ASCI Red
1950 1960 1970 1980 1990 2000 2010 1 KFlop/s 1 MFlop/s 1 GFlop/s 1 TFlop/s 1 PFlop/s
"# "# "# "#
$$%&' $$%&' $$%&'
,:1),!!! "#;,)<
10 20 30 40 50 60 70 1990 1992 1994 1996 1998 2000 Year
Cray Y-MP (8) TMC CM-2 (2048) Fujitsu VP-2600
TMC CM-2 (2048) Fujitsu VP-2600 Cray Y-MP (8)
Hitachi CP- PACS (2040) Intel Paragon (6788) Fujitsu VPP-500 (140) TMC CM-5 (1024) NEC SX-3 (4)
,:1),!!! "#;,-<
Hitachi CP-PACS (2040) Intel Paragon (6788) Fujitsu VPP-500 (140) TMC CM-5 (1024) NEC SX-3 (4) TMC CM-2 (2048) Fujitsu VP-2600 Cray Y-MP (8)
1000 2000 3000 4000 5000 6000 7000 1990 1992 1994 1996 1998 2000 Year
ASCI White Pacific (7424) Intel ASCI Red Xeon (9632) SGI ASCI Blue Mountain (5040) Intel ASCI Red (9152) ASCI Blue Pacific SST (5808)
,:1),!!! #"#;=>#<
512 Nighthawk 16-way SMP nodes
4.0 TB memory 159 TB disk 2x I/O size and delivered bw over SST 2.5x external network improvement Sufficient swap for GANG scheduling
3 x 480 4-way SMP nodes 3.9 TF peak performance 2.6 TB memory 2.5 Tb/s bisectional bandwidth 62 TB disk 6.4 GB/s delivered I/O bandwidth
,:1),!!! #"#;0+(#<
Hitachi CP-PACS (2040) Intel Paragon (6788) Fujitsu VPP-500 (140) TMC CM-5 (1024) NEC SX-3 (4) TMC CM-2 (2048) Fujitsu VP-2600 Cray Y-MP (8)
10 20 30 40 50 60 70 1990 1992 1994 1996 1998 2000 Year
2002
Intel ASCI Red (9152) ASCI White Pacific (7424) Intel ASCI Red Xeon (9632) ASCI Blue Mountain (5040)
(
Manufacturer or vendor
indicated by manufacturer or vendor
Customer
Location and country
Year of installation/last major update
Academic,Research,Industry,Vendor,Class.
Number of processors
Maxmimal LINPACK performance achieved
Theoretical peak performance
Problemsize for achieving Rmax
Problemsize for achieving half of Rmax
Rank Manufacturer Computer Rmax [TF/s] Installation Site Country Year Area of Installation # Proc
1 NEC Earth-Simulator 35.86 Earth Simulator Center Japan 2002 Research 5120 2 IBM ASCI White SP Power3 7.23 Lawrence Livermore National Laboratory USA 2000 Research 8192 3 HP AlphaServer SC ES45 1 GHz 4.46 Pittsburgh Supercomputing Center USA 2001 Academic 3016 4 HP AlphaServer SC ES45 1 GHz 3.98 Commissariat a l’Energie Atomique (CEA) France 2001 Research 2560 5 IBM SP Power3 375 MHz 3.05 NERSC/LBNL USA 2001 Research 3328 6 HP AlphaServer SC ES45 1 GHz 2.92 Los Alamos National Laboratory USA 2002 Research 2048 7 Intel ASCI Red 2.38 Sandia National Laboratory USA 1999 Research 9632 8 IBM pSeries 690 1.3 GHz 2.31 Oak Ridge National Laboratory USA 2002 Research 864 9 IBM ASCI Blue Pacific SST, IBM SP 604e 2.14 Lawrence Livermore National Laboratory USA 1999 Research 5808 10 IBM pSeries 690 1.3 Ghz 2.00 IBM/US Army Reseach Lab (ARL) USA 2002 Vendor 768 !
1.17 TF/s 220 TF/s 35.8 TF/s 59.7 GF/s 134 GF/s 0.4 GF/s
J u n
3 N
3 J u n
4 N
4 J u n
5 N
5 J u n
6 N
6 J u n
7 N
7 J u n
8 N
8 J u n
9 N
9 J u n
u n
N
J u n
Fujitsu 'NWT' NAL NEC Earth Simulator Intel ASCI Red Sandia IBM ASCI White LLNL
1 Gflop/s 1 Tflop/s 100 Mflop/s 100 Gflop/s 100 Tflop/s 10 Gflop/s 10 Tflop/s 1 Pflop/s
✂✁☎✄✝✆✟✞✝✠ ✡✝✞)
Horst Simon, NERSC
6?#
#"!
,1
!!!#/!
!2!
8##!!
,:1),!!
!"#<
9"#8! 76?
!!!"!!
#
##!!"+
*
ASCI Purple
Earth Simulator J u n
3 J u n
4 J u n
5 J u n
6 J u n
7 J u n
8 J u n
9 J u n
u n
J u n
J u n
J u n
J u n
J u n
J u n
J u n
J u n
J u n
N=1 N=500 Sum
1 GFlop/s 1 TFlop/s 1 PFlop/s 100 MFlop/s 100 GFlop/s 100 TFlop/s 10 GFlop/s 10 TFlop/s 10 PFlop/s
✂✁☎✄✝✆✟✞✝✠ ✡✝✞Cray SGI IBM Sun HP TMC Intel Fujitsu NEC Hitachi
100 200 300 400 500
J u n
3 N
3 J u n
4 N
4 J u n
5 N
5 J u n
6 N
6 J u n
7 N
7 J u n
8 N
8 J u n
9 N
9 J u n
u n
N
J u n
HP 168, IBM 164
Cray SGI IBM Sun HP TMC Intel Fujitsu NEC Hitachi
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Jun-93 Nov-93 Jun-94 Nov-94 Jun-95 Nov-95 Jun-96 Nov-96 Jun-97 Nov-97 Jun-98 Nov-98 Jun-99 Nov-99 Jun-00 Nov-00 Jun-01 Nov-01 Jun-02
Performance
IBM 33%, HP 22%, NEC 19%
USA/Canada Europe Japan
100 200 300 400 500 J u n
3 J u n
4 J u n
5 J u n
6 J u n
7 J u n
8 J u n
9 J u n
u n
J u n
US 238 (242), Europe 171 (162), Japan 53 (56)
(
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Jun-93 Nov-93 Jun-94 Nov-94 Jun-95 Nov-95 Jun-96 Nov-96 Jun-97 Nov-97 Jun-98 Nov-98 Jun-99 Nov-99 Jun-00 Nov-00 Jun-01 Nov-01 Jun-02
P erfo rm an ce
USA/Canada
Japan Europe
US 45% (59) Europe 24% (22) Japan 25% (13)
Germany UK France Scandinavia Benelux Switzerland
50 100 150
J u n
3 N
3 J u n
4 N
4 J u n
5 N
5 J u n
6 N
6 J u n
7 N
7 J u n
8 N
8 J u n
9 N
9 J u n
u n
N
J u n
G 64, UK 37, F 23, SK 12, BEL 14, CH 3
!
450 358 245 207 203 158 141 67 643
100 200 300 400 500 600 700
J a p a n U S A G e r m a n y S c a n d i n a v i a U K F r a n c e S w i t z e r l a n d I t a l y L u x e m b
r g
✂✁ ✄ ✂✁ ✄ ✂✁ ✄ ✂✁ ✄ ☎ ✆ ☎ ✆ ☎ ✆ ☎ ✆Japan 57 Tf/s US 99 Tf/s
Research Industry Academic Classified Vendor 100 200 300 400 500
J u n
3 N
3 J u n
4 N
4 J u n
5 N
5 J u n
6 N
6 J u n
7 N
7 J u n
8 N
8 J u n
9 N
9 J u n
u n
N
J u n
*
Engineering Commercial Unknown 50 100 150 200 250
J u n
3 N
3 J u n
4 N
4 J u n
5 N
5 J u n
6 N
6 J u n
7 N
7 J u n
8 N
8 J u n
9 N
9 J u n
u n
N
J u n
Rank Manufacturer Computer Rmax [GF/s] Installation Site Country Area # Proc
… … … … … … 40 IBM SP Power3 795 Charles Schwab USA Finance 768 66 IBM SP Power3 594 Sprint PCS USA Telecom 320 67 IBM SP Power4 555 EDS General Motors USA Automotive 224 73 IBM SP Power3 546 State Farm USA Database 520 125 IBM Netfinity P3 Ethernet Cluster 366 WesternGeco UK Geophysics 1280 127 Hewlett-Packard SuperDome HyperPlex 361 Centrica Plc UK Energy 196 … … … … … … …
Research Industry Academic Classified Vendor
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% J u n
3 J u n
4 J u n
5 J u n
6 J u n
7 J u n
8 J u n
9 J u n
u n
J u n
Performance
USA Japan Europe
100 200 300 400 500 J u n
3 N
3 J u n
4 N
4 J u n
5 N
5 J u n
6 N
6 J u n
7 N
7 J u n
8 N
8 J u n
9 N
9 J u n
u n
N
J u n
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
J u n
3 N
3 J u n
4 N
4 J u n
5 N
5 J u n
6 N
6 J u n
7 N
7 J u n
8 N
8 J u n
9 N
9 J u n
u n
N
J u n
Performance USA Japan Europe
Scalar Vector SIMD 100 200 300 400 500
J u n
3 N
3 J u n
4 N
4 J u n
5 N
5 J u n
6 N
6 J u n
7 N
7 J u n
8 N
8 J u n
9 N
9 J u n
u n
N
J u n
(
CMOS/
CMOS/ proprietary ECL
100 200 300 400 500 J u n
3 N
3 J u n
4 N
4 J u n
5 N
5 J u n
6 N
6 J u n
7 N
7 J u n
8 N
8 J u n
9 N
9 J u n
u n
N
J u n
Alpha Power HP intel MIPS Sparc
proprietary
100 200 300 400 500 J u n 9 3 N
9 3 J u n 9 4 N
9 4 J u n 9 5 N
9 5 J u n 9 6 N
9 6 J u n 9 7 N
9 7 J u n 9 8 N
9 8 J u n 9 9 N
9 9 J u n N
J u n 1 N
1 J u n 2
!
Single Processor
SMP MPP SIMD
Constellation
Cluster - NOW
100 200 300 400 500 J u n
3 N
3 J u n
4 N
4 J u n
5 N
5 J u n
6 N
6 J u n
7 N
7 J u n
8 N
8 J u n
9 N
9 J u n
u n
N
J u n
Y-MP C90 Sun HPC Paragon CM5 T3D T3E SP2 Cluster of Sun HPC ASCI Red CM2 VP500 SX3
Constellation: # of p/n n
)
10 20 30 40 1 101 201 301 401 Rank P e r f
m a n c e [ T F lo p /s ]
*
0.5 1 1.5 2 1 101 201 301 401 Rank P e r f
m a n c e [ T F lo p /s ]
50 100 150 200 250 1 10 100 1000 Rank Performance [TFlops]
50 100 150 200 250 1 10 100 1000 Rank Performance [TFlops]
58=Rank of ½ cumulative performance
Rank of 1/2 TOP500 Performance 10 20 30 40 50 60 70 80 J u n
3 J u n
4 J u n
5 J u n
6 J u n
7 J u n
8 J u n
9 J u n
u n
J u n
(
Performance
7"#*
!! (+1$E21)))#2.,- //364# &#"! #! !#!2&#!/7!8/ F3&7F4!G"#!!/ ##364 !/## !++$& 'H ! !E
'I ;=-!.
!
&0))2)))2;,)))J
#G!
!
" " &#
!
#/
#!
!!+
#!!#
" &# !2&
9#!
"# !!#!# K))!"# !+
!!
"' 2"# # + #"#
;())2))) =>!. #
%# 2 62 !2
26
%. %
$2 6
8 82C 6
C/ /!! //! D#!"! !"! &#!
Enabled by PC hardware, networks and operating system achieving capabilities of scientific workstations at a fraction of the cost and availability of industry standard message passing libraries. However, much more of a contact sport.
Rank Manufacturer Computer Rmax [GF/s] Installation Site Country # Proc
… … … … … … 30 Self-made Cplant/Ross 707 Sandia National Lab USA 1369 34 IBM Titan Cluster Itanium 800 MHz 594 NCSA USA 320 39 NEC Magi Cluster PIII 933 MHz 654 CBRC – Tsukuba Advanced Computing Center Japan 1024 40 Self-made SCoreIII PIII 933 MHz 618 Real World Computing, Tsukuba Japan 1024 41 IBM Netfinity Cluster PIII 1 GHz 594 NCSA USA 1024 320 Dell PowerEdge Cluster Windows2000 121 Cornell Theory Center USA 252 … … … … … … …
3
=+0K$E2())6E"2,-B,G
6'8=3'=4 ,(( !#6'''#"!!
K="!(L3,)+,=$!.4 6,=1/" #+ !?!# #"##'=
(
10 20 30 40 50 60 70 80 J u n 9 7 N
9 7 J u n 9 8 N
9 8 J u n 9 9 N
9 9 J u n N
J u n 1 N
1 J u n 2 AMD Intel IBM Netfinity Alpha HP Alpha Server Sparc
!
) *..!+0))+ C!!! (*
High resolution global models
predictions of global warming etc
High resolution regional models
predictions of El Niño events and Asian monsoon etc.,
High resolution local models
predictions of weather disasters such as typhoons, localized torrential downpour, oil spill, downburst etc.
Atmospheric and
Global dynamic model Simulation of earthquake generation process Solid earth science Regional model
to describe crust/mantle activity in the Japanese Archipelago region,
to describe the entire solid earth as a system.
Seismic wave tomography
(
C#%'L2-()#2
!0,)(!2()7!. 2#,)C
+
!"3,1))!"!4
>))C# ,+-C M(2K!
Processor node #0 #1 #15 … … … Cluster #0 #1 #39 … … MT archives HDD ………… Shared memory Vector processor Vector processor Vector processor … #7 #0 #1 Shared memory Vector processor Vector processor Vector processor … #7 #0 #1 Shared memory Vector processor Vector processor Vector processor … #7 #0 #1 Tape library
Specifications Peak performance / processor Peak performance / node Shared memory 8 Gflops 64 Gflops 16 GB Total number of processors Total number of nodes Total peak performance Total main memory 5,120 640 40 Tflops 10 TB
Interconnection network (16GB/s * 2)
( Earth Simulator Research and Development Center
consisting of computing nodes in which vector-type multi- processors are tightly connected by sharing main memory.
: 10 TB (total). Shared memory / node : 16 GB
the peak performance 40 TFLOPS (the effective performance for an atmospheric circulation model is more than 5 TFLOPS).
386 mm 457 mm
457mm 386mm
SX4 8 Gflops (2 Gflop/s x 4) Clock :125MHz LSI: 0.35
m CMOS37x4=148 LSIs SX4 8 Gflops (2 Gflop/s x 4) Clock :125MHz LSI: 0.35
m CMOS37x4=148 LSIs SX5 8 Gflop/s Clock :250MHz LSI: 0.25
m CMOS32 LSIs SX5 8 Gflop/s Clock :250MHz LSI: 0.25
m CMOS32 LSIs Earth Simulator 8 Gflop/s Clock :500MHz/1GHz LSI: 0.15
m CMOS1 chip processor Earth Simulator 8 Gflop/s Clock :500MHz/1GHz LSI: 0.15
m CMOS1 chip processor
Earth Simulator Research and Development Center
225mm 225mm 115mm 110mm
((
about 6m Peak Performance : 64 Gflops Main Memory : 16GB Electric Power : 8KVA Peak Performance : 64 Gflops Main Memory : 16GB Electric Power : 90KVA about 0.7m about 1m about 7m Present distributed-memory supercomputer Earth Simulator 1 node (SX-4) 1 node
Air Cooling Air Cooling
Earth Simulator Research and Development Center
Earth Simulator Research and Development Center
R&D Issues on Hardware Technologies
Line width / Spacing : 25
m / 25 m6 core layers + 4 build-up layers on both surfaces
(2) Packaging Technology
(3) Cooling Technology
0.15
m CMOS + Cu interconnection (8 layers)1.50-2.0 million transistors/cm2 10 million transistors/cm2
(1) LSI Technology
(4) Board to Board Interconnection Technology
(5) PN-IN Interconnection Technology
High performance one-chip vector processor: OCVP-ES
(! Earth Simulator Research and Development Center
640 PNs
PN #2 PN #3 PN #4 PN #5 PN #636 PN #637 PN #638 PN #639 PN #0 PN #1
320 Cabinets
XSW #0 XSW #1 XSW #2 XSW #3 XSW #4 XSW #5 XSW #6 XSW #7 XSW #126 XSW #127 XCT #0
128 XSWs
64 Cabinets
XCT #1
Total number of cables : 640 x 130 = 83,200 Total length of cables : 2,900 m Total weight of cables : 220t
Connection between processor nodes (crossbar network)
Bird’s-eye View of the Earth Simulator System 65m 50m
Double Floor for IN Cables Interconnection Network (IN) Cabinets Cartridge Tape Library System Power Supply System Air Conditioning System Processor Node (PN) Cabinets Disks *
Cross-sectional View of the Earth Simulator Building
Air-conditioning system Double floor for IN cables and air-conditioning Air-conditioning return duct Lightning protection system Power supply system Air-conditioning system Seismic isolation system
Building for operation and research Building for computer system Power plant
Earth Simulator Research and Development Center Earth Simulator Research and Development Center
3,000 km Total length of IN cables
Earth Simulator Research and Development Center Earth Simulator Research and Development Center (
Earth Earth Simulator Research and Development Center Simulator Research and Development Center
Panoramic view of the Earth Simulator System January, 2002
!
)
CMK0+-7!. "!EM,2)(,2=,-N31+>C4 !3O4#OM=-02()1 C0+1+ !*.! *76
P !!'M=>+07!. ';,.-P30))4 'QP3,=4 'QP3,04 'Q !!'#
3K>+=7!.4
'QQK%7?
3>+07!.4
'I ;=>7!.
"# "# "# "#
$$%&' $$%&' $$%&'
!*
31920 140
236 124.5
Fujitsu NWT 1993 128600 6768
1.4 338 2.3 281.1
Intel Paragon XP/S MP 1994 128600 6768
1.0 338 1 281.1
Intel Paragon XP/S MP 1995 103680 2048
1.8 614 1.3 368.2
Hitachi CP-PACS 1996 235000 9152
3.0 1830 3.6 1338
Intel ASCI Option Red (200 MHz Pentium Pro) 1997 431344 5808
2.1 3868 1.6 2144
ASCI Blue-Pacific SST, IBM SP 604E 1998 362880 9632
0.8 3207 1.1 2379
ASCI Red Intel Pentium II Xeon core 1999 430000 7424
3.5 11136 2.1 4938
ASCI White-Pacific, IBM SP Power 3 2000 518096 7424
1.0 11136 1.5 7226
ASCI White-Pacific, IBM SP Power 3 2001 1041216 5104
3.7 40832 4.9 35610
Earth Simulator Computer, NEC 2002 Size of Problem Number of Processors
Factor
Pervious Year Theoretical Peak Gflop/s Factor
Pervious Year Measured Gflop/s
Computer Year
cea lanl psc esc llnl psc lbnl snl llnl snl lanl u toyko snl lanl noo snl
leibniz us government
leibniz
lbnl ibm
!
Cumulous convection Condensation, precipitation, convection
(Arakawa and Schubert, 1974;Moorthi & Suarez, 1992)
Large-scale condensation Other cloud processes and prediction of cloud water (Le Treut & Li, 1990) Radiation 2-stream k-distribution scheme (Nakajima&Tanaka, 1986) Vertical diffusion Transport of heat, momentum, and moisture in PBL Level 2 turbulence scheme (Mellor & Yamada, 1974,1982) Surface flux Fluxes in surface boundary layer (Louis, 1979) (Mellor et al., 1992) Ground process Multi-layer heat conduction, Hydrology (Manabe, 1979) Ground moisture (Manabe et al., 1965) Frozen soil process (Clapp & Hornberger, 1978) Bucket model (Kondo, 1993)
mixing layer Ocean temperature (Wilson et al, 1987) Sea ice Gravity wave- induced drag Orographic effect (McFarlane, 1987) Others Dry convection adjustment
63/#4//Q# #+++!##
3+!%+!4
+++"7
#
6 3C/4//Q# !!!#!!#/!8E!
!
!!! » D!## » !/3=/#!4! DE 3C/4
//Q,'
E! 68E!!/!!!
Grid points: 3840*1920*96
J=1920 I=3840
✂✁☎✄✝✆✝✞ ✟ ✟ ✟ ✂✁☎✞✝✆ ✂✁☎✞✝✄ ✂✁☎✞☎✠K = 9 6 J=1920
✂✁☎✄✝✆✝✞ ✟ ✟ ✟ ✂✁☎✞✝✆ ✂✁☎✞✝✄ ✂✁☎✞☎✠K = 9 6 FFT Inversed FFT
P N 6 4 P N 6 4
Parallel decomposition
Grid space Spectral space High resolution (10km) resulting in increased cost concentration
MPI among nodes / Microtasking within node Domain decomposition that fully exploits parallel nodes (>99%
parallelization ratio) with less communication
Reduced load imbalance due to improved algorithms
(e.g., Use of increasingly popular Kuo cloud physics model)
Improved vector performance with DO-loop optimization Combined use of assembler coding for part of matrix operations
!(
6E!
6#!
##
'"#!!!
Nodes CPUs Elapse time /Node ( sec ) Peak Sustained Ratio(%) 80 80 1 238.04 0.64 0.52 81.1 160 160 1 119.26 1.28 1.04 81.0 320 320 1 60.52 2.56 2.04 79.8 640 80 8 32.06 5.12 3.86 75.3 1280 160 8 16.24 10.24 7.61 74.3 2560 320 8 8.52 20.48 14.50 70.8 TFLOPS
26.6 TFLOP/S sustained performance with the 640 full nodes (5120 CPUs/ Peak 40 TFLOP/S)
!!
74.3% 75.3% 26.6 TFLOP/S sustained performance with the 640 full nodes (5120 CPUs/ peak 40 TFLOP/S) 70.8%
)*
)
)
)
)
)(
)
Distributed systems hetero- geneous Massively parallel systems homo- geneous
G r i d b a s e d C
p u t i n g
Beowulf cluster Network of ws
C l u s t e r s w / s p e c i a l i n t e r c
n e c t
Entropia ASCI Tflops
SETI@home Parallel Dist mem
)!
Performance
))
$8
#!T!T !!""!"!
+++"#! 7!6!"###
J#U#U
**
* ##!!##!
+
!2##!!
8#!+C+++
!!!!+ #!* ##E#2"!2/!!!72D6
#6!"!+C+++
!7!+ 6###
#"+
#!H"!
"+
*
&"#!"!+ %!
&!"!!!#!
."*
%& #2 2+#
."!+
*
96#+++ C+++/##! !!!!##+++ +++#!3TT4 +++##"!3
#"4
%##8*36262D62G742+++
+++T#T!!!#
G!!!#!##
#& "##2!!! #/##3 627" !22
74
$##!!* "#V"!
*
,::,#2,$!. #2,)))!#
8!!!! 22 6? ! "!/#!
!,#
7!# #!! !G !
##"
!#
!#
!"
!#
#!
*(
,!.# C,)2)))#,2)))2))) C,)C#,C ."##22+ "!#25()C#,
6""!#@#"!A"