Page 1
1
Myrinet User's Group Conference 12-14 May 2002 Vienna, Austria
Performance Optimization for Cluster Computing
- 2
Performance Optimization for Cluster Computing - - PDF document
Myrinet User's Group Conference 12-14 May 2002 Vienna, Austria Performance Optimization for Cluster Computing
1
3
4
1980 1981 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 _
DRAM CPU
1982
2001_
5
1980 1981 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 _
DRAM CPU
1982
2001_
6
()!4"5!*"52(3
$)!/"5!/*"5!6,-2(3"76,-28.$* $9)!/"5!:*"5!:+,0;(3"7,-1-28.$* ()!:"5!/*"5!1--2(3"7/:--28.$* $ 0)!:"5!:*"5!0<,2(3"7/,--28.$*
7
()!4"5!*"52(3
$)!/"5!/*"5!6,-2(3"76,-28.$* $9)!/"5!:*"5!:+,0;(3"7,-1-28.$* ()!:"5!/*"5!1--2(3"7/:--28.$* $ 0)!:"5!:*"5!0<,2(3"7/,--28.$*
α
α α α 7= )
:!/1>":? 6,-2* @/<--2&* (
7α α α α =A) 0!:9>":?
6,-2* @:,,-2&* (
8
()!4"5!*"52(3
$)!/"5!/*"5!6,-2(3"76,-28.$* $9)!/"5!:*"5!:+,0;(3"7,-1-28.$* ()!:"5!/*"5!1--2(3"7/:--28.$* $ 0)!:"5!:*"5!0<,2(3"7/,--28.$*
α
α α α 7= )
:!/1>":? 6,-2* @/<--2&* (
7α α α α =A) 0!:9>":?
6,-2* @:,,-2&* (
()! ("5!"
$)!0:"5!/002(3"7,0:2>*711+,2&* $9)!0:"5!,002(3"7:/0:2>*7:112&* ()!19"5!/002(3"7/-192>*7/002&* $ 0)!/:6"5!/--2(3"7/1--2>*7:--2&*
$( ((( ((+ $(((+
Control Datapath Secondary Storage (Disk) Processor Registers Main Memory (DRAM) Level 2 and 3 Cache (SRAM) On-Chip Cache 1s 10,000,000s (10s ms) 100,000 s (.1s ms) Speed (ns): 10s 100s 100s Gs Size (bytes): Ks Ms Tertiary Storage (Disk/Tape) 10,000,000,000s (10s sec) 10,000,000 s (10s ms) Ts Distributed Memory Remote Cluster Memory
10
11
/+(* C (
2$ 2(8
:+((C(
.( (
$B$CCC88&CC'
12
CC+
(
13
>3CC(C CC (+ (((D ( +
14
/C:CH0>
2(C.C2C C > C#C'
C *
3)
> /( 8$ 2( K (3
((D( +
0 .0 5 00 .0 10 00 .0 15 00 .0 20 00 .0 25 00 .0 30 00 .0 35 00 .0
"
"
!
%
!
%
'
" % )
MFLOP/S Ven dor BLAS AT LAS BLAS F77 BLAS
16
((/99 2###
$19:M $0:9M
2/:6D (+ N( #:
17
1000 2000 3000 4000 5000 6000 7000 8000 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500
Size MFlop/s
Intel P4 2.53 GHz 32-bit w/SSE2 Intel P4 2.53 GHz 64-bit w/SSE2
19
88( ++(+*G**+(
+ +
$B($ +++*G*( ;#22 O
$(K( $((
D=DD=D=
.3(N+ 2$( + (( (+
( =
21
(PII 8 Way Cluster with 100 Mb/s switched network)
Root Sequential Binary Binomial Ring
22
(PII 8 Way Cluster with 100 Mb/s switched network) Message Size Optimal algorithm Buffer Size (bytes) (bytes) 8 binomial 8 16 binomial 16 32 binary 32 64 binomial 64 128 binomial 128 256 binomial 256 512 binomial 512 1K sequential 1K 2K binary 2K 4K binary 2K 8K binary 2K 16K binary 4K 32K binary 4K 64K ring 4K 128K ring 4K 256K ring 4K 512K ring 4K 1M binary 4K
Root Sequential Binary Binomial Ring
23
= +
+
,-Q 3+
24
= +
+
,-Q 3+
25
Myrinet (fully connected) Gigabit enet (fully connected)
26
H
27
(=!$>C>C >CH2$"((+ ((C + &$2 ((
(:D $(( ($2( (((
((( ( (4( (
(
C((
✂✁☎✄ ✁✝✆ ✞28
29
Natural Data (A,b)
Natural Answer (x) Structured Data (A’,b’) Structured Answer (x’)
30
31
32
33
Can use Grid infrastructure, i.e.Globus/NWS, but doesn’t have to.
34
:! C"0!CC"
++ +
++ + ++ + ++ + ++ +
++ +
++ +
++ +
++ + ++ + ++ + ++ +
++ +
++ +
+++
+++
+++
Bandwidth Latency Load CPU Performance Memory
35
$%?$7/ $%?$7S0C0C6C6C6C6T 8?$7S:C0C9C1C6C6T
36
( + (=( ;
B
( +
37
(,/:2>K+
08#( L-,M/-*/-->!" /1$8#(> 20,-
2 $19<+: 2 6$:DD 6 (
020LL1; $ ,-/:; (/-$
38
39
40
Ax=b using HPL 16 Pentium III 550 MHz TORC
1 2 3 4 5 6 2 3 4 5 6 7 8 9 1 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 Problem Size (N)
GFlop/s
IP,Myrinet GigCable(Cu) Fast Ethernet GM,Myrinet (shmem)
41
( (
N(
42
( +
+
( +
@(#P1H1*< ;2$K/---- >2$ KV0# $=&
D19 B$D$K B( 8O #
43
44
=:+9C:+:C:+- (
8M9+0 !9+0+9" !W++"
:+6
=:+9 (
()**+++** &@( 8CC2>
45
46
()** ++* ()** ++*
47
48
49
50
51
(C
/1C0:C19C/:6+
52
&(C8
P#O(C%
((P(C%
(C% $ 33C% %K(C%
%C% (2C % $(2C %
B (C2 >2C& C% B( VC%
8
(
$$
$