Performance Optimization for Cluster Computing - - PDF document

performance optimization for cluster computing
SMART_READER_LITE
LIVE PREVIEW

Performance Optimization for Cluster Computing - - PDF document

Myrinet User's Group Conference 12-14 May 2002 Vienna, Austria Performance Optimization for Cluster Computing


slide-1
SLIDE 1

Page 1

1

Myrinet User's Group Conference 12-14 May 2002 Vienna, Austria

Performance Optimization for Cluster Computing

  • 2

Overview

!"#

$% $$

&'

slide-2
SLIDE 2

Page 2

3

  • $
  • ()**+,--+
  • .(/--012

4

Where Does the Performance Go? or Why Should I Care About the Memory Hierarchy? µProc 60%/yr. (2X/1.5yr) DRAM 9%/yr. (2X/10 yrs)

1 10 100 1000

1980 1981 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 _

DRAM CPU

1982

Performance

Time “Moore’s Law” Processor-DRAM Memory Gap (latency)

2001_

slide-3
SLIDE 3

Page 3

5

Where Does the Performance Go? or Why Should I Care About the Memory Hierarchy? µProc 60%/yr. (2X/1.5yr) DRAM 9%/yr. (2X/10 yrs)

1 10 100 1000

1980 1981 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 _

DRAM CPU

1982

Performance

Time “Moore’s Law” Processor-DRAM Memory Gap (latency) Processor-Memory Performance Gap: (grows 50% / year)

2001_

6

Optimizing Computation and Memory Use

3

()!4"5!*"52(3

$)!/"5!/*"5!6,-2(3"76,-28.$* $9)!/"5!:*"5!:+,0;(3"7,-1-28.$* ()!:"5!/*"5!1--2(3"7/:--28.$* $ 0)!:"5!:*"5!0<,2(3"7/,--28.$*

slide-4
SLIDE 4

Page 4

7

Optimizing Computation and Memory Use

3

()!4"5!*"52(3

$)!/"5!/*"5!6,-2(3"76,-28.$* $9)!/"5!:*"5!:+,0;(3"7,-1-28.$* ()!:"5!/*"5!1--2(3"7/:--28.$* $ 0)!:"5!:*"5!0<,2(3"7/,--28.$*

.)

α

α α α 7= )

:!/1>":? 6,-2* @/<--2&* (

7α α α α =A) 0!:9>":?

6,-2* @:,,-2&* (

8

Optimizing Computation and Memory Use

3

()!4"5!*"52(3

$)!/"5!/*"5!6,-2(3"76,-28.$* $9)!/"5!:*"5!:+,0;(3"7,-1-28.$* ()!:"5!/*"5!1--2(3"7/:--28.$* $ 0)!:"5!:*"5!0<,2(3"7/,--28.$*

.)

α

α α α 7= )

:!/1>":? 6,-2* @/<--2&* (

7α α α α =A) 0!:9>":?

6,-2* @:,,-2&* (

23

()! ("5!"

$)!0:"5!/002(3"7,0:2>*711+,2&* $9)!0:"5!,002(3"7:/0:2>*7:112&* ()!19"5!/002(3"7/-192>*7/002&* $ 0)!/:6"5!/--2(3"7/1--2>*7:--2&*

slide-5
SLIDE 5

Page 5

Memory Hierarchy >()

$( ((( ((+ $(((+

Control Datapath Secondary Storage (Disk) Processor Registers Main Memory (DRAM) Level 2 and 3 Cache (SRAM) On-Chip Cache 1s 10,000,000s (10s ms) 100,000 s (.1s ms) Speed (ns): 10s 100s 100s Gs Size (bytes): Ks Ms Tertiary Storage (Disk/Tape) 10,000,000,000s (10s sec) 10,000,000 s (10s ms) Ts Distributed Memory Remote Cluster Memory

10

Motivation Self Adapting Numerical Software (SANS) Effort

.3 =(

(( =(3+

  • B

; 2( (( >(C D+ (( @ @*3 +

slide-6
SLIDE 6

Page 6

11

What is Self Adapting Performance Tuning of Software? )

/+(* C (

  • =

2$ 2(8

:+((C(

&( (E

.( (

2=

$B$CCC88&CC'

12

Self Adapting Numerical Software - SANS Effort

  • $ (((

CC+

  • $D!"3+
  • (F (
  • =(

(

TUNING SYSTEM Different Algorithms, Segment Sizes Best Algorithm, Segment Size

slide-7
SLIDE 7

Page 7

13

Self-Adapting Numerical Software (SANS) Effort (=

  • ((+

.(>@D

(*

  • ((
  • .(

B CC (

*

>3CC(C CC (+ (((D ( +

  • @*3+

14

Software Generation Strategy - ATLAS BLAS

  • G:-C

/C:CH0>

  • I J((
  • ((
  • 3+
  • K(
  • (
  • 2C

2(C.C2C C > C#C'

  • $((
  • ;

C *

  • K(
  • $
  • /(

3)

> /( 8$ 2( K (3

slide-8
SLIDE 8

Page 8

ATLAS (DGEMM n = 500)

  • ((>

((D( +

0 .0 5 00 .0 10 00 .0 15 00 .0 20 00 .0 25 00 .0 30 00 .0 35 00 .0

  • !
  • "

"

  • !
  • "

"

  • "

!

  • #
  • $

%

  • &
  • "

!

  • #
  • $

%

  • "

'

  • (
  • )

" % )

  • *
  • A rchitectures

MFLOP/S Ven dor BLAS AT LAS BLAS F77 BLAS

16

Pentium 4 - SSE2

Today’s “Sweet Spot” in Price/Performance

:+,0;B3C,002B3C6%

!"/H,/:%:(C( :+,0;*! *#:"C(( !,L+0&"

2#=:!#:"

((/99 2###

$19:M $0:9M

2/:6D (+ N( #:

slide-9
SLIDE 9

Page 9

17

1000 2000 3000 4000 5000 6000 7000 8000 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500

Size MFlop/s

Intel P4 2.53 GHz 32-bit w/SSE2 Intel P4 2.53 GHz 64-bit w/SSE2

ATLAS Matrix Multiply Intel Pentium 4 at 2.53 GHz – using SSE2

  • 18
slide-10
SLIDE 10

Page 10

19

Related Tuning Projects

  • B88

88( ++(+*G**+(

  • 88&88(&

+ +

  • $B$

$B($ +++*G*( ;#22 O

  • $K

$(K( $((

  • =

D=DD=D=

  • C>
  • ()** +++*G**
  • 20

Machine-Assisted Application Development and Adaptation

.3(N+ 2$( + (( (+

(

( =

TUNING SYSTEM Different Algorithms, Segment Sizes Best Algorithm, Segment Size

slide-11
SLIDE 11

Page 11

21

Work in Progress: SANS Approach Applied to Broadcast

(PII 8 Way Cluster with 100 Mb/s switched network)

Root Sequential Binary Binomial Ring

22

Work in Progress: SANS Approach Applied to Broadcast

(PII 8 Way Cluster with 100 Mb/s switched network) Message Size Optimal algorithm Buffer Size (bytes) (bytes) 8 binomial 8 16 binomial 16 32 binary 32 64 binomial 64 128 binomial 128 256 binomial 256 512 binomial 512 1K sequential 1K 2K binary 2K 4K binary 2K 8K binary 2K 16K binary 4K 32K binary 4K 64K ring 4K 128K ring 4K 256K ring 4K 512K ring 4K 1M binary 4K

Root Sequential Binary Binomial Ring

slide-12
SLIDE 12

Page 12

23

CG Variants by Dynamic Selection at Run Time

  • P
  • (

= +

  • C
  • @
  • &(
  • ((D

+

  • /,Q

,-Q 3+

24

CG Variants by Dynamic Selection at Run Time

  • P
  • (

= +

  • C
  • @
  • &(
  • ((D

+

  • /,Q

,-Q 3+

slide-13
SLIDE 13

Page 13

25

LAPACK For Clusters

((

(( ( (IJ +

$% (

Myrinet (fully connected) Gigabit enet (fully connected)

26

ScaLAPACK $%

  • =
  • !2$$H

"2$

.(( ( (

(

C

>2CB$D=C8OC#CC;CC ;C2C'

H

slide-14
SLIDE 14

Page 14

27

How ScaLAPACK Works

$%)

(=!$>C>C >CH2$"((+ ((C + &$2 ((

(:D $(( ($2( (((

((( ( (4( (

  • I F RJ
  • )(=(

(

C((

✂✁☎✄ ✁✝✆ ✞

28

LAPACK For Clusters

=+ $%

2 (((

(N( (

(( .3( ((

  • ($2(
slide-15
SLIDE 15

Page 15

29

User has problem to solve (e.g. Ax = b)

Natural Data (A,b)

Middleware Application Library (e.g. LAPACK, ScaLAPACK, PETSc,…)

Natural Answer (x) Structured Data (A’,b’) Structured Answer (x’)

Big Picture…

30

File System -based

User A b Stage data to disk

slide-16
SLIDE 16

Page 16

31

File System -based

User A b Library Middle-ware

32

File System -based

User A b Library Middle-ware NWS Autopilot MDS Resource Selection Time function minimization

slide-17
SLIDE 17

Page 17

33

File System -based

User A b Library Middle-ware NWS Autopilot MDS Resource Selection 0,0 0,1 … Time function minimization

Can use Grid infrastructure, i.e.Globus/NWS, but doesn’t have to.

34

Resource Selector

K(&N !>"&

((( (+

:! C"0!CC"

;

=

++ +

M M

++ + ++ + ++ + ++ +

M

++ +

M M M

++ +

M M =

++ +

M M

++ + ++ + ++ + ++ +

M

++ +

M M M

++ +

M M M

+++

M M M

+++

M M M

+++

M M

Bandwidth Latency Load CPU Performance Memory

slide-18
SLIDE 18

Page 18

35

Ax = b Cluster of 8 Pentium III 933 MHz

0.1 1 10 100 1000 10000 5 1 2 1 2 4 2 4 8 4 9 6 8 1 9 2 1 2 4 Size Time to Solution

$%?$7/ $%?$7S0C0C6C6C6C6T 8?$7S:C0C9C1C6C6T

36

LAPACK For Clusters (LFC) 8 (

((

  • +

( + (=( ;

  • +

B

  • +
  • (

;+

( +

$)C

(CUKC C

  • 2
slide-19
SLIDE 19

Page 19

37

TORC Cluster

6$,,-2B3

(

(,/:2>K+

8#(

08#( L-,M/-*/-->!" /1$8#(> 20,-

2

2 $19<+: 2 6$:DD 6 (

;#(

020LL1; $ ,-/:; (/-$

38

20 40 60 80 Time(s) 2E+06 3E+07 1E+08 Bytes Broadcast

MPI bytes broadcast vs. time, TORC

IP, GigCable(Cu) Fast Ethernet Myrinet IP, Myrinet

slide-20
SLIDE 20

Page 20

39

50 100 Time (s) 2E+06 3E+07 1E+08 Bytes Broadcast

BLACS broadcast results, TORC [p,q]=[1,16]

IP + GigCable(Cu) Fast Ethernet Myrinet IP, Myrinet

40

Ax=b using HPL 16 Pentium III 550 MHz TORC

1 2 3 4 5 6 2 3 4 5 6 7 8 9 1 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 Problem Size (N)

GFlop/s

IP,Myrinet GigCable(Cu) Fast Ethernet GM,Myrinet (shmem)

slide-21
SLIDE 21

Page 21

41

Tools for Performance Evaluation

(

  • K(

( (

  • (

N(

42

Performance Counters

  • ((

( +

  • C(

+

  • .($C(=C

( +

  • #=$

@(#P1H1*< ;2$K/---- >2$ KV0# $=&

D19 B$D$K B( 8O #

slide-22
SLIDE 22

Page 22

43

Performance Data That May Be Available

  • 8
  • *

>(*

  • >(

$

  • $
  • *(
  • (

> >

44

PAPI - Supported Processors

  • $CCCC!$9( "

=:+9C:+:C:+- (

  • >2$ 0C1-9C1-9!$ 9"

8M9+0 !9+0+9" !W++"

  • CCH

:+6

  • ;KM*2$
  • 2(

=:+9 (

  • 0#CP/CP:
  • & :%M$
  • )

()**+++** &@( 8CC2>

slide-23
SLIDE 23

Page 23

45

Early Users of PAPI ##$*$$!$"

  • !2C."
  • $ !KC"
  • !#C2=$*"
  • P !C"
  • !;C.K"

$

46

= ($ ( +

+ ;+ . + > (>2

B ($ O+ & O+2

()** ++* ()** ++*

'

What is DynaProf?

slide-24
SLIDE 24

Page 24

47

.=+ (

+

+

+ +

@CO=

Dynamic Instrumentation:

48

Perfometer/ DynaProf

  • !

" " #"

slide-25
SLIDE 25

Page 25

49

GUI Server Application

Next Version of Perfometer Implementation

Application Application

50

PAPI’s Parallel Interface

slide-26
SLIDE 26

Page 26

51

Futures for Numerical Algorithms and Software on Clusters and Grids

K D

C=C +

  • C=+

(C

(

(+

/1C0:C19C/:6+

KCC

(+

52

Collaborators

  • $C

&(C8

  • .

P#O(C%

  • .3

((P(C%

  • 8

(C% $ 33C% %K(C%

  • $$

%C% (2C % $(2C %

  • $

B (C2 >2C& C% B( VC%

  • ()**+++**

8

  • ,$%

(

$$

  • ()**+++**

$