HTAs PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PAPER PUBLISHED - - PowerPoint PPT Presentation

hta s
SMART_READER_LITE
LIVE PREVIEW

HTAs PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PAPER PUBLISHED - - PowerPoint PPT Presentation

HTAs PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PAPER PUBLISHED AT PPOPP MARCH 2006 PRESENTATION BY ROMAN FRIGG Written at UIUC 1 , Universidade da Coruna 2 and IBM T.J. Watson Research Center 3 by 30 Ganesh Bikshandi 1 , Jia Guo, Daniel


slide-1
SLIDE 1

HTA’s

NOV

30

PROGRAMMING FOR PARALLELISM AND LOCALITY WITH

PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson Research Center3 by Ganesh Bikshandi1, Jia Guo, Daniel Hoeflinger1, Gheorghe Almasi3, Basilio B. Fraguela2, María J. Garzarán1, David Padua1 and Christoph von Praun3 PAPER PUBLISHED AT PPOPP MARCH 2006

slide-2
SLIDE 2

PROGRAMMING TODAY’S SYSTEMS

2

| SCALABILITY PORTABILITY PRODUCTIVITY

slide-3
SLIDE 3

PROGRAMMING TODAY’S SYSTEMS

2

| SCALABILITY PORTABILITY PRODUCTIVITY Parallelism

slide-4
SLIDE 4

PROGRAMMING TODAY’S SYSTEMS

2

| SCALABILITY PORTABILITY PRODUCTIVITY Parallelism Locality

slide-5
SLIDE 5

PROGRAMMING TODAY’S SYSTEMS

2

| SCALABILITY PORTABILITY PRODUCTIVITY Parallelism Locality Abstractions

slide-6
SLIDE 6

PROGRAMMING TODAY’S SYSTEMS

2

| SCALABILITY PORTABILITY PRODUCTIVITY Parallelism Locality Abstractions

HTA’s

slide-7
SLIDE 7

CLASSIFICATION

3

|

slide-8
SLIDE 8

LIBRARIES LANGUAGES

HTA

MPI/PVM GAS POET POOMA X10 CAF ZPL TITANIUM UPC HPF

CLASSIFICATION

3

|

slide-9
SLIDE 9

LIBRARIES LANGUAGES

HTA

MPI/PVM GAS POET POOMA X10 CAF ZPL TITANIUM UPC HPF

CLASSIFICATION

3

|

  • Library
  • Matlab & C++
  • Single threaded, global view
slide-10
SLIDE 10

TALK OVERVIEW

INTRO 1

4

|

slide-11
SLIDE 11

TALK OVERVIEW

INTRO HOW HTA’s WORK

1 2

4

|

slide-12
SLIDE 12

TALK OVERVIEW

INTRO HOW HTA’s WORK HTA OPERATIONS & APPLICATIONS

1 2 3

4

|

slide-13
SLIDE 13

TALK OVERVIEW

INTRO HOW HTA’s WORK HTA OPERATIONS & APPLICATIONS EVALUATION

1 2 3 4

4

|

slide-14
SLIDE 14

TALK OVERVIEW

INTRO HOW HTA’s WORK HTA OPERATIONS & APPLICATIONS EVALUATION CONCLUSIONS

1 2 3 4 5

4

|

slide-15
SLIDE 15

RECURSIVE TILING

  • distributed

image source: paper 5

|

HOW HTA’s WORK 2

  • local
  • local
slide-16
SLIDE 16

CONSTRUCT HTA FROM 6x6 MATRIX

6

|

HOW HTA’s WORK 2

slide-17
SLIDE 17

CONSTRUCT HTA FROM 6x6 MATRIX

6

|

T1 = hta( ,{ , } )

HOW HTA’s WORK 2

slide-18
SLIDE 18

CONSTRUCT HTA FROM 6x6 MATRIX

6

|

T1 = hta( ,{ , } ) M

HOW HTA’s WORK 2

slide-19
SLIDE 19

CONSTRUCT HTA FROM 6x6 MATRIX

6

|

T1 = hta( ,{ , } ) M [1 3 5]

HOW HTA’s WORK 2

slide-20
SLIDE 20

CONSTRUCT HTA FROM 6x6 MATRIX

6

|

1 3 5 T1 = hta( ,{ , } ) M [1 3 5]

HOW HTA’s WORK 2

slide-21
SLIDE 21

CONSTRUCT HTA FROM 6x6 MATRIX

6

|

1 3 5 T1 = hta( ,{ , } ) M [1 3 5] [1 3 5]

HOW HTA’s WORK 2

slide-22
SLIDE 22

CONSTRUCT HTA FROM 6x6 MATRIX

6

|

1 3 5 1 3 5 T1 = hta( ,{ , } ) M [1 3 5] [1 3 5]

HOW HTA’s WORK 2

slide-23
SLIDE 23

CONSTRUCT HTA FROM 6x6 MATRIX

6

|

1 3 5 1 3 5 T1 = hta( ,{ , } ) M [1 3 5] [1 3 5]

HOW HTA’s WORK 2

T2 = hta( ,{ , }, )

slide-24
SLIDE 24

CONSTRUCT HTA FROM 6x6 MATRIX

6

|

1 3 5 1 3 5 T1 = hta( ,{ , } ) M [1 3 5] [1 3 5]

HOW HTA’s WORK 2

T2 = hta( ,{ , }, ) T1

slide-25
SLIDE 25

CONSTRUCT HTA FROM 6x6 MATRIX

6

|

1 3 5 1 3 5 T1 = hta( ,{ , } ) M [1 3 5] [1 3 5]

HOW HTA’s WORK 2

T2 = hta( ,{ , }, ) T1 [1 2]

slide-26
SLIDE 26

CONSTRUCT HTA FROM 6x6 MATRIX

6

|

1 3 5 1 3 5 T1 = hta( ,{ , } ) M [1 3 5] [1 3 5]

HOW HTA’s WORK 2

T2 = hta( ,{ , }, ) T1 [1 2] 1 2

slide-27
SLIDE 27

CONSTRUCT HTA FROM 6x6 MATRIX

6

|

1 3 5 1 3 5 T1 = hta( ,{ , } ) M [1 3 5] [1 3 5]

HOW HTA’s WORK 2

T2 = hta( ,{ , }, ) T1 [1 2] [1 3] 1 2

slide-28
SLIDE 28

CONSTRUCT HTA FROM 6x6 MATRIX

6

|

1 3 5 1 3 5 T1 = hta( ,{ , } ) M [1 3 5] [1 3 5]

HOW HTA’s WORK 2

T2 = hta( ,{ , }, ) T1 [1 2] [1 3] 1 3 1 2

slide-29
SLIDE 29

CONSTRUCT HTA FROM 6x6 MATRIX

6

|

1 3 5 1 3 5 T1 = hta( ,{ , } ) M [1 3 5] [1 3 5]

HOW HTA’s WORK 2

T2 = hta( ,{ , }, ) T1 [1 2] [1 3] 1 3 1 2 [2 2]

slide-30
SLIDE 30

CONSTRUCT HTA FROM 6x6 MATRIX

6

|

1 3 5 1 3 5 T1 = hta( ,{ , } ) M [1 3 5] [1 3 5]

P1 P2 P3 P4

HOW HTA’s WORK 2

T2 = hta( ,{ , }, ) T1 [1 2] [1 3] 1 3 1 2 [2 2]

slide-31
SLIDE 31

HTA ACCESS

7

|

C=

HOW HTA’s WORK 2

slide-32
SLIDE 32

HTA ACCESS

7

|

C(1:2,3:6)

C=

HOW HTA’s WORK 2

slide-33
SLIDE 33

HTA ACCESS

7

|

C(1:2,3:6)

C=

HOW HTA’s WORK 2

slide-34
SLIDE 34

HTA ACCESS

7

|

C(1:2,3:6)

C=

HOW HTA’s WORK 2

slide-35
SLIDE 35

HTA ACCESS

7

|

C(1:2,3:6)

C=

HOW HTA’s WORK 2

slide-36
SLIDE 36

HTA ACCESS

7

|

C(1:2,3:6) C{2,1}{1,2}(2,2)

C=

HOW HTA’s WORK 2

slide-37
SLIDE 37

HTA ACCESS

7

|

C(1:2,3:6) C{2,1}{1,2}(2,2)

C=

HOW HTA’s WORK 2

slide-38
SLIDE 38

HTA ACCESS

7

|

C(1:2,3:6) C{2,1}{1,2}(2,2)

C=

HOW HTA’s WORK 2

slide-39
SLIDE 39

HTA ACCESS

7

|

C(1:2,3:6) C{2,1}{1,2}(2,2)

C=

HOW HTA’s WORK 2

slide-40
SLIDE 40

HTA ACCESS

7

|

C(1:2,3:6) C{2,1}{1,2}(2,2)

C=

HOW HTA’s WORK 2

slide-41
SLIDE 41

HTA ACCESS

7

|

C(1:2,3:6) C{2,1}{1,2}(2,2)

C=

HOW HTA’s WORK 2

slide-42
SLIDE 42

HTA ACCESS

7

|

C(1:2,3:6) C{2,1}{1,2}(2,2)

C=

HOW HTA’s WORK 2

slide-43
SLIDE 43

HTA ACCESS

7

|

C(1:2,3:6) C{2,1}{1,2}(2,2)

C=

HOW HTA’s WORK 2

slide-44
SLIDE 44

HTA ACCESS

7

|

C(1:2,3:6) C{2,1}{1,2}(2,2)

C=

HOW HTA’s WORK 2

slide-45
SLIDE 45

HTA ACCESS

7

|

C(1:2,3:6) C{2,1}{1,2}(2,2)

C=

HOW HTA’s WORK 2

slide-46
SLIDE 46

HTA ACCESS

7

|

C(1:2,3:6) C{2,1}{1,2}(2,2)

C=

HOW HTA’s WORK 2

C(6,4) =

slide-47
SLIDE 47

HTA ACCESS

7

|

C(1:2,3:6) C{2,1}{1,2}(2,2)

C=

HOW HTA’s WORK 2

C(6,4) = C{2,1}(2,4) =

slide-48
SLIDE 48

ASSIGNMENTS & BINARY OPERATORS

8

|

VALID OPERATIONS

Scalar

Array HTA

slide-49
SLIDE 49

ASSIGNMENTS & BINARY OPERATORS

9

|

VALID OPERATION ?

slide-50
SLIDE 50

ASSIGNMENTS & BINARY OPERATORS

9

|

*

4x4 HTA 2x3 Array VALID OPERATION ?

slide-51
SLIDE 51

ASSIGNMENTS & BINARY OPERATORS

9

|

u

*

4x4 HTA 2x3 Array VALID OPERATION ?

slide-52
SLIDE 52

ASSIGNMENTS & BINARY OPERATORS

9

|

*

4x4 HTA 3x2 Array VALID OPERATION ?

slide-53
SLIDE 53

ASSIGNMENTS & BINARY OPERATORS

9

|

v

*

4x4 HTA 3x2 Array VALID OPERATION ?

slide-54
SLIDE 54

ASSIGNMENTS & BINARY OPERATORS

9

|

*

4x4 HTA Scalar VALID OPERATION ?

slide-55
SLIDE 55

ASSIGNMENTS & BINARY OPERATORS

9

|

u

*

4x4 HTA Scalar VALID OPERATION ?

slide-56
SLIDE 56

ASSIGNMENTS & BINARY OPERATORS

9

|

*

4x4 HTA 4x4 HTA VALID OPERATION ?

slide-57
SLIDE 57

ASSIGNMENTS & BINARY OPERATORS

9

|

u

*

4x4 HTA 4x4 HTA VALID OPERATION ?

slide-58
SLIDE 58

ASSIGNMENTS & BINARY OPERATORS

9

|

=

4x4 HTA 4x4 HTA VALID OPERATION ?

slide-59
SLIDE 59

ASSIGNMENTS & BINARY OPERATORS

9

|

v

=

4x4 HTA 4x4 HTA VALID OPERATION ?

slide-60
SLIDE 60

TALK OVERVIEW

INTRO HOW HTA’s WORK HTA OPERATIONS & APPLICATIONS CONCLUSIONS

1 2 3 5

10

|

EVALUATION

4

slide-61
SLIDE 61

TWO KINDS OF OPERATIONS

11

|

HTA OPERATIONS & APPLICATIONS

3 GLOBAL COMPUTATIONS COMMUNICATION OPERATIONS

P1 P3 P2 P3 P1 P3 P2 P3

slide-62
SLIDE 62

f(x)

TWO KINDS OF OPERATIONS

11

|

HTA OPERATIONS & APPLICATIONS

3 GLOBAL COMPUTATIONS COMMUNICATION OPERATIONS

P1 P3 P2 P3 P1 P3 P2 P3

slide-63
SLIDE 63

f(x)

TWO KINDS OF OPERATIONS

11

|

HTA OPERATIONS & APPLICATIONS

3 GLOBAL COMPUTATIONS COMMUNICATION OPERATIONS

P1 P3 P2 P3 P1 P3 P2 P3

slide-64
SLIDE 64

f(x)

TWO KINDS OF OPERATIONS

11

|

HTA OPERATIONS & APPLICATIONS

3 GLOBAL COMPUTATIONS COMMUNICATION OPERATIONS

P1 P3 P2 P3 P1 P3 P2 P3

Assignments, repmat, circshift, permute

slide-65
SLIDE 65

f(x)

TWO KINDS OF OPERATIONS

11

|

HTA OPERATIONS & APPLICATIONS

3 GLOBAL COMPUTATIONS COMMUNICATION OPERATIONS

P1 P3 P2 P3 P1 P3 P2 P3

g(x) g(x) g(x) g(x) g(x) Assignments, repmat, circshift, permute

slide-66
SLIDE 66

f(x)

TWO KINDS OF OPERATIONS

11

|

HTA OPERATIONS & APPLICATIONS

3 GLOBAL COMPUTATIONS COMMUNICATION OPERATIONS

P1 P3 P2 P3 P1 P3 P2 P3

g(x) g(x) g(x) g(x) Assignments, repmat, circshift, permute

slide-67
SLIDE 67

f(x)

TWO KINDS OF OPERATIONS

11

|

HTA OPERATIONS & APPLICATIONS

3 GLOBAL COMPUTATIONS COMMUNICATION OPERATIONS

P1 P3 P2 P3 P1 P3 P2 P3

g(x) g(x) g(x) g(x) Assignments, repmat, circshift, permute parHTA(@g(x), H)

slide-68
SLIDE 68

CANNON’S ALGORITHM

12

|

HTA OPERATIONS & APPLICATIONS

3

function C = cannon(A,B,C) for i=2:m A{i,:} = circshift(A{i,:}, [0, -(i-1)]); B(:,i} = circshift(B{:,i}, [-(i-1), 0]); end for k=1:m C = C + A * B; A = circshift(A, [0, -1]); B = circshift(B, [-1, 0]); end

slide-69
SLIDE 69

CANNON’S ALGORITHM

12

|

HTA OPERATIONS & APPLICATIONS

3

function C = cannon(A,B,C) for i=2:m A{i,:} = circshift(A{i,:}, [0, -(i-1)]); B(:,i} = circshift(B{:,i}, [-(i-1), 0]); end for k=1:m C = C + A * B; A = circshift(A, [0, -1]); B = circshift(B, [-1, 0]); end A,B,C

slide-70
SLIDE 70

CANNON’S ALGORITHM

12

|

HTA OPERATIONS & APPLICATIONS

3

function C = cannon(A,B,C) for i=2:m A{i,:} = circshift(A{i,:}, [0, -(i-1)]); B(:,i} = circshift(B{:,i}, [-(i-1), 0]); end for k=1:m C = C + A * B; A = circshift(A, [0, -1]); B = circshift(B, [-1, 0]); end

slide-71
SLIDE 71

CANNON’S ALGORITHM

12

|

HTA OPERATIONS & APPLICATIONS

3

function C = cannon(A,B,C) for i=2:m A{i,:} = circshift(A{i,:}, [0, -(i-1)]); B(:,i} = circshift(B{:,i}, [-(i-1), 0]); end for k=1:m C = C + A * B; A = circshift(A, [0, -1]); B = circshift(B, [-1, 0]); end circshift( ) circshift( ) circshift( ) circshift( )

slide-72
SLIDE 72

CANNON’S ALGORITHM

12

|

HTA OPERATIONS & APPLICATIONS

3

function C = cannon(A,B,C) for i=2:m A{i,:} = circshift(A{i,:}, [0, -(i-1)]); B(:,i} = circshift(B{:,i}, [-(i-1), 0]); end for k=1:m C = C + A * B; A = circshift(A, [0, -1]); B = circshift(B, [-1, 0]); end

slide-73
SLIDE 73

CANNON’S ALGORITHM

12

|

HTA OPERATIONS & APPLICATIONS

3

Initialization

function C = cannon(A,B,C) for i=2:m A{i,:} = circshift(A{i,:}, [0, -(i-1)]); B(:,i} = circshift(B{:,i}, [-(i-1), 0]); end for k=1:m C = C + A * B; A = circshift(A, [0, -1]); B = circshift(B, [-1, 0]); end

slide-74
SLIDE 74

CANNON’S ALGORITHM

12

|

HTA OPERATIONS & APPLICATIONS

3

Initialization Iteration

function C = cannon(A,B,C) for i=2:m A{i,:} = circshift(A{i,:}, [0, -(i-1)]); B(:,i} = circshift(B{:,i}, [-(i-1), 0]); end for k=1:m C = C + A * B; A = circshift(A, [0, -1]); B = circshift(B, [-1, 0]); end

slide-75
SLIDE 75

B23 A32 13

|

HTA OPERATIONS & APPLICATIONS

3

A11 A12 A13 A22 A23 A33 A21 A31 B11 B22 B21 B31 B32 B33 Initialization

for i=2:m A{i,:} = circshift(A{i,:}, [0, -(i-1)]); B(:,i} = circshift(B{:,i}, [-(i-1), 0]); end

B13 B12

slide-76
SLIDE 76

B23 A32 13

|

HTA OPERATIONS & APPLICATIONS

3

A11 A12 A13 A22 A23 A33 A21 A31 B11 B22 B21 B31 B32 B33 Initialization

for i=2:m A{i,:} = circshift(A{i,:}, [0, -(i-1)]); B(:,i} = circshift(B{:,i}, [-(i-1), 0]); end

B13 B12

i=2

slide-77
SLIDE 77

B23 A32 13

|

HTA OPERATIONS & APPLICATIONS

3

A11 A12 A13 A22 A23 A33 A21 A31 B11 B22 B21 B31 B32 B33 Initialization

for i=2:m A{i,:} = circshift(A{i,:}, [0, -(i-1)]); B(:,i} = circshift(B{:,i}, [-(i-1), 0]); end

B13 B12

i=2

slide-78
SLIDE 78

B23 A32 13

|

HTA OPERATIONS & APPLICATIONS

3

A11 A12 A13 A22 A23 A33 A21 A31 B11 B22 B21 B31 B32 B33 Initialization

for i=2:m A{i,:} = circshift(A{i,:}, [0, -(i-1)]); B(:,i} = circshift(B{:,i}, [-(i-1), 0]); end

B13 B12

i=2

slide-79
SLIDE 79

B23 A32 13

|

HTA OPERATIONS & APPLICATIONS

3

A11 A12 A13 A22 A23 A33 A21 A31 B11 B22 B21 B31 B32 B33 Initialization

for i=2:m A{i,:} = circshift(A{i,:}, [0, -(i-1)]); B(:,i} = circshift(B{:,i}, [-(i-1), 0]); end

B13 B12

i=2

slide-80
SLIDE 80

B23 A32 13

|

HTA OPERATIONS & APPLICATIONS

3

A11 A12 A13 A22 A23 A33 A21 A31 B11 B22 B21 B31 B32 B33 Initialization

for i=2:m A{i,:} = circshift(A{i,:}, [0, -(i-1)]); B(:,i} = circshift(B{:,i}, [-(i-1), 0]); end

B13 B12

i=3

slide-81
SLIDE 81

B23 A32 13

|

HTA OPERATIONS & APPLICATIONS

3

A11 A12 A13 A22 A23 A33 A21 A31 B11 B22 B21 B31 B32 B33 Initialization

for i=2:m A{i,:} = circshift(A{i,:}, [0, -(i-1)]); B(:,i} = circshift(B{:,i}, [-(i-1), 0]); end

B13 B12

i=3

slide-82
SLIDE 82

B23 A32 13

|

HTA OPERATIONS & APPLICATIONS

3

A11 A12 A13 A22 A23 A33 A21 A31 B11 B22 B21 B31 B32 B33 Initialization

for i=2:m A{i,:} = circshift(A{i,:}, [0, -(i-1)]); B(:,i} = circshift(B{:,i}, [-(i-1), 0]); end

B13 B12

i=3

slide-83
SLIDE 83

B23 A32 13

|

HTA OPERATIONS & APPLICATIONS

3

A11 A12 A13 A22 A23 A33 A21 A31 B11 B22 B21 B31 B32 B33 Initialization

for i=2:m A{i,:} = circshift(A{i,:}, [0, -(i-1)]); B(:,i} = circshift(B{:,i}, [-(i-1), 0]); end

B13 B12

i=3

slide-84
SLIDE 84

B23 A32 13

|

HTA OPERATIONS & APPLICATIONS

3

A11 A12 A13 A22 A23 A33 A21 A31 B11 B22 B21 B31 B32 B33 Initialization

for i=2:m A{i,:} = circshift(A{i,:}, [0, -(i-1)]); B(:,i} = circshift(B{:,i}, [-(i-1), 0]); end

B13 B12

i=3

slide-85
SLIDE 85

B23 A32 13

|

HTA OPERATIONS & APPLICATIONS

3

A11 A12 A13 A22 A23 A33 A21 A31 B11 B22 B21 B31 B32 B33 Initialization

for i=2:m A{i,:} = circshift(A{i,:}, [0, -(i-1)]); B(:,i} = circshift(B{:,i}, [-(i-1), 0]); end

B13 B12

i=3

slide-86
SLIDE 86

C32 C11 C12 C13 C22 C23 C33 C21 C31 14

|

HTA OPERATIONS & APPLICATIONS

3

A32 A12 A13 A23 A21 A31 Iteration

for k=1:m C = C + A * B; A = circshift(A, [0, -1]); B = circshift(B, [-1, 0]); end

A11 A22 A33 B23 B12 B13 B21 B31 B32 B11 B22 B33

slide-87
SLIDE 87

C32 C11 C12 C13 C22 C23 C33 C21 C31 14

|

HTA OPERATIONS & APPLICATIONS

3

A32 A12 A13 A23 A21 A31 Iteration

for k=1:m C = C + A * B; A = circshift(A, [0, -1]); B = circshift(B, [-1, 0]); end

A11 A22 A33 B23 B12 B13 B21 B31 B32 B11 B22 B33

k=1

slide-88
SLIDE 88

C32 C11 C12 C13 C22 C23 C33 C21 C31 14

|

HTA OPERATIONS & APPLICATIONS

3

A32 A12 A13 A23 A21 A31 Iteration

for k=1:m C = C + A * B; A = circshift(A, [0, -1]); B = circshift(B, [-1, 0]); end

A11 A22 A33 B23 B12 B13 B21 B31 B32 B11 B22 B33

k=1

slide-89
SLIDE 89

C32 C11 C12 C13 C22 C23 C33 C21 C31 14

|

HTA OPERATIONS & APPLICATIONS

3

A32 A12 A13 A23 A21 A31 Iteration

for k=1:m C = C + A * B; A = circshift(A, [0, -1]); B = circshift(B, [-1, 0]); end

A11 A22 A33 B23 B12 B13 B21 B31 B32 B11 B22 B33

k=1

slide-90
SLIDE 90

C32 C11 C12 C13 C22 C23 C33 C21 C31 14

|

HTA OPERATIONS & APPLICATIONS

3

A32 A12 A13 A23 A21 A31 Iteration

for k=1:m C = C + A * B; A = circshift(A, [0, -1]); B = circshift(B, [-1, 0]); end

A11 A22 A33 B23 B12 B13 B21 B31 B32 B11 B22 B33

k=1

slide-91
SLIDE 91

C32 C11 C12 C13 C22 C23 C33 C21 C31 14

|

HTA OPERATIONS & APPLICATIONS

3

A32 A12 A13 A23 A21 A31 Iteration

for k=1:m C = C + A * B; A = circshift(A, [0, -1]); B = circshift(B, [-1, 0]); end

A11 A22 A33 B23 B12 B13 B21 B31 B32 B11 B22 B33

k=1

slide-92
SLIDE 92

C32 C11 C12 C13 C22 C23 C33 C21 C31 14

|

HTA OPERATIONS & APPLICATIONS

3

A32 A12 A13 A23 A21 A31 Iteration

for k=1:m C = C + A * B; A = circshift(A, [0, -1]); B = circshift(B, [-1, 0]); end

A11 A22 A33 B23 B12 B13 B21 B31 B32 B11 B22 B33

k=2

slide-93
SLIDE 93

C32 C11 C12 C13 C22 C23 C33 C21 C31 14

|

HTA OPERATIONS & APPLICATIONS

3

A32 A12 A13 A23 A21 A31 Iteration

for k=1:m C = C + A * B; A = circshift(A, [0, -1]); B = circshift(B, [-1, 0]); end

A11 A22 A33 B23 B12 B13 B21 B31 B32 B11 B22 B33

k=2

slide-94
SLIDE 94

C32 C11 C12 C13 C22 C23 C33 C21 C31 14

|

HTA OPERATIONS & APPLICATIONS

3

A32 A12 A13 A23 A21 A31 Iteration

for k=1:m C = C + A * B; A = circshift(A, [0, -1]); B = circshift(B, [-1, 0]); end

A11 A22 A33 B23 B12 B13 B21 B31 B32 B11 B22 B33

k=2

slide-95
SLIDE 95

TALK OVERVIEW

INTRO HOW HTA’s WORK HTA OPERATIONS & APPLICATIONS CONCLUSIONS

1 2 3 5

15

|

EVALUATION

4

slide-96
SLIDE 96

NASA ADVANCED SUPERCOMPUTING BENCHMARK

image source: paper 16

|

EVALUATION

4

Nprocs EP (CLASS C) FT (CLASS B) CG (CLASS C) MG (CLASS B) LU (CLASS B) Fortran+ Matlab + Fortran + Matlab + Fortran + Matlab + Fortran + Matlab + Fortran + Matlab + MPI HTA MPI HTA MPI HTA MPI HTA MPI HTA 1 901.6 3556.9 136.8 657.4 3606.9 3812.0 26.9 828.0 15.7 245.1 4 273.1 888.8 109.1 274.0 362.0 1750.9 17.0 273.8 6.3 60.5 8 136.3 447.0 65.5 159.3 123.4 823.6 9.6 151.3 2.9 29.9 16 68.6 224.8 37.2 87.2 89.5 375.2 4.8 87.0 1.2 16.0 32 34.7 112.0 20.7 42.9 48.4 250.3 3.3 54.9 1.1 9.8 64 17.1 56.7 10.4 24.0 44.5 148.0 1.6 50.4 1.3 7.1 128 8.5 29.1 5.9 15.6 30.8 123.0 1.4 38.5 1.6 N/A able 1. Execution times in seconds for some of the applications in the NAS benchmarks for Fortran+MPI versus MATLAB +HTA.
slide-97
SLIDE 97

NASA ADVANCED SUPERCOMPUTING BENCHMARK

image source: paper 16

|

EVALUATION

4

Nprocs EP (CLASS C) FT (CLASS B) CG (CLASS C) MG (CLASS B) LU (CLASS B) Fortran+ Matlab + Fortran + Matlab + Fortran + Matlab + Fortran + Matlab + Fortran + Matlab + MPI HTA MPI HTA MPI HTA MPI HTA MPI HTA 1 901.6 3556.9 136.8 657.4 3606.9 3812.0 26.9 828.0 15.7 245.1 4 273.1 888.8 109.1 274.0 362.0 1750.9 17.0 273.8 6.3 60.5 8 136.3 447.0 65.5 159.3 123.4 823.6 9.6 151.3 2.9 29.9 16 68.6 224.8 37.2 87.2 89.5 375.2 4.8 87.0 1.2 16.0 32 34.7 112.0 20.7 42.9 48.4 250.3 3.3 54.9 1.1 9.8 64 17.1 56.7 10.4 24.0 44.5 148.0 1.6 50.4 1.3 7.1 128 8.5 29.1 5.9 15.6 30.8 123.0 1.4 38.5 1.6 N/A able 1. Execution times in seconds for some of the applications in the NAS benchmarks for Fortran+MPI versus MATLAB +HTA.

Too many numbers!

slide-98
SLIDE 98

32 64 96 128 32 64 96 128

EP

25 % 100 % Matlab+HTA Fortran+MPI

ebarassingly parallel # processors speedup factor sequential speed 17

|

128 3.2 GHz Intel Xeons, Gigabit Ethernet

EVALUATION

4

Matlab+HTA Fortran+MPI

slide-99
SLIDE 99

32 64 96 128 32 64 96 128

EP

LINEAR

SPEEDUP

25 % 100 % Matlab+HTA Fortran+MPI

ebarassingly parallel # processors speedup factor sequential speed 17

|

128 3.2 GHz Intel Xeons, Gigabit Ethernet

EVALUATION

4

Matlab+HTA Fortran+MPI

slide-100
SLIDE 100

32 64 96 128 32 64 96 128

FFT

21 % 100 % Matlab+HTA Fortran+MPI

fast fourier transform 18

|

EVALUATION

4

128 3.2 GHz Intel Xeons, Gigabit Ethernet

Matlab+HTA Fortran+MPI # processors speedup factor sequential speed

slide-101
SLIDE 101

32 64 96 128 32 64 96 128

FFT

HTA’s

SCALE BETTER

21 % 100 % Matlab+HTA Fortran+MPI

fast fourier transform 18

|

EVALUATION

4

128 3.2 GHz Intel Xeons, Gigabit Ethernet

Matlab+HTA Fortran+MPI # processors speedup factor sequential speed

slide-102
SLIDE 102

32 64 96 128 32 64 96 128

CG

95 % 100 % Matlab+HTA Fortran+MPI

conjugate gradient 19

|

EVALUATION

4

128 3.2 GHz Intel Xeons, Gigabit Ethernet

Matlab+HTA Fortran+MPI # processors speedup factor sequential speed

slide-103
SLIDE 103

32 64 96 128 32 64 96 128

CG

MPI

SUPER LINEAR SPEEDUP

95 % 100 % Matlab+HTA Fortran+MPI

conjugate gradient 19

|

EVALUATION

4

128 3.2 GHz Intel Xeons, Gigabit Ethernet

Matlab+HTA Fortran+MPI # processors speedup factor sequential speed

slide-104
SLIDE 104

32 64 96 128 32 64 96 128

MG

3 % 100 % Matlab+HTA Fortran+MPI

multi grid 20

|

EVALUATION

4

128 3.2 GHz Intel Xeons, Gigabit Ethernet

Matlab+HTA Fortran+MPI # processors speedup factor sequential speed

slide-105
SLIDE 105

HTA’s

SLOW

32 64 96 128 32 64 96 128

MG

3 % 100 % Matlab+HTA Fortran+MPI

multi grid 20

|

EVALUATION

4

128 3.2 GHz Intel Xeons, Gigabit Ethernet

Matlab+HTA Fortran+MPI # processors speedup factor sequential speed

slide-106
SLIDE 106

32 64 96 128 32 64 96 128

LU

6 % 100 % Matlab+HTA Fortran+MPI

lu factorization 21

|

EVALUATION

4

128 3.2 GHz Intel Xeons, Gigabit Ethernet

Matlab+HTA Fortran+MPI # processors speedup factor sequential speed

slide-107
SLIDE 107

32 64 96 128 32 64 96 128

LU

SLOW

AGAIN

6 % 100 % Matlab+HTA Fortran+MPI

lu factorization 21

|

EVALUATION

4

128 3.2 GHz Intel Xeons, Gigabit Ethernet

Matlab+HTA Fortran+MPI # processors speedup factor sequential speed

slide-108
SLIDE 108

32 64 96 128 32 64 96 128

LU

SLOW

AGAIN

6 % 100 % Matlab+HTA Fortran+MPI

lu factorization 21

|

EVALUATION

4

no data for 128 processors

128 3.2 GHz Intel Xeons, Gigabit Ethernet

Matlab+HTA Fortran+MPI # processors speedup factor sequential speed

slide-109
SLIDE 109

PERFORMANCE OF C++ HTA’s

22

|

EVALUATION

4 MMM

504 1008 2016 3024 4032 1000 2000 3000 4000

matrix size MFLOPS

Naive 3 loops HTA naive Tiled 6 loops HTA+ATLAS ATLAS Intel MKL

Intel Pentium 4, 3.0 GHz, 8KB L1 cache

slide-110
SLIDE 110

PERFORMANCE OF C++ HTA’s

22

|

EVALUATION

4 MMM

504 1008 2016 3024 4032 1000 2000 3000 4000

matrix size MFLOPS

Naive 3 loops HTA naive Tiled 6 loops HTA+ATLAS ATLAS Intel MKL

Intel Pentium 4, 3.0 GHz, 8KB L1 cache

8-13.5%
slide-111
SLIDE 111

LINES OF CODE COMPARISON

ep cg mg ft lu 200 400 600 800 1000 1200 lines of code HTA MPI HTA MPI HTA MPI HTA MPI HTA MPI Computation Communication Data Decomposition

image source: paper 23

|

EVALUATION

4

slide-112
SLIDE 112

TALK OVERVIEW

INTRO HOW HTA’s WORK HTA OPERATIONS & APPLICATIONS CONCLUSIONS

1 2 3 5

24

|

EVALUATION

4

slide-113
SLIDE 113

CONCLUSIONS

25

|

CONCLUSIONS

5 SCALABILITY PORTABILITY PRODUCTIVITY

HTA’s

slide-114
SLIDE 114

FURTHER INFORMATION

26

|

CONCLUSIONS

5

http://polaris.cs.uiuc.edu/hta/

slide-115
SLIDE 115

THANKS.

FOR YOUR ATTENTION

slide-116
SLIDE 116

&A

Q

PUT YOUR QUESTIONS