Hierarchical Parallel Matrix Multiplication on Large-Scale - - PowerPoint PPT Presentation

hierarchical parallel matrix multiplication on large
SMART_READER_LITE
LIVE PREVIEW

Hierarchical Parallel Matrix Multiplication on Large-Scale - - PowerPoint PPT Presentation

Problem Outline Experiments Summary Hierarchical Parallel Matrix Multiplication on Large-Scale Distributed Memory Platforms Jean-Nol Quintin, Khalid Hasanov, Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science


slide-1
SLIDE 1

Problem Outline Experiments Summary

Hierarchical Parallel Matrix Multiplication on Large-Scale Distributed Memory Platforms

Jean-Noël Quintin, Khalid Hasanov, Alexey Lastovetsky

Heterogeneous Computing Laboratory School of Computer Science and Informatics, University College Dublin, Belfield, Dublin 4, Ireland http://hcl.ucd.ie

2013

1 / 38

slide-2
SLIDE 2

Problem Outline Experiments Summary

Outline

Problem Outline Motivation and Introduction Previous Work: SUMMA Our Work: HSUMMA

2 / 38

slide-3
SLIDE 3

Problem Outline Experiments Summary

Outline

Problem Outline Motivation and Introduction Previous Work: SUMMA Our Work: HSUMMA Experiments Experiments on Grid5000 Experiments on BlueGene

2 / 38

slide-4
SLIDE 4

Problem Outline Experiments Summary Motivation and Introduction Previous Work: SUMMA Our Work: HSUMMA

Outline

Problem Outline Motivation and Introduction Previous Work: SUMMA Our Work: HSUMMA Experiments Experiments on Grid5000 Experiments on BlueGene

3 / 38

slide-5
SLIDE 5

Problem Outline Experiments Summary Motivation and Introduction Previous Work: SUMMA Our Work: HSUMMA

Motivation

◮ Majority of HPC algorithms for scientific applications were

introduced between 1970s and 1990s

◮ They were designed for and tested on up to hundreds (few

thousands at most) of processors.

◮ In June 1995, the number of cores in the top 10

supercomputers ranged from 42 to 3680 (see http://www.top500.org/)

4 / 38

slide-6
SLIDE 6

Problem Outline Experiments Summary Motivation and Introduction Previous Work: SUMMA Our Work: HSUMMA

Motivation

◮ Majority of HPC algorithms for scientific applications were

introduced between 1970s and 1990s

◮ They were designed for and tested on up to hundreds (few

thousands at most) of processors.

◮ In June 1995, the number of cores in the top 10

supercomputers ranged from 42 to 3680 (see http://www.top500.org/)

◮ Nowadays, in June 2013, this number ranges from 147,456

to 3,120,000

4 / 38

slide-7
SLIDE 7

Problem Outline Experiments Summary Motivation and Introduction Previous Work: SUMMA Our Work: HSUMMA

Motivation

The increasing scale of the HPC platforms creates new research questions which needs to be solved:

◮ Scalability ◮ Communication cost ◮ Energy efficiency ◮ etc.

5 / 38

slide-8
SLIDE 8

Problem Outline Experiments Summary Motivation and Introduction Previous Work: SUMMA Our Work: HSUMMA

Introduction

We focus on the communication cost of scientific applications

  • n large-scale distributed memory platforms.

◮ Example application: Parallel Matrix Multiplication. ◮ Why Matrix Multiplication?

◮ Matrix multiplication is important in its own rights as a

computational kernel of many scientific applications.

◮ It is a popular representative for other scientific applications ◮ If an optimization method works well for matrix

multiplication, it will also work well for many other relative scientific applications

6 / 38

slide-9
SLIDE 9

Problem Outline Experiments Summary Motivation and Introduction Previous Work: SUMMA Our Work: HSUMMA

Introduction

◮ Example algorithm:

◮ SUMMA - Scalable Universal Matrix Multiplication

Algorithm.

◮ Introduced by Robert A. van de Geijn and Jerrell Watts.

University of Texas at Austin, 1995.

◮ Implemented in ScaLAPACK. 7 / 38

slide-10
SLIDE 10

Problem Outline Experiments Summary Motivation and Introduction Previous Work: SUMMA Our Work: HSUMMA

Outline

Problem Outline Motivation and Introduction Previous Work: SUMMA Our Work: HSUMMA Experiments Experiments on Grid5000 Experiments on BlueGene

8 / 38

slide-11
SLIDE 11

Problem Outline Experiments Summary Motivation and Introduction Previous Work: SUMMA Our Work: HSUMMA

SUMMA

P00 P01 P02 P03 P04 P05 P10 P11 P12 P13 P14 P15 P20 P21 P22 P23 P24 P25 P30 P31 P32 P33 P34 P35 P40 P41 P42 P43 P44 P45 P50 P51 P52 P53 P54 P55 Ab

  • k

P00 P01 P02 P03 P04 P05 P10 P11 P12 P13 P14 P15 P20 P21 P22 P23 P24 P25 P30 P31 P32 P33 P34 P35 P40 P41 P42 P43 P44 P45 P50 P51 P52 P53 P54 P55 Bb

k•

◮ Number of steps: n

b (n×n - matrices, b - block size,

√ P× √ P - processors grid, P = 36) ◮ The pivot column Ab

  • k of

n √ P ×b blocks of matrix A is broadcast horizontally.

◮ The pivot row Bb

k• of b× n √ P blocks of matrix B is broadcast vertically.

◮ Then, each

n √ P × n √ P block cij of matrix C is updated, cij = cij + aik×bkj.

◮ Size of data broadcast vertically and horizontally in each step: 2 n

√ P × b 9 / 38

slide-12
SLIDE 12

Problem Outline Experiments Summary Motivation and Introduction Previous Work: SUMMA Our Work: HSUMMA

Outline

Problem Outline Motivation and Introduction Previous Work: SUMMA Our Work: HSUMMA Experiments Experiments on Grid5000 Experiments on BlueGene

10 / 38

slide-13
SLIDE 13

Problem Outline Experiments Summary Motivation and Introduction Previous Work: SUMMA Our Work: HSUMMA

Our Contribution

◮ We introduce application level hierarchical optimization of

SUMMA

◮ Hierarchical SUMMA (HSUMMA) is platform independent

  • ptimization of SUMMA

◮ We theoretically and experimentally show that HSUMMA

reduces the communication cost of SUMMA

11 / 38

slide-14
SLIDE 14

Problem Outline Experiments Summary Motivation and Introduction Previous Work: SUMMA Our Work: HSUMMA

SUMMA vs HSUMMA. Arrangement of Processors

P00 P01 P02 P03 P04 P05 P10 P11 P12 P13 P14 P15 P20 P21 P22 P23 P24 P25 P30 P31 P32 P33 P34 P35 P40 P41 P42 P43 P44 P45 P50 P51 P52 P53 P54 P55 SUMMA

12 / 38

slide-15
SLIDE 15

Problem Outline Experiments Summary Motivation and Introduction Previous Work: SUMMA Our Work: HSUMMA

SUMMA vs HSUMMA. Arrangement of Processors

P00 P01 P02 P03 P04 P05 P10 P11 P12 P13 P14 P15 P20 P21 P22 P23 P24 P25 P30 P31 P32 P33 P34 P35 P40 P41 P42 P43 P44 P45 P50 P51 P52 P53 P54 P55 SUMMA P00 P01 P10 P11 P02 P03 P12 P13 P04 P05 P14 P15 P20 P21 P30 P31 P22 P23 P32 P33 P24 P25 P34 P35 P40 P41 P50 P51 P42 P43 P52 P53 P44 P45 P54 P55 HSUMMA

12 / 38

slide-16
SLIDE 16

Problem Outline Experiments Summary Motivation and Introduction Previous Work: SUMMA Our Work: HSUMMA

Horizontal Communications Between Groups in HSUMMA

P00 P01 P10 P11 P02 P03 P12 P13 P04 P05 P14 P15 P20 P21 P30 P31 P22 P23 P32 P33 P24 P25 P34 P35 P40 P41 P50 P51 P42 P43 P52 P53 P44 P45 P54 P55 AM

  • k

◮ P - number of processors (P = 36) ◮ G - number of groups (G = 9) ◮ √

P × √ P - processors grid

◮ √

G × √ G - grid of processor groups

◮ M - block size between groups ◮ n/M - number of steps ◮ Size of data broadcast

horizontally in each step: n×M

√ P

The pivot column AM

  • k of

n √ P ×M blocks of matrix A is broadcast horizontally

between groups

13 / 38

slide-17
SLIDE 17

Problem Outline Experiments Summary Motivation and Introduction Previous Work: SUMMA Our Work: HSUMMA

Horizontal Communications Inside Groups in HSUMMA

P00 P01 P10 P11 P02 P03 P12 P13 P04 P05 P14 P15 P20 P21 P30 P31 P22 P23 P32 P33 P24 P25 P34 P35 P40 P41 P50 P51 P42 P43 P52 P53 P44 P45 P54 P55 Ab

  • k

Ab

  • k

Ab

  • k

◮ √ P √ G × √ P √ G− grid of processors inside groups ◮ b− block size inside one group ◮ M/b− steps inside one group ◮ n/M− steps between groups ◮ Size of data broadcast

horizontally in each step: n×b

√ P

Upon receipt of the pivot column data from the other groups, the local pivot column Ab

  • k, (b≤M) of

n √ P ×b blocks of matrix A is broadcast horizontally

inside each group

14 / 38

slide-18
SLIDE 18

Problem Outline Experiments Summary Motivation and Introduction Previous Work: SUMMA Our Work: HSUMMA

Vertical Communications Between Groups in HSUMMA

P00 P01 P10 P11 P02 P03 P12 P13 P04 P05 P14 P15 P20 P21 P30 P31 P22 P23 P32 P33 P24 P25 P34 P35 P40 P41 P50 P51 P42 P43 P52 P53 P44 P45 P54 P55 BM

k• ◮ P - number of processors (P = 36) ◮ G - number of groups (G = 9) ◮ √

P × √ P - processors grid

◮ √

G × √ G - grid of processor groups

◮ M - block size between groups ◮ n/M - number of steps ◮ Size of data broadcast

vertically in each step: n×M

√ P

The pivot row BM

k• of M× n √ P blocks of matrix B is broadcast vertically

between groups

15 / 38

slide-19
SLIDE 19

Problem Outline Experiments Summary Motivation and Introduction Previous Work: SUMMA Our Work: HSUMMA

Vertical Communications Inside Groups in HSUMMA

P00 P01 P10 P11 P02 P03 P12 P13 P04 P05 P14 P15 P20 P21 P30 P31 P22 P23 P32 P33 P24 P25 P34 P35 P40 P41 P50 P51 P42 P43 P52 P53 P44 P45 P54 P55 Bb

k•

Bb

k•

Bb

k• ◮ √ P √ G × √ P √ G− grid of processors ◮ b− block size inside one group ◮ M/b− steps inside one group ◮ n/M− steps between groups ◮ Size of data broadcast

vertically in each step: n×b

√ P

Upon receipt of the pivot row data from the other groups, the local pivot row Bb

  • k of b× n

√ P , (b≤M) blocks of matrix B is broadcast vertically inside each

group

16 / 38

slide-20
SLIDE 20

Problem Outline Experiments Summary Motivation and Introduction Previous Work: SUMMA Our Work: HSUMMA

Communication Model to Analyse SUMMA and HSUMMA

Time of sending of a message of size m between two processors: α + mβ (1) Here,

◮ α -latency ◮ β -reciprocal bandwith ◮ m -message size

17 / 38

slide-21
SLIDE 21

Problem Outline Experiments Summary Motivation and Introduction Previous Work: SUMMA Our Work: HSUMMA

General Broadcast Model to Analyse SUMMA and HSUMMA

We use a general broadcast model for all homogeneous broadcast algorithms such as

◮ flat ◮ binary ◮ binomial ◮ linear ◮ scatter-allgather broadcast

Tbcast(m, p) = L(p)×α + m×W(p)×β (2)

18 / 38

slide-22
SLIDE 22

Problem Outline Experiments Summary Motivation and Introduction Previous Work: SUMMA Our Work: HSUMMA

General Broadcast Model

Tbcast(m, p) = L(p)×α + m×W(p)×β Assumptions:

◮ L(1) = 0 and W(1) = 0 ◮ L(p) and W(p) are monotonic and differentiable functions

in the interval (1, p),

◮ their first derivatives are constants or monotonic in the

interval (1, p)

19 / 38

slide-23
SLIDE 23

Problem Outline Experiments Summary Motivation and Introduction Previous Work: SUMMA Our Work: HSUMMA

SUMMA and HSUMMA with General Broadcast Model

◮ SUMMA: TS(n, p) = 2

  • n

b ×L(√p)α + n2 √p ×W(√p)β

  • (3)

◮ HSUMMA: THS(n, p, G) = THSl (n, p, G) + THSb(n, p, G) (4) Here G ∈ [1, p] and we take b = M for simplicity and ◮ THSl is the latency cost: THSl (n, p, G) = 2 n b ×

  • L(

√ G) + L( √p √ G )

  • α

(5) ◮ THSb is the bandwidth cost: THSb(n, p, G) = 2 n2 √p ×

  • W(

√ G) + W( √p √ G )

  • β

(6) SUMMA is a special case of HSUMMA when G = 1 or G = p.

20 / 38

slide-24
SLIDE 24

Problem Outline Experiments Summary Motivation and Introduction Previous Work: SUMMA Our Work: HSUMMA

Optimal Number of Groups in HSUMMA with General Broadcast Model

Derivative of the communication cost function of HSUMMA with general broadcast model: ∂THS ∂G = n b ×L1(p, G)α + n2 √p ×W1(p, G)β (7) Here, L1(p, G) and W1(p, G) are defined as follows: L1(p, G) =  ∂L( √ G) ∂ √ G × 1 √ G − ∂L(

√p √ G)

√p √ G

× √p G √ G   (8) W1(p, G) =  ∂W( √ G) ∂ √ G × 1 √ G − ∂W(

√p √ G)

√p √ G

× √p G √ G   (9) If G = √ P then L1(p, G) = 0 and W1(p, G) = 0. Thus, ∂THS

∂G = 0 21 / 38

slide-25
SLIDE 25

Problem Outline Experiments Summary Motivation and Introduction Previous Work: SUMMA Our Work: HSUMMA

Optimal Number of Groups in HSUMMA with General Broadcast Model

◮ HSUMMA has extremum in G ∈ (1, P) ◮ G =

√ P is the extremum point.

◮ Depending on α and β:

◮ This extremum can be minimum which means HSUMMA

always outperforms SUMMA.

◮ Or maximum which means HSUMMA has the same

performance as SUMMA.

22 / 38

slide-26
SLIDE 26

Problem Outline Experiments Summary Motivation and Introduction Previous Work: SUMMA Our Work: HSUMMA

Theoretical Prediction by Using Scatter-Allgather Broadcast

Algorithm

  • Comp. Cost

Latency Factor Bandwidth Factor inside groups between groups inside groups between groups SUMMA

2n3 p

(log2 (p) + 2 (√p − 1)) × n

b

4

  • 1 −

1 √p

  • × n2

√p

HSUMMA

2n3 p

  • log2

p

G

  • + 2

√p

√ G − 1

  • × n

b

  • log2 (G) + 2

√ G − 1

  • × n

M

4

  • 1 −

√ G √p

  • × n2

√p

4

  • 1 −

1 √ G

  • × n2

√p

23 / 38

slide-27
SLIDE 27

Problem Outline Experiments Summary Motivation and Introduction Previous Work: SUMMA Our Work: HSUMMA

Optimal Number of Groups with Scatter-Allgather Broadcast

∂THSV ∂G = G − √p G √ G ×

b − 2 n2 p ×β

  • (10)

If G = √p then

∂THSV ∂G

= 0. ◮ If α

β > 2 nb p then G = √p is the minimum of THS.

◮ If α

β < 2 nb p then G = √p is the maximum of THS. In this case the function gets

its minimum at either G = 1 or G = p.

24 / 38

slide-28
SLIDE 28

Problem Outline Experiments Summary Motivation and Introduction Previous Work: SUMMA Our Work: HSUMMA

Optimal Number of Groups with Scatter-Allgather Broadcast

Algorithm

  • Comp. Cost

Latency Factor Bandwidth Factor inside groups between groups inside groups between groups SUMMA

2n3 p

(log2 (p) + 2 (√p − 1)) × n

b

4

  • 1 −

1 √p

  • × n2

√p

HSUMMA

2n3 p

  • log2

p

G

  • + 2

√p

√ G − 1

  • × n

b

  • log2 (G) + 2

√ G − 1

  • × n

M

4

  • 1 −

√ G √p

  • × n2

√p

4

  • 1 −

1 √ G

  • × n2

√p

HSUMMA(G = √p, b = M)

2n3 p

(log2 (p) + 4 ( 4 √p − 1)) × n

b

8

  • 1 −

1

4

√p

  • × n2

√p

25 / 38

slide-29
SLIDE 29

Problem Outline Experiments Summary Motivation and Introduction Previous Work: SUMMA Our Work: HSUMMA

Theoretical Prediction on Future Exascale Platforms by Using Scatter-Allgather Broadcast

2−2 22 26 210 214 218 222 5 10 15 Groups Execution Time (Sec) HSUMMA SUMMA

◮ Total flop rate (γ): 1E18 flops ◮ Latency: 500 ns, ◮ Bandwidth: 100 GB/s ◮ Problem size: n = 222, ◮ Number of processors: p = 220 ◮ Block size: b = M = 256 Prediction of SUMMA and HSUMMA on Exascale. (The parameters were taken from: Report on Exascale Architecture. IESP Meeting. April 12, 2012)

26 / 38

slide-30
SLIDE 30

Problem Outline Experiments Summary Experiments on Grid5000 Experiments on BlueGene

Outline

Problem Outline Motivation and Introduction Previous Work: SUMMA Our Work: HSUMMA Experiments Experiments on Grid5000 Experiments on BlueGene

27 / 38

slide-31
SLIDE 31

Problem Outline Experiments Summary Experiments on Grid5000 Experiments on BlueGene

Experimental platforms

◮ The experiments were carried out on Graphene cluster of

Nancy site of Grid5000 platform,

◮ On 8, 16, 32, 64 and 128 cores and ◮ On IBM BlueGene on 1024, 2048, 4096, 8192 and 16384

cores

28 / 38

slide-32
SLIDE 32

Problem Outline Experiments Summary Experiments on Grid5000 Experiments on BlueGene

Outline

Problem Outline Motivation and Introduction Previous Work: SUMMA Our Work: HSUMMA Experiments Experiments on Grid5000 Experiments on BlueGene

29 / 38

slide-33
SLIDE 33

Problem Outline Experiments Summary Experiments on Grid5000 Experiments on BlueGene

Summa vs HSUMMA on Grid5000 with MPICH

1 2 4 8 16 32 64 128 10 20 30 Number of Groups Execution Time(Sec) HSUMMA SUMMA

HSUMMA and SUMMA on Grid5000 with MPICH-2. b = M = 64,n = 8192 and p = 128. 7.75 times reduction of the execution time.

30 / 38

slide-34
SLIDE 34

Problem Outline Experiments Summary Experiments on Grid5000 Experiments on BlueGene

Summa vs HSUMMA on Grid5000 with MPICH

1 2 4 8 16 32 64 128 2 4 6 8 10 Number of Groups Execution Time(Sec) HSUMMA SUMMA

HSUMMA and SUMMA on Grid5000 with MPICH-2. b = M = 256, n = 8192 and p = 128. 2.96 times reduction of the execution time.

31 / 38

slide-35
SLIDE 35

Problem Outline Experiments Summary Experiments on Grid5000 Experiments on BlueGene

Summa vs HSUMMA on Grid5000 with OpenMPI on Ethernet

1 2 4 8 16 32 64 128 1 2 3 4 5 Number of Groups Execution Time(Sec) HSUMMA SUMMA

HSUMMA and SUMMA on Grid5000 with OpenMPI on

  • Ethernet. b = M = 256, n = 8192 and p = 128.

16.8 percent reduction of the execution time.

32 / 38

slide-36
SLIDE 36

Problem Outline Experiments Summary Experiments on Grid5000 Experiments on BlueGene

Summa vs HSUMMA on Grid5000 with OpenMPI on Infiniband

1 2 4 8 16 32 64 128 0.2 0.4 0.6 Number of Groups Execution Time(Sec) HSUMMA SUMMA

HSUMMA and SUMMA on Grid5000 with OpenMPI on

  • Infiniband. b = M = 256, n = 8192 and p = 128.

24 percent reduction of the execution time.

33 / 38

slide-37
SLIDE 37

Problem Outline Experiments Summary Experiments on Grid5000 Experiments on BlueGene

Outline

Problem Outline Motivation and Introduction Previous Work: SUMMA Our Work: HSUMMA Experiments Experiments on Grid5000 Experiments on BlueGene

34 / 38

slide-38
SLIDE 38

Problem Outline Experiments Summary Experiments on Grid5000 Experiments on BlueGene

Summa vs HSUMMA on BlueGene

2−1 22 25 28 211 214 10 20 30 40 50 Number of groups Time (Sec) HSUMMA execution time SUMMA execution time HSUMMA communication time SUMMA communication time

SUMMA and HSUMMA on BG/P . Execution and communication

  • time. b = M = 256, n = 65536 and p = 16384. 2.08 times

reduction of the execution time. 5.89 times reduction of the communication time.

35 / 38

slide-39
SLIDE 39

Problem Outline Experiments Summary Experiments on Grid5000 Experiments on BlueGene

SUMMA and HSUMMA Communication Time

211 212 213 214 10 20 30 40 Number of cores Communication Time (Sec) HSUMMA communication time SUMMA communication time

SUMMA and HSUMMA on BG/P . Communication time. b = M = 256 and n = 65536

36 / 38

slide-40
SLIDE 40

Problem Outline Experiments Summary

Summary

Improvement over SUMMA:

◮ Hierarchical SUMMA has theoretically better

communication time and thus less execution time than SUMMA

◮ 2.08 times less communication time on 2048 cores ◮ 5.89 times less communication time on 16384 cores ◮ 1.2 times less overall execution time on 2048 cores ◮ 2.36 times less overall execution time on 16384 cores

37 / 38

slide-41
SLIDE 41

Problem Outline Experiments Summary

Questions?

38 / 38