How Two-sided Matrix Transformation Algorithms Can Benefit from Task - - PowerPoint PPT Presentation

how two sided matrix transformation algorithms can
SMART_READER_LITE
LIVE PREVIEW

How Two-sided Matrix Transformation Algorithms Can Benefit from Task - - PowerPoint PPT Presentation

How Two-sided Matrix Transformation Algorithms Can Benefit from Task Parallelism Mirko Myllykoski Department of Computing Science Ume a University Nordic Numerical Linear Algebra Meeting KTH, Stockholm, 21-22 October, 2019 1 / 85


slide-1
SLIDE 1

How Two-sided Matrix Transformation Algorithms Can Benefit from Task Parallelism

Mirko Myllykoski

Department of Computing Science Ume˚ a University

Nordic Numerical Linear Algebra Meeting KTH, Stockholm, 21-22 October, 2019

1 / 85

slide-2
SLIDE 2

Eigenvalue problems

◮ Given A, B ∈ Rn×n, find λi ∈ C and xi ∈ Cn such that Axi = λixi

  • r

Axi = λiBxi. ◮ Assumption: The matrices A and B are dense and nonsymmetric.

2 / 85

slide-3
SLIDE 3

Eigenvalue problems (reduction to real Schur form)

◮ The matrix A is

◮ first reduced to upper Hessenberg form H = QT

1 AQ1 and

◮ then gradually reduced to real Schur form S = QT

2 HQ2.

Hessenberg Dense Schur

Figure: An illustration of the two reduction steps in standard case.

3 / 85

slide-4
SLIDE 4

Why? (task-based approach versus ScaLAPACK)

20k 40k 60k 80k 100k 120k Matrix dimension 0.0 0.2 0.4 0.6 0.8 1.0 Relative runtime 1.6 - 2.9 fold speedup StarNEig PDHSEQR

(a) Schur reduction1.

20k 40k 60k 80k 100k 120k Matrix dimension 0.0 0.2 0.4 0.6 0.8 1.0 Relative runtime 2.8 - 5.0 fold speedup StarNEig PDTRSEN

(b) Eigenvalue reordering. Figure: Improvement compared to ScaLAPACK. Up to 256 cores.

1https://github.com/NLAFET/SEVP-PDHSEQR-Alg953/. 4 / 85

slide-5
SLIDE 5

Background (double-shift QR algorithm)

◮ The first column of H is transformed to the first column of (H − λ1I)(H − λ2I), where the shifts λ1, λ2 ∈ C are the eigenvalues of a small 2 × 2 submatrix. ◮ The resulting 3 × 3 bulge is chased across the diagonal of H.

2 x 2 e i g e n v a l u e p r

  • b

l e m b u l g e

Figure: An illustration of how the bulge is created and chased.

5 / 85

slide-6
SLIDE 6

Background (multi-shift QR algorithm and level 3 BLAS)

◮ A modern multi-shift QR algorithm algorithm

◮ groups together a set of bulges and ◮ initially applies the transformations only within a small diagonal window.

◮ The transformations are accumulated and propagated with level 3 BLAS operations.

Apply locally Group transformations Propagate with BLAS-3 updates In L2 cache

Figure: An illustration of accumulated transformations.

6 / 85

slide-7
SLIDE 7

Background (bulge chasing in ScaLAPACK)

◮ With p cores, we can have up to √p concurrent windows. ◮ The transformation are broadcasted and applied in parallel.

◮ Theoretically possible degree of parallelism is p. Figure: An illustration of the bulge chasing stage.

time cores / ranks

Figure: A hypothetical trace for the bulge chasing stage.

7 / 85

slide-8
SLIDE 8

Background (multi-shift QR algorithm with AED)

S c h u r r e d u c t i

  • n

R e

  • r

d e r D e fl a t e Bulge chasing S h i f t s S p i k e Hessenberg reduction

Figure: An illustration of the multi-shift QR algorithm with AED.

8 / 85

slide-9
SLIDE 9

Background (AED in ScaLAPACK)

time cores / ranks AED AED

slide-10
SLIDE 10

Task-based approach (task graphs)

◮ The computational work is cut into self-contained tasks. ◮ The tasks are inserted into a runtime system.

◮ The runtime system derives the task dependences. ◮ The task dependencies can be visualized as a task graph.

W R W W R W R W R W R L L L L L dependences t a s k s R R R R R R R R R R L L L L L L L L L L

slide-11
SLIDE 11

Task-based approach (more opportunities for concurrency)

◮ Real live task graphs are much more complex.

◮ But enclose more opportunities for increased concurrency.

◮ The runtime system traverses the task graph.

◮ No global synchronization. ◮ Computational steps are allowed overlap and merge.

◮ Other benefits of the task-based approach include

◮ better load balancing, ◮ task priorities, ◮ accelerators support (GPUs) and ◮ implicit MPI communications.

11 / 85

slide-12
SLIDE 12

Task-based approach (traversal)

W R W W R W R W R W R L L L L L R R R R R R R R R R L L L L L L L L L L

critical path critical path dependences ready for scheduling can be scheduled

Figure: An illustration of how the task graph is traversed.

12 / 85

slide-13
SLIDE 13

Trace (first AED)

*

Figure: An illustration of the first AED window.

13 / 85

slide-14
SLIDE 14

Trace (first AED)

*

Figure: An illustration of the first AED window.

14 / 85

slide-15
SLIDE 15

Trace (first AED)

*

Figure: An illustration of the first AED window.

15 / 85

slide-16
SLIDE 16

Trace (first AED)

*→

Figure: An illustration of the first AED window.

16 / 85

slide-17
SLIDE 17

Trace (bulges are introduced from the top left corner)

*

Figure: An illustration of the beginning of the bulge chasing stage.

17 / 85

slide-18
SLIDE 18

Trace (bulges are introduced from the top left corner)

Figure: An illustration of the beginning of the bulge chasing stage.

18 / 85

slide-19
SLIDE 19

Trace (bulges are introduced from the top left corner)

Figure: An illustration of the beginning of the bulge chasing stage.

19 / 85

slide-20
SLIDE 20

Trace (bulges are introduced from the top left corner)

Figure: An illustration of the beginning of the bulge chasing stage.

20 / 85

slide-21
SLIDE 21

Trace (bulges are introduced from the top left corner)

*→

Figure: An illustration of the beginning of the bulge chasing stage.

21 / 85

slide-22
SLIDE 22

Trace (bulges are chased across the diagonal)

Figure: An illustration of the middle of the bulge chasing stage.

22 / 85

slide-23
SLIDE 23

Trace (bulges are chased across the diagonal)

Figure: An illustration of the middle of the bulge chasing stage.

23 / 85

slide-24
SLIDE 24

Trace (bulges are chased across the diagonal)

Figure: An illustration of the middle of the bulge chasing stage.

24 / 85

slide-25
SLIDE 25

Trace (bulges are chased across the diagonal)

Figure: An illustration of the middle of the bulge chasing stage.

25 / 85

slide-26
SLIDE 26

Trace (bulges are chased across the diagonal)

Figure: An illustration of the middle of the bulge chasing stage.

26 / 85

slide-27
SLIDE 27

Trace (bulges are chased across the diagonal)

*→

Figure: An illustration of the middle of the bulge chasing stage.

27 / 85

slide-28
SLIDE 28

Trace (delayed update wave follows the bulges)

Figure: An illustration of the update wave that follows the bulges.

28 / 85

slide-29
SLIDE 29

Trace (delayed update wave follows the bulges)

Figure: An illustration of the update wave that follows the bulges.

29 / 85

slide-30
SLIDE 30

Trace (delayed update wave follows the bulges)

Figure: An illustration of the update wave that follows the bulges.

30 / 85

slide-31
SLIDE 31

Trace (delayed update wave follows the bulges)

Figure: An illustration of the update wave that follows the bulges.

31 / 85

slide-32
SLIDE 32

Trace (delayed update wave follows the bulges)

Figure: An illustration of the update wave that follows the bulges.

32 / 85

slide-33
SLIDE 33

Trace (delayed update wave follows the bulges)

Figure: An illustration of the update wave that follows the bulges.

33 / 85

slide-34
SLIDE 34

Trace (delayed update wave follows the bulges)

Figure: An illustration of the update wave that follows the bulges.

34 / 85

slide-35
SLIDE 35

Trace (delayed update wave follows the bulges)

Figure: An illustration of the update wave that follows the bulges.

35 / 85

slide-36
SLIDE 36

Trace (delayed update wave follows the bulges)

*→

Figure: An illustration of the update wave that follows the bulges.

36 / 85

slide-37
SLIDE 37

Trace (bulge chasing stage ends)

Figure: An illustration of the end of the bulge chasing stage.

37 / 85

slide-38
SLIDE 38

Trace (bulge chasing stage ends)

Figure: An illustration of the end of the bulge chasing stage.

38 / 85

slide-39
SLIDE 39

Trace (bulge chasing stage ends)

Figure: An illustration of the end of the bulge chasing stage.

39 / 85

slide-40
SLIDE 40

Trace (bulge chasing stage ends)

*→

Figure: An illustration of the end of the bulge chasing stage.

40 / 85

slide-41
SLIDE 41

Trace (second AED)

*

Figure: An illustration of the second AED window.

41 / 85

slide-42
SLIDE 42

Trace (second AED)

*

Figure: An illustration of the second AED window.

42 / 85

slide-43
SLIDE 43

Trace (second AED)

*

Figure: An illustration of the second AED window.

43 / 85

slide-44
SLIDE 44

Trace (second AED)

*

Figure: An illustration of the second AED window.

44 / 85

slide-45
SLIDE 45

Trace (second AED)

*

Figure: An illustration of the second AED window.

45 / 85

slide-46
SLIDE 46

Trace (second AED)

→ Figure: An illustration of the second AED window.

46 / 85

slide-47
SLIDE 47

Trace (third AED)

*

Figure: An illustration of the third AED window.

47 / 85

slide-48
SLIDE 48

Trace (third AED)

→ Figure: An illustration of the third AED window.

48 / 85

slide-49
SLIDE 49

Trace (fourth AED)

Figure: An illustration of the fourth AED window.

49 / 85

slide-50
SLIDE 50

Trace (fourth AED)

Figure: An illustration of the fourth AED window.

50 / 85

slide-51
SLIDE 51

Trace (fourth AED)

*→

Figure: An illustration of the fourth AED window.

51 / 85

slide-52
SLIDE 52

Trace (second bulge chasing stage begins)

*→

Figure: An illustration of the beginning of the second bulge chasing stage.

52 / 85

slide-53
SLIDE 53

Trace (two merged bulge chasing stages)

Figure: An illustration of two merged bulge chasing stages.

53 / 85

slide-54
SLIDE 54

Trace (two merged bulge chasing stages)

Figure: An illustration of two merged bulge chasing stages.

54 / 85

slide-55
SLIDE 55

Trace (two merged bulge chasing stages)

Figure: An illustration of two merged bulge chasing stages.

55 / 85

slide-56
SLIDE 56

Trace (two merged bulge chasing stages)

Figure: An illustration of two merged bulge chasing stages.

56 / 85

slide-57
SLIDE 57

Trace (two merged bulge chasing stages)

Figure: An illustration of two merged bulge chasing stages.

57 / 85

slide-58
SLIDE 58

Trace (two merged bulge chasing stages)

*

Figure: An illustration of two merged bulge chasing stages.

58 / 85

slide-59
SLIDE 59

Trace (two merged bulge chasing stages)

Figure: An illustration of two merged bulge chasing stages.

59 / 85

slide-60
SLIDE 60

Trace (two merged bulge chasing stages)

Figure: An illustration of two merged bulge chasing stages.

60 / 85

slide-61
SLIDE 61

Trace (two merged bulge chasing stages)

Figure: An illustration of two merged bulge chasing stages.

61 / 85

slide-62
SLIDE 62

Trace (two merged bulge chasing stages)

Figure: An illustration of two merged bulge chasing stages.

62 / 85

slide-63
SLIDE 63

Trace (two merged bulge chasing stages)

Figure: An illustration of two merged bulge chasing stages.

63 / 85

slide-64
SLIDE 64

Trace (two merged bulge chasing stages)

Figure: An illustration of two merged bulge chasing stages.

64 / 85

slide-65
SLIDE 65

Trace (two merged bulge chasing stages)

Figure: An illustration of two merged bulge chasing stages.

65 / 85

slide-66
SLIDE 66

Trace (two merged bulge chasing stages)

Figure: An illustration of two merged bulge chasing stages.

66 / 85

slide-67
SLIDE 67

Trace (two merged bulge chasing stages)

Figure: An illustration of two merged bulge chasing stages.

67 / 85

slide-68
SLIDE 68

Trace (two merged bulge chasing stages)

Figure: An illustration of two merged bulge chasing stages.

68 / 85

slide-69
SLIDE 69

Trace (two merged bulge chasing stages)

Figure: An illustration of two merged bulge chasing stages.

69 / 85

slide-70
SLIDE 70

Trace (two merged bulge chasing stages)

Figure: An illustration of two merged bulge chasing stages.

70 / 85

slide-71
SLIDE 71

Trace (two merged bulge chasing stages)

→ Figure: An illustration of two merged bulge chasing stages.

71 / 85

slide-72
SLIDE 72

Trace (5th, 6th, 7th and 8th AED)

Figure: An illustration of two merged bulge chasing stages.

72 / 85

slide-73
SLIDE 73

Trace (5th, 6th, 7th and 8th AED)

Figure: An illustration of two merged bulge chasing stages.

73 / 85

slide-74
SLIDE 74

Trace (5th, 6th, 7th and 8th AED)

Figure: An illustration of two merged bulge chasing stages.

74 / 85

slide-75
SLIDE 75

Trace (5th, 6th, 7th and 8th AED)

Figure: An illustration of two merged bulge chasing stages.

75 / 85

slide-76
SLIDE 76

Trace (5th, 6th, 7th and 8th AED)

Figure: An illustration of two merged bulge chasing stages.

76 / 85

slide-77
SLIDE 77

Trace (5th, 6th, 7th and 8th AED)

Figure: An illustration of two merged bulge chasing stages.

77 / 85

slide-78
SLIDE 78

Trace (5th, 6th, 7th and 8th AED)

Figure: An illustration of two merged bulge chasing stages.

78 / 85

slide-79
SLIDE 79

Trace (5th, 6th, 7th and 8th AED)

Figure: An illustration of two merged bulge chasing stages.

79 / 85

slide-80
SLIDE 80

Trace (5th, 6th, 7th and 8th AED)

Figure: An illustration of two merged bulge chasing stages.

80 / 85

slide-81
SLIDE 81

Trace (5th, 6th, 7th and 8th AED)

Figure: An illustration of two merged bulge chasing stages.

81 / 85

slide-82
SLIDE 82

Trace (5th, 6th, 7th and 8th AED)

→ Figure: An illustration of two merged bulge chasing stages.

82 / 85

slide-83
SLIDE 83

Computational results (distributed memory performance)

20k 40k 60k 80k 100k 120k Matrix dimension 0.0 0.2 0.4 0.6 0.8 1.0 Relative runtime 1.6 - 2.9 fold speedup StarNEig PDHSEQR

(a) Schur reduction1.

20k 40k 60k 80k 100k 120k Matrix dimension 0.0 0.2 0.4 0.6 0.8 1.0 Relative runtime 2.8 - 5.0 fold speedup StarNEig PDTRSEN

(b) Eigenvalue reordering. Figure: Improvement compared to ScaLAPACK. Up to 256 cores.

1https://github.com/NLAFET/SEVP-PDHSEQR-Alg953/. 83 / 85

slide-84
SLIDE 84

StarNEig library

◮ StarNEig library1 aims to provide a complete task-based software stack for solving eigenvalue problems.

◮ Support for shared and distributed memory; and GPUs. ◮ Increased parallelism through expressing algorithms as DAGs. ◮ Better (heterogeneous) scheduling and load balancing. ◮ Overlapping communications and computations.

◮ Current status2:

Standard case Generalized case SM DM GPU SM DM GPU Hessenberg

Schur

  • Reordering
  • Eigenvectors

— : Complete, : Experimental, : (Sca)LAPACK.

1In collaboration with Carl Christian Kjelgaard Mikkelsen, Angelika Schwarz,

Lars Karlsson and Bo K˚ agstr¨

  • m.

2https://nlafet.github.io/StarNEig 84 / 85

slide-85
SLIDE 85

Extra (distributed memory scalability)

20k 40k 60k 80k 100k 120k 140k 160k Matrix dimension 500 1000 1500 2000 2500 3000 Runtime [s] 1 nodes 4 nodes 9 nodes 16 nodes 25 nodes

Figure: Standard case, 28 cores / node, max 700 cores.

85 / 85