Minimisation de la m emoire VS minimisation du volume dE/S dans - - PowerPoint PPT Presentation

minimisation de la m emoire vs minimisation du volume d e
SMART_READER_LITE
LIVE PREVIEW

Minimisation de la m emoire VS minimisation du volume dE/S dans - - PowerPoint PPT Presentation

Minimisation de la m emoire VS minimisation du volume dE/S dans les m ethodes de factorisation de matrices creuses Abdou Guermouche, LaBRI Bordeaux May 2010 Context Solving sparse linear Typical matrix: BRGM systems matrix 3 .


slide-1
SLIDE 1

Minimisation de la m´ emoire VS minimisation du volume d’E/S dans les m´ ethodes de factorisation de matrices creuses

Abdou Guermouche, LaBRI Bordeaux

May 2010

slide-2
SLIDE 2

Context

Solving sparse linear systems

Ax = b ⇒ Direct methods: A = LU

Typical matrix: BRGM matrix

  • 3.7 × 106 variables
  • 156 × 106 non zeros in A
  • 4.5 × 109 non zeros in LU
  • 26.5 × 1012 flops

2/43 Abdou Guermouche, May 2010

slide-3
SLIDE 3

Context

Solving sparse linear systems

Ax = b ⇒ Direct methods: A = LU

Typical matrix: BRGM matrix

  • 3.7 × 106 variables
  • 156 × 106 non zeros in A
  • 4.5 × 109 non zeros in LU
  • 26.5 × 1012 flops

2/43 Abdou Guermouche, May 2010

slide-4
SLIDE 4

Context

Physical constraint

Memory required Core memory

Memory crash

Software challenge

  • Implementation of an
  • ut-of-core execution

scheme within MUMPS

2/43 Abdou Guermouche, May 2010

slide-5
SLIDE 5

Context

Out-of-core

Memory required Core memory Disks

Use of disks

Software challenge

  • Implementation of an
  • ut-of-core execution

scheme within MUMPS

2/43 Abdou Guermouche, May 2010

slide-6
SLIDE 6

Outline

Multifrontal method Active memory minimization Algorithm (Liu’s Algorithm) Memory issues

Limitation of the approach

New multifrontal schedules and algorithms

Flexible allocation scheme A new memory minimization algorithm

Results Total memory minimization How about Volume of I/O? Computing Volume of I/O Minimizing I/O volume Towards an out-of-core flexible allocation Conclusion and Future work

3/43 Abdou Guermouche, May 2010

slide-7
SLIDE 7

Outline

Multifrontal method Active memory minimization Algorithm (Liu’s Algorithm) Memory issues

Limitation of the approach

New multifrontal schedules and algorithms

Flexible allocation scheme A new memory minimization algorithm

Results Total memory minimization How about Volume of I/O? Computing Volume of I/O Minimizing I/O volume Towards an out-of-core flexible allocation Conclusion and Future work

4/43 Abdou Guermouche, May 2010

slide-8
SLIDE 8

The multifrontal method (Duff, Reid’83)

3 5 4 2 1 1 2 3 4 5 3 5 4 2 1 1 2 3 4 5

Non−zero Fill−in

A= L+U−I=

Storage divided into two parts:

  • Factors systematically written to

disk;

  • Active Storage kept in memory.

Factors Stack of contribution blocks Active frontal matrix Active Storage

3 2 4 5 1

1 5 4 2 3 3 4 4 5 5

Factors Contribution block

Elimination tree

5/43 Abdou Guermouche, May 2010

slide-9
SLIDE 9

The multifrontal method (Duff, Reid’83)

Storage divided into two parts:

  • Factors systematically written to

disk;

  • Active Storage kept in memory.

Factors Stack of contribution blocks Active frontal matrix Active Storage

3 2 4 5 1

1 5 4 2 3 3 4 4 5 5

Factors Contribution block

Elimination tree

5/43 Abdou Guermouche, May 2010

slide-10
SLIDE 10

The multifrontal method (Duff, Reid’83)

Storage divided into two parts:

  • Factors systematically written to

disk;

  • Active Storage kept in memory.

Factors Stack of contribution blocks Active frontal matrix Active Storage

3 2 4 5 1

1 5 4 2 3 3 4 4 5 5

Factors Contribution block

Elimination tree

5/43 Abdou Guermouche, May 2010

slide-11
SLIDE 11

The multifrontal method (Duff, Reid’83)

Storage divided into two parts:

  • Factors systematically written to

disk;

  • Active Storage kept in memory.

Factors Stack of contribution blocks Active frontal matrix Active Storage

3 2 4 5 1

1 5 4 2 3 3 4 4 5 5

Factors Contribution block

Elimination tree

5/43 Abdou Guermouche, May 2010

slide-12
SLIDE 12

The multifrontal method (Duff, Reid’83)

Storage divided into two parts:

  • Factors systematically written to

disk;

  • Active Storage kept in memory.

Factors Stack of contribution blocks Active frontal matrix Active Storage

3 2 4 5 1

1 5 4 2 3 3 4 4 5 5

Factors Contribution block

Elimination tree

5/43 Abdou Guermouche, May 2010

slide-13
SLIDE 13

The multifrontal method (Duff, Reid’83)

Storage divided into two parts:

  • Factors systematically written to

disk;

  • Active Storage kept in memory.

Factors Stack of contribution blocks Active frontal matrix Active Storage

3 2 4 5 1

1 5 4 2 3 3 4 4 5 5

Factors Contribution block

Elimination tree

5/43 Abdou Guermouche, May 2010

slide-14
SLIDE 14

The multifrontal method (Duff, Reid’83)

Storage divided into two parts:

  • Factors systematically written to

disk;

  • Active Storage kept in memory.

Factors Stack of contribution blocks Active frontal matrix Active Storage

3 2 4 5 1

1 5 4 2 3 3 4 4 5 5

Factors Contribution block

Elimination tree

5/43 Abdou Guermouche, May 2010

slide-15
SLIDE 15

Memory behaviour (serial postorder traversal)

3 2 1

6/43 Abdou Guermouche, May 2010

slide-16
SLIDE 16

Memory behaviour (serial postorder traversal)

3 2 1

6/43 Abdou Guermouche, May 2010

slide-17
SLIDE 17

Memory behaviour (serial postorder traversal)

3 2 1

6/43 Abdou Guermouche, May 2010

slide-18
SLIDE 18

Memory behaviour (serial postorder traversal)

3 2 1

6/43 Abdou Guermouche, May 2010

slide-19
SLIDE 19

Memory behaviour (serial postorder traversal)

3 2 1

6/43 Abdou Guermouche, May 2010

slide-20
SLIDE 20

Memory behaviour (serial postorder traversal)

3 2 1

6/43 Abdou Guermouche, May 2010

slide-21
SLIDE 21

Memory behaviour (serial postorder traversal)

3 2 1

6/43 Abdou Guermouche, May 2010

slide-22
SLIDE 22

Memory behaviour (serial postorder traversal)

3 2 1

6/43 Abdou Guermouche, May 2010

slide-23
SLIDE 23

Sequential case results

Memory peak

Worst case.

Memory peak

Best case.

→ Algorithms to find the optimal tree traversal have been proposed

7/43 Abdou Guermouche, May 2010

slide-24
SLIDE 24

Sequential case results

Memory peak

Worst case.

Memory peak

Best case.

→ Algorithms to find the optimal tree traversal have been proposed

7/43 Abdou Guermouche, May 2010

slide-25
SLIDE 25

Sequential case: Memory behavior (2/2)

Consider a parent node in the tree:

  • n is the number of children.
  • j denotes the jth child of the node.
  • cbj is the size of the contribution block of

child j.

  • m is the memory size of the frontal matrix of

the parent.

  • A (resp. Aj) is the amount of active memory

needed to process the parent (resp. child j).

n 2 1

cbn cb2 cb1

...

The assembly step requires a storage: m +

n

  • j=1

cbj

8/43 Abdou Guermouche, May 2010

slide-26
SLIDE 26

Sequential case: Memory behavior (2/2)

Consider a parent node in the tree:

  • n is the number of children.
  • j denotes the jth child of the node.
  • cbj is the size of the contribution block of

child j.

  • m is the memory size of the frontal matrix of

the parent.

  • A (resp. Aj) is the amount of active memory

needed to process the parent (resp. child j).

n 2 1

cbn cb2 cb1

...

The storage required to process child j is: Aj +

j−1

  • k=1

cbk

8/43 Abdou Guermouche, May 2010

slide-27
SLIDE 27

Sequential case: Memory behavior (2/2)

Consider a parent node in the tree:

  • n is the number of children.
  • j denotes the jth child of the node.
  • cbj is the size of the contribution block of

child j.

  • m is the memory size of the frontal matrix of

the parent.

  • A (resp. Aj) is the amount of active memory

needed to process the parent (resp. child j).

n 2 1

cbn cb2 cb1

...

A is thus defined by: A = max(max

j=1,n(Aj + j−1

  • k=1

cbk), m +

n

  • j=1

cbj)

8/43 Abdou Guermouche, May 2010

slide-28
SLIDE 28

Outline

Multifrontal method Active memory minimization Algorithm (Liu’s Algorithm) Memory issues

Limitation of the approach

New multifrontal schedules and algorithms

Flexible allocation scheme A new memory minimization algorithm

Results Total memory minimization How about Volume of I/O? Computing Volume of I/O Minimizing I/O volume Towards an out-of-core flexible allocation Conclusion and Future work

9/43 Abdou Guermouche, May 2010

slide-29
SLIDE 29

Liu’s Algorithm

Liu’s Theorem (Tree pebbling theorem)

The minimum of maxj(xj + j−1

i=1 yj) is obtained when the sequence

(xi, yi) is sorted in decreasing order of xi − yi, Consequence: An optimal child sequence is obtained by rearranging the children nodes in decreasing order of Ai − cbi. Algorithm:

  • Bottom-up greedy process.
  • Apply Liu’s theorem at each level of the tree.

10/43 Abdou Guermouche, May 2010

slide-30
SLIDE 30

Outline

Multifrontal method Active memory minimization Algorithm (Liu’s Algorithm) Memory issues

Limitation of the approach

New multifrontal schedules and algorithms

Flexible allocation scheme A new memory minimization algorithm

Results Total memory minimization How about Volume of I/O? Computing Volume of I/O Minimizing I/O volume Towards an out-of-core flexible allocation Conclusion and Future work

11/43 Abdou Guermouche, May 2010

slide-31
SLIDE 31

Limitation of the Classical scheme

Memory peak Allocation of the father

Classical approach.

Memory peak Allocation of the father

Flexible scheme.

→ Decoupling the allocation and the computations can improve the memory behavior

12/43 Abdou Guermouche, May 2010

slide-32
SLIDE 32

Limitation of the Classical scheme

Memory peak Allocation of the father

Classical approach.

Memory peak Allocation of the father

Flexible scheme.

→ Decoupling the allocation and the computations can improve the memory behavior

12/43 Abdou Guermouche, May 2010

slide-33
SLIDE 33

Outline

Multifrontal method Active memory minimization Algorithm (Liu’s Algorithm) Memory issues

Limitation of the approach

New multifrontal schedules and algorithms

Flexible allocation scheme A new memory minimization algorithm

Results Total memory minimization How about Volume of I/O? Computing Volume of I/O Minimizing I/O volume Towards an out-of-core flexible allocation Conclusion and Future work

13/43 Abdou Guermouche, May 2010

slide-34
SLIDE 34

Flexible multifrontal scheme

. . . . . .

S1 S2 1 p+1 n p 2

  • p is the position of the allocation of

the parent.

  • S1 is the set of children treated

before the allocation of the parent.

  • S2 is the set of children treated after

the allocation of the parent.

  • The memory behavior inside S1 is similar to the case of the

classical multifrontal scheme.

  • Inside S2, the order of the children has no impact on the

memory behavior. Aflex = max

  • max

j=1,p(Aflex j

+

j−1

  • k=1

cbk), m +

p

  • k=1

cbk, m + max

j=p+1,n Aflex j

  • 14/43

Abdou Guermouche, May 2010

slide-35
SLIDE 35

Flexible multifrontal scheme

. . . . . .

S1 S2 1 p+1 n p 2

  • p is the position of the allocation of

the parent.

  • S1 is the set of children treated

before the allocation of the parent.

  • S2 is the set of children treated after

the allocation of the parent.

  • The memory behavior inside S1 is similar to the case of the

classical multifrontal scheme.

  • Inside S2, the order of the children has no impact on the

memory behavior. Aflex = max

  • max

j=1,p(Aflex j

+

j−1

  • k=1

cbk), m +

p

  • k=1

cbk, m + max

j=p+1,n Aflex j

  • 14/43

Abdou Guermouche, May 2010

slide-36
SLIDE 36

Flexible multifrontal scheme

. . . . . .

S1 S2 1 p+1 n p 2

  • p is the position of the allocation of

the parent.

  • S1 is the set of children treated

before the allocation of the parent.

  • S2 is the set of children treated after

the allocation of the parent.

  • The memory behavior inside S1 is similar to the case of the

classical multifrontal scheme.

  • Inside S2, the order of the children has no impact on the

memory behavior. Aflex = max

  • max

j=1,p(Aflex j

+

j−1

  • k=1

cbk), m +

p

  • k=1

cbk, m + max

j=p+1,n Aflex j

  • 14/43

Abdou Guermouche, May 2010

slide-37
SLIDE 37

A new memory minimization algorithm

Theorem

An optimal sequence can be obtained by :

  • Sorting the children in decreasing order of Aflex

j

.

  • Trying all the possible positions for the allocation of the parent

and sorting the children belonging to S1 according to Liu’s Theorem.

  • Selecting the configuration that gives the smallest peak.

Algorithm: Bottom-up greedy process where the theorem is applied at each level of the tree.

15/43 Abdou Guermouche, May 2010

slide-38
SLIDE 38

Proof

Aflex = max

  • max

j=1,p(Aflex j

+

j−1

  • k=1

cbk), m +

p

  • k=1

cbk, m + max

j=p+1,n Aflex j

  • Inside S2, the order of the children has no impact on the

memory behavior.

  • If ∃j ∈ S1 / Aflex

j

≤ maxi∈S2(Aflex

i

) → j can be moved from S1 to S2 without increasing the peak.

p p p S1 S1 S1 S2 S2 S2 Optimal configuration

16/43 Abdou Guermouche, May 2010

slide-39
SLIDE 39

Active memory minimization Algorithm

Algorithm: Set S1 = {1, . . . , n}, S2 = ∅ and p = n; Find the schedule providing an optimal Aflex value for partition (S1, S2); repeat Find j such that Aflex

j

= mink∈S1 Aflex

k

; Set S1 = S1 \ {j}, S2 = S2 ∪ {j}, and p = p − 1; Find the schedule providing an optimal A

′flex value for partition

(S1, S2); if A

′flex ≤ Aflex then

Keep the value of p, and the schedule of children in S1 and S2 corresponding to A

′flex;

Set Aflex = A

′flex;

end if until p == 1 or A

′flex > Aflex

17/43 Abdou Guermouche, May 2010

slide-40
SLIDE 40

Outline

Multifrontal method Active memory minimization Algorithm (Liu’s Algorithm) Memory issues

Limitation of the approach

New multifrontal schedules and algorithms

Flexible allocation scheme A new memory minimization algorithm

Results Total memory minimization How about Volume of I/O? Computing Volume of I/O Minimizing I/O volume Towards an out-of-core flexible allocation Conclusion and Future work

18/43 Abdou Guermouche, May 2010

slide-41
SLIDE 41

Experimental environment

MUMPS: Multifrontal Parallel Solver for both LU and LDLT. Reordering techniques: AMD, AMF, METIS, PORD. Test platform: IBM platform at IDRIS. Test problems: Large range of matrices extracted from various collections (Rutherford-Boeing, University of Florida or PARASOL,. . . ). Schedules tested :

  • Classical multifrontal scheme (parent allocated after all its

children).

  • Anticipated parent allocation scheme (parent allocated after its

first child).

  • Flexible parent allocation scheme (parent allocated at an

arbitrary position). Simulation of memory variations for all the schedules during the analysis step.

19/43 Abdou Guermouche, May 2010

slide-42
SLIDE 42

Experimental environment

MUMPS: Multifrontal Parallel Solver for both LU and LDLT. Reordering techniques: AMD, AMF, METIS, PORD. Test platform: IBM platform at IDRIS. Test problems: Large range of matrices extracted from various collections (Rutherford-Boeing, University of Florida or PARASOL,. . . ). Schedules tested :

  • Classical multifrontal scheme (parent allocated after all its

children).

  • Anticipated parent allocation scheme (parent allocated after its

first child).

  • Flexible parent allocation scheme (parent allocated at an

arbitrary position). Simulation of memory variations for all the schedules during the analysis step.

19/43 Abdou Guermouche, May 2010

slide-43
SLIDE 43

Experimental results: Active memory gains

1 1.2 1.4 1.6 1.8 2 2.2 5 10 15 20 25 30 35 40 45 Memory Ratio Matrix Flexible allocation scheme Classical allocation scheme Early allocation scheme

AMD.

1 1.2 1.4 1.6 1.8 2 2.2 5 10 15 20 25 30 35 40 45 Memory Ratio Matrix Flexible allocation scheme Classical allocation scheme Early allocation scheme

METIS.

Large gains against the classical allocation scheme for matrices 8, 9 and 10.

20/43 Abdou Guermouche, May 2010

slide-44
SLIDE 44

Experimental results: Active memory gains

1 1.2 1.4 1.6 1.8 2 2.2 5 10 15 20 25 30 35 40 45 Memory Ratio Matrix Flexible allocation scheme Classical allocation scheme Early allocation scheme

AMD.

1 1.2 1.4 1.6 1.8 2 2.2 5 10 15 20 25 30 35 40 45 Memory Ratio Matrix Flexible allocation scheme Classical allocation scheme Early allocation scheme

METIS.

Large gains against the classical allocation scheme for matrices 8, 9 and 10.

20/43 Abdou Guermouche, May 2010

slide-45
SLIDE 45

Outline

Multifrontal method Active memory minimization Algorithm (Liu’s Algorithm) Memory issues

Limitation of the approach

New multifrontal schedules and algorithms

Flexible allocation scheme A new memory minimization algorithm

Results Total memory minimization How about Volume of I/O? Computing Volume of I/O Minimizing I/O volume Towards an out-of-core flexible allocation Conclusion and Future work

21/43 Abdou Guermouche, May 2010

slide-46
SLIDE 46

Total memory minimization (1/3)

Memory space Tflex needed for the processing of a node in the tree is given by: P1 = max

  • max

j=1,p(Tflex j

+

j−1

  • k=1

(cbk + Fk)), m +

p

  • k=1

(cbk + Fk)

  • P2 = max
  • m +

p

  • k=1

Fk + max

j=p+1,n(Tflex j

+

j−1

  • k=p+1

Fk)

  • Tflex = max(P1, P2).

The order in S2 has an impact on the memory occupation.

22/43 Abdou Guermouche, May 2010

slide-47
SLIDE 47

Total memory minimization (1/3)

Memory space Tflex needed for the processing of a node in the tree is given by: P1 = max

  • max

j=1,p(Tflex j

+

j−1

  • k=1

(cbk + Fk)), m +

p

  • k=1

(cbk + Fk)

  • P2 = max
  • m +

p

  • k=1

Fk + max

j=p+1,n(Tflex j

+

j−1

  • k=p+1

Fk)

  • Tflex = max(P1, P2).

The order in S2 has an impact on the memory occupation.

22/43 Abdou Guermouche, May 2010

slide-48
SLIDE 48

Total memory minimization (1/3)

Memory space Tflex needed for the processing of a node in the tree is given by: P1 = max

  • max

j=1,p(Tflex j

+

j−1

  • k=1

(cbk + Fk)), m +

p

  • k=1

(cbk + Fk)

  • P2 = max
  • m +

p

  • k=1

Fk + max

j=p+1,n(Tflex j

+

j−1

  • k=p+1

Fk)

  • Tflex = max(P1, P2).

The order in S2 has an impact on the memory occupation.

22/43 Abdou Guermouche, May 2010

slide-49
SLIDE 49

Total memory minimization (1/3)

Memory space Tflex needed for the processing of a node in the tree is given by: P1 = max

  • max

j=1,p(Tflex j

+

j−1

  • k=1

(cbk + Fk)), m +

p

  • k=1

(cbk + Fk)

  • P2 = max
  • m +

p

  • k=1

Fk + max

j=p+1,n(Tflex j

+

j−1

  • k=p+1

Fk)

  • Tflex = max(P1, P2).

The order in S2 has an impact on the memory occupation.

22/43 Abdou Guermouche, May 2010

slide-50
SLIDE 50

Total memory minimization (1/3)

Memory space Tflex needed for the processing of a node in the tree is given by: P1 = max

  • max

j=1,p(Tflex j

+

j−1

  • k=1

(cbk + Fk)), m +

p

  • k=1

(cbk + Fk)

  • P2 = max
  • m +

p

  • k=1

Fk + max

j=p+1,n(Tflex j

+

j−1

  • k=p+1

Fk)

  • Tflex = max(P1, P2).

The order in S2 has an impact on the memory occupation.

22/43 Abdou Guermouche, May 2010

slide-51
SLIDE 51

Total memory minimization (2/3)

S1 S2 Children sequence Tflex

i

− (cbi + Fi) Tflex

i

− Fi Total memory minimizing sequences inside S1 and S2. Property:

p S1 S2

let j0 ∈ S2 be the child for which the peak is reached inside S2. → The total memory peak cannot decrease if j0 remains in S2 for all configurations where S1 ⊂ S′1.

23/43 Abdou Guermouche, May 2010

slide-52
SLIDE 52

Total memory minimization (2/3)

S1 S2 Children sequence Tflex

i

− (cbi + Fi) Tflex

i

− Fi Total memory minimizing sequences inside S1 and S2. Property:

p S1 S2 j0

let j0 ∈ S2 be the child for which the peak is reached inside S2. → The total memory peak cannot decrease if j0 remains in S2 for all configurations where S1 ⊂ S′1.

23/43 Abdou Guermouche, May 2010

slide-53
SLIDE 53

Total memory minimization (2/3)

S1 S2 Children sequence Tflex

i

− (cbi + Fi) Tflex

i

− Fi Total memory minimizing sequences inside S1 and S2. Property:

p S1 S2 j0

let j0 ∈ S2 be the child for which the peak is reached inside S2. → The total memory peak cannot decrease if j0 remains in S2 for all configurations where S1 ⊂ S′1.

23/43 Abdou Guermouche, May 2010

slide-54
SLIDE 54

Total memory minimization (2/3)

S1 S2 Children sequence Tflex

i

− (cbi + Fi) Tflex

i

− Fi Total memory minimizing sequences inside S1 and S2. Property:

p S1 S2 j0

let j0 ∈ S2 be the child for which the peak is reached inside S2. → The total memory peak cannot decrease if j0 remains in S2 for all configurations where S1 ⊂ S′1.

23/43 Abdou Guermouche, May 2010

slide-55
SLIDE 55

Total memory minimization (3/3)

Algorithm: Set S1 = ∅, S2 = {1, . . . , n} and p = 0; Sort S2 according in decreasing order of Tflex

j

− Fj; Compute Tflex = P2; repeat Find j0 such that the peak in P2 is obtained for j0; Set S1 = S1 ∪ {j0}, S2 = S2 \ {j0}, and p = p + 1; (Remark: j0 is inserted at the position in S1 so that the order inside this set is decreasing in terms of Tflex

j

− (cbj + Fj).) Compute P1, P2, and T

′flex = max(P1, P2);

if T

′flex ≤ Tflex then

Keep the values of p, S1 and S2 and set Tflex = T

′flex;

end if until p = n or P1 ≥ P2

24/43 Abdou Guermouche, May 2010

slide-56
SLIDE 56

Experimental results: Total memory gains

0.9 0.95 1 1.05 1.1 1.15 1.2 1.25 1.3 5 10 15 20 25 30 35 40 45 Memory Ratio Matrix Flexible allocation scheme Classical allocation scheme Early allocation scheme Active memory flexible scheme

AMD.

0.9 0.95 1 1.05 1.1 1.15 1.2 1.25 1.3 5 10 15 20 25 30 35 40 45 Memory Ratio Matrix Flexible allocation scheme Classical allocation scheme Early allocation scheme Active memory flexible scheme

METIS.

25/43 Abdou Guermouche, May 2010

slide-57
SLIDE 57

Outline

Multifrontal method Active memory minimization Algorithm (Liu’s Algorithm) Memory issues

Limitation of the approach

New multifrontal schedules and algorithms

Flexible allocation scheme A new memory minimization algorithm

Results Total memory minimization How about Volume of I/O? Computing Volume of I/O Minimizing I/O volume Towards an out-of-core flexible allocation Conclusion and Future work

26/43 Abdou Guermouche, May 2010

slide-58
SLIDE 58

Outline

Multifrontal method Active memory minimization Algorithm (Liu’s Algorithm) Memory issues

Limitation of the approach

New multifrontal schedules and algorithms

Flexible allocation scheme A new memory minimization algorithm

Results Total memory minimization How about Volume of I/O? Computing Volume of I/O Minimizing I/O volume Towards an out-of-core flexible allocation Conclusion and Future work

27/43 Abdou Guermouche, May 2010

slide-59
SLIDE 59

Notations et assumptions

n 2 1

cbn cb2 cb1

...

  • M0: Core memory available
  • m: Size of the frontal matrix of the parent

(m < M0)

  • n: Number of children
  • i: ith child
  • cbi: Size of the contribution block of child i
  • Mi: Memory required to process the

subtree rooted at child i Assumptions:

  • Factors are written to disk as soon as computed
  • Each frontal matrix fits in core memory (i.e. m < M0)
  • Oldest CBs written first

28/43 Abdou Guermouche, May 2010

slide-60
SLIDE 60

Notations et assumptions

n 2 1

cbn cb2 cb1

...

  • M0: Core memory available
  • m: Size of the frontal matrix of the parent

(m < M0)

  • n: Number of children
  • i: ith child
  • cbi: Size of the contribution block of child i
  • Mi: Memory required to process the

subtree rooted at child i Assumptions:

  • Factors are written to disk as soon as computed
  • Each frontal matrix fits in core memory (i.e. m < M0)
  • Oldest CBs written first

28/43 Abdou Guermouche, May 2010

slide-61
SLIDE 61

I/O volume (1/3)

stack memory area written to disk allocation of a frontal matrix active frontal matrix Active memory Physical memory Active memory Physical memory Active memory Physical memory Allocation of a new frontal matrix After I/O Active memory Physical memory Active memory Physical memory After I/O Allocation of a new frontal matrix

...

29/43 Abdou Guermouche, May 2010

slide-62
SLIDE 62

I/O volume (2/3)

Let be a parent and its children

Case 1: ∀i ∈ 1;...;n : Mi < M0

I/O produced at assembly step:

max(0, m +

n

  • j=1

cbj − M0)

I/O produced when processing the subtree rooted at child j:

max(0, Mj +

j−1

  • k=1

cbk − M0)

VI/O is thus:

VI/O = max

  • max
  • max

j=1,n(Mj + j−1

  • k=1

cbk), m +

n

  • j=1

cbj

  • −M0, 0
  • 30/43

Abdou Guermouche, May 2010

slide-63
SLIDE 63

I/O volume (2/3)

Let be a parent and its children

Case 1: ∀i ∈ 1;...;n : Mi < M0

I/O produced at assembly step:

max(0, m +

n

  • j=1

cbj − M0)

I/O produced when processing the subtree rooted at child j:

max(0, Mj +

j−1

  • k=1

cbk − M0)

VI/O is thus:

VI/O = max

  • max
  • max

j=1,n(Mj + j−1

  • k=1

cbk), m +

n

  • j=1

cbj

  • −M0, 0
  • 30/43

Abdou Guermouche, May 2010

slide-64
SLIDE 64

I/O volume (2/3)

Let be a parent and its children

Case 1: ∀i ∈ 1;...;n : Mi < M0

I/O produced at assembly step:

max(0, m +

n

  • j=1

cbj − M0)

I/O produced when processing the subtree rooted at child j:

max(0, Mj +

j−1

  • k=1

cbk − M0)

VI/O is thus:

VI/O = max

  • max
  • max

j=1,n(Mj + j−1

  • k=1

cbk), m +

n

  • j=1

cbj

  • −M0, 0
  • 30/43

Abdou Guermouche, May 2010

slide-65
SLIDE 65

I/O volume (3/3)

Case 2: general case (∃i ∈ 1;...;n : Mi > M0)

  • Same expression for the active memory
  • New expression for the I/O volume:
  • If Mi > M0 then:
  • 1. VI/O

i

independent of position of child i

  • 2. When processing child i, CBs of previously processed children are

written to disk VI/O = max

  • max

j=1,n(min(Mj, M0) + j−1

  • k=1

cbk), m +

n

  • j=1

cbj

  • − M0 +

n

  • i=1

VI/O

i

max

  • max

j=1,n(min(Mj, M0) + j−1

  • k=1

cbk), m +

n

  • j=1

cbj

  • − M0, 0
  • +

n

  • i=1

VI/O

i

= ⇒ Total I/O volume computed with a greedy depth-first traversal on the

whole tree

31/43 Abdou Guermouche, May 2010

slide-66
SLIDE 66

I/O volume (3/3)

Case 2: general case (∃i ∈ 1;...;n : Mi > M0)

  • Same expression for the active memory
  • New expression for the I/O volume:
  • If Mi > M0 then:
  • 1. VI/O

i

independent of position of child i

  • 2. When processing child i, CBs of previously processed children are

written to disk VI/O = max

  • max

j=1,n(min(Mj, M0) + j−1

  • k=1

cbk), m +

n

  • j=1

cbj

  • − M0 +

n

  • i=1

VI/O

i

= max

  • max
  • max

j=1,n(min(Mj, M0) + j−1

  • k=1

cbk), m +

n

  • j=1

cbj

  • − M0, 0
  • +

n

  • i=1

VI/O

i

= ⇒ Total I/O volume computed with a greedy depth-first traversal on the

whole tree

31/43 Abdou Guermouche, May 2010

slide-67
SLIDE 67

Outline

Multifrontal method Active memory minimization Algorithm (Liu’s Algorithm) Memory issues

Limitation of the approach

New multifrontal schedules and algorithms

Flexible allocation scheme A new memory minimization algorithm

Results Total memory minimization How about Volume of I/O? Computing Volume of I/O Minimizing I/O volume Towards an out-of-core flexible allocation Conclusion and Future work

32/43 Abdou Guermouche, May 2010

slide-68
SLIDE 68

Problem

  • Assumption: factors written to disk as soon as computed
  • Active memory peak: tree traversal-dependent

Memory peak

Worst case

Memory peak

Best case

  • LIU’86: Optimum algorithm for minimizing the peak of active

memory

  • Problem: How to minimize the I/O volume when the active

memory does not hold in a given amount of physical memory M0 ?

33/43 Abdou Guermouche, May 2010

slide-69
SLIDE 69

Problem

  • Assumption: factors written to disk as soon as computed
  • Active memory peak: tree traversal-dependent

Memory peak

Worst case

Memory peak

Best case

  • LIU’86: Optimum algorithm for minimizing the peak of active

memory

  • Problem: How to minimize the I/O volume when the active

memory does not hold in a given amount of physical memory M0 ?

33/43 Abdou Guermouche, May 2010

slide-70
SLIDE 70

Minimizing I/O volume

How to process the children to minimize VI/O ?

  • Minimizing VI/O corresponds to minimize:

max

j=1,n

  • min(Mj, M0) +

j−1

  • k=1

cbk

  • Theorem (Tree pebbling theorem)

The minimum for maxj(xj + j−1

i=1 yj) is obtained when the

sequence (xj, yj) is sorted in the increasing order of xj − yj. Consequence: An optimal sequence is got by ordering the children nodes in the increasing order of min(Mj, M0) − cbj.

  • Algorithm:
  • Greedy depth-first traversal process
  • Applying the theorem at each level of the tree

34/43 Abdou Guermouche, May 2010

slide-71
SLIDE 71

Minimizing I/O volume

How to process the children to minimize VI/O ?

  • Minimizing VI/O corresponds to minimize:

max

j=1,n

  • min(Mj, M0) +

j−1

  • k=1

cbk

  • Theorem (Tree pebbling theorem)

The minimum for maxj(xj + j−1

i=1 yj) is obtained when the

sequence (xj, yj) is sorted in the increasing order of xj − yj. Consequence: An optimal sequence is got by ordering the children nodes in the increasing order of min(Mj, M0) − cbj.

  • Algorithm:
  • Greedy depth-first traversal process
  • Applying the theorem at each level of the tree

34/43 Abdou Guermouche, May 2010

slide-72
SLIDE 72

Minimizing I/O volume

How to process the children to minimize VI/O ?

  • Minimizing VI/O corresponds to minimize:

max

j=1,n

  • min(Mj, M0) +

j−1

  • k=1

cbk

  • Theorem (Tree pebbling theorem)

The minimum for maxj(xj + j−1

i=1 yj) is obtained when the

sequence (xj, yj) is sorted in the increasing order of xj − yj. Consequence: An optimal sequence is got by ordering the children nodes in the increasing order of min(Mj, M0) − cbj.

  • Algorithm:
  • Greedy depth-first traversal process
  • Applying the theorem at each level of the tree

34/43 Abdou Guermouche, May 2010

slide-73
SLIDE 73

Toy example

a b c M=12 cb=4 M=8 cb=2 F=5 M0=8

Liu’s Algorithm I/O minimization Algorithm sequence → a-b-c sequence → b-a-c Afath = 12, VI/0 = 8 Afath = 14, VI/0 = 7

Memory Time 4 8 12 Core limit Memory Time 4 8 12 Core limit 14

35/43 Abdou Guermouche, May 2010

slide-74
SLIDE 74

Experimental Results: MinMEM VS MinIO

0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 20 40 60 80 100 VIO (ratio MinMEM / MinIO) M0 (% of in−core requirements) BCSSTK − AMD MSDOOR − AMF BMWCRA − METIS GEO3D−20−20−20 − PORD 1 1.5 2 2.5 3 20 40 60 80 100 120 VIO (ratio MinMEM / MinIO) Matrix index AMD AMF METIS PORD Classical Last−in−place

Volume decreased up to 50% in comparison to Liu’s algorithm.

36/43 Abdou Guermouche, May 2010

slide-75
SLIDE 75

Outline

Multifrontal method Active memory minimization Algorithm (Liu’s Algorithm) Memory issues

Limitation of the approach

New multifrontal schedules and algorithms

Flexible allocation scheme A new memory minimization algorithm

Results Total memory minimization How about Volume of I/O? Computing Volume of I/O Minimizing I/O volume Towards an out-of-core flexible allocation Conclusion and Future work

37/43 Abdou Guermouche, May 2010

slide-76
SLIDE 76

Flexible allocation scheme and I/O volume

. . . . . .

S1 S2 1 p+1 n p 2

  • p is the position of the allocation of

the parent

  • S1 is the set of children treated

before the allocation of the parent

  • S2 is the set of children treated after

the allocation of the parent

  • Inside S1: behavior of the classical multifrontal scheme
  • Inside S2: no impact of the order on the I/O volume
  • We may write (partially/entirely) the parent at any time after its

allocation provided we read it again as soon as a CB is produced

38/43 Abdou Guermouche, May 2010

slide-77
SLIDE 77

Flexible allocation scheme and I/O volume

. . . . . .

S1 S2 1 p+1 n p 2

  • p is the position of the allocation of

the parent

  • S1 is the set of children treated

before the allocation of the parent

  • S2 is the set of children treated after

the allocation of the parent

  • Inside S1: behavior of the classical multifrontal scheme
  • Inside S2: no impact of the order on the I/O volume
  • We may write (partially/entirely) the parent at any time after its

allocation provided we read it again as soon as a CB is produced

38/43 Abdou Guermouche, May 2010

slide-78
SLIDE 78

Flexible allocation scheme and I/O volume

. . . . . .

S1 S2 1 p+1 n p 2

  • p is the position of the allocation of

the parent

  • S1 is the set of children treated

before the allocation of the parent

  • S2 is the set of children treated after

the allocation of the parent

VI/O = max(0, max

  • max

j=1,p(min(Mj, M0) + j−1

  • k=1

cbk), m +

p

  • j=1

cbj

  • − M0)

38/43 Abdou Guermouche, May 2010

slide-79
SLIDE 79

Flexible allocation scheme and I/O volume

. . . . . .

S1 S2 1 p+1 n p 2

  • p is the position of the allocation of

the parent

  • S1 is the set of children treated

before the allocation of the parent

  • S2 is the set of children treated after

the allocation of the parent

VI/O = max(0, max

  • max

j=1,p(min(Mj, M0) + j−1

  • k=1

cbk), m +

p

  • j=1

cbj

  • − M0)

+

n

  • j=p+1

max(0, m + min(Mj, M0) − M0)

38/43 Abdou Guermouche, May 2010

slide-80
SLIDE 80

Flexible allocation scheme and I/O volume

. . . . . .

S1 S2 1 p+1 n p 2

  • p is the position of the allocation of

the parent

  • S1 is the set of children treated

before the allocation of the parent

  • S2 is the set of children treated after

the allocation of the parent

VI/O = max(0, max

  • max

j=1,p(min(Mj, M0) + j−1

  • k=1

cbk), m +

p

  • j=1

cbj

  • − M0)

+

n

  • j=p+1

max(0, m + min(Mj, M0) − M0)+

n

  • i=p+1

max(0, cbi + m − M0)

38/43 Abdou Guermouche, May 2010

slide-81
SLIDE 81

Flexible allocation scheme and I/O volume

. . . . . .

S1 S2 1 p+1 n p 2

  • p is the position of the allocation of

the parent

  • S1 is the set of children treated

before the allocation of the parent

  • S2 is the set of children treated after

the allocation of the parent

VI/O = max(0, max

  • max

j=1,p(min(Mj, M0) + j−1

  • k=1

cbk), m +

p

  • j=1

cbj

  • − M0)

+

n

  • j=p+1

max(0, m + min(Mj, M0) − M0)+

n

  • i=p+1

max(0, cbi + m − M0) +

n

  • i=1

VI/O

i

38/43 Abdou Guermouche, May 2010

slide-82
SLIDE 82

Minimizing the I/O volume

Problem

  • 1. In which order should we process the n children;
  • 2. At which position p should we allocate the frontal matrix of the parent?

to minimize Vfamily on a given family ?

Complexity

  • Number of possibilities: n · n!.
  • Reduced to 2n (2-partition):
  • order before allocation → our optimal algorithm;
  • order after allocation → does not matter.
  • Cannot do “better” in general:
  • Problem NP-hard;
  • Proof: reduction from . . . 2-partition.

39/43 Abdou Guermouche, May 2010

slide-83
SLIDE 83

Heuristic

Algorithm.

All the children are initially in S1 repeat move to S2 the child which is responsible for the peak of storage until we cannot save anything Results with the TWOTONE matrix ordered with PORD:

1e+07 2e+07 3e+07 4e+07 5e+07 7.57e+06 6e+06 4e+06 2e+06 Volume of I/O (number of reals) Memory available (number of reals) Term−MinIO Flex−MinMEM Flex−MinIO

40/43 Abdou Guermouche, May 2010

slide-84
SLIDE 84

Outline

Multifrontal method Active memory minimization Algorithm (Liu’s Algorithm) Memory issues

Limitation of the approach

New multifrontal schedules and algorithms

Flexible allocation scheme A new memory minimization algorithm

Results Total memory minimization How about Volume of I/O? Computing Volume of I/O Minimizing I/O volume Towards an out-of-core flexible allocation Conclusion and Future work

41/43 Abdou Guermouche, May 2010

slide-85
SLIDE 85

Conclusion and Future work

Memory

  • New memory management schemes and

corresponding memory minimization algorithms proposed.

  • Active memory and total memory cases

considered. Volume of I/O

  • Optimality for memory minimization optimality

for volume of I/O minimization.

  • Several contexts studied and various

algorithms/heuristics proposed. Future work:

  • Real-life implementation (modification of the factorization).
  • What about parallel executions?

42/43 Abdou Guermouche, May 2010

slide-86
SLIDE 86

Projet de recherche au sein de ROMA

Complexit´ e croissante des plates-formes actuelles :

ecessit´ e de s’adapter ` a l’h´ et´ erog´ en´ eit´ e et ` a l’aspect grande ´ echelle des plates-formes.

  • Prise en compte de ph´

enom` enes de pannes et de consommation d’´ energie. Projet au sein de ROMA : Algorithmique en lien avec le mat´ eriel et les architectures ´ emergentes

  • Exploitation des architectures multi-cœurs modernes.
  • Utilisation des GPU et autres acc´

el´ erateurs.

  • Consommation d’´

energie.

  • Int´

egration des r´ esultats de recherche dans des applications (MUMPS).

43/43 Abdou Guermouche, May 2010