Improving Power efficiency of Dense Linear Algebra Algorithms on - - PowerPoint PPT Presentation

improving power efficiency of dense linear algebra
SMART_READER_LITE
LIVE PREVIEW

Improving Power efficiency of Dense Linear Algebra Algorithms on - - PowerPoint PPT Presentation

The 2011 International Conference on High Performance Computing & Simulation Workshop on Optimization Issues in Energy Efficient Distributed Systems Improving Power efficiency of Dense Linear Algebra Algorithms on Multi-Core Processors via


slide-1
SLIDE 1

The 2011 International Conference on High Performance Computing & Simulation

Workshop on Optimization Issues in Energy Efficient Distributed Systems

Improving Power efficiency of Dense Linear Algebra Algorithms on Multi-Core Processors via Slack Control

Pedro Alonso1, Manuel F. Dolz2, Rafael Mayo2, Enrique S. Quintana-Ort´ ı2 1 2

July 4–8, 2011, Istanbul (Turkey)

slide-2
SLIDE 2

Introduction Theoretical approach Slack Reduction Algorithm Experimental results Conclusions and future work

Motivation

High performance computing:

Optimization of algorithms applied to solve complex problems

Technological advance ⇒ improve performance:

Processors works at higher frequencies Higher number of cores per socket (processor)

Large number of processors and cores ⇒ High energy consumption Methods, algorithms and techniques to reduce energy consumption applied to high performance computing.

Reduce the frequency of processors with DVFS technique

Pedro Alonso et al Improving Power efficiency of DLA Algorithms on Multi-Core Processors

slide-3
SLIDE 3

Introduction Theoretical approach Slack Reduction Algorithm Experimental results Conclusions and future work

Outline

1

Introduction

2

Theoretical approach The Critical Path Method Application to dense linear algebra algorithms

3

Slack Reduction Algorithm Previous steps Slack reduction Simulator

4

Experimental results Description Cholesky factorization QR factorization

5

Conclusions and future work Conclusions Future work

Pedro Alonso et al Improving Power efficiency of DLA Algorithms on Multi-Core Processors

slide-4
SLIDE 4

Introduction Theoretical approach Slack Reduction Algorithm Experimental results Conclusions and future work

Introduction

Scheduling tasks of dense linear algebra algorithms

Examples: Cholesky, QR and LU factorizations

Energy saving tools available for multi-core processors

Example: Dynamic Voltage and Frequency Scaling (DVFS)

Scheduling tasks + DVFS

Power-aware scheduling on multi-core processors Our strategy: Reduce the frequency of cores that will execute non-critical tasks to decrease idle times without sacrifying total performance of the algorithm

Energy saving

Pedro Alonso et al Improving Power efficiency of DLA Algorithms on Multi-Core Processors

slide-5
SLIDE 5

Introduction Theoretical approach Slack Reduction Algorithm Experimental results Conclusions and future work

Introduction

Scheduling tasks of dense linear algebra algorithms

Examples: Cholesky, QR and LU factorizations

Energy saving tools available for multi-core processors

Example: Dynamic Voltage and Frequency Scaling (DVFS)

Scheduling tasks + DVFS

Power-aware scheduling on multi-core processors Our strategy: Reduce the frequency of cores that will execute non-critical tasks to decrease idle times without sacrifying total performance of the algorithm

Energy saving

Pedro Alonso et al Improving Power efficiency of DLA Algorithms on Multi-Core Processors

slide-6
SLIDE 6

Introduction Theoretical approach Slack Reduction Algorithm Experimental results Conclusions and future work The Critical Path Method Application to dense linear algebra algorithms

The Critical Path Method

j

LFj ESj

i

LFi ESi

Cij

ESi=max(ESk + Cki)

Sij

LFj=min(LFk + Cjk) Sij=ESj - ESi - Cij

Concepts: DAG of dependencies

Nodes ⇒ Temporal events Edges ⇒ Tasks

Times

Early and latest times to start and finalize execution of tasks

Total slack:

Amount of time that a task can be delayed without increasing the total execution time of the algorithm

Critical path:

Formed by a succession of tasks, from initial to final node of the graph, with total slack = 0.

Pedro Alonso et al Improving Power efficiency of DLA Algorithms on Multi-Core Processors

slide-7
SLIDE 7

Introduction Theoretical approach Slack Reduction Algorithm Experimental results Conclusions and future work The Critical Path Method Application to dense linear algebra algorithms

Application to dense linear algebra algorithms

Objective ⇒ obtain the dependency graph corresponding to the computation

  • f a dense linear algebra algorithm, apply the Critical Path Method to analize

slacks and reducing them with our Slack Reduction Algorithm

Example: Cholesky factorization of a matrix consisting of 3 × 3 blocks for k = 1, 2, . . . , s do Akk = LkkLT

kk

Cholesky factorization

b3 3 flops 0,33 u.t.

for i = k + 1, k + 2, . . . , s do Aik ← AikL−T

kk

Triangular system solve b3 flops 1 u.t. end for for i = k + 1, k + 2, . . . , s do for j = k + 1, k + 2, . . . , i − 1 do Aij ← Aij − AikAT

jk

Matrix-matrix product 2b3 flops 2 u.t. end for Aii ← Aii − AikAT

ik

Symmetric rank-b update b3 flops 1 u.t. end for end for

Pedro Alonso et al Improving Power efficiency of DLA Algorithms on Multi-Core Processors

slide-8
SLIDE 8

Introduction Theoretical approach Slack Reduction Algorithm Experimental results Conclusions and future work The Critical Path Method Application to dense linear algebra algorithms

Application to dense linear algebra algorithms

Objective ⇒ obtain the dependency graph corresponding to the computation

  • f a dense linear algebra algorithm, apply the Critical Path Method to analize

slacks and reducing them with our Slack Reduction Algorithm

Example: Cholesky factorization of a matrix consisting of 3 × 3 blocks for k = 1, 2, . . . , s do Akk = LkkLT

kk

Cholesky factorization

b3 3 flops 0,33 u.t.

for i = k + 1, k + 2, . . . , s do Aik ← AikL−T

kk

Triangular system solve b3 flops 1 u.t. end for for i = k + 1, k + 2, . . . , s do for j = k + 1, k + 2, . . . , i − 1 do Aij ← Aij − AikAT

jk

Matrix-matrix product 2b3 flops 2 u.t. end for Aii ← Aii − AikAT

ik

Symmetric rank-b update b3 flops 1 u.t. end for end for

Pedro Alonso et al Improving Power efficiency of DLA Algorithms on Multi-Core Processors

slide-9
SLIDE 9

Introduction Theoretical approach Slack Reduction Algorithm Experimental results Conclusions and future work The Critical Path Method Application to dense linear algebra algorithms

Application to dense linear algebra algorithms

Taks-node DAG capturing the data dependencies in the computation of the Cholesky factorization of a matrix consisting of 3 × 3 blocks

T 211(1) G 321(2) P 111(0.33) T 322(1) T 311(1) S 331(1) S 221(1) S 332(1) P 333(0.33) P 222(0.33)

Graph transformation in order to apply CPM Conversion from task-node to task-edge graph

1 3 2 5 4 7 6 9 8 NULL NULL S 332(1) P 333(0.33) T 322(1) S 221(1) T 311(1) T 211(1) P 111(0.33) P 222(0.33) S 331(1) G 321(2) Pedro Alonso et al Improving Power efficiency of DLA Algorithms on Multi-Core Processors

slide-10
SLIDE 10

Introduction Theoretical approach Slack Reduction Algorithm Experimental results Conclusions and future work The Critical Path Method Application to dense linear algebra algorithms

Application to dense linear algebra algorithms

Taks-node DAG capturing the data dependencies in the computation of the Cholesky factorization of a matrix consisting of 3 × 3 blocks

T 211(1) G 321(2) P 111(0.33) T 322(1) T 311(1) S 331(1) S 221(1) S 332(1) P 333(0.33) P 222(0.33)

Graph transformation in order to apply CPM Conversion from task-node to task-edge graph

1 3 2 5 4 7 6 9 8 NULL NULL S 332(1) P 333(0.33) T 322(1) S 221(1) T 311(1) T 211(1) P 111(0.33) P 222(0.33) S 331(1) G 321(2) Pedro Alonso et al Improving Power efficiency of DLA Algorithms on Multi-Core Processors

slide-11
SLIDE 11

Introduction Theoretical approach Slack Reduction Algorithm Experimental results Conclusions and future work The Critical Path Method Application to dense linear algebra algorithms

Application to dense linear algebra algorithms

Taks-node DAG capturing the data dependencies in the computation of the Cholesky factorization of a matrix consisting of 3 × 3 blocks

T 211(1) G 321(2) P 111(0.33) T 322(1) T 311(1) S 331(1) S 221(1) S 332(1) P 333(0.33) P 222(0.33)

Graph transformation in order to apply CPM Conversion from task-node to task-edge graph

1 3 2 5 4 7 6 9 8 NULL NULL S 332(1) P 333(0.33) T 322(1) S 221(1) T 311(1) T 211(1) P 111(0.33) P 222(0.33) S 331(1) G 321(2) Pedro Alonso et al Improving Power efficiency of DLA Algorithms on Multi-Core Processors

slide-12
SLIDE 12

Introduction Theoretical approach Slack Reduction Algorithm Experimental results Conclusions and future work The Critical Path Method Application to dense linear algebra algorithms

Application to dense linear algebra algorithms

Application of CPM to the task-edge DAG of the Cholesky factorization of a matrix consisting of 3 × 3 blocks

Task i − j Ci,j ESi LFj Si,j P 111 0-1 0.33 0.33 T 211 1-8 1 0.33 1.33 T 311 1-2 1 0.33 1.33 NULL 2-3 1.33 1.33 S 221 8-9 1 1.33 3 0.67 G 321 3-4 2 1.33 3.33 S 331 2-5 1 1.33 4.33 2 P 222 9-4 0.33 2.33 3.33 0.67 T 322 4-5 1 3.33 4.33 S 332 5-6 1 4.33 5.33 P 333 6-7 0.33 5.33 5.67 NULL 8-3 1.33 1.33

Critical path:

1 3 2 5 4 7 6 9 8 NULL NULL S 332(1) P 333(0.33) T 322(1) S 221(1) T 311(1) T 211(1) P 111(0.33) P 222(0.33) S 331(1) G 321(2)

Objective: tune the slack of those tasks with Si,j > 0, reducing its execution frequency and yielding low power usage → Slack Reduction Algorithm

Pedro Alonso et al Improving Power efficiency of DLA Algorithms on Multi-Core Processors

slide-13
SLIDE 13

Introduction Theoretical approach Slack Reduction Algorithm Experimental results Conclusions and future work The Critical Path Method Application to dense linear algebra algorithms

Application to dense linear algebra algorithms

Application of CPM to the task-edge DAG of the Cholesky factorization of a matrix consisting of 3 × 3 blocks

Task i − j Ci,j ESi LFj Si,j P 111 0-1 0.33 0.33 T 211 1-8 1 0.33 1.33 T 311 1-2 1 0.33 1.33 NULL 2-3 1.33 1.33 S 221 8-9 1 1.33 3 0.67 G 321 3-4 2 1.33 3.33 S 331 2-5 1 1.33 4.33 2 P 222 9-4 0.33 2.33 3.33 0.67 T 322 4-5 1 3.33 4.33 S 332 5-6 1 4.33 5.33 P 333 6-7 0.33 5.33 5.67 NULL 8-3 1.33 1.33

Critical path:

1 3 2 5 4 7 6 9 8 NULL NULL S 332(1) P 333(0.33) T 322(1) S 221(1) T 311(1) T 211(1) P 111(0.33) P 222(0.33) S 331(1) G 321(2)

Objective: tune the slack of those tasks with Si,j > 0, reducing its execution frequency and yielding low power usage → Slack Reduction Algorithm

Pedro Alonso et al Improving Power efficiency of DLA Algorithms on Multi-Core Processors

slide-14
SLIDE 14

Introduction Theoretical approach Slack Reduction Algorithm Experimental results Conclusions and future work Previous steps Slack reduction Simulator

Slack reduction

Slack reduction algorithm

1

Frequency assignment

2

Critical subpath extraction

3

Slack reduction

1

Frequency assignment Example: Cholesky factorization of 3×3 blocks:

1 3 2 5 4 7 6 9 8 NULL NULL S 332(1) f = 2.27 P 333(0.33) f = 2.27 T 322(1) f = 2.27 S 221(1) f = 2.27 T 311(1) f = 2.27 T 211(1) f = 2.27 P 111(0.33) f = 2.27 P 222(0.33) f = 2.27 S 331(1) f = 2.27 G 321(2) f = 2.27

Discrete collection of frequencies: {2.27, 2.13, 2.00, 1.87, 1.73, 1.60} GHz The execution time of tasks increase inversely proportional as its frequency decreases!

Pedro Alonso et al Improving Power efficiency of DLA Algorithms on Multi-Core Processors

slide-15
SLIDE 15

Introduction Theoretical approach Slack Reduction Algorithm Experimental results Conclusions and future work Previous steps Slack reduction Simulator

Slack reduction

Slack reduction algorithm

1

Frequency assignment

2

Critical subpath extraction

3

Slack reduction

1

Frequency assignment Example: Cholesky factorization of 3×3 blocks:

1 3 2 5 4 7 6 9 8 NULL NULL S 332(1) f = 2.27 P 333(0.33) f = 2.27 T 322(1) f = 2.27 S 221(1) f = 2.27 T 311(1) f = 2.27 T 211(1) f = 2.27 P 111(0.33) f = 2.27 P 222(0.33) f = 2.27 S 331(1) f = 2.27 G 321(2) f = 2.27

Discrete collection of frequencies: {2.27, 2.13, 2.00, 1.87, 1.73, 1.60} GHz The execution time of tasks increase inversely proportional as its frequency decreases!

Pedro Alonso et al Improving Power efficiency of DLA Algorithms on Multi-Core Processors

slide-16
SLIDE 16

Introduction Theoretical approach Slack Reduction Algorithm Experimental results Conclusions and future work Previous steps Slack reduction Simulator

Critical subpath extraction

2

Critical subpath extraction

1 3 2 5 4 7 6 9 8

CP0 = {0, 1, 8, 3, 4, 5, 6, 7}, 5.66 u.t.

NULL NULL S 332(1) P 333(0.33) T 322(1) S 221(1) T 311(1) T 211(1) P 111(0.33) P 222(0.33) S 331(1) G 321(2) 1 3 2 5 4 9 8

CP1 = {1, 2, 5}, 2 u.t.

NULL S 221(1) T 311(1) P 222(0.33) S 331(1) 3 2 4 9 8

CP2 = {8, 9, 4}, 1.33 u.t.

NULL S 221(1) P 222(0.33) 3 2 NULL

CP3 = {2, 3}, 0 u.t.

Pedro Alonso et al Improving Power efficiency of DLA Algorithms on Multi-Core Processors

slide-17
SLIDE 17

Introduction Theoretical approach Slack Reduction Algorithm Experimental results Conclusions and future work Previous steps Slack reduction Simulator

Critical subpath extraction

2

Critical subpath extraction

1 3 2 5 4 7 6 9 8

CP0 = {0, 1, 8, 3, 4, 5, 6, 7}, 5.66 u.t.

NULL NULL S 332(1) P 333(0.33) T 322(1) S 221(1) T 311(1) T 211(1) P 111(0.33) P 222(0.33) S 331(1) G 321(2) 1 3 2 5 4 9 8

CP1 = {1, 2, 5}, 2 u.t.

NULL S 221(1) T 311(1) P 222(0.33) S 331(1) 3 2 4 9 8

CP2 = {8, 9, 4}, 1.33 u.t.

NULL S 221(1) P 222(0.33) 3 2 NULL

CP3 = {2, 3}, 0 u.t.

Pedro Alonso et al Improving Power efficiency of DLA Algorithms on Multi-Core Processors

slide-18
SLIDE 18

Introduction Theoretical approach Slack Reduction Algorithm Experimental results Conclusions and future work Previous steps Slack reduction Simulator

Critical subpath extraction

2

Critical subpath extraction

1 3 2 5 4 7 6 9 8

CP0 = {0, 1, 8, 3, 4, 5, 6, 7}, 5.66 u.t.

NULL NULL S 332(1) P 333(0.33) T 322(1) S 221(1) T 311(1) T 211(1) P 111(0.33) P 222(0.33) S 331(1) G 321(2) 1 3 2 5 4 9 8

CP1 = {1, 2, 5}, 2 u.t.

NULL S 221(1) T 311(1) P 222(0.33) S 331(1) 3 2 4 9 8

CP2 = {8, 9, 4}, 1.33 u.t.

NULL S 221(1) P 222(0.33) 3 2 NULL

CP3 = {2, 3}, 0 u.t.

Pedro Alonso et al Improving Power efficiency of DLA Algorithms on Multi-Core Processors

slide-19
SLIDE 19

Introduction Theoretical approach Slack Reduction Algorithm Experimental results Conclusions and future work Previous steps Slack reduction Simulator

Critical subpath extraction

2

Critical subpath extraction

1 3 2 5 4 7 6 9 8

CP0 = {0, 1, 8, 3, 4, 5, 6, 7}, 5.66 u.t.

NULL NULL S 332(1) P 333(0.33) T 322(1) S 221(1) T 311(1) T 211(1) P 111(0.33) P 222(0.33) S 331(1) G 321(2) 1 3 2 5 4 9 8

CP1 = {1, 2, 5}, 2 u.t.

NULL S 221(1) T 311(1) P 222(0.33) S 331(1) 3 2 4 9 8

CP2 = {8, 9, 4}, 1.33 u.t.

NULL S 221(1) P 222(0.33) 3 2 NULL

CP3 = {2, 3}, 0 u.t.

Pedro Alonso et al Improving Power efficiency of DLA Algorithms on Multi-Core Processors

slide-20
SLIDE 20

Introduction Theoretical approach Slack Reduction Algorithm Experimental results Conclusions and future work Previous steps Slack reduction Simulator

Slack Reduction Algorithm (I)

Iteration 1 Process critical subpath CP1 = {1, 2, 5}:

1

Checks for tasks of CP1 with a nonzero slack: only task S 331

2

Slack is reduced by reducing execution frequency of task: S 331: 2.27 GHz ⇒ 1.60 GHz; 1 u.t. ⇒ 1.42 u.t.; Slack 2 u.t.⇒ 0.58 u.t.

1 3 2 5 4 7 6 9 8 NULL NULL S 332(1) f = 2.27 P 333(0.33) f = 2.27 T 322(1) f = 2.27 S 221(1) f = 2.27 T 311(1) f = 2.27 T 211(1) f = 2.27 P 111(0.33) f = 2.27 P 222(0.33) f = 2.27 S 331(1) f = 2.27 G 321(2) f = 2.27 Pedro Alonso et al Improving Power efficiency of DLA Algorithms on Multi-Core Processors

slide-21
SLIDE 21

Introduction Theoretical approach Slack Reduction Algorithm Experimental results Conclusions and future work Previous steps Slack reduction Simulator

Slack Reduction Algorithm (I)

Iteration 1 Process critical subpath CP1 = {1, 2, 5}:

1

Checks for tasks of CP1 with a nonzero slack: only task S 331

2

Slack is reduced by reducing execution frequency of task: S 331: 2.27 GHz ⇒ 1.60 GHz; 1 u.t. ⇒ 1.42 u.t.; Slack 2 u.t.⇒ 0.58 u.t.

1 3 2 5 4 7 6 9 8 NULL NULL S 332(1) f = 2.27 P 333(0.33) f = 2.27 T 322(1) f = 2.27 S 221(1) f = 2.27 T 311(1) f = 2.27 T 211(1) f = 2.27 P 111(0.33) f = 2.27 P 222(0.33) f = 2.27 S 331(1) f = 2.27 G 321(2) f = 2.27 Pedro Alonso et al Improving Power efficiency of DLA Algorithms on Multi-Core Processors

slide-22
SLIDE 22

Introduction Theoretical approach Slack Reduction Algorithm Experimental results Conclusions and future work Previous steps Slack reduction Simulator

Slack Reduction Algorithm (II)

Iteration 1 Process critical subpath CP1 = {1, 2, 5}:

1

Checks for tasks of CP1 with a nonzero slack: only task S 331

2

Slack is reduced by reducing execution frequency of task: S 331: 2.27 GHz ⇒ 1.60 GHz; 1 u.t. ⇒ 1.42 u.t.; Slack 2 u.t.⇒ 0.58 u.t.

1 3 2 5 4 7 6 9 8 NULL NULL S 332(1) f = 2.27 P 333(0.33) f = 2.27 T 322(1) f = 2.27 S 221(1) f = 2.27 T 311(1) f = 2.27 T 211(1) f = 2.27 P 111(0.33) f = 2.27 P 222(0.33) f = 2.27 S 331(1.42) f = 1.60 G 321(2) f = 2.27 Pedro Alonso et al Improving Power efficiency of DLA Algorithms on Multi-Core Processors

slide-23
SLIDE 23

Introduction Theoretical approach Slack Reduction Algorithm Experimental results Conclusions and future work Previous steps Slack reduction Simulator

Slack Reduction Algorithm (III)

Iteration 2 Process of critical subpath CP2 = {8, 9, 4}:

1

Checks for tasks of CP2 with a nonzero slack: tasks S 221 and P 222

2

Slack is equaly splitted in both tasks: S 221: 2.27 GHz ⇒ 1.73 GHz; 1 u.t. ⇒ 1.31 u.t.; Slack 0.67 u.t.⇒ 0 u.t. P 222: 2.27 GHz ⇒ 1.73 GHz; 0.33 u.t. ⇒ 0.44 u.t.; Slack 0.67 u.t.⇒ 0 u.t.

1 3 2 5 4 7 6 9 8 NULL NULL S 332(1) f = 2.27 P 333(0.33) f = 2.27 T 322(1) f = 2.27 S 221(1) f = 2.27 T 311(1) f = 2.27 T 211(1) f = 2.27 P 111(0.33) f = 2.27 P 222(0.33) f = 2.27 S 331(1.42) f = 1.60 G 321(2) f = 2.27 Pedro Alonso et al Improving Power efficiency of DLA Algorithms on Multi-Core Processors

slide-24
SLIDE 24

Introduction Theoretical approach Slack Reduction Algorithm Experimental results Conclusions and future work Previous steps Slack reduction Simulator

Slack Reduction Algorithm (III)

Iteration 2 Process of critical subpath CP2 = {8, 9, 4}:

1

Checks for tasks of CP2 with a nonzero slack: tasks S 221 and P 222

2

Slack is equaly splitted in both tasks: S 221: 2.27 GHz ⇒ 1.73 GHz; 1 u.t. ⇒ 1.31 u.t.; Slack 0.67 u.t.⇒ 0 u.t. P 222: 2.27 GHz ⇒ 1.73 GHz; 0.33 u.t. ⇒ 0.44 u.t.; Slack 0.67 u.t.⇒ 0 u.t.

1 3 2 5 4 7 6 9 8 NULL NULL S 332(1) f = 2.27 P 333(0.33) f = 2.27 T 322(1) f = 2.27 S 221(1) f = 2.27 T 311(1) f = 2.27 T 211(1) f = 2.27 P 111(0.33) f = 2.27 P 222(0.33) f = 2.27 S 331(1.42) f = 1.60 G 321(2) f = 2.27 Pedro Alonso et al Improving Power efficiency of DLA Algorithms on Multi-Core Processors

slide-25
SLIDE 25

Introduction Theoretical approach Slack Reduction Algorithm Experimental results Conclusions and future work Previous steps Slack reduction Simulator

Slack Reduction Algorithm (IV)

Iteration 2 Process of critical subpath CP2 = {8, 9, 4}:

1

Checks for tasks of CP2 with a nonzero slack: tasks S 221 and P 222

2

Slack is equaly splitted in both tasks: S 221: 2.27 GHz ⇒ 1.73 GHz; 1 u.t. ⇒ 1.31 u.t.; Slack 0.67 u.t.⇒ 0 u.t. P 222: 2.27 GHz ⇒ 1.73 GHz; 0.33 u.t. ⇒ 0.44 u.t.; Slack 0.67 u.t.⇒ 0 u.t.

1 3 2 5 4 7 6 9 8 NULL NULL S 332(1) f = 2.27 P 333(0.33) f = 2.27 T 322(1) f = 2.27 S 221(1.31) f = 1.73 T 311(1) f = 2.27 T 211(1) f = 2.27 P 111(0.33) f = 2.27 P 222(0.44) f = 1.73 S 331(1.42) f = 1.60 G 321(2) f = 2.27 Pedro Alonso et al Improving Power efficiency of DLA Algorithms on Multi-Core Processors

slide-26
SLIDE 26

Introduction Theoretical approach Slack Reduction Algorithm Experimental results Conclusions and future work Previous steps Slack reduction Simulator

Slack Reduction Algorithm (V)

Iteration 3 Process of critical subpath CP2 = {2, 3}:

1

Requires no processing: subpath only contains a NULL task

Frequency assignment and cost of the task-edge DAG of dependencies of Cholesky algorithm consisting of a matrix 3 × 3:

1 3 2 5 4 7 6 9 8 NULL NULL S 332(1) f = 2.27 P 333(0.33) f = 2.27 T 322(1) f = 2.27 S 221(1.31) f = 1.73 T 311(1) f = 2.27 T 211(1) f = 2.27 P 111(0.33) f = 2.27 P 222(0.44) f = 1.73 S 331(1.42) f = 1.60 G 321(2) f = 2.27 Pedro Alonso et al Improving Power efficiency of DLA Algorithms on Multi-Core Processors

slide-27
SLIDE 27

Introduction Theoretical approach Slack Reduction Algorithm Experimental results Conclusions and future work Previous steps Slack reduction Simulator

Slack Reduction Algorithm (V)

Iteration 3 Process of critical subpath CP2 = {2, 3}:

1

Requires no processing: subpath only contains a NULL task

Frequency assignment and cost of the task-edge DAG of dependencies of Cholesky algorithm consisting of a matrix 3 × 3:

1 3 2 5 4 7 6 9 8 NULL NULL S 332(1) f = 2.27 P 333(0.33) f = 2.27 T 322(1) f = 2.27 S 221(1.31) f = 1.73 T 311(1) f = 2.27 T 211(1) f = 2.27 P 111(0.33) f = 2.27 P 222(0.44) f = 1.73 S 331(1.42) f = 1.60 G 321(2) f = 2.27 Pedro Alonso et al Improving Power efficiency of DLA Algorithms on Multi-Core Processors

slide-28
SLIDE 28

Introduction Theoretical approach Slack Reduction Algorithm Experimental results Conclusions and future work Previous steps Slack reduction Simulator

Simulator (I)

Simulator to evaluate the performance of our strategy Input parameters: DAG capturing tasks and dependencies of a blocked algorithm and frequencies recommended by Slack Reduction Algorithm A simple description of the target architecture:

Number of sockets (physical processors) Number of cores per socket

Discrete range of frequencies and its associated voltages The cost (overhead) required to perform frequency changes Static priority list scheduler: Duration of tasks is known in advance Tasks that lie on critical path must be prioritized

Pedro Alonso et al Improving Power efficiency of DLA Algorithms on Multi-Core Processors

slide-29
SLIDE 29

Introduction Theoretical approach Slack Reduction Algorithm Experimental results Conclusions and future work Description Cholesky factorization QR factorization

Benchmark algorithms

Blocked algorithms: Cholesky and QR with incremental pivoting

Block size: b = 192 Matrix size varies from 576 to 2,112

Target architecture

Four quad-core sockets (a total of 16 cores) Discrete range of frequencies: {2.27, 2.13, 2.00, 1.87, 1.73, 1.60} GHz Associated voltages vary from 0.75 to 1.35 V (linear relation between voltage and the frequency) Frequency change latency: 0.1 u.t. Representative values from Intel Xeon 5520 processor

Metrics:

Execution time (u.t.)

TSRAPolicy TNopolicy Impact of SRA on time %TSRA = TSRAPolicy TNopolicy · 100

Consumption (u.c.)

CSRAPolicy = n i=1 v2 i T(fi ) + v2 ∗ T(fmax ) CNoPolicy = v2 T(fmax ) Impact of SRA on consumption %CSRA = CSRAPolicy CNoPolicy · 100 Pedro Alonso et al Improving Power efficiency of DLA Algorithms on Multi-Core Processors

slide-30
SLIDE 30

Introduction Theoretical approach Slack Reduction Algorithm Experimental results Conclusions and future work Description Cholesky factorization QR factorization

Cholesky factorization

Impact of the SRA on energy and time for the Cholesky factorization:

20 40 60 80 100 120 140 576 768 960 1152 1344 1536 1728 1920 2112 20 40 60 80 100 120 140 Impact of SRA on consumption Impact of SRA on time Matrix size (n) Cholesky factorization Impact on consumption; excess ratio=1.00 Impact on consumption; excess ratio=1.50 Impact on time; excess ratio=1.00 Impact on time; excess ratio=1.50

Excess ratio (e=1): Time is not compromised in some cases and increases consumption with matrix size Excess ratio (e=1.5): Time is compromised in most cases but there is less consumption than with e = 1

Pedro Alonso et al Improving Power efficiency of DLA Algorithms on Multi-Core Processors

slide-31
SLIDE 31

Introduction Theoretical approach Slack Reduction Algorithm Experimental results Conclusions and future work Description Cholesky factorization QR factorization

QR Factorization

Impact of the SRA on energy and time for the QR factorization:

20 40 60 80 100 120 140 576 768 960 1152 1344 1536 1728 20 40 60 80 100 120 140 Impact of SRA on consumption Impact of SRA on time Matrix size (n) QR factorization Impact on consumption; excess ratio=1.00 Impact on consumption; excess ratio=1.50 Impact on time; excess ratio=1.00 Impact on time; excess ratio=1.50

Excess ratio=1: Time is not compromised in some cases but consumption increases with matrix size Excess ratio=1.5: Time is compromised in most cases but there is less consumption than Excess ratio=1

Pedro Alonso et al Improving Power efficiency of DLA Algorithms on Multi-Core Processors

slide-32
SLIDE 32

Introduction Theoretical approach Slack Reduction Algorithm Experimental results Conclusions and future work Conclusions Future work

Conclusions

Idea: exploit task-level parallelism to reduce energy consumption Objective: to reduce idle times by reducing execution frequency of tasks

Slack Reduction Algorithm Tasks with slack are executed at a minor frequency Theoretical results of dense linear algorithms Cholesky and QR (with incremental pivoting) factorizations Significant reduction in power consumption under realistic conditions A higher ratio between number of computational resources and number of tasks yields a more reduced power consumption LU factorization show similar behaviour to that of QR factorization.

Pedro Alonso et al Improving Power efficiency of DLA Algorithms on Multi-Core Processors

slide-33
SLIDE 33

Introduction Theoretical approach Slack Reduction Algorithm Experimental results Conclusions and future work Conclusions Future work

Future work

Some improvements:

Slack Reduction Algorithm is a static strategy but... it has an implicit cost! We are working in dynamic strategies to work at run-time for adapt frequency and reduce slacks

Future work:

We plan to integrate these theoretical study into a real run-time scheduler, e.g., SuperMatrix as part of libflame

Pedro Alonso et al Improving Power efficiency of DLA Algorithms on Multi-Core Processors

slide-34
SLIDE 34

Introduction Theoretical approach Slack Reduction Algorithm Experimental results Conclusions and future work

Thanks for your attention!

Questions?

Pedro Alonso et al Improving Power efficiency of DLA Algorithms on Multi-Core Processors