SLIDE 1
Parallel Programming and Heterogeneous Computing A3 - Performance - - PowerPoint PPT Presentation
Parallel Programming and Heterogeneous Computing A3 - Performance - - PowerPoint PPT Presentation
Parallel Programming and Heterogeneous Computing A3 - Performance Metrics Max Plauth, Sven Khler, Felix Eberhardt, Lukas Wenzel and Andreas Polze Operating Systems and Middleware Group Performance Which car is faster? for
SLIDE 2
SLIDE 3
■
Decrease Latency – process a single workload faster (= speedup)
■
Increase Throughput – process more workloads in the same time
Ø
Both are Performance metrics
■
Scalability: make best use of additional resources
□
Scale Up: Utilize additional resources on a machine
□
Scale Out: Utilize resources on additional machines
■
Cost/Energy Efficiency:
□
minimize cost/energy requirements for given performance objectives
□
alternatively: maximize performance for given cost/energy budget
■
Utilization: minimize idle time (=waste) of available resources
■
Precision-Tradeoffs: trade performance for precision of results
Lukas Wenzel ParProg20 A1 Terminology Chart 3
Recap Optimization Goals
SLIDE 4
Different responses of performance metrics to scaling (additional resources):
■
Speedup: More resources ~ less time executing the same workload › strong scaling
■
Scaled Speedup: More resources ~ same time executing a larger workload › weak scaling
■
Linear speedup = resources and workload execution scale by same factor
Lukas Wenzel ParProg 2020 A3 Performance Metrics Chart 4
Scaling Behavior
SLIDE 5
Lukas Wenzel ParProg 2020 A3 Performance Metrics Chart 5
Anatomy of a Workload
A workload consists of multiple tasks, containing different amounts of operations each.
T1 T2 T3 T5 T4 T6 T7 T8 execution time 44 (idle 0) 15 (2) 9 (3) 9 (29)
×8 ×5 ×3 ×1
T3 T5 T4 T6 T7 T8 T1 T2 Speedup 4.89 (!) T1 T2 T3 T5 T4 T6 T7 T8 Speedup 4.89 T1 T2 T3 T5 T4 T6 T7 T8 Speedup 2.93
SLIDE 6
Lukas Wenzel ParProg 2020 A3 Performance Metrics Chart 6
Anatomy of a Workload
T1 T2 T3 T5 T4 T6 T7 T8
The longest task puts a lower bound on the shortest execution time.
𝐔𝐪𝐛𝐬 𝐔𝐭𝐟𝐫 𝐔𝐪𝐛𝐬/𝐎 𝐔𝐭𝐟𝐫
𝐔𝟐 𝐔(𝐎)
𝐔 𝐎 = 𝐔𝐪𝐛𝐬 𝐎 + 𝐔𝐭𝐟𝐫
Replace absolute times by parallelizable fraction 𝐐:
𝐔𝐪𝐛𝐬 = 𝐔𝟐 ⋅ 𝐐 𝐔𝐭𝐟𝐫 = 𝐔𝟐 ⋅ (𝟐 − 𝐐)
Modeling discrete tasks is impractical → simplified continuous model.
𝑼 𝑶 = 𝑼𝟐 ⋅ 𝑸 𝑶 + (𝟐 − 𝑸)
SLIDE 7
Even for arbitrarily large 𝐎, the speedup converges to a fixed limit For getting reasonable speedup out of 1000 processors, the sequential part must be substantially below 0.1%
Lukas Wenzel ParProg 2020 A3 Performance Metrics Chart 7
[Amdahl1967] Amdahl‘s Law
𝐭𝐁𝐧𝐞𝐛𝐢𝐦 𝐎 = T
)
T(N) = T
)
T
) ⋅ P
N + (1 − P) = 𝟐 𝐐 𝐎 + (𝟐 − 𝐐) 𝐦𝐣𝐧
𝑶→, 𝒕𝑩𝒏𝒆𝒃𝒊𝒎 𝑶 =
𝟐 𝟐 − 𝐐
Amdahl's Law derives the speedup 𝐭𝐁𝐧𝐞𝐛𝐢𝐦 𝐎 for a parallelization degree 𝐎
SLIDE 8
Lukas Wenzel ParProg 2020 A3 Performance Metrics Chart 8
[Amdahl1967] Amdahl‘s Law
By Daniels220 at English Wikipedia, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=6678551
SLIDE 9
Regardless of processor count, 90% parallelizable code allows not more than a speedup by factor 10.
Ø
Parallelism requires highly parallelizable workloads to achieve a speedup
■
What is the sense in large parallel machines? Amdahl's law assumes a simple speedup scenario! Ø isolated execution of a single workload Ø fixed workload size
Lukas Wenzel ParProg 2020 A3 Performance Metrics Chart 9
[Amdahl1967] Amdahl‘s Law
SLIDE 10
Consider a scaled speedup scenario, allowing a variable workload size 𝐱. Amdahl ~ What is the shortest execution time for a given workload? Gustafson-Barsis ~ What is the largest workload for a given execution time?
Lukas Wenzel ParProg 2020 A3 Performance Metrics Chart 10
[Gustafson1988] Gustafson-Barsis’ Law
𝐔𝐪𝐛𝐬 𝐔𝐭𝐟𝐫
𝐔
Assumption: The parallelizable part of a workload contributes useful work when replicated.
𝐔𝐪𝐛𝐬 𝐔𝐭𝐟𝐫
𝐔
𝐱𝟐 ~ 𝐔𝐪𝐛𝐬 + 𝐔𝐭𝐟𝐫 𝐱(𝐎) ~ 𝐎 ⋅ 𝐔𝐪𝐛𝐬 + 𝐔𝐭𝐟𝐫
SLIDE 11
Lukas Wenzel ParProg 2020 A3 Performance Metrics Chart 11
[Gustafson1988] Gustafson-Barsis’ Law
𝐔𝐪𝐛𝐬 𝐔𝐭𝐟𝐫
𝐔
Assumption: The parallelizable part of a workload contributes useful work when replicated.
𝐔𝐪𝐛𝐬 𝐔𝐭𝐟𝐫
𝐔
𝐱𝟐 ~ 𝐔𝐪𝐛𝐬 + 𝐔𝐭𝐟𝐫 𝐱(𝐎) ~ 𝐎 ⋅ 𝐔𝐪𝐛𝐬 + 𝐔𝐭𝐟𝐫
Determine the scaled speedup 𝐭𝐇𝐯𝐭𝐮𝐛𝐰𝐭𝐩𝐨 𝐎 through the increase in workload size 𝐱(𝐎) over the fixed execution time 𝐔
𝐭𝐇𝐯𝐭𝐮𝐛𝐠𝐭𝐩𝐨 𝐎 = w(N) w) = T ⋅ (P ⋅ N + (1 − P)) T ⋅ P + (1 − P) = 𝐐 ⋅ 𝑶 + (𝟐 − 𝑸)
SLIDE 12
Lukas Wenzel ParProg 2020 A3 Performance Metrics Chart 12
[Gustafson1988] Gustafson-Barsis’ Law
By Peahihawaii - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=12630392
P = 90% P = 50%
SLIDE 13
Parallel fraction 𝐐 is a hypothetical parameter and not easily deduced from a given workload.
Ø
Karp-Flatt-Metric determines sequential fraction 𝐑 = 𝟐 − 𝐐 empirically
1.
Measure baseline execution time 𝐔𝟐 by executing workload on a single execution unit
2.
Measure parallelized execution time 𝐔(𝐎) by executing workload on 𝐎 execution units
3.
Determine speedup 𝐭(𝐎) = B
𝐔𝟐 𝐔(𝐎)
4.
Calculate Karp-Flatt-Metric
𝐑(𝐎) = 𝟐 𝐭(𝐎) − 𝟐 𝐎 𝟐 − 𝟐 𝐎
Lukas Wenzel ParProg 2020 A3 Performance Metrics Chart 13
[Karp1990] Karp-Flatt-Metric
SLIDE 14
Lukas Wenzel ParProg 2020 A3 Performance Metrics Chart 14
[Karp1990] Karp-Flatt-Metric
The Karp-Flatt-Metric is derived by rearranging Amdahl's Law.
𝟐 𝒕(𝑶) = 𝑼 𝑶 𝑼𝟐 ; 𝑼 𝑶 = 𝟐 − 𝑹 𝑶 + 𝑹 ⋅ 𝑼𝟐 𝟐 𝒕(𝑶) = 𝟐 − 𝑹 𝑶 + 𝑹 ⋅ 𝑼𝟐 𝑼𝟐 𝟐 𝒕(𝑶) = 𝟐 − 𝑹 𝑶 + 𝑹 = 𝟐 𝑶 + 𝟐 − 𝟐 𝑶 ⋅ 𝑹 𝟐 𝒕(𝑶) − 𝟐 𝑶 = 𝟐 − 𝟐 𝑶 ⋅ 𝑹 𝟐 𝒕(𝑶) − 𝟐 𝑶 𝟐 − 𝟐 𝑶 = 𝑹
SLIDE 15
Observing 𝐑(𝐎) for different 𝐎 gives an indication, how the workload reacts to different degrees of parallelism:
■
𝐑(𝐎) close to 𝟏 ~ high parallel fraction, workload benefits from parallelization
■
𝐑(𝐎) close to 𝟐 ~ low parallel fraction, workload can not use parallel resources
■
𝐑(𝐎) increases with 𝐎 ~ workload suffers from parallelization overhead
■
𝐑(𝐎) decreases with 𝐎 ~ workload scales well Observing 𝐑(𝐎) for different implementation variants of the workload can reveal bottlenecks.
Lukas Wenzel ParProg 2020 A3 Performance Metrics Chart 15
[Karp1990] Karp-Flatt-Metric
SLIDE 16
Directed Acyclic Graph to model a workload:
■
Nodes represent operations
■
Edges express dependencies between operations Work 𝐔 - Total workload execution time 𝐔𝟐 - Execution time with a single processor ~ number of nodes 𝐔𝐐 - Execution time with P processors 𝐔4 - Execution time with arbitrary number of processors ~ graph diameter Work Law 𝑼𝑸 ≥ B
𝑼𝟐 𝑸
(processors can not process multiple operations at once) Span Law 𝑈
7 ≥ 𝑈 4
(execution order can not break dependencies)
Lukas Wenzel ParProg 2020 A3 Performance Metrics Chart 16
[Leiserson2008] A More Detailed View
4 5 7 8 13 14 15 16 17 1 11 2 3 6 9 10 12 18
𝐔𝟐 = 𝟐𝟗 𝐔4 = 𝟘
SLIDE 17
[Amdahl1967]
Amdahl, Gene M. "Validity of the single processor approach to achieving large scale computing capabilities." Proceedings of the AFIPS Spring Joint Computer
- Conference. 483-485. 1967.
[Gustafson1988] Gustafson, John L. "Reevaluating Amdahl's law." Communications of the ACM 31.5 (1988): 532-533. [Karp1990] Karp, Alan H. and Flatt, Horace P. "Measuring parallel processor performance." Communications of the ACM 33.5 (1990): 539-543. [Leiserson2008] Leiserson, Charles E. and Mirman, Ilya B. "How to survive the multicore software revolution (or at least survive the hype)." Cilk Arts 1 (2008): 11.
Lukas Wenzel ParProg 2020 A3 Performance Metrics Chart 17
Literature
SLIDE 18
And now for a break and a cup of Oolong.
*or beverage of your choice