Parallel Programming and Heterogeneous Computing A3 - Performance - - PowerPoint PPT Presentation

parallel programming and heterogeneous computing
SMART_READER_LITE
LIVE PREVIEW

Parallel Programming and Heterogeneous Computing A3 - Performance - - PowerPoint PPT Presentation

Parallel Programming and Heterogeneous Computing A3 - Performance Metrics Max Plauth, Sven Khler, Felix Eberhardt, Lukas Wenzel and Andreas Polze Operating Systems and Middleware Group Performance Which car is faster? for


slide-1
SLIDE 1

Parallel Programming and Heterogeneous Computing

A3 - Performance Metrics

Max Plauth, Sven Köhler, Felix Eberhardt, Lukas Wenzel and Andreas Polze Operating Systems and Middleware Group

slide-2
SLIDE 2

Which car is faster? … for transporting several large boxes … for winning a race

Lukas Wenzel ParProg 2020 A3 Performance Metrics Chart 2

Performance

Performance depends not only on an execution environment but also on the workload it executes!

slide-3
SLIDE 3

Decrease Latency – process a single workload faster (= speedup)

Increase Throughput – process more workloads in the same time

Ø

Both are Performance metrics

Scalability: make best use of additional resources

Scale Up: Utilize additional resources on a machine

Scale Out: Utilize resources on additional machines

Cost/Energy Efficiency:

minimize cost/energy requirements for given performance objectives

alternatively: maximize performance for given cost/energy budget

Utilization: minimize idle time (=waste) of available resources

Precision-Tradeoffs: trade performance for precision of results

Lukas Wenzel ParProg20 A1 Terminology Chart 3

Recap Optimization Goals

slide-4
SLIDE 4

Different responses of performance metrics to scaling (additional resources):

Speedup: More resources ~ less time executing the same workload › strong scaling

Scaled Speedup: More resources ~ same time executing a larger workload › weak scaling

Linear speedup = resources and workload execution scale by same factor

Lukas Wenzel ParProg 2020 A3 Performance Metrics Chart 4

Scaling Behavior

slide-5
SLIDE 5

Lukas Wenzel ParProg 2020 A3 Performance Metrics Chart 5

Anatomy of a Workload

A workload consists of multiple tasks, containing different amounts of operations each.

T1 T2 T3 T5 T4 T6 T7 T8 execution time 44 (idle 0) 15 (2) 9 (3) 9 (29)

×8 ×5 ×3 ×1

T3 T5 T4 T6 T7 T8 T1 T2 Speedup 4.89 (!) T1 T2 T3 T5 T4 T6 T7 T8 Speedup 4.89 T1 T2 T3 T5 T4 T6 T7 T8 Speedup 2.93

slide-6
SLIDE 6

Lukas Wenzel ParProg 2020 A3 Performance Metrics Chart 6

Anatomy of a Workload

T1 T2 T3 T5 T4 T6 T7 T8

The longest task puts a lower bound on the shortest execution time.

𝐔𝐪𝐛𝐬 𝐔𝐭𝐟𝐫 𝐔𝐪𝐛𝐬/𝐎 𝐔𝐭𝐟𝐫

𝐔𝟐 𝐔(𝐎)

𝐔 𝐎 = 𝐔𝐪𝐛𝐬 𝐎 + 𝐔𝐭𝐟𝐫

Replace absolute times by parallelizable fraction 𝐐:

𝐔𝐪𝐛𝐬 = 𝐔𝟐 ⋅ 𝐐 𝐔𝐭𝐟𝐫 = 𝐔𝟐 ⋅ (𝟐 − 𝐐)

Modeling discrete tasks is impractical → simplified continuous model.

𝑼 𝑶 = 𝑼𝟐 ⋅ 𝑸 𝑶 + (𝟐 − 𝑸)

slide-7
SLIDE 7

Even for arbitrarily large 𝐎, the speedup converges to a fixed limit For getting reasonable speedup out of 1000 processors, the sequential part must be substantially below 0.1%

Lukas Wenzel ParProg 2020 A3 Performance Metrics Chart 7

[Amdahl1967] Amdahl‘s Law

𝐭𝐁𝐧𝐞𝐛𝐢𝐦 𝐎 = T

)

T(N) = T

)

T

) ⋅ P

N + (1 − P) = 𝟐 𝐐 𝐎 + (𝟐 − 𝐐) 𝐦𝐣𝐧

𝑶→, 𝒕𝑩𝒏𝒆𝒃𝒊𝒎 𝑶 =

𝟐 𝟐 − 𝐐

Amdahl's Law derives the speedup 𝐭𝐁𝐧𝐞𝐛𝐢𝐦 𝐎 for a parallelization degree 𝐎

slide-8
SLIDE 8

Lukas Wenzel ParProg 2020 A3 Performance Metrics Chart 8

[Amdahl1967] Amdahl‘s Law

By Daniels220 at English Wikipedia, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=6678551

slide-9
SLIDE 9

Regardless of processor count, 90% parallelizable code allows not more than a speedup by factor 10.

Ø

Parallelism requires highly parallelizable workloads to achieve a speedup

What is the sense in large parallel machines? Amdahl's law assumes a simple speedup scenario! Ø isolated execution of a single workload Ø fixed workload size

Lukas Wenzel ParProg 2020 A3 Performance Metrics Chart 9

[Amdahl1967] Amdahl‘s Law

slide-10
SLIDE 10

Consider a scaled speedup scenario, allowing a variable workload size 𝐱. Amdahl ~ What is the shortest execution time for a given workload? Gustafson-Barsis ~ What is the largest workload for a given execution time?

Lukas Wenzel ParProg 2020 A3 Performance Metrics Chart 10

[Gustafson1988] Gustafson-Barsis’ Law

𝐔𝐪𝐛𝐬 𝐔𝐭𝐟𝐫

𝐔

Assumption: The parallelizable part of a workload contributes useful work when replicated.

𝐔𝐪𝐛𝐬 𝐔𝐭𝐟𝐫

𝐔

𝐱𝟐 ~ 𝐔𝐪𝐛𝐬 + 𝐔𝐭𝐟𝐫 𝐱(𝐎) ~ 𝐎 ⋅ 𝐔𝐪𝐛𝐬 + 𝐔𝐭𝐟𝐫

slide-11
SLIDE 11

Lukas Wenzel ParProg 2020 A3 Performance Metrics Chart 11

[Gustafson1988] Gustafson-Barsis’ Law

𝐔𝐪𝐛𝐬 𝐔𝐭𝐟𝐫

𝐔

Assumption: The parallelizable part of a workload contributes useful work when replicated.

𝐔𝐪𝐛𝐬 𝐔𝐭𝐟𝐫

𝐔

𝐱𝟐 ~ 𝐔𝐪𝐛𝐬 + 𝐔𝐭𝐟𝐫 𝐱(𝐎) ~ 𝐎 ⋅ 𝐔𝐪𝐛𝐬 + 𝐔𝐭𝐟𝐫

Determine the scaled speedup 𝐭𝐇𝐯𝐭𝐮𝐛𝐰𝐭𝐩𝐨 𝐎 through the increase in workload size 𝐱(𝐎) over the fixed execution time 𝐔

𝐭𝐇𝐯𝐭𝐮𝐛𝐠𝐭𝐩𝐨 𝐎 = w(N) w) = T ⋅ (P ⋅ N + (1 − P)) T ⋅ P + (1 − P) = 𝐐 ⋅ 𝑶 + (𝟐 − 𝑸)

slide-12
SLIDE 12

Lukas Wenzel ParProg 2020 A3 Performance Metrics Chart 12

[Gustafson1988] Gustafson-Barsis’ Law

By Peahihawaii - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=12630392

P = 90% P = 50%

slide-13
SLIDE 13

Parallel fraction 𝐐 is a hypothetical parameter and not easily deduced from a given workload.

Ø

Karp-Flatt-Metric determines sequential fraction 𝐑 = 𝟐 − 𝐐 empirically

1.

Measure baseline execution time 𝐔𝟐 by executing workload on a single execution unit

2.

Measure parallelized execution time 𝐔(𝐎) by executing workload on 𝐎 execution units

3.

Determine speedup 𝐭(𝐎) = B

𝐔𝟐 𝐔(𝐎)

4.

Calculate Karp-Flatt-Metric

𝐑(𝐎) = 𝟐 𝐭(𝐎) − 𝟐 𝐎 𝟐 − 𝟐 𝐎

Lukas Wenzel ParProg 2020 A3 Performance Metrics Chart 13

[Karp1990] Karp-Flatt-Metric

slide-14
SLIDE 14

Lukas Wenzel ParProg 2020 A3 Performance Metrics Chart 14

[Karp1990] Karp-Flatt-Metric

The Karp-Flatt-Metric is derived by rearranging Amdahl's Law.

𝟐 𝒕(𝑶) = 𝑼 𝑶 𝑼𝟐 ; 𝑼 𝑶 = 𝟐 − 𝑹 𝑶 + 𝑹 ⋅ 𝑼𝟐 𝟐 𝒕(𝑶) = 𝟐 − 𝑹 𝑶 + 𝑹 ⋅ 𝑼𝟐 𝑼𝟐 𝟐 𝒕(𝑶) = 𝟐 − 𝑹 𝑶 + 𝑹 = 𝟐 𝑶 + 𝟐 − 𝟐 𝑶 ⋅ 𝑹 𝟐 𝒕(𝑶) − 𝟐 𝑶 = 𝟐 − 𝟐 𝑶 ⋅ 𝑹 𝟐 𝒕(𝑶) − 𝟐 𝑶 𝟐 − 𝟐 𝑶 = 𝑹

slide-15
SLIDE 15

Observing 𝐑(𝐎) for different 𝐎 gives an indication, how the workload reacts to different degrees of parallelism:

𝐑(𝐎) close to 𝟏 ~ high parallel fraction, workload benefits from parallelization

𝐑(𝐎) close to 𝟐 ~ low parallel fraction, workload can not use parallel resources

𝐑(𝐎) increases with 𝐎 ~ workload suffers from parallelization overhead

𝐑(𝐎) decreases with 𝐎 ~ workload scales well Observing 𝐑(𝐎) for different implementation variants of the workload can reveal bottlenecks.

Lukas Wenzel ParProg 2020 A3 Performance Metrics Chart 15

[Karp1990] Karp-Flatt-Metric

slide-16
SLIDE 16

Directed Acyclic Graph to model a workload:

Nodes represent operations

Edges express dependencies between operations Work 𝐔 - Total workload execution time 𝐔𝟐 - Execution time with a single processor ~ number of nodes 𝐔𝐐 - Execution time with P processors 𝐔4 - Execution time with arbitrary number of processors ~ graph diameter Work Law 𝑼𝑸 ≥ B

𝑼𝟐 𝑸

(processors can not process multiple operations at once) Span Law 𝑈

7 ≥ 𝑈 4

(execution order can not break dependencies)

Lukas Wenzel ParProg 2020 A3 Performance Metrics Chart 16

[Leiserson2008] A More Detailed View

4 5 7 8 13 14 15 16 17 1 11 2 3 6 9 10 12 18

𝐔𝟐 = 𝟐𝟗 𝐔4 = 𝟘

slide-17
SLIDE 17

[Amdahl1967]

Amdahl, Gene M. "Validity of the single processor approach to achieving large scale computing capabilities." Proceedings of the AFIPS Spring Joint Computer

  • Conference. 483-485. 1967.

[Gustafson1988] Gustafson, John L. "Reevaluating Amdahl's law." Communications of the ACM 31.5 (1988): 532-533. [Karp1990] Karp, Alan H. and Flatt, Horace P. "Measuring parallel processor performance." Communications of the ACM 33.5 (1990): 539-543. [Leiserson2008] Leiserson, Charles E. and Mirman, Ilya B. "How to survive the multicore software revolution (or at least survive the hype)." Cilk Arts 1 (2008): 11.

Lukas Wenzel ParProg 2020 A3 Performance Metrics Chart 17

Literature

slide-18
SLIDE 18

And now for a break and a cup of Oolong.

*or beverage of your choice