B3CC: Concurrency 08: Parallelism from Concurrency Trevor L. - - PowerPoint PPT Presentation

▶

Jan 08, 2024 98 likes •505 views

B3CC: Concurrency 08: Parallelism from Concurrency Trevor L. McDonell Utrecht University, B2 2020-2021 Recap Concurrency: dealing with lots of things at once - Collection of independently executing processes - Two or more threads are

SLIDE 1

B3CC: Concurrency 08: Parallelism from Concurrency

Trevor L. McDonell Utrecht University, B2 2020-2021

SLIDE 2

Recap

Concurrency: dealing with lots of things at once
Collection of independently executing processes
Two or more threads are making progress
Parallelism: doing lots of things at once
Simultaneous execution of (possibly related) computations
At least two threads are executing simultaneously

SLIDE 3

Recap

So far we have discussed concurrency as a means to write modular code

with multiple interactions

Example: network server that interacts with multiple clients simultaneously
Sometimes this can speed up the program by overlapping the I/O or time

spent waiting for clients to respond, but this speedup doesn’t require multiple processors to achieve

In many cases we can use the same method to achieve real parallelism
From now, we will talk about some of the considerations for doing this well

SLIDE 4

Motivations

SLIDE 5

The free lunch is over

“The free lunch is over” (2005)
Today virtually all processors include multiple cores/processing elements
This has become the primary method for increasing performance
This has consequences for the programmer

http://www.gotw.ca/publications/concurrency-ddj.htm

SLIDE 6

Why?

https://github.com/karlrupp/microprocessor-trend-data

epyc itanium 2 pentium 4 pentium 386

100 101 102 103 104 105 106 107 1970 1980 1990 2000 2010 2020 Transistors (thousands)

Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2019 by K. Rupp

Year 48 Years of Microprocessor Trend Data

SLIDE 7

Why?

Moore's curve (1965)
Observation that the number of transistors in an integrated circuit doubles

roughly every two years

In particular, to minimise the cost per transistor
Not a law in any sense of the word (don't call it that)

SLIDE 8

Why?

Dennard scaling
As transistors get smaller, power density remains constant
Combined with shrinking transistors, this implies performance per watt

grows at roughly the same rate as transistor density

signal delay decreases (clock frequency increases)
voltage and current decrease (power density remains constant)

SLIDE 9

Why?

https://github.com/karlrupp/microprocessor-trend-data

100 101 102 103 104 105 106 107 1970 1980 1990 2000 2010 2020 Frequency (MHz) Transistors (thousands) Typical Power (Watts)

Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2019 by K. Rupp

Year 48 Years of Microprocessor Trend Data

epyc itanium 2 pentium 4 pentium 386

SLIDE 10

Why?

Since ~2005 Dennard scaling breaks down
Static power losses increased faster than the overall power supply dropped

(due to decreasing voltage & current)

Consequence: can no longer improve performance through frequency

scaling alone

SLIDE 11

Why?

https://github.com/karlrupp/microprocessor-trend-data

100 101 102 103 104 105 106 107 1970 1980 1990 2000 2010 2020 Frequency (MHz) Single-Thread Performance (SpecINT x 103) Transistors (thousands) Typical Power (Watts)

Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2019 by K. Rupp

Year 48 Years of Microprocessor Trend Data

epyc itanium 2 pentium 4 pentium 386

SLIDE 12

Why?

Traditional approaches to increasing CPU performance:
Frequency scaling
Caches
Micro-architectural improvements
Out of order execution (increase utilisation of execution hardware)
Branch prediction (guess the outcome of control flow)
Speculative execution (do work before knowing if it will be needed)

SLIDE 13

Why?

Frequency scaling: The Power Wall
Power consumption of

transistors does not decrease as fast as density increases

Performance limited by power

consumption (& dissipation)

Time Transistor density Transistor power Total power

SLIDE 14

Why?

Caches: The Memory Wall
Memory speed does not increase

as fast as computing speed

Increasingly difficult to hide

memory latency

Performance Time Gap Compute Memory

SLIDE 15

Why?

Microarchitecture improvements:

Instruction Level Parallelism Wall

Law of diminishing returns
Pollack rule: performance ∝

complexity2

Cost Serial performance

SLIDE 16

Why?

https://github.com/karlrupp/microprocessor-trend-data

100 101 102 103 104 105 106 107 1970 1980 1990 2000 2010 2020 Number of Logical Cores Frequency (MHz) Single-Thread Performance (SpecINT x 103) Transistors (thousands) Typical Power (Watts)

Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2019 by K. Rupp

Year 48 Years of Microprocessor Trend Data

epyc itanium 2 pentium 4 pentium 386

SLIDE 17

Why?

https://github.com/karlrupp/microprocessor-trend-data

100 101 102 103 104 105 106 107 1970 1980 1990 2000 2010 2020 Number of Logical Cores Frequency (MHz) Single-Thread Performance (SpecINT x 103) Transistors (thousands) Typical Power (Watts)

Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2019 by K. Rupp

Year 48 Years of Microprocessor Trend Data

epyc itanium 2 pentium 4 pentium 386

SLIDE 18

Aside: more cores ≠ more performance

https://arstechnica.com/gadgets/2020/11/a-history-of-intel-vs-amd-desktop-performance-with-cpu-charts-galore/

SLIDE 19

Considerations

SLIDE 20

Parallelism

Improving application performance through parallelisation means:
Reducing the total time to compute a single result (latency)
Increasing the rate at which a series of results are computed (throughput)
Reducing the power consumption of a computation

SLIDE 21

Problem

To make the program run faster, we need to gain more from parallelisation

than we lose due to the overhead of adding it

Granularity: If the tasks are too small, the overhead of managing the tasks
utweighs any benefit you might get from running them in parallel
Data dependencies: When one task depends on another, they must be

performed sequentially

SLIDE 22

Speedup

The performance improvement, or speedup of a parallel application, is:
Where TP is the time to execute using P threads/processors
The efficiency of the program is:
Here, T1 can be:
The parallel algorithm executed on one thread: relative speedup
An equivalent serial algorithm: absolute speedup

speedup = Sp = T1 TP

<latexit sha1_base64="eRBgRu12smUCM/Nhj6DJW79fP8=">ACEXicbVDLSsNAFJ34rPVdelmsAhdlUQE3QhFNy4r9gVNCJPpTt0kgwzE6GE/Ibf8WNC0XcunPn3zhpu9DWA3M5nHMvc+8JBGdK2/a3tbK6tr6xWdoqb+/s7u1XDg47KklhTZNeCJ7AVHAWQxtzTSHnpBAoBDNxjfFH73AaRiSdzSEwFeRIYxCxkl2kh+peZGRI9klCkBMEhFjq/wvS9MdUNJaNbyndyUZu5XqnbdngIvE2dOqmiOpl/5cgcJTSOINeVEqb5jC+1lRGpGOeRlN1UgCB2TIfQNjUkEysumF+X41CgDHCbSvFjqfp7IiORUpMoMJ3F/mrRK8T/vH6qw0svY7FINcR09lGYcqwTXMSDB0wC1XxiCKGSmV0xHRGThDYhlk0IzuLJy6RzVncMvzuvNq7ncZTQMTpBNeSgC9RAt6iJ2oiR/SMXtGb9WS9WO/Wx6x1xZrPHKE/sD5/AM/fnPk=</latexit><latexit sha1_base64="eRBgRu12smUCM/Nhj6DJW79fP8=">ACEXicbVDLSsNAFJ34rPVdelmsAhdlUQE3QhFNy4r9gVNCJPpTt0kgwzE6GE/Ibf8WNC0XcunPn3zhpu9DWA3M5nHMvc+8JBGdK2/a3tbK6tr6xWdoqb+/s7u1XDg47KklhTZNeCJ7AVHAWQxtzTSHnpBAoBDNxjfFH73AaRiSdzSEwFeRIYxCxkl2kh+peZGRI9klCkBMEhFjq/wvS9MdUNJaNbyndyUZu5XqnbdngIvE2dOqmiOpl/5cgcJTSOINeVEqb5jC+1lRGpGOeRlN1UgCB2TIfQNjUkEysumF+X41CgDHCbSvFjqfp7IiORUpMoMJ3F/mrRK8T/vH6qw0svY7FINcR09lGYcqwTXMSDB0wC1XxiCKGSmV0xHRGThDYhlk0IzuLJy6RzVncMvzuvNq7ncZTQMTpBNeSgC9RAt6iJ2oiR/SMXtGb9WS9WO/Wx6x1xZrPHKE/sD5/AM/fnPk=</latexit><latexit sha1_base64="eRBgRu12smUCM/Nhj6DJW79fP8=">ACEXicbVDLSsNAFJ34rPVdelmsAhdlUQE3QhFNy4r9gVNCJPpTt0kgwzE6GE/Ibf8WNC0XcunPn3zhpu9DWA3M5nHMvc+8JBGdK2/a3tbK6tr6xWdoqb+/s7u1XDg47KklhTZNeCJ7AVHAWQxtzTSHnpBAoBDNxjfFH73AaRiSdzSEwFeRIYxCxkl2kh+peZGRI9klCkBMEhFjq/wvS9MdUNJaNbyndyUZu5XqnbdngIvE2dOqmiOpl/5cgcJTSOINeVEqb5jC+1lRGpGOeRlN1UgCB2TIfQNjUkEysumF+X41CgDHCbSvFjqfp7IiORUpMoMJ3F/mrRK8T/vH6qw0svY7FINcR09lGYcqwTXMSDB0wC1XxiCKGSmV0xHRGThDYhlk0IzuLJy6RzVncMvzuvNq7ncZTQMTpBNeSgC9RAt6iJ2oiR/SMXtGb9WS9WO/Wx6x1xZrPHKE/sD5/AM/fnPk=</latexit><latexit sha1_base64="eRBgRu12smUCM/Nhj6DJW79fP8=">ACEXicbVDLSsNAFJ34rPVdelmsAhdlUQE3QhFNy4r9gVNCJPpTt0kgwzE6GE/Ibf8WNC0XcunPn3zhpu9DWA3M5nHMvc+8JBGdK2/a3tbK6tr6xWdoqb+/s7u1XDg47KklhTZNeCJ7AVHAWQxtzTSHnpBAoBDNxjfFH73AaRiSdzSEwFeRIYxCxkl2kh+peZGRI9klCkBMEhFjq/wvS9MdUNJaNbyndyUZu5XqnbdngIvE2dOqmiOpl/5cgcJTSOINeVEqb5jC+1lRGpGOeRlN1UgCB2TIfQNjUkEysumF+X41CgDHCbSvFjqfp7IiORUpMoMJ3F/mrRK8T/vH6qw0svY7FINcR09lGYcqwTXMSDB0wC1XxiCKGSmV0xHRGThDYhlk0IzuLJy6RzVncMvzuvNq7ncZTQMTpBNeSgC9RAt6iJ2oiR/SMXtGb9WS9WO/Wx6x1xZrPHKE/sD5/AM/fnPk=</latexit>

efficiency = Sp P = T1 PTP

<latexit sha1_base64="0gleX9ESliUtMJWRtd8hHuZJOSQ=">ACIHicbVDLSsNAFJ3UV62vqEs3g0VwVRIR6kYounEZsS9oQ5hMJ+3QmSTMTIQ8ilu/BU3LhTRnX6Nkzagth4YOPfce7lzjh8zKpVlfRqVldW19Y3qZm1re2d3z9w/6MoEZh0cMQi0feRJIyGpKOoYqQfC4K4z0jPn14X/d49EZJGYVulMXE5Goc0oBgpLXlmc8iRmgiekUCLlIQ4zeElHAYC4ezOi/PM+anbnq1r2Pac3DPrVsOaAS4TuyR1UMLxzI/hKMIJ6HCDEk5sK1YuRkSimJG8towkSRGeIrGZKBpiDiRbjYzmMTrYxgEAn9QgVn6u+NDHEpU+7rycKOXOwV4n+9QaKCzejYZwobX1+KEgYVBEs0oIjKghWLNUEYUH1XyGeIJ2F0pnWdAj2ouVl0j1r2JrfntdbV2UcVXAEjsEpsETtMANcEAHYPAnsALeDUejWfjzXifj1aMcucQ/IHx9Q3XzKNW</latexit><latexit sha1_base64="0gleX9ESliUtMJWRtd8hHuZJOSQ=">ACIHicbVDLSsNAFJ3UV62vqEs3g0VwVRIR6kYounEZsS9oQ5hMJ+3QmSTMTIQ8ilu/BU3LhTRnX6Nkzagth4YOPfce7lzjh8zKpVlfRqVldW19Y3qZm1re2d3z9w/6MoEZh0cMQi0feRJIyGpKOoYqQfC4K4z0jPn14X/d49EZJGYVulMXE5Goc0oBgpLXlmc8iRmgiekUCLlIQ4zeElHAYC4ezOi/PM+anbnq1r2Pac3DPrVsOaAS4TuyR1UMLxzI/hKMIJ6HCDEk5sK1YuRkSimJG8towkSRGeIrGZKBpiDiRbjYzmMTrYxgEAn9QgVn6u+NDHEpU+7rycKOXOwV4n+9QaKCzejYZwobX1+KEgYVBEs0oIjKghWLNUEYUH1XyGeIJ2F0pnWdAj2ouVl0j1r2JrfntdbV2UcVXAEjsEpsETtMANcEAHYPAnsALeDUejWfjzXifj1aMcucQ/IHx9Q3XzKNW</latexit><latexit sha1_base64="0gleX9ESliUtMJWRtd8hHuZJOSQ=">ACIHicbVDLSsNAFJ3UV62vqEs3g0VwVRIR6kYounEZsS9oQ5hMJ+3QmSTMTIQ8ilu/BU3LhTRnX6Nkzagth4YOPfce7lzjh8zKpVlfRqVldW19Y3qZm1re2d3z9w/6MoEZh0cMQi0feRJIyGpKOoYqQfC4K4z0jPn14X/d49EZJGYVulMXE5Goc0oBgpLXlmc8iRmgiekUCLlIQ4zeElHAYC4ezOi/PM+anbnq1r2Pac3DPrVsOaAS4TuyR1UMLxzI/hKMIJ6HCDEk5sK1YuRkSimJG8towkSRGeIrGZKBpiDiRbjYzmMTrYxgEAn9QgVn6u+NDHEpU+7rycKOXOwV4n+9QaKCzejYZwobX1+KEgYVBEs0oIjKghWLNUEYUH1XyGeIJ2F0pnWdAj2ouVl0j1r2JrfntdbV2UcVXAEjsEpsETtMANcEAHYPAnsALeDUejWfjzXifj1aMcucQ/IHx9Q3XzKNW</latexit><latexit sha1_base64="0gleX9ESliUtMJWRtd8hHuZJOSQ=">ACIHicbVDLSsNAFJ3UV62vqEs3g0VwVRIR6kYounEZsS9oQ5hMJ+3QmSTMTIQ8ilu/BU3LhTRnX6Nkzagth4YOPfce7lzjh8zKpVlfRqVldW19Y3qZm1re2d3z9w/6MoEZh0cMQi0feRJIyGpKOoYqQfC4K4z0jPn14X/d49EZJGYVulMXE5Goc0oBgpLXlmc8iRmgiekUCLlIQ4zeElHAYC4ezOi/PM+anbnq1r2Pac3DPrVsOaAS4TuyR1UMLxzI/hKMIJ6HCDEk5sK1YuRkSimJG8towkSRGeIrGZKBpiDiRbjYzmMTrYxgEAn9QgVn6u+NDHEpU+7rycKOXOwV4n+9QaKCzejYZwobX1+KEgYVBEs0oIjKghWLNUEYUH1XyGeIJ2F0pnWdAj2ouVl0j1r2JrfntdbV2UcVXAEjsEpsETtMANcEAHYPAnsALeDUejWfjzXifj1aMcucQ/IHx9Q3XzKNW</latexit>

SLIDE 23

Maximum speedup

Several factors appear as overhead in parallel computations and limit the

speedup of the program

Periods when not all processors are performing useful work
Extra computations in the parallel version not appearing in the sequential

version (example: recompute constants locally)

Communication time between processes

SLIDE 24

Amdhal

Amdhal's law considers the execution time of a program is either:
Wser: time spent doing (non-parallelisable) serial work
Wpar: time spent doing parallel work
If f is the fraction of serial work to be performed, we get the parallel speedup:

TP ≥ Wser + Wpar P

<latexit sha1_base64="8AaAl2IE6vfLmeMbWXBM3wlAY8=">ACGnicbZDLSsNAFIYn9VbrLerSzWARBKEkIuiy6MZlhd6gCWEyPWmHTi7MTIQS8hxufBU3LhRxJ258GydtQG39YeDjP+cw5/x+wplUlvVlVFZW19Y3qpu1re2d3T1z/6Ar41RQ6NCYx6LvEwmcRdBRTHoJwJI6HPo+ZObot67ByFZHLXVNAE3JKOIBYwSpS3PtNteCzsjwD3PCYkaizCTIHJ8hp1AEJr92AkReZ61cs+sWw1rJrwMdgl1VKrlmR/OMKZpCJGinEg5sK1EuRkRilEOec1JSETsgIBhojEoJ0s9lpOT7RzhAHsdAvUnjm/p7ISCjlNPR1Z7GnXKwV5n+1QaqCKzdjUZIqiOj8oyDlWMW4yAkPmQCq+FQDoYLpXTEdEx2J0mnWdAj24snL0D1v2JrvLurN6zKOKjpCx+gU2egSNdEtaqEOougBPaEX9Go8Gs/Gm/E+b60Y5cwh+iPj8xu3sKFW</latexit><latexit sha1_base64="8AaAl2IE6vfLmeMbWXBM3wlAY8=">ACGnicbZDLSsNAFIYn9VbrLerSzWARBKEkIuiy6MZlhd6gCWEyPWmHTi7MTIQS8hxufBU3LhRxJ258GydtQG39YeDjP+cw5/x+wplUlvVlVFZW19Y3qpu1re2d3T1z/6Ar41RQ6NCYx6LvEwmcRdBRTHoJwJI6HPo+ZObot67ByFZHLXVNAE3JKOIBYwSpS3PtNteCzsjwD3PCYkaizCTIHJ8hp1AEJr92AkReZ61cs+sWw1rJrwMdgl1VKrlmR/OMKZpCJGinEg5sK1EuRkRilEOec1JSETsgIBhojEoJ0s9lpOT7RzhAHsdAvUnjm/p7ISCjlNPR1Z7GnXKwV5n+1QaqCKzdjUZIqiOj8oyDlWMW4yAkPmQCq+FQDoYLpXTEdEx2J0mnWdAj24snL0D1v2JrvLurN6zKOKjpCx+gU2egSNdEtaqEOougBPaEX9Go8Gs/Gm/E+b60Y5cwh+iPj8xu3sKFW</latexit><latexit sha1_base64="8AaAl2IE6vfLmeMbWXBM3wlAY8=">ACGnicbZDLSsNAFIYn9VbrLerSzWARBKEkIuiy6MZlhd6gCWEyPWmHTi7MTIQS8hxufBU3LhRxJ258GydtQG39YeDjP+cw5/x+wplUlvVlVFZW19Y3qpu1re2d3T1z/6Ar41RQ6NCYx6LvEwmcRdBRTHoJwJI6HPo+ZObot67ByFZHLXVNAE3JKOIBYwSpS3PtNteCzsjwD3PCYkaizCTIHJ8hp1AEJr92AkReZ61cs+sWw1rJrwMdgl1VKrlmR/OMKZpCJGinEg5sK1EuRkRilEOec1JSETsgIBhojEoJ0s9lpOT7RzhAHsdAvUnjm/p7ISCjlNPR1Z7GnXKwV5n+1QaqCKzdjUZIqiOj8oyDlWMW4yAkPmQCq+FQDoYLpXTEdEx2J0mnWdAj24snL0D1v2JrvLurN6zKOKjpCx+gU2egSNdEtaqEOougBPaEX9Go8Gs/Gm/E+b60Y5cwh+iPj8xu3sKFW</latexit><latexit sha1_base64="8AaAl2IE6vfLmeMbWXBM3wlAY8=">ACGnicbZDLSsNAFIYn9VbrLerSzWARBKEkIuiy6MZlhd6gCWEyPWmHTi7MTIQS8hxufBU3LhRxJ258GydtQG39YeDjP+cw5/x+wplUlvVlVFZW19Y3qpu1re2d3T1z/6Ar41RQ6NCYx6LvEwmcRdBRTHoJwJI6HPo+ZObot67ByFZHLXVNAE3JKOIBYwSpS3PtNteCzsjwD3PCYkaizCTIHJ8hp1AEJr92AkReZ61cs+sWw1rJrwMdgl1VKrlmR/OMKZpCJGinEg5sK1EuRkRilEOec1JSETsgIBhojEoJ0s9lpOT7RzhAHsdAvUnjm/p7ISCjlNPR1Z7GnXKwV5n+1QaqCKzdjUZIqiOj8oyDlWMW4yAkPmQCq+FQDoYLpXTEdEx2J0mnWdAj24snL0D1v2JrvLurN6zKOKjpCx+gU2egSNdEtaqEOougBPaEX9Go8Gs/Gm/E+b60Y5cwh+iPj8xu3sKFW</latexit>

SP ≤ 1 f + (1 − f)/P

<latexit sha1_base64="ehSsLCz1Dhpr+4wAQyuJ7HxjSaY=">ACDHicbVC7SgNBFL0bXzG+opZaXAxCRNRdG60kMYyonlANoTZyWwyZPbBzKwQlnyAjb9iY6GIrR9gZ+enOHkUmnhg4HDOudy5x4sFV9q2v6zMwuLS8kp2Nbe2vrG5ld/eqakokZRVaSQi2fCIYoKHrKq5FqwRS0YCT7C61y+P/Po9k4pH4Z0exKwVkG7IfU6JNlI7X7htV9AVDF1fEpqig8MUfTzGoMn6B+dVXBoUvapPQbOE2dKCqWr/fI3AFTa+U+3E9EkYKGmgijVdOxYt1IiNaeCDXNuolhMaJ90WdPQkARMtdLxMUM8NEoH/UiaF2ocq78nUhIoNQg8kwyI7qlZbyT+5zUT7V+2Uh7GiWYhnSzyE4E6wlEz2OGSUS0GhAqufkr0h4xrWjTX86U4MyePE9q56eO4TemjRJMkIU9OIAiOHABJbiGClSBwgM8wQu8Wo/Ws/VmvU+iGWs6swt/YH38AEn0mW0=</latexit><latexit sha1_base64="qGWDxHsHDHTnJo7a1dhYjXNEtg=">ACDHicbVDLSgMxFM3UV62vqktFLhahItaZbnQlhW5cVrQP6Aw1k2ba0MyDJCOUYT7Ajb/ixoUibv0Ad36A/2H6WGjrgcDhnHO5uceNOJPKNL+MzMLi0vJKdjW3tr6xuZXf3mnIMBaE1knIQ9FysaScBbSumOK0FQmKfZfTpjuojvzmPRWShcGtGkbU8XEvYB4jWGmpky/cdGpgcwq2JzBJwI0AQ9OoGjBKXjHZzVIdcosmWPAPLGmpFC53K9+lw/uap38p90NSezTQBGOpWxbZqScBAvFCKdpzo4ljTAZ4B5taxpgn0onGR+TwpFWuCFQr9AwVj9PZFgX8qh7+qkj1Vfznoj8T+vHSvwklYEMWKBmSyIs5qBGzUCXCUoUH2qCiWD6r0D6WLeidH85XYI1e/I8aZRLlubXuo0KmiCL9tAhKiILnaMKukI1VEcEPaAn9IJejUfj2Xgz3ifRjDGd2UV/YHz8ADutmiI=</latexit><latexit sha1_base64="qGWDxHsHDHTnJo7a1dhYjXNEtg=">ACDHicbVDLSgMxFM3UV62vqktFLhahItaZbnQlhW5cVrQP6Aw1k2ba0MyDJCOUYT7Ajb/ixoUibv0Ad36A/2H6WGjrgcDhnHO5uceNOJPKNL+MzMLi0vJKdjW3tr6xuZXf3mnIMBaE1knIQ9FysaScBbSumOK0FQmKfZfTpjuojvzmPRWShcGtGkbU8XEvYB4jWGmpky/cdGpgcwq2JzBJwI0AQ9OoGjBKXjHZzVIdcosmWPAPLGmpFC53K9+lw/uap38p90NSezTQBGOpWxbZqScBAvFCKdpzo4ljTAZ4B5taxpgn0onGR+TwpFWuCFQr9AwVj9PZFgX8qh7+qkj1Vfznoj8T+vHSvwklYEMWKBmSyIs5qBGzUCXCUoUH2qCiWD6r0D6WLeidH85XYI1e/I8aZRLlubXuo0KmiCL9tAhKiILnaMKukI1VEcEPaAn9IJejUfj2Xgz3ifRjDGd2UV/YHz8ADutmiI=</latexit><latexit sha1_base64="OGAP68z/wtF1gXEXebhW5zRtA=">ACDHicbVDLSsNAFJ34rPVdenmYhEqYk3c6LgxmVE+4AmlMl0g6dTMLMRCghH+DGX3HjQhG3foA7/8Zpm4W2Hhg4nHMud+4JEs6Utu1va2l5ZXVtvbR3tza3tmt7O23VJxKQpsk5rHsBFhRzgRtaqY57S4ijgtB2Mrid+4FKxWJxr8cJ9SM8ECxkBGsj9SrVu54LHqfghRKTDBzIMwjhFGoOnEF4cu5CblJ23Z4CFolTkCoq4PYqX14/JmlEhSYcK9V17ET7GZaEU7zspcqmAywgPaNVTgiCo/mx6Tw7FR+hDG0jyhYar+nshwpNQ4Ckwywnqo5r2J+J/XTXV45WdMJKmgswWhSkHcOkGegzSYnmY0Mwkcz8FcgQm1a06a9sSnDmT14krYu6Y/itXW0ijpK6BAdoRpy0CVqoBvkoiYi6BE9o1f0Zj1ZL9a79TGLlnFzAH6A+vzB/Gal7A=</latexit>

SLIDE 25

Amdhal

The speedup bound is determined by the degree of sequential execution in

the program, not the number of processors

Strong scaling (fixed sized speedup):

Time Serial work Parallelizable work P = 1 P = 2 P = 4 P = 8

limP →∞ SP ≤ 1/f

<latexit sha1_base64="ksmcsTmUvrC5s98zwN82SPjWfE=">ACnicdVC7TsMwFL0pr1JeAUYWQ4XEVBIY6FiJhbEV9CE1VeS4TmvVcSLbQYqiziz8CgsDCLHyBWxsjHwGbgsSzyNZOj7nXtnBAlnSjvOi1WYm19YXCoul1ZW19Y37M2tlopTSWiTxDyWnQArypmgTc0p51EUhwFnLaD0enEb19SqVgsLnSW0F6EB4KFjGBtJN/e9TiL/LyOPB0j4lQZ2N07ps7p8hFhyj07bJTOa46Bug3cSvOFOUary9AkDdt5+9fkzSiApNOFaq6zqJ7uVYakY4HZe8VNEkxEe0K6hAkdU9fJplDHaN0ofhbE0R2g0Vb9u5DhSKosCMxlhPVQ/vYn4l9dNdVjt5UwkqaCzB4KU45M7EkvqM8kJZpnhmAimfkrIkMsMdGmvZIp4TMp+p+0jiqu4Q23XKvBDEXYgT04ABdOoAZnUIcmELiCG7iDe+vaurUerMfZaMH62NmGb7Ce3gEIGptp</latexit><latexit sha1_base64="gFh/2lH8FHMrjDZAqOKc4u3OPAs=">ACnicdVDLSgMxFM3UR2t9jbp0Ey2CqzqjiF0W3Lhs0T6gMwyZNOGJpkhyQjD0LUbf8WNC0Xc+gXu/AHxM0xbBZ8HAifn3EtyTpgwqrTjvFiFufmFxWJpqby8srq2bm9stlWcSkxaOGax7IZIEUYFaWmqGekmkiAeMtIJR6cTv3NJpKxuNBZQnyOBoJGFCNtpMDe8RjlQd6Ano6hR0WkszE8D8ydEejCAxgFdsWpHtUcA/ibuFVnikodNt9eS8XjRmA/e/0Yp5wIjRlSquc6ifZzJDXFjIzLXqpIgvAIDUjPUIE4UX4+jTKGe0bpwyiW5gNp+rXjRxpTIemkmO9FD9CbiX14v1VHNz6lIUk0Enj0UpQya2JNeYJ9KgjXLDEFYUvNXiIdIqxNe2VTwmdS+D9pH1Zdw5tupV4HM5TANtgF+8AFJ6AOzkADtAGV+AG3IF769q6tR6sx9lowfrY2QLfYD29A2rVm7M=</latexit><latexit sha1_base64="gFh/2lH8FHMrjDZAqOKc4u3OPAs=">ACnicdVDLSgMxFM3UR2t9jbp0Ey2CqzqjiF0W3Lhs0T6gMwyZNOGJpkhyQjD0LUbf8WNC0Xc+gXu/AHxM0xbBZ8HAifn3EtyTpgwqrTjvFiFufmFxWJpqby8srq2bm9stlWcSkxaOGax7IZIEUYFaWmqGekmkiAeMtIJR6cTv3NJpKxuNBZQnyOBoJGFCNtpMDe8RjlQd6Ano6hR0WkszE8D8ydEejCAxgFdsWpHtUcA/ibuFVnikodNt9eS8XjRmA/e/0Yp5wIjRlSquc6ifZzJDXFjIzLXqpIgvAIDUjPUIE4UX4+jTKGe0bpwyiW5gNp+rXjRxpTIemkmO9FD9CbiX14v1VHNz6lIUk0Enj0UpQya2JNeYJ9KgjXLDEFYUvNXiIdIqxNe2VTwmdS+D9pH1Zdw5tupV4HM5TANtgF+8AFJ6AOzkADtAGV+AG3IF769q6tR6sx9lowfrY2QLfYD29A2rVm7M=</latexit><latexit sha1_base64="L21ie50RyOMlNX/TmtDJ/4rg6i8=">ACnicdVC7TsMwFL0pr1JeAUYWQ4XEVBIY6FiJhbEI+pCaKHJcp7XqOJHtIEVRZxZ+hYUBhFj5Ajb+BveBxPNIlo7PuVf2OWHKmdKO826VFhaXlfKq5W19Y3NLXt7p62STBLaIglPZDfEinImaEszWk3lRTHIaedcHQ+8Ts3VCqWiGudp9SP8UCwiBGsjRTY+x5ncVA0kacT5DER6XyMrgJz5xS56BhFgV1aqd1xwD9Jm7NmaIKczQD+83rJySLqdCEY6V6rpNqv8BSM8LpuOJliqaYjPCA9gwVOKbKL6ZRxujQKH0UJdIcodFU/bpR4FipPA7NZIz1UP30JuJfXi/TUd0vmEgzTQWZPRlHJnYk15Qn0lKNM8NwUQy81dEhlhiok17FVPCZ1L0P2mf1FzDL91qozGvowx7cABH4MIZNOACmtACArdwD4/wZN1ZD9az9TIbLVnznV34Buv1A7JcmPA=</latexit>

SLIDE 26

Amdhal

SLIDE 27

Gustafson-Baris

Often the problem size can increase as the number of processes increases
The proportion of the serial part decreases
Weak scaling (scaled speedup):

Time Serial work Parallelizable work P =1 P =2 P =4 P =8

SP = 1 + (P − 1)fpar

<latexit sha1_base64="mJz+Pjv8YgfQahAhDt6y1qWxNIc=">ACAnicdVDJSgNBEK2JW4zbqCfx0iQIETHM6MFchAEvHkc0CyRh6On0JE16Frp7hDAEL/kRD148KOLVr/Dm39hJFwfFDzeq6Kqnp9wJpVlvRm5ufmFxaX8cmFldW19w9zcqs4FYTWSMxj0fSxpJxFtKaY4rSZCIpDn9OGPzib+I1rKiSLoys1TGgnxL2IBYxgpSXP3Ln0XHSKbHSAyi46RPY+CrwswWLkmSWrcly1NBvYlesKUpO8XYMGq5nvra7MUlDGinCsZQt20pUJ8NCMcLpqNBOJU0wGeAebWka4ZDKTjZ9YT2tNJFQSx0RQpN1a8TGQ6lHIa+7gyx6suf3kT8y2ulKqh2MhYlqaIRmS0KUo5UjCZ5oC4TlCg+1AQTwfStiPSxwETp1Ao6hM9P0f+kflSxNb+wS4DM+RhF4pQBhtOwIFzcKEGBG7gDh7g0Rgb98aT8TxrzRkfM9vwDcbLO3RDlcI=</latexit><latexit sha1_base64="tu6VHpU27k6KTF3c+d3FTzNb28c=">ACAnicdVDLSsNAFJ3UV62vqCsRZGgRKmJdGE3QsCNy4j2AW0Ik+mkHTqZhJmJUEJxoT/iwo0LRdz6Fe78FHdOWwWfBy4czrmXe+8JEkalsqxXIzc1PTM7l58vLCwuLa+Yq2t1GacCkxqOWSyaAZKEU5qipGmokgKAoYaQT945HfuCBC0pifq0FCvAh1OQ0pRkpLvrlx5rvwCNpwF5ZduAftHRj6WYLE0DdLVuWgamnA38SuWGOUnOLN1dtWJ3Z986XdiXEaEa4wQ1K2bCtRXoaEopiRYaGdSpIg3Ed0tKUo4hILxu/MITbWunAMBa6uIJj9etEhiIpB1GgOyOkevKnNxL/8lqpCqteRnmSKsLxZFGYMqhiOMoDdqgWLGBJgLqm+FuIcEwkqnVtAhfH4K/yf1/Yqt+aldchwQR5sgiIoAxscAgecABfUAaX4Bbcgwfj2rgzHo2nSWvO+JhZB9gPL8DA1SXqw=</latexit><latexit sha1_base64="tu6VHpU27k6KTF3c+d3FTzNb28c=">ACAnicdVDLSsNAFJ3UV62vqCsRZGgRKmJdGE3QsCNy4j2AW0Ik+mkHTqZhJmJUEJxoT/iwo0LRdz6Fe78FHdOWwWfBy4czrmXe+8JEkalsqxXIzc1PTM7l58vLCwuLa+Yq2t1GacCkxqOWSyaAZKEU5qipGmokgKAoYaQT945HfuCBC0pifq0FCvAh1OQ0pRkpLvrlx5rvwCNpwF5ZduAftHRj6WYLE0DdLVuWgamnA38SuWGOUnOLN1dtWJ3Z986XdiXEaEa4wQ1K2bCtRXoaEopiRYaGdSpIg3Ed0tKUo4hILxu/MITbWunAMBa6uIJj9etEhiIpB1GgOyOkevKnNxL/8lqpCqteRnmSKsLxZFGYMqhiOMoDdqgWLGBJgLqm+FuIcEwkqnVtAhfH4K/yf1/Yqt+aldchwQR5sgiIoAxscAgecABfUAaX4Bbcgwfj2rgzHo2nSWvO+JhZB9gPL8DA1SXqw=</latexit><latexit sha1_base64="iY0woRvy4jCnLYkaek7nbKD7fj4=">ACAnicdVDLSsNAFL2pr1pfUVfiZrAIFbEkurAboeDGZUT7gLaEyXTSDp1MwsxEKG48VfcuFDErV/hzr9x+hB8HrhwOde7r0nSDhT2nHerdzc/MLiUn65sLK6tr5hb27VZxKQmsk5rFsBlhRzgStaY5bSaS4ijgtBEMzsd+4ZKxWJxrYcJ7US4J1jICNZG8u2dK9DZ8hFh6jkoSPkHqDQzxIsR75dMonFcA/SZu2ZmgCDN4v3W7sYkjajQhGOlWq6T6E6GpWaE01GhnSqaYDLAPdoyVOCIqk42eWGE9o3SRWEsTQmNJurXiQxHSg2jwHRGWPfVT28s/uW1Uh1WOhkTSaqpINFYcqRjtE4D9RlkhLNh4ZgIpm5FZE+lphok1rBhPD5Kfqf1I/LruGXbrFancWRh13YgxK4cApVuAPakDgFu7hEZ6sO+vBerZepq05azazDd9gvX4AY5GUOw=</latexit>

SLIDE 28

Parallelism

A quick tour

SLIDE 29

What’s in a game?

Three kinds of code:
Gameplay simulation
Models the state of the game world as interacting entities
Sound, networking, user input, etc.
Numeric computation
Physics, collision detection, path finding, scene graph traversal, etc.
Rendering
Pixel & vertex attributes; runs on the GPU

Sekiro: Shadows Die Twice, FromSoftware

SLIDE 30

What’s in a game?

Parallel application design
In practice large applications consist of

a mix of concurrency and parallelism

Parts may be run concurrently, but there are also (data) dependencies
Usually, individual tasks are not the same size

The Witness, Thekla

User input Physics AI Game logic S

n d Rendering Networking

SLIDE 31

What’s in a game?

Parallel application design
In practice large applications consist of

a mix of concurrency and parallelism

Parts may be run concurrently, but there are also (data) dependencies
Usually, individual tasks are not the same size

The Witness, Thekla

User input

Physics AI Game logic

n d

Rendering

Networking

SLIDE 32

Task parallelism

Task parallelism
Problem is broken down into separate tasks
Individual tasks are created and communicate/

synchronise with each other

Task decomposition dictates scalability

SLIDE 33

Fork-Join

Splits control flow into multiple forks which later rejoin
Can be used to implement many other patterns

A C B D

data Async a do a <.- compute_A b' <.- async (compute_B a) c' <.- async (compute_C a) b <.- wait b' c <.- wait c' d <.- compute_D b c

SLIDE 34

Divide-and-conquer

Many divide-and-conquer algorithms lend themselves to fork-join parallelism
Sub-problems must be independent

so that they can execute in parallel

Correct task granularity is vital
Deep enough to expose

enough parallelism

Not so fine-grained that

scheduling overheads  dominate

divide base case combine

SLIDE 35

Load balancing

The computation must be distributed evenly across the processors to obtain

the fastest possible execution speed

It may be that some processors will complete their tasks before others and

become idle because the work is not evenly distributed

The amount of work is not known prior to execution
Differences in processor speeds (e.g. noisy system, frequency boost…)

SLIDE 36

Load balancing

Static load balancing can be viewed as a scheduling or bin packing problem
Estimate the execution time for parts of the program and their

interdependencies

Generate a fixed number of equally sized tasks and distribute amongst the

processors in some way (e.g. round robin, recursive bisection, random…)

Limitations:
Accurate estimates of execution time are difficult
Does not account for variable delays (e.g. memory access time) or

number of tasks (e.g. search problems)

SLIDE 37

Load balancing

In dynamic load balancing tasks are allocated to processors during execution
In a centralised dynamic scheme one process holds all tasks to be computed
Worker processes request new tasks from the work-pool
Readily applicable to divide-and-conquer problems

SLIDE 38

Data parallelism

Data parallelism
Problem is viewed as operations over parallel data
The same operation is applied to subsets of the data
Scales to the amount of data & number of processors

… n

{ { {

P1 P2 P3 …

SLIDE 39

Photo by @yukimomon

{ { {

tot ziens