B3CC: Concurrency 08: Parallelism from Concurrency Trevor L. - - PowerPoint PPT Presentation

b3cc concurrency 08 parallelism from concurrency
SMART_READER_LITE
LIVE PREVIEW

B3CC: Concurrency 08: Parallelism from Concurrency Trevor L. - - PowerPoint PPT Presentation

B3CC: Concurrency 08: Parallelism from Concurrency Trevor L. McDonell Utrecht University, B2 2020-2021 Recap Concurrency: dealing with lots of things at once - Collection of independently executing processes - Two or more threads are


slide-1
SLIDE 1

B3CC: Concurrency 08: Parallelism from Concurrency

Trevor L. McDonell Utrecht University, B2 2020-2021

slide-2
SLIDE 2

Recap

  • Concurrency: dealing with lots of things at once
  • Collection of independently executing processes
  • Two or more threads are making progress
  • Parallelism: doing lots of things at once
  • Simultaneous execution of (possibly related) computations
  • At least two threads are executing simultaneously

2

slide-3
SLIDE 3

Recap

  • So far we have discussed concurrency as a means to write modular code

with multiple interactions

  • Example: network server that interacts with multiple clients simultaneously
  • Sometimes this can speed up the program by overlapping the I/O or time

spent waiting for clients to respond, but this speedup doesn’t require multiple processors to achieve

  • In many cases we can use the same method to achieve real parallelism
  • From now, we will talk about some of the considerations for doing this well

3

slide-4
SLIDE 4

Motivations

4

slide-5
SLIDE 5

The free lunch is over

  • “The free lunch is over” (2005)
  • Today virtually all processors include multiple cores/processing elements
  • This has become the primary method for increasing performance
  • This has consequences for the programmer

5

http://www.gotw.ca/publications/concurrency-ddj.htm

slide-6
SLIDE 6

Why?

6

https://github.com/karlrupp/microprocessor-trend-data

epyc itanium 2 pentium 4 pentium 386

100 101 102 103 104 105 106 107 1970 1980 1990 2000 2010 2020 Transistors (thousands)

Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2019 by K. Rupp

Year 48 Years of Microprocessor Trend Data

slide-7
SLIDE 7

Why?

  • Moore's curve (1965)
  • Observation that the number of transistors in an integrated circuit doubles

roughly every two years

  • In particular, to minimise the cost per transistor
  • Not a law in any sense of the word (don't call it that)

7

slide-8
SLIDE 8

Why?

  • Dennard scaling
  • As transistors get smaller, power density remains constant
  • Combined with shrinking transistors, this implies performance per watt

grows at roughly the same rate as transistor density

  • signal delay decreases (clock frequency increases)
  • voltage and current decrease (power density remains constant)

8

slide-9
SLIDE 9

Why?

9

https://github.com/karlrupp/microprocessor-trend-data

100 101 102 103 104 105 106 107 1970 1980 1990 2000 2010 2020 Frequency (MHz) Transistors (thousands) Typical Power (Watts)

Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2019 by K. Rupp

Year 48 Years of Microprocessor Trend Data

epyc itanium 2 pentium 4 pentium 386

slide-10
SLIDE 10

Why?

  • Since ~2005 Dennard scaling breaks down
  • Static power losses increased faster than the overall power supply dropped

(due to decreasing voltage & current)

  • Consequence: can no longer improve performance through frequency

scaling alone

10

slide-11
SLIDE 11

Why?

11

https://github.com/karlrupp/microprocessor-trend-data

100 101 102 103 104 105 106 107 1970 1980 1990 2000 2010 2020 Frequency (MHz) Single-Thread Performance (SpecINT x 103) Transistors (thousands) Typical Power (Watts)

Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2019 by K. Rupp

Year 48 Years of Microprocessor Trend Data

epyc itanium 2 pentium 4 pentium 386

slide-12
SLIDE 12

Why?

  • Traditional approaches to increasing CPU performance:
  • Frequency scaling
  • Caches
  • Micro-architectural improvements
  • Out of order execution (increase utilisation of execution hardware)
  • Branch prediction (guess the outcome of control flow)
  • Speculative execution (do work before knowing if it will be needed)

12

slide-13
SLIDE 13

Why?

  • Frequency scaling: The Power Wall
  • Power consumption of

transistors does not decrease as fast as density increases

  • Performance limited by power

consumption (& dissipation)

13

Time Transistor density Transistor power Total power

slide-14
SLIDE 14

Why?

  • Caches: The Memory Wall
  • Memory speed does not increase

as fast as computing speed

  • Increasingly difficult to hide

memory latency

14

Performance Time Gap Compute Memory

slide-15
SLIDE 15

Why?

  • Microarchitecture improvements:

Instruction Level Parallelism Wall

  • Law of diminishing returns
  • Pollack rule: performance ∝

complexity2

15

Cost Serial performance

slide-16
SLIDE 16

Why?

16

https://github.com/karlrupp/microprocessor-trend-data

100 101 102 103 104 105 106 107 1970 1980 1990 2000 2010 2020 Number of Logical Cores Frequency (MHz) Single-Thread Performance (SpecINT x 103) Transistors (thousands) Typical Power (Watts)

Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2019 by K. Rupp

Year 48 Years of Microprocessor Trend Data

epyc itanium 2 pentium 4 pentium 386

slide-17
SLIDE 17

Why?

17

https://github.com/karlrupp/microprocessor-trend-data

100 101 102 103 104 105 106 107 1970 1980 1990 2000 2010 2020 Number of Logical Cores Frequency (MHz) Single-Thread Performance (SpecINT x 103) Transistors (thousands) Typical Power (Watts)

Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2019 by K. Rupp

Year 48 Years of Microprocessor Trend Data

epyc itanium 2 pentium 4 pentium 386

slide-18
SLIDE 18

Aside: more cores ≠ more performance

18

https://arstechnica.com/gadgets/2020/11/a-history-of-intel-vs-amd-desktop-performance-with-cpu-charts-galore/

slide-19
SLIDE 19

Considerations

19

slide-20
SLIDE 20

Parallelism

  • Improving application performance through parallelisation means:
  • Reducing the total time to compute a single result (latency)
  • Increasing the rate at which a series of results are computed (throughput)
  • Reducing the power consumption of a computation

20

slide-21
SLIDE 21

Problem

  • To make the program run faster, we need to gain more from parallelisation

than we lose due to the overhead of adding it

  • Granularity: If the tasks are too small, the overhead of managing the tasks
  • utweighs any benefit you might get from running them in parallel
  • Data dependencies: When one task depends on another, they must be

performed sequentially

21

slide-22
SLIDE 22

Speedup

  • The performance improvement, or speedup of a parallel application, is:
  • Where TP is the time to execute using P threads/processors
  • The efficiency of the program is:
  • Here, T1 can be:
  • The parallel algorithm executed on one thread: relative speedup
  • An equivalent serial algorithm: absolute speedup

22

speedup = Sp = T1 TP

<latexit sha1_base64="eRBgRu12smUCM/Nhj6DJW79fP8=">ACEXicbVDLSsNAFJ34rPVdelmsAhdlUQE3QhFNy4r9gVNCJPpTt0kgwzE6GE/Ibf8WNC0XcunPn3zhpu9DWA3M5nHMvc+8JBGdK2/a3tbK6tr6xWdoqb+/s7u1XDg47KklhTZNeCJ7AVHAWQxtzTSHnpBAoBDNxjfFH73AaRiSdzSEwFeRIYxCxkl2kh+peZGRI9klCkBMEhFjq/wvS9MdUNJaNbyndyUZu5XqnbdngIvE2dOqmiOpl/5cgcJTSOINeVEqb5jC+1lRGpGOeRlN1UgCB2TIfQNjUkEysumF+X41CgDHCbSvFjqfp7IiORUpMoMJ3F/mrRK8T/vH6qw0svY7FINcR09lGYcqwTXMSDB0wC1XxiCKGSmV0xHRGThDYhlk0IzuLJy6RzVncMvzuvNq7ncZTQMTpBNeSgC9RAt6iJ2oiR/SMXtGb9WS9WO/Wx6x1xZrPHKE/sD5/AM/fnPk=</latexit><latexit sha1_base64="eRBgRu12smUCM/Nhj6DJW79fP8=">ACEXicbVDLSsNAFJ34rPVdelmsAhdlUQE3QhFNy4r9gVNCJPpTt0kgwzE6GE/Ibf8WNC0XcunPn3zhpu9DWA3M5nHMvc+8JBGdK2/a3tbK6tr6xWdoqb+/s7u1XDg47KklhTZNeCJ7AVHAWQxtzTSHnpBAoBDNxjfFH73AaRiSdzSEwFeRIYxCxkl2kh+peZGRI9klCkBMEhFjq/wvS9MdUNJaNbyndyUZu5XqnbdngIvE2dOqmiOpl/5cgcJTSOINeVEqb5jC+1lRGpGOeRlN1UgCB2TIfQNjUkEysumF+X41CgDHCbSvFjqfp7IiORUpMoMJ3F/mrRK8T/vH6qw0svY7FINcR09lGYcqwTXMSDB0wC1XxiCKGSmV0xHRGThDYhlk0IzuLJy6RzVncMvzuvNq7ncZTQMTpBNeSgC9RAt6iJ2oiR/SMXtGb9WS9WO/Wx6x1xZrPHKE/sD5/AM/fnPk=</latexit><latexit sha1_base64="eRBgRu12smUCM/Nhj6DJW79fP8=">ACEXicbVDLSsNAFJ34rPVdelmsAhdlUQE3QhFNy4r9gVNCJPpTt0kgwzE6GE/Ibf8WNC0XcunPn3zhpu9DWA3M5nHMvc+8JBGdK2/a3tbK6tr6xWdoqb+/s7u1XDg47KklhTZNeCJ7AVHAWQxtzTSHnpBAoBDNxjfFH73AaRiSdzSEwFeRIYxCxkl2kh+peZGRI9klCkBMEhFjq/wvS9MdUNJaNbyndyUZu5XqnbdngIvE2dOqmiOpl/5cgcJTSOINeVEqb5jC+1lRGpGOeRlN1UgCB2TIfQNjUkEysumF+X41CgDHCbSvFjqfp7IiORUpMoMJ3F/mrRK8T/vH6qw0svY7FINcR09lGYcqwTXMSDB0wC1XxiCKGSmV0xHRGThDYhlk0IzuLJy6RzVncMvzuvNq7ncZTQMTpBNeSgC9RAt6iJ2oiR/SMXtGb9WS9WO/Wx6x1xZrPHKE/sD5/AM/fnPk=</latexit><latexit sha1_base64="eRBgRu12smUCM/Nhj6DJW79fP8=">ACEXicbVDLSsNAFJ34rPVdelmsAhdlUQE3QhFNy4r9gVNCJPpTt0kgwzE6GE/Ibf8WNC0XcunPn3zhpu9DWA3M5nHMvc+8JBGdK2/a3tbK6tr6xWdoqb+/s7u1XDg47KklhTZNeCJ7AVHAWQxtzTSHnpBAoBDNxjfFH73AaRiSdzSEwFeRIYxCxkl2kh+peZGRI9klCkBMEhFjq/wvS9MdUNJaNbyndyUZu5XqnbdngIvE2dOqmiOpl/5cgcJTSOINeVEqb5jC+1lRGpGOeRlN1UgCB2TIfQNjUkEysumF+X41CgDHCbSvFjqfp7IiORUpMoMJ3F/mrRK8T/vH6qw0svY7FINcR09lGYcqwTXMSDB0wC1XxiCKGSmV0xHRGThDYhlk0IzuLJy6RzVncMvzuvNq7ncZTQMTpBNeSgC9RAt6iJ2oiR/SMXtGb9WS9WO/Wx6x1xZrPHKE/sD5/AM/fnPk=</latexit>

efficiency = Sp P = T1 PTP

<latexit sha1_base64="0gleX9ESliUtMJWRtd8hHuZJOSQ=">ACIHicbVDLSsNAFJ3UV62vqEs3g0VwVRIR6kYounEZsS9oQ5hMJ+3QmSTMTIQ8ilu/BU3LhTRnX6Nkzagth4YOPfce7lzjh8zKpVlfRqVldW19Y3qZm1re2d3z9w/6MoEZh0cMQi0feRJIyGpKOoYqQfC4K4z0jPn14X/d49EZJGYVulMXE5Goc0oBgpLXlmc8iRmgiekUCLlIQ4zeElHAYC4ezOi/PM+anbnq1r2Pac3DPrVsOaAS4TuyR1UMLxzI/hKMIJ6HCDEk5sK1YuRkSimJG8towkSRGeIrGZKBpiDiRbjYzmMTrYxgEAn9QgVn6u+NDHEpU+7rycKOXOwV4n+9QaKCzejYZwobX1+KEgYVBEs0oIjKghWLNUEYUH1XyGeIJ2F0pnWdAj2ouVl0j1r2JrfntdbV2UcVXAEjsEpsETtMANcEAHYPAnsALeDUejWfjzXifj1aMcucQ/IHx9Q3XzKNW</latexit><latexit sha1_base64="0gleX9ESliUtMJWRtd8hHuZJOSQ=">ACIHicbVDLSsNAFJ3UV62vqEs3g0VwVRIR6kYounEZsS9oQ5hMJ+3QmSTMTIQ8ilu/BU3LhTRnX6Nkzagth4YOPfce7lzjh8zKpVlfRqVldW19Y3qZm1re2d3z9w/6MoEZh0cMQi0feRJIyGpKOoYqQfC4K4z0jPn14X/d49EZJGYVulMXE5Goc0oBgpLXlmc8iRmgiekUCLlIQ4zeElHAYC4ezOi/PM+anbnq1r2Pac3DPrVsOaAS4TuyR1UMLxzI/hKMIJ6HCDEk5sK1YuRkSimJG8towkSRGeIrGZKBpiDiRbjYzmMTrYxgEAn9QgVn6u+NDHEpU+7rycKOXOwV4n+9QaKCzejYZwobX1+KEgYVBEs0oIjKghWLNUEYUH1XyGeIJ2F0pnWdAj2ouVl0j1r2JrfntdbV2UcVXAEjsEpsETtMANcEAHYPAnsALeDUejWfjzXifj1aMcucQ/IHx9Q3XzKNW</latexit><latexit sha1_base64="0gleX9ESliUtMJWRtd8hHuZJOSQ=">ACIHicbVDLSsNAFJ3UV62vqEs3g0VwVRIR6kYounEZsS9oQ5hMJ+3QmSTMTIQ8ilu/BU3LhTRnX6Nkzagth4YOPfce7lzjh8zKpVlfRqVldW19Y3qZm1re2d3z9w/6MoEZh0cMQi0feRJIyGpKOoYqQfC4K4z0jPn14X/d49EZJGYVulMXE5Goc0oBgpLXlmc8iRmgiekUCLlIQ4zeElHAYC4ezOi/PM+anbnq1r2Pac3DPrVsOaAS4TuyR1UMLxzI/hKMIJ6HCDEk5sK1YuRkSimJG8towkSRGeIrGZKBpiDiRbjYzmMTrYxgEAn9QgVn6u+NDHEpU+7rycKOXOwV4n+9QaKCzejYZwobX1+KEgYVBEs0oIjKghWLNUEYUH1XyGeIJ2F0pnWdAj2ouVl0j1r2JrfntdbV2UcVXAEjsEpsETtMANcEAHYPAnsALeDUejWfjzXifj1aMcucQ/IHx9Q3XzKNW</latexit><latexit sha1_base64="0gleX9ESliUtMJWRtd8hHuZJOSQ=">ACIHicbVDLSsNAFJ3UV62vqEs3g0VwVRIR6kYounEZsS9oQ5hMJ+3QmSTMTIQ8ilu/BU3LhTRnX6Nkzagth4YOPfce7lzjh8zKpVlfRqVldW19Y3qZm1re2d3z9w/6MoEZh0cMQi0feRJIyGpKOoYqQfC4K4z0jPn14X/d49EZJGYVulMXE5Goc0oBgpLXlmc8iRmgiekUCLlIQ4zeElHAYC4ezOi/PM+anbnq1r2Pac3DPrVsOaAS4TuyR1UMLxzI/hKMIJ6HCDEk5sK1YuRkSimJG8towkSRGeIrGZKBpiDiRbjYzmMTrYxgEAn9QgVn6u+NDHEpU+7rycKOXOwV4n+9QaKCzejYZwobX1+KEgYVBEs0oIjKghWLNUEYUH1XyGeIJ2F0pnWdAj2ouVl0j1r2JrfntdbV2UcVXAEjsEpsETtMANcEAHYPAnsALeDUejWfjzXifj1aMcucQ/IHx9Q3XzKNW</latexit>
slide-23
SLIDE 23

Maximum speedup

  • Several factors appear as overhead in parallel computations and limit the

speedup of the program

  • Periods when not all processors are performing useful work
  • Extra computations in the parallel version not appearing in the sequential

version (example: recompute constants locally)

  • Communication time between processes

23

slide-24
SLIDE 24

Amdhal

  • Amdhal's law considers the execution time of a program is either:
  • Wser: time spent doing (non-parallelisable) serial work
  • Wpar: time spent doing parallel work
  • If f is the fraction of serial work to be performed, we get the parallel speedup:

24

TP ≥ Wser + Wpar P

<latexit sha1_base64="8AaAl2IE6vfLmeMbWXBM3wlAY8=">ACGnicbZDLSsNAFIYn9VbrLerSzWARBKEkIuiy6MZlhd6gCWEyPWmHTi7MTIQS8hxufBU3LhRxJ258GydtQG39YeDjP+cw5/x+wplUlvVlVFZW19Y3qpu1re2d3T1z/6Ar41RQ6NCYx6LvEwmcRdBRTHoJwJI6HPo+ZObot67ByFZHLXVNAE3JKOIBYwSpS3PtNteCzsjwD3PCYkaizCTIHJ8hp1AEJr92AkReZ61cs+sWw1rJrwMdgl1VKrlmR/OMKZpCJGinEg5sK1EuRkRilEOec1JSETsgIBhojEoJ0s9lpOT7RzhAHsdAvUnjm/p7ISCjlNPR1Z7GnXKwV5n+1QaqCKzdjUZIqiOj8oyDlWMW4yAkPmQCq+FQDoYLpXTEdEx2J0mnWdAj24snL0D1v2JrvLurN6zKOKjpCx+gU2egSNdEtaqEOougBPaEX9Go8Gs/Gm/E+b60Y5cwh+iPj8xu3sKFW</latexit><latexit sha1_base64="8AaAl2IE6vfLmeMbWXBM3wlAY8=">ACGnicbZDLSsNAFIYn9VbrLerSzWARBKEkIuiy6MZlhd6gCWEyPWmHTi7MTIQS8hxufBU3LhRxJ258GydtQG39YeDjP+cw5/x+wplUlvVlVFZW19Y3qpu1re2d3T1z/6Ar41RQ6NCYx6LvEwmcRdBRTHoJwJI6HPo+ZObot67ByFZHLXVNAE3JKOIBYwSpS3PtNteCzsjwD3PCYkaizCTIHJ8hp1AEJr92AkReZ61cs+sWw1rJrwMdgl1VKrlmR/OMKZpCJGinEg5sK1EuRkRilEOec1JSETsgIBhojEoJ0s9lpOT7RzhAHsdAvUnjm/p7ISCjlNPR1Z7GnXKwV5n+1QaqCKzdjUZIqiOj8oyDlWMW4yAkPmQCq+FQDoYLpXTEdEx2J0mnWdAj24snL0D1v2JrvLurN6zKOKjpCx+gU2egSNdEtaqEOougBPaEX9Go8Gs/Gm/E+b60Y5cwh+iPj8xu3sKFW</latexit><latexit sha1_base64="8AaAl2IE6vfLmeMbWXBM3wlAY8=">ACGnicbZDLSsNAFIYn9VbrLerSzWARBKEkIuiy6MZlhd6gCWEyPWmHTi7MTIQS8hxufBU3LhRxJ258GydtQG39YeDjP+cw5/x+wplUlvVlVFZW19Y3qpu1re2d3T1z/6Ar41RQ6NCYx6LvEwmcRdBRTHoJwJI6HPo+ZObot67ByFZHLXVNAE3JKOIBYwSpS3PtNteCzsjwD3PCYkaizCTIHJ8hp1AEJr92AkReZ61cs+sWw1rJrwMdgl1VKrlmR/OMKZpCJGinEg5sK1EuRkRilEOec1JSETsgIBhojEoJ0s9lpOT7RzhAHsdAvUnjm/p7ISCjlNPR1Z7GnXKwV5n+1QaqCKzdjUZIqiOj8oyDlWMW4yAkPmQCq+FQDoYLpXTEdEx2J0mnWdAj24snL0D1v2JrvLurN6zKOKjpCx+gU2egSNdEtaqEOougBPaEX9Go8Gs/Gm/E+b60Y5cwh+iPj8xu3sKFW</latexit><latexit sha1_base64="8AaAl2IE6vfLmeMbWXBM3wlAY8=">ACGnicbZDLSsNAFIYn9VbrLerSzWARBKEkIuiy6MZlhd6gCWEyPWmHTi7MTIQS8hxufBU3LhRxJ258GydtQG39YeDjP+cw5/x+wplUlvVlVFZW19Y3qpu1re2d3T1z/6Ar41RQ6NCYx6LvEwmcRdBRTHoJwJI6HPo+ZObot67ByFZHLXVNAE3JKOIBYwSpS3PtNteCzsjwD3PCYkaizCTIHJ8hp1AEJr92AkReZ61cs+sWw1rJrwMdgl1VKrlmR/OMKZpCJGinEg5sK1EuRkRilEOec1JSETsgIBhojEoJ0s9lpOT7RzhAHsdAvUnjm/p7ISCjlNPR1Z7GnXKwV5n+1QaqCKzdjUZIqiOj8oyDlWMW4yAkPmQCq+FQDoYLpXTEdEx2J0mnWdAj24snL0D1v2JrvLurN6zKOKjpCx+gU2egSNdEtaqEOougBPaEX9Go8Gs/Gm/E+b60Y5cwh+iPj8xu3sKFW</latexit>

SP ≤ 1 f + (1 − f)/P

<latexit sha1_base64="ehSsLCz1Dhpr+4wAQyuJ7HxjSaY=">ACDHicbVC7SgNBFL0bXzG+opZaXAxCRNRdG60kMYyonlANoTZyWwyZPbBzKwQlnyAjb9iY6GIrR9gZ+enOHkUmnhg4HDOudy5x4sFV9q2v6zMwuLS8kp2Nbe2vrG5ld/eqakokZRVaSQi2fCIYoKHrKq5FqwRS0YCT7C61y+P/Po9k4pH4Z0exKwVkG7IfU6JNlI7X7htV9AVDF1fEpqig8MUfTzGoMn6B+dVXBoUvapPQbOE2dKCqWr/fI3AFTa+U+3E9EkYKGmgijVdOxYt1IiNaeCDXNuolhMaJ90WdPQkARMtdLxMUM8NEoH/UiaF2ocq78nUhIoNQg8kwyI7qlZbyT+5zUT7V+2Uh7GiWYhnSzyE4E6wlEz2OGSUS0GhAqufkr0h4xrWjTX86U4MyePE9q56eO4TemjRJMkIU9OIAiOHABJbiGClSBwgM8wQu8Wo/Ws/VmvU+iGWs6swt/YH38AEn0mW0=</latexit><latexit sha1_base64="qGWDxHsHDHTnJo7a1dhYjXNEtg=">ACDHicbVDLSgMxFM3UV62vqktFLhahItaZbnQlhW5cVrQP6Aw1k2ba0MyDJCOUYT7Ajb/ixoUibv0Ad36A/2H6WGjrgcDhnHO5uceNOJPKNL+MzMLi0vJKdjW3tr6xuZXf3mnIMBaE1knIQ9FysaScBbSumOK0FQmKfZfTpjuojvzmPRWShcGtGkbU8XEvYB4jWGmpky/cdGpgcwq2JzBJwI0AQ9OoGjBKXjHZzVIdcosmWPAPLGmpFC53K9+lw/uap38p90NSezTQBGOpWxbZqScBAvFCKdpzo4ljTAZ4B5taxpgn0onGR+TwpFWuCFQr9AwVj9PZFgX8qh7+qkj1Vfznoj8T+vHSvwklYEMWKBmSyIs5qBGzUCXCUoUH2qCiWD6r0D6WLeidH85XYI1e/I8aZRLlubXuo0KmiCL9tAhKiILnaMKukI1VEcEPaAn9IJejUfj2Xgz3ifRjDGd2UV/YHz8ADutmiI=</latexit><latexit sha1_base64="qGWDxHsHDHTnJo7a1dhYjXNEtg=">ACDHicbVDLSgMxFM3UV62vqktFLhahItaZbnQlhW5cVrQP6Aw1k2ba0MyDJCOUYT7Ajb/ixoUibv0Ad36A/2H6WGjrgcDhnHO5uceNOJPKNL+MzMLi0vJKdjW3tr6xuZXf3mnIMBaE1knIQ9FysaScBbSumOK0FQmKfZfTpjuojvzmPRWShcGtGkbU8XEvYB4jWGmpky/cdGpgcwq2JzBJwI0AQ9OoGjBKXjHZzVIdcosmWPAPLGmpFC53K9+lw/uap38p90NSezTQBGOpWxbZqScBAvFCKdpzo4ljTAZ4B5taxpgn0onGR+TwpFWuCFQr9AwVj9PZFgX8qh7+qkj1Vfznoj8T+vHSvwklYEMWKBmSyIs5qBGzUCXCUoUH2qCiWD6r0D6WLeidH85XYI1e/I8aZRLlubXuo0KmiCL9tAhKiILnaMKukI1VEcEPaAn9IJejUfj2Xgz3ifRjDGd2UV/YHz8ADutmiI=</latexit><latexit sha1_base64="OGAP68z/wtF1gXEXebhW5zRtA=">ACDHicbVDLSsNAFJ34rPVdenmYhEqYk3c6LgxmVE+4AmlMl0g6dTMLMRCghH+DGX3HjQhG3foA7/8Zpm4W2Hhg4nHMud+4JEs6Utu1va2l5ZXVtvbR3tza3tmt7O23VJxKQpsk5rHsBFhRzgRtaqY57S4ijgtB2Mrid+4FKxWJxr8cJ9SM8ECxkBGsj9SrVu54LHqfghRKTDBzIMwjhFGoOnEF4cu5CblJ23Z4CFolTkCoq4PYqX14/JmlEhSYcK9V17ET7GZaEU7zspcqmAywgPaNVTgiCo/mx6Tw7FR+hDG0jyhYar+nshwpNQ4Ckwywnqo5r2J+J/XTXV45WdMJKmgswWhSkHcOkGegzSYnmY0Mwkcz8FcgQm1a06a9sSnDmT14krYu6Y/itXW0ijpK6BAdoRpy0CVqoBvkoiYi6BE9o1f0Zj1ZL9a79TGLlnFzAH6A+vzB/Gal7A=</latexit>
slide-25
SLIDE 25

Amdhal

  • The speedup bound is determined by the degree of sequential execution in

the program, not the number of processors

  • Strong scaling (fixed sized speedup):

25

Time Serial work Parallelizable work P = 1 P = 2 P = 4 P = 8

limP →∞ SP ≤ 1/f

<latexit sha1_base64="ksmcsTmUvrC5s98zwN82SPjWfE=">ACnicdVC7TsMwFL0pr1JeAUYWQ4XEVBIY6FiJhbEV9CE1VeS4TmvVcSLbQYqiziz8CgsDCLHyBWxsjHwGbgsSzyNZOj7nXtnBAlnSjvOi1WYm19YXCoul1ZW19Y37M2tlopTSWiTxDyWnQArypmgTc0p51EUhwFnLaD0enEb19SqVgsLnSW0F6EB4KFjGBtJN/e9TiL/LyOPB0j4lQZ2N07ps7p8hFhyj07bJTOa46Bug3cSvOFOUary9AkDdt5+9fkzSiApNOFaq6zqJ7uVYakY4HZe8VNEkxEe0K6hAkdU9fJplDHaN0ofhbE0R2g0Vb9u5DhSKosCMxlhPVQ/vYn4l9dNdVjt5UwkqaCzB4KU45M7EkvqM8kJZpnhmAimfkrIkMsMdGmvZIp4TMp+p+0jiqu4Q23XKvBDEXYgT04ABdOoAZnUIcmELiCG7iDe+vaurUerMfZaMH62NmGb7Ce3gEIGptp</latexit><latexit sha1_base64="gFh/2lH8FHMrjDZAqOKc4u3OPAs=">ACnicdVDLSgMxFM3UR2t9jbp0Ey2CqzqjiF0W3Lhs0T6gMwyZNOGJpkhyQjD0LUbf8WNC0Xc+gXu/AHxM0xbBZ8HAifn3EtyTpgwqrTjvFiFufmFxWJpqby8srq2bm9stlWcSkxaOGax7IZIEUYFaWmqGekmkiAeMtIJR6cTv3NJpKxuNBZQnyOBoJGFCNtpMDe8RjlQd6Ano6hR0WkszE8D8ydEejCAxgFdsWpHtUcA/ibuFVnikodNt9eS8XjRmA/e/0Yp5wIjRlSquc6ifZzJDXFjIzLXqpIgvAIDUjPUIE4UX4+jTKGe0bpwyiW5gNp+rXjRxpTIemkmO9FD9CbiX14v1VHNz6lIUk0Enj0UpQya2JNeYJ9KgjXLDEFYUvNXiIdIqxNe2VTwmdS+D9pH1Zdw5tupV4HM5TANtgF+8AFJ6AOzkADtAGV+AG3IF769q6tR6sx9lowfrY2QLfYD29A2rVm7M=</latexit><latexit sha1_base64="gFh/2lH8FHMrjDZAqOKc4u3OPAs=">ACnicdVDLSgMxFM3UR2t9jbp0Ey2CqzqjiF0W3Lhs0T6gMwyZNOGJpkhyQjD0LUbf8WNC0Xc+gXu/AHxM0xbBZ8HAifn3EtyTpgwqrTjvFiFufmFxWJpqby8srq2bm9stlWcSkxaOGax7IZIEUYFaWmqGekmkiAeMtIJR6cTv3NJpKxuNBZQnyOBoJGFCNtpMDe8RjlQd6Ano6hR0WkszE8D8ydEejCAxgFdsWpHtUcA/ibuFVnikodNt9eS8XjRmA/e/0Yp5wIjRlSquc6ifZzJDXFjIzLXqpIgvAIDUjPUIE4UX4+jTKGe0bpwyiW5gNp+rXjRxpTIemkmO9FD9CbiX14v1VHNz6lIUk0Enj0UpQya2JNeYJ9KgjXLDEFYUvNXiIdIqxNe2VTwmdS+D9pH1Zdw5tupV4HM5TANtgF+8AFJ6AOzkADtAGV+AG3IF769q6tR6sx9lowfrY2QLfYD29A2rVm7M=</latexit><latexit sha1_base64="L21ie50RyOMlNX/TmtDJ/4rg6i8=">ACnicdVC7TsMwFL0pr1JeAUYWQ4XEVBIY6FiJhbEI+pCaKHJcp7XqOJHtIEVRZxZ+hYUBhFj5Ajb+BveBxPNIlo7PuVf2OWHKmdKO826VFhaXlfKq5W19Y3NLXt7p62STBLaIglPZDfEinImaEszWk3lRTHIaedcHQ+8Ts3VCqWiGudp9SP8UCwiBGsjRTY+x5ncVA0kacT5DER6XyMrgJz5xS56BhFgV1aqd1xwD9Jm7NmaIKczQD+83rJySLqdCEY6V6rpNqv8BSM8LpuOJliqaYjPCA9gwVOKbKL6ZRxujQKH0UJdIcodFU/bpR4FipPA7NZIz1UP30JuJfXi/TUd0vmEgzTQWZPRlHJnYk15Qn0lKNM8NwUQy81dEhlhiok17FVPCZ1L0P2mf1FzDL91qozGvowx7cABH4MIZNOACmtACArdwD4/wZN1ZD9az9TIbLVnznV34Buv1A7JcmPA=</latexit>
slide-26
SLIDE 26

Amdhal

26

slide-27
SLIDE 27

Gustafson-Baris

  • Often the problem size can increase as the number of processes increases
  • The proportion of the serial part decreases
  • Weak scaling (scaled speedup):

27

Time Serial work Parallelizable work P =1 P =2 P =4 P =8

SP = 1 + (P − 1)fpar

<latexit sha1_base64="mJz+Pjv8YgfQahAhDt6y1qWxNIc=">ACAnicdVDJSgNBEK2JW4zbqCfx0iQIETHM6MFchAEvHkc0CyRh6On0JE16Frp7hDAEL/kRD148KOLVr/Dm39hJFwfFDzeq6Kqnp9wJpVlvRm5ufmFxaX8cmFldW19w9zcqs4FYTWSMxj0fSxpJxFtKaY4rSZCIpDn9OGPzib+I1rKiSLoys1TGgnxL2IBYxgpSXP3Ln0XHSKbHSAyi46RPY+CrwswWLkmSWrcly1NBvYlesKUpO8XYMGq5nvra7MUlDGinCsZQt20pUJ8NCMcLpqNBOJU0wGeAebWka4ZDKTjZ9YT2tNJFQSx0RQpN1a8TGQ6lHIa+7gyx6suf3kT8y2ulKqh2MhYlqaIRmS0KUo5UjCZ5oC4TlCg+1AQTwfStiPSxwETp1Ao6hM9P0f+kflSxNb+wS4DM+RhF4pQBhtOwIFzcKEGBG7gDh7g0Rgb98aT8TxrzRkfM9vwDcbLO3RDlcI=</latexit><latexit sha1_base64="tu6VHpU27k6KTF3c+d3FTzNb28c=">ACAnicdVDLSsNAFJ3UV62vqCsRZGgRKmJdGE3QsCNy4j2AW0Ik+mkHTqZhJmJUEJxoT/iwo0LRdz6Fe78FHdOWwWfBy4czrmXe+8JEkalsqxXIzc1PTM7l58vLCwuLa+Yq2t1GacCkxqOWSyaAZKEU5qipGmokgKAoYaQT945HfuCBC0pifq0FCvAh1OQ0pRkpLvrlx5rvwCNpwF5ZduAftHRj6WYLE0DdLVuWgamnA38SuWGOUnOLN1dtWJ3Z986XdiXEaEa4wQ1K2bCtRXoaEopiRYaGdSpIg3Ed0tKUo4hILxu/MITbWunAMBa6uIJj9etEhiIpB1GgOyOkevKnNxL/8lqpCqteRnmSKsLxZFGYMqhiOMoDdqgWLGBJgLqm+FuIcEwkqnVtAhfH4K/yf1/Yqt+aldchwQR5sgiIoAxscAgecABfUAaX4Bbcgwfj2rgzHo2nSWvO+JhZB9gPL8DA1SXqw=</latexit><latexit sha1_base64="tu6VHpU27k6KTF3c+d3FTzNb28c=">ACAnicdVDLSsNAFJ3UV62vqCsRZGgRKmJdGE3QsCNy4j2AW0Ik+mkHTqZhJmJUEJxoT/iwo0LRdz6Fe78FHdOWwWfBy4czrmXe+8JEkalsqxXIzc1PTM7l58vLCwuLa+Yq2t1GacCkxqOWSyaAZKEU5qipGmokgKAoYaQT945HfuCBC0pifq0FCvAh1OQ0pRkpLvrlx5rvwCNpwF5ZduAftHRj6WYLE0DdLVuWgamnA38SuWGOUnOLN1dtWJ3Z986XdiXEaEa4wQ1K2bCtRXoaEopiRYaGdSpIg3Ed0tKUo4hILxu/MITbWunAMBa6uIJj9etEhiIpB1GgOyOkevKnNxL/8lqpCqteRnmSKsLxZFGYMqhiOMoDdqgWLGBJgLqm+FuIcEwkqnVtAhfH4K/yf1/Yqt+aldchwQR5sgiIoAxscAgecABfUAaX4Bbcgwfj2rgzHo2nSWvO+JhZB9gPL8DA1SXqw=</latexit><latexit sha1_base64="iY0woRvy4jCnLYkaek7nbKD7fj4=">ACAnicdVDLSsNAFL2pr1pfUVfiZrAIFbEkurAboeDGZUT7gLaEyXTSDp1MwsxEKG48VfcuFDErV/hzr9x+hB8HrhwOde7r0nSDhT2nHerdzc/MLiUn65sLK6tr5hb27VZxKQmsk5rFsBlhRzgStaY5bSaS4ijgtBEMzsd+4ZKxWJxrYcJ7US4J1jICNZG8u2dK9DZ8hFh6jkoSPkHqDQzxIsR75dMonFcA/SZu2ZmgCDN4v3W7sYkjajQhGOlWq6T6E6GpWaE01GhnSqaYDLAPdoyVOCIqk42eWGE9o3SRWEsTQmNJurXiQxHSg2jwHRGWPfVT28s/uW1Uh1WOhkTSaqpINFYcqRjtE4D9RlkhLNh4ZgIpm5FZE+lphok1rBhPD5Kfqf1I/LruGXbrFancWRh13YgxK4cApVuAPakDgFu7hEZ6sO+vBerZepq05azazDd9gvX4AY5GUOw=</latexit>
slide-28
SLIDE 28

Parallelism

A quick tour

28

slide-29
SLIDE 29

What’s in a game?

  • Three kinds of code:
  • Gameplay simulation
  • Models the state of the game world as interacting entities
  • Sound, networking, user input, etc.
  • Numeric computation
  • Physics, collision detection, path finding, scene graph traversal, etc.
  • Rendering
  • Pixel & vertex attributes; runs on the GPU

29

Sekiro: Shadows Die Twice, FromSoftware

slide-30
SLIDE 30

What’s in a game?

  • Parallel application design
  • In practice large applications consist of


a mix of concurrency and parallelism

  • Parts may be run concurrently, but there are also (data) dependencies
  • Usually, individual tasks are not the same size

30

The Witness, Thekla

User input Physics AI Game logic S

  • u

n d Rendering Networking

slide-31
SLIDE 31

What’s in a game?

  • Parallel application design
  • In practice large applications consist of


a mix of concurrency and parallelism

  • Parts may be run concurrently, but there are also (data) dependencies
  • Usually, individual tasks are not the same size

31

The Witness, Thekla

User input

Physics AI Game logic

S

  • u

n d

Rendering

Networking

slide-32
SLIDE 32

Task parallelism

  • Task parallelism
  • Problem is broken down into separate tasks
  • Individual tasks are created and communicate/

synchronise with each other

  • Task decomposition dictates scalability

32

slide-33
SLIDE 33

Fork-Join

  • Splits control flow into multiple forks which later rejoin
  • Can be used to implement many other patterns

33

A C B D

data Async a do a <.- compute_A b' <.- async (compute_B a) c' <.- async (compute_C a) b <.- wait b' c <.- wait c' d <.- compute_D b c

slide-34
SLIDE 34

Divide-and-conquer

  • Many divide-and-conquer algorithms lend themselves to fork-join parallelism
  • Sub-problems must be independent


so that they can execute in parallel

  • Correct task granularity is vital
  • Deep enough to expose


enough parallelism

  • Not so fine-grained that


scheduling overheads
 dominate

34

divide base case combine

slide-35
SLIDE 35

Load balancing

  • The computation must be distributed evenly across the processors to obtain

the fastest possible execution speed

  • It may be that some processors will complete their tasks before others and

become idle because the work is not evenly distributed

  • The amount of work is not known prior to execution
  • Differences in processor speeds (e.g. noisy system, frequency boost…)

35

slide-36
SLIDE 36

Load balancing

  • Static load balancing can be viewed as a scheduling or bin packing problem
  • Estimate the execution time for parts of the program and their

interdependencies

  • Generate a fixed number of equally sized tasks and distribute amongst the

processors in some way (e.g. round robin, recursive bisection, random…)

  • Limitations:
  • Accurate estimates of execution time are difficult
  • Does not account for variable delays (e.g. memory access time) or

number of tasks (e.g. search problems)

36

slide-37
SLIDE 37

Load balancing

  • In dynamic load balancing tasks are allocated to processors during execution
  • In a centralised dynamic scheme one process holds all tasks to be computed
  • Worker processes request new tasks from the work-pool
  • Readily applicable to divide-and-conquer problems

37

slide-38
SLIDE 38

Data parallelism

  • Data parallelism
  • Problem is viewed as operations over parallel data
  • The same operation is applied to subsets of the data
  • Scales to the amount of data & number of processors

38

… n

{ { {

P1 P2 P3 …

slide-39
SLIDE 39

Photo by @yukimomon

tot ziens