Parallel Models Different ways to exploit parallelism Reusing this - - PowerPoint PPT Presentation

parallel models
SMART_READER_LITE
LIVE PREVIEW

Parallel Models Different ways to exploit parallelism Reusing this - - PowerPoint PPT Presentation

Parallel Models Different ways to exploit parallelism Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.


slide-1
SLIDE 1

Parallel Models

Different ways to exploit parallelism

slide-2
SLIDE 2

Reusing this material

This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_US

This means you are free to copy and redistribute the material and adapt and build on the material under the following terms: You must give appropriate credit, provide a link to the license and indicate if changes were made. If you adapt or build on the material you must distribute your work under the same license as the original. Note that this presentation contains images owned by others. Please seek their permission before reusing these images.

2

slide-3
SLIDE 3

www.epcc.ed.ac.uk www.archer.ac.uk

slide-4
SLIDE 4

Outline

  • Shared-Variables Parallelism
  • threads
  • shared-memory architectures
  • Message-Passing Parallelism
  • processes
  • distributed-memory architectures
  • Practicalities
  • usage on real HPC architectures

4

slide-5
SLIDE 5

Shared Variables

Threads-based parallelism

5

slide-6
SLIDE 6

Shared-memory concepts

  • Have already covered basic concepts
  • threads can all see data of parent process
  • can run on different cores
  • potential for parallel speedup

6

slide-7
SLIDE 7

Analogy

  • One very large whiteboard in a two-person office
  • the shared memory
  • Two people working on the same problem
  • the threads running on different cores attached to the memory
  • How do they collaborate?
  • working together
  • but not interfering
  • Also need private data

my data

shared data

my data

7

slide-8
SLIDE 8

Threads

8

PC PC PC

Private data Private data Private data

Shared data Thread 1 Thread 2 Thread 3

slide-9
SLIDE 9

Thread 1 Thread 2 mya=23 mya=a+1 23 23 24 Program Private data Shared data a=mya

Thread Communication

9

slide-10
SLIDE 10

Synchronisation

  • Synchronisation crucial for shared variables approach
  • thread 2’s code must execute after thread 1
  • Most commonly use global barrier synchronisation
  • other mechanisms such as locks also available
  • Writing parallel codes relatively straightforward
  • access shared data as and when its needed
  • Getting correct code can be difficult!

10

slide-11
SLIDE 11

Specific example

  • Computing asum = a0+ a1 + … a7
  • shared:
  • main array: a[8]
  • result: asum
  • private:
  • loop counter: i
  • loop limits: istart, istop
  • local sum: myasum
  • synchronisation:
  • thread0: asum += myasum
  • barrier
  • thread1: asum += myasum

loop: i = istart,istop myasum += a[i] end loop asum asum=0

11

slide-12
SLIDE 12

Reductions

  • A reduction produces a single value from associative operations such as

addition, multiplication, max, min, and, or.

asum = 0; for (i=0; i < n; i++) asum += a[i];

  • Only one thread at a time updating asum removes all parallelism
  • each thread accumulates own private copy; copies reduced to give final result.
  • if the number of operations is much larger than the number of threads, most of

the operations can proceed in parallel

  • Want common patterns like this to be automated
  • not programmed by hand as in previous slide

12

slide-13
SLIDE 13

Hardware

  • Needs support of a shared-memory architecture

13

Memory

Processor

Shared Bus

Processor Processor Processor Processor

Single Operating System

slide-14
SLIDE 14

Thread Placement: Shared Memory

14

OS User T T T T T T T T T T T T T T T T

slide-15
SLIDE 15

Threads in HPC

  • Threads existed before parallel computers
  • Designed for concurrency
  • Many more threads running than physical cores
  • scheduled / descheduled as and when needed
  • For parallel computing
  • Typically run a single thread per core
  • Want them all to run all the time
  • OS optimisations
  • Place threads on selected cores
  • Stop them from migrating

15

slide-16
SLIDE 16

Practicalities

  • Threading can only operate within a single node
  • Each node is a shared-memory computer (e.g. 24 cores on ARCHER)
  • Controlled by a single operating system
  • Simple parallelisation
  • Speed up a serial program using threads
  • Run an independent program per node (e.g. a simple task farm)
  • More complicated
  • Use multiple processes (e.g. message-passing – next)
  • On ARCHER: could run one process per node, 24 threads per

process

  • or 2 procs per node / 12 threads per process or 4 / 6 ...

16

slide-17
SLIDE 17

Threads: Summary

  • Shared blackboard a good analogy for thread parallelism
  • Requires a shared-memory architecture
  • in HPC terms, cannot scale beyond a single node
  • Threads operate independently on the shared data
  • need to ensure they don’t interfere; synchronisation is crucial
  • Threading in HPC usually uses OpenMP directives
  • supports common parallel patterns
  • e.g. loop limits computed by the compiler
  • e.g. summing values across threads done automatically

17

slide-18
SLIDE 18

Message Passing

Process-based parallelism

18

slide-19
SLIDE 19

Analogy

  • Two whiteboards in different single-person offices
  • the distributed memory
  • Two people working on the same problem
  • the processes on different nodes attached to the interconnect
  • How do they collaborate?
  • to work on single problem
  • Explicit communication
  • e.g. by telephone
  • no shared data

my data my data

19

slide-20
SLIDE 20

a=23 Recv(1,b) Process 1 Process 2 23 23 24 23 Program Data Send(2,a) a=b+1

Process communication

20

slide-21
SLIDE 21

Synchronisation

  • Synchronisation is automatic in message-passing
  • the messages do it for you
  • Make a phone call …
  • … wait until the receiver picks up
  • Receive a phone call
  • … wait until the phone rings
  • No danger of corrupting someone else’s data
  • no shared blackboard

21

slide-22
SLIDE 22

Communication modes

  • Sending a message can either be synchronous or

asynchronous

  • A synchronous send is not completed until the message

has started to be received

  • An asynchronous send completes as soon as the

message has gone

  • Receives are usually synchronous - the receiving process

must wait until the message arrives

22

slide-23
SLIDE 23

Synchronous send

  • Analogy with faxing a letter.
  • Know when letter has started to be received.

23

slide-24
SLIDE 24

Asynchronous send

  • Analogy with posting a letter.
  • Only know when letter has been posted, not when it has been

received.

24

slide-25
SLIDE 25

Point-to-Point Communications

  • We have considered two processes
  • one sender
  • one receiver
  • This is called point-to-point communication
  • simplest form of message passing
  • relies on matching send and receive
  • Close analogy to sending personal emails

25

slide-26
SLIDE 26

Message Passing: Collective communications

Process-based parallelism

26

slide-27
SLIDE 27

Collective Communications

  • A simple message communicates between two processes
  • There are many instances where communication between

groups of processes is required

  • Can be built from simple messages, but often

implemented separately, for efficiency

27

slide-28
SLIDE 28

Broadcast: one to all communication

28

slide-29
SLIDE 29

Broadcast

  • From one process to all others

29

8 8 8 8 8 8

slide-30
SLIDE 30

Scatter

  • Information scattered to many processes

30

0 1 2 3 4 5 1 3 4 5 2

slide-31
SLIDE 31

Gather

  • Information gathered onto one process

31

0 1 2 3 4 5 1 3 4 5 2

slide-32
SLIDE 32

Reduction Operations

  • Combine data from several processes to form a single result

32

Strik ike? e?

slide-33
SLIDE 33

Reduction

  • Form a global sum, product, max, min, etc.

33

1 3 4 5 2 15

slide-34
SLIDE 34

Hardware

  • Natural map to

distributed-memory

  • one process per

processor-core

  • messages go over

the interconnect, between nodes/OS’s

Processor Processor Processor Processor Processor Processor Processor Processor

Interconnect

34

slide-35
SLIDE 35

Processes: Summary

  • Processes cannot share memory
  • ring-fenced from each other
  • analogous to white boards in separate offices
  • Communication requires explicit messages
  • analogous to making a phone call, sending an email, …
  • synchronisation is done by the messages
  • Almost exclusively use Message-Passing Interface
  • MPI is a library of function calls / subroutines

35

slide-36
SLIDE 36

Practicalities

How we use the parallel models

36

slide-37
SLIDE 37

Practicalities

  • 8-core machine might only have 2

nodes

  • how do we run MPI on a real HPC

machine?

  • Mostly ignore architecture
  • pretend we have single-core nodes
  • one MPI process per processor-core
  • e.g. run 8 processes on the 2 nodes
  • Messages between processor-

cores on the same node are fast

  • but remember they also share access

to the network

Interconnect

37

slide-38
SLIDE 38

Message Passing on Shared Memory

  • Run one process per core
  • don’t directly exploit shared memory
  • analogy is phoning your office mate
  • actually works well in practice!

my data my data

  • Message-passing

programs run by a special job launcher

  • user specifies #copies
  • some control over

allocation to nodes

38

slide-39
SLIDE 39

Summary

39

slide-40
SLIDE 40

Summary

  • Shared-variables parallelism
  • uses threads
  • requires shared-memory machine
  • easy to implement but limited scalability
  • in HPC, done using OpenMP compilers
  • Distributed memory
  • uses processes
  • can run on any machine: messages can go over the interconnect
  • harder to implement but better scalability
  • on HPC, done using the MPI library

40