CSC2/458 Parallel and Distributed Systems PPMI: Basic Building - - PowerPoint PPT Presentation

csc2 458 parallel and distributed systems ppmi basic
SMART_READER_LITE
LIVE PREVIEW

CSC2/458 Parallel and Distributed Systems PPMI: Basic Building - - PowerPoint PPT Presentation

CSC2/458 Parallel and Distributed Systems PPMI: Basic Building Blocks Sreepathi Pai February 13, 2018 URCS Outline Multiprocessor Machines Archetypes of Work Distribution Multiprocessing Multithreading and POSIX Threads Non-blocking I/O or


slide-1
SLIDE 1

CSC2/458 Parallel and Distributed Systems PPMI: Basic Building Blocks

Sreepathi Pai February 13, 2018

URCS

slide-2
SLIDE 2

Outline

Multiprocessor Machines Archetypes of Work Distribution Multiprocessing Multithreading and POSIX Threads Non-blocking I/O or ‘Asynchronous’ Execution

slide-3
SLIDE 3

Outline

Multiprocessor Machines Archetypes of Work Distribution Multiprocessing Multithreading and POSIX Threads Non-blocking I/O or ‘Asynchronous’ Execution

slide-4
SLIDE 4

Very Simplified Programmer’s View of Multicore

core PC core PC core PC core PC core PC

  • Multiple program counters (PC)
  • MIMD machine
  • To what do we set these PCs?
  • Can the hardware do this automatically for us?
slide-5
SLIDE 5

Automatic Parallelization to the Rescue?

for(i = 0; i < N; i++) { for(j = 0; j < i; j++) { // something(i, j) } }

  • Assume a stream of instructions from a single-threaded

program

  • How do we split this stream into pieces?
slide-6
SLIDE 6

Thread-Level Parallelism

  • Break stream into long continuous streams of instructions
  • Much bigger than issue window on superscalars
  • 8 instructions vs hundreds
  • Streams are largely independent
  • Best performance on current hardware
  • “Thread-Level Parallelism”
  • ILP
  • DLP
  • MLP
slide-7
SLIDE 7

Parallelization Issues

  • Assume we have a parallel algorithm
  • Work Distribution
  • How to split up work to be performed among threads?
  • Communication
  • How to send and receive data between threads?
  • Synchronization
  • How to coordinate different threads?
  • A form of communication
slide-8
SLIDE 8

Outline

Multiprocessor Machines Archetypes of Work Distribution Multiprocessing Multithreading and POSIX Threads Non-blocking I/O or ‘Asynchronous’ Execution

slide-9
SLIDE 9

Types of Parallel Programs (Simplified)

Let’s assume all parallel programs consist of “atomic” tasks.

  • All tasks identical, all perform same amount of work
  • Count words per page (many pages)
  • Matrix Multiply, 2D-convolution, most “regular” programs
  • All tasks identical, but perform different amounts of work
  • Count words per chapter (many chapters)
  • Graph analytics, most “irregular” programs
  • Different tasks
  • Pipelines
  • Servers (Tasks: Receive Request, Process Request, Respond to

Request)

slide-10
SLIDE 10

Scheme 1: One task per thread, same work

Count words per page of a book.

slide-11
SLIDE 11

Work assigned once to threads (Static)

slide-12
SLIDE 12

How many threads?

  • As many as tasks
  • As many as cores
  • Less than cores
  • More than cores
slide-13
SLIDE 13

Hardware and OS Limitations

  • As many as tasks
  • Too many, OS scheduler limitations
  • As many as cores
  • Reasonable default
  • Less than cores
  • If hardware bottleneck is saturated
  • More than cores
  • May help to cope with lack of ILP
slide-14
SLIDE 14

Scheme 2: Multiple tasks per thread, differing work

Count words per chapter of a book.

slide-15
SLIDE 15

Static Work Assignment

Assign chapters evenly.

slide-16
SLIDE 16

Static Work Assignment

Chapters are of different lengths leading to load imbalance

slide-17
SLIDE 17

Assigning chapters by size of chapters

  • Not always possible
  • May not know size of all chapters
  • Bin-packing problem
  • NP-hard
slide-18
SLIDE 18

Dynamic (Greedy) Balancing

  • Create a set of worker threads (thread pool)
  • Place work (i.e. chapters) into a parallel worklist
  • Each worker thread pulls work off the worklist
  • When it finishes a chapter, it pulls more work off the worklist
slide-19
SLIDE 19

Dynamic Balancing

Parallel Worklist

slide-20
SLIDE 20

Generalized Parallel Programs

  • Threads can create additional work (“tasks”)
  • Tasks may be dependent on each other
  • Form a dependence graph
  • Same ideas as thread pool
  • Except only “ready” tasks are pulled off worklist
  • As tasks finish, their dependents are marked ready
  • May have thread-specific worklists
  • To prevent contention on main worklist
slide-21
SLIDE 21

Outline

Multiprocessor Machines Archetypes of Work Distribution Multiprocessing Multithreading and POSIX Threads Non-blocking I/O or ‘Asynchronous’ Execution

slide-22
SLIDE 22

Multiprocessing

  • Simplest way to take advantage of multiple cores
  • Run multiple processes
  • fork and wait
  • Traditional way in Unix
  • “Processes are cheap”
  • Not cheap in Windows
  • Nothing-shared model
  • Child inherits some parent state
  • Only viable model available in some programming languages
  • Python
  • Shared nothing: Communication between processes?
slide-23
SLIDE 23

Communication between processes

  • Unix Interprocess Communication (IPC)
  • Filesystem
  • Pipes (anonymous and named)
  • Unix sockets
  • Semaphores
  • SysV Shared Memory
slide-24
SLIDE 24

Outline

Multiprocessor Machines Archetypes of Work Distribution Multiprocessing Multithreading and POSIX Threads Non-blocking I/O or ‘Asynchronous’ Execution

slide-25
SLIDE 25

Multithreading

  • One process
  • Process creates threads (“lightweight processes”)
  • How is a thread different from a process? [What minimum

state does a thread require?]

  • Everything shared model
  • Communication
  • Read and write to memory
  • Relies on programmers to think carefully about access to

shared data

  • Tricky
slide-26
SLIDE 26

Multithreading Programming Models

Roughly in (decreasing) order of power and complexity:

  • POSIX threads (pthreads)
  • C++11 threads may be simpler than this
  • Thread Building Blocks from Intel
  • Cilk
  • OpenMP
slide-27
SLIDE 27

POSIX Threads on Linux

  • Processes == Threads for scheduler in Linux
  • 1:1 threading model
  • See OS textbook
  • pthreads provided as a library
  • gcc test.c -lpthread
  • OS scheduler can affect performance significantly
  • Especially with user-level threads
slide-28
SLIDE 28

Multithreading Components

  • Thread Management
  • Creation, death, waiting, etc.
  • Communication
  • Shared variables (ordinary variables)
  • Condition Variables
  • Synchronization
  • Mutexes (Mutual Exclusion)
  • Barriers
  • (Hardware) Read-Modify-Writes or “Atomics”
slide-29
SLIDE 29

Outline

Multiprocessor Machines Archetypes of Work Distribution Multiprocessing Multithreading and POSIX Threads Non-blocking I/O or ‘Asynchronous’ Execution

slide-30
SLIDE 30

CPU and I/O devices

  • CPU compute
  • I/O devices perform I/O
  • What should the CPU do when it wants to do I/O?
slide-31
SLIDE 31

Parallelism

  • I/O devices can usually operate in parallel with CPU
  • Read/write memory with DMA, for example
  • I/O devices can inform CPU when they complete work
  • (Hardware) Interrupts
  • How do we take advantage of this parallelism?
  • Even with a single-core CPU?
  • Hint: OS behaviour on I/O operations?
slide-32
SLIDE 32

Non-blocking I/O within a program

  • Default I/O programming model: block until request is

satisfied

  • Non-blocking I/O model: don’t block
  • also called ”Asynchronous I/O”
  • also called ”Overlapped I/O”
  • Multiple I/O requests can be outstanding at the same time
  • How to handle completion?
  • How to handle data lifetimes?
slide-33
SLIDE 33

General Non-blocking I/O Programming Style

  • Operations don’t block
  • Only succeed when guaranteed not to block
  • Or put request in a (logical) queue to be handled later
  • Operation completion can be detected by:
  • Polling (e.g. select)
  • Notification (e.g. callbacks)
slide-34
SLIDE 34

Programming Model Constructs for Asynchronous Program- ming

  • Coroutines
  • Futures/Promises