Lecture 2: Terminology and Definitions Abhinav Bhatele, Department - - PowerPoint PPT Presentation

lecture 2 terminology and definitions
SMART_READER_LITE
LIVE PREVIEW

Lecture 2: Terminology and Definitions Abhinav Bhatele, Department - - PowerPoint PPT Presentation

Introduction to Parallel Computing (CMSC498X / CMSC818X) Lecture 2: Terminology and Definitions Abhinav Bhatele, Department of Computer Science Announcements Piazza space for the course is live. Sign up link:


slide-1
SLIDE 1

Lecture 2: Terminology and Definitions

Abhinav Bhatele, Department of Computer Science

Introduction to Parallel Computing (CMSC498X / CMSC818X)

slide-2
SLIDE 2

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Announcements

  • Piazza space for the course is live. Sign up link:
  • https://piazza.com/umd/fall2020/cmsc498xcmsc818x
  • Slides from previous class are posted online on the course website
  • Recorded video is available via Panopto or ELMS

2

slide-3
SLIDE 3

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Summary of last lecture

  • Need for parallel and high performance computing
  • Parallel architecture: nodes, memory, network, storage

3

slide-4
SLIDE 4

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Cores, sockets, nodes

  • CPU: processor
  • Single-core or multi-core
  • Core is a processing unit, multiple such units
  • n a single chip make it a multi-core processor
  • Socket: same as chip or processor
  • Node: packaging of sockets

4

https://www.glennklockwood.com/hpc-howtos/process-affinity.html

slide-5
SLIDE 5

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Job scheduling

5

slide-6
SLIDE 6

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Job scheduling

  • HPC systems use job or batch scheduling
  • Each user submits their parallel programs for execution to a “job” scheduler

5

Job Queue

#Nodes Requested Time Requested

128 30 mins 64 24 hours 56 6 hours 192 12 hours … … … …

1 2 3 4 5 6

slide-7
SLIDE 7

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Job scheduling

  • HPC systems use job or batch scheduling
  • Each user submits their parallel programs for execution to a “job” scheduler
  • The scheduler decides:
  • what job to schedule next (based on an algorithm: FCFS, priority-based, ….)
  • what resources (compute nodes) to allocate to the ready job

5

Job Queue

#Nodes Requested Time Requested

128 30 mins 64 24 hours 56 6 hours 192 12 hours … … … …

1 2 3 4 5 6

slide-8
SLIDE 8

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Job scheduling

  • HPC systems use job or batch scheduling
  • Each user submits their parallel programs for execution to a “job” scheduler
  • The scheduler decides:
  • what job to schedule next (based on an algorithm: FCFS, priority-based, ….)
  • what resources (compute nodes) to allocate to the ready job

5

Job Queue

#Nodes Requested Time Requested

128 30 mins 64 24 hours 56 6 hours 192 12 hours … … … …

1 2 3 4 5 6

  • Compute nodes: dedicated to each job
  • Network, filesystem: shared by all jobs
slide-9
SLIDE 9

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Compute nodes vs. login nodes

  • Compute nodes: dedicated nodes for running jobs
  • Can only be accessed when they have been allocated to a user by the job scheduler
  • Login nodes: nodes shared by all users to compile their programs, submit jobs etc.

6

slide-10
SLIDE 10

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Supercomputers vs. commodity clusters

  • Supercomputer refers to a large expensive installation, typically using custom

hardware

  • High-speed interconnect
  • IBM Blue Gene, Cray XT, Cray XC
  • Cluster refers to a cluster of nodes, typically put together using commodity (off-the-

shelf) hardware

7

slide-11
SLIDE 11

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Serial vs. parallel code

  • Thread: a thread or path of execution managed by the OS
  • Share memory
  • Process: heavy-weight, processes do not share resources such as memory, file

descriptors etc.

  • Serial or sequential code: can only run on a single thread or process
  • Parallel code: can be run on one or more threads or processes

8

slide-12
SLIDE 12

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Scaling and scalable

  • Scaling: running a parallel program on 1

to n processes

  • 1, 2, 3, … , n
  • 1, 2, 4, 8, …, n
  • Scalable: A program is scalable if it’s

performance improves when using more resources

9

slide-13
SLIDE 13

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Scaling and scalable

  • Scaling: running a parallel program on 1

to n processes

  • 1, 2, 3, … , n
  • 1, 2, 4, 8, …, n
  • Scalable: A program is scalable if it’s

performance improves when using more resources

9

Execution time (minutes) 0.1 1 10 100 1000 10000 Number of cores 1 4 16 64 256 1K 4K 16K

Actual Extrapolation

slide-14
SLIDE 14

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Weak versus strong scaling

  • Strong scaling: Fixed total problem size as we run on more processes
  • Sorting n numbers on 1 process, 2 processes, 4 processes, …
  • Weak scaling: Fixed problem size per process but increasing total problem size as we

run on more processes

  • Sorting n numbers on 1 process
  • 2n numbers on 2 processes
  • 4n numbers on 4 processes

10

slide-15
SLIDE 15

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Speedup and efficiency

  • Speedup: Ratio of execution time on one process to that on p processes
  • Efficiency: Speedup per process

11

Speedup = t1 tp Efficiency = t1 tp × p

slide-16
SLIDE 16

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Amdahl’s law

  • Speedup is limited by the serial portion of the code
  • Often referred to as the serial “bottleneck”
  • Lets say only a fraction f of the code can be parallelized on p processes

12

Speedup = 1 (1 − f ) + f/p

slide-17
SLIDE 17

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Amdahl’s law

  • Speedup is limited by the serial portion of the code
  • Often referred to as the serial “bottleneck”
  • Lets say only a fraction f of the code can be parallelized on p processes

12

Speedup = 1 (1 − f ) + f/p

slide-18
SLIDE 18

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Amdahl’s law

  • Speedup is limited by the serial portion of the code
  • Often referred to as the serial “bottleneck”
  • Lets say only a fraction f of the code can be parallelized on p processes

12

Speedup = 1 (1 − f ) + f/p

slide-19
SLIDE 19

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Amdahl’s law

13

Speedup = 1 (1 − p) + p/n

fprintf(stdout,"Process %d of %d is on %s\n", myid, numprocs, processor_name); fflush(stdout); n = 10000; /* default # of rectangles */ if (myid == 0) startwtime = MPI_Wtime(); MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); h = 1.0 / (double) n; sum = 0.0; /* A slightly better approach starts from large i and works back */ for (i = myid + 1; i <= n; i += numprocs) { x = h * ((double)i - 0.5); sum += f(x); } mypi = h * sum; MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);

p = 60 s on 1 process 100 - p = 40 s on 1 process Speedup = 1 (1 − 0.6) + 0.6/n

slide-20
SLIDE 20

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Communication and synchronization

  • Each process may execute serial code independently for a while
  • When data is needed from other (remote) processes, messaging occurs
  • Referred to as communication or synchronization or MPI messages
  • Intra-node vs. inter-node communication
  • Bulk synchronous programs: All processes compute simultaneously, then synchronize

together

14

slide-21
SLIDE 21

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Different models of parallel computation

  • SIMD: Single Instruction Multiple Data
  • MIMD: Multiple Instruction Multiple Data
  • SPMD: Single Program Multiple Data
  • Typical in HPC

15

slide-22
SLIDE 22

Abhinav Bhatele 5218 Brendan Iribe Center (IRB) / College Park, MD 20742 phone: 301.405.4507 / e-mail: bhatele@cs.umd.edu