[PPT] - Lecture 1 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden PowerPoint Presentation

SLIDE 1

Lecture 1 CSE 260 – Parallel Computation (Fall 2015) Scott B. Baden Introduction

SLIDE 2

2

Welcome to CSE 260!

Your instructor is Scott Baden

u baden+260@ucsd.edu u Office hours in EBU3B Room 3244

Today at 3.30, next week TBA (or make appointment)
Your TA is Siddhant Arya

u sarya@eng.ucsd.edu

The class home page is

http://www.cse.ucsd.edu/classes/fa15/cse260-a

Course Resources

u Piazza (Moodle for grades) u Stampede @ TACC

Create an XSEDE portal account if you don’t have one

u Bang @ UCSD

Scott B. Baden / CSE 260, UCSD / Fall '15

SLIDE 3

How to solve computationally intensive problems
n parallel computers effectively: multicore

processors, GPUs, clusters

u Parallel programming: multithreading, message passing,

vectorization, accelerator programming (OpenMP, CUDA, SIMD)

u Parallel algorithms: discretization, sorting, linear

algebra, sorting; communication avoiding algorithms (CA); irregular problems

u Performance programming: latency hiding, managing

locality within complicated memory hierarchies, load balancing, efficient data motion

What you’ll learn in this class

Scott B. Baden / CSE 260, UCSD / Fall '15 3

SLIDE 4

4

Background

CSE 260 will build on your existing background,

generalizing programming techniques, algorithm design and analysis

Background

u Graduate standing u Recommended undergrad background: Computer

Architecture & Operating Systems, C/C++ programming

u I will level the playing field for non-CSE students; see

me if you are unsure about your background

Your background

u CSME? u Parallel computation? u Numerical analysis?

Scott B. Baden / CSE 260, UCSD / Fall '15

SLIDE 5

5

Background Markers

C/C++

Java Fortran?

TLB misses
MPI
RPC
Multithreading
CUDA, GPUs
Abstract base class
Navier Stokes Equations
Sparse factorization

∇ • u = 0 Dρ Dt + ρ ∇ •v

( ) = 0

f (a) + " f (a) 1 ! (x − a) + " " f (a) 2! (x − a)2 + ...

Scott B. Baden / CSE 260, UCSD / Fall '15

SLIDE 6

6

Course Requirements

5 assignments

u Pre-survey and Registration: due Sunday @

9pm

u 3 Programming labs

Teams of 2, option to switch teams
Find a partner using the

“looking for a partner” Moodle forum

Includes a lab report, greater emphasis (grading)

with each lab

u 1 in class test toward the end of the course

Scott B. Baden / CSE 260, UCSD / Fall '15

SLIDE 7

7

Text and readings

Required texts

u Programming Massively Parallel Processors: A Hands-

n Approach, 2nd Ed., by David Kirk and Wen-mei

Hwu, Morgan Kaufmann Publishers (201w)

u An Introduction to Parallel Programming, by Peter

Pacheco, Morgan Kaufmann (2011)

Assigned class readings will also include on-line

materials

Lecture slides

www. www.cse se.uc ucsd sd.edu/classes/fa15/cse260-a/Lectures /classes/fa15/cse260-a/Lectures

Scott B. Baden / CSE 260, UCSD / Fall '15

SLIDE 8

8

Policies

Academic Integrity

u Do you own work u Plagiarism and cheating will not be tolerated

By taking this course, you implicitly agree

to abide by the following the course polices: www.cse.ucsd.edu/classes/fa15/cse260-a/Policies.html

.cse.ucsd.edu/classes/fa15/cse260-a/Policies.html

Scott B. Baden / CSE 260, UCSD / Fall '15

SLIDE 9

Class participation is important to keep the

lecture active

Consider the slides as talking points, class

discussions driven by your interest

Complete the assigned readings before

lecture and be prepared to discuss in class

Different lecture modalities

u The 2 minute pause u In class problem solving

Classroom participation

Scott B. Baden / CSE 260, UCSD / Fall '15 9

SLIDE 10

Opportunity in class to develop your

understanding, of lecture

u By trying to explain to someone else u Getting your brain actively working on it

What will happen

u I pose a question u You discuss with 1-2 people around you

Most important is your understanding of why the answer is

correct

u After most people seem to be done

I’ll ask for quiet
A few will share what their group talked about

– Good answers are those where you were wrong, then realized…

The 2 minute pause

Scott B. Baden / CSE 260, UCSD / Fall '15 10

SLIDE 11

Principles
Technological disruption and its impact
Motivation – applications

An Introduction to Parallel Computation

Scott B. Baden / CSE 260, UCSD / Fall '15 11

SLIDE 12

12

What is parallel processing?

We decompose a workload onto simultaneously

executing physical processing resources to improve some aspect of performance

u Speedup: 100 processors run ×100 faster than one u Capability: Tackle a larger problem, more accurately u Algorithmic, e.g. search u Locality: more cache memory and bandwidth

Multiple processors co-operate to process a related

set of tasks – tightly coupled

Generally requires some form of communication

and/or synchronization to manage the workload distribution

Scott B. Baden / CSE 260, UCSD / Fall '15

SLIDE 13

Have you written a parallel program?

Threads
MPI
RPC
C++11 Async
CUDA

Scott B. Baden / CSE 260, UCSD / Fall '15 13

SLIDE 14

The difference between Parallel Processing, Concurrency & Distributed Computing

Parallel processing

u Performance (and capacity) is the main goal u More tightly coupled than distributed computation

Concurrency

u Concurrency control: serialize certain computations to ensure

correctness, e.g. database transactions

u Performance need not be the main goal

Distributed computation

u Geographically distributed u Multiple resources computing & communicating unreliably u “Cloud” computing, large amounts of storage, different from

clusters in the cloud

u Looser, coarser grained communication and synchronization

May or may not involve separate physical resources, e.g.

multitasking “Virtual Parallelism”

Scott B. Baden / CSE 260, UCSD / Fall '15 14

SLIDE 15

Granularity

A measure of how often a computation

communicates, and what scale

u Distributed computer: a whole program u Multicomputer: function, a loop nest u Multiprocessor: + memory reference u Multicore: a single socket implementation of a

multiprocessor

u GPU: kernel thread u Instruction level parallelism: instruction,

register

15 Scott B. Baden / CSE 260, UCSD / Fall '15

SLIDE 16

Principles
Technological disruption and its impact
Motivation – applications

An Introduction to Parallel Computation

Scott B. Baden / CSE 260, UCSD / Fall '15 16

SLIDE 17

Why is parallelism inevitable?

Physical limitations on heat dissipation impede

processor clock speed increases

To make the processor faster, we replicate the

computational elements

Scott B. Baden / CSE 260, UCSD / Fall '15 18

SLIDE 18

Technological trends of scalable HPC systems

Hybrid processors
Complicated software-managed

parallel memory hierarchy

Memory/core is shrinking
Communication costs increasing relative to

computational rate

Peak performance [Top500, 13]

20 40 60

2008 2009 2010 2011 2012 2013

2x/year PFLOP/s 2x/ 3-4 years PFLOP/s

Scott B. Baden / CSE 260, UCSD / Fall '15 21

SLIDE 19

23 9/25/15 23

The age of the multi-core processor

On-chip parallel computer
IBM Power4 (2001), many others follow

(Intel, AMD, Tilera, Cell Broadband Engine)

First dual core laptops (2005-6)
GPUs (nVidia, ATI): desktop supercomputer
In smart phones, behind the dashboard

blog.laptopmag.com/nvidia-tegrak1-unveiled

Everyone has a parallel computer at their

fingertips

If we don’t use parallelism, we lose it!

realworldtech.com Scott B. Baden / CSE 260, UCSD / Fall '15

SLIDE 20

24 9/25/15 24

The GPU

Specialized many-core processor
Massively multithreaded, long vectors
Reduced on-chip memory per core
Explicitly manage the memory

hierarchy

Scott B. Baden / CSE 260, UCSD / Fall '15

Christopher Dyken, SINTEF

1200 1000 800 600 GFLOPS AMD (GPU) NVIDIA (GPU) Intel (CPU) 400 200 2001 2002 2003 2004 2005 Year 2006 2007 Quad-core Dual-core Many-core GPU Multicore CPU Courtesy: John Owens 2008 2009

SLIDE 21

Performance and Implementation Issues

9/25/15 25

To cope with growing

data motion costs (relative to computation)

Conserve locality Hide latency

Little’s Law [1961]

# threads = performance × latency T = p × λ

p and λ increasing with time

p =1 - 8 flops/cycle λ = 500 cycles/word

Memory (DRAM) Processor

Year P e r f

r

m a n c e

fotosearch.com fotosearch.com

λ p-1

Scott B. Baden / CSE 260, UCSD / Fall '15 25

SLIDE 22

Consequences of evolutionary disruption

Transformational: new capabilities for

predictive modelling, healthcare… benefits to society

Changes the common wisdom for solving a

problem including the implementation

Simplified processor design, but more user

control over the hardware resources

Scott B. Baden / CSE 260, UCSD / Fall '15 26

SLIDE 23

Today’s mobile computer would have been yesterday’s supercomputer

Cray-1

Supercomputer

80 MHz processor
240 Mflops/sec peak
3.4 Mflops Linpack
8 Megabytes memory
Water cooled
1.8m H x 2.2m W
4 tons
Over $10M in 1976

www.anandtech.com/show/8716/apple-a8xs-gpu-gxa6850-even-better-than-i-thought

Scott B. Baden / CSE 260, UCSD / Fall '15 27

SLIDE 24

Exascale computing

What is an exaflop? 1M Gigaflops! 1018 flops
The first Exascale computer ~ 2020

u 20+ MWatts : 50+ GFlops/Watt

Top 500 #1, Tianhe-2: 1.9 GF/W (17.6MW)

u Exascale extrapolation: 521 MW

#49 Green500 [Does not include 6.2 MW extra]

u #1 on the green list delivers 4.4 GF/W

GSIC Center Tokyo Inst Tech, TSUBAME-KFC Xeon E5 and NVIDIA K20x

Exascale adds a significant power barrier
Per capita power consumption in the EU (IEA)

in 2009: 0.77kW [Tianhe-2 ~ 20K]

Scott B. Baden / CSE 260, UCSD / Fall '15 28

SLIDE 25

Principles
Technological disruption and its impact
Motivation – applications

An Introduction to Parallel Computation

Scott B. Baden / CSE 260, UCSD / Fall '15 29

SLIDE 26

Simulates a 7.7 earthquake along the southern San Andreas fault near LA using seismic, geophysical, and other data from the Southern California Earthquake Center

Scott B. Baden / CSE 260, UCSD / Fall '15 30

epicenter.usc.edu/cmeportal/TeraShake.html

A Motivating Application - TeraShake

SLIDE 27

Divide up Southern

California into blocks

For each block, get all

the data about geological structures, fault information, …

Map the blocks onto

processors of the supercomputer

Run the simulation

using current information on fault activity and on the physics of earthquakes

Scott B. Baden / CSE 260, UCSD / Fall '15 31

SDSC

Machine

Room DataCentral@SDSC

How TeraShake Works

SLIDE 28

Scott B. Baden / CSE 260, UCSD / Fall '15 32

Animation

SLIDE 29

Face detection with Viola-Jones algorithm

Searches images for features of a human face
GPU performance competitive with FPGAs, but far

lower development cost

Jason Oberg, Daniel Hefenbrock, Tan Nguyen [CSE

260, fa’09 →fccm’10]

Window Feature Image

Scott B. Baden / CSE 260, UCSD / Fall '15 33

Sonia Sotomayor

SLIDE 30

Capability

u We solved a problem that we couldn’t solve

before, or under conditions that were not possible previously

Performance

u Solve the same problem in less time than before u This can provide a capability if we are solving

many problem instances

The result achieved must justify the effort

u Enable new scientific discovery u Software costs must be reasonable

The Payoff

Scott B. Baden / CSE 260, UCSD / Fall '15 34

SLIDE 31

I increased performance – so what’s the catch?

A well behaved single processor algorithm may

behave poorly on a parallel computer, and may need to be reformulated numerically

There currently exists no tool that can convert a

serial program into an efficient parallel program

… for all applications … all of the time… on all hardware

Many Performance programming issues

The new code may look dramatically different

from the original

Scott B. Baden / CSE 260, UCSD / Fall '15 35

SLIDE 32

Application-specific knowledge is important

The more we know about the application…

… specific problem … math/physics ... initial data … … context for analyzing the output… … the more we can improve performance

Particularly challenging for irregular problems
Parallelism introduces many new tradeoffs

u Redesign the software u Rethink the problem solving technique

Scott B. Baden / CSE 260, UCSD / Fall '15 36

SLIDE 33

Next time

Parallel processors
Optimizing the memory hierarchy for high

performance

Scott B. Baden / CSE 260, UCSD / Fall '15 37

SLIDE 34

FIN

Scott B. Baden / CSE 260, UCSD / Fall '15 38