Lecture 1 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden - - PowerPoint PPT Presentation

lecture 1 cse 260 parallel computation fall 2015 scott b
SMART_READER_LITE
LIVE PREVIEW

Lecture 1 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden - - PowerPoint PPT Presentation

Lecture 1 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Introduction Welcome to CSE 260! Your instructor is Scott Baden u baden+260@ucsd.edu u Office hours in EBU3B Room 3244 Today at 3.30, next week TBA (or make


slide-1
SLIDE 1

Lecture 1 CSE 260 – Parallel Computation (Fall 2015) Scott B. Baden Introduction

slide-2
SLIDE 2

2

Welcome to CSE 260!

  • Your instructor is Scott Baden

u baden+260@ucsd.edu u Office hours in EBU3B Room 3244

  • Today at 3.30, next week TBA (or make appointment)
  • Your TA is Siddhant Arya

u sarya@eng.ucsd.edu

  • The class home page is

http://www.cse.ucsd.edu/classes/fa15/cse260-a

  • Course Resources

u Piazza (Moodle for grades) u Stampede @ TACC

Create an XSEDE portal account if you don’t have one

u Bang @ UCSD

Scott B. Baden / CSE 260, UCSD / Fall '15

slide-3
SLIDE 3
  • How to solve computationally intensive problems
  • n parallel computers effectively: multicore

processors, GPUs, clusters

u Parallel programming: multithreading, message passing,

vectorization, accelerator programming (OpenMP, CUDA, SIMD)

u Parallel algorithms: discretization, sorting, linear

algebra, sorting; communication avoiding algorithms (CA); irregular problems

u Performance programming: latency hiding, managing

locality within complicated memory hierarchies, load balancing, efficient data motion

What you’ll learn in this class

Scott B. Baden / CSE 260, UCSD / Fall '15 3

slide-4
SLIDE 4

4

Background

  • CSE 260 will build on your existing background,

generalizing programming techniques, algorithm design and analysis

  • Background

u Graduate standing u Recommended undergrad background: Computer

Architecture & Operating Systems, C/C++ programming

u I will level the playing field for non-CSE students; see

me if you are unsure about your background

  • Your background

u CSME? u Parallel computation? u Numerical analysis?

Scott B. Baden / CSE 260, UCSD / Fall '15

slide-5
SLIDE 5

5

Background Markers

  • C/C++

Java Fortran?

  • TLB misses
  • MPI
  • RPC
  • Multithreading
  • CUDA, GPUs
  • Abstract base class
  • Navier Stokes Equations
  • Sparse factorization

∇ • u = 0 Dρ Dt + ρ ∇ •v

( ) = 0

f (a) + " f (a) 1 ! (x − a) + " " f (a) 2! (x − a)2 + ...

Scott B. Baden / CSE 260, UCSD / Fall '15

slide-6
SLIDE 6

6

Course Requirements

  • 5 assignments

u Pre-survey and Registration: due Sunday @

9pm

u 3 Programming labs

  • Teams of 2, option to switch teams
  • Find a partner using the

“looking for a partner” Moodle forum

  • Includes a lab report, greater emphasis (grading)

with each lab

u 1 in class test toward the end of the course

Scott B. Baden / CSE 260, UCSD / Fall '15

slide-7
SLIDE 7

7

Text and readings

  • Required texts

u Programming Massively Parallel Processors: A Hands-

  • n Approach, 2nd Ed., by David Kirk and Wen-mei

Hwu, Morgan Kaufmann Publishers (201w)

u An Introduction to Parallel Programming, by Peter

Pacheco, Morgan Kaufmann (2011)

  • Assigned class readings will also include on-line

materials

  • Lecture slides

www. www.cse se.uc ucsd sd.edu/classes/fa15/cse260-a/Lectures /classes/fa15/cse260-a/Lectures

Scott B. Baden / CSE 260, UCSD / Fall '15

slide-8
SLIDE 8

8

Policies

  • Academic Integrity

u Do you own work u Plagiarism and cheating will not be tolerated

  • By taking this course, you implicitly agree

to abide by the following the course polices: www.cse.ucsd.edu/classes/fa15/cse260-a/Policies.html

.cse.ucsd.edu/classes/fa15/cse260-a/Policies.html

Scott B. Baden / CSE 260, UCSD / Fall '15

slide-9
SLIDE 9
  • Class participation is important to keep the

lecture active

  • Consider the slides as talking points, class

discussions driven by your interest

  • Complete the assigned readings before

lecture and be prepared to discuss in class

  • Different lecture modalities

u The 2 minute pause u In class problem solving

Classroom participation

Scott B. Baden / CSE 260, UCSD / Fall '15 9

slide-10
SLIDE 10
  • Opportunity in class to develop your

understanding, of lecture

u By trying to explain to someone else u Getting your brain actively working on it

  • What will happen

u I pose a question u You discuss with 1-2 people around you

  • Most important is your understanding of why the answer is

correct

u After most people seem to be done

  • I’ll ask for quiet
  • A few will share what their group talked about

– Good answers are those where you were wrong, then realized…

The 2 minute pause

Scott B. Baden / CSE 260, UCSD / Fall '15 10

slide-11
SLIDE 11
  • Principles
  • Technological disruption and its impact
  • Motivation – applications

An Introduction to Parallel Computation

Scott B. Baden / CSE 260, UCSD / Fall '15 11

slide-12
SLIDE 12

12

What is parallel processing?

  • We decompose a workload onto simultaneously

executing physical processing resources to improve some aspect of performance

u Speedup: 100 processors run ×100 faster than one u Capability: Tackle a larger problem, more accurately u Algorithmic, e.g. search u Locality: more cache memory and bandwidth

  • Multiple processors co-operate to process a related

set of tasks – tightly coupled

  • Generally requires some form of communication

and/or synchronization to manage the workload distribution

Scott B. Baden / CSE 260, UCSD / Fall '15

slide-13
SLIDE 13

Have you written a parallel program?

  • Threads
  • MPI
  • RPC
  • C++11 Async
  • CUDA

Scott B. Baden / CSE 260, UCSD / Fall '15 13

slide-14
SLIDE 14

The difference between Parallel Processing, Concurrency & Distributed Computing

  • Parallel processing

u Performance (and capacity) is the main goal u More tightly coupled than distributed computation

  • Concurrency

u Concurrency control: serialize certain computations to ensure

correctness, e.g. database transactions

u Performance need not be the main goal

  • Distributed computation

u Geographically distributed u Multiple resources computing & communicating unreliably u “Cloud” computing, large amounts of storage, different from

clusters in the cloud

u Looser, coarser grained communication and synchronization

  • May or may not involve separate physical resources, e.g.

multitasking “Virtual Parallelism”

Scott B. Baden / CSE 260, UCSD / Fall '15 14

slide-15
SLIDE 15

Granularity

  • A measure of how often a computation

communicates, and what scale

u Distributed computer: a whole program u Multicomputer: function, a loop nest u Multiprocessor: + memory reference u Multicore: a single socket implementation of a

multiprocessor

u GPU: kernel thread u Instruction level parallelism: instruction,

register

15 Scott B. Baden / CSE 260, UCSD / Fall '15

slide-16
SLIDE 16
  • Principles
  • Technological disruption and its impact
  • Motivation – applications

An Introduction to Parallel Computation

Scott B. Baden / CSE 260, UCSD / Fall '15 16

slide-17
SLIDE 17

Why is parallelism inevitable?

  • Physical limitations on heat dissipation impede

processor clock speed increases

  • To make the processor faster, we replicate the

computational elements

Scott B. Baden / CSE 260, UCSD / Fall '15 18

slide-18
SLIDE 18

Technological trends of scalable HPC systems

  • Hybrid processors
  • Complicated software-managed

parallel memory hierarchy

  • Memory/core is shrinking
  • Communication costs increasing relative to

computational rate

Peak performance [Top500, 13]

20 40 60

2008 2009 2010 2011 2012 2013

2x/year PFLOP/s 2x/ 3-4 years PFLOP/s

Scott B. Baden / CSE 260, UCSD / Fall '15 21

slide-19
SLIDE 19

23 9/25/15 23

The age of the multi-core processor

  • On-chip parallel computer
  • IBM Power4 (2001), many others follow

(Intel, AMD, Tilera, Cell Broadband Engine)

  • First dual core laptops (2005-6)
  • GPUs (nVidia, ATI): desktop supercomputer
  • In smart phones, behind the dashboard

blog.laptopmag.com/nvidia-tegrak1-unveiled

  • Everyone has a parallel computer at their

fingertips

  • If we don’t use parallelism, we lose it!

realworldtech.com Scott B. Baden / CSE 260, UCSD / Fall '15

slide-20
SLIDE 20

24 9/25/15 24

The GPU

  • Specialized many-core processor
  • Massively multithreaded, long vectors
  • Reduced on-chip memory per core
  • Explicitly manage the memory

hierarchy

Scott B. Baden / CSE 260, UCSD / Fall '15

Christopher Dyken, SINTEF

1200 1000 800 600 GFLOPS AMD (GPU) NVIDIA (GPU) Intel (CPU) 400 200 2001 2002 2003 2004 2005 Year 2006 2007 Quad-core Dual-core Many-core GPU Multicore CPU Courtesy: John Owens 2008 2009

slide-21
SLIDE 21

Performance and Implementation Issues

9/25/15 25

  • To cope with growing

data motion costs (relative to computation)

Conserve locality Hide latency

  • Little’s Law [1961]

# threads = performance × latency T = p × λ

p and λ increasing with time

p =1 - 8 flops/cycle λ = 500 cycles/word

Memory (DRAM) Processor

Year P e r f

  • r

m a n c e

fotosearch.com fotosearch.com

λ p-1

Scott B. Baden / CSE 260, UCSD / Fall '15 25

slide-22
SLIDE 22

Consequences of evolutionary disruption

  • Transformational: new capabilities for

predictive modelling, healthcare… benefits to society

  • Changes the common wisdom for solving a

problem including the implementation

  • Simplified processor design, but more user

control over the hardware resources

Scott B. Baden / CSE 260, UCSD / Fall '15 26

slide-23
SLIDE 23

Today’s mobile computer would have been yesterday’s supercomputer

  • Cray-1

Supercomputer

  • 80 MHz processor
  • 240 Mflops/sec peak
  • 3.4 Mflops Linpack
  • 8 Megabytes memory
  • Water cooled
  • 1.8m H x 2.2m W
  • 4 tons
  • Over $10M in 1976

www.anandtech.com/show/8716/apple-a8xs-gpu-gxa6850-even-better-than-i-thought

Scott B. Baden / CSE 260, UCSD / Fall '15 27

slide-24
SLIDE 24

Exascale computing

  • What is an exaflop? 1M Gigaflops! 1018 flops
  • The first Exascale computer ~ 2020

u 20+ MWatts : 50+ GFlops/Watt

  • Top 500 #1, Tianhe-2: 1.9 GF/W (17.6MW)

u Exascale extrapolation: 521 MW

#49 Green500 [Does not include 6.2 MW extra]

u #1 on the green list delivers 4.4 GF/W

GSIC Center Tokyo Inst Tech, TSUBAME-KFC Xeon E5 and NVIDIA K20x

  • Exascale adds a significant power barrier
  • Per capita power consumption in the EU (IEA)

in 2009: 0.77kW [Tianhe-2 ~ 20K]

Scott B. Baden / CSE 260, UCSD / Fall '15 28

slide-25
SLIDE 25
  • Principles
  • Technological disruption and its impact
  • Motivation – applications

An Introduction to Parallel Computation

Scott B. Baden / CSE 260, UCSD / Fall '15 29

slide-26
SLIDE 26

Simulates a 7.7 earthquake along the southern San Andreas fault near LA using seismic, geophysical, and other data from the Southern California Earthquake Center

Scott B. Baden / CSE 260, UCSD / Fall '15 30

epicenter.usc.edu/cmeportal/TeraShake.html

A Motivating Application - TeraShake

slide-27
SLIDE 27
  • Divide up Southern

California into blocks

  • For each block, get all

the data about geological structures, fault information, …

  • Map the blocks onto

processors of the supercomputer

  • Run the simulation

using current information on fault activity and on the physics of earthquakes

Scott B. Baden / CSE 260, UCSD / Fall '15 31

SDSC

Machine

Room DataCentral@SDSC

How TeraShake Works

slide-28
SLIDE 28

Scott B. Baden / CSE 260, UCSD / Fall '15 32

Animation

slide-29
SLIDE 29

Face detection with Viola-Jones algorithm

  • Searches images for features of a human face
  • GPU performance competitive with FPGAs, but far

lower development cost

  • Jason Oberg, Daniel Hefenbrock, Tan Nguyen [CSE

260, fa’09 →fccm’10]

Window Feature Image

Scott B. Baden / CSE 260, UCSD / Fall '15 33

Sonia Sotomayor

slide-30
SLIDE 30
  • Capability

u We solved a problem that we couldn’t solve

before, or under conditions that were not possible previously

  • Performance

u Solve the same problem in less time than before u This can provide a capability if we are solving

many problem instances

  • The result achieved must justify the effort

u Enable new scientific discovery u Software costs must be reasonable

The Payoff

Scott B. Baden / CSE 260, UCSD / Fall '15 34

slide-31
SLIDE 31

I increased performance – so what’s the catch?

  • A well behaved single processor algorithm may

behave poorly on a parallel computer, and may need to be reformulated numerically

  • There currently exists no tool that can convert a

serial program into an efficient parallel program

… for all applications … all of the time… on all hardware

  • Many Performance programming issues

The new code may look dramatically different

from the original

Scott B. Baden / CSE 260, UCSD / Fall '15 35

slide-32
SLIDE 32

Application-specific knowledge is important

  • The more we know about the application…

… specific problem … math/physics ... initial data … … context for analyzing the output… … the more we can improve performance

  • Particularly challenging for irregular problems
  • Parallelism introduces many new tradeoffs

u Redesign the software u Rethink the problem solving technique

Scott B. Baden / CSE 260, UCSD / Fall '15 36

slide-33
SLIDE 33

Next time

  • Parallel processors
  • Optimizing the memory hierarchy for high

performance

Scott B. Baden / CSE 260, UCSD / Fall '15 37

slide-34
SLIDE 34

FIN

Scott B. Baden / CSE 260, UCSD / Fall '15 38