Course Material Course material: www.cs.umu.se/kurser/5DV011/VT12 - - PowerPoint PPT Presentation

course material
SMART_READER_LITE
LIVE PREVIEW

Course Material Course material: www.cs.umu.se/kurser/5DV011/VT12 - - PowerPoint PPT Presentation

Course Material Course material: www.cs.umu.se/kurser/5DV011/VT12 Lecture 1: Introduction assignments, schedule, hand-outs, etc Mikael Rnnar mr@cs.umu.se Jerry Eriksson jerry@cs.umu.se Content Goal Motivate and define parallel


slide-1
SLIDE 1

Lecture 1: Introduction

Mikael Rännar

mr@cs.umu.se

Jerry Eriksson

jerry@cs.umu.se

Course Material

Course material:

www.cs.umu.se/kurser/5DV011/VT12

assignments, schedule, hand-outs, etc

Content

  • Motivate and define parallel computations
  • Design of parallel algorithms
  • Overview of different classes of parallel

systems

  • Overview of different programming concepts
  • Historic and current parallel systems
  • Applications demanding HPC

– Research within this area at the department

Goal

The goal of the course is to give basic knowledge about

  • parallel computer hardware architectures
  • design of parallel algorithms
  • parallel programming paradigms and languages
  • compiler techniques for automatic parallelization and vectorization
  • areas of application in parallel computing

This includes knowledge about central ideas and classification systems, machines with shared and distributed memory, data- och functional parallelism, parallel programming languages, scheduling algorithms, analyses of dependencies and different tools supporting development of parallel programs.

slide-2
SLIDE 2

Course evaluation vt-11

  • Assignment 2 too difficult
  • Look for a new book

Scientific Computing 87vs2K9

  • 1987

– Minisupercomputers (1-20Mflop/s): Alliant, Convex, DEC – Parallel vector processors (PVP) (20-2000 Mflop/s)

  • 2002 PC:s (lots of ’em)

– RISC Workstations (500-4000 Mflop/s): DEC, HP, IBM, SGI, Sun – RISC based symmetric multiprocessors (10-400 Gflop/s): IBM, SUN, SGI – Parallel vector processors (10-36000! Gflop/s): Fujiutsu, Hitachi, NEC – Highly parallel proc. (1-10000 Gflop/s): HP, IBM, NEC, Fujiutsu, Hitachi – Earth Simulator 5120 vector-CPU, 36 teraflop

  • 2004 - IBM’s Blue Gene Project (65k CPU), 136 teraflop
  • 2005/6/7 - IBM’s Blue Gene Project (128k CPU) , (208k 2007), 480 teraflop
  • 2008 - IBM’s Roadrunner, Cell, 1.1 petaflop
  • 2009 - Cray XT5 (224162 cores), 1.75 petaflop
  • 2010 – Tihane-1A, 2.57 petaflop, NVIDIA GPU
  • 2011 – Fujitsu, K computer, SPARC64 (705024 cores), 10.5 petaflop

Blue Gene (LLNL) Roadrunner (LANL)

slide-3
SLIDE 3

Jaguar (Oak Ridge NL) K computer

History at the department/HPC2N

  • 1986: IBM 3090VF600

– Shared memory, 6 processors with vector unit

  • 1987: Intel iPSC/2: 32-128 nodes

– Distributed memory MIMD, Hypercube with 64 noder (i386 + 4M per node) – 16 nodes with a vector board

  • 199X: Alliant FX2800

– Shared memory machine MIMD, 17 i860 processors

  • 1996: IBM SP

– 64 Thin nodes, 2 High nodes à 4 processors

  • 1997: SGI Onyx2

– 10 MIPS R10000

  • 1998: 2-way POWER3
  • 1999: Small Linux cluster
  • 2001: Better POWER3
  • 2002: Large Linux cluster, Seth (120 dual Athlon processors), Wolfkit SCI
  • 2003: SweGrid Linux cluster, Ingrid, 100 nodes with Pentium4
  • 2004: 384 CPU cluster (Opteron) Sarek 1,7 Tflops peak, 79% HP-Linpack
  • 2008: Linux cluster Akka, 5376 cores, 10.7 TB RAM, 46 Teraflop HP-Linpack, ranked

39 on Top 500 (June 2008)

  • 2012: Linux cluster Abisko, 15264 cores (318 nodes with 4 AMD 12 core Interlagos)

Athlo n

Scientific applications

(Research at the department)

  • BLAS/LAPACK

– BLAS-2, matrix-vector operations – BLAS-3, matrix-matrix operations – LAPACK

  • Linjear algebra + eigenvalue problems

– ScaLAPACK

  • Nonlinear optimization

– Neural networks

  • Development environments

– CONLAB/CONLAB-compiler

  • Functional languages
slide-4
SLIDE 4

The Demand for Speed!

  • Grand Challenge Problems
  • Simulations of different kind
  • Deep Blue
  • Data analyses
  • Cryptography

Example of applications

  • Global atmospheric circulation
  • Weather prediction

– Differential equations (over time) – Descritization on a lattice

  • Earthquakes

Technical applications

  • VLSI-design

– Simulation: different gates on one level can be tested in //, as they act independently – Placement: (move blocks randomly to minimize an object function, e.g. Cable length) – Cable drawing

  • Design

– Simulate flows around objects like cars, aeroplans, boats – Tenacity (hållfasthet) computations – Heat distribution

More Applications

  • Simulate atom bombs (ASCI)
  • Scientific visualization

– Show large data sets graphically

  • Signal and Image Analysis
  • Reservoir modeling

– Oil in Norway for example

  • Rempote analysis of e.g. The Earth

– Satellite data: adaptation, analysis, catalogization

  • Movies and commercials

– Star Wars etc

  • Searching on the Internet
  • etc, etc, etc, etc ....
slide-5
SLIDE 5

Parallel computations!

A collections of processors that communicate and cooperate to solve a large problem fast.

Communication media

Motive & Goal

  • Manufacturing

– Physical laws limits the speed of the processors – Moores law – Price/Performance

  • Cheaper to take many cheap and relatively fast

processors than to develop one super fast processor

  • Possible to use fewer kinds of circuits but use more
  • f them
  • Use

– Decrease wall clock time – Solve bigger problems

Why we’re building parallel systems

Up to now, performance increases have

been attributable to increasing density of transistors.

But there are

inherent problems.

A little physics lesson

Smaller transistors = faster processors. Faster processors = increased power

consumption.

Increased power consumption = increased

heat.

Increased heat = unreliable processors.

slide-6
SLIDE 6

Solution

Move away from single-core systems to

multicore processors.

“core” = central processing unit (CPU) Introducing parallelism!!!

Why we need to write parallel programs

Running multiple instances of a serial

program often isn’t very useful.

Think of running multiple instances of your

favorite game.

What you really want is for

it to run faster.

Approaches to the serial problem

Rewrite serial programs so that they’re

parallel.

Write translation programs that

automatically convert serial programs into parallel programs.

This is very difficult to do. Success has been limited.

More problems

Some coding constructs can be

recognized by an automatic program generator, and converted to a parallel construct.

However, it’s likely that the result will be a

very inefficient program.

Sometimes the best parallel solution is to

step back and devise an entirely new algorithm.

slide-7
SLIDE 7

Data dependency:

Can you put a brick anywhere anytime? Yes No

Dig a ditch:

Can be parallelized? Yes No

Can all problems be solved in parallel?

Dig a hole in the ground:

Can be parallelized? Yes No

x x x

Design of parallel programs

  • Data Partitioning

– distribute data on the different processores

  • Granularity

– size of the parallel parts

  • Load Balancing

– Make all processors have the same load

  • Synchronization

– Cooperate to produce the result

max 4 processors coarse-grained small amount of communication

Parallel program design, example

Game-of-life on a 2D net (see W-A page 190)

max 16 processors fine-grained a lot of communication Communication time = a + ßk

Load Balancing

Goal: All processors should do the same amount of work

Look at the following example:

slide-8
SLIDE 8

Load Balancing

1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 Row block mappning Proc.: 1 2 3 Nr : 13 22 10 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 Column block mappning Proc.: 1 2 3 Nr : 4 13 19 12 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 Block-cyclisc mappning Proc.: 1 2 3 Nr : 11 12 12 14

Flynn's Taxonomy

  • Flynn does not describe modernities like
  • Pipelining (MISD?)
  • Memory model
  • Interconnection network

SISD (von Neuman) SIMD (vector, array) MIMD (multiple micros) MISD (?) Single Multiple Number of Data Streams Number of Instruction Streams Single Multiple

Paradigms

A model of the world that is used to formulate a computer solution to a problem

Synchronous paradigms

Vector/Array

  • Each processor is alotted a very small
  • peration
  • Pipeline parallelism
  • Good when operations can be broken

down into fine-grained steps

slide-9
SLIDE 9

Synchronous paradigms SIMD

  • Dataparallel!
  • All processors do the same thing at the same

time, or are idle

  • Phase 1:

– Data partitioning and distribution

  • Phase 2:

– Data parallel work

  • Good for large regular data structures

Asynchronous paradigms MIMD

  • The processors work independently of each other
  • Must be synchronized

– Message passing – Mutual exclusion (locks)

  • Best for corse-grained problems
  • Shared memory

– Virtually and physically shared – UMA, NUMA, COMA, CC-NUMA

  • Distributed memory

– Highly parallel systems, NOWS, COWS

Shared Mamory Architectures

  • All processors have access to a global

address space

– UMA, NUMA

  • Access to the shared memory can be by

a bus or a switched network

  • The hardware does not scale well to

massively parallel levels

Memory bus/switching network P P P P P Memory

Distributed Memory Architectures

  • Each node has its own local memory (no shared

adress space)

  • The processors communicate with each other over

a network by using messages

  • The network topology can be static or dynamic
  • The hardware scales well. Programmering are more

difficult than with shared memory

  • Computations are much faster than communication

Network

m p m p m p m p m p m p m p m p m p m p Mesh, ring, linear array, 2D-torus, 3D-mesh 3D-torus, tree fat tree, hypercube, star, vulkan switch, cube connected cycl’

  • mega, crossbar,

etc, ......

slide-10
SLIDE 10

SPMD, Single Program Multiple Data

– Asynchronous data parallel processing – Software equivalent to SIMD – Execute the same program, but on different data asynchronously

Control vs Data parallelism

  • Control parallelism (instruction parallelism)

– use parallelism in the control structures of a program – independent parts of a program execute in parallel

  • Data parallelism

– one processor per data element (block of data) – each processor needs separate data memory – millions of processors can be applied on large problems

Parallel Programming – Implicitly

  • Old Fortran, C, ...

– Lots of dependencies between different parts of the program – The compiler must find all dependencies – The compiler restructures the program to identify more parallelism – Advantage: Backwards compatible with existing program

  • New languages and extensions give more parallelism

– Fortran 90 – HPF – OpenMP – MPI

}