 
              Course Material Course material: www.cs.umu.se/kurser/5DV011/VT12 Lecture 1: Introduction assignments, schedule, hand-outs, etc Mikael Rännar mr@cs.umu.se Jerry Eriksson jerry@cs.umu.se Content Goal • Motivate and define parallel computations The goal of the course is to give basic knowledge about • Design of parallel algorithms - parallel computer hardware architectures - design of parallel algorithms • Overview of different classes of parallel - parallel programming paradigms and languages systems - compiler techniques for automatic parallelization and vectorization • Overview of different programming concepts - areas of application in parallel computing This includes knowledge about central ideas and classification • Historic and current parallel systems systems, machines with shared and distributed memory, data- och functional parallelism, parallel programming languages, scheduling • Applications demanding HPC algorithms, analyses of dependencies and different tools supporting – Research within this area at the department development of parallel programs.
Scientific Computing 87vs2K9 Course evaluation vt-11 • 1987 – Minisupercomputers (1-20Mflop/s): Alliant, Convex, DEC Assignment 2 too difficult • – Parallel vector processors (PVP) (20-2000 Mflop/s) Look for a new book • • 2002 PC:s (lots of ’em) – RISC Workstations (500-4000 Mflop/s): DEC, HP, IBM, SGI, Sun – RISC based symmetric multiprocessors (10-400 Gflop/s): IBM, SUN, SGI – Parallel vector processors (10-36000! Gflop/s): Fujiutsu, Hitachi, NEC – Highly parallel proc. (1-10000 Gflop/s): HP, IBM, NEC, Fujiutsu, Hitachi – Earth Simulator 5120 vector-CPU, 36 teraflop • 2004 - IBM’s Blue Gene Project (65k CPU), 136 teraflop • 2005/6/7 - IBM’s Blue Gene Project (128k CPU) , (208k 2007), 480 teraflop • 2008 - IBM’s Roadrunner, Cell, 1.1 petaflop • 2009 - Cray XT5 (224162 cores), 1.75 petaflop • 2010 – Tihane-1A, 2.57 petaflop, NVIDIA GPU • 2011 – Fujitsu, K computer, SPARC64 (705024 cores), 10.5 petaflop Blue Gene (LLNL) Roadrunner (LANL)
Jaguar (Oak Ridge NL) K computer Scientific applications History at the department/HPC2N (Research at the department) • 1986: IBM 3090VF600 – Shared memory, 6 processors with vector unit • 1987: Intel iPSC/2: 32-128 nodes • BLAS/LAPACK – Distributed memory MIMD, Hypercube with 64 noder (i386 + 4M per node) – BLAS-2, matrix-vector operations – 16 nodes with a vector board • 199X: Alliant FX2800 – BLAS-3, matrix-matrix operations – Shared memory machine MIMD, 17 i860 processors • 1996: IBM SP – LAPACK – 64 Thin nodes, 2 High nodes à 4 processors • Linjear algebra + eigenvalue problems • 1997: SGI Onyx2 – 10 MIPS R10000 – ScaLAPACK • 1998: 2-way POWER3 • 1999: Small Linux cluster • Nonlinear optimization • 2001: Better POWER3 – Neural networks • 2002: Large Linux cluster, Seth (120 dual Athlon processors), Wolfkit SCI Athlo • 2003: SweGrid Linux cluster, Ingrid, 100 nodes with Pentium4 n • Development environments • 2004: 384 CPU cluster (Opteron) Sarek 1,7 Tflops peak, 79% HP-Linpack • 2008: Linux cluster Akka, 5376 cores, 10.7 TB RAM, 46 Teraflop HP-Linpack, ranked – CONLAB/CONLAB-compiler 39 on Top 500 (June 2008) • Functional languages • 2012: Linux cluster Abisko, 15264 cores (318 nodes with 4 AMD 12 core Interlagos)
The Demand for Speed! Example of applications • Grand Challenge Problems • Global atmospheric circulation • Weather prediction • Simulations of different kind – Differential equations (over time) • Deep Blue – Descritization on a lattice • Data analyses • Earthquakes • Cryptography Technical applications More Applications • Simulate atom bombs (ASCI) • VLSI-design • Scientific visualization – Simulation: different gates on one level can be tested in //, as they act independently – Show large data sets graphically – Placement: (move blocks randomly to minimize • Signal and Image Analysis an object function, e.g. Cable length) • Reservoir modeling – Cable drawing – Oil in Norway for example • Design • Rempote analysis of e.g. The Earth – Simulate flows around objects like cars, – Satellite data: adaptation, analysis, catalogization aeroplans, boats • Movies and commercials – Tenacity (hållfasthet) computations – Star Wars etc – Heat distribution • Searching on the Internet • etc, etc, etc, etc ....
Parallel computations! Motive & Goal A collections of processors that • Manufacturing communicate and cooperate to – Physical laws limits the speed of the processors solve a large problem fast. – Moores law – Price/Performance • Cheaper to take many cheap and relatively fast processors than to develop one super fast processor • Possible to use fewer kinds of circuits but use more of them • Use – Decrease wall clock time – Solve bigger problems Communication media Why we’re building parallel A little physics lesson systems � Smaller transistors = faster processors. � Up to now, performance increases have � Faster processors = increased power been attributable to increasing density of consumption. transistors. � Increased power consumption = increased heat. � But there are � Increased heat = unreliable processors. inherent problems.
Why we need to write parallel Solution programs � Move away from single-core systems to multicore processors. � Running multiple instances of a serial program often isn’t very useful. � “core” = central processing unit (CPU) � Think of running multiple instances of your favorite game. � Introducing parallelism!!! � What you really want is for it to run faster. Approaches to the serial problem More problems � Rewrite serial programs so that they’re � Some coding constructs can be parallel. recognized by an automatic program generator, and converted to a parallel construct. � Write translation programs that automatically convert serial programs into � However, it’s likely that the result will be a parallel programs. very inefficient program. � This is very difficult to do. � Sometimes the best parallel solution is to � Success has been limited. step back and devise an entirely new algorithm.
Can all problems be Design of parallel programs solved in parallel? • Data Partitioning Dig a hole in the ground: Dig a ditch: – distribute data on the different processores • Granularity – size of the parallel parts • Load Balancing Yes No Yes No x x Can be parallelized? Can be parallelized? – Make all processors have the same load Data dependency: • Synchronization Yes No Can you put a brick anywhere x – Cooperate to produce the result anytime? Parallel program design, example Load Balancing Goal: All processors should do the same Game-of-life on a 2D net amount of work (see W-A page 190) Look at the following example: max 4 processors max 16 processors coarse-grained fine-grained small amount of communication a lot of communication Communication time = a + ß k
Load Balancing Flynn's Taxonomy 0 0 0 0 0 0 0 0 Number of Data Streams Row block mappning 1 1 1 1 1 1 1 1 Proc.: 0 1 2 3 Single Multiple 2 2 2 2 2 2 2 2 Nr : 13 22 10 3 3 3 3 3 3 3 3 3 SISD SIMD Number of Single (von Neuman) (vector, array) 0 1 2 3 0 1 2 3 Instruction Column block mappning Streams 0 1 2 3 0 1 2 3 MISD MIMD Multiple Proc.: 0 1 2 3 (?) (multiple micros) 0 1 2 3 0 1 2 3 Nr : 4 13 19 12 0 1 2 3 0 1 2 3 0 1 0 1 0 1 0 1 • Flynn does not describe modernities like Block-cyclisc mappning 2 3 2 3 2 3 2 3 Proc.: 0 1 2 3 Nr : 11 12 12 14 0 1 0 1 0 1 0 1 • Pipelining (MISD?) 2 3 2 3 2 3 2 3 • Memory model • Interconnection network Synchronous paradigms Paradigms Vector/Array • Each processor is alotted a very small A model of the world that is used to formulate a computer solution to a problem operation • Pipeline parallelism • Good when operations can be broken down into fine-grained steps
Recommend
More recommend