CS 140: Models of parallel programming: Distributed memory and MPI

Technology Trends: Microprocessor Capacity Gordon Moore (Intel co-founder) Moore ’ s Law: # transistors / chip predicted in 1965 that the doubles every 1.5 years transistor density of semiconductor chips would double roughly every 18 months. Microprocessors keep getting smaller, denser, and more powerful.

Trends in processor clock speed Triton’s clockspeed is still only 2600 Mhz in 2015!

4-core Intel Sandy Bridge (Triton uses an 8-core version) 2600 Mhz clock speed

Generic Parallel Machine Architecture Storage Proc Proc Proc Hierarchy Cache Cache Cache L2 Cache L2 Cache L2 Cache interconnects potential L3 Cache L3 Cache L3 Cache Memory Memory Memory • Key architecture question: Where and how fast are the interconnects? • Key algorithm question: Where is the data?

Triton memory hierarchy: I (Chip level) (AMD Opteron 8-core Magny-Cours, similar to Triton’s Intel Sandy Bridge) Proc Proc Proc Proc Proc Proc Proc Proc Cache Cache Cache Cache Cache Cache Cache Cache L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache L3 Cache (8MB) Chip sits in socket, connected to the rest of the node . . .

Triton memory hierarchy II (Node level) Node P P P P L1/L2 L1/L2 L1/L2 L1/L2 Chip L3 Cache (20 MB) P P P P L1/L2 L1/L2 L1/L2 L1/L2 P P P P L1/L2 L1/L2 L1/L2 L1/L2 Shared Chip L3 Cache (20 MB) Node Memory P P P P L1/L2 L1/L2 L1/L2 L1/L2 (64GB) <- Infiniband interconnect to other nodes ->

Triton memory hierarchy III (System level) Node Node Node Node Node Node Node Node 64GB 64GB 64GB 64GB 64GB 64GB 64GB 64GB 64GB 64GB 64GB 64GB 64GB 64GB 64GB 64GB Node Node Node Node Node Node Node Node 324 nodes, message-passing communication, no shared memory

Some models of parallel computation Computational model Languages • Shared memory • Cilk, OpenMP, Pthreads … • SPMD / Message passing • MPI • SIMD / Data parallel • Cuda, Matlab, OpenCL, … • PGAS / Partitioned global • UPC, CAF, Titanium • Loosely coupled • Map/Reduce, Hadoop, … • Hybrids … • ???

Parallel programming languages • Many have been invented – *much* less consensus on what are the best languages than in the sequential world. • Could have a whole course on them; we ’ ll look just a few. Languages you ’ ll use in homework : • C with MPI (very widely used, very old-fashioned) • Cilk Plus (a newer upstart) • You will choose a language for the final project

Generic Parallel Machine Architecture Storage Proc Proc Proc Hierarchy Cache Cache Cache L2 Cache L2 Cache L2 Cache interconnects potential L3 Cache L3 Cache L3 Cache Memory Memory Memory • Key architecture question: Where and how fast are the interconnects? • Key algorithm question: Where is the data?

Message-passing programming model P1 NI P0 NI Pn NI memory memory . . . memory interconnect • Architecture: Each processor has its own memory and cache but cannot directly access another processor ’ s memory. • Language: MPI (“Message-Passing Interface”) • A least common denominator based on 1980s technology • Links to documentation on course home page • SPMD = “Single Program, Multiple Data”

Hello, world in MPI #include <stdio.h> #include "mpi.h" int main( int argc, char *argv[]) { int rank, size; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &size ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); printf( "Hello world from process %d of %d\n", rank, size ); MPI_Finalize(); return 0; }

MPI in nine routines (all you really need) MPI_Init Initialize MPI_Finalize Finalize MPI_Comm_size How many processes? MPI_Comm_rank Which process am I? MPI_Wtime Timer MPI_Send Send data to one proc MPI_Recv Receive data from one proc MPI_Bcast Broadcast data to all procs Combine data from all procs MPI_Reduce

Ten more MPI routines (sometimes useful) More collective ops (like Bcast and Reduce): MPI_Alltoall, MPI_Alltoallv MPI_Scatter, MPI_Gather Non-blocking send and receive: MPI_Isend, MPI_Irecv MPI_Wait, MPI_Test, MPI_Probe, MPI_Iprobe Synchronization: MPI_Barrier

Example: Send an integer x from proc 0 to proc 1 MPI_Comm_rank(MPI_COMM_WORLD,&myrank); /* get rank */ int msgtag = 1; if (myrank == 0) { int x = 17; MPI_Send(&x, 1, MPI_INT, 1, msgtag, MPI_COMM_WORLD); } else if (myrank == 1) { int x; MPI_Recv(&x, 1, MPI_INT,0,msgtag,MPI_COMM_WORLD,&status); }

Some MPI Concepts • Communicator • A set of processes that are allowed to communicate among themselves. • Kind of like a “radio channel”. • Default communicator: MPI_COMM_WORLD • A library can use its own communicator, separated from that of a user program.

Some MPI Concepts • Data Type • What kind of data is being sent/recvd? • Mostly just names for C data types • MPI_INT, MPI_CHAR, MPI_DOUBLE, etc.

Some MPI Concepts • Message Tag • Arbitrary (integer) label for a message • Tag of Send must match tag of Recv • Useful for error checking & debugging

Parameters of blocking send MPI_Send(buf, count, datatype, dest, tag, comm) Address of Datatype of Message tag send b uff er each item Number of items Rank of destination Comm unicator to send process

Parameters of blocking receive MPI_Recv(buf, count, datatype, src, tag, comm, status) Status Address of Message tag Datatype of after oper ation receiv e b uff er each item Maxim um n umber Rank of source Comm unicator of items to receiv e process

Example: Send an integer x from proc 0 to proc 1 MPI_Comm_rank(MPI_COMM_WORLD,&myrank); /* get rank */ int msgtag = 1; if (myrank == 0) { int x = 17; MPI_Send(&x, 1, MPI_INT, 1, msgtag, MPI_COMM_WORLD); } else if (myrank == 1) { int x; MPI_Recv(&x, 1, MPI_INT,0,msgtag,MPI_COMM_WORLD,&status); }

CS 140: Models of parallel programming: Distributed memory and MPI - PowerPoint PPT Presentation

CS 140: Models of parallel programming: Distributed memory and MPI Technology Trends: Microprocessor Capacity Gordon Moore (Intel co-founder) Moore s Law: # transistors / chip predicted in 1965 that the doubles every 1.5 years transistor

CSC2/458 Parallel and Distributed Systems Parallel Memory Systems: Coherence Sreepathi Pai

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Lecture 2: Parallel Architectures Lecture 2: Parallel Architectures and Programming Models

Distributed Shared Memory 1 Distributed Shared Memory Making the main memory of a cluster of

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

CSC2/458 Parallel and Distributed Systems Parallel Memory Systems Consistency Sreepathi Pai

Biology Large Biological Molecules 2015-08-28 www.njctl.org Slide 3 / 140 Slide 4 / 140

Biology Large Biological Molecules 2015-08-28 www.njctl.org Slide 3 / 140 Slide 4 / 140

Gases Slide 3 / 140 Slide 4 / 140 Table of Contents The Kinetic Molecular Click on the topic

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Distributed Data-Parallel Programming Parallel Programming and Data Analysis Heather Miller

Cluster Basics Hana Sevcikova University of Washington DataCamp Parallel Programming in R

Weak memory models INF4140 - Models of concurrency Weak memory models Fall 2016 30. 10. 2016

Programming Distributed Systems Programming Models for Distributed Systems Annette Bieniusa FB

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

GHUMVEE: Efficient, Effective and Flexible Replication Stijn Volckaert Computer Systems Lab

Towards Scalable Application Checkpointing with Parallel File System Delegation Dulcardo Arteaga

Partitioning System Software for Hardware Enclaves Chia-Che Tsai Texas A&M University /

CENG5030 Part 2-4: CNN Inaccurate Speedup-2 - Quantization Bei Yu (Latest update: March 25,

Object-oriented Packet Caching for ICN Yannis Thomas, George Xylomenos, Christos Tsilopoulos,

Memory Virtualization: Swapping and Demand Paging Mechanisms Prof. Patrick G. Bridges 1

Tutorial on Word-Level Model Checking Armin Biere FMCAD 2020 September 21, 2020 Online Formal

What is a Process? Answer 1: a process is an abstraction of a program in execution Answer 2: a

Sambuz

Useful Links

Newsletter

Mail Us

CS 140: Models of parallel programming: Distributed memory and MPI - PowerPoint PPT Presentation

CS 140: Models of parallel programming: Distributed memory and MPI Technology Trends: Microprocessor Capacity Gordon Moore (Intel co-founder) Moore s Law: # transistors / chip predicted in 1965 that the doubles every 1.5 years transistor

CSC2/458 Parallel and Distributed Systems Parallel Memory Systems: Coherence Sreepathi Pai

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Lecture 2: Parallel Architectures Lecture 2: Parallel Architectures and Programming Models

Distributed Shared Memory 1 Distributed Shared Memory Making the main memory of a cluster of

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

CSC2/458 Parallel and Distributed Systems Parallel Memory Systems Consistency Sreepathi Pai

Biology Large Biological Molecules 2015-08-28 www.njctl.org Slide 3 / 140 Slide 4 / 140

Biology Large Biological Molecules 2015-08-28 www.njctl.org Slide 3 / 140 Slide 4 / 140

Gases Slide 3 / 140 Slide 4 / 140 Table of Contents The Kinetic Molecular Click on the topic

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Distributed Data-Parallel Programming Parallel Programming and Data Analysis Heather Miller

Cluster Basics Hana Sevcikova University of Washington DataCamp Parallel Programming in R

Weak memory models INF4140 - Models of concurrency Weak memory models Fall 2016 30. 10. 2016

Programming Distributed Systems Programming Models for Distributed Systems Annette Bieniusa FB

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

GHUMVEE: Efficient, Effective and Flexible Replication Stijn Volckaert Computer Systems Lab

Towards Scalable Application Checkpointing with Parallel File System Delegation Dulcardo Arteaga

Partitioning System Software for Hardware Enclaves Chia-Che Tsai Texas A&amp;M University /

CENG5030 Part 2-4: CNN Inaccurate Speedup-2 - Quantization Bei Yu (Latest update: March 25,

Object-oriented Packet Caching for ICN Yannis Thomas, George Xylomenos, Christos Tsilopoulos,

Memory Virtualization: Swapping and Demand Paging Mechanisms Prof. Patrick G. Bridges 1

Tutorial on Word-Level Model Checking Armin Biere FMCAD 2020 September 21, 2020 Online Formal

What is a Process? Answer 1: a process is an abstraction of a program in execution Answer 2: a

Sambuz

Useful Links

Newsletter

Mail Us

Partitioning System Software for Hardware Enclaves Chia-Che Tsai Texas A&M University /