PVTOL: Designing Portability, Productivity and Performance for - PowerPoint PPT Presentation

PVTOL: Designing Portability, Productivity and Performance for Multicore Architectures Hahn Kim, Nadya Bliss, Jim Daly, Karen Eng, Jeremiah Gale, James Geraci, Ryan Haney, Jeremy Kepner, Sanjeev Mohindra, Sharon Sacco, Eddie Rutledge HPEC 2008 25 September 2008 MIT Lincoln Laboratory This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendataions are those of the author and are not necessarily endorsed by the United States Government. PVTOL-1 HGK 9/25/08

Outline • Background – Motivation Multicore Processors – – Programming Challenges • Tasks & Conduits • Maps & Arrays • Results • Summary MIT Lincoln Laboratory PVTOL-2 HGK 9/25/08

SWaP* for Real-Time Embedded Systems • Modern DoD sensors continue to increase in fidelity and sampling rates • Real-time processing will always be a requirement Decreasing SWaP U-2 Global Hawk Modern sensor platforms impose tight SWaP requirements on real-time embedded systems MIT Lincoln Laboratory PVTOL-3 * SWaP = Size, Weight and Power HGK 9/25/08

Embedded Processor Evolution High Performance Embedded Processors 10000 i860 GPU SHARC PowerPC PowerXCell 8i MFLOPS / Watt PowerPC with AltiVec Cell Cell (estimated) 1000 MPC7447A MPC7410 MPC7400 100 SHARC 750 10 i860 XR 603e • Multicore processor • 1 PowerPC core 8 SIMD cores 1990 2000 2010 Year • 20 years of exponential growth in FLOPS / W Multicore processors help • achieve performance Must switch architectures every ~5 years • requirements within tight Current high performance architectures are multicore SWaP constraints MIT Lincoln Laboratory PVTOL-4 HGK 9/25/08

Parallel Vector Tile Optimizing Library • PVTOL is a portable and scalable middleware library for multicore processors • Enables unique software development process for real-time signal processing applications Cluster Embedded Desktop Computer 3. Deploy code 1. Develop serial code 2. Parallelize code 4. Automatically parallelize code Make parallel programming as easy as serial programming MIT Lincoln Laboratory PVTOL-5 HGK 9/25/08

PVTOL Architecture Tasks & Conduits Concurrency and data movement Maps & Arrays Distribute data across processor and memory hierarchies Functors Abstract computational kernels into objects Portability: Runs on a range of architectures Performance: Achieves high performance Productivity: Minimizes effort at user level MIT Lincoln Laboratory PVTOL-6 HGK 9/25/08

Outline • Background • Tasks & Conduits • Maps & Arrays • Results • Summary MIT Lincoln Laboratory PVTOL-7 HGK 9/25/08

Multicore Programming Challenges Inside the Box Outside the Box Desktop Embedded Cluster Embedded Board Multicomputer • • Threads Processes – Pthreads – MPI (MPICH, Open MPI, etc.) – OpenMP – Mercury PAS • • Shared memory Distributed memory – Pointer passing – Message passing – Mutexes, condition variables PVTOL provides consistent semantics for both multicore and cluster computing MIT Lincoln Laboratory PVTOL-8 HGK 9/25/08

Tasks & Conduits • Tasks provide concurrency load(B) DIT – Collection of 1+ threads in 1+ cdt1.write(B) Disk processes cdt1 – Tasks are SPMD, i.e. each thread runs task code • Task Maps specify locations of A B cdt1.read(B) DAT Tasks A = B cdt2.write(A) • Conduits move data Safely move data cdt2 – – Multibuffering cdt2.read(A) DOT – Synchronization save(A) DIT Read data from source (1 thread) DAT Process data (4 threads) DOT Output results (1 thread) Conduits Connect DIT to DAT and DAT to DOT MIT Lincoln Laboratory PVTOL-9 * DIT – Data Input Task, DAT – Data Analysis Task, DOT – Data Output Task HGK 9/25/08

Pipeline Example DIT-DAT-DOT int main(int argc, char** argv) Main function creates tasks, { connects tasks with conduits and // Create maps (omitted for brevity) launches the task computation ... // Create the tasks Task<Dit> dit("Data Input Task", ditMap); Task<Dat> dat("Data Analysis Task", datMap); dit dat dot Task<Dot> dot("Data Output Task", dotMap); // Create the conduits Conduit<Matrix <double> > ab("A to B Conduit"); ab bc Conduit<Matrix <double> > bc("B to C Conduit"); // Make the connections dit.init(ab.getWriter()); dat.init(ab.getReader(), bc.getWriter()); dot.init(bc.getReader()); ab bc dit dat dot // Complete the connections ab.setupComplete(); bc.setupComplete(); // Launch the tasks dit.run(); dat.run(); dot.run(); // Wait for tasks to complete dit.waitTillDone(); dat.waitTillDone(); dot.waitTillDone(); } MIT Lincoln Laboratory PVTOL-10 HGK 9/25/08

Pipeline Example Data Analysis Task (DAT) class Dat Tasks read and write data { using Reader and Writer private: interfaces to Conduits Conduit<Matrix <double> >::Reader m_Reader; Conduit<Matrix <double> >::Writer m_Writer; public: Readers and Writer provide void init(Conduit<Matrix <double> >::Reader& reader, handles to data buffers Conduit<Matrix <double> >::Writer& writer) { // Get data reader for the conduit reader.setup(tr1::Array<int, 2>(ROWS, COLS)); reader m_Reader = reader; // Get data writer for the conduit writer.setup(tr1::Array<int, 2>(ROWS, COLS)); writer m_Writer = writer; } void run() { Matrix <double>& B = m_Reader.getData(); Matrix <double>& A = m_Writer.getData(); B A A = B; reader writer A = B m_reader.releaseData(); m_writer.releaseData(); } }; MIT Lincoln Laboratory PVTOL-11 HGK 9/25/08

Outline • Background • Tasks & Conduits • Maps & Arrays – Hierarchy – Functors • Results • Summary MIT Lincoln Laboratory PVTOL-12 HGK 9/25/08

Map-Based Programming Technology Organization Language Year • A map is an assignment of blocks of data to processing MIT-LL* C++ 2000 Parallel Vector Library elements pMatlab MIT-LL MATLAB 2003 • Maps have been demonstrated in several technologies VSIPL++ C++ 2006 HPEC-SI † Map Map Map grid: 1x2 grid: 1x2 grid: 1x2 Grid specification dist: block- dist: block dist: cyclic Distribution together with cyclic procs: 0:1 procs: 0:1 specification processor list procs: 0:1 describes describe where how data are data are distributed distributed Cluster Cluster Cluster Proc Proc Proc Proc Proc Proc 0 1 0 1 0 1 MIT Lincoln Laboratory † High Performance Embedded Computing Software Initiative PVTOL-13 * MIT Lincoln Laboratory HGK 9/25/08

PVTOL Machine Model • • Processor Hierarchy Memory Hierarchy – Processor: – Each level in the processor hierarchy can have its own Scheduled by OS memory Co-processor: – Dependent on processor for program control Disk CELL Cluster Remote Processor Memory Processor Local Processor Memory CELL CELL 0 1 Cache/Loc. Co-proc Mem. Read register SPE 0 SPE 1 … SPE 0 SPE 1 … Write register Registers Read data Write data Write Co-processor PVTOL extends maps to support hierarchy MIT Lincoln Laboratory PVTOL-14 HGK 9/25/08

PVTOL Machine Model • • Processor Hierarchy Memory Hierarchy – Processor: – Each level in the processor hierarchy can have its own Scheduled by OS memory Co-processor: – Dependent on processor for program control Disk x86 Cluster Remote Processor Memory Processor Local Processor Memory x86/PPC x86/PPC 0 1 Cache/Loc. Co-proc Mem. GPU / GPU / GPU / GPU / Read register … … FPGA 0 FPGA 1 FPGA 0 FPGA 1 Write register Registers Read data Write data Write Co-processor Semantics are the same across different architectures MIT Lincoln Laboratory PVTOL-15 HGK 9/25/08

Hierarchical Maps and Arrays Serial PPC • PVTOL provides hierarchical maps and arrays • Hierarchical maps concisely describe Parallel data distribution at each level PPC • Hierarchical arrays hide details of the Cluster grid: 1x2 processor and memory hierarchy dist: block procs: 0:1 PPC PPC 0 1 Program Flow Program Flow 1. Define a Block 1. Define a Block Hierarchical • Data type, index layout (e.g. row-major) • Data type, index layout (e.g. row-major) CELL Cluster 2. Define a Map for each level in the 2. Define a Map for each level in the grid: 1x2 hierarchy hierarchy dist: block procs: 0:1 • Grid, data distribution, processor list • Grid, data distribution, processor list CELL CELL 0 1 3. Define an Array for the Block 3. Define an Array for the Block grid: 1x2 dist: block 4. Parallelize the Array with the Hierarchical 4. Parallelize the Array with the Hierarchical procs: 0:1 Map (optional) Map (optional) SPE 0 SPE 1 SPE 0 SPE 1 5. Process the Array 5. Process the Array block: 1x2 LS LS LS LS MIT Lincoln Laboratory PVTOL-16 HGK 9/25/08

PVTOL: Designing Portability, Productivity and Performance for - PowerPoint PPT Presentation

PVTOL: Designing Portability, Productivity and Performance for Multicore Architectures Hahn Kim, Nadya Bliss, Jim Daly, Karen Eng, Jeremiah Gale, James Geraci, Ryan Haney, Jeremy Kepner, Sanjeev Mohindra, Sharon Sacco, Eddie Rutledge HPEC 2008

Number Portability Three kinds of number portability Location portability: a subscriber may move

Kokkos: Performance Portability and Photos placed in horizontal position with even amount

Designing for Designing for Greenspace Greenspace Greenspace Designing for Designing for

Is it performance portability when Im using (small) DGEMM? Dagstuhl Seminar: Performance

EXPLORER+500 Performance and portability combined EXPLORER+ 500 The most used BGAN terminal

Message Passing Programming Designing MPI Applications Overview Lecture will cover MPI

Class 14 Slides SLIDE what is the designing principle how does designing principle

Applets Murray Cole Applets 1 Portability and Security JVM and bytecode make

Lecture 1.3 Course Introduction Portability and Scalability in Heterogeneous Parallel

JEDI Portability Across Platforms Containers, Cloud Computing, and HPC Outline I) JEDI

Health Insurance Portability and Accountability Act of 1996 Compliance at Purdue

The Right to Data Portability: Privacy and An7trust Analysis Professor Peter Swire Ohio State

It Its confusing HEALTHPLAN2020 BHB UNIFIED UNIVERSAL MFR MOCKPLAN PORTABILITY UN-INSURED

Message Passing Programming Designing MPI Applications Overview Lecture will cover MPI

Designing Your Fashion Portfolio From Concept To Presentation Designing Your Fashion Portfolio

X10: a High-Productivity Approach to X10: a High-Productivity Approach to High Performance

Results of Just-In-Time Browsing for Digital Microfilm Douglas J. Kennard and William A.

Project PIZZARO - Image Restoration Module - Report I Michal Sorel, Filip Sroubek,

Is 2.44 trillion unknowns the largest finite element system that can be solved today? U. Rde

In fi nite Parallel Universes: State at the Edge Peter Bourgon Fastly In fi nite Parallel

Restoring and Digitizing the Municipal Library of Gjirokaster Municipality of Gjirokaster Aurora

Acquiring place BRANDing competences at work through continuing VET to increase the attractiveness

Q1 2019 Results Orri Hauksson og skar Hauksson 30 April 2019 Highlights in Q1 2019 EBITDA

Conceiving a Fast and Practical Multispectral Stereo System Raju Shrestha Supervisors: Jon