Programming Hybrid CPU-GPU Clusters with Unicorn Subodh Kumar IIT - PowerPoint PPT Presentation

Programming Hybrid CPU-GPU Clusters with Unicorn Subodh Kumar IIT Delhi Part of Tarun Beri’s PhD thesis

Parallel Programming is Hard ● Problem decomposition ● Processor mapping ● Scheduling ● Communicated and synchronization ● Tuning to hardware ● Maintainable and portable code ● Programmer productivity ● Scalability ● Managing multiple types of parallelism ○ accelerator, shared memory, cluster, message passing ● Thread model is non-deterministic ● Low level locking prone to deadlocks and livelocks ● Large numbers still trained on sequential models of computation ○ No effective bridging model Indian Institute of Technology Delhi

About Unicorn ● Multi-node, Multi-CPU, Multi-GPU programming framework ○ Shared memory style ○ Bulk synchronous ● For coarse grained compute-intensive workloads ● Designed to adapt to the totality of effective network, memory and compute throughputs of devices Current implementation - Assumes flat network topology - No elaborate matching of device capability to workload Indian Institute of Technology Delhi

Unicorn’s Goals • To design a cluster programming model which is: Hetero- Simple Scalable Unified Abstract geneous Agnostic to Performance Works on Complexity Common network increases hybrid CPU- largely in API for topology/ with cluster GPU sequential data CPU/GPU nodes clusters organization native code Indian Institute of Technology Delhi

Programming with Unicorn ● Plug-in existing sequential, parallel CPU and CUDA code ○ Debugging complexity “near” sequential/native code ● Shared memory style but deterministic execution ○ No data races ○ Check-in/check-out memory semantics with conflict resolution ○ No explicit data transfers in user code ! Internally, latency-hiding with compute-communication overlap is first class citizen ○ Automatic load balancing and scheduling ● Code agnostic to data placement, organization and generation ○ No requirement of user binding data or computations to nodes Indian Institute of Technology Delhi

Many Competing Approaches Legion ● Task graph partitioning Global Arrays X10 ● Express units of work Split-C ● Directives Cilk POP-C++ ● Loop parallelization and scheduling Titanium ● Distributed address space (PGAS) Sequoia MPI Globus MPI-ACC Data Scheduling StarPU-MPI Phalanx Charm++/G-Charm Mekong Indian Institute of Technology Delhi

Bulk Synchronous? Global Shared Memory DMA Transfer DMA Transfer Input Phase CPU to GPU CPU to GPU Local Local Computation Computation DMA Transfer DMA Transfer GPU to CPU GPU to CPU Output Phase Synchronization to global shared Synchronization memory Indian Institute of Technology Delhi

Unicorn Programming Model Global Shared Memory Input Phase A parallel program is a graph of tasks Task 0 Data Dependency Task 1 Local Local Computation Computation Task 3 Task 2 Output Phase Tasks are divided into multiple Synchronization to global shared concurrently executable subtasks memory Indian Institute of Technology Delhi

Unicorn – Data Partitioning Stage 1: Data Partitioning Node 1 Node 2 [Partition memory among subtasks] CPU Core 1 CPU Core 1 User • Subscribes to input memory sections • Subscribes to output memory sections GPU 1 GPU 1 Copy Copy CPU Core 2 2 e r o C Copy (Internally optimized) U P C Copy (Internally optimized) r e [Library Managed] f s n a r Input Address Space (RO) Input Address Space (RO) T a t a D Output Address Space (RW/WO) Output Address Space (RW/WO) Indian Institute of Technology Delhi

Unicorn – Subtask Execution Stage 2: Computation Node 1 Node 2 [Synchronization-free subtask execution] CPU Core 1 CPU Core 1 User provided: • CPU subtasks execute CPU functions • GPU subtasks execute GPU kernels GPU 1 GPU 1 CPU Core 2 2 e r o C U P C Input Address Space (RO) Input Address Space (RO) Output Address Space (RW/WO) Output Address Space (RW/WO) Indian Institute of Technology Delhi

Unicorn – Data Synchronization Stage 3: Data Synchronization Node 1 Node 2 [Copy Synchronization] CPU Core 1 CPU Core 1 • Copy synchronization is system managed • No user intervention required GPU 1 GPU 1 CPU Core 2 2 e r o C U P C Input Address Space (RO) Input Address Space (RO) Output Address Space (RW/WO) Output Address Space (RW/WO) Indian Institute of Technology Delhi

Unicorn – Data Reduction Stage 3: Data Synchronization Node 1 Node 2 [Reduce Synchronization] CPU Core 1 CPU Core 1 • Reduce operation is user controlled • Hierarchical log(n) reduction GPU 1 GPU 1 CPU Core 2 2 e r o C U P C Input Address Space (RO) Input Address Space (RO) Output Address Space (RW/WO) Output Address Space (RW/WO) Indian Institute of Technology Delhi

What a program looks like? struct complex { float real, imag; }; struct fft_conf { size_t rows, cols; }; fft_1d(matrix_rows, matrix_cols) { key = "FFT"; register_callback(key, SUBSCRIPTION, fft_subscription); Define task callbacks register_callback(key, CUDA, "fft_cuda", "fft.cu"); if(get_host() == 0) // Submit task from single host { size = matrix_rows * matrix_cols * sizeof(complex); input = malloc_shared(size); Allocate address spaces output = malloc_shared(size); initialize_input(input); // application data // create task with one subtask per row nsubtasks = matrix_rows; task = create_task(key, nsubtasks, fft_conf(matrix_rows, matrix_cols)); Create task bind_address_space(task, input, READ_ONLY); bind_address_space(task, output, WRITE_ONLY); Bind address spaces to task submit_task(task); Submit task for asynchronous execution wait_for_task_completion(task); } } Indian Institute of Technology Delhi

Pillars of Unicorn ● Understand the flow of data and balance load ○ Pipeline data delivery and computation ● Parallelize at multiple levels ○ Inner loops often data parallel ! Scientific computation ○ Coarse grained outer level ● Sandbox computation ○ No data race ○ Transactional semantics ○ Data reduction, assimilation, re-organization Indian Institute of Technology Delhi

Runtime Optimizations ● Distributed directory maintenance ○ Non-coherent opportunistic lock-free updates ○ MPI style views ● Schedule data pre-fetch (among nodes and to/from GPU) ○ Software GPU cache ○ Hierarchical steal, Pro-active steal, granularity adjustment ○ Locality aware scheduling ! Local estimate of partial subtask affinity ! Register top affinities with Claim central ! Time to fetch non-resident, rather than size of resident data ○ Locality-aware work stealing ● Automatic device conglomeration ● Network message merging, compression, etc. Indian Institute of Technology Delhi

Experimental Setup Node Configuration (14 node cluster) Intel Xeon X5650 2.67 GHz CPUs 2 CPUs with six cores each 64 GB main memory 2 Tesla M2070 GPUs 32Gbps InfiniBand (QDR) network Total number of devices in the cluster = 196 Indian Institute of Technology Delhi

Experiments Image Convolution 2D C2C FFT 61440 x 61440 matrix 24-bit RGB image 1 row FFT task 2 16 x 2 16 pixels Matrix Multiplication 1 column FFT task 31 x 31 filter 2 16 x 2 16 matrices 120 subtasks/task 1024 subtasks 1024 subtasks Block LU Decomposition 2 16 x 2 16 matrix Page Rank 3 tasks per iteration 500 million web pages 32 iterations 20 outlinks per page (max) 1 sequential task/iteration Web dump stored on NFS Min/Max/Avg subtasks in 25 iterations a task = 1/961/121.7 250 subtasks/iteration Indian Institute of Technology Delhi

Performance Results Indian Institute of Technology Delhi

Scaling with increasing problem size Indian Institute of Technology Delhi

GPU versus CPU+GPU Image Convolution Lower Execution time is better Indian Institute of Technology Delhi

Unicorn Time versus Application Time Indian Institute of Technology Delhi

Unicorn versus StarPU (one node) Matrix Multiplication Indian Institute of Technology Delhi

Unicorn versus SUMMA Matrix Multiplication Indian Institute of Technology Delhi

Concluding Remarks ● Unicorn is suitable for Coarse grained tasks decomposable into concurrently executable subtasks ▪ Defer synchronization, with lazy conflict resolution ▪ ● Unicorn model does not work efficiently with tasks Having non-deterministic memory access pattern ▪ Requiring fine-grained/frequent communication ▪ ● Unicorn could use Directives and language based constructs ▪ Inter-job and IO scheduling ▪ Support asynchronous updates ▪ Adapt to newer architecture, GPU-aware MPI etc. ▪ Indian Institute of Technology Delhi

For more details, please visit: http://www.cse.iitd.ac.in/~subodh/unicorn.html Thank you

Programming Hybrid CPU-GPU Clusters with Unicorn Subodh Kumar IIT - PowerPoint PPT Presentation

Programming Hybrid CPU-GPU Clusters with Unicorn Subodh Kumar IIT Delhi Part of Tarun Beris PhD thesis Parallel Programming is Hard Problem decomposition Processor mapping Scheduling Communicated and synchronization

How to Become a Unicorn WHAT IS A UNICORN? STARTUP UNICORN A privately held startup

TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Router Architectures CPU CPU Memory Memory packets NFE NFE Processor Processor Line Card

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Efficient Large-Scale Graph Processing on Hybrid CPU and GPU Systems Abdullah Gharaibeh, Elizeu

CPU scheduling CPU 1 P k P 3 P 2 P 1 . . . CPU 2 . . . CPU n The scheduling problem: - Have

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Jorge

CPU Scheduling Heechul Yun 1 Agenda Introduction to CPU scheduling Classical CPU

Hybrid Construction Hybrid Construction Hybrid Construction Hybrid Construction 1 VP

Devonport site visit summary presentation 13 th March 2019 Marine Land Aviation Nuclear

Budget Model Presentation UBDC 14 February 2017 Where excellence and opportunity meet.

Rise church core v alues LOVE WE BELIEVEin keeping with the two greatest

#MicroFocusCyberSummit Automate Static and Dynamic Scans, CI/CD Integrations and Auditing for

UK Naval Presentation UK Naval Presentation Type 23 Duke Class TOWED ARRAY SUB SURFACE

Prof. Ronnie Shephard Memorial Address 31 ISMOR, 30 July 2014 Eugene P. Visco Opening You more

Jan anuar ary y 6, 2017 Wed ednesd nesday ay, , 1/11 11 Ha Have e out the followin

T approach to multiphysics code development for high-performance computing. This approach is

Programming Hybrid CPU-GPU Clusters with Unicorn Subodh Kumar IIT - PowerPoint PPT Presentation

Programming Hybrid CPU-GPU Clusters with Unicorn Subodh Kumar IIT Delhi Part of Tarun Beris PhD thesis Parallel Programming is Hard Problem decomposition Processor mapping Scheduling Communicated and synchronization

How to Become a Unicorn WHAT IS A UNICORN? STARTUP UNICORN A privately held startup

TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Router Architectures CPU CPU Memory Memory packets NFE NFE Processor Processor Line Card

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Efficient Large-Scale Graph Processing on Hybrid CPU and GPU Systems Abdullah Gharaibeh, Elizeu

CPU scheduling CPU 1 P k P 3 P 2 P 1 . . . CPU 2 . . . CPU n The scheduling problem: - Have

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Jorge

CPU Scheduling Heechul Yun 1 Agenda Introduction to CPU scheduling Classical CPU

Hybrid Construction Hybrid Construction Hybrid Construction Hybrid Construction 1 VP

Devonport site visit summary presentation 13 th March 2019 Marine Land Aviation Nuclear

Budget Model Presentation UBDC 14 February 2017 Where excellence and opportunity meet.

Rise church core v alues LOVE WE BELIEVEin keeping with the two greatest

#MicroFocusCyberSummit Automate Static and Dynamic Scans, CI/CD Integrations and Auditing for

UK Naval Presentation UK Naval Presentation Type 23 Duke Class TOWED ARRAY SUB SURFACE

Prof. Ronnie Shephard Memorial Address 31 ISMOR, 30 July 2014 Eugene P. Visco Opening You more

Jan anuar ary y 6, 2017 Wed ednesd nesday ay, , 1/11 11 Ha Have e out the followin

T approach to multiphysics code development for high-performance computing. This approach is

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,