programming hybrid cpu gpu clusters with unicorn
play

Programming Hybrid CPU-GPU Clusters with Unicorn Subodh Kumar IIT - PowerPoint PPT Presentation

Programming Hybrid CPU-GPU Clusters with Unicorn Subodh Kumar IIT Delhi Part of Tarun Beris PhD thesis Parallel Programming is Hard Problem decomposition Processor mapping Scheduling Communicated and synchronization


  1. Programming Hybrid CPU-GPU Clusters with Unicorn Subodh Kumar IIT Delhi Part of Tarun Beri’s PhD thesis

  2. Parallel Programming is Hard ● Problem decomposition ● Processor mapping ● Scheduling ● Communicated and synchronization ● Tuning to hardware ● Maintainable and portable code ● Programmer productivity ● Scalability ● Managing multiple types of parallelism ○ accelerator, shared memory, cluster, message passing ● Thread model is non-deterministic ● Low level locking prone to deadlocks and livelocks ● Large numbers still trained on sequential models of computation ○ No effective bridging model Indian Institute of Technology Delhi

  3. About Unicorn ● Multi-node, Multi-CPU, Multi-GPU programming framework ○ Shared memory style ○ Bulk synchronous ● For coarse grained compute-intensive workloads ● Designed to adapt to the totality of effective network, memory and compute throughputs of devices Current implementation - Assumes flat network topology - No elaborate matching of device capability to workload Indian Institute of Technology Delhi

  4. Unicorn’s Goals • To design a cluster programming model which is: Hetero- Simple Scalable Unified Abstract geneous Agnostic to Performance Works on Complexity Common network increases hybrid CPU- largely in API for topology/ with cluster GPU sequential data CPU/GPU nodes clusters organization native code Indian Institute of Technology Delhi

  5. Programming with Unicorn ● Plug-in existing sequential, parallel CPU and CUDA code ○ Debugging complexity “near” sequential/native code ● Shared memory style but deterministic execution ○ No data races ○ Check-in/check-out memory semantics with conflict resolution ○ No explicit data transfers in user code ! Internally, latency-hiding with compute-communication overlap is first class citizen ○ Automatic load balancing and scheduling ● Code agnostic to data placement, organization and generation ○ No requirement of user binding data or computations to nodes Indian Institute of Technology Delhi

  6. Many Competing Approaches Legion ● Task graph partitioning Global Arrays X10 ● Express units of work Split-C ● Directives Cilk POP-C++ ● Loop parallelization and scheduling Titanium ● Distributed address space (PGAS) Sequoia MPI Globus MPI-ACC Data Scheduling StarPU-MPI Phalanx Charm++/G-Charm Mekong Indian Institute of Technology Delhi

  7. Bulk Synchronous? Global Shared Memory DMA Transfer DMA Transfer Input Phase CPU to GPU CPU to GPU Local Local Computation Computation DMA Transfer DMA Transfer GPU to CPU GPU to CPU Output Phase Synchronization to global shared Synchronization memory Indian Institute of Technology Delhi

  8. Unicorn Programming Model Global Shared Memory Input Phase A parallel program is a graph of tasks Task 0 Data Dependency Task 1 Local Local Computation Computation Task 3 Task 2 Output Phase Tasks are divided into multiple Synchronization to global shared concurrently executable subtasks memory Indian Institute of Technology Delhi

  9. Unicorn – Data Partitioning Stage 1: Data Partitioning Node 1 Node 2 [Partition memory among subtasks] CPU Core 1 CPU Core 1 User • Subscribes to input memory sections • Subscribes to output memory sections GPU 1 GPU 1 Copy Copy CPU Core 2 2 e r o C Copy (Internally optimized) U P C Copy (Internally optimized) r e [Library Managed] f s n a r Input Address Space (RO) Input Address Space (RO) T a t a D Output Address Space (RW/WO) Output Address Space (RW/WO) Indian Institute of Technology Delhi

  10. Unicorn – Subtask Execution Stage 2: Computation Node 1 Node 2 [Synchronization-free subtask execution] CPU Core 1 CPU Core 1 User provided: • CPU subtasks execute CPU functions • GPU subtasks execute GPU kernels GPU 1 GPU 1 CPU Core 2 2 e r o C U P C Input Address Space (RO) Input Address Space (RO) Output Address Space (RW/WO) Output Address Space (RW/WO) Indian Institute of Technology Delhi

  11. Unicorn – Data Synchronization Stage 3: Data Synchronization Node 1 Node 2 [Copy Synchronization] CPU Core 1 CPU Core 1 • Copy synchronization is system managed • No user intervention required GPU 1 GPU 1 CPU Core 2 2 e r o C U P C Input Address Space (RO) Input Address Space (RO) Output Address Space (RW/WO) Output Address Space (RW/WO) Indian Institute of Technology Delhi

  12. Unicorn – Data Reduction Stage 3: Data Synchronization Node 1 Node 2 [Reduce Synchronization] CPU Core 1 CPU Core 1 • Reduce operation is user controlled • Hierarchical log(n) reduction GPU 1 GPU 1 CPU Core 2 2 e r o C U P C Input Address Space (RO) Input Address Space (RO) Output Address Space (RW/WO) Output Address Space (RW/WO) Indian Institute of Technology Delhi

  13. What a program looks like? struct complex { float real, imag; }; struct fft_conf { size_t rows, cols; }; fft_1d(matrix_rows, matrix_cols) { key = "FFT"; register_callback(key, SUBSCRIPTION, fft_subscription); Define task callbacks register_callback(key, CUDA, "fft_cuda", "fft.cu"); if(get_host() == 0) // Submit task from single host { size = matrix_rows * matrix_cols * sizeof(complex); input = malloc_shared(size); Allocate address spaces output = malloc_shared(size); initialize_input(input); // application data // create task with one subtask per row nsubtasks = matrix_rows; task = create_task(key, nsubtasks, fft_conf(matrix_rows, matrix_cols)); Create task bind_address_space(task, input, READ_ONLY); bind_address_space(task, output, WRITE_ONLY); Bind address spaces to task submit_task(task); Submit task for asynchronous execution wait_for_task_completion(task); } } Indian Institute of Technology Delhi

  14. Pillars of Unicorn ● Understand the flow of data and balance load ○ Pipeline data delivery and computation ● Parallelize at multiple levels ○ Inner loops often data parallel ! Scientific computation ○ Coarse grained outer level ● Sandbox computation ○ No data race ○ Transactional semantics ○ Data reduction, assimilation, re-organization Indian Institute of Technology Delhi

  15. Runtime Optimizations ● Distributed directory maintenance ○ Non-coherent opportunistic lock-free updates ○ MPI style views ● Schedule data pre-fetch (among nodes and to/from GPU) ○ Software GPU cache ○ Hierarchical steal, Pro-active steal, granularity adjustment ○ Locality aware scheduling ! Local estimate of partial subtask affinity ! Register top affinities with Claim central ! Time to fetch non-resident, rather than size of resident data ○ Locality-aware work stealing ● Automatic device conglomeration ● Network message merging, compression, etc. Indian Institute of Technology Delhi

  16. Experimental Setup Node Configuration (14 node cluster) Intel Xeon X5650 2.67 GHz CPUs 2 CPUs with six cores each 64 GB main memory 2 Tesla M2070 GPUs 32Gbps InfiniBand (QDR) network Total number of devices in the cluster = 196 Indian Institute of Technology Delhi

  17. Experiments Image Convolution 2D C2C FFT 61440 x 61440 matrix 24-bit RGB image 1 row FFT task 2 16 x 2 16 pixels Matrix Multiplication 1 column FFT task 31 x 31 filter 2 16 x 2 16 matrices 120 subtasks/task 1024 subtasks 1024 subtasks Block LU Decomposition 2 16 x 2 16 matrix Page Rank 3 tasks per iteration 500 million web pages 32 iterations 20 outlinks per page (max) 1 sequential task/iteration Web dump stored on NFS Min/Max/Avg subtasks in 25 iterations a task = 1/961/121.7 250 subtasks/iteration Indian Institute of Technology Delhi

  18. Performance Results Indian Institute of Technology Delhi

  19. Scaling with increasing problem size Indian Institute of Technology Delhi

  20. GPU versus CPU+GPU Image Convolution Lower Execution time is better Indian Institute of Technology Delhi

  21. Unicorn Time versus Application Time Indian Institute of Technology Delhi

  22. Unicorn versus StarPU (one node) Matrix Multiplication Indian Institute of Technology Delhi

  23. Unicorn versus SUMMA Matrix Multiplication Indian Institute of Technology Delhi

  24. Concluding Remarks ● Unicorn is suitable for Coarse grained tasks decomposable into concurrently executable subtasks ▪ Defer synchronization, with lazy conflict resolution ▪ ● Unicorn model does not work efficiently with tasks Having non-deterministic memory access pattern ▪ Requiring fine-grained/frequent communication ▪ ● Unicorn could use Directives and language based constructs ▪ Inter-job and IO scheduling ▪ Support asynchronous updates ▪ Adapt to newer architecture, GPU-aware MPI etc. ▪ Indian Institute of Technology Delhi

  25. For more details, please visit: http://www.cse.iitd.ac.in/~subodh/unicorn.html Thank you

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend