Chai: Collaborative Heterogeneous Applications for - PowerPoint PPT Presentation

Chai: Collaborative Heterogeneous Applications for Integrated-architectures Juan Gómez-Luna 1 , Izzat El Hajj 2 , Li-Wen Chang 2 , Víctor García-Flores 3,4 , Simon Garcia de Gonzalo 2 , Thomas B. Jablin 2,5 , Antonio J. Peña 4 , and Wen-mei Hwu 2 1 Universidad de Córdoba, 2 University of Illinois at Urbana-Champaign, 3 Universitat Politècnica de Catalunya, 4 Barcelona Supercomputing Center, 5 MulticoreWare, Inc.

Motivation • Heterogeneous systems are moving towards tighter integration • Shared virtual memory, coherence, system-wide atomics • OpenCL 2.0, CUDA 8.0 • Benchmark suite is needed • Analyzing collaborative workloads • Evaluating new architecture features

Application Structure fine-grain task fine-grain sub-tasks coarse-grain task coarse-grain sub-task

Data Partitioning A B Execution Flow A B A B

Data Partitioning: Bézier Surfaces • Output surface points are distributed across devices x y z ... ... 3D Surface point processed in CPU ... 3D Surface point processed in GPU ... Tile of surface points processed in CPU . . . ... ... ... ... ... ... Tile of surface points processed ... in GPU ...

Data Partitioning: Image Histogram Input pixels distributed across devices Output bins distributed across devices CPU GPU CPU GPU

Data Partitioning: Padding • Rows are distributed across devices • Challenge: in-place, required inter-worker synchronization CPU GPU

Data Partitioning: Stream Compaction • Rows are distributed across devices • Like padding, but irregular and involves predicate computations CPU GPU

Data Partitioning: Other Benchmarks • Canny Edge Detection • Different devices process different images • Random Sample Consensus • Workers on different devices process different models • In-place Transposition • Workers on different devices follow different cycles

Types of data partitioning • Partitioning strategy: • Static (fixed work for each device) • Dynamic (contend on shared worklist) • Flexible interface for defining partitioning schemes • Partitioned data: • Input (e.g., Image Histogram) • Output (e.g., Bézier Surfaces) • Both (e.g., Padding)

Fine-grain Task Partitioning Execution Flow A B A A B B

Fine-grain Task Partitioning: Random Sample Consensus Data partitioning: models distributed Task partitioning: model fitting on CPU across devices and evaluation on GPU Fitting Fitting Fitting Fitting Fitting Fitting Fitting Fitting Fitting Evaluation Evaluation Fitting Evaluation Evaluation Evaluation Evaluation Evaluation Evaluation Evaluation Evaluation CPU GPU CPU GPU

Fine-grain Task Partitioning: Task Queue System Synthetic Tasks Histogram Tasks enqueue read enqueue read enqueue read Short Long Long Short Short enqueue read enqueue read T short Histo. Histo. T long T short Histo. T long empty? empty? T short Histo. Histo. CPU GPU CPU GPU

Coarse-grain Task Partitioning Execution Flow A B A B

Coarse-grain Task Partitioning: Breadth First Search & Single Source Shortest Path small frontiers large frontiers processed on CPU processed on GPU CPU GPU SSSP performs more computations than BFS which hides communication/memory latency

Coarse-grain Task Partitioning: Canny Edge Detection Data partitioning: images distributed Task partitioning: stages distributed across devices across devices and pipelined Gaussian Filter Gaussian Filter Gaussian Filter Sobel Filter Sobel Filter Sobel Filter Gaussian Filter Non-max Suppression Non-max Suppression Non-max Suppression Sobel Filter Hysteresis Hysteresis Hysteresis Non-max Suppression Hysteresis CPU GPU CPU GPU

Benchmarks and Implementations Implementations: • OpenCL -U • OpenCL -D • CUDA -U • CUDA -D • CUDA -U -Sim • CUDA -D -Sim • C++AMP

Benchmark Diversity D ATA P ARTITIONING Benchmark Partitioning Granularity Partitioned Data System-wide Atomics Load Balance BS Fine Output None Yes CEDD Coarse Input, Output None Yes HSTI Fine Input Compute No HSTO Fine Output None No PAD Fine Input, Output Sync Yes RSCD Medium Output Compute Yes SC Fine Input, Output Sync No TRNS Medium Input, Output Sync No F INE - GRAIN T ASK P ARTITIONING C OARSE - GRAIN T ASK P ARTITIONING Benchmark System-wide Atomics Load Balance Benchmark System-wide Atomics Partitioning Concurrency RSCT Sync, Compute Yes BFS Sync, Compute Iterative No TQ Sync No CEDT Sync Non-iterative Yes TQH Sync No SSSP Sync, Compute Iterative No

Evaluation Platform • AMD Kaveri A10-7850K APU • 4 CPU cores • 8 GPU compute units • AMD APP SDK 3.0 • Profiling: • CodeXL • gem5-gpu

Benefits of Collaboration • Collaborative execution improves performance 4096 512 12x12 (300x300) best 8x8 (300x300) 256 1024 Execution Time ( ms ) Execution Time ( ms ) 1 4x4 (300x300) 128 0.5 256 best 0 64 64 32 16 16 4 8 1CPU 2CPU 4CPU GPU GPU + GPU + GPU + 1CPU 2CPU 4CPU GPU GPU + GPU + GPU + 1CPU 2CPU 4CPU 1CPU 2CPU 4CPU Bézier Surfaces Stream Compaction (up to 47% improvement over GPU only) (up to 82% improvement over GPU only)

Benefits of Collaboration • Optimal number of devices not always max and varies across datasets 4096 524288 NE 12000x11999 NY 6000x5999 1024 Execution Time ( ms ) 65536 Execution Time ( ms ) best UT 1000x999 best 256 8192 64 1024 16 128 4 1 16 1CPU 2CPU 4CPU GPU GPU + GPU + GPU + 1CPU 2CPU 4CPU GPU GPU + GPU + GPU + 1CPU 2CPU 4CPU 1CPU 2CPU 4CPU Padding Single Source Shortest Path (up to 16% improvement over GPU only) (up to 22% improvement over GPU only)

Benefits of Collaboration

Benefits of Unified Memory Kernel Comparable (same kernels, system-wide Unified kernels can Unified kernels avoid 1.6 Execution Time ( normalized ) atomics make Unified sometimes slower) exploit more parallelism kernel launch overhead 1.4 1.2 1 0.8 0.6 0.4 0.2 0 D U D U D U D U D U D U D U D U D U D U D U D U D U D U BS CEDD HSTI HSTO PAD RSCD SC TRNS RSCT TQ TQH BFS CEDT SSSP Fine-grain Coarse-grain Data Partitioning Task Partitioning

Benefits of Unified Memory Kernel Copy Back & Merge Copy To Device 1.6 Unified versions avoid copy overhead Execution Time ( normalized ) 1.4 1.2 1 0.8 0.6 0.4 0.2 0 D U D U D U D U D U D U D U D U D U D U D U D U D U D U BS CEDD HSTI HSTO PAD RSCD SC TRNS RSCT TQ TQH BFS CEDT SSSP Fine-grain Coarse-grain Data Partitioning Task Partitioning

Benefits of Unified Memory Kernel Copy Back & Merge Copy To Device Allocation SVM allocation seems 1.6 Execution Time ( normalized ) to take longer 1.4 1.2 1 0.8 0.6 0.4 0.2 0 D U D U D U D U D U D U D U D U D U D U D U D U D U D U BS CEDD HSTI HSTO PAD RSCD SC TRNS RSCT TQ TQH BFS CEDT SSSP Fine-grain Coarse-grain Data Partitioning Task Partitioning

C++ AMP Performance Results 4.37 11.93 8.08 2.5 Speedup (normalized to faster) 2 1.5 OpenCL-U 1 C++AMP 0.5 0

Occupancy L EGEND : 100% Benchmark Diversity 80% 60% 40% VALUBusy MemUnitBusy 20% 0% VALUUtilization CacheHit BS CEDD (gaussian) CPU GPU 49.5 64.8 14 CEDD (sobel) CEDD (non-max) CEDD (hysteresis) HSTI System-wide Atomics ( ops / thousand cycles ) 12 10 8 HSTO PAD RSCD SC 6 4 TRNS RSCT TQ TQH 2 0 BS CEDD HSTI HSTO PAD RSCD SC TRNS RSCT TQ TQH BFS CEDT SSSP BFS CEDT (gaussian) CEDT (sobel) SSSP Varying intensity in use of system-wide atomics Diverse execution profiles

Benefits of Collaboration on FPGA 1.2 Idle Similar improvement 1.0 Execution Time (s) from data and task Copy partitioning 0.8 Compute 0.6 Case Study: Canny Edge Detection 0.4 0.2 0.0 C F C F C F C F C F C F C F C F CPU FPGA Data Task CPU FPGA Data Task Single device Collaborative Single device Collaborative Stratix V Arria 10 Source: Collaborative Computing for Heterogeneous Integrated Systems. ICPE’17 Vision Track.

Benefits of Collaboration on FPGA 45 Data Partitioning (Stratix V) Task Partitioning (Stratix V) 40 Data Partitioning (Arria 10) 35 Task Partitioning (Arria 10) Execution Time (ms) Case Study: 30 Random Sample 25 Consensus 20 Task partitioning exploits 15 disparity in nature of tasks 10 5 0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Source: Collaborative Computing for Heterogeneous Integrated Systems. ICPE’17 Vision Track.

Released • Website: chai-benchmarks.github.io • Code: github.com/chai-benchmarks/chai • Online Forum: groups.google.com/d/forum/chai-dev • Papers: • Chai: Collaborative Heterogeneous Applications for Integrated-architectures. ISPASS’17 . • Collaborative Computing for Heterogeneous Integrated Systems. ICPE’17 Vision Track .

Chai: Collaborative Heterogeneous Applications for Integrated-architectures Juan Gómez-Luna 1 , Izzat El Hajj 2 , Li-Wen Chang 2 , Víctor García-Flores 3,4 , Simon Garcia de Gonzalo 2 , Thomas B. Jablin 2,5 , Antonio J. Peña 4 , and Wen-mei Hwu 2 1 Universidad de Córdoba, 2 University of Illinois at Urbana-Champaign, 3 Universitat Politècnica de Catalunya, 4 Barcelona Supercomputing Center, 5 MulticoreWare, Inc. URL: chai-benchmarks.github.io Thank You! J

Chai: Collaborative Heterogeneous Applications for - PowerPoint PPT Presentation

Chai: Collaborative Heterogeneous Applications for Integrated-architectures Juan Gmez-Luna 1 , Izzat El Hajj 2 , Li-Wen Chang 2 , Vctor Garca-Flores 3,4 , Simon Garcia de Gonzalo 2 , Thomas B. Jablin 2,5 , Antonio J. Pea 4 , and Wen-mei Hwu

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Unifying Heterogeneous Cray Unifying Heterogeneous Cray Resources and Systems into an

G o G re e n, Sa ve G re e n RON NOSEK CHAI RMAN, KESON I NDUST RI ES G o G re e n, Sa

COLLABORATIVE COMMUNITY PRESENTATION MAY 30TH, 2018 One San Pedro COLLABORATIVE One San Pedro

Parameterizing Access Control for Heterogeneous Peer-to-Peer Applications Ashish Gehani Surendar

Uniform access to heterogeneous Uniform access to heterogeneous grid infrastructures with grid

Mining Heterogeneous Mining Heterogeneous Information Networks Information Networks Xifeng Yan

Learning by Fusing Heterogeneous Data Marinka Zitnik Thesis Defense, October 22 2015 Motivation

Composing heterogeneous software with style Stephen Kell stephen.kell@cs.ox.ac.uk Composing. . .

Decentralized Dynamic Scheduling across Heterogeneous Multi core across Heterogeneous Multi

Static Worksharing Strategies for Heterogeneous Computers with Unrecoverable Failures Anne

An Introduction to Coupling Conditions Homogeneous Heterogeneous Domain Decomposition Problems

Modeling Heterogeneous Modeling Heterogeneous Real- -time Components in BIP time Components in

Omron TM Collaborative Robot Collaborative robotics taken to the next level in intelligence

Collaborative Modeling Collaborative Modeling Incorporating new technologies into the

Collaborative Academy Collaborative Academy Ales Cepek and Jan Pytel Ales Cepek and Jan Pytel

PLR Code SIRun4.cxx Pretty self explanatory Sets up a timer to time the process kill

Dynamic programming using histomorphisms Jevgeni Kabanov Viinistu, 2005 Catamorphism (fold)

Galaxy as an educational tool and community resources for undergraduate training PAG 2020

Campaign gn S Stor orage stor orage f for or tiers space ce f for or everything Peter

Introduction to Cataloging Kentucky Department for Libraries and Archives April 2019 Why do we

Recent advances in the development of ice cloud bulk scattering and absorption models for use with

Histo Histo Histo LocalityDescriptor ldesc(); LocalityDescriptor ldesc(X, Y, Z);

HW/SW Codesign FSMD I ECE 522 FSMDs VHDL Essentials covers FSMD design principles Here, we