chai collaborative heterogeneous applications for
play

Chai: Collaborative Heterogeneous Applications for - PowerPoint PPT Presentation

Chai: Collaborative Heterogeneous Applications for Integrated-architectures Juan Gmez-Luna 1 , Izzat El Hajj 2 , Li-Wen Chang 2 , Vctor Garca-Flores 3,4 , Simon Garcia de Gonzalo 2 , Thomas B. Jablin 2,5 , Antonio J. Pea 4 , and Wen-mei Hwu


  1. Chai: Collaborative Heterogeneous Applications for Integrated-architectures Juan Gómez-Luna 1 , Izzat El Hajj 2 , Li-Wen Chang 2 , Víctor García-Flores 3,4 , Simon Garcia de Gonzalo 2 , Thomas B. Jablin 2,5 , Antonio J. Peña 4 , and Wen-mei Hwu 2 1 Universidad de Córdoba, 2 University of Illinois at Urbana-Champaign, 3 Universitat Politècnica de Catalunya, 4 Barcelona Supercomputing Center, 5 MulticoreWare, Inc.

  2. Motivation • Heterogeneous systems are moving towards tighter integration • Shared virtual memory, coherence, system-wide atomics • OpenCL 2.0, CUDA 8.0 • Benchmark suite is needed • Analyzing collaborative workloads • Evaluating new architecture features

  3. Application Structure fine-grain task fine-grain sub-tasks coarse-grain task coarse-grain sub-task

  4. Data Partitioning A B Execution Flow A B A B

  5. Data Partitioning: Bézier Surfaces • Output surface points are distributed across devices x y z ... ... 3D Surface point processed in CPU ... 3D Surface point processed in GPU ... Tile of surface points processed in CPU . . . ... ... ... ... ... ... Tile of surface points processed ... in GPU ...

  6. Data Partitioning: Image Histogram Input pixels distributed across devices Output bins distributed across devices CPU GPU CPU GPU

  7. Data Partitioning: Padding • Rows are distributed across devices • Challenge: in-place, required inter-worker synchronization CPU GPU

  8. Data Partitioning: Stream Compaction • Rows are distributed across devices • Like padding, but irregular and involves predicate computations CPU GPU

  9. Data Partitioning: Other Benchmarks • Canny Edge Detection • Different devices process different images • Random Sample Consensus • Workers on different devices process different models • In-place Transposition • Workers on different devices follow different cycles

  10. Types of data partitioning • Partitioning strategy: • Static (fixed work for each device) • Dynamic (contend on shared worklist) • Flexible interface for defining partitioning schemes • Partitioned data: • Input (e.g., Image Histogram) • Output (e.g., Bézier Surfaces) • Both (e.g., Padding)

  11. Fine-grain Task Partitioning Execution Flow A B A A B B

  12. Fine-grain Task Partitioning: Random Sample Consensus Data partitioning: models distributed Task partitioning: model fitting on CPU across devices and evaluation on GPU Fitting Fitting Fitting Fitting Fitting Fitting Fitting Fitting Fitting Evaluation Evaluation Fitting Evaluation Evaluation Evaluation Evaluation Evaluation Evaluation Evaluation Evaluation CPU GPU CPU GPU

  13. Fine-grain Task Partitioning: Task Queue System Synthetic Tasks Histogram Tasks enqueue read enqueue read enqueue read Short Long Long Short Short enqueue read enqueue read T short Histo. Histo. T long T short Histo. T long empty? empty? T short Histo. Histo. CPU GPU CPU GPU

  14. Coarse-grain Task Partitioning Execution Flow A B A B

  15. Coarse-grain Task Partitioning: Breadth First Search & Single Source Shortest Path small frontiers large frontiers processed on CPU processed on GPU CPU GPU SSSP performs more computations than BFS which hides communication/memory latency

  16. Coarse-grain Task Partitioning: Canny Edge Detection Data partitioning: images distributed Task partitioning: stages distributed across devices across devices and pipelined Gaussian Filter Gaussian Filter Gaussian Filter Sobel Filter Sobel Filter Sobel Filter Gaussian Filter Non-max Suppression Non-max Suppression Non-max Suppression Sobel Filter Hysteresis Hysteresis Hysteresis Non-max Suppression Hysteresis CPU GPU CPU GPU

  17. Benchmarks and Implementations Implementations: • OpenCL -U • OpenCL -D • CUDA -U • CUDA -D • CUDA -U -Sim • CUDA -D -Sim • C++AMP

  18. Benchmark Diversity D ATA P ARTITIONING Benchmark Partitioning Granularity Partitioned Data System-wide Atomics Load Balance BS Fine Output None Yes CEDD Coarse Input, Output None Yes HSTI Fine Input Compute No HSTO Fine Output None No PAD Fine Input, Output Sync Yes RSCD Medium Output Compute Yes SC Fine Input, Output Sync No TRNS Medium Input, Output Sync No F INE - GRAIN T ASK P ARTITIONING C OARSE - GRAIN T ASK P ARTITIONING Benchmark System-wide Atomics Load Balance Benchmark System-wide Atomics Partitioning Concurrency RSCT Sync, Compute Yes BFS Sync, Compute Iterative No TQ Sync No CEDT Sync Non-iterative Yes TQH Sync No SSSP Sync, Compute Iterative No

  19. Evaluation Platform • AMD Kaveri A10-7850K APU • 4 CPU cores • 8 GPU compute units • AMD APP SDK 3.0 • Profiling: • CodeXL • gem5-gpu

  20. Benefits of Collaboration • Collaborative execution improves performance 4096 512 12x12 (300x300) best 8x8 (300x300) 256 1024 Execution Time ( ms ) Execution Time ( ms ) 1 4x4 (300x300) 128 0.5 256 best 0 64 64 32 16 16 4 8 1CPU 2CPU 4CPU GPU GPU + GPU + GPU + 1CPU 2CPU 4CPU GPU GPU + GPU + GPU + 1CPU 2CPU 4CPU 1CPU 2CPU 4CPU Bézier Surfaces Stream Compaction (up to 47% improvement over GPU only) (up to 82% improvement over GPU only)

  21. Benefits of Collaboration • Optimal number of devices not always max and varies across datasets 4096 524288 NE 12000x11999 NY 6000x5999 1024 Execution Time ( ms ) 65536 Execution Time ( ms ) best UT 1000x999 best 256 8192 64 1024 16 128 4 1 16 1CPU 2CPU 4CPU GPU GPU + GPU + GPU + 1CPU 2CPU 4CPU GPU GPU + GPU + GPU + 1CPU 2CPU 4CPU 1CPU 2CPU 4CPU Padding Single Source Shortest Path (up to 16% improvement over GPU only) (up to 22% improvement over GPU only)

  22. Benefits of Collaboration

  23. Benefits of Unified Memory Kernel Comparable (same kernels, system-wide Unified kernels can Unified kernels avoid 1.6 Execution Time ( normalized ) atomics make Unified sometimes slower) exploit more parallelism kernel launch overhead 1.4 1.2 1 0.8 0.6 0.4 0.2 0 D U D U D U D U D U D U D U D U D U D U D U D U D U D U BS CEDD HSTI HSTO PAD RSCD SC TRNS RSCT TQ TQH BFS CEDT SSSP Fine-grain Coarse-grain Data Partitioning Task Partitioning

  24. Benefits of Unified Memory Kernel Copy Back & Merge Copy To Device 1.6 Unified versions avoid copy overhead Execution Time ( normalized ) 1.4 1.2 1 0.8 0.6 0.4 0.2 0 D U D U D U D U D U D U D U D U D U D U D U D U D U D U BS CEDD HSTI HSTO PAD RSCD SC TRNS RSCT TQ TQH BFS CEDT SSSP Fine-grain Coarse-grain Data Partitioning Task Partitioning

  25. Benefits of Unified Memory Kernel Copy Back & Merge Copy To Device Allocation SVM allocation seems 1.6 Execution Time ( normalized ) to take longer 1.4 1.2 1 0.8 0.6 0.4 0.2 0 D U D U D U D U D U D U D U D U D U D U D U D U D U D U BS CEDD HSTI HSTO PAD RSCD SC TRNS RSCT TQ TQH BFS CEDT SSSP Fine-grain Coarse-grain Data Partitioning Task Partitioning

  26. C++ AMP Performance Results 4.37 11.93 8.08 2.5 Speedup (normalized to faster) 2 1.5 OpenCL-U 1 C++AMP 0.5 0

  27. Occupancy L EGEND : 100% Benchmark Diversity 80% 60% 40% VALUBusy MemUnitBusy 20% 0% VALUUtilization CacheHit BS CEDD (gaussian) CPU GPU 49.5 64.8 14 CEDD (sobel) CEDD (non-max) CEDD (hysteresis) HSTI System-wide Atomics ( ops / thousand cycles ) 12 10 8 HSTO PAD RSCD SC 6 4 TRNS RSCT TQ TQH 2 0 BS CEDD HSTI HSTO PAD RSCD SC TRNS RSCT TQ TQH BFS CEDT SSSP BFS CEDT (gaussian) CEDT (sobel) SSSP Varying intensity in use of system-wide atomics Diverse execution profiles

  28. Benefits of Collaboration on FPGA 1.2 Idle Similar improvement 1.0 Execution Time (s) from data and task Copy partitioning 0.8 Compute 0.6 Case Study: Canny Edge Detection 0.4 0.2 0.0 C F C F C F C F C F C F C F C F CPU FPGA Data Task CPU FPGA Data Task Single device Collaborative Single device Collaborative Stratix V Arria 10 Source: Collaborative Computing for Heterogeneous Integrated Systems. ICPE’17 Vision Track.

  29. Benefits of Collaboration on FPGA 45 Data Partitioning (Stratix V) Task Partitioning (Stratix V) 40 Data Partitioning (Arria 10) 35 Task Partitioning (Arria 10) Execution Time (ms) Case Study: 30 Random Sample 25 Consensus 20 Task partitioning exploits 15 disparity in nature of tasks 10 5 0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Source: Collaborative Computing for Heterogeneous Integrated Systems. ICPE’17 Vision Track.

  30. Released • Website: chai-benchmarks.github.io • Code: github.com/chai-benchmarks/chai • Online Forum: groups.google.com/d/forum/chai-dev • Papers: • Chai: Collaborative Heterogeneous Applications for Integrated-architectures. ISPASS’17 . • Collaborative Computing for Heterogeneous Integrated Systems. ICPE’17 Vision Track .

  31. Chai: Collaborative Heterogeneous Applications for Integrated-architectures Juan Gómez-Luna 1 , Izzat El Hajj 2 , Li-Wen Chang 2 , Víctor García-Flores 3,4 , Simon Garcia de Gonzalo 2 , Thomas B. Jablin 2,5 , Antonio J. Peña 4 , and Wen-mei Hwu 2 1 Universidad de Córdoba, 2 University of Illinois at Urbana-Champaign, 3 Universitat Politècnica de Catalunya, 4 Barcelona Supercomputing Center, 5 MulticoreWare, Inc. URL: chai-benchmarks.github.io Thank You! J

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend