Chai: Collaborative Heterogeneous Applications for - - PowerPoint PPT Presentation

chai collaborative heterogeneous applications for
SMART_READER_LITE
LIVE PREVIEW

Chai: Collaborative Heterogeneous Applications for - - PowerPoint PPT Presentation

Chai: Collaborative Heterogeneous Applications for Integrated-architectures Juan Gmez-Luna 1 , Izzat El Hajj 2 , Li-Wen Chang 2 , Vctor Garca-Flores 3,4 , Simon Garcia de Gonzalo 2 , Thomas B. Jablin 2,5 , Antonio J. Pea 4 , and Wen-mei Hwu


slide-1
SLIDE 1

Chai: Collaborative Heterogeneous Applications for Integrated-architectures

Juan Gómez-Luna1, Izzat El Hajj2, Li-Wen Chang2, Víctor García-Flores3,4, Simon Garcia de Gonzalo2, Thomas B. Jablin2,5, Antonio J. Peña4, and Wen-mei Hwu2

1Universidad de Córdoba, 2University of Illinois at Urbana-Champaign, 3Universitat Politècnica de Catalunya, 4Barcelona Supercomputing Center, 5MulticoreWare, Inc.

slide-2
SLIDE 2

Motivation

  • Heterogeneous systems are moving towards tighter integration
  • Shared virtual memory, coherence, system-wide atomics
  • OpenCL 2.0, CUDA 8.0
  • Benchmark suite is needed
  • Analyzing collaborative workloads
  • Evaluating new architecture features
slide-3
SLIDE 3

Application Structure

coarse-grain task coarse-grain sub-task fine-grain sub-tasks fine-grain task

slide-4
SLIDE 4

Data Partitioning

B A B A A B Execution Flow

slide-5
SLIDE 5

Data Partitioning: Bézier Surfaces

  • Output surface points are distributed across devices

xyz

... ... ... ... ... ... ... ... ... . . . ... ... ...

Tile of surface points processed in CPU Tile of surface points processed in GPU 3D Surface point processed in GPU 3D Surface point processed in CPU

slide-6
SLIDE 6

Data Partitioning: Image Histogram

Input pixels distributed across devices Output bins distributed across devices

GPU CPU GPU CPU

slide-7
SLIDE 7

Data Partitioning: Padding

  • Rows are distributed across devices
  • Challenge: in-place, required inter-worker synchronization

GPU CPU

slide-8
SLIDE 8

Data Partitioning: Stream Compaction

  • Rows are distributed across devices
  • Like padding, but irregular and involves predicate computations

GPU CPU

slide-9
SLIDE 9

Data Partitioning: Other Benchmarks

  • Canny Edge Detection
  • Different devices process different images
  • Random Sample Consensus
  • Workers on different devices process different models
  • In-place Transposition
  • Workers on different devices follow different cycles
slide-10
SLIDE 10

Types of data partitioning

  • Partitioning strategy:
  • Static (fixed work for each device)
  • Dynamic (contend on shared worklist)
  • Flexible interface for defining partitioning schemes
  • Partitioned data:
  • Input (e.g., Image Histogram)
  • Output (e.g., Bézier Surfaces)
  • Both (e.g., Padding)
slide-11
SLIDE 11

B A B A

Fine-grain Task Partitioning

A B Execution Flow

slide-12
SLIDE 12

Fine-grain Task Partitioning: Random Sample Consensus

Data partitioning: models distributed across devices Task partitioning: model fitting on CPU and evaluation on GPU

CPU GPU

Fitting Evaluation Fitting Evaluation Fitting Evaluation Fitting Evaluation Fitting Evaluation

CPU GPU

Fitting Evaluation Fitting Evaluation Fitting Evaluation Fitting Evaluation Fitting Evaluation

slide-13
SLIDE 13

Fine-grain Task Partitioning: Task Queue System

Synthetic Tasks Histogram Tasks

CPU GPU

Tshort Tlong Tshort Tlong Tshort

Short Long Long Short Short

enqueue enqueue enqueue enqueue enqueue empty?

CPU GPU

Histo. Histo. Histo. Histo. Histo. read read read read read empty?

slide-14
SLIDE 14

A B A B

Coarse-grain Task Partitioning

Execution Flow

slide-15
SLIDE 15

Coarse-grain Task Partitioning: Breadth First Search & Single Source Shortest Path

CPU GPU

small frontiers processed on CPU large frontiers processed on GPU

SSSP performs more computations than BFS which hides communication/memory latency

slide-16
SLIDE 16

Coarse-grain Task Partitioning: Canny Edge Detection

Data partitioning: images distributed across devices Task partitioning: stages distributed across devices and pipelined

CPU GPU

Gaussian Filter Sobel Filter Non-max Suppression Hysteresis Gaussian Filter Sobel Filter Non-max Suppression Hysteresis

CPU GPU

Gaussian Filter Sobel Filter Non-max Suppression Hysteresis Gaussian Filter Sobel Filter Non-max Suppression Hysteresis

slide-17
SLIDE 17

Benchmarks and Implementations

Implementations:

  • OpenCL-U
  • OpenCL-D
  • CUDA-U
  • CUDA-D
  • CUDA-U-Sim
  • CUDA-D-Sim
  • C++AMP
slide-18
SLIDE 18

Benchmark Diversity

DATA PARTITIONING Benchmark Partitioning Granularity Partitioned Data System-wide Atomics Load Balance BS Fine Output None Yes CEDD Coarse Input, Output None Yes HSTI Fine Input Compute No HSTO Fine Output None No PAD Fine Input, Output Sync Yes RSCD Medium Output Compute Yes SC Fine Input, Output Sync No TRNS Medium Input, Output Sync No FINE-GRAIN TASK PARTITIONING Benchmark System-wide Atomics Load Balance RSCT Sync, Compute Yes TQ Sync No TQH Sync No COARSE-GRAIN TASK PARTITIONING Benchmark System-wide Atomics Partitioning Concurrency BFS Sync, Compute Iterative No CEDT Sync Non-iterative Yes SSSP Sync, Compute Iterative No

slide-19
SLIDE 19

Evaluation Platform

  • AMD Kaveri A10-7850K APU
  • 4 CPU cores
  • 8 GPU compute units
  • AMD APP SDK 3.0
  • Profiling:
  • CodeXL
  • gem5-gpu
slide-20
SLIDE 20

Benefits of Collaboration

  • Collaborative execution improves performance

Bézier Surfaces

(up to 47% improvement over GPU only)

Stream Compaction

(up to 82% improvement over GPU only) 4 16 64 256 1024 4096 1CPU 2CPU 4CPU GPU GPU + 1CPU GPU + 2CPU GPU + 4CPU Execution Time (ms) 12x12 (300x300) 8x8 (300x300) 4x4 (300x300) best 8 16 32 64 128 256 512 1CPU 2CPU 4CPU GPU GPU + 1CPU GPU + 2CPU GPU + 4CPU Execution Time (ms) 1 0.5 best

slide-21
SLIDE 21

Benefits of Collaboration

  • Optimal number of devices not always max and varies across datasets

16 128 1024 8192 65536 524288 1CPU 2CPU 4CPU GPU GPU + 1CPU GPU + 2CPU GPU + 4CPU Execution Time (ms) NE NY UT best 1 4 16 64 256 1024 4096 1CPU 2CPU 4CPU GPU GPU + 1CPU GPU + 2CPU GPU + 4CPU Execution Time (ms) 12000x11999 6000x5999 1000x999 best

Padding

(up to 16% improvement over GPU only)

Single Source Shortest Path

(up to 22% improvement over GPU only)

slide-22
SLIDE 22

Benefits of Collaboration

slide-23
SLIDE 23

Benefits of Unified Memory

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 D U D U D U D U D U D U D U D U D U D U D U D U D U D U BS CEDD HSTI HSTO PAD RSCD SC TRNS RSCT TQ TQH BFS CEDT SSSP Fine-grain Coarse-grain Data Partitioning Task Partitioning Execution Time (normalized) Kernel Comparable (same kernels, system-wide atomics make Unified sometimes slower) Unified kernels can exploit more parallelism Unified kernels avoid kernel launch overhead

slide-24
SLIDE 24

Benefits of Unified Memory

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 D U D U D U D U D U D U D U D U D U D U D U D U D U D U BS CEDD HSTI HSTO PAD RSCD SC TRNS RSCT TQ TQH BFS CEDT SSSP Fine-grain Coarse-grain Data Partitioning Task Partitioning Execution Time (normalized) Kernel Copy Back & Merge Copy To Device Unified versions avoid copy overhead

slide-25
SLIDE 25

Benefits of Unified Memory

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 D U D U D U D U D U D U D U D U D U D U D U D U D U D U BS CEDD HSTI HSTO PAD RSCD SC TRNS RSCT TQ TQH BFS CEDT SSSP Fine-grain Coarse-grain Data Partitioning Task Partitioning Execution Time (normalized) Kernel Copy Back & Merge Copy To Device Allocation SVM allocation seems to take longer

slide-26
SLIDE 26

C++ AMP Performance Results

0.5 1 1.5 2 2.5 Speedup (normalized to faster) OpenCL-U C++AMP 4.37 11.93 8.08

slide-27
SLIDE 27

Benchmark Diversity

0% 20% 40% 60% 80% 100%

Occupancy MemUnitBusy CacheHit VALUUtilization VALUBusy LEGEND:

BS CEDD (gaussian) CEDD (sobel) CEDD (non-max) HSTI HSTO PAD RSCD SC CEDD (hysteresis) TRNS TQ TQH BFS CEDT (gaussian) CEDT (sobel) RSCT SSSP

2 4 6 8 10 12 14 BS CEDD HSTI HSTO PAD RSCD SC TRNS RSCT TQ TQH BFS CEDT SSSP System-wide Atomics (ops / thousand cycles) CPU GPU 49.5 64.8

Varying intensity in use of system-wide atomics Diverse execution profiles

slide-28
SLIDE 28

Benefits of Collaboration on FPGA

0.0 0.2 0.4 0.6 0.8 1.0 1.2 C F C F C F C F C F C F C F C F CPU FPGA Data Task CPU FPGA Data Task Single device Collaborative Single device Collaborative Stratix V Arria 10 Execution Time (s) Idle Copy Compute

Case Study: Canny Edge Detection

Source: Collaborative Computing for Heterogeneous Integrated Systems. ICPE’17 Vision Track.

Similar improvement from data and task partitioning

slide-29
SLIDE 29

Benefits of Collaboration on FPGA

Case Study: Random Sample Consensus

5 10 15 20 25 30 35 40 45 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Execution Time (ms) Data Partitioning (Stratix V) Task Partitioning (Stratix V) Data Partitioning (Arria 10) Task Partitioning (Arria 10)

Source: Collaborative Computing for Heterogeneous Integrated Systems. ICPE’17 Vision Track.

Task partitioning exploits disparity in nature of tasks

slide-30
SLIDE 30

Released

  • Website: chai-benchmarks.github.io
  • Code: github.com/chai-benchmarks/chai
  • Online Forum: groups.google.com/d/forum/chai-dev
  • Papers:
  • Chai: Collaborative Heterogeneous Applications for Integrated-architectures.

ISPASS’17.

  • Collaborative Computing for Heterogeneous Integrated Systems.

ICPE’17 Vision Track.

slide-31
SLIDE 31

Chai: Collaborative Heterogeneous Applications for Integrated-architectures

Juan Gómez-Luna1, Izzat El Hajj2, Li-Wen Chang2, Víctor García-Flores3,4, Simon Garcia de Gonzalo2, Thomas B. Jablin2,5, Antonio J. Peña4, and Wen-mei Hwu2

1Universidad de Córdoba, 2University of Illinois at Urbana-Champaign, 3Universitat Politècnica de Catalunya, 4Barcelona Supercomputing Center, 5MulticoreWare, Inc.

URL: chai-benchmarks.github.io Thank You! J