Polly-ACC: Transparent Compilation to Heterogeneous Hardware Torsten - - PowerPoint PPT Presentation

polly acc transparent compilation to heterogeneous
SMART_READER_LITE
LIVE PREVIEW

Polly-ACC: Transparent Compilation to Heterogeneous Hardware Torsten - - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth Polly-ACC: Transparent Compilation to Heterogeneous Hardware Torsten Hoefler (with Tobias Grosser) 1 spcl.inf.ethz.ch @spcl_eth Evading various ends the hardware view 2 spcl.inf.ethz.ch @spcl_eth


slide-1
SLIDE 1

spcl.inf.ethz.ch @spcl_eth

Polly-ACC: Transparent Compilation to Heterogeneous Hardware Torsten Hoefler (with Tobias Grosser)

1

slide-2
SLIDE 2

spcl.inf.ethz.ch @spcl_eth

2

Evading various “ends” – the hardware view

slide-3
SLIDE 3

spcl.inf.ethz.ch @spcl_eth

3

row = 0;
  • utput_image_ptr = output_image;
  • utput_image_ptr += (NN * dead_rows);
for (r = 0; r < NN - KK + 1; r++) {
  • utput_image_offset = output_image_ptr;
  • utput_image_offset += dead_cols;
col = 0; for (c = 0; c < NN - KK + 1; c++) { input_image_ptr = input_image; input_image_ptr += (NN * row); kernel_ptr = kernel; S0: *output_image_offset = 0; for (i = 0; i < KK; i++) { input_image_offset = input_image_ptr; input_image_offset += col; kernel_offset = kernel_ptr; for (j = 0; j < KK; j++) { S1: temp1 = *input_image_offset++; S1: temp2 = *kernel_offset++; S1: *output_image_offset += temp1 * temp2; } kernel_ptr += KK; input_image_ptr += NN; } S2: *output_image_offset = ((*output_image_offset)/ normal_factor);
  • utput_image_offset++ ;
col++; }
  • utput_image_ptr += NN;
row++; } }

Fortran

C/C++

CPU CPU CPU CPU CPU CPU CPU CPU

Multi-Core CPU

GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU

Accelerator

Sequential Software Parallel Hardware

slide-4
SLIDE 4

spcl.inf.ethz.ch @spcl_eth

Non-Goal: Algorithmic Changes

4

Design Goals

Automatic “Regression Free” High Performance

Automatic accelerator mapping

  • How close can we get?
slide-5
SLIDE 5

spcl.inf.ethz.ch @spcl_eth

Tool: Polyhedral Modeling

Iteration Space

1 2 3 4 5

j i

5 4 3 2 1

N = 4 j ≤ i i ≤ N = 4 0 ≤ j 0 ≤ i

D = { (i,j) | 0 ≤ i ≤ N ∧ 0 ≤ j ≤ i }

(i, j) = (0,0)

(1,0) (1,1) (2,0) (2,1)

Program Code

(2,2) (3,0) (3,1) (3,2) (3,3) (4,0) (4,1) (4,2) (4,3) (4,4)

for (i = 0; i <= N; i++) for (j = 0; j <= i; j++) S(i,j);

4

Polly -- Performing Polyhedral Optimizations on a Low-Level Intermediate Representation Tobias Grosser et al, Parallel Processing Letter, 2012

slide-6
SLIDE 6

spcl.inf.ethz.ch @spcl_eth

6

Mapping Computation to Device

1 2 1 2 1 2 3 1 2 3 0 1 1

Device Blocks & Threads Iteration Space 𝐶𝐽𝐸 = { 𝑗, 𝑘 → 𝑗 4 % 2, 𝑘 3 % 2 }

1 1 i j

𝑈𝐽𝐸 = { 𝑗, 𝑘 → 𝑗 % 4, 𝑘 % 3 }

slide-7
SLIDE 7

spcl.inf.ethz.ch @spcl_eth

7

Memory Hierarchy of a Heterogeneous System

slide-8
SLIDE 8

spcl.inf.ethz.ch @spcl_eth

8

Host-device date transfers

slide-9
SLIDE 9

spcl.inf.ethz.ch @spcl_eth

9

Host-device date transfers

slide-10
SLIDE 10

spcl.inf.ethz.ch @spcl_eth

10

Mapping onto fast memory

slide-11
SLIDE 11

spcl.inf.ethz.ch @spcl_eth

11

Mapping onto fast memory

Polyhedral parallel code generation for CUDA, Verdoolaege, Sven et. al, ACM Transactions on Architecture and Code Optimization, 2013

slide-12
SLIDE 12

spcl.inf.ethz.ch @spcl_eth

Profitability Heuristic

Trivial Unsuitable Insufficient Compute static dynamic Modeling Execution All Loop Nests GPU

  • T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16
slide-13
SLIDE 13

spcl.inf.ethz.ch @spcl_eth

13

From kernels to program – data transfers

void heat(int n, float A[n], float hot, float cold) { float B[n] = {0}; initialize(n, A, cold); setCenter(n, A, hot, n/4); for (int t = 0; t < T; t++) { average(n, A, B); average(n, B, A); printf("Iteration %d done", t); } }

slide-14
SLIDE 14

spcl.inf.ethz.ch @spcl_eth

14

Data Transfer – Per Kernel

Host Memory

initialize() setCenter() average() average() average()

D → 𝐼 D → 𝐼 𝐼 → 𝐸 𝐸 → 𝐼 time 𝐼 → 𝐸 𝐸 → 𝐼 𝐼 → 𝐸 𝐸 → 𝐼 Device Memory

void heat(int n, float A[n], ...) { initialize(n, A, cold); setCenter(n, A, hot, n/4); for (int t = 0; t < T; t++) { average(n, A, B); average(n, B, A); printf("Iteration %d done", t); } }

slide-15
SLIDE 15

spcl.inf.ethz.ch @spcl_eth

15

Data Transfer – Inter Kernel Caching

Host Memory 𝐸 → 𝐼 Host Memory

initialize() setCenter() average() average() average()

time 𝐼 → 𝐸 Device Memory

void heat(int n, float A[n], ...) { initialize(n, A, cold); setCenter(n, A, hot, n/4); for (int t = 0; t < T; t++) { average(n, A, B); average(n, B, A); printf("Iteration %d done", t); } }

slide-16
SLIDE 16

spcl.inf.ethz.ch @spcl_eth

16

Evaluation

Evaluation

Workstation: 10 core SandyBridge NVIDIA Titan Black (Kepler) Mobile: 4 core Haswell NVIDIA GT730M (Kepler)

slide-17
SLIDE 17

spcl.inf.ethz.ch @spcl_eth

1 10 100 1000 10000 SCoPs 0-dim 1-dim 2-dim 3-dim No Heuristics Heuristics

17

LLVM Nightly Test Suite

# Compute Regions / Kernels

  • T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16
slide-18
SLIDE 18

spcl.inf.ethz.ch @spcl_eth

18

Some results: Polybench 3.2

  • T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

arithmean: ~30x geomean: ~6x Xeon E5-2690 (10 cores, 0.5Tflop) vs. Titan Black Kepler GPU (2.9k cores, 1.7Tflop)

Speedup over icc –O3

slide-19
SLIDE 19

spcl.inf.ethz.ch @spcl_eth

0:00 1:12 2:24 3:36 4:48 6:00 7:12 8:24 Mobile Workstation icc icc -openmp clang Polly ACC

19

Compiles all of SPEC CPU 2006 – Example: LBM

  • T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

Runtime (m:s)

Xeon E5-2690 (10 cores, 0.5Tflop) vs. Titan Black Kepler GPU (2.9k cores, 1.7Tflop) essentially my 4-core x86 laptop with the (free) GPU that’s in there

~20% ~4x

slide-20
SLIDE 20

spcl.inf.ethz.ch @spcl_eth

20

Cactus ADM (SPEC 2006)

Workstation Mobile

  • T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16
slide-21
SLIDE 21

spcl.inf.ethz.ch @spcl_eth

21

Cactus ADM (SPEC 2006) - Data Transfer

Workstation Mobile

  • T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16
slide-22
SLIDE 22

spcl.inf.ethz.ch @spcl_eth

Polly-ACC

22

Automatic “Regression Free” High Performance

http://spcl.inf.ethz.ch/Polly-ACC

  • T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16
slide-23
SLIDE 23

spcl.inf.ethz.ch @spcl_eth

  • Unfortunately not …
  • Limited to affine code regions
  • Maybe generalizes to control-restricted programs
  • No distributed anything!!
  • Good news:
  • Much of traditional HPC fits that model
  • Infrastructure is coming along
  • Bad news:
  • Modern data-driven HPC and Big Data fits less well
  • Need a programming model for distributed heterogeneous machines!

23

Brave new compiler world!?

slide-24
SLIDE 24

spcl.inf.ethz.ch @spcl_eth

How do we program GPUs today?

l d l d l d l d s t s t s t s t

device compute core active thread instruction latency

CUDA

  • over-subscribe hardware
  • use spare parallel slack for latency

hiding MPI

  • host controlled
  • full device

synchronization

  • T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)
slide-25
SLIDE 25

spcl.inf.ethz.ch @spcl_eth

Latency hiding at the cluster level?

l d l d l d l d

device compute core active thread instruction latency

dCUDA (distributed CUDA)

  • unified programming model for GPU clusters
  • avoid unnecessary device synchronization to enable system wide latency hiding

s t pu t s t pu t l d l d l d l d s t s t s t s t

  • T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)
slide-26
SLIDE 26

spcl.inf.ethz.ch @spcl_eth

Tobias Gysi, Jeremiah Baer, TH: “dCUDA: Hardware Supported Overlap of Computation and Communication”

Wednesday, Nov. 16th 4:00-4:30pm Room 355-D

26

Talk on Wednesday