spcl.inf.ethz.ch @spcl_eth
Polly-ACC: Transparent Compilation to Heterogeneous Hardware Torsten Hoefler (with Tobias Grosser)
1
Polly-ACC: Transparent Compilation to Heterogeneous Hardware Torsten - - PowerPoint PPT Presentation
spcl.inf.ethz.ch @spcl_eth Polly-ACC: Transparent Compilation to Heterogeneous Hardware Torsten Hoefler (with Tobias Grosser) 1 spcl.inf.ethz.ch @spcl_eth Evading various ends the hardware view 2 spcl.inf.ethz.ch @spcl_eth
spcl.inf.ethz.ch @spcl_eth
Polly-ACC: Transparent Compilation to Heterogeneous Hardware Torsten Hoefler (with Tobias Grosser)
1
spcl.inf.ethz.ch @spcl_eth
2
Evading various “ends” – the hardware view
spcl.inf.ethz.ch @spcl_eth
3
row = 0;Fortran
C/C++
CPU CPU CPU CPU CPU CPU CPU CPU
Multi-Core CPU
GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPUAccelerator
Sequential Software Parallel Hardware
spcl.inf.ethz.ch @spcl_eth
Non-Goal: Algorithmic Changes
4
Design Goals
Automatic “Regression Free” High Performance
Automatic accelerator mapping
spcl.inf.ethz.ch @spcl_eth
Tool: Polyhedral Modeling
Iteration Space
1 2 3 4 5
j i
5 4 3 2 1
N = 4 j ≤ i i ≤ N = 4 0 ≤ j 0 ≤ i
D = { (i,j) | 0 ≤ i ≤ N ∧ 0 ≤ j ≤ i }
(i, j) = (0,0)
(1,0) (1,1) (2,0) (2,1)
Program Code
(2,2) (3,0) (3,1) (3,2) (3,3) (4,0) (4,1) (4,2) (4,3) (4,4)
for (i = 0; i <= N; i++) for (j = 0; j <= i; j++) S(i,j);
4
Polly -- Performing Polyhedral Optimizations on a Low-Level Intermediate Representation Tobias Grosser et al, Parallel Processing Letter, 2012
spcl.inf.ethz.ch @spcl_eth
6
Mapping Computation to Device
1 2 1 2 1 2 3 1 2 3 0 1 1
Device Blocks & Threads Iteration Space 𝐶𝐽𝐸 = { 𝑗, 𝑘 → 𝑗 4 % 2, 𝑘 3 % 2 }
1 1 i j
𝑈𝐽𝐸 = { 𝑗, 𝑘 → 𝑗 % 4, 𝑘 % 3 }
spcl.inf.ethz.ch @spcl_eth
7
Memory Hierarchy of a Heterogeneous System
spcl.inf.ethz.ch @spcl_eth
8
Host-device date transfers
spcl.inf.ethz.ch @spcl_eth
9
Host-device date transfers
spcl.inf.ethz.ch @spcl_eth
10
Mapping onto fast memory
spcl.inf.ethz.ch @spcl_eth
11
Mapping onto fast memory
Polyhedral parallel code generation for CUDA, Verdoolaege, Sven et. al, ACM Transactions on Architecture and Code Optimization, 2013
spcl.inf.ethz.ch @spcl_eth
Profitability Heuristic
Trivial Unsuitable Insufficient Compute static dynamic Modeling Execution All Loop Nests GPU
spcl.inf.ethz.ch @spcl_eth
13
From kernels to program – data transfers
void heat(int n, float A[n], float hot, float cold) { float B[n] = {0}; initialize(n, A, cold); setCenter(n, A, hot, n/4); for (int t = 0; t < T; t++) { average(n, A, B); average(n, B, A); printf("Iteration %d done", t); } }
spcl.inf.ethz.ch @spcl_eth
14
Data Transfer – Per Kernel
Host Memory
initialize() setCenter() average() average() average()
D → 𝐼 D → 𝐼 𝐼 → 𝐸 𝐸 → 𝐼 time 𝐼 → 𝐸 𝐸 → 𝐼 𝐼 → 𝐸 𝐸 → 𝐼 Device Memory
void heat(int n, float A[n], ...) { initialize(n, A, cold); setCenter(n, A, hot, n/4); for (int t = 0; t < T; t++) { average(n, A, B); average(n, B, A); printf("Iteration %d done", t); } }
spcl.inf.ethz.ch @spcl_eth
15
Data Transfer – Inter Kernel Caching
Host Memory 𝐸 → 𝐼 Host Memory
initialize() setCenter() average() average() average()
time 𝐼 → 𝐸 Device Memory
void heat(int n, float A[n], ...) { initialize(n, A, cold); setCenter(n, A, hot, n/4); for (int t = 0; t < T; t++) { average(n, A, B); average(n, B, A); printf("Iteration %d done", t); } }
spcl.inf.ethz.ch @spcl_eth
16
Workstation: 10 core SandyBridge NVIDIA Titan Black (Kepler) Mobile: 4 core Haswell NVIDIA GT730M (Kepler)
spcl.inf.ethz.ch @spcl_eth
1 10 100 1000 10000 SCoPs 0-dim 1-dim 2-dim 3-dim No Heuristics Heuristics
17
LLVM Nightly Test Suite
# Compute Regions / Kernels
spcl.inf.ethz.ch @spcl_eth
18
Some results: Polybench 3.2
arithmean: ~30x geomean: ~6x Xeon E5-2690 (10 cores, 0.5Tflop) vs. Titan Black Kepler GPU (2.9k cores, 1.7Tflop)
Speedup over icc –O3
spcl.inf.ethz.ch @spcl_eth
0:00 1:12 2:24 3:36 4:48 6:00 7:12 8:24 Mobile Workstation icc icc -openmp clang Polly ACC
19
Compiles all of SPEC CPU 2006 – Example: LBM
Runtime (m:s)
Xeon E5-2690 (10 cores, 0.5Tflop) vs. Titan Black Kepler GPU (2.9k cores, 1.7Tflop) essentially my 4-core x86 laptop with the (free) GPU that’s in there
~20% ~4x
spcl.inf.ethz.ch @spcl_eth
20
Cactus ADM (SPEC 2006)
Workstation Mobile
spcl.inf.ethz.ch @spcl_eth
21
Cactus ADM (SPEC 2006) - Data Transfer
Workstation Mobile
spcl.inf.ethz.ch @spcl_eth
Polly-ACC
22
Automatic “Regression Free” High Performance
http://spcl.inf.ethz.ch/Polly-ACC
spcl.inf.ethz.ch @spcl_eth
23
Brave new compiler world!?
spcl.inf.ethz.ch @spcl_eth
How do we program GPUs today?
l d l d l d l d s t s t s t s t
device compute core active thread instruction latency
CUDA
hiding MPI
synchronization
…
spcl.inf.ethz.ch @spcl_eth
Latency hiding at the cluster level?
l d l d l d l d
device compute core active thread instruction latency
dCUDA (distributed CUDA)
s t pu t s t pu t l d l d l d l d s t s t s t s t
spcl.inf.ethz.ch @spcl_eth
Tobias Gysi, Jeremiah Baer, TH: “dCUDA: Hardware Supported Overlap of Computation and Communication”
Wednesday, Nov. 16th 4:00-4:30pm Room 355-D
26
Talk on Wednesday