Should I port my code to a DSL?
Bahareh Davani · Ferran Marti · Laleh Beni · Saikiran Ramanan · Feng Liu Aparna Chandramowlishwaran
October 27, 2017 — Scholas Dagstuhl
Should I port my code to a DSL? Bahareh Davani Ferran Marti Laleh - - PowerPoint PPT Presentation
Should I port my code to a DSL? Bahareh Davani Ferran Marti Laleh Beni Saikiran Ramanan Feng Liu Aparna Chandramowlishwaran October 27, 2017 Scholas Dagstuhl actory PC https://en.wikipedia.org/wiki/Newport_Beach,_California C
Bahareh Davani · Ferran Marti · Laleh Beni · Saikiran Ramanan · Feng Liu Aparna Chandramowlishwaran
October 27, 2017 — Scholas Dagstuhl
https://en.wikipedia.org/wiki/Newport_Beach,_California
(“HIGH PERFORMANCE TURBULENT FLOW SIMULATIONS”)
CONTEXT: MOBO
(“MOVING BOUNDARIES”)
Citation: “Petascale direct numerical simulation of blood flow on 200k cores and heterogeneous architectures.” In SC’10. Winner, Gordon Bell Prize. http://dx.doi.org/10.1109/SC.2010.42
Prior work with same physical fidelity
1,200 cells: Sequential + integral equations Zinchenko et al. (2003) 14,000 cells: IBM BG/P + Lattice Boltzmann O(10k) unknowns/cell Clausen et al. (2010)
MoBo: 260 million cells (90 billion unknowns) on 200k cores (Jaguar @ ORNL)
CPU, GPU + integral equations + implicit AMR O(100) unknowns / cell Key to scaling: Optimal n-body methods based on the fast multipole method (FMM) on highly non-uniform domains
influence in 20th century
data mining
“Everyone is doing stencils.”
Anonymous Wolverine.
“Stencils are easy, they are structured”
Anonymous Chipmunk.
“We need separation of concerns” (drink!)
Anonymous Chupacabras.
“We need better performance models”
Anonymous Axolotl.
Do current frameworks capture stencil patterns in “real applications”? What is the gap between stencil DSLs and hand-
applications”? What is the right separation of concerns? Story time!
“Everyone is doing stencils.”
Anonymous Wolverine.
“Stencils are easy, they are structured”
Anonymous Chipmunk.
“We need separation of concerns” (drink!)
Anonymous Chupacabras.
“We need better performance models”
Anonymous Axolotl.
Do current frameworks capture stencil patterns in “real applications”? What is the gap between stencil DSLs and hand-
applications”? What is the right separation of concerns? Story time!
๏ 3D Unsteady Reynolds Averaged Navier-Stokes (URANS)
equations
๏ Dual time-stepping scheme
๏ Pseudo-time marching — multi-stage Runge-Kutta
scheme
๏ Marched to a steady state in pseudo time ๏ Spatial discretization of the residual ๏ 2nd order accurate
๏ Cell-centered stencils
๏ Most well-studied in literature
๏ Vertex-centered stencils
๏ More complex memory access pattern ๏ More memory-bound than cell-centered stencils
๏ Cell-centered stencils
๏ Most well-studied in literature
๏ Vertex-centered stencils
๏ More complex memory access pattern ๏ More memory-bound than cell-centered stencils
NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region
1 2 4 8 16 32 64 1 2 4 8 16 22 44 88 Number of threads Speedup Broadwell
~105x ~159x
NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region
1 2 4 8 16 32 64 1 2 4 8 16 32 64 Number of threads Speedup Abu Dhabi
NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region NUMA Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region SMT Region
1 2 4 8 16 32 1 2 4 8 16 32 Number of threads Speedup Haswell
~160x
+Strength Reduction +Fusion +Parallelism +NUMA +Blocking +SIMD Transformations +SIMD
Peak DP Peak DP Peak DP Peak DP Peak DP Peak DP Peak DP Peak DP Peak DP Peak DP Peak DP Peak DP Peak DP Peak DP FMA FMA FMA FMA FMA FMA FMA FMA FMA FMA FMA FMA FMA FMA SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD ILP ILP ILP ILP ILP ILP ILP ILP ILP ILP ILP ILP ILP ILP NUMA NUMA NUMA NUMA NUMA NUMA NUMA NUMA NUMA NUMA NUMA NUMA NUMA NUMA P e a k s t r e a m b a n d w i d t h P e a k s t r e a m b a n d w i d t h P e a k s t r e a m b a n d w i d t h P e a k s t r e a m b a n d w i d t h P e a k s t r e a m b a n d w i d t h P e a k s t r e a m b a n d w i d t h P e a k s t r e a m b a n d w i d t h P e a k s t r e a m b a n d w i d t h P e a k s t r e a m b a n d w i d t h P e a k s t r e a m b a n d w i d t h P e a k s t r e a m b a n d w i d t h P e a k s t r e a m b a n d w i d t h P e a k s t r e a m b a n d w i d t h P e a k s t r e a m b a n d w i d t h
0.5 1.0 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 512.0 1024.0 1/16 1/8 1/4 1/2 1 2 4 8 16 32 64 flop:DRAM byte ratio GFlops/s
Abu Dhabi
Peak DP Peak DP Peak DP Peak DP Peak DP Peak DP Peak DP Peak DP Peak DP Peak DP Peak DP Peak DP Peak DP Peak DP Peak DP FMA FMA FMA FMA FMA FMA FMA FMA FMA FMA FMA FMA FMA FMA FMA SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD ILP ILP ILP ILP ILP ILP ILP ILP ILP ILP ILP ILP ILP ILP ILP NUMA NUMA NUMA NUMA NUMA NUMA NUMA NUMA NUMA NUMA NUMA NUMA NUMA NUMA NUMA Peak stream bandwidth Peak stream bandwidth Peak stream bandwidth Peak stream bandwidth Peak stream bandwidth Peak stream bandwidth Peak stream bandwidth Peak stream bandwidth Peak stream bandwidth Peak stream bandwidth Peak stream bandwidth Peak stream bandwidth Peak stream bandwidth Peak stream bandwidth Peak stream bandwidth
0.5 1.0 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 512.0 1024.0 2048.0 1/16 1/8 1/4 1/2 1 2 4 8 16 32 64 128 flop:DRAM byte ratio GFlops/s
Broadwell
Peak DP Peak DP Peak DP Peak DP Peak DP Peak DP Peak DP Peak DP Peak DP Peak DP Peak DP Peak DP Peak DP FMA FMA FMA FMA FMA FMA FMA FMA FMA FMA FMA FMA FMA SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD ILP ILP ILP ILP ILP ILP ILP ILP ILP ILP ILP ILP ILP N U M A N U M A N U M A N U M A N U M A N U M A N U M A N U M A N U M A N U M A N U M A N U M A N U M A P e a k s t r e a m b a n d w i d t h P e a k s t r e a m b a n d w i d t h P e a k s t r e a m b a n d w i d t h P e a k s t r e a m b a n d w i d t h P e a k s t r e a m b a n d w i d t h P e a k s t r e a m b a n d w i d t h P e a k s t r e a m b a n d w i d t h P e a k s t r e a m b a n d w i d t h P e a k s t r e a m b a n d w i d t h P e a k s t r e a m b a n d w i d t h P e a k s t r e a m b a n d w i d t h P e a k s t r e a m b a n d w i d t h P e a k s t r e a m b a n d w i d t h
0.5 1.0 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 512.0 1024.0 1/16 1/8 1/4 1/2 1 2 4 8 16 32 flop:DRAM byte ratio GFlops/s
Haswell
→ K.J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F . Durand, and S. Amarasinghe. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. PLDI ’13
Hand- tuned Halide Hand- tuned Halide Hand- tuned Halide
This gap is due to strength reduction and inter-stencil fusion in the hand-tuned code.
Hand- tuned Halide Hand- tuned Halide Hand- tuned Halide Optimization 3.5x 1.5x 3x 1.3x 3.1x 1.4x
Hand- tuned Halide Hand- tuned Halide Hand- tuned Halide Optimization 3.5x 1.5x 3x 1.3x 3.1x 1.4x +Vectorize 3.6x 1.1x 2.3x 1x 2.8x 1.2x
This gap is partly due to NUMA-aware parallelization in the hand-tuned code. (Halide is currently not NUMA-aware)
http://math.stanford.edu/~lexing/software/kifmm3d.tar.gz
qi = X
j
K(rij)φj
rij = |xi − yj|
K(r) = C r
Large, complex tuning spaces
36x faster (dual-socket quad-core x86)
Single-core, manually coded & tuned Data: Structure reorg. (transpose or “SOA”), NUMA-aware memory allocation Traffic: Matrix-free via interprocedural loop fusion, blocking/tiling Numerical: rsqrtps + Newton-Raphson (x86) Low-level: SIMD vectorization (x86) OpenMP parallelization/scheduling Algorithmic tuning
K(r) = C r
Force computation Nearest neighbors Kernel density estimation Range count
∀q ∈ Q : F(q) =
C r − q ||r − q||3 ∀q ∈ Q : AllNN(q) = argminr∈R d(q, r) ∀q ∈ Q : KDE(q) = 1 |R|
K(q, r) ∀q ∈ Q : Range(q) =
I(dist(q, r)) ≤ h)
Consider pairs of points – naïvely O(N2) What do these have in common?
Force computation
∀q ∈ Q : F(q) =
C r − q ||r − q||3
Evaluate interactions → Tree traversals Store aggregate data at nodes, e.g., bounding box, mass
force computations, e.g., Barnes-Hut or FMM
Problem Operators Kernel Function
All Nearest Neighbors All Range Search All Range Count Naive Bayes Classifier Mixture Model E-step K-means E-step Mixture Model Log-likelihood Kernel Density Estimation Kernel Density Bayes Classifier 2-point (cross-)correlation Nadaraya-Watson Regression Thermodynamic Average Largest-span set Closest Pair Minimum Spanning Tree Coulombic Interaction Average Density Wave function Hausdorff Distance Intrinsic (fractal) Dimension
∀, arg min ∀, ∪ arg
||xq − xr||
I(hmin < ||xq − xr|| < hmax) I(hmin < ||xq − xr|| < hmax)
∀, Σ
(1/ p 2π|Σk|)e− 1
2 (xi−µk)T Σ−1 k (xi−µk)P(Ck)∀, arg max
(1/ p 2π|Σk|)e− 1
2 (xi−µk)T Σ−1 k (xi−µk)∀, ∀ ∀, arg min
||xq − xr||
(1/ p 2π|Σk|)e− 1
2 (xi−µk)T Σ−1 k (xi−µk)X , log X
∀, Σ
φ(||xq − xr|| h ) φ(||xq − xr|| h )P(Ck)
∀, arg max Σ max, min ∀, Π
||xq − xr|| φ(||xq − xr||)
Σ, Σ
I(||xq − xr|| < h)
Σ, Σ
I(||xq − xr|| < h)
∀, Σ
yr φ(||xq − xr|| h )
Σ, Σ
φ(||xq − xr||)
max, ..., max Σ(||xq − xr||)
arg min, arg min
||xq − xr||
∀, arg min
||xq − xr||
∀, Σ
αqαr ||xq − xr||
Σ, Σ
I(||xq − xr|| < h)
Each problem has a set of operators and a kernel function
∀q, arg mink
r||xq − xr||
Storage query(filePathString1); Storage reference(filePathString2); PortalExpr expr; expr.addLayer(PortalOp(PortalOp::OP::FORALL), query); expr.addLayer(PortalOp(PortalOp::OP::KARGMIN, k), reference, PortalFunc(PortalFunc::TYPE::EUCLIDEAN)); Storage knnoutput = expr.execute();
v3 processor (Haswell-EP)
614.4 GFlops
I (||xq − xr|| ≤ h)
63 5.3 6.3 Base 6.2 143 3.5 8.9 Base 7.5 231 2.1 23.1 Base 14.5 98 2 12.3 Base 4.7 160 Base 13.3 2 4.5
50 100 150 200 250 Yahoo! HIGGS Census KDD IHEPC Speedup
MATLAB WEKA MLPACK Scikit PASCAL
201 5.2 22.3 Base 18.4 142 Base 7.9 1.6 3.9 104 1.4 6.1 Base 3.4 123 1.3 15.4 Base 7.7 98 1.5 6.1 Base 4.1
50 100 150 200 250 Yahoo! HIGGS Census KDD IHEPC Speedup
EM kNN
1.6×, 3.2×, and 53.7× respectively for the same dataset.
KNN EM KDE HD RS EMST Alg +Opt +Par Alg +Opt +Par Alg +Opt +Par Alg +Opt +Par Alg +Opt +Par Alg +Opt +Par Yahoo! 3.1 12.1 173.1 1.6 3.2 53.7 2.1 9.1 92.1 2.5 11.5 161.1 2.2 9.1 126.8 2.9 11.9 166.7 HIGGS 2.1 7.3 108.1 1.5 6.8 117.6 1.7 4.7 50.1 1.9 6.1 89.6 1.9 6.3 86.5 2.0 6.9 102.8 Census 1.4 6.5 90.8 1.3 11.2 190.0 1.4 8.1 75.6 1.3 10.2 141.8 1.3 10.4 144.9 1.4 10.9 151.6 KDD 1.6 6.8 100.7 1.4 4.1 70.9 1.5 3.1 33.5 1.4 3.8 54.4 1.4 5.1 70.5 1.5 3.8 55.5 IHEPC 3.0 4.3 61.5 1.5 7.6 127.6 2.0 5.4 53.6 2.5 6.8 101.3 2.1 6.3 94.1 2.9 7.1 107.1
CFD solvers can be expressed in stencil DSL’s with minimal effort. Portal can generate out-of-the-box new optimal N-body algorithms —O(N log N) EM and O(N) Hausdorff distance. Limitations
๏ Finding the optimal schedule for performance is non-
trivial.
๏ Most stencil DSL’s are only optimized for cell-
centered stencils.
๏ Does not support sufficient combination of