SLIDE 1 Exascale N-body algorithms
for data analysis and simulation
- 1. Introduction and significance
- 2. N-body problems in high-dimensions
- 3. Scaling challenges
Computational kernels requirements Comments on the productivity challenge GEORGE BIROS
padas.ices.utexas.edu
SLIDE 2 With
- Bill March
- Bo Xiao
- Chenhan Yu
- Sameer Tharakan
- Dhairya Malhotra
SLIDE 3
N-body methods O(N2)àO(N)
Gravity & Coulomb Waves & Scattering Fluids & Transport Machine learning Geostatistics Image analysis
SLIDE 4
MATVEC or Kernel sum
SLIDE 5
Relation to PDEs
SLIDE 6
Relation to H-matrices
SLIDE 7 Nystrom method (and its variants)
- Most popular and effective method
- Low-rank decomposition for the kernel matrix G
(not an N-body algorithm)
- Uses matrix sampling theory; requires random
sampling of O(s) points, s is the target rank
Talwalkar et al, 10
SLIDE 8
N-body 101
SLIDE 9 Idea I: Far-field interactions
sources targets separation
x
xj
xw
LOW RANK APPROXIMATION O(N3) à O(N)
x x x
SLIDE 10
Idea II: Near-/Far-field split
SLIDE 11 Idea III: recursion
Rd
Matrix partitioning
1 2 4 3
K11 K12 K32
SLIDE 12 Questions
- Accurate far-field representation
- Optimal complexity
– N2 à N log N à N
– Parallelization – Good per/core performance – Heterogeneous architectures
- For dim<4, all these are answered
SLIDE 13 Low dimensions
- Barnes & Hut’86 – Treecodes
- Greengard & Rokhlin’87 – FMM
- Roklin’90
- High frequency FMM
- Hackbusch & Novak’89 – Panel clustering
- Hackbusch’99 – H-Matrices
- Greengard & Gropp’91 – parallel
- Warren & Salmon’93 - parallel
SLIDE 14 Stokes flow, 18B dof, jump=1E9
Malhotra, Gholami, B., SC’14 p: Stampede nodes node: 1.42TFLOP Xeon: 16 core ~ 22% peak
SLIDE 15 High-dimensions
- Griebel et al’12: 200K points, synthetic 20D, real 6D, sequential
- Duraiswami et al’06 (IFGT): 100K points, 9D, low-order
accuracy, sequential
- Tharakan, March, & B: 4.5M points, 28D, Nystrom
- Lee, Vuduc & Gray: 20M points, 10D, low-order accuracy, parallel
SLIDE 16 Issues with high D
- Approximation of far-field
- (polynomial in ambient-D)
- Pruning (near-far field)
- (polynomial in ambient-D)
- Practically, no scalable algorithms
- no parallelism, no control of complexity/accuracy
- Nystrom method assumes global low-rank
- doesn’t scale with problem size
SLIDE 17 ASKIT SISC’15, SISC’16, ACHA’15, KDD’15, SC’15, IPDPS’15
- Far-field approximation—intrinsic D
- Nearest Neighbors: Pruning/Sampling
- Provable/predictable complexity estimate
- Provable/predictable error estimate
- Parallelism and performance
- Inspired by:
– Kapur & Long’98, Ying & B. & Zorin’03 – Halko & Martinsson & Tropp’11
– Other related work ACA, CUR, Approximate QR
SLIDE 18 Far-field approximation
- G: n x m matrix ; compute factor(G)
- Subsample Gs, ns x m matrix
- factor(Gs) ~ factor(G)
- uniform sampling (large errors)
- norm sampling (better errors)
- leverage sampling (close to optimal, requires SVD(G))
- range space sampling ( requires fast matvec(G)
- ASKIT: random sampling + nearest neighbors
– approximates norm sampling, behaves like leverage sampling – SKELETONIZATION
SLIDE 19
Skeletonization : far field representation
SLIDE 20
Evaluation
SLIDE 21
Evaluation
SLIDE 22 Summary of ASKIT features
- Metric tree to create binary partition
- Approximate nearest-neighbors using RPTs
- Nearest neighbors for skeletonization
- Far-field sampling to construct approximate ID
- Pruning using tree Morton IDs
- Bottom-up pass to recursively create skeletons
- Top-down pass, treecode evaluation
- ID rank of far field is adaptively chosen give error
tolerance (similar to FMM tolerance)
SLIDE 23 Theoretical results
Work Error Nystrom
SLIDE 24 Gaussian kernel
3D, 1M points 64D, 8D intr, 1M points
SLIDE 25 8 M points / 784 dimensions
multiclass kernel logistic regression
March, B. et al KDD’15, SC’15
SLIDE 26 8M points, 784 dimensions
NYSTROM ASKIT
SLIDE 27 Scaling
CLASSIFICATION 1B / 128D ~1TB
March W, Yu C , B. et al, IPDPS 15
TACC Stampede Largest run 144s 200 TFLOPS 30% peak
SLIDE 28 H-matrices and factorizations
inverse
SLIDE 29 Scaling for INV-ASKIT IPDPS’16
N=0.5M, D=54 N=4.5M, D=8 N=1.6M, D=784 N=16M, D=64
For run #11 we get 30 TFLOPS~ 35% peak
SLIDE 30
Toward exascale
SLIDE 31 Scalability requirements
- Computational physics/chemistry/biology
– large scale – 1E13 unknowns (3D/4D)
– real time – 1E6 unknowns (1D/4D)
- Data analysis/Uncertainty quantification
– large scale – 1E15 unknowns (1D - 1000D) – real time / streaming / online
SLIDE 32 Computational Kernels
- Geometric – distance calculations / projections
- Analytic – special functions / fast transforms / differentiation
- Algebraic – QR factorization, Least squares
- Combinatorial – tree traversals / pruning / sorting / merging
- Statistical – sampling
- Memory – reshuffling, packing, permutation, communication
- Application – linear/non linear/optimization/time stepping /
multiphysics
SLIDE 33 Kernel examples [problem size / node]
- Special function evaluations
- Sorting/merging [1E5-1E9]
- Median/selection [1-1E9]
- Tree-traversals / neighborhood construction
- Euclidean distance calculations [ GEMM-like 100—16,000 ) ]
- Nearest-neighbors [ GEMM-like + heapsort ]
- Kernel evaluations [ GEMM-like 100—16,000 ]
- FFTs [3D, size./dim O(10)-O(20)]
- High-order polynomial evaluations
- Rectangular GEMM [ 8-500 X 100,000]
- Pivoted QR factorizations [500 – 16,000]
SLIDE 34
Nearest neighbor kernel
xi xj
SLIDE 35 GEMM for nearest-neighbors
- Input : O(Nd)
- Output : O(N k)
- Calculations: O(N2d + N k logk)
- Intermediate storage: O(N2 + Nk)
SLIDE 36
Custom GEMM + heap selection + arbitrary norms
SLIDE 37
Comparison with GEMM
SLIDE 38 The perspective of high- level application and library designers
actually just my perspective - focus on single node only
SLIDE 39 Workflow
Model selection Discretization - correctness and speed Verification Robustness / usability / reproducibility Validation PREDICT High performance and efficiency
Our responsibility: develop algorithms that “respect” memory hierarchies, promote/expose SIMD, are work-optimal, concurrent, and fault tolerant
SLIDE 40 Our understanding of CPU
– Cores x – FLOPS/instruction x – Instructions/cycle x – Cycles/sec
vectorize
– Registers – Cache – RAM – DISK
SLIDE 41 Code for correctness
C = A*B C C A B i j k k
SLIDE 42 Example
- C=A*B, all A,B,C 4000-by-4000
- 1 CORE
- Expected run time: 40003 * 2 / 21E9 = 6s
- ICC -g : 80 s, 12X slower
- ICC -O: 48 s 8X slower
- ICC -O3: 25 s; 4X slower
- 8 CORES: 3.4 s vs 0.75
(~4.5X slower)
(~7X slower)
- 16 CORES, gcc -O3: 60secs (~100X slower)
2 x 8core-E5-2680 2.7GHz
SLIDE 43
Reality (BLIS)
FLAME/BLIS van de Geijn group UT AUSTIN
SLIDE 44 Challenges for exascale
- Large and constantly evolving design space
- Expanding complexity of
– simulation and data analysis – programming and profiling tools – underlying hardware
- No common abstract parallel machine model
- Static/synchronous scheduling increasingly hard
- Fault tolerance / resiliency
- End-to-end scalability
SLIDE 45 Gaps in
- programmability
- productivity
- maintenability
- performance
- scalability
- portability
- accessibility
- energy efficiency
- reproducibility
Large percentage of production scientific codes ~ 1-0.1% efficiency Would be happy to have productivity tools that can raise performance to 10% Requires 10X to 100X improvements
SLIDE 46
Thank you