Exascale N-body algorithms for data analysis and simulation 1. - - PowerPoint PPT Presentation

exascale n body algorithms
SMART_READER_LITE
LIVE PREVIEW

Exascale N-body algorithms for data analysis and simulation 1. - - PowerPoint PPT Presentation

Exascale N-body algorithms for data analysis and simulation 1. Introduction and significance 2. N-body problems in high-dimensions 3. Scaling challenges Computational kernels requirements Comments on the productivity challenge GEORGE BIROS


slide-1
SLIDE 1

Exascale N-body algorithms

for data analysis and simulation

  • 1. Introduction and significance
  • 2. N-body problems in high-dimensions
  • 3. Scaling challenges

Computational kernels requirements Comments on the productivity challenge GEORGE BIROS

padas.ices.utexas.edu

slide-2
SLIDE 2

With

  • Bill March
  • Bo Xiao
  • Chenhan Yu
  • Sameer Tharakan
  • Dhairya Malhotra
slide-3
SLIDE 3

N-body methods O(N2)àO(N)

Gravity & Coulomb Waves & Scattering Fluids & Transport Machine learning Geostatistics Image analysis

slide-4
SLIDE 4

MATVEC or Kernel sum

slide-5
SLIDE 5

Relation to PDEs

slide-6
SLIDE 6

Relation to H-matrices

slide-7
SLIDE 7

Nystrom method (and its variants)

  • Most popular and effective method
  • Low-rank decomposition for the kernel matrix G

(not an N-body algorithm)

  • Uses matrix sampling theory; requires random

sampling of O(s) points, s is the target rank

  • Work
  • Error

Talwalkar et al, 10

slide-8
SLIDE 8

N-body 101

slide-9
SLIDE 9

Idea I: Far-field interactions

sources targets separation

x

xj

xw

LOW RANK APPROXIMATION O(N3) à O(N)

x x x

slide-10
SLIDE 10

Idea II: Near-/Far-field split

slide-11
SLIDE 11

Idea III: recursion

Rd

Matrix partitioning

1 2 4 3

K11 K12 K32

slide-12
SLIDE 12

Questions

  • Accurate far-field representation
  • Optimal complexity

– N2 à N log N à N

  • Error bounds
  • HPC

– Parallelization – Good per/core performance – Heterogeneous architectures

  • For dim<4, all these are answered
slide-13
SLIDE 13

Low dimensions

  • Barnes & Hut’86 – Treecodes
  • Greengard & Rokhlin’87 – FMM
  • Roklin’90
  • High frequency FMM
  • Hackbusch & Novak’89 – Panel clustering
  • Hackbusch’99 – H-Matrices
  • Greengard & Gropp’91 – parallel
  • Warren & Salmon’93 - parallel
slide-14
SLIDE 14

Stokes flow, 18B dof, jump=1E9

Malhotra, Gholami, B., SC’14 p: Stampede nodes node: 1.42TFLOP Xeon: 16 core ~ 22% peak

slide-15
SLIDE 15

High-dimensions

  • Griebel et al’12: 200K points, synthetic 20D, real 6D, sequential
  • Duraiswami et al’06 (IFGT): 100K points, 9D, low-order

accuracy, sequential

  • Tharakan, March, & B: 4.5M points, 28D, Nystrom
  • Lee, Vuduc & Gray: 20M points, 10D, low-order accuracy, parallel
slide-16
SLIDE 16

Issues with high D

  • Approximation of far-field
  • (polynomial in ambient-D)
  • Pruning (near-far field)
  • (polynomial in ambient-D)
  • Practically, no scalable algorithms
  • no parallelism, no control of complexity/accuracy
  • Nystrom method assumes global low-rank
  • doesn’t scale with problem size
slide-17
SLIDE 17

ASKIT SISC’15, SISC’16, ACHA’15, KDD’15, SC’15, IPDPS’15

  • Far-field approximation—intrinsic D
  • Nearest Neighbors: Pruning/Sampling
  • Provable/predictable complexity estimate
  • Provable/predictable error estimate
  • Parallelism and performance
  • Inspired by:

– Kapur & Long’98, Ying & B. & Zorin’03 – Halko & Martinsson & Tropp’11

– Other related work ACA, CUR, Approximate QR

slide-18
SLIDE 18

Far-field approximation

  • G: n x m matrix ; compute factor(G)
  • Subsample Gs, ns x m matrix
  • factor(Gs) ~ factor(G)
  • uniform sampling (large errors)
  • norm sampling (better errors)
  • leverage sampling (close to optimal, requires SVD(G))
  • range space sampling ( requires fast matvec(G)
  • ASKIT: random sampling + nearest neighbors

– approximates norm sampling, behaves like leverage sampling – SKELETONIZATION

slide-19
SLIDE 19

Skeletonization : far field representation

slide-20
SLIDE 20

Evaluation

slide-21
SLIDE 21

Evaluation

slide-22
SLIDE 22

Summary of ASKIT features

  • Metric tree to create binary partition
  • Approximate nearest-neighbors using RPTs
  • Nearest neighbors for skeletonization
  • Far-field sampling to construct approximate ID
  • Pruning using tree Morton IDs
  • Bottom-up pass to recursively create skeletons
  • Top-down pass, treecode evaluation
  • ID rank of far field is adaptively chosen give error

tolerance (similar to FMM tolerance)

slide-23
SLIDE 23

Theoretical results

Work Error Nystrom

slide-24
SLIDE 24

Gaussian kernel

3D, 1M points 64D, 8D intr, 1M points

slide-25
SLIDE 25

8 M points / 784 dimensions

  • OCR using

multiclass kernel logistic regression

March, B. et al KDD’15, SC’15

slide-26
SLIDE 26

8M points, 784 dimensions

NYSTROM ASKIT

slide-27
SLIDE 27

Scaling

CLASSIFICATION 1B / 128D ~1TB

March W, Yu C , B. et al, IPDPS 15

TACC Stampede Largest run 144s 200 TFLOPS 30% peak

slide-28
SLIDE 28

H-matrices and factorizations

inverse

slide-29
SLIDE 29

Scaling for INV-ASKIT IPDPS’16

N=0.5M, D=54 N=4.5M, D=8 N=1.6M, D=784 N=16M, D=64

For run #11 we get 30 TFLOPS~ 35% peak

slide-30
SLIDE 30

Toward exascale

slide-31
SLIDE 31

Scalability requirements

  • Computational physics/chemistry/biology

– large scale – 1E13 unknowns (3D/4D)

  • Graphics/signal analysis

– real time – 1E6 unknowns (1D/4D)

  • Data analysis/Uncertainty quantification

– large scale – 1E15 unknowns (1D - 1000D) – real time / streaming / online

slide-32
SLIDE 32

Computational Kernels

  • Geometric – distance calculations / projections
  • Analytic – special functions / fast transforms / differentiation
  • Algebraic – QR factorization, Least squares
  • Combinatorial – tree traversals / pruning / sorting / merging
  • Statistical – sampling
  • Memory – reshuffling, packing, permutation, communication
  • Application – linear/non linear/optimization/time stepping /

multiphysics

slide-33
SLIDE 33

Kernel examples [problem size / node]

  • Special function evaluations
  • Sorting/merging [1E5-1E9]
  • Median/selection [1-1E9]
  • Tree-traversals / neighborhood construction
  • Euclidean distance calculations [ GEMM-like 100—16,000 ) ]
  • Nearest-neighbors [ GEMM-like + heapsort ]
  • Kernel evaluations [ GEMM-like 100—16,000 ]
  • FFTs [3D, size./dim O(10)-O(20)]
  • High-order polynomial evaluations
  • Rectangular GEMM [ 8-500 X 100,000]
  • Pivoted QR factorizations [500 – 16,000]
slide-34
SLIDE 34

Nearest neighbor kernel

xi xj

slide-35
SLIDE 35

GEMM for nearest-neighbors

  • Input : O(Nd)
  • Output : O(N k)
  • Calculations: O(N2d + N k logk)
  • Intermediate storage: O(N2 + Nk)
slide-36
SLIDE 36

Custom GEMM + heap selection + arbitrary norms

slide-37
SLIDE 37

Comparison with GEMM

slide-38
SLIDE 38

The perspective of high- level application and library designers

actually just my perspective - focus on single node only

slide-39
SLIDE 39

Workflow

Model selection Discretization - correctness and speed Verification Robustness / usability / reproducibility Validation PREDICT High performance and efficiency

Our responsibility: develop algorithms that “respect” memory hierarchies, promote/expose SIMD, are work-optimal, concurrent, and fault tolerant

slide-40
SLIDE 40

Our understanding of CPU

  • FLOPS =

– Cores x – FLOPS/instruction x – Instructions/cycle x – Cycles/sec

  • Try to block and

vectorize

  • MOPS

– Registers – Cache – RAM – DISK

slide-41
SLIDE 41

Code for correctness

C = A*B C C A B i j k k

slide-42
SLIDE 42

Example

  • C=A*B, all A,B,C 4000-by-4000
  • 1 CORE
  • Expected run time: 40003 * 2 / 21E9 = 6s
  • ICC -g : 80 s, 12X slower
  • ICC -O: 48 s 8X slower
  • ICC -O3: 25 s; 4X slower
  • 8 CORES: 3.4 s vs 0.75

(~4.5X slower)

  • 16 CORES: 2.6s vs 0.38

(~7X slower)

  • 16 CORES, gcc -O3: 60secs (~100X slower)

2 x 8core-E5-2680 2.7GHz

slide-43
SLIDE 43

Reality (BLIS)

FLAME/BLIS van de Geijn group UT AUSTIN

slide-44
SLIDE 44

Challenges for exascale

  • Large and constantly evolving design space
  • Expanding complexity of

– simulation and data analysis – programming and profiling tools – underlying hardware

  • No common abstract parallel machine model
  • Static/synchronous scheduling increasingly hard
  • Fault tolerance / resiliency
  • End-to-end scalability
slide-45
SLIDE 45

Gaps in

  • programmability
  • productivity
  • maintenability
  • performance
  • scalability
  • portability
  • accessibility
  • energy efficiency
  • reproducibility

Large percentage of production scientific codes ~ 1-0.1% efficiency Would be happy to have productivity tools that can raise performance to 10% Requires 10X to 100X improvements

slide-46
SLIDE 46

Thank you