PASCAL A Parallel Algorithmic SCALable Framework A Parallel - - PowerPoint PPT Presentation

pascal
SMART_READER_LITE
LIVE PREVIEW

PASCAL A Parallel Algorithmic SCALable Framework A Parallel - - PowerPoint PPT Presentation

PASCAL A Parallel Algorithmic SCALable Framework A Parallel Algorithmic SCALable Framework for N-body Problems for N-body Problems Laleh Aghababaie Beni, Aparna Chandramowlishwaran Laleh Aghababaie Beni, Aparna Chandramowlishwaran Euro-Par


slide-1
SLIDE 1

HPC Factory

A Parallel Algorithmic SCALable Framework for N-body Problems

Laleh Aghababaie Beni, Aparna Chandramowlishwaran Euro-Par 2017

PASCAL

A Parallel Algorithmic SCALable Framework for N-body Problems

Laleh Aghababaie Beni, Aparna Chandramowlishwaran Euro-Par 2017

slide-2
SLIDE 2

HPC Factory

  • Introduction
  • PASCAL Framework
  • Space Partitioning Trees
  • Tree Traversal
  • Prune/Approximate Generators
  • Optimizations & Parallelization
  • Experiments & Results
  • Conclusions & Future Work

Outline

slide-3
SLIDE 3

HPC Factory

Outline

  • Introduction
  • PASCAL Framework
  • Space Partitioning Trees
  • Tree Traversal
  • Prune/Approximate Generators
  • Optimizations & Parallelization
  • Experiments & Results
  • Conclusions & Future Work
slide-4
SLIDE 4

HPC Factory

N-body calculations

Force computation Nearest neighbors Kernel density estimation Range count

∀q ∈ Q : F(q) =

  • r∈(Q−{q})

C r − q ||r − q||3 ∀q ∈ Q : AllNN(q) = argminr∈R d(q, r) ∀q ∈ Q : KDE(q) = 1 |R|

  • r∈R

K(q, r) ∀q ∈ Q : Range(q) =

  • r∈R

I(dist(q, r)) ≤ h)

slide-5
SLIDE 5

HPC Factory

N-body calculations

Force computation Nearest neighbors Kernel density estimation Range count

∀q ∈ Q : F(q) =

  • r∈(Q−{q})

C r − q ||r − q||3 ∀q ∈ Q : AllNN(q) = argminr∈R d(q, r) ∀q ∈ Q : KDE(q) = 1 |R|

  • r∈R

K(q, r) ∀q ∈ Q : Range(q) =

  • r∈R

I(dist(q, r)) ≤ h)

What do these have in common?

slide-6
SLIDE 6

HPC Factory

N-body calculations

Force computation Nearest neighbors Kernel density estimation Range count

∀q ∈ Q : F(q) =

  • r∈(Q−{q})

C r − q ||r − q||3 ∀q ∈ Q : AllNN(q) = argminr∈R d(q, r) ∀q ∈ Q : KDE(q) = 1 |R|

  • r∈R

K(q, r) ∀q ∈ Q : Range(q) =

  • r∈R

I(dist(q, r)) ≤ h)

Consider pairs of points – naïvely O(N2) What do these have in common?

slide-7
SLIDE 7

HPC Factory

Commonality: Optimal approximation algorithms

Force computation

∀q ∈ Q : F(q) =

  • r∈(Q−{q})

C r − q ||r − q||3

Evaluate interactions → Tree traversals Store aggregate data at nodes, e.g., bounding box, mass

  • Hierarchical tree-based approximation algorithms for

force computations, e.g., Barnes-Hut or FMM

slide-8
SLIDE 8

HPC Factory

N-body problems in other domains

Problem Operators Kernel Function

All Nearest Neighbors All Range Search All Range Count Naive Bayes Classifier Mixture Model E-step K-means E-step Mixture Model Log-likelihood Kernel Density Estimation Kernel Density Bayes Classifier 2-point (cross-)correlation Nadaraya-Watson Regression Thermodynamic Average Largest-span set Closest Pair Minimum Spanning Tree Coulombic Interaction Average Density Wave function Hausdorff Distance Intrinsic (fractal) Dimension

∀, arg min ∀, ∪ arg

||xq − xr||

I(hmin < ||xq − xr|| < hmax) I(hmin < ||xq − xr|| < hmax)

∀, Σ

(1/ p 2π|Σk|)e− 1

2 (xi−µk)T Σ−1 k (xi−µk)P(Ck)

∀, arg max

(1/ p 2π|Σk|)e− 1

2 (xi−µk)T Σ−1 k (xi−µk)

∀, ∀ ∀, arg min

||xq − xr||

(1/ p 2π|Σk|)e− 1

2 (xi−µk)T Σ−1 k (xi−µk)

X , log X

∀, Σ

φ(||xq − xr|| h ) φ(||xq − xr|| h )P(Ck)

∀, arg max Σ max, min ∀, Π

||xq − xr|| φ(||xq − xr||)

Σ, Σ

I(||xq − xr|| < h)

Σ, Σ

I(||xq − xr|| < h)

∀, Σ

yr φ(||xq − xr|| h )

Σ, Σ

φ(||xq − xr||)

max, ..., max Σ(||xq − xr||)

arg min, arg min

||xq − xr||

∀, arg min

||xq − xr||

∀, Σ

αqαr ||xq − xr||

Σ, Σ

I(||xq − xr|| < h)

Each problem has a set of operators and a kernel function

slide-9
SLIDE 9

HPC Factory

  • One of the original seven dwarfs or motifs
  • FMM listed among the top 10 algorithms having the greatest

influence in 20th century

  • EM is one of the top 10 algorithms having the highest impact in 


data mining

  • Applications
  • Machine learning
  • Computer vision
  • Computational geometry
  • Scientific computing …

Why N-body methods?

slide-10
SLIDE 10

HPC Factory

Key Ideas and Findings

  • An algorithmic framework for N-body problems
  • Automatically generates prune & approximation

conditions

  • Results in O(N log N) and O(N) algorithms
  • Domain-specific optimizations and parallelization
  • Show 10-230x speedup on 6 different algorithms

compared to state-of-art libraries/softwares

  • Out-of-the-box new optimal algorithms
  • O(N log N) EM algorithm for GMM’s
  • O(N) algorithm for Hausdorff distance
slide-11
SLIDE 11

HPC Factory

  • Introduction
  • PASCAL Framework
  • Space Partitioning Trees
  • Tree Traversal
  • Prune/Approximate Generators
  • Optimizations & Parallelization
  • Experiments & Results
  • Conclusions & Future Work

Outline

slide-12
SLIDE 12

HPC Factory

PASCAL Framework

Datasets N-body spec.: Operators & Kernel function Prune/Approximate condition generator Tree Traversal Schemes

Multi tree traversals BaseCase Prune/Approximate ComputeApproximate

Space-partitioning Trees

Kd-tree

Domain-Specific Optimizations Optimized code Parallelization

slide-13
SLIDE 13

HPC Factory

Tree Construction

http://www.cs.cmu.edu/~dpelleg/kmeans.html

Recursively divide space until each box has at most q points.

slide-14
SLIDE 14

HPC Factory

Tree Construction

http://www.cs.cmu.edu/~dpelleg/kmeans.html

Recursively divide space until each box has at most q points.

slide-15
SLIDE 15

HPC Factory

Tree Construction

http://www.cs.cmu.edu/~dpelleg/kmeans.html

Recursively divide space until each box has at most q points.

slide-16
SLIDE 16

HPC Factory

Tree Traversal

Q R

slide-17
SLIDE 17

HPC Factory

Tree Traversal

Q R

Prune/Approx()?

slide-18
SLIDE 18

HPC Factory

Tree Traversal

Q R

Prune/Approx()? NO

slide-19
SLIDE 19

HPC Factory

Tree Traversal

Q R

slide-20
SLIDE 20

HPC Factory

Tree Traversal

Q R

Prune/Approx()?

slide-21
SLIDE 21

HPC Factory

Tree Traversal

Q R

Prune/Approx()? NO

slide-22
SLIDE 22

HPC Factory

Tree Traversal

Q R

slide-23
SLIDE 23

HPC Factory

Tree Traversal

Q R

BaseCase()

Direct QL⊗RL → O(q2)

slide-24
SLIDE 24

HPC Factory

Tree Traversal

Q R

slide-25
SLIDE 25

HPC Factory

Tree Traversal

Q R

Prune/Approx()?

slide-26
SLIDE 26

HPC Factory

Tree Traversal

Q R

Prune/Approx()? YES

slide-27
SLIDE 27

HPC Factory

Tree Traversal

Q R

If Prune/Approx() is true, discard the entire subtree for pruning problems

slide-28
SLIDE 28

HPC Factory

Tree Traversal

Q R

BaseCase()

slide-29
SLIDE 29

HPC Factory

Tree Traversal

Q R

slide-30
SLIDE 30

HPC Factory

Tree Traversal

Q R

Prune/Approx()? YES

slide-31
SLIDE 31

HPC Factory

Tree Traversal

Q R

If Prune/Approx() is true, replace the subtree with the centroid for approximation problems

slide-32
SLIDE 32

HPC Factory

Tree Traversal

Q R

ApproxCompute()

slide-33
SLIDE 33

HPC Factory

Tree Traversal

Q R

BaseCase()

slide-34
SLIDE 34

HPC Factory

Tree Traversal

Q R

slide-35
SLIDE 35

HPC Factory

Prune/Approximate Condition Generator

  • Prune e.g., Hausdorff Distance
slide-36
SLIDE 36

HPC Factory

Hausdorff Distance

slide-37
SLIDE 37

HPC Factory

Hausdorff Distance

Q

slide-38
SLIDE 38

HPC Factory

Hausdorff Distance

Q R

slide-39
SLIDE 39

HPC Factory

Hausdorff Distance

Q R

slide-40
SLIDE 40

HPC Factory

Hausdorff Distance

Q R

slide-41
SLIDE 41

HPC Factory

Hausdorff Distance

Q R

slide-42
SLIDE 42

HPC Factory

Hausdorff Distance

Q R

slide-43
SLIDE 43

HPC Factory

Hausdorff Distance

Q R

slide-44
SLIDE 44

HPC Factory

Prune/Approximate Condition Generator

Log-likelihood E-step

  • Approximation e.g., Expectation Maximization (EM)

M-step

  • Prune e.g., Hausdorff Distance
slide-45
SLIDE 45

HPC Factory

Approximate Condition for EM

slide-46
SLIDE 46

HPC Factory

Approximate Condition for EM

Q R

slide-47
SLIDE 47

HPC Factory

Approximate Condition for EM

Q R

slide-48
SLIDE 48

HPC Factory

Approximate Condition for EM

Q R

Kmin

slide-49
SLIDE 49

HPC Factory

Approximate Condition for EM

Q R

slide-50
SLIDE 50

HPC Factory

Approximate Condition for EM

Q R

Kmax

slide-51
SLIDE 51

HPC Factory

Approximate Condition for EM

Q R

slide-52
SLIDE 52

HPC Factory

Approximate Condition for EM

Q R

center center

slide-53
SLIDE 53

HPC Factory

Approximate Condition for EM

Q R

center center Kcentfr

slide-54
SLIDE 54

HPC Factory

Approximate Condition for EM

Q R

center center Kcentfr Kmax -Kmin < X Kcentfr

< β

slide-55
SLIDE 55

HPC Factory

Approximate Condition for EM

Q R

center center Kcentfr Kmax -Kmin < X Kcentfr

user-controlled accuracy

< β

slide-56
SLIDE 56

HPC Factory

Approximate Condition for EM

Q R

center center Kcentfr Kmax -Kmin < X Kcentfr

user-controlled accuracy

< β

log liklihood: E-step:

(ri,max − ri,min) < β ri,mean(i = 1, ..., K)

slide-57
SLIDE 57

HPC Factory

Prune Condition for Hausdorff distance:

slide-58
SLIDE 58

HPC Factory

Prune Condition for Hausdorff distance:

R Q

slide-59
SLIDE 59

HPC Factory

Prune Condition for Hausdorff distance:

R Q

slide-60
SLIDE 60

HPC Factory

Prune Condition for Hausdorff distance:

R

border point

Q

slide-61
SLIDE 61

HPC Factory

Prune Condition for Hausdorff distance:

R

border point

Q

  • p⊕1(τ1, K(xq, xr)|op⊕2(τ2, K(xq, xr))) s.t.

∀xq ∈ N border

q

, ∀xr ∈ N border

r

slide-62
SLIDE 62

HPC Factory

  • Introduction
  • PASCAL Framework
  • Space Partitioning Trees
  • Tree Traversal
  • Prune/Approximate Generators
  • Optimizations & Parallelization
  • Experiments & Results
  • Conclusions & Future Work

Outline

slide-63
SLIDE 63

HPC Factory

Optimizations

  • Incremental bounding box calculation
slide-64
SLIDE 64

HPC Factory

Optimizations

  • Incremental bounding box calculation
slide-65
SLIDE 65

HPC Factory

Optimizations

  • Incremental bounding box calculation
slide-66
SLIDE 66

HPC Factory

Only update the dimension that is split at each node

Optimizations

  • Incremental bounding box calculation
slide-67
SLIDE 67

HPC Factory

Only update the dimension that is split at each node

Optimizations

  • Incremental bounding box calculation
slide-68
SLIDE 68

HPC Factory

  • Optimal Metric Calculation
  • Reduced distance
  • e.g., squared Euclidean distance
  • Eliminates expensive sqrt instruction with long

latencies

  • Partial distance
  • Big payoff for large dimensional datasets

Optimizations

  • Incremental bounding box calculation
slide-69
SLIDE 69

HPC Factory

  • Optimal Metric Calculation
  • Reduced distance
  • e.g., squared Euclidean distance
  • Eliminates expensive sqrt instruction with long

latencies

  • Partial distance
  • Big payoff for large dimensional datasets

Optimizations

  • Incremental bounding box calculation
  • Incremental distance calculation
  • Node-to-node distance computed incrementally from

parent’s distance in constant time

slide-70
SLIDE 70

HPC Factory

Parallelization

Q

slide-71
SLIDE 71

HPC Factory

Parallelization

Q

cilk_spawn

slide-72
SLIDE 72

HPC Factory

Parallelization

Q

cilk_spawn

slide-73
SLIDE 73

HPC Factory

Parallelization

Q

cilk_spawn

Stop spawning when #cilk threads = #physical threads t0 t1 t2 t3

slide-74
SLIDE 74

HPC Factory

Parallelization

Q

cilk_spawn

Stop spawning when #cilk threads = #physical threads t0 t1 t2 t3

Task parallelism Data parallelism

slide-75
SLIDE 75

HPC Factory

Parallelization

t0

cilk_spawn

Stop spawning when #cilk threads = #physical threads

Task parallelism

Pruning/Approximation causes load imbalance

Q

t1 t2 t3

Data parallelism

slide-76
SLIDE 76

HPC Factory

  • Introduction
  • PASCAL Framework
  • Space Partitioning Trees
  • Tree Traversal
  • Prune/Approximate Generators
  • Optimizations & Parallelization
  • Experiments & Results
  • Conclusions & Future Work

Outline

slide-77
SLIDE 77

HPC Factory

  • Architecture
  • Dual-socket Intel Xeon E5-2630

v3 processor (Haswell-EP)

  • Each socket has 8 cores
  • Theoretical peak performance of

614.4 GFlops

  • Compiler
  • Intel C++ complier (icpc v15.0.2)
  • Python v2.7.6 (Scikit-learn)
  • Java v1.8.0 (Weka)

Experimental Setup

slide-78
SLIDE 78

HPC Factory

Case Studies (Direct)

  • Kernel Density Estimation
  • Nearest Neighbors
  • Range-Search

I (||xq − xr|| ≤ h)

  • Hausdorff Distance
slide-79
SLIDE 79

HPC Factory

Case Studies (Iterative)

Log-likelihood

  • Euclidean Minimum Spanning Tree

E-step

  • Expectation Maximization (EM)

M-step

slide-80
SLIDE 80

HPC Factory

  • Weka: 6,677,053 downloads, written in Java
  • Scikit-learn: 121,841 downloads, written in Python
  • MATLAB: over 1,000,000 licensed users, uses C in backend
  • MLPACK: exploits C++ language features to provide maximum performance

Library Comparison

63 5.3 6.3 Base 6.2 143 3.5 8.9 Base 7.5 231 2.1 23.1 Base 14.5 98 2 12.3 Base 4.7 160 Base 13.3 2 4.5

50 100 150 200 250 Yahoo! HIGGS Census KDD IHEPC Speedup

MATLAB WEKA MLPACK Scikit PASCAL

201 5.2 22.3 Base 18.4 142 Base 7.9 1.6 3.9 104 1.4 6.1 Base 3.4 123 1.3 15.4 Base 7.7 98 1.5 6.1 Base 4.1

50 100 150 200 250 Yahoo! HIGGS Census KDD IHEPC Speedup

EM kNN

slide-81
SLIDE 81

HPC Factory

Speedup Breakdown

1.6×, 3.2×, and 53.7× respectively for the same dataset.

KNN EM KDE HD RS EMST Alg +Opt +Par Alg +Opt +Par Alg +Opt +Par Alg +Opt +Par Alg +Opt +Par Alg +Opt +Par Yahoo! 3.1 12.1 173.1 1.6 3.2 53.7 2.1 9.1 92.1 2.5 11.5 161.1 2.2 9.1 126.8 2.9 11.9 166.7 HIGGS 2.1 7.3 108.1 1.5 6.8 117.6 1.7 4.7 50.1 1.9 6.1 89.6 1.9 6.3 86.5 2.0 6.9 102.8 Census 1.4 6.5 90.8 1.3 11.2 190.0 1.4 8.1 75.6 1.3 10.2 141.8 1.3 10.4 144.9 1.4 10.9 151.6 KDD 1.6 6.8 100.7 1.4 4.1 70.9 1.5 3.1 33.5 1.4 3.8 54.4 1.4 5.1 70.5 1.5 3.8 55.5 IHEPC 3.0 4.3 61.5 1.5 7.6 127.6 2.0 5.4 53.6 2.5 6.8 101.3 2.1 6.3 94.1 2.9 7.1 107.1

slide-82
SLIDE 82

HPC Factory

Scalability

slide-83
SLIDE 83

HPC Factory

  • First generalized algorithmic framework for N-body

problems

  • Out-of-the-box new optimal algorithms
  • O(N log N) EM algorithm
  • O(N) Hausdorff distance algorithm
  • Generalizes to more than two operators
  • 10-230x speedup from optimal tree algorithm, domain-

specific optimizations and parallelization

  • Short-term: DSL + code generator for base-case,
  • ptimizations and parallelization
  • Long-term: Extend to GPUs and distributed memory

systems

Summary and Status