Declarative Infrastructure for Automated Scientific Computing - - PowerPoint PPT Presentation

declarative infrastructure for automated scientific
SMART_READER_LITE
LIVE PREVIEW

Declarative Infrastructure for Automated Scientific Computing - - PowerPoint PPT Presentation

Declarative Infrastructure for Automated Scientific Computing Matthew Rocklin March 5th, 2013 Expertise Domain knowledge is important for efficient computation Expertise Most scientific programmers arent experts in all requisite domains


slide-1
SLIDE 1

Declarative Infrastructure for Automated Scientific Computing

Matthew Rocklin March 5th, 2013

slide-2
SLIDE 2

Expertise

Domain knowledge is important for efficient computation

slide-3
SLIDE 3

Expertise

Most scientific programmers aren’t experts in all requisite domains

slide-4
SLIDE 4

Expertise

// A: Naive // B: Programmer // C: Mathematician int fact(int n){ int fact(int n){ int fact(int n){ if (n == 0) int prod = n; // n! = Gamma(n+1) return 1; while(n--) return lround(exp(lgamma(n+1))); return n*fact(n-1); prod *= n; } } return prod; }

◮ Modern compilers automate B ◮ Humans do C by hand ◮ The people who know C are rarely trained to build compilers ◮ Both B and C are commonly necessary within one project

slide-5
SLIDE 5

Composable Software - Monolithic

Meteorological Science PDEs Distributed Memory Scheduling Finite Elements Uncertainty Quantification GPGPU HP C/Fortran

Numerical Weather Prediction

slide-6
SLIDE 6

Composable Software - Composable

Meteorological Science PDEs Distributed Memory Scheduling Finite Elements Uncertainty Quantification GPGPU HP C/Fortran Numerical Weather Prediction

slide-7
SLIDE 7

Outline

◮ Probability Modeling ◮ Bayesian inference simulations (MCMC) ◮ Matrix algebra ◮ BLAS/LAPACK computations ◮ Static Scheduling ◮ Blocked Matrix Algorithms

slide-8
SLIDE 8

Section 1 Probability Modeling

slide-9
SLIDE 9

Computer Algebra - SymPy

>>> expr = log(3*exp(x + 2)) >>> print simplify(expr) x + 2 + log(3)

x 3 2 Add exp Mul log

x 3 2 log Add

slide-10
SLIDE 10

Random Variables

>>> x = Normal('x', 0, 1) >>> expr = log(3*exp(x + 2)) >>> print simplify(expr) x + log(3) + 2

1 x 3 2 log NormalDistribution ProbabilitySpace RandomSymbol Add

slide-11
SLIDE 11

Random Variables

>>> x = Normal('x', 0, 1) >>> expr = log(3*exp(x + 2)) >>> print simplify(expr) x + log(3) + 2 >>> P(expr > 4) ∞ √ 2e− 1

2 (z−log (3)+2)2

2√π dz

  • 1. Uncertainty doesn’t interfere
  • 2. Probability query → integral expression is a simple transformation
  • 3. Integral problems have mature solutions
slide-12
SLIDE 12

Random Variables

Scientific Modeling Engineering error analysis Financial Analytic Solvers Monte Carlo Sparse Grids Numeric Quadrature SymPy RV Integrals

Figure:

slide-13
SLIDE 13

Bayesian Inference

>>> rate = Beta('lambda', a, b) >>> count = Poisson('count', rate) p(x|λ) = λx eλx! p(λ) = λa−1 (−λ + 1)b−1 Γ (a + b) Γ (a) Γ (b) Infer rate given many observations for count Maximize p(λ|xi) ∝

  • i

p(xi|λ) · p(λ) 0 = d dλ log

  • i

p(xi|λ) · p(λ)

  • = d

  • i

log(p(xi|λ) · p(λ)) Need to find the roots of

n

  • i=1

a (λ − 1) + bλ − λ (λ − 1) − 2λ + (λ − 1) data [i] + 1 λ (λ − 1) = 0

slide-14
SLIDE 14

Bayesian Inference

PyMC: SymPy + RV → Theano (array computations) → C + CUDA

slide-15
SLIDE 15

How do we solve math problems?

>>> A = Normal('a', 0, 1) >>> density(A) √ 2e− 1

2 z2

2√π >>> density(A**2) Use generic transformations taught in Statistics 101, e.g. fY (y) = fX (g−1(y))

  • dg−1(y)

dy

2e− 1

2 z| 1

√z |

2√π

slide-16
SLIDE 16

How do we solve math problems?

>>> A = Normal('a', 0, 1) >>> B = Normal('b', 0, 1) >>> density(A**2 + B**2) ∞

−∞

e− 1

2 (b−a)2e− 1 2 z+ 1 2 b2

2π| √ z − b2| db

slide-17
SLIDE 17

How do we solve math problems?

>>> A = Normal('a', 0, 1) >>> B = Normal('b', 0, 1) >>> density(A**2 + B**2) 1 2 e− 1

2 z

This is a Chi squared distribution with two degrees of freedom Phenomenological relations: N(0, 1)2 ∼ χ2(1) χ2(n) + χ2(m) ∼ χ2(n + m)

slide-18
SLIDE 18

Term Rewrite System

Rewrite rule: Source pattern tan(x) Target pattern sin(x)/cos(x) Varibles x, Example: From: 3 + tan(a + b)**2 To: 3 + (sin(a + b) / cos(a + b))**2

slide-19
SLIDE 19

Term Rewrite System

Rules: tan(a) → sin(a)/cos(a) sin2(a) → 1 − cos(2a) 2 cos2(a) → 1 + cos(2a) 2 sin(a) + sin(b) → 2sin( a + b 2 )cos( a + b 2 ) sin2(a) + cos2(a) → 1 sin(a)/cos(a) → tan(a) ... Simplify: sin2(y) + sin(z) cos(z) + cos2(y)

slide-20
SLIDE 20

Encode Statistical Rewrite Rules

Express patterns: patterns = [ (Normal(0, 1), StandardNormal(), []), (StandardNormal()**2, ChiSquared(1), []), (ChiSquared(m) + ChiSquared(n), ChiSquared(n + m), [n, m]), ... ] Define control flow: canonicalize = chain(repeat, bottom_up, choose) Combine: stat_simplify = canonicalize(patterns)

slide-21
SLIDE 21

Status and Evaluation

◮ Software: ◮ Fully functional ◮ Lacks efficient matching for many patterns ◮ Maybe integrate into PyMC ◮ Evaluation: ◮ Compare numeric runtimes ◮ Compare complexity of problem description

slide-22
SLIDE 22

Related Work

◮ Symbolic Statistics ◮ L. Leemis, GD Evans, APPL: A Probability Programming Language ◮ M. Erwig and S. Kollmansberger Probabilistic Functional Programming in Haskell, 2006 ◮ Markov chain Monte Carlo ◮ WinBUGS ◮ JAGS ◮ PyMC ◮ Term Rewrite Systems - Elan, Maude, Mathematica, Stratego/XT, Coq ◮ Term Rewrite Systems (CAS) ◮ RUBI - AD Rich, DJ Jeffrey A Knowledge Repository for Indefinite Integration Based on

Transformation Rules

◮ H. Fu, X. Zhong, Z. Zeng, Automated and Readable Simplification of Trigonometric

Expressions

slide-23
SLIDE 23

Section 2 Numerical Linear Algebra

slide-24
SLIDE 24

The need for a high level array compiler

x = ones(10000, 1) (x*x’)*x Elapsed time is 0.337711 seconds. x*(x’*x) Elapsed time is 0.000956 seconds.

x, n by 1 x, n by 1 x', 1 by n (x*x'), nxn (x'*x), 1x1 x*x'*x

slide-25
SLIDE 25

The need for a high level array compiler

β = (X T X)−1X T y beta = (X.T*X).I*X.T*y

slide-26
SLIDE 26

The need for a high level array compiler

β = (X T X)−1X T y beta = solve(X.T*X, X.T*y)

slide-27
SLIDE 27

The need for a high level array compiler

β = (X T X)−1X T y beta = spd_solve(X.T*X, X.T*y)

slide-28
SLIDE 28

BLAS/LAPACK

Numeric libraries for dense linear algebra

◮ DGEMM - Double precision GEneral Matrix Multiply – αAB + βC

slide-29
SLIDE 29

BLAS/LAPACK

Numeric libraries for dense linear algebra

◮ DGEMM - Double precision GEneral Matrix Multiply – αAB + βC ◮ SUBROUTINE DGEMM(TRANSA,TRANSB,M,N,K,ALPHA,A,LDA,B,LDB,BETA,C,LDC)

slide-30
SLIDE 30

BLAS/LAPACK

Numeric libraries for dense linear algebra

◮ DGEMM - Double precision GEneral Matrix Multiply – αAB + βC ◮ SUBROUTINE DGEMM(TRANSA,TRANSB,M,N,K,ALPHA,A,LDA,B,LDB,BETA,C,LDC) ◮ DSYMM - Double precision SYmmetric Matrix Multiply – αAB + βC

slide-31
SLIDE 31

BLAS/LAPACK

Numeric libraries for dense linear algebra

◮ DGEMM - Double precision GEneral Matrix Multiply – αAB + βC ◮ SUBROUTINE DGEMM(TRANSA,TRANSB,M,N,K,ALPHA,A,LDA,B,LDB,BETA,C,LDC) ◮ DSYMM - Double precision SYmmetric Matrix Multiply – αAB + βC ◮ SUBROUTINE DSYMM(SIDE,UPLO,M,N,ALPHA,A,LDA,B,LDB,BETA,C,LDC)

slide-32
SLIDE 32

BLAS/LAPACK

Numeric libraries for dense linear algebra

◮ DGEMM - Double precision GEneral Matrix Multiply – αAB + βC ◮ SUBROUTINE DGEMM(TRANSA,TRANSB,M,N,K,ALPHA,A,LDA,B,LDB,BETA,C,LDC) ◮ DSYMM - Double precision SYmmetric Matrix Multiply – αAB + βC ◮ SUBROUTINE DSYMM(SIDE,UPLO,M,N,ALPHA,A,LDA,B,LDB,BETA,C,LDC) ◮ . . .

slide-33
SLIDE 33

BLAS/LAPACK

Numeric libraries for dense linear algebra

◮ DGEMM - Double precision GEneral Matrix Multiply – αAB + βC ◮ SUBROUTINE DGEMM(TRANSA,TRANSB,M,N,K,ALPHA,A,LDA,B,LDB,BETA,C,LDC) ◮ DSYMM - Double precision SYmmetric Matrix Multiply – αAB + βC ◮ SUBROUTINE DSYMM(SIDE,UPLO,M,N,ALPHA,A,LDA,B,LDB,BETA,C,LDC) ◮ . . . ◮ DPOSV - Double symmetric POsitive definite matrix SolVe – A−1y

slide-34
SLIDE 34

BLAS/LAPACK

Numeric libraries for dense linear algebra

◮ DGEMM - Double precision GEneral Matrix Multiply – αAB + βC ◮ SUBROUTINE DGEMM(TRANSA,TRANSB,M,N,K,ALPHA,A,LDA,B,LDB,BETA,C,LDC) ◮ DSYMM - Double precision SYmmetric Matrix Multiply – αAB + βC ◮ SUBROUTINE DSYMM(SIDE,UPLO,M,N,ALPHA,A,LDA,B,LDB,BETA,C,LDC) ◮ . . . ◮ DPOSV - Double symmetric POsitive definite matrix SolVe – A−1y ◮ SUBROUTINE DPOSV( UPLO, N, NRHS, A, LDA, B, LDB, INFO )

slide-35
SLIDE 35

Necessary Definitions

Language: Multiply, addition, inverse, transpose, trace, determinant, blocks, etc. . . beta = (X.T*X).I*X.T*y X.I*X -> Identity Predicates: symmetric, positive definite, full rank, orthogonal, lower triangular,

  • etc. . . .

fullrank(X) fullrank(X) -> positive_definite(X.T*X) Computations: class SYMM(BLAS): inputs = [alpha, A, B, beta, C]

  • utputs

= [alpha*A*B + beta*C] condition = symmetric(A) or symmetric(B) inplace = {0: 4} # 0th output stored in 4th input fortran = ....

slide-36
SLIDE 36

Mathematical code

Original language definition in Maude ... eq A (B + C) = (A B) + (A C) . eq (alpha A)’ = alpha A’ . eq A’’ = A . eq inverse(A) A = Identity . ... eq X X’ is symmetric = True . ceq X Y is positive-definite = True if X is positive-definite and Y is positive-definite . ceq X X’ is positive-definite = True if X is full-rank . Eventually moved to SymPy for distribution reasons

slide-37
SLIDE 37

Computation

Given: (X.T*X).I*X.T*y full_rank(X) Produce:

POSV (X'*X)^-1*X'*y INFO GEMM X'*y GEMM X'*X X y Figure:

slide-38
SLIDE 38

Computation

Identity (X'*X)^-1*X'*y (X'*X)^-1*X'*y

slide-39
SLIDE 39

Computation

Identity (X'*X)^-1*X'*y (X'*X)^-1*X'*y

POSV (X'*X)^-1*X'*y INFO X'*X X'*y

slide-40
SLIDE 40

Computation

Identity (X'*X)^-1*X'*y (X'*X)^-1*X'*y

POSV (X'*X)^-1*X'*y INFO X'*X X'*y POSV (X'*X)^-1*X'*y INFO GEMM X'*X X'*y X

slide-41
SLIDE 41

Computation

Identity (X'*X)^-1*X'*y (X'*X)^-1*X'*y

POSV (X'*X)^-1*X'*y INFO X'*X X'*y POSV (X'*X)^-1*X'*y INFO GEMM X'*X X'*y X POSV (X'*X)^-1*X'*y INFO GEMM X'*y GEMM X'*X X y

slide-42
SLIDE 42

User Experience

X = MatrixSymbol('X', n, m) y = MatrixSymbol('y', n, 1) inputs = [X, y]

  • utputs = [(X.T*X).I*X.T*y]

facts = Q.fullrank(X) f = f2py(next(compile(inputs, outputs, facts))) subroutine f(X, y, var_7, m, n) implicit none integer, intent(in) :: m integer, intent(in) :: n real*8, intent(in) :: y(n) ! y real*8, intent(in) :: X(n, m) ! X real*8, intent(out) :: var_7(m) ! 0, X'*y, (X'*X)^-1*X'*y real*8 :: var_8(m, m) ! 0, X'*X integer :: INFO ! INFO call dgemm('N', 'N', m, 1, n, 1, X, n, y, n, 0, var_7, m) call dgemm('N', 'N', m, m, n, 1, X, n, X, n, 0, var_8, m) call dposv('U', m, 1, var_8, m, var_7, m, INFO) RETURN END

slide-43
SLIDE 43

Multiple Results

Identity (X'*X)^-1*X'*y (X'*X)^-1*X'*y

POSV (X'*X)^-1*X'*y INFO X'*X X'*y POSV (X'*X)^-1*X'*y INFO GEMM X'*X X'*y X POSV (X'*X)^-1*X'*y INFO GEMM X'*y GEMM X'*X X y

slide-44
SLIDE 44

Multiple Results

Identity (X'*X)^-1*X'*y (X'*X)^-1*X'*y

POSV (X'*X)^-1*X'*y INFO X'*X X'*y POSV (X'*X)^-1*X'*y INFO GEMM X'*X X'*y X POSV (X'*X)^-1*X'*y INFO GEMM X'*y GEMM X'*X X y

GESV PermutationMatrix(IPIV((A'*A)^-1*A'*y))*(A'*A)^-1*A'*y IPIV((A'*A)^-1*A'*y) INFO LASWP (A'*A)^-1*A'*y A'*A A'*y GESV PermutationMatrix(IPIV((A'*A)^-1*A'*y))*(A'*A)^-1*A'*y IPIV((A'*A)^-1*A'*y) INFO GEMM A'*A LASWP (A'*A)^-1*A'*y A'*y A

GESV PermutationMatrix(IPIV((A'*A)^-1*A'*y))*(A'*A)^-1*A'*y IPIV((A'*A)^-1*A'*y) INFO GEMM A'*y GEMM A'*A LASWP (A'*A)^-1*A'*y A y

slide-45
SLIDE 45

Section 3 Static Scheduling

slide-46
SLIDE 46

Static Scheduling

newmu = mu + Sigma*H.T * (R + H*Sigma*H.T).I * (H*mu - data) newSigma = Sigma - Sigma*H.T * (R + H*Sigma*H.T).I * H * Sigma

SYMM H'*(H*Sigma*H' + R)^-1*H*Sigma POSV (H*Sigma*H' + R)^-1*((-1)*data + H*mu) INFO POSV (H*Sigma*H' + R)^-1*H GEMM H'*(H*Sigma*H' + R)^-1*((-1)*data + H*mu) GEMM H'*(H*Sigma*H' + R)^-1*H GEMM H*Sigma*H' + R GEMM (-1)*data + H*mu SYMM Sigma*H'*(H*Sigma*H' + R)^-1*((-1)*data + H*mu) + mu SYMM (-1)*Sigma*H'*(H*Sigma*H' + R)^-1*H*Sigma + Sigma SYMM Sigma*H' Sigma H R mu data

slide-47
SLIDE 47

Static Scheduling

SYMM H'*(H*Sigma*H' + R)^-1*H*Sigma POSV (H*Sigma*H' + R)^-1*H INFO Send-->('cpu', 0) GEMM H'*(H*Sigma*H' + R)^-1*H GEMM H*Sigma*H' + R SYMM (-1)*Sigma*H'*(H*Sigma*H' + R)^-1*H*Sigma + Sigma SYMM Sigma*H' Sigma H R

POSV (H*Sigma*H' + R)^-1*((-1)*data + H*mu) INFO GEMM H'*(H*Sigma*H' + R)^-1*((-1)*data + H*mu) GEMM (-1)*data + H*mu SYMM Sigma*H'*(H*Sigma*H' + R)^-1*((-1)*data + H*mu) + mu Recv<--('cpu', 1) H*Sigma*H' + R H mu data Sigma

slide-48
SLIDE 48

Static Scheduling

Given: Computation Graph

POSV (X'*X)^-1*X'*y INFO GEMM X'*y GEMM X'*X X y

Worker network

CPU CPU CPU GPU GPU

Computation times Communication times task, worker → time variable, source, target → time Produce: Set of computation subgraphs to minimize total runtime

slide-49
SLIDE 49

Related work

◮ Performance Modeling ◮ E Peise and P Bientinesi. Performance Modeling for Dense Linear Algebra. 2012 ◮ Roman Iakymchuk. Performance Modeling and Prediction for Linear Algebra

Algorithms 2012

◮ JJ Dongarra, RA Vandegeijn, and DW Walker. Scalability issues affecting the design of

a dense linear algebra library 1994

◮ Heterogeneous Static Scheduling ◮ M. Tompkins. Optimization Techniques for Task Allocation and Scheduling in

Distributed Multi-Agent Operations. 2003

◮ H. Topcuoglu, S. Hariri, M. Wu. Performance-effective and low-complexity task

scheduling for heterogeneous computing. 2002

◮ Suggestions? ◮ Automated Dense Linear Algebra ◮ ScaLAPACK, PlaLAPACK, BLACS ◮ FLAME - Language for blocked matrix algorithms ◮ SuperMatrix - Dynamic shared memory variant ◮ Elemental - Distributed memory variant ◮ Magma - Hybrid LAPACK ◮ Spiral - Hardware specific numeric code generation with internal computation language

slide-50
SLIDE 50

Application - Blocked Cholesky Decomposition

Math Problem: Ax = y → LLT x = y → x = L−T L−1y A symmetric positive definite, L lower triangular

  • A11 AT

21

A21 A22

  • =
  • L11

L21 L22 L11 L21 L22

T L11 := cholesky(A11) L21 := A21L−T

11

L22 := cholesky(A22 − L21LT

21)

slide-51
SLIDE 51

Status and Evaluation

◮ Have done - Software ◮ Implemented Schedulers ◮ Implemented rudimentary cost model ◮ Everything hooks up well ◮ Will do - Compare existing schedulers with ◮ Schedulers: IP, Heuristic, Naive dynamic scheduling ◮ Variety of block / problem sizes ◮ Variety of desired algorithms ◮ Variety of timing models ◮ Under uncertain conditions ◮ Compare scheduling times vs execution times ◮ Could do - ◮ Implement direct cost model (run and profile each task) ◮ Comparison against FLAME/Magma

slide-52
SLIDE 52

DAG Ordering

Recv and Send return control immediately. RecvWait and SendWait block until communication terminates. A, B, C, D are computation tasks

C D B Send1 A RecvWait1 RecvWait2 Recv1 Recv2 SendWait1

Figure:

Order 1: R1 RW1 R2 RW2 A B C D S1 SW1 Order 2: R2 R1 RW2 B S1 RW1 A C D SW1

slide-53
SLIDE 53

DAG Ordering

C D B Send1 A RecvWait1 RecvWait2 Recv1 Recv2 SendWait1

Figure:

A RecvWait1 RecvWait2 SendWait B C D Recv1 Recv2 Send

slide-54
SLIDE 54

DAG Ordering

def dependence(a, b): if depends((a, b)): return 1 if depends((b, a)): return -1 return 0 def mpi_send_wait_key(a): """ Wait as long as possible on Waits, Start Send/Recvs early """ if isinstance(a.op, (MPIRecvWait, MPISendWait)): return 1 if isinstance(a.op, (MPIRecv, MPISend)): return -1 return 0 def mpi_tag_key(a): """ Break MPI ties by using the variable tag - prefer lower tags first """ if isinstance(a.op, (MPISend, MPIRecv, MPIRecvWait, MPISendWait)): return a.op.tag return 0

slide-55
SLIDE 55

Thank You

SYMM H'*(H*Sigma*H' + R)^-1*H*Sigma POSV (H*Sigma*H' + R)^-1*H INFO Send-->('cpu', 0) GEMM H'*(H*Sigma*H' + R)^-1*H GEMM H*Sigma*H' + R SYMM (-1)*Sigma*H'*(H*Sigma*H' + R)^-1*H*Sigma + Sigma SYMM Sigma*H' Sigma H R POSV (H*Sigma*H' + R)^-1*((-1)*data + H*mu) INFO GEMM H'*(H*Sigma*H' + R)^-1*((-1)*data + H*mu) GEMM (-1)*data + H*mu SYMM Sigma*H'*(H*Sigma*H' + R)^-1*((-1)*data + H*mu) + mu Recv<--('cpu', 1) H*Sigma*H' + R H mu data Sigma

Slides: http://github.com/mrocklin/candidacy-exam/

slide-56
SLIDE 56

Section 4 Extras

slide-57
SLIDE 57

Inplace

SYMM H'*(H*Sigma*H' + R)^-1*H*Sigma POSV (H*Sigma*H' + R)^-1*((-1)*data + H*mu) INFO POSV (H*Sigma*H' + R)^-1*H GEMM H'*(H*Sigma*H' + R)^-1*((-1)*data + H*mu) GEMM H'*(H*Sigma*H' + R)^-1*H GEMM H*Sigma*H' + R GEMM (-1)*data + H*mu SYMM Sigma*H'*(H*Sigma*H' + R)^-1*((-1)*data + H*mu) + mu SYMM (-1)*Sigma*H'*(H*Sigma*H' + R)^-1*H*Sigma + Sigma SYMM Sigma*H' Sigma H R mu data

Figure:

slide-58
SLIDE 58

Inplace

SYMM (-1)*Sigma*H'*(H*Sigma*H' + R)^-1*H*Sigma + Sigma @ Sigmavar_2 COPY 0 @ var_19 COPY 0 @ var_20 GEMM H'*(H*Sigma*H' + R)^-1*((-1)*data + H*mu) @ var_11 GEMM H'*(H*Sigma*H' + R)^-1*H @ var_19 GEMM H*Sigma*H' + R @ R GEMM (-1)*data + H*mu @ data SYMM H'*(H*Sigma*H' + R)^-1*H*Sigma @ var_20 SYMM Sigma*H' @ var_17 SYMM Sigma*H'*(H*Sigma*H' + R)^-1*((-1)*data + H*mu) + mu @ muvar_2 COPY H @ Hvar_2 POSV (H*Sigma*H' + R)^-1*((-1)*data + H*mu) @ data INFO @ INFO POSV (H*Sigma*H' + R)^-1*H @ Hvar_2 COPY Sigma @ Sigmavar_2 COPY mu @ muvar_2

  • 1 @ var_14

1 @ var_2 0 @ var_5 H @ H 0 @ var_4 0 @ var_11 R @ R mu @ mu data @ data Sigma @ Sigma 0 @ var_17

Figure: