How to Write Fast Numerical Code Spring 2011 Lecture 15 Instructor: - - PowerPoint PPT Presentation

how to write fast numerical code
SMART_READER_LITE
LIVE PREVIEW

How to Write Fast Numerical Code Spring 2011 Lecture 15 Instructor: - - PowerPoint PPT Presentation

How to Write Fast Numerical Code Spring 2011 Lecture 15 Instructor: Markus Pschel TA: Georg Ofenbeck Reuse Again Reuse of an algorithm: Number of operations Minimal number of Size of input + size of output data Memory accesses


slide-1
SLIDE 1

How to Write Fast Numerical Code

Spring 2011 Lecture 15 Instructor: Markus Püschel TA: Georg Ofenbeck

slide-2
SLIDE 2

Reuse Again

Reuse of an algorithm:

Examples:

  • Matrix multiplication C = AB + C
  • Discrete Fourier transform
  • Adding two vectors x = x+y

Number of operations Size of input + size of output data

2n3 3n2 = 2 3n = O(n)

¼ 5n log2(n)

2n

= 5

2 log2(n) = O(log(n))

n 2n = 1 2 = O(1)

Minimal number of Memory accesses

slide-3
SLIDE 3

Effects

5 10 15 20 25 30 35 40 45 50 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000

matrix size

MMM on 2 x Core 2 Duo 3 GHz (double precision) Gflop/s

MMM: O(n) reuse

Performance maintained even when data does not fit into caches Drop will happen once data does not fit into main memory

5 10 15 20 25 30 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 65,536 131,072 262,144

input size

..

Discrete Fourier Transform (DFT) on 2 x Core 2 Duo 3 GHz (single precision)

Gflop/s

FFT: O(log(n)) reuse

Performance drop when data does not fit into largest cache Outside cache: Runtime only determined by memory accesses (memory bound)

slide-4
SLIDE 4

Memory Bound Computation

Typically: Computations with O(1) reuse

Performance bound based on data traffic may be tighter than performance bound obtained by op count

slide-5
SLIDE 5

Example

Vector addition: z = x + y on Core 2

Core 2: Resulting bounds

  • Peak performance (no SSE):

1 add/cycle n cycles

  • Throughput L1 cache:

2 doubles/cycle 2/3 n cycles

  • Throughput L2 cache:

1 doubles/cycle 1/3 n cycles

  • Throughput Main memory:

¼ doubles/cycle 1/12 n cycles

void vectorsum(double *x, double *y, double *z, int n) { int i; for (i = 0; i < n; i++) z[i] = x[i] + y[i]; }

Reuse: 1/3

slide-6
SLIDE 6

Example

Vector addition: z = x + y on Core 2

Core 2: Resulting bounds

  • Peak performance (no SSE):

1 add/cycle n cycles

  • Throughput L1 cache:

2 doubles/cycle 3/2 n cycles

  • Throughput L2 cache:

1 doubles/cycle 3n cycles

  • Throughput Main memory:

¼ doubles/cycle 12 n cycles

void vectorsum(double *x, double *y, double *z, int n) { int i; for (i = 0; i < n; i++) z[i] = x[i] + y[i]; }

Reuse: 1/3

slide-7
SLIDE 7

Memory-Bound Computation

10 20 30 40 50 60 70 80 90 100 1 KB 4 KB 16 KB 64 KB 256 KB 1 MB 4 MB 16 MB

z = x + y on Core i7 (one core, no SSE), icc 12.0 /O2 /fp:fast /Qipo

L1 cache L2 cache L3 cache

2 doubles/cycle 1 double/cycle 1/2 double/cycle vector length Percentage peak performance (peak = 1 add/cycle) Bounds based

  • n bandwidth
slide-8
SLIDE 8

Today

Sparse matrix-vector multiplication (MVM)

Sparsity/Bebop

References:

  • Eun-Jin Im, Katherine A. Yelick, Richard Vuduc. SPARSITY: An Optimization

Framework for Sparse Matrix Kernels, Int’l Journal of High Performance

  • Comp. App., 18(1), pp. 135-158, 2004
  • Vuduc, R.; Demmel, J.W.; Yelick, K.A.; Kamil, S.; Nishtala, R.; Lee, B.;

Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply,

  • pp. 26, Supercomputing, 2002
  • Sparsity/Bebop website
slide-9
SLIDE 9

Sparse Linear Algebra

Very different characteristics from dense linear algebra (LAPACK etc.)

Applications:

  • finite element methods
  • PDE solving
  • physical/chemical simulation

(e.g., fluid dynamics)

  • linear programming
  • scheduling
  • signal processing (e.g., filters)

Core building block: Sparse MVM

Graphics: http://aam.mathematik.uni-freiburg.de/IAM/homepages/claus/ projects/unfitted-meshes_en.html

slide-10
SLIDE 10

Sparse MVM (SMVM)

y = y + Ax, A sparse but known

Typically executed many times for fixed A

What is reused?

Reuse dense versus sparse MVM?

  • =

+

y y x A

slide-11
SLIDE 11

Storage of Sparse Matrices

Standard storage is obviously inefficient

  • Many zeros are stored
  • As a consequence, reuse is decreased

Several sparse storage formats are available

Most popular: Compressed sparse row (CSR) format

  • blackboard
slide-12
SLIDE 12

CSR

Assumptions:

  • A is m x n
  • K nonzero entries

Storage: Θ(max(K, m)), typically Θ(K)

b c c a b b c A as matrix b c c a b b c 1 3 1 2 3 2 3 4 6 7

values col_idx row_start

A in CSR: length K length K length m+1

slide-13
SLIDE 13

Sparse MVM Using CSR

void smvm(int m, const double* value, const int* col_idx, const int* row_start, const double* x, double* y) { int i, j; double d; /* loop over rows */ for (i = 0; i < m; i++) { d = y[i]; /* scalar replacement since reused */ /* loop over non-zero elements in row i */ for (j = row_start[i]; j < row_start[i+1]; j++, col_idx++, value++) { d += value[j] * x[col_idx[j]]; } y[i] = d; } }

y = y + Ax CSR + sparse MVM: Advantages?

slide-14
SLIDE 14

CSR

Advantages:

  • Only nonzero values are stored
  • All arrays are accessed consecutively in MVM (spatial locality)

Disadvantages:

  • x is not reused
  • Insertion costly
slide-15
SLIDE 15

Impact of Matrix Sparsity on Performance

Adressing overhead (dense MVM vs. dense MVM in CSR):

  • ~ 2x slower (Mflop/s, example only)

Irregular structure

  • ~ 5x slower (Mflop/s, example only) for “random” sparse matrices

Fundamental difference between MVM and sparse MVM (SMVM):

  • Sparse MVM is input dependent (sparsity pattern of A)
  • Changing the order of computation (blocking) requires changing the data

structure (CSR)

slide-16
SLIDE 16

Bebop/Sparsity: SMVM Optimizations

Idea: Register blocking

Reason: Reuse x to reduce memory traffic

Execution: Block SMVM y = y + Ax into micro MVMs

  • Block size r x c becomes a parameter
  • Consequence: Change A from CSR to r x c block-CSR (BCSR)

BCSR: Blackboard

slide-17
SLIDE 17

BCSR (Blocks of Size r x c)

Assumptions:

  • A is m x n
  • Block size r x c
  • Kr,c nonzero blocks

Storage: Θ(rcKr,c), rcKr,c ≥ K

b c c a b b c A as matrix (r = c = 2) b c 0 c 0 0 c 0 b b c 0 2 2 2 4

b_values b_col_idx b_row_start

A in BCSR (r = c = 2): length rcKr,c length Kr,c length m/r+1

slide-18
SLIDE 18

Sparse MVM Using 2 x 2 BCSR

void smvm_2x2(int bm, const int *b_row_start, const int *b_col_idx, const double *b_value, const double *x, double *y) { int i, j; double d0, d1, c0, c1; /* loop over block rows */ for (i = 0; i < bm; i++, y += 2) { d0 = y[i]; /* scalar replacement */ d1 = y[i+1]; /* dense micro MVM */ for (j = b_row_start[i]; j < b_row_start[i+1]; j++, b_col_idx++, b_value += 2*2) { c0 = x[b_col_idx[j]+0]; /* scalar replacement */ c1 = x[b_col_idx[j]+1]; d0 += b_value[0] * c0; d1 += b_value[2] * c0; d0 += b_value[1] * c1; d1 += b_value[3] * c1; } y[i] = d0; y[i+1] = d1; } }

slide-19
SLIDE 19

BCSR

Advantages:

  • Reuse of x and y (same as for dense MVM)
  • Reduces storage for indexes

Disadvantages:

  • Storage for values of A increased (zeros added)
  • Computational overhead (also due to zeros)

Main factors (since memory bound):

  • Plus: increased reuse on x + reduced index storage

= reduced memory traffic

  • Minus: more zeros = increased memory traffic

* =

slide-20
SLIDE 20

Which Block Size (r x c) is Optimal?

Example: about 20,000 x 20,000 matrix with perfect 8 x 8 block structure, 0.33% non-zero entries

In this case: No overhead when blocked r x c, with r,c divides 8

source: R. Vuduc, LLNL

slide-21
SLIDE 21

Speed-up through r x c Blocking

Source: Eun-Jin Im, Katherine A. Yelick, Richard Vuduc. SPARSITY: An Optimization Framework for Sparse Matrix Kernels, Int’l Journal of High Performance Comp. App., 18(1), pp. 135-158, 2004

  • machine dependent
  • hard to predict
slide-22
SLIDE 22

How to Find the Best Blocking for given A?

Best block size is hard to predict (see previous slide)

Solution 1: Searching over all r x c within a range, e.g., 1 ≤ r,c ≤ 12

  • Conversion of A in CSR to BCSR roughly as expensive as 10 SMVMs
  • Total cost: 1440 SMVMs
  • Too expensive

Solution 2: Model

  • Estimate the gain through blocking
  • Estimate the loss through blocking
  • Pick best ratio
slide-23
SLIDE 23

Model: Example

Gain by blocking (dense MVM) Overhead (average) by blocking

16/9 = 1.77 1.4 1.4/1.77 = 0.79 (no gain)

* =

Model: Doing that for all r and c and picking best

slide-24
SLIDE 24

Model

Goal: find best r x c for y = y + Ax

Gain through r x c blocking (estimation):

  • dependent on machine, independent of sparse matrix

Overhead through r x c blocking (estimation)

  • scan part of matrix A
  • independent of machine, dependent on sparse matrix

Expected gain: Gr,c/Or,c dense MVM performance in r x c BCSR dense MVM performance in CSR Gr,c = number of matrix values in r x c BCSR number of matrix values in CSR Or,c =

slide-25
SLIDE 25

Gain from Blocking (Dense Matrix in BCSR)

  • machine dependent
  • hard to predict

Source: Eun-Jin Im, Katherine A. Yelick, Richard Vuduc. SPARSITY: An Optimization Framework for Sparse Matrix Kernels, Int’l Journal of High Performance Comp. App., 18(1), pp. 135-158, 2004

row block size r row block size r column block size c column block size c

Pentium III Itanium 2

slide-26
SLIDE 26

Typical Result

BCRS model BCSR exhaustive search Analytical upper bound how obtained? (blackboard) CRS

Figure: Eun-Jin Im, Katherine A. Yelick, Richard Vuduc. SPARSITY: An Optimization Framework for Sparse Matrix Kernels, Int’l Journal of High Performance Comp. App., 18(1), pp. 135-158, 2004