How to Write Fast Numerical Code Spring 2011 Lecture 7 Instructor: - - PowerPoint PPT Presentation

how to write fast numerical code
SMART_READER_LITE
LIVE PREVIEW

How to Write Fast Numerical Code Spring 2011 Lecture 7 Instructor: - - PowerPoint PPT Presentation

How to Write Fast Numerical Code Spring 2011 Lecture 7 Instructor: Markus Pschel TA: Georg Ofenbeck Last Time: Locality Temporal and Spatial memory memory Last Time: Reuse FFT: O(log(n)) reuse MMM: O(n) reuse Discrete Fourier


slide-1
SLIDE 1

How to Write Fast Numerical Code

Spring 2011 Lecture 7 Instructor: Markus Püschel TA: Georg Ofenbeck

slide-2
SLIDE 2

Last Time: Locality

Temporal and Spatial

memory memory

slide-3
SLIDE 3

Last Time: Reuse

5 10 15 20 25 30 35 40 45 50 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000

matrix size

Matrix-Matrix Multiplication (MMM) on 2 x Core 2 Duo 3 GHz (double precision)

Gflop/s

5 10 15 20 25 30 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 65,536 131,072 262,144

input size

..

Discrete Fourier Transform (DFT) on 2 x Core 2 Duo 3 GHz (single precision)

Gflop/s

MMM: O(n) reuse FFT: O(log(n)) reuse

slide-4
SLIDE 4

Today

Caches

slide-5
SLIDE 5

Cache

Definition: Computer memory with short access time used for the storage of frequently or recently used instructions or data

Naturally supports temporal locality

Spatial locality is supported by transferring data in blocks

  • Core 2: one block = 64 B = 8 doubles

Main Memory

CPU

Cache

slide-6
SLIDE 6

General Cache Mechanics

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 8 9 14 3

Cache Memory

Larger, slower, cheaper memory viewed as partitioned into “blocks” Data is copied in block-sized transfer units Smaller, faster, more expensive memory caches a subset of the blocks

4 4 4 10 10 10

slide-7
SLIDE 7

General Cache Concepts: Hit

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 8 9 14 3

Cache Memory

Data in block b is needed

Request: 14

14

Block b is in cache: Hit!

slide-8
SLIDE 8

General Cache Concepts: Miss

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 8 9 14 3

Cache Memory

Data in block b is needed

Request: 12

Block b is not in cache: Miss! Block b is fetched from memory

Request: 12

12 12 12

Block b is stored in cache

  • Placement policy:

determines where b goes

  • Replacement policy:

determines which block gets evicted (victim)

slide-9
SLIDE 9

Types of Cache Misses (The 3 C’s)

Compulsory (cold) miss

  • Occurs on first access to a block

Capacity miss

  • Occurs when working set is larger than the cache

Conflict miss

  • Conflict misses occur when the cache is large enough, but multiple data
  • bjects all map to the same slot
slide-10
SLIDE 10

Cache Performance Metrics

Miss Rate

  • Fraction of memory references not found in cache: misses / accesses

= 1 – hit rate

Hit Time

  • Time to deliver a block in the cache to the processor
  • Core 2:

3 clock cycles for L1 14 clock cycles for L2

Miss Penalty

  • Additional time required because of a miss
  • Core 2: about 100 cycles for L2 miss
slide-11
SLIDE 11

General Cache Organization (S, E, B)

E = 2e lines per set E = associativity, E=1: direct mapped S = 2s sets set line

0 1 2

B-1

tag v

valid bit B = 2b bytes per cache block (the data)

Cache size: S x E x B data bytes

slide-12
SLIDE 12

Cache Read

S = 2s sets

0 1 2

B-1

tag v

valid bit B = 2b bytes per cache block (the data)

t bits s bits b bits

Address of word: tag set index block

  • ffset

data begins at this offset

  • Locate set
  • Check if any line in set

has matching tag

  • Yes + line valid: hit
  • Locate data starting

at offset E = 2e lines per set E = associativity, E=1: direct mapped

slide-13
SLIDE 13

Example (S=8, E=1)

int sum_array_rows(double a[16][16]) { int i, j; double sum = 0; for (i = 0; i < 16; i++) for (j = 0; j < 16; j++) sum += a[i][j]; return sum; }

B = 32 byte = 4 doubles assume: cold (empty) cache, a[0][0] goes here

int sum_array_cols(double a[16][16]) { int i, j; double sum = 0; for (j = 0; i < 16; i++) for (i = 0; j < 16; j++) sum += a[i][j]; return sum; }

blackboard

Ignore the variables sum, i, j

slide-14
SLIDE 14

Example (S=4, E=2)

int sum_array_rows(double a[16][16]) { int i, j; double sum = 0; for (i = 0; i < 16; i++) for (j = 0; j < 16; j++) sum += a[i][j]; return sum; }

B = 32 byte = 4 doubles assume: cold (empty) cache, a[0][0] goes here

int sum_array_cols(double a[16][16]) { int i, j; double sum = 0; for (j = 0; i < 16; i++) for (i = 0; j < 16; j++) sum += a[i][j]; return sum; }

blackboard

Ignore the variables sum, i, j

slide-15
SLIDE 15

What about writes?

What to do on a write-hit?

  • Write-through: write immediately to memory
  • Write-back: defer write to memory until replacement of line

(needs a valid bit)

What to do on a write-miss?

  • Write-allocate: load into cache, update line in cache
  • No-write-allocate: writes immediately to memory

Core 2:

  • Write-back + Write-allocate
slide-16
SLIDE 16

Small Example, Part 1

Cache:

E = 1 (direct mapped) S = 2 B = 16 (2 doubles)

Array (accessed twice in example)

x = x[0], …, x[7]

% Matlab style code for j = 0:1 for i = 0:7 access(x[i])

Access pattern: Hit/Miss: 0123456701234567 MHMHMHMHMHMHMHMH Result: 8 misses, 8 hits Spatial locality: yes Temporal locality: no

x[0]

slide-17
SLIDE 17

Small Example, Part 2

Cache:

E = 1 (direct mapped) S = 2 B = 16 (2 doubles)

Array (accessed twice in example)

x = x[0], …, x[7]

% Matlab style code for j = 0:1 for i = 0:2:7 access(x[i]) for i = 1:2:7 access(x[i])

Access pattern: Hit/Miss: 0246135702461357 MMMMMMMMMMMMMMMM Result: 16 misses Spatial locality: no Temporal locality: no

x[0]

slide-18
SLIDE 18

Small Example, Part 3

Cache:

E = 1 (direct mapped) S = 2 B = 16 (2 doubles)

Array (accessed twice in example)

x = x[0], …, x[7]

% Matlab style code for j = 0:1 for k = 0:1 for i = 0:3 access(x[i+4j])

Access pattern: Hit/Miss: 0123012345674567 MHMHHHHHMHMHHHHH Result: 4 misses, 8 hits (is optimal, why?) Spatial locality: yes Temporal locality: yes

x[0]

slide-19
SLIDE 19

The Killer: Two-Power Strided Access

x = x[0], …, x[n-1], n >> cache size

Stride 1: 0123…

Spatial locality Full cache used

Stride 2: 0 2 4 6 …

Some spatial locality 1/2 cache used

Stride 4: 0 4 8 12 …

No spatial locality 1/4 cache used

Stride 8: 0 8 16 24 …

No spatial locality 1/8 cache used

Stride 4S: 0 4S 8S 16S …

No spatial locality 1/(4S) cache used S sets E-way associative (here: 2)

Same for larger stride

x[0]

slide-20
SLIDE 20

The Killer: Where Does It Occur?

Accessing two-power size 2D arrays (e.g., images) columnwise

  • 2d Transforms
  • Stencil computations
  • Correlations

Various transform algorithms

  • Fast Fourier transform
  • Wavelet transforms
  • Filter banks
slide-21
SLIDE 21

Today

Linear algebra software: history, LAPACK and BLAS

Blocking: key to performance

MMM

ATLAS: MMM program generator

slide-22
SLIDE 22

Linear Algebra Algorithms: Examples

Solving systems of linear equations

Eigenvalue problems

Singular value decomposition

LU/Cholesky/QR/… decompositions

… and many others

Make up most of the numerical computation across disciplines (sciences, computer science, engineering)

Efficient software is extremely relevant

slide-23
SLIDE 23

The Path to LAPACK

EISPACK and LINPACK

  • Libraries for linear algebra algorithms
  • Developed in the early 70s
  • Jack Dongarra, Jim Bunch, Cleve Moler, Pete Stewart, …
  • LINPACK still used as benchmark for the TOP500 (Wiki) list of most

powerful supercomputers

Problem:

  • Implementation “vector-based,” i.e., little locality in data access
  • Low performance on computers with deep memory hierarchy
  • Became apparent in the 80s

Solution: LAPACK

  • Reimplement the algorithms “block-based,” i.e., with locality
  • Developed late 1980s, early 1990s
  • Jim Demmel, Jack Dongarra et al.
slide-24
SLIDE 24

LAPACK and BLAS

Basic Idea:

Basic Linear Algebra Subroutines (BLAS, list)

  • BLAS 1: vector-vector operations (e.g., vector sum)
  • BLAS 2: matrix-vector operations (e.g., matrix-vector product)
  • BLAS 3: matrix-matrix operations (e.g., MMM)

LAPACK implemented on top of BLAS

  • Using BLAS 3 as much as possible

LAPACK BLAS

static reimplemented for each platform Reuse: O(1) Reuse: O(1) Reuse: O(n)

slide-25
SLIDE 25

Why is BLAS3 so important?

Using BLAS3 = blocking = enabling reuse

Cache analysis for blocking MMM (blackboard)

Blocking (for the memory hierarchy) is the single most important

  • ptimization for dense linear algebra algorithms

Unfortunately: The introduction of multicore processors requires a reimplementation of LAPACK just multithreading BLAS is not good enough