A Generalized Framework for Auto-tuning Stencil Computations Shoaib - - PowerPoint PPT Presentation

a generalized framework for auto tuning stencil
SMART_READER_LITE
LIVE PREVIEW

A Generalized Framework for Auto-tuning Stencil Computations Shoaib - - PowerPoint PPT Presentation

F U T U R E T E C H N O L O G I E S G R O U P A Generalized Framework for Auto-tuning Stencil Computations Shoaib Kamil 1,3 , Cy Chan 4 , Samuel Williams 1 , Leonid Oliker 1 , John Shalf 1,2 , Mark Howison 3 , E. Wes Bethel 1


slide-1
SLIDE 1

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P 1

A Generalized Framework for Auto-tuning Stencil Computations

Shoaib Kamil1,3, Cy Chan4, Samuel Williams1, Leonid Oliker1, John Shalf1,2, Mark Howison3,

  • E. Wes Bethel1, Prabhat1

1Lawrence Berkeley National Laboratory (LBNL) 2National Energy Research Scientific Computing Center (NERSC) 3EECS Department, University of California, Berkeley (UCB) 4CSAIL, Massachusetts Institute of Technology (MIT)

SAKamil@lbl.gov

slide-2
SLIDE 2

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P 2

The Challenge: Productive Implementation

  • f an Auto-tuner
slide-3
SLIDE 3

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Conventional Optimization

 Take one kernel/application

  • Perform some analysis of it
  • Research the literature for appropriate optimizations
  • Implement a couple of them by hand optimizing for one target machine.
  • Iterate a couple of times.

 Result:

improve performance for one kernel on one computer.

3

slide-4
SLIDE 4

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Conventional Auto-tuning

 Automate the code generation and tuning process.

  • Perform some analysis of the kernel
  • Research the literature for appropriate optimizations
  • implement a code generator and search benchmark
  • explore optimization space
  • report best implementation/parameters

 Result:

significantly improve performance for one kernel on any computer. i.e. provides performance portability

 Downside:

  • autotuner creation time is substantial
  • must reinvent the wheel for every kernel

4

slide-5
SLIDE 5

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Generalized Frameworks for Auto-tuning

 Integrate some of the code transformation features of a compiler

with the domain-specific optimization knowledge of an auto-tuner

  • parse high-level source
  • apply transformations allowed by the domain, but not necessarily safe

based on language semantics alone

  • generate code + auto-tuning benchmark
  • explore optimization space
  • report best implementation/parameters

 Result:

significantly improve performance for any kernel on any computer for a domain or motif. i.e. performance portability without sacrificing productivity

5

slide-6
SLIDE 6

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Outline

1.

Stencils

2.

Machines

3.

Framework

4.

Results

5.

Conclusions

6

slide-7
SLIDE 7

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P 7

Benchmark Stencils

  • Laplacian
  • Divergence
  • Gradient
  • Bilateral Filtering
slide-8
SLIDE 8

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

What’s a stencil ?

 Nearest neighbor computations on structured grids (1D…ND array)  stencils from PDEs are often a weighted linear combination

  • f neighboring values

 cases where weights vary in space/time  stencil can also result in a table lookup  stencils can be nonlinear operators  caveat: We only examine implementations like Jacobi’s Method

(i.e. separate read and write arrays)

8

i,j,k i+1,j,k i-1,j,k i,j+1,k i,j,k+1 i,j,k-1 i,j-1,k

slide-9
SLIDE 9

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Laplacian Differential Operator

 7-point stencil on scalar grid, produces a scalar grid  Substantial reuse (+high working set size)  Memory-intensive kernel  Elimination of capacity misses may improve performance by 66%

9

xy product write_array[ ] x dimension read_array[ ] u’ u i,j,k i+1,j,k i-1,j,k i,j+1,k i,j,k+1 i,j,k-1 i,j-1,k

slide-10
SLIDE 10

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Divergence Differential Operator

 6-point stencil on a vector grid, produces a scalar grid  Low reuse per component.  Only z-component demands a large working set  Memory-intensive kernel  Elimination of capacity misses may improve performance by 40%

10

read_array[ ][ ] x dimension write_array[ ] xy product x y z u i+1,j,k i-1,j,k i,j+1,k i,j,k+1 i,j,k-1 i,j-1,k

slide-11
SLIDE 11

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Gradient Differential Operator

 6-point stencil on a scalar grid, produces a vector grid  High reuse (like laplacian)  High working set size  three write streams (+ write allocation streams) = 7 total streams  Memory-intensive kernel  Elimination of capacity misses may improve performance by 30%

11

write_array[ ][ ] x dimension read_array[ ] xy product x y z u i+1,j,k i-1,j,k i,j+1,k i,j,k+1 i,j,k-1 i,j-1,k

slide-12
SLIDE 12

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

3D Bilateral Filtering

 Extracted from a medical imaging application (MRI

processing)

 Normal Gaussian stencils smooth images,

but destroy sharp edges.

 This kernel performs anistropic filtering thus preserving

edges.

 We may scale the size of the stencil (radius=3,5)

  • 73-pt or 113-pt stencils.
  • apply to dataset of 192 x 256x256 slices
  • originally 8-bit grayscale voxels, but processed as 32-bit floats

12

slide-13
SLIDE 13

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

3D Bilateral Filtering

(pseudo code)

 Each point in the stencil mandates a voxel-dependent indirection,

and each stencil also requires one divide.

for all points (xyz) in x,y,z{ voxelSum = 0 weightSum = 0 srcVoxel = src[xyz] for all neighbors (ijk) within radius of xyz{ neighborVoxel = src[ijk] neighborWeight = table2[ijk]*table1[neighborVoxel-srcVoxel] voxelSum +=neighborWeight*neighborVoxel weightSum+=neighborWeight } dstVoxel = voxelSum/weightSum }

 Large radii results in extremely compute-intensive kernels with large

working sets

13

slide-14
SLIDE 14

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P 14

Benchmark Machines

slide-15
SLIDE 15

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Multicore SMPs

 Experiments only explored parallelism within an SMP  We use a Sun X2200 M2 as a proxy for the XT5 (e.g. Jaguar)  We use a Nehalem machine as a proxy for possible future Cray

machines.

 Barcelona/Nehalem are NUMA

15

6 x 1066MHz DDR3 DIMMs 25.6 GB/s 3x64b controllers QuickPath MT Core MT Core MT Core MT Core 256K 256K 256K 256K 8MB shared L3 6 x 1066MHz DDR3 DIMMs 25.6 GB/s 3x64b controllers QuickPath MT Core MT Core MT Core MT Core 256K 256K 256K 256K 8MB shared L3 16GB/s

(each direction)

800MHz DDR2 DIMMs 12.8 GB/s 2x64b controllers HyperTransport Opteron Opteron Opteron Opteron 512K 512K 512K 512K 2MB victim SRI / xbar 667MHz DDR2 DIMMs 10.66GB/s 2x64b controllers HyperTransport Opteron Opteron Opteron Opteron 512K 512K 512K 512K 2MB victim SRI / xbar 667MHz DDR2 DIMMs 10.66GB/s 2x64b controllers HyperTransport Opteron Opteron Opteron Opteron 512K 512K 512K 512K 2MB victim SRI / xbar 4GB/s

(each direction)

AMD Budapest (XT4) AMD Barcelona (X2200 M2) Intel Nehalem (X5550)

slide-16
SLIDE 16

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P 16

Generalized Framework for Auto-tuning Stencils

Copy and Paste auto-tuning

slide-17
SLIDE 17

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Overview

Given a F95 implementation of an application:

1.

Programmer annotates target stencil loop nests

2.

Auto-tuning System:

  • converts FORTRAN implementation into internal representation (AST)
  • builds a test harness
  • Strategy Engine iterates on:
  • apply optimization to internal representation
  • backend generation of optimized C code
  • compile C code
  • benchmark C code
  • using best implementation, automatically produces a library for that

kernel/machine combination

3.

Programmer then updates application to call optimized library routine

17

slide-18
SLIDE 18

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Strategy Engine:

Auto-parallelization

 The strategy engines can auto-parallelize cache blocks among

hardware thread contexts.

 We use a single-program, multiple-data (SPMD) model implemented

with POSIX Threads (Pthreads).

 All threads are created at the beginning of the application.  We also produce an initialization routine that exploits the first touch

policy to ensure proper NUMA-aware allocation.

18

slide-19
SLIDE 19

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

+Y +Z (b) Decomposition into Thread Blocks (c) Decomposition into Register Blocks (a) Decomposition of a Node Block into a Chunk of Core Blocks

RY RX RZ CY CZ CX TY TX NY NZ NX

+X

(unit stride) TY CZ TX

Strategy Engine:

Auto-tuning Optimizations

 Strategy Engine explores a number of auto-tuning optimizations:

  • loop unrolling/register blocking
  • cache blocking
  • constant propagation / common subexpression elimination

 Future Work:

  • cache bypass (e.g. movntpd)
  • software prefetching
  • SIMD intrinsics
  • data structure transformations

19

slide-20
SLIDE 20

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P 20

Experimental Results

NOTE: threads are ordered to exploit: multiple threads within a core (Nehalem only), then multicore, then multiple sockets (Barcelona/Nehalem)

slide-21
SLIDE 21

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Laplacian Performance

 On the memory-bound architecture (Barcelona), auto-parallelization

doesn’t make a difference.

 Auto-tuning enables scalability.  Barcelona is bandwidth-proportionally faster than the XT4.  Nehalem is ~2.5x faster than Barcelona, and 4x faster than the XT4  Auto-parallelization plus tuning significantly outperforms OpenMP.

21 Auto-tuning Auto- parallelization serial reference OpenMP Comparison

1 2 3 4 5 1 2 4 8

GFlop/s Threads

Barcelona

2 4 6 8 10 12 1 2 4 8 16

GFlop/s Threads

Nehalem

0.5 1 1.5 2 2.5 3 3.5 1 2 4

GFlop/s Threads

XT4

Auto-NUMA

slide-22
SLIDE 22

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Divergence Performance

 No changes to the framework were required (just drop in F95 code)  As there was less reuse in the Divergence than in Laplacian, there are

fewer capacity misses.

 So auto-tuning has less to improve upon  Nehalem is ~2.5x faster than Barcelona

22

0.5 1 1.5 2 2.5 3 3.5 1 2 4 8

GFlop/s Threads

Barcelona

1 2 3 4 5 6 7 1 2 4 8 16

GFlop/s Threads

Nehalem

0.5 1 1.5 2 1 2 4

GFlop/s Threads

XT4

Auto-tuning Auto- parallelization serial reference OpenMP Comparison Auto-NUMA

slide-23
SLIDE 23

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Gradient Performance

 No changes to the framework were required (just drop in F95 code)  Gradient has moderate reuse, but a large number of output streams.  Performance gains from auto-tuning are moderate (25-35%)  Parallelization is only valuable in conjunction with auto-tuning

23

0.5 1 1.5 2 1 2 4 8

GFlop/s Threads

Barcelona

0.5 1 1.5 2 2.5 3 3.5 4 1 2 4 8 16

GFlop/s Threads

Nehalem

0.2 0.4 0.6 0.8 1 1.2 1.4 1 2 4

GFlop/s Threads

XT4

Auto-tuning Auto- parallelization serial reference OpenMP Comparison Auto-NUMA

slide-24
SLIDE 24

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

3D Bilateral Filter Performance

(radius=3)

 No changes to the framework were required (just drop in F95 code)  Essentially a 7x7x7 (343-pt) stencil  Performance is much more closely tied to GHz

instead of GB/s.

 Auto-parallelization yielded near perfect parallel efficiency

wrt cores on Barcelona/Nehalem (Nehalem has HyperThreading)

 Auto-tuning significantly outperformed OpenMP (75% on Nehalem)

24

2 4 6 8 10 12 1 2 4 8

GFlop/s Threads

Barcelona

5 10 15 20 25 1 2 4 8 16

GFlop/s Threads

Nehalem

1 2 3 4 5 6 7 1 2 4

GFlop/s Threads

XT4

Auto-tuning Auto- parallelization serial reference OpenMP Comparison

slide-25
SLIDE 25

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

3D Bilateral Filter Performance

(radius=5)

 basically the same story as radius=3  XT4/Nehalem delivered approximately same

performance as they did with radius=3

 Barcelona delivered somewhat better performance.

25 Auto-tuning Auto- parallelization serial reference OpenMP Comparison

1 2 3 4 5 6 7 1 2 4

GFlop/s Threads

XT4

5 10 15 20 1 2 4 8 16

GFlop/s Threads

Nehalem

2 4 6 8 10 12 14 1 2 4 8

GFlop/s Threads

Barcelona

slide-26
SLIDE 26

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P 26

Summary

slide-27
SLIDE 27

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Summary:

Framework for auto-tuning stencils

 Dramatic step forward in auto-tuning technology  Although the framework required substantial up front work,

it provides performance portability across the breadth of architectures AND stencil kernels.

 Delivers very good performance, and well in excess of OpenMP.  Future work will examine relevant optimizations

  • e.g. cache bypass would significantly improve gradient performance.

27

slide-28
SLIDE 28

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Summary:

Machine Comparison

2 4 6 8 10 12

GFlop/s

Laplacian 28

1 2 3 4 5 6 7

GFlop/s

Divergence

0.5 1 1.5 2 2.5 3 3.5 4

GFlop/s

Gradient

2 4 6 8 10 12 14 16 18 20

GFlop/s

Bilateral Filter

 Barcelona delivers bandwidth-proportionally better performance on the

memory-intensive differential operators.

 Surprisingly, Barcelona delivers ~2.5x better performance on the compute

intensive bilateral filter.

 Nehalem clearly sustains dramatically better performance than either

Opteron.

 Despite having a 15% faster clock, nehalem realizes a much better bilateral

filter performance.

slide-29
SLIDE 29

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

29

Acknowledgements

 Research supported by DOE Office of Science under contract

number DE-AC02-05CH11231

 Microsoft (Award #024263)  Intel (Award #024894)  U.C. Discovery Matching Funds (Award #DIG07-10227)  All XT4 simulations were performed on the XT4 (Franklin) at the

National Energy Research Scientific Computing Center (NERSC)

slide-30
SLIDE 30

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P 30

Questions?