Lattice Boltzmann Simulation Optimization on Leading Multicore - - PowerPoint PPT Presentation

lattice boltzmann simulation optimization on leading
SMART_READER_LITE
LIVE PREVIEW

Lattice Boltzmann Simulation Optimization on Leading Multicore - - PowerPoint PPT Presentation

BIPS BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms Samuel


slide-1
SLIDE 1

BIPS BIPS

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N

Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms

Samuel Williams1,2, Jonathan Carter2, Leonid Oliker1,2, John Shalf2, Katherine Yelick1,2

1University of California, Berkeley 2Lawrence Berkeley National Laboratory

samw@eecs.berkeley.edu

slide-2
SLIDE 2

BIPS BIPS

Motivation

 Multicore is the de facto solution for improving

peak performance for the next decade

 How do we ensure this applies to sustained

performance as well ?

 Processor architectures are extremely diverse

and compilers can rarely fully exploit them

 Require a HW/SW solution that guarantees

performance without completely sacrificing productivity

slide-3
SLIDE 3

BIPS BIPS

Overview

 Examined the Lattice-Boltzmann Magneto-hydrodynamic (LBMHD) application  Present and analyze two threaded & auto-tuned implementations  Benchmarked performance across 5 diverse multicore microarchitectures

  • Intel Xeon (Clovertown)
  • AMD Opteron (rev.F)
  • Sun Niagara2 (Huron)
  • IBM QS20 Cell Blade (PPEs)
  • IBM QS20 Cell Blade (SPEs)

 We show

  • Auto-tuning can significantly improve application performance
  • Cell consistently delivers good performance and efficiency
  • Niagara2 delivers good performance and productivity
slide-4
SLIDE 4

BIPS BIPS

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N

Multicore SMPs used

slide-5
SLIDE 5

BIPS BIPS

Multicore SMP Systems

667MHz FBDIMMs Chipset (4x64b controllers) 10.6 GB/s(write) 21.3 GB/s(read) 10.6 GB/s Core2 FSB Core2 Core2 Core2 10.6 GB/s Core2 FSB Core2 Core2 Core2 4MB Shared L2 4MB Shared L2 4MB Shared L2 4MB Shared L2 512MB XDR DRAM 25.6GB/s EIB (Ring Network) SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC MFC MFC MFC MFC 256K 256K 256K 256K SPE SPE SPE SPE XDR BIF PPE 512KB L2 512MB XDR DRAM 25.6GB/s EIB (Ring Network) BIF XDR PPE 512KB L2 SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC MFC MFC MFC MFC 256K 256K 256K 256K SPE SPE SPE SPE <20GB/s each direction

IBM QS20 Cell Blade Sun Niagara2 (Huron) AMD Opteron (rev.F) Intel Xeon (Clovertown)

Opteron Opteron 667MHz DDR2 DIMMs 10.66 GB/s 128b memory controller HT 1MB victim 1MB victim SRI / crossbar Opteron Opteron 667MHz DDR2 DIMMs 10.66 GB/s 128b memory controller HT 1MB victim 1MB victim SRI / crossbar 4GB/s (each direction) Crossbar Switch 42.66 GB/s (read) 667MHz FBDIMMs 4MB Shared L2 (16 way) (address interleaving via 8x64B banks) 21.33 GB/s (write) 179 GB/s (fill) 90 GB/s (writethru) 4x128b memory controllers (2 banks each) MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1

slide-6
SLIDE 6

BIPS BIPS

Multicore SMP Systems

(memory hierarchy)

667MHz FBDIMMs Chipset (4x64b controllers) 10.6 GB/s(write) 21.3 GB/s(read) 10.6 GB/s Core2 FSB Core2 Core2 Core2 10.6 GB/s Core2 FSB Core2 Core2 Core2 4MB Shared L2 4MB Shared L2 4MB Shared L2 4MB Shared L2 512MB XDR DRAM 25.6GB/s EIB (Ring Network) SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC MFC MFC MFC MFC 256K 256K 256K 256K SPE SPE SPE SPE XDR BIF PPE 512KB L2 512MB XDR DRAM 25.6GB/s EIB (Ring Network) BIF XDR PPE 512KB L2 SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC MFC MFC MFC MFC 256K 256K 256K 256K SPE SPE SPE SPE <20GB/s each direction

IBM QS20 Cell Blade Sun Niagara2 (Huron) AMD Opteron (rev.F) Intel Xeon (Clovertown)

Opteron Opteron 667MHz DDR2 DIMMs 10.66 GB/s 128b memory controller HT 1MB victim 1MB victim SRI / crossbar Opteron Opteron 667MHz DDR2 DIMMs 10.66 GB/s 128b memory controller HT 1MB victim 1MB victim SRI / crossbar 4GB/s (each direction) Crossbar Switch 42.66 GB/s (read) 667MHz FBDIMMs 4MB Shared L2 (16 way) (address interleaving via 8x64B banks) 21.33 GB/s (write) 179 GB/s (fill) 90 GB/s (writethru) 4x128b memory controllers (2 banks each) MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1

C

  • n

v e n t i

  • n

a l C a c h e

  • b

a s e d M e m

  • r

y H i e r a r c h y

slide-7
SLIDE 7

BIPS BIPS

Multicore SMP Systems

(memory hierarchy)

667MHz FBDIMMs Chipset (4x64b controllers) 10.6 GB/s(write) 21.3 GB/s(read) 10.6 GB/s Core2 FSB Core2 Core2 Core2 10.6 GB/s Core2 FSB Core2 Core2 Core2 4MB Shared L2 4MB Shared L2 4MB Shared L2 4MB Shared L2 512MB XDR DRAM 25.6GB/s EIB (Ring Network) SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC MFC MFC MFC MFC 256K 256K 256K 256K SPE SPE SPE SPE XDR BIF PPE 512KB L2 512MB XDR DRAM 25.6GB/s EIB (Ring Network) BIF XDR PPE 512KB L2 SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC MFC MFC MFC MFC 256K 256K 256K 256K SPE SPE SPE SPE <20GB/s each direction

IBM QS20 Cell Blade Sun Niagara2 (Huron) AMD Opteron (rev.F) Intel Xeon (Clovertown)

Opteron Opteron 667MHz DDR2 DIMMs 10.66 GB/s 128b memory controller HT 1MB victim 1MB victim SRI / crossbar Opteron Opteron 667MHz DDR2 DIMMs 10.66 GB/s 128b memory controller HT 1MB victim 1MB victim SRI / crossbar 4GB/s (each direction) Crossbar Switch 42.66 GB/s (read) 667MHz FBDIMMs 4MB Shared L2 (16 way) (address interleaving via 8x64B banks) 21.33 GB/s (write) 179 GB/s (fill) 90 GB/s (writethru) 4x128b memory controllers (2 banks each) MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1

C

  • n

v e n t i

  • n

a l C a c h e

  • b

a s e d M e m

  • r

y H i e r a r c h y

D i s j

  • i

n t L

  • c

a l S t

  • r

e M e m

  • r

y H i e r a r c h y

slide-8
SLIDE 8

BIPS BIPS

Multicore SMP Systems

(memory hierarchy)

667MHz FBDIMMs Chipset (4x64b controllers) 10.6 GB/s(write) 21.3 GB/s(read) 10.6 GB/s Core2 FSB Core2 Core2 Core2 10.6 GB/s Core2 FSB Core2 Core2 Core2 4MB Shared L2 4MB Shared L2 4MB Shared L2 4MB Shared L2 512MB XDR DRAM 25.6GB/s EIB (Ring Network) SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC MFC MFC MFC MFC 256K 256K 256K 256K SPE SPE SPE SPE XDR BIF PPE 512KB L2 512MB XDR DRAM 25.6GB/s EIB (Ring Network) BIF XDR PPE 512KB L2 SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC MFC MFC MFC MFC 256K 256K 256K 256K SPE SPE SPE SPE <20GB/s each direction

IBM QS20 Cell Blade Sun Niagara2 (Huron) AMD Opteron (rev.F) Intel Xeon (Clovertown)

Opteron Opteron 667MHz DDR2 DIMMs 10.66 GB/s 128b memory controller HT 1MB victim 1MB victim SRI / crossbar Opteron Opteron 667MHz DDR2 DIMMs 10.66 GB/s 128b memory controller HT 1MB victim 1MB victim SRI / crossbar 4GB/s (each direction) Crossbar Switch 42.66 GB/s (read) 667MHz FBDIMMs 4MB Shared L2 (16 way) (address interleaving via 8x64B banks) 21.33 GB/s (write) 179 GB/s (fill) 90 GB/s (writethru) 4x128b memory controllers (2 banks each) MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1

C a c h e + P t h r e a d s i m p l e m e n t a t i

  • n

s b

L

  • c

a l S t

  • r

e + l i b s p e i m p l e m e n t a t i

  • n

s

slide-9
SLIDE 9

BIPS BIPS

Multicore SMP Systems

(peak flops)

667MHz FBDIMMs Chipset (4x64b controllers) 10.6 GB/s(write) 21.3 GB/s(read) 10.6 GB/s Core2 FSB Core2 Core2 Core2 10.6 GB/s Core2 FSB Core2 Core2 Core2 4MB Shared L2 4MB Shared L2 4MB Shared L2 4MB Shared L2 512MB XDR DRAM 25.6GB/s EIB (Ring Network) SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC MFC MFC MFC MFC 256K 256K 256K 256K SPE SPE SPE SPE XDR BIF PPE 512KB L2 512MB XDR DRAM 25.6GB/s EIB (Ring Network) BIF XDR PPE 512KB L2 SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC MFC MFC MFC MFC 256K 256K 256K 256K SPE SPE SPE SPE <20GB/s each direction

IBM QS20 Cell Blade Sun Niagara2 (Huron) AMD Opteron (rev.F) Intel Xeon (Clovertown)

Opteron Opteron 667MHz DDR2 DIMMs 10.66 GB/s 128b memory controller HT 1MB victim 1MB victim SRI / crossbar Opteron Opteron 667MHz DDR2 DIMMs 10.66 GB/s 128b memory controller HT 1MB victim 1MB victim SRI / crossbar 4GB/s (each direction) Crossbar Switch 42.66 GB/s (read) 667MHz FBDIMMs 4MB Shared L2 (16 way) (address interleaving via 8x64B banks) 21.33 GB/s (write) 179 GB/s (fill) 90 GB/s (writethru) 4x128b memory controllers (2 banks each) MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1

75 Gflop/s 17 Gflop/s PPEs: 13 Gflop/s SPEs: 29 Gflop/s 11 Gflop/s

slide-10
SLIDE 10

BIPS BIPS

Multicore SMP Systems

(peak DRAM bandwidth)

667MHz FBDIMMs Chipset (4x64b controllers) 10.6 GB/s(write) 21.3 GB/s(read) 10.6 GB/s Core2 FSB Core2 Core2 Core2 10.6 GB/s Core2 FSB Core2 Core2 Core2 4MB Shared L2 4MB Shared L2 4MB Shared L2 4MB Shared L2 512MB XDR DRAM 25.6GB/s EIB (Ring Network) SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC MFC MFC MFC MFC 256K 256K 256K 256K SPE SPE SPE SPE XDR BIF PPE 512KB L2 512MB XDR DRAM 25.6GB/s EIB (Ring Network) BIF XDR PPE 512KB L2 SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC MFC MFC MFC MFC 256K 256K 256K 256K SPE SPE SPE SPE <20GB/s each direction

IBM QS20 Cell Blade Sun Niagara2 (Huron) AMD Opteron (rev.F) Intel Xeon (Clovertown)

Opteron Opteron 667MHz DDR2 DIMMs 10.66 GB/s 128b memory controller HT 1MB victim 1MB victim SRI / crossbar Opteron Opteron 667MHz DDR2 DIMMs 10.66 GB/s 128b memory controller HT 1MB victim 1MB victim SRI / crossbar 4GB/s (each direction) Crossbar Switch 42.66 GB/s (read) 667MHz FBDIMMs 4MB Shared L2 (16 way) (address interleaving via 8x64B banks) 21.33 GB/s (write) 179 GB/s (fill) 90 GB/s (writethru) 4x128b memory controllers (2 banks each) MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1

21 GB/s(read) 10 GB/s(write) 21 GB/s 51 GB/s 42 GB/s(read) 21 GB/s(write)

slide-11
SLIDE 11

BIPS BIPS

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N

Auto-tuning

slide-12
SLIDE 12

BIPS BIPS

Auto-tuning

 Hand optimizing each architecture/dataset combination is not

feasible

 Our auto-tuning approach finds a good performance solution by

a combination of heuristics and exhaustive search

  • Perl script generates many possible kernels
  • (Generate SIMD optimized kernels)
  • Auto-tuning benchmark examines kernels and reports back with

the best one for the current architecture/dataset/compiler/…

  • Performance depends on the optimizations generated
  • Heuristics are often desirable when the search space isn’t

tractable

 Proven value in Dense Linear Algebra(ATLAS),

Spectral(FFTW,SPIRAL), and Sparse Methods(OSKI)

slide-13
SLIDE 13

BIPS BIPS

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N

Introduction to LBMHD

slide-14
SLIDE 14

BIPS BIPS

Introduction to Lattice Methods

 Structured grid code, with a series of time steps  Popular in CFD (allows for complex boundary conditions)  Overlay a higher dimensional phase space

  • Simplified kinetic model that maintains the macroscopic quantities
  • Distribution functions (e.g. 5-27 velocities per point in space)

are used to reconstruct macroscopic quantities

  • Significant Memory capacity requirements

14 12 4 16 13 5 9 8 21 2 25 3 1 24 22 23 26 18 15 6 19 17 7 11 10 20 +Z +Y +X

slide-15
SLIDE 15

BIPS BIPS

LBMHD

(general characteristics)

 Plasma turbulence simulation  Couples CFD with Maxwell’s equations  Two distributions:

  • momentum distribution (27 scalar velocities)
  • magnetic distribution (15 vector velocities)

 Three macroscopic quantities:

  • Density
  • Momentum (vector)
  • Magnetic Field (vector)

momentum distribution

14 4 13 16 5 8 9 21 12

+Y

2 25 1 3 24 23 22 26 18 6 17 19 7 10 11 20 15

+Z +X magnetic distribution

14 13 16 21 12 25 24 23 22 26 18 17 19 20 15

+Y +Z +X macroscopic variables +Y +Z +X

slide-16
SLIDE 16

BIPS BIPS

LBMHD

(flops and bytes)

 Must read 73 doubles, and update 79 doubles per point in

space (minimum 1200 bytes)

 Requires about 1300 floating point operations per point in space  Flop:Byte ratio

  • 0.71 (write allocate architectures)
  • 1.07 (ideal)

 Rule of thumb for LBMHD:

  • Architectures with more flops than bandwidth are likely

memory bound (e.g. Clovertown)

slide-17
SLIDE 17

BIPS BIPS

LBMHD

(implementation details)

 Data Structure choices:

  • Array of Structures: no spatial locality, strided access
  • Structure of Arrays: huge number of memory streams per

thread, but guarantees spatial locality, unit-stride, and vectorizes well

 Parallelization

  • Fortran version used MPI to communicate between tasks.

= bad match for multicore

  • The version in this work uses pthreads for multicore, and MPI for

inter-node

  • MPI is not used when auto-tuning

 Two problem sizes:

  • 643 (~330MB)
  • 1283 (~2.5GB)
slide-18
SLIDE 18

BIPS BIPS

Stencil for Lattice Methods

 Very different the canonical heat equation stencil

  • There are multiple read and write arrays
  • There is no reuse

read_lattice[ ][ ] write_lattice[ ][ ]

slide-19
SLIDE 19

BIPS BIPS

Side Note on Performance Graphs

 Threads are mapped first to cores, then sockets.

i.e. multithreading, then multicore, then multisocket

 Niagara2 always used 8 threads/core.  Show two problem sizes  We’ll step through performance as optimizations/features are

enabled within the auto-tuner

 More colors implies more optimizations were necessary  This allows us to compare architecture performance while

keeping programmer effort(productivity) constant

slide-20
SLIDE 20

BIPS BIPS

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N

Performance and Analysis of Pthreads Implementation

slide-21
SLIDE 21

BIPS BIPS

Pthread Implementation

Not naïve

  • fully unrolled loops
  • NUMA-aware
  • 1D parallelization

Always used 8 threads per core on Niagara2

1P Niagara2 is faster than 2P x86 machines

IBM Cell Blade (PPEs) Sun Niagara2 (Huron) Intel Xeon (Clovertown) AMD Opteron (rev.F)

slide-22
SLIDE 22

BIPS BIPS

Not naïve

  • fully unrolled loops
  • NUMA-aware
  • 1D parallelization

Always used 8 threads per core on Niagara2

1P Niagara2 is faster than 2P x86 machines

Pthread Implementation

IBM Cell Blade (PPEs) Sun Niagara2 (Huron) Intel Xeon (Clovertown) AMD Opteron (rev.F)

4.8% of peak flops 17% of bandwidth 14% of peak flops 17% of bandwidth 54% of peak flops 14% of bandwidth 1% of peak flops 0.3% of bandwidth

slide-23
SLIDE 23

BIPS BIPS

Not naïve

  • fully unrolled loops
  • NUMA-aware
  • 1D parallelization

Always used 8 threads per core on Niagara2

1P Niagara2 is faster than 2P x86 machines

Initial Pthread Implementation

IBM Cell Blade (PPEs) Sun Niagara2 (Huron) Intel Xeon (Clovertown) AMD Opteron (rev.F)

Performance degradation despite improved surface to volume ratio Performance degradation despite improved surface to volume ratio

slide-24
SLIDE 24

BIPS BIPS

Cache effects

 Want to maintain a working set of velocities in the L1 cache  150 arrays each trying to keep at least 1 cache line.  Impossible with Niagara’s 1KB/thread L1 working set =

capacity misses

 On other architectures, the combination of:

  • Low associativity L1 caches (2 way on opteron)
  • Large numbers or arrays
  • Near power of 2 problem sizes

can result in large numbers of conflict misses

 Solution: apply a lattice (offset) aware padding heuristic to the

velocity arrays to avoid/minimize conflict misses

slide-25
SLIDE 25

BIPS BIPS

Auto-tuned Performance

(+Stencil-aware Padding)

This lattice method is essentially 79 simultaneous 72-point stencils

Can cause conflict misses even with highly associative L1 caches (not to mention

  • pteron’s 2 way)

Solution: pad each component so that when accessed with the corresponding stencil(spatial)

  • ffset, the components are

uniformly distributed in the cache

IBM Cell Blade (PPEs) Sun Niagara2 (Huron) Intel Xeon (Clovertown) AMD Opteron (rev.F)

+Padding Naïve+NUMA

slide-26
SLIDE 26

BIPS BIPS

Blocking for the TLB

 Touching 150 different arrays will thrash TLBs with less than

128 entries.

 Try to maximize TLB page locality  Solution: borrow a technique from compilers for vector machines:

  • Fuse spatial loops
  • Strip mine into vectors of size vector length (VL)
  • Interchange spatial and velocity loops

 Can be generalized by varying:

  • The number of velocities simultaneously accessed
  • The number of macroscopics / velocities simultaneously updated

 Has the side benefit expressing more ILP and DLP (SIMDization) and

cleaner loop structure at the cost of increased L1 cache traffic

slide-27
SLIDE 27

BIPS BIPS

Multicore SMP Systems

(TLB organization)

667MHz FBDIMMs Chipset (4x64b controllers) 10.6 GB/s(write) 21.3 GB/s(read) 10.6 GB/s Core2 FSB Core2 Core2 Core2 10.6 GB/s Core2 FSB Core2 Core2 Core2 4MB Shared L2 4MB Shared L2 4MB Shared L2 4MB Shared L2 512MB XDR DRAM 25.6GB/s EIB (Ring Network) SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC MFC MFC MFC MFC 256K 256K 256K 256K SPE SPE SPE SPE XDR BIF PPE 512KB L2 512MB XDR DRAM 25.6GB/s EIB (Ring Network) BIF XDR PPE 512KB L2 SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC MFC MFC MFC MFC 256K 256K 256K 256K SPE SPE SPE SPE <20GB/s each direction

IBM QS20 Cell Blade Sun Niagara2 (Huron) AMD Opteron (rev.F) Intel Xeon (Clovertown)

Opteron Opteron 667MHz DDR2 DIMMs 10.66 GB/s 128b memory controller HT 1MB victim 1MB victim SRI / crossbar Opteron Opteron 667MHz DDR2 DIMMs 10.66 GB/s 128b memory controller HT 1MB victim 1MB victim SRI / crossbar 4GB/s (each direction) Crossbar Switch 42.66 GB/s (read) 667MHz FBDIMMs 4MB Shared L2 (16 way) (address interleaving via 8x64B banks) 21.33 GB/s (write) 179 GB/s (fill) 90 GB/s (writethru) 4x128b memory controllers (2 banks each) MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1

16 entries

4KB pages

32 entries

4KB pages

PPEs: 1024 entries SPEs: 256 entries

4KB pages

128 entries 4MB pages

slide-28
SLIDE 28

BIPS BIPS

Cache / TLB Tug-of-War

 For cache locality we want small VL  For TLB page locality we want large VL  Each architecture has a different balance between these two

forces

 Solution: auto-tune to find the optimal VL

L1 miss penalty TLB miss penalty L1 / 1200 Page size / 8 VL

slide-29
SLIDE 29

BIPS BIPS

Auto-tuned Performance

(+Vectorization)

Each update requires touching ~150 components, each likely to be on a different page

TLB misses can significantly impact performance

Solution: vectorization

Fuse spatial loops, strip mine into vectors of size VL, and interchange with phase dimensional loops

Auto-tune: search for the

  • ptimal vector length

Significant benefit on some architectures

Becomes irrelevant when bandwidth dominates performance

IBM Cell Blade (PPEs) Sun Niagara2 (Huron) Intel Xeon (Clovertown) AMD Opteron (rev.F)

+Vectorization +Padding Naïve+NUMA

slide-30
SLIDE 30

BIPS BIPS

Auto-tuned Performance

(+Explicit Unrolling/Reordering)

Give the compilers a helping hand for the complex loops

Code Generator: Perl script to generate all power

  • f 2 possibilities

Auto-tune: search for the best unrolling and expression of data level parallelism

Is essential when using SIMD intrinsics

IBM Cell Blade (PPEs) Sun Niagara2 (Huron) Intel Xeon (Clovertown) AMD Opteron (rev.F)

+Unrolling +Vectorization +Padding Naïve+NUMA

slide-31
SLIDE 31

BIPS BIPS

Auto-tuned Performance

(+Software prefetching)

Expanded the code generator to insert software prefetches in case the compiler doesn’t.

Auto-tune:

  • no prefetch
  • prefetch 1 line ahead
  • prefetch 1 vector

ahead.

Relatively little benefit for relatively little work

IBM Cell Blade (PPEs) Sun Niagara2 (Huron) Intel Xeon (Clovertown) AMD Opteron (rev.F)

+SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA

slide-32
SLIDE 32

BIPS BIPS

Auto-tuned Performance

(+Software prefetching)

Expanded the code generator to insert software prefetches in case the compiler doesn’t.

Auto-tune:

  • no prefetch
  • prefetch 1 line ahead
  • prefetch 1 vector ahead.

Relatively little benefit for relatively little work

IBM Cell Blade (PPEs) Sun Niagara2 (Huron) Intel Xeon (Clovertown) AMD Opteron (rev.F)

+SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA

6% of peak flops 22% of bandwidth 32% of peak flops 40% of bandwidth 59% of peak flops 15% of bandwidth 10% of peak flops 3.7% of bandwidth

slide-33
SLIDE 33

BIPS BIPS

Auto-tuned Performance

(+SIMDization, including non-temporal stores)

Compilers(gcc & icc) failed at exploiting SIMD.

Expanded the code generator to use SIMD intrinsics.

Explicit unrolling/reordering was extremely valuable here.

Exploited movntpd to minimize memory traffic (only hope if memory bound)

Significant benefit for significant work

+SIMDization +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA

IBM Cell Blade (PPEs) Sun Niagara2 (Huron) Intel Xeon (Clovertown) AMD Opteron (rev.F)

slide-34
SLIDE 34

BIPS BIPS

Auto-tuned Performance

(+SIMDization, including non-temporal stores)

Compilers(gcc & icc) failed at exploiting SIMD.

Expanded the code generator to use SIMD intrinsics.

Explicit unrolling/reordering was extremely valuable here.

Exploited movntpd to minimize memory traffic (only hope if memory bound)

Significant benefit for significant work

+SIMDization +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA

IBM Cell Blade (PPEs) Sun Niagara2 (Huron) Intel Xeon (Clovertown) AMD Opteron (rev.F)

7.5% of peak flops 18% of bandwidth 42% of peak flops 35% of bandwidth 59% of peak flops 15% of bandwidth 10% of peak flops 3.7% of bandwidth

slide-35
SLIDE 35

BIPS BIPS

Auto-tuned Performance

(+SIMDization, including non-temporal stores)

Compilers(gcc & icc) failed at exploiting SIMD.

Expanded the code generator to use SIMD intrinsics.

Explicit unrolling/reordering was extremely valuable here.

Exploited movntpd to minimize memory traffic (only hope if memory bound)

Significant benefit for significant work

+SIMDization +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA

IBM Cell Blade (PPEs) Sun Niagara2 (Huron) Intel Xeon (Clovertown) AMD Opteron (rev.F)

1.5x 10x 4.3x 1.6x

slide-36
SLIDE 36

BIPS BIPS

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N

Performance and Analysis of Cell Implementation

slide-37
SLIDE 37

BIPS BIPS

Cell Implementation

 Double precision implementation

  • DP will severely hamper performance

 Vectorized, double buffered, but not auto-tuned

  • No NUMA optimizations
  • No Unrolling
  • VL is fixed
  • Straight to SIMD intrinsics
  • Prefetching replaced by DMA list commands

 Only collision() was implemented.

slide-38
SLIDE 38

BIPS BIPS

Auto-tuned Performance

(Local Store Implementation)

First attempt at cell implementation.

VL, unrolling, reordering fixed

No NUMA

Exploits DMA and double buffering to load vectors

Straight to SIMD intrinsics.

Despite the relative performance, Cell’s DP implementation severely impairs performance

Intel Xeon (Clovertown) AMD Opteron (rev.F) Sun Niagara2 (Huron) IBM Cell Blade (SPEs)*

*collision() only

+SIMDization +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA

slide-39
SLIDE 39

BIPS BIPS

First attempt at cell implementation.

VL, unrolling, reordering fixed

No NUMA

Exploits DMA and double buffering to load vectors

Straight to SIMD intrinsics.

Despite the relative performance, Cell’s DP implementation severely impairs performance

Auto-tuned Performance

(Local Store Implementation)

Intel Xeon (Clovertown) AMD Opteron (rev.F) Sun Niagara2 (Huron) IBM Cell Blade (SPEs)*

*collision() only

+SIMDization +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA

7.5% of peak flops 18% of bandwidth 42% of peak flops 35% of bandwidth 59% of peak flops 15% of bandwidth 57% of peak flops 33% of bandwidth

slide-40
SLIDE 40

BIPS BIPS

First attempt at cell implementation.

VL, unrolling, reordering fixed

Exploits DMA and double buffering to load vectors

Straight to SIMD intrinsics.

Despite the relative performance, Cell’s DP implementation severely impairs performance

Speedup from Heterogeneity

Intel Xeon (Clovertown) AMD Opteron (rev.F) Sun Niagara2 (Huron) IBM Cell Blade (SPEs)*

*collision() only

+SIMDization +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA

1.5x 13x

  • ver auto-tuned PPE

4.3x 1.6x

slide-41
SLIDE 41

BIPS BIPS

First attempt at cell implementation.

VL, unrolling, reordering fixed

Exploits DMA and double buffering to load vectors

Straight to SIMD intrinsics.

Despite the relative performance, Cell’s DP implementation severely impairs performance

Speedup over naive

Intel Xeon (Clovertown) AMD Opteron (rev.F) Sun Niagara2 (Huron) IBM Cell Blade (SPEs)*

*collision() only

+SIMDization +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA

1.5x 130x 4.3x 1.6x

slide-42
SLIDE 42

BIPS BIPS

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N

Summary

slide-43
SLIDE 43

BIPS BIPS

Aggregate Performance

(Fully optimized)

Cell SPEs deliver the best full system performance

  • Although, Niagara2 delivers near comparable per socket performance

Dual core Opteron delivers far better performance (bandwidth) than Clovertown

Clovertown has far too little effective FSB bandwidth

slide-44
SLIDE 44

BIPS BIPS

Parallel Efficiency

(average performance per thread, Fully optimized)

 Aggregate Mflop/s / #cores  Niagara2 & Cell showed very good multicore scaling  Clovertown showed very poor multicore scaling (FSB limited)

slide-45
SLIDE 45

BIPS BIPS

Power Efficiency

(Fully Optimized)

 Used a digital power meter to measure sustained power  Calculate power efficiency as:

sustained performance / sustained power

 All cache-based machines delivered similar power efficiency  FBDIMMs (~12W each) sustained power

  • 8 DIMMs on Clovertown (total of ~330W)
  • 16 DIMMs on N2 machine (total of ~450W)
slide-46
SLIDE 46

BIPS BIPS

Productivity

 Niagara2 required significantly less work to deliver

good performance (just vectorization for large problems).

 Clovertown, Opteron, and Cell all required SIMD

(hampers productivity) for best performance.

 Cache based machines required search for some

  • ptimizations, while Cell relied solely on heuristics

(less time to tune)

slide-47
SLIDE 47

BIPS BIPS

Summary

 Niagara2 delivered both very good performance and productivity  Cell delivered very good performance and efficiency (processor

and power)

 On the memory bound Clovertown parallelism wins out over

  • ptimization and auto-tuning

 Our multicore auto-tuned LBMHD implementation

significantly outperformed the already optimized serial implementation

 Sustainable memory bandwidth is essential even on kernels

with moderate computational intensity (flop:byte ratio)

 Architectural transparency is invaluable in optimizing code

slide-48
SLIDE 48

BIPS BIPS

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N

Multi-core arms race

slide-49
SLIDE 49

BIPS BIPS

New Multicores

2.2GHz Opteron (rev.F) 1.40GHz Niagara2

slide-50
SLIDE 50

BIPS BIPS

New Multicores

 Barcelona is a quad

core Opteron

 Victoria Falls is a dual

socket (128 threads) Niagara2

 Both have the same

total bandwidth

+SIMDization +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA +Smaller pages

2.2GHz Opteron (rev.F) 2.3GHz Opteron (Barcelona) 1.40GHz Niagara2 1.16GHz Victoria Falls

slide-51
SLIDE 51

BIPS BIPS

Speedup from multicore/socket

 Barcelona is a quad

core Opteron

 Victoria Falls is a dual

socket (128 threads) Niagara2

 Both have the same

total bandwidth

+SIMDization +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA +Smaller pages

2.2GHz Opteron (rev.F) 2.3GHz Opteron (Barcelona) 1.40GHz Niagara2 1.16GHz Victoria Falls

1.6x

(1.9x frequency normalized)

1.9x

(1.8x frequency normalized)

slide-52
SLIDE 52

BIPS BIPS

Speedup from Auto-tuning

 Barcelona is a quad

core Opteron

 Victoria Falls is a dual

socket (128 threads) Niagara2

 Both have the same

total bandwidth

+SIMDization +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA +Smaller pages

2.2GHz Opteron (rev.F) 2.3GHz Opteron (Barcelona) 1.40GHz Niagara2 1.16GHz Victoria Falls

1.5x 16x 3.9x 4.3x

slide-53
SLIDE 53

BIPS BIPS

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N

Questions?

slide-54
SLIDE 54

BIPS BIPS

Acknowledgements

 UC Berkeley

  • RADLab Cluster (Opterons)
  • PSI cluster(Clovertowns)

 Sun Microsystems

  • Niagara2 donations

 Forschungszentrum Jülich

  • Cell blade cluster access

 George Vahala, et. al

  • Original version of LBMHD

 ASCR Office in the DOE office of Science

  • contract DE-AC02-05CH11231