NAS FT Variants Performance Summary Best MFlop rates for all NAS FT - - PowerPoint PPT Presentation

nas ft variants performance summary
SMART_READER_LITE
LIVE PREVIEW

NAS FT Variants Performance Summary Best MFlop rates for all NAS FT - - PowerPoint PPT Presentation

NAS FT Variants Performance Summary Best MFlop rates for all NAS FT Benchmark versions 1100 .5 Tflops Best NAS Fortran/MPI 1000 Best NAS Fortran/MPI Best MPI (always Slabs) 1000 Best MPI Best UPC (always Pencils) 900 Best UPC 800 800


slide-1
SLIDE 1

4/1/16 CS267 Lecture: UPC 61

NAS FT Variants Performance Summary

  • Slab is always best for MPI; small message cost too high
  • Pencil is always best for UPC; more overlap

200 400 600 800 1000 Myrinet 64 InfiniBand 256 Elan3 256 Elan3 512 Elan4 256 Elan4 512 MFlops per Thread Best MFlop rates for all NAS FT Benchmark versions Best NAS Fortran/MPI Best MPI Best UPC

100 200 300 400 500 600 700 800 900 1000 1100 Myrinet 64 InfiniBand 256 Elan3 256 Elan3 512 Elan4 256 Elan4 512 MFlops per Thread Best NAS Fortran/MPI Best MPI (always Slabs) Best UPC (always Pencils)

Joint work with Chris Bell, Rajesh Nishtala, Dan Bonachea

.5 Tflops

slide-2
SLIDE 2

FFT Performance on BlueGene/P

HPC Challenge Peak as of July 09 is ~4.5 Tflops on 128k Cores

  • PGAS implementations

consistently outperform MPI

  • Leveraging communication/

computation overlap yields best performance

  • More collectives in flight

and more communication leads to better performance

  • At 32k cores, overlap

algorithms yield 17% improvement in overall application time

  • Numbers are getting close to

HPC record

  • Future work to try to beat

the record

500 1000 1500 2000 2500 3000 3500 256 512 1024 2048 4096 8192 16384 32768

GFlops

  • Num. of Cores

Slabs Slabs (Collective) Packed Slabs (Collective) MPI Packed Slabs 62

G O O D

slide-3
SLIDE 3

4/1/16 CS267 Lecture: UPC 63

Case Study: LU Factorization

  • Direct methods have complicated dependencies
  • Especially with pivoting (unpredictable communication)
  • Especially for sparse matrices (dependence graph with holes)
  • LU Factorization in UPC
  • Use overlap ideas and multithreading to mask latency
  • Multithreaded: UPC threads + user threads + threaded BLAS
  • Panel factorization: Including pivoting
  • Update to a block of U
  • Trailing submatrix updates
  • Status:
  • Dense LU done: HPL-compliant
  • Sparse version underway

Joint work with Parry Husbands

slide-4
SLIDE 4

4/1/16 CS267 Lecture: UPC 64

UPC HPL Performance

  • Comparison to ScaLAPACK on an Altix, a 2 x 4 process grid
  • ScaLAPACK (block size 64) 25.25 GFlop/s (tried several block sizes)
  • UPC LU (block size 256) - 33.60 GFlop/s, (block size 64) - 26.47 GFlop/s
  • n = 32000 on a 4x4 process grid
  • ScaLAPACK - 43.34 GFlop/s (block size = 64)
  • UPC - 70.26 Gflop/s (block size = 200)

X1 Linpack Performance

200 400 600 800 1000 1200 1400 60 X1/64 X1/128 GFlop/s MPI/HPL UPC

Opteron Cluster Linpack Performance

50 100 150 200 Opt/64 GFlop/s MPI/HPL UPC

Altix Linpack Performance

20 40 60 80 100 120 140 160 Alt/32 GFlop/s MPI/HPL UPC

  • MPI HPL numbers

from HPCC database

  • Large scaling:
  • 2.2 TFlops on 512p,
  • 4.4 TFlops on 1024p

(Thunder)

Joint work with Parry Husbands