ON THE FUTURE OF HIGH PERFORMANCE COMPUTING: HOW TO THINK FOR PETA - - PowerPoint PPT Presentation

on the future of high performance computing how to think
SMART_READER_LITE
LIVE PREVIEW

ON THE FUTURE OF HIGH PERFORMANCE COMPUTING: HOW TO THINK FOR PETA - - PowerPoint PPT Presentation

ON THE FUTURE OF HIGH PERFORMANCE COMPUTING: HOW TO THINK FOR PETA AND EXASCALE COMPUTING JACK DONGARRA UNIVERSITY OF TENNESSEE OAK RIDGE NATIONAL LAB Over Last 20 Years - Performance Development 1E+09 123%%PFlop/s % 100 Pflop/s


slide-1
SLIDE 1

ON THE FUTURE OF HIGH PERFORMANCE COMPUTING: HOW TO THINK FOR PETA AND EXASCALE COMPUTING

JACK DONGARRA UNIVERSITY OF TENNESSEE OAK RIDGE NATIONAL LAB

slide-2
SLIDE 2

Over Last 20 Years - Performance Development

0.1 1 10 100 1000 10000 100000 1000000 10000000 100000000 1E+09

1 Gflop/s 1 Tflop/s 100 Mflop/s 100 Gflop/s 100 Tflop/s 10 Gflop/s 10 Tflop/s 1 Pflop/s 100 Pflop/s 10 Pflop/s 59.7%GFlop/s % 400%MFlop/s % 1.17%TFlop/s % 16.3%PFlop/s % 60.8%TFlop/s % 123%%PFlop/s %

SUM % N=1 % N=500 % 6-8 years

My Laptop (70 Gflop/s)

1993 1995 1997 1999 2001 2003 2005 2007 2009 2011

My iPad2 & iPhone 4s (1.02 Gflop/s)

2012

slide-3
SLIDE 3

June 2012: The TOP10

Rank Site Computer Country Cores Rmax [Pflops] % of Peak Power [MW] MFlops /Watt 1 DOE / NNSA L Livermore Nat Lab Sequoia, BlueGene/Q (16c) + custom USA 1,572,864 16.3 81 8.6 1895 2 RIKEN Advanced Inst for Comp Sci K computer Fujitsu SPARC64 VIIIfx (8c) + custom Japan 705,024 10.5 93 12.7 830 3 DOE / OS Argonne Nat Lab Mira, BlueGene/Q (16c) + custom USA 786,432 8.16 81 3.95 2069 4 Leibniz Rechenzentrum SuperMUC, Intel (8c) + IB Germany 147,456 2.90 90* 3.52 823 5

  • Nat. SuperComputer

Center in Tianjin Tianhe-1A, NUDT Intel (6c) + Nvidia GPU (14c) + custom China 186,368 2.57 55 4.04 636 6 DOE / OS Oak Ridge Nat Lab Jaguar, Cray AMD (16c) + custom USA 298,592 1.94 74 5.14 377 7 CINECA Fermi, BlueGene/Q (16c) + custom Italy 163,840 1.73 82 .821 2099 8 Forschungszentrum Juelich (FZJ) JuQUEEN, BlueGene/Q (16c) + custom Germany 131,072 1.38 82 .657 2099 9 Commissariat a l'Energie Atomique (CEA) Curie, Bull Intel (8c) + IB France 77,184 1.36 82 2.25 604 10

  • Nat. Supercomputer

Center in Shenzhen Nebulea, Dawning Intel (6) + Nvidia GPU (14c) + IB China 120,640 1.27 43 2.58 493

  • 3
slide-4
SLIDE 4

Accelerators (58 systems)

0" 10" 20" 30" 40" 50" 60" 2006" 2007" 2008" 2009" 2010" 2011" 2012" Systems% Intel"MIC"(1)" Clearspeed"CSX600"(0)" ATI"GPU"(2)" IBM"PowerXCell"8i"(2)" NVIDIA"2070"(10)" NVIDIA"2050(12)" NVIDIA"2090"(31)"

slide-5
SLIDE 5

Countries Share

Absolute Counts US: 252 China: 68 Japan: 35 UK: 25 France: 22 Germany: 20

5

Switzerland

slide-6
SLIDE 6

Swiss Machines in Top500 (max:12 min:1)

6

0" 50" 100" 150" 200" 250" 300" 350" 400" 450" 500" Jan+93" Oct+95" Jul+98" Apr+01" Jan+04" Oct+06" Jul+09" Apr+12"

High point: 12 systems (6/95) Low points: 1 system (6/02, 11/02, 6/12)

4 5 7 9 12 9 8 9 6 6 5 6 6 5 8 8 6 2 1 1 3 3 2 3 3 4 4 5 5 7 6 4 4 5 5 4 4 3 1

slide-7
SLIDE 7

28 Systems at > Pflop/s (Peak)

0" 5" 10" 15" 20" 25" 30" 35" 40" 45" US""""""""""""""""""" (9)" "Japan""""""" (4)" China""""""""""""""" (5)" Germany""""""""""" (4)" France"""""""""""" (2)" UK""""""""""""""""" (2)" Italy"""""""""""" (1)" Russia""""""" (1)"

41# 16.2# 11.1# 6.9# 2.92# 2.73# 2.1# 1.7#

Pflop/s"Club"

Pflop/s (Peak)

10/2/12 7

slide-8
SLIDE 8

Linpack Efficiency

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 100 200 300 400 500 Linpack Efficiency

slide-9
SLIDE 9

Linpack Efficiency

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 100 200 300 400 500 Linpack Efficiency

slide-10
SLIDE 10

Linpack Efficiency

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 100 200 300 400 500 Linpack Efficiency

slide-11
SLIDE 11

Performance Development in Top500

0.1 1 10 100 1000 10000 100000 1000000 10000000 100000000 1E+09 1E+10 1E+11 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020

1 Eflop/s 1 Gflop/s 1 Tflop/s 100 Gflop/s 100 Tflop/s 10 Gflop/s 10 Tflop/s 1 Pflop/s 100 Pflop/s 10 Pflop/s

N=1% N=500%

slide-12
SLIDE 12

The High Cost of Data Movement

2011 2018 DP FMADD flop 100 pJ 10 pJ DP DRAM read 4800 pJ 1920 pJ Local Interconnect 7500 pJ 2500 pJ Cross System 9000 pJ 3500 pJ

12

slide-13
SLIDE 13

Energy Cost Challenge

At ~$1M per MW energy costs are

substantial

! 10 Pflop/s in 2011 uses ~10 MWs ! 1 Eflop/s in 2018 > 100 MWs ! DOE Target: 1 Eflop/s in 2018 at 20 MWs

13

slide-14
SLIDE 14

Potential System Architecture with a cap of $200M and 20MW

Systems 2012

BG/Q Computer

2019 Difference Today & 2019 System peak

20 Pflop/s 1 Eflop/s O(100)

Power

8.6 MW ~20 MW

System memory

1.6 PB

(16*96*1024)

32 - 64 PB O(10)

Node performance

205 GF/s

(16*1.6GHz*8)

1.2 or 15TF/s O(10) – O(100)

Node memory BW

42.6 GB/s 2 - 4TB/s O(1000)

Node concurrency

64 Threads O(1k) or 10k O(100) – O(1000)

Total Node Interconnect BW

20 GB/s 200-400GB/s O(10)

System size (nodes)

98,304

(96*1024)

O(100,000) or O(1M) O(100) – O(1000)

Total concurrency

5.97 M O(billion) O(1,000)

MTTI

4 days O(<1 day)

  • O(10)
slide-15
SLIDE 15

Potential System Architecture with a cap of $200M and 20MW

Systems 2012

BG/Q Computer

2022 Difference Today & 2022 System peak

20 Pflop/s 1 Eflop/s O(100)

Power

8.6 MW ~20 MW

System memory

1.6 PB

(16*96*1024)

32 - 64 PB O(10)

Node performance

205 GF/s

(16*1.6GHz*8)

1.2 or 15TF/s O(10) – O(100)

Node memory BW

42.6 GB/s 2 - 4TB/s O(1000)

Node concurrency

64 Threads O(1k) or 10k O(100) – O(1000)

Total Node Interconnect BW

20 GB/s 200-400GB/s O(10)

System size (nodes)

98,304

(96*1024)

O(100,000) or O(1M) O(100) – O(1000)

Total concurrency

5.97 M O(billion) O(1,000)

MTTI

4 days O(<1 day)

  • O(10)
slide-16
SLIDE 16

Critical Issues at Peta & Exascale for Algorithm and Software Design

Synchronization-reducing algorithms

! Break Fork-Join model

Communication-reducing algorithms

! Use methods which have lower bound on communication

Mixed precision methods

! 2x speed of ops and 2x speed for data movement

Autotuning

! Today’s machines are too complicated, build “smarts” into software to adapt to the hardware

Fault resilient algorithms

! Implement algorithms that can recover from failures/bit flips

Reproducibility of results

! Today we can’t guarantee this. We understand the issues, but some of our “colleagues” have a hard time with this.

16

slide-17
SLIDE 17
  • Must rethink the design of our

algorithms and software

! Manycore and Hybrid architectures are disruptive technology

! Similar to what happened with cluster computing and message passing

! Rethink and rewrite the applications, algorithms, and software ! Data movement is expensive ! Flops are cheap

Major Changes to Algorithms/Software

17

slide-18
SLIDE 18

Dense Linear Algebra

Software Evolution

LINPACK (70's) vector operations LAPACK (80's) block operations ScaLAPACK (90's) block cyclic data distribution PLASMA (00's) tile operations

" Level 1 BLAS " Level 3 BLAS " PBLAS " BLACS

(message passing)

" tile layout " dataflow scheduling

slide-19
SLIDE 19

PLASMA

Principles

" Tile Algorithms

" minimize capacity misses

" Tile Matrix Layout

" minimize conflict misses

" Dynamic DAG Scheduling

" minimizes idle time " More overlap " Asynchronous ops

CPU MEM cache CPU MEM cache CPU cache CPU cache CPU cache

LAPACK PLASMA

slide-20
SLIDE 20

Fork-Join Parallelization of LU and QR. Parallelize the update:

  • Easy and done in any reasonable software.
  • This is the 2/3n3 term in the FLOPs count.
  • Can be done efficiently with LAPACK+multithreaded BLAS
  • dgemm

Time Cores

slide-21
SLIDE 21

Objectives

! High utilization of each core ! Scaling to large number of cores ! Synchronization reducing algorithms

Methodology

! Dynamic DAG scheduling (QUARK) ! Explicit parallelism ! Implicit communication ! Fine granularity / block data layout

Arbitrary DAG with dynamic scheduling

21

Fork-join parallelism

PLASMA/MAGMA: Parallel Linear Algebra s/w for Multicore/Hybrid Architectures

DAG scheduled parallelism

Time

slide-22
SLIDE 22

Communication Avoiding QR Example

  • A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd

Conference on Hypercube Concurrent Computers and Applications, volume II, Applications, pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.

R0! R1! R2! R3! R0! R2! R0! R! R!

D1! D2! D3!

Domain_Tile_QR! Domain_Tile_QR! Domain_Tile_QR! Domain_Tile_QR!

D0! D1! D2! D3! D0!

  • 22
slide-23
SLIDE 23

Communication Avoiding QR Example

  • A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd

Conference on Hypercube Concurrent Computers and Applications, volume II, Applications, pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.

R0! R1! R2! R3! R0! R2! R0! R! R!

D1! D2! D3!

Domain_Tile_QR! Domain_Tile_QR! Domain_Tile_QR! Domain_Tile_QR!

D0! D1! D2! D3! D0!

  • 23
slide-24
SLIDE 24

Communication Avoiding QR Example

  • A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd

Conference on Hypercube Concurrent Computers and Applications, volume II, Applications, pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.

R0! R1! R2! R3! R0! R2! R0! R! R!

D1! D2! D3!

Domain_Tile_QR! Domain_Tile_QR! Domain_Tile_QR! Domain_Tile_QR!

D0! D1! D2! D3! D0!

  • 24
slide-25
SLIDE 25

Communication Avoiding QR Example

  • A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd

Conference on Hypercube Concurrent Computers and Applications, volume II, Applications, pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.

R0! R1! R2! R3! R0! R2! R0! R! R!

D1! D2! D3!

Domain_Tile_QR! Domain_Tile_QR! Domain_Tile_QR! Domain_Tile_QR!

D0! D1! D2! D3! D0!

  • 25
slide-26
SLIDE 26

Communication Avoiding QR Example

  • A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd

Conference on Hypercube Concurrent Computers and Applications, volume II, Applications, pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.

R0! R1! R2! R3! R0! R2! R0! R! R!

D1! D2! D3!

Domain_Tile_QR! Domain_Tile_QR! Domain_Tile_QR! Domain_Tile_QR!

D0! D1! D2! D3! D0!

  • 26
slide-27
SLIDE 27

PowerPack 2.0

27

The PowerPack platform consists of software and hardware Kirk Cameron, Virginia Tech; http://scape.cs.vt.edu/software/

slide-28
SLIDE 28

Power for QR Factorization

28

dual-socket quad-core Intel Xeon E5462 (Harpertown) processor @ 2.80GHz (8 cores total) w / MLK BLAS matrix size is very tall and skinny (mxn is 1,152,000 by 288)

PLASMA’s Communication Reducing QR Factorization DAG based PLASMA’s Conventional QR Factorization DAG based MKL’s QR Factorization Fork-join based LAPACK’s QR Factorization Fork-join based

slide-29
SLIDE 29

The standard Tridiagonal reduction xSYTRD

step step k: Q A Q* : Q A Q* then update # step step k+1 k+1 $ LAPACK LAPACK xSYTRD xSYTRD: : 1. Apply left-right transformations Q A Q* to the panel 2. Update the remaining submatrix A33

A22 A32 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟

  T11 T T

21

T21 A22 AT

32

A32 A33   ≡   T11 T T

21

T21 A22 AT

32

A32 A33   = ⇒   T11 T T

21

T21 T22 T T

23

T23 A33   where A33 = A33 − YW T − WY T

For the symmetric eigenvalue problem: First stage takes:

  • 90% of the time if only eigenvalues
  • 50% of the time if eigenvalues and eigenvectors
slide-30
SLIDE 30

The standard Tridiagonal reduction xSYTRD $ Characteristics Characteristics

1. Phase 1 requires :

  • 4 panel vector multiplications,
  • 1 symmetric matrix vector multiplication with A33,
  • Cost 2(n-k)2b Flops.

2. Phase 2 requires:

  • Symmetric update of A33 using SYRK,
  • Cost 2(n-k)2b Flops.

$ Observations Observations

  • Too many Level 2 BLAS ops,
  • Relies on panel factorization,
  • Total cost 4n3/3
  • #Bulk

ulk sync phases, sync phases,

  • #Memory

emory bound algorithm. bound algorithm.

slide-31
SLIDE 31

Symmetric Eigenvalue Problem

  • Standard reduction algorithm are very slow on multicore.
  • Step1: Reduce the dense matrix to band.
  • Matrix-matrix operations, high degree of parallelism
  • Step2: Bulge Chasing on the band matrix
  • by group and cache aware
slide-32
SLIDE 32

2k 3k 4k 6k 8k 10k 12k 14k 16k 18k 20k 22k 24k 10 20 30 40 50 60 70 80 90 100 110 120 130 140 Matrix size Gflop/s PLASMA DSYTRD+DSTEV MKL−SBR−DSYRDB+DSTEV SBR−toolkit−DSYRDD+DSTEV MKL−DSYTRD+DSTEV LPK−reference−DSYTRD+DSTEV

11X 50X

Symmetric

Eigenvalues Singular Values singular values only

Block DAG based to banded form, then pipelined group

chasing to tridiagaonal form.

The reduction to condensed form accounts for the factor

  • f 50 improvement over LAPACK
  • Execution rates based on 4/3n3 ops

eigenvalues only

Experiments on eight-socket six-core AMD Opteron 2.4 GHz processors with MKL V10.3.

slide-33
SLIDE 33

Summary

These are old ideas (today SMPss, StarPU, Charm++, ParalleX,

Swarm,…)

Major Challenges are ahead for extreme

computing

! Power ! Levels of Parallelism ! Communication ! Hybrid ! Fault Tolerance ! … and many others not discussed here

Not just a programming assignment. This opens up many new opportunities for

applied mathematicians and computer scientists

slide-34
SLIDE 34

Collaborators / Software / Support

% PLASMA

http://icl.cs.utk.edu/plasma/

% MAGMA

http://icl.cs.utk.edu/magma/

% Quark (RT for Shared Memory)

  • http://icl.cs.utk.edu/quark/

% PaRSEC(Parallel Runtime Scheduling

and Execution Control)

  • http://icl.cs.utk.edu/parsec/

34

%

Collaborating partners

University of Tennessee, Knoxville University of California, Berkeley University of Colorado, Denver INRIA, France KAUST, Saudi Arabia

These tools are being applied to a range of applications beyond dense LA: Sparse direct, Sparse iterations methods and Fast Multipole Methods