Future Directions in High Future Directions in High P Performance - - PowerPoint PPT Presentation

future directions in high future directions in high p
SMART_READER_LITE
LIVE PREVIEW

Future Directions in High Future Directions in High P Performance - - PowerPoint PPT Presentation

Future Directions in High Future Directions in High P Performance Computing Performance Computing P f f C C ti ti Jack Dongarra k INNOVATIVE COMP ING LABORATORY U i University of Tennessee i f T Oak Ridge National Laboratory


slide-1
SLIDE 1

Future Directions in High Future Directions in High P f C ti P f C ti Performance Computing Performance Computing

k Jack Dongarra

INNOVATIVE COMP ING LABORATORY

U i i f T University of Tennessee Oak Ridge National Laboratory University of Manchester

2/20/2008 1

slide-2
SLIDE 2

Outline Outline

  • Top500 Results

p

  • Four Important Concepts that Will

Effect Math Software Effect Math Software

Effective Use of Many-Core Exploiting Mixed Precision in Our Exploiting Mixed Precision in Our

Numerical Computations

Self Adapting / Auto Tuning of Software

Self Adapting / Auto Tuning of Software

Fault Tolerant Algorithms

2

slide-3
SLIDE 3
  • H. Meuer, H. Simon, E. Strohmaier, & JD
  • H. Meuer, H. Simon, E. Strohmaier, & JD
  • Listing of the 500 most powerful

g p Computers in the World

  • Yardstick: Rmax from LINPACK MPP

Yardstick: Rmax from LINPACK MPP Ax=b, dense problem

TPP performance

  • Updated twice a year

SC‘ i h S i b

Size Rate

SC‘xy in the States in November Meeting in Germany in June

3

  • All data available from www.top500.org
slide-4
SLIDE 4

Performance Development

6.96 PF/s

1 Pflop/ s

478 TF/s NEC Earth Simulator SUM

100 Tflop/ s 1 Pflop/ s

IBM BlueGene/L 1.17 TF/s 59.7 GF/s 5.9 TF/s Intel ASCI Red IBM ASCI White N=1

1 Tflop/ s 10 Tflop/ s

6-8 years

Fujitsu 'NWT' N=500

100 Gflop/ s 10 Gflop/ s

My Laptop

0.4 GF/s

3 4 5 6 7 8 9 00 01 02 03 04 05 06 07

1 Gflop/ s 100 Mflop/ s

4

199 199 199 199 199 199 199 200 200 200 200 200 200 200 200

slide-5
SLIDE 5

30th Edition: The TOP10

Manufacturer Computer Rmax

[TF/s]

Installation Site Country Year #Cores 1 IBM Blue Gene/L eServer Blue Gene 478 DOE L L N L b USA

2007 C t

212,992 1 IBM Dual Core .7 GHz 478 Lawrence Livermore Nat Lab USA

Custom

212,992 2 IBM Blue Gene/P Quad Core .85 GHz 167 Forschungszentrum Jülich Germany

2007 Custom

65,536 3 SGI Altix ICE 8200 Xeon 127 SGI/New Mexico Computing USA

2007

14 336 3 SGI Quad Core 3 GHz 127 p g Applications Center USA

Hybrid

14,336 4 HP Cluster Platform Xeon Dual Core 3 GHz 118 Computational Research Laboratories, TATA SONS India

2007 Commod

14,240 Cluster Platform

2007

5 HP Cluster Platform Dual Core 2.66 GHz 102.8 Government Agency Sweden

2007 Commod

13,728 6 Cray Opteron Dual Core 2.4 GHz 102.2 DOE Sandia Nat Lab USA

2007 Hybrid

26,569 7 Cray Opteron 101 7 DOE USA

2006

23 016 Cray Dual Core 2.6 GHz 101.7 Oak Ridge National Lab USA

Hybrid

23,016 8 IBM eServer Blue Gene/L Dual Core .7 GHz 91.2 IBM Thomas J. Watson Research Center USA

2005 Custom

40,960 9 C Opteron 85 4 DOE USA

2006

19 320

07

5 9 Cray p Dual Core 2.6 GHz 85.4 Lawrence Berkeley Nat Lab USA

2006 Hybrid

19,320 10 IBM eServer Blue Gene/L Dual Core .7 GHz 82.1 Stony Brook/BNL, NY Center for Computational Sciences USA

2006 Custom

36,864

slide-6
SLIDE 6

IBM IBM BlueGene BlueGene/L /L #1

#1 212,992 Cores 212,992 Cores

2 6 MW tt (2600 h )

(104 racks, 104x32x32) 212992 procs Rack (32 Node boards, 8x8x16) 2048 processors

2.6 MWatts (2600 homes) 70,000 ops/s/person

Node Board (32 chips, 4x4x2) 16 Compute Cards 2048 processors

BlueGene/L Compute ASIC

Compute Card (2 chips, 2x1x1) 16 Compute Cards 64 processors 298/596 TF/s 32 TB DDR Chip (2 processors) 4 processors 90/180 GF/s 2.9/5.7 TF/s 0.5 TB DDR 32 TB DDR

Full system total of 212,992 cores

2.8/5.6 GF/s 4 MB (cache) 5.6/11.2 GF/s 1 GB DDR 90/180 GF/s 16 GB DDR

“Fastest Computer” BG/L 700 MH 213K 07 6 BG/L 700 MHz 213K proc 104 racks Peak: 596 Tflop/s Linpack: 498 Tflop/s

84% of peak

The compute node ASICs include all networking and processor functionality. Each compute ASIC includes two 32-bit superscalar PowerPC 440 embedded cores (note that L1 cache coherence is not maintained between these cores). (20.7K sec about 5.7hours; n=2.5M)

slide-7
SLIDE 7

Cores per System – November 2007

300 1,800,000

TOP500 Total Cores

200 250 Systems 1,000,000 1,200,000 1,400,000 1,600,000 , , 100 150 Number of 200,000 400,000 600,000 800,000 50 List Release Date 64k-128k 32k-64k 16k-32k 8k-16k 4k-8k 2049- 4096 1025- 2048 513-1024 257-512 129-256 65-128 33-64

7

slide-8
SLIDE 8

Top500 Systems November 2007

500

478 Tflop/s

350 400 450

7 systems > 100 Tflop/s

478 Tflop/s

200 250 300

21 systems > 50 Tflop/s x (Tflop/s)

50 100 150

149 systems > 10 Tflop/s Rmax

1 14 27 40 53 66 79 92 105 118 131 144 157 170 183 196 209 222 235 248 261 274 287 300 313 326 339 352 365 378 91 04 17 3 6 9

Rank

5.9 Tflop/s 8

3 3 3 40 41 43 443 456 469 482 495

Rank

slide-9
SLIDE 9

Chips Used in Each of the 500 Systems

72% Intel 12% IBM 16% AMD

Intel IA‐32 3% Intel EM64T 65% C NEC 0% Sun Sparc 0%

16% AMD

HP PA‐RISC 0% Cray 0% HP Alpha 0% AMD x86_64 16% 0% Intel IA‐64 IBM Power 12%

9

4% 12%

slide-10
SLIDE 10

Interconnects / Systems

500

Others

400

Cray Interconnect SP Switch

200 300

Crossbar

(121)

100 200

Quadrics Infiniband

(270) (18)

3 4 5 6 7 8 9 00 01 02 03 04 05 06 07

Myrinet Gigabit Ethernet

07 10

GigE + Infiniband + Myrinet = 82%

199 199 199 199 199 199 199 200 200 200 200 200 200 200 200

Gigabit Ethernet N/A

slide-11
SLIDE 11

Top500 by Usage

287, 57%

Industry Research Academic

15, 3% 8, 2% 3, 1%

Government Vendor Classified

101, 20% 86, 17%

07 11

slide-12
SLIDE 12

Countries / Performance (Nov 2007)

60% 7.7% 2.8% 2.7%

12

7.4% 4.2% 3.2%

slide-13
SLIDE 13

Power is an Industry Wide Problem Power is an Industry Wide Problem

G l f iliti

♦ Google facilities

leveraging hydroelectric

“Hiding in Plain Sight, Google Seeks More Power”, by John Markoff, June 14, 2006

hydroelectric power

  • ld aluminum

l t plants

>500,000 servers worldwide

13

New Google Plant in The Dulles, Oregon, from NYT, June 14, 2006

slide-14
SLIDE 14

Gflop Gflop/ / KWatt KWatt in the Top 20 in the Top 20

300 350 200 250 50 100 150 50

14

slide-15
SLIDE 15

Green500 Green500

15

slide-16
SLIDE 16

Performance Projection

1 Eflop/s 1 Pflop/s 10 Pflop/s 100 Pflop/s

SUM

1 Tflop/s 100 Tflop/s 10 Tflop/s

6-8 years

N=1 N=500

1 Gflop/s 100 Gflop/s 10 Gflop/s

y 8-10 years

1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015

N 500

100 Mflop/s

30th List / November 2007

www.top500.org

page 16

slide-17
SLIDE 17

Los Alamos Roadrunner Los Alamos Roadrunner A A Petascale Petascale S ystem S ystem in in 2008 2008

“Connected Unit” cluster 192 Opteron nodes

(180 w/ 2 dual-Cell blades

≈ 13,000 Cell HPC chips

  • ≈ 1.33 PetaFlop/s (from Cell)

7 000 d l O t

(180 w/ 2 dual Cell blades connected w/ 4 PCIe x8 links)

≈ 7,000 dual-core Opterons

~18 clusters

2nd stage InfiniBand 4x DDR interconnect

(18 sets of 12 links to 8 switches)

2nd stage InfiniBand interconnect (8 switches)

Based on the 100 Gflop/s (DP) Cell chip

Approval by DOE 12/07 First CU being built today Expect a May Pflop/s run Full system to LANL in December 2008

slide-18
SLIDE 18

Increase Increase

Increasing the number of gates into a tight knot and decreasing the cycle time of the processor

Lower Lower Voltage Voltage Increase Increase Clock Rate Clock Rate & Transistor & Transistor Density Density Density Density

We have seen increasing number of gates on a chip and increasing clock speed. Cache Cache Heat becoming an unmanageable problem, Intel Processors > 100 Watts

Core Core Core C1 C2 C1 C2

We will not see the dramatic increases in clock speeds in the future. However, the number of

C1 C2 Cache C1 C2 Cache C1 C2 C3 C4 C1 C2 C3 C4

18

gates on a chip will continue to increase.

C3 C4 C3 C4 C1 C2 C3 C4 C1 C2 C3 C4

slide-19
SLIDE 19

Power Cost of Frequency Power Cost of Frequency

  • Power ∝ Voltage2 x Frequency (V2F)
  • Frequency ∝ Voltage

P F

3

  • Power ∝Frequency3

19

slide-20
SLIDE 20

Power Cost of Frequency Power Cost of Frequency

  • Power ∝ Voltage2 x Frequency (V2F)
  • Frequency ∝ Voltage

P F

3

  • Power ∝Frequency3

20

slide-21
SLIDE 21

What’ s Next? What’ s Next?

All Large Core All Large Core Mixed Large Mixed Large and and Small Core Small Core Many Small Cores Many Small Cores S all Co e S all Co e Many Small Cores Many Small Cores All Small Core All Small Core Different Classes of Chips H

+ 3D Stacked Many Floating-

Home Games / Graphics Business S cientific

SRAM SRAM Memory Point Cores

slide-22
SLIDE 22

80 Core 80 Core

  • Intel’s 80

Core chip Core chip

1 Tflop/s 62 Watts 62 Watts 1.2 TB/s

internal BW internal BW

22

slide-23
SLIDE 23

Maj or Changes to S

  • ftware

Maj or Changes to S

  • ftware
  • Must rethink the design of our

software software

Another disruptive technology

  • Similar to what happened with cluster

computing and message passing

Rethink and rewrite the applications,

algorithms and software algorithms, and software

  • Numerical libraries for example will

change change

For example, both LAPACK and

ScaLAPACK will undergo major changes

23

g j g to accommodate this

slide-24
SLIDE 24

A New Generation of S

  • ftware:

A New Generation of S

  • ftware:

Parallel Linear Algebra S

  • ftware for

Parallel Linear Algebra S

  • ftware for Multicore

Multicore Architectures (PLAS MA) Architectures (PLAS MA)

Algorithms follow hardware evolution in time LINP ACK (70’s) (Vector operations) Rely on Level 1 BLAS (Vector operations)

  • Level-1 BLAS
  • perations

LAP ACK (80’s) Rely on LAP ACK (80 s) (Blocking, cache friendly) Rely on

  • Level-3 BLAS
  • perations

S caLAP ACK (90’s) (Distributed Memory) Rely on

  • PBLAS

Mess Passing PLAS MA (00’s) Rely on New Algorithms (many-core friendly)

  • a DAG/ scheduler
  • block data layout
  • some extra kernels

Those new algorithms h l l it th l ll ( lti t l ti )

  • have a very low granularity, they scale very well (multicore, petascale computing, … )
  • removes a lots of dependencies among the tasks, (multicore, distributed computing)
  • avoid latency (distributed computing, out-of-core)
  • rely on fast kernels

Those new algorithms need new kernels and rely on efficient scheduling algorithms.

slide-25
SLIDE 25

A New Generation of S

  • ftware:

A New Generation of S

  • ftware:

Parallel Linear Algebra S

  • ftware for

Parallel Linear Algebra S

  • ftware for Multicore

Multicore Architectures (PLAS MA) Architectures (PLAS MA)

Algorithms follow hardware evolution in time LINP ACK (70’s) (Vector operations) Rely on Level 1 BLAS (Vector operations)

  • Level-1 BLAS
  • perations

LAP ACK (80’s) Rely on LAP ACK (80 s) (Blocking, cache friendly) Rely on

  • Level-3 BLAS
  • perations

S caLAP ACK (90’s) (Distributed Memory) Rely on

  • PBLAS

Mess Passing PLAS MA (00’s) Rely on New Algorithms (many-core friendly)

  • a DAG/ scheduler
  • block data layout
  • some extra kernels

Those new algorithms h l l it th l ll ( lti t l ti )

  • have a very low granularity, they scale very well (multicore, petascale computing, … )
  • removes a lots of dependencies among the tasks, (multicore, distributed computing)
  • avoid latency (distributed computing, out-of-core)
  • rely on fast kernels

Those new algorithms need new kernels and rely on efficient scheduling algorithms.

slide-26
SLIDE 26

Steps in the LAPACK LU Steps in the LAPACK LU

DGETF2 LAPACK

(Factor a panel)

DLSWP LAPACK

(B k d )

DLSWP LAPACK

(Backward swap)

DLSWP LAPACK

(Forward swap)

DTRSM BLAS

(Triangular solve)

26

DGEMM BLAS

(Matrix multiply)

slide-27
SLIDE 27

LU Timing Profile (4 processor system) LU Timing Profile (4 processor system)

Threads – no lookahead

1D decomposition and SGI Origin Time for each component

DGETF2 DLASWP(L) DLASWP(R) DTRSM DGEMM

DGETF2 DLSWP DLSWP DTRSM DGEMM

Bulk Sync Phases Bulk Sync Phases

slide-28
SLIDE 28

Adaptive Lookahead Adaptive Lookahead - Dynamic Dynamic

28

Event Driven Multithreading Event Driven Multithreading Reorganizing algorithms to use this approach

slide-29
SLIDE 29

Fork Fork-

  • Join vs. Dynamic Execution

Join vs. Dynamic Execution

A C A B C T T T

Fork-Join – parallel BLAS Time

Experiments on Experiments on

29

pe e ts o pe e ts o Intel’s Quad Core Clovertown Intel’s Quad Core Clovertown with 2 Sockets w/ 8 Treads with 2 Sockets w/ 8 Treads

slide-30
SLIDE 30

Fork Fork-

  • Join vs. Dynamic Execution

Join vs. Dynamic Execution

A C A B C T T T

Fork-Join – parallel BLAS Time DAG-based – dynamic scheduling

Experiments on Experiments on

Time saved 30

pe e ts o pe e ts o Intel’s Quad Core Clovertown Intel’s Quad Core Clovertown with 2 Sockets w/ 8 Treads with 2 Sockets w/ 8 Treads

slide-31
SLIDE 31

With With the the Hype on Hype on Cell & PS 3 Cell & PS 3 We Became Interested We Became Interested We Became Interested We Became Interested

  • The PlayStation 3's CPU based on a "Cell“ processor
  • Each Cell contains a Power PC processor and 8 SPEs. (SPE is processing unit,

SPE: SPU + DMA engine)

  • An SPE is a self contained vector processor which acts independently from the
  • thers.
  • 4 way SIMD floating point units capable of a total of 25.6 Gflop/s @ 3.2 GHZ
  • 204 8 Gflop/s peak!
  • 204.8 Gflop/s peak!
  • The catch is that this is for 32 bit floating point; (Single Precision SP)
  • And 64 bit floating point runs at 14.6 Gflop/s total for all 8 SPEs!!
  • Divide SP peak by 14; factor of 2 because of DP and 7 because of latency issues

SPE ~ 25 Gflop/s peak 31

slide-32
SLIDE 32

Performance of S ingle Precision Performance of S ingle Precision

  • n Conventional Processors
  • n Conventional Processors
  • n Conventional Processors
  • n Conventional Processors
  • Realized have the

similar situation on

  • ur commodity

Size Size SGEMM/ SGEMM/ DGEMM DGEMM Size Size SGEMV/ SGEMV/ DGEMV DGEMV AMD Opteron

  • ur commodity

processors.

  • That is, SP is 2X as

fast as DP on many systems

AMD Opteron 246 3000 3000 2.00 2.00 5000 5000 1.70 1.70 UltraS parc-IIe 3000 3000 1.64 1.64 5000 5000 1.66 1.66 Intel PIII

systems

  • The Intel Pentium

and AMD Opteron h SSE2

Coppermine 3000 3000 2.03 2.03 5000 5000 2.09 2.09 PowerPC 970 3000 3000 2.04 2.04 5000 5000 1.44 1.44 Intel Woodcrest 3000 3000 1 81 1 81 5000 5000 2 18 2 18

have SSE2

  • 2 flops/cycle DP
  • 4 flops/cycle SP

Woodcrest 3000 3000 1.81 1.81 5000 5000 2.18 2.18 Intel XEON 3000 3000 2.04 2.04 5000 5000 1.82 1.82 Intel Centrino Duo 3000 3000 2.71 2.71 5000 5000 2.21 2.21

S ingle precision is faster because:

  • Higher parallelism in S

S E/ vector units

  • IBM PowerPC has

AltiVec

  • 8 flops/cycle SP

4 fl / l DP

g p

  • Reduced data motion
  • Higher locality in cache
  • 4 flops/cycle DP
  • No DP on AltiVec
slide-33
SLIDE 33

32 or 64 bit Floating Point Precision? 32 or 64 bit Floating Point Precision?

  • A long time ago 32 bit floating point was

used

S ill d i i ifi b li i d

Still used in scientific apps but limited

  • Most apps use 64 bit floating point

Accumulation of round off error

Accumulation of round off error

  • A 10 TFlop/s computer running for 4 hours performs > 1

Exaflop (1018) ops.

Ill conditioned problems

p

IEEE SP exponent bits too few (8 bits, 10±38) Critical sections need higher precision

  • Sometimes need extended precision (128 bit fl pt)

Sometimes need extended precision (128 bit fl pt)

However some can get by with 32 bit fl pt in

some parts

  • Mixed precision a possibility

33

  • Mixed precision a possibility

Approximate in lower precision and then refine

  • r improve solution to high precision.
slide-34
SLIDE 34

Idea Goes S

  • mething Like This…

Idea Goes S

  • mething Like This…
  • Exploit 32 bit floating point as much as

possible.

Especially for the bulk of the computation

  • Correct or update the solution with selective

use of 64 bit floating point to provide a refined results

  • Intuitively:

Compute a 32 bit result,

C l l t ti t 32 bit lt i

Calculate a correction to 32 bit result using

selected higher precision and,

Perform the update of the 32 bit results with the

34

Perform the update of the 32 bit results with the correction using high precision.

slide-35
SLIDE 35

Mixed Mixed-

  • Precision Iterative Refinement

Precision Iterative Refinement

It ti fi t f d t A b k thi

L U = lu(A) SINGLE O(n3) x = L\(U\b) SINGLE O(n2)

  • Iterative refinement for dense systems, Ax = b, can work this

way.

x L\(U\b) ( ) r = b – Ax DOUBLE O(n2) WHILE || r || not small enough z = L\(U\r) SINGLE O(n2) x = x + z DOUBLE O(n1) x = x + z DOUBLE O(n ) r = b – Ax DOUBLE O(n2) END

  • Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt

results when using DP fl pt.

35

slide-36
SLIDE 36

Mixed Mixed-

  • Precision Iterative Refinement

Precision Iterative Refinement

It ti fi t f d t A b k thi

L U = lu(A) SINGLE O(n3) x = L\(U\b) SINGLE O(n2)

  • Iterative refinement for dense systems, Ax = b, can work this

way.

x L\(U\b) ( ) r = b – Ax DOUBLE O(n2) WHILE || r || not small enough z = L\(U\r) SINGLE O(n2) x = x + z DOUBLE O(n1) x = x + z DOUBLE O(n ) r = b – Ax DOUBLE O(n2) END

  • Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt

results when using DP fl pt.

  • It can be shown that using this approach we can compute the solution

It can be shown that using this approach we can compute the solution to 64-bit floating point precision.

  • Requires extra storage, total is 1.5 times normal;
  • O(n3) work is done in lower precision

36

( ) p

  • O(n2) work is done in high precision
  • Problems if the matrix is ill-conditioned in sp; O(108)
slide-37
SLIDE 37

Results for Mixed Precision Iterative Refinement for Dense Ax = b

Architecture (BLAS)

1 Intel Pentium III Coppermine (Goto) 2 Intel Pentium III Katmai (Goto) 3 Sun UltraSPARC IIe (Sunperf) 4 Intel Pentium IV Prescott (Goto) 5 Intel Pentium IV-M Northwood (Goto) 6 AMD Opteron (Goto) 7 Cray X1 (libsci)

8

IBM Power PC G5 (2 7 GHz) (VecLib)

8

IBM Power PC G5 (2.7 GHz) (VecLib) 9 Compaq Alpha EV6 (CXML) 10 IBM SP Power3 (ESSL) 11 SGI Octane (ATLAS)

  • Single precision is faster than DP because:
  • Higher parallelism within vector units

4 ops/cycle (usually) instead of 2 ops/cycle

  • Reduced data motion
  • Reduced data motion

32 bit data instead of 64 bit data

  • Higher locality in cache

More data items in cache

slide-38
SLIDE 38

Results for Mixed Precision Iterative Refinement for Dense Ax = b

Architecture (BLAS)

1 Intel Pentium III Coppermine (Goto) 2 Intel Pentium III Katmai (Goto) 3 Sun UltraSPARC IIe (Sunperf) 4 Intel Pentium IV Prescott (Goto) 5 Intel Pentium IV-M Northwood (Goto) 6 AMD Opteron (Goto) 7 Cray X1 (libsci)

8

IBM Power PC G5 (2 7 GHz) (VecLib)

8

IBM Power PC G5 (2.7 GHz) (VecLib) 9 Compaq Alpha EV6 (CXML) 10 IBM SP Power3 (ESSL) 11 SGI Octane (ATLAS)

A hi (BLAS MPI) # DP S l DP S l # Architecture (BLAS-MPI) # procs n DP Solve /SP Solve DP Solve /Iter Ref # iter

AMD Opteron (Goto – OpenMPI MX) 32 22627 1.85

1.79

6 AMD Opteron (Goto – OpenMPI MX) 64 32000 1 90

1.83

6

  • Single precision is faster than DP because:
  • Higher parallelism within vector units

4 ops/cycle (usually) instead of 2 ops/cycle

  • Reduced data motion

1.90

1.83

  • Reduced data motion

32 bit data instead of 64 bit data

  • Higher locality in cache

More data items in cache

slide-39
SLIDE 39

What about the Cell? What about the Cell?

Power PC at 3 2 GHz

  • Power PC at 3.2 GHz

DGEMM at 5 Gflop/s Altivec peak at 25.6 Gflop/s

p p

  • Achieved 10 Gflop/s SGEMM
  • 8 SPUs

204 8 Gflop/s peak!

204.8 Gflop/s peak! The catch is that this is for 32 bit floating point;

(Single Precision SP) d b l l

And 64 bit floating point runs at 14.6 Gflop/s

total for all 8 SPEs!!

  • Divide SP peak by 14; factor of 2 because of DP and 7

because of latency issues

39

because of latency issues

slide-40
SLIDE 40

Moving Data Around on the Cell

256 KB 256 KB Injection bandwidth 25.6 GB/s Injection bandwidth Injection bandwidth Injection bandwidth Worst case memory bound operations (no reuse of data) 3 data movements (2 in and 1 out) with 2 ops (SAXPY) For the cell would be 4.6 Gflop/s (25.6 GB/s*2ops/12B)

slide-41
SLIDE 41

IBM Cell 3.2 GHz, Ax = b IBM Cell 3.2 GHz, Ax = b

250 200 SP Peak (204 Gflop/s) SP Ax=b IBM 30

8 SGEMM (Embarrassingly Parallel)

100 150 GFlop/s DP Peak (15 Gflop/s) DP Ax=b IBM .30 secs 50 500 1000 1500 2000 2500 3000 3500 4000 4500 3.9 secs

41

Matrix Size

slide-42
SLIDE 42

IBM Cell 3.2 GHz, Ax = b IBM Cell 3.2 GHz, Ax = b

250 200 SP Peak (204 Gflop/s) SP Ax=b IBM 30

8 SGEMM (Embarrassingly Parallel)

100 150 GFlop/s DSGESV DP Peak (15 Gflop/s) DP Ax=b IBM .30 secs .47 secs 50 100

8.3X

500 1000 1500 2000 2500 3000 3500 4000 4500 3.9 secs

42

500 1000 1500 2000 2500 3000 3500 4000 4500 Matrix Size

slide-43
SLIDE 43

Cholesky Cholesky -

  • Using 2 Cell Chips

Using 2 Cell Chips

43

slide-44
SLIDE 44

Intriguing Potential Intriguing Potential

  • Exploit lower precision as much as possible

Payoff in performance

  • Faster floating point

g p

  • Less data to move
  • Automatically switch between SP and DP to match

the desired accuracy the desired accuracy

Compute solution in SP and then a correction to the

solution in DP

  • Potential for GPU FPGA special purpose processors

Potential for GPU, FPGA, special purpose processors

What about 16 bit floating point?

  • Use as little you can get away with and improve the accuracy
  • Applies to sparse direct and iterative linear systems
  • Applies to sparse direct and iterative linear systems

and Eigenvalue, optimization problems, where Newton’s method is used.

44 Correction = - A\(b – Ax)

slide-45
SLIDE 45

Conclusions Conclusions

  • For the last decade or more, the research

investment strategy has been investment strategy has been

  • verwhelmingly biased in favor of hardware.
  • This strategy needs to be rebalanced -

gy barriers to progress are increasingly on the software side.

  • Moreover, the return on investment is more

favorable to software.

Hardware has a half-life measured in years, while

software has a half-life measured in decades.

  • High Performance Ecosystem out of balance

g y

  • Hardware, OS, Compilers, Software, Algorithms, Applications
  • No Moore’s Law for software, algorithms and applications
slide-46
SLIDE 46

Collaborators / S upport Collaborators / S upport

Alfredo Buttari, UTK J li L g Julien Langou, UColorado Julie Langou, UTK g , Piotr Luszczek, MathWorks J k b K k UTK Jakub Kurzak, UTK Stan Tomov, UTK

33