ManyCore ManyCore Computing: ManyCore ManyCore Computing: - - PowerPoint PPT Presentation

manycore manycore computing manycore manycore computing
SMART_READER_LITE
LIVE PREVIEW

ManyCore ManyCore Computing: ManyCore ManyCore Computing: - - PowerPoint PPT Presentation

ManyCore ManyCore Computing: ManyCore ManyCore Computing: Computing: Computing: The Impact on Numerical The Impact on Numerical S S oftware for Linear Algebra oftware for Linear Algebra Libraries Libraries Libraries Libraries Jack


slide-1
SLIDE 1

ManyCore ManyCore Computing: Computing: ManyCore ManyCore Computing: Computing: The Impact on Numerical The Impact on Numerical S

  • ftware for Linear Algebra

S

  • ftware for Linear Algebra

Libraries Libraries Libraries Libraries

k Jack Dongarra

INNOVATIVE COMP ING LABORATORY

U i i f T University of Tennessee Oak Ridge National Laboratory University of Manchester

11/20/2007 1

slide-2
SLIDE 2

Performance Proj ection Performance Proj ection Top500 Data Top500 Data p

1 F/s 100 PF/s

SUM

1 PF/s 10 PF/s

SUM

1 TF/s 100 TF/s 10 TF/s

N=1

1 TF/s 100 GF/s 10 GF/

N=500

1 GF/s 10 GF/s

2

1993 1995 1997 1999 2001 2003 2005 2007 2009 100 MF/s

slide-3
SLIDE 3

What Will a What Will a Petascale Petascale S ystem Looks Like? S ystem Looks Like?

Possible Petascale System

  • 1. # of cores per nodes

10 – 100 cores p

  • 2. Performance per nodes

100 – 1,000 GFlop/s

  • 3. Number of nodes

1,000 - 10,000 nodes 4 Latency inter nodes 1 μsec

  • 4. Latency inter-nodes

1 μsec

  • 5. Bandwidth inter-nodes

10 GB/s

  • 6. Memory per nodes

10 GB

  • Part I: First rule in linear algebra: Have an efficient DGEMM

Motivation in

  • 2. performance per node 5. bandwidth inter-nodes 6. memory per nodes

p p y p

  • Part II: Algorithms for multicore and latency avoiding algorithms for

LU, QR …

Motivation in:

1 Number of cores per node 2 performance per node 4 Latency inter-nodes

  • 1. Number of cores per node 2. performance per node 4. Latency inter nodes
  • Part III: Algorithms for fault tolerance

Motivation in:

  • 1. Number of cores per node 3. number of nodes
slide-4
SLIDE 4

Maj or Changes to S

  • ftware

Maj or Changes to S

  • ftware
  • Must rethink the design of our

software software

Another disruptive technology

  • Similar to what happened with cluster

computing and message passing

Rethink and rewrite the applications, algorithms and software algorithms, and software

  • Numerical libraries for example will

change change

For example, both LAPACK and ScaLAPACK will undergo major changes

4

g j g to accommodate this

slide-5
SLIDE 5

Coding for an Coding for an Abstract Abstract M Multicore ulticore

Parallel software for multicores should have two characteristics:

  • Fine granularity:

high level of parallelism is needed cores will probably be associated with relatively small cores will probably be associated with relatively small

local memories. This requires splitting an operation into tasks that operate on small portions of data in order to reduce bus traffic and improve data locality reduce bus traffic and improve data locality.

  • Asynchronicity: as the degree of TLP grows and

granularity of the operations becomes smaller, the presence of synchronization points in a parallel execution seriously affects the efficiency of an algorithm.

slide-6
SLIDE 6

ManyCore ManyCore -

  • Parallelism for the

Parallelism for the Masses Masses

  • We are looking at the following

g g concepts in designing the next numerical library implementation y p

Dynamic Data Driven Execution Self Adapting p g Block Data Layout Mixed Precision in the Algorithm Mixed Precision in the Algorithm Exploit Hybrid Architectures Fault Tolerant Methods Fault Tolerant Methods

6

slide-7
SLIDE 7

A New Generation of S

  • ftware:

A New Generation of S

  • ftware:

Algorithms follow hardware evolution in time

LINP ACK (70’s) (Vector operations) Rely on Level 1 BLAS (Vector operations)

  • Level-1 BLAS
  • perations

LAP ACK (80’s) Rely on LAP ACK (80 s) (Blocking, cache friendly) Rely on

  • Level-3 BLAS
  • perations

S caLAP ACK (90’s) (Distributed Memory) Rely on

  • PBLAS

Mess Passing PLAS MA (00’s) Rely on New Algorithms (many-core friendly)

  • a DAG/ scheduler
  • block data layout
  • some extra kernels

Those new algorithms h l l it th l ll ( lti t l ti )

  • have a very low granularity, they scale very well (multicore, petascale computing, … )
  • removes a lots of dependencies among the tasks, (multicore, distributed computing)
  • avoid latency (distributed computing, out-of-core)
  • rely on fast kernels

Those new algorithms need new kernels and rely on efficient scheduling algorithms.

slide-8
SLIDE 8

A New Generation of S

  • ftware:

A New Generation of S

  • ftware:

Parallel Linear Algebra S

  • ftware for

Parallel Linear Algebra S

  • ftware for Multicore

Multicore Architectures (PLAS MA) Architectures (PLAS MA)

Algorithms follow hardware evolution in time

LINP ACK (70’s) (Vector operations) Rely on Level 1 BLAS (Vector operations)

  • Level-1 BLAS
  • perations

LAP ACK (80’s) Rely on LAP ACK (80 s) (Blocking, cache friendly) Rely on

  • Level-3 BLAS
  • perations

S caLAP ACK (90’s) (Distributed Memory) Rely on

  • PBLAS

Mess Passing PLAS MA (00’s) Rely on New Algorithms (many-core friendly)

  • a DAG/ scheduler
  • block data layout

These new algorithms h l l it th l ll ( lti t l ti )

  • have a very low granularity, they scale very well (multicore, petascale computing, … )
  • removes a lots of dependencies among the tasks, (multicore, distributed computing)
  • avoid latency (distributed computing, out-of-core)
  • rely on fast kernels

Those new algorithms need new kernels and rely on efficient scheduling algorithms.

slide-9
SLIDE 9

Developing Developing P Parallel arallel A Algorithms lgorithms

parallelism

LAPACK LAPACK

Threaded BLAS

parallelism

s PThreads OpenMP PThreads OpenMP sequential BLAS sequential BLAS

slide-10
SLIDE 10

Steps in the LAPACK LU Steps in the LAPACK LU

DGETF2 LAPACK

(Factor a panel)

DLSWP LAPACK

(B k d )

DLSWP LAPACK

(Backward swap)

DLSWP LAPACK

(Forward swap)

DTRSM BLAS

(Triangular solve)

10

DGEMM BLAS

(Matrix multiply)

slide-11
SLIDE 11

LU Timing Profile (4 LU Timing Profile (4 core core system) system)

Threads – no lookahead

Time for each component

DGETF2 DLASWP(L) DLASWP(R) DTRSM DGEMM

DGETF2 DLSWP DLSWP DTRSM DGEMM

Bulk Sync Phases Bulk Sync Phases

slide-12
SLIDE 12

Adaptive Lookahead Adaptive Lookahead - Dynamic Dynamic

12

Event Driven Multithreading Event Driven Multithreading Reorganizing algorithms to use this approach

slide-13
SLIDE 13

Fork Fork-

  • Join vs. Dynamic Execution

Join vs. Dynamic Execution

A C A B C T T T

Fork-Join – parallel BLAS Time

Experiments on Experiments on

13

pe e ts o pe e ts o Intel’s Quad Core Clovertown Intel’s Quad Core Clovertown with 2 Sockets w/ 8 Treads with 2 Sockets w/ 8 Treads

slide-14
SLIDE 14

Fork Fork-

  • Join vs. Dynamic Execution

Join vs. Dynamic Execution

A C A B C T T T

Fork-Join – parallel BLAS Time DAG-based – dynamic scheduling

Experiments on Experiments on

Time saved 14

pe e ts o pe e ts o Intel’s Quad Core Clovertown Intel’s Quad Core Clovertown with 2 Sockets w/ 8 Treads with 2 Sockets w/ 8 Treads

slide-15
SLIDE 15

Achieving Achieving A Asynchronicity synchronicity

The matrix factorization can be The matrix factorization can be represented as a DAG:

  • nodes: tasks that operate on “tiles”
  • edges: dependencies among tasks
  • edges: dependencies among tasks

Tasks can be scheduled asynchronously and in any order as long as dependencies are not violated.

slide-16
SLIDE 16

Achieving Achieving A Asynchronicity synchronicity

A critical path can be defined as the A critical path can be defined as the shortest path that connects all the nodes with the higher number of t i d

  • utgoing edges.

Priorities: Priorities:

slide-17
SLIDE 17

Achieving asynchronicity

Achieving Achieving A Asynchronicity synchronicity

Very fine granularity Few dependencies, i.e., high

flexibility for the scheduling of flexibility for the scheduling of tasks asynchronous scheduling No idle times

No idle times Some degree of adaptativity Better locality thanks to block

data layout

slide-18
SLIDE 18

Cholesky Cholesky Factorization Factorization DAG DAG-

  • based

based Dependency Tracking Dependency Tracking

1:1

1: 1 1: 1: 1:

1:1 1:2 2:2

1: 2 1: 3 1: 4 2: 2 2: 3 2: 4

1:3 2:3 3:3

2 3 4 2: 2

1:4 2:4 3:4 4:4

2: 3 2: 4 3: 3:

Dependencies expressed by the DAG

3: 3 3: 4 3: 3

Dependencies expressed by the DAG are enforced on a tile basis: fine-grained parallelization flexible scheduling

3

slide-19
SLIDE 19

Cholesky Cholesky on the IBM Cell

  • n the IBM Cell

Pi li i Pipelining: Between loop iterations. Double Buffering: Within BLAS Result: Within BLAS, Between BLAS, Between loop iterations. Minimum load imbalance, Minimum dependency stalls, Minimum memory stalls (no waiting for data). 19 ( g )

Achieves 174 Gflop/s; 85% of peak in SP.

slide-20
SLIDE 20

Cholesky Cholesky -

  • Using 2 Cell Chips

Using 2 Cell Chips

20

slide-21
SLIDE 21

Parallelism in LAPACK: Parallelism in LAPACK: Blocked Storage Blocked Storage

Column-Major Column Major

slide-22
SLIDE 22

Parallelism in LAPACK: Parallelism in LAPACK: Blocked Blocked S Storage torage

Column-Major Blocked Column Major Blocked

slide-23
SLIDE 23

Parallelism in LAPACK: blocked storage Parallelism in LAPACK: blocked storage

Column-Major Blocked Column Major Blocked

slide-24
SLIDE 24

Parallelism in LAPACK: Parallelism in LAPACK: Blocked Blocked S Storage torage

The use of blocked storage can significantly improve performance

2

Blocking Speedup

p p

1.4 1.6 1.8 2

DGEMM DTRSM

0.8 1 1.2

speedup

0.2 0.4 0.6 64 128 256

block size

slide-25
SLIDE 25

Multicore Multicore Friendly Algorithms Friendly Algorithms

QR Factorization -- 2-socket Clovertown (Peak 85 12 Gflop/s)

50 60

(Peak 85.12 Gflop/s)

DAG based, Tiled

40 50

s Intel MKL

20 30

Gflop/s LAPACK BLAS Threading

10 2000 4000 6000 8000 10000 12000 14000 2000 4000 6000 8000 10000 12000 14000

problem size

slide-26
SLIDE 26

Intel’ s Intel’ s Clovertown Clovertown Quad Core Quad Core Q

  • 1. LAPACK (BLAS Fork-Join Parallelism)
  • 2. ScaLAPACK (Mess Pass using mem copy)
  • 3. DAG Based (Dynamic Scheduling)

3 Implementations of LU factorization 3 Implementations of LU factorization Quad core w/2 sockets per board, w/ 8 Treads Quad core w/2 sockets per board, w/ 8 Treads

35000 40000 45000 25000 30000 35000

p/s

15000 20000

Mflo

8 Core Experiments

5000 10000

8 Core Experiments

26

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000 14000 15000

Problems Size

slide-27
SLIDE 27

With With the the Hype on Hype on Cell & PS 3 Cell & PS 3 We Became Interested We Became Interested We Became Interested We Became Interested

  • The PlayStation 3's CPU based on a "Cell“ processor
  • Each Cell contains a Power PC processor and 8 SPEs. (SPE is processing unit,

SPE: SPU + DMA engine)

  • An SPE is a self contained vector processor which acts independently from the
  • thers.
  • 4 way SIMD floating point units capable of a total of 25.6 Gflop/s @ 3.2 GHZ
  • 204 8 Gflop/s peak!
  • 204.8 Gflop/s peak!
  • The catch is that this is for 32 bit floating point; (Single Precision SP)
  • And 64 bit floating point runs at 14.6 Gflop/s total for all 8 SPEs!!
  • Divide SP peak by 14; factor of 2 because of DP and 7 because of latency issues

SPE ~ 25 Gflop/s peak 27

slide-28
SLIDE 28

Moving Data Around on the Cell

256 KB 256 KB Injection bandwidth 25.6 GB/s Injection bandwidth Injection bandwidth Injection bandwidth Worst case memory bound operations (no reuse of data) 3 data movements (2 in and 1 out) with 2 ops (SAXPY) For the cell would be 4.6 Gflop/s (25.6 GB/s*2ops/12B)

slide-29
SLIDE 29

32 or 64 bit Floating Point Precision? 32 or 64 bit Floating Point Precision?

  • A long time ago 32 bit floating point was

used

S ill d i i ifi b li i d Still used in scientific apps but limited

  • Most apps use 64 bit floating point

Accumulation of round off error Accumulation of round off error

  • A 10 TFlop/s computer running for 4 hours performs > 1

Exaflop (1018) ops.

Ill conditioned problems p IEEE SP exponent bits too few (8 bits, 10±38) Critical sections need higher precision

  • Sometimes need extended precision (128 bit fl pt)

Sometimes need extended precision (128 bit fl pt)

However some can get by with 32 bit fl pt in some parts

  • Mixed precision a possibility

29

  • Mixed precision a possibility

Approximate in lower precision and then refine

  • r improve solution to high precision.
slide-30
SLIDE 30

Idea Goes S

  • mething Like This…

Idea Goes S

  • mething Like This…
  • Exploit 32 bit floating point as much as

possible.

Especially for the bulk of the computation

  • Correct or update the solution with selective

use of 64 bit floating point to provide a refined results

  • Intuitively:

Compute a 32 bit result, C l l t ti t 32 bit lt i Calculate a correction to 32 bit result using selected higher precision and, Perform the update of the 32 bit results with the

30

Perform the update of the 32 bit results with the correction using high precision.

slide-31
SLIDE 31

Mixed Mixed-

  • Precision Iterative Refinement

Precision Iterative Refinement

It ti fi t f d t A b k thi

L U = lu(A) SINGLE O(n3) x = L\(U\b) SINGLE O(n2)

  • Iterative refinement for dense systems, Ax = b, can work this

way.

x L\(U\b) ( ) r = b – Ax DOUBLE O(n2) WHILE || r || not small enough z = L\(U\r) SINGLE O(n2) x = x + z DOUBLE O(n1) x = x + z DOUBLE O(n ) r = b – Ax DOUBLE O(n2) END

Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt results when using DP fl pt.

31

slide-32
SLIDE 32

Mixed Mixed-

  • Precision Iterative Refinement

Precision Iterative Refinement

It ti fi t f d t A b k thi

L U = lu(A) SINGLE O(n3) x = L\(U\b) SINGLE O(n2)

  • Iterative refinement for dense systems, Ax = b, can work this

way.

x L\(U\b) ( ) r = b – Ax DOUBLE O(n2) WHILE || r || not small enough z = L\(U\r) SINGLE O(n2) x = x + z DOUBLE O(n1) x = x + z DOUBLE O(n ) r = b – Ax DOUBLE O(n2) END

Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt results when using DP fl pt. It can be shown that using this approach we can compute the solution It can be shown that using this approach we can compute the solution to 64-bit floating point precision.

  • Requires extra storage, total is 1.5 times normal;
  • O(n3) work is done in lower precision

32

( ) p

  • O(n2) work is done in high precision
  • Problems if the matrix is ill-conditioned in sp; O(108)
slide-33
SLIDE 33

Results for Mixed Precision Iterative Refinement for Dense Ax = b

Architecture (BLAS)

1 Intel Pentium III Coppermine (Goto) 2 Intel Pentium III Katmai (Goto) 3 Sun UltraSPARC IIe (Sunperf) 4 Intel Pentium IV Prescott (Goto) 5 Intel Pentium IV-M Northwood (Goto) 6 AMD Opteron (Goto) 7 Cray X1 (libsci)

8

IBM Power PC G5 (2 7 GHz) (VecLib)

8

IBM Power PC G5 (2.7 GHz) (VecLib) 9 Compaq Alpha EV6 (CXML) 10 IBM SP Power3 (ESSL) 11 SGI Octane (ATLAS)

  • Single precision is faster than DP because:
  • Higher parallelism within vector units

4 ops/cycle (usually) instead of 2 ops/cycle

  • Reduced data motion
  • Reduced data motion

32 bit data instead of 64 bit data

  • Higher locality in cache

More data items in cache

slide-34
SLIDE 34

Results for Mixed Precision Iterative Refinement for Dense Ax = b

Architecture (BLAS)

1 Intel Pentium III Coppermine (Goto) 2 Intel Pentium III Katmai (Goto) 3 Sun UltraSPARC IIe (Sunperf) 4 Intel Pentium IV Prescott (Goto) 5 Intel Pentium IV-M Northwood (Goto) 6 AMD Opteron (Goto) 7 Cray X1 (libsci)

8

IBM Power PC G5 (2 7 GHz) (VecLib)

8

IBM Power PC G5 (2.7 GHz) (VecLib) 9 Compaq Alpha EV6 (CXML) 10 IBM SP Power3 (ESSL) 11 SGI Octane (ATLAS)

A hi (BLAS MPI) # DP S l DP S l # Architecture (BLAS-MPI) # procs n DP Solve /SP Solve DP Solve /Iter Ref # iter

AMD Opteron (Goto – OpenMPI MX) 32 22627 1.85

1.79

6 AMD Opteron (Goto – OpenMPI MX) 64 32000 1 90

1.83

6

  • Single precision is faster than DP because:
  • Higher parallelism within vector units

4 ops/cycle (usually) instead of 2 ops/cycle

  • Reduced data motion

1.90

1.83

  • Reduced data motion

32 bit data instead of 64 bit data

  • Higher locality in cache

More data items in cache

slide-35
SLIDE 35

IBM Cell 3.2 GHz, Ax = b IBM Cell 3.2 GHz, Ax = b

250 200 SP Peak (204 Gflop/s) SP Ax=b IBM 30

8 SGEMM (Embarrassingly Parallel)

100 150 GFlop/s DP Peak (15 Gflop/s) DP Ax=b IBM .30 secs 50 500 1000 1500 2000 2500 3000 3500 4000 4500 3.9 secs

35

Matrix Size

slide-36
SLIDE 36

IBM Cell 3.2 GHz, Ax = b IBM Cell 3.2 GHz, Ax = b

250 200 SP Peak (204 Gflop/s) SP Ax=b IBM 30

8 SGEMM (Embarrassingly Parallel)

100 150 GFlop/s DSGESV DP Peak (15 Gflop/s) DP Ax=b IBM .30 secs .47 secs 50 100

8.3X

500 1000 1500 2000 2500 3000 3500 4000 4500 3.9 secs

36

500 1000 1500 2000 2500 3000 3500 4000 4500 Matrix Size

slide-37
SLIDE 37

Cholesky Cholesky on the Cell

  • n the Cell,

, Ax=b, A=A

Ax=b, A=AT, , x xTAx Ax > 0 > 0

Single precision performance Mixed precision performance using iterative refinement Method achieving 64 bit accuracy 33 37 For the SPE’s standard C code and C language SIMD extensions (intrinsics)

slide-38
SLIDE 38

S parse Linear Algebra S parse Linear Algebra

  • Computational speed

24

s)

Computational speed doesn't matter

  • Peak 204 Gflop/s

M b tt

18 20 22

64 bytes 128 bytes 256 bytes 512 bytes

width (GB/s

  • Memory bus matters
  • 25 GB/s = 12 Gflop/s
  • Assuming matrix read

12 14 16

mory Bandw

from memory

  • In practice ~6 Gflop/s
  • In SP using 8 SPEs

6 8 10

egate Mem

1 2 3 4 5 6 7 8 2 4

Aggre

33

Synergistic Processing Elements

slide-39
SLIDE 39

What About That PS 3? What About That PS 3?

25.6 Gflop/s 25.6 Gflop/s 25.6 Gflop/s 25.6 Gflop/s

SIT CELL

PE PE PE PE

SIT CELL

200 GB/s

PowerPC

PE PE PE PE

25.6 Gflop/s 25.6 Gflop/s 25.6 Gflop/s 25.6 Gflop/s 25 GB/s

3.2 GHz 25 GB/ i j ti b d idth

512 MiB

25 GB/s injection bandwidth 200 GB/s between SPEs 32 bit peak perf 8*25.6 Gflop/s 204.8 Gflop/s peak 64 bit peak perf 8*1.8 Gflop/s p p p 14.6 Gflop/s peak 512 MiB memory

slide-40
SLIDE 40

PS 3 Hardware Overview PS 3 Hardware Overview

Disabled/Broken: Yield issues 25.6 Gflop/s 25.6 Gflop/s 25.6 Gflop/s

SIT CELL

PE PE PE

SIT CELL

200 GB/s GameOS Hypervisor

PowerPC

PE PE PE

25.6 Gflop/s 25.6 Gflop/s 25.6 Gflop/s 25 GB/s

3.2 GHz 25 GB/s injection bandwidth 200 GB/s between SPEs

256 MiB

200 GB/s between SPEs 32 bit peak perf 6*25.6 Gflop/s 153.6 Gflop/s peak 64 bit peak perf 6*1.8 Gflop/s 10.8 Gflop/s peak 1 Gb/s NIC 256 MiB memory

slide-41
SLIDE 41

HPC in the Living Room HPC in the Living Room

33 41

slide-42
SLIDE 42

Matrix Multiple on a 4 Node PlayStation3 Cluster

What's good

Very cheap: ~4$ per Gflop/s (with 32

bit fl pt theoretical peak) F t l l t ti b t

SPE

What's bad

Gigabit network card. 1 Gb/s is too

little for such computational power (150 Gflop/s per node)

Fast local computations between SPEs Perfect overlap between

communications and computations is possible (Open-MPI running): Gflop/s per node)

Linux can only run on top of GameOS

(hypervisor)

Extremely high network access

l t i (120 )

PPE does communication via MPI SPEs do computation via SGEMMs

latencies (120 usec)

Low bandwidth (600 Mb/s) Only 256 MB local memory Only 6 SPEs

33

Gold: Computation: 8 ms Blue: Communication: 20 ms

slide-43
SLIDE 43

SUMMA Model vs Measures 1 SPE SUMMA on a 2x2 PlayStation3 cluster SUMMA on a 2x2 PlayStation3 cluster

95 100

SUMMA -- Model vs Measures 1 SPE

75 80 85 90 60 65 70 75

Model 1 SPE

Gflop /s

45 50 55

Measures 1 SPE

2000 4000 6000 8000 35 40

problem size

33

problem size

slide-44
SLIDE 44

SUMMA Model vs Measures 1 SPE SUMMA on a 2x2 PlayStation3 cluster SUMMA on a 2x2 PlayStation3 cluster

95 100

SUMMA -- Model vs Measures 1 SPE

75 80 85 90 60 65 70 75

Model 1 SPE

Gflop /s

45 50 55

Measures 1 SPE

2000 4000 6000 8000 35 40

problem size

33

problem size

slide-45
SLIDE 45

Users Guide for SC on PS3 Users Guide for SC on PS3

  • SCOP3: A Rough
  • SCOP3: A Rough

Guide to Scientific Computing on the Computing on the PlayStation 3 S b

  • See webpage

for details

33

slide-46
SLIDE 46

How to Deal with Complexity? How to Deal with Complexity?

  • Adaptivity is the key for applications to
  • Adaptivity is the key for applications to

effectively use available resources whose complexity is exponentially increasing p y p y g

  • Goal:

Automatically bridge the gap between the y g g p application and computers that are rapidly changing and getting more and more complex

slide-47
SLIDE 47

S elf S elf-

  • Adapting S
  • ftware

Adapting S

  • ftware
  • Variation
  • Variation

Many different algorithm implementation are generated implementation are generated automatically and tested for performance

  • Selection

The best performing implementation is The best performing implementation is sought by optimization

slide-48
SLIDE 48

S elf S elf-

  • Adapting S
  • ftware

Adapting S

  • ftware

Huge search space (algorithms, parameters,...) Generate + Adapt (once per target) → Use (often) Generate

Adapt (once per target)

Use (often)

Variation Selection Variation Selection

Automatic Performance Tuning

48

slide-49
SLIDE 49

S elf S elf-

  • Adapting

Adapting S

  • ftware

S

  • ftware

Automatically generated HW adapted libraries

  • Large sections of straight-line code produced

Examples p

  • Numerical linear algebra:

ATLAS, OSKI

  • Discrete Fourier transforms:

FFTW

  • Digital signal processing:

SPIRAL

g g p g

  • MPI Collectives (UCB, UTK)

FT-MPI

49

slide-50
SLIDE 50

Generic Code Optimization Generic Code Optimization

  • Can ATLAS-like techniques be applied to arbitrary code?
  • What do we mean by ATLAS-like techniques?

What do we mean by ATLAS like techniques? Blocking Loop unrolling p g Data prefetch Functional unit scheduling etc.

  • Referred to as empirical optimization

G t i ti Generate many variations Pick the best implementation by measuring the performance performance

slide-51
SLIDE 51

Applying S elf Adapting S

  • ftware

Applying S elf Adapting S

  • ftware
  • Numerical and Non-numerical

li ti applications

BLAS like ops / message passing collectives

  • Static or Dynamic determine code to be

used

Perform at make time / every time invoked

  • Independent or dependent on data

Independent or dependent on data presented

Same on each data set / depends on Same on each data set / depends on properties of data

51

slide-52
SLIDE 52

Multi, Many, … , Many Multi, Many, … , Many-

  • More

More

  • Parallelism for the masses
  • Multi, Many, Many-MoreCore

Multi, Many, Many MoreCore are here and coming fast

  • Our approach for numerical libraries:

Use Dynamic DAG based scheduling Minimize sync - Non-blocking communication Maximize locality - Block data layout y y

  • Autotuners should take on a larger, or at least

complementary, role to compilers in translating parallel programs parallel programs.

  • What’s needed is a long-term, balanced

investment in hardware, software, algorithms and , , g applications in the HPC Ecosystem.

52

slide-53
SLIDE 53

Collaborators / S upport Collaborators / S upport

Alfredo Buttari, UTK J li L g Julien Langou, UColorado Julie Langou, UTK g , Piotr Luszczek, MathWorks J k b K k UTK Jakub Kurzak, UTK Stan Tomov, UTK

33