Getting the Performance Out Of Getting the Performance Out Of High - - PDF document

getting the performance out of getting the performance
SMART_READER_LITE
LIVE PREVIEW

Getting the Performance Out Of Getting the Performance Out Of High - - PDF document

Getting the Performance Out Of Getting the Performance Out Of High Performance Computing High Performance Computing Jack Dongarra I nnovative Computing Lab University of Tennessee and Computer Science and Math Division Oak Ridge Nat ional


slide-1
SLIDE 1

1

1

Getting the Performance Out Of Getting the Performance Out Of High Performance Computing High Performance Computing

Jack Dongarra I nnovative Computing Lab University of Tennessee and Computer Science and Math Division Oak Ridge Nat ional Lab

ht t p:/ / ht t p:/ / www.cs.ut k.edu/ ~dongarra www.cs.ut k.edu/ ~dongarra/

2

Getting the Performance Out Of Getting the Performance Out Of High Performance Computing High Performance Computing

Jack Dongarra I nnovative Computing Lab University of Tennessee and Computer Science and Math Division Oak Ridge Nat ional Lab

ht t p:/ / ht t p:/ / www.cs.ut k.edu/ ~dongarra www.cs.ut k.edu/ ~dongarra/

slide-2
SLIDE 2

2

3 4

Getting the Performance into Getting the Performance into High Performance Computing High Performance Computing

Jack Dongarra I nnovative Computing Lab University of Tennessee and Computer Science and Math Division Oak Ridge Nat ional Lab

ht t p:/ / ht t p:/ / www.cs.ut k.edu/ ~dongarra www.cs.ut k.edu/ ~dongarra/

slide-3
SLIDE 3

3

5

Technology Trends: Technology Trends: Microprocessor Capacity Microprocessor Capacity

2X transistors/Chip Every 1.5 years

Called “Moore’s Law” Microprocessors have become smaller, denser, and more powerful. Not just processors, storage, internet bandwidth, etc

Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months.

6 Earth Simulator ASCI White Pacific EDSAC 1 UNIVAC 1 IBM 7090 CDC 6600 IBM 360/195 CDC 7600 Cray 1 Cray X-MP Cray 2 TMC CM-2 TMC CM-5 Cray T3D ASCI Red

1950 1960 1970 1980 1990 2000 2010 1 KFlop/s 1 MFlop/s 1 GFlop/s 1 TFlop/s 1 PFlop/s

Scalar Super Scalar Vector Parallel Super Scalar/Vector/Parallel

Moore’s Law Moore’s Law

slide-4
SLIDE 4

4

7

Next Generation: IBM Blue Gene/L Next Generation: IBM Blue Gene/L and ASCI Purple and ASCI Purple

♦ Announced 11/ 19/ 02

One of 2 machines f or LLNL 360 TFlop/ s 130, 000 proc Linux FY 2005

Plus ASCI Purple IBM Power 5 based 12K proc, 100 TFlop/s

8

To Be Provocative… To Be Provocative… Citation in the Press, March 10 Citation in the Press, March 10th

th, 2008

, 2008

DOE Supercomputers Sit Idle

How could this happen?

Complexit y of programming t hese machines were underestimated Users were unprepared f or t he lack of reliabilit y of the hardware and sof tware Little ef f ort was spent to carry out medium and long term research activities to solve problems t hat were f oreseen 5 years ago in the areas of applications, algorithm, middleware, programming models, and computer architectures, … WASHINGTON, Mar. 10, 2008 GAO reports that after almost

5 years of effort and several hundreds of M$’s spent at the DOE labs, the high performance computers recently purchased did not meet users’ expectation and are sitting idle…Alan Laub head of the DOE efforts

reports that the computer equ

slide-5
SLIDE 5

5

9

Software Technology & Performance Software Technology & Performance

♦ Tendency to f ocus on the hardware ♦ Sof t ware required t o bridge an ever widening gap ♦ Gaps bet ween pot ent ial and delivered

perf ormance is very st eep

P erformance only if the data and controls are setup just right

Otherwise, dramatic perf ormance degradations, very unstable situation Will become more unstable as systems change and become more complex

♦ Challenge f or applications, libraries, and tools is

f ormidable with Tf lop/ s level, even great er wit h Pf lops, some might say insurmount able.

1

Linpack (100x100) Analysis, Linpack (100x100) Analysis, The Machine on My Desk 12 Years Ago and Today The Machine on My Desk 12 Years Ago and Today

♦ Compaq 386/ SX20 SX wit h FPA - . 16 Mf lop/ s ♦ Pent ium I V – 2. 8 GHz – 1317 Mf lop/ s ♦ 12 years we see a f act or of ~ 8231

Doubling in less than 12 months, f or 12 years

♦ Moore’s Law gives us a f act or of 256. ♦ How do we get a f act or > 8000?

Clock speed increase = 128x External Bus Width & Caching –

16 vs. 64 bits = 4x

Floating Point -

4/ 8 bits multi vs. 64 bits (1 clock) = 8x

Compiler Technology = 2x

♦ However t he pot ent ial f or t hat Pent ium 4 is

  • 5. 6 Gf lop/ s and here we are getting 1. 32 Gf lop/ s

Still a f actor of 4.25 of f of peak

Complex set of interaction between Application Algorithms Programming language Compiler Machine instructions Hardware Many layers of translation from the application to the hardware Changing with each generation

slide-6
SLIDE 6

6

1 1

Where Does Much of Our Lost Performance Go? or Where Does Much of Our Lost Performance Go? or Why Should I Care About the Memory Hierarchy? Why Should I Care About the Memory Hierarchy?

1 100 10000 1000000 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004

Year Performance

Processor-DRAM Memory Gap (latency) µProc 60%/yr. (2X/1.5yr) DRAM 9%/yr. (2X/10 yrs) “Moore’s Law” Processor-Memory Performance Gap: (grows 50% / year)

CPU DRAM

1 2

Optimizing Computation and Optimizing Computation and Memory Use Memory Use

♦Computational optimizations

Theoret ical peak:(# f pus)* (f lops/ cycle)* cycle time

Pentium 4: (1 f pu)*(2 f lops/ cycle)*(2. 8 Ghz) = 5600 MFLOP/ s

♦Operations like:

y = α x + y : 3 operands (24 Bytes) needed f or 2 f lops;

5600 Mf lop/ s requires 8400 MWord/ s bandwidth f rom memory

♦Memory optimization

Theoret ical peak: (bus widt h) * (bus speed)

Pentium 4: (32 bits)*(533 Mhz) = 2132 MB/ s = 266 MWord/ s

Off by a factor of 30 from what’s required to drive the processor from memory to peak performance

slide-7
SLIDE 7

7

1 3

Memory Hierarchy Memory Hierarchy

♦ By taking advantage of the principle of locality:

P resent the user with as much memory as is available in the cheapest technology. P rovide access at the speed offered by the fastest technology.

Control Datapath Secondary Storage (Disk) Processor Registers Main Memory (DRAM) Level 2 and 3 Cache (SRAM) On-Chip Cache 1s 10,000,000s (10s ms) 100,000 s (.1s ms) Speed (ns): 10s 100s 100s Gs Size (bytes): Ks Ms Tertiary Storage (Disk/Tape) 10,000,000,000s (10s sec) 10,000,000 s (10s ms) Ts Distributed Memory Remote Cluster Memory

1 4

Tool To Help Understand What’s Going Tool To Help Understand What’s Going On In the Processor On In the Processor

♦ Complex system with many f ilters ♦ Need to identif y bottlenecks ♦ Prioritize optimization ♦ Focus on important aspects

slide-8
SLIDE 8

8

1 5

Tools for Performance Analysis, Tools for Performance Analysis, Modeling and Optimization Modeling and Optimization

♦ PAPI ♦ Dynist ♦ SVPablo

ROSE: Compiler Framework Recognition of high-level abstractions Specification of Transformations

1 6

Example PERC on Climate Model Example PERC on Climate Model

♦ I nt eract ion wit h t he

SciDAC Climat e development ef f ort

♦ Prof iling

I dentif ying perf ormance bottleneck and prioritizing enhancements

♦ Evaluat ion of code

  • ver t ime 9 mont h

period

♦ Produced 400%

improvement via decreased overhead and increased scalability

slide-9
SLIDE 9

9

1 7

Signatures: Key Factors in Applications Signatures: Key Factors in Applications and System that Affect Performance and System that Affect Performance

♦ Application Signatures

Characterization of operations needed to be perf ormed by application Description of application demands

  • n resources

Algorithm Algorithm Signatures

Opts counts Memory ref pat t erns Dat a dependencies I / O charact erist ics

Sof tware Sof tware Signatures

Sync points Thread level parallelism I nst level parallelism Ratio of mem ref t o f lpt ops

Predict application behavior and perf ormance

♦ Hardware Signat ures

  • Perf ormance capabilit ies of Machine
  • Lat encies and bandwidt h of memory

hierarchy

Local to node & to remote node

  • I nst ruct ion issue rat es
  • Cache size
  • TLB size

Execution signature combine application and machine signatures to provide accurate performance models . Parallel or Distributed Application Performance Monitor Observation and Model Signature Comparison Generation of Observation Signature

Feedback Feedback (Degree of Similarity) (Degree of Similarity)

Live Performance Data Live Performance Data

Application Signature

Model Signature

1 8

Algorithms Algorithms vs vs Applications Applications

X X

MultigridSchemes

X X

Stiff ODE Solvers Circuit Simulation Electronic Device Simulation Structural Mechanics Inverse Problems Adjustment

  • f Geodetic

Networks Comp Fluid Dynamics Weather Simulation Quantum Chemistry Lattice Gauge (QCD)

X X X X X X

Sparse Linear System Solvers

X X

Linear Least Squares

X X X X

Nonlinear Algebraic System Solvers

X X X

Sparse Eigenvalue Problems

X X X

FFT

X X X

Rapid Elliptic Problem Solvers

X X

Integral Transformations Monte Carlo Schemes

X

From: Supercomputing Tradeoffs and the Cedar System, E. Davidson, D. Kuck, D. Lawrie, and A. Sameh, in High- Speed Computing, Scientific Applications and Algorithm Design, Ed R. Wilhelmson, U of I Press, 1986.

slide-10
SLIDE 10

1

1 9

Update to Update to Sameh’s Sameh’s Table? Table?

♦ Next step by looking

at:

Application Signatures Algorithms choices Sof t ware prof ile Architecture (Machine)

♦ Data mine to extract

inf ormation

♦ Need signatures f or

A3S

Application Perf ormance Matrix http:/ / www. krellinst. org/ mat rix/

20

Performance Tuning Performance Tuning

♦Motivation: perf ormance of many applications

dominated by a f ew kernels

♦ Conventional approach: handtuning by user or

vendor

Very t ime consuming and t edious work Even with intimate knowledge of architecture and compiler, perf ormance hard to predict Growing list of kernels to tune Must be redone f or every archit ect ure, compiler Compiler t echnology of t en lags archit ect ure Not just a compiler problem:

Best algorithm may depend on input, so some tuning at run- time. Not all algorithms semantically or mathematically equivalent

slide-11
SLIDE 11

1 1

21

Automatic Performance Tuning to Automatic Performance Tuning to Hide Complexity Hide Complexity

♦ Approach: f or each kernel

  • 1. I dentif y and generate a space of algorithms
  • 2. Search f or the f astest one, by running them

♦ What is a space of algorithms?

Depending on kernel and input, may vary

instruction mix and order memory access patterns data structures mathematical f ormulation

♦ When do we search?

Once per kernel and architecture At compile time At run time All of the above

22

Some Automatic Tuning Projects Some Automatic Tuning Projects

♦ ATLAS (www. netlib. org/ atlas) (Dongarra, Whaley) used in Mat lab and many SciDAC and ASCI project s ♦ PHI PAC (www. icsi. berkeley. edu/ ~bilmes/ phipac) (Bilmes, Asanovic, Vuduc, Demmel) ♦ Sparsity (www. cs. berkeley. edu/ ~yelick/ sparsity) (Yelick, I m) ♦ Self Adapting Linear Algebra Sof tware (SALAS) (Dongarra, Eijkhout, Gropp, Keyes) ♦ FFTs and Signal Processing FFTW (www. f f t w. org) Won 1999 Wilkinson Prize f or Numerical Sof t ware SPI RAL ( www. ece. cmu. edu/ ~spiral) Extensions to other transf orms, DSPs UHFFT Extensions to higher dimension, parallelism

0.0 500.0 1000.0 1500.0 2000.0 2500.0 3000.0 3500.0

A M D A t h l

  • n
  • 6

D E C e v 5 6

  • 5

3 3 D E C e v 6

  • 5

H P 9 / 7 3 5 / 1 3 5 I B M P P C 6 4

  • 1

1 2 I B M P

  • w

e r 2

  • 1

6 I B M P

  • w

e r 3

  • 2

I n t e l P

  • I

I I 9 3 3 M H z I n t e l P

  • 4

2 . 5 3 G H z w / S S E 2 S G I R 1 i p 2 8

  • 2

S G I R 1 2 i p 3

  • 2

7 S u n U l t r a S p a r c 2

  • 2

Architectures MFLOP/S

Vendor BLAS ATLAS BLAS F77 BLAS

slide-12
SLIDE 12

1 2

23

Futures for High Performance Scientific Futures for High Performance Scientific Computing Computing

♦ Numerical sof tware will be adaptive,

exploratory, and intelligent

♦ Determinism in numerical computing will

be gone.

Af ter all, its not reasonable to ask f or exactness in numerical computations.

Reproducibilit y at a cost

♦ I mportance of f loating point arithmetic

will be undiminished.

16, 32, 64, 128 bit s and beyond.

♦ Reproducibility, f ault tolerance, and

auditability

♦ Adaptivity is a key so applications can

ef f ectively use the resources.

24

Citation in the Press, March 10 Citation in the Press, March 10th

th, 2008

, 2008

DOE Supercomputers Live up to Expectation

WASHINGTON, Mar. 10, 2008

GAO reported today that after almost 5 years of effort and several hundreds of M$’s spent at DOE labs, the high performance computers recently purchased have exceeded users’ expectation and are helping to solve some of

  • ur most challenging problems.

Alan Laub head of DOE’s HPC efforts reported today at the annual meeting of the SciDAC PI How can t his happen?

♦ Close interactions of with the

applications and the CS and Math I SI C groups

♦ Dramatic improvements in

adaptability of sof tware to the execution environment

♦ I mproved processor - memory

bandwidth

♦ New large- scale system

architectures and sof tware

Aggressive f ault management and reliability ♦ Exploration of some alternative

architectures and languages

Application teams to help drive the design of new architectures

slide-13
SLIDE 13

1 3

25

With Apologies to Gary Larson… With Apologies to Gary Larson…

♦ SciDAC is helping ♦ Teams are developing

t he scient if ic computing sof tware and hardware inf rastructure needed to use t erascale computers and beyond.