High-Performance Computing Today Jack Dongarra I nnovative - - PDF document

high performance computing today
SMART_READER_LITE
LIVE PREVIEW

High-Performance Computing Today Jack Dongarra I nnovative - - PDF document

High-Performance Computing Today Jack Dongarra I nnovative Computing Laboratory University of Tennessee and Oak Ridge National Laboratory ht t p:/ / www.cs.ut k.edu/ ~dongarra ht t p:/ / www.cs.ut k.edu/ ~ dongarra/ / 1 Outline ? Look


slide-1
SLIDE 1

1

1

High-Performance Computing Today

Jack Dongarra I nnovative Computing Laboratory University of Tennessee and Oak Ridge National Laboratory

ht t p:/ / www.cs.ut k.edu/ ~ ht t p:/ / www.cs.ut k.edu/ ~dongarra dongarra/ /

2

Outline

? Look at trends in HPC

Top500 statistics

? Perf ormance of Super- Scalar Processors

ATLAS

? Perf ormance Monitoring

PAPI

? NetSolve

Example of grid middleware

I n pioneer days, they used oxen f or heavy pulling, and when

  • ne ox couldn' t budge a log they didn' t try to grow a

larger ox. We shouldn' t be trying f or bigger computers, but f or more systems of computers. - - Grace Hopper

slide-2
SLIDE 2

2

3

Technology Trends: Microprocessor Capacity

2X transistors/Chip Every 1.5 years

Called “Moore’s Law”

Moore’s Law

Microprocessors have become smaller, denser, and more powerful.

Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months.

4

High Performance Computers & Numerical Libraries

?

20 years ago

1x106 Floating Point Ops/ sec (Mf lop/ s)

» Scalar based » Loop unrolling

?

10 years ago

1x109 Floating Point Ops/ sec (Gf lop/ s)

» Vect or & Shared memory comput ing, bandwidt h aware » Block part it ioned, lat ency t olerant

?

Today

1x1012 Floating Point Ops/ sec (Tf lop/ s)

» Highly parallel, dist ribut ed processing, message passing, net work based » dat a decomposit ion, communicat ion/ comput at ion

?

10 years away

1x1015 Floating Point Ops/ sec (Pf lop/ s)

» Many more levels MH, combinat ion/ grids&HPC » More adapt ive, LT and bandwidt h aware, f ault t olerant , ext ended precision, at t ent ion t o SMP nodes

slide-3
SLIDE 3

3

5

TOP500 TOP500

  • Listing of the 500 most powerful

Computers in the World

  • Yardstick: Rmax from LINPACK MPP

Ax=b, dense problem

  • Updated twice a year

SC‘xy in the States in November Meeting in Mannheim, Germany in June

  • All data available from www.top500.org

Size Rate

TPP perf ormance

I n 1980 a computation that took 1 f ull year to complete can now be done in 1 month!

Fastest Computer Over Time

TMC CM-2 (2048) Fujitsu VP-2600 Cray Y-MP (8)

5 10 15 20 25 30 35 40 45 50 1990 1992 1994 1996 1998 2000 Year

GFlop/s

slide-4
SLIDE 4

4

I n 1980 a computation that took 1 f ull year to complete can now be done in 4 days!

Fastest Computer Over Time

Hitachi CP-PACS (2040) Intel Paragon (6788) Fujitsu VPP-500 (140) TMC CM-5 (1024) NEC SX-3 (4)

TMC CM-2 (2048) Fujitsu VP-2600 Cray Y-MP (8)

50 100 150 200 250 300 350 400 450 500 1990 1992 1994 1996 1998 2000 Year

GFlop/s

I n 1980 a computation that took 1 f ull year to complete can today be done in 1 hour!

Fastest Computer Over Time

ASCI White Pacific (7424) ASCI Blue Pacific SST (5808) SGI ASCI Blue Mountain (5040) Intel ASCI Red (9152)

Hitachi CP-PACS (2040) Intel Paragon (6788) Fujitsu VPP-500 (140) TMC CM-5 (1024) NEC SX-3 (4) TMC CM-2 (2048) Fujitsu VP-2600 Cray Y-MP (8)

Intel ASCI Red Xeon (9632)

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 1990 1992 1994 1996 1998 2000 Year

GFlop/s

slide-5
SLIDE 5

5

Rank Company Machine Procs Gflop/s Place Country Year

1 Intel ASCI Red 9632 2380 Sandia National Labs Albuquerque USA 1999 2 IBM ASCI Blue-Pacific SST, IBM SP 604e 5808 2144 Lawrence Livermore National Laboratory Livermore USA 1999 3 SGI ASCI Blue Mountain 6144 1608 Los Alamos National Laboratory Los Alamos USA 1998 4 Hitachi SR8000-F1/112 112 1035 Leibniz Rechenzentrum Muenchen Germany 2000 5 Hitachi SR8000-F1/100 100 917 High Energy Accelerator Research Organization /KEK Tsukuba Japan 2000 6 Cray Inc. T3E1200 1084 892 Government USA 1998 7 Cray Inc. T3E1200 1084 892 US Army HPC Research Center at NCS Minneapolis USA 2000 8 Hitachi SR8000/128 128 874 University of Tokyo Tokyo Japan 1999 9 Cray Inc. T3E900 1324 815 Government USA 1997 10 IBM SP Power3 375 MHz 1336 723 Naval Oceanographic Office (NAVOCEANO) Poughkeepsie USA 2000

Top 10 Machines (June 2000)

Performance Development

64.3 TF/s 2.38 TF/s 39.9 GF/s Jun-93 Nov-93 Jun-94 Nov-94 Jun-95 Nov-95 Jun-96 Nov-96 Jun-97 Nov-97 Jun-98 Nov-98 Jun-99 Nov-99 Jun-00

Intel XP/S140 Sandia Fujitsu 'NWT' NAL SNI VP200EX Uni Dresden Cray Y-MP M94/4 KFA Jülich Cray Y-MP C94/364 'EPA' USA Hitachi/Tsukuba CP-PACS/2048 SGI POWER CHALLANGE GOODYEAR Intel ASCI Red Sandia Intel ASCI Red Sandia Sun Ultra HPC 1000 News International Sun HPC 10000 Merril Lynch Fujitsu 'NWT' NAL Intel ASCI Red Sandia IBM 604e 69 proc Nabisco

N=1 N=500 SUM 1 Gflop/s 1 Tflop/s 100 Mflop/s 100 Gflop/s 100 Tflop/s 10 Gflop/s 10 Tflop/s

[6 0 G - 4 0 0 M][2 .4 Tf lop/ s 4 0 Gf lop/ s] , Sch wab # 1 9 , 1 /2 each year , 13 3 > 10 0 Gf , f ast er t han Moore’s law,

slide-6
SLIDE 6

6

Performance Development

0.1 1 10 100 1000 10000 100000 1000000 Jun-93 Nov-94 Jun-96 Nov-97 Jun-99 Nov-00 Jun-02 Nov-03 Jun-05 Nov-06 Jun-08 Nov-09 Performance [GFlop/s]

N=1 N=500 Sum 1 TFlop/s 1 PFlop/s ASCI Earth Simulator

Ent ry 1 T 20 0 5 and 1 P 20 1 My Laptop

Architectures

Single Processor SMP MPP SIMD Constellation Cluster

100 200 300 400 500 Jun-93 Nov-93 Jun-94 Nov-94 Jun-95 Nov-95 Jun-96 Nov-96 Jun-97 Nov-97 Jun-98 Nov-98 Jun-99 Nov-99 Jun-00

91 const, 14 clus, 275 mpp, 120 smp

slide-7
SLIDE 7

7

Chip Technology

Inmos Transputer

Alpha IBM HP intel MIPS SUN Other

100 200 300 400 500 Jun-93 Nov-93 Jun-94 Nov-94 Jun-95 Nov-95 Jun-96 Nov-96 Jun-97 Nov-97 Jun-98 Nov-98 Jun-99 Nov-99 Jun-00

1 4

High-Performance Computing Directions: Beowulf-class PC Clusters

? COTS PC Nodes

Pentium, Alpha,

PowerPC, SMP

? COTS LAN/ SAN

I nterconnect

Ethernet, Myrinet,

Giganet, ATM

? Open Source Unix

Linux, BSD

? Message Passing

Computing

MPI , PVM HPF

? Best price-

perf ormance

? Low entry- level cost ? Just- in- place

conf iguration

? Vendor invulnerable ? Scalable ? Rapid technology

tracking Definition: Advantages:

Enabled by PC hardware, networks and operating system achieving capabilities of scientific workstations at a fraction of the cost and availability of industry standard message passing libraries.

slide-8
SLIDE 8

8

15

Where Does the Performance Go? or Why Should I Cares About the Memory Hierarchy?

µProc 60%/yr. (2X/1.5yr) DRAM 9%/yr. (2X/10 yrs)

1 10 100 1000

1980 1981 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000

DRAM CPU

1982

Processor-Memory Performance Gap: (grows 50% / year)

Performance

Time

“Moore’s Law” Processor-DRAM Memory Gap (latency)

1 6

Optimizing Computation and Memory Use

? Computational optimizations

Theoretical peak:

(# f pus)*(f lops/ cycle) * Mhz

» PI I I : (1 f pu)*(1 f lop/ cycle)*(650 Mhz) = 650 MFLOP/ s » Athlon: (2 f pu)*(1f lop/ cycle)*(600 Mhz) = 1200 MFLOP/ s » Power3: (2 f pu)*(2 f lops/ cycle)*(375 Mhz) = 1500 MFLOP/ s

? Memory optimization

Theoretical peak: (bus width) * (bus speed) » PI I I : (32 bits)*(133 Mhz) = 532 MB/ s = 66. 5 MW/ s » Athlon: (64 bits)*(200 Mhz) = 1600 MB/ s = 200 MW/ s » Power3: (128 bits)*(100 Mhz) = 1600 MB/ s = 200 MW/ s

? Memory about an order of magnit ude slower

slide-9
SLIDE 9

9

Memory Hierarchy

? By taking advantage of the principle of locality:

Present the user with as much memory as is available in

the cheapest technology.

Provide access at the speed of f ered by the f astest

technology.

Control Datapath Secondary Storage (Disk) Processor Registers Main Memory (DRAM) Level 2 and 3 Cache (SRAM) On-Chip Cache 1s 10,000,000s (10s ms) 100,000 s (.1s ms) Speed (ns): 10s 100s 100s Gs Size (bytes): Ks Ms Tertiary Storage (Disk/Tape) 10,000,000,000s (10s sec) 10,000,000 s (10s ms) Ts Distributed Memory Remote Cluster Memory

18

How To Get Performance From Commodity Processors?

? Today’s processors can achieve high- perf ormance, but

this requires extensive machine- specif ic hand tuning.

? Hardware and sof tware have a large design space

w/ many parameters

Blocking sizes, loop nesting permutations, loop unrolling

depths, sof tware pipelining strategies, register allocations, and instruction schedules.

Complicated interactions with the increasingly sophisticated

micro- architectures of new microprocessors.

?

Until recently, no tuned BLAS f or Pentium f or Linux.

?

Need f or quick/ dynamic deployment of optimized routines.

?

ATLAS - Automatic Tuned Linear Algebra Sof tware

PhiPac f rom Berkeley FFTW f rom MI T (http:/ / www. f f tw. org)

slide-10
SLIDE 10

1

1 9

ATLAS

? An adaptive sof tware architecture

High- perf ormance Portability Elegance

? ATLAS is f aster than all other portable BLAS

implementations and it is comparable with machine- specif ic libraries provided by the vendor.

20

ATLAS Across Various Architectures (DGEMM n=500)

? ATLAS is f aster than all other portable BLAS

implementations and it is comparable with machine- specif ic libraries provided by the vendor.

0.0 100.0 200.0 300.0 400.0 500.0 600.0 700.0 800.0 900.0 AMD Athlon-600 DEC ev56-533 DEC ev6-500 HP9000/735/135 IBM PPC604-112 IBM Power2-160 IBM Power3-200 Pentium Pro-200 Pentium II-266 Pentium III-550 SGI R10000ip28-200 SGI R12000ip30-270 Sun UltraSparc2-200

Architectures MFLOPS

Vendor BLAS ATLAS BLAS F77 BLAS

slide-11
SLIDE 11

1 1

21

Code Generation Strategy

?

Code is iteratively generated & timed until

  • ptimal case is f ound.

We try:

Dif f ering NBs Breaking f alse

dependencies

M, N and K loop unrolling

?

Designed f or RI SC arch

Super Scalar Need reasonable C

compiler ?

On- chip multiply optimizes f or:

TLB access L1 cache reuse FP unit usage Memory f etch Register reuse Loop overhead

minimization ?

Takes a 30 minutes to a hour to run.

?

New model of high perf ormance programming where critical code is machine generated using parameter optimization.

22

Plans for ATLAS

? Sof tware Release, available today:

Level 1, 2, and 3 BLAS implementations See: www. netlib. org/ atlas/

? Next Version:

Multi- treading Java generator

? Futures:

Optimize message passing system Runtime adaptation

» Sparsity analysis » I terative code improvement

Specialization f or user applications Adaptive libraries

slide-12
SLIDE 12

1 2

23

Tools for Performance Evaluation

? Timing and perf ormance evaluation

has been an art

Resolution of the clock I ssues about cache ef f ects Dif f erent syst ems

? Situation about to change

Today’s processors have int ernal

counters

2 4

Performance Counters

? Almost all high perf ormance processors

include hardware perf ormance counters.

? Some are easy to access, others not

available to users.

? On most platf orms the API s, if they

exist, are not appropriate f or a common user, f unctional or well documented.

? Existing perf ormance counter API s

Cray T3E SGI MI PS R10000 I BM Power series DEC Alpha pf m pseudo- device interf ace Windows 95, NT and Linux

slide-13
SLIDE 13

1 3

2 5

Performance Data That May Be Available

Cycle count Floating point

instruction count

I nteger instruction

count

I nstruction count Load/ store count Branch taken / not

taken count

Branch mispredictions Pipeline stalls due to

memory subsystem

Pipeline stalls due to

resource conf licts

I / D cache misses f or

dif f erent levels

Cache invalidations TLB misses TLB invalidations

2 6

PAPI’s Graphical Tools Perfometer Usage

? Application is instrumented with PAPI

call perf ometer()

? Will be layered over the best existing

vendor- specif ic API s f or these platf orms

? Sections of code that are of interest

are designated with specif ic colors

Using a call to set_perf ometer(‘color’)

? Application is started, at the call to

perf ormet er a task is spawned to collect and send the inf ormation to a Java applet containing the graphical view.

slide-14
SLIDE 14

1 4

27

Perfometer

Flops issued Machine info Process & Real time Flop/s Rate

Call Perfometer(‘red’)

Flop/s Instantaneous Rate

2 8

Go To Demo

slide-15
SLIDE 15

15

2 9

Trends in Computational Science and Engineering

? Multi- scale, Multi- physics,

Multi- dimensional simulat ions of realist ic complexity

? Growing use of dynamic adaptive algorithms ? Strong interplay between observation and

simulation (e. g. , cosmology, weather)

? I mpact of the WWW

accelerated pace of research

due to electronic publishing

prolif eration of digital archives emergence of workbenches and

portals

3 0

Grid Computing

? To treat CPU cycles and sof tware like

commodities, an application should be:

Ubiquitous - - able to interf ace to the

system at any point and leverage whatever is available

Resource Aware - - capable of managing

heterogenity

Adaptive - -

able to tailor its behavior dynamically so that it gets maximum perf ormance benef it f rom the services and resources at hand

slide-16
SLIDE 16

1 6

3 1 3 2

The Grid Architecture Picture

Resource Layer High speed networks and routers computers Data bases Online instruments Service Layers User Portals Authentication Co- Scheduling Naming & Files Events Grid Access & Info Problem Solving Environments Application Science Portals Resource Discovery & Allication Fault Tolerance

slide-17
SLIDE 17

1 7

3 3

Motivation for NetSolve

? Client - Server Design ? Non- hierarchical system ? Load Balancing and Fault Tolerance ? Heterogeneous Environment Supported ? Multiple and simple client interf aces ? Built on standard components

Basics

Design an easy-t o-use t ool t o provide ef f icient and uniform access t o a variet y of scient if ic packages on UNIX and Window’s plat forms

3 4

NetSolve - The Big Picture

Reply Choice Computational Resources

Hardware: Software: Clusters Routines MPP Libraries Workstations Applications Globus,Condor, MPI,PVM

Request Agent

Scheduler Database

Client - RPC like

Matlab Mathematica C, Fortran Java, Excel Java GUI

No knowledge of the grid required

slide-18
SLIDE 18

18

3 5

NetSolve

? Three deployment scenarios:

Client, servers and

agents anywhere on I nternet (3(10)- 150(80- ws/ mpp)- Mcell)

Client, servers and agents on an I ntranet Client, server and agent on the same

machine ? “Blue Collar” Grid Based Computing

User can set things up, no “su” required Does not require deep knowledge of network

programming ? Smart Libraries

“Rent” access to routines Decouple interf ace

3 6

NetSolve Usage Scenarios

? Grid based library rout ines

Users doesn’t have to have library routines

  • n their machine

? Task f arming applications

“Pleasantly parallel” execution eg Parameter studies

? Remote application execution

Complete packages with user specif ying input

parameters

slide-19
SLIDE 19

1 9

NetSolve - MATLAB Interface

>> define sparse matrix A >> define rhs >> [x, its] = netsolve('itmeth',’petsc’, A, rhs, 1.e-6, 50); … >> [x, its] = petsc(A, rhs); % for PETSc >> [x, its] = aztec(A, rhs); % for AZTEC >> [x] = superlu(A, rhs); % for SuperLU >> [x] = ma28(A, rhs); % for MA28

Synchronous Call Asynchronous Calls also available

NetSolve - FORTRAN Interface

parameter( MAX = 100) double precision A(MAX,MAX), B(MAX) integer IPIV(MAX), N, INFO, LWORK integer NSINFO call DGESV(N,1,A,MAX,IPIV,B,MAX,INFO)

Easy to ‘switch’ to NetSolve

call NETSL(‘DGESV()’,NSINFO, N,1,A,MAX,IPIV,B,MAX,INFO)

slide-20
SLIDE 20

20

3 9

Hiding the Parallel Processing

? User maybe unaware of parallel

processing

? NetSolve takes care of the starting the message

passing system, data distribution, and returning the results.

4 0

  • Developed at: Salk Institute (T. Bartol), Cornell U. (J. Stiles)
  • Study how neurotransmitters diffuse and activate receptors in synapses
  • blue unbounded, red singly bounded, green doubly bounded closed,

yellow doubly bounded open

MCell: 3-D Monte-Carlo Simulation of Neuro- Transmitter Release in Between Cells

slide-21
SLIDE 21

21

4 1

? I ntegrated Parallel Accurate Reservoir

Simulator.

Mary Wheeler’s group, UT- Austin

? Reservoir and Environmental Simulation.

models black oil, waterf lood, compositions 3D transient f low of multiple phase

? I ntegrates Existing Simulators. ? Framework simplif ied development

Provides solvers, handling f or wells, table lookup. Provides pre/ postprocessor, visualization.

? Full I PARS access without I nstallation. ? I PARS I nterf aces Now Available:

C, FORTRAN, Matlab, Mathematica, and Web. Web Server NetSolve Client

IPARS-enabled Servers

Web Interface

42

NetSolve Applications and Interactions

?

Tool integration

Globus - Middleware inf rastructure (ANL/ SSI ) Condor - Workstation f arm (U Wisconsin) NWS - Network Weather Service (U Tennessee) SCI Run - Computational steering (U Utah) Ninf - NetSolve- like system, (ETL, Tsukuba)

?

Library usage

LAPACK/ ScaLAPACK - Parallel dense linear solvers SuperLU/ MA28 - Parallel sparse direct linear solvers(UCB/ RAL) PETSc/ Aztec - Parallel iterative solvers (ANL/ SNL) Other areas as well (not just linear algebra)

?

Applications

MCell - Microcellular physiology (UCSD/ Salk) I PARS - Reservoir Simulator (UTexas, Austin) Virtual Human - Pulmonary System Model (ORNL) RSI CC - Radiation Saf ety sw/ simulation (ORNL) LUCAS - Land usage modeling (U Tennessee) I mageVision - Computer Graphics and Vision (Graz U)

slide-22
SLIDE 22

22

4 3

Conclusion

? Exciting time to be in scientif ic

computing

? Network computing is here ? The Grid of f ers tremendous

  • pportunities f or collaboration

? I mportant to develop algorithms

and sof tware that will work ef f ectively in this environment

4 4

Contributors to These Ideas

?

Top500

Erich Strohmaier, UTK Hans Meuer, Mannheim U

?

ATLAS

Antoine Petitet, UTK Clint Whaley, UTK

?

PAPI

Shirley Browne, UTK Nathan Garner, UTK Kevin London, UTK Phil Mucci, UTK

?

NetSolve

Dorian Arnold, UTK Susan Blackf ord, UTK Henri Casanova, UCSD Michelle Miller, UTK Sathish Vadhiyar, UTK

For additional inf ormat ion see…

  • www. netlib. org/ top500/
  • www. netlib. org/ atlas/
  • icl. cs. utk. edu/ projects/ papi/
  • www. netlib. org/ netsolve/
  • www. cs. utk. edu/ ~dongarra/

Many opport unit ies wit hin group

slide-23
SLIDE 23

23

4 5