Application Accelerators: Application Accelerators: Application - - PowerPoint PPT Presentation

application accelerators application accelerators
SMART_READER_LITE
LIVE PREVIEW

Application Accelerators: Application Accelerators: Application - - PowerPoint PPT Presentation

Application Accelerators: Application Accelerators: Application Accelerators: Application Accelerators: Dues ex Dues ex machina machina machina ? ? Dues ex Dues ex machina CCGSC, Flat Rock, North Carolina CCGSC, Flat Rock, North


slide-1
SLIDE 1

Application Accelerators: Application Accelerators: Application Accelerators: Application Accelerators: Dues ex Dues ex Dues ex Dues ex machina machina machina machina ? ?

CCGSC, Flat Rock, North Carolina CCGSC, Flat Rock, North Carolina CCGSC, Flat Rock, North Carolina CCGSC, Flat Rock, North Carolina

Jeffrey S. Vetter Jeffrey S. Vetter

Oak Ridge National Laboratory Oak Ridge National Laboratory and and Georgia Institute of Technology Georgia Institute of Technology

slide-2
SLIDE 2

2

Highlights Highlights

  • Background and motivation

Background and motivation

– Current trends in architectures favor two strategies

  • Homogenous multicore
  • Application accelerators
  • Correct architecture for an application can provide

Correct architecture for an application can provide astounding results astounding results

  • Challenges to adopting application accelerators

Challenges to adopting application accelerators

– Performance prediction – Productive software systems

  • Solutions from Siskiyou

Solutions from Siskiyou

– Modeling assertions – Multi-paradigm procedure call

slide-3
SLIDE 3

3

The Drama The Drama

  • Years of prosperity

Years of prosperity

– Increasing large-scale parallelism – Increasing number of transistors – Increasing clock speed – Stable programming models and languages

  • Notable constraints force a new utility function for

Notable constraints force a new utility function for architectures architectures

– Signaling – Power – Heat / thermal envelope – Packaging – Memory, I/O, interconnect latency and bandwidth – Instruction level parallelism – Market trends favor ‘good enough’ computing – Economist

slide-4
SLIDE 4

4

Current Approaches to Current Approaches to Continue Improving Performance Continue Improving Performance

  • Chip Multiprocessors

Chip Multiprocessors

– Homogenous multicore – Intel – AMD – IBM

  • Application accelerators to augment general

Application accelerators to augment general purpose multi purpose multi-

  • cores

cores

slide-5
SLIDE 5

5

Results from Initial Results from Initial Multicores Multicores Provide Performance Boost Provide Performance Boost

DGEMM POP

slide-6
SLIDE 6

6

Quad Kilo Quad Kilo-

  • core chips are on the w ay!

core chips are on the w ay!

  • 4 core chips coming

4 core chips coming

  • 8 core chips likely

8 core chips likely

  • ??

??

  • Rapport

Rapport

– Rapport currently offers a 256 core chip – Planning 1024 core chip in 2007 – Kilocore™ – Targeted at mobile and

  • ther consumer

applications

slide-7
SLIDE 7

7

Enter Application Accelerators Enter Application Accelerators

  • Optional hardware installed to accelerate applications

Optional hardware installed to accelerate applications beyond the performance of the general purpose beyond the performance of the general purpose processor processor

Intel Woodcrest Dual Core NVIDIA Quadro FX 4500 GPU NVIDIA GeForce 6600 GPU IBM Cell Processor ClearSpeed Avalon clock frequency 3.0 GHz CPU 80 W ~48 GFLOPS / ~24 GFLOPS CPU socket heatsink + fan 470 MHz 350 MHz 3.2 GHz 250 MHz type accelerator card accelerator card CPU accelerator card power usage 110 W 30 W 100 W 20 W speed single / double precision 180 GFLOPS / NA 20 GFLOPS / NA 256 GFLOPS / 25 GFLOPS 50 GFLOPS / 50 GFLOPS typical size PCIe / MXM1 card PCIe / MXM1 card CPU socket PCI-X card cooling heatsink + fan HS-only or HS+fan heatsink + fan HS-only

slide-8
SLIDE 8

8

For Example For Example … … Graphics Cards Graphics Cards

slide-9
SLIDE 9

9

For Example For Example … … STI Cell STI Cell

slide-10
SLIDE 10

10

For Example For Example … … ClearSpeed ClearSpeed

slide-11
SLIDE 11

11

For Example For Example … … FPGAs FPGAs

slide-12
SLIDE 12

12

AMD AMD Torrenza Torrenza Ecosystem Ecosystem

slide-13
SLIDE 13

13

Architectures that Match Application Requirements can offer Architectures that Match Application Requirements can offer Impressive/Astounding Performance Benefits Impressive/Astounding Performance Benefits

  • Geo

Geo-

  • registration on GPU

registration on GPU

– 700x speedup over commodity processor

  • Numerous FPGA results on

Numerous FPGA results on integer, logic, flop applications integer, logic, flop applications

– 40x on Smith-Waterman – 10x speedup on MD

  • HPCC

HPCC RandomAccess RandomAccess on

  • n

Cray X1E Cray X1E

– 7 GUPS on 512 MSPs – 32 GUPS on 64,000 procs

Video Imagery Geo-registration 2k x 2k Output

0.001 0.01 0.1 1 512x512 1024x1024 2048x2048

Input Image Size (pixels) Time (seconds)

CPU P4 2.4GHz GPU GeForce 6600 with readback GPU QuadroFX 4500 with readback GPU GeForce 6600 GPU QuadroFX 4500

Molecular Dynamics Molecular Dynamics System System Seconds Seconds Cell PPE Cell PPE 0.425 0.425 MTA MTA2 2 w/32 procs w/32 procs ~ ~0.035 0.035 2.2GHz 2.2GHz Opteron Opteron 0.125 0.125 Cell w/ 8 Cell w/ 8 SPEs SPEs 0.013 0.013 GPU GPU (7900GT) (7900GT) 0.012 0.012

Arbitrary Kernel, 32-bit, 4-color 64x64 Image

0.0001 0.001 0.01 0.1 3 5 7 9 11 13 15 17 19 21 23 25

Kernel Size Time (sec) CPU P4 (debug) CPU P4 (opt) Cell SPE GeForce 6600 QuadroFX 4500

slide-14
SLIDE 14

14

Disruptive Technologies and the S Disruptive Technologies and the S-

  • Curve

Curve

  • D

Dé éj jà à vu? vu?

– Floating Point Systems accelerator (1970-80s) – Weitek coprocessors (1980s)

  • Some differences

Some differences … …

– Flops are free – Power and thermal envelopes are constraining designs

slide-15
SLIDE 15

15

Significant Hurdles to Adoption for Significant Hurdles to Adoption for Accelerators (and Accelerators (and multicores multicores?) ?)

  • Performance prediction

Performance prediction

– Should my organization purchase an accelerator? – What will be the performance improvement on my application workload with the accelerator? – Is the accelerator working as we expect? – How can I optimize my application for the accelerator?

  • Productive software systems

Productive software systems

– Do I have to rewrite my application for each accelerator? – How stable is the performance across systems?

slide-16
SLIDE 16

Performance Modeling Performance Modeling

slide-17
SLIDE 17

17

Modeling Assertions Introduction Modeling Assertions Introduction

  • We need new application performance modeling techniques for

We need new application performance modeling techniques for HPC to tackle scale and architectural diversity HPC to tackle scale and architectural diversity

– Performance modeling is quite useful at many stages in the architecture and application development process

  • Existing approaches

Existing approaches

– Manual

  • Application driven

– Automated

  • Target architecture driven

– Black box schemes—accurate but applicability to a range of applications and systems is unknown

  • Goals

Goals

– Aim to combine analytical and empirical schemes – A framework for systematic model development – performance engineering of applications – Modular – Hierarchical – Separate application and system variables – Based on ‘user’ or ‘code developer’ input—no magical solution – Scalable—future application and system configurations

slide-18
SLIDE 18

18

Symbolic Performance Models w ith MA Symbolic Performance Models w ith MA

  • Advantages over traditional

Advantages over traditional modeling techniques modeling techniques

– Modularity, portability and extensibility – Parameterized, symbolic models are evaluated with Matlab and Octave

  • Construct, validate, and

Construct, validate, and project application project application requirements as a function requirements as a function

  • f input parameters
  • f input parameters

Declare important application variables Declare important application operations Annotate code with MA API Terminate when model is representative& error level is acceptable Validate Modeling Assertions empirically at runtime

Incrementally refine model based on error rates by adding and modifying variable and operation declarations

Modeling Assertion (MA) = Empirical data + Symbolic modeling

slide-19
SLIDE 19

19

MA Framew ork MA Framew ork

MA API in C (for Fortran & C applications With MPI) ma(f)_subroutine_start/end ma(f)_loop_start/end ma(f)_flop_start/stop ma(f)_heap/stack_memory ma(f)_mpi_xxxx ma(f)_set/unset_tracing Source code annotation Runtime system generate trace files Post-processing toolset (in Java) Model validation Control flow model Symbolic model

Classes of API calls currently implemented and tested

send = niter*(l2npcols*(dp*2)+l2npcols*(dp)+ cgitmax*(l2npcols*(dp*na/num_proc_cols)+dp*na/n um_proc_cols+l2npcols*(dp)+l2npcols*(dp))+l2npc

  • ls*(dp*na/num_proc_cols)+dp*na/num_proc_cols+l

2npcols*(dp)) main () { ….. loop (NAME = conj_loop) (COUNT = niter) { loop (NAME = norm_loop) (COUNT = l2npcols) { mpi_irecv (NAME = nrecv) (SIZE = dp * 2); mpi_send (NAME = nsend) (SIZE = dp * 2);

slide-20
SLIDE 20

20

Example w ith MA Annotation Example w ith MA Annotation

call maf_def_variable_int('na',na) call maf_def_variable_int('nonzer',nonzer) ….. call maf_def_variable_assign_int('num_proc_cols', > '2^ceil(log(nprocs)/(2*log(2)))',num_proc_cols) ….. call maf_loop_start('conj_loop','niter',niter) do it = 1, niter ….. call maf_flop_start('flopzeta','4*na/num_proc_cols', > 4*na/num_proc_cols) do j=1, lastcol-firstcol+1 norm_temp1(1) = norm_temp1(1) + x(j)*z(j) norm_temp1(2) = norm_temp1(2) + z(j)*z(j) enddo call maf_flop_stop('flopzeta') ….. call maf_loop_end('conj_loop',it-1) ….. call maf_subroutine_start('conj_grad') …… call ma_loop_start('cj_matvec','l2npcols',l2npcols) do i = l2npcols, 1, -1 call maf_mpi_irecv('l2rcv','dp*na/num_proc_cols', > dp*naa/npcols,l2npcols) call mpi_irecv( q(reduce_recv_starts(i)), > reduce_recv_lengths(i), > dp_type, …… call maf_subroutine_end('conj_grad') Input parameters: na, nonzer, niter and nprocs Derived parameters: nz, num_proc_cols, l2cpcols and dp (size of REAL) Start of a loop with loop count End markers used for validation Marker for floating-point

  • peration count

Markup for subroutine invocation MA MPI API call

send = niter*(l2npcols*(dp*2)+l2npcols*(dp)+ cgitmax*(l2npcols*(dp*na/num_proc_cols)+ dp*na/num_proc_cols+l2npcols*(dp)+l2npcols*(dp))+ l2npcols*(dp*na/num_proc_cols)+ dp*na/num_proc_cols+l2npcols*(dp))

slide-21
SLIDE 21

21

  • pq: ma_flop:7000:7000:0.0: PASS=50: FAIL=0

cj_sumred: ma_loop:1:1:0.0: PASS=50: FAIL=0 l4rcv: ma_mpi_irecv:8:8:0.0: PASS=50: FAIL=0 l4snd: ma_mpi_send:8:8:0.0: PASS=50: FAIL=0 sumred: ma_flop:1:1:0.0: PASS=50: FAIL=0 floprhopq: ma_flop:21001:21001:0.0: PASS=50: FAIL=0 cj_rho: ma_loop:1:1:0.0: PASS=50: FAIL=0 l5rcv: ma_mpi_irecv:8:8:0.0: PASS=50: FAIL=0 l5snd: ma_mpi_send:8:8:0.0: PASS=50: FAIL=0 flopbeta: ma_flop:7002:7001:1.426E-4: PASS=6: FAIL=44 flopnzx: ma_flop_start:3503:4347:-0.194: PASS=0: FAIL=2

1.00E+00 1.00E+01 1.00E+02 1.00E+03 1.00E+04 1.00E+05 1.00E+06 1.00E+07 1.00E+08 1.00E+09 S W A B C Problem instance Number of Floating-point operations

CG (measured) CG (predicted) SP (measured) SP (predicted)

Model validation output

NAS CG Class S: na=1400, nonzer=7 Class W: na=7000, nonzer=8 Class A: na=14000, nonzer=11 Class B: na=75000, nonzer=13 Class C: na=150000, nonzer=15 NAS CG Class S: na=1400, nonzer=7 Class W: na=7000, nonzer=8 Class A: na=14000, nonzer=11 Class B: na=75000, nonzer=13 Class C: na=150000, nonzer=15 NAS SP Class S: problem_size=7 Class W: problem_size=36 Class A: problem_size=64 Class B: problem_size=102 Class C: problem_size=162 NAS SP Class S: problem_size=7 Class W: problem_size=36 Class A: problem_size=64 Class B: problem_size=102 Class C: problem_size=162

Example Model Validation Example Model Validation

slide-22
SLIDE 22

22

Computation Distribution Computation Distribution

Runtime distribution across loop blocks in NAS SP and CG

– Generated using symbolic models – Vary important parameters, such as number of processors, apps parameters

  • Unlike CG, there is not a single hotspot in SP

Unlike CG, there is not a single hotspot in SP

0% 20% 40% 60% 80% 100% 16 36 64 100 196 256

Number of processors % of floating-point operations in z_solve

l16 l15 l14 l13 l12 l11 l10 l9 l8 l7 l6 l5 l4 l3 l2 l1

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 4 8 16 32 64 128 256 512

Number of Processors % of total floating-point operations

l11 l10 l9 l8 l7 l6 l5 l4 l3 l2 l1

CG SP

slide-23
SLIDE 23

23

MPI Message Distribution Analysis MPI Message Distribution Analysis

  • CG

CG

– 65% messages in CG are 8 bytes – Remaining over 37 Kbytes

  • SP

SP

– 95% messages in SP are ~28 Kbytes – Remaining 50-64 Kbytes

  • Conclusion: CG requires low latency network

Conclusion: CG requires low latency network

50 100 150 200 250 300 350 1 4 9 25 64 128 256 512 1024 Number of processors Speedup

CG SP

Speedup of NAS CG and SP

  • n ORNL Cray XT3 system
slide-24
SLIDE 24

24

Sensitivity of SP calculations Sensitivity of SP calculations

0.E+00 2.E+09 4.E+09 6.E+09 8.E+09 1.E+10 1.E+10 1.E+10 2.E+10 162 324 486 648 810 972 1134 1296 1458 1620

problem_size FP Operations Average memory Messages Sent (bytes)

Sensitivity of workload requirements with respect to the SP input parameter: problem_size

slide-25
SLIDE 25

25

Modeling Assertions w ith Accelerators Modeling Assertions w ith Accelerators

  • MA framework provides critical information on

MA framework provides critical information on computational intensity and data movement that computational intensity and data movement that is critical for mapping applications to is critical for mapping applications to accelerators accelerators

  • MA is providing insight into DOE applications for

MA is providing insight into DOE applications for acceleration acceleration

– Biomolecular application: AMBER – Climate Modeling: POP

slide-26
SLIDE 26

26

main sander runmd ewald_force get_nb_energy short_ene do_pmesh_kspace nonbond_list force grad_sumrc vdinvsqrt get_nb_list pack_nb_list fft_backrc fft_forwardrc fft3dzxyrc fft2drc scalar_sumrc_orthog fft3d0rc cfftb1 cfftf1 cffti 1 2 4 5 6 7 8 10 11 13 14 9 12 15 17 18 19 20 21 22 passb4 passb2 26 28 24 nb_adjust 29 fastwt_mp_quick3 32 adjust_imagcrds 34 ephi 36 shake 45 zero_array 35 map_coords 41 get_grid_weights 31 get_fftdims 174 fft_setup 90 196 passf4 27 fill_bspline_1 39 pub_fft.f passb2.f

Mapping Amber Kernel to Mapping Amber Kernel to FPGAs FPGAs

ew_fft.f ew_fft.f ew_recip.f ew_direct.f vec_lib.f ew_force.f ew_box.f ew_setup.f

23558000 47116000 1000 1319248 1000 1000 20864000 20864000 128001 1001 1000 1000 1 4096000 1000 1000 3000 2000 8320000 1000 70674000 1000 1056 1000 1000 1000 1002 1000 3 1 1 1 jac Amber8 benchmark: List time (% of nonbond) = 4.72 (5.19) Direct Ewald time = 70.82 Recip Ewald time = 14.76 Total Ewald time (% of nonbond)= 86.23 (94.81) FFT time (% of Recip) = 4.76 (32.24)

73.14% 3.39% 11.22%

Mapping to FPGAs

Obtained 3x application speedup on FPGA using HLL on SRC 6C MapStation. Obtained 3x application speedup on FPGA using HLL on SRC 6C MapStation.

slide-27
SLIDE 27

MPPS: Multi MPPS: Multi-

  • Paradigm

Paradigm Programming System Programming System

slide-28
SLIDE 28

28

Multi Multi-

  • Paradigm Computing

Paradigm Computing

  • Several vendors are designing, even now

Several vendors are designing, even now building building multi multi-

  • paradigm

paradigm systems systems

– Along with general purpose microprocessors, a multi- paradigm system may include:

  • FPGAs
  • Highly multi-threaded processors (MTA)
  • Graphics processors
  • Physics processors
  • Digital signal processors

– Vendors include:

  • IBM, SGI, Cray, SRC,

ClearSpeed, Linux Networx

P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P V P P FPGA GPU GPU GPU GPU P P FPGA P P FPGA P P FPGA V V V P V V V V P V V V V P V V V V

Legend P: commodity processor V: vector processor GPU: graphics processing unit FPGA: field programmable gate array

P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P V P P FPGA GPU GPU GPU GPU P P FPGA P P FPGA P P FPGA V V V P V V V V P V V V V P V V V V

Legend P: commodity processor V: vector processor GPU: graphics processing unit FPGA: field programmable gate array

slide-29
SLIDE 29

29

Multi Multi-

  • Paradigm Computing Challenges

Paradigm Computing Challenges

  • Multi

Multi-

  • Paradigm systems offer lots of

Paradigm systems offer lots of performance potential, but performance potential, but… …

…it is challenging to realize that potential it is challenging to realize that potential

– Different APIs, different tools, different assumptions! – Different ISAs, SDKs – Explicit data movement – Simplistic scheduling – Static binding to available resources

slide-30
SLIDE 30

30

MPPS Basis: MPPS Basis: Multi Multi-

  • Paradigm Procedure Call (MPPC)

Paradigm Procedure Call (MPPC)

  • Multi

Multi-

  • Paradigm Procedure Calls

Paradigm Procedure Calls

– Adopt highly successful RPC approach – Open protocol for communication within infrastructure

  • MPPC runtime system

MPPC runtime system

– Runtime agent to manage access to device – Directory service for dynamic discovery of devices and their status – Local service OS on devices (if possible)

  • Support for defining adaptive policies for scheduling

Support for defining adaptive policies for scheduling application requests onto computing devices application requests onto computing devices

– Simple policies built-in – Custom policies can be driven by automated administration and performance tools

FPGA MTA GPU Cell … MPPC Applications Libraries Tools

slide-31
SLIDE 31

31

Compiler Support for MPPS Compiler Support for MPPS

  • Pragmas

Pragmas identify regions of code to accelerate identify regions of code to accelerate

– Built on Open64 – Similar to OpenMP analysis

  • Extracts code for device service

Extracts code for device service

– Device code compiled separately with device specific SDK

  • Replaces original code with MPPC call

Replaces original code with MPPC call

– Marshals data; starts, waits on device

slide-32
SLIDE 32

32

Summary Summary

  • Accelerators will continue to gain market share in one

Accelerators will continue to gain market share in one form or another form or another

– Expansion slots – On-chip accelerators which are used as necessary

  • Software systems that can mask the complexity will

Software systems that can mask the complexity will become much more important become much more important

– Multi-paradigm Programming System – Automated generation of MPPC calls

  • Performance modeling and analysis will become critical

Performance modeling and analysis will become critical for procurements, validation, and optimization for procurements, validation, and optimization

– Modeling assertions

slide-33
SLIDE 33

33

Acknow ledgements and More Info Acknow ledgements and More Info

  • This research was sponsored by the Office of Mathematical, Infor

This research was sponsored by the Office of Mathematical, Information, mation, and Computational Sciences, Office of Science, U.S. Department o and Computational Sciences, Office of Science, U.S. Department of Energy f Energy under Contract No. DE under Contract No. DE-

  • AC05

AC05-

  • 00OR22725 with UT

00OR22725 with UT-

  • Batelle

Batelle, LLC. , LLC. Accordingly, the U.S. Government retains a non Accordingly, the U.S. Government retains a non-

  • exclusive, royalty

exclusive, royalty-

  • free

free license to publish or reproduce the published form of this contr license to publish or reproduce the published form of this contribution, or ibution, or allow others to do so, for U.S. Government purposes. allow others to do so, for U.S. Government purposes.

  • http://www.csm.ornl.gov/ft

http://www.csm.ornl.gov/ft

  • vetter@computer.org

vetter@computer.org

slide-34
SLIDE 34

Bonus Slides Bonus Slides

slide-35
SLIDE 35

35

Performance Stability Performance Stability

SPE Optimizations

0.681 s 0.248 s 0.064 s 0.047 s

0.01 0.1 1 Original Fast cosine Fast exp/sqrt SIMD

Runtime (sec) [log scale]

slide-36
SLIDE 36

36

Performance Stability (2) Performance Stability (2)

  • HPC Challenge ratio of Optimized over Baseline

HPC Challenge ratio of Optimized over Baseline

1.0 1.1 2.9 1.5 0.5 1 1.5 2 2.5 3 3.5 4 HPL RA PTRANS FFT STREAMS 723

slide-37
SLIDE 37

37

MPI Symbolic Models MPI Symbolic Models

1.0E+00 1.0E+01 1.0E+02 1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07 1.0E+08 4 8 9 16 25 32 36 64 121 128 225 256 400 512 900 1024 1600 Number of processors

CG (message volume (bytes)) CG (message count) SP (message volume (bytes)) SP (message count)

Error rate for MPI message sizes and count = 0% Error rate for MPI message sizes and count = 0% Message size (bytes) and message count per MPI task for NAS MPI CG and SP benchmarks

slide-38
SLIDE 38

38

Sensitivity Analysis: Data Generated by Sensitivity Analysis: Data Generated by Symbolic Models Symbolic Models

  • Application input

Application input parameters: parameters:

– na (array size) – nonzer (number of nonzero elements)

  • Question: which

Question: which parameter influences the parameter influences the workload and how? workload and how?

0.E+00 5.E+06 1.E+07 2.E+07 2.E+07 3.E+07 3.E+07 150000 300000 450000 600000 750000 900000 1E+06 1E+06 1E+06 2E+06

na

FP Operations LS Operations

0.E+00 2.E+07 4.E+07 6.E+07 8.E+07 1.E+08 1.E+08 1.E+08 2.E+08 2.E+08 15 30 45 60 75 90 105 120 135 150

nonzer

FP Operations LS Operations

  • MA models generated

MA models generated the required information the required information efficiently efficiently

  • Observation: the

Observation: the nonzer nonzer parameter has a huge parameter has a huge impact on computation impact on computation requirements requirements

  • Also identified that

Also identified that nonzer nonzer has no impact on has no impact on MPI communication MPI communication

slide-39
SLIDE 39

39

MPPS Research Directions MPPS Research Directions

  • Integration with Modeling Assertions

Integration with Modeling Assertions

– MA models can help MPPC make better scheduling decisions – MPPC behavior can be fed back to improve models that are multi-paradigm aware

  • Multi

Multi-

  • operation scheduling
  • peration scheduling

– Instead of MPPC_FFT, MPPC_DGEMM granularity, turn over larger sequences of work to MPPC infrastructure – More optimization opportunities – More scheduling burden on MPPC infrastructure

slide-40
SLIDE 40

40

MPPC API MPPC API

int main( int argc, char* argv[] ) { MPI_Init( argc, argv ); MPPC_Init(); … MPPC_DGEMM( a, b, s, z ); … MPPC_ZDFFT( u, v, n ); … MPPC_Finalize(); MPI_Finalize(); return 0; }

Mapping, data marshaling, scheduling

  • f specific multi-paradigm device

hidden from user.

Automated static analysis and profile-directed feedback can hide conversion of applications to MPPC and optimize series of MPPC routines.