A Hi h L A High-Level Signal Processing l Si l P i Library for - - PowerPoint PPT Presentation

a hi h l a high level signal processing l si l p i
SMART_READER_LITE
LIVE PREVIEW

A Hi h L A High-Level Signal Processing l Si l P i Library for - - PowerPoint PPT Presentation

A Hi h L A High-Level Signal Processing l Si l P i Library for Multicore Processors Sharon Sacco, Nadya Bliss, Ryan Haney, Jeremy Kepner, Hahn Kim, Sanjeev Mohindra, J K H h Ki S j M hi d Glenn Schrader and Edward Rutledge This work


slide-1
SLIDE 1

A Hi h L l Si l P i A High-Level Signal Processing Library for Multicore Processors

Sharon Sacco, Nadya Bliss, Ryan Haney, J K H h Ki S j M hi d Jeremy Kepner, Hahn Kim, Sanjeev Mohindra, Glenn Schrader and Edward Rutledge

This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government

MIT Lincoln Laboratory

y y

slide-2
SLIDE 2

Outline

  • Overview
  • HPEC Challenge Benchmarks
  • Parallel Vector Tile Optimizing Library
  • Parallel Vector Tile Optimizing Library
  • Summary

MIT Lincoln Laboratory

GT Cell Workshop-2 SMHS 6/19/2007

slide-3
SLIDE 3

Embedded Digital Systems

Next-Generation Warfighting Vision

Rapid System Prototyping

Wid b d Ai b

Open System Technologies

High Performance Embedded C ti Widebody Airborne Sensor Platform Greenbank Computing— Software Initiative Triple Canopy Foliage Penetration Next Gen Radar Open Systems Architecture

Advanced Hardware Implementations Network & Decision Support Initiatives

Receiver on-Chip Integrated Sensing and Decision Support

MIT Lincoln Laboratory

GT Cell Workshop-3 SMHS 6/19/2007

Very Large Scale Integration/ Field Programmable Gate Array Hybrids LLGrid

slide-4
SLIDE 4

Embedded Processor Evolution

10000

tt

High Performance Embedded Processors i860 SHARC PowerPC

1000

LOPS / Wat

MPC7447A Cell MPC7410 MPC7400

PowerPC with AltiVec Cell (estimated)

10 100

MFL

i860 XR 603e 750 SHARC

1990 2000 2010

Year

  • 20 years of exponential growth in FLOPS / Watt
  • Requires switching architectures every ~5 years

Year

MIT Lincoln Laboratory

GT Cell Workshop-4 SMHS 6/19/2007

  • Cell processor is current high performance architecture
slide-5
SLIDE 5

Outline

  • Overview
  • HPEC Challenge Benchmarks

P ll l V t Til O ti i i Lib

  • Time-Domain FIR Filter
  • Results
  • Parallel Vector Tile Optimizing Library
  • Summary

MIT Lincoln Laboratory

GT Cell Workshop-5 SMHS 6/19/2007

slide-6
SLIDE 6

HPEC Challenge

Information and Knowledge Processing Kernels

P tt M t h G ti Al ith Pattern Match Genetic Algorithm

  • Compute best match for a pattern out of

set of candidate patterns

– Uses weighted mean-square error

0 1 0.2 0.3 0.4

Selection Crossover Mutation

Mag Pattern under test

Uses weighted mean square error

0.1

Evaluation

  • Evaluate each chromosome
  • Select chromosomes for next generation

Candidate Pattern 1 Candidate Pattern 2

Range M

Database Operations

  • Crossover: randomly pair up chromosomes

and exchange portions

  • Mutation: randomly change each chromosome

Candidate Pattern N

Corner Turn Database Operations

  • Three generic database
  • perations:

– search: find all items in a given range

Corner Turn

1 2 3 4 5 6 7 8 9 10 11 1 5 9 2 6 10 3 7 11 4 8

Red-Black Tree Data Structure in a given range – insert: add items to the database – delete:remove item from the database * Numbers denote memory content

  • Memory rearrangement of matrix

contents

Switch from row to column major MIT Lincoln Laboratory

GT Cell Workshop-6 SMHS 6/19/2007

Linked List Data Structures

– Switch from row to column major layout

slide-7
SLIDE 7

HPEC Challenge

Signal and Image Processing Kernels

QR FIR QR FIR

M Filters (~10 coefficients) Input Matrix nnels

A

*

Q R

M Filters (>100 coefficients)

  • Bank of filters applied to input data

M Chan

  • Computes the factorization of an input

matrix, A=QR

(MxN) (MxN) (MxM)

SVD CFAR

  • Bank of filters applied to input data
  • FIR filters implemented in time and

frequency domain matrix, A QR

  • Implementation uses Fast Givens

algorithm

SVD CFAR

Input Matrix Bidiagonal Matrix Diagonal Matrix Σ

Dopplers Range

C(i,j,k)

C

Target List

(i,j,k)

  • Produces decomposition of an input

matrix, X=UΣVH

Beams

T(i,j,k)

  • Creates a target list given a data cube

Normalize, Threshold MIT Lincoln Laboratory

GT Cell Workshop-7 SMHS 6/19/2007

  • Classic Golub-Kahan SVD

implementation

  • Calculates normalized power for each

cell, thresholds for target detection

slide-8
SLIDE 8

Time Domain FIR Algorithm

x x x x

Single Filter (example size 4) 2 1 3

  • Number of Operations:

k – Filter size n – Input size nf - number of filters Filter slides along reference to form dot products

. . .

2 1 3 4 5 7 6 n-1 n-2 nf - number of filters Total FOPs: ~ 8 x nf x n x k

  • Output Size: n + k - 1

+

Reference Input data Output point

. . .

M-3 7 6 5 4 3 2 1 M-1 M-2

  • TDFIR uses complex data
  • TDFIR uses a bank of filters

Set k n nf 1 128 4096 64

HPEC Challenge Parameters TDFIR – Each filter is used in a tapered convolution – A convolution is a series of dot products

FIR i f th b t t d t t FLOPS

1 128 4096 64 2 12 1024 20

MIT Lincoln Laboratory

GT Cell Workshop-8 SMHS 6/19/2007

  • FIR is one of the best ways to demonstrate FLOPS
slide-9
SLIDE 9

Performance Challenges

Efficiently partition the application Keep the pipelines full Dual issue instructions

Tiles

Maximize the use of SIMD Registers Keep the pipelines full Don’t let the cache Can the buses be controlled efficiently?

Registers Cache

  • Instr. Operands

g slow things down Access memory efficiently Can the exact Keep the data flowing

Cache Local Memory Blocks Remote Memory Messages

C f Cover memory transfers with computations processor be selected?

Watch out for race conditions

Disk Pages Remote Memory

Price of performance is increased programming comple it

Can information on disk be preloaded before needed?

MIT Lincoln Laboratory

GT Cell Workshop-9 SMHS 6/19/2007

  • Price of performance is increased programming complexity
slide-10
SLIDE 10

Reference C implementation

for (i = K; i > 0; i--){ /* Set accumulators and pointers for dot product for output point */

  • Computations take 2

lines

r1 = Rin; r2 = Iin;

  • 1 = Rout;
  • 2 = Iout;

/* calculate contributions from a single kernel point */

  • Mostly loop control,

pointers, and initialization

  • Output initialization

d

/ calculate contributions from a single kernel point / for (j = 0; j < N; j++){ *o1 += *k1 * *r1 - *k2 * *r2; *o2 += *k2 * *r1 + *k1 * *r2;

assumed

  • SPE needs split complex
  • Separate real and

imaginary vectors

r1++; r2++; o1++; o2++; } /* update input pointers */ k1++; k2++; R t

imaginary vectors

Rout++; Iout++; }

Reference C FIR is easy to understand

MIT Lincoln Laboratory

GT Cell Workshop-10 SMHS 6/19/2007

slide-11
SLIDE 11

C with SIMD Extensions

  • Inner loop contributes to 4
  • utput points per pass
  • SIMD registers in use

/* load reference data and shift */ ir0 = *Rin++; ii0 = *Iin++; ir1 = (vector float) spu_shuffle(irOld, ir0, shift1); ii1 = (vector float) spu_shuffle(iiOld, ii0, shift1);

  • SIMD registers in use
  • Shuffling of values in

registers is a requirement

– Compilers are unlikely to

ir2 = (vector float) spu_shuffle(irOld, ir0, shift2); ii2 = (vector float) spu_shuffle(iiOld, ii0, shift2); ir3 = (vector float) spu_shuffle(irOld, ir0, shift3); ii3 = (vector float) spu_shuffle(iiOld, ii0, shift3); Rtemp = kr0 * ir0 + Rtemp; Itemp = kr0 * ii0 + Itemp;

Compilers are unlikely to recognize this type of code

  • Can rival assembly code

with more effort

Rtemp = kr0 * ir0 + Rtemp; Itemp = kr0 * ii0 + Itemp; Rtemp = -(ki0 * ii0 - Rtemp); Itemp = ki0 * ir0 + Itemp; Rtemp = kr1 * ir1 + Rtemp; Itemp = kr1 * ii1 + Itemp; Rtemp = -(ki1 * ii1 - Rtemp); Itemp = ki1 * ir1 + Itemp; Rtemp = kr2 * ir2 + Rtemp; Itemp = kr2 * ii2 + Itemp; Rtemp = -(ki2 * ii2 - Rtemp); Itemp = ki2 * ir2 + Itemp; Rtemp = kr3 * ir3 + Rtemp; Itemp = kr3 * ii3 + Itemp; Rtemp = -(ki3 * ii3 - Rtemp); Itemp = ki3 * ir3 + Itemp;

  • SIMD C extensions

*Rout++ = Rtemp; *Iout++ = Itemp; irOld = ir0; iiOld = ii0; /* update old values */

  • SIMD C extensions

increase code complexity

– Hardware needs consideration

MIT Lincoln Laboratory

GT Cell Workshop-11 SMHS 6/19/2007

Contents of inner loop of convolution

consideration

slide-12
SLIDE 12

Assembly Version

/* Fourth kernel point contribution */ fma $50,$30,$66,$50 /* reout''' + rein * rek [4i+16j+3]-[4i+16j+6] */ lqd $45,32($15) /* load imin[16j+24]-[16j+27] */ fma $51,$31,$66,$51 /* imout''' + rein * imk [4i+16j+3]-[4i+16j+6] */ shufb $66,$40,$42,$16 /* rein[4i+16j+17]-[4i+16j+20] in $66 */ fma $52,$30,$68,$52 /* reout''' + rein * rek [4i+16j+7]-[4i+16j+10] */

Dual issue instructions Integer and bitwise instructions compete with

$ ,$ ,$ ,$ [ j ] [ j ] lqd $47,48($15) /* load imin[16j+28]-[16j+31] */ fma $53,$31,$68,$53 /* imout''' + rein * imk [4i+16j+7]-[4i+16j+10] */ shufb $68,$42,$44,$16 /* rein[4i+16j+21]-[4i+16j+24] in $68 */ and $43,$43,$19 /* clear $43 for taper if necessary */ lnop fma $54,$30,$70,$54 /* reout''' + rein * rek [4i+16j+11]-[4i+16j+14] */ l d $49 64($15) /* l d i i [4i 16j 32] [4i 16j 31] */

instructions compete with floating point for dual issue Padding with occasional “no operations” keeps

lqd $49,64($15) /* load imin[4i+16j+32]-[4i+16j+31] */ fma $55,$31,$70,$55 /* imout''' + rein * imk [4i+16j+11]-[4i+16j+14] */ shufb $70,$44,$46,$16 /* rein[4i+16j+25]-[4i+16j+28] in $70 */ fma $56,$30,$72,$56 /* reout''' + rein * rek [4i+16j+15]-[4i+16j+18] */ ai $14,$14,64 /* update rein address */ fma $57,$31,$72,$57 /* imout''' + rein * imk [4i+16j+15]-[4i+16j+18] */ shufb $72,$46,$48,$16 /* rein[4i+16j+29]-[4i+16j+32] in $72 */

Easy to end dual issue i t ti “no operations” keeps the performance going Pointer updates and loop control affect inner loop

shufb $72,$46,$48,$16 / rein[4i+16j+29] [4i+16j+32] in $72 /

instructions

Code fragment from inner loop

  • Inner loop operates on 16 output values with 4 kernel values

control affect inner loop cycle count

  • 78% dual issue cycles in inner loop with 4 nops
  • 74 registers used

High performance demands soft are to le erage hard are

MIT Lincoln Laboratory

GT Cell Workshop-12 SMHS 6/19/2007

  • High performance demands software to leverage hardware
slide-13
SLIDE 13

Parallel Approach Time Domain FIR

. . .

2 1 3 4 5 7 6 n-1 n-2 2 1 3

SPE 1 SPE 2 . . .

2 1 3 4 5 7 6 n-1 n-2

. . .

2 1 3 4 5 7 6 n-1 n-2 2 1 3 2 1 3

SPE 2 SPE 3 . . .

2 1 3 4 5 7 6 n-1 n-2 2 1 3

SPE 4

  • HPEC Challenge Benchmark TDFIR is a series of independent

convolutions convolutions

– “Embarrassingly” parallel problem is a good place to start – Independent convolutions are divided among the processors – Computation of one convolution can be overlapped with DMAs from

MIT Lincoln Laboratory

GT Cell Workshop-13 SMHS 6/19/2007

Computation of one convolution can be overlapped with DMAs from

  • thers
slide-14
SLIDE 14

Outline

  • Overview
  • HPEC Challenge Benchmarks

P ll l V t Til O ti i i Lib

  • Time-Domain FIR Filter
  • Results
  • Parallel Vector Tile Optimizing Library
  • Summary

MIT Lincoln Laboratory

GT Cell Workshop-14 SMHS 6/19/2007

slide-15
SLIDE 15

Mercury Cell Processor Test System

Mercury Cell Processor System

  • Single Dual Cell Blade

– Native tool chain – Two 2.4 GHz Cells running in SMP mode T S ft Y ll D Li 2 6 14 – Terra Soft Yellow Dog Linux 2.6.14

  • Received 03/21/06

– booted & running same day – integrated/w LL network < 1 wk – Octave (Matlab clone) running – Parallel VSIPL++ compiled

  • Each Cell has 153.6 GFLOPS (single precision )

( g p ) – 307.2 for system @ 2.4 GHz (maximum)

Software includes:

  • IBM Software Development Kit (SDK)

– Includes example programs

  • Mercury Software Tools

– MultiCore Framework (MCF) – Scientific Algorithms Library (SAL) MIT Lincoln Laboratory

GT Cell Workshop-15 SMHS 6/19/2007

g y ( ) – Trace Analysis Tool and Library (TATL)

slide-16
SLIDE 16

Performance Time Domain FIR

10

3

GIGAFLOPS vs. Increasing L, Constants M = 64 N = 4096 K = 128 04-Aug-2006 Cell (1 SPEs) Cell (2 SPEs) Cell (4 SPEs) Cell (8 SPEs) 10

2

10

3 Time vs. Increasing L, Constants M = 64 N = 4096 K = 128 04-Aug-2006

Cell (1 SPEs) Cell (2 SPEs) Cell (4 SPEs) Cell (8 SPEs)

Cell @ 2.4 GHz

10

2

GAFLOPS Cell (8 SPEs) Cell (16 SPEs) 10

1

E (seconds) Cell (8 SPEs) Cell (16 SPEs) GIG 10

  • 1

10 TIME

Cell @ 2.4 GHz

10 10

1

10

2

10

3

10

4

10

5

10

1

Number of Iterations (L) 10 10

1

10

2

10

3

10

4

10

5

10

  • 2

Number of Iterations (L)

Set 1 has a bank of 64 size 128 filters with size 4096 input vectors

Cell @ 2.4 GHz

  • Octave runs TDFIR in a loop

– Averages out overhead

# SPE 1 2 4 8 16

Maximum GFLOPS for TDFIR #1

Set 1 has a bank of 64 size 128 filters with size 4096 input vectors MIT Lincoln Laboratory

GT Cell Workshop-16 SMHS 6/19/2007

– Applications run convolutions many times typically

GFLOPS 16 32 63 126 253

slide-17
SLIDE 17

Overhead

SPE Thread Spawning Overhead 2.4 GHz

30 40 50

  • nds

10 20 30

milliseco

2 4 6 8 10

Number of SPES

  • Thread spawn takes ~ 5.3 ms / SPE

– Minimize thread spawns Use middleware that avoids thread spawns

MIT Lincoln Laboratory

GT Cell Workshop-17 SMHS 6/19/2007

– Use middleware that avoids thread spawns

slide-18
SLIDE 18

SLOCs and Coding Effort

C SIMD C Hand Coding Parallel (8 SPE) Lines of Code 33 110 371 546

3

Parallel

103

Software Lines of Code (SLOC) and Performance for TDFIR

Design Time Minute Hour Hour

  • Coding Time

Minute Hour Day

  • 1.5

2 2.5

Hand Coding SIMD C

102

up from C Debug Time Minute Minute Day

  • Performance

Efficiency 0.014 0.27 0.88 0.82

0.5 1

C

101 100

Speed Efficiency (1 SPE) GFLOPS @ 2.4 GHz 0.27 5.2 17 126

5 10 15 20

10

SLOC / C SLOC

  • Clear tradeoff between performance and effort

– C code simple, poor performance

– SIMD C, more complex to code, reasonable performance

MIT Lincoln Laboratory

GT Cell Workshop-18 SMHS 6/19/2007

SIMD C, more complex to code, reasonable performance – Hand coding, very complex, excellent performance

slide-19
SLIDE 19

Outline

  • Overview
  • HPEC Challenge Benchmarks

P ll l V t Til O ti i i Lib

  • Parallel Vector Tile Optimizing Library

(PVTOL)

  • PVTOL Architecture
  • PVTOL Development Cycle
  • Summary
  • Contributing Technologies

MIT Lincoln Laboratory

GT Cell Workshop-19 SMHS 6/19/2007

slide-20
SLIDE 20

Parallel Vector Tile Optimizing Library (PVTOL)

  • PVTOL is a portable and scalable middleware library for

multicore processors multicore processors

  • Enables incremental development

Cluster Embedded D kt Embedded Computer Desktop

  • 2. Parallelize code
  • 3. Deploy code
  • 1. Develop serial code
  • 4. Automatically parallelize code

MIT Lincoln Laboratory

GT Cell Workshop-20 SMHS 6/19/2007

Make parallel programming as easy as serial programming

slide-21
SLIDE 21

PVTOL Architecture

  • Performance

– Achieves high performance

  • Portability

– Built on standards, e.g. VSIPL++

  • Prod cti it
  • Productivity

– Minimizes effort at user level

MIT Lincoln Laboratory

GT Cell Workshop-21 SMHS 6/19/2007

PVTOL preserves the simple load-store programming model in software

slide-22
SLIDE 22

PVTOL Development Process

void main(int argc, char *argv[]) {

Serial PVTOL code

// Initialize PVTOL process pvtol(argc, argv); // Create input, weights, and output matrices typedef Dense<2, float, tuple<0, 1> > dense block t; typedef Dense<2, float, tuple<0, 1> > dense_block_t; typedef Matrix<float, dense_block_t, LocalMap> matrix_t; matrix_t input(num_vects, len_vect), filts(num_vects, len_vect),

  • utput(num_vects, len_vect);

// Initialize arrays ... // Perform TDFIR filter

  • utput = tdfir(input, filts);

}

MIT Lincoln Laboratory

GT Cell Workshop-22 SMHS 6/19/2007

}

slide-23
SLIDE 23

PVTOL Development Process

void main(int argc, char *argv[]) {

Parallel PVTOL code

// Initialize PVTOL process pvtol(argc, argv); // Add parallel map RunTimeMap map1(...); // Create input, weights, and output matrices typedef Dense<2, float, tuple<0, 1> > dense block t; typedef Dense<2, float, tuple<0, 1> > dense_block_t; typedef Matrix<float, dense_block_t, RunTimeMap> matrix_t; matrix_t input(num_vects, len_vect, map1), filts(num_vects, len_vect, map1),

  • utput(num_vects, len_vect, map1);

// Initialize arrays ... // Perform TDFIR filter

  • utput = tdfir(input, filts);

}

MIT Lincoln Laboratory

GT Cell Workshop-23 SMHS 6/19/2007

}

slide-24
SLIDE 24

PVTOL Development Process

Embedded PVTOL code

void main(int argc, char *argv[]) { // Initialize PVTOL process pvtol(argc, argv); // Add hierarchical map RunTimeMap map2(...); // Add parallel map RunTimeMap map1(..., map2); // Create input, weights, and output matrices typedef Dense<2, float, tuple<0, 1> > dense block t; typedef Dense<2, float, tuple<0, 1> > dense_block_t; typedef Matrix<float, dense_block_t, RunTimeMap> matrix_t; matrix_t input(num_vects, len_vect, map1), filts(num_vects, len_vect, map1),

  • utput(num_vects, len_vect, map1);

// Initialize arrays ... // Perform TDFIR filter

  • utput = tdfir(input, filts);

}

MIT Lincoln Laboratory

GT Cell Workshop-24 SMHS 6/19/2007

}

slide-25
SLIDE 25

PVTOL Development Process

Automapped PVTOL code

void main(int argc, char *argv[]) { // Initialize PVTOL process pvtol(argc, argv); // Create input, weights, and output matrices typedef Dense<2, float, tuple<0, 1> > dense block t; typedef Dense<2, float, tuple<0, 1> > dense_block_t; typedef Matrix<float, dense_block_t, AutoMap> matrix_t; matrix_t input(num_vects, len_vect), filts(num_vects, len_vect),

  • utput(num_vects, len_vect);

// Initialize arrays ... // Perform TDFIR filter

  • utput = tdfir(input, filts);

}

MIT Lincoln Laboratory

GT Cell Workshop-25 SMHS 6/19/2007

}

slide-26
SLIDE 26

Outline

  • Overview
  • HPEC Challenge Benchmarks

P ll l V t Til O ti i i Lib

  • Parallel Vector Tile Optimizing Library

(PVTOL)

  • PVTOL Architecture
  • PVTOL Development Cycle
  • Summary
  • Contributing Technologies

MIT Lincoln Laboratory

GT Cell Workshop-26 SMHS 6/19/2007

slide-27
SLIDE 27

pMapper Automated Mapping System**

Cell Processor Simulation

  • Simulate the Cell processor using pMapper simulator infrastructure
  • Use pMapper to automate mapping and predict performance on the Cell

CELL PERFORMANCE METRICS METRICS

PERFORM. MODEL ATLAS EXPERT MAPPING SYSTEM

PERFORM. MODEL PROGRAM SPEC

SIGNAL FLOW SIGNAL FLOW PROGRAM TIMING CELL SIMULATOR

SPEC

FLOW EXTRACTOR GRAPH TIMING SIMULATOR MIT Lincoln Laboratory

GT Cell Workshop-27 SMHS 6/19/2007

** Note: Patent process underway

slide-28
SLIDE 28

Hierarchical Arrays

  • Hierarchical arrays hide details of the processor and memory hierarchy
  • Hierarchical maps concisely describe data distribution at each level

CELL Cluster grid: 1x2 dist: block nodes: 0:1

clusterMap

CELL 1 CELL map: cellMap grid: 1x4 di t bl k … SPE 1 SPE 2 … SPE 0 SPE 3 SPE 1 SPE 2 SPE 0 SPE 3 dist: block policy: default nodes: 0:3 map: speMap

cellMap

grid: 4x1 dist: block policy: default

speMap

MIT Lincoln Laboratory

GT Cell Workshop-28 SMHS 6/19/2007

LS LS LS LS LS LS LS LS LS LS

slide-29
SLIDE 29

The High Performance Embedded Computing Software Initiative

Program Goals

  • Develop and integrate software

t h l i f b dd d

Demonstrate

technologies for embedded parallel systems to address portability, productivity, and performance

HPEC

  • Engage acquisition community

to promote technology insertion

  • Deliver quantifiable benefits

HPEC Software Initiative

Portability: reduction in lines-of-code to change port/scale to new

Deliver quantifiable benefits

Performance (1.5x)

Interoperable & Scalable

g p system Productivity: reduction in overall lines-of- code Performance: computation and i ti b h k

MIT Lincoln Laboratory

GT Cell Workshop-29 SMHS 6/19/2007

Performance (1.5x)

communication benchmarks

slide-30
SLIDE 30

VSIPL++

Vector Signal Processing Application Matrix Signal Processing Application Image Processing Application

Vector, Signal, and Image Processing Library

Future

  • Portable to workstations embedded

Upgrade Systems

  • Parallelism built in
  • Portable to workstations, embedded

systems, FPGAs with minimal performance cost

  • Applicable to simple and complex

applications

  • Easier upgrade cycle
  • Parallelism built in
  • Object Oriented
  • Basic Arithmetic, Matrix Algebra, Signal

Processing, and Equation Solvers

VSIPL++ is freely available

MIT Lincoln Laboratory

GT Cell Workshop-30 SMHS 6/19/2007

  • Easier upgrade cycle
  • Reduced development time and cost

VSIPL++ is freely available www.hpec-si.org

slide-31
SLIDE 31

Mercury SAL and MCF

  • Scientific Algorithms Library (SAL) is

an FPS based library available on most Mercury products

  • Program portability within Mercury

products

  • Common signal processing algorithms
  • MultiCore Frameworks (MCF) manages

multi-SPE programming

  • SAL has over 100 functions
  • ptimized for single SPE

– FFT (1D, multiple)

– Function offload engine model – Stripmining – Intraprocessor communications – Overlays – Convolution (real, complex) – Matrix multiply – Basic arithmetic – Trigonometric and transcendental

  • Leveraging vendor libraries reduces development time
  • Provides optimization

– Profiling – Transpose MIT Lincoln Laboratory

GT Cell Workshop-31 SMHS 6/19/2007

Provides optimization

  • Less debugging of application
slide-32
SLIDE 32

Summary

  • With basic tools high performance is achievable from the

Cell processor Cell processor

– Hard to program: SIMD C extensions or assembly code are required – Development and debugging time can be long

  • PVTOL is making programming easier for Cell

– Leverages existing technologies g g g – Packages common kernels needed by users – API simplifies application code

MIT Lincoln Laboratory

GT Cell Workshop-32 SMHS 6/19/2007

slide-33
SLIDE 33

Backup

MIT Lincoln Laboratory

GT Cell Workshop-33 SMHS 6/19/2007