F F Fast Transforms using the Cell/B.E. Processor Fast Transforms - - PowerPoint PPT Presentation

f f fast transforms using the cell b e processor fast
SMART_READER_LITE
LIVE PREVIEW

F F Fast Transforms using the Cell/B.E. Processor Fast Transforms - - PowerPoint PPT Presentation

F F Fast Transforms using the Cell/B.E. Processor Fast Transforms using the Cell/B.E. Processor t T t T f f i g th C ll/B E P i g th C ll/B E P David A. Bader joint work with Seunghwa Kang and Virat Agarwal Sony-Toshiba-IBM Center of


slide-1
SLIDE 1

F t T f i g th C ll/B E P F t T f i g th C ll/B E P Fast Transforms using the Cell/B.E. Processor Fast Transforms using the Cell/B.E. Processor

David A. Bader joint work with Seunghwa Kang and Virat Agarwal

slide-2
SLIDE 2

Sony-Toshiba-IBM Center of Competence for the Cell/B.E. at Georgia Tech for the Cell/B.E. at Georgia Tech

Mission Mission: grow the community of Cell Broadband Engine users and developers

  • Fall 2006: Georgia Tech wins competition for

hosting the STI Center

  • First publicly-available IBM QS20 Cluster

y

  • 200 attendees at 2007 STI Workshop
  • Multicore curriculum and training
  • Multicore curriculum and training
  • Demonstrated performance on

–Multimedia and gaming S i tifi ti –Scientific computing –Medical applications –Financial services David A. Bader, Director

http://sti.cc.gatech.edu

David A. Bader

slide-3
SLIDE 3

Applications

  • CellBuzz: Freely-available, open source

f C libraries optimized for the Cell/B.E.

http://sourceforge.net/projects/cellbuzz/

– ZLIB & GZIP: data compression – FFT: fast Fourier transform – RC5: encryption – MPEG-2: video encoding and decoding – JPEG2000: digital content processing

  • Financial Modeling

David A. Bader

slide-4
SLIDE 4

Cell/B.E. Libraries: FFT and JPEG2000

  • FFTC: Fastest Fourier Transform on

FFTC: Fastest Fourier Transform on the Cell/B.E. the Cell/B.E.

– 1-Dimensional single precision DIF-FFT optimized – 1-Dimensional single precision DIF-FFT optimized for 1K-16K complex input samples – Parallelize & optimize computation of a single FFT computation D i hi h f h i ti b i i – Design high performance synchronization barrier using inter-SPE communication – Demonstrated superior performance of 18.6 GFlop/s for 8K complex input samples.

Butterflies of ordered DIF FFT

20 25 IBM Power5 AMD Opteron Intel Pentium 4 FFTW on Cell Our implementation (8 SPEs) Intel Core Duo

FFTC

  • JPEG2000 on the

JPEG2000 on the Cell/B.E. Cell/B.E.

– Optimize coding/decoding by data decomposition / data

GigaFlop/s

5 10 15

alignment / vectorization – Demonstrated average speedup of 3.1 over Intel 3.2 GHz Pentium-4

Input size

1024 2048 4096 8192 16384

The source code is freely available from our CellBuzz project in SourceForge http://sourceforge.net/projects/cellbuzz/

David A. Bader

slide-5
SLIDE 5

Cell/B.E. Libraries: ZLIB and MPEG-2

  • ZLIB Data compression &

ZLIB Data compression & decompression library decompression library

– Vectorize compute intensive kernels and parallelize to run on multiple SPEs – Extend the gzip header format while maintaining compatibility with legacy gzip decompressors – Demonstrated speedup of 2.9 over high-end Intel Pentium-4 system

  • MPEG-2 Video

MPEG-2 Video Decoding Decoding

– First parallelization of a multimedia application on Cell/B.E. – Demonstrated a speedup of 3 over Intel 3.2GHz Xeon. e

  • st ated a speedup o 3 o e

te 3 G eo

The source code is freely available from our CellBuzz project in SourceForge http://sourceforge.net/projects/cellbuzz/

David A. Bader

slide-6
SLIDE 6

Using the Cell/B.E. in Aircraft Health Monitoring

“Retired Marine Lt. Gen. Bernard Trainor said the issue of aging aircraft is a constant complaint of all branches of service.”

Atl t J l C tit ti

  • Fault Diagnosis

Atlanta Journal Constitution April 27, 2002

g

– Estimate the crack length without di bl b d disassembly based on vibration data collected from multiple sensors.

  • Failure Prognosis

– Estimate the expected time before crash

David A. Bader

slide-7
SLIDE 7

System-of-Systems Decompostion

Ge ne r ator Oil L e ve l Hydr aulic F ilte r s, Ac tuator L e akage and We ar Powe r and Cooling T ur bo Mac hine L ife , Oil Condition, Oil Se r vic ing and F ilte r Condition E ngine Batte r y Pump, and Hydr aulic F luid L e ve l Oxyge n Ge ne r ator L anding Ge ar and Nitr

  • ge n

Ge ne r ator and F ilte r L anding Ge ar Str ut Pr e ssur e and F luid L e ve l L anding Ge ar and Ar r e sting Hook Str uc tur e fatigue life R

  • tar

y Ac tuator We ar

David A. Bader

slide-8
SLIDE 8

Overview of the Diagnosis and Prognosis Process

De-Noising Feature Extraction Preprocessed

Online Modules

1 2 3 4 5 6 7 8 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 crack length 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1 2 3 4 5 6 7 8 0.2 crack length

Features & Sensor Data De Noising Extraction Preprocessed Data Feature De-Noising Diagnosis

Fault Growth

Loading

System

Features & Mapping Particle Filter Particle Filter Noise Models Feature Extraction & Mapping Techniques De Noising Techniques Flight Regime Data & Model Parameter Tuning Prognosis Simulated Data Driven

Stress Table

Crack Length K mi K ma 1. 30.2 27.9 2 27.2 25.6 2. 21.5 21.2 3 19.4 17.82

Experimental Data System Model for Diagnosis

RUL

Data Methods

Offline Modules

System Model for Prognosis

In Involv lves m es multiple ltiple computa computationally e tionally expensiv pensive modules!!! e modules!!!

DAQ David A. Bader

slide-9
SLIDE 9

Fast Transforms on the Cell/B.E.

  • Fast Fourier Transform
  • Discrete Wavelet Transform

9

David Bader

slide-10
SLIDE 10

FFTC: Fastest Fourier Transform for Cell/B.E.

  • Focus on medium size FFT computations

– Complex single-precision 1-Dimensional FFT

  • Input samples and output results reside in

main memory.

  • Radix 2, 3 and 5.

ad , 3 a d 5

  • Optimized for 1K-16K input samples.
  • Focus on achieving high performance for the
  • Focus on achieving high performance for the

computation of a single FFT, rather than increasing throughput increasing throughput.

10

David Bader

slide-11
SLIDE 11

Existing FFT Research on Cell/B.E.

  • [Williams et al., 2006], analyzed peak performance.
  • [Cico, Cooper and Greene, 2006] estimated 22.1

GFlops/s for an 8K complex 1D FFT that resides in the Local Store of one SPE the Local Store of one SPE.

– 8 independent FFTs in local store of 8 SPEs gives 176.8 GFlops/s. p /

  • [Chow, Fossum and Brokenshire, 2005] achieved

46.8 GFlops/s for 16M complex FFT.

– Highly specialized for this particular input size.

  • FFTW is a highly portable FFT library of various

types, precision and input size.

David Bader

11

slide-12
SLIDE 12

Our FFTC is based on Cooley Tukey

  • Input is one dimensional vector of complex values.
  • Algorithm is iterative, no recursion.
  • Out of Place approach is used.

pp

  • Requires two arrays A&B for computation, one input and
  • ne output that are swapped at every stage.

p pp y g

  • Out of place approach prevents data reordering after the

last stage. g

  • Algorithm requires log N stages. Each stage requires O(N)

computation. p – Complexity O(N log N)

12

David Bader

slide-13
SLIDE 13

Stage begin Twiddle factors

13

Stage end

David Bader

slide-14
SLIDE 14

Illustration of the Algorithm

Illustration of the algorithm for n=16 algorithm for n 16 complex values. Distance between pairs

  • f output values double at

every subsequent stage. Shows how output of

  • ne stage serves as the

inp t to another

David Bader

14

input to another.

slide-15
SLIDE 15

FFTC design on Cell/B.E. : Challenges

  • Synchroni

Synchronize e step after every step after every stage leads to signifi stage leads to significan cant

  • verhead.
  • verhead.
  • Reduce synchronization

stages. g

  • Design efficient barrier

synchronization routine.

  • We will later describe an

We will later describe an efficient tree-based synchronization algorithm based on inter-SPE based on inter SPE communication.

Insert synchronization barrier

David Bader

15

Insert synchronization barrier

slide-16
SLIDE 16

FFTC design on Cell/B.E. : Challenges (contd..)

Load balancing to achieve better SPU utilization Load balancing to achieve better SPU utilization – No SPE should wait at the synchronization barrier. R i ffi i t ll li ti t h i t ll t d t t – Require efficient parallelization technique to allocate data to SPEs. – Strategy should be scalable across multiple chips (large b f SPE ) Vect ctorization dif

  • rization difficult f

icult for r ever ery stage y stage

First 2 stages.

number of SPEs).

  • Stages 1 & 2, do not have regular

data access pattern.

  • Require data reorganization to fully

utilize the SPE computational power.

  • Optimizing the first 2 stages become

important for medium size inputs, as it 20 2 f

David Bader

16

may constitute 20-25% of the total running time.

slide-17
SLIDE 17

FFTC design on Cell/B.E. : Challenges (cont’d)

Limited local store Limited local store

  • require space for N/2 twiddle factors and input data.

require space for N/2 twiddle factors and input data.

  • loop unrolling and duplication increases size of the code.
  • Effectively manage code and data within 256KB.

Algorith Algorithm is m is branch branchy: y:

  • Doubly nested for loop within the

Doubly nested for loop within the

  • uter while loop
  • Lack of branch predictor

compromises performance compromises performance.

  • Provide branch hints and

restructure the algorithm to eliminate branch

David Bader

17

eliminate branch.

slide-18
SLIDE 18

Paralleling FFTC on the Cell/B.E.

Input size N (complex samples) Divide the input array in 2*p chunks Divide the input array in 2*p chunks where p: number of available SPEs. PPE allocates chunk i and i+p SPE i PPE allocates chunk i and i+p SPE i, spawns threads and waits for completion.

  • Data allocation technique is same at

t g every stage.

  • Efficient technique as it prevents

intervention from PPE during the computation. Achieves load balancing, each SPE f

David Bader

18

receives equal amount of work.

slide-19
SLIDE 19

Optimization for SPE

The The in input data at e put data at ever ery stage is y stage is fetched using tched using DMA in a DMA in a multi-buf multi-buffered w red way.

  • The block size is limited by a global

parameter buffer_size. While While loop duplication loop duplication for Stages r Stages 1 & 2 1 & 2

  • For vectorization of these stages we need

to use spe shuffle intrinsic on the output to use spe s u e t s c o t e output vector.

  • The figure above gives the shuffle pattern

for Stage 1 and the figure below for Stage for Stage 1, and the figure below for Stage 2. Loop duplication increases code size Loop duplication increases code size in the

David Bader

19

Loop duplication increases code size Loop duplication increases code size in the already limited local store.

slide-20
SLIDE 20

Optimization for SPE (cont’d)

Duplicat Duplicate while loop e while loop when loop when loop count counter r is < buf is < buffersize ersize and and th th i

  • th

ther erwise. se.

  • Need to stall for DMA get at

different places within the inner for loop.

  • The second case allows for

efficient loop unrolling in the efficient loop unrolling in the inner-most for loop. While While loop loop duplication f duplication for these 2 r these 2 While While loop loop duplication f duplication for these 2 r these 2 cases fur cases further increase code size her increase code size, that limits the size of FFT that can be computed using this methodology

David Bader

20

computed using this methodology.

slide-21
SLIDE 21

Experimental Setup

  • Manual Loop unrolling, multi-buffering, inter SPE

communication, odd-even pipelining, vectorization. , p p g,

  • Instruction level profiling and performance analysis using Cell

SDK 3.0, used xlc compiler at level 3 optimization.

  • FLOP analysis
  • FLOP analysis

– Operation Count : (5*N log N) floating point operations – For 2 complex value computations we require p p q

  • One complex subtraction (2 FLOP), One complex addition (2 FLOP)

and one complex multiplication (6 FLOP).

David Bader

21

slide-22
SLIDE 22

Performance analysis : Scaling across SPEs

FFT Size 1K Number of SPEs vs Running Time

20 8

FFT Size 4K Number of SPEs vs Running Time

80 8

croseconds)

12 14 16 18 20

provement

6 8

icroseconds)

60

mprovement

6

Running Time (mi

4 6 8 10

Performance Im

2 4

Running Time (mi

20 40

Performance Im

2 4

Number of SPEs

1 2 4 8

R

2

Number of SPEs

1 2 4 8

R

  • Near linear scaling from 1 to 8 SPEs. Thus it should scale well

across multiple chips as well.

David Bader

22

  • Speedup increases with larger input size.
slide-23
SLIDE 23

Performance Comparison of FFTs

IBM Power5 AMD O 20 25 AMD Opteron Intel Pentium 4 FFTW on Cell Our implementation (8 SPEs) Intel Core Duo

Flop/s

15 20

FFTC

GigaF

5 10 1024 2048 4096 8192 16384 5

David Bader

23

* Performance numbers from BenchFFT.

Input size

slide-24
SLIDE 24

FFTC Summary

  • Use various techniques such as Manual Loop unrolling,

multi-buffering, inter SPE communication, odd-even g, , pipelining, vectorization to achieve performance on a SPE.

  • Loop duplication increases the code size but helps in further

ti i ti SPE

  • ptimizations on an SPE.
  • We demonstr

We demonstrate superior ate superior performance of performance of 18.6 GigaFlop/s 18.6 GigaFlop/s for for an FFT of size an FFT of size 8k-16K 8k-16K, and believe and believe we have the fastest we have the fastest , FFT implementation on the Cell/B.E. FFT implementation on the Cell/B.E.

  • Code available at http://sourceforge.net/projects/cellbuzz/

David Bader

24

slide-25
SLIDE 25

Discrete Wavelet Transform on Cell/B.E.

  • We design an efficient data decomposition scheme

to achieve high performance with affordable to achieve high performance with affordable programming complexity

  • We introduce multiple Cell/B E and DWT specific
  • We introduce multiple Cell/B.E. and DWT specific
  • ptimization issues and solutions
  • Our implementation achieves 34 and 56 times
  • Our implementation achieves 34 and 56 times

speedup over one PPE performance, and 4.7 and 3.7 times speedup over the cutting edge multicore p p g g processor (AMD Barcelona), for lossless and lossy DWT, respectively.

David Bader

slide-26
SLIDE 26

Discrete Wavelet Transform (in JPEG2000)

  • Decompose an image in both vertical and horizontal

direction to the sub-bands representing the coarse and direction to the sub bands representing the coarse and detail part while preserving space information LL HL LH HH

David Bader

slide-27
SLIDE 27

Discrete Wavelet Transform (in JPEG2000)

  • Vertical filtering followed by horizontal filtering
  • Highly parallel but bandwidth intensive
  • Highly parallel but bandwidth intensive
  • Distinct memory access pattern becomes a problem
  • Adopt Jasper [Adams2005] as a baseline code

Adopt Jasper [Adams2005] as a baseline code

David Bader

slide-28
SLIDE 28

Previous work

  • Column grouping [Chaver2002] to enhance cache behavior

in vertical filtering in vertical filtering

  • Muta et al. [Muta2007] optimized convolution based

(require up to 2 times more operations than lifting based approach) DWT for Cell/B.E.

  • High single SPE performance

Does not scale above 1 SPE

  • Does not scale above 1 SPE

David Bader

slide-29
SLIDE 29

Data Decomposition Scheme

2 D array width Row padding A multiple of the cache line size Cache line aligned 2-D array width Row padding A unit of data 2-D array height transfer and computation A unit of data height A unit of data distribution to the processing l t A multiple of the cache line size Remainder elements Distributed to the SPEs Processed by the PPE

David Bader

slide-30
SLIDE 30

Data Decomposition Scheme

  • Satisfies the alignment and size requirements for efficient

DMA data transfer and vectorization DMA data transfer and vectorization.

  • Fixed LS space requirements regardless of an input image

size

  • Constant loop count

A it f d t A unit of data transfer and computation constant width

David Bader

slide-31
SLIDE 31

Vectorization – Real number representation

  • Jasper adopts fixed point representation

R l fl ti i t ith ti ith fi d i t – Replace floating point arithmetic with fixed point arithmetic – Not a good choice for Cell/B.E. Not a good choice for Cell/B.E.

Inst. Latency (SPE)

h $5 $3 $4

(SPE) mpyh 7 cycles 7 l

mpyh $5, $3, $4 mpyh $2, $4, $3 mpyu $4, $3, $4 fm $3, $3, $4

mpyu 7 cycles a 2 cycles

py , , a $3, $5, $2 a $3, $3, $4

fm 6 cycles

David Bader

slide-32
SLIDE 32

Loop Interleaving

  • In a naïve approach, a single vertical filtering involves 3 or 6

times data transfer times data transfer

  • Bandwidth becomes a bottleneck
  • Interleave splitting, lifting, and optional scaling steps

p g g p g p

Does not fit into the LS the LS

David Bader

slide-33
SLIDE 33

Loop Interleaving

  • First interleave multiple lifting steps
  • Then merge splitting step with the interleaved lifting step

low0 high0 low0 low1 low0* low1*

  • Then, merge splitting step with the interleaved lifting step

high0 low1 high1 low1 low2 low3 low1 low2* low3* Interleaved S litti low2 high2 low3 high0 high1 high2 high0* high1* high2* Interleaved Lifting Splitting

Overwritten before read

low3 high3 high2 high3 high2 high3*

  • Use temporary main memory buffer for the upper half

David Bader

slide-34
SLIDE 34

Fine–grain Data Transfer Control

  • Initially, we copy data from the buffer after the interleaved

loop is finished p

  • Yet, we can start it just after low2 and high2 are read
  • Cell/B.E.’s software controlled DMA data transfer enables

this

low0 high0 low0 low1 low0* low1*

this

low1 high1 low2 low2 low3 high0 low2* low3* high0* Interleaved Lifting Splitting low2 high2 low3 high0 high1 high2 high0 high1* high2* Lifting high3 high3 high3*

David Bader

slide-35
SLIDE 35

Performance Evaluation

* 3800 X 2600 color image, 5 resolution levels * Execution time and scalability up to 2 Cell/B.E. chips (IBM QS20)

David Bader

slide-36
SLIDE 36

Performance Evaluation Comparison with x86 Architecture Comparison with x86 Architecture

  • One 3.2 GHz Cell/B.E. chip (IBM QS20)
  • One 2.0 GHz AMD Barcelona chip (AMD Quad-core Opteron 8350)

Parallelization OpenMP based parallelization Vectorization Auto-vectorization

s)

4000 5000

1.0

Vectorization Auto vectorization with compiler directives Real Number Identical to the

ution Time (ms

2000 3000

1 0 1.9

Representation Cell/B.E. case Loop Interleaving Identical to the Cell/B.E. case

Execu

1000 2000

1.0 34 56 2.4 7.3 15

g / Run-time profile feedback Compile with run- time profile feedback

C e l l / B . E . ( B a s e ) / B . E . ( O p t i m i z e d ) B a r c e l

  • n

a ( B a s e ) l

  • n

a ( O p t i m i z e d )

34 56

C C e l l / B B a B a r c e l

  • DWT - Lossless

DWT - Lossy

* Optimization * Optimization for the r the Bar Barcelona elona

David Bader

slide-37
SLIDE 37

DWT Summary

  • Cell/B.E. has a great potential to speed-up parallel

transforms but requires careful implementation transforms but requires careful implementation

  • We design an efficient data decomposition scheme to

achieve high performance with affordable programming complexity

  • Our implementation demonstrates 34

Our implementation demonstrates 34 and and 56 56 times times speedup over one PPE speedup over one PPE and and 4 7 and and 3 7 times speedup over times speedup over speedup over one PPE speedup over one PPE, and and 4.7 and 3 7 and 3.7 times speedup over 7 times speedup over the AMD Barcelona processor with one Cell/B.E. chip the AMD Barcelona processor with one Cell/B.E. chip

  • Cell/B.E. can also be used as an accelerator in combination

with the traditional microprocessor

David Bader

slide-38
SLIDE 38

IBM QS22

  • Delivers 204.8 Gflop/s (double-precision) peak performance

with FMA (fused-multiply-and-add) in comparison with 29 2 with FMA (fused multiply and add) in comparison with 29.2 Gflop/s in QS20 or QS21

  • Supports 4 to 32 GB (DDR2) main memory in comparison

with 1 GB (XDR) in QS20 or QS21

P XC ll 8i P XC ll 8i PowerXCell 8i 102.8 Gflop/s (DP) peak 25 6 PowerXCell 8i 102.8 Gflop/s (DP) peak 25 6 20 GB/s 2-16 GB DDR2 DRAM 25.6 GB/s 2-16 GB DDR2 DRAM 25.6 GB/s 2 16 GB DDR2 DRAM 2 16 GB DDR2 DRAM

David Bader

slide-39
SLIDE 39

R Optimizations for the Cell

  • Optimize R statistics package for the Cell/B.E. processor

– BLAS – BLAS – LAPACK – random number generator, and – variance/covariance/correlation

  • IBM & Georgia Tech collaboration; freely-available, open

source (GPL) code will be released on SourceForge, based ( ) g ,

  • n R-2.7.0
  • Demonstration of native double-precision performance using

th IBM QS22 Bl d ith d l P XC ll 8i the IBM QS22 Blade with dual Power XCell 8i processors

David Bader

slide-40
SLIDE 40

R Performance on the QS22: Covariance w/ Pearson’s method Covariance w/ Pearson s method

  • Covariance computation with Pearson’s method

(without cache blocking) (without cache blocking)

  • QS20: was compute bounded

QS22 i b d idth b d d

  • QS22: is now bandwidth bounded

16 18 8 10 12 14 Gflop/s

1024 items *

2 4 6 G

8192 samples/item

QS22 QS20

David Bader

slide-41
SLIDE 41

R Performance on the QS22: Covariance w/ Kendall’s method Covariance w/ Kendall s method

  • Covariance computation with Kendall’s method
  • Compute bounded in both systems
  • Invokes sign() function in the loop body

– Needed for correctness checking

50 60

128 items *

30 40 Gflop/s

4096 samples/item

10 20 QS22 QS20

David Bader

slide-42
SLIDE 42

R Performance on the QS22: Covariance further optimizations Covariance, further optimizations

  • Test kernel created by removing sign() function call

in Kendall’s method in Kendall’s method

– Does not affect correctness

  • Compute bounded in both systems
  • Compute bounded in both systems

120 140

128 items *

60 80 100 Gflop/s

4096 samples/item

20 40 QS22 QS20

David Bader

slide-43
SLIDE 43

Financial Services using R on Cell

  • CreditMetrics (R extension package) on the QS22 with
  • ptimized BLAS LAPACK and RNG libraries
  • ptimized BLAS, LAPACK, and RNG libraries
  • Additional optimizations (e.g. modifying script for smaller

memory footprint, not specific to the Cell/B.E.)

1000 1200 400 600 800 Second 200 400 baseline

  • ptimized

David Bader

slide-44
SLIDE 44

Acknowledgment of Support

44

David A. Bader

slide-45
SLIDE 45

Cell/B.E. Apps: Financial Modeling

  • Objective:

Objective: Demonstrate a competitive edge of the Cell/B.E. for Financial Services.

  • Eur

European Option Option Pricin Pricing.

  • g. Black - Scholes equation:

) ( ) ( ) ( ) ( t dW t S dt t S t dS σ μ + =

  • Collateralized De

llateralized Debt O bt Obligatio ligation (CDO) pricin (CDO) pricing

– Gaussian Copula, Monte Carlo simulation

Special Purpose Vehicle (SPV) Originating Bank Senior 30-70% Mezzanine Assets sold to the SPV Principal & interest Loss Cash Detachment point - d

  • Optimiz

Optimize variou various s

(SPV) Mezzanine 5-30% Equity 0-5% Cash Funding L Attachment point - a

– random number random number generators : generators : Mersenne Twister, Mersenne Twister, Hammersley Hammersley sequence, sequence, LCG. LCG. – normalization techniques : normalization techniques : Box Box Mueller Polar/Cartesian, Low Mueller Polar/Cartesian, Low Distortion Map. Distortion Map.

David A. Bader

slide-46
SLIDE 46

Cell/B.E. Apps: Financial Modeling / pp g

Performance Analysis : Random Number Generation

Over 3 Billion random numbers per second from a single Cell/B.E.

* The performance results on the Intel AMD and IBM PowerPC processors are from: * The performance results on the Intel, AMD and IBM PowerPC processors are from:

  • M. Saito and M. Matsumoto. Simple and Fast MT: A Two times faster new variant of Mersenne twister. In Proc.

7th Intl. Conference on Monte Carlo Methods in Scientific Computing, Germany, 2006. David A. Bader

slide-47
SLIDE 47

Cell/B.E. Apps: Financial Modeling / pp g

Performance Analysis : European Option pricing

[1] V. Podlozhnyuk. Monte Carlo Option pricing. (NVIDIA CUDA) White paper, v1.0, June, 2007. [2] IBM Corporation The Cell project at IBM Research White paper

1.5x over optimized CUDA implementation for NVIDIA G80. 2x over optimized implementation for RapidMind on Cell.

D bl i i ill b ti l

[2] IBM Corporation. The Cell project at IBM Research. White paper.

Double precision will be essential

David A. Bader