Application of Fortran Application of Fortran Pthreads on Linear - - PowerPoint PPT Presentation

application of fortran application of fortran pthreads on
SMART_READER_LITE
LIVE PREVIEW

Application of Fortran Application of Fortran Pthreads on Linear - - PowerPoint PPT Presentation

Application of Fortran Application of Fortran Pthreads on Linear Algebra Pthreads on Linear Algebra and Scientific Computing and Scientific Computing Clay P. Breshears Clay P. Breshears Henry A. Gabb Gabb Henry A. Mark Fahey Fahey Mark


slide-1
SLIDE 1

Application of Fortran Application of Fortran Pthreads on Linear Algebra Pthreads on Linear Algebra and Scientific Computing and Scientific Computing

Clay P. Breshears Clay P. Breshears Henry A. Henry A. Gabb Gabb Mark Mark Fahey Fahey

USAE Waterways Experiment Station USAE Waterways Experiment Station Major Shared Resource Center Major Shared Resource Center

slide-2
SLIDE 2

Acknowledgements Acknowledgements

This work was funded by the This work was funded by the DoD DoD High Performance Computing High Performance Computing Modernization Program CEWES Major Shared Resource Center Modernization Program CEWES Major Shared Resource Center through Programming Environment and Training (PET), through Programming Environment and Training (PET), Contract Number: DAHC 94-96-C0002, Nichols Research Contract Number: DAHC 94-96-C0002, Nichols Research Corporation. Corporation.

slide-3
SLIDE 3

Introduction Introduction

¥ ¥ Last year: Introduced F90 Pthreads API Last year: Introduced F90 Pthreads API ¥ ¥ This year: Are they really useful? This year: Are they really useful?

Ð Ð How easy is programming? How easy is programming? Ð Ð Can we get decent parallel performance? Can we get decent parallel performance? Ð Ð Are there algorithmic considerations? Are there algorithmic considerations? Ð Ð Are there external considerations? Are there external considerations?

¥ ¥ Must be within a userÕs attention span Must be within a userÕs attention span

slide-4
SLIDE 4

Pthreads Pthreads

¥ ¥ POSIX standard for thread functions POSIX standard for thread functions

Ð Ð Thread management Thread management Ð Ð Mutual exclusion Mutual exclusion Ð Ð Conditional variables Conditional variables Ð Ð Attributes Attributes

¥ ¥ Only defined in C Only defined in C

slide-5
SLIDE 5

F90 Pthreads API F90 Pthreads API

¥ ¥ F90 subroutines interface C functions F90 subroutines interface C functions ¥ ¥ Ô Ôf fÕ prefix to name Õ prefix to name

Ð Ð fpthread fpthread_create _create Ð Ð Similar to PVM, MPI Similar to PVM, MPI Ð Ð Error code as final argument Error code as final argument

¥ ¥ F90 module and C wrapper routines F90 module and C wrapper routines

slide-6
SLIDE 6

Problems Considered Problems Considered

¥ ¥ Matrix Multiplication Matrix Multiplication ¥ ¥ Direct Solution of Linear Systems Direct Solution of Linear Systems

Ð Ð Gaussian Gaussian Elimination and Back Substitution Elimination and Back Substitution

¥ ¥ Command, Control, Communication, and Command, Control, Communication, and Intelligence (C3I) Benchmarks Intelligence (C3I) Benchmarks

Ð Ð Map-Image Correlation Map-Image Correlation Ð Ð Terrain Masking Terrain Masking

slide-7
SLIDE 7

Project Goals Project Goals

¥ ¥ Apply threaded programming techniques to Apply threaded programming techniques to scientific computations scientific computations ¥ ¥ Demonstrate and exploit concurrency in numeric Demonstrate and exploit concurrency in numeric codes codes ¥ ¥ Achieve execution speed up with multiple Achieve execution speed up with multiple threads/processors threads/processors

slide-8
SLIDE 8

NOT a Project Goal... NOT a Project Goal...

To produce the fastest executing To produce the fastest executing versions of codes and algorithms versions of codes and algorithms examined examined

Our Emphasis: Our Emphasis: ¥ ¥How easy is the method to use? How easy is the method to use? ¥ ¥How much speed up might be How much speed up might be expected? expected?

slide-9
SLIDE 9

Matrix Multiplication Matrix Multiplication

¥ ¥ Sparse matrix-matrix multiplication Sparse matrix-matrix multiplication ¥ ¥ Row-major linked list data structure Row-major linked list data structure

Ð Ð Direct Methods for Sparse Matrices Direct Methods for Sparse Matrices by Duff, by Duff, Erisman Erisman and Reid and Reid

¥ ¥ IKJ loop structure IKJ loop structure

Ð Ð C(I, :) = C(I, :) + A(I, K) * B(K, :) C(I, :) = C(I, :) + A(I, K) * B(K, :) Ð Ð I Ith

th

row of C; row of C; K Kth

th row of B

row of B Ð Ð Scalar from A (accessed across Scalar from A (accessed across I Ith

th row)

row)

slide-10
SLIDE 10

Thread Algorithm Thread Algorithm

While more rows to process While more rows to process

get next row number (lock shared counter) get next row number (lock shared counter) for each element in row of A do for each element in row of A do

copy appropriate row of B to local vector copy appropriate row of B to local vector multiply vector by scalar from A multiply vector by scalar from A add results to local ÒsummationÓ vector add results to local ÒsummationÓ vector

add ÒsummationÓ vector to row of C add ÒsummationÓ vector to row of C

lock data structure to prevent overwriting lock data structure to prevent overwriting

slide-11
SLIDE 11

1000x1000 Sparse Matrix Multiplication (< 10K NZ) 1000x1000 Sparse Matrix Multiplication (< 10K NZ)

  • n SGI/Cray Origin 2000 (IRIX 6.4)
  • n SGI/Cray Origin 2000 (IRIX 6.4)

# of Threads Time (seconds) 1 0.259 2 0.261 4 0.261 8 0.264 16 0.270 32 0.301 64 0.324 128 0.364

slide-12
SLIDE 12

Dense Matrices Dense Matrices

¥ ¥ Not enough work in sparse case Not enough work in sparse case ¥ ¥ IKJ loop structure IKJ loop structure

Ð Ð row-major access row-major access

¥ ¥ JKI loop structure JKI loop structure

Ð Ð C(:, J) = C(:, J) + A(:, K) * B(K, J) C(:, J) = C(:, J) + A(:, K) * B(K, J) Ð Ð J Jth

th column of C;

column of C; K Kth

th column of A

column of A Ð Ð Scalar from B (accessed down Scalar from B (accessed down J Jth

th column)

column)

slide-13
SLIDE 13

1000x1000 Dense Matrix Multiplication 1000x1000 Dense Matrix Multiplication

  • n SGI/Cray Origin 2000 (IRIX 6.4)
  • n SGI/Cray Origin 2000 (IRIX 6.4)

# of Threads IKJ Time

(se conds)

JKI Time

(seconds)

1 75.5 29.8 2 42.4 20.2 4 22.4 11.2 8 11.9 6.1 16 6.3 3.4 32 3.6 2.0 64 3.2 1.9 128 3.0 2.3

slide-14
SLIDE 14

Solution of Linear Systems Solution of Linear Systems

¥ ¥ Gaussian Gaussian Elimination with Back Substitution Elimination with Back Substitution

Ð Ð simple method with row updates simple method with row updates Ð Ð diminishing amounts of work diminishing amounts of work

Q: How do we distribute work evenly among threads Q: How do we distribute work evenly among threads throughout computation? throughout computation? A: Cyclic distribution of rows A: Cyclic distribution of rows

slide-15
SLIDE 15

Cyclic Row Partitioning Cyclic Row Partitioning

A

pivot

x b

index

slide-16
SLIDE 16

Thread Algorithm Thread Algorithm

do i = 1, NUMROWS if (i mod myid = = 0) then save pivot, row(i) else wait for pivot to be saved endif copy row(i), pivot to local row, pivot do j = i+1, NUMROWS if (j mod myid = = 0) then compute row multiplier update row(j) with local row endif end do end do indx = NUMROWS while (indx > 0) while (indx mod myid = = 0) fpthread_cond_wait compute x(indx) indx = indx - 1 fpthread_cond_broadcast end while

slide-17
SLIDE 17

Cyclic Row Partitioning Cyclic Row Partitioning

A

pivot

x b

index

slide-18
SLIDE 18

But What AboutÉ But What AboutÉ

¥ ¥ Column-Major ordering is Fortran standard Column-Major ordering is Fortran standard

Ð Ð Helped with Matrix Multiply code Helped with Matrix Multiply code

¥ ¥ Modify threaded algorithm Modify threaded algorithm

Ð Ð Transpose A matrix after input Transpose A matrix after input

¥ ¥ library routine is threaded library routine is threaded

Ð Ð Swap A(i,j) indices for A(j,i) Swap A(i,j) indices for A(j,i)

slide-19
SLIDE 19

Timing Results Timing Results

# of Threads Row-Major Transpos e Column-Major 1 6 72 16 2 2 7 35 23 2 4 5 56 11 5 8 10 30 10 0 1 6 16 42 14 7 3 2 16 67 13 1 F90 threaded Gaussian Elimination with back substitution 2000x2000 system of equations on SGI/Cray Origin2000

slide-20
SLIDE 20

But What AboutÉ But What AboutÉ

¥ ¥ ...Memory contention of threads if matrices ...Memory contention of threads if matrices stored on a single processor? stored on a single processor?

Ð Ð Used Used _DSM_ROUND_ROBIN _DSM_ROUND_ROBIN to no significant effect to no significant effect Ð Ð Used Used A(CYCLIC,*) A(CYCLIC,*) distribution to no significant effect distribution to no significant effect

¥ ¥ ...Distribution of threads to processors? ...Distribution of threads to processors?

slide-21
SLIDE 21

pthread pthread_ _setconcurrency setconcurrency

¥ ¥ SGI extension to Pthreads SGI extension to Pthreads ¥ ¥ Wrote F90 wrapper Wrote F90 wrapper ¥ ¥ Set to number of threads executing Set to number of threads executing

slide-22
SLIDE 22

Added Timing Results Added Timing Results

# o f Threads Row-Major Transpose Column-Major Transpose fp_setconcurrency 1 67 2 1 62 2 73 5 2 32 1 25 4 55 6 1 15 6 8 8 10 30 1 00 2 6 16 16 42 1 47 5 3 32 16 67 1 31 6 4 F90 threaded Gaussian Elimination with back substitution 2000x2000 system of equations on SGI/Cray Origin2000

slide-23
SLIDE 23

10K System of Equations 10K System of Equations

¥ ¥ Transpose algorithm Transpose algorithm

Ð Ð 32 threads 32 threads Ð Ð 10144 seconds on Origin2000 10144 seconds on Origin2000

¥ ¥ Transpose with Transpose with setconcurrency setconcurrency

Ð Ð 32 threads 32 threads Ð Ð 8608 seconds on Origin2000 8608 seconds on Origin2000

slide-24
SLIDE 24

C3I Benchmarks C3I Benchmarks

¥ ¥ U.S. AFRL Information Directorate U.S. AFRL Information Directorate

Ð Ð Rome Research Site ( Rome Research Site (Griffis Griffis AFB) AFB)

¥ ¥ 10 non real-time C3I functions 10 non real-time C3I functions

Ð Ð diverse diverse Ð Ð computationally computationally Ð Ð challenging challenging Ð Ð representative of C3I systems representative of C3I systems

¥ ¥ Spec, sequential code, associated Spec, sequential code, associated dataset dataset

slide-25
SLIDE 25

Map-Image Correlation Map-Image Correlation

¥ ¥ Surveillance data from remote sensors Surveillance data from remote sensors

Ð Ð space-based infrared satellites space-based infrared satellites Ð Ð remotely-piloted vehicles remotely-piloted vehicles Ð Ð intelligence photographs intelligence photographs

¥ ¥ Determine the alignment of features in the images Determine the alignment of features in the images with a detailed map of the area with a detailed map of the area ¥ ¥ Potential for comparing ÒbeforeÓ and ÒafterÓ images Potential for comparing ÒbeforeÓ and ÒafterÓ images

slide-26
SLIDE 26

Example Problem Example Problem

Satellite photo (0900 hours)

Spy plane photo (1300 hours)

Translation and matching code Find matching rotation

  • f physical features

and realize a ship has departed

slide-27
SLIDE 27

Potentials for Concurrency Potentials for Concurrency

¥ ¥ Each image is independent Each image is independent ¥ ¥ 2-D Fast Fourier Transform 2-D Fast Fourier Transform

Ð Ð Each column is independent 1-D FFT Each column is independent 1-D FFT Ð Ð Transpose Transpose Ð Ð Each column (original row) is independent Each column (original row) is independent

¥ ¥ FFT of finite number of rotations is independent FFT of finite number of rotations is independent

slide-28
SLIDE 28

Thread Algorithm Thread Algorithm

Create thread for each image (both original and rotations): Create thread for each image (both original and rotations): Discretize Discretize image image For each column in image create thread for 1-D FFT For each column in image create thread for 1-D FFT Transpose image array Transpose image array For each column in image create thread for 1-D FFT For each column in image create thread for 1-D FFT Join threads Join threads Correlate images (using threaded Inverse Fourier Transform) Correlate images (using threaded Inverse Fourier Transform)

slide-29
SLIDE 29

Implementation Details Implementation Details

¥ ¥ Two 1024x1024 Grids Two 1024x1024 Grids ¥ ¥ Use 1-D FFT routine from Cray Use 1-D FFT routine from Cray Sci Sci Lib Lib ¥ ¥ Total of three 2-D Total of three 2-D FFTs FFTs

Ð Ð Two images Two images Ð Ð Inverse FFT for correlation Inverse FFT for correlation

¥ ¥ Use Use pthread pthread_ _setconcurrency setconcurrency

slide-30
SLIDE 30

Map-Image Timing Results Map-Image Timing Results

Single Threaded 2-D FFT (Cray Sci Library): < 2 seconds

Number of threads 1 2 4 8 16 32 64 Time (seconds) 155 85 47 28 19 14 16 Threaded Image Correlation of Two 1024x1024 grids on SGI Origin2000

slide-31
SLIDE 31

Terrain Masking Terrain Masking

¥ ¥ Used in aircraft flight mission computer systems Used in aircraft flight mission computer systems to aid in attack, covert, and evasive flight to aid in attack, covert, and evasive flight

  • perations
  • perations

¥ ¥ Compute evasive routes with low Compute evasive routes with low observability

  • bservability

Ð Ð given a set of threats and their positions given a set of threats and their positions

slide-32
SLIDE 32

Problem Description Problem Description

¥ ¥ Input: Input:

Ð Ð 2-D relief map (grid of surface elevations) 2-D relief map (grid of surface elevations) Ð Ð Position and range of Position and range of threats threats within region within region

¥ ¥ Output: Output:

Ð Ð Original map plus masking altitudes Original map plus masking altitudes

¥ ¥ minimum visible altitude at grid points minimum visible altitude at grid points

slide-33
SLIDE 33

Threat Range and Line of Sight Threat Range and Line of Sight

Within range and in line-

  • f-site

Enemy Sensor

Terrain

Masked Masked

slide-34
SLIDE 34

Concurrent Method 1 Concurrent Method 1

For each threat For each threat determine range boundaries of threat determine range boundaries of threat divide range into N sectors divide range into N sectors create N threads create N threads assign one per sector assign one per sector join threads join threads

slide-35
SLIDE 35

Sector Concurrency Model Sector Concurrency Model

slide-36
SLIDE 36

Concurrent Method 2 Concurrent Method 2

Determine number of sectors, N Determine number of sectors, N create M (number of threats) threads create M (number of threats) threads do I = 1, N do I = 1, N for each threat compute sector I for each threat compute sector I lock potential overlap areas lock potential overlap areas synchronize M threads synchronize M threads end do end do

slide-37
SLIDE 37

Threat Concurrency Model Threat Concurrency Model

Create and use mutex for each

  • verlapping

column

slide-38
SLIDE 38

Terrain Masking Results Terrain Masking Results

NUMBER OF THREADS

1 2 4 8 16

Time (minutes)

25 14 8 5.5 3.5

Terrain Masking Timing (in minutes) for 6000x6000 element grid with 90 threats on SGI Origin2000

With or without pthread_setconcurrency yielded little difference.

slide-39
SLIDE 39

Conclusions Conclusions

¥ ¥ Threaded codes can demonstrate speed-up Threaded codes can demonstrate speed-up ¥ ¥ Minor coding changes to add Pthreads Minor coding changes to add Pthreads

Ð Ð task parallelism or functional units task parallelism or functional units

¥ ¥ Able to take advantage of nested parallelism Able to take advantage of nested parallelism ¥ ¥ Must still be aware of architectural quirks Must still be aware of architectural quirks

Ð Ð cache access patterns cache access patterns Ð Ð data distribution and memory contention data distribution and memory contention Ð Ð thread to processor mapping thread to processor mapping