Distributed Computation of Feature-Detectors for Medical Image - - PowerPoint PPT Presentation

distributed computation of feature detectors for medical
SMART_READER_LITE
LIVE PREVIEW

Distributed Computation of Feature-Detectors for Medical Image - - PowerPoint PPT Presentation

Peter Zinterhof Scientific Computing, Salzburg University Distributed Computation of Feature-Detectors for Medical Image Processing


slide-1
SLIDE 1

Peter Zinterhof Scientific Computing, Salzburg University

Distributed Computation of Feature-Detectors for Medical Image Processing on GPGPU and Cell Processors Peter Zinterhof

Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy

slide-2
SLIDE 2

Peter Zinterhof Scientific Computing, Salzburg University

Task:

  • automated recognition of 'regions of interest' within medical

imaging data here: recognition of kidneys within CT-slices

Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy

slide-3
SLIDE 3

Peter Zinterhof Scientific Computing, Salzburg University

General approach: Evolve filter-sets, which generate feature vectors that can be classified by means of a Kohonen self-organizing map. Image filtering: Filter: 4 x 4 pixels, fed into perceptron Perceptron sums 16 inputs linearly, fires if threshold  is exceeded sliding-window (convolution) each filter delivers single number → for one image aggregate 8 filters into feature vector of length 8

Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy

slide-4
SLIDE 4

Peter Zinterhof Scientific Computing, Salzburg University

Training phase: 1.Apply eight perceptron-based filters on training images (64 x 64 pixels) 2.Sum up fire-events to create feature vectors (8-bin histograms) 3.train Kohonen map 4.Repeat steps 1-3 for many 'individuals' in parallel and compute fitness of each filter-set (recognition rate). 5.Apply evolutionary strategy onto perceptrons to evolve such perceptrons that eventually generate high recognition rates. Recall phase: 1.Apply best filter-set onto image to be classified (kidney/ non-kidney) 2.Use pre-computed Kohonen map to classify feature vector

Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy

slide-5
SLIDE 5

Peter Zinterhof Scientific Computing, Salzburg University

Solution:

  • Use of clusters of General Purpose Graphics Processing Units

(GPGPUs) and Cell processors enables training within days instead of many months.

  • Two levels of parallelism

coarse grained fine grained systems level image filtering Kohonen map

Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy

slide-6
SLIDE 6

Peter Zinterhof Scientific Computing, Salzburg University

GPU mpich2 + gbic GPU + Cell mpich2 + pvm3 + gbic

Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy

slide-7
SLIDE 7

Peter Zinterhof Scientific Computing, Salzburg University

Samplecode GPU: image filtering dim3 dimBlock (60,8); // 480 threads dim3 dimGrid (images , 1); // 12000 blocks > 5.7 Mio.. threads → __global__ void Perceptrons (unsigned char *a1, float *weights, unsigned char *border, int *ReturnVectors_dlocal) { __shared__ float w[Channels][4][4]; __shared__ unsigned char borders[Channels+Channels][4][4]; __shared__ unsigned char a[64][64]; if (threadIdx.y==0) { // fetch data from global memory aw = &w[0][0][0]; for (i=threadIdx.x; i < (1*Channels*16); i+=60) {

Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy

slide-8
SLIDE 8

Peter Zinterhof Scientific Computing, Salzburg University

aw[i] = weights[i]; } ue = &borders[0][0][0]; for (i=threadIdx.x; i < (2*Channels*16); i+=60) { ue[i] = border[i]; } base = blockIdx.x*(4096); uw = &a[0][0]; for (i=threadIdx.x; i < 4096; i+=60) { uw[i] = a1[base+i]; } }

Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy

slide-9
SLIDE 9

Peter Zinterhof Scientific Computing, Salzburg University

mask=threadIdx.y; // which of the 8 filters to compute locally Counter=0; for (j=0; j < 60; j++) { sum=0.0; if ((a[j+0][threadIdx.x+0]>=borders[mask][0][0])&&(a[j+0] [threadIdx.x+0]<borders[mask+Channels][0][0])) sum+=w[mask][0][0]; if ((a[j+1][threadIdx.x+0]>=borders[mask][0][1])&&(a[j+1] [threadIdx.x+0]<borders[mask+Channels][0][1])) sum+=w[mask][0][1]; if ((a[j+2][threadIdx.x+0]>=borders[mask][0][2])&&(a[j+2] [threadIdx.x+0]<borders[mask+Channels][0][2])) sum+=w[mask][0][2]; if ((a[j+3][threadIdx.x+0]>=borders[mask][0][3])&&(a[j+3] [threadIdx.x+0]<borders[mask+Channels][0][3]))

Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy

slide-10
SLIDE 10

Peter Zinterhof Scientific Computing, Salzburg University

sum+=w[mask][0][3]; if (sum > THETA) { Counter++; } } // next j atomicAdd ( &ReturnVectors_dlocal [(blockIdx.x*Channels)+mask] , Counter); }

Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy

slide-11
SLIDE 11

Peter Zinterhof Scientific Computing, Salzburg University

Codesample Cell: nearest-neighbor search

// ********************************************************************* // DEMONSTRATION OF spu_cmpgt and spu_sel for a nearest-neighbor search // mymap: array of Kohonen-map vectors (1/6 of total map) // testvector: random vector, whose nearest neighbor is computed // ********************************************************************* for (i=0; i < RES*DIM; i++) { sum=(vector float){0.0,0.0,0.0,0.0}; current_nr=spu_splats (i); diff = mymap[i][0] - testvector[0]; sum = spu_madd (diff,diff, sum); diff = mymap[i][1] - testvector[1]; sum = spu_madd (diff,diff, sum); ... diff = mymap[i][7] - testvector[7]; sum = spu_madd (diff,diff, sum); mask = spu_cmpgt (localbest, sum); // if localbest > sum -> arg 1, else arg 0 is new minimum localbest=spu_sel(localbest,sum,mask); localbest_nr = spu_sel (localbest_nr,current_nr,mask); }

Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy

slide-12
SLIDE 12

Peter Zinterhof Scientific Computing, Salzburg University

Codesample Cell: Improved barrier function Transporting nearest neighbor information 'piggy-back'-style, yielding

  • approx. 14 % speedup

Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy

slide-13
SLIDE 13

Peter Zinterhof Scientific Computing, Salzburg University

Benchmarks:

Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy

slide-14
SLIDE 14

Peter Zinterhof Scientific Computing, Salzburg University Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy

slide-15
SLIDE 15

Peter Zinterhof Scientific Computing, Salzburg University

Conclusions:

  • Kohonen map training is 12x faster on the Cell processor than on a

single i7-core (920, 2.67 GHz)

  • in conjunction with a single GPGPU (used for filtering only), Cell

reaches 50 % of a GPGPU's performance, even in a cluster of 8 PS3 consoles one additional GPGPU suffices.

  • GPGPUs reach an overall speedup of 22 x over a Xeon
  • combining 8 GPGPUs with 8 Cells yields a speedup of ~248x over a

single i7 core 5 minutes on a cluster of 'unconventional hardware' is → equivalent to 1 day on the CPU

Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy