SLIDE 1
Peter Zinterhof Scientific Computing, Salzburg University
Distributed Computation of Feature-Detectors for Medical Image Processing on GPGPU and Cell Processors Peter Zinterhof
Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy
SLIDE 2 Peter Zinterhof Scientific Computing, Salzburg University
Task:
- automated recognition of 'regions of interest' within medical
imaging data here: recognition of kidneys within CT-slices
Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy
SLIDE 3
Peter Zinterhof Scientific Computing, Salzburg University
General approach: Evolve filter-sets, which generate feature vectors that can be classified by means of a Kohonen self-organizing map. Image filtering: Filter: 4 x 4 pixels, fed into perceptron Perceptron sums 16 inputs linearly, fires if threshold is exceeded sliding-window (convolution) each filter delivers single number → for one image aggregate 8 filters into feature vector of length 8
Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy
SLIDE 4
Peter Zinterhof Scientific Computing, Salzburg University
Training phase: 1.Apply eight perceptron-based filters on training images (64 x 64 pixels) 2.Sum up fire-events to create feature vectors (8-bin histograms) 3.train Kohonen map 4.Repeat steps 1-3 for many 'individuals' in parallel and compute fitness of each filter-set (recognition rate). 5.Apply evolutionary strategy onto perceptrons to evolve such perceptrons that eventually generate high recognition rates. Recall phase: 1.Apply best filter-set onto image to be classified (kidney/ non-kidney) 2.Use pre-computed Kohonen map to classify feature vector
Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy
SLIDE 5 Peter Zinterhof Scientific Computing, Salzburg University
Solution:
- Use of clusters of General Purpose Graphics Processing Units
(GPGPUs) and Cell processors enables training within days instead of many months.
- Two levels of parallelism
coarse grained fine grained systems level image filtering Kohonen map
Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy
SLIDE 6
Peter Zinterhof Scientific Computing, Salzburg University
GPU mpich2 + gbic GPU + Cell mpich2 + pvm3 + gbic
Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy
SLIDE 7
Peter Zinterhof Scientific Computing, Salzburg University
Samplecode GPU: image filtering dim3 dimBlock (60,8); // 480 threads dim3 dimGrid (images , 1); // 12000 blocks > 5.7 Mio.. threads → __global__ void Perceptrons (unsigned char *a1, float *weights, unsigned char *border, int *ReturnVectors_dlocal) { __shared__ float w[Channels][4][4]; __shared__ unsigned char borders[Channels+Channels][4][4]; __shared__ unsigned char a[64][64]; if (threadIdx.y==0) { // fetch data from global memory aw = &w[0][0][0]; for (i=threadIdx.x; i < (1*Channels*16); i+=60) {
Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy
SLIDE 8
Peter Zinterhof Scientific Computing, Salzburg University
aw[i] = weights[i]; } ue = &borders[0][0][0]; for (i=threadIdx.x; i < (2*Channels*16); i+=60) { ue[i] = border[i]; } base = blockIdx.x*(4096); uw = &a[0][0]; for (i=threadIdx.x; i < 4096; i+=60) { uw[i] = a1[base+i]; } }
Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy
SLIDE 9
Peter Zinterhof Scientific Computing, Salzburg University
mask=threadIdx.y; // which of the 8 filters to compute locally Counter=0; for (j=0; j < 60; j++) { sum=0.0; if ((a[j+0][threadIdx.x+0]>=borders[mask][0][0])&&(a[j+0] [threadIdx.x+0]<borders[mask+Channels][0][0])) sum+=w[mask][0][0]; if ((a[j+1][threadIdx.x+0]>=borders[mask][0][1])&&(a[j+1] [threadIdx.x+0]<borders[mask+Channels][0][1])) sum+=w[mask][0][1]; if ((a[j+2][threadIdx.x+0]>=borders[mask][0][2])&&(a[j+2] [threadIdx.x+0]<borders[mask+Channels][0][2])) sum+=w[mask][0][2]; if ((a[j+3][threadIdx.x+0]>=borders[mask][0][3])&&(a[j+3] [threadIdx.x+0]<borders[mask+Channels][0][3]))
Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy
SLIDE 10
Peter Zinterhof Scientific Computing, Salzburg University
sum+=w[mask][0][3]; if (sum > THETA) { Counter++; } } // next j atomicAdd ( &ReturnVectors_dlocal [(blockIdx.x*Channels)+mask] , Counter); }
Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy
SLIDE 11
Peter Zinterhof Scientific Computing, Salzburg University
Codesample Cell: nearest-neighbor search
// ********************************************************************* // DEMONSTRATION OF spu_cmpgt and spu_sel for a nearest-neighbor search // mymap: array of Kohonen-map vectors (1/6 of total map) // testvector: random vector, whose nearest neighbor is computed // ********************************************************************* for (i=0; i < RES*DIM; i++) { sum=(vector float){0.0,0.0,0.0,0.0}; current_nr=spu_splats (i); diff = mymap[i][0] - testvector[0]; sum = spu_madd (diff,diff, sum); diff = mymap[i][1] - testvector[1]; sum = spu_madd (diff,diff, sum); ... diff = mymap[i][7] - testvector[7]; sum = spu_madd (diff,diff, sum); mask = spu_cmpgt (localbest, sum); // if localbest > sum -> arg 1, else arg 0 is new minimum localbest=spu_sel(localbest,sum,mask); localbest_nr = spu_sel (localbest_nr,current_nr,mask); }
Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy
SLIDE 12 Peter Zinterhof Scientific Computing, Salzburg University
Codesample Cell: Improved barrier function Transporting nearest neighbor information 'piggy-back'-style, yielding
Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy
SLIDE 13
Peter Zinterhof Scientific Computing, Salzburg University
Benchmarks:
Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy
SLIDE 14
Peter Zinterhof Scientific Computing, Salzburg University Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy
SLIDE 15 Peter Zinterhof Scientific Computing, Salzburg University
Conclusions:
- Kohonen map training is 12x faster on the Cell processor than on a
single i7-core (920, 2.67 GHz)
- in conjunction with a single GPGPU (used for filtering only), Cell
reaches 50 % of a GPGPU's performance, even in a cluster of 8 PS3 consoles one additional GPGPU suffices.
- GPGPUs reach an overall speedup of 22 x over a Xeon
- combining 8 GPGPUs with 8 Cells yields a speedup of ~248x over a
single i7 core 5 minutes on a cluster of 'unconventional hardware' is → equivalent to 1 day on the CPU
Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy