distributed computation of feature detectors for medical
play

Distributed Computation of Feature-Detectors for Medical Image - PowerPoint PPT Presentation

Peter Zinterhof Scientific Computing, Salzburg University Distributed Computation of Feature-Detectors for Medical Image Processing


  1. Peter Zinterhof Scientific Computing, Salzburg University Distributed Computation of Feature-Detectors for Medical Image Processing on GPGPU and Cell Processors Peter Zinterhof Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy

  2. Peter Zinterhof Scientific Computing, Salzburg University Task: • automated recognition of 'regions of interest' within medical imaging data here: recognition of kidneys within CT-slices Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy

  3. Peter Zinterhof Scientific Computing, Salzburg University General approach: Evolve filter-sets, which generate feature vectors that can be classified by means of a Kohonen self-organizing map. Image filtering: Filter: 4 x 4 pixels, fed into perceptron Perceptron sums 16 inputs linearly, fires if threshold  is exceeded → sliding-window (convolution) each filter delivers single number for one image aggregate 8 filters into feature vector of length 8 Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy

  4. Peter Zinterhof Scientific Computing, Salzburg University Training phase: 1.Apply eight perceptron-based filters on training images (64 x 64 pixels) 2.Sum up fire-events to create feature vectors (8-bin histograms) 3.train Kohonen map 4.Repeat steps 1-3 for many 'individuals' in parallel and compute fitness of each filter-set (recognition rate). 5.Apply evolutionary strategy onto perceptrons to evolve such perceptrons that eventually generate high recognition rates. Recall phase: 1.Apply best filter-set onto image to be classified (kidney/ non-kidney) 2.Use pre-computed Kohonen map to classify feature vector Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy

  5. Peter Zinterhof Scientific Computing, Salzburg University Solution: • Use of clusters of General Purpose Graphics Processing Units (GPGPUs) and Cell processors enables training within days instead of many months. • Two levels of parallelism coarse grained fine grained systems level image filtering Kohonen map Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy

  6. Peter Zinterhof Scientific Computing, Salzburg University GPU mpich2 + gbic GPU + Cell mpich2 + pvm3 + gbic Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy

  7. Peter Zinterhof Scientific Computing, Salzburg University Samplecode GPU: image filtering dim3 dimBlock (60,8); // 480 threads → dim3 dimGrid (images , 1); // 12000 blocks > 5.7 Mio.. threads __global__ void Perceptrons (unsigned char *a1, float *weights, unsigned char *border, int *ReturnVectors_dlocal) { __shared__ float w[Channels][4][4]; __shared__ unsigned char borders[Channels+Channels][4][4]; __shared__ unsigned char a[64][64]; if (threadIdx.y==0) { // fetch data from global memory aw = &w[0][0][0]; for (i=threadIdx.x; i < (1*Channels*16); i+=60) { Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy

  8. Peter Zinterhof Scientific Computing, Salzburg University aw[i] = weights[i]; } ue = &borders[0][0][0]; for (i=threadIdx.x; i < (2*Channels*16); i+=60) { ue[i] = border[i]; } base = blockIdx.x*(4096); uw = &a[0][0]; for (i=threadIdx.x; i < 4096; i+=60) { uw[i] = a1[base+i]; } } Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy

  9. Peter Zinterhof Scientific Computing, Salzburg University mask=threadIdx.y; // which of the 8 filters to compute locally Counter=0; for (j=0; j < 60; j++) { sum=0.0; if ((a[j+0][threadIdx.x+0]>=borders[mask][0][0])&&(a[j+0] [threadIdx.x+0]<borders[mask+Channels][0][0])) sum+=w[mask][0][0]; if ((a[j+1][threadIdx.x+0]>=borders[mask][0][1])&&(a[j+1] [threadIdx.x+0]<borders[mask+Channels][0][1])) sum+=w[mask][0][1]; if ((a[j+2][threadIdx.x+0]>=borders[mask][0][2])&&(a[j+2] [threadIdx.x+0]<borders[mask+Channels][0][2])) sum+=w[mask][0][2]; if ((a[j+3][threadIdx.x+0]>=borders[mask][0][3])&&(a[j+3] [threadIdx.x+0]<borders[mask+Channels][0][3])) Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy

  10. Peter Zinterhof Scientific Computing, Salzburg University sum+=w[mask][0][3]; if (sum > THETA) { Counter++; } } // next j atomicAdd ( &ReturnVectors_dlocal [(blockIdx.x*Channels)+mask] , Counter); } Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy

  11. Peter Zinterhof Scientific Computing, Salzburg University Codesample Cell: nearest-neighbor search // ********************************************************************* // DEMONSTRATION OF spu_cmpgt and spu_sel for a nearest-neighbor search // mymap: array of Kohonen-map vectors (1/6 of total map) // testvector: random vector, whose nearest neighbor is computed // ********************************************************************* for (i=0; i < RES*DIM; i++) { sum=(vector float){0.0,0.0,0.0,0.0}; current_nr=spu_splats (i); diff = mymap[i][0] - testvector[0]; sum = spu_madd (diff,diff, sum); diff = mymap[i][1] - testvector[1]; sum = spu_madd (diff,diff, sum); ... diff = mymap[i][7] - testvector[7]; sum = spu_madd (diff,diff, sum); mask = spu_cmpgt (localbest, sum); // if localbest > sum -> arg 1, else arg 0 is new minimum localbest= spu_sel (localbest,sum,mask); localbest_nr = spu_sel (localbest_nr,current_nr,mask); } Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy

  12. Peter Zinterhof Scientific Computing, Salzburg University Codesample Cell: Improved barrier function Transporting nearest neighbor information 'piggy-back'-style, yielding approx. 14 % speedup Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy

  13. Peter Zinterhof Scientific Computing, Salzburg University Benchmarks: Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy

  14. Peter Zinterhof Scientific Computing, Salzburg University Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy

  15. Peter Zinterhof Scientific Computing, Salzburg University Conclusions: • Kohonen map training is 12x faster on the Cell processor than on a single i7-core (920, 2.67 GHz) • in conjunction with a single GPGPU (used for filtering only), Cell reaches 50 % of a GPGPU's performance, even in a cluster of 8 PS3 consoles one additional GPGPU suffices. • GPGPUs reach an overall speedup of 22 x over a Xeon • combining 8 GPGPUs with 8 Cells yields a speedup of ~248x over a → single i7 core 5 minutes on a cluster of 'unconventional hardware' is equivalent to 1 day on the CPU Third Workshop on UnConventional High Performance Computing 2010 (UCHPC 2010), Ischia – Naples, Italy

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend