The context : bioimage data explosion High-throughput imaging - - PowerPoint PPT Presentation
The context : bioimage data explosion High-throughput imaging - - PowerPoint PPT Presentation
The context : bioimage data explosion High-throughput imaging techniques have led to bioimage data explosion Submicrometer resolution Whole mouse brain mapping TeraByte scale images Heterogeneous images Need for tools capable to
University Campus Bio-Medico of Rome
The context: bioimage data explosion
High-throughput imaging techniques have
led to bioimage data explosion
Submicrometer resolution TeraByte scale images Heterogeneous images Need for tools capable to process these
huge datasets
Which requirements? Need for automation to extract
information
Fully- or semi-automated analysis? Is human assistance needed?
Whole mouse brain mapping
March 26, 2018 2
University Campus Bio-Medico of Rome
TeraStitcher: born for Confocal Light Sheet Microscope Images
Generates a 2D matrix of 3D tiles
(image stacks)
Tiles are regularly spaced with
reasonably reliable nominal positions
- Stack acquisition is performed with a
single movement guaranteeing a reliable constant inter-slices distance along z
- Long range movements (order of cm)
make necessary an alignment step in all directions
Sufficient overlap between adjacent
tiles
March 26, 2018 4
University Campus Bio-Medico of Rome
TeraStitcher: Ultra-terabyte images
Need to process 16 bits images larger than
20,000 x 20,000 x 3,000 voxels (~2.5 TB)
Increasingly high resolutions in X-Y Multiple channels Very large slices (e.g. 40K x 30K pixels) Thousands of slices
Solutions:
Sparse datasets Compressed formats Parallelism Multi-channel support
March 26, 2018 5
?
University Campus Bio-Medico of Rome
What is image stitching?
Image stitching is the process
- f
combining multiple photographic images with
- verlapping
fields
- f
view to produce a segmented panorama
- r
high-resolution image (from Wikipedia)
March 26, 2018 7
Initially developed for confocal light sheet microscopy (CLSM) images
Fast 2D approach to align adjacent stacks
Efficient use of memory resources
Minimum I/O workload (2 readings, 1 writing)
Can generate a multi-resolution representation suited for further processing
University Campus Bio-Medico of Rome
TeraStitcher: overview
March 26, 2018 8
University Campus Bio-Medico of Rome
TeraStitcher: software structure
The software structure provides
means for incorporating new functionalities
Object-oriented software
architecture and design patterns
Functional decomposition visible to
the user through command line
- ptions
Stitcher
StackedVolume MIP-NCC-Displ
- MIP_displacements: int[6]
- NCC_widths: float[3]
- NCC_maxs: float[3]
- reliab_factors: float[3]
+ evalReliability(): float + …
…
MST
+ execute(… )
…
Stack
«interface» Pai r w i seD i spl Al go
+ execut
e(…
) :Displacement
…
MIP-NCC
+ execute(… ): Displacement
«executes» «executes» «uses»
Abstract Factory
VolumeManager
1 N 1 N «interface»
D i spl acem ent
+ eval
Rel i abi l i t y( ) :float + get
D i spl acem ent ( di r ect i- n:
int + t
hr eshol d( t hr es:float) + pr
- j
Displacement) # def
_di spl acem ent s:int[3] # di
spl acem ent s:int[3]
«produces »
Strategy
Stitcher
+ computeDisplacements(algorithm_type: int, … ) + projectDisplacements() + thresholdDisplacements(thres: float) + computeTilesPlacement(algorithm_type: int) + mergeTiles(blending_type: int,… )
«interface» Ti l esPl acem ent Al go
+ execut
e(…
)«uses» «uses»
March 26, 2018 9
Strategy (aims at minimizing memory requirements)
- Divide the volume into layers (substacks) along Z
- For each layer
- Keep in memory a row (or column) of substacks at the time
- For each pair of adjacent substacks, call
Algorithm
- Compute the Maximum Intensity Projections (MIPs)
along X, Y, and Z for the two substacks
- Apply 2D Normalized Cross-Correlation (NCC) to the
three pairs of MIPs to find three 2D displacements
- Output “most-reliable” displacement for each direction
University Campus Bio-Medico of Rome
TeraStitcher: separating the strategy from the algorithm
March 26, 2018 10
University Campus Bio-Medico of Rome
TeraStitcher: global optimization step
Global optimization is
performed using computed (or even nominal) alignments and their reliability
The resulting
alignments are used to generate the xml file
Manual intervention is
possible to correct isolated errors
V:0( ) H:375( ) D:0 ( ) V:0( ) H:375( ) D:0 ( ) V:0( ) H:375( ) D:0 ( ) V:0( ) H:375( ) D:0 ( ) V:375( ) H:0( ) D:0( ) V:0( ) H:375( ) D:0 ( ) V:375( ) H:0( ) D:0( ) V:375( ) H:0( ) D:0( ) V:0( ) H:375( ) D:0 ( ) V:0( ) H:375( ) D:0 ( ) V:0( ) H:375( ) D:0( ) V:1(1,35) H:372(1,40) D:-3(1,09) V:375( ) H:0( ) D:0( ) V:372(1,37) H:1(1,30) D:0(1,40) V:375( ) H:0( ) D:0( ) V:375( ) H:0( ) D:0( ) V:375( ) H:0( ) D:0( ) V:375( ) H:0( ) D:0( ) V:375( ) H:0( ) D:0( ) V:371(1,36) H:2(1,35) D:2(1,33) V:375( ) H:0( ) D:0( ) V:0( ) H:373(1,15) D:2(1,11) V:1(1,40) H:375(1,26) D:-1(1,12) V:0( ) H:375 ( ) D:0( ) V:0(1,32) H:373(1,22) D:-1(1,10) V:375( ) H:0( ) D:0( ) V:375( ) H:0( ) D:0( ) V:372(1,41) H:3(1,22) D:-4(1,35) V:372(1,21) H:4(1,35) D:-1(1,31) V:375( ) H:0( ) D:0( ) V:1(1,37) H:371(1,35) D:-3(1,37) V:3(1,34) H:373(1,23) D:0(1,08) V:0( ) H:375( ) D:0 ( ) V:0( ) H:375( ) D:0 ( ) V:372(1,35) H:3(1,22) D:-1(1,37) V:374(1,35) H:5(1,22) D:-1(1,12) V:375( ) H:0( ) D:0( ) V:374(1,25) H:6(1,15) D:2(1,12) V:375( ) H:0( ) D:0( ) V:375( ) H:0( ) D:0( )
horizontal axis H vertical axis V
MST along V MST along H MST along D
March 26, 2018 12
University Campus Bio-Medico of Rome
TeraStitcher: developments and issues
Parallelization: maintenance issues
different alternatives to exploit parallel processing
different code for sequential and parallel version
Solution: launching multiple instances on disjoint sub-
regions
command line options to specify the sub-region to process
use a scripting language (python) to implement a driver
Alignment computation (computation bound)
each instance processes a group of adjacent sub-stacks
merge the xml files with alignment information
Generation of the stitched image (I/O bound)
the directory structure is first generated
each instance writes disjoint groups of tiles in parallel
metadata to be used in subsequent steps are finally generated
March 26, 2018 13
Why do we need GPUs?
Tile size: 2048 x 2048 x 2950 voxel (16 bits per voxel) Tile matrix: 3 x 3 Total dataset size: Tile size x #tiles (>220 GByte) Stitching on one Power8 (2.061 Ghz) core: ~15000 secs! Stitching on one Xeon ES2640 (2.6 Ghz) core: ~20000 secs! GPUs come to rescue!
March 26, 2018 16
Normalized Cross Correlation
The most time-consuming part of
the stitching process is the evaluation of the Normalized Cross- Correlation (NCC) between the Maximum Intensity Projections (MIP) of the tiles.
NCC is a variant of the classic Cross Correlation in
which data are normalized by subtracting the mean and dividing by the standard deviation of the two datasets.
TeraStitcher implements a fast NCC from J. P. Lewis,
“Fast Template Matching”, Vision Interface (1995).
March 26, 2018 17
CPU Normalized Cross Correlation
f_mean = t_mean = 0.; for ( i=0, pxl1=im1, pxl2=im2; i<dimi; i++, pxl1+=stride, pxl2+=stride ) { for ( j=0; j<dimj; j++, pxl1++, pxl2++ ) { f_mean += *pxl1; t_mean += *pxl2; } } f_mean /= (dimi*dimj); t_mean /= (dimi*dimj); numerator = factor1 = factor2 = 0.; for ( ij=0, pxl1=im1, pxl2=im2; ij<dimi*dimj; ij++) { f_prime = pxl1[(ij%dimj)+(stride+dimj)*(ij/dimj)] - f_mean; t_prime = pxl2[(ij%dimj)+(stride+dimj)*(ij/dimj)] - t_mean; numerator += pxl1[(ij%dimj)+(stride+dimj)*(ij/dimj)] * t_prime; factor1 += f_prime * f_prime; factor2 += t_prime * t_prime; }
March 26, 2018 18
The
previous fragment
- f
code is repeated (2*delayu+1) x (2*delayv+1) times at each iteration (there are few hundreds iterations per run).
A typical value of delayu and delayv is ~ 100
We developed a gpu_NCC kernel that does all the work with a single invocation per iteration:
minimize the overhead due to the invocation allow to overlap memory copies from CPU to GPU
(using streams)
March 26, 2018 19
From CPU TO GPU NCC
The gpu_NCC kernel heavily relies on shuffle primitives to
compute the mean values required by the NCC
#define CALCSUM(v) (v)=warpReduceSumD((v)); \ if(lane==0) shared[wid]=(v); \ __syncthreads(); \ (v) = (threadIdx.x<(blockDim.x/warpSize))*shared[lane]; \ if(wid==0) (v)=warpReduceSumD((v)); \ if(tid==0) shared[0]=(v); \ __syncthreads(); \ (v)=shared[0]; \ __syncthreads(); #define CALCAVE(v) CALCSUM(v) \ (v) /= (dimi*dimj);
March 26, 2018 20
GPU NCC
IBM S822LC HPC, courtesy of IBM Italy (G. Richelli)
512 GB of RAM, 16 Power8 cores (SMT 8) 2 960 GB solid state disks 4 P100 GPUs.
March 26, 2018 21
Test Platform
Time comparison (220 GB testcase)
1 Power8 process 15300 seconds (14617 for the NCC) 2 Power8 processes 8191 seconds 4 Power8 processes 4100 seconds 16 Power8 processes 1168 seconds! 64 Power8 processes 2457 seconds (2200 for the NCC)! 1 process using one P100 580 seconds (70 for the NCC) 2 processes each one using a P100 344 seconds 4 processes each one using a P100 174 seconds 8 processes sharing 4 P100 93 seconds 16 processes sharing 2 P100 115 seconds 16 processes sharing 4 P100 56 seconds! (17.5 for the NCC)
March 26, 2018 22
Results on other GPUs
Running on a Titan V(olta) the GPU time reduces from
70 seconds to (about) 42.5 seconds
Running on a K80 the GPU time increases up to 493
seconds!
The nvprof tool helps in understanding why time
increases much more than expected…
March 26, 2018 23
K80 vs. Volta (average values)
March 26, 2018 24
The difference in performance is due to the memory subsystem of the Volta
There are no-changes with respect to the original CPU
implementation, just tiny additions
Can compile on platforms without GPU by turning-off a
macro in the CMAKE configuration file
CUDA version activated by means of an env variable
Simple mechanism that works on any platform
CUDA version tested under Linux, MacOS, Windows We are testing/evaluating an OpenCL version CUDA code will be soon available from github
March 26, 2018 25
Other Features of the GPU version
University Campus Bio-Medico of Rome
TeraStitcher: current status
Open source multi-platform software (https://abria.github.io/TeraStitcher) Available both as standalone application (with GUI) and as plugin for Vaa3D The CPU version is already in use in many organizations including
WYSS Center, Geneve Switzerland Renier Lab, ICM Brain and Spine Institute, Paris, France Chung's Lab, MIT, Boston, USA Adam Glaser, Washington University, Seattle, USA Tomer Lab, Columbia University, New York, USA Deisseroth Lab, Stanford University, CA, USA
Several users already expressed interest for the CUDA version.
TeraStitcher Brain Cell Finder Vaa3D-TeraFly TeraConverter
March 26, 2018 26
University Campus Bio-Medico of Rome
Credits (1/2): image processing, bioinformatics, machine learning
TeraStitcher Vaa3D-TeraFly Brain cell finder
- Prof. Giulio Iannello
Alessandro Bria Leonardo Onofri
- Prof. Paolo Frasconi
Paolo Soda Hanchuan Peng Roberto Cortini
March 26, 2018 27
University Campus Bio-Medico of Rome
Credits (2/2): confocal light sheet microscopy, sample processing
- Prof. Francesco Pavone
Leonardo Sacconi Ludovico Silvestri Irene Costantini
International Center of Computational Neurophotonics
March 26, 2018 28
Credits: Jonathan Liu and Adam Glaser, Univ. of Washington have built the microscope and acquired the data the samples were prepared by: Joshua Vaughan, University of Washington (the expanded brain), Rusty Nicovich, Allen Institute (the entire brain slice), Michael Gerner, University of Washington (mouse lymph node and mouse lung)
March 26, 2018 30
Vaa3D-TeraFly TeraStitcher TeraConverter Brain Cell Finder
University Campus Bio-Medico of Rome
Brain cell finder (1/5): overview
Developed for confocal light sheet microscopy (CLSM) images [7] Applied to localize and count the Purkinje cells in the cerebellum of an L7-GFP mouse First complete map of a selected neuronal population in a large area of the mouse brain
Vaa3D-TeraFly TeraStitcher TeraConverter Brain Cell Finder
March 26, 2018 32
Brain cell finder (2/5): semantic deconvolution
Boosts weak somata and decreases the
voxel intensities in non-soma regions
Use of a neural network (NN) trained to
map the original image into an ideal one
10 substacks covering different regions were manually annotated to feed data to the NN
Original image Ideal image (reference) Image filtered by the trained neural network
training testing
Vaa3D-TeraFly TeraStitcher TeraConverter Brain Cell Finder
March 26, 2018 33
University Campus Bio-Medico of Rome
Brain cell finder (3/5): mean shift clustering
First we get rid of very dark voxels
which are unlikely to be part of a soma
Seeds are chosen as local maxima
exceeding a maximum entropy-based
- ptimal threshold
The mean shift algorithm then:
Places a spherical kernel of radius R on
each seed
Shifts each point towards the mean
value computed as the kernel- weighted average of the data
Repeats
the previous steps until convergence is achieved
Vaa3D-TeraFly TeraStitcher TeraConverter Brain Cell Finder
March 26, 2018 34
University Campus Bio-Medico of Rome
Brain cell finder (4/5): manifold filter
Exploiting anatomical knowledge
Cells are not scattered randomly in
the 3D space but are laid out in manifolds
Isolated
- r
- ff-manifold
cells are discarded
Vaa3D-TeraFly TeraStitcher TeraConverter Brain Cell Finder
March 26, 2018 35
University Campus Bio-Medico of Rome
Brain cell finder (5/5): results
Performances of the pipeline were estimated
- n 56 substacks (~1.09 GigaVoxels) manually
annotated with Vaa3D
- F1 = 0.96
A second human operator annotated the same
56 substacks and achieved F1 = 0.98
Processing the whole mouse cerebellum (120
GigaVoxels) yielded cells, which agrees with stereology data
Vaa3D-TeraFly TeraStitcher TeraConverter Brain Cell Finder
March 26, 2018 36