The context : bioimage data explosion High-throughput imaging - - PowerPoint PPT Presentation

the context bioimage data explosion
SMART_READER_LITE
LIVE PREVIEW

The context : bioimage data explosion High-throughput imaging - - PowerPoint PPT Presentation

The context : bioimage data explosion High-throughput imaging techniques have led to bioimage data explosion Submicrometer resolution Whole mouse brain mapping TeraByte scale images Heterogeneous images Need for tools capable to


slide-1
SLIDE 1
slide-2
SLIDE 2

University Campus Bio-Medico of Rome

The context: bioimage data explosion

 High-throughput imaging techniques have

led to bioimage data explosion

 Submicrometer resolution  TeraByte scale images  Heterogeneous images  Need for tools capable to process these

huge datasets

 Which requirements?  Need for automation to extract

information

 Fully- or semi-automated analysis?  Is human assistance needed?

Whole mouse brain mapping

March 26, 2018 2

slide-3
SLIDE 3

University Campus Bio-Medico of Rome

TeraStitcher: born for Confocal Light Sheet Microscope Images

 Generates a 2D matrix of 3D tiles

(image stacks)

 Tiles are regularly spaced with

reasonably reliable nominal positions

  • Stack acquisition is performed with a

single movement guaranteeing a reliable constant inter-slices distance along z

  • Long range movements (order of cm)

make necessary an alignment step in all directions

 Sufficient overlap between adjacent

tiles

March 26, 2018 4

slide-4
SLIDE 4

University Campus Bio-Medico of Rome

TeraStitcher: Ultra-terabyte images

 Need to process 16 bits images larger than

20,000 x 20,000 x 3,000 voxels (~2.5 TB)

 Increasingly high resolutions in X-Y  Multiple channels  Very large slices (e.g. 40K x 30K pixels)  Thousands of slices

 Solutions:

 Sparse datasets  Compressed formats  Parallelism  Multi-channel support

March 26, 2018 5

slide-5
SLIDE 5

?

slide-6
SLIDE 6

University Campus Bio-Medico of Rome

What is image stitching?

Image stitching is the process

  • f

combining multiple photographic images with

  • verlapping

fields

  • f

view to produce a segmented panorama

  • r

high-resolution image (from Wikipedia)

March 26, 2018 7

slide-7
SLIDE 7

Initially developed for confocal light sheet microscopy (CLSM) images

Fast 2D approach to align adjacent stacks

Efficient use of memory resources

Minimum I/O workload (2 readings, 1 writing)

Can generate a multi-resolution representation suited for further processing

University Campus Bio-Medico of Rome

TeraStitcher: overview

March 26, 2018 8

slide-8
SLIDE 8

University Campus Bio-Medico of Rome

TeraStitcher: software structure

 The software structure provides

means for incorporating new functionalities

 Object-oriented software

architecture and design patterns

 Functional decomposition visible to

the user through command line

  • ptions

Stitcher

StackedVolume MIP-NCC-Displ

  • MIP_displacements: int[6]
  • NCC_widths: float[3]
  • NCC_maxs: float[3]
  • reliab_factors: float[3]

+ evalReliability(): float + …

MST

+ execute(… )

Stack

«interface» Pai r w i seD i spl Al go

+ execut

e(

) :

Displacement

MIP-NCC

+ execute(… ): Displacement

«executes» «executes» «uses»

Abstract Factory

VolumeManager

1 N 1 N «interface»

D i spl acem ent

+ eval

Rel i abi l i t y( ) :

float + get

D i spl acem ent ( di r ect i
  • n:
i nt ) :

int + t

hr eshol d( t hr es:

float) + pr

  • j
ect ( di spl :

Displacement) # def

_di spl acem ent s:

int[3] # di

spl acem ent s:

int[3]

«produces »

Strategy

Stitcher

+ computeDisplacements(algorithm_type: int, … ) + projectDisplacements() + thresholdDisplacements(thres: float) + computeTilesPlacement(algorithm_type: int) + mergeTiles(blending_type: int,… )

«interface» Ti l esPl acem ent Al go

+ execut

e(

)

«uses» «uses»

March 26, 2018 9

slide-9
SLIDE 9

 Strategy (aims at minimizing memory requirements)

  • Divide the volume into layers (substacks) along Z
  • For each layer
  • Keep in memory a row (or column) of substacks at the time
  • For each pair of adjacent substacks, call

 Algorithm

  • Compute the Maximum Intensity Projections (MIPs)

along X, Y, and Z for the two substacks

  • Apply 2D Normalized Cross-Correlation (NCC) to the

three pairs of MIPs to find three 2D displacements

  • Output “most-reliable” displacement for each direction

University Campus Bio-Medico of Rome

TeraStitcher: separating the strategy from the algorithm

March 26, 2018 10

slide-10
SLIDE 10

University Campus Bio-Medico of Rome

TeraStitcher: global optimization step

 Global optimization is

performed using computed (or even nominal) alignments and their reliability

 The resulting

alignments are used to generate the xml file

 Manual intervention is

possible to correct isolated errors

V:0(฀ ) H:375(฀ ) D:0 (฀ ) V:0(฀ ) H:375(฀ ) D:0 (฀ ) V:0(฀ ) H:375(฀ ) D:0 (฀ ) V:0(฀ ) H:375(฀ ) D:0 (฀ ) V:375(฀ ) H:0(฀ ) D:0(฀ ) V:0(฀ ) H:375(฀ ) D:0 (฀ ) V:375(฀ ) H:0(฀ ) D:0(฀ ) V:375(฀ ) H:0(฀ ) D:0(฀ ) V:0(฀ ) H:375(฀ ) D:0 (฀ ) V:0(฀ ) H:375(฀ ) D:0 (฀ ) V:0(฀ ) H:375(฀ ) D:0(฀ ) V:1(1,35) H:372(1,40) D:-3(1,09) V:375(฀ ) H:0(฀ ) D:0(฀ ) V:372(1,37) H:1(1,30) D:0(1,40) V:375(฀ ) H:0(฀ ) D:0(฀ ) V:375(฀ ) H:0(฀ ) D:0(฀ ) V:375(฀ ) H:0(฀ ) D:0(฀ ) V:375(฀ ) H:0(฀ ) D:0(฀ ) V:375(฀ ) H:0(฀ ) D:0(฀ ) V:371(1,36) H:2(1,35) D:2(1,33) V:375(฀ ) H:0(฀ ) D:0(฀ ) V:0(฀ ) H:373(1,15) D:2(1,11) V:1(1,40) H:375(1,26) D:-1(1,12) V:0(฀ ) H:375 (฀ ) D:0(฀ ) V:0(1,32) H:373(1,22) D:-1(1,10) V:375(฀ ) H:0(฀ ) D:0(฀ ) V:375(฀ ) H:0(฀ ) D:0(฀ ) V:372(1,41) H:3(1,22) D:-4(1,35) V:372(1,21) H:4(1,35) D:-1(1,31) V:375(฀ ) H:0(฀ ) D:0(฀ ) V:1(1,37) H:371(1,35) D:-3(1,37) V:3(1,34) H:373(1,23) D:0(1,08) V:0(฀ ) H:375(฀ ) D:0 (฀ ) V:0(฀ ) H:375(฀ ) D:0 (฀ ) V:372(1,35) H:3(1,22) D:-1(1,37) V:374(1,35) H:5(1,22) D:-1(1,12) V:375(฀ ) H:0(฀ ) D:0(฀ ) V:374(1,25) H:6(1,15) D:2(1,12) V:375(฀ ) H:0(฀ ) D:0(฀ ) V:375(฀ ) H:0(฀ ) D:0(฀ )

horizontal axis H vertical axis V

MST along V MST along H MST along D

March 26, 2018 12

slide-11
SLIDE 11

University Campus Bio-Medico of Rome

TeraStitcher: developments and issues

 Parallelization: maintenance issues

different alternatives to exploit parallel processing

different code for sequential and parallel version

 Solution: launching multiple instances on disjoint sub-

regions

command line options to specify the sub-region to process

use a scripting language (python) to implement a driver

 Alignment computation (computation bound)

each instance processes a group of adjacent sub-stacks

merge the xml files with alignment information

 Generation of the stitched image (I/O bound)

the directory structure is first generated

each instance writes disjoint groups of tiles in parallel

metadata to be used in subsequent steps are finally generated

March 26, 2018 13

slide-12
SLIDE 12

Why do we need GPUs?

 Tile size: 2048 x 2048 x 2950 voxel (16 bits per voxel)  Tile matrix: 3 x 3  Total dataset size: Tile size x #tiles (>220 GByte)  Stitching on one Power8 (2.061 Ghz) core: ~15000 secs!  Stitching on one Xeon ES2640 (2.6 Ghz) core: ~20000 secs!  GPUs come to rescue!

March 26, 2018 16

slide-13
SLIDE 13

Normalized Cross Correlation

 The most time-consuming part of

the stitching process is the evaluation of the Normalized Cross- Correlation (NCC) between the Maximum Intensity Projections (MIP) of the tiles.

 NCC is a variant of the classic Cross Correlation in

which data are normalized by subtracting the mean and dividing by the standard deviation of the two datasets.

 TeraStitcher implements a fast NCC from J. P. Lewis,

“Fast Template Matching”, Vision Interface (1995).

March 26, 2018 17

slide-14
SLIDE 14

CPU Normalized Cross Correlation

f_mean = t_mean = 0.; for ( i=0, pxl1=im1, pxl2=im2; i<dimi; i++, pxl1+=stride, pxl2+=stride ) { for ( j=0; j<dimj; j++, pxl1++, pxl2++ ) { f_mean += *pxl1; t_mean += *pxl2; } } f_mean /= (dimi*dimj); t_mean /= (dimi*dimj); numerator = factor1 = factor2 = 0.; for ( ij=0, pxl1=im1, pxl2=im2; ij<dimi*dimj; ij++) { f_prime = pxl1[(ij%dimj)+(stride+dimj)*(ij/dimj)] - f_mean; t_prime = pxl2[(ij%dimj)+(stride+dimj)*(ij/dimj)] - t_mean; numerator += pxl1[(ij%dimj)+(stride+dimj)*(ij/dimj)] * t_prime; factor1 += f_prime * f_prime; factor2 += t_prime * t_prime; }

March 26, 2018 18

slide-15
SLIDE 15

 The

previous fragment

  • f

code is repeated (2*delayu+1) x (2*delayv+1) times at each iteration (there are few hundreds iterations per run).

 A typical value of delayu and delayv is ~ 100

We developed a gpu_NCC kernel that does all the work with a single invocation per iteration:

 minimize the overhead due to the invocation  allow to overlap memory copies from CPU to GPU

(using streams)

March 26, 2018 19

From CPU TO GPU NCC

slide-16
SLIDE 16

 The gpu_NCC kernel heavily relies on shuffle primitives to

compute the mean values required by the NCC

#define CALCSUM(v) (v)=warpReduceSumD((v)); \ if(lane==0) shared[wid]=(v); \ __syncthreads(); \ (v) = (threadIdx.x<(blockDim.x/warpSize))*shared[lane]; \ if(wid==0) (v)=warpReduceSumD((v)); \ if(tid==0) shared[0]=(v); \ __syncthreads(); \ (v)=shared[0]; \ __syncthreads(); #define CALCAVE(v) CALCSUM(v) \ (v) /= (dimi*dimj);

March 26, 2018 20

GPU NCC

slide-17
SLIDE 17

 IBM S822LC HPC, courtesy of IBM Italy (G. Richelli)

 512 GB of RAM,  16 Power8 cores (SMT 8)  2 960 GB solid state disks  4 P100 GPUs.

March 26, 2018 21

Test Platform

slide-18
SLIDE 18

Time comparison (220 GB testcase)

1 Power8 process 15300 seconds (14617 for the NCC) 2 Power8 processes 8191 seconds 4 Power8 processes 4100 seconds 16 Power8 processes 1168 seconds! 64 Power8 processes 2457 seconds (2200 for the NCC)! 1 process using one P100 580 seconds (70 for the NCC) 2 processes each one using a P100 344 seconds 4 processes each one using a P100 174 seconds 8 processes sharing 4 P100 93 seconds 16 processes sharing 2 P100 115 seconds 16 processes sharing 4 P100 56 seconds! (17.5 for the NCC)

March 26, 2018 22

slide-19
SLIDE 19

Results on other GPUs

 Running on a Titan V(olta) the GPU time reduces from

70 seconds to (about) 42.5 seconds

 Running on a K80 the GPU time increases up to 493

seconds!

 The nvprof tool helps in understanding why time

increases much more than expected…

March 26, 2018 23

slide-20
SLIDE 20

K80 vs. Volta (average values)

March 26, 2018 24

The difference in performance is due to the memory subsystem of the Volta

slide-21
SLIDE 21

 There are no-changes with respect to the original CPU

implementation, just tiny additions

 Can compile on platforms without GPU by turning-off a

macro in the CMAKE configuration file

 CUDA version activated by means of an env variable

 Simple mechanism that works on any platform

 CUDA version tested under Linux, MacOS, Windows  We are testing/evaluating an OpenCL version  CUDA code will be soon available from github

March 26, 2018 25

Other Features of the GPU version

slide-22
SLIDE 22

University Campus Bio-Medico of Rome

TeraStitcher: current status

 Open source multi-platform software (https://abria.github.io/TeraStitcher)  Available both as standalone application (with GUI) and as plugin for Vaa3D  The CPU version is already in use in many organizations including

 WYSS Center, Geneve Switzerland  Renier Lab, ICM Brain and Spine Institute, Paris, France  Chung's Lab, MIT, Boston, USA  Adam Glaser, Washington University, Seattle, USA  Tomer Lab, Columbia University, New York, USA  Deisseroth Lab, Stanford University, CA, USA

 Several users already expressed interest for the CUDA version.

TeraStitcher Brain Cell Finder Vaa3D-TeraFly TeraConverter

March 26, 2018 26

slide-23
SLIDE 23

University Campus Bio-Medico of Rome

Credits (1/2): image processing, bioinformatics, machine learning

TeraStitcher Vaa3D-TeraFly Brain cell finder

  • Prof. Giulio Iannello

Alessandro Bria Leonardo Onofri

  • Prof. Paolo Frasconi

Paolo Soda Hanchuan Peng Roberto Cortini

March 26, 2018 27

slide-24
SLIDE 24

University Campus Bio-Medico of Rome

Credits (2/2): confocal light sheet microscopy, sample processing

  • Prof. Francesco Pavone

Leonardo Sacconi Ludovico Silvestri Irene Costantini

International Center of Computational Neurophotonics

March 26, 2018 28

slide-25
SLIDE 25

Credits: Jonathan Liu and Adam Glaser, Univ. of Washington have built the microscope and acquired the data the samples were prepared by: Joshua Vaughan, University of Washington (the expanded brain), Rusty Nicovich, Allen Institute (the entire brain slice), Michael Gerner, University of Washington (mouse lymph node and mouse lung)

slide-26
SLIDE 26

March 26, 2018 30

slide-27
SLIDE 27

Vaa3D-TeraFly TeraStitcher TeraConverter Brain Cell Finder

slide-28
SLIDE 28

University Campus Bio-Medico of Rome

Brain cell finder (1/5): overview

 Developed for confocal light sheet microscopy (CLSM) images [7]  Applied to localize and count the Purkinje cells in the cerebellum of an L7-GFP mouse  First complete map of a selected neuronal population in a large area of the mouse brain

Vaa3D-TeraFly TeraStitcher TeraConverter Brain Cell Finder

March 26, 2018 32

slide-29
SLIDE 29

Brain cell finder (2/5): semantic deconvolution

 Boosts weak somata and decreases the

voxel intensities in non-soma regions

 Use of a neural network (NN) trained to

map the original image into an ideal one

10 substacks covering different regions were manually annotated to feed data to the NN

Original image Ideal image (reference) Image filtered by the trained neural network

training testing

Vaa3D-TeraFly TeraStitcher TeraConverter Brain Cell Finder

March 26, 2018 33

slide-30
SLIDE 30

University Campus Bio-Medico of Rome

Brain cell finder (3/5): mean shift clustering

 First we get rid of very dark voxels

which are unlikely to be part of a soma

 Seeds are chosen as local maxima

exceeding a maximum entropy-based

  • ptimal threshold

 The mean shift algorithm then:

 Places a spherical kernel of radius R on

each seed

 Shifts each point towards the mean

value computed as the kernel- weighted average of the data

 Repeats

the previous steps until convergence is achieved

Vaa3D-TeraFly TeraStitcher TeraConverter Brain Cell Finder

March 26, 2018 34

slide-31
SLIDE 31

University Campus Bio-Medico of Rome

Brain cell finder (4/5): manifold filter

 Exploiting anatomical knowledge

 Cells are not scattered randomly in

the 3D space but are laid out in manifolds

 Isolated

  • r
  • ff-manifold

cells are discarded

Vaa3D-TeraFly TeraStitcher TeraConverter Brain Cell Finder

March 26, 2018 35

slide-32
SLIDE 32

University Campus Bio-Medico of Rome

Brain cell finder (5/5): results

 Performances of the pipeline were estimated

  • n 56 substacks (~1.09 GigaVoxels) manually

annotated with Vaa3D

  • F1 = 0.96

 A second human operator annotated the same

56 substacks and achieved F1 = 0.98

 Processing the whole mouse cerebellum (120

GigaVoxels) yielded cells, which agrees with stereology data

Vaa3D-TeraFly TeraStitcher TeraConverter Brain Cell Finder

March 26, 2018 36