STATISTICAL ANALYSIS OF MASS SPECTROMETRY IMAGING EXPERIMENTS - - PowerPoint PPT Presentation

statistical analysis of mass spectrometry imaging
SMART_READER_LITE
LIVE PREVIEW

STATISTICAL ANALYSIS OF MASS SPECTROMETRY IMAGING EXPERIMENTS - - PowerPoint PPT Presentation

A FRAMEWORK FOR STATISTICAL ANALYSIS OF MASS SPECTROMETRY IMAGING EXPERIMENTS Kylie Bemis Purdue University Department of Statistics OUTLINE Statement of the problem Biotechnological problem Statistical and computational problem


slide-1
SLIDE 1

A FRAMEWORK FOR

STATISTICAL ANALYSIS OF MASS SPECTROMETRY IMAGING EXPERIMENTS

Kylie Bemis

Purdue University Department of Statistics

slide-2
SLIDE 2

OUTLINE

  • Statement of the problem
  • Biotechnological problem
  • Statistical and computational problem
  • Statement of contributions
  • Open-source software
  • matter: Rapid prototyping with data on disk
  • Cardinal: Statistical toolbox for mass spectrometry imaging experiments
  • Statistical methods
  • Spatial shrunken centroids
  • Evaluation and case studies
  • Summary
  • Conclusions
  • Future work

2

slide-3
SLIDE 3

OUTLINE

  • Statement of the problem
  • Biotechnological problem
  • Statistical and computational problem
  • Statement of contributions
  • Open-source software
  • matter: Rapid prototyping with data on disk
  • Cardinal: Statistical toolbox for mass spectrometry imaging experiments
  • Statistical methods
  • Spatial shrunken centroids
  • Evaluation and case studies
  • Summary
  • Conclusions
  • Future work

3

slide-4
SLIDE 4

MASS SPECTROMETRY IMAGING

Investigate spatial distribution of analytes

y y y y y

  • Scan with laser/spray
  • Collect mass spectra
  • Reconstruct ion images
  • Date “cube”
  • R. Graham

Cooks and lab

slide-5
SLIDE 5

BIOTECHNOLOGICAL PROBLEM

  • Rapidly advancing technology
  • Increasing mass resolutions
  • Greater mass accuracy and range
  • More features (larger P)
  • Increasing spatial resolutions
  • Approaching 1 µm resolution
  • More pixels (larger N)
  • More complex experiments
  • 3D experiments
  • Time-course experiments
  • Increasing sample size
  • More biological replicates
  • More pixels (larger N)

5

slide-6
SLIDE 6

STATISTICAL & COMPUTATIONAL PROBLEM

  • Complex, high-dimensional data
  • Spatial x, y dimensions
  • Potentially z, t dimensions
  • Mass spectral features (m/z values)
  • Correlation structures
  • Spatial (and possibly temporal)
  • Between mass spectral features
  • Increasing mass+spatial resolutions
  • Larger(-than-memory) datasets
  • Can range from 100 MB to 100 GB
  • Experimental design
  • Variation across samples+slides
  • What counts as a replicate?

6

slide-7
SLIDE 7

PROBLEM STATEMENT

  • Biotechnological problem
  • Mass spectrometry (MS) imaging has advanced at a rapid pace
  • Computational tools have not advanced at a comparable pace
  • Lack of free, open-source statistical tools for statistical analysis
  • Need for classification/segmentation with statistical inference:
  • Classification: Classify pixels based on their mass spectral profiles into

pre-defined classes (such as healthy/disease status)

  • Segmentation: Assign pixels to newly discovered segments with

relatively homogenous and distinct mass spectral profiles

  • Select a subset of informative mass spectral features
  • Statistical and computational problem
  • MS imaging experiments result in complex, high-dimensional experiments
  • Spatial structure in datasets with large P and large N
  • Statistical computing on larger-than-memory data is a challenge
slide-8
SLIDE 8

OUTLINE

  • Statement of the problem
  • Biotechnological problem
  • Statistical and computational problem
  • Statement of contributions
  • Open-source software
  • matter: Rapid prototyping with data on disk
  • Cardinal: Statistical toolbox for mass spectrometry imaging experiments
  • Statistical methods
  • Spatial shrunken centroids
  • Evaluation and case studies
  • Summary
  • Conclusions
  • Future work

8

slide-9
SLIDE 9

STATEMENT OF CONTRIBUTIONS

9

  • Statistical methods: spatial shrunken centroids
  • Classification and segmentation for MS imaging experiments
  • Probabilistic model using spatial information
  • Selection of most informative mass spectral features
  • Open-source software: Cardinal
  • Free, open-source R package for MS imaging experiments
  • Full pipeline including processing, visualization, and statistical analysis
  • For experimentalists, provides accessible statistical methods
  • For statisticians, provides infrastructure for method development
  • Open-source software: matter
  • Free, open-source R package for rapid prototyping with data-on-disk
  • Flexible statistical computing and method development for larger-than-memory datasets
  • Enables Cardinal to scale to high-resolution, high-throughput MS imaging experiments
  • Evaluation and case studies
  • Public datasets and reproducible results in CardinalWorkflows
  • Community impact of this work

x y z

slide-10
SLIDE 10

OUTLINE

  • Statement of the problem
  • Biotechnological problem
  • Statistical and computational problem
  • Statement of contributions
  • Open-source software
  • matter: Rapid prototyping with data on disk
  • Cardinal: Statistical toolbox for mass spectrometry imaging experiments
  • Statistical methods
  • Spatial shrunken centroids
  • Evaluation and case studies
  • Summary
  • Conclusions
  • Future work

10

slide-11
SLIDE 11

11

PROBLEM: LARGER-THAN-MEMORY DATA

challenges statistical method development

x y z m/z = 715.03 t = 4 x y z m/z = 715.03 t = 8 x y z m/z = 715.03 t = 11
  • MS imaging experiments rapidly advancing
  • Increasing mass and spatial resolutions
  • Larger sample sizes, multiple files
  • Growing data size poses difficulty for statistics
  • Need to test methods on larger-than-memory data
  • Need to work with domain-specific formats
  • Current R solutions are inflexible

Cardinal help Google group

slide-12
SLIDE 12

12

CONTRIBUTION: MATTER

  • pen-source statistical computing with data on disk

File 1 File 2 File 3

Storage matter object

Atom 1 Atom 2 Atom 3 Atom 4 Atom 5 Atom 6

  • Work with larger-than-memory

datasets on disk in R

  • Emphasizes flexibility with a

minimal memory footprint

  • Adaptable to more datasets than

bigmemory and ff

  • Potentially slower computation
  • Designed for statistical method

development in R

  • Rapid prototyping with minimal

additional effort

  • Works with many existing algorithms
  • Efficient calculation of summary statistics
  • Infrastructure for statistical computing
  • n large data
slide-13
SLIDE 13

UUID mzArray 1 intensityArray 1 mzArray 2 intensityArray 2 mzArray 3 intensityArray 3 UUID mzArray intensityArray 1 intensityArray 2 intensityArray 3 intensityArray 4 intensityArray 5

“processed” imzML “continuous” imzML

NEED TO WORK WITH MS IMAGING FILES

e.g., “processed” and “continuous” imzML

13

  • Open-source format for MS

imaging experiments

  • XML metadata file defines

binary data file structure

  • Binary data schema is incompatible

with bigmemory and ff

  • Prefer to avoid additional file

conversion

  • Need random access into different

parts of the file

  • Often one-sample-per-file
  • Need to seamlessly work with

multiple files in an experiment

  • Each file can be very large
  • matter solves these problems
slide-14
SLIDE 14

FLEXIBLE ACCESS TO DATA ON DISK

Metadata Column A Column B Column C Column D Metadata Column E Column F Column G Column H Column A Column C Column F Column H

File 1 File 2 matter matrix

any binary format, any file structure

14

  • User-defined file structure
  • Data can come from anywhere
  • Any part of a file
  • Any combination of files
  • Representation in R can be

different from on disk

  • Access as ordinary R vector/matrix
  • No need to worry about data size
  • r memory management
slide-15
SLIDE 15

EXAMPLE: LINEAR REGRESSION

with a 1.2 GB simulated data and biglm

15

Memory Used Memory Overhead Time R matrices + lm 7 GB 1.4 GB 33 sec R matrices + biglm 2.7 GB 1.3 GB 158 sec bigmemory + biglm 1.7 GB 397 MB 21 sec matter + biglm 466 MB 319 MB 42 sec

R matrices + lm R matrices + biglm bigmemory + biglm matter + biglm

1750 3500 5250 7000

Memory Used (MB) Memory Overhead (MB)

  • 1.2 GB dataset
  • N = 15,000,000 observations
  • P = 9 variables
  • Linear regression
  • Using biglm package
  • Specifically for large datasets
slide-16
SLIDE 16

EXAMPLE: PRINCIPAL COMPONENTS ANALYSIS

with a 1.2 GB simulated data and irlba

16

Memory Used Memory Overhead Time R matrices + svd 3.6 GB 2.4 GB 62 sec R matrices + irlba 2.3 GB 961 MB 9 sec bigmemory + irlba 3.5 GB 962 MB 9 sec matter + irlba 522 MB 427 MB 171 sec

R matrices + svd R matrices + irlba bigmemory + irlba matter + irlba

1000 2000 3000 4000

Memory Used (MB) Memory Overhead (MB)

  • 1.2 GB dataset
  • N = 15,000,000 observations
  • P = 10 variables
  • PCA
  • Using irlba package
  • Not specifically for large datasets
slide-17
SLIDE 17

EXAMPLE: PRINCIPAL COMPONENTS ANALYSIS

with a 2.85 GB microbial time-course experiment

17

Oetjen et al, Gigascience, 2015

x y z x y z x y z

t = 11 t = 8 t = 4 m/z 262

x y z x y z x y z

t = 11 t = 8 t = 4 PC1 scores PC1 loadings

  • 3D microbial time-course
  • 2.85 GB on disk
  • 17,672 pixels
  • 40,299 features

234 MB to compute 3 PC 79 MB memory overhead 418 sec per PC

slide-18
SLIDE 18

EXAMPLE: VISUALIZATION

  • f a 26.45 GB mouse pancreas experiment

18

Oetjen et al, Gigascience, 2015

x y z

m/z 5086

x y z

m/z 3121

x y z

m/z 3922

  • 3D mouse pancreas
  • 26.45 GB on disk
  • 497,225 pixels
  • 13,312 features

1.25 GB used in-memory 223 MB to calculate mean spectrum

Mean spectrum

cannot load at all without matter

slide-19
SLIDE 19

OUTLINE

  • Statement of the problem
  • Biotechnological problem
  • Statistical and computational problem
  • Statement of contributions
  • Open-source software
  • matter: Rapid prototyping with data on disk
  • Cardinal: Statistical toolbox for mass spectrometry imaging experiments
  • Statistical methods
  • Spatial shrunken centroids
  • Evaluation and case studies
  • Summary
  • Conclusions
  • Future work

19

slide-20
SLIDE 20

20

  • Few free, open-source tools exist
  • Most incapable of handling large datasets from multiple files
  • Lack of extensibility by statisticians and computer scientists
  • Little focus on statistical analysis and experimental design
  • Focus on visualization of molecular ion images and mass spectra
  • Some computational algorithms without statistical inference
  • MSiReader
  • Free, open-source
  • Requires Matlab
  • SCiLS
  • Commercial, proprietary
  • Requires Bruker instruments

PROBLEM: LACK OF SOFTWARE

for statistical analysis of MS imaging experiments

slide-21
SLIDE 21

CONTRIBUTION: CARDINAL

  • pen-source statistical software for MS imaging
  • K. D. Bemis, A. Harry, L. S. Eberlin, C. Ferreira, S. M. van de Ven, P. Mallick, M. Stolowitz, O. Vitek.

“Cardinal: an R package for statistical analysis of mass spectrometry-based imaging experiments”. Bioinformatics, 31:2418, 2015

  • Free, open-source
  • R-based
  • Available on Bioconductor
  • Source code on Github

www.cardinalmsi.org

  • >1,800 unique downloads since public release on April 17, 2015
  • Winner of the 2015 John M. Chambers Statistical Software Award
  • Last release on May 4, 2016 with Bioconductor 3.3
slide-22
SLIDE 22

SOFTWARE FOR MSI EXPERIMENTS

  • Format support
  • imzML (continuous & processed) and Analyze 7.5
  • Visualization
  • Plotting of mass spectra and molecular images
  • Spectral processing
  • Normalization, smoothing, baseline reduction, peak picking
  • Image processing
  • Contrast enhancement, spatial smoothing
  • Statistical analysis
  • PCA, PLS, spatial shrunken centroids (classification & segmentation)

22

focus on experiments, not just datasets

slide-23
SLIDE 23

EFFICIENT, MODULAR DATA STRUCTURES

  • iSet
  • Virtual class for imaging experiments
  • MSImageSet
  • Mass spectrometry imaging experiments
  • MSImageData
  • Efficient storage of mass spectra and reconstruction of images
  • MSImageProcess
  • Tracks pre-processing applied to mass spectra
  • IAnnotatedDataFrame
  • Tracks pixel-level metadata
  • MIAPE-Imaging
  • Minimum Information About a Proteomics [Imaging] Experiment
  • ResultSet
  • Stores results of statistical analyses on imaging experiments

23

slide-24
SLIDE 24

VISUALIZATION TOOLS

library(CardinalWorkflows) data(cardinal, cardinal_analyses) top <- topLabels(cardinal.sscg, model=list(r=1, k=10, s=3), n=9) image(cardinal, mz=top$mz, plusminus=0.5, normalize.image="linear", contrast.enhance="histogram", layout=c(3,3))

Plot top 9 ion images from segmentation (across all segments)

24

slide-25
SLIDE 25

VISUALIZATION TOOLS

image(cardinal, mz=c(207.08, 235, 255.25, 265.17, 649.17), plusminus=0.5, normalize.image="linear", contrast.enhance="histogram", col=c(“red", “darkred”, “gray", “black", "brown"), superpose=TRUE)

Recreate painting from

  • verlay of ion images

25

slide-26
SLIDE 26

SPECTRAL AND IMAGE PROCESSING

smoothSignal(Brain_1, plot=TRUE) reduceBaseline(Brain_1, plot=TRUE) peakPick(Brain_1, plot=TRUE)

627.61

m/z = 9984.72

627.61 6.54

m/z = 9984.72

627.61 14.05

m/z = 9984.72

image(Brain_1, mz=9984.7) image(…, contrast.enhance=“histogram”) image(…, smooth.image=“gaussian”)

26

slide-27
SLIDE 27

27

  • All pre-processing methods in Cardinal
  • Can take user-specified functions for custom processing
  • Are wrappers around pixelApply or featureApply
  • pixelApply and featureApply
  • Allow applying arbitrary functions over imaging experiments
  • Allow conditioning on groups of pixels and/or features

standardize <- function(x) x / sum(x) # TIC normalization pixelApply(data, .fun=standardize) # Standardize samples featureApply(data, .fun=standardize, .pixel.groups=sample)

APPLY FUNCTIONS OVER IMAGES

with pixelApply and featureApply

slide-28
SLIDE 28

ANALYZE BIGGER EXPERIMENTS

using data-on-disk with matter

28

Oetjen et al, Gigascience, 2015

Work with arbitrarily large datasets from any number of files

mouse <- readMSIData(“3D_Mouse_Pancreas.imzML”) pData(mouse)$TIC <- pixelApply(mouse, sum) image3D(mouse, TIC ~ x * y * z)

x y z

TIC

Example: 26.45 GB dataset on a 16 GB laptop

slide-29
SLIDE 29

OUTLINE

  • Statement of the problem
  • Biotechnological problem
  • Statistical and computational problem
  • Statement of contributions
  • Open-source software
  • matter: Rapid prototyping with data on disk
  • Cardinal: Statistical toolbox for mass spectrometry imaging experiments
  • Statistical methods
  • Spatial shrunken centroids
  • Evaluation and case studies
  • Summary
  • Conclusions
  • Future work

29

slide-30
SLIDE 30

30

  • Few statistical methods being developed for MS imaging
  • Current algorithms do not do statistical inference
  • Feature selection is post-hoc and heuristic
  • Existing statistical methods are inappropriate or inefficient
  • Spatial statistics methods do not yet scale to many features
  • Few other methods can incorporate spatial information

PROBLEM: NEED FOR STATISTICAL INFERENCE

incorporating the spatial information in the experiment

slide-31
SLIDE 31

200 400 600 800 −30 −10 10 30

m z brain t−statistics

200 400 600 800 −40 −20 20 40

m z liver t−statistics

200 400 600 800 −40 −20 20 40

m z heart t−statistics

  • K. D. Bemis, A. Harry, L. S. Eberlin, C. Ferreira, S. M. van de Ven, P. Mallick, M. Stolowitz, O. Vitek.

“Probabilistic segmentation of mass spectrometry images helps select important ions and characterize confidence in the resulting segments ”. Molecular & Cellular Proteomics, 2016

t-statistics show important ions for the brain, heart, and liver segments

CONTRIBUTION: SPATIAL SHRUNKEN CENTROIDS

spatially-aware classification/segmentation with feature selection

  • Combines spatial information & feature selection
  • Spatially-aware distance from spatially-aware clustering

(Alexandrov and Kobarg, 2011)

  • Statistical regularization from nearest shrunken centroids

(Tibshirani, Hastie, et al., 2013)

  • Improved image classification & segmentation
  • Data-driven selection of appropriate number of segments
  • Selects most important ions for distinguishing class/segment
  • Probability model characterizes uncertainty
slide-32
SLIDE 32

Spatially-aware (SA) weights: weights depend on the distance from neighborhood center Spatially-aware structurally-adaptive (SASA) weights: weights of neighbors also depend on their spectral similarity

αδiδj = exp (

  • δ2

i + δ2 j

2σ2 )

αδiδj(xijm, xi0j0m0) = exp (

  • δ2

i + δ2 j

2σ2 ) · q βδiδj(xijm)βδiδj(xi0j0m0)

βδiδj(xijm) = exp ⇢ 1 2λ2 kx(i+δi)(j+δj)m xijmk2

  • Alexandrov & Kobarg,

Bioinformatics, 2011 Considers mass spectra from neighboring pixels

SPATIAL SMOOTHING

from spatially-aware clustering

32

slide-33
SLIDE 33

Classification: Start with labeled classes Calculate t-statistics

tkp = ¯ xkp − ¯ xp ˆ τp · q 1

Nk − 1 PK

k=1 Nk

class centroid global centroid Tibshirani, Hastie, et al. Statistical Science, 2013 pooled sd

t0

kp = sign(tkp)(|tkp| − s)+,

where t+ = t if t > 0, and t+ = 0 if t ≤ 0

Shrink t-statistics shrinkage parameter Segmentation: Initialize segments with spatially-aware clustering

¯ x0

kp = ¯

xp + t0

kpˆ

τp · s 1 Nk − 1 PK

k=1 Nk

Calculate shrunken centroids Uninformative features are removed

FEATURE SELECTION

from nearest shrunken centroids

33

slide-34
SLIDE 34

d(xijm, ¯ x0

k) =

X

rδi,δj,r

αδiδj(xijm) · kx(i+δi)(j+δj)m ¯ x0

kk2

Key contribution: Calculate spatially-aware distance to shrunken centroids spatial neighborhood SA or SASA weights mass spectrum class centroid SA weights: Modified SASA weights:

αδiδj = exp (

  • δ2

i + δ2 j

2σ2 ) βδiδj(xijm) = exp ⇢ 1 2λ2 kx(i+δi)(j+δj)m xijmk2

  • αδiδj(xijm) = exp

(

  • δ2

i + δ2 j

2σ2 ) · βδiδj(xijm)

Bemis, et al. Molecular & Cellular Proteomics, 2016 Allows feature selection + spatial smoothing

PROPOSAL:

spatial distance to shrunken centroids

34

slide-35
SLIDE 35

CALCULATING CLASS OR SEGMENT MEMBERSHIP

D(xijm, ¯ x0

k) = 1

ˆ τ 2

p

d(xijm, ¯ x0

k) − 2 log πk

Calculate discriminant scores Using spatially-aware distance to shrunken centroids Calculate posterior probabilities

  • f class or segment membership

ˆ pk(xijm) = e(1/2)D(xijm, ¯

x0

k)

K

P

l=1

e(1/2)D(xijm, ¯

x0

l)

pooled sd prior probabilities Tibshirani, Hastie, et al. Statistical Science, 2013 Classification: Done Segmentation: Iterate until no change in segments Assign pixel to class or segment with max posterior probability spatial distance

35

slide-36
SLIDE 36

Alexandrov & Kobarg, Bioinformatics, 2011 Spatial shrunken centroids

r=2, k=6 r=2, k=20, s=6 6 segments SA=Spatially Aware SASA=Spatially Aware Structurally Adaptive

K−means PCA + K−means SA + K−means SASA + K−means SA + Shrunken Centroids SASA + Shrunken Centroids

IMPROVED SEGMENTATION

from statistical regularization and spatial information

slide-37
SLIDE 37

r = 2, k = 20, s = 0 r = 2, k = 20, s = 3 r = 2, k = 20, s = 6 r = 2, k = 20, s = 9

s=0 s=3 s=6 s=9

2 4 6 8 6 8 10 12 14 16 18

Shrinkage parameter (s) Predicted # of Classes

  • r = 1, k = 15

r = 2, k = 15 r = 1, k = 20 r = 2, k = 20

Empirical relationship exists between sparsity in the # of features and # of segments r=2, k=20, s=6 6 segments

SA + Shrunken Centroids

DATA-DRIVEN MODEL SELECTION

for unsupervised experiments through statistical regularization

37

slide-38
SLIDE 38

r=2, s=6, k=20 6 segments

200 400 600 800 −30 −10 10 30

m z brain t−statistics

200 400 600 800 −40 −20 20 40

m z liver t−statistics

36.6

m/z = 834.5

43.11

m/z = 537.25

SA + Shrunken Centroids

SELECTION OF MOLECULAR FEATURES

that distinguish each segment for improved interpretability

38

slide-39
SLIDE 39

Low noise (MALDI rat) Medium noise (DESI mouse) High noise (MALDI mouse) Reduced to 2 segments Optimal sparsity Higher sparsity

r = 2, k = 10, s = 0 r = 2, k = 5, s = 0 r = 3, k = 10, s = 28 r = 3, k = 5, s = 35 r = 2, k = 10, s = 5 r = 2, k = 10, s = 25

1 2 3 4 5 2.0 2.5 3.0 3.5 4.0

Shrinkage parameter (s) Predicted # of Classes

  • r = 2, k = 5

r = 2, k = 10 5 10 15 20 25 30 35 2 4 6 8 10

Shrinkage parameter (s) Predicted # of Classes

  • r = 3, k = 5

r = 3, k = 10 5 10 15 20 25 2 3 4 5 6 7 8

Shrinkage parameter (s) Predicted # of Classes

  • r = 2, k = 5

r = 2, k = 10

Optimal # of segments

VISUALIZE UNCERTAINTY

probabilistic model characterizes uncertainty in segmentation

39

slide-40
SLIDE 40

69 2.66

m/z = 885.67 UH0505_12

cancer normal

101 2.79

m/z = 885.67 UH9812_03

200 400 600 800 1000 5 10 15 20

m z intensity

r = 3, k = 2, s = 20

200 400 600 800 1000 5 10 15 20

m z intensity

r = 3, k = 2, s = 20

200 400 600 800 1000 −20 −10 10 20

m z t−statistic

r = 3, k = 2, s = 20

cancer normal

Graham Cooks and lab Livia Eberlin

r=3, s=20 Selected by cross-validation

FACILITATES INTERPRETABILITY

for supervised experiments through feature selection and probability

40

slide-41
SLIDE 41

OUTLINE

  • Statement of the problem
  • Biotechnological problem
  • Statistical and computational problem
  • Statement of contributions
  • Open-source software
  • matter: Rapid prototyping with data on disk
  • Cardinal: Statistical toolbox for mass spectrometry imaging experiments
  • Statistical methods
  • Spatial shrunken centroids
  • Evaluation and case studies
  • Summary
  • Conclusions
  • Future work

41

slide-42
SLIDE 42

EVALUATION AND CASE STUDIES

  • Cardinal and spatial shrunken centroids widely tested
  • Evaluated on both experimental data and controlled standards
  • Public datasets and reproducible results provided in CardinalWorkflows
  • Community support and feedback has been valuable
  • >1,800 users of Cardinal and public feedback is extremely enthusiastic
  • Google help group provides insight into usage and needed improvements
slide-43
SLIDE 43

CONTROLLED EXAMPLE: CARDINAL PAINTING

SA + Shrunken Centroids SASA + Shrunken Centroids

2 4 6 8 7 8 9 10 11 12 13 14

Shrinkage parameter (s) Predicted # of Classes

  • r = 1, k = 10

r = 2, k = 10 r = 1, k = 15 r = 2, k = 15 2 4 6 8 6 7 8 9 10 11 12 13

Shrinkage parameter (s) Predicted # of Classes

  • r = 1, k = 10

r = 2, k = 10 r = 1, k = 15 r = 2, k = 15

Graham Cooks and lab r=1, s=3, k=10 r=2, s=3, k=10

43

slide-44
SLIDE 44

CONTROLLED EXAMPLE: FARMHOUSE PAINTING

Graham Cooks and lab

SA + Shrunken Centroids SASA + Shrunken Centroids

2 4 6 8 5 6 7 8 9 10 11 12

Shrinkage parameter (s) Predicted # of Classes

  • r = 1, k = 10

r = 2, k = 10 r = 1, k = 15 r = 2, k = 15 2 4 6 8 6 7 8 9 10 11 12 13

Shrinkage parameter (s) Predicted # of Classes

  • r = 1, k = 10

r = 2, k = 10 r = 1, k = 15 r = 2, k = 15

r=1, s=3, k=10 r=2, s=3, k=10

44

slide-45
SLIDE 45

CONTROLLED EXAMPLE: SPOTTED PATTERN TEST

Mark Stolowitz Stephanie van de Ven

  • S. M. van de Ven, K. D. Bemis, K. Lau, R. Adusumilli, U. Kota, M. Stolowitz, O. Vitek, P. Mallick, S. S.
  • Gambhir. “Protein biomarkers on tissue as imaged via MALDI mass spectrometry: A systematic

approach to study the limits of detection”. Proteomics, 2016

Quantify limits of detection in MS imaging experiments

  • Statistical approaches to MS

imaging experiments are necessary

  • Cardinal enables systematic

study of crucial experimental design questions

slide-46
SLIDE 46

OUTLINE

  • Statement of the problem
  • Biotechnological problem
  • Statistical and computational problem
  • Statement of contributions
  • Open-source software
  • matter: Rapid prototyping with data on disk
  • Cardinal: Statistical toolbox for mass spectrometry imaging experiments
  • Statistical methods
  • Spatial shrunken centroids
  • Evaluation and case studies
  • Summary
  • Conclusions
  • Future work

46

slide-47
SLIDE 47

CONCLUSIONS AND FUTURE WORK

47

  • Statistical methods: spatial shrunken centroids
  • Regularized classification and segmentation for MS imaging experiments
  • Further investigate relationship between sparsity and number of segments
  • Open-source software: Cardinal
  • Free, open-source statistical software for MS imaging experiments
  • More statistical methods and parallel computation
  • Open-source software: matter
  • Enables statistical method development with larger-than-memory datasets
  • Extension to sparse datasets and “processed” imzML format
  • General conclusions
  • Combined contributions enable scalable statistical methods for MS imaging
  • Development of statistically-focused computational infrastructure alongside

new statistical methods is vital in rapidly advancing areas

slide-48
SLIDE 48

ACKNOWLEDGEMENTS

48

Purdue EMBL

Theodore Alexandrov

Purdue

Graham Cooks and lab Livia Eberlin Kevin Kerian Christina Ferreira

Stanford

Parag Mallick Mark Stolowitz Uma Kota Stephanie van de Ven April Harry

  • Advanced BioImaging

Systems

  • Canary Center at

Stanford for Cancer Early detec9on

  • NIH-R21

Olga Vitek

Northeastern

  • NSF-GRFP
  • NSF-SI2-SSE
  • NSF-BIO-DBI
  • Sy and Laurie Sternberg

Interdisciplinary Chair

Robert Ness Meena Choi Mike Cheng Ting Huang

slide-49
SLIDE 49

ADDITIONAL SLIDES

slide-50
SLIDE 50

SHRUNKEN T-STATISTICS MEASURE IMPORTANCE OF FEATURES IN DISTINGUISHING A SEGMENT

Tibshirani, Hastie, et al., Statistical Science, 2003

Measure difference between segment mean spectrum and

  • verall mean spectrum

750 800 850 900 5 10 15 20 25 30

PIGII_206 mean spectrum (brain)

750 800 850 900 −40 20 40 60

  • ● ●
  • Also use spectra

from nearby pixels when comparing to mean spectrum

Proposed

Intensity t-statistic centroid (mean spectrum) t-statistics m/z m/z

slide-51
SLIDE 51

STATISTICAL REGULARIZATION REMOVES UNINFORMATIVE FEATURES

Tibshirani, Hastie, et al., Statistical Science, 2003

Shrink t-statistics toward 0 with a shrinkage penalty (regularization) Shrink mean spectra accordingly — uninformative features are dropped

750 800 850 900 −20 10 30

  • 750

800 850 900 5 10 15 20 25 30

Intensity t-statistic shrunken centroid shrunken t-statistics m/z m/z