statistical analysis of mass spectrometry imaging
play

STATISTICAL ANALYSIS OF MASS SPECTROMETRY IMAGING EXPERIMENTS - PowerPoint PPT Presentation

A FRAMEWORK FOR STATISTICAL ANALYSIS OF MASS SPECTROMETRY IMAGING EXPERIMENTS Kylie Bemis Purdue University Department of Statistics OUTLINE Statement of the problem Biotechnological problem Statistical and computational problem


  1. A FRAMEWORK FOR STATISTICAL ANALYSIS OF MASS SPECTROMETRY IMAGING EXPERIMENTS Kylie Bemis Purdue University Department of Statistics

  2. OUTLINE • Statement of the problem • Biotechnological problem • Statistical and computational problem • Statement of contributions • Open-source software • matter: Rapid prototyping with data on disk • Cardinal: Statistical toolbox for mass spectrometry imaging experiments • Statistical methods • Spatial shrunken centroids • Evaluation and case studies • Summary • Conclusions • Future work 2

  3. OUTLINE • Statement of the problem • Biotechnological problem • Statistical and computational problem • Statement of contributions • Open-source software • matter: Rapid prototyping with data on disk • Cardinal: Statistical toolbox for mass spectrometry imaging experiments • Statistical methods • Spatial shrunken centroids • Evaluation and case studies • Summary • Conclusions • Future work 3

  4. MASS SPECTROMETRY IMAGING Investigate spatial distribution of analytes • Scan with laser/spray • Collect mass spectra • Reconstruct ion images • Date “cube” R. Graham Cooks and lab y y y y y

  5. BIOTECHNOLOGICAL PROBLEM • Rapidly advancing technology • Increasing mass resolutions • Greater mass accuracy and range • More features (larger P) • Increasing spatial resolutions • Approaching 1 µm resolution • More pixels (larger N) • More complex experiments • 3D experiments • Time-course experiments • Increasing sample size • More biological replicates • More pixels (larger N) 5

  6. STATISTICAL & COMPUTATIONAL PROBLEM • Complex, high-dimensional data • Spatial x, y dimensions • Potentially z, t dimensions • Mass spectral features (m/z values) • Correlation structures • Spatial (and possibly temporal) • Between mass spectral features • Increasing mass+spatial resolutions • Larger(-than-memory) datasets • Can range from 100 MB to 100 GB • Experimental design • Variation across samples+slides • What counts as a replicate? 6

  7. PROBLEM STATEMENT • Biotechnological problem • Mass spectrometry (MS) imaging has advanced at a rapid pace • Computational tools have not advanced at a comparable pace • Lack of free, open-source statistical tools for statistical analysis • Need for classification/segmentation with statistical inference: • Classification : Classify pixels based on their mass spectral profiles into pre-defined classes (such as healthy/disease status) • Segmentation : Assign pixels to newly discovered segments with relatively homogenous and distinct mass spectral profiles • Select a subset of informative mass spectral features • Statistical and computational problem • MS imaging experiments result in complex, high-dimensional experiments • Spatial structure in datasets with large P and large N • Statistical computing on larger-than-memory data is a challenge

  8. OUTLINE • Statement of the problem • Biotechnological problem • Statistical and computational problem • Statement of contributions • Open-source software • matter: Rapid prototyping with data on disk • Cardinal: Statistical toolbox for mass spectrometry imaging experiments • Statistical methods • Spatial shrunken centroids • Evaluation and case studies • Summary • Conclusions • Future work 8

  9. STATEMENT OF CONTRIBUTIONS • Statistical methods: spatial shrunken centroids • Classification and segmentation for MS imaging experiments • Probabilistic model using spatial information • Selection of most informative mass spectral features • Open-source software: Cardinal • Free, open-source R package for MS imaging experiments • Full pipeline including processing, visualization, and statistical analysis • For experimentalists, provides accessible statistical methods • For statisticians, provides infrastructure for method development • Open-source software: matter • Free, open-source R package for rapid prototyping with data-on-disk • Flexible statistical computing and method development for larger-than-memory datasets • Enables Cardinal to scale to high-resolution, high-throughput MS imaging experiments • Evaluation and case studies • Public datasets and reproducible results in CardinalWorkflows • Community impact of this work y z x 9

  10. OUTLINE • Statement of the problem • Biotechnological problem • Statistical and computational problem • Statement of contributions • Open-source software • matter: Rapid prototyping with data on disk • Cardinal: Statistical toolbox for mass spectrometry imaging experiments • Statistical methods • Spatial shrunken centroids • Evaluation and case studies • Summary • Conclusions • Future work 10

  11. PROBLEM: LARGER-THAN-MEMORY DATA challenges statistical method development • MS imaging experiments rapidly advancing m/z = 715.03 m/z = 715.03 m/z = 715.03 t = 4 t = 8 t = 11 z z z • Increasing mass and spatial resolutions • Larger sample sizes, multiple files y y y • Growing data size poses difficulty for statistics x x x • Need to test methods on larger-than-memory data • Need to work with domain-specific formats • Current R solutions are inflexible Cardinal help Google group 11

  12. CONTRIBUTION: MATTER open-source statistical computing with data on disk • Work with larger-than-memory datasets on disk in R Storage • Emphasizes flexibility with a minimal memory footprint matter object • Adaptable to more datasets than File 1 Atom 1 bigmemory and ff • Potentially slower computation Atom 2 • Designed for statistical method Atom 3 File 2 development in R Atom 4 • Rapid prototyping with minimal additional effort Atom 5 • Works with many existing algorithms Atom 6 File 3 • Efficient calculation of summary statistics • Infrastructure for statistical computing on large data 12

  13. NEED TO WORK WITH MS IMAGING FILES e.g., “processed” and “continuous” imzML • Open-source format for MS imaging experiments “processed” “continuous” • XML metadata file defines imzML imzML binary data file structure UUID UUID • Binary data schema is incompatible with bigmemory and ff mzArray 1 mzArray • Prefer to avoid additional file intensityArray 1 intensityArray 1 conversion • Need random access into different mzArray 2 intensityArray 2 parts of the file intensityArray 2 intensityArray 3 • Often one-sample-per-file mzArray 3 intensityArray 4 • Need to seamlessly work with multiple files in an experiment intensityArray 3 intensityArray 5 • Each file can be very large • matter solves these problems 13

  14. FLEXIBLE ACCESS TO DATA ON DISK any binary format, any file structure • User-defined file structure • Data can come from anywhere File 1 File 2 • Any part of a file Metadata Metadata • Any combination of files Column A Column E • Representation in R can be Column B Column F different from on disk Column C Column G • Access as ordinary R vector/matrix Column D Column H • No need to worry about data size or memory management matter matrix Column A Column C Column F Column H 14

  15. EXAMPLE: LINEAR REGRESSION with a 1.2 GB simulated data and biglm • 1.2 GB dataset Memory Used (MB) Memory Overhead (MB) • N = 15,000,000 observations R matrices + lm • P = 9 variables R matrices + biglm • Linear regression bigmemory + biglm • Using biglm package matter + biglm 0 1750 3500 5250 7000 • Specifically for large datasets Memory Used Memory Overhead Time R matrices + lm 7 GB 1.4 GB 33 sec R matrices + biglm 2.7 GB 1.3 GB 158 sec bigmemory + biglm 1.7 GB 397 MB 21 sec matter + biglm 466 MB 319 MB 42 sec 15

  16. EXAMPLE: PRINCIPAL COMPONENTS ANALYSIS with a 1.2 GB simulated data and irlba • 1.2 GB dataset Memory Used (MB) Memory Overhead (MB) • N = 15,000,000 observations R matrices + svd • P = 10 variables R matrices + irlba • PCA bigmemory + irlba • Using irlba package matter + irlba 0 1000 2000 3000 4000 • Not specifically for large datasets Memory Used Memory Overhead Time R matrices + svd 3.6 GB 2.4 GB 62 sec R matrices + irlba 2.3 GB 961 MB 9 sec bigmemory + irlba 3.5 GB 962 MB 9 sec matter + irlba 522 MB 427 MB 171 sec 16

  17. EXAMPLE: PRINCIPAL COMPONENTS ANALYSIS with a 2.85 GB microbial time-course experiment • 3D microbial time-course • 2.85 GB on disk 418 sec • 17,672 pixels per PC • 40,299 features Oetjen et al, Gigascience, 2015 234 MB to compute 3 PC 79 MB memory overhead PC1 loadings t = 11 t = 8 t = 4 t = 11 t = 8 t = 4 z z z z z z y y y y y y x x x x x x m/z 262 PC1 scores 17

  18. EXAMPLE: VISUALIZATION of a 26.45 GB mouse pancreas experiment m/z 5086 • 3D mouse pancreas cannot • 26.45 GB on disk load at all • 497,225 pixels without y matter • 13,312 features Oetjen et al, Gigascience, 2015 z x 1.25 GB used in-memory m/z 3121 223 MB to calculate mean spectrum Mean spectrum y z x m/z 3922 y z x 18

  19. OUTLINE • Statement of the problem • Biotechnological problem • Statistical and computational problem • Statement of contributions • Open-source software • matter: Rapid prototyping with data on disk • Cardinal: Statistical toolbox for mass spectrometry imaging experiments • Statistical methods • Spatial shrunken centroids • Evaluation and case studies • Summary • Conclusions • Future work 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend