Sc Scaling R a g R and B Bioconduct ctor to to support methods - PowerPoint PPT Presentation

Sc Scaling R a g R and B Bioconduct ctor to to support methods for si single-ce cell g genomi mic a c analysis Peter Hickey Department of Biostatistics Johns Hopkins University @PeteHaitch www.peterhickey.org

Oh my, what big data you have! 80 GB 800 MB 8 MB Not just single-cell data Svensson V, Vento-Tormo R, Teichmann SA. Moore’s Law in Single Cell Transcriptomics, arXiv, 2017. Available: http://arxiv.org/abs/1704.01379

More data, more software + 42 preprints + 15 without publication Data from Luke Zappia (https://github.com/Oshlack/scRNA-tools)

More data, more software • https://github.com/seandavi/awesome-single-cell • > 80 software packages • https://github.com/Oshlack/scRNA-tools • Spreadsheet with description of > 120 software packages • Even within Bioconductor, lots of data structures

SingleCellExperiment: a Bioconductor class for single-cell data • Extends SummarizedExperiment

SummarizedExperiment

SummarizedExperiment Experiment metadata

SingleCellExperiment: a Bioconductor class for single-cell data • Davide Risso & Aaron Lun • Extends SummarizedExperiment • Adds slots for common single-cell data and operations • Spike-ins • Dimensionality reductions • Available on Bioconductor devel branch • Popular single-cell analysis packages are migrating to add support • scater • scran • MAST • zinbwave

That’s all lovely, but I’ve got BI BIG DA DATA • Yeah, sorta • Most single-cell genomics data are sparse data • 10X Genomics 1 million neuron scRNA-seq • HDF5 file • 30,000 rows (genes), 1.3 million columns (cells) • 93% zero • 136 GB as an ordinary matrix • Sparse Matrix • Limited to < 2 31 – 1 non-zero elements • Integer matrix stored as double Aaron Lun demonstrated analysis on desktop with 8 GB RAM

DelayedArray: For all your array- like needs • Hervé Pagès • DelayedArray is to arrays as tibble is to tables • Familiar matrix API • [ • dim() • t() • log() • … • But operations are delayed until data are explicitly realised • Data can be stored in a variety of backends • Works as an assay in a SummarizedExperiment (and derived classes)

Backends • In-memory • matrix ( base ) • Matrix ( Matrix ) • RleArray ( DelayedArray ) • Rle = run length encoding • On-disk • HDF5 ( HDF5Array ) • Data are in a HDF5 file, keep it in an HDF5 file • matter ( matterArray ) • Kylie A. Bemis (Northeastern University)

Backends Class/backend Package Size in memory Size on disk DelayedArray with matrix base 800 MB 0 MB DelayedArray with dgCMatrix Matrix 951 MB 0 MB RleMatrix (solid) DelayedArray 1001 MB 0 MB RleMatrix (chunked) DelayedArray 634 MB 0 MB HDF5Array (default compression) HDF5Array < 10 kB 165 MB matter matterArray < 10 kB 800 MB • Fairly straightforward to add new backends

DelayedMatrixStats • Me • Inspired by matrixStats (Henrik Bengsston, CRAN) • Functions for columns and rows operations on DelayedMatrix (2D DelayedArray) objects • colSums2(), rowSums2() • colMeans2(), rowMeans2() • colSds(), rowSds() • colLogSumExps(), rowLogSumsExps() • … (33 more methods) • Idea: Support matrixStats API for DelayedMatrix and derived classes • Aim 1: General methods to work on arbitrary DelayedMatrix • Aim 2: Optimised methods for specific backends

beachmat • Aaron Lun, Mike Smith, Hervé Pagès • Unified C++ API for (most) DelayedMatrix backends • get_col(), get_row() • set_col(), set_row() • Currently: matrix, Matrix, RleMatrix, HDF5Matrix

restfulSE • Vincent Carey • Proof-of-concept • HDF5 server backed SummarizedExperiment • Data live on remote server, stored in HDF5 file • RESTful API • Data returned as binary (better) or JSON • No server-side computation (yet)

Key points • Starting point for a lot of genomics data analysis is a array of numbers • Bioconductor strength is semantically rich data structures for array-like data • SummarizedExperiment -> SingleCellExperiment • Assay data doesn’t have to be an ordinary array • Supporting general array-like data with DelayedArray and different backends • DelayedMatrixStats, beachmat, HDF5 Server • R/BioC’s strength is supporting interactive exploratory data analysis, rich data structures, interoperability

Links and contact • Packages : • https://bioconductor.org/packages/SingleCellExperiment/ • https://bioconductor.org/packages/DelayedArray/ • https://bioconductor.org/packages/HDF5Array/ • https://bioconductor.org/packages/beachmat/ • https://bioconductor.org/packages/matter/ • https://github.com/PeteHaitch/matterArray • https://github.com/PeteHaitch/DelayedMatrixStats • https://github.com/vjcitn/restfulSE • Slides : http://peterhickey.org/presentations/ • GitHub & Twitter : @PeteHaitch

Sc Scaling R a g R and B Bioconduct ctor to to support methods - PowerPoint PPT Presentation

Sc Scaling R a g R and B Bioconduct ctor to to support methods for si single-ce cell g genomi mic a c analysis Peter Hickey Department of Biostatistics Johns Hopkins University @PeteHaitch www.peterhickey.org Oh my, what big data

UP UP AND OUT: SCALING SOFTWARE WITH AKKA Jonas Bonr CTO Typesafe @jboner Scaling software

Outline Scaling Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large

Analysis of Scaling Algorithms for Matrix & Operator Scaling Contents Scaling Algorithms

Detect ctor Charact cterization fo for the underground gr gravitational-wave detect ctor,

Effectively Scaling Effectively Scaling up/universalizing exclusive up/universalizing exclusive

Scaling From simple models to rich strategies PPPLab Day, November 30th Scaling: recent

Outline Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large Principles of

Chapter 11: Scaling and Round-off Noise Keshab K. Parhi Outline Introduction Scaling

ADAPTIVE RADIO OUTPUT SCALING FOR POWER AND BANDWIDTH SAVING Koen Zandberg 1 ADAPTIVE RADIO

Conformal Finite Size Scaling of Conformal Finite Size Scaling of Flavors Chik Him Wong Twelve

So#ware Scaling Mo/va/on & Goals HW Configura/on & Scale Out So#ware Scaling

Health and Heterogeneity Josep Pijoan-Mas , CEMFI Jos e-V ctor R os-Rull , UPenn

Health and Heterogeneity Josep Pijoan-Mas Jos e-V ctor R os-Rull CEMFI, CEPR

Life Insurance and Household Consumption Jay Hong Jos e-V ctor R os-Rull Penn,

The Demographic Transition and Long-Term Marriage Trends Jos e-V ctor R os-Rull

Aggregate shocks and house prices fluctuations Jos e-V ctor R os-Rull Virginia S

FudgeFactor: Syntax-Guided Synthesis for Accurate RTL Error Localization and Correction Paolo

Dynamical modelling of successive defaults N. El Karoui M. Jeanblanc Y. Jiao 21 Sep, 2007 cmap

Search-Based Genetic Optimization for Deployment and Reconfiguration of Software in the Cloud

Reduced Basis Methods for Option Pricing page 1/54 Reduced Basis Methods for Option Pricing |

Cel vs Computer Animation Snow White vs. Toy Story Animation Pipeline Not much has

Geothermal Projects in El Salvador and its Environmental and Social Aspects Manuel Monterrosa

SAILING ON AN ANALYTE RESULTS OF A CASE STUDY ON GALVANIC CELLS UNIT AT UPPER SECONDARY

Using word alignments to assist computer-aided translation users by marking which target-side

Sc Scaling R a g R and B Bioconduct ctor to to support methods - PowerPoint PPT Presentation

Sc Scaling R a g R and B Bioconduct ctor to to support methods for si single-ce cell g genomi mic a c analysis Peter Hickey Department of Biostatistics Johns Hopkins University @PeteHaitch www.peterhickey.org Oh my, what big data

UP UP AND OUT: SCALING SOFTWARE WITH AKKA Jonas Bonr CTO Typesafe @jboner Scaling software

Outline Scaling Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large

Analysis of Scaling Algorithms for Matrix &amp; Operator Scaling Contents Scaling Algorithms

Detect ctor Charact cterization fo for the underground gr gravitational-wave detect ctor,

Effectively Scaling Effectively Scaling up/universalizing exclusive up/universalizing exclusive

Scaling From simple models to rich strategies PPPLab Day, November 30th Scaling: recent

Outline Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large Principles of

Chapter 11: Scaling and Round-off Noise Keshab K. Parhi Outline Introduction Scaling

ADAPTIVE RADIO OUTPUT SCALING FOR POWER AND BANDWIDTH SAVING Koen Zandberg 1 ADAPTIVE RADIO

Conformal Finite Size Scaling of Conformal Finite Size Scaling of Flavors Chik Him Wong Twelve

So#ware Scaling Mo/va/on &amp; Goals HW Configura/on &amp; Scale Out So#ware Scaling

Health and Heterogeneity Josep Pijoan-Mas , CEMFI Jos e-V ctor R os-Rull , UPenn

Health and Heterogeneity Josep Pijoan-Mas Jos e-V ctor R os-Rull CEMFI, CEPR

Life Insurance and Household Consumption Jay Hong Jos e-V ctor R os-Rull Penn,

The Demographic Transition and Long-Term Marriage Trends Jos e-V ctor R os-Rull

Aggregate shocks and house prices fluctuations Jos e-V ctor R os-Rull Virginia S

FudgeFactor: Syntax-Guided Synthesis for Accurate RTL Error Localization and Correction Paolo

Dynamical modelling of successive defaults N. El Karoui M. Jeanblanc Y. Jiao 21 Sep, 2007 cmap

Search-Based Genetic Optimization for Deployment and Reconfiguration of Software in the Cloud

Reduced Basis Methods for Option Pricing page 1/54 Reduced Basis Methods for Option Pricing |

Cel vs Computer Animation Snow White vs. Toy Story Animation Pipeline Not much has

Geothermal Projects in El Salvador and its Environmental and Social Aspects Manuel Monterrosa

SAILING ON AN ANALYTE RESULTS OF A CASE STUDY ON GALVANIC CELLS UNIT AT UPPER SECONDARY

Using word alignments to assist computer-aided translation users by marking which target-side

Analysis of Scaling Algorithms for Matrix & Operator Scaling Contents Scaling Algorithms

So#ware Scaling Mo/va/on & Goals HW Configura/on & Scale Out So#ware Scaling