Lecture 12: Predicting gene expression and splicing Prof. Manolis - PowerPoint PPT Presentation

6.874, 6.802, 20.390, 20.490, HST.506 Computational Systems Biology Deep Learning in the Life Sciences Lecture 12: Predicting gene expression and splicing Prof. Manolis Kellis Slides credit: David Gifford, et al http://mit6874.github.io

Today: Predicting gene expression and splicing 0. Review: Expression, unsupervised learning, clustering 1. Up-sampling: predict 20,000 genes from 1000 genes 2. Compressive sensing: Composite measurements 3. DeepChrome+LSTMs: predict expression from chromatin 4. Predicting splicing from sequence: 1000s of features 5. Unsupervised deep learning: Restricted Boltzmann mach. 6. Multi-modal programs: Expr+DNA+miRNA RMBs Liang

RNA-Seq: De novo tx reconstruction / quantification Count RNA-Seq technology: Microarray technology • Sequence short reads from • Synthesize DNA probe array, mRNA, map to genome complementary hybridization • Variations: • Variations: • Count reads mapping to each • One long probe per gene known gene • Many short probes per gene • Reconstruct transcriptome de • Tiled k-mers across genome novo in each experiment • Advantage: • Advantage: • Can focus on small regions, • Digital measurements, de novo even if few molecules / cell

Expression Analysis Data Matrix • Measure 20,000 genes in 100s of conditions Condition 1Condition 2 n experiments Condition 3 … Gene similarity questions Each experiment measures m genes expression of thousands of ‘spots’, typically genes Expression profile of a gene • Study resulting matrix Experiment similarity questions

Clustering vs. Classification Independent validation of groups that emerge: Known Conditions  Conditions  classes:  Genes  Genes Chronic lymphocytic leukemia B-cell genes in blood cell lines Proliferation genes in transformed cell lines Lymph node genes in diffuse large B-cell lymphoma (DLBCL) Alizadeh, Nature 2000 Alizadeh, Nature 2000 Goal of Clustering : Group similar items Goal of Classification : Extract features that likely come from the same category, from the data that best assign new and in doing so reveal hidden structure elements to ≥1 of well-defined classes • Unsupervised learning • Supervised learning

PCA, Dimensionality reduction

Geometric interpretation of SVD Shearing Rotation Rotation Scaling Mx = M(x) = U( S( V*(x) ) )

Low-rank Approximation • Solution via SVD = σ σ T A U diag ( ,..., , 0 ,..., 0 ) V k 1 k set smallest r-k singular values to zero k ∑ = k = 1 σ T A u v column notation: sum k i i i i of rank 1 matrices : min − = − = σ A X A A • Error: + k k 1 F F = X rank ( X ) k

PCA of MNIST digits

t-SNE of MNIST digits 0 1 3 2 6 8 5 9 7 4

t-SNEs of single-cell Brain data CA1 CA2-4 Subiculum Dentate Gyrus (DG) Brain Hippocampus sub-structures scRNA-seq in 48 individuals, 84k cells, Nature, 2019 scATAC-seq of 262k cells across 7 brain regions 16 Sz/16 BP/16 Controls, 300k cells

Autoencoder: dimensionality reduction with neural net • Tricking a supervised learning algorithm to work in unsupervised fashion • Feed input as output function to be learned. But! Constrain model complexity • Pretraining with RBMs to learn representations for future supervised tasks. Use RBM output as “data” for training the next layer in stack • After pretraining, "unroll” RBMs to create deep autoencoder • Fine-tune using backpropagation [Hinton et al , 2006]

1. Up-sampling gene expression patterns

Challenge: Measure few values, infer many values https://arxiv.org/pdf/1902.06068.pdf • Image up-scaling • Digital signal upscaling – Interpolating low-pass filter – Inverse of convolution (de-convolution) (e.g. FIR finite impulse response) – Transfer learning from corpus of images – Low-dim. capture of higher-dim. signal – Low-dim. re-projection to high-dim. img – Nyquist rate (discrete) / freq. (contin.) • Gene expression measurements – Measure 1000 genes, infer the rest • Which 1000 genes? Compressed sensing – Rapid, cheap, reference assay – Measure few combinations of genes – Apply to millions of conditions – Better capture high-dimensional vector

Deep Learning architectures for up-sampling images Pre-sampling super-resolution (SR) Post-sampling SR Progressive up-sampling • Representation/abstract learning Iterative up-and-down sampling – Enables compression, re-upscaling, denoising – Example: autoencoder bottleneck. High-low-high – Modification: de-compression, up-scaling, low-high only

D-GEX - Deep Learning for up-scaling L1000 gene expression • Multi-task Multi-Layer Feed-Forward Neural Net • Non-linear activation function (hyperbolic tangent) • Input: 943 genes, Output: 9520 targets (partition to fit in memory)

D-GEX outperforms Linear Regression or K-nearest-Neighbors • Lower error than LR or KNN • Training rapidly converges • Strictly better for nearly all genes • Deeper = better However: performance still not great, computational limitations

2. Composite measurements for compressed sensing

Key insight: Composite measurements better capture modules • Sparse Module Activity Factorization (SMAF)

Making composite measurements in practice • Combinations of probes + barcodes for measurement • More consistent signal-to-noise ratios

3. Predicting Expression from Chromatin

Can we predict gene expression from chromatin information? • DNA methylation vs. gene expression • Promoters: high. Gene body: low

Strong enhancers (+H3K27ac) vs. weak enhancers (H3K4me1 only)

DeepChrome: positional histone features predictive of expression Histone mark 1 • Outperforms previous methods • Positional information for each mark • Convolution, pooling, drop-out, Multi-Layer- • Meaningful features selected Perceptron (MLP) alternating lin/non-linear

AttentiveChrome: Selectively attend to specific marks/positions Histone mark 1 • Attention: LSTM: Long short-term memory module • Hierarchical LSTM modules: interactions across marks • Attention focuses on specific positions for specific marks • Consistent improvement over DeepChrome

4. Predicting splicing from sequence

Deciphering tissue-specific splicing code Alternatively spliced exon2 exon1 exon2 exon3 300 nt 300 nt 300 nt 300 nt Feature set: RNA feature known motifs, transcript structure extraction in target exon and adjacent exons Exon inclusion: t inc =1, t exc =0, t nc =0 Tissue type Splicing code 3-class softplus Exon exclusion: prediction model: t inc =0, t exc =1, q inc , q exc , q nc t nc =0 [Barash et al., 2010]

Bayesian neural network splicing code 1014 RNA features x 3665 exons Bayesian neural network: • # hidden units follows Poisson( λ ) • Network weights follows spike-and- slab prior Bern(1 − α ) • Likelihood is cross- entropy • Network weights are sampled from the posterior 4 Mouse tissues each with 3 classes (i.e., 12 output units) [Xiong et al., 2011]

Predicts diseasing causing mutations from splicing code [Xiong et al., 2011]

Predicts diseasing causing mutations from splicing code Scoring splicing changes due to SNP ∆ ψ : • Train splice code model on 10,689 exons to predict the 3 splicing classes over 16 human tissues using 1393 sequence features (motifs & RNA structures) Score both the reference ψ ref and alternative ψ alt sequences • harboring one of the 658,420 common variants • Calculate ∆ψt = ψ t ref − ψ r over each tissue t alt Obtain largest absolute or aggregate ∆ ψ t to score effects of • SNPs [Xiong et al., 2011]

Lecture 12: Predicting gene expression and splicing Prof. Manolis - PowerPoint PPT Presentation

6.874, 6.802, 20.390, 20.490, HST.506 Computational Systems Biology Deep Learning in the Life Sciences Lecture 12: Predicting gene expression and splicing Prof. Manolis Kellis Slides credit: David Gifford, et al http://mit6874.github.io

Gene Expression Data Introduction to gene expression data Expression data storage concept An

SPLICING SYSTEMS ACCEPTING VS. GENERATING Juan Castellanos Victor Mitrana Eugenio Santos

Seamless Audio Splicing Seamless Audio Splicing for for ISO/IEC 13818 Transport Streams

Studying Alternative Splicing Meelis Kull PhD student in the University of Tartu supervisor:

Eukaryotic Gene Eukaryotic Gene Prediction Prediction Eukaryotic gene structure Eukaryotic

PSIchomics Shiny application for the integrated analysis of alternative splicing from large

Analysis of Gene Expression Profiles Analysis of Gene Expression Profiles and Drug Activity

1 Milestones Milestones ID Task Name Duration Start Finish % Complete 1 Project Proposal

Gene Expression Remember the days of 10 th grade biology Learning about gene expression Which can

A Data Warehouse-based A Data Warehouse-based Gene Expression Analysis Gene Expression Analysis

AP BIOLOGY Gene Expression Summer 2013 www.njctl.org Slide 3 / 199 Gene Expression Unit Topics

CSEP 527 Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

CSEP 527 Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

CSci 8980: Advanced Topics in Graphical Models Application: Gene Expression Analysis Instructor:

CSEP 590 B Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

Boolean models of the lac operon in E. coli Matthew Macauley Clemson University Gene expression

Page 1 1 Review: Aliasing Review: Sampling Theorem and Nyquist Rate Shannon Sampling Theorem

I See Airplanes: How to build your own radar system Eric Blossom eb@comsec.com More fun with

Paper Summaries Any takers? Articulated Figures III Motion Capture Projects Projects

IPv6 in Solaris 8 Yoshihide Amemiya Product Marketing Manager Solaris Software Sun

Supercomputers in Science from the big bang to climate change David Henty EPCC, the University

HPC Strategy & US Exascale Program James Amundson, Scientific Computing Division Head

I/O Workload Overview of the Applications on Intrepid Supercomputer Pablo J. Pavan , Valria S.

Quantum Control of Superconducting Circuits Liang Jiang Yale University, Applied Physics Victor

Lecture 12: Predicting gene expression and splicing Prof. Manolis - PowerPoint PPT Presentation

6.874, 6.802, 20.390, 20.490, HST.506 Computational Systems Biology Deep Learning in the Life Sciences Lecture 12: Predicting gene expression and splicing Prof. Manolis Kellis Slides credit: David Gifford, et al http://mit6874.github.io

Gene Expression Data Introduction to gene expression data Expression data storage concept An

SPLICING SYSTEMS ACCEPTING VS. GENERATING Juan Castellanos Victor Mitrana Eugenio Santos

Seamless Audio Splicing Seamless Audio Splicing for for ISO/IEC 13818 Transport Streams

Studying Alternative Splicing Meelis Kull PhD student in the University of Tartu supervisor:

Eukaryotic Gene Eukaryotic Gene Prediction Prediction Eukaryotic gene structure Eukaryotic

PSIchomics Shiny application for the integrated analysis of alternative splicing from large

Analysis of Gene Expression Profiles Analysis of Gene Expression Profiles and Drug Activity

1 Milestones Milestones ID Task Name Duration Start Finish % Complete 1 Project Proposal

Gene Expression Remember the days of 10 th grade biology Learning about gene expression Which can

A Data Warehouse-based A Data Warehouse-based Gene Expression Analysis Gene Expression Analysis

AP BIOLOGY Gene Expression Summer 2013 www.njctl.org Slide 3 / 199 Gene Expression Unit Topics

CSEP 527 Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

CSEP 527 Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

CSci 8980: Advanced Topics in Graphical Models Application: Gene Expression Analysis Instructor:

CSEP 590 B Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

Boolean models of the lac operon in E. coli Matthew Macauley Clemson University Gene expression

Page 1 1 Review: Aliasing Review: Sampling Theorem and Nyquist Rate Shannon Sampling Theorem

I See Airplanes: How to build your own radar system Eric Blossom eb@comsec.com More fun with

Paper Summaries Any takers? Articulated Figures III Motion Capture Projects Projects

IPv6 in Solaris 8 Yoshihide Amemiya Product Marketing Manager Solaris Software Sun

Supercomputers in Science from the big bang to climate change David Henty EPCC, the University

HPC Strategy &amp; US Exascale Program James Amundson, Scientific Computing Division Head

I/O Workload Overview of the Applications on Intrepid Supercomputer Pablo J. Pavan , Valria S.

Quantum Control of Superconducting Circuits Liang Jiang Yale University, Applied Physics Victor

HPC Strategy & US Exascale Program James Amundson, Scientific Computing Division Head