lecture 12 predicting gene expression and splicing
play

Lecture 12: Predicting gene expression and splicing Prof. Manolis - PowerPoint PPT Presentation

6.874, 6.802, 20.390, 20.490, HST.506 Computational Systems Biology Deep Learning in the Life Sciences Lecture 12: Predicting gene expression and splicing Prof. Manolis Kellis Slides credit: David Gifford, et al http://mit6874.github.io


  1. 6.874, 6.802, 20.390, 20.490, HST.506 Computational Systems Biology Deep Learning in the Life Sciences Lecture 12: Predicting gene expression and splicing Prof. Manolis Kellis Slides credit: David Gifford, et al http://mit6874.github.io

  2. Today: Predicting gene expression and splicing 0. Review: Expression, unsupervised learning, clustering 1. Up-sampling: predict 20,000 genes from 1000 genes 2. Compressive sensing: Composite measurements 3. DeepChrome+LSTMs: predict expression from chromatin 4. Predicting splicing from sequence: 1000s of features 5. Unsupervised deep learning: Restricted Boltzmann mach. 6. Multi-modal programs: Expr+DNA+miRNA RMBs Liang

  3. RNA-Seq: De novo tx reconstruction / quantification Count RNA-Seq technology: Microarray technology • Sequence short reads from • Synthesize DNA probe array, mRNA, map to genome complementary hybridization • Variations: • Variations: • Count reads mapping to each • One long probe per gene known gene • Many short probes per gene • Reconstruct transcriptome de • Tiled k-mers across genome novo in each experiment • Advantage: • Advantage: • Can focus on small regions, • Digital measurements, de novo even if few molecules / cell

  4. Expression Analysis Data Matrix • Measure 20,000 genes in 100s of conditions Condition 1Condition 2 n experiments Condition 3 … Gene similarity questions Each experiment measures m genes expression of thousands of ‘spots’, typically genes Expression profile of a gene • Study resulting matrix Experiment similarity questions

  5. Clustering vs. Classification Independent validation of groups that emerge: Known Conditions  Conditions  classes:  Genes  Genes Chronic lymphocytic leukemia B-cell genes in blood cell lines Proliferation genes in transformed cell lines Lymph node genes in diffuse large B-cell lymphoma (DLBCL) Alizadeh, Nature 2000 Alizadeh, Nature 2000 Goal of Clustering : Group similar items Goal of Classification : Extract features that likely come from the same category, from the data that best assign new and in doing so reveal hidden structure elements to ≥1 of well-defined classes • Unsupervised learning • Supervised learning

  6. PCA, Dimensionality reduction

  7. Geometric interpretation of SVD Shearing Rotation Rotation Scaling Mx = M(x) = U( S( V*(x) ) )

  8. Low-rank Approximation • Solution via SVD = σ σ T A U diag ( ,..., , 0 ,..., 0 ) V k 1 k set smallest r-k singular values to zero k ∑ = k = 1 σ T A u v column notation: sum k i i i i of rank 1 matrices : min − = − = σ A X A A • Error: + k k 1 F F = X rank ( X ) k

  9. PCA of MNIST digits

  10. t-SNE of MNIST digits 0 1 3 2 6 8 5 9 7 4

  11. t-SNEs of single-cell Brain data CA1 CA2-4 Subiculum Dentate Gyrus (DG) Brain Hippocampus sub-structures scRNA-seq in 48 individuals, 84k cells, Nature, 2019 scATAC-seq of 262k cells across 7 brain regions 16 Sz/16 BP/16 Controls, 300k cells

  12. Autoencoder: dimensionality reduction with neural net • Tricking a supervised learning algorithm to work in unsupervised fashion • Feed input as output function to be learned. But! Constrain model complexity • Pretraining with RBMs to learn representations for future supervised tasks. Use RBM output as “data” for training the next layer in stack • After pretraining, "unroll” RBMs to create deep autoencoder • Fine-tune using backpropagation [Hinton et al , 2006]

  13. Today: Predicting gene expression and splicing 0. Review: Expression, unsupervised learning, clustering 1. Up-sampling: predict 20,000 genes from 1000 genes 2. Compressive sensing: Composite measurements 3. DeepChrome+LSTMs: predict expression from chromatin 4. Predicting splicing from sequence: 1000s of features 5. Unsupervised deep learning: Restricted Boltzmann mach. 6. Multi-modal programs: Expr+DNA+miRNA RMBs Liang

  14. 1. Up-sampling gene expression patterns

  15. Challenge: Measure few values, infer many values https://arxiv.org/pdf/1902.06068.pdf • Image up-scaling • Digital signal upscaling – Interpolating low-pass filter – Inverse of convolution (de-convolution) (e.g. FIR finite impulse response) – Transfer learning from corpus of images – Low-dim. capture of higher-dim. signal – Low-dim. re-projection to high-dim. img – Nyquist rate (discrete) / freq. (contin.) • Gene expression measurements – Measure 1000 genes, infer the rest • Which 1000 genes? Compressed sensing – Rapid, cheap, reference assay – Measure few combinations of genes – Apply to millions of conditions – Better capture high-dimensional vector

  16. Deep Learning architectures for up-sampling images Pre-sampling super-resolution (SR) Post-sampling SR Progressive up-sampling • Representation/abstract learning Iterative up-and-down sampling – Enables compression, re-upscaling, denoising – Example: autoencoder bottleneck. High-low-high – Modification: de-compression, up-scaling, low-high only

  17. D-GEX - Deep Learning for up-scaling L1000 gene expression • Multi-task Multi-Layer Feed-Forward Neural Net • Non-linear activation function (hyperbolic tangent) • Input: 943 genes, Output: 9520 targets (partition to fit in memory)

  18. D-GEX outperforms Linear Regression or K-nearest-Neighbors • Lower error than LR or KNN • Training rapidly converges • Strictly better for nearly all genes • Deeper = better However: performance still not great, computational limitations

  19. Today: Predicting gene expression and splicing 0. Review: Expression, unsupervised learning, clustering 1. Up-sampling: predict 20,000 genes from 1000 genes 2. Compressive sensing: Composite measurements 3. DeepChrome+LSTMs: predict expression from chromatin 4. Predicting splicing from sequence: 1000s of features 5. Unsupervised deep learning: Restricted Boltzmann mach. 6. Multi-modal programs: Expr+DNA+miRNA RMBs Liang

  20. 2. Composite measurements for compressed sensing

  21. Key insight: Composite measurements better capture modules • Sparse Module Activity Factorization (SMAF)

  22. Making composite measurements in practice • Combinations of probes + barcodes for measurement • More consistent signal-to-noise ratios

  23. Today: Predicting gene expression and splicing 0. Review: Expression, unsupervised learning, clustering 1. Up-sampling: predict 20,000 genes from 1000 genes 2. Compressive sensing: Composite measurements 3. DeepChrome+LSTMs: predict expression from chromatin 4. Predicting splicing from sequence: 1000s of features 5. Unsupervised deep learning: Restricted Boltzmann mach. 6. Multi-modal programs: Expr+DNA+miRNA RMBs Liang

  24. 3. Predicting Expression from Chromatin

  25. Can we predict gene expression from chromatin information? • DNA methylation vs. gene expression • Promoters: high. Gene body: low

  26. Strong enhancers (+H3K27ac) vs. weak enhancers (H3K4me1 only)

  27. DeepChrome: positional histone features predictive of expression Histone mark 1 • Outperforms previous methods • Positional information for each mark • Convolution, pooling, drop-out, Multi-Layer- • Meaningful features selected Perceptron (MLP) alternating lin/non-linear

  28. AttentiveChrome: Selectively attend to specific marks/positions Histone mark 1 • Attention: LSTM: Long short-term memory module • Hierarchical LSTM modules: interactions across marks • Attention focuses on specific positions for specific marks • Consistent improvement over DeepChrome

  29. Today: Predicting gene expression and splicing 0. Review: Expression, unsupervised learning, clustering 1. Up-sampling: predict 20,000 genes from 1000 genes 2. Compressive sensing: Composite measurements 3. DeepChrome+LSTMs: predict expression from chromatin 4. Predicting splicing from sequence: 1000s of features 5. Unsupervised deep learning: Restricted Boltzmann mach. 6. Multi-modal programs: Expr+DNA+miRNA RMBs Liang

  30. 4. Predicting splicing from sequence

  31. Deciphering tissue-specific splicing code Alternatively spliced exon2 exon1 exon2 exon3 300 nt 300 nt 300 nt 300 nt Feature set: RNA feature known motifs, transcript structure extraction in target exon and adjacent exons Exon inclusion: t inc =1, t exc =0, t nc =0 Tissue type Splicing code 3-class softplus Exon exclusion: prediction model: t inc =0, t exc =1, q inc , q exc , q nc t nc =0 [Barash et al., 2010]

  32. Bayesian neural network splicing code 1014 RNA features x 3665 exons Bayesian neural network: • # hidden units follows Poisson( λ ) • Network weights follows spike-and- slab prior Bern(1 − α ) • Likelihood is cross- entropy • Network weights are sampled from the posterior 4 Mouse tissues each with 3 classes (i.e., 12 output units) [Xiong et al., 2011]

  33. Predicts diseasing causing mutations from splicing code [Xiong et al., 2011]

  34. Predicts diseasing causing mutations from splicing code Scoring splicing changes due to SNP ∆ ψ : • Train splice code model on 10,689 exons to predict the 3 splicing classes over 16 human tissues using 1393 sequence features (motifs & RNA structures) Score both the reference ψ ref and alternative ψ alt sequences • harboring one of the 658,420 common variants • Calculate ∆ψt = ψ t ref − ψ r over each tissue t alt Obtain largest absolute or aggregate ∆ ψ t to score effects of • SNPs [Xiong et al., 2011]

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend