6.874, 6.802, 20.390, 20.490, HST.506 Computational Systems Biology Deep Learning in the Life Sciences
Lecture 12: Predicting gene expression and splicing
- Prof. Manolis Kellis
http://mit6874.github.io
Slides credit: David Gifford, et al
Lecture 12: Predicting gene expression and splicing Prof. Manolis - - PowerPoint PPT Presentation
6.874, 6.802, 20.390, 20.490, HST.506 Computational Systems Biology Deep Learning in the Life Sciences Lecture 12: Predicting gene expression and splicing Prof. Manolis Kellis Slides credit: David Gifford, et al http://mit6874.github.io
6.874, 6.802, 20.390, 20.490, HST.506 Computational Systems Biology Deep Learning in the Life Sciences
http://mit6874.github.io
Slides credit: David Gifford, et al
RNA-Seq technology:
mRNA, map to genome
known gene
novo in each experiment
Count
Microarray technology
complementary hybridization
even if few molecules / cell
Condition 1Condition 2 Condition 3 …
Experiment similarity questions Gene similarity questions Expression profile of a gene
Each experiment measures expression of thousands
Conditions Genes
Alizadeh, Nature 2000
Conditions Genes
Proliferation genes in transformed cell lines B-cell genes in blood cell lines
Alizadeh, Nature 2000
Lymph node genes in diffuse large B-cell lymphoma (DLBCL) Chronic lymphocytic leukemia
Goal of Clustering: Group similar items that likely come from the same category, and in doing so reveal hidden structure Goal of Classification: Extract features from the data that best assign new elements to ≥1 of well-defined classes
Known classes: Independent validation
set smallest r-k singular values to zero T k k
1
column notation: sum
T i i k i i k
1σ
k
1 ) ( :min + =
k F k F k X rank X
scRNA-seq in 48 individuals, 84k cells, Nature, 2019 CA1 Subiculum Dentate Gyrus (DG) CA2-4 16 Sz/16 BP/16 Controls, 300k cells Brain Hippocampus sub-structures scATAC-seq of 262k cells across 7 brain regions
[Hinton et al, 2006]
– Inverse of convolution (de-convolution) – Transfer learning from corpus of images – Low-dim. re-projection to high-dim. img
https://arxiv.org/pdf/1902.06068.pdf
– Interpolating low-pass filter (e.g. FIR finite impulse response) – Low-dim. capture of higher-dim. signal – Nyquist rate (discrete) / freq. (contin.)
– Measure 1000 genes, infer the rest – Rapid, cheap, reference assay – Apply to millions of conditions
– Measure few combinations of genes – Better capture high-dimensional vector
Post-sampling SR Pre-sampling super-resolution (SR) Progressive up-sampling Iterative up-and-down sampling
– Enables compression, re-upscaling, denoising – Example: autoencoder bottleneck. High-low-high – Modification: de-compression, up-scaling, low-high
Module Activity Factorization (SMAF)
Histone mark 1
Perceptron (MLP) alternating lin/non-linear
Long short-term memory module
modules: interactions across marks
Histone mark 1
specific positions for specific marks
Splicing code Tissue type Alternatively spliced exon2 exon1 exon2 exon3 300 nt 300 nt 300 nt 300 nt Feature set: known motifs, transcript structure in target exon and adjacent exons 3-class softplus prediction model:
qinc, qexc, qnc
Exon inclusion: tinc=1, texc=0, tnc=0 Exon exclusion: tinc=0, texc=1, tnc=0 RNA feature extraction
[Barash et al., 2010]
1014 RNA features x 3665 exons 4 Mouse tissues each with 3 classes (i.e., 12 output units) Bayesian neural network:
follows Poisson(λ)
follows spike-and- slab prior Bern(1 − α)
entropy
sampled from the posterior [Xiong et al., 2011]
[Xiong et al., 2011]
Scoring splicing changes due to SNP ∆ψ:
classes over 16 human tissues using 1393 sequence features (motifs & RNA structures)
harboring one of the 658,420 common variants
ref − ψr alt
SNPs
[Xiong et al., 2011]
Architecture of the new network to predict alternative splicing between two tissues. It contains three hidden layers, with hidden variables that jointly represent genomic features and tissue types. [Leung et al., 2014]
tissue types
large number of parameters: (1393 inputs + 13 outputs) × 10 hidden units = 13000 parameters
the best a softplus/Dirichlet multivariate linear regression may achieve similar performance
features are pre-defined and thus may be completely reflect the underlying splicing mechanism
axon body dendritic tree
axon hillock
i
Activation function
v1 v2 h1 h2
Boltzmann distribution (exponentiated negative energy) v=visible units; h=hidden units, E (v,h) energy function wij=connection weights si∈{0,1}=state of unit i, bi=bias of unit i in global P(v) energy-dependent prob. function
Goal: Given v, learn weights wij to maximize P(v). Botzmann machine becomes universal approximator
Adv: Local learning rules, infer each variable based on neighbors only. No need for example annotations, no output function. Problem: Difficult to train, dependencies between hidden units [Ackley et al., 1985; Le Roux, Bengio, 2008]
input v2 h3 h2 h1 v1 hidden
[Hinton and Osindero, 2006] However: <v^
i, h^ j> model still too large to estimate.
apply Markov Chain Monte Carlo (MCMC) (i.e., Gibbs sampling) Objective function:
sensible set of weights using unlabelled data.
input v2 h1 h2 h3 hidden v1
2. Then use the pre-trained weight to perform backpropagation to classify labelled data
input v2 h2 h1 h3 hidden v1 a2 a1 softplus output
1st column: Sample from generative model with each label clamped on. 2nd column: 20 iterations of alternating Gibbs sampling in associative memory. etc… (Figure 9, Hinton et al., 2006).
Interactive visualization of network learning: http://www.cs.toronto.edu/~hinton/digits.html
RBMs for TCGA cancer integration: Expression, miRNAs, Methylation
Hierarchical model integrates:
Energy function combines multiple data types