The transcriptome and differential expression - - PowerPoint PPT Presentation

the transcriptome and differential expression
SMART_READER_LITE
LIVE PREVIEW

The transcriptome and differential expression - - PowerPoint PPT Presentation

Computational Systems Biology Deep Learning in the Life Sciences 6.802 6.874 20.390 20.490 HST.506 David Gifford Lecture 7 February 27, 2019 The transcriptome and differential expression http://mit6874.github.io 1 Whats on tap today!


slide-1
SLIDE 1

Computational Systems Biology Deep Learning in the Life Sciences

6.802 6.874 20.390 20.490 HST.506

David Gifford Lecture 7 February 27, 2019

The transcriptome and differential expression

http://mit6874.github.io

1

slide-2
SLIDE 2

What’s on tap today!

  • Recap of manifolds, KL divergence, t-SNE

gradients

  • The transcriptome

– Exon splicing and isoform expression

  • Differential expression detection

– Embedded models and significance testing – Multiple hypothesis correction – Gene set enrichment analysis

  • Exon splicing code
slide-3
SLIDE 3
  • 1. Manifolds, KL Divergence, KL gradients
slide-4
SLIDE 4

What is a manifold mapping? Neighborhoods in high dimensional space are preserved in low dimensional space

slide-5
SLIDE 5

KL Divergence is always positive

Gibbs Inequality

slide-6
SLIDE 6

We can use gradient methods to find an embedding

slide-7
SLIDE 7

The overall gradient on yi is the sum of gradients from all

  • ther points
slide-8
SLIDE 8

Gradient between two points is proportional to their displacement

slide-9
SLIDE 9

We can interpret a pair-wise gradient as a spring

slide-10
SLIDE 10

We sum all of the gradients for a given point to update its location

slide-11
SLIDE 11
  • 2. RNA-seq data has ~3,000 – 20,000

gene expression levels per sample

slide-12
SLIDE 12

RNA-Seq characterizes RNA molecules

A B C Gene in genome A B C

pre-mRNA or ncRNA transcription

A B C

splicing

A C

export to cytoplasm mRNA nucleus cytoplasm

High-throughput sequencing of RNAs at various stages of processing

Slide courtesy Cole Trapnell

slide-13
SLIDE 13

RNA-Seq: millions of short reads from fragmented mRNAs

Pepke et. al. Nature Methods 2009

Extract RNA from cells/tissue + splice junctions

slide-14
SLIDE 14

ET Wang et al. Nature 000, 1-7 (2008) doi:10.1038/nature07509

Pervasive tissue-specific regulation of alternative mRNA isoforms.

slide-15
SLIDE 15

Sox2

One measure of expression is Reads Per Kilobase of gene per Million reads (RPKM)

slide-16
SLIDE 16

Smug1 Reads over exons Junction reads (split between exons)

RNA-seq reads map to exons and across exons

slide-17
SLIDE 17

Aligned reads reveal isoform possibilities

A B C identify candidate exons via genomic mapping A B C A B C Generate possible pairings

  • f exons

Align reads to possible junctions A B C A B C

Slide courtesy Cole Trapnell

slide-18
SLIDE 18

We can use mapped reads to learn the isoform mixture y

A B C D E

Slide courtesy Cole Trapnell

Isoform Fraction T1 y1 T2 y2 T3 y3 T4 y4

slide-19
SLIDE 19

P(Ri| T=Tj) – Excluded reads

Ri Tj If a read pair Ri is structurally incompatible with transcript Tj, then

P(R = Ri |T = Tj) = 0

Intron in Tj

Slide courtesy Cole Trapnell

slide-20
SLIDE 20

P(Ri| T=Tj) – Paired end reads

Assume our library fragments have a length distribution described by a probability density F. Thus, the probability of observing a particular paired alignment to a transcript: Ri Tj Implied fragment length lj(Ri)

P(R = Ri |T = Tj) = F(lj(Rj)) lj

Slide courtesy Cole Trapnell

slide-21
SLIDE 21

Estimating Isoform Expression

  • Find expression abundances y1,…,yn for

a set of isoforms T1,…,Tn

  • Observations are the set of reads R1,…,Rm
  • Can estimate mRNA expression of each isoform

using total number of reads that map to a gene and y

P(R | Ψ) = Ψ jP(R = Ri |T = Tj)

j=0 n

i=0 m

∏ Ψ =

Ψ

argmaxL(Ψ | R)

L(Ψ | R)∝ P(R | Ψ)P(Ψ)

slide-22
SLIDE 22
  • 3. The significance of differential

expression

slide-23
SLIDE 23
slide-24
SLIDE 24
slide-25
SLIDE 25
slide-26
SLIDE 26
slide-27
SLIDE 27
slide-28
SLIDE 28

What is the right distribution for modeling read counts?

Poission?

slide-29
SLIDE 29

ij 2

σ =

ij

µ +

j 2

s

p

v

ip( j)

q

( )

Orange Line – DESeq Dashed Orange – edgeR Purple - Poission

Read count data is overdispersed for a Poission Use a Negative Binomial instead

slide-30
SLIDE 30

A Negative Binomial distribution is better (DESeq)

  • i gene or isoform

p condition

  • j sample (experiment) p(j) condition of sample j
  • m number of samples
  • Kij number of counts for isoform i in experiment j
  • qip Average scaled expression for gene i condition p

ij

K ~ NB

ij

µ ,

ij 2

σ

( )

ij 2

σ =

ij

µ +

j 2

s

p

v

ip( j)

q

( )

ij

µ =

ip( j)

q

j

s

ip

q = 1 # of replicates

ij

K

j

s

j in replicates

slide-31
SLIDE 31
slide-32
SLIDE 32

Hypergeometric test for gene set overlap significance

N – total # of genes 1000 n1 - # of genes in set A 20 n2 - # of genes in set B 30 k - # of genes in both A and B 3

P k

( ) =

n1 k ! " # $ % & N − n1 n2 − k ! " # $ % & N n2 ! " # $ % & P x ≥ k

( ) =

P(i)

i=k min(n1,n2)

0.017 0.020

slide-33
SLIDE 33

Bonferroni correction

  • Total number of rejections of null hypothesis over all N tests denoted by

R. Pr(R>0) ~= Nα

  • Need to set α’ = Pr(R>0) to required significance level over all tests.

Referred to as the experimentwise error rate.

  • With 100 tests, to achieve overall experimentwise significance level of

α’=0.05: 0.05 = 100α

  • > α = 0.0005
  • Pointwise significance level of 0.05%.
slide-34
SLIDE 34

Example - Genome wide association screens

  • Risch & Merikangas (1996).
  • 100,000 genes.
  • Observe 10 SNPs in each gene.
  • 1 million tests of null hypothesis of no association.
  • To achieve experimentwise significance level of

5%, require pointwise p-value less than 5 x 10-8

slide-35
SLIDE 35

Bonferroni correction - problems

  • Assumes each test of the null hypothesis to be

independent.

  • If not true, Bonferroni correction to significance level is

conservative.

  • Loss of power to reject null hypothesis.
  • Example: genome-wide association screen across linked

SNPs – correlation between tests due to LD between loci.

slide-36
SLIDE 36

Benjamini Hochberg

  • Select False Discovery Rate a
  • Number of tests is m
  • Sort p-values P(k) in ascending order (most significant first)
  • Assumes tests are uncorrelated or positively correlated
slide-37
SLIDE 37
  • 4. How can we predict splice isoforms

from sequence?

slide-38
SLIDE 38

[Konarska, Nature, (1985)] The spliceosome, catalyzed by small nuclear ribonucleoproteins (snRNPs) binds the 5ʹ splice site, facilitating 5ʹ intron base pairing with the downstream branch sequence, forming a lariat. The 3ʹ end of the exon is cut and joined to the branch site by a hydroxyl (OH) group at the 3ʹ end of the exon that attacks the phosphodiester bond at the 3ʹ splice site. The exons are covalently bound, and the lariat containing the intron is released.

RNA SPLICING

slide-39
SLIDE 39
  • 1. PWM Models
  • 2. Hidden Markov Models
  • 3. Maximum Entropy Models
  • 4. Hybrid Networks

RNA SPLICING: MACHINE LEARNING HISTORY AND STATE OF THE ART

slide-40
SLIDE 40

Computational Model: PWMs

Abril, Castelo, Guigó, (2005) The simplest mechanism for summarizing observed spice site data into a machine learning model. The PWM matrix stores at each location a nucleotide frequency, which may be convolved with a novel sequence to identify potential splice sites.

slide-41
SLIDE 41
  • 1. PWM Models
  • 2. Hidden Markov Models
  • 3. Maximum Entropy Models
  • 4. Hybrid Networks

RNA SPLICING: MACHINE LEARNING HISTORY AND STATE OF THE ART

slide-42
SLIDE 42

Computational Model: HIDDEN MARKOV MODEL

HMM (Marji & Garg, 2013) Emits state transitions moving sequentially down a DNA sequence to predict state switching between intron and exon states.

slide-43
SLIDE 43
  • 1. PWM Models
  • 2. Hidden Markov Models
  • 3. Maximum Entropy Models
  • 4. Hybrid Networks

RNA SPLICING: MACHINE LEARNING HISTORY AND STATE OF THE ART

slide-44
SLIDE 44

Computational Model: MAXIMUM ENTROPY

MAXENT (Yeo & Burge, 2003)

Creates a maximum entropy score, allowing higher-order dependencies than in a simple, single-state Markov model. An improvement over previous models, in 2003.

slide-45
SLIDE 45
  • 1. PWM Models
  • 2. Hidden Markov Models
  • 3. Maximum Entropy Models
  • 4. Hybrid Networks

RNA SPLICING: MACHINE LEARNING HISTORY AND STATE OF THE ART

slide-46
SLIDE 46

The COSSMO Model directly predicts PSI (Bretschneider et al, 2018)

slide-47
SLIDE 47

COSSMO LSTM Model (Bretschneider et al, 2018) COSSMO uses both convolutional and LSTM layers and outperforms MAXENT scan.

slide-48
SLIDE 48

(Bretschneider et al, 2018)

COSSMO rediscovers known splicing motifs. Motifs are extracted by clustering input sequences that activate the

  • network. Reference motifs are
  • n the top and matching

motifs learned by COSSMO are

  • n the bottom.

Computational Model: Deep Learning with “COSSMO”

slide-49
SLIDE 49

https://blog.addgene.org/treating-muscular-dystrophy-with-crispr-gene-editing

Duchenne muscular dystrophy (DMD), an X-linked recessive disorder in approximately 1 in 5000 males.

slide-50
SLIDE 50

FIN - Thank You