CSEP 590 B Computational Biology Gene Expression Analysis 1 - - PowerPoint PPT Presentation

csep 590 b computational biology
SMART_READER_LITE
LIVE PREVIEW

CSEP 590 B Computational Biology Gene Expression Analysis 1 - - PowerPoint PPT Presentation

CSEP 590 B Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3 Microarrays 4 RNAseq DNA Sequencer map to genome, analyze 5 Goals of RNAseq #1: Which genes are being expressed? How? assemble reads


slide-1
SLIDE 1

CSEP 590 B
 Computational Biology

Gene Expression Analysis

1

slide-2
SLIDE 2

Assaying Gene Expression

3

slide-3
SLIDE 3

Microarrays

4

slide-4
SLIDE 4

RNAseq

DNA Sequencer ⬇ ⬇ ⬇ map to genome, analyze

5

slide-5
SLIDE 5

Goals of RNAseq

#1: Which genes are being expressed?

How? assemble reads (fragments of mRNAs) into (nearly) full-length mRNAs and/or map them to a reference genome

#2: How highly expressed are they?

How? count how many fragments come from each gene–expect more highly expressed genes to yield more reads, after correcting for biases like mRNA length

#3: What’s same/diff between 2 samples

E.g., tumor/normal

#4: ...

7

slide-6
SLIDE 6

2

intron

exon 5’ exon

Recall: splicing

slide-7
SLIDE 7

RNAseq Data Analysis

De novo Assembly

mostly deBruijn-based, but likely to change with longer reads more complex than genome assembly due to alt splicing, wide diffs in expression levels; e.g. often multiple “k’s” used pro: no ref needed (non-model orgs), novel discoveries possible, e.g. very short exons con: less sensitive to weakly-expressed genes

Reference-based (more later)

pro/con: basically the reverse

Both: subsequent bias correction, quantitation, differential expression calls, fusion detection, etc.

8

slide-8
SLIDE 8

“TopHat” (Ref based example)

! map reads to ref transcriptome (optional) ! map reads to ref genome ! unmapped reads remapped as 25mers ! novel splices = 25mers anchored 2 sides ! stitch original reads across these ! remap reads with minimal overlaps ! Roughly: 10m reads/hr, 4Gbytes


(typical data set 100m–1b reads)

BWA

9

slide-9
SLIDE 9

20 Scale chr19: FCGRT FCGRT 5 kb hg19 50,020,000 50,025,000 1yr-3

Day 20 1 Year

RNAseq Example

slide-10
SLIDE 10

RNAseq protocol (approx)

Extract RNA (maybe by polyA ↔ polyT) Reverse-transcribe into DNA (“cDNA”) Make double-stranded, maybe amplify Cut into, say, ~300bp fragments Add adaptors to each end Sequence ~100-175bp from one or both ends CAUTIONS: non-uniform sampling, sequence (e.g. G+C), 5’-3’, and length biases

6

slide-11
SLIDE 11

Bias Correction in RNAseq

Walter L. (Larry) Ruzzo

Computer Science and Engineering Genome Sciences University of Washington Fred Hutchinson Cancer Research Center Seattle, WA, USA

slide-12
SLIDE 12

Associate Editor: Alex Bateman ABSTRACT Motivation: Quantification of sequence abundance in RNA-Seq experiments is often conflated by protocol-specific sequence bias. The exact sources of the bias are unknown, but may be influenced by

These biases may adversely effect transcript discovery, as low level noise may be overreported in some regions, and in

  • thers, active transcription may be underreported. They render

untrustworthy comparisons of relative abundance between genes

2

slide-13
SLIDE 13

RNA seq

RNA → → Sequence → → Count

cDNA, fragment, end repair, A-tail, ligate, PCR, … QC filter, trim, map, …

slide-14
SLIDE 14

25 50 75 100 50 100 150 200

What we expect: Uniform Sampling

Uniform sampling of 4000 “reads” across a 200 bp “exon.” Average 20 ± 4.7 per position, min ≈ 9, max ≈33
 I.e., as expected, we see ≈ μ ± 3σ in 200 samples

slide-15
SLIDE 15

The bad news: random fragments are not so uniform.

What we get: highly non-uniform coverage

––––––––––– 3’ exon –––––––––

200 nucleotides Mortazavi data

E.g., assuming uniform, the 8 peaks above 100 are > +10σ above mean ~

25 50

Uniform Actual

slide-16
SLIDE 16

The bad news: random fragments are not so uniform.

The Good News: we can (partially) correct the bias

not perfect, but better: 38% reduction in LLR

  • f uniform model;

hugely more likely

What we get: highly non-uniform coverage

200 nucleotides

25 50

Uniform Actual

slide-17
SLIDE 17

Fitting a model of the sequence surrounding read starts lets us predict which positions have more reads.

Bias is ^ sequence-dependent

Reads

and platform/sample-dependent

(in part)

slide-18
SLIDE 18

(a) (b) (c) (d) (e)

M e t h

  • d

O u t l i n e

slide-19
SLIDE 19

Modeling Sequence Bias

Want a probability distribution over k-mers, k ≈ 40 Some obvious choices Full joint distribution: 4k-1 parameters PWM (0-th order Markov): (4-1)•k parameters Something intermediate Directed Bayes network

9

slide-20
SLIDE 20

One “node” per nucleotide, 
 ±20 bp of read start

  • Filled node means that

position is biased

  • Arrow i → j means letter at

position i modifies bias at j

  • For both, numeric

parameters say how much How–optimize:

ℓ=

n

  • i=1

logPr[xi|si]=

n

  • i=1

log Pr[si|xi]Pr[xi]

  • x∈{0,1}Pr[si|x]Pr[x]

Form of the models:

Directed Bayes nets

slide-21
SLIDE 21

NB:

  • Not just initial

hexamer

  • Span ≥ 19
  • All include

negative positions

  • All different,

even on same platform

Illumina ABI

slide-22
SLIDE 22
slide-23
SLIDE 23

Log Log

slide-24
SLIDE 24

Trapnell Data Kullback-Leibler Divergence

Result – Increased Uniformity

Jones Li et al Hansen et al

slide-25
SLIDE 25
slide-26
SLIDE 26

R2

* = p-value < 10-23

hypothesis test:
 “Is BN better than X?”

(1-sided Wilcoxon signed-rank test)

Result – Increased Uniformity

Fractional improvement in log-likelihood under uniform model across 1000 exons (R2=1-L’/L)

slide-27
SLIDE 27
slide-28
SLIDE 28

“First, do no harm”

Theorem: The probability of “false bias discovery,” i.e., 


  • f learning a non-empty model from n reads

sampled from unbiased data is less than 1 - (Pr(X < 3 log n))2h where h = number of nucleotides in the model and X is a random variable that (asymptotically in n) is !2 with 3 degrees of

  • freedom. (E[X] = 3)
slide-29
SLIDE 29

104

If > 10,000 reads are used, the probability

  • f a non-empty model < 0.0004

Prob(non-empty model | unbiased data) Number of training reads

“First, do no harm”

19

Theorem: The probability of “false bias discovery,” i.e., of

learning a non-empty model from n reads sampled from unbiased data, declines exponentially with n.

slide-30
SLIDE 30

Given: r-sided die, with probs p1...pr of each face. Roll it n=10,000 times; observed frequencies = q1, …, qr, (the MLEs for the unknown qi’s). How close is pi to qi? Fancy name, simple idea: H(Q||P) is just the expected per-sample contribution to log-likelihood ratio test for “was X sampled from H0: P vs H1: Q?”

how different are two distributions?

20

  • 1000 : 1

m H(Q||P) mH(Q||P) ≥ 1000 m ≥ 1000 H(Q||P)

  • k k

Q P

H(Q||P) =

  • i

qi qi pi qi pi Q P k pi ≈ qi

  • H qi pi

H

slide-31
SLIDE 31

21

  • P p1, . . . , pr

pi = 1 r r = 4k k X1, X2, . . . , Xr n =

i Xi P

qi = Xi

n ≈ pi

P Q E[qi] = E Xi n

  • = E[Xi]

n = npi n = pi 1/√n H(Q||P) =

r

  • i=1

qi qi pi =

r

  • i=1

qi

  • 1 + qi − pi

pi

slide-32
SLIDE 32

22

(1 + x) H(Q||P) ≈

r

  • i=1

qi

  • qi − pi

pi − 1 2 qi − pi pi 2 =

r

  • i=1

qi qi − pi pi − qi 2pi (qi − pi)2 pi r

i=1 qi = r i=1 pi = 1 r i=1 pi qi−pi pi

= 0 H(Q||P) ≈

r

  • i=1

qi qi − pi pi − pi qi − pi pi − qi 2pi (qi − pi)2 pi =

r

  • i=1

(qi − pi)2 pi

  • 1 − qi

2pi

  • ≈ 1

2

r

  • i=1

(qi − pi)2 pi qi ≈ pi n2/n2 H(Q||P) ≈ 1 2n

r

  • i=1

(nqi − npi)2 npi = 1 2n

r

  • i=1

(Xi − E[Xi])2 E[Xi]

slide-33
SLIDE 33

23

  • χ2

n → ∞ χ2 r − 1 r −1 Q P E[H(Q||P)] = r − 1 2n

  • 5

10 15 20

  • 20
  • 15
  • 10
  • 5

Relative Entropy, wrt Uniform, of Observed n balls in r bins

log2(n) log2(relative entropy) Each Circle is mean of 100 trials; Stars are theoretical estimates for n/r >= 1/4. r = 16384 * * * * * * * * * r = 1024 * * * * * * * * * * * * * r = 256 * * * * * * * * * * * * * * * r = 64 * * * * * * * * * * * * * * * * * r = 16 * * * * * * * * * * * * * * * * * * * r = 2 * * * * * * * * * * * * * * * * * * * * *

slide-34
SLIDE 34

24

5 10 15 20

  • 20
  • 15
  • 10
  • 5

Relative Entropy, wrt Uniform, of Observed n balls in r bins

log2(n) log2(relative entropy) Each Circle is mean of 100 trials; Stars are theoretical estimates for n/r >= 1/4. r = 16384 * * * * * * * * * r = 1024 * * * * * * * * * * * * * r = 256 * * * * * * * * * * * * * * * r = 64 * * * * * * * * * * * * * * * * * r = 16 * * * * * * * * * * * * * * * * * * * r = 2 * * * * * * * * * * * * * * * * * * * * *

  • E[H(Q||P)] = r − 1

2n

… and after a modicum of algebra: … which empirically is a good approximation:

LLR of error rises with number of parameters r, declines with size of training set n

slide-35
SLIDE 35

… while accuracy and runtime rise with n (empirically)

25

R2

  • 104

Training time: 104 reads in minutes; 
 105 reads in an hour

slide-36
SLIDE 36

26

Availability

h t t p : / / b i

  • c
  • n

d u c t

  • r

.

  • r

g / p a c k a g e s / r e l e a s e / b i

  • c

/ h t m l / s e q b i a s . h t m l

slide-37
SLIDE 37

Acknowledgements

Daniel Jones

Katze Lab

Michael Katze Xinxia Peng

P01 Labs

Tony Blau, Chuck Murry, Hannele Ruohola-Baker, Nathan Palpant, Kavitha Kuppusamy, …

Funding

NIGMS, NHGR, NIAID

I C ! !

slide-38
SLIDE 38

CSEP 590 B! Computational Biology

Course Wrap Up

slide-39
SLIDE 39

What is DNA? RNA? How many Amino Acids are there? Did human beings, as we know them, develop from earlier species of animals? What are stem cells? What did Viterbi invent? What is dynamic programming? What is a likelihood ratio test? What is the EM algorithm? How would you find the maximum of f(x) = ax3 + bx2 + cx +d in the interval -10<x<25?

slide-40
SLIDE 40

“High-Throughput ! BioTech”

Sensors

DNA sequencing Microarrays/Gene expression Mass Spectrometry/Proteomics Protein/protein & DNA/protein interaction

Controls

Cloning Gene editing/knock out/knock in RNAi

Floods of data! ! “Grand Challenge” problems

slide-41
SLIDE 41

CS Points of Contact

Scientific visualization

Gene expression patterns

Databases

Integration of disparate, overlapping data sources Distributed genome annotation in face of shifting underlying coordinates

AI/NLP/Text Mining

Information extraction from journal texts with inconsistent nomenclature, indirect interactions, incomplete/inaccurate models,…

Machine learning

System level synthesis of cell behavior from low-level heterogeneous data (DNA sequence, gene expression, protein interaction, mass spec, …)

Algorithms …

slide-42
SLIDE 42

Frontiers & Opportunities

New data:

Proteomics, SNP, arrays, CGH, comparative sequence information, epigenomics, chromatin structure, ncRNA, interactome, single-cell everything

New methods:

graphical models, rigorous filtering

Data integration

many, complex, noisy sources

Systems Biology

slide-43
SLIDE 43

Frontiers & Opportunities

Open Problems:

splicing, alternative splicing multiple sequence alignment (genome scale, w/ RNA etc.) protein & RNA structure interaction modeling regulation, at all levels network models RNA trafficing ncRNA discovery …

slide-44
SLIDE 44

Exciting Times! ! “Biology is to 21st Century ! as Physics was to 20th”

Lots to do Highly multidisciplinary You’ll be hearing a lot more about it I hope I’ve given you a taste of it

slide-45
SLIDE 45

Thanks!

PS: Please complete online course evaluation before 12/7