csep 590 b computational biology
play

CSEP 590 B Computational Biology Gene Expression Analysis 1 - PowerPoint PPT Presentation

CSEP 590 B Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3 Microarrays 4 RNAseq DNA Sequencer map to genome, analyze 5 Goals of RNAseq #1: Which genes are being expressed? How? assemble reads


  1. CSEP 590 B 
 Computational Biology Gene Expression Analysis 1

  2. Assaying Gene Expression 3

  3. Microarrays 4

  4. ⬇ RNAseq ⬇ ⬇ DNA Sequencer map to genome, analyze 5

  5. Goals of RNAseq #1: Which genes are being expressed? How? assemble reads (fragments of mRNAs) into (nearly) full-length mRNAs and/or map them to a reference genome #2: How highly expressed are they? How? count how many fragments come from each gene–expect more highly expressed genes to yield more reads, after correcting for biases like mRNA length #3: What’s same/diff between 2 samples E.g., tumor/normal #4: ... 7

  6. Recall: splicing exon 5 ’ intron exon 2

  7. RNAseq Data Analysis De novo Assembly mostly deBruijn-based, but likely to change with longer reads more complex than genome assembly due to alt splicing, wide diffs in expression levels; e.g. often multiple “k’s” used pro: no ref needed (non-model orgs), novel discoveries possible, e.g. very short exons con: less sensitive to weakly-expressed genes Reference-based (more later) pro/con: basically the reverse Both: subsequent bias correction, quantitation, differential expression calls, fusion detection, etc. 8

  8. “TopHat” (Ref based example) ! map reads to ref transcriptome (optional) BWA ! map reads to ref genome ! unmapped reads remapped as 25mers ! novel splices = 25 mers anchored 2 sides ! stitch original reads across these ! remap reads with minimal overlaps ! Roughly: 10m reads/hr, 4Gbytes 
 (typical data set 100m–1b reads) 9

  9. RNAseq Example 1yr-3 hg19 5 kb Scale chr19: 50,020,000 50,025,000 Day 20 1 Year FCGRT FCGRT 20

  10. RNAseq protocol (approx) Extract RNA (maybe by polyA ↔ polyT) Reverse-transcribe into DNA (“cDNA”) Make double-stranded, maybe amplify Cut into, say, ~300bp fragments Add adaptors to each end Sequence ~100-175bp from one or both ends CAUTIONS: non-uniform sampling, sequence (e.g. G+C), 5’-3’, and length biases 6

  11. Bias Correction in RNAseq Walter L. (Larry) Ruzzo Computer Science and Engineering Genome Sciences University of Washington Fred Hutchinson Cancer Research Center Seattle, WA, USA

  12. Associate Editor: Alex Bateman These biases may adversely effect transcript discovery, as ABSTRACT low level noise may be overreported in some regions, and in Motivation: Quantification of sequence abundance in RNA-Seq others, active transcription may be underreported. They render experiments is often conflated by protocol-specific sequence bias. untrustworthy comparisons of relative abundance between genes The exact sources of the bias are unknown, but may be influenced by 2

  13. RNA seq cDNA, fragment, QC filter, RNA → → S equence → → C ount end repair, A-tail, trim, map, ligate, PCR, … …

  14. What we expect: Uniform Sampling 100 75 50 25 0 0 50 100 150 200 Uniform sampling of 4000 “reads” across a 200 bp “exon.” Average 20 ± 4.7 per position, min ≈ 9, max ≈ 33 
 I.e., as expected, we see ≈ μ ± 3 σ in 200 samples

  15. What we get: highly non-uniform coverage The bad news : random fragments are not so uniform. E.g., assuming uniform, the 8 peaks above 100 are > +10 σ above mean ~ Uniform 50 25 0 Actual ––––––––––– 3’ exon ––––––––– 200 nucleotides Mortazavi data

  16. What we get: highly non-uniform coverage The bad news : random fragments are not so uniform. Uniform 50 25 0 Actual not perfect, but better: 38% reduction in LLR of uniform model; hugely more likely 200 nucleotides The Good News : we can (partially) correct the bias

  17. (in part) Bias is ^ sequence-dependent Reads and platform/sample-dependent Fitting a model of the sequence surrounding read starts lets us predict which positions have more reads.

  18. d o ( a ) h e t n e i M l t ( b ) u O ( c ) ( d ) ( e )

  19. Modeling Sequence Bias Want a probability distribution over k-mers, k ≈ 40 Some obvious choices Full joint distribution: 4 k -1 parameters PWM (0-th order Markov): (4-1)•k parameters Something intermediate Directed Bayes network 9

  20. Form of the models: Directed Bayes nets One “node” per nucleotide, 
 ±20 bp of read start • Filled node means that position is biased • Arrow i → j means letter at position i modifies bias at j • For both, numeric parameters say how much How–optimize: n n Pr [ s i | x i ] Pr [ x i ] � � ℓ = logPr [ x i | s i ]= log � x ∈ { 0 , 1 } Pr [ s i | x ] Pr [ x ] i = 1 i = 1

  21. NB: •Not just initial hexamer •Span ≥ 19 •All include Illumina ABI negative positions •All different, even on same platform

  22. Log Log

  23. Result – Increased Uniformity Kullback-Leibler Divergence Jones Li et al Hansen et al Trapnell Data

  24. Result – Increased Uniformity R 2 Fractional improvement in log-likelihood under hypothesis test: 
 uniform model across * = p-value < 10 -23 “Is BN better than X?” 1000 exons (R 2 =1-L’/L) (1-sided Wilcoxon signed-rank test)

  25. “First, do no harm” Theorem: The probability of “false bias discovery,” i.e., 
 of learning a non-empty model from n reads sampled from un biased data is less than 1 - (Pr( X < 3 log n )) 2 h where h = number of nucleotides in the model and X is a random variable that (asymptotically in n ) is ! 2 with 3 degrees of freedom. (E[ X ] = 3)

  26. “First, do no harm” Theorem: The probability of “false bias discovery,” i.e., of learning a non-empty model from n reads sampled from unbiased data, declines exponentially with n. Prob(non-empty model | unbiased data) If > 10,000 reads are used, the probability of a non-empty model < 0.0004 10 4 19 Number of training reads

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend