CSE 428 Spring 2018 Overview Course Web Pages: - PowerPoint PPT Presentation

CSE 428 Spring 2018

Overview Course Web Pages: https://courses.cs.washington.edu/courses/cse428/18sp/ TAs: Daniel Jones Yue Zhang Group-Project-oriented: Typically teams of ~4 students I will offer some projects ideas I am open to student-generated ideas “computers” + “biology” (+ reasonable scope + something I can facilitate) 2

Project Challenges Organization & Scheduling Bio Jargon Tools from elsewhere Did I mention Organization & Scheduling? 3

What I hope you will learn See previous slide! You’ll see real DNA/RNA seq data in all of them, plus Some mixture of: data structures, algorithms, data analytics, statistics, biology, HCI, ML, … 4

Project Evaluation Weekly Goals + Progress reports Final written reports + oral presentations Including evaluation of code, test results, etc. Peer comments 5

Project Ideas 3 of my 4 suggestions grow out of “bias” in RNA sequencing, outlined in the following ~2 dozen slides. For today, at least, the details are not critical; key points I hope you get are that a) we can sequence RNA from cells b) it’s informative c) it’s quantitative d) technical artifacts bias that quantitative information e) we have software that ameliorates this bias, and f) there are unexplored issues surrounding this, hence, project ideas: visualizing and understanding the sources and extent of the biases and their impact on various downstream analyses.

Bias in RNA sequencing and what to do about it Walter L. (Larry) Ruzzo Computer Science and Engineering Genome Sciences University of Washington Fred Hutchinson Cancer Research Center Seattle, WA, USA ruzzo@uw.edu

⬇ RNAseq ⬇ ⬇ Millions of reads, DNA Sequencer say, 100 bp each map to genome, map to genome, analyze compare & analyze 8

Goals of RNAseq 1. Which genes are being expressed? How? assemble reads (fragments of mRNAs) into (nearly) full-length mRNAs and/or map them to a reference genome 2. How highly expressed are they? How? count how many fragments come from each gene–expect more highly expressed genes to yield more reads per unit length 3. What’s same/diff between 2 samples E.g., tumor/normal 4. ... 9

RNA seq cDNA, fragment, QC filter, RNA → → S equence → → C ount end repair, A-tail, trim, map, ligate, PCR, … … It’s so easy, what could possibly go wrong?

What we expect: Uniform Sampling 100 Count reads starting at each position, not those covering each position 75 50 25 0 0 50 100 150 200 Uniform sampling of 4000 “reads” across a 200 bp “exon.” Average 20 ± 4.7 per position, min ≈ 9, max ≈ 33   I.e., as expected, we see ≈ μ ± 3 σ in 200 samples

What we get: highly non-uniform coverage The bad news : random fragments are not so uniform. E.g., assuming uniform, the 8 peaks above 100 are > +10 σ above mean ~ Count reads starting at Uniform each position, not those 50 25 covering each position 0 Actual ––––––––––– 3’ exon ––––––––– 200 nucleotides Mortazavi data

What we get: highly non-uniform coverage The bad news : random fragments are not so uniform. E.g., assuming uniform, the 8 peaks above 100 are > +10 σ above mean ~ Count reads starting at Uniform each position, not those 50 25 covering each position 0 Actual How to make it more uniform? A: Math tricks like averaging/smoothing (e.g. “coverage”) or transformations (“log”), …, or WE DO   B: Try to model (aspects of) causation   ––––––––––– 3’ exon ––––––––– THIS (& use increased uniformity of result as a measure of success) 200 nucleotides Mortazavi data

What we get: highly non-uniform coverage The Good News : we can (partially) correct the bias The bad news : random fragments are not so uniform. Uniform 50 25 0 Actual not perfect, but better: 38% reduction in LLR of uniform model; hugely more likely 200 nucleotides

(in part) Bias is ^ sequence-dependent Reads and platform/sample-dependent Fitting a model of the sequence surrounding read starts lets us predict which positions have more reads.

what causes bias? No one knows in any great detail Speculations: all steps in the complex protocol may contribute E.g., primers in PCR-like amplification steps may have unequal affinities (“random hexamers”, e.g.) ligase enzyme sequence preferences potential RNA structures fragmentation biases mapping biases 16

some prior work Hansen, et al. 2010 “7-mer” method - directly count foreground/ background 7-mers at read starts, correct by ratio   2 * (4 7 -1) = 32766 free parameters Li, et al. 2010 GLM - generalized linear model } training requires gene MART - multiple additive regression trees annotations 17

d sample foreground sequences o ( a ) h e t n e i M l t sample (local) background sequences ( b ) u O train Bayesian network ( c ) I.e., learn sequence patterns associated w/ high / low read counts. ( d ) ( e )

    defining bias Data is Un biased if read is independent of sequence: Pr( read at i ) = Pr( read at i | sequence at i ) From Bayes rule: Pr( seq at i | read at i ) Pr( read at i | seq at i ) = Pr( read at i ) Pr( seq at i) We define “bias” to be this factor 19

Modeling Sequence Bias Want a probability distribution over k-mers, k ≈ 40? Some obvious choices: Full joint distribution: 4 k -1 parameters PWM (0-th order Markov): (4-1)•k parameters Something intermediate: Directed Bayes network 20

Form of the models: Directed Bayes nets One “node” per nucleotide,   ±20 bp of read start • Filled node means that position is biased • Arrow i → j means letter at position i modifies bias at j • For both, numeric parameters say how much How–optimize: n n Pr [ s i | x i ] Pr [ x i ] � � ℓ = logPr [ x i | s i ]= log � x ∈ { 0 , 1 } Pr [ s i | x ] Pr [ x ] i = 1 i = 1

NB: •Not just initial hexamer •Span ≥ 19 •All include Illumina ABI negative positions •All different, even on same platform

Result – Increased Uniformity Kullback-Leibler Divergence Jones Li et al Hansen et al Trapnell Data

some questions What is the chance that we will learn an incorrect model? E.g., learn a biased model from unbiased input? How does the amount of   training data effect accuracy   of the resulting model? 24

CSE 428 Spring 2018 Overview Course Web Pages: - PowerPoint PPT Presentation

CSE 428 Spring 2018 Overview Course Web Pages: https://courses.cs.washington.edu/courses/cse428/18sp/ TAs: Daniel Jones Yue Zhang Group-Project-oriented: Typically teams of ~4 students I will offer some projects ideas I am open to

CSE 3401 Functional and Logic Programming York University CSE 3401 Vida Movahedi 1 York University

CSE 428 Spring 2019 Overview Course Web Pages:

Lecture 16 self-stabilization distributed systems CS425 / ECE 428 / CSE 424 sayan mitra

CSE 182-L2:Blast & variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu

CSE 312 Final Review: Section AA CSE 312 TAs December 8, 2011 CSE 312 Final Review: Section AA

Welcome to CSE 506 Introduc/on & Review Don Porter 1 2 CSE 506: Opera.ng Systems CSE 506:

4 5 6 CSE 142 vs CSE 143 CSE 142 / AP CS A CSE 143 You learned how to write Return of

Announcements CSE 590f seminar Wednesday, 4pm, CSE 403 CSE 477, Winter/Spring 2009 UW

About the course From the CSE catalog: CSE 321 Discrete Structures (4) CSE 321 Discrete

CSE 5194.01: OpenAI and ONNX John Herwig CSE 5194.01 OpenAI What is OpenAI? According to their

BIOLUMINESCENCE IN WATASENIA SCINTILLANS (FIREFLY SQUID) Presented by: Timothy Goh Biol 428

Lithium. Powering Our Future. info@nevadaem.com nevadaenergymetals.com 604-428-5690 TSX-V: BFF

Lithium. Powering Our Future. info@nevadaem.com nevadaenergymetals.com 604-428-5690 TSX-V: BFF

DC D CM MC C Partners DCMC March 29, 2018 Overview DCMC P ARTNERS The President signed

Stavros Hatzakos MedCruise President MedCruise member traffic

Michelle Sheppard Art 428 Generally speaking, color directly influences the soul. Color is the

Chapter 2 Population growth Last week d N N ( t ) = N (0)e ( b d ) t d t = ( b d ) N

Outline Basic Models Wed

Chemistry of Transition Metals Bonding in transition metal compounds Theories : (i) Werner

Acta E Articles: Frequently Encountered Problems and Hints for Evaluating Structures of Inorganic

Massively parallel read mapping on graphics cards Johannes K oster May 15, 2014 1 / 23

Aggregate Programming Part 3: Applica2on Examples Jacob Beal

Timing is Crucial for Matching Platforms o Matching markets are dyn dynamic (= new agents

COMPSCI 326 Web Programming Week 09: ER Diagram Sketches Agenda 4:00 4:35 ER Diagram

CSE 428 Spring 2018 Overview Course Web Pages: - PowerPoint PPT Presentation

CSE 428 Spring 2018 Overview Course Web Pages: https://courses.cs.washington.edu/courses/cse428/18sp/ TAs: Daniel Jones Yue Zhang Group-Project-oriented: Typically teams of ~4 students I will offer some projects ideas I am open to

CSE 3401 Functional and Logic Programming York University CSE 3401 Vida Movahedi 1 York University

CSE 428 Spring 2019 Overview Course Web Pages:

Lecture 16 self-stabilization distributed systems CS425 / ECE 428 / CSE 424 sayan mitra

CSE 182-L2:Blast &amp; variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu

CSE 312 Final Review: Section AA CSE 312 TAs December 8, 2011 CSE 312 Final Review: Section AA

Welcome to CSE 506 Introduc/on &amp; Review Don Porter 1 2 CSE 506: Opera.ng Systems CSE 506:

4 5 6 CSE 142 vs CSE 143 CSE 142 / AP CS A CSE 143 You learned how to write Return of

Announcements CSE 590f seminar Wednesday, 4pm, CSE 403 CSE 477, Winter/Spring 2009 UW

About the course From the CSE catalog: CSE 321 Discrete Structures (4) CSE 321 Discrete

CSE 5194.01: OpenAI and ONNX John Herwig CSE 5194.01 OpenAI What is OpenAI? According to their

BIOLUMINESCENCE IN WATASENIA SCINTILLANS (FIREFLY SQUID) Presented by: Timothy Goh Biol 428

Lithium. Powering Our Future. info@nevadaem.com nevadaenergymetals.com 604-428-5690 TSX-V: BFF

Lithium. Powering Our Future. info@nevadaem.com nevadaenergymetals.com 604-428-5690 TSX-V: BFF

DC D CM MC C Partners DCMC March 29, 2018 Overview DCMC P ARTNERS The President signed

Stavros Hatzakos MedCruise President MedCruise member traffic

Michelle Sheppard Art 428 Generally speaking, color directly influences the soul. Color is the

Chapter 2 Population growth Last week d N N ( t ) = N (0)e ( b d ) t d t = ( b d ) N

Outline Basic Models Wed

Chemistry of Transition Metals Bonding in transition metal compounds Theories : (i) Werner

Acta E Articles: Frequently Encountered Problems and Hints for Evaluating Structures of Inorganic

Massively parallel read mapping on graphics cards Johannes K oster May 15, 2014 1 / 23

Aggregate Programming Part 3: Applica2on Examples Jacob Beal

Timing is Crucial for Matching Platforms o Matching markets are dyn dynamic (= new agents

COMPSCI 326 Web Programming Week 09: ER Diagram Sketches Agenda 4:00 4:35 ER Diagram

CSE 182-L2:Blast & variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu

Welcome to CSE 506 Introduc/on & Review Don Porter 1 2 CSE 506: Opera.ng Systems CSE 506: