Fostering sensi+vity analysis for genome-scale inference - PowerPoint PPT Presentation

Fostering ¡sensi+vity ¡analysis ¡for ¡ genome-‑scale ¡inference ¡ “So6ware ¡into ¡ideas” ¡ Vince ¡Carey, ¡Ph.D. ¡ Channing ¡Division ¡of ¡Network ¡Medicine ¡ Harvard ¡Medical ¡School ¡ PSB ¡2013/NSF ¡BIGDATA ¡Add-‑on ¡

Road ¡map ¡of ¡the ¡talk ¡ * ¡Brief ¡discussion ¡of ¡generalized ¡linear ¡models ¡ • Examples ¡of ¡genome-‑scale ¡inference ¡ – eQTL ¡enumera+on ¡[modest ¡volume] ¡ – dsQTL ¡enumera+on ¡[high ¡volume] ¡ • Sensi+vi+es ¡and ¡greedy ¡tuning ¡ • Holis+c ¡workflows: ¡the ¡burden ¡of ¡the ¡past ¡ • The ¡MAMS ¡principles ¡(Mul+ply-‑Agnos+c, ¡ Mul+ply-‑Scalable) ¡for ¡sta+s+cal ¡algorithm ¡ deployments ¡ ¡

GLM: ¡A ¡produc+ve ¡unifica+on ¡of ¡ sta+s+cal ¡models, ¡1972 ¡ • Scalar ¡outcome ¡variable ¡Y ¡has ¡mean ¡value ¡μ ¡ • The ¡mean ¡is ¡ linked ¡to ¡a ¡linear ¡predictor ¡ ¡ ¡ ¡g(μ) ¡= ¡α ¡+ ¡x 1 β 1 ¡+ ¡… ¡+ ¡x p β p ¡ • The ¡ variance ¡is ¡a ¡ func-on ¡of ¡the ¡mean ¡ – Var(Y) ¡= ¡φV(μ) ¡ • Choices ¡of ¡g() ¡and ¡V() ¡correspond ¡to ¡Gaussian, ¡Logis+c, ¡ Poisson, ¡Gamma ¡regression ¡procedures ¡ • Itera+vely ¡reweighted ¡least ¡squares ¡can ¡be ¡used ¡for ¡ es+ma+on; ¡asympto+cally ¡sta+s+cally ¡efficient ¡under ¡mild ¡ assump+ons ¡ • Reprinted ¡in ¡“Breakthroughs ¡in ¡sta+s+cs”, ¡along ¡with ¡works ¡ of ¡Fisher, ¡Student, ¡Pearson, ¡Wald, ¡…. ¡

1992: ¡deployment ¡as ¡ glm()

GLM: ¡40 ¡years ¡of ¡theory, ¡extension, ¡ deployment ¡ • GENSTAT, ¡GLIM: ¡Numerical ¡Algorithms ¡Group ¡ • S, ¡Splus ¡– ¡ glm infrastructure ¡includes ¡ robust() ¡family ¡ • R ¡ – stats::glm and ¡ biglm::bigglm ¡ address ¡“standard” ¡and ¡high-‑volume ¡fipng ¡ requirements ¡(the ¡laqer ¡with ¡incremental ¡QR) ¡ • Addi+onal ¡tailored ¡deployments ¡in ¡ Bioconductor ¡snpStats, ¡limma, ¡DESeq, ¡edgeR ¡ confront ¡gene+c ¡and ¡genomic ¡requirements ¡

Why ¡so ¡much ¡+me ¡on ¡GLM? ¡ • Illustrates ¡an ¡aspect ¡of ¡algorithmic ¡“holism”: ¡a ¡ single ¡interface, ¡focused ¡infrastructure ¡solves ¡all ¡ of ¡a ¡class ¡of ¡problems ¡formerly ¡treated ¡piecemeal ¡ • Illustrates ¡the ¡idea ¡of ¡an ¡algorithm ¡template ¡that ¡ can ¡receive ¡user-‑coded ¡func+ons ¡to ¡modify ¡ opera+ons ¡ • Has ¡been ¡re-‑implemented ¡too ¡o6en, ¡and ¡ examining ¡causes ¡for ¡this ¡can ¡help ¡define ¡ requirements ¡for ¡enduring ¡deployments ¡

Ques+ons ¡ • If ¡sta+s+cians ¡had ¡discovered ¡GLM ¡only ¡today, ¡what ¡ would ¡be ¡a ¡reasonable ¡approach ¡to ¡implementa+on? ¡ ¡ How ¡to ¡sidestep ¡common ¡assump+ons ¡ – “all ¡data ¡in ¡memory” ¡ – scalar ¡execu+on ¡of ¡algorithm ¡steps ¡ – inputs ¡are ¡(mostly) ¡floa+ng ¡point ¡numbers ¡and ¡integers ¡ • What ¡languages ¡and ¡environments ¡will ¡support ¡ streamlined ¡implementa+ons, ¡maximizing ¡efficient ¡use ¡ of ¡available ¡hardware/so6ware? ¡ • How ¡will ¡interac+ve ¡data ¡analysis ¡capabili+es ¡be ¡ achieved ¡with ¡high ¡data ¡volume ¡and ¡environment ¡ complexity? ¡

Williams ¡R ¡et ¡al. ¡Genome ¡Research ¡2007 ¡vol. ¡17 ¡(12) ¡pp. ¡1707-‑1716 ¡

GSTT1 ¡eQTL: ¡Average ¡expression ¡varies ¡by ¡genotype ¡ at ¡nearby ¡SNPs ¡– ¡why? ¡[N=90 ¡CEU ¡HM ¡phase ¡2; ¡ Sanger ¡GENEVAR] ¡

Full ¡chromosome ¡scan ¡for ¡CPNE1 ¡and ¡view ¡of ¡the ¡the ¡ peak ¡

Summary ¡ • Transcriptome ¡and ¡SNP-‑ome ¡are ¡jointly ¡measured ¡ on ¡a ¡number ¡of ¡individuals ¡ – ~20000 ¡transcripts, ¡~10 ¡million ¡SNP, ¡… ¡ • Models ¡for ¡addi+ve ¡gene+c ¡effects ¡on ¡transcript ¡ levels ¡are ¡fit ¡for ¡all ¡gene:snp ¡pairs ¡in ¡cis ¡ • Humps ¡and ¡peaks ¡in ¡the ¡series ¡of ¡associa+on ¡ sta+s+cs ¡are ¡found ¡along ¡the ¡genome ¡ • Reliability ¡of ¡the ¡procedure, ¡interpreta+on ¡of ¡ results? ¡

Tuned ¡with ¡ ¡ ¡100bp ¡window ¡ ¡ ¡ ¡top ¡5% ¡sensi+vity ¡ ¡ ¡4 ¡PC ¡removal ¡

Greedy ¡tuning ¡for ¡higher ¡yield ¡

Summary ¡ • Feature ¡space ¡now ¡a ¡con+nuously ¡scored ¡+ling ¡of ¡ the ¡genome ¡ – Filtered ¡to ¡1.5 ¡million ¡features ¡but ¡could ¡be ¡many ¡ more, ¡could ¡consider ¡as ¡many ¡as ¡37 ¡million ¡1KG ¡SNP ¡ • Scope ¡of ¡gene+c ¡regula+on ¡seems ¡more ¡limited: ¡ dropping ¡cis ¡search ¡region ¡from ¡40kb ¡to ¡2kb ¡does ¡ not ¡dras+cally ¡affect ¡yield ¡of ¡dsQTL ¡ • A ¡number ¡of ¡ad ¡hoc ¡filtering ¡steps ¡might ¡have ¡ more ¡important ¡impacts ¡

Distribu+ons ¡of ¡norm. ¡DHS ¡over ¡70 ¡individuals ¡at ¡most ¡sensi+ve ¡windows ¡in ¡vicinity ¡of ¡ORMDL3 ¡ ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 2 ! ! ! ! ! ! ! ! DHS 0 ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! − 2 ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 2 Chicago 1.5 ! 1 0.5 0 38.06 mb 38.08 mb 38.07 mb 38.09 mb GSDMB GSDMB ORMDL3 LRRC3C GSDMB ORMDL3 Genes GSDMB ORMDL3 GSDMB

Greedy ¡tuning ¡of ¡eQTL ¡searches ¡ • Yield ¡can ¡be ¡affected ¡by ¡ – Choice ¡of ¡cis-‑interval ¡size ¡ – Depth ¡of ¡search ¡into ¡rare ¡variants ¡(lower ¡bound ¡on ¡minor ¡ allele ¡frequency) ¡ – Approach ¡to ¡removing ¡non-‑biologic ¡varia+on ¡from ¡ expression ¡assay ¡results ¡(Stegle, ¡Durbin, ¡RECOMB ¡2008) ¡ • Management ¡of ¡a ¡single ¡search ¡is ¡difficult, ¡but ¡mul+ple ¡ searches ¡or ¡extensive ¡metadata ¡need ¡to ¡be ¡retained ¡so ¡ that ¡various ¡calling ¡policies ¡can ¡be ¡compared ¡ • We’ll ¡consider ¡combined ¡analysis ¡of ¡CEU ¡and ¡YRI ¡ founders ¡(N=120) ¡

Minor ¡allele ¡frequency ¡determines ¡ reliability ¡of ¡associa+on ¡inference ¡

Permutation distribution of maximum association scores at 500kb cis radius ! ! ! ! ! ! 50 ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 40 ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 30 ! ! ! ! ! score ! ! ! ! ! ! ! ! ! ! ! ! ! 20 10 ! ! ! ! ! 0 ! ! ! ! 0.0 0.1 0.2 0.3 0.4 0.5 MAF

radius of cis search 5000 50000 250000 ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 2250 ! ! npc ! ! ! # genes w/eQTL ! at FDR <= 0.05 ! ! 30 ! ! ! ! ! 25 ! ! ! ! ! ! ! ! ! 2000 ! 20 ! ! 15 ! 10 ! ! 1750 5 ! 0.025 0.050 0.075 0.100 0.025 0.050 0.075 0.100 0.025 0.050 0.075 0.100 lower bound on MAF 5000 50000 250000 ! ! ! ! ! ! ! ! ! ! ! ! factor(MAF) ! ! ! ! ! ! 2250 ! ! ! ! ! ! 0.005 ! ! # genes w/eQTL ! at FDR <= 0.05 ! ! ! ! 0.01 ! ! ! ! ! ! ! 0.025 ! ! ! ! ! 2000 ! ! 0.03 ! ! ! 0.05 ! ! ! ! 0.075 1750 ! 0.1 ! 10 20 30 10 20 30 10 20 30 # PC removed

Upshots ¡for ¡eQTL ¡ • Very ¡large ¡number ¡of ¡tests ¡ • Evident ¡sensi+vity ¡of ¡yield ¡to ¡a ¡number ¡of ¡ tuning ¡parameters ¡ • Thorough ¡inves+ga+ons ¡require ¡explora+on ¡of ¡ the ¡parameter ¡space ¡ • With ¡GGtools ¡R ¡2.15 ¡the ¡full ¡500kb ¡radius, ¡ ¡ MAF ¡> ¡0.05 ¡search ¡took ¡3h ¡on ¡88 ¡commodity ¡ cores ¡

Fostering sensi+vity analysis for genome-scale inference - PowerPoint PPT Presentation

Fostering sensi+vity analysis for genome-scale inference So6ware into ideas Vince Carey, Ph.D. Channing Division of Network Medicine Harvard Medical

Fostering Biotechnology at National Fostering Biotechnology at National Fostering Biotechnology

in Somerset What is is considered to be pri rivate fostering? Pri rivate Fostering In

An experimental assessment of electric sensi tj vity in yellow rays and juvenile lemon sharks

Sensi&vity of snowpack simula&on associated with

Sensi&vity of Tropical Cyclones to Resolu&on, Convec&on

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics & Computational

Genome Sequencing & Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference

Catchment Sensi-ve Farming (CSF) Working with farmers to improve

Genome Reassembly From Fragments 7 January 2019 OSU CSE 1 Genome A genome is the encoding

i18n@ W3C Richard Ishida W3C The i18n Ac+vity i18n Ac+vity

Fostering Spiritual Growth in the Church Fostering Spiritual Growth in the Church bit.ly/SG-ST

Genome Annotation The steps in genome sequencing Generate genome sequence Assembly ORF

Visualizing ENCODE Data in the UCSC Genome Browser Pauline Fujita, Ph.D. UCSC Genome Bioinformatics

The Mouse Genome The Mouse Genome Database (MGD) Database (MGD) Eppig J.T., et al. (2005). The

Self Study: Yeast Genome Comparison SESSION 4 MARTIN KRZYWINSKI Genome Sciences Centre BC

Genome 562 February 2015 Week 6 Genome 562 p.1/13 Julian Huxley (1887-1975) Oxford

CS681: Advanced Topics in Computational Biology Can Alkan EA509 calkan@cs.bilkent.edu.tr

XKCP internals Gilles Van Assche 1 1 STMicroelectronics SCA workshop ibenik, Croatia, June 2019

Taking Control of Your Managed Care Destiny AJAS 2017 April 3, 2017 All Roads Lead to Managed

GWAS on your notebook: Semi-parallel linear and logis9c

Differen'al Privacy with Bounded Priors: Reconciling U+lity and

Wi Voter i iii opium Bob 3_ Requirements 1 Verifiable Independently tally correct

Announcements Midterm 2 is Thursday The midterm will cover everything since the first midterm up

Data and Analysis Note 8 Introduction to Corpora Alex Simpson Note 8 Introduction to corpora

Fostering sensi+vity analysis for genome-scale inference - PowerPoint PPT Presentation

Fostering sensi+vity analysis for genome-scale inference So6ware into ideas Vince Carey, Ph.D. Channing Division of Network Medicine Harvard Medical

Fostering Biotechnology at National Fostering Biotechnology at National Fostering Biotechnology

in Somerset What is is considered to be pri rivate fostering? Pri rivate Fostering In

An experimental assessment of electric sensi tj vity in yellow rays and juvenile lemon sharks

Sensi&amp;vity of snowpack simula&amp;on associated with

Sensi&amp;vity of Tropical Cyclones to Resolu&amp;on, Convec&amp;on

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics &amp; Computational

Genome Sequencing &amp; Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference

Catchment Sensi-ve Farming (CSF) Working with farmers to improve

Genome Reassembly From Fragments 7 January 2019 OSU CSE 1 Genome A genome is the encoding

i18n@ W3C Richard Ishida W3C The i18n Ac+vity i18n Ac+vity

Fostering Spiritual Growth in the Church Fostering Spiritual Growth in the Church bit.ly/SG-ST

Genome Annotation The steps in genome sequencing Generate genome sequence Assembly ORF

Visualizing ENCODE Data in the UCSC Genome Browser Pauline Fujita, Ph.D. UCSC Genome Bioinformatics

The Mouse Genome The Mouse Genome Database (MGD) Database (MGD) Eppig J.T., et al. (2005). The

Self Study: Yeast Genome Comparison SESSION 4 MARTIN KRZYWINSKI Genome Sciences Centre BC

Genome 562 February 2015 Week 6 Genome 562 p.1/13 Julian Huxley (1887-1975) Oxford

CS681: Advanced Topics in Computational Biology Can Alkan EA509 calkan@cs.bilkent.edu.tr

XKCP internals Gilles Van Assche 1 1 STMicroelectronics SCA workshop ibenik, Croatia, June 2019

Taking Control of Your Managed Care Destiny AJAS 2017 April 3, 2017 All Roads Lead to Managed

GWAS on your notebook: Semi-parallel linear and logis9c

Differen'al Privacy with Bounded Priors: Reconciling U+lity and

Wi Voter i iii opium Bob 3_ Requirements 1 Verifiable Independently tally correct

Announcements Midterm 2 is Thursday The midterm will cover everything since the first midterm up

Data and Analysis Note 8 Introduction to Corpora Alex Simpson Note 8 Introduction to corpora

Sensi&vity of snowpack simula&on associated with

Sensi&vity of Tropical Cyclones to Resolu&on, Convec&on

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics & Computational

Genome Sequencing & Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference