Removing Unwanted Variation in Machine Learning for - PowerPoint PPT Presentation

      Removing Unwanted Variation in   Machine Learning for Personalized Medicine   with Johann Gagnon-Bartsch and Laurent Jacob     European Marie Curie Network for MLPM. Barcelona, 20 May 2016 1 Photo:&Bernard&Gagnon&

Apology, Motivation and   Declaration of Conflict of Interest 2& SBA73&from&Sabadell,&Catalunya&

Over&500,000&thyroid&nodule&fine&needle&aspiraFon&&& (FNA)&procedures&were&performed&in&the&US&in&2011.&& FNA&samples&can&be&challenging&to&interpret&and&produce& indeterminate&results&in&15%&to&30%&of&cases .& Guidelines&recommended&that&most&of&these&paFents&undergo&a&& diagnosFc&thyroid&surgery&to&assess&whether&the&nodules&are&benign&or& &malignant.&70%Q80%&of&the&Fme,&the&nodules&prove&to&be&benign.& The&Afirma&Gene&Expression&Classifier&(GEC),&helps&physicians&reduce&the&& number&of&surgeries&by&preoperaFvely&idenFfying&benign&nodules&& among&those&that&were&classified&by&cytopathology&as&indeterminate.& &

Over&500,000&thyroid&nodule&fine&needle&aspiraFon&&& (FNA)&procedures&were&performed&in&the&US&in&2011.&& FNA&samples&can&be&challenging&to&interpret&and&produce& indeterminate&results&in&15%&to&30%&of&cases .& & Guidelines&recommended&that&most&of&these&paFents&undergo&a&& I’m&on&the&ScienFfic&Advisory&Board&of&Veracyte&& diagnosFc&thyroid&surgery&to&assess&whether&the&nodules&are&benign&or& and&receive&money&from&them.&& &malignant.&70%Q80%&of&the&Fme,&the&nodules&prove&to&be&benign.& & & The&Afirma&Gene&Expression&Classifier&(GEC),&helps&physicians&reduce&the&& #&of&avoidable&surgeries&by&preoperaFvely&idenFfying&benign&nodules&& among&those&that&were&classified&by&cytopathology&as&indeterminate.& &

Introduction to our RUV methods 10&

  The problem   High-dimensional (e.g. omic or fMRI) data can be affected by unwanted variation. For example, batch effects due to time, space, equipment, operators, reagents, sample source, sample quality, environmental conditions,… the list goes on … 11

Artifact can overwhelm biology PC2& !batch!1! !batch!2! Sample&principal& component&scores& PC1& Gene&expression&data.&Adapted&from&Lazar&C& et#al.## Brief&Bioinform& 2013#

Some scientific goals sought using   gene expression microarrays   Differential Expression Classification Clustering & Unwanted&variaFon&can&reduce&precision&and&add&bias&& (via&confounding),&leading&to&false&posiFves&and&false&& negaFves,&&poor&classifiers&and&arFficial&clusters.& & 13

Aim for today To discuss some new ways of • identifying and removing (i.e. adjusting for) unwanted factors, when the goal is classification , and • telling whether or not it helped. 14

“Our” model (brief refs later) m (10s-1,000s) samples, n (10s of 1,000s) genes, k ( ≤ m-p) UV factors Y m × n = X m × p β p × n + W m × k α k × n + ε m × n where Y is a matrix of gene expression measurents, observed, X carries the factors of interest, observed in a training set, unobserved in a test set β are gene coefficients, unobserved, W carries unwanted variation factors, unobserved, α are gene coefficients, unobserved, ε are errors, unobserved . 15&

Concrete example With our Afirma-T example, we could put x i =-1 if sample i is benign, x i = +1 if sample i is malignant. The w i for this example could capture batch effects in reagents, in chips, processing dates, operators, and other things (remember: we’re treating them as unobserved. 16&

Our model in pictures β α# n# n# Y# X# ε# W# m# m#p# m# m# n# n# k# y ij ######=#########x i β j ###########+###########w i α j ##########+########ε ij# The& ε ij# are&all& (0,#σ 2 j ) ,&uncorrelated # with&each&other&and&all&else.& We&resist&the&temptaFon&to&make&assumpFons&about&the& {α j }.#

Our goal: classification That is, we have y but don’t know X (or W) for our test and target set samples. Before we get there, we’ll discuss estimating β as we would in a training set with known X . 18&

Our model, 2 Y m × n = X m × p β p × n + W m × k α k × n + ε m × n Initial goal: to estimate β Note: W unobserved, o/w standard linear model “Our” strategy: use factor analysis to estimate W 19&

Some ways of dealing with these and related problems with microarrays • Standard linear regression (many) • EB linear regression (ComBat, Johnson et al , 2007) • Naïve factor analysis ( SVD, several ) • Bayes (Lucas et al, 2006, Stegle et al , 2008) • Surrogate Variable Analysis (Leek & Storey, 2007) • Mixed model analysis (Kang et al, 2008, Listgarten et al, 2012) 20&

Identifiability: we don’t know the correlation of W ( k=1 ) with X Two&samples& x 1 #=#w 1 #=#1# x 2 #=#x,#w 2 #=w# Dots&are&genes& & (y Ij ,y 2j# )#=#( β j + α j + ε 1j , x β j + w α j + ε 2j ) 21&

We might have genes j not affected by X (y Ij ,y 2j# )#=#( α j + ε 1j , w α j + ε 2j ) 22&

We might have genes j not affected by X (y Ij ,y 2j# )#=#( α j + ε 1j , w α j + ε 2j ) 23&

We might have genes j not affected by X Nega,ve!controls :&genes&whose&expression&is&not&associated& with&the&biological&factors&of&interest&embodied&in #X# (y Ij ,y 2j# )#=#( α j + ε 1j , w α j + ε 2j ) 24&

“Our” solution: Use control genes Negative controls: Assume β j = 0. 0# α c# Y c# ε c# PosiFve&controls:&Assume&& β j #≠#0.# & “controls” in this context means “controls w.r.t. differential expression” 25&

    Using the negative controls c   Y c = W α c + ε c Just do a factor analysis on the negative controls! Examples of negative controls • housekeeping (HK) genes, • spiked-in controls • suitable empirical controls This works! 26&

Introducing the two-step: RUV-2 1. Do a factor analysis on Y c to estimate W. 2. Then regress Y on X and W ^ , the estimated W, to get an estimate of β adjusted for W ^ . There are many ways to do the factor analysis, but we just use SVD: Write Y c# =#UΛV T #,## then&put& W ^ #=#U (k) ## (first#k#columns) & Issues: choice of k, and can we do better? Yes: RUV-4 27&

Introducing RUV-inv We&start&with&RUVQ4&(UCB&Stat&Tech&Rep&820),&and&put&&&& & k=mN1# &(the&largest&&possible&value&when& p=1 ).&&We&don’t&& need&an&SVD,&and&we&find& & β RUV − inv = [ X t ( Y c Y c ˆ t ) − 1 X ] − 1 X t ( Y c Y c t ) − 1 Y This&is&the&generalized&least&squares&esFmator&using&&a& covariance&matrix&based&on&data&from&the&negaFve&control& genes&(others&use&all&genes),&but&we&esFmate&SEs&differently.& 28&

A microarray experiment with central   retina tissue from the rd1 mouse: 4 times x 3 rd1# is&a&mouse&model&of& rePniPs#pigmentosa:# loss&of&rod& photoreceptors,&followed&by&that&of&cone&photoreceptors& Light&blue:&2&months& Dark&blue:&4&months& &&&&&&&&&&&&Principal&component&2&&& Purple:&6&months& Red:&&8&months& & Very!severe!! & batch!effects! & Ideally&we&would&have&& seen&4&Fght&groups&of&& 3& ! ,& ! ,& ! &and& ! &resp.& Principal&component&1&&

Removing severe batch effects • Initially no significantly downregulated retinal genes were found between 2 and 8 months (left volcano plot on the next slide). • Using RUV-inv (right plot), we were able to find several significantly down-regulated retinal, even cone-specific genes, which were later confirmed. 30&

Standard analysis Green!dots :&genes& expressed&in&the&reFna& Q log 10# (pNvalue)# log 2 (fold#change)#8m/2m############### 31& #

Standard analysis Analysis with RUVinv Green!dots :&genes& expressed&in&the&reFna& Q log 10# (pNvalue)# Q log 10# (pNvalue)# log 2 (fold#change)#8m/2m#############log 2 (fold#change)#8m/2m## 32& #

Are there any questions? 33&

Removing Unwanted Variation in Machine Learning for - PowerPoint PPT Presentation

Removing Unwanted Variation in Machine Learning for Personalized Medicine with Johann Gagnon-Bartsch and Laurent Jacob European Marie Curie Network for MLPM. Barcelona, 20 May 2016 1

Reflections on Unwanted Traffic After the IAB Workshop Apricot, March 1 Loa Andersson Internet

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Normal Forms for CFGs Eliminating Useless Variables Removing Epsilon Removing Unit

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Unwanted fertility : induced abortion in Zambia Dr Ernestina Coast e.coast@lse.ac.uk

Practical DKIM Deployment ( for Mail Service Providers ) Daniel Black OVEE Systems Consultancy

Ostra: Leveraging trust to thwart unwanted communication Alan Mislove Ansley Post

Nonhomogeneous linear systems of DEs Diagonalization, Variation of Parameters ITI 11/04/2020

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Pattern Discovery in Biosequences Pattern Discovery in Biosequences ISMB 2002 tutorial (Appendix)

usin ing TMM and DESeq -Ying Sha, Lu Wang 1 Extreme low library size of two samples before

Homework 2 MLE and Naive Bayes Instructions Answer the questions and upload your answers to

Balls, sticks, triangles and molecules Frederic.Cazals@sophia.inria.fr Algorithms - Biology -

MOL2NET, 2017 , 3, doi:10.3390/mol2net-03-xxxx 1 MDPI MOL2NET, International Conference Series

Performance analysis Goals are to be able to understand better why your program has the

Introduction to Machine Learning Amel Ghouila amel.ghouila@pasteur.tn @AmelGhouila CODATA-RDA,

NPP Calibration/Validation Program Heather Kilcoyne NPOESS Data Products Division 15 OCT 08

Sambuz

Useful Links

Newsletter

Mail Us