Leveraging prior information and group structure for false discovery - PowerPoint PPT Presentation

Leveraging prior information and group structure for false discovery rate control Rina Foygel Barber Dept. of Statistics, University of Chicago http://www.stat.uchicago.edu/~rina/

Multiple comparisons & FDR control When testing n different questions simultaneously, how to determine which effects are significant? • False discovery proportion: total # discoveries = |H 0 ∩ � FDP = # false discoveries S | | � S | • False discovery rate: FDR = E [ FDP ] 2/29

Multiple comparisons & FDR control Benjamini-Hochberg (BH) procedure (1995): set a data-dependent threshold for rejecting p-values, to adapt to the amount of signal present in the data • If we reject all p-values below a fixed threshold t , t · |H 0 | # { i : P i ≤ t } = � FDP ( t ) ≈ FDP ( t ) • Choose adaptive threshold: max t with � FDP ( t ) ≤ α • Guaranteed to control FDR at level α if p-values are independent or positively dependent (PRDS) Benjamini & Hochberg 1995; Benjamini & Yekutieli 2001 3/29

Multiple comparisons & FDR control How can we incorporate additional information into the FDR control problem? • If some of the hypotheses are more likely to contain true signals, should we give them priority? • If the hypotheses have a grouped / clustered / hierarchical structure, how can we take this into account? 4/29

Outline 1. Accumulation tests: testing a ranked list of hypotheses • Joint work with Ang Li 2. The p-filter: FDR control across groups • Joint work with Aaditya Ramdas 5/29

Ordered hypothesis testing Setting: a multiple comparisons problem with a pre-defined ordering. p-values: P 1 , P 2 , P 3 , . . . , P N ← − − − − − − − − − − − − − → select first / select last / most likely to be a true signal least likely to be a true signal 6/29

Ordered hypothesis testing Where does the ordering come from? • Data from related experiments: e.g. gene expression levels in a different tissue, with a related drug compound, etc • Regression setting: For sequential procedures (forward selection, LASSO, etc), recent work produces valid p-values for variables in the order that they are selected: • Post-selection inference (Fithian, Taylor, Tibshirani, Tibshirani, Lockart, ....) • Knockoff method (Barber & Cand` es): one-bit p-values 7/29

Ordered hypothesis testing SeqStep method (Barber & Cand` es): ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● p−value ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 100 200 300 400 500 Index Want to estimate # nulls among first k p-values � count how many p-values are > 0 . 5 8/29

Ordered hypothesis testing Null p-values are equally likely to be above 0 . 5 or below 0 . 5 ⇓ ≈ half the null p-values, among the first k p-values, will be > 0 . 5 ⇓ FDP ( k ) ≈ 2 · (# p-values > 0 . 5 , among first k ) = � FDP SeqStep ( k ) k Then stop at � k SeqStep = last time that � FDP SeqStep ( k ) ≤ α 9/29

Ordered hypothesis testing A related method — ForwardStop (G’Sell et al 2013): To estimate FDP among the first k p-values, � � � k 1 i =1 log 1 − P i � FDP ForwardStop ( k ) = k Then stop at � k ForwardStop = last time that � FDP ForwardStop ( k ) ≤ α 10/29

Accumulation tests Accumulation test: reject the first � k h p-values, where � � � k : � k h = max FDP h ( k ) ≤ α , for FDP ( k ) = # nulls among { 1 , . . . , k } ≈ h ( P 1 ) + · · · + h ( P k ) k k � �� Estimated FDP = � FDP h ( k ) h is a function [0 , 1] → [0 , ∞ ] with � 1 • t =0 h ( t ) d t = 1 ⇒ E [ h ( P i )] = 1 for the nulls • h ≈ 0 near 0 ⇒ E [ h ( P i )] ≈ 0 for strong signals 11/29

Accumulation tests Existing & new choices for the function h: SeqStep (knockoff paper) ForwardStop (G'Sell et al 2013) HingeExp (new) 4 4 4 3 3 3 h(P) h(P) h(P) 2 2 2 1 1 1 0 0 0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 P P P 12/29

Accumulation tests Theorem If h is an accumulation function bounded by C , then � # nulls among { 1 , . . . , k } � ≤ α. E k + C/α (See paper for a guarantee when h is unbounded.) Advantage over BH & other multiple testing corrections: No dependence on n = # of hypotheses tested 13/29

Gene dosage data • Expression levels for n = 22283 genes measured at different dosage levels: Sample size: 5 control (zero dose), 5 low dose, 5 high dose • Can we identify genes with differential expression at the lowest dosage level? control 10 low dose high dose 8 6 4 2 0 1007_s_at 121_at 1053_at 117_at 1255_g_at Data from Coser et al 2003 via R Geoquery package (data set GDS2324) 14/29

Gene dosage data • Standard approach w/o high dose data: 1. Two-sample test for control vs. low dose 2. Then correct for multiple comparisons (BH & variants) control control 10 10 low dose low dose high dose 8 8 6 6 � 4 4 2 2 0 0 1007_s_at 121_at 1053_at 117_at 1255_g_at 1007_s_at 121_at 1053_at 117_at 1255_g_at • Our approach: 1. Rank genes by comparing high dose vs. control/low dose 2. Run accumulation test to compare control vs. low dose control control / low dose control 10 10 10 low dose high dose low dose high dose 8 8 8 6 6 6 � � 4 4 4 2 2 2 0 0 0 1007_s_at 121_at 1053_at 117_at 1255_g_at 1007_s_at 121_at 1053_at 117_at 1255_g_at 1007_s_at 121_at 1053_at 117_at 1255_g_at 15/29

Gene dosage data ● 20000 HingeExp ● SeqStep ● ForwardStop ● Variants of 15000 BH procedure # of discoveries (see paper for details) ● 10000 ● 5000 ● ● ● ● 0 ● 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Target FDR level q Target FDR level α 16/29

Outline 1. Accumulation tests: testing a ranked list of hypotheses • Joint work with Ang Li 2. The p-filter: FDR control across groups • Joint work with Aaditya Ramdas 17/29

Structured set of hypotheses Time 1 Time 2 Time 3 Timepoint Location Hypotheses: 18/29

Structured set of hypotheses • n hypotheses with p-values P 1 , . . . , P n • M “layers” = partitions of the hypotheses (e.g. entries, rows, columns in our array) • Goal: select set � S of discoveries such that FDR is bounded simultaneously for layer 1 , 2 , . . . , M . 19/29

Structured set of hypotheses Where do the groupings come from? • Natural structure in the set of hypotheses • Regression setting: Clusters / correlations within the features; Hierarchical structure (e.g. due to interaction terms) 20/29

Multilayer FDR How to define FDR for the m th layer? • Partition [ n ] = A m 1 ∪ · · · ∪ A m G m • Nulls H 0 m = { g : A m g ⊆ H 0 } • Selected set � g ∩ � S m = { g : A m S � = ∅ } � � m ∩ � |H 0 S m | • FDR control: E ≤ α m ? | � S m | 21/29

Leveraging prior information and group structure for false discovery - PowerPoint PPT Presentation

Leveraging prior information and group structure for false discovery rate control Rina Foygel Barber Dept. of Statistics, University of Chicago http://www.stat.uchicago.edu/~rina/ Multiple comparisons & FDR control When testing n different

About FSP Group About FSP Group FSP Group Structure FSP Group Structure FSP Group FSP

From Elms to Ash and Beyond Managing Our Urban Forest for Diversity & Resilience Ralph

NCTracks: Prior Approval Durable Medical Equipment Prior Approval Durable Medical Equipment

Prior Learning Assessment Content of Presentation Introduction to Prior Learning Assessment

Using Routing Policies Using Routing Policies Mark Prior Mark Prior Core Engineering Group

Leveraging Market Power? Leveraging Market Power? Premium Pay TV Content and Premium Pay TV

Bill OHanlon www.billohanlon.com NICABM December 2008 Time leveraging Time leveraging

Superfund Settlements: Leveraging Superfund Settlements: Leveraging Recent Developments Impact of

T echBrief Leveraging Redundancy to Leveraging Redundancy to Build Fault-T olerant Networks

Role of Pricing in Leveraging Market Power Role of Pricing in Leveraging Market Power Tom Hird

Solving Problems Leveraging Organizations Leveraging Organizations Eric P. Loewen, Ph.D.

Leveraging a Leveraging a Collaborative Work Environment Collaborative Work Environment to

Data Consortium: Data Consortium: Leveraging Kansas health data to advance Leveraging Kansas

ConnectHome Nation Webinar Leveraging Stakeholder Volunteer Programs 1 Agenda Leveraging

Leveraging the GPU on Spark Tobias Polzer, Friedrich-Alexander University Erlangen-Nuremberg

Video Game Finance: Video Game Finance: Leveraging Value of IP Assets Leveraging Value of IP

Ramsey classes Topological dynamics Model theory Ramsey theory The KPT theory Questions

Partial Orders for Reconstructing Open . . . Representing Uncertainty, Extending Allens . . .

On the Complexity of VAS Reachability Sylvain Schmitz based on joint works with D. Figueira, S.

ANALYSIS OF CYTOSTATIC COMPOUNDS IN AQUATIC ENVIRONMENT ubomra Kova ov 1 under the

Required Readings Further Reading Big Picture Metric-Based Network Exploration and Multiscale

Multicast Address-Set Claim (MASC) Deplo ymen t Ramesh Go vindan, Deb orah Estrin, P

What is case? Nominative/accusative languages Many languages mark nouns or noun phrases with

Large Neutrino and Nucleon Decay detectors in Europe (and some ideas on interregional

Leveraging prior information and group structure for false discovery - PowerPoint PPT Presentation

Leveraging prior information and group structure for false discovery rate control Rina Foygel Barber Dept. of Statistics, University of Chicago http://www.stat.uchicago.edu/~rina/ Multiple comparisons & FDR control When testing n different

About FSP Group About FSP Group FSP Group Structure FSP Group Structure FSP Group FSP

From Elms to Ash and Beyond Managing Our Urban Forest for Diversity &amp; Resilience Ralph

NCTracks: Prior Approval Durable Medical Equipment Prior Approval Durable Medical Equipment

Prior Learning Assessment Content of Presentation Introduction to Prior Learning Assessment

Using Routing Policies Using Routing Policies Mark Prior Mark Prior Core Engineering Group

Leveraging Market Power? Leveraging Market Power? Premium Pay TV Content and Premium Pay TV

Bill OHanlon www.billohanlon.com NICABM December 2008 Time leveraging Time leveraging

Superfund Settlements: Leveraging Superfund Settlements: Leveraging Recent Developments Impact of

T echBrief Leveraging Redundancy to Leveraging Redundancy to Build Fault-T olerant Networks

Role of Pricing in Leveraging Market Power Role of Pricing in Leveraging Market Power Tom Hird

Solving Problems Leveraging Organizations Leveraging Organizations Eric P. Loewen, Ph.D.

Leveraging a Leveraging a Collaborative Work Environment Collaborative Work Environment to

Data Consortium: Data Consortium: Leveraging Kansas health data to advance Leveraging Kansas

ConnectHome Nation Webinar Leveraging Stakeholder Volunteer Programs 1 Agenda Leveraging

Leveraging the GPU on Spark Tobias Polzer, Friedrich-Alexander University Erlangen-Nuremberg

Video Game Finance: Video Game Finance: Leveraging Value of IP Assets Leveraging Value of IP

Ramsey classes Topological dynamics Model theory Ramsey theory The KPT theory Questions

Partial Orders for Reconstructing Open . . . Representing Uncertainty, Extending Allens . . .

On the Complexity of VAS Reachability Sylvain Schmitz based on joint works with D. Figueira, S.

ANALYSIS OF CYTOSTATIC COMPOUNDS IN AQUATIC ENVIRONMENT ubomra Kova ov 1 under the

Required Readings Further Reading Big Picture Metric-Based Network Exploration and Multiscale

Multicast Address-Set Claim (MASC) Deplo ymen t Ramesh Go vindan, Deb orah Estrin, P

What is case? Nominative/accusative languages Many languages mark nouns or noun phrases with

Large Neutrino and Nucleon Decay detectors in Europe (and some ideas on interregional

From Elms to Ash and Beyond Managing Our Urban Forest for Diversity & Resilience Ralph