leveraging prior information and group structure for
play

Leveraging prior information and group structure for false discovery - PowerPoint PPT Presentation

Leveraging prior information and group structure for false discovery rate control Rina Foygel Barber Dept. of Statistics, University of Chicago http://www.stat.uchicago.edu/~rina/ Multiple comparisons & FDR control When testing n different


  1. Leveraging prior information and group structure for false discovery rate control Rina Foygel Barber Dept. of Statistics, University of Chicago http://www.stat.uchicago.edu/~rina/

  2. Multiple comparisons & FDR control When testing n different questions simultaneously, how to determine which effects are significant? • False discovery proportion: total # discoveries = |H 0 ∩ � FDP = # false discoveries S | | � S | • False discovery rate: FDR = E [ FDP ] 2/29

  3. Multiple comparisons & FDR control Benjamini-Hochberg (BH) procedure (1995): set a data-dependent threshold for rejecting p-values, to adapt to the amount of signal present in the data • If we reject all p-values below a fixed threshold t , t · |H 0 | # { i : P i ≤ t } = � FDP ( t ) ≈ FDP ( t ) • Choose adaptive threshold: max t with � FDP ( t ) ≤ α • Guaranteed to control FDR at level α if p-values are independent or positively dependent (PRDS) Benjamini & Hochberg 1995; Benjamini & Yekutieli 2001 3/29

  4. Multiple comparisons & FDR control How can we incorporate additional information into the FDR control problem? • If some of the hypotheses are more likely to contain true signals, should we give them priority? • If the hypotheses have a grouped / clustered / hierarchical structure, how can we take this into account? 4/29

  5. Outline 1. Accumulation tests: testing a ranked list of hypotheses • Joint work with Ang Li 2. The p-filter: FDR control across groups • Joint work with Aaditya Ramdas 5/29

  6. Ordered hypothesis testing Setting: a multiple comparisons problem with a pre-defined ordering. p-values: P 1 , P 2 , P 3 , . . . , P N ← − − − − − − − − − − − − − → select first / select last / most likely to be a true signal least likely to be a true signal 6/29

  7. Ordered hypothesis testing Where does the ordering come from? • Data from related experiments: e.g. gene expression levels in a different tissue, with a related drug compound, etc • Regression setting: For sequential procedures (forward selection, LASSO, etc), recent work produces valid p-values for variables in the order that they are selected: • Post-selection inference (Fithian, Taylor, Tibshirani, Tibshirani, Lockart, ....) • Knockoff method (Barber & Cand` es): one-bit p-values 7/29

  8. Ordered hypothesis testing SeqStep method (Barber & Cand` es): ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● p−value ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 100 200 300 400 500 Index Want to estimate # nulls among first k p-values � count how many p-values are > 0 . 5 8/29

  9. Ordered hypothesis testing Null p-values are equally likely to be above 0 . 5 or below 0 . 5 ⇓ ≈ half the null p-values, among the first k p-values, will be > 0 . 5 ⇓ FDP ( k ) ≈ 2 · (# p-values > 0 . 5 , among first k ) = � FDP SeqStep ( k ) k Then stop at � k SeqStep = last time that � FDP SeqStep ( k ) ≤ α 9/29

  10. Ordered hypothesis testing A related method — ForwardStop (G’Sell et al 2013): To estimate FDP among the first k p-values, � � � k 1 i =1 log 1 − P i � FDP ForwardStop ( k ) = k Then stop at � k ForwardStop = last time that � FDP ForwardStop ( k ) ≤ α 10/29

  11. Accumulation tests Accumulation test: reject the first � k h p-values, where � � � k : � k h = max FDP h ( k ) ≤ α , for FDP ( k ) = # nulls among { 1 , . . . , k } ≈ h ( P 1 ) + · · · + h ( P k ) k k � �� � Estimated FDP = � FDP h ( k ) h is a function [0 , 1] → [0 , ∞ ] with � 1 • t =0 h ( t ) d t = 1 ⇒ E [ h ( P i )] = 1 for the nulls • h ≈ 0 near 0 ⇒ E [ h ( P i )] ≈ 0 for strong signals 11/29

  12. Accumulation tests Existing & new choices for the function h: SeqStep (knockoff paper) ForwardStop (G'Sell et al 2013) HingeExp (new) 4 4 4 3 3 3 h(P) h(P) h(P) 2 2 2 1 1 1 0 0 0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 P P P 12/29

  13. Accumulation tests Theorem If h is an accumulation function bounded by C , then � # nulls among { 1 , . . . , k } � ≤ α. E k + C/α (See paper for a guarantee when h is unbounded.) Advantage over BH & other multiple testing corrections: No dependence on n = # of hypotheses tested 13/29

  14. Gene dosage data • Expression levels for n = 22283 genes measured at different dosage levels: Sample size: 5 control (zero dose), 5 low dose, 5 high dose • Can we identify genes with differential expression at the lowest dosage level? control 10 low dose high dose 8 6 4 2 0 1007_s_at 121_at 1053_at 117_at 1255_g_at Data from Coser et al 2003 via R Geoquery package (data set GDS2324) 14/29

  15. Gene dosage data • Standard approach w/o high dose data: 1. Two-sample test for control vs. low dose 2. Then correct for multiple comparisons (BH & variants) control control 10 10 low dose low dose high dose 8 8 6 6 � 4 4 2 2 0 0 1007_s_at 121_at 1053_at 117_at 1255_g_at 1007_s_at 121_at 1053_at 117_at 1255_g_at • Our approach: 1. Rank genes by comparing high dose vs. control/low dose 2. Run accumulation test to compare control vs. low dose control control / low dose control 10 10 10 low dose high dose low dose high dose 8 8 8 6 6 6 � � 4 4 4 2 2 2 0 0 0 1007_s_at 121_at 1053_at 117_at 1255_g_at 1007_s_at 121_at 1053_at 117_at 1255_g_at 1007_s_at 121_at 1053_at 117_at 1255_g_at 15/29

  16. Gene dosage data ● 20000 HingeExp ● SeqStep ● ForwardStop ● Variants of 15000 BH procedure # of discoveries (see paper for details) ● 10000 ● 5000 ● ● ● ● 0 ● 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Target FDR level q Target FDR level α 16/29

  17. Outline 1. Accumulation tests: testing a ranked list of hypotheses • Joint work with Ang Li 2. The p-filter: FDR control across groups • Joint work with Aaditya Ramdas 17/29

  18. Structured set of hypotheses Time 1 Time 2 Time 3 Timepoint Location Hypotheses: 18/29

  19. Structured set of hypotheses • n hypotheses with p-values P 1 , . . . , P n • M “layers” = partitions of the hypotheses (e.g. entries, rows, columns in our array) • Goal: select set � S of discoveries such that FDR is bounded simultaneously for layer 1 , 2 , . . . , M . 19/29

  20. Structured set of hypotheses Where do the groupings come from? • Natural structure in the set of hypotheses • Regression setting: Clusters / correlations within the features; Hierarchical structure (e.g. due to interaction terms) 20/29

  21. Multilayer FDR How to define FDR for the m th layer? • Partition [ n ] = A m 1 ∪ · · · ∪ A m G m • Nulls H 0 m = { g : A m g ⊆ H 0 } • Selected set � g ∩ � S m = { g : A m S � = ∅ } � � m ∩ � |H 0 S m | • FDR control: E ≤ α m ? | � S m | 21/29

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend