Sailing Through Data: Discoveries and Mirages Emmanuel Cand` es, - PowerPoint PPT Presentation

Sailing Through Data: Discoveries and Mirages Emmanuel Cand` es, Stanford University 2018 Machine Learning Summer School, Buenos Aires, June 2018

Robustness

Robustness 1.00 1.00 Exact Cov ● 0.75 0.75 Power FDR 0.50 0.50 0.25 0.25 ● 0.00 0.00 0.0 0.5 1.0 0.0 0.5 1.0 Relative Frobenius Norm Error Relative Frobenius Norm Error Figure: Covariates are AR(1) with autocorrelation coefficient 0.3. n = 800 , p = 1500 , and target FDR is 10%. Y | X follows logistic model with 50 nonzero entries

Robustness 1.00 1.00 Exact Cov ● 0.75 ● 0.75 Graph. Lasso Power FDR 0.50 0.50 0.25 0.25 ● ● 0.00 0.00 0.0 0.5 1.0 0.0 0.5 1.0 Relative Frobenius Norm Error Relative Frobenius Norm Error Figure: Covariates are AR(1) with autocorrelation coefficient 0.3. n = 800 , p = 1500 , and target FDR is 10%. Y | X follows logistic model with 50 nonzero entries

Robustness 1.00 1.00 Exact Cov 50% Emp. Cov ● 0.75 ● 0.75 Graph. Lasso ● Power FDR 0.50 0.50 0.25 0.25 ● ● ● 0.00 0.00 0.0 0.5 1.0 0.0 0.5 1.0 Relative Frobenius Norm Error Relative Frobenius Norm Error Figure: Covariates are AR(1) with autocorrelation coefficient 0.3. n = 800 , p = 1500 , and target FDR is 10%. Y | X follows logistic model with 50 nonzero entries

Robustness 1.00 1.00 Exact Cov 50% Emp. Cov ● 0.75 ● 0.75 Graph. Lasso ● 62.5% Emp. Cov ● Power FDR 0.50 0.50 0.25 0.25 ● ● ● ● 0.00 0.00 0.0 0.5 1.0 0.0 0.5 1.0 Relative Frobenius Norm Error Relative Frobenius Norm Error Figure: Covariates are AR(1) with autocorrelation coefficient 0.3. n = 800 , p = 1500 , and target FDR is 10%. Y | X follows logistic model with 50 nonzero entries

Robustness 1.00 1.00 Exact Cov 50% Emp. Cov ● 0.75 ● 0.75 Graph. Lasso ● 62.5% Emp. Cov ● ● 75% Emp. Cov Power FDR 0.50 0.50 0.25 0.25 ● ● ● ● ● 0.00 0.00 0.0 0.5 1.0 0.0 0.5 1.0 Relative Frobenius Norm Error Relative Frobenius Norm Error Figure: Covariates are AR(1) with autocorrelation coefficient 0.3. n = 800 , p = 1500 , and target FDR is 10%. Y | X follows logistic model with 50 nonzero entries

Robustness 1.00 1.00 Exact Cov 50% Emp. Cov ● 0.75 ● 0.75 Graph. Lasso ● 62.5% Emp. Cov ● ● 75% Emp. Cov Power FDR 0.50 0.50 ● 87.5% Emp. Cov 0.25 0.25 ● ● ● ● ● ● 0.00 0.00 0.0 0.5 1.0 0.0 0.5 1.0 Relative Frobenius Norm Error Relative Frobenius Norm Error Figure: Covariates are AR(1) with autocorrelation coefficient 0.3. n = 800 , p = 1500 , and target FDR is 10%. Y | X follows logistic model with 50 nonzero entries

Robustness 1.00 1.00 Exact Cov 50% Emp. Cov ● 0.75 ● 0.75 Graph. Lasso ● 62.5% Emp. Cov ● ● 75% Emp. Cov Power FDR 0.50 0.50 ● 87.5% Emp. Cov 0.25 0.25 ● ● ● ● ● ● 100% Emp. Cov 0.00 0.00 ● ● 0.0 0.5 1.0 0.0 0.5 1.0 Relative Frobenius Norm Error Relative Frobenius Norm Error Figure: Covariates are AR(1) with autocorrelation coefficient 0.3. n = 800 , p = 1500 , and target FDR is 10%. Y | X follows logistic model with 50 nonzero entries

Simulations with synthetic Markov chain Markov chain covariates with 5 hidden states. Binomial response 1.0 1.0 0.8 0.8 0.6 0.6 Power FDP 0.4 0.4 0.2 0.2 0.0 0.0 4 5 6 7 8 9 10 12 15 20 4 5 6 7 8 9 10 12 15 20 Signal amplitude Signal amplitude Figure: Power and FDP over 100 repetitions (true F X ) n = 1000 , p = 1000 , target FDR: α = 0 . 1 Z j = | ˆ β j (ˆ λ CV ) | , W j = Z j − ˜ Z j

Robustness Markov chain covariates with 5 hidden states. Binomial response 1.0 1.0 0.8 0.8 0.6 0.6 Power FDP 0.4 0.4 0.2 0.2 0.0 0.0 4 5 6 7 8 9 10 12 15 20 4 5 6 7 8 9 10 12 15 20 Signal amplitude Signal amplitude Figure: Power and FDP over 100 repetitions (estimated F X ) n = 1000 , p = 1000 , target FDR: α = 0 . 1 Z j = | ˆ β j (ˆ λ CV ) | , W j = Z j − ˜ Z j

Simulations with synthetic HMM HMM covariates with latent “clockwise” Markov chain. Binomial response 1.0 1.0 0.8 0.8 0.6 0.6 Power FDP 0.4 0.4 0.2 0.2 0.0 0.0 3 4 5 6 7 8 9 10 15 20 3 4 5 6 7 8 9 10 15 20 Signal amplitude Signal amplitude Figure: Power and FDP over 100 repetitions (true F X ) n = 1000 , p = 1000 , target FDR: α = 0 . 1 Z j = | ˆ β j (ˆ λ CV ) | , W j = Z j − ˜ Z j

Robustness HMM covariates with latent “clockwise” Markov chain. Binomial response 1.0 1.0 0.8 0.8 0.6 0.6 Power FDP 0.4 0.4 0.2 0.2 0.0 0.0 3 4 5 6 7 8 9 10 15 20 3 4 5 6 7 8 9 10 15 20 Signal amplitude Signal amplitude Figure: Power and FDP over 100 repetitions (estimated F X ) n = 1000 , p = 1000 , target FDR: α = 0 . 1 Z j = | ˆ β j (ˆ λ CV ) | , W j = Z j − ˜ Z j

Out-of-sample parameter estimation Inhomogeneous Markov chain covariates with 5 hidden states. Binomial response 1.0 1.0 0.8 0.8 0.6 0.6 Power FDP 0.4 0.4 0.2 0.2 0.0 0.0 10 25 50 75 100 500 1000 5000 10000 10 25 50 75 100 500 1000 5000 10000 Number of unsupervised observations Number of unsupervised observations Figure: Power and FDP over 100 repetitions (estimated F X from independent dataset) n = 1000 , p = 1000 , target FDR: α = 0 . 1 Z j = | ˆ β j (ˆ λ CV ) | , W j = Z j − ˜ Z j

Model-X knockoff variables (robust version) i.i.d. samples from P XY Distr. P X of X only ‘approx’ known Distr. P Y | X of Y | X completely unknown

Model-X knockoff variables (robust version) i.i.d. samples from P XY Distr. P X of X only ‘approx’ known Distr. P Y | X of Y | X completely unknown Knockoffs wrt. to user input Q X (Barber, C. and Samworth, ’18) X = ( ˜ ˜ X 1 , . . . , ˜ Originals X = ( X 1 , . . . , X p ) Knockoffs X p )

Model-X knockoff variables (robust version) i.i.d. samples from P XY Distr. P X of X only ‘approx’ known Distr. P Y | X of Y | X completely unknown Knockoffs wrt. to user input Q X (Barber, C. and Samworth, ’18) X = ( ˜ ˜ X 1 , . . . , ˜ Originals X = ( X 1 , . . . , X p ) Knockoffs X p ) (1) Pairwise exchangeability wrt Q X : If X ∼ Q X ( X, ˜ d ( X, ˜ X ) swap ( S ) = X ) e.g. ( X 1 , X 2 , X 3 , ˜ X 1 , ˜ X 2 , ˜ d ( X 1 , ˜ X 2 , ˜ X 3 , ˜ X 3 ) swap ( { 2 , 3 } ) = X 1 , X 2 , X 3 )

Model-X knockoff variables (robust version) i.i.d. samples from P XY Distr. P X of X only ‘approx’ known Distr. P Y | X of Y | X completely unknown Knockoffs wrt. to user input Q X (Barber, C. and Samworth, ’18) X = ( ˜ ˜ X 1 , . . . , ˜ Originals X = ( X 1 , . . . , X p ) Knockoffs X p ) (1) Pairwise exchangeability wrt Q X : If X ∼ Q X ( X, ˜ d ( X, ˜ X ) swap ( S ) = X ) e.g. ( X 1 , X 2 , X 3 , ˜ X 1 , ˜ X 2 , ˜ d ( X 1 , ˜ X 2 , ˜ X 3 , ˜ X 3 ) swap ( { 2 , 3 } ) = X 1 , X 2 , X 3 ) (2) Ignore Y when constructing knockoffs: ˜ X ⊥ ⊥ Y | X

Model-X knockoff variables (robust version) i.i.d. samples from P XY Distr. P X of X only ‘approx’ known Distr. P Y | X of Y | X completely unknown Knockoffs wrt. to user input Q X (Barber, C. and Samworth, ’18) X = ( ˜ ˜ X 1 , . . . , ˜ Originals X = ( X 1 , . . . , X p ) Knockoffs X p ) (1) Pairwise exchangeability wrt Q X : If X ∼ Q X ( X, ˜ d ( X, ˜ X ) swap ( S ) = X ) e.g. ( X 1 , X 2 , X 3 , ˜ X 1 , ˜ X 2 , ˜ d ( X 1 , ˜ X 2 , ˜ X 3 , ˜ X 3 ) swap ( { 2 , 3 } ) = X 1 , X 2 , X 3 ) (2) Ignore Y when constructing knockoffs: ˜ X ⊥ ⊥ Y | X Only require conditionals Q ( X j | X − j ) which do not have to be compatible

FDR control ˆ S = { W j ≥ τ } � � t : 1+ |{ j : W j ≤ − t }| ≤ q τ = min 1 ∨ |{ j : W j ≥ t }| � �� FDP ( t ) t

FDR control ˆ S = { W j ≥ τ } � � t : 1+ |{ j : W j ≤ − t }| ≤ q τ = min 1 ∨ |{ j : W j ≥ t }| � �� FDP ( t ) t Theorem (Barber and C. (’15)) If user-input Q X is correct ( Q X = P X ) , then for knockoff+ � # false positives � E ≤ q # selections

Robustness of knockoffs? Does exchangeability hold approx. when Q X � = P X ? 0 __ __ __ __ __ + + + + + + + + + |W| If P X = Q X , coins are unbiased independent Problem: if P X � = Q X , coins may be (slighltly) biased (slightly) dependent

KL divergence condition The KL condition � � � P j ( X ij | X i, − j ) Q j ( � X ij | X i, − j ) � KL j := log ≤ ǫ Q j ( X ij | X i, − j ) P j ( � X ij | X i, − j ) i E [ � KL j ] = KL divergence between distributions of ( X j , � X j , X − j , � ( � X j , X j , X − j , � X − j ) & X − j )

From KL condition to FDR control Theorem (Barber, C. and Samworth (2018)) For any ǫ ≥ 0 � � # false positives j with � KL j ≤ ǫ ≤ q exp( ǫ ) E # selections

From KL condition to FDR control Theorem (Barber, C. and Samworth (2018)) For any ǫ ≥ 0 � � # false positives j with � KL j ≤ ǫ ≤ q exp( ǫ ) E # selections Corollary � � �� FDR ≤ min q exp( ǫ ) + P max KL j > ǫ ǫ ≥ 0 null j Information theoretically optimal

New directions

ML inspired knockoffs Joint with S. Bates, Y. Romano, M. Sesia and J. Zhou Knockoffs for graphical models Knockoffs via restricted Boltzmann machines Knockoffs via variational auto-encoders? Knockoffs via generative adversarial networks?

Improving power? Joint with Z. Ren and M. Sesia

Sailing Through Data: Discoveries and Mirages Emmanuel Cand` es, - PowerPoint PPT Presentation

Sailing Through Data: Discoveries and Mirages Emmanuel Cand` es, Stanford University 2018 Machine Learning Summer School, Buenos Aires, June 2018 Robustness Robustness 1.00 1.00 Exact Cov 0.75 0.75 Power FDR 0.50 0.50 0.25 0.25

Sailing Through Data: Discoveries and Mirages Emmanuel Cand` es, Stanford University 2018 Machine

Twilight Tuesdays Sailing is for You! Welcome to Twilight Tuesdays Sailing Fun, Sailing and

CHALMERS FORMULA SAILING BIO-BASED COMPOSITES IN SAILING? CHALMERS FORMULA SAILING 2019-03-14

6th Grade Parent Academy Discoveries Introductions Discoveries Classes FOUR TOTAL Two

CORPORATE EXPERIENCE AT DOSC About DOSC Dubai Offshore Sailing Club is a non - profit sailing club

Man Overboard! It could happen to you - how to be best prepared. Navy Offshore Sailing Varsity

SAILING Study Dolutegravir versus Raltegravir in Treatment Experienced SAILING: Study Design

Welcome to LaSalle Mariners Yacht Club Womens Sailing Adventure 2017 Who is LMYC? LaSalle

Annual Membership Meeting - November 15, 2019 Sharing the Sailing Community Founded January 12,

Ulley Sailing Club Where small really is beautiful! Aiming to make sailing affordable and

Overview Introduction to Keppel Bay Sailing Academy Royal Yachting Association (UK)

TUCSON SAILING CLUB BY: KEITH ROSENBAUM: KEITHROSENBAUM@ICLOUD.COM SAFETY PRESENTATION: AUGUST

Air Space, Air Time* * Draft Presentation. Marcos Garcia-Rojo, AAR, Fall 10 Halos, Mirages and

CBRE SAILING THROUGH PORT PROPERTY OPPORTUNITIES AND CHALLENGES SLIDE 1 CBRE FIRM INFORMATION AND

Welcome You are invited to sail with the Great Lakes Sailing Adventure on board the yacht,

September 2015 Sailing Passion A.S.B.L is an apolitcal and secular nonproft organizaton

Machine Learning Classifiers: Many Diverse Ways to Learn CS171, Winter Quarter, 2020

S-A bu g+ [5" tkln Grt 5 h ,(ol :- 4, th"$ 6, 1"\ t 4,a [") x +L +)S 1z {e

Binary Logistic Regression In ordinary linear regression with continuous variables, we fit a

Non-Stationary Time Series, Cointegration and Spurious Regression Heino Bohn Nielsen 1 of 26

Modeling Default Correlation and Clustering: A Time Change Approach Rafael Mendoza-Arriaga 1 Joint

Basic Concepts Rubn Prez Nanomechanics & SPM Theory Group Departamento de Fsica

Obviation in Hungarian: What is its shape, and is it due to competition? Tel Aviv Linguistics

ProbabilityandStatistics* ! ! forComputerScience** I!have!now!used!each!of!the!

Sailing Through Data: Discoveries and Mirages Emmanuel Cand` es, - PowerPoint PPT Presentation

Sailing Through Data: Discoveries and Mirages Emmanuel Cand` es, Stanford University 2018 Machine Learning Summer School, Buenos Aires, June 2018 Robustness Robustness 1.00 1.00 Exact Cov 0.75 0.75 Power FDR 0.50 0.50 0.25 0.25

Sailing Through Data: Discoveries and Mirages Emmanuel Cand` es, Stanford University 2018 Machine

Twilight Tuesdays Sailing is for You! Welcome to Twilight Tuesdays Sailing Fun, Sailing and

CHALMERS FORMULA SAILING BIO-BASED COMPOSITES IN SAILING? CHALMERS FORMULA SAILING 2019-03-14

6th Grade Parent Academy Discoveries Introductions Discoveries Classes FOUR TOTAL Two

CORPORATE EXPERIENCE AT DOSC About DOSC Dubai Offshore Sailing Club is a non - profit sailing club

Man Overboard! It could happen to you - how to be best prepared. Navy Offshore Sailing Varsity

SAILING Study Dolutegravir versus Raltegravir in Treatment Experienced SAILING: Study Design

Welcome to LaSalle Mariners Yacht Club Womens Sailing Adventure 2017 Who is LMYC? LaSalle

Annual Membership Meeting - November 15, 2019 Sharing the Sailing Community Founded January 12,

Ulley Sailing Club Where small really is beautiful! Aiming to make sailing affordable and

Overview Introduction to Keppel Bay Sailing Academy Royal Yachting Association (UK)

TUCSON SAILING CLUB BY: KEITH ROSENBAUM: KEITHROSENBAUM@ICLOUD.COM SAFETY PRESENTATION: AUGUST

Air Space, Air Time* * Draft Presentation. Marcos Garcia-Rojo, AAR, Fall 10 Halos, Mirages and

CBRE SAILING THROUGH PORT PROPERTY OPPORTUNITIES AND CHALLENGES SLIDE 1 CBRE FIRM INFORMATION AND

Welcome You are invited to sail with the Great Lakes Sailing Adventure on board the yacht,

September 2015 Sailing Passion A.S.B.L is an apolitcal and secular nonproft organizaton

Machine Learning Classifiers: Many Diverse Ways to Learn CS171, Winter Quarter, 2020

S-A bu g+ [5&quot; tkln Grt 5 h ,(ol :- 4, th&quot;$ 6, 1&quot;\ t 4,a [&quot;) x +L +)S 1z {e

Binary Logistic Regression In ordinary linear regression with continuous variables, we fit a

Non-Stationary Time Series, Cointegration and Spurious Regression Heino Bohn Nielsen 1 of 26

Modeling Default Correlation and Clustering: A Time Change Approach Rafael Mendoza-Arriaga 1 Joint

Basic Concepts Rubn Prez Nanomechanics &amp; SPM Theory Group Departamento de Fsica

Obviation in Hungarian: What is its shape, and is it due to competition? Tel Aviv Linguistics

Probability*and*Statistics* ! ! for*Computer*Science** I!have!now!used!each!of!the!

S-A bu g+ [5" tkln Grt 5 h ,(ol :- 4, th"$ 6, 1"\ t 4,a [") x +L +)S 1z {e

Basic Concepts Rubn Prez Nanomechanics & SPM Theory Group Departamento de Fsica

ProbabilityandStatistics* ! ! forComputerScience** I!have!now!used!each!of!the!