on robustness of principal component regression
play

On Robustness of Principal Component Regression Anish Agarwal - PowerPoint PPT Presentation

On Robustness of Principal Component Regression Anish Agarwal Devavrat Shah, Dennis Shen, Dogyoon Song MIT 1 What is PCR? 1 2 What is PCR? 1 3 What is PCR? 1 Step 1: PCA 4 What is PCR? 1 Step 1: PCA ( k -components) 5 What is PCR?


  1. On Robustness of Principal Component Regression Anish Agarwal Devavrat Shah, Dennis Shen, Dogyoon Song MIT 1

  2. What is PCR? 1 2

  3. What is PCR? 1 3

  4. What is PCR? 1 Step 1: PCA 4

  5. What is PCR? 1 Step 1: PCA ( k -components) 5

  6. What is PCR? 1 Step 2: Regression minimize 6

  7. What is PCR? 1 Step 3: Prediction 7

  8. When & Why Use PCR 2 8

  9. 2 Data Science Folklore “IF DATA IS (APPROXIMATELY) LOW-DIMENSIONAL, USE PCR!” -- -- An Anonymous Data ta Scienti tists ts Whe When n exactly sho should we be usi sing ng PC PCR? 9 -- LOREM IPSUM

  10. 2 Key Questions We Answer Theoretical properties of PCR? Is dimension-reduction only benefit to PCR? 10

  11. Our Theoretical Analysis of PCR helps answer following questions.. How low-rank do covariates need to be? How many principal components to pick? How well does PCR perform on a test data (i.e. generalization properties)? 11

  12. Is Dimension-Reduction Only Benefit? NO! 12 -- LOREM IPSUM

  13. 2 PCR (as is) works for a wide variety of settings! Noisy ? 0 Missing 3. 3.14 ? 1 Mixed valued ? ? Sensitive 13

  14. We We show PCR R is surprisingly ly robu bust to proble blems ms th that p t plague ue l larg rge-sca scale m modern rn d data tase sets ts Ma Main in Con ontrib ibut ution ion of of this is Wor ork 14 -- LOREM IPSUM

  15. Erro rror-In Vari ariab able Regre ression (S (Setti etting We e Consider) er) 15 -- LOREM IPSUM

  16. 2 Classical (high-dimensional) Regression 16

  17. 2 Error-in-Variable (EIV) Regression ? ? ? ? Representative of modern datasets 17

  18. 2 EIV - Surprising Number of Applications Time Series Analysis (measurement noise) Causal Inference (Synthetic Control) (measurement noise) Differentially-private Regression (noise by design) Mixed Valued Regression (structural noise) 18

  19. 2 EIV - Surprising Number of Applications Time Series Analysis (measurement noise) Causal Inference (Synthetic Control) (measurement noise) Differentially-private Regression (noise by design) Mixed Valued Regression (structural noise) 19

  20. Formal R Results 20 -- LOREM IPSUM

  21. 2 Theorem (Informal): Training Error If principal components chosen correctly (" = $) number of covariates PCR implicitly denoises covariates! fraction of observations OLS minmax error rate (low-dimensional, noiseless, fully observed covariates) 21

  22. 2 Theorem (Informal): Testing Error If principal components not chosen correctly (" ≠ $) Train Error with PCR (") Test Error PCR implicitly de-noises PCR implicitly performs covariates & ' -regularization Choose k that minimizes above 22

  23. 2 When To and Not to Use PCR? – Look at Spectrum Use PCR! Don’t Use PCR! Case 3 Magnitude of Case 1 Singular Values Singular Values (ordered by magnitude) Case 4 Case 2 23

  24. 2 Exponential-decaying spectrum is ubiquitous in real-world data GDP Trajectories (Macroeconomics) 24

  25. 2 Exponential-decaying spectrum is ubiquitous in real-world data Avito Ad-Click Dataset (E-Commerce) 25

  26. 2 Exponential-decaying spectrum is ubiquitous in real-world data Cricket Trajectories (Sports) 26

  27. Surprising Applications of PCR 3 27

  28. 3 Applications of Error-In-Variable Regression Time Series Analysis (measurement noise) Causal Inference (Synthetic Control) (measurement noise) Differentially-private Regression (noise by design) Mixed Valued Regression (structural noise) 28

  29. Da Data p privacy i is t top-of of-mind mind as s we we inc increasing singly apply ML on n se sensit nsitiv ive use ser data (gene netic ic data, purcha hase se hist history etc.) 29

  30. Standard N Notion o of P Priva vacy i in M ML ε -Differential P Priva vacy Intuitively, an algorithm is ε -differentially private if ou outcom ome of of a a more than ε due to stati tatisti tical al query ry on a database ca cannot ch change by mo pr presence/absence of any us user data record Example of Statistical Query: “ Average Income of all users between ages 25 and 30” 30

  31. hieve ε -di differ eren entially priva vacy? Ho How w to achie Laplace M Mechanism Laplacian N Noise ⁄ " # database 31

  32. Pr Predict ictiv ive Accu ccuracy cy vs. s. Pr Priv ivacy cy Tradeoff ff Ca Can n we achi hieve good prediction n error and nd still maint ntain n privacy? y? Yes! Ye 32

  33. Pr Predict ictiv ive Accu ccuracy cy vs. s. Pr Priv ivacy cy Tradeoff ff Can Ca n we achi hieve good prediction n error and nd still maint ntain n privacy? y? Step 1: Data Owner adds Laplacian Noise Step 2: Analyst Performs PCR Done! Don 33

  34. Wh What i t is s sample c complexity ty c cost f t for r ε - di differential p privacy? Prediction Error Do Does de de-no noising ising st step (PC PCA) break priv ivacy cy? No, PCA only de-noises covariates on average with respect to the - norm 34

  35. Conclusion 4 35

  36. Inspec In pect spec pectrum of yo your cova variate e matrix Magnitude of Case 1 Singular Values Singular Values Use PCR! (ordered by magnitude) de-noises Case 2 regularizes 36

  37. Po Possib ssible Implica icatio ions ns fo for Modern n ML Linear Case Non-Linear Case Step 1: Dimension Reduction PCA GANs? Li Linea ear l low-di dimens nsional nal covar ariat ate pre- Does non-linear covariate pre-processing proc processing has many implicit benefits (e.g. de- (e.g. GANs) have similar benefits for noising, regularizing) unstructured data? 37

  38. Co Come Me Meet Us s At Our Post ster #3 – East Exhibition Hall B + C, 5-7pm, Thursday Po Post ster #3 Shameless Plug Sh ug :) PCR for Time Series Analysis: ts tspd pdb.mit. t.edu PCR for Causal Inference: gi github.com/Rom Romcos os/SC SC_de demo 38

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend