Learning from Untrusted Data Moses Charikar, Jacob Steinhardt, - PowerPoint PPT Presentation

Learning from Untrusted Data Moses Charikar, Jacob Steinhardt, Gregory Valiant Symposium on the Theory of Computing June 19, 2017

(Icon credit: Annie Lin) Motivation: data poisoning attacks: 1

(Icon credit: Annie Lin) Motivation: data poisoning attacks: Question: what concepts can be learned in the presence of arbitrarily corrupted data? 1

Related Work • 60 years of work on robust statistics... PCA: • XCM ’10, CLMW ’11, CSPW ’11 Mean estimation: • LRV ’16, DKKLMS ’16, DKKLMS ’17, L ’17, DBS ’17, S CV ’17 Regression: • NTN ’11, NT ’13, CCM ’13, BJK ’15 Classification: • FHKP ’09, GR ’09, KLS ’09, ABL ’14 Semi-random graphs: • FK ’01, C ’07, MMV ’12, S ’17 Other: • HM ’13, C ’14, C ’16, DKS ’16, S CV ’16 2

Problem Setting Observe n points x 1 , . . . , x n 3

Problem Setting Observe n points x 1 , . . . , x n Unknown subset of αn points drawn i.i.d. from p ∗ 3

Problem Setting Observe n points x 1 , . . . , x n Unknown subset of αn points drawn i.i.d. from p ∗ Remaining (1 − α ) n points are arbitrary 3

Problem Setting Observe n points x 1 , . . . , x n Unknown subset of αn points drawn i.i.d. from p ∗ Remaining (1 − α ) n points are arbitrary Goal: estimate parameter of interest θ ( p ∗ ) • assuming p ∗ ∈ P (e.g. bounded moments) • θ ( p ∗ ) could be mean, best fit line, ranking, etc. 3

Problem Setting Observe n points x 1 , . . . , x n Unknown subset of αn points drawn i.i.d. from p ∗ Remaining (1 − α ) n points are arbitrary Goal: estimate parameter of interest θ ( p ∗ ) • assuming p ∗ ∈ P (e.g. bounded moments) • θ ( p ∗ ) could be mean, best fit line, ranking, etc. New regime: α ≪ 1 3

Why Is This Possible? If e.g. α = 1 3 , estimation seems impossible: 4

Why Is This Possible? If e.g. α = 1 3 , estimation seems impossible: But can narrow down to 3 possibilities! 4

Why Is This Possible? If e.g. α = 1 3 , estimation seems impossible: But can narrow down to 3 possibilities! List-decodable learning [Balcan, Blum, Vempala ’08] • output O (1 /α ) answers, one of which is approximately correct 4

Why Is This Possible? If e.g. α = 1 3 , estimation seems impossible: But can narrow down to 3 possibilities! List-decodable learning [Balcan, Blum, Vempala ’08] • output O (1 /α ) answers, one of which is approximately correct Semi-verified learning • observe O (1) verified points from p ∗ 4

Why Care? Practical problem: data poisoning attacks • How can we build learning algorithms that are provably secure to manipulation? 5

Why Care? Practical problem: data poisoning attacks • How can we build learning algorithms that are provably secure to manipulation? Fundamental problem in robust statistics • What can be learned in presence of arbitrary outliers? 5

Why Care? Practical problem: data poisoning attacks • How can we build learning algorithms that are provably secure to manipulation? Fundamental problem in robust statistics • What can be learned in presence of arbitrary outliers? Agnostic learning of mixtures • When is it possible to learn about one mixture component, with no assumptions about the other components? 5

Main Theorem Observed functions: f 1 , . . . , f n Want to minimize unknown target function: ¯ f 6

Main Theorem Observed functions: f 1 , . . . , f n Want to minimize unknown target function: ¯ f Key quantity: spectral norm bound on a subset I : w ∈ R d � [ ∇ f i ( w ) − ∇ ¯ √ 1 | I | max f ( w )] i ∈ I � op ≤ S. 6

Main Theorem Observed functions: f 1 , . . . , f n Want to minimize unknown target function: ¯ f Key quantity: spectral norm bound on a subset I : w ∈ R d � [ ∇ f i ( w ) − ∇ ¯ √ 1 | I | max f ( w )] i ∈ I � op ≤ S. Meta-Theorem Given a spectral norm bound on an unknown subset of αn functions, learning is possible: • in the semi-verified model (for convex f i ) • in the list-decodable model (for strongly convex f i ) 6

Main Theorem Observed functions: f 1 , . . . , f n Want to minimize unknown target function: ¯ f Key quantity: spectral norm bound on a subset I : w ∈ R d � [ ∇ f i ( w ) − ∇ ¯ √ 1 | I | max f ( w )] i ∈ I � op ≤ S. Meta-Theorem Given a spectral norm bound on an unknown subset of αn functions, learning is possible: • in the semi-verified model (for convex f i ) • in the list-decodable model (for strongly convex f i ) All results direct corollaries of meta-theorem! 6

Corollary: Mean Estimation Setting: distribution p ∗ on R d with mean µ and bounded 1st moments: E p ∗ [ |� x − µ, v �| ] ≤ σ � v � 2 for all v ∈ R d . 7

Corollary: Mean Estimation Setting: distribution p ∗ on R d with mean µ and bounded 1st moments: E p ∗ [ |� x − µ, v �| ] ≤ σ � v � 2 for all v ∈ R d . Observe αn samples from p ∗ and (1 − α ) n arbitrary points, and want to estimate µ . 7

Corollary: Mean Estimation Setting: distribution p ∗ on R d with mean µ and bounded 1st moments: E p ∗ [ |� x − µ, v �| ] ≤ σ � v � 2 for all v ∈ R d . Observe αn samples from p ∗ and (1 − α ) n arbitrary points, and want to estimate µ . Theorem (Mean Estimation) If αn ≥ d , it is possible to output estimates ˆ µ 1 , . . . , ˆ µ m of the mean µ such that • m ≤ 2 /α , and O ( σ/ √ α ) w.h.p. µ j − µ � 2 = ˜ • min m j =1 � ˆ 7

Corollary: Mean Estimation Setting: distribution p ∗ on R d with mean µ and bounded 1st moments: E p ∗ [ |� x − µ, v �| ] ≤ σ � v � 2 for all v ∈ R d . Observe αn samples from p ∗ and (1 − α ) n arbitrary points, and want to estimate µ . Theorem (Mean Estimation) If αn ≥ d , it is possible to output estimates ˆ µ 1 , . . . , ˆ µ m of the mean µ such that • m ≤ 2 /α , and O ( σ/ √ α ) w.h.p. µ j − µ � 2 = ˜ • min m j =1 � ˆ Alternately, it is possible to output an estimate ˆ µ given a single verified point from p ∗ . 7

Comparisons Mean estimation: Bound Regime Assumption Samples σ √ 1 − α α > 1 − c 4 th moments d LRV ’16 d 3 σ (1 − α ) α > 1 − c DKKLMS ’16 sub-Gaussian σ/ √ α α > 0 1 st moments d CSV ’17 8

Comparisons Mean estimation: Bound Regime Assumption Samples σ √ 1 − α α > 1 − c 4 th moments d LRV ’16 d 3 σ (1 − α ) α > 1 − c DKKLMS ’16 sub-Gaussian σ/ √ α α > 0 1 st moments d CSV ’17 Estimating mixtures: Separation Robust? σ ( k + 1 / √ α ) AM ’05 no σk KK ’10 no √ σ k AS ’12 no σ/ √ α CSV ’17 yes 8

Other Results Stochastic Block Model: (sparse regime: cf. GV ’14, LLV ’15, RT ’15, RV ’16) Average Degree Robust? 1 /α 4 GV ’14 no 1 /α 2 AS ’15 no 1 /α 3 yes CSV ’17 9

Other Results Stochastic Block Model: (sparse regime: cf. GV ’14, LLV ’15, RT ’15, RV ’16) Average Degree Robust? 1 /α 4 GV ’14 no 1 /α 2 AS ’15 no 1 /α 3 yes CSV ’17 Others: • discrete product distributions • exponential families • ranking 9

Proof Overview (Mean Estimation) Recall goal: given n points x 1 , . . . , x n , αn drawn from p ∗ , estimate mean µ of p ∗ 10

Proof Overview (Mean Estimation) Recall goal: given n points x 1 , . . . , x n , αn drawn from p ∗ , estimate mean µ of p ∗ Key tension: balance adversarial and statistical error 10

Proof Overview (Mean Estimation) Recall goal: given n points x 1 , . . . , x n , αn drawn from p ∗ , estimate mean µ of p ∗ Key tension: balance adversarial and statistical error Key tension: balance adversarial and statistical error 10

Proof Overview (Mean Estimation) Recall goal: given n points x 1 , . . . , x n , αn drawn from p ∗ , estimate mean µ of p ∗ Key tension: balance adversarial and statistical error Key tension: balance adversarial and statistical error Key tension: balance adversarial and statistical error 10

Proof Overview (Mean Estimation) Recall goal: given n points x 1 , . . . , x n , αn drawn from p ∗ , estimate mean µ of p ∗ Key tension: balance adversarial and statistical error Key tension: balance adversarial and statistical error Key tension: balance adversarial and statistical error High-level strategy: solve convex optimization problem • if cost is low, estimation succeeds (spectral norm bound) • if cost is high, identify and remove outliers 10

Algorithm � n i =1 � x i − µ � 2 First pass: minimize µ 2 11

Learning from Untrusted Data Moses Charikar, Jacob Steinhardt, - PowerPoint PPT Presentation

Learning from Untrusted Data Moses Charikar, Jacob Steinhardt, Gregory Valiant Symposium on the Theory of Computing June 19, 2017 (Icon credit: Annie Lin) Motivation: data poisoning attacks: 1 (Icon credit: Annie Lin) Motivation: data

Confinement (Running Untrusted Programs) Chester Rebeiro Indian Institute of Technology Madras

Robust Learning from Untrusted Sources Nikola Konstantinov Christoph H. Lampert ICML, June 2019

Upgrading Transport Protocols using Untrusted Mobile Code Parveen Patel Andrew Whitaker Jay

Chip-Secured Data Access: Confidential Data on Untrusted Servers L. Bouganim, P. Pucheral

Sieve: Cryptographically Enforced Access Control for User Data in Untrusted Clouds Frank Wang (MIT

Hails: Protecting Data Privacy in Untrusted Web Applications Daniel B. Giffin, Amit Levy, Deian

Ryoan: A Distributed Sandbox for Untrusted Computation on Secret Data Tyler Hunt , Zhiting Zhu,

Deserialization of untrusted data in Java Analysis, current solutions & a new approach

Protecting Data in Untrusted Locations An exercise in Real World threat modeling. Jan

Proof-Carrying Data: secure computation on untrusted execution platforms Eran Tromer Joint work

Towards Application Security on Untrusted Operating Systems Dan R. K. Ports Tal Garfinkel MIT

A Survey of Oblivious RAMs David Cash IBM Securely Outsourcing Memory Server Goal : Store,

Access Control in Untrusted Cloud Storage using Unidirectional Re-encryption Zach Kissel, Jie

Safe Loading A Foundation for Secure Execution of Untrusted Programs Mathias Payer, Tobias

NIZKs with an untrusted CRS: Security in the face of parameter subversion Mihir Bellare

Automatically Generating Malicious Disks using Symbolic Execution Junfeng Yang, Can Sar, Paul

What if Meaning is Indeterminate? Ramsification and Semantic Indeterminacy Hannes Leitgeb LMU

Testing for Tensions Between Datasets David Parkinson University of Queensland In collaboration

The Security-Privacy tension: recent developments in the U.K. and elsewhere James Davenport

Forces MCV4U: Calculus & Vectors A force is a push or a pull on an object. A force has both a

Chest Radiology Highlights: Tips, Tricks and Things You Should Never Miss! Case #1 1

Evaluation and Treatment of Evaluation and Treatment of Pulmonary Arterial Hypertension

BNSSG Mental Health and Well Being Strategy update An overview of MH in our the system Why have

A Reflection of Gods Character I am THE LAW I am THE LAW What makes law LAW? King

Learning from Untrusted Data Moses Charikar, Jacob Steinhardt, - PowerPoint PPT Presentation

Learning from Untrusted Data Moses Charikar, Jacob Steinhardt, Gregory Valiant Symposium on the Theory of Computing June 19, 2017 (Icon credit: Annie Lin) Motivation: data poisoning attacks: 1 (Icon credit: Annie Lin) Motivation: data

Confinement (Running Untrusted Programs) Chester Rebeiro Indian Institute of Technology Madras

Robust Learning from Untrusted Sources Nikola Konstantinov Christoph H. Lampert ICML, June 2019

Upgrading Transport Protocols using Untrusted Mobile Code Parveen Patel Andrew Whitaker Jay

Chip-Secured Data Access: Confidential Data on Untrusted Servers L. Bouganim, P. Pucheral

Sieve: Cryptographically Enforced Access Control for User Data in Untrusted Clouds Frank Wang (MIT

Hails: Protecting Data Privacy in Untrusted Web Applications Daniel B. Giffin, Amit Levy, Deian

Ryoan: A Distributed Sandbox for Untrusted Computation on Secret Data Tyler Hunt , Zhiting Zhu,

Deserialization of untrusted data in Java Analysis, current solutions &amp; a new approach

Protecting Data in Untrusted Locations An exercise in Real World threat modeling. Jan

Proof-Carrying Data: secure computation on untrusted execution platforms Eran Tromer Joint work

Towards Application Security on Untrusted Operating Systems Dan R. K. Ports Tal Garfinkel MIT

A Survey of Oblivious RAMs David Cash IBM Securely Outsourcing Memory Server Goal : Store,

Access Control in Untrusted Cloud Storage using Unidirectional Re-encryption Zach Kissel, Jie

Safe Loading A Foundation for Secure Execution of Untrusted Programs Mathias Payer, Tobias

NIZKs with an untrusted CRS: Security in the face of parameter subversion Mihir Bellare

Automatically Generating Malicious Disks using Symbolic Execution Junfeng Yang, Can Sar, Paul

What if Meaning is Indeterminate? Ramsification and Semantic Indeterminacy Hannes Leitgeb LMU

Testing for Tensions Between Datasets David Parkinson University of Queensland In collaboration

The Security-Privacy tension: recent developments in the U.K. and elsewhere James Davenport

Forces MCV4U: Calculus &amp; Vectors A force is a push or a pull on an object. A force has both a

Chest Radiology Highlights: Tips, Tricks and Things You Should Never Miss! Case #1 1

Evaluation and Treatment of Evaluation and Treatment of Pulmonary Arterial Hypertension

BNSSG Mental Health and Well Being Strategy update An overview of MH in our the system Why have

A Reflection of Gods Character I am THE LAW I am THE LAW What makes law LAW? King

Deserialization of untrusted data in Java Analysis, current solutions & a new approach

Forces MCV4U: Calculus & Vectors A force is a push or a pull on an object. A force has both a