SLIDE 1 Making Generalization Robust
Katrina Ligett HUJI & Caltech
joint with Rachel Cummings, Kobbi Nissim, Aaron Roth, and Steven Wu
SLIDE 2
A model for science…
SLIDE 3
A model for science…
SLIDE 4 Learning Alg
Hypothesis
- domain: contains all possible examples
- hypothesis: X-> {0,1} labels examples
- learning alg samples labeled examples, returns hypothesis
SLIDE 5 Learning Alg
Hypothesis
The goal of science: Find hypothesis that has low true error on the distribution D: err(h) = Prx~D[h(x) ≠ h*(x)]
SLIDE 6
Why does science work?
SLIDE 7
Why does science work?
SLIDE 8 Learning Alg
Hypothesis
The goal of science: Find hypothesis that has low true error on the distribution D: err(h) = Prx~D[h(x) ≠ h*(x)] Idea: find hypothesis that has low empirical error on S, plus guarantee that findings on the sample generalize to D
SLIDE 9 Learning Alg
Hypothesis
Empirical error: errE(h) = 1/n ∑x ∈ S 1[h(x) ≠ h*(x)] Generalization: output h s.t. Pr[|h(S) - h(D) |] ≤ 𝛃] ≥ 1 - β
SLIDE 10 taken from Understanding Machine Learning, Shai Shalev-Schwarts and Shai Ben-David
SLIDE 11
Problem solved!
SLIDE 12
Science doesn’t happen in a vacuum.
Problem solved?
SLIDE 13
One thing that can go wrong: post-processing
SLIDE 14
- Learning an SVM: Output encodes Support Vectors (sample points)
- This output could be post-processed to obtain a non-generalizing
hypothesis: “10% of all data points are x_k”
SLIDE 15 Oh, man. Our approach on this Kaggle competition really failed on the test data. Oh well, let’s try again. Did you see that paper published by the Smith lab? Yeah, I bet they’d see an even bigger effect if they accounted for sunspots! The journal requires open access to the data—let’s try it and see!
SLIDE 16 A second big problem: adaptive composition
q1 a1 q2 a2
. . .
SLIDE 17 A second big problem: adaptive composition
q1 a1 q2 a2
. . .
SLIDE 18 A second big problem: adaptive composition
Adaptive composition can cause overfitting! Generalization guarantees don’t “add up”
q1 a1 q2 a2
. . .
SLIDE 19 A second big problem: adaptive composition
Adaptive composition can cause overfitting! Generalization guarantees don’t “add up”
q1 a1 q2 a2
. . .
- Pick parameters; fit model
SLIDE 20 A second big problem: adaptive composition
Adaptive composition can cause overfitting! Generalization guarantees don’t “add up”
q1 a1 q2 a2
. . .
- Pick parameters; fit model
- ML competitions
SLIDE 21 A second big problem: adaptive composition
Adaptive composition can cause overfitting! Generalization guarantees don’t “add up”
q1 a1 q2 a2
. . .
- Pick parameters; fit model
- ML competitions
- Scientific fields that share one dataset
SLIDE 22 Some basic questions
- Is it possible to get good learning algorithms that also are
robust to post-processing? Adaptive composition?
- How to construct them? Existing algorithms? How much extra
data do they need?
- Accuracy + generalization + post-processing-robustness = ?
- Accuracy + generalization + adaptive composition = ?
- What composes with what? How well (how quickly does
generalization degrade)? Why?
SLIDE 23 Notice: generalization doesn’t require correct hypotheses, just that they perform the same on the sample as on the distribution Generalization alone is easy. What’s interesting: generalization + accuracy.
SLIDE 24
“no adversary can use output to find a hypothesis that
information-theoretic (could think computational)
Generalization + post-processing robustness
SLIDE 25
Robust Generalization
SLIDE 26 Robust Generalization
- Robust to post-processing
- Somewhat robust to adaptive composition (more on this later)
SLIDE 27
Do Robustly-Generalizing Algs Exist?
SLIDE 28
Yes!
Do Robustly-Generalizing Algs Exist?
SLIDE 29
Yes!
Do Robustly-Generalizing Algs Exist?
SLIDE 30 Yes!
- This paper: Compression Schemes -> Robust
Generalization
Do Robustly-Generalizing Algs Exist?
SLIDE 31 Yes!
- This paper: Compression Schemes -> Robust
Generalization
- [DFHPRR15a]: Bounded description length -> Robust
Generalization
Do Robustly-Generalizing Algs Exist?
SLIDE 32 Yes!
- This paper: Compression Schemes -> Robust
Generalization
- [DFHPRR15a]: Bounded description length -> Robust
Generalization
- [BNSSSU16]: Differential privacy -> Robust Generalization
Do Robustly-Generalizing Algs Exist?
SLIDE 33 Yes!
- This paper: Compression Schemes -> Robust
Generalization
- [DFHPRR15a]: Bounded description length -> Robust
Generalization
- [BNSSSU16]: Differential privacy -> Robust Generalization
Do Robustly-Generalizing Algs Exist?
SLIDE 34 Compression schemes
Hypothesis
SLIDE 35 Compression schemes
Hypothesis
SLIDE 36 Robust Generalization via compression
A A
algorithm
SLIDE 37 What Can be Learned under RG?
Theorem (informal; thanks to Shay Moran): sample complexity of robustly generalizing learning is the same up to log factors, as the sample complexity of PAC learning
SLIDE 38 Yes!
- This paper: Compression Schemes -> Robust
Generalization
- [DFHPRR15a]: Bounded description length -> Robust
Generalization
- [BNSSSU16]: Differential privacy -> Robust Generalization
Do Robustly-Generalizing Algs Exist?
SLIDE 39
SLIDE 40
Differential Privacy [DMNS ‘06]
SLIDE 41 Differential Privacy [DMNS ‘06]
- Robust to post-processing [DMNS ‘06] and adaptive composition [DRV
‘10]
- Necessarily randomized output
- No mention of how samples drawn!
SLIDE 42
Does DP = RG?
SLIDE 43
Does DP = RG?
SLIDE 44
Does DP = RG?
SLIDE 45 Does DP = RG?
No “quick fix” to make RG learner satisfy DP
SLIDE 46
“no adversary can use output to find a hypothesis that overfits”
- Differential privacy [DMNS ‘06]
“similar samples should have the same output”
“output reveals nothing about the sample”
Notions of generalization
SLIDE 47
Perfect Generalization
SLIDE 48
- Differential privacy gives privacy to the individual
Changing one entry in the database shouldn’t change the output too much
- Perfect generalization gives privacy to the data provider
(e.g. school, hospital) Changing the entire sample to something “typical” shouldn’t change the output too much
PG as a privacy notion
SLIDE 49
Exponential Mechanism [MT07]
SLIDE 50
DP implies PG with worse parameters
SLIDE 51
PG implies DP…sort of
SLIDE 52
PG implies DP…sort of
SLIDE 53 PG implies DP…sort of
Problems that are solvable under PG are also solvable under DP
SLIDE 54
“no adversary can use output to find a hypothesis that overfits”
- Differential privacy [DMNS ‘06]
“similar samples should have the same output”
“output reveals nothing about the sample”
Notions of generalization
SLIDE 55 Some basic questions
- Is it possible to get good learning algorithms that also are
robust to post-processing? Adaptive composition?
- How to construct them? Existing algorithms? How much extra
data do they need?
- Accuracy + generalization + post-processing-robustness = ?
- Accuracy + generalization + adaptive composition = ?
- What composes with what? How well (how quickly does
generalization degrade)? Why?
SLIDE 56 Making Generalization Robust
Katrina Ligett katrina.ligett@mail.huji.ac.il HUJI & Caltech
joint with Rachel Cummings, Kobbi Nissim, Aaron Roth, and Steven Wu