Di ff erentially Private Empirical Risk Minimization with Non-convex - - PowerPoint PPT Presentation

di ff erentially private empirical risk minimization with
SMART_READER_LITE
LIVE PREVIEW

Di ff erentially Private Empirical Risk Minimization with Non-convex - - PowerPoint PPT Presentation

Di ff erentially Private Empirical Risk Minimization with Non-convex Loss Functions Di Wang , Changyou Chen and Jinhui Xu State University of New York at Bu ff alo International Conference on Machine Learning 2019 Di Wang Non-convex DP-ERM ICML


slide-1
SLIDE 1

Differentially Private Empirical Risk Minimization with Non-convex Loss Functions

Di Wang, Changyou Chen and Jinhui Xu

State University of New York at Buffalo

International Conference on Machine Learning 2019

Di Wang Non-convex DP-ERM ICML 2019 1 / 15

slide-2
SLIDE 2

Outline

1

Introduction Problem Description Result 1 Result 2 Result 3

Di Wang Non-convex DP-ERM ICML 2019 2 / 15

slide-3
SLIDE 3

Outline

1

Introduction Problem Description Result 1 Result 2 Result 3

Di Wang Non-convex DP-ERM ICML 2019 3 / 15

slide-4
SLIDE 4

Empirical Risk Minimization (ERM)

Given: A dataset D = {(x1, y1), (x2, y2), · · · , (xn, yn)}, where each (xi, yi) ∈ Rd × R ∼ P. Regularization r(·) : Rd 󰀂→ R, we use ℓ2 regularization with r(w) = λ

2󰀃w󰀃2 2.

For a loss function ℓ, the (regularized) Empirical Risk: ˆ Lr(w; D) = 1 n

n

󰁜

i=1

ℓ(w; xi, yi) + r(w). the (regularized) Population Risk: Lr

P(w) = E(x,y)∼P[ℓ(w; x, y)] + r(w).

Goal: Find w so as to minimize the empirical or population risk.

Di Wang Non-convex DP-ERM ICML 2019 4 / 15

slide-5
SLIDE 5

(󰂄, δ)- Differential Privacy (DP)

Differential Privacy (DP) [Dwork et al,. 2006]

We say that two datasets, D and D′, are neighbors if they differ by only

  • ne entry, denoted as D ∼ D′.

A randomized algorithm A is (󰂄, δ)-differentially private if for all neighboring datasets D, D′, and for all events S in the output space of A, we have Pr(A(D) ∈ S) ≤ e󰂄PrA(D′) ∈ S) + δ.

Di Wang Non-convex DP-ERM ICML 2019 5 / 15

slide-6
SLIDE 6

DP-ERM

DP-ERM

Determine a sample complexity n = n(1/󰂄, 1/δ, p, 1/α) such that there is an (󰂄, δ)-DP algorithm whose output wpriv achieves an α-error in the expected excess empirical risk: Errr

D(wpriv) = Eˆ

L(wLDP; D) − min

w∈Rd ˆ

L(w; D) ≤ α.

  • r in the expected excess empirical risk:

Errr

P(wpriv) = E[Lr P(wpriv)] − min w∈Rd Lr P(w) ≤ α.

Di Wang Non-convex DP-ERM ICML 2019 6 / 15

slide-7
SLIDE 7

Motivation

Previous work on DP-ERM mainly focuses on convex loss functions.

Di Wang Non-convex DP-ERM ICML 2019 7 / 15

slide-8
SLIDE 8

Motivation

Previous work on DP-ERM mainly focuses on convex loss functions. For non-convex loss functions, [Zhang et al, 2017] and [Wang and Xu 2019] studied the problem and used, as error measurement, the ℓ2 gradient norm of a private estimator, i.e., 󰀃∇ˆ Lr

D(wpriv)󰀃2 and EP󰀃∇ℓ(wpriv; x, y)󰀃2

Di Wang Non-convex DP-ERM ICML 2019 7 / 15

slide-9
SLIDE 9

Motivation

Previous work on DP-ERM mainly focuses on convex loss functions. For non-convex loss functions, [Zhang et al, 2017] and [Wang and Xu 2019] studied the problem and used, as error measurement, the ℓ2 gradient norm of a private estimator, i.e., 󰀃∇ˆ Lr

D(wpriv)󰀃2 and EP󰀃∇ℓ(wpriv; x, y)󰀃2

Main Question: Can the excess empirical (population) risk be used to measure the error of non-convex loss functions in the differential privacy model?

Di Wang Non-convex DP-ERM ICML 2019 7 / 15

slide-10
SLIDE 10

Outline

1

Introduction Problem Description Result 1 Result 2 Result 3

Di Wang Non-convex DP-ERM ICML 2019 8 / 15

slide-11
SLIDE 11

Result 1

Theorem 1

If the loss function is L-Lipschitz, twice differentiable and M-smooth, by using the private version of Gradient Langevin Dynamics (DP-GLD) we show that the excess empirical (or population) risk is upper bounded by ˜ O( d log(1/δ)

log n󰂄2 ).

Di Wang Non-convex DP-ERM ICML 2019 9 / 15

slide-12
SLIDE 12

Result 1

Theorem 1

If the loss function is L-Lipschitz, twice differentiable and M-smooth, by using the private version of Gradient Langevin Dynamics (DP-GLD) we show that the excess empirical (or population) risk is upper bounded by ˜ O( d log(1/δ)

log n󰂄2 ).

The proof is based on some recent developments in Bayesian learning and analysis of GLD. By using a finer analysis of the time-average error of some SDE, we show the following

Di Wang Non-convex DP-ERM ICML 2019 9 / 15

slide-13
SLIDE 13

Result 1

Theorem 1

If the loss function is L-Lipschitz, twice differentiable and M-smooth, by using the private version of Gradient Langevin Dynamics (DP-GLD) we show that the excess empirical (or population) risk is upper bounded by ˜ O( d log(1/δ)

log n󰂄2 ).

The proof is based on some recent developments in Bayesian learning and analysis of GLD. By using a finer analysis of the time-average error of some SDE, we show the following

Theorem 2

For the excessed empirical risk, there is an (󰂄, δ)-DP algorithm which satisfies lim

T→∞ Errr D(wT) ≤ ˜

O 󰀄C0(d) log(1/δ) nτ󰂄τ 󰀅 , where C0(d) is a function of d and 0 < τ < 1 is some cosntant.

Di Wang Non-convex DP-ERM ICML 2019 9 / 15

slide-14
SLIDE 14

Outline

1

Introduction Problem Description Result 1 Result 2 Result 3

Di Wang Non-convex DP-ERM ICML 2019 10 / 15

slide-15
SLIDE 15

Result 2

Are these bounds tight?

Di Wang Non-convex DP-ERM ICML 2019 11 / 15

slide-16
SLIDE 16

Result 2

Are these bounds tight? Based on the exponential mechanism, we have

Empirical Risk

For any β < 1, there is an 󰂄-differentially private algorithm whose output wpriv induces an excess empirical risk Errr

D(wpriv) ≤ ˜

O( d

n󰂄) with probability

at least 1 − β.

Di Wang Non-convex DP-ERM ICML 2019 11 / 15

slide-17
SLIDE 17

Result 2

Are these bounds tight? Based on the exponential mechanism, we have

Empirical Risk

For any β < 1, there is an 󰂄-differentially private algorithm whose output wpriv induces an excess empirical risk Errr

D(wpriv) ≤ ˜

O( d

n󰂄) with probability

at least 1 − β.

Population Risk

For Generalized Linear model and Robust Regressions (whose loss function is ℓ(w; x, y) = (σ(〈w, x〉) − y)2 and ℓ(w; x, y) = Φ(〈w, x〉 − y), respectively), under some reasonable assumptions, there is an (󰂄, δ)-DP algorithm whose excess population risk is upper bounded by ErrP(wpriv) ≤ O 󰀄

4

󰁵 d ln 1

δ

√n󰂄 󰀅 .

Di Wang Non-convex DP-ERM ICML 2019 11 / 15

slide-18
SLIDE 18

Outline

1

Introduction Problem Description Result 1 Result 2 Result 3

Di Wang Non-convex DP-ERM ICML 2019 12 / 15

slide-19
SLIDE 19

Finding Approximate Local Minimum Privately

Finding global minimum of non-convex function is challenging!

Di Wang Non-convex DP-ERM ICML 2019 13 / 15

slide-20
SLIDE 20

Finding Approximate Local Minimum Privately

Finding global minimum of non-convex function is challenging! Recent research on Deep Learning and other non-convex problems show that local minima, but not critical points, are sufficient.

Di Wang Non-convex DP-ERM ICML 2019 13 / 15

slide-21
SLIDE 21

Finding Approximate Local Minimum Privately

Finding global minimum of non-convex function is challenging! Recent research on Deep Learning and other non-convex problems show that local minima, but not critical points, are sufficient. But, finding local minima is still NP-hard.

Di Wang Non-convex DP-ERM ICML 2019 13 / 15

slide-22
SLIDE 22

Finding Approximate Local Minimum Privately

Finding global minimum of non-convex function is challenging! Recent research on Deep Learning and other non-convex problems show that local minima, but not critical points, are sufficient. But, finding local minima is still NP-hard. Fortunately, many non-convex functions are strict saddle. Thus, it is sufficient to find the second order stationary point (or approximate local minimum).

Definition

w is an α-second-order stationary point (α-SOSP), if 󰀃∇F(w)󰀃2 ≤ α and λmin(∇2F(w)) ≥ −√ρα. (1)

Di Wang Non-convex DP-ERM ICML 2019 13 / 15

slide-23
SLIDE 23

Finding Approximate Local Minimum Privately

Finding global minimum of non-convex function is challenging! Recent research on Deep Learning and other non-convex problems show that local minima, but not critical points, are sufficient. But, finding local minima is still NP-hard. Fortunately, many non-convex functions are strict saddle. Thus, it is sufficient to find the second order stationary point (or approximate local minimum).

Definition

w is an α-second-order stationary point (α-SOSP), if 󰀃∇F(w)󰀃2 ≤ α and λmin(∇2F(w)) ≥ −√ρα. (1) Can we find some approximate local minimum which escapes saddle points and still keeps the algorithm (󰂄, δ)-differentially private?

Di Wang Non-convex DP-ERM ICML 2019 13 / 15

slide-24
SLIDE 24

Result 3

On one hand, (Ge et al. 2015) proposed an algorithm, noisy Stochastic Gradient Descent, to find approximate local minima.

Di Wang Non-convex DP-ERM ICML 2019 14 / 15

slide-25
SLIDE 25

Result 3

On one hand, (Ge et al. 2015) proposed an algorithm, noisy Stochastic Gradient Descent, to find approximate local minima. On the other hand, in DP community, one popular method for ERM is called DP-SGD, which adds some Gaussian noise in each iteration.

Di Wang Non-convex DP-ERM ICML 2019 14 / 15

slide-26
SLIDE 26

Result 3

On one hand, (Ge et al. 2015) proposed an algorithm, noisy Stochastic Gradient Descent, to find approximate local minima. On the other hand, in DP community, one popular method for ERM is called DP-SGD, which adds some Gaussian noise in each iteration. Using DP-GD, we can show

Theorem 4

If the data size n is large enough such that n ≥ ˜ Ω( 󰁵 log 1

δd log 1 ξ

󰂄α2 ), (2) then with probability 1 − ζ, one of the outputs is an α-SOSP of the empirical risk ˆ L(·, D).

Di Wang Non-convex DP-ERM ICML 2019 14 / 15

slide-27
SLIDE 27

Thank you!

Di Wang Non-convex DP-ERM ICML 2019 15 / 15