CSC2412: Private Gradient Descent & Empirical Risk Minimization - - PowerPoint PPT Presentation

csc2412 private gradient descent empirical risk
SMART_READER_LITE
LIVE PREVIEW

CSC2412: Private Gradient Descent & Empirical Risk Minimization - - PowerPoint PPT Presentation

CSC2412: Private Gradient Descent & Empirical Risk Minimization Sasho Nikolov 1 Empirical Risk Minimization Learning: Reminder Known data universe X and an unknown probability distribution D on X Known concept class C and an unknown


slide-1
SLIDE 1

CSC2412: Private Gradient Descent & Empirical Risk Minimization

Sasho Nikolov

1

slide-2
SLIDE 2

Empirical Risk Minimization

slide-3
SLIDE 3

Learning: Reminder

  • Known data universe X and an unknown probability distribution D on X
  • Known concept class C and an unknown concept c 2 C
  • We get a dataset X = {(x1, c(x1)), . . . , (xn, c(xn))}, where each xi is an

independent sample from D. Goal: Learn c from X.

2

slide-4
SLIDE 4

Loss

`(c0, (x, y) ) = 8 < : 1 c0(x) 6= y c0(x) = y LD,c(c0) = Ex⇠D[`(c0, (x, c(x)) )] = Px⇠D(c0(x) 6= c(x)) We want an algorithm M that outputs some c0 2 C and satisfies P(LD,c(M(X))  ↵) 1 .

3

Binary

loss,

dagh

label

II.

( c

' "" "")

ItscI

+ ele

' , It , cent)

×

slide-5
SLIDE 5

Agnostic learning

Maybe no concept gives 100% correct labels. Generally, we have a distribution D on X ⇥ {1, +1}. LD(c) = E(x,y)⇠D[`(c, (x, y) )] = P(x,y)⇠D(c(x) 6= y) D is unknown but we are given iid samples X = {(x1, y1), . . . , (xn, yn)}. We want an algorithm M that outputs some c0 2 C and satisfies P(LD(M(X))  min

c2C LD(c) + ↵) 1 . 4 → agnostic setting

( as

  • pposed

to realizable )

distribution

  • n

labeled examples

( Xi , Yi )

  • i id

D

M

g-

best possible loss

achievable by

C

slide-6
SLIDE 6

Empirical risk minimization, again

Issue: We want to find arg minc2C LD(c), but we do not know D. Solution: Instead we solve arg minc2C LX(c), where LX(c) = 1 n

n

X

i=1

`(c, (xi, yi)). is the empirical error. Theorem (Uniform convergence) Suppose that n ln(|C|/β)

2α2

. Then, with probability 1 , max

c2C LX(c) LD(c)  ↵. 5

=

Hi :c Kitty ill

µ T

for binary loss

womanish

e?F-

↳ let

  • Luc ,

Ed

slide-7
SLIDE 7

Example: Linear Separators

  • X = [0, 1]d
  • C is all functions of the type cθ(x) = sign(hx, ✓i + ✓0) for ✓ 2 Rd, ✓0 2 R.

For convenience, replace x by (x, 1) 2 [0, 1]d+1 and ✓, ✓0 by (✓, ✓0) 2 Rd+1 cθ(x) = sign(hx, ✓i)

6

unit

cube

in

Rd

444=72441 signal

  • ft
' # g
  • I
  • 240
  • ↳ Cx)
.
  • ftl

if

+

" below"

the plane

Realizable

Agnostic

  • 1

if

above

. .
  • i
. i . . .

Hoyt Qo

  • f ( Y)

, too))

" " . .

i
  • .
.
  • .
. . . . "

D ①

' . + i . . .
  • .
i

0 .

.
  • From

now

, "will

ignore

to

  • ,
"
  • .
' ' e
  • .
. . . . .

'

¢

→ Finding

best

separator

is

a ←

a

generally computationally

hard

slide-8
SLIDE 8

Logistic Regression

Sign for sigmoid: given ✓ and x predict 8 < : +1 w/ prob.

1 1+ehx,θi

1 w/ prob.

1 1+ehx,θi

Logistic loss `(✓, (x, y)) = log ✓ 1 P(predict y from hx, ✓i) ◆ = log(1 + ey·hx,θi). Logistic regression: Given X = {(x1, y1), . . . , (xn, yn)} solve arg min

θ2Θ

1 n

n

X

i=1

log(1 + ey·hx,θi)

7

  • %hgq
  • t ,

a ←

a- Hit) # Hay

O

  • i -

I

Cf

can

be efficiently

, Empirical

loss

.

← connection to population

solved

loss

in

another

Ehess

slide-9
SLIDE 9

(Private) Gradient Descent

slide-10
SLIDE 10

Convex loss

The function LX(✓) = 1

n

Pn

i=1 log(1 + ey·hx,θi) is convex in ✓.

Convex functions can be minimized efficiently.

  • for non-convex, it’s complicated

8

  • ¥↳a

to .ae/Thdtt:LxHgI)EatLxHI-iLLxIo

' )

E

f .

  • the -10,11 ↳ (tutti
  • HE
' )EdLxH

) -1ft

  • d) 410
')

to ,o

' :

Luo

't

> Lfo't -1474101,0--01

slide-11
SLIDE 11

Gradient descent

arg min

θ2Bd+1

2

(R)

1 n

n

X

i=1

log(1 + ey·hx,θi) ✓0 = 0 for t = 1...T 1 do ˜ ✓t = ✓t1 ⌘rLX(✓t1) ✓t = ˜ ✓t/ max{1, k˜ ✓tk2/R} end for

  • utput 1

T

PT1

t=0 ✓t 9

↳ lot

BY "lR ) - fo

: HIKER's

÷÷÷÷÷r

..

i

:÷.↳.

Parameter

decreases the fastest

94M¥04

/

me::÷:÷÷÷÷÷⇐÷¥

'

.

(

+ Zt)

ik

  • 1

Kitt

.
  • ' A

↳ HI

Pointing.",

to

↳*¥%,

I

.÷%

. #

a

  • G. asf
't

to

slide-12
SLIDE 12

Advanced composition - warmup

Publish k functions f1, ..., fk : X n ! Rd with (", )-DP, where 8i ∆2fi  C

10

want

to

release

  • f. ( x)
, fdtl . . . , f kN ) .

Can achieve

noise Ifa uetiou

;

could

be adaptive : fi

depends

  • n f. Htt Z
, . . . . , fi . , Kk Zi . ,

a ckGgQ

1) Apply

Gaussian mechanism

K

times

,

use

composition

; → noise =Ckhgk

Release

fill) t Zi

Zi

  • NCO , III)

for

g= Eighty

, to achieve l¥¥
  • Ap

(kN)

By composition

thin

( felt ) t -2

. . . . . , fax ) -121)

is CE

, dy - pp

"

glx) .

  • c. pi

"

;

use

the Gaussian

noise

"initialism

for g

.

( D.gift

llgkl-gltmijm.at?illtilH-tik'

' "iska

slide-13
SLIDE 13

Advanced composition (for Gaussian noise)

Suppose we realease Y1 = f1(X) + Z1, . . . , Ykfk(X) + Zk where fi : X n ! Rd depends also on Y1, . . . , Yi1. Zi ⇠ N ⇣ 0, (∆2f )2

ρ

· I ⌘ Then the output Y1, . . . , Yk satisfies (", )-DP for " = k⇢ + p 2k⇢ ln(1/)).

11

fi :D

"

→ Rd

= c- IRD
  • l

l

  • same

as if

⇒ for

to achieve

cc.SI

  • DP

meringue

query

fi

is a

Daf

. VK.iq/ adaptive

noise

per

  • E
slide-14
SLIDE 14

Sensitivity of gradients

Suppose X = [1, +1]d+1. ∆2rLX(✓) = r log(1 + eyhx,θi) =

1 1+eyhx,θi yx 12

K-hki.y.li

  • - -in . yul 's

D Lyttle th

.

Dilogf It

e

  • Li
  • 14,9 )

X

' -Glenys, . . . > Kilsyth . . . , Knight}

mat 11174,1A

  • 174.10111,

=

mat

'

a Hilogllte

  • Y
  • 4,9)
  • taxi

'"Y'

playlet

e- Y' 4104114

⇐ E

  • iffy

" Dloyllte

  • yet. 9,11,

""Y

' '

⇒2¥

' '

II

t

ye htt}

det

11 Dlogllte

  • Y'

" '9111

,

xe-L-l.tt)

=

eyxll,

k

"there

slide-15
SLIDE 15

Private gradient descent

✓0 = 0 for t = 1...T 1 do ˜ ✓t = ✓t1 ⌘(rLX(✓t1) + Zt1) ✓t = ˜ ✓t/ max{1, k˜ ✓tk2/R} end for

  • utput 1

T

PT1

t=0 ✓t 13

Think

  • f

171×10-01,1741 # t

  • ;

as

the adaptively

chosen functions

fi , te

. .
  • ,
  • ft

a

tan

yea

aaaa

Zt

  • NCO .

II)

  • E

t

t

ga d-

" r'

  • Thgl'T)

t

r'=

dt.bg#

an

'

( E

, d )
  • DP

by

advanced

composition

+ post

  • processing
slide-16
SLIDE 16

Accuracy analysis

Theorem Suppose EkLX(✓t) + Ztk2

2  B2 for all t. For ⌘ = R BT 1/2 we have

E " LX ⇣ 1 T

T1

X

t=0

✓t ⌘#  min

θ2Bd+1

2

(R)

LX(✓) + RB T 1/2

15

Proof in

the

notes

( optional

)

D

  • t

d

goes

to

  • ptimal

as

T

→ re

value

slide-17
SLIDE 17

Plugging in

EkLX(✓t) + Ztk2

2 = EkLX(✓t)k2 2 + EkZtk2 2  d+1 n2 + 2d 16

RI

dT.bg#edtI

Eleni

D

D

Ele

Ctt)

  • B

'

T

pad

+

d

E.Izumi

.
  • Eid" '

em

For any

t

1117410711 , soft

error

.
  • 1¥? =Af÷tR¥

11%+101112--11 In -2 Ploy ( ite

  • Lil 'T
'

111,

In

2- 1117 bgllte

  • Yi Ki .

11 , ⇐off