CSC2412: Private Gradient Descent & Empirical Risk Minimization
Sasho Nikolov
1
CSC2412: Private Gradient Descent & Empirical Risk Minimization - - PowerPoint PPT Presentation
CSC2412: Private Gradient Descent & Empirical Risk Minimization Sasho Nikolov 1 Empirical Risk Minimization Learning: Reminder Known data universe X and an unknown probability distribution D on X Known concept class C and an unknown
CSC2412: Private Gradient Descent & Empirical Risk Minimization
Sasho Nikolov
1
Empirical Risk Minimization
Learning: Reminder
independent sample from D. Goal: Learn c from X.
2
Loss
`(c0, (x, y) ) = 8 < : 1 c0(x) 6= y c0(x) = y LD,c(c0) = Ex⇠D[`(c0, (x, c(x)) )] = Px⇠D(c0(x) 6= c(x)) We want an algorithm M that outputs some c0 2 C and satisfies P(LD,c(M(X)) ↵) 1 .
3
Binary
loss,
dagh
label
II.
( c
' "" "")+ ele
' , It , cent)×
Agnostic learning
Maybe no concept gives 100% correct labels. Generally, we have a distribution D on X ⇥ {1, +1}. LD(c) = E(x,y)⇠D[`(c, (x, y) )] = P(x,y)⇠D(c(x) 6= y) D is unknown but we are given iid samples X = {(x1, y1), . . . , (xn, yn)}. We want an algorithm M that outputs some c0 2 C and satisfies P(LD(M(X)) min
c2C LD(c) + ↵) 1 . 4 → agnostic setting
( as
to realizable )
distribution
labeled examples
( Xi , Yi )
D
g-
best possible loss
achievable by
C
Empirical risk minimization, again
Issue: We want to find arg minc2C LD(c), but we do not know D. Solution: Instead we solve arg minc2C LX(c), where LX(c) = 1 n
n
X
i=1
`(c, (xi, yi)). is the empirical error. Theorem (Uniform convergence) Suppose that n ln(|C|/β)
2α2
. Then, with probability 1 , max
c2C LX(c) LD(c) ↵. 5
=
Hi :c Kitty ill
µ T
for binary loss
womanish
e?F-
↳ let
Ed
Example: Linear Separators
For convenience, replace x by (x, 1) 2 [0, 1]d+1 and ✓, ✓0 by (✓, ✓0) 2 Rd+1 cθ(x) = sign(hx, ✓i)
6
unit
cube
in
Rd
444=72441 signal
if
+
" below"the plane
Realizable
Agnostic
if
above
. .Hoyt Qo
, too))
" " . .①
iD ①
' . + i . . .0 .
.now
, "willignore
to
'
¢
→ Finding
best
separator
is
a ←
a
generally computationally
hard
Logistic Regression
Sign for sigmoid: given ✓ and x predict 8 < : +1 w/ prob.
1 1+ehx,θi
1 w/ prob.
1 1+ehx,θi
Logistic loss `(✓, (x, y)) = log ✓ 1 P(predict y from hx, ✓i) ◆ = log(1 + ey·hx,θi). Logistic regression: Given X = {(x1, y1), . . . , (xn, yn)} solve arg min
θ2Θ
1 n
n
X
i=1
log(1 + ey·hx,θi)
7
a ←
a- Hit) # Hay
O
I
Cf
can
be efficiently
, Empirical
loss
.← connection to population
solved
loss
in
another
Ehess
(Private) Gradient Descent
Convex loss
The function LX(✓) = 1
n
Pn
i=1 log(1 + ey·hx,θi) is convex in ✓.
Convex functions can be minimized efficiently.
8
to .ae/Thdtt:LxHgI)EatLxHI-iLLxIo
' )E
f .
) -1ft
to ,o
' :Luo
't
> Lfo't -1474101,0--01
Gradient descent
arg min
θ2Bd+1
2
(R)
1 n
n
X
i=1
log(1 + ey·hx,θi) ✓0 = 0 for t = 1...T 1 do ˜ ✓t = ✓t1 ⌘rLX(✓t1) ✓t = ˜ ✓t/ max{1, k˜ ✓tk2/R} end for
T
PT1
t=0 ✓t 9
↳ lot
BY "lR ) - fo
: HIKER's..
i
Parameter
decreases the fastest
94M¥04
/
me::÷:÷÷÷÷÷⇐÷¥
'
.
(
+ Zt)
ik
Kitt
.↳ HI
Pointing.",
to
I
.÷%
. #a
to
Advanced composition - warmup
Publish k functions f1, ..., fk : X n ! Rd with (", )-DP, where 8i ∆2fi C
10
want
to
release
Can achieve
noise Ifa uetiou
1¥
;
could
be adaptive : fi
depends
a ckGgQ
1) Apply
Gaussian mechanism
K
times
,use
composition
; → noise =Ckhgk
Release
fill) t Zi
Zi
for
g= Eighty
, to achieve l¥¥(kN)
By composition
thin
( felt ) t -2
. . . . . , fax ) -121)is CE
, dy - pp"
glx) .
"
;
use
the Gaussian
noise
"initialism
for g
.( D.gift
llgkl-gltmijm.at?illtilH-tik'
' "iska
Advanced composition (for Gaussian noise)
Suppose we realease Y1 = f1(X) + Z1, . . . , Ykfk(X) + Zk where fi : X n ! Rd depends also on Y1, . . . , Yi1. Zi ⇠ N ⇣ 0, (∆2f )2
ρ
· I ⌘ Then the output Y1, . . . , Yk satisfies (", )-DP for " = k⇢ + p 2k⇢ ln(1/)).
11
fi :D
"
→ Rd
= c- IRDl
as if
⇒ for
to achieve
cc.SI
meringue
query
fi
is a
Daf
. VK.iq/ adaptive⇒
noise
per
Sensitivity of gradients
Suppose X = [1, +1]d+1. ∆2rLX(✓) = r log(1 + eyhx,θi) =
1 1+eyhx,θi yx 12
K-hki.y.li
D Lyttle th
.Dilogf It
e
X
' -Glenys, . . . > Kilsyth . . . , Knight}mat 11174,1A
=
mat
'a Hilogllte
'"Y'
playlet
e- Y' 4104114
⇐ E
" Dloyllte
""Y
' '⇒2¥
' 't
ye htt}
det
11 Dlogllte
" '9111
,
xe-L-l.tt)
=eyxll,
k
"there
Private gradient descent
✓0 = 0 for t = 1...T 1 do ˜ ✓t = ✓t1 ⌘(rLX(✓t1) + Zt1) ✓t = ˜ ✓t/ max{1, k˜ ✓tk2/R} end for
T
PT1
t=0 ✓t 13
Think
171×10-01,1741 # t
as
the adaptively
chosen functions
fi , te
. .a
tan
yea
aaaaZt
t
t
ga d-
" r'
t
r'=
dt.bg#
an
'
( E
, d )by
advanced
composition
+ post
Accuracy analysis
Theorem Suppose EkLX(✓t) + Ztk2
2 B2 for all t. For ⌘ = R BT 1/2 we have
E " LX ⇣ 1 T
T1
X
t=0
✓t ⌘# min
θ2Bd+1
2
(R)
LX(✓) + RB T 1/2
15
→
Proof in
the
notes
( optional
)
D
d
goes
to
as
T
→ re
value
Plugging in
EkLX(✓t) + Ztk2
2 = EkLX(✓t)k2 2 + EkZtk2 2 d+1 n2 + 2d 16
RI
Eleni
D
D
Ele
Ctt)
'
T
pad
+
d
E.Izumi
.em
For any
t
1117410711 , soft
error
.11%+101112--11 In -2 Ploy ( ite
111,
⇐
In
2- 1117 bgllte
11 , ⇐off