csc2412 private gradient descent empirical risk
play

CSC2412: Private Gradient Descent & Empirical Risk Minimization - PowerPoint PPT Presentation

CSC2412: Private Gradient Descent & Empirical Risk Minimization Sasho Nikolov 1 Empirical Risk Minimization Learning: Reminder Known data universe X and an unknown probability distribution D on X Known concept class C and an unknown


  1. CSC2412: Private Gradient Descent & Empirical Risk Minimization Sasho Nikolov 1

  2. Empirical Risk Minimization

  3. Learning: Reminder • Known data universe X and an unknown probability distribution D on X • Known concept class C and an unknown concept c 2 C • We get a dataset X = { ( x 1 , c ( x 1 )) , . . . , ( x n , c ( x n )) } , where each x i is an independent sample from D . Goal: Learn c from X . 2

  4. Loss label dagh 8 loss , c 0 ( x ) 6 = y 1 Binary < ` ( c 0 , ( x , y ) ) = c 0 ( x ) = y 0 : ItscI ' " " " " ) , It , cent ) ' + ele ( c II. × L D , c ( c 0 ) = E x ⇠ D [ ` ( c 0 , ( x , c ( x )) )] = P x ⇠ D ( c 0 ( x ) 6 = c ( x )) We want an algorithm M that outputs some c 0 2 C and satisfies P ( L D , c ( M ( X ))  ↵ ) � 1 � � . 3

  5. Agnostic learning → agnostic setting Maybe no concept gives 100% correct labels. to realizable ) opposed ( as labeled distribution Generally, we have a distribution D on X ⇥ { � 1 , +1 } . on examples L D ( c ) = E ( x , y ) ⇠ D [ ` ( c , ( x , y ) )] = P ( x , y ) ⇠ D ( c ( x ) 6 = y ) D is unknown but we are given iid samples X = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } . D ( Xi , Yi ) - i id We want an algorithm M that outputs some c 0 2 C and satisfies M P ( L D ( M ( X ))  min c 2 C L D ( c ) + ↵ ) � 1 � � . g- best possible loss achievable by C 4

  6. Empirical risk minimization, again Issue : We want to find arg min c 2 C L D ( c ), but we do not know D . Solution : Instead we solve arg min c 2 C L X ( c ) , where Hi :c Kitty ill n L X ( c ) = 1 X ` ( c , ( x i , y i )) . µ T = n i =1 for binary loss is the empirical error. Theorem (Uniform convergence) Suppose that n � ln( | C | / β ) . Then, with probability � 1 � � , 2 α 2 womanish max c 2 C L X ( c ) � L D ( c )  ↵ . - Luc , Ed ↳ let e ? F- 5

  7. Example: Linear Separators - ft 444=72441 signal ' # g Rd - cube unit in • X = [0 , 1] d - I -240 • C is all functions of the type c θ ( x ) = sign ( h x , ✓ i + ✓ 0 ) for ✓ 2 R d , ✓ 0 2 R . - - - - For convenience, replace x by ( x , 1) 2 [0 , 1] d +1 and ✓ , ✓ 0 by ( ✓ , ✓ 0 ) 2 R d +1 " below " the plane - ftl if + ↳ Cx ) c θ ( x ) = sign ( h x , ✓ i ) above . Agnostic Realizable if -1 - f ( Y ) , too ) ) . i Hoyt Qo - i . . . . . " ① i - " - . . . " D ① to - . . ' . . " will . From . ignore - 0 . now . - . . i + i . . , → Finding " . - - - ' ' , e best . separator ' is . . . . . ¢ . a generally computationally hard a ← 6

  8. ↳ Logistic Regression 8 1 +1 w/ prob. < 1+ e �h x , θ i Sign for sigmoid: given ✓ and x predict 1 � 1 w/ prob. %hgq • : 1+ e h x , θ i a- Hit ) # Hay t , - • a ← O Logistic loss - i - - - ✓ 1 ◆ = log(1 + e � y · h x , θ i ) . ` ( ✓ , ( x , y )) = log P (predict y from h x , ✓ i ) I Cf Logistic regression : Given X = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } solve loss , Empirical be efficiently . can n connection to population 1 X log(1 + e � y · h x , θ i ) arg min ← solved another Ehess n θ 2 Θ loss in i =1 7

  9. (Private) Gradient Descent

  10. Convex loss ¥ ↳ a The function L X ( ✓ ) = 1 P n i =1 log(1 + e � y · h x , θ i ) is convex in ✓ . n - ' ) E f . to .ae/Thdtt:LxHgI)EatLxHI-iLLxIo - ' )EdLxH ) -1ft - d) 410 ' ) - HE the -10,11 ↳ ( tutti > Lfo 't -1474101,0--01 to ,o 't ' Luo : Convex functions can be minimized e ffi ciently. • for non-convex, it’s complicated 8

  11. Gradient descent : HIKER 's BY "lR ) - fo ↳ lot . . ÷÷÷÷÷r i n :÷. ↳ . 1 X log(1 + e � y · h x , θ i ) arg min n θ 2 B d +1 ( R ) 2 i =1 Parameter 94M¥04 decreases the fastest me ::÷:÷÷÷÷÷ ⇐ ÷¥ ✓ 0 = 0 / for t = 1 ... T � 1 do ' . ( + Zt ) ˜ ✓ t = ✓ t � 1 � ⌘ r L X ( ✓ t � 1 ) ik - -' A ✓ t = ˜ ✓ t / max { 1 , k ˜ Kitt ✓ t k 2 / R } ↳ HI -1 . ↳ *¥% , end for Pointing . " , to output 1 P T � 1 I t =0 ✓ t T . # .÷% a to G. asf 't 9

  12. Advanced composition - warmup Publish k functions f 1 , ..., f k : X n ! R d with ( " , � )-DP, where 8 i ∆ 2 f i  C Can achieve release , f kN ) f. ( x ) , fdtl want noise Ifa uetiou to a ckGgQ . . . . be adaptive : fi 1¥ could . , Kk Zi . , depends on f. Htt Z . , fi ; . , . . 1) Apply Gaussian mechanism → noise =Ckhgk K times composition ; use g= Eighty , to achieve l¥¥ - NCO , III ) , fill ) t Zi for Zi Release - Ap ( kN ) , fax ) -121 ) ( felt ) t -2 , dy - pp By composition thin CE is . . . . . " initialism " for g noise " use Gaussian c. pi the glx ) . . ; llgkl-gltmijm.at?illtilH-tik' ' " iska ( D. gift 10

  13. ⇒ Advanced composition (for Gaussian noise) → Rd " fi :D Suppose we realease Y 1 = f 1 ( X ) + Z 1 , . . . , Y k f k ( X ) + Z k where f i : X n ! R d depends = also on Y 1 , . . . , Y i � 1 . c- IRD ⇣ 0 , ( ∆ 2 f ) 2 ⌘ Z i ⇠ N · I ρ - l l p Then the output Y 1 , . . . , Y k satisfies ( " , � )-DP for " = k ⇢ + 2 k ⇢ ln(1 / � )). - same as if ⇒ for to achieve cc.SI meringue - DP . VK.iq/ adaptive Daf fi noise per query is 11 a - E

  14. Sensitivity of gradients D Lyttle th - 14,9 ) - Li K-hki.y.li - - - in . yul 's Dilogf It e . ' - Glenys , . . . > Kilsyth . . . , Knight } X Suppose X = [ � 1 , +1] d +1 . - Y - 4,9 ) - 174.10111 , a Hilogllte ' mat 11174,1A - mat = ∆ 2 r L X ( ✓ ) = ' " Y ' e- Y' 4104114 playlet ' ' " " Y taxi - yet . 9,11 , - iffy " Dloyllte ⇐ E II ' ' ⇒ 2¥ t ye htt } 1 r log(1 + e � y h x , θ i ) = � det 1+ e y h x , θ i yx xe-L-l.tt ) - Y' " ' 9111 11 Dlogllte " there , eyxll , k = 12

  15. Private gradient descent 171 × 10-01,1741 # t of Think - ; - - the adaptively chosen functions fi , te as - , . . ✓ 0 = 0 • ft o - NCO . II ) for t = 1 ... T � 1 do Zt ✓ t = ✓ t � 1 � ⌘ ( r L X ( ✓ t � 1 ) + Z t � 1 ) ˜ yea tan a aaaa ✓ t = ˜ ✓ t / max { 1 , k ˜ - - ✓ t k 2 / R } • E t " r ' ga d- t end for Thgl 'T ) P T � 1 output 1 t =0 ✓ t • T t dt.bg# r '= ' an , d ) - DP by composition ( E advanced + post processing 13 -

  16. Accuracy analysis ( optional ) notes the Proof in → Theorem D 2  B 2 for all t . For ⌘ = Suppose E k L X ( ✓ t ) + Z t k 2 R BT 1 / 2 we have - ⇣ 1 " T � 1 ⌘# L X ( ✓ ) + RB X L X ✓ t  min E T 1 / 2 T θ 2 B d +1 ( R ) t =0 2 - t d 0 to goes optimal → re T as value 15

  17. ⇐ Plugging in dT.bg#edtI RI Eleni D - B ' Ele E k L X ( ✓ t ) + Z t k 2 D 2 = E k L X ( ✓ t ) k 2 2 + E k Z t k 2 2  d +1 n 2 + � 2 d Ctt ) T d pad E. Izumi - Eid " ' + em . - 1¥ ? =Af÷tR¥ 1117410711 , soft error For any t . - Lil 'T 111 , 11%+101112--11 In -2 Ploy ( ite ' - Yi Ki . ⇐ off 2- 1117 bgllte 11 , In 16

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend