semi cyclic sgd

Semi-Cyclic SGD Hubert Eichner Tomer Koren Brendan McMahan Kunal - PowerPoint PPT Presentation

Semi-Cyclic SGD Hubert Eichner Tomer Koren Brendan McMahan Kunal Talwar Google Google Google Google Nati Srebro +1 , SGD is great +1


  1. Semi-Cyclic SGD Hubert Eichner Tomer Koren Brendan McMahan Kunal Talwar Google Google Google Google Nati Srebro

  2. π‘₯ 𝑒+1 ← π‘₯ 𝑒 βˆ’ πœƒβˆ‡π‘” π‘₯ 𝑒 , 𝑨 𝑒 π‘₯ π‘ˆ ෝ SGD is great ……

  3. π‘₯ 𝑒+1 ← π‘₯ 𝑒 βˆ’ πœƒβˆ‡π‘” π‘₯ 𝑒 , 𝑨 𝑒 A b A g A z B e B o B y C h C l C o C o C u D a D e D i D o D r D y E d E f π‘₯ π‘ˆ ෝ E l E n E p E r E s E t E x F a F i F l F o F r F u G e G i SGD is great …… G l G m G r H a H i H o if you run on iid (randomly shuffled) data

  4. π‘₯ 𝑒+1 ← π‘₯ 𝑒 βˆ’ πœƒβˆ‡π‘” π‘₯ 𝑒 , 𝑨 𝑒 Samples in block 𝑗 = 1. . 𝑛 are sampled from as 𝑨 𝑒 ∼ 𝒠 𝑗 1 𝑛 Οƒ 𝑗 𝒠 𝑗 overall distribution: 𝒠 = π‘₯ π‘ˆ ෝ SGD is great …… if you run on iid (randomly shuffled) data Cyclically varying (not fully shuffled) data

  5. π‘₯ 𝑒+1 ← π‘₯ 𝑒 βˆ’ πœƒβˆ‡π‘” π‘₯ 𝑒 , 𝑨 𝑒 Samples in block 𝑗 = 1. . 𝑛 are sampled from as 𝑨 𝑒 ∼ 𝒠 𝑗 1 𝑛 Οƒ 𝑗 𝒠 𝑗 overall distribution: 𝒠 = π‘₯ π‘ˆ ෝ SGD is great …… if you run on iid (randomly shuffled) data Cyclically varying (not fully shuffled) data, e.g. in Federated Learning β€’ Train model by executing SGD steps on user devices when device available (plugged in, idle, on WiFi) β€’ Diurnal variations (e.g. Day vs night available devices; US vs UK vs India)

  6. π‘₯ 𝑒+1 ← π‘₯ 𝑒 βˆ’ πœƒβˆ‡π‘” π‘₯ 𝑒 , 𝑨 𝑒 Samples in block 𝑗 = 1. . 𝑛 are sampled from as 𝑨 𝑒 ∼ 𝒠 𝑗 Samples in block 𝑗 = 1. . 𝑛 are sampled from as 𝑨 𝑒 ∼ 𝒠 𝑗 1 𝑛 Οƒ 𝑗 𝒠 𝑗 overall distribution: 𝒠 = π‘₯ π‘ˆ ෝ β€’ Train ෝ π‘₯ π‘ˆ by running block-cyclic SGD βž” could be MUCH slower, by an arbitrary large factor

  7. π‘₯ 𝑒+1 ← π‘₯ 𝑒 βˆ’ πœƒβˆ‡π‘” π‘₯ 𝑒 , 𝑨 𝑒 Samples in block 𝑗 = 1. . 𝑛 are sampled from as 𝑨 𝑒 ∼ 𝒠 𝑗 π‘₯ 1 ෝ π‘₯ 2 ෝ β€’ Train ෝ π‘₯ π‘ˆ by running block-cyclic SGD βž” could be MUCH slower, by an arbitrary large factor π‘₯ 𝑗 for each block 𝑗 = 1. . 𝑛 Pluralistic approach: learn different ෝ π‘₯ 𝑗 separately on data from that block (across all cycles) β€’ Train each ෝ βž” could be slower/less efficient by a factor of 𝑛

  8. π‘₯ 𝑒+1 ← π‘₯ 𝑒 βˆ’ πœƒβˆ‡π‘” π‘₯ 𝑒 , 𝑨 𝑒 Samples in block 𝑗 = 1. . 𝑛 are sampled from as 𝑨 𝑒 ∼ 𝒠 𝑗 π‘₯ 1 ΰ·₯ π‘₯ 2 ΰ·₯ β€’ Train ෝ π‘₯ π‘ˆ by running block-cyclic SGD βž” could be MUCH slower, by an arbitrary large factor π‘₯ 𝑗 for each block 𝑗 = 1. . 𝑛 Pluralistic approach: learn different ෝ π‘₯ 𝑗 separately on data from that block (across all cycles) β€’ Train each ෝ βž” could be slower/less efficient by a factor of 𝑛 π‘₯ 𝑗 using single SGD chain+ β€œ pluralistic averaging ” β€’ Our solution: train ΰ·₯ βž” exactly same guarantee as if using random shuffling (no degradation) βž” no extra comp. cost, no assumptions about 𝓔 𝒋 nor relatedness

Recommend


More recommend