Semi-Cyclic SGD Hubert Eichner Tomer Koren Brendan McMahan Kunal - - PowerPoint PPT Presentation

β–Ά
semi cyclic sgd
SMART_READER_LITE
LIVE PREVIEW

Semi-Cyclic SGD Hubert Eichner Tomer Koren Brendan McMahan Kunal - - PowerPoint PPT Presentation

Semi-Cyclic SGD Hubert Eichner Tomer Koren Brendan McMahan Kunal Talwar Google Google Google Google Nati Srebro +1 , SGD is great +1


slide-1
SLIDE 1

Semi-Cyclic SGD

Hubert Eichner

Google

Tomer Koren

Google

Brendan McMahan

Google

Kunal Talwar

Google

Nati Srebro

slide-2
SLIDE 2

π‘₯𝑒+1 ← π‘₯𝑒 βˆ’ πœƒβˆ‡π‘” π‘₯𝑒, 𝑨𝑒

SGD is great……

ෝ π‘₯π‘ˆ

slide-3
SLIDE 3

A b A z B

  • C

u D e D

  • E

n E r E t F r G e G l A g B e D a D i E p E s F u G i C h C

  • D

y E f F a F l G r H i B y C l C

  • D

r E d E l E x F i F

  • G

m H a H

  • π‘₯𝑒+1 ← π‘₯𝑒 βˆ’ πœƒβˆ‡π‘” π‘₯𝑒, 𝑨𝑒

SGD is great…… if you run on iid (randomly shuffled) data

ෝ π‘₯π‘ˆ

slide-4
SLIDE 4

π‘₯𝑒+1 ← π‘₯𝑒 βˆ’ πœƒβˆ‡π‘” π‘₯𝑒, 𝑨𝑒

SGD is great…… if you run on iid (randomly shuffled) data Cyclically varying (not fully shuffled) data

ෝ π‘₯π‘ˆ

Samples in block 𝑗 = 1. . 𝑛 are sampled from as 𝑨𝑒 ∼ 𝒠𝑗

  • verall distribution: 𝒠 =

1 𝑛 σ𝑗 𝒠𝑗

slide-5
SLIDE 5

π‘₯𝑒+1 ← π‘₯𝑒 βˆ’ πœƒβˆ‡π‘” π‘₯𝑒, 𝑨𝑒

SGD is great…… if you run on iid (randomly shuffled) data Cyclically varying (not fully shuffled) data, e.g. in Federated Learning

  • Train model by executing SGD steps on user devices

when device available (plugged in, idle, on WiFi)

  • Diurnal variations (e.g. Day vs night available devices; US vs UK vs India)

ෝ π‘₯π‘ˆ

Samples in block 𝑗 = 1. . 𝑛 are sampled from as 𝑨𝑒 ∼ 𝒠𝑗

  • verall distribution: 𝒠 =

1 𝑛 σ𝑗 𝒠𝑗

slide-6
SLIDE 6

π‘₯𝑒+1 ← π‘₯𝑒 βˆ’ πœƒβˆ‡π‘” π‘₯𝑒, 𝑨𝑒 Samples in block 𝑗 = 1. . 𝑛 are sampled from as 𝑨𝑒 ∼ 𝒠𝑗

  • Train ෝ

π‘₯π‘ˆ by running block-cyclic SGD βž” could be MUCH slower, by an arbitrary large factor

ෝ π‘₯π‘ˆ

Samples in block 𝑗 = 1. . 𝑛 are sampled from as 𝑨𝑒 ∼ 𝒠𝑗

  • verall distribution: 𝒠 =

1 𝑛 σ𝑗 𝒠𝑗

slide-7
SLIDE 7

π‘₯𝑒+1 ← π‘₯𝑒 βˆ’ πœƒβˆ‡π‘” π‘₯𝑒, 𝑨𝑒 Samples in block 𝑗 = 1. . 𝑛 are sampled from as 𝑨𝑒 ∼ 𝒠𝑗

  • Train ෝ

π‘₯π‘ˆ by running block-cyclic SGD βž” could be MUCH slower, by an arbitrary large factor Pluralistic approach: learn different ෝ π‘₯𝑗 for each block 𝑗 = 1. . 𝑛

  • Train each ෝ

π‘₯𝑗 separately on data from that block (across all cycles) βž” could be slower/less efficient by a factor of 𝑛

ෝ π‘₯2 ෝ π‘₯1

slide-8
SLIDE 8

π‘₯𝑒+1 ← π‘₯𝑒 βˆ’ πœƒβˆ‡π‘” π‘₯𝑒, 𝑨𝑒 Samples in block 𝑗 = 1. . 𝑛 are sampled from as 𝑨𝑒 ∼ 𝒠𝑗

  • Train ෝ

π‘₯π‘ˆ by running block-cyclic SGD βž” could be MUCH slower, by an arbitrary large factor Pluralistic approach: learn different ෝ π‘₯𝑗 for each block 𝑗 = 1. . 𝑛

  • Train each ෝ

π‘₯𝑗 separately on data from that block (across all cycles) βž” could be slower/less efficient by a factor of 𝑛

  • Our solution: train ΰ·₯

π‘₯𝑗 using single SGD chain+β€œpluralistic averaging” βž” exactly same guarantee as if using random shuffling (no degradation) βž” no extra comp. cost, no assumptions about 𝓔𝒋 nor relatedness

ΰ·₯ π‘₯2 ΰ·₯ π‘₯1