Semi-Cyclic SGD
Hubert Eichner
Tomer Koren
Brendan McMahan
Kunal Talwar
Semi-Cyclic SGD Hubert Eichner Tomer Koren Brendan McMahan Kunal - - PowerPoint PPT Presentation
Semi-Cyclic SGD Hubert Eichner Tomer Koren Brendan McMahan Kunal Talwar Google Google Google Google Nati Srebro +1 , SGD is great +1
Hubert Eichner
Tomer Koren
Brendan McMahan
Kunal Talwar
π₯π’+1 β π₯π’ β πβπ π₯π’, π¨π’
SGD is greatβ¦β¦
ΰ· π₯π
A b A z B
u D e D
n E r E t F r G e G l A g B e D a D i E p E s F u G i C h C
y E f F a F l G r H i B y C l C
r E d E l E x F i F
m H a H
SGD is greatβ¦β¦ if you run on iid (randomly shuffled) data
ΰ· π₯π
π₯π’+1 β π₯π’ β πβπ π₯π’, π¨π’
SGD is greatβ¦β¦ if you run on iid (randomly shuffled) data Cyclically varying (not fully shuffled) data
ΰ· π₯π
Samples in block π = 1. . π are sampled from as π¨π’ βΌ π π
1 π Οπ π π
π₯π’+1 β π₯π’ β πβπ π₯π’, π¨π’
SGD is greatβ¦β¦ if you run on iid (randomly shuffled) data Cyclically varying (not fully shuffled) data, e.g. in Federated Learning
when device available (plugged in, idle, on WiFi)
ΰ· π₯π
Samples in block π = 1. . π are sampled from as π¨π’ βΌ π π
1 π Οπ π π
π₯π’+1 β π₯π’ β πβπ π₯π’, π¨π’ Samples in block π = 1. . π are sampled from as π¨π’ βΌ π π
π₯π by running block-cyclic SGD β could be MUCH slower, by an arbitrary large factor
ΰ· π₯π
Samples in block π = 1. . π are sampled from as π¨π’ βΌ π π
1 π Οπ π π
π₯π’+1 β π₯π’ β πβπ π₯π’, π¨π’ Samples in block π = 1. . π are sampled from as π¨π’ βΌ π π
π₯π by running block-cyclic SGD β could be MUCH slower, by an arbitrary large factor Pluralistic approach: learn different ΰ· π₯π for each block π = 1. . π
π₯π separately on data from that block (across all cycles) β could be slower/less efficient by a factor of π
ΰ· π₯2 ΰ· π₯1
π₯π’+1 β π₯π’ β πβπ π₯π’, π¨π’ Samples in block π = 1. . π are sampled from as π¨π’ βΌ π π
π₯π by running block-cyclic SGD β could be MUCH slower, by an arbitrary large factor Pluralistic approach: learn different ΰ· π₯π for each block π = 1. . π
π₯π separately on data from that block (across all cycles) β could be slower/less efficient by a factor of π
π₯π using single SGD chain+βpluralistic averagingβ β exactly same guarantee as if using random shuffling (no degradation) β no extra comp. cost, no assumptions about ππ nor relatedness
ΰ·₯ π₯2 ΰ·₯ π₯1