Semi-Cyclic SGD Hubert Eichner Tomer Koren Brendan McMahan Kunal Talwar Google Google Google Google Nati Srebro
π₯ π’+1 β π₯ π’ β πβπ π₯ π’ , π¨ π’ π₯ π ΰ· SGD is great β¦β¦
π₯ π’+1 β π₯ π’ β πβπ π₯ π’ , π¨ π’ A b A g A z B e B o B y C h C l C o C o C u D a D e D i D o D r D y E d E f π₯ π ΰ· E l E n E p E r E s E t E x F a F i F l F o F r F u G e G i SGD is great β¦β¦ G l G m G r H a H i H o if you run on iid (randomly shuffled) data
π₯ π’+1 β π₯ π’ β πβπ π₯ π’ , π¨ π’ Samples in block π = 1. . π are sampled from as π¨ π’ βΌ π π 1 π Ο π π π overall distribution: π = π₯ π ΰ· SGD is great β¦β¦ if you run on iid (randomly shuffled) data Cyclically varying (not fully shuffled) data
π₯ π’+1 β π₯ π’ β πβπ π₯ π’ , π¨ π’ Samples in block π = 1. . π are sampled from as π¨ π’ βΌ π π 1 π Ο π π π overall distribution: π = π₯ π ΰ· SGD is great β¦β¦ if you run on iid (randomly shuffled) data Cyclically varying (not fully shuffled) data, e.g. in Federated Learning β’ Train model by executing SGD steps on user devices when device available (plugged in, idle, on WiFi) β’ Diurnal variations (e.g. Day vs night available devices; US vs UK vs India)
π₯ π’+1 β π₯ π’ β πβπ π₯ π’ , π¨ π’ Samples in block π = 1. . π are sampled from as π¨ π’ βΌ π π Samples in block π = 1. . π are sampled from as π¨ π’ βΌ π π 1 π Ο π π π overall distribution: π = π₯ π ΰ· β’ Train ΰ· π₯ π by running block-cyclic SGD β could be MUCH slower, by an arbitrary large factor
π₯ π’+1 β π₯ π’ β πβπ π₯ π’ , π¨ π’ Samples in block π = 1. . π are sampled from as π¨ π’ βΌ π π π₯ 1 ΰ· π₯ 2 ΰ· β’ Train ΰ· π₯ π by running block-cyclic SGD β could be MUCH slower, by an arbitrary large factor π₯ π for each block π = 1. . π Pluralistic approach: learn different ΰ· π₯ π separately on data from that block (across all cycles) β’ Train each ΰ· β could be slower/less efficient by a factor of π
π₯ π’+1 β π₯ π’ β πβπ π₯ π’ , π¨ π’ Samples in block π = 1. . π are sampled from as π¨ π’ βΌ π π π₯ 1 ΰ·₯ π₯ 2 ΰ·₯ β’ Train ΰ· π₯ π by running block-cyclic SGD β could be MUCH slower, by an arbitrary large factor π₯ π for each block π = 1. . π Pluralistic approach: learn different ΰ· π₯ π separately on data from that block (across all cycles) β’ Train each ΰ· β could be slower/less efficient by a factor of π π₯ π using single SGD chain+ β pluralistic averaging β β’ Our solution: train ΰ·₯ β exactly same guarantee as if using random shuffling (no degradation) β no extra comp. cost, no assumptions about π π nor relatedness
Recommend
More recommend