semi cyclic sgd
play

Semi-Cyclic SGD Hubert Eichner Tomer Koren Brendan McMahan Kunal - PowerPoint PPT Presentation

Semi-Cyclic SGD Hubert Eichner Tomer Koren Brendan McMahan Kunal Talwar Google Google Google Google Nati Srebro +1 , SGD is great +1


  1. Semi-Cyclic SGD Hubert Eichner Tomer Koren Brendan McMahan Kunal Talwar Google Google Google Google Nati Srebro

  2. ๐‘ฅ ๐‘ข+1 โ† ๐‘ฅ ๐‘ข โˆ’ ๐œƒโˆ‡๐‘” ๐‘ฅ ๐‘ข , ๐‘จ ๐‘ข ๐‘ฅ ๐‘ˆ เท SGD is great โ€ฆโ€ฆ

  3. ๐‘ฅ ๐‘ข+1 โ† ๐‘ฅ ๐‘ข โˆ’ ๐œƒโˆ‡๐‘” ๐‘ฅ ๐‘ข , ๐‘จ ๐‘ข A b A g A z B e B o B y C h C l C o C o C u D a D e D i D o D r D y E d E f ๐‘ฅ ๐‘ˆ เท E l E n E p E r E s E t E x F a F i F l F o F r F u G e G i SGD is great โ€ฆโ€ฆ G l G m G r H a H i H o if you run on iid (randomly shuffled) data

  4. ๐‘ฅ ๐‘ข+1 โ† ๐‘ฅ ๐‘ข โˆ’ ๐œƒโˆ‡๐‘” ๐‘ฅ ๐‘ข , ๐‘จ ๐‘ข Samples in block ๐‘— = 1. . ๐‘› are sampled from as ๐‘จ ๐‘ข โˆผ ๐’  ๐‘— 1 ๐‘› ฯƒ ๐‘— ๐’  ๐‘— overall distribution: ๐’  = ๐‘ฅ ๐‘ˆ เท SGD is great โ€ฆโ€ฆ if you run on iid (randomly shuffled) data Cyclically varying (not fully shuffled) data

  5. ๐‘ฅ ๐‘ข+1 โ† ๐‘ฅ ๐‘ข โˆ’ ๐œƒโˆ‡๐‘” ๐‘ฅ ๐‘ข , ๐‘จ ๐‘ข Samples in block ๐‘— = 1. . ๐‘› are sampled from as ๐‘จ ๐‘ข โˆผ ๐’  ๐‘— 1 ๐‘› ฯƒ ๐‘— ๐’  ๐‘— overall distribution: ๐’  = ๐‘ฅ ๐‘ˆ เท SGD is great โ€ฆโ€ฆ if you run on iid (randomly shuffled) data Cyclically varying (not fully shuffled) data, e.g. in Federated Learning โ€ข Train model by executing SGD steps on user devices when device available (plugged in, idle, on WiFi) โ€ข Diurnal variations (e.g. Day vs night available devices; US vs UK vs India)

  6. ๐‘ฅ ๐‘ข+1 โ† ๐‘ฅ ๐‘ข โˆ’ ๐œƒโˆ‡๐‘” ๐‘ฅ ๐‘ข , ๐‘จ ๐‘ข Samples in block ๐‘— = 1. . ๐‘› are sampled from as ๐‘จ ๐‘ข โˆผ ๐’  ๐‘— Samples in block ๐‘— = 1. . ๐‘› are sampled from as ๐‘จ ๐‘ข โˆผ ๐’  ๐‘— 1 ๐‘› ฯƒ ๐‘— ๐’  ๐‘— overall distribution: ๐’  = ๐‘ฅ ๐‘ˆ เท โ€ข Train เท ๐‘ฅ ๐‘ˆ by running block-cyclic SGD โž” could be MUCH slower, by an arbitrary large factor

  7. ๐‘ฅ ๐‘ข+1 โ† ๐‘ฅ ๐‘ข โˆ’ ๐œƒโˆ‡๐‘” ๐‘ฅ ๐‘ข , ๐‘จ ๐‘ข Samples in block ๐‘— = 1. . ๐‘› are sampled from as ๐‘จ ๐‘ข โˆผ ๐’  ๐‘— ๐‘ฅ 1 เท ๐‘ฅ 2 เท โ€ข Train เท ๐‘ฅ ๐‘ˆ by running block-cyclic SGD โž” could be MUCH slower, by an arbitrary large factor ๐‘ฅ ๐‘— for each block ๐‘— = 1. . ๐‘› Pluralistic approach: learn different เท ๐‘ฅ ๐‘— separately on data from that block (across all cycles) โ€ข Train each เท โž” could be slower/less efficient by a factor of ๐‘›

  8. ๐‘ฅ ๐‘ข+1 โ† ๐‘ฅ ๐‘ข โˆ’ ๐œƒโˆ‡๐‘” ๐‘ฅ ๐‘ข , ๐‘จ ๐‘ข Samples in block ๐‘— = 1. . ๐‘› are sampled from as ๐‘จ ๐‘ข โˆผ ๐’  ๐‘— ๐‘ฅ 1 เทฅ ๐‘ฅ 2 เทฅ โ€ข Train เท ๐‘ฅ ๐‘ˆ by running block-cyclic SGD โž” could be MUCH slower, by an arbitrary large factor ๐‘ฅ ๐‘— for each block ๐‘— = 1. . ๐‘› Pluralistic approach: learn different เท ๐‘ฅ ๐‘— separately on data from that block (across all cycles) โ€ข Train each เท โž” could be slower/less efficient by a factor of ๐‘› ๐‘ฅ ๐‘— using single SGD chain+ โ€œ pluralistic averaging โ€ โ€ข Our solution: train เทฅ โž” exactly same guarantee as if using random shuffling (no degradation) โž” no extra comp. cost, no assumptions about ๐“” ๐’‹ nor relatedness

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend