GPipe and GShard
Kaixin Luo
GPipe and GShard Kaixin Luo Motivation The more Computational - - PowerPoint PPT Presentation
GPipe and GShard Kaixin Luo Motivation The more Computational power you spent, the better model you get GPipes Idea Commonly known parallelisms : Data Parallelism Model Parallelism Proposed: Pipeline parallelism What
Kaixin Luo
mini batch gradient
function c
combination of f[i] to f[j]
communication cost.
batches:
gradients
partition will hurt the overall performance
complex(GShard)
different place.
number, the rest are constances.
to split_dimensions into num_partitions portions
multi-dimension split and specifying the placement of each dimension
participants, it is efficient in TPU device network
Pipeline Parallelism
Computation and Automatic Sharding