Soft Threshold Weight Reparameterization for Learnable Sparsity
Aditya Kusupati Vivek Ramanujan*, Raghav Somani*, Mitchell Wortsman* Prateek Jain, Sham Kakade and Ali Farhadi
1
Soft Threshold Weight Reparameterization for Learnable Sparsity - - PowerPoint PPT Presentation
Soft Threshold Weight Reparameterization for Learnable Sparsity Aditya Kusupati Vivek Ramanujan * , Raghav Somani * , Mitchell Wortsman * Prateek Jain, Sham Kakade and Ali Farhadi 1 Motivation Deep Neural Networks Highly accurate
1
Privacy preserving smart glasses Billions of mobile devices
2
Layer 1 Layer 2 Layer 3 # Params FLOPs 20 100 1000 100K 100K 50K Total 1120 250K Sparsity – Method 1 # Params FLOPs Sparsity – Method 2 20 100 100 100K 100K 5K 220 205K # Params FLOPs 10 10 200 50K 10K 10K 220 70K
3
4
5
Sparsity Dense-to-sparse training Uniform sparsity Non-uniform sparsity Sparse-to-sparse training Non-uniform sparsity SOTA; Dense training cost Hard to train; Lower training cost
Pruning (GMP)
sparse-to-sparse
magnitude/gradient
Hybrid
6
7
𝑀
𝑀
Type equation here.
8
Type equation here.
9
Type equation here.
Overall sparsity vs Epochs – 90% sparse ResNet50 on ImageNet-1K Layer-wise sparsity – 90% sparse ResNet50 on ImageNet-1K
10
11
12
13
14
initial layers than the non-uniform sparsity baselines
denser than all baselines
backbones for transfer learning
across layers such that it has lower total inference cost than the baselines
Layer-wise sparsity and FLOPs budgets for 90% sparse ResNet50 on ImageNet-1K
15
depth-wise separable conv layers denser than rest of the layers
50% lesser FLOPs than GMP
Layer-wise sparsity and FLOPs budgets for 90% sparse MobileNetV1 on ImageNet-1K
16
17
𝐗 𝐗𝟐 𝐗𝟑 ∑ Train with STR on ∑ 𝐗𝟐 𝐗𝟑 ∑ Typical low-rank parameterization ෩ 𝐗𝟐 ෩ 𝐗𝟑
18
19
20
21
Prateek Raghav* Mitchell* Vivek* Aditya Sham Ali