Parameter efficient training of deep convolutional neural networks - - PowerPoint PPT Presentation

parameter efficient training of deep convolutional neural
SMART_READER_LITE
LIVE PREVIEW

Parameter efficient training of deep convolutional neural networks - - PowerPoint PPT Presentation

Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization Hesham Mostafa (Intel AI) Xin Wang (Intel AI, Cerebras Systems) Easy : post-training (sparse) compression Hard : direct training of sparse


slide-1
SLIDE 1

Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization

Hesham Mostafa (Intel AI) Xin Wang (Intel AI, Cerebras Systems)

slide-2
SLIDE 2

Easy: post-training (sparse) compression Hard: direct training of sparse networks

Compression

slide-3
SLIDE 3

“Winning lottery tickets” (Frankle & Carbin 2018): post hoc identification of trainable sparse nets

Compression

slide-4
SLIDE 4

Dynamic sparse reparameterization (ours): training-time structural exploration

slide-5
SLIDE 5

Direct training sparse nets to generalize as well as post-training compression: is this possible? -YES

Directly trained sparse nets: are they “winning lottery tickets”? -NO

slide-6
SLIDE 6

Dynamic sparse reparameterization

◃ ki is the number of pruned weights ◃ Number of surviving weights after pruning ◃ Total number of pruned and surviving weights ◃ Adjust pruning threshold ◃ Grow li

LK zero-initialized weights at random in Wi

1 for each sparse parameter tensorWi do 2

(Wi, ki) ← prune_by_threshold(Wi, H)

3

li ← number_of_nonzero_entries(Wi)

4 end for 5 (K, L) ← (

i ki, i li)

6 H ← adjust_pruning_threshold(H, K, δ) 7 for each sparse parameter tensorWi do 8

Wi ← grow_back(Wi, li

LK)

9 end for

prune grow

slide-7
SLIDE 7

Closed gap between post-training compression and direct training of sparse nets

Sparsity (# Param) 0.8 (7.3M) 0.9 (5.1M) 0.0 (25.6M) Thin dense 72.4 [-2.5] 90.9 [-1.5] 70.7 [-4.2] 89.9 [-2.5] 74.9 [0.0] 92.4 [0.0] Static sparse 71.6 [-3.3] 90.4 [-2.0] 67.8 [-7.1] 88.4 [-4.0] DeepR (Bellec et al., 2017) 71.7 [-3.2] 90.6 [-1.8] 70.2 [-4.7] 90.0 [-2.4] SET (Mocanu et al., 2018) 72.6 [-2.3] 91.2 [-1.2] 70.4 [-4.5] 90.1 [-2.3] Dynamic sparse (Ours) 73.3 [-1.6] 92.4 [ 0.0] 71.6 [-3.3] 90.5 [-1.9] Compressed sparse (Zhu & Gupta, 2017) 73.2 [-1.7] 91.5 [-0.9] 70.3 [-4.6] 90.0 [-2.4]

Number of parameters (K) Test accuracy%

306 161 92 93 94 95 596 741 451

Global sparsity WRN-28-2 on CIFAR10 Resnet-50 on Imagenet

0.7 0.8 0.9 0.5 0.6 Static sparse Compressed sparse Dynamic sparse Thin dense SET DeepR Full dense

slide-8
SLIDE 8

Directly trained sparse nets are not “winning tickets”: exploration of structural degrees of freedom is crucial

slide-9
SLIDE 9

Visit our poster: 
 Wednesday, Pacific Ballroom #248