Maximal Sparsity with Deep Networks?
Bo Xin1,2, Yizhou Wang1, Wen Gao1 and David Wipf2
1 Peking University 2 Microsoft Research, Beijing
Deep Networks? Bo Xin 1,2 , Yizhou Wang 1 , Wen Gao 1 and David Wipf - - PowerPoint PPT Presentation
Maximal Sparsity with Deep Networks? Bo Xin 1,2 , Yizhou Wang 1 , Wen Gao 1 and David Wipf 2 1 Peking University 2 Microsoft Research, Beijing Outline Background and motivation Unfolding iterative algorithms Theoretical analysis
Bo Xin1,2, Yizhou Wang1, Wen Gao1 and David Wipf2
1 Peking University 2 Microsoft Research, Beijing
๐ฆโ๐๐ ๐ฆ 0 ๐ก. ๐ข. ๐ง = ฮฆ๐ฆ
๐ฆ
Practical alternatives:
Combinatorial NP hard and close approximations are highly non convex
Numerous practical applications:
Fundamental weakness:
A matrix ฮฆ satisfies the RIP with constant ๐๐ ฮฆ < 1 if Holds for all {๐ฆ: ๐ฆ 0 โค ๐}.
[Candรจs et al., 2006] small RIP constant ๐2 ฮฆ large RIP constant ๐2 ฮฆ
2 ๏ฝ k (1 โ ๐๐ ฮฆ ) ๐ฆ 2
2 โค
ฮฆ๐ฆ 2
2 โค 1 + ๐๐ ฮฆ
๐ฆ 2
2
Suppose there exists some ๐ฆโ such that Then the IHT iterations are guaranteed to converge to ๐ฆโ.
[Blumensath and Davies, 2009] Only very small degrees of correlation can be tolerated
๐ง = ฮฆ๐ฆโ ๐ฆโ 0 โค ๐ ๐3๐ ฮฆ < 1/โ32
constant
It Iterative har ard th threshold ldin ing (IH (IHT) It Iterative so soft th threshold lding (IS (ISTA) ๐ผ๐ฆ
๐ฆ=๐ฆ(๐ข) = ฮฆ๐ฮฆ๐ฆ(๐ข) โ ฮฆ๐๐ง
๐จ = ๐ฆ(๐ข) โ ๐๐ผ๐ฆ
๐ฆ=๐ฆ(๐ข)
๐ฆ(๐ข+1) = ๐ขโ๐ ๐กโ(๐จ) ๐ฅโ๐๐๐ ๐๐๐ข ๐๐๐๐ค๐๐ ๐๐, ๐๐ { }
๐ฆ๐
(๐ข+1) = ๐จ๐
if |๐จ๐| ๐๐๐๐๐ ๐ ๐๐๐ ๐๐๐ก๐ข ๐๐ขโ๐๐ ๐ฅ๐๐ก๐ ๐ฆ๐
(๐ข+1) = ๐ก๐๐๐ ๐จ๐ |๐จ๐| โ ๐
๐๐ ๐จ๐ > ๐ ๐๐ขโ๐๐ ๐ฅ๐๐ก๐
๐จ = ๐ฆ(๐ข) โ ๐๐ผ๐ฆ
๐ฆ=๐ฆ(๐ข)
๐ผ๐ฆ
๐ฆ=๐ฆ(๐ข) = ฮฆ๐ฮฆ๐ฆ(๐ข) โ ฮฆ๐๐ง
It Iterative har ard th threshold ldin ing (IH (IHT) It Iterative so soft th threshold lding (IS (ISTA) ๐ฆ(๐ข+1) = ๐ขโ๐ ๐กโ(๐จ) ๐ฅโ๐๐๐ ๐๐๐ข ๐๐๐๐ค๐๐ ๐๐, ๐๐ { }
๐ฆ๐
(๐ข+1) = ๐จ๐
if |๐จ๐| ๐๐๐๐๐ ๐ ๐๐๐ ๐๐๐ก๐ข ๐๐ขโ๐๐ ๐ฅ๐๐ก๐ ๐ฆ๐
(๐ข+1) = ๐ก๐๐๐ ๐จ๐ |๐จ๐| โ ๐
๐๐ ๐จ๐ > ๐ ๐๐ขโ๐๐ ๐ฅ๐๐ก๐ lin inear op
no none lin inear op
โฆ
) 1 ( ) 1 ( ) 1 (
b x ๏ซ W
) 2 ( ) 2 ( ) 2 (
b x ๏ซ W
) ( ) ( ) ( t t t
W b x ๏ซ
Basic DNN Template linear filter nonlinearity/threshold
b x x
) ( ) 1 (
๏ซ ๏ฝ
๏ซ t t
W f
Observation: Many common iterative algorithms follow the exact same script:
Fast sparse encoders: [Gregor and LeCun, 2010] [Wang et al. 2015] Whatโs more?
โฆ
) 1 ( ) 1 ( ) 1 (
b x ๏ซ W
) 2 ( ) 2 ( ) 2 (
b x ๏ซ W
) ( ) ( ) ( t t t
W b x ๏ซ
Basic DNN Template
โฆ
b x
) 1 ( ๏ซ
W b x
) 2 (
๏ซ W b x
) ( ๏ซ t
W
linear filter non-linearity Question 1: So is there an advantage to learning the weights?
๐ = ๐ฝ โ ๐ฮฆ๐ฮฆ ๐ = ๐ฮฆ๐๐ง
Low Correlation: Easy High Correlation: Hard
low rank
๏ ๏T ๏ ๏T
๏ ๏ซ ๏ ๏ฝ ๏
) ( ) ( uncor cor
small RIP constant large RIP constant
๏จ ๏ฉ entries
0, iid
) (
๏ต N
uncor
๏ฎ ๏
32 1 3
] [ ๏ผ ๏
k
๏ค
32 1 3
] [ ๏พ๏พ ๏
k
๏ค
There will always exist layer weights ๐ and bias ๐ such that the effective RIP constant is reduced via where ฮจ is arbitrary and D is diagonal. Theorem 1
effective RIP constant It is therefore possible to reduce high RIP constants
๐3๐
โ
ฮฆ = inf
ฮจ ๐ธ ๐3๐ โ
ฮจฮฆ๐ธ โค ๐3๐[ฮฆ]
[Xin et al. 2016]
So we can โundoโ low rank correlations that would otherwise produce a high RIP constant โฆ
Suppose we have correlated dictionary formed via With ฮฆ ๐ฃ๐๐๐๐ ๏ iid ๐(0, ๐) entries and ฮ sufficiently low rank. Then
] [ E ] [
) ( 3 ) ( * 3 uncor k cor k
E ๏ ๏ป ๏ ๏ค ๏ค Theorem 2
large RIP small RIP
ฮฆ(๐๐๐ ) = ฮฆ ๐ฃ๐๐๐๐ + ฮ
[Xin et al. 2016]
โฆ
) 1 ( ) 1 ( ) 1 (
b x ๏ซ W
) 2 ( ) 2 ( ) 2 (
b x ๏ซ W
) ( ) ( ) ( t t t
W b x ๏ซ
With independent weights on each layer Often possible to obtain nearly ideal RIP constant even when full rank ๏ is present Theorem 3
Question 2: Do independent weights (and activations) has the potential to do even better? Yes [Xin et al. 2016]
) 1 ( ) 1 ( ) 1 (
b x ๏ซ W
) 2 ( ) 2 ( ) 2 (
b x ๏ซ W
ฮฆ = [ฮฆ1, โฆ ฮฆ๐] ฮฆ๐ = ฮฆ๐ ๐ฃ๐๐๐๐ + ฮi ๐ฆ(๐ข+1) = ๐ผ(ฮฉ๐๐
๐ข , ฮฉ๐๐๐ (๐ข) )[๐ ๐ข ๐ฆ ๐ข + ๐ ๐ข ]
) 1 ( ) 1 ( ) 1 (
b x ๏ซ W
) 2 ( ) 2 ( ) 2 (
b x ๏ซ W
ฮฆ = [ฮฆ1, โฆ ฮฆ๐] ฮฆ๐ = ฮฆ๐ ๐ฃ๐๐๐๐ + ฮi ๐ฆ(๐ข+1) = ๐ผ(ฮฉ๐๐
๐ข , ฮฉ๐๐๐ (๐ข) )[๐ ๐ข ๐ฆ ๐ข + ๐ ๐ข ]
) 1 ( ) 1 ( ) 1 (
b x ๏ซ W
) 2 ( ) 2 ( ) 2 (
b x ๏ซ W
ฮฆ = [ฮฆ1, โฆ ฮฆ๐] ฮฆ๐ = ฮฆ๐ ๐ฃ๐๐๐๐ + ฮi ๐ฆ(๐ข+1) = ๐ผ(ฮฉ๐๐
๐ข , ฮฉ๐๐๐ (๐ข) )[๐ ๐ข ๐ฆ ๐ข + ๐ ๐ข ]
) 1 ( ) 1 ( ) 1 (
b x ๏ซ W
) 2 ( ) 2 ( ) 2 (
b x ๏ซ W
ฮฆ = [ฮฆ1, โฆ ฮฆ๐] ฮฆ๐ = ฮฆ๐ ๐ฃ๐๐๐๐ + ฮi ๐ฆ(๐ข+1) = ๐ผ(ฮฉ๐๐
๐ข , ฮฉ๐๐๐ (๐ข) )[๐ ๐ข ๐ฆ ๐ข + ๐ ๐ข ]
Idealized deep network weights exist that improve RIP constants.
estimate support of ๐ฆโ.
๐ง โ ฮฆ๐ฆ 2 based loss will be unaware and expend undue effort to match coefficient magnitudes.
layer
with information flow
multi-scales sparse estimation problems
1 ๐2 ๐ฃ๐๐ค๐ ๐ where ๐ฃ๐, ๐ค๐ are iid ๐(0,1)
Strong estimation accuracy Robust to training distributions Question 3: Does deep learning really learn ideal weights as analyzed? Hard to say, but it DOES achieve strong empirical performance for maximal sparsity
[Ikehata et al., 2012]
lightings lighting matrix raw unknown surface normal Specular reflections, shadows, etc. (outliers)
โฆ
๏ ๏๏จ ๏ฉ ๏ ๏๏จ
๏ ๏๏จ ๏ฉ
x x n y
T T T
Null Null Null L L L
Proj L Proj Proj ๏ฝ ๏ซ ๏ฝ
z
Once outliers are known, can estimate n via:
ห
x y n ๏ญ ๏ ๏ ๏ ๏ฝ
T T
[Candรจs and Tao, 2004] # of nonzero elements
algorithms can be provably enhanced by learning.
choices affect performance.
(i.e., high RIP constants).
Residual LSTM