Maximal Sparsity with Deep Networks? Bo Xin 1,2 , Yizhou Wang 1 , Wen Gao 1 and David Wipf 2 1 Peking University 2 Microsoft Research, Beijing
Outline • Background and motivation • Unfolding iterative algorithms • Theoretical analysis • Practical designs and empirical supports • Applications • Discussion
Maximal Sparsity 𝑦∈𝑆 𝑛 𝑦 0 𝑡. 𝑢. 𝑧 = Φ𝑦 min Φ ∈ 𝑆 𝑜×𝑛 𝑧 ∈ 𝑆 𝑜 ⋅ 0 # of non-zeros
Maximal Sparsity is NP hard min 𝑦 0 𝑡. 𝑢. 𝑧 = Φ𝑦 𝑦 Combinatorial NP hard and close approximations are highly non convex Practical alternatives: • ℓ 1 norm minimization • orthogonal matching pursuit (OMP) • Iterative hard thresholding (IHT)
Pros and Cons Numerous practical applications: • Feature selection [Cotter and Rao 2001; Figueiredo, 2002] • Outlier removal [Candes and Tao 2005; Ikehata et al. 2012] • Compressive sensing [Donoho, 2006] • Source localization [Baillet et al. 2001; Malioutov et al. 2005] • Computer vision applications [John Wright et al 2009] Fundamental weakness: • If the Gram matrix Φ T Φ has high off-diagonal energy • Estimation of 𝑦 ∗ can be extremely poor
Restricted Isometry Property (RIP) A matrix Φ satisfies the RIP with constant 𝜀 𝑙 Φ < 1 if 2 ≤ 2 ≤ 1 + 𝜀 𝑙 Φ 2 (1 − 𝜀 𝑙 Φ ) 𝑦 2 Φ𝑦 2 𝑦 2 Holds for all {𝑦: 𝑦 0 ≤ 𝑙} . [Candès et al., 2006] small RIP constant 𝜀 2 Φ large RIP constant 𝜀 2 Φ 2 k
Guaranteed Recovery with IHT Suppose there exists some 𝑦 ∗ such that 𝑧 = Φ𝑦 ∗ 𝑦 ∗ 0 ≤ 𝑙 𝜀 3𝑙 Φ < 1/√32 Then the IHT iterations are guaranteed to converge to 𝑦 ∗ . [Blumensath and Davies, 2009] Only very small degrees of correlation can be tolerated
Checkpoint • Thus far • Maximal sparsity is NP hard • Practical alternative suffers when the dictionary has high RIP constant • What’s coming • A deep learning base perspective • Technical analysis
Iterative algorithms It Iterative har ard th threshold ldin ing (IH (IHT) It Iterative so soft th threshold lding (IS (ISTA) 𝑥ℎ𝑗𝑚𝑓 𝑜𝑝𝑢 𝑑𝑝𝑜𝑤𝑓𝑠𝑓, 𝑒𝑝 { 𝑦=𝑦 (𝑢) = Φ 𝑈 Φ𝑦 (𝑢) − Φ 𝑈 𝑧 𝛼𝑦 𝑨 = 𝑦 (𝑢) − 𝜈𝛼𝑦 𝑦=𝑦 (𝑢) 𝑦 (𝑢+1) = 𝑢ℎ𝑠𝑡ℎ(𝑨) } (𝑢+1) = 𝑨 𝑗 if |𝑨 𝑗 | 𝑏𝑛𝑝𝑜 𝑙 𝑚𝑏𝑠𝑓𝑡𝑢 (𝑢+1) = 𝑡𝑗𝑜 𝑨 𝑗 |𝑨 𝑗 | − 𝜇 𝑗𝑔 𝑨 𝑗 > 𝜇 𝑦 𝑗 𝑦 𝑗 0 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓 0 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓
Iterative algorithms It Iterative har ard th threshold ldin ing (IH (IHT) It Iterative so soft th threshold lding (IS (ISTA) 𝑥ℎ𝑗𝑚𝑓 𝑜𝑝𝑢 𝑑𝑝𝑜𝑤𝑓𝑠𝑓, 𝑒𝑝 { lin inear op op 𝑦=𝑦 (𝑢) = Φ 𝑈 Φ𝑦 (𝑢) − Φ 𝑈 𝑧 𝛼𝑦 𝑨 = 𝑦 (𝑢) − 𝜈𝛼𝑦 𝑦=𝑦 (𝑢) 𝑦 (𝑢+1) = 𝑢ℎ𝑠𝑡ℎ(𝑨) none lin no inear op op } (𝑢+1) = 𝑨 𝑗 if |𝑨 𝑗 | 𝑏𝑛𝑝𝑜 𝑙 𝑚𝑏𝑠𝑓𝑡𝑢 (𝑢+1) = 𝑡𝑗𝑜 𝑨 𝑗 |𝑨 𝑗 | − 𝜇 𝑗𝑔 𝑨 𝑗 > 𝜇 𝑦 𝑗 𝑦 𝑗 0 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓 0 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓
Deep Network = Unfolded Optimization? Basic DNN Template ( 1 ) ( 1 ) ( 1 ) linear filter W x b nonlinearity/threshold ( 2 ) ( 2 ) ( 2 ) W x b Observation: Many common iterative algorithms … follow the exact same script: ( t ) ( t ) ( t ) x b W ( 1 ) ( ) t t x x b f W
Deep Network = Unfolded Optimization? Basic DNN Template ( 1 ) ( 1 ) ( 1 ) W x b Fast sparse encoders: [Gregor and LeCun, 2010] [Wang et al. 2015] ( 2 ) ( 2 ) ( 2 ) W x b What’s more? … ( t ) ( t ) ( t ) x b W
Unfolded IHT Iterations ( 1 ) W x b linear filter non-linearity 𝑋 = 𝐽 − 𝜈Φ 𝑈 Φ ( 2 ) W x b 𝜈Φ 𝑈 𝑧 𝑐 = … ( t ) W x b Question 1: So is there an advantage to learning the weights?
Effects of Correlation Structure High Correlation: Hard Low Correlation: Easy T T entries iid N 0, ( uncor ) ( cor ) ( uncor ) low rank small RIP constant large RIP constant 1 1 [ ] [ ] 3 k 3 k 32 32
Performance Bound with Learned Layer Weights Theorem 1 There will always exist layer weights 𝑋 and bias 𝑐 such that the effective RIP constant is reduced via ∗ ∗ 𝜀 3𝑙 Φ = Ψ 𝐸 𝜀 3𝑙 inf ΨΦ𝐸 ≤ 𝜀 3𝑙 [Φ] effective RIP constant original RIP constant where Ψ is arbitrary and D is diagonal. [Xin et al. 2016] It is therefore possible to reduce high RIP constants
Practical Consequences Theorem 2 Suppose we have correlated dictionary formed via Φ (𝑑𝑝𝑠) = Φ 𝑣𝑜𝑑𝑝𝑠 + Δ large RIP small RIP With Φ 𝑣𝑜𝑑𝑝𝑠 iid 𝑂(0, 𝜉) entries and Δ sufficiently low rank. Then * E [ ] E [ ] 3 k ( cor ) 3 k ( uncor ) [Xin et al. 2016] So we can ‘undo’ low rank correlations that would otherwise produce a high RIP constant …
Advantages of Independent Layer Weights (and Activations) Theorem 3 ( 1 ) ( 1 ) ( 1 ) W x b With independent weights on each layer ( 2 ) ( 2 ) ( 2 ) W x b Often possible to obtain … nearly ideal RIP constant even ( t ) ( t ) ( t ) W x b when full rank is present [Xin et al. 2016] Question 2: Do independent weights (and activations) has the potential to do even better? Yes
Advantages of Independent Layer Weights (and Activations) Φ 𝑗 = Φ 𝑗 𝑣𝑜𝑑𝑝𝑠 + Δ i Φ = [Φ 1 , … Φ 𝑑 ] 𝑦 (𝑢+1) = 𝐼 (Ω 𝑝𝑜 (𝑢) ) [𝑋 𝑢 𝑦 𝑢 + 𝑐 𝑢 ] 𝑢 , Ω 𝑝𝑔𝑔 ( 1 ) ( 1 ) ( 1 ) W x b ( 2 ) ( 2 ) ( 2 ) W x b
Advantages of Independent Layer Weights (and Activations) Φ 𝑗 = Φ 𝑗 𝑣𝑜𝑑𝑝𝑠 + Δ i Φ = [Φ 1 , … Φ 𝑑 ] 𝑦 (𝑢+1) = 𝐼 (Ω 𝑝𝑜 (𝑢) ) [𝑋 𝑢 𝑦 𝑢 + 𝑐 𝑢 ] 𝑢 , Ω 𝑝𝑔𝑔 ( 1 ) ( 1 ) ( 1 ) W x b ( 2 ) ( 2 ) ( 2 ) W x b
Advantages of Independent Layer Weights (and Activations) Φ 𝑗 = Φ 𝑗 𝑣𝑜𝑑𝑝𝑠 + Δ i Φ = [Φ 1 , … Φ 𝑑 ] 𝑦 (𝑢+1) = 𝐼 (Ω 𝑝𝑜 (𝑢) ) [𝑋 𝑢 𝑦 𝑢 + 𝑐 𝑢 ] 𝑢 , Ω 𝑝𝑔𝑔 ( 1 ) ( 1 ) ( 1 ) W x b ( 2 ) ( 2 ) ( 2 ) W x b
Advantages of Independent Layer Weights (and Activations) Φ 𝑗 = Φ 𝑗 𝑣𝑜𝑑𝑝𝑠 + Δ i Φ = [Φ 1 , … Φ 𝑑 ] 𝑦 (𝑢+1) = 𝐼 (Ω 𝑝𝑜 (𝑢) ) [𝑋 𝑢 𝑦 𝑢 + 𝑐 𝑢 ] 𝑢 , Ω 𝑝𝑔𝑔 ( 1 ) ( 1 ) ( 1 ) W x b ( 2 ) ( 2 ) ( 2 ) W x b
Checkpoint • Thus far Idealized deep network weights exist that improve RIP constants. • What’s coming • Practical design to facilitate success • Empirical results • Applications
Alternative Learning-Based Strategy • Treat as a multi-label DNN classification problem to estimate support of 𝑦 ∗ . • The main challenge is estimating supp[x] • Once support is obtained, computing actual value is trivial • ℎ 𝑧 − Φ𝑦 2 based loss will be unaware and expend undue effort to match coefficient magnitudes. • Specifically, we learn to find supp[x] using multilabel softmax loss layer
Alternative Learning-Based Strategy • Adopt highway and gating mechanisms • Relatively deep nets for challenging problems such designs help with information flow • Our analysis show such designes seem natural for challenging multi-scales sparse estimation problems • The philosophies of generating training sets. • Generative perspective 𝑦 ∗ 𝑧 = Φ𝑦 ∗ • Not 𝑧 𝑦 ∗ = 𝑝𝑞𝑢𝑗𝑛𝑗𝑨𝑏𝑗𝑢𝑝𝑜(𝑧) • Unsupervised training • 𝑦 ∗ are randomly generated
Experiments 1 𝑈 where 𝑣 𝑗 , 𝑤 𝑗 are iid 𝑂(0,1) • We generate Φ = 𝑗 𝑗 2 𝑣 𝑗 𝑤 𝑗 • Super-linear decaying singular values • With structure but quite general • Values of 𝑦 ∗ • 𝑉 -distribution : drawn from 𝑉 −0.5,0.5 excluding 𝑉[−0.1,0.1] • 𝑂 -distribution : drawn from 𝑂 +0.3,0.1 𝑏𝑜𝑒 𝑂(−0.3,0.1) • Experiments • Basic: U2U • Cross: U2N, N2U
Results Strong estimation accuracy Robust to training distributions Question 3: Does deep learning really learn ideal weights as analyzed? Hard to say, but it DOES achieve strong empirical performance for maximal sparsity
Robust Surface Normal Estimation • Input: … • Per-pixel model: Specular reflections, y L n x observations under different shadows, etc. (outliers) lightings raw unknown surface lighting normal matrix • Can apply any sparse learning method to obtain outliers [Ikehata et al., 2012]
Recommend
More recommend