deep networks

Deep Networks? Bo Xin 1,2 , Yizhou Wang 1 , Wen Gao 1 and David Wipf - PowerPoint PPT Presentation

Maximal Sparsity with Deep Networks? Bo Xin 1,2 , Yizhou Wang 1 , Wen Gao 1 and David Wipf 2 1 Peking University 2 Microsoft Research, Beijing Outline Background and motivation Unfolding iterative algorithms Theoretical analysis


  1. Maximal Sparsity with Deep Networks? Bo Xin 1,2 , Yizhou Wang 1 , Wen Gao 1 and David Wipf 2 1 Peking University 2 Microsoft Research, Beijing

  2. Outline • Background and motivation • Unfolding iterative algorithms • Theoretical analysis • Practical designs and empirical supports • Applications • Discussion

  3. Maximal Sparsity 𝑦∈𝑆 𝑛 𝑦 0 𝑡. 𝑢. 𝑧 = Φ𝑦 min Φ ∈ 𝑆 𝑜×𝑛 𝑧 ∈ 𝑆 𝑜 ⋅ 0 # of non-zeros

  4. Maximal Sparsity is NP hard min 𝑦 0 𝑡. 𝑢. 𝑧 = Φ𝑦 𝑦 Combinatorial NP hard and close approximations are highly non convex Practical alternatives: • ℓ 1 norm minimization • orthogonal matching pursuit (OMP) • Iterative hard thresholding (IHT)

  5. Pros and Cons Numerous practical applications: • Feature selection [Cotter and Rao 2001; Figueiredo, 2002] • Outlier removal [Candes and Tao 2005; Ikehata et al. 2012] • Compressive sensing [Donoho, 2006] • Source localization [Baillet et al. 2001; Malioutov et al. 2005] • Computer vision applications [John Wright et al 2009] Fundamental weakness: • If the Gram matrix Φ T Φ has high off-diagonal energy • Estimation of 𝑦 ∗ can be extremely poor

  6. Restricted Isometry Property (RIP) A matrix Φ satisfies the RIP with constant 𝜀 𝑙 Φ < 1 if 2 ≤ 2 ≤ 1 + 𝜀 𝑙 Φ 2 (1 − 𝜀 𝑙 Φ ) 𝑦 2 Φ𝑦 2 𝑦 2 Holds for all {𝑦: 𝑦 0 ≤ 𝑙} . [Candès et al., 2006] small RIP constant 𝜀 2 Φ large RIP constant 𝜀 2 Φ  2 k

  7. Guaranteed Recovery with IHT Suppose there exists some 𝑦 ∗ such that 𝑧 = Φ𝑦 ∗ 𝑦 ∗ 0 ≤ 𝑙 𝜀 3𝑙 Φ < 1/√32 Then the IHT iterations are guaranteed to converge to 𝑦 ∗ . [Blumensath and Davies, 2009] Only very small degrees of correlation can be tolerated

  8. Checkpoint • Thus far • Maximal sparsity is NP hard • Practical alternative suffers when the dictionary has high RIP constant • What’s coming • A deep learning base perspective • Technical analysis

  9. Iterative algorithms It Iterative har ard th threshold ldin ing (IH (IHT) It Iterative so soft th threshold lding (IS (ISTA) 𝑥ℎ𝑗𝑚𝑓 𝑜𝑝𝑢 𝑑𝑝𝑜𝑤𝑓𝑠𝑕𝑓, 𝑒𝑝 { 𝑦=𝑦 (𝑢) = Φ 𝑈 Φ𝑦 (𝑢) − Φ 𝑈 𝑧 𝛼𝑦 𝑨 = 𝑦 (𝑢) − 𝜈𝛼𝑦 𝑦=𝑦 (𝑢) 𝑦 (𝑢+1) = 𝑢ℎ𝑠𝑡ℎ(𝑨) } (𝑢+1) = 𝑨 𝑗 if |𝑨 𝑗 | 𝑏𝑛𝑝𝑜𝑕 𝑙 𝑚𝑏𝑠𝑕𝑓𝑡𝑢 (𝑢+1) = 𝑡𝑗𝑕𝑜 𝑨 𝑗 |𝑨 𝑗 | − 𝜇 𝑗𝑔 𝑨 𝑗 > 𝜇 𝑦 𝑗 𝑦 𝑗 0 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓 0 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓

  10. Iterative algorithms It Iterative har ard th threshold ldin ing (IH (IHT) It Iterative so soft th threshold lding (IS (ISTA) 𝑥ℎ𝑗𝑚𝑓 𝑜𝑝𝑢 𝑑𝑝𝑜𝑤𝑓𝑠𝑕𝑓, 𝑒𝑝 { lin inear op op 𝑦=𝑦 (𝑢) = Φ 𝑈 Φ𝑦 (𝑢) − Φ 𝑈 𝑧 𝛼𝑦 𝑨 = 𝑦 (𝑢) − 𝜈𝛼𝑦 𝑦=𝑦 (𝑢) 𝑦 (𝑢+1) = 𝑢ℎ𝑠𝑡ℎ(𝑨) none lin no inear op op } (𝑢+1) = 𝑨 𝑗 if |𝑨 𝑗 | 𝑏𝑛𝑝𝑜𝑕 𝑙 𝑚𝑏𝑠𝑕𝑓𝑡𝑢 (𝑢+1) = 𝑡𝑗𝑕𝑜 𝑨 𝑗 |𝑨 𝑗 | − 𝜇 𝑗𝑔 𝑨 𝑗 > 𝜇 𝑦 𝑗 𝑦 𝑗 0 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓 0 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓

  11. Deep Network = Unfolded Optimization? Basic DNN Template  ( 1 ) ( 1 ) ( 1 ) linear filter W x b nonlinearity/threshold  ( 2 ) ( 2 ) ( 2 ) W x b Observation: Many common iterative algorithms … follow the exact same script:    ( t ) ( t ) ( t ) x b W    ( 1 ) ( ) t t x x b f W

  12. Deep Network = Unfolded Optimization? Basic DNN Template  ( 1 ) ( 1 ) ( 1 ) W x b Fast sparse encoders: [Gregor and LeCun, 2010] [Wang et al. 2015]  ( 2 ) ( 2 ) ( 2 ) W x b What’s more? …  ( t ) ( t ) ( t ) x b W

  13. Unfolded IHT Iterations (  1 ) W x b linear filter non-linearity 𝑋 = 𝐽 − 𝜈Φ 𝑈 Φ  ( 2 ) W x b 𝜈Φ 𝑈 𝑧 𝑐 = … (  t ) W x b Question 1: So is there an advantage to learning the weights?

  14. Effects of Correlation Structure High Correlation: Hard Low Correlation: Easy  T   T    entries         iid N 0, ( uncor ) ( cor ) ( uncor ) low rank small RIP constant large RIP constant       1 1 [ ] [ ] 3 k 3 k 32 32

  15. Performance Bound with Learned Layer Weights Theorem 1 There will always exist layer weights 𝑋 and bias 𝑐 such that the effective RIP constant is reduced via ∗ ∗ 𝜀 3𝑙 Φ = Ψ 𝐸 𝜀 3𝑙 inf ΨΦ𝐸 ≤ 𝜀 3𝑙 [Φ] effective RIP constant original RIP constant where Ψ is arbitrary and D is diagonal. [Xin et al. 2016] It is therefore possible to reduce high RIP constants

  16. Practical Consequences Theorem 2 Suppose we have correlated dictionary formed via Φ (𝑑𝑝𝑠) = Φ 𝑣𝑜𝑑𝑝𝑠 + Δ large RIP small RIP With Φ 𝑣𝑜𝑑𝑝𝑠  iid 𝑂(0, 𝜉) entries and Δ sufficiently low rank. Then          * E [ ] E [ ] 3 k ( cor ) 3 k ( uncor ) [Xin et al. 2016] So we can ‘undo’ low rank correlations that would otherwise produce a high RIP constant …

  17. Advantages of Independent Layer Weights (and Activations) Theorem 3  ( 1 ) ( 1 ) ( 1 ) W x b With independent weights on each layer  ( 2 ) ( 2 ) ( 2 ) W x b Often possible to obtain … nearly ideal RIP constant even  ( t ) ( t ) ( t ) W x b when full rank  is present [Xin et al. 2016] Question 2: Do independent weights (and activations) has the potential to do even better? Yes

  18. Advantages of Independent Layer Weights (and Activations) Φ 𝑗 = Φ 𝑗 𝑣𝑜𝑑𝑝𝑠 + Δ i Φ = [Φ 1 , … Φ 𝑑 ] 𝑦 (𝑢+1) = 𝐼 (Ω 𝑝𝑜 (𝑢) ) [𝑋 𝑢 𝑦 𝑢 + 𝑐 𝑢 ] 𝑢 , Ω 𝑝𝑔𝑔  ( 1 ) ( 1 ) ( 1 ) W x b  ( 2 ) ( 2 ) ( 2 ) W x b

  19. Advantages of Independent Layer Weights (and Activations) Φ 𝑗 = Φ 𝑗 𝑣𝑜𝑑𝑝𝑠 + Δ i Φ = [Φ 1 , … Φ 𝑑 ] 𝑦 (𝑢+1) = 𝐼 (Ω 𝑝𝑜 (𝑢) ) [𝑋 𝑢 𝑦 𝑢 + 𝑐 𝑢 ] 𝑢 , Ω 𝑝𝑔𝑔  ( 1 ) ( 1 ) ( 1 ) W x b  ( 2 ) ( 2 ) ( 2 ) W x b

  20. Advantages of Independent Layer Weights (and Activations) Φ 𝑗 = Φ 𝑗 𝑣𝑜𝑑𝑝𝑠 + Δ i Φ = [Φ 1 , … Φ 𝑑 ] 𝑦 (𝑢+1) = 𝐼 (Ω 𝑝𝑜 (𝑢) ) [𝑋 𝑢 𝑦 𝑢 + 𝑐 𝑢 ] 𝑢 , Ω 𝑝𝑔𝑔  ( 1 ) ( 1 ) ( 1 ) W x b  ( 2 ) ( 2 ) ( 2 ) W x b

  21. Advantages of Independent Layer Weights (and Activations) Φ 𝑗 = Φ 𝑗 𝑣𝑜𝑑𝑝𝑠 + Δ i Φ = [Φ 1 , … Φ 𝑑 ] 𝑦 (𝑢+1) = 𝐼 (Ω 𝑝𝑜 (𝑢) ) [𝑋 𝑢 𝑦 𝑢 + 𝑐 𝑢 ] 𝑢 , Ω 𝑝𝑔𝑔  ( 1 ) ( 1 ) ( 1 ) W x b  ( 2 ) ( 2 ) ( 2 ) W x b

  22. Checkpoint • Thus far Idealized deep network weights exist that improve RIP constants. • What’s coming • Practical design to facilitate success • Empirical results • Applications

  23. Alternative Learning-Based Strategy • Treat as a multi-label DNN classification problem to estimate support of 𝑦 ∗ . • The main challenge is estimating supp[x] • Once support is obtained, computing actual value is trivial • ℎ 𝑧 − Φ𝑦 2 based loss will be unaware and expend undue effort to match coefficient magnitudes. • Specifically, we learn to find supp[x] using multilabel softmax loss layer

  24. Alternative Learning-Based Strategy • Adopt highway and gating mechanisms • Relatively deep nets for challenging problems  such designs help with information flow • Our analysis show such designes seem natural for challenging multi-scales sparse estimation problems • The philosophies of generating training sets. • Generative perspective 𝑦 ∗  𝑧 = Φ𝑦 ∗ • Not 𝑧  𝑦 ∗ = 𝑝𝑞𝑢𝑗𝑛𝑗𝑨𝑏𝑗𝑢𝑝𝑜(𝑧) • Unsupervised training • 𝑦 ∗ are randomly generated

  25. Experiments 1 𝑈 where 𝑣 𝑗 , 𝑤 𝑗 are iid 𝑂(0,1) • We generate Φ = 𝑗 𝑗 2 𝑣 𝑗 𝑤 𝑗 • Super-linear decaying singular values • With structure but quite general • Values of 𝑦 ∗ • 𝑉 -distribution : drawn from 𝑉 −0.5,0.5 excluding 𝑉[−0.1,0.1] • 𝑂 -distribution : drawn from 𝑂 +0.3,0.1 𝑏𝑜𝑒 𝑂(−0.3,0.1) • Experiments • Basic: U2U • Cross: U2N, N2U

  26. Results Strong estimation accuracy Robust to training distributions Question 3: Does deep learning really learn ideal weights as analyzed? Hard to say, but it DOES achieve strong empirical performance for maximal sparsity

  27. Robust Surface Normal Estimation • Input: … • Per-pixel model:   Specular reflections, y L n x observations under different shadows, etc. (outliers) lightings raw unknown surface lighting normal matrix • Can apply any sparse learning method to obtain outliers [Ikehata et al., 2012]

Recommend


More recommend