deep networks
play

Deep Networks? Bo Xin 1,2 , Yizhou Wang 1 , Wen Gao 1 and David Wipf - PowerPoint PPT Presentation

Maximal Sparsity with Deep Networks? Bo Xin 1,2 , Yizhou Wang 1 , Wen Gao 1 and David Wipf 2 1 Peking University 2 Microsoft Research, Beijing Outline Background and motivation Unfolding iterative algorithms Theoretical analysis


  1. Maximal Sparsity with Deep Networks? Bo Xin 1,2 , Yizhou Wang 1 , Wen Gao 1 and David Wipf 2 1 Peking University 2 Microsoft Research, Beijing

  2. Outline • Background and motivation • Unfolding iterative algorithms • Theoretical analysis • Practical designs and empirical supports • Applications • Discussion

  3. Maximal Sparsity 𝑦∈𝑆 𝑛 𝑦 0 𝑡. 𝑢. 𝑧 = Φ𝑦 min Φ ∈ 𝑆 𝑜×𝑛 𝑧 ∈ 𝑆 𝑜 ⋅ 0 # of non-zeros

  4. Maximal Sparsity is NP hard min 𝑦 0 𝑡. 𝑢. 𝑧 = Φ𝑦 𝑦 Combinatorial NP hard and close approximations are highly non convex Practical alternatives: • ℓ 1 norm minimization • orthogonal matching pursuit (OMP) • Iterative hard thresholding (IHT)

  5. Pros and Cons Numerous practical applications: • Feature selection [Cotter and Rao 2001; Figueiredo, 2002] • Outlier removal [Candes and Tao 2005; Ikehata et al. 2012] • Compressive sensing [Donoho, 2006] • Source localization [Baillet et al. 2001; Malioutov et al. 2005] • Computer vision applications [John Wright et al 2009] Fundamental weakness: • If the Gram matrix Φ T Φ has high off-diagonal energy • Estimation of 𝑦 ∗ can be extremely poor

  6. Restricted Isometry Property (RIP) A matrix Φ satisfies the RIP with constant 𝜀 𝑙 Φ < 1 if 2 ≤ 2 ≤ 1 + 𝜀 𝑙 Φ 2 (1 − 𝜀 𝑙 Φ ) 𝑦 2 Φ𝑦 2 𝑦 2 Holds for all {𝑦: 𝑦 0 ≤ 𝑙} . [Candès et al., 2006] small RIP constant 𝜀 2 Φ large RIP constant 𝜀 2 Φ  2 k

  7. Guaranteed Recovery with IHT Suppose there exists some 𝑦 ∗ such that 𝑧 = Φ𝑦 ∗ 𝑦 ∗ 0 ≤ 𝑙 𝜀 3𝑙 Φ < 1/√32 Then the IHT iterations are guaranteed to converge to 𝑦 ∗ . [Blumensath and Davies, 2009] Only very small degrees of correlation can be tolerated

  8. Checkpoint • Thus far • Maximal sparsity is NP hard • Practical alternative suffers when the dictionary has high RIP constant • What’s coming • A deep learning base perspective • Technical analysis

  9. Iterative algorithms It Iterative har ard th threshold ldin ing (IH (IHT) It Iterative so soft th threshold lding (IS (ISTA) 𝑥ℎ𝑗𝑚𝑓 𝑜𝑝𝑢 𝑑𝑝𝑜𝑤𝑓𝑠𝑕𝑓, 𝑒𝑝 { 𝑦=𝑦 (𝑢) = Φ 𝑈 Φ𝑦 (𝑢) − Φ 𝑈 𝑧 𝛼𝑦 𝑨 = 𝑦 (𝑢) − 𝜈𝛼𝑦 𝑦=𝑦 (𝑢) 𝑦 (𝑢+1) = 𝑢ℎ𝑠𝑡ℎ(𝑨) } (𝑢+1) = 𝑨 𝑗 if |𝑨 𝑗 | 𝑏𝑛𝑝𝑜𝑕 𝑙 𝑚𝑏𝑠𝑕𝑓𝑡𝑢 (𝑢+1) = 𝑡𝑗𝑕𝑜 𝑨 𝑗 |𝑨 𝑗 | − 𝜇 𝑗𝑔 𝑨 𝑗 > 𝜇 𝑦 𝑗 𝑦 𝑗 0 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓 0 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓

  10. Iterative algorithms It Iterative har ard th threshold ldin ing (IH (IHT) It Iterative so soft th threshold lding (IS (ISTA) 𝑥ℎ𝑗𝑚𝑓 𝑜𝑝𝑢 𝑑𝑝𝑜𝑤𝑓𝑠𝑕𝑓, 𝑒𝑝 { lin inear op op 𝑦=𝑦 (𝑢) = Φ 𝑈 Φ𝑦 (𝑢) − Φ 𝑈 𝑧 𝛼𝑦 𝑨 = 𝑦 (𝑢) − 𝜈𝛼𝑦 𝑦=𝑦 (𝑢) 𝑦 (𝑢+1) = 𝑢ℎ𝑠𝑡ℎ(𝑨) none lin no inear op op } (𝑢+1) = 𝑨 𝑗 if |𝑨 𝑗 | 𝑏𝑛𝑝𝑜𝑕 𝑙 𝑚𝑏𝑠𝑕𝑓𝑡𝑢 (𝑢+1) = 𝑡𝑗𝑕𝑜 𝑨 𝑗 |𝑨 𝑗 | − 𝜇 𝑗𝑔 𝑨 𝑗 > 𝜇 𝑦 𝑗 𝑦 𝑗 0 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓 0 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓

  11. Deep Network = Unfolded Optimization? Basic DNN Template  ( 1 ) ( 1 ) ( 1 ) linear filter W x b nonlinearity/threshold  ( 2 ) ( 2 ) ( 2 ) W x b Observation: Many common iterative algorithms … follow the exact same script:    ( t ) ( t ) ( t ) x b W    ( 1 ) ( ) t t x x b f W

  12. Deep Network = Unfolded Optimization? Basic DNN Template  ( 1 ) ( 1 ) ( 1 ) W x b Fast sparse encoders: [Gregor and LeCun, 2010] [Wang et al. 2015]  ( 2 ) ( 2 ) ( 2 ) W x b What’s more? …  ( t ) ( t ) ( t ) x b W

  13. Unfolded IHT Iterations (  1 ) W x b linear filter non-linearity 𝑋 = 𝐽 − 𝜈Φ 𝑈 Φ  ( 2 ) W x b 𝜈Φ 𝑈 𝑧 𝑐 = … (  t ) W x b Question 1: So is there an advantage to learning the weights?

  14. Effects of Correlation Structure High Correlation: Hard Low Correlation: Easy  T   T    entries         iid N 0, ( uncor ) ( cor ) ( uncor ) low rank small RIP constant large RIP constant       1 1 [ ] [ ] 3 k 3 k 32 32

  15. Performance Bound with Learned Layer Weights Theorem 1 There will always exist layer weights 𝑋 and bias 𝑐 such that the effective RIP constant is reduced via ∗ ∗ 𝜀 3𝑙 Φ = Ψ 𝐸 𝜀 3𝑙 inf ΨΦ𝐸 ≤ 𝜀 3𝑙 [Φ] effective RIP constant original RIP constant where Ψ is arbitrary and D is diagonal. [Xin et al. 2016] It is therefore possible to reduce high RIP constants

  16. Practical Consequences Theorem 2 Suppose we have correlated dictionary formed via Φ (𝑑𝑝𝑠) = Φ 𝑣𝑜𝑑𝑝𝑠 + Δ large RIP small RIP With Φ 𝑣𝑜𝑑𝑝𝑠  iid 𝑂(0, 𝜉) entries and Δ sufficiently low rank. Then          * E [ ] E [ ] 3 k ( cor ) 3 k ( uncor ) [Xin et al. 2016] So we can ‘undo’ low rank correlations that would otherwise produce a high RIP constant …

  17. Advantages of Independent Layer Weights (and Activations) Theorem 3  ( 1 ) ( 1 ) ( 1 ) W x b With independent weights on each layer  ( 2 ) ( 2 ) ( 2 ) W x b Often possible to obtain … nearly ideal RIP constant even  ( t ) ( t ) ( t ) W x b when full rank  is present [Xin et al. 2016] Question 2: Do independent weights (and activations) has the potential to do even better? Yes

  18. Advantages of Independent Layer Weights (and Activations) Φ 𝑗 = Φ 𝑗 𝑣𝑜𝑑𝑝𝑠 + Δ i Φ = [Φ 1 , … Φ 𝑑 ] 𝑦 (𝑢+1) = 𝐼 (Ω 𝑝𝑜 (𝑢) ) [𝑋 𝑢 𝑦 𝑢 + 𝑐 𝑢 ] 𝑢 , Ω 𝑝𝑔𝑔  ( 1 ) ( 1 ) ( 1 ) W x b  ( 2 ) ( 2 ) ( 2 ) W x b

  19. Advantages of Independent Layer Weights (and Activations) Φ 𝑗 = Φ 𝑗 𝑣𝑜𝑑𝑝𝑠 + Δ i Φ = [Φ 1 , … Φ 𝑑 ] 𝑦 (𝑢+1) = 𝐼 (Ω 𝑝𝑜 (𝑢) ) [𝑋 𝑢 𝑦 𝑢 + 𝑐 𝑢 ] 𝑢 , Ω 𝑝𝑔𝑔  ( 1 ) ( 1 ) ( 1 ) W x b  ( 2 ) ( 2 ) ( 2 ) W x b

  20. Advantages of Independent Layer Weights (and Activations) Φ 𝑗 = Φ 𝑗 𝑣𝑜𝑑𝑝𝑠 + Δ i Φ = [Φ 1 , … Φ 𝑑 ] 𝑦 (𝑢+1) = 𝐼 (Ω 𝑝𝑜 (𝑢) ) [𝑋 𝑢 𝑦 𝑢 + 𝑐 𝑢 ] 𝑢 , Ω 𝑝𝑔𝑔  ( 1 ) ( 1 ) ( 1 ) W x b  ( 2 ) ( 2 ) ( 2 ) W x b

  21. Advantages of Independent Layer Weights (and Activations) Φ 𝑗 = Φ 𝑗 𝑣𝑜𝑑𝑝𝑠 + Δ i Φ = [Φ 1 , … Φ 𝑑 ] 𝑦 (𝑢+1) = 𝐼 (Ω 𝑝𝑜 (𝑢) ) [𝑋 𝑢 𝑦 𝑢 + 𝑐 𝑢 ] 𝑢 , Ω 𝑝𝑔𝑔  ( 1 ) ( 1 ) ( 1 ) W x b  ( 2 ) ( 2 ) ( 2 ) W x b

  22. Checkpoint • Thus far Idealized deep network weights exist that improve RIP constants. • What’s coming • Practical design to facilitate success • Empirical results • Applications

  23. Alternative Learning-Based Strategy • Treat as a multi-label DNN classification problem to estimate support of 𝑦 ∗ . • The main challenge is estimating supp[x] • Once support is obtained, computing actual value is trivial • ℎ 𝑧 − Φ𝑦 2 based loss will be unaware and expend undue effort to match coefficient magnitudes. • Specifically, we learn to find supp[x] using multilabel softmax loss layer

  24. Alternative Learning-Based Strategy • Adopt highway and gating mechanisms • Relatively deep nets for challenging problems  such designs help with information flow • Our analysis show such designes seem natural for challenging multi-scales sparse estimation problems • The philosophies of generating training sets. • Generative perspective 𝑦 ∗  𝑧 = Φ𝑦 ∗ • Not 𝑧  𝑦 ∗ = 𝑝𝑞𝑢𝑗𝑛𝑗𝑨𝑏𝑗𝑢𝑝𝑜(𝑧) • Unsupervised training • 𝑦 ∗ are randomly generated

  25. Experiments 1 𝑈 where 𝑣 𝑗 , 𝑤 𝑗 are iid 𝑂(0,1) • We generate Φ = 𝑗 𝑗 2 𝑣 𝑗 𝑤 𝑗 • Super-linear decaying singular values • With structure but quite general • Values of 𝑦 ∗ • 𝑉 -distribution : drawn from 𝑉 −0.5,0.5 excluding 𝑉[−0.1,0.1] • 𝑂 -distribution : drawn from 𝑂 +0.3,0.1 𝑏𝑜𝑒 𝑂(−0.3,0.1) • Experiments • Basic: U2U • Cross: U2N, N2U

  26. Results Strong estimation accuracy Robust to training distributions Question 3: Does deep learning really learn ideal weights as analyzed? Hard to say, but it DOES achieve strong empirical performance for maximal sparsity

  27. Robust Surface Normal Estimation • Input: … • Per-pixel model:   Specular reflections, y L n x observations under different shadows, etc. (outliers) lightings raw unknown surface lighting normal matrix • Can apply any sparse learning method to obtain outliers [Ikehata et al., 2012]

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend