Deep Networks? Bo Xin 1,2 , Yizhou Wang 1 , Wen Gao 1 and David Wipf - PowerPoint PPT Presentation

Maximal Sparsity with Deep Networks? Bo Xin 1,2 , Yizhou Wang 1 , Wen Gao 1 and David Wipf 2 1 Peking University 2 Microsoft Research, Beijing

Outline • Background and motivation • Unfolding iterative algorithms • Theoretical analysis • Practical designs and empirical supports • Applications • Discussion

Maximal Sparsity 𝑦∈𝑆 𝑛 𝑦 0 𝑡. 𝑢. 𝑧 = Φ𝑦 min Φ ∈ 𝑆 𝑜×𝑛 𝑧 ∈ 𝑆 𝑜 ⋅ 0 # of non-zeros

Maximal Sparsity is NP hard min 𝑦 0 𝑡. 𝑢. 𝑧 = Φ𝑦 𝑦 Combinatorial NP hard and close approximations are highly non convex Practical alternatives: • ℓ 1 norm minimization • orthogonal matching pursuit (OMP) • Iterative hard thresholding (IHT)

Pros and Cons Numerous practical applications: • Feature selection [Cotter and Rao 2001; Figueiredo, 2002] • Outlier removal [Candes and Tao 2005; Ikehata et al. 2012] • Compressive sensing [Donoho, 2006] • Source localization [Baillet et al. 2001; Malioutov et al. 2005] • Computer vision applications [John Wright et al 2009] Fundamental weakness: • If the Gram matrix Φ T Φ has high off-diagonal energy • Estimation of 𝑦 ∗ can be extremely poor

Restricted Isometry Property (RIP) A matrix Φ satisfies the RIP with constant 𝜀 𝑙 Φ < 1 if 2 ≤ 2 ≤ 1 + 𝜀 𝑙 Φ 2 (1 − 𝜀 𝑙 Φ ) 𝑦 2 Φ𝑦 2 𝑦 2 Holds for all {𝑦: 𝑦 0 ≤ 𝑙} . [Candès et al., 2006] small RIP constant 𝜀 2 Φ large RIP constant 𝜀 2 Φ  2 k

Guaranteed Recovery with IHT Suppose there exists some 𝑦 ∗ such that 𝑧 = Φ𝑦 ∗ 𝑦 ∗ 0 ≤ 𝑙 𝜀 3𝑙 Φ < 1/√32 Then the IHT iterations are guaranteed to converge to 𝑦 ∗ . [Blumensath and Davies, 2009] Only very small degrees of correlation can be tolerated

Checkpoint • Thus far • Maximal sparsity is NP hard • Practical alternative suffers when the dictionary has high RIP constant • What’s coming • A deep learning base perspective • Technical analysis

Iterative algorithms It Iterative har ard th threshold ldin ing (IH (IHT) It Iterative so soft th threshold lding (IS (ISTA) 𝑥ℎ𝑗𝑚𝑓 𝑜𝑝𝑢 𝑑𝑝𝑜𝑤𝑓𝑠𝑕𝑓, 𝑒𝑝 { 𝑦=𝑦 (𝑢) = Φ 𝑈 Φ𝑦 (𝑢) − Φ 𝑈 𝑧 𝛼𝑦 𝑨 = 𝑦 (𝑢) − 𝜈𝛼𝑦 𝑦=𝑦 (𝑢) 𝑦 (𝑢+1) = 𝑢ℎ𝑠𝑡ℎ(𝑨) } (𝑢+1) = 𝑨 𝑗 if |𝑨 𝑗 | 𝑏𝑛𝑝𝑜𝑕 𝑙 𝑚𝑏𝑠𝑕𝑓𝑡𝑢 (𝑢+1) = 𝑡𝑗𝑕𝑜 𝑨 𝑗 |𝑨 𝑗 | − 𝜇 𝑗𝑔 𝑨 𝑗 > 𝜇 𝑦 𝑗 𝑦 𝑗 0 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓 0 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓

Iterative algorithms It Iterative har ard th threshold ldin ing (IH (IHT) It Iterative so soft th threshold lding (IS (ISTA) 𝑥ℎ𝑗𝑚𝑓 𝑜𝑝𝑢 𝑑𝑝𝑜𝑤𝑓𝑠𝑕𝑓, 𝑒𝑝 { lin inear op op 𝑦=𝑦 (𝑢) = Φ 𝑈 Φ𝑦 (𝑢) − Φ 𝑈 𝑧 𝛼𝑦 𝑨 = 𝑦 (𝑢) − 𝜈𝛼𝑦 𝑦=𝑦 (𝑢) 𝑦 (𝑢+1) = 𝑢ℎ𝑠𝑡ℎ(𝑨) none lin no inear op op } (𝑢+1) = 𝑨 𝑗 if |𝑨 𝑗 | 𝑏𝑛𝑝𝑜𝑕 𝑙 𝑚𝑏𝑠𝑕𝑓𝑡𝑢 (𝑢+1) = 𝑡𝑗𝑕𝑜 𝑨 𝑗 |𝑨 𝑗 | − 𝜇 𝑗𝑔 𝑨 𝑗 > 𝜇 𝑦 𝑗 𝑦 𝑗 0 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓 0 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓

Deep Network = Unfolded Optimization? Basic DNN Template  ( 1 ) ( 1 ) ( 1 ) linear filter W x b nonlinearity/threshold  ( 2 ) ( 2 ) ( 2 ) W x b Observation: Many common iterative algorithms … follow the exact same script:    ( t ) ( t ) ( t ) x b W    ( 1 ) ( ) t t x x b f W

Deep Network = Unfolded Optimization? Basic DNN Template  ( 1 ) ( 1 ) ( 1 ) W x b Fast sparse encoders: [Gregor and LeCun, 2010] [Wang et al. 2015]  ( 2 ) ( 2 ) ( 2 ) W x b What’s more? …  ( t ) ( t ) ( t ) x b W

Unfolded IHT Iterations (  1 ) W x b linear filter non-linearity 𝑋 = 𝐽 − 𝜈Φ 𝑈 Φ  ( 2 ) W x b 𝜈Φ 𝑈 𝑧 𝑐 = … (  t ) W x b Question 1: So is there an advantage to learning the weights?

Effects of Correlation Structure High Correlation: Hard Low Correlation: Easy  T   T    entries         iid N 0, ( uncor ) ( cor ) ( uncor ) low rank small RIP constant large RIP constant       1 1 [ ] [ ] 3 k 3 k 32 32

Performance Bound with Learned Layer Weights Theorem 1 There will always exist layer weights 𝑋 and bias 𝑐 such that the effective RIP constant is reduced via ∗ ∗ 𝜀 3𝑙 Φ = Ψ 𝐸 𝜀 3𝑙 inf ΨΦ𝐸 ≤ 𝜀 3𝑙 [Φ] effective RIP constant original RIP constant where Ψ is arbitrary and D is diagonal. [Xin et al. 2016] It is therefore possible to reduce high RIP constants

Practical Consequences Theorem 2 Suppose we have correlated dictionary formed via Φ (𝑑𝑝𝑠) = Φ 𝑣𝑜𝑑𝑝𝑠 + Δ large RIP small RIP With Φ 𝑣𝑜𝑑𝑝𝑠  iid 𝑂(0, 𝜉) entries and Δ sufficiently low rank. Then          * E [ ] E [ ] 3 k ( cor ) 3 k ( uncor ) [Xin et al. 2016] So we can ‘undo’ low rank correlations that would otherwise produce a high RIP constant …

Advantages of Independent Layer Weights (and Activations) Theorem 3  ( 1 ) ( 1 ) ( 1 ) W x b With independent weights on each layer  ( 2 ) ( 2 ) ( 2 ) W x b Often possible to obtain … nearly ideal RIP constant even  ( t ) ( t ) ( t ) W x b when full rank  is present [Xin et al. 2016] Question 2: Do independent weights (and activations) has the potential to do even better? Yes

Advantages of Independent Layer Weights (and Activations) Φ 𝑗 = Φ 𝑗 𝑣𝑜𝑑𝑝𝑠 + Δ i Φ = [Φ 1 , … Φ 𝑑 ] 𝑦 (𝑢+1) = 𝐼 (Ω 𝑝𝑜 (𝑢) ) [𝑋 𝑢 𝑦 𝑢 + 𝑐 𝑢 ] 𝑢 , Ω 𝑝𝑔𝑔  ( 1 ) ( 1 ) ( 1 ) W x b  ( 2 ) ( 2 ) ( 2 ) W x b

Checkpoint • Thus far Idealized deep network weights exist that improve RIP constants. • What’s coming • Practical design to facilitate success • Empirical results • Applications

Alternative Learning-Based Strategy • Treat as a multi-label DNN classification problem to estimate support of 𝑦 ∗ . • The main challenge is estimating supp[x] • Once support is obtained, computing actual value is trivial • ℎ 𝑧 − Φ𝑦 2 based loss will be unaware and expend undue effort to match coefficient magnitudes. • Specifically, we learn to find supp[x] using multilabel softmax loss layer

Alternative Learning-Based Strategy • Adopt highway and gating mechanisms • Relatively deep nets for challenging problems  such designs help with information flow • Our analysis show such designes seem natural for challenging multi-scales sparse estimation problems • The philosophies of generating training sets. • Generative perspective 𝑦 ∗  𝑧 = Φ𝑦 ∗ • Not 𝑧  𝑦 ∗ = 𝑝𝑞𝑢𝑗𝑛𝑗𝑨𝑏𝑗𝑢𝑝𝑜(𝑧) • Unsupervised training • 𝑦 ∗ are randomly generated

Experiments 1 𝑈 where 𝑣 𝑗 , 𝑤 𝑗 are iid 𝑂(0,1) • We generate Φ = 𝑗 𝑗 2 𝑣 𝑗 𝑤 𝑗 • Super-linear decaying singular values • With structure but quite general • Values of 𝑦 ∗ • 𝑉 -distribution : drawn from 𝑉 −0.5,0.5 excluding 𝑉[−0.1,0.1] • 𝑂 -distribution : drawn from 𝑂 +0.3,0.1 𝑏𝑜𝑒 𝑂(−0.3,0.1) • Experiments • Basic: U2U • Cross: U2N, N2U

Results Strong estimation accuracy Robust to training distributions Question 3: Does deep learning really learn ideal weights as analyzed? Hard to say, but it DOES achieve strong empirical performance for maximal sparsity

Robust Surface Normal Estimation • Input: … • Per-pixel model:   Specular reflections, y L n x observations under different shadows, etc. (outliers) lightings raw unknown surface lighting normal matrix • Can apply any sparse learning method to obtain outliers [Ikehata et al., 2012]

Deep Networks? Bo Xin 1,2 , Yizhou Wang 1 , Wen Gao 1 and David Wipf - PowerPoint PPT Presentation

Maximal Sparsity with Deep Networks? Bo Xin 1,2 , Yizhou Wang 1 , Wen Gao 1 and David Wipf 2 1 Peking University 2 Microsoft Research, Beijing Outline Background and motivation Unfolding iterative algorithms Theoretical analysis

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Networks Andrea Passerini passerini@disi.unitn.it Machine Learning Deep Networks Need for

Deep Networks Andrea Passerini passerini@disi.unitn.it Machine Learning Deep Networks Need for

E9 205 Machine Learning for Signal Processing Understanding Deep Networks 08-11-2019 Instructor

Deep-Learning: Unsupervised Generative models Deep Belief Networks Deep Stacked AutoEncoders

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Learning with Neural Networks The Structure and Optimization of Deep Neural Networks Allan

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Deep learning Optimization and Regularization in deep networks Hamid Beigy Sharif university of

Deep Neural Networks CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu Deep learning slides credit:

Data Mining II Neural Networks and Deep Learning Heiko Paulheim Deep Learning A recent

P2P Networks as Content P2P Networks as Content Delivery Networks Delivery Networks FINAL

Current Network Structure for Pediatrics Hospital Networks Country, state, regional, Academic

Adding 32-bit Mode to the ACL2 Model of the x86 ISA Alessandro Coglio Shilpi Goel Kestrel

Analyzing the Traits and Anomalies of Political Discussions on Reddit Anna Guimar aes, Oana

Processes pid = 1000 pid = 1001 stack stack heap heap data/globals data/globals code code

Control flow (1) Condition codes Conditional and unconditional jumps Loops Conditional moves

Distance vector protocols Distance Vector routing principles Routing loops and countermeasures to

Android App Anatomy Eric Burke Square @burke_eric Topics Android lifecycle Fragments

The Case for Cooperative Kernel Threads Yanjin Zhu (student) Leonid Ryzhyk (student) Peter Chubb

BU CS 332 Theory of Computation Lecture 6: Reading: NFAs > Regular expressions

Deep Networks? Bo Xin 1,2 , Yizhou Wang 1 , Wen Gao 1 and David Wipf - PowerPoint PPT Presentation

Maximal Sparsity with Deep Networks? Bo Xin 1,2 , Yizhou Wang 1 , Wen Gao 1 and David Wipf 2 1 Peking University 2 Microsoft Research, Beijing Outline Background and motivation Unfolding iterative algorithms Theoretical analysis

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Networks Andrea Passerini passerini@disi.unitn.it Machine Learning Deep Networks Need for

Deep Networks Andrea Passerini passerini@disi.unitn.it Machine Learning Deep Networks Need for

E9 205 Machine Learning for Signal Processing Understanding Deep Networks 08-11-2019 Instructor

Deep-Learning: Unsupervised Generative models Deep Belief Networks Deep Stacked AutoEncoders

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Learning with Neural Networks The Structure and Optimization of Deep Neural Networks Allan

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Deep learning Optimization and Regularization in deep networks Hamid Beigy Sharif university of

Deep Neural Networks CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu Deep learning slides credit:

Data Mining II Neural Networks and Deep Learning Heiko Paulheim Deep Learning A recent

P2P Networks as Content P2P Networks as Content Delivery Networks Delivery Networks FINAL

Current Network Structure for Pediatrics Hospital Networks Country, state, regional, Academic

Adding 32-bit Mode to the ACL2 Model of the x86 ISA Alessandro Coglio Shilpi Goel Kestrel

Analyzing the Traits and Anomalies of Political Discussions on Reddit Anna Guimar aes, Oana

Processes pid = 1000 pid = 1001 stack stack heap heap data/globals data/globals code code

Control flow (1) Condition codes Conditional and unconditional jumps Loops Conditional moves

Distance vector protocols Distance Vector routing principles Routing loops and countermeasures to

Android App Anatomy Eric Burke Square @burke_eric Topics Android lifecycle Fragments

The Case for Cooperative Kernel Threads Yanjin Zhu (student) Leonid Ryzhyk (student) Peter Chubb

BU CS 332 Theory of Computation Lecture 6: Reading: NFAs &gt; Regular expressions

BU CS 332 Theory of Computation Lecture 6: Reading: NFAs > Regular expressions