Deep Networks? Bo Xin 1,2 , Yizhou Wang 1 , Wen Gao 1 and David Wipf - - PowerPoint PPT Presentation

โ–ถ
deep networks
SMART_READER_LITE
LIVE PREVIEW

Deep Networks? Bo Xin 1,2 , Yizhou Wang 1 , Wen Gao 1 and David Wipf - - PowerPoint PPT Presentation

Maximal Sparsity with Deep Networks? Bo Xin 1,2 , Yizhou Wang 1 , Wen Gao 1 and David Wipf 2 1 Peking University 2 Microsoft Research, Beijing Outline Background and motivation Unfolding iterative algorithms Theoretical analysis


slide-1
SLIDE 1

Maximal Sparsity with Deep Networks?

Bo Xin1,2, Yizhou Wang1, Wen Gao1 and David Wipf2

1 Peking University 2 Microsoft Research, Beijing

slide-2
SLIDE 2

Outline

  • Background and motivation
  • Unfolding iterative algorithms
  • Theoretical analysis
  • Practical designs and empirical supports
  • Applications
  • Discussion
slide-3
SLIDE 3

Maximal Sparsity

min

๐‘ฆโˆˆ๐‘†๐‘› ๐‘ฆ 0 ๐‘ก. ๐‘ข. ๐‘ง = ฮฆ๐‘ฆ

ฮฆ โˆˆ ๐‘†๐‘œร—๐‘› ๐‘ง โˆˆ ๐‘†๐‘œ โ‹… 0 # of non-zeros

slide-4
SLIDE 4

Maximal Sparsity is NP hard

min

๐‘ฆ

๐‘ฆ 0 ๐‘ก. ๐‘ข. ๐‘ง = ฮฆ๐‘ฆ

Practical alternatives:

  • โ„“1 norm minimization
  • orthogonal matching pursuit (OMP)
  • Iterative hard thresholding (IHT)

Combinatorial NP hard and close approximations are highly non convex

slide-5
SLIDE 5

Pros and Cons

Numerous practical applications:

  • Feature selection [Cotter and Rao 2001; Figueiredo, 2002]
  • Outlier removal [Candes and Tao 2005; Ikehata et al. 2012]
  • Compressive sensing [Donoho, 2006]
  • Source localization [Baillet et al. 2001; Malioutov et al. 2005]
  • Computer vision applications [John Wright et al 2009]

Fundamental weakness:

  • If the Gram matrix ฮฆTฮฆ has high off-diagonal energy
  • Estimation of ๐‘ฆโˆ— can be extremely poor
slide-6
SLIDE 6

A matrix ฮฆ satisfies the RIP with constant ๐œ€๐‘™ ฮฆ < 1 if Holds for all {๐‘ฆ: ๐‘ฆ 0 โ‰ค ๐‘™}.

Restricted Isometry Property (RIP)

[Candรจs et al., 2006] small RIP constant ๐œ€2 ฮฆ large RIP constant ๐œ€2 ฮฆ

2 ๏€ฝ k (1 โˆ’ ๐œ€๐‘™ ฮฆ ) ๐‘ฆ 2

2 โ‰ค

ฮฆ๐‘ฆ 2

2 โ‰ค 1 + ๐œ€๐‘™ ฮฆ

๐‘ฆ 2

2

slide-7
SLIDE 7

Suppose there exists some ๐‘ฆโˆ— such that Then the IHT iterations are guaranteed to converge to ๐‘ฆโˆ—.

[Blumensath and Davies, 2009] Only very small degrees of correlation can be tolerated

๐‘ง = ฮฆ๐‘ฆโˆ— ๐‘ฆโˆ— 0 โ‰ค ๐‘™ ๐œ€3๐‘™ ฮฆ < 1/โˆš32

Guaranteed Recovery with IHT

slide-8
SLIDE 8
  • Thus far
  • Maximal sparsity is NP hard
  • Practical alternative suffers when the dictionary has high RIP

constant

  • Whatโ€™s coming
  • A deep learning base perspective
  • Technical analysis

Checkpoint

slide-9
SLIDE 9

Iterative algorithms

It Iterative har ard th threshold ldin ing (IH (IHT) It Iterative so soft th threshold lding (IS (ISTA) ๐›ผ๐‘ฆ

๐‘ฆ=๐‘ฆ(๐‘ข) = ฮฆ๐‘ˆฮฆ๐‘ฆ(๐‘ข) โˆ’ ฮฆ๐‘ˆ๐‘ง

๐‘จ = ๐‘ฆ(๐‘ข) โˆ’ ๐œˆ๐›ผ๐‘ฆ

๐‘ฆ=๐‘ฆ(๐‘ข)

๐‘ฆ(๐‘ข+1) = ๐‘ขโ„Ž๐‘ ๐‘กโ„Ž(๐‘จ) ๐‘ฅโ„Ž๐‘—๐‘š๐‘“ ๐‘œ๐‘๐‘ข ๐‘‘๐‘๐‘œ๐‘ค๐‘“๐‘ ๐‘•๐‘“, ๐‘’๐‘ { }

๐‘ฆ๐‘—

(๐‘ข+1) = ๐‘จ๐‘—

if |๐‘จ๐‘—| ๐‘๐‘›๐‘๐‘œ๐‘• ๐‘™ ๐‘š๐‘๐‘ ๐‘•๐‘“๐‘ก๐‘ข ๐‘๐‘ขโ„Ž๐‘“๐‘ ๐‘ฅ๐‘—๐‘ก๐‘“ ๐‘ฆ๐‘—

(๐‘ข+1) = ๐‘ก๐‘—๐‘•๐‘œ ๐‘จ๐‘— |๐‘จ๐‘—| โˆ’ ๐œ‡

๐‘—๐‘” ๐‘จ๐‘— > ๐œ‡ ๐‘๐‘ขโ„Ž๐‘“๐‘ ๐‘ฅ๐‘—๐‘ก๐‘“

slide-10
SLIDE 10

๐‘จ = ๐‘ฆ(๐‘ข) โˆ’ ๐œˆ๐›ผ๐‘ฆ

๐‘ฆ=๐‘ฆ(๐‘ข)

๐›ผ๐‘ฆ

๐‘ฆ=๐‘ฆ(๐‘ข) = ฮฆ๐‘ˆฮฆ๐‘ฆ(๐‘ข) โˆ’ ฮฆ๐‘ˆ๐‘ง

Iterative algorithms

It Iterative har ard th threshold ldin ing (IH (IHT) It Iterative so soft th threshold lding (IS (ISTA) ๐‘ฆ(๐‘ข+1) = ๐‘ขโ„Ž๐‘ ๐‘กโ„Ž(๐‘จ) ๐‘ฅโ„Ž๐‘—๐‘š๐‘“ ๐‘œ๐‘๐‘ข ๐‘‘๐‘๐‘œ๐‘ค๐‘“๐‘ ๐‘•๐‘“, ๐‘’๐‘ { }

๐‘ฆ๐‘—

(๐‘ข+1) = ๐‘จ๐‘—

if |๐‘จ๐‘—| ๐‘๐‘›๐‘๐‘œ๐‘• ๐‘™ ๐‘š๐‘๐‘ ๐‘•๐‘“๐‘ก๐‘ข ๐‘๐‘ขโ„Ž๐‘“๐‘ ๐‘ฅ๐‘—๐‘ก๐‘“ ๐‘ฆ๐‘—

(๐‘ข+1) = ๐‘ก๐‘—๐‘•๐‘œ ๐‘จ๐‘— |๐‘จ๐‘—| โˆ’ ๐œ‡

๐‘—๐‘” ๐‘จ๐‘— > ๐œ‡ ๐‘๐‘ขโ„Ž๐‘“๐‘ ๐‘ฅ๐‘—๐‘ก๐‘“ lin inear op

  • p

no none lin inear op

  • p
slide-11
SLIDE 11

Deep Network = Unfolded Optimization?

โ€ฆ

) 1 ( ) 1 ( ) 1 (

b x ๏€ซ W

) 2 ( ) 2 ( ) 2 (

b x ๏€ซ W

) ( ) ( ) ( t t t

W b x ๏€ซ

Basic DNN Template linear filter nonlinearity/threshold

๏€จ ๏€ฉ

b x x

) ( ) 1 (

๏€ซ ๏€ฝ

๏€ซ t t

W f

Observation: Many common iterative algorithms follow the exact same script:

slide-12
SLIDE 12

Deep Network = Unfolded Optimization?

Fast sparse encoders: [Gregor and LeCun, 2010] [Wang et al. 2015] Whatโ€™s more?

โ€ฆ

) 1 ( ) 1 ( ) 1 (

b x ๏€ซ W

) 2 ( ) 2 ( ) 2 (

b x ๏€ซ W

) ( ) ( ) ( t t t

W b x ๏€ซ

Basic DNN Template

slide-13
SLIDE 13

Unfolded IHT Iterations

โ€ฆ

b x

) 1 ( ๏€ซ

W b x

) 2 (

๏€ซ W b x

) ( ๏€ซ t

W

linear filter non-linearity Question 1: So is there an advantage to learning the weights?

๐‘‹ = ๐ฝ โˆ’ ๐œˆฮฆ๐‘ˆฮฆ ๐‘ = ๐œˆฮฆ๐‘ˆ๐‘ง

slide-14
SLIDE 14

Low Correlation: Easy High Correlation: Hard

low rank

๏† ๏†T ๏† ๏†T

๏„ ๏€ซ ๏† ๏€ฝ ๏†

) ( ) ( uncor cor

small RIP constant large RIP constant

๏€จ ๏€ฉ entries

0, iid

) (

๏ต N

uncor

๏‚ฎ ๏†

32 1 3

] [ ๏€ผ ๏†

k

๏ค

32 1 3

] [ ๏€พ๏€พ ๏†

k

๏ค

Effects of Correlation Structure

slide-15
SLIDE 15

There will always exist layer weights ๐‘‹ and bias ๐‘ such that the effective RIP constant is reduced via where ฮจ is arbitrary and D is diagonal. Theorem 1

effective RIP constant It is therefore possible to reduce high RIP constants

  • riginal RIP constant

Performance Bound with Learned Layer Weights

๐œ€3๐‘™

โˆ—

ฮฆ = inf

ฮจ ๐ธ ๐œ€3๐‘™ โˆ—

ฮจฮฆ๐ธ โ‰ค ๐œ€3๐‘™[ฮฆ]

[Xin et al. 2016]

slide-16
SLIDE 16

So we can โ€˜undoโ€™ low rank correlations that would otherwise produce a high RIP constant โ€ฆ

Suppose we have correlated dictionary formed via With ฮฆ ๐‘ฃ๐‘œ๐‘‘๐‘๐‘  ๏ƒ  iid ๐‘‚(0, ๐œ‰) entries and ฮ” sufficiently low rank. Then

๏€จ ๏€ฉ ๏€จ ๏€ฉ

] [ E ] [

) ( 3 ) ( * 3 uncor k cor k

E ๏† ๏‚ป ๏† ๏ค ๏ค Theorem 2

large RIP small RIP

Practical Consequences

ฮฆ(๐‘‘๐‘๐‘ ) = ฮฆ ๐‘ฃ๐‘œ๐‘‘๐‘๐‘  + ฮ”

[Xin et al. 2016]

slide-17
SLIDE 17

โ€ฆ

) 1 ( ) 1 ( ) 1 (

b x ๏€ซ W

) 2 ( ) 2 ( ) 2 (

b x ๏€ซ W

) ( ) ( ) ( t t t

W b x ๏€ซ

With independent weights on each layer Often possible to obtain nearly ideal RIP constant even when full rank ๏„ is present Theorem 3

Advantages of Independent Layer Weights (and Activations)

Question 2: Do independent weights (and activations) has the potential to do even better? Yes [Xin et al. 2016]

slide-18
SLIDE 18

) 1 ( ) 1 ( ) 1 (

b x ๏€ซ W

) 2 ( ) 2 ( ) 2 (

b x ๏€ซ W

Advantages of Independent Layer Weights (and Activations)

ฮฆ = [ฮฆ1, โ€ฆ ฮฆ๐‘‘] ฮฆ๐‘— = ฮฆ๐‘— ๐‘ฃ๐‘œ๐‘‘๐‘๐‘  + ฮ”i ๐‘ฆ(๐‘ข+1) = ๐ผ(ฮฉ๐‘๐‘œ

๐‘ข , ฮฉ๐‘๐‘”๐‘” (๐‘ข) )[๐‘‹ ๐‘ข ๐‘ฆ ๐‘ข + ๐‘ ๐‘ข ]

slide-19
SLIDE 19

) 1 ( ) 1 ( ) 1 (

b x ๏€ซ W

) 2 ( ) 2 ( ) 2 (

b x ๏€ซ W

Advantages of Independent Layer Weights (and Activations)

ฮฆ = [ฮฆ1, โ€ฆ ฮฆ๐‘‘] ฮฆ๐‘— = ฮฆ๐‘— ๐‘ฃ๐‘œ๐‘‘๐‘๐‘  + ฮ”i ๐‘ฆ(๐‘ข+1) = ๐ผ(ฮฉ๐‘๐‘œ

๐‘ข , ฮฉ๐‘๐‘”๐‘” (๐‘ข) )[๐‘‹ ๐‘ข ๐‘ฆ ๐‘ข + ๐‘ ๐‘ข ]

slide-20
SLIDE 20

) 1 ( ) 1 ( ) 1 (

b x ๏€ซ W

) 2 ( ) 2 ( ) 2 (

b x ๏€ซ W

Advantages of Independent Layer Weights (and Activations)

ฮฆ = [ฮฆ1, โ€ฆ ฮฆ๐‘‘] ฮฆ๐‘— = ฮฆ๐‘— ๐‘ฃ๐‘œ๐‘‘๐‘๐‘  + ฮ”i ๐‘ฆ(๐‘ข+1) = ๐ผ(ฮฉ๐‘๐‘œ

๐‘ข , ฮฉ๐‘๐‘”๐‘” (๐‘ข) )[๐‘‹ ๐‘ข ๐‘ฆ ๐‘ข + ๐‘ ๐‘ข ]

slide-21
SLIDE 21

) 1 ( ) 1 ( ) 1 (

b x ๏€ซ W

) 2 ( ) 2 ( ) 2 (

b x ๏€ซ W

Advantages of Independent Layer Weights (and Activations)

ฮฆ = [ฮฆ1, โ€ฆ ฮฆ๐‘‘] ฮฆ๐‘— = ฮฆ๐‘— ๐‘ฃ๐‘œ๐‘‘๐‘๐‘  + ฮ”i ๐‘ฆ(๐‘ข+1) = ๐ผ(ฮฉ๐‘๐‘œ

๐‘ข , ฮฉ๐‘๐‘”๐‘” (๐‘ข) )[๐‘‹ ๐‘ข ๐‘ฆ ๐‘ข + ๐‘ ๐‘ข ]

slide-22
SLIDE 22
  • Thus far
  • Whatโ€™s coming
  • Practical design to facilitate success
  • Empirical results
  • Applications

Idealized deep network weights exist that improve RIP constants.

Checkpoint

slide-23
SLIDE 23
  • Treat as a multi-label DNN classification problem to

estimate support of ๐‘ฆโˆ—.

  • The main challenge is estimating supp[x]
  • Once support is obtained, computing actual value is trivial
  • โ„Ž

๐‘ง โˆ’ ฮฆ๐‘ฆ 2 based loss will be unaware and expend undue effort to match coefficient magnitudes.

  • Specifically, we learn to find supp[x] using multilabel softmax loss

layer

Alternative Learning-Based Strategy

slide-24
SLIDE 24
  • Adopt highway and gating mechanisms
  • Relatively deep nets for challenging problems ๏ƒ  such designs help

with information flow

  • Our analysis show such designes seem natural for challenging

multi-scales sparse estimation problems

  • The philosophies of generating training sets.
  • Generative perspective ๐‘ฆโˆ— ๏ƒ  ๐‘ง = ฮฆ๐‘ฆโˆ—
  • Not ๐‘ง ๏ƒ  ๐‘ฆโˆ— = ๐‘๐‘ž๐‘ข๐‘—๐‘›๐‘—๐‘จ๐‘๐‘—๐‘ข๐‘๐‘œ(๐‘ง)
  • Unsupervised training
  • ๐‘ฆโˆ— are randomly generated

Alternative Learning-Based Strategy

slide-25
SLIDE 25
  • We generate ฮฆ = ๐‘—

1 ๐‘—2 ๐‘ฃ๐‘—๐‘ค๐‘— ๐‘ˆ where ๐‘ฃ๐‘—, ๐‘ค๐‘— are iid ๐‘‚(0,1)

  • Super-linear decaying singular values
  • With structure but quite general
  • Values of ๐‘ฆโˆ—
  • ๐‘‰-distribution : drawn from ๐‘‰ โˆ’0.5,0.5 excluding ๐‘‰[โˆ’0.1,0.1]
  • ๐‘‚-distribution : drawn from ๐‘‚ +0.3,0.1 ๐‘๐‘œ๐‘’ ๐‘‚(โˆ’0.3,0.1)
  • Experiments
  • Basic: U2U
  • Cross: U2N, N2U

Experiments

slide-26
SLIDE 26

Results

Strong estimation accuracy Robust to training distributions Question 3: Does deep learning really learn ideal weights as analyzed? Hard to say, but it DOES achieve strong empirical performance for maximal sparsity

slide-27
SLIDE 27

Robust Surface Normal Estimation

  • Input:
  • Per-pixel model:
  • Can apply any sparse learning method to obtain outliers

[Ikehata et al., 2012]

  • bservations under different

lightings lighting matrix raw unknown surface normal Specular reflections, shadows, etc. (outliers)

x n y L ๏€ซ ๏€ฝ

โ€ฆ

slide-28
SLIDE 28

Convert to Sparse Estimation Problem

๏› ๏๏€จ ๏€ฉ ๏› ๏๏€จ

๏€ฉ

๏› ๏๏€จ ๏€ฉ

x x n y

T T T

Null Null Null L L L

Proj L Proj Proj ๏€ฝ ๏€ซ ๏€ฝ

y ~

x y x

z

๏† ๏€ฝ ~ s.t. min

๏†

Once outliers are known, can estimate n via:

๏€จ ๏€ฉ

๏€จ ๏€ฉ

ห†

  • 1

x y n ๏€ญ ๏† ๏† ๏† ๏€ฝ

T T

[Candรจs and Tao, 2004] # of nonzero elements

slide-29
SLIDE 29

Real time photometric stereo

slide-30
SLIDE 30
  • First rigorous analysis of how unfolded iterative

algorithms can be provably enhanced by learning.

  • Detailed characterization of how different architecture

choices affect performance.

  • Narrow benefit: First ultra-fast method for obtaining
  • ptimal sparse representations with correlated designs

(i.e., high RIP constants).

  • Broad benefit: General insights into why DNNs can
  • utperform hand-crafted algorithms.

Conclusions

slide-31
SLIDE 31

Thank you!

slide-32
SLIDE 32

Network structure

Residual LSTM