Latent Structure Beyond Sparse Codes Benjamin Recht Department of - - PowerPoint PPT Presentation

latent structure beyond sparse codes
SMART_READER_LITE
LIVE PREVIEW

Latent Structure Beyond Sparse Codes Benjamin Recht Department of - - PowerPoint PPT Presentation

Latent Structure Beyond Sparse Codes Benjamin Recht Department of EECS and Statistics University of California, Berkeley 2.5x Gabor-like thingies... redundancy robustness and sparsity Sparse Codes Which mathematical representations can be


slide-1
SLIDE 1

Latent Structure Beyond Sparse Codes

Benjamin Recht Department of EECS and Statistics University of California, Berkeley

slide-2
SLIDE 2

Sparse Codes

2.5x

redundancy Which mathematical representations can be learned robustly? robustness and sparsity Gabor-like thingies...

slide-3
SLIDE 3

Sparse Approximation

  • Use the fact that images

are sparse in wavelet basis to reduce number of measurements required for signal acquisition.

Compressed Sensing

  • npatients << npeaks
  • If very few are needed for

diagnosis, search for a sparse set of markers

Lasso

slide-4
SLIDE 4

Cardinality Minimization

  • PROBLEM: Find the vector of lowest cardinality that

satisfies/approximates the underdetermined linear system

  • NP-HARD:

– Reduce to EXACT-COVER – Hard to approximate – Known exact algorithms require enumeration

  • HEURISTIC: Replace cardinality with l1 norm

Φx = y Φ : Rp → Rn

slide-5
SLIDE 5

Density Matrix Seismic Imaging Geometric Structure Rank of: Recommender Systems Data Matrix Quantum Tomography Rank of: Rank of: Rank of: Unfolded Tensor Gram Matrix

slide-6
SLIDE 6

Affine Rank Minimization

  • PROBLEM: Find the matrix of lowest rank that

satisfies/approximates the underdetermined linear system

  • NP-HARD:

– Reduce to solving polynomial equations – Hard to approximate – Exact algorithms are awful

  • HUERISTIC: Replace rank with nuclear norm

Φ(X) = y Φ : Rp1×p2 → Rn

slide-7
SLIDE 7

Heuristic: Gradient Descent

  • Step 1: Pick (i,j) and compute residual:
  • Step 2: Take a mixture of current model and

corrected model (𝛽,β>0): r x p2

= M L R*

p1 x r p1 x p2

minimize kXk∗ subject to Φ(X) = b

IDEA: Replace rank with nuclear norm: Some guy on livejournal, 2006 Fazel, Parillo, Recht, 2007 Candes and Recht, 2008

Succeeds when number of samples is Õ(r(p1 +p2))

e = (LiRT

j − Mij)

 Li Rj

 αLi − βeRj αRj − βeLi

slide-8
SLIDE 8

System Identification: find a dynamical model that agrees with time series data

  • All linear systems are combinations of single pole filters.
  • Leverage this structure for new algorithms and analysis.

Observe a time series driven by the input

y1, y2, . . . , yT u1, u2, . . . uT

What is a principled way to build a parsimonious model for the input-output responses?

Na et al, 2012 Shah, Bhaskar, Tang, and Recht 2012

slide-9
SLIDE 9

Linear Inverse Problems

  • Find me a solution of
  • Φ n x p, n<p
  • Of the infinite collection of solutions, which one

should we pick?

  • Leverage structure:
  • How do we design algorithms to solve

underdetermined systems problems with priors?

y = Φx

Sparsity Rank Smoothness Symmetry

slide-10
SLIDE 10

kxk1 =

p

X

i=1

|xi|

  • 1-sparse vectors of

Euclidean norm 1

  • Convex hull is the

unit ball of the l1 norm

1 1

  • 1
  • 1

Sparsity

slide-11
SLIDE 11

minimize kxk1 subject to Φx = y

x1 x2 Φx=y

Compressed Sensing: Candes, Romberg, Tao, Donoho, Tanner, Etc...

slide-12
SLIDE 12
  • 2x2 matrices
  • plotted in 3d

rank 1 x2 + z2 + 2y2 = 1

Rank

slide-13
SLIDE 13
  • 2x2 matrices
  • plotted in 3d

rank 1 x2 + z2 + 2y2 = 1 Convex hull:

Rank

kXk∗ = X

i

σi(X)

slide-14
SLIDE 14
  • 2x2 matrices
  • plotted in 3d

Nuclear Norm Heuristic Fazel 2002. R, Fazel, and Parillo 2007 Rank Minimization/Matrix Completion

kXk∗ = X

i

σi(X)

slide-15
SLIDE 15
  • Integer solutions:

all components of x are ±1

  • Convex hull is the

unit ball of the l1 norm

(1,-1) (1,1) (-1,-1) (-1,1)

Integer Programming

slide-16
SLIDE 16

minimize kxk∞ subject to Φx = y

x1 x2 Φx=y

Donoho and Tanner 2008 Mangasarian and Recht. 2009.

slide-17
SLIDE 17
  • Search for best linear combination of fewest atoms
  • “rank” = fewest atoms needed to describe the model

Parsimonious Models

atoms model weights rank

slide-18
SLIDE 18

Atomic Norms

  • Given a basic set of atoms, , define the function
  • When is centrosymmetric, we get a norm
  • When can we compute this?
  • When does this work?

kxkA = inf{ X

a∈A

|ca| : x = X

a∈A

caa} kxkA = inf{t > 0 : x 2 tconv(A)} A minimize kzkA subject to Φz = y

IDEA:

A

slide-19
SLIDE 19

Union of Subspaces

  • X has structured sparsity: linear combination of elements

from a set of subspaces {Ug}.

  • Atomic set: unit norm vectors living in one of the Ug

Permutations and Rankings

  • X a sum of a few permutation matrices
  • Examples: Multiobject Tracking, Ranked elections, BCS
  • Convex hull of permutation matrices: doubly stochastic matrices.
slide-20
SLIDE 20
  • Moments: convex hull of of [1,t,t2,t3,t4,...],

t∈T, some basic set.

  • System Identification, Image Processing,

Numerical Integration, Statistical Inference

  • Solve with semidefinite programming
  • Cut-matrices: sums of rank-one sign matrices.
  • Collaborative Filtering, Clustering in Genetic

Networks, Combinatorial Approximation Algorithms

  • Approximate with semidefinite

programming

  • Low-rank Tensors: sums of rank-one tensors
  • Computer Vision, Image Processing,

Hyperspectral Imaging, Neuroscience

  • Approximate with alternating least-

squares

slide-21
SLIDE 21

Atomic norms in sparse approximation

  • Greedy approximations
  • Best n term approximation to a function f in the

convex hull of A.

  • Maurey, Jones, and Barron (1980s-90s)
  • Devore and Temlyakov (1996)
  • Random Feature Heuristics (Rahimi and R, 2007)

kf fnkL2  c0kfkA pn

slide-22
SLIDE 22
  • Set of directions that decrease the norm from x

form a cone:

  • x is the unique minimizer if the intersection of this

cone with the null space of Φ ¡equals {0}

Tangent Cones

y = Φz x minimize kzkA subject to Φz = y

{z : kzkA  kxkA}

TA(x)

TA(x) = {d : kx + αdkA  kxkA for some α > 0}

slide-23
SLIDE 23

Mean Width

d0x SC(d) = sup

x2C

d0x −d0x

Support Function:

SC(d) + SC(−d)

measures width of C when projected onto span of d. mean width: w(C) =

Z

Sp−1 SC(u)du

slide-24
SLIDE 24
  • When does a random subspace, U in , intersect a

convex cone C at the origin?

  • Gordon (1988): with high probability if

where is the mean width.

  • Corollary: For inverse problems, if Φ is a random

Gaussian matrix with n rows, need for exact recovery of x.

codim(U) ≥ p w(C ∩ Sp−1)2

w(C ∩ Sp−1) = Z

Sp−1 SC(u)du

n ≥ p w(TA(x) ∩ Sp−1)2 Rp

slide-25
SLIDE 25
  • Hypercube:
  • Sparse Vectors, p vector, sparsity s
  • Block sparse, M groups (possibly overlapping),

maximum group size B, k active groups

  • Low-rank matrices: p1 x p2, (p1<p2), rank r

Rates

n ≥ p/2 n ≥ 2s log p

s

  • + 5s

4

n ≥ k ⇣p 2 log (M − k) + √ B ⌘2 + kB n ≥ 3r(p1 + p2 − r)

slide-26
SLIDE 26
  • Suppose we observe
  • If is an optimal solution, then

provided that

Robust Recovery (deterministic)

minimize kzkA subject to kΦz yk  δ kwk2  δ kx ˆ xk  2 ✏ ˆ x y = Φx + w

{z : kzkA  kxkA}

kΦz yk  δ n ≥ pw(TA(x) ∩ Sp−1)2 (1 − ✏)2

slide-27
SLIDE 27
  • Suppose we observe
  • If is an optimal solution, then

provided that

Robust Recovery (statistical)

ˆ x y = Φx + w

ˆ x

minimize kΦz yk2 + µkzkA

cone{u : kx + ukA  kxkA + γkuk}

kx ˆ xk2  η(x, A, Φ, γ)µ

And under an additional “cone condition”

Bhaskar, Tang, and Recht 2011

µ Ew[kΦ∗wk∗

A]

kΦx Φˆ xk2  p µkxkA

slide-28
SLIDE 28
  • Sparse Vectors, p vector, sparsity s
  • Low-rank matrices: p1 x p2, (p1<p2), rank r

Denoising Rates (re-derivations)

1 pkˆ x x?k2

2 = O

✓σ2s log(p) p ◆

1 p1p2 kˆ x x?k2

F = O

✓σ2r p1 ◆

slide-29
SLIDE 29

Atomic Norm Minimization

  • Generalizes existing, powerful methods
  • Rigorous formula for developing new analysis

algorithms

  • Tightest bounds on number of measurements

needed for model recovery in all common models

  • One algorithm prototype for many data-mining

applications

minimize kzkA subject to Φz = y

IDEA:

Chandrasekaran, Recht, Parrilo, and Willsky 2010

slide-30
SLIDE 30
  • Gram matrix of y vectors

indicates overlapping support

  • Use graph algorithms to

identify single dictionary elements at a time

Learning representations

  • ASSUME:
  • very sparse vectors
  • s<N1/2/log(N)
  • very incoherent dictionary

(much more than RIP)

  • number of observations is

much bigger than N

Arora, Ge, and Moitra Agarwal, Anandkumar, and Netrapalli

x z

|Φx, Φz| |x, z|

slide-31
SLIDE 31

Extended representations

C = π(K ∩ L)

convex body linear map cone affine space

slide-32
SLIDE 32

C = π(K ∩ L)

(1,-1) (1,1) (-1,-1) (-1,1)

1 1

  • 1
  • 1

π = I −I L = {y :

2d

  • i=1

yi = 1} L = {Z : trace(Z) = 1} π A B BT C

  • = B

π T x xT u

  • = x

L =

  • y :

yi + yi+d = 1 1 ≤ i ≤ d

  • π =

I −I

L =

  • Z =

T x xT u

  • :

T toeplitz T11 = u = 1

  • K = R2d

+

K = Sd1+d2

+

K = Sd+1

+

K = R2d

+

slide-33
SLIDE 33

Extended representations

C = π(K ∩ L)

linear map cone affine space

C∗ = {y : x, y 1 x C} 1 x, y = A(x), B(y) A : C → K B : C∗ → K∗

C has a lift into K if there are maps such that for all extreme points of x ∈ C and y ∈ C* polar body

Gouveia, Parrilo, and Thomas

Representation learning becomes matrix factorization

slide-34
SLIDE 34

Learning extended representations?

C = π(K ∩ L)

convex body linear map cone affine space

  • Learning representation through NMF?
  • Ties immediately with gaussian width analysis
  • Could obviate graph structured arguments
  • What are the right features?