Greedy Sparsity-Constrained Optimization Sohail Bahmani with Petros - - PowerPoint PPT Presentation

β–Ά
greedy sparsity constrained optimization
SMART_READER_LITE
LIVE PREVIEW

Greedy Sparsity-Constrained Optimization Sohail Bahmani with Petros - - PowerPoint PPT Presentation

Greedy Sparsity-Constrained Optimization Sohail Bahmani with Petros Boufounos and Bhiksha Raj 45 th Asilomar Conference, Nov. 2011 Outline Background Compressed Sensing Problem Formulation Generalizing Compressed Sensing


slide-1
SLIDE 1

45th Asilomar Conference, Nov. 2011

Greedy Sparsity-Constrained Optimization

Sohail Bahmani with Petros Boufounos and Bhiksha Raj

slide-2
SLIDE 2

45th Asilomar Conference, Nov. 2011

  • Background
  • Compressed Sensing
  • Problem Formulation
  • Generalizing Compressed Sensing
  • Example
  • Prior Work
  • GraSP Algorithm
  • Main Result
  • Required Conditions
  • Example: β„“2-regularized Logistic Regression

Outline

slide-3
SLIDE 3

45th Asilomar Conference, Nov. 2011

  • Applications:
  • Biomedical Imaging, Image Denoising, Image Segmentation, Filter Design, System

Identification, etc

Compressed Sensing (1)

Linear Inverse Problem

Sparse signal 𝐲⋆ ∈ ℝ𝒒 Measurement matrix 𝐁 ∈ β„π‘œΓ—π‘ž Measurement 𝐳 = 𝐁𝐲⋆ + 𝐟 Noise 𝐟 ∈ β„π‘œ Given 𝐳 and 𝐁 with π‘œ β‰ͺ π‘ž, estimate 𝐲⋆

slide-4
SLIDE 4

45th Asilomar Conference, Nov. 2011

Compressed Sensing (2)

β„“πŸ-minimization arg min

𝐲

𝐲 0 (L0) subject to 𝐁𝐲 βˆ’ 𝐳 2 ≀ πœ— β„“πŸ-minimization arg min

𝐲

𝐲 1 (L1) subject to 𝐁𝐲 βˆ’ 𝐳 2 ≀ πœ— β„“πŸ-constrained LS arg min

𝐲

𝐁𝐲 βˆ’ 𝐳 2

2

(C0) subject to 𝐲 0 ≀ 𝑑 β„“πŸ-constrained LS arg min

𝐲

𝐁𝐲 βˆ’ 𝐳 2

2

(C1) subject to 𝐲 1 ≀ 𝑆

Use β„“1-norm as a proxy for β„“0-pseudonorm (Greedy) Approximate Solvers 𝐲 1 = 𝑦𝑗

π‘ž 𝑗=1

𝐲 0 = supp 𝐲 = 𝕁 𝑦𝑗 β‰  0

π‘ž 𝑗=1

(1) Convexify

slide-5
SLIDE 5

45th Asilomar Conference, Nov. 2011

  • For 𝑔 𝐲 =

𝐁𝐲 βˆ’ 𝐳 2

2 we get the β„“0-constrained least squares formulation in CS

  • We will see β„“πŸ‘-regularized logistic loss as another example for 𝑔 𝐲
  • More generally, 𝑔 𝐲 can be the empirical loss associated with some observations

in statistical estimation problems

Generalizing Compressed Sensing

  • Common assumptions in CS
  • The relation between the input and response has a linear form: 𝒛 = 𝐁𝐲 + 𝐟
  • The error is usually measured in squared error: 𝑔 𝐲 =

𝐁𝐲 βˆ’ 𝐳 2

2

consider nonlinear relations

  • ther measures of fidelity

General Formulation Let 𝑔: β„π‘ž β†’ ℝ be a cost function. Approximate the solution to 𝐲 = arg min

𝐲

𝑔(𝐲) subject to 𝐲 0 ≀ 𝑑.

slide-6
SLIDE 6

45th Asilomar Conference, Nov. 2011

  • Gene selection problem
  • Data points 𝐛 ∈ β„π‘ž: Gene expression coefficients obtained from tissue samples
  • Labels 𝑧 ∈ 0,1 : Determines healthy (𝑧 = 0) vs. cancer (𝑧 = 1) samples
  • Observation: π‘œ copies of 𝐛, 𝑧 namely iid instances

𝐛𝑗, 𝑧𝑗

𝑗=1 π‘œ

  • Restriction: Fewer samples than dimensions, i.e., π‘œ < π‘ž
  • Goal: Find 𝑑 β‰ͺ π‘ž entries (i.e., variables) of data points 𝐛 using which label 𝑧 can

be predicted with least β€œerror”

  • MLE
  • 𝑧|𝐛 has a likelihood function that depends on a 𝑑-sparse parameter vector 𝐲
  • Min. the loss (equivalent to max. joint likelihood) to estimate true parameter 𝐲⋆

Example

Nonlinearity Empirical loss: 𝑔 𝐲 =

1 π‘œ

βˆ’log π‘š(𝐲 ; 𝐛𝑗, 𝑧𝑗)

π‘œ 𝑗=1

slide-7
SLIDE 7

45th Asilomar Conference, Nov. 2011

  • In statistical estimation framework: convex 𝑔 + β„“1-regularization
  • Kakade et al. [AISTAT’09] : Loss functions from exponential family
  • Negahban et al. [NIPS’09] : M-estimators and β€œdecomposable” norms
  • Agarwal et al. [NIPS’10] : Projected Gradient Descent with β„“1-constraint
  • Issue: Sparsity cannot be guaranteed to be optimal, because
  • Nonlinearity causes solution-dependent error bounds that can become very large
  • β„“1-regularization is merely a proxy to induce sparsity
  • We consider a greedy algorithm for the problem
  • Algorithm enforces sparsity directly
  • Generally has lower computational complexity

Prior Work

slide-8
SLIDE 8

45th Asilomar Conference, Nov. 2011

Algorithm

Gradient Support Pursuit

Input 𝑔 β‹… and 𝑑 Output 𝐲

  • 0. Initialize

𝐲 = 𝟏 Repeat

  • 1. Compute Gradient

𝐴 = 𝛼𝑔 𝐲

  • 2. Identify Coordinates Ξ© = supp 𝐴2𝑑
  • 3. Merge Supports

𝒰 = supp 𝐲 ⋃Ω

  • 4. Find Crude Estimate 𝐜 = arg min

𝐲

𝑔 𝐲 s.t. 𝐲|𝒰𝑑 = 𝟏

  • 5. Prune

𝐲 = πœπ‘‘ Until Halting Condition Holds

Inspired by the CoSaMP algorithm [Needell & Tropp ’09]

𝐲 Ξ© 2 Update 𝐴 = 𝛼𝑔 𝐲 1 supp 𝐲 𝒰 3 𝐜 4 πœπ‘‘ 5 Tractable because 𝑔 obeys certain conditions

slide-9
SLIDE 9

45th Asilomar Conference, Nov. 2011

Theroem

If 𝑔 satisfies certain properties then the estimate obtained at the 𝑗-th iteration of GraSP obeys 𝐲 𝑗 βˆ’ 𝐲⋆

2 ≀ πœ†π‘— 𝐲⋆ 2 + 𝐷 𝛼 𝑔 𝐲⋆ ℐ 2

, where ℐ contains the indices of the 3𝑑 largest coordinates of 𝛼𝑔 𝐲⋆ in magnitude.

  • For πœ† < 1 (ie., contraction factor) we get linear rate of convergence up to an

approximation error

  • In statistical estimation problems 𝛼𝑔 𝐲⋆ |ℐ can be related to the statistical

precision of the estimator

Main Result

slide-10
SLIDE 10

45th Asilomar Conference, Nov. 2011

Definition (Stable Hessian Property)

For 𝑔: β„π‘ž ⟢ ℝ with Hessian πˆπ‘” β‹… let 𝐡𝑙 𝐲 ≔ sup

supp 𝐲 ⋃supp 𝚬 ≀𝑙 𝚬 2=1

𝚬Tπˆπ‘” 𝐲 𝚬 𝐢𝑙 𝐲 ≔ inf

supp 𝐲 ⋃supp 𝚬 ≀𝑙 𝚬 2=1

𝚬Tπˆπ‘” 𝐲 𝚬 . Then we say 𝑔 satisfies SHP of order 𝑙 with constant πœˆπ‘™ if we have 𝐡𝑙 𝐲 𝐢𝑙 𝐲 ≀ πœˆπ‘™ for all 𝑙-sparse vectors 𝐲.

  • SHP basically says that symmetric restrictions of the Hessian are well-conditioned
  • For 𝑔 𝐲 =

1 2 𝐁𝐲 βˆ’ 𝐳 2 2 as in CS, SHP implies the Restricted Isometry Property

1 + πœ€π‘™ 1 βˆ’ πœ€π‘™ ≀ πœˆπ‘™ β‡’ πœ€π‘™ ≀ πœˆπ‘™ βˆ’ 1 πœˆπ‘™ + 1

Required Conditions

slide-11
SLIDE 11

45th Asilomar Conference, Nov. 2011

  • Logistic model:
  • 𝑧 ∣ 𝐛; 𝐲 ~ Bernoulli(

1 1+π‘“βˆ’ 𝐛,𝐲 )

  • For iid observation pairs 𝐛𝑗, 𝑧𝑗

𝑗=1 π‘œ

write the logistic loss as

β„’ 𝐲 ≔ 1 π‘œ log 1 + 𝑓 𝐛𝑗,𝐲 βˆ’ 𝑧𝑗 𝐛𝑗, 𝐲

π‘œ 𝑗=1

.

  • β„“πŸ‘-regularized logistic regression with sparsity constraint:
  • We can show πœˆπ‘™ ≀ 1 +

𝛽𝑙 4πœƒ, where

𝛽𝑙 = max

𝒧 πœ‡max (𝐁𝒧) subject to 𝒧 ≀ 𝑙.

Example

arg min

𝐲

𝑔 𝐲 = β„’ 𝐲 + πœƒ 2 𝐲 2

2

subject to 𝐲 0 ≀ 𝑑.

slide-12
SLIDE 12

45th Asilomar Conference, Nov. 2011

Main Result Revisited

Theorem

If 𝑔 satisfies SHP of order πŸ“π’• with constant π‚πŸ“π’• < πŸ‘ and π‘ͺπŸ“π’• 𝐲 > 𝝑, then the estimate obtained at the 𝑗-th iteration of GraSP obeys 𝐲 𝑗 βˆ’ 𝐲⋆

2 ≀ 𝜈4𝑑 2 βˆ’ 1 𝑗 𝐲⋆ 2 + 2 𝜈4𝑑 + 2

πœ— 2 βˆ’ 𝜈4𝑑

2

𝛼 𝑔 𝐲⋆

ℐ 2

, where ℐ contains the indices of the 3𝑑 largest coordinates of 𝛼𝑔 𝐲⋆ in magnitude. If 𝑔 satisfies certain properties then the estimate obtained at the 𝑗-th iteration of GraSP obeys 𝐲 𝑗 βˆ’ 𝐲⋆

2 ≀ πœ†π‘— 𝐲⋆ 2 + 𝐷 𝛼 𝑔 𝐲⋆ ℐ 2

, where ℐ contains the indices of the 3𝑑 largest coordinates of 𝛼𝑔 𝐲⋆ in magnitude.

slide-13
SLIDE 13

45th Asilomar Conference, Nov. 2011

  • Extend CS results to Nonlinear Models and Different Error Measures
  • β„“1-regularization may not yield sufficiently sparse solutions because of the type of

cost functions introduced by nonlinearities in the model

  • GraSP Algorithm
  • Greedy method that always gives a sparse solution
  • Accuracy is guaranteed for the class of functions that satisfy SHP
  • Linear rate of convergence up to the approximation error
  • Some interesting problems to study
  • Deterministic results, e.g., using equivalent of incoherence
  • Relax SHP to an entirely local condition

Summary