analyzing task driven learning algorithms
play

Analyzing Task Driven Learning Algorithms Final Presentation Mike - PowerPoint PPT Presentation

Analyzing Task Driven Learning Algorithms Final Presentation Mike Pekala May 1, 2012 Advisor: Prof. Doron Levy (dlevy at math.umd.edu) UMD Dept. of Mathematics & Center for Scientific Computation and Mathematical Modeling (CSCAMM) Mike


  1. Analyzing Task Driven Learning Algorithms Final Presentation Mike Pekala May 1, 2012 Advisor: Prof. Doron Levy (dlevy at math.umd.edu) UMD Dept. of Mathematics & Center for Scientific Computation and Mathematical Modeling (CSCAMM) Mike Pekala (UMD) AMSC664 May 1, 2012 1 / 31

  2. Overview Project Overview Existing Algorithm Implementation/Validation Sparse Reconstruction Least Angle Regression (LARS) [Efron et al., 2004] Feature-Sign [Lee et al., 2007] Non-negative and incremental Cholesky variants Dictionary Learning Task-Driven Dictionary Learning (TDDL) [Mairal et al., 2010] Application/Analysis to New (Publicly Available) Datasets Hyperspectral Imagery Urban [US Army Corps of Engineers, 2012] USGS Hyperspectral Library [Clark et al., 2007] Mike Pekala (UMD) AMSC664 May 1, 2012 2 / 31

  3. Sparse Reconstruction Topic: Sparse Reconstruction x α Φ = Mike Pekala (UMD) AMSC664 May 1, 2012 3 / 31

  4. Sparse Reconstruction Penalized Least Squares Recall the Lasso: given Φ = [ φ 1 , . . . , φ p ] ∈ R m × p , t ∈ R + , solve: || x − Φ α || 2 min 2 s.t. || α || 1 ≤ t α which has an equivalent unconstrained formulation: || x − Φ α || 2 min 2 + λ || α || 1 α for some scalar λ ≥ 0. The L 1 penalty improves upon OLS by introducing parsimony (feature selection) and regularization (improved generality). Many ways to solve this problem, e.g. 1 Directly, via convex optimization (can be expensive) 2 Iterative techniques Forward selection (“matching pursuit”), forward stagewise, others. Least Angle Regression (LARS) [Efron et al., 2004] Feature-Sign [Lee et al., 2007] Mike Pekala (UMD) AMSC664 May 1, 2012 4 / 31

  5. Sparse Reconstruction LARS LARS Properties Full details in [Efron et al., 2004] Why is it good? Less aggressive than some greedy techniques; less likely to eliminate useful predictors when predictors are correlated. More efficient than Forward Selection, which can take thousands of tiny steps towards a final model. Some Properties (Theorem 1) Assuming covariates added/removed one at a time from active set, complete LARS solution path yields all Lasso solutions. (Sec. 3.1) With a change to the covariate selection rule, LARS can be modified to solve the Positive Lasso problem. (Sec. 7) The cost of LARS is comprable to that of a least squares fit on m variables. The LARS sequence incrementally generates a Cholesky factorization of Φ T Φ in a very specific order. Mike Pekala (UMD) AMSC664 May 1, 2012 5 / 31

  6. Sparse Reconstruction LARS LARS Relationship to OLS (2.22) Successive LARS estimates ˆ µ k always approach but never reach the OLS estimate ¯ x k (except maybe on the final iteration). x φ 2 φ 2 ˆ µ 2 ¯ x 2 φ 1 ˆ µ 2 approaches OLS solution ¯ x 2 Mike Pekala (UMD) AMSC664 May 1, 2012 6 / 31

  7. Sparse Reconstruction LARS LARS Implementation/Validation Diabetes Validation Test: Coefficients 800 9 600 3 6 400 4 200 8 7 10 α i β 0 1 3 9 4 7 2 10 5 8 6 1 −200 2 −400 −600 5 −800 0 500 1000 1500 2000 2500 3000 3500 || β || 1 || α || 1 n = 10 , m = 442; Matches Figure 1 in [Efron et al., 2004] Also validated by comparing orthogonal designs with theoretical result. Mike Pekala (UMD) AMSC664 May 1, 2012 7 / 31

  8. Sparse Reconstruction Feature-Sign Feature-Sign Properties Full details in [Lee et al., 2007] Why is it good? Very efficient; reported performance gains over LARS. Can be initialized with arbitrary starting coefficents. Simple to implement. One half of a two-part algorithm for matrix factorization. Some Properties Tries to search for, or “guess”, signs of coefficients. Knowing signs reduces LASSO to an unconstrained quadratic program (QP) with closed form solution. Iteratively refines these sign guesses; involves an intermediate line search. Objective function strictly decreases. Mike Pekala (UMD) AMSC664 May 1, 2012 8 / 31

  9. Sparse Reconstruction Feature-Sign Feature-Sign Implementation/Validation Implemented nonnegative extension. Performance hit (at least w/ my implementation) as the unconstrained QP becomes a constrained QP. Solved using Matlab’s quadprog(). Validated by comparing results with LARS Mike Pekala (UMD) AMSC664 May 1, 2012 9 / 31

  10. Dictionary Learning Topic: Dictionary Learning x α Φ = Mike Pekala (UMD) AMSC664 May 1, 2012 10 / 31

  11. Dictionary Learning Dictionary Learning for Sparse Reconstruction Following the notation/development of [Mairal et al., 2010]. Given: training data set of signals X = [ x 1 , . . . , x n ] in R m × n Goal: design a dictionary Φ in R m × p (possible for p > m , i.e. an overcomplete dictionary) by minimizing the empirical cost function n g n ( D ) � 1 � ℓ u ( x i , D ) n i =1 where ℓ u , the unsupervised loss function, is small when Φ is “good” at representing x i sparsely. In [Mairal et al., 2010], the authors use the elastic-net formulation: 1 2 + λ 1 || α || 1 + λ 2 2 || x − D α || 2 2 || α || 2 ℓ u ( x , D ) � min (1) 2 α ∈ R p Mike Pekala (UMD) AMSC664 May 1, 2012 11 / 31

  12. Dictionary Learning Dictionary Learning for Sparse Reconstruction To prevent artificially improving ℓ u by arbitrarily scaling D , one typically constrains the set of permissible dictionaries: D � { D ∈ R m × p s.t. ∀ j ∈ { 1 , . . . , p } , || d j || 2 ≤ 1 } Optimizing the empirical cost g n can be very expensive when the training set is large (as is often the case in dictionary learning problems). However, in reality, one usually wants to minimize the expected loss: g ( D ) � E x [ ℓ u ( x , D )] = lim n →∞ g n ( D ) a.s. (where expectation is taken with respect to the unknown distribution of data objects p ( x )) In these cases, online stochastic techniques have been shown to work well [Mairal et al., 2009]. Mike Pekala (UMD) AMSC664 May 1, 2012 12 / 31

  13. Dictionary Learning Classification and Sparse Reconstruction Consider the classification task: Given: a fixed dictionary D , an observation x ∈ X ⊆ R m and a sparse representation of the observation x ≈ α ⋆ ( x , D ) Goal: identify the associated label y ∈ Y , where Y is a finite set of labels (would be a subset of R q for regression) Assume D is fixed and α ⋆ ( x , D ) will be used as the features for predicting y . The classification problem is to learn the model parameters W by solving: W ∈W f ( W ) + ν 2 || W || 2 min F where f ( W ) � E y, x [ ℓ s ( y, W , α ⋆ ( x , D ))] and ℓ s is a convex loss function (e.g. logistic) adapted to the supervised learning problem. Mike Pekala (UMD) AMSC664 May 1, 2012 13 / 31

  14. Dictionary Learning Task Driven Dictionary Learning for Classification Now, we wish to jointly learn D , W : D ∈D , W ∈W f ( D , W ) + ν 2 || W || 2 min (2) F where f ( D , W ) � E y, x [ ℓ s ( y, W , α ⋆ ( x , D ))] Example: Two loss functions 6 Binary classification: Y = {− 1 , +1 } 0−1 loss logistic 5 Linear model: w ∈ R p 4 Prediction: sign( w T α ⋆ ( x , D )) 3 � 1 + e − yw T α ⋆ � 2 Logistic loss: ℓ s = log 1 0 −5 −4 −3 −2 −1 0 1 2 + ν � � 1 + e − y w T α ⋆ ( x , D ) �� 2 || w || 2 min log (3) D ∈D , w ∈ R p E y, x 2 Mike Pekala (UMD) AMSC664 May 1, 2012 14 / 31

  15. Dictionary Learning Solving the Problem Stochastic gradient descent is often used to minimize functions whose gradients are expectations. The authors of [Mairal et al., 2010] show that, under suitable conditions, equation (2) is differentiable on D × W , and that, ∇ W f ( D , W ) = E y, x [ ∇ W ℓ s ( y, w , α ⋆ )] − D β ⋆ α ⋆ T + ( x − D α ⋆ ) β ⋆ T � � ∇ D f ( D , W ) = E y, x where β ⋆ ∈ R p is defined by the properties: β ⋆ Λ C = 0 and β ⋆ Λ = ( D T Λ D Λ + λ 2 I ) − 1 ∇ α Λ ℓ s ( y, W , α ⋆ ) and Λ are the indices of the nonzero coefficients of α ⋆ ( x , D ). Mike Pekala (UMD) AMSC664 May 1, 2012 15 / 31

  16. Dictionary Learning Algorithm: SGD for task-driven dictionary learning [Mairal et al., 2010] Input: p ( y, x ) (a way to draw samples i.i.d. from p ), λ 1 , λ 2 , ν ∈ R (regularization parameters), D ∈ D 0 (initial dictionary), W 0 ∈ W (initial model), T (num. iterations), t 0 , ρ ∈ R (learning rate parameters) for t = 1 to T do 1 Draw ( y t , x t ) from p ( y, x ) (mini-batch of size 200) 2 Compute α ⋆ via sparse coding (LARS, Feature-Sign) 3 Determine active set Λ and β ⋆ 4 Update learning rate ρ t 5 Take projected gradient descent step 6 end 7 Mike Pekala (UMD) AMSC664 May 1, 2012 16 / 31

  17. Dictionary Learning TDDL Implementation/Validation Matched experimental results on the USPS [Hastie et al., 2009] data set with those reported in [Mairal et al., 2010] Digit # in D 0 Runtime (h) Accuracy ρ λ 0 10 .150 5 8.2 .926 1 10 .225 7 7.1 .990 2 10 .225 7 6.8 .972 3 10 .225 7 7.4 .968 4 10 .225 4 7.6 .971 5 10 .225 4 7.2 .972 6 10 .225 2 7.5 .969 7 10 .175 5 7.9 .983 8 10 .200 3 8.5 .951 9 10 .200 3 8.1 .969 mean .967 reported .964 Mike Pekala (UMD) AMSC664 May 1, 2012 17 / 31

  18. Hyperspectral Imaging Topic: Hyperspectral Imaging Smith Island, Near IR 0.1 0.05 0 ( x , y , λ ) −0.05 −0.1 −0.15 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 wavelength ( µ m) Mike Pekala (UMD) AMSC664 May 1, 2012 18 / 31

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend