Analyzing Task Driven Learning Algorithms Final Presentation Mike - PowerPoint PPT Presentation

Analyzing Task Driven Learning Algorithms Final Presentation Mike Pekala May 1, 2012 Advisor: Prof. Doron Levy (dlevy at math.umd.edu) UMD Dept. of Mathematics & Center for Scientific Computation and Mathematical Modeling (CSCAMM) Mike Pekala (UMD) AMSC664 May 1, 2012 1 / 31

Overview Project Overview Existing Algorithm Implementation/Validation Sparse Reconstruction Least Angle Regression (LARS) [Efron et al., 2004] Feature-Sign [Lee et al., 2007] Non-negative and incremental Cholesky variants Dictionary Learning Task-Driven Dictionary Learning (TDDL) [Mairal et al., 2010] Application/Analysis to New (Publicly Available) Datasets Hyperspectral Imagery Urban [US Army Corps of Engineers, 2012] USGS Hyperspectral Library [Clark et al., 2007] Mike Pekala (UMD) AMSC664 May 1, 2012 2 / 31

Sparse Reconstruction Topic: Sparse Reconstruction x α Φ = Mike Pekala (UMD) AMSC664 May 1, 2012 3 / 31

Sparse Reconstruction Penalized Least Squares Recall the Lasso: given Φ = [ φ 1 , . . . , φ p ] ∈ R m × p , t ∈ R + , solve: || x − Φ α || 2 min 2 s.t. || α || 1 ≤ t α which has an equivalent unconstrained formulation: || x − Φ α || 2 min 2 + λ || α || 1 α for some scalar λ ≥ 0. The L 1 penalty improves upon OLS by introducing parsimony (feature selection) and regularization (improved generality). Many ways to solve this problem, e.g. 1 Directly, via convex optimization (can be expensive) 2 Iterative techniques Forward selection (“matching pursuit”), forward stagewise, others. Least Angle Regression (LARS) [Efron et al., 2004] Feature-Sign [Lee et al., 2007] Mike Pekala (UMD) AMSC664 May 1, 2012 4 / 31

Sparse Reconstruction LARS LARS Properties Full details in [Efron et al., 2004] Why is it good? Less aggressive than some greedy techniques; less likely to eliminate useful predictors when predictors are correlated. More efficient than Forward Selection, which can take thousands of tiny steps towards a final model. Some Properties (Theorem 1) Assuming covariates added/removed one at a time from active set, complete LARS solution path yields all Lasso solutions. (Sec. 3.1) With a change to the covariate selection rule, LARS can be modified to solve the Positive Lasso problem. (Sec. 7) The cost of LARS is comprable to that of a least squares fit on m variables. The LARS sequence incrementally generates a Cholesky factorization of Φ T Φ in a very specific order. Mike Pekala (UMD) AMSC664 May 1, 2012 5 / 31

Sparse Reconstruction LARS LARS Relationship to OLS (2.22) Successive LARS estimates ˆ µ k always approach but never reach the OLS estimate ¯ x k (except maybe on the final iteration). x φ 2 φ 2 ˆ µ 2 ¯ x 2 φ 1 ˆ µ 2 approaches OLS solution ¯ x 2 Mike Pekala (UMD) AMSC664 May 1, 2012 6 / 31

Sparse Reconstruction LARS LARS Implementation/Validation Diabetes Validation Test: Coefficients 800 9 600 3 6 400 4 200 8 7 10 α i β 0 1 3 9 4 7 2 10 5 8 6 1 −200 2 −400 −600 5 −800 0 500 1000 1500 2000 2500 3000 3500 || β || 1 || α || 1 n = 10 , m = 442; Matches Figure 1 in [Efron et al., 2004] Also validated by comparing orthogonal designs with theoretical result. Mike Pekala (UMD) AMSC664 May 1, 2012 7 / 31

Sparse Reconstruction Feature-Sign Feature-Sign Properties Full details in [Lee et al., 2007] Why is it good? Very efficient; reported performance gains over LARS. Can be initialized with arbitrary starting coefficents. Simple to implement. One half of a two-part algorithm for matrix factorization. Some Properties Tries to search for, or “guess”, signs of coefficients. Knowing signs reduces LASSO to an unconstrained quadratic program (QP) with closed form solution. Iteratively refines these sign guesses; involves an intermediate line search. Objective function strictly decreases. Mike Pekala (UMD) AMSC664 May 1, 2012 8 / 31

Sparse Reconstruction Feature-Sign Feature-Sign Implementation/Validation Implemented nonnegative extension. Performance hit (at least w/ my implementation) as the unconstrained QP becomes a constrained QP. Solved using Matlab’s quadprog(). Validated by comparing results with LARS Mike Pekala (UMD) AMSC664 May 1, 2012 9 / 31

Dictionary Learning Topic: Dictionary Learning x α Φ = Mike Pekala (UMD) AMSC664 May 1, 2012 10 / 31

Dictionary Learning Dictionary Learning for Sparse Reconstruction Following the notation/development of [Mairal et al., 2010]. Given: training data set of signals X = [ x 1 , . . . , x n ] in R m × n Goal: design a dictionary Φ in R m × p (possible for p > m , i.e. an overcomplete dictionary) by minimizing the empirical cost function n g n ( D ) � 1 � ℓ u ( x i , D ) n i =1 where ℓ u , the unsupervised loss function, is small when Φ is “good” at representing x i sparsely. In [Mairal et al., 2010], the authors use the elastic-net formulation: 1 2 + λ 1 || α || 1 + λ 2 2 || x − D α || 2 2 || α || 2 ℓ u ( x , D ) � min (1) 2 α ∈ R p Mike Pekala (UMD) AMSC664 May 1, 2012 11 / 31

Dictionary Learning Dictionary Learning for Sparse Reconstruction To prevent artificially improving ℓ u by arbitrarily scaling D , one typically constrains the set of permissible dictionaries: D � { D ∈ R m × p s.t. ∀ j ∈ { 1 , . . . , p } , || d j || 2 ≤ 1 } Optimizing the empirical cost g n can be very expensive when the training set is large (as is often the case in dictionary learning problems). However, in reality, one usually wants to minimize the expected loss: g ( D ) � E x [ ℓ u ( x , D )] = lim n →∞ g n ( D ) a.s. (where expectation is taken with respect to the unknown distribution of data objects p ( x )) In these cases, online stochastic techniques have been shown to work well [Mairal et al., 2009]. Mike Pekala (UMD) AMSC664 May 1, 2012 12 / 31

Dictionary Learning Classification and Sparse Reconstruction Consider the classification task: Given: a fixed dictionary D , an observation x ∈ X ⊆ R m and a sparse representation of the observation x ≈ α ⋆ ( x , D ) Goal: identify the associated label y ∈ Y , where Y is a finite set of labels (would be a subset of R q for regression) Assume D is fixed and α ⋆ ( x , D ) will be used as the features for predicting y . The classification problem is to learn the model parameters W by solving: W ∈W f ( W ) + ν 2 || W || 2 min F where f ( W ) � E y, x [ ℓ s ( y, W , α ⋆ ( x , D ))] and ℓ s is a convex loss function (e.g. logistic) adapted to the supervised learning problem. Mike Pekala (UMD) AMSC664 May 1, 2012 13 / 31

Dictionary Learning Task Driven Dictionary Learning for Classification Now, we wish to jointly learn D , W : D ∈D , W ∈W f ( D , W ) + ν 2 || W || 2 min (2) F where f ( D , W ) � E y, x [ ℓ s ( y, W , α ⋆ ( x , D ))] Example: Two loss functions 6 Binary classification: Y = {− 1 , +1 } 0−1 loss logistic 5 Linear model: w ∈ R p 4 Prediction: sign( w T α ⋆ ( x , D )) 3 � 1 + e − yw T α ⋆ � 2 Logistic loss: ℓ s = log 1 0 −5 −4 −3 −2 −1 0 1 2 + ν � � 1 + e − y w T α ⋆ ( x , D ) �� 2 || w || 2 min log (3) D ∈D , w ∈ R p E y, x 2 Mike Pekala (UMD) AMSC664 May 1, 2012 14 / 31

Dictionary Learning Solving the Problem Stochastic gradient descent is often used to minimize functions whose gradients are expectations. The authors of [Mairal et al., 2010] show that, under suitable conditions, equation (2) is differentiable on D × W , and that, ∇ W f ( D , W ) = E y, x [ ∇ W ℓ s ( y, w , α ⋆ )] − D β ⋆ α ⋆ T + ( x − D α ⋆ ) β ⋆ T � � ∇ D f ( D , W ) = E y, x where β ⋆ ∈ R p is defined by the properties: β ⋆ Λ C = 0 and β ⋆ Λ = ( D T Λ D Λ + λ 2 I ) − 1 ∇ α Λ ℓ s ( y, W , α ⋆ ) and Λ are the indices of the nonzero coefficients of α ⋆ ( x , D ). Mike Pekala (UMD) AMSC664 May 1, 2012 15 / 31

Dictionary Learning Algorithm: SGD for task-driven dictionary learning [Mairal et al., 2010] Input: p ( y, x ) (a way to draw samples i.i.d. from p ), λ 1 , λ 2 , ν ∈ R (regularization parameters), D ∈ D 0 (initial dictionary), W 0 ∈ W (initial model), T (num. iterations), t 0 , ρ ∈ R (learning rate parameters) for t = 1 to T do 1 Draw ( y t , x t ) from p ( y, x ) (mini-batch of size 200) 2 Compute α ⋆ via sparse coding (LARS, Feature-Sign) 3 Determine active set Λ and β ⋆ 4 Update learning rate ρ t 5 Take projected gradient descent step 6 end 7 Mike Pekala (UMD) AMSC664 May 1, 2012 16 / 31

Dictionary Learning TDDL Implementation/Validation Matched experimental results on the USPS [Hastie et al., 2009] data set with those reported in [Mairal et al., 2010] Digit # in D 0 Runtime (h) Accuracy ρ λ 0 10 .150 5 8.2 .926 1 10 .225 7 7.1 .990 2 10 .225 7 6.8 .972 3 10 .225 7 7.4 .968 4 10 .225 4 7.6 .971 5 10 .225 4 7.2 .972 6 10 .225 2 7.5 .969 7 10 .175 5 7.9 .983 8 10 .200 3 8.5 .951 9 10 .200 3 8.1 .969 mean .967 reported .964 Mike Pekala (UMD) AMSC664 May 1, 2012 17 / 31

Hyperspectral Imaging Topic: Hyperspectral Imaging Smith Island, Near IR 0.1 0.05 0 ( x , y , λ ) −0.05 −0.1 −0.15 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 wavelength ( µ m) Mike Pekala (UMD) AMSC664 May 1, 2012 18 / 31

Analyzing Task Driven Learning Algorithms Final Presentation Mike - PowerPoint PPT Presentation

Analyzing Task Driven Learning Algorithms Final Presentation Mike Pekala May 1, 2012 Advisor: Prof. Doron Levy (dlevy at math.umd.edu) UMD Dept. of Mathematics & Center for Scientific Computation and Mathematical Modeling (CSCAMM) Mike

Big- Big -O O Analyzing Algorithms Asymptotically Analyzing Algorithms Asymptotically P1 P2

Priority-Driven Scheduling of Periodic Tasks Priority-driven vs. clock-driven scheduling:

False fasting is driven by pride False fasting is driven by pride False fasting is

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Bond Task Force Draft Bond Task Force Recommendations Tuesday, February 27 , 2018 Bond Task

Task 1d: River basin management Task leader: LNEC; Involved partners EU: ISPRA, DTU, EWA Task

p wered Yva productivity AI Task Manager @nerdybff Task Management Task Management Todoist

PerfMon redux: analyzing a CUDA application with the Windows PerfMon redux: analyzing a CUDA

What are survey weights? Kelly McConville Assistant Professor of Statistics DataCamp Analyzing

Understanding Census geography and tigris basics Kyle Walker Instructor DataCamp Analyzing US

Twitter Networks Alex Hanna Computational Social Scientist DataCamp Analyzing Social Media Data

Multi-Task Active Learning Yi Zhang Outline Active Learning Multi-Task Active Learning

AI2 - Module 3 Task 5: Learning from Data Overview Task 5: Learning from Data Task 6: Coping

CGO Task Presentation CGO Task Presentation CGO Task Presentation Effective Task Presentation

Analyzing Blockwise Lattice Algorithms using Dynamical Systems Guillaume Hanrot, Xavier Pujol,

CS141: Intermediate Data Structures and Algorithms Analysis of Algorithms Amr Magdy Analyzing

Suite for ImageJ Paul Cueva, David A. Muller pdc23@cornell.edu ImageJ Started as NIH Image

Seismic Inversion Chaiwoot Boonyasiriwat October 21, 2020 Petroleum Exploration Petroleum

PORT SECURITY Dr. Thomas H. Wakeman III Executive Director 17 November 2008 National C enter for

A constrained-based optimization approach for seismic data recovery problems ICASSP 2014 SS5:

Final projects 21 May: Class split into 4 sub-classes (for each TA). Each group gives a ~8 min

Robust nonnegative matrix factorisation with the -divergence and applications in imaging C

The interplay of analysis and algorithms (or, Computational Harmonic Analysis) Anna Gilbert

Learning with Low Rank Approximations or how to use near separability to extract content from