Esteban García-Cuesta Researcher at Universidad Carlos III - Spain WHEN “LESS IS MORE” TECHNIQUES AND APPLICATIONS Pittsburgh, February 24 th of 2010
“ Less is More” 2 3D 2D Esteban García-Cuesta, Universidad Carlos III de Madrid
Summary 3 This talk is about : It is not specifically about : High dimensional datasets A machine learning algorithm Two proposals developed during Computer vision my PhD. studies How each of the proposals point of view can help in a robotics context Data mining Esteban García-Cuesta, Universidad Carlos III de Madrid
Outline 4 Introduction to dimensionality Homogeneous structures reduction Remote Sensing application Feature selection using eigenvector Facial motion feature points coefficients (Part I) selection Introduction: Principal Components Analysis Map building without localization by DR How to use the PCA coefficients for feature selection Recent trends in dimensionality Application to a remote sensing reduction scenario Feature extraction models (Part II) Graphs and embedding graphs Esteban García-Cuesta, Universidad Carlos III de Madrid
Hough transform 5 Introduction to Dimensionality Reduction • Motivation • Problems related with high dimensional data
Motivation 6 Modern technologies routinely produce massive amounts of data Scientific progress now heavily depends on the ability to process and analyze high dimensional data The heart of these analysis is the reduction of the dimensionality by selecting a subset of original features or obtaining a well-chosen combination of them Esteban García-Cuesta, Universidad Carlos III de Madrid
Problems Related with HD 7 High dimensionality: Most of the machine learning and data mining techniques are not effective with high dimension datasets. Irrelevant features. Redundant features. The so- called “curse of Irrelevant features Redundant features dimensionality” ( CoD) [Bellman’61] . Esteban García-Cuesta, Universidad Carlos III de Madrid
CoD 8 Number of training instances needed to `populate’ a space grows exponentially with dimensionality What can we do? Unexpected properties Euclidean distance tends to zero Gaussian behaivor of uniformly sampled points Laurens van de Maaten, DM Summer-School 2008, Maastricht Esteban García-Cuesta, Universidad Carlos III de Madrid
Dimensionality Reduction 9 Feature Selection The “intrinsic” dimensionality may be smaller than the number of features Only a subset of original features are selected Def: the minimum number of Discrete necessary features to preserve the data properties Comprehensibility Other reasons for dimensionality reduction: Feature Extraction Compress data All features are used We want to visualize high Continous dimensional data Esteban García-Cuesta, Universidad Carlos III de Madrid
Remote Sensing Application 10 FORWARD MODEL Spectrum of energy Infrared Sensor CO 2 H 2 O Intensity (a.u.) -- CO 2 Wavelength(cm -1 ) INVERSE MODEL Spectrum of energy RTE Intensity (a.u.) Temperature Retrieve Wavelength(cm -1 ) Length Esteban García-Cuesta, Universidad Carlos III de Madrid
Machine Learning Approach 11 We have gathered a dataset X: N data samples (different flame Spectrums of energy Temperature Profile observations) Intensity (a.u.) D features /variables /dimensions (each one of the wavelengths) LEARN We want to ‘learn’ from this data: Wavelength(cm -1 ) Length Inverse of the RTE Regression problem Esteban García-Cuesta, Universidad Carlos III de Madrid
Why is Important to Solve the IRTE 12 COMBUSTION Global warming Healthy dangerous To have an automatic control and diagnosis of combustions in order to obtain energy efficiently and minimize the pollutant emissions Esteban García-Cuesta, Universidad Carlos III de Madrid
Wrapper selection [Kohavi’97] 13 Feature selection using the eigenvectors coefficients • Introduction: Principal Component Analysis • How to use the PCA to select a subset of original features • Applied to remote sensing data
Feature Selection 14 Def: a process that chooses an Supervised optimal subset of features Exploits input-output relations according to an objective Unstable due to multicollinearity function Wrapped approach Objectives There are many subsets To reduce dimensionality and remove noise Unsupervised To improve mining performance Feature ranking based on a quality metric Speed of learning Based on variance and separability Accuracy of the data (PCA) Simplicity and comprehensibility Esteban García-Cuesta, Universidad Carlos III de Madrid
Subset Search Problem 15 [Kohavi & John ‘97] Esteban García-Cuesta, Universidad Carlos III de Madrid
Feature Selection 16 In high dimensional data: Large number of features to work with Many irrelevant features and which is more important many redundant ones Individual feature evaluation (filter approach) Focus on identifying relevant features without handling feature redundancy or feature relations Feature subset selection (wrapper approach) Rely on the evaluation of the subset to handle the redundancy (too many possibilities) Esteban García-Cuesta, Universidad Carlos III de Madrid
Multicollinearity 17 Esteban García-Cuesta, Universidad Carlos III de Madrid
PCA 18 Its main objective is to reduce the dimensionality but conserving the total variance 2 1 :[p x p] covariance matrix : k dimensional projection :[p x k] eigenvector matrix : column vector of the k eigenvector :[k x k] diagonal eigenvalue matrix : input data column matrix observation Esteban García-Cuesta, Universidad Carlos III de Madrid
PCA Coefficients 19 Eigenvector 1 Coefficients of feature i Esteban García-Cuesta, Universidad Carlos III de Madrid
PCA Coefficients 20 Key idea is that high absolute value coefficients means more influence relevant features high absolute value coefficients Esteban García-Cuesta, Universidad Carlos III de Madrid
B4 Method [Jolliffe,02] 21 Very appealing because of Eigenvector coefficient α k (a.u.) simplicity It lacks of redundancy control Features Nº Esteban García-Cuesta, Universidad Carlos III de Madrid
Analysis of PCA Coefficients 22 Key idea is that similar absolute value coefficients means high correlation Irrelevant features coefficients 0 between their associated features Redundant features similar coefficients On the other extreme very independent features has maximum distance Different eigenvectors uncorrelated bases Esteban García-Cuesta, Universidad Carlos III de Madrid
Redundancy Control 23 Select a feature with the Eigenvector coefficient α k (a.u.) highest value for different ranges Difficult to choose the threshold Features Nº Esteban García-Cuesta, Universidad Carlos III de Madrid
Using a Priory Specific Knowledge 24 Infrared Sensor -- Emmits Absorbs Adjacent wavelengths/features have similar space information Wavelength (cm-1) X-space Esteban García-Cuesta, Universidad Carlos III de Madrid
Guided Feature Selection [garcia- cuesta’08] 25 “Multilayer perceptron as inverse model in a ground-based remote sensing temperature retrieval problem” J. Eng. Appl. Artif. Intell., Vol.21:26-34, Issue 1, February 2008. Selection of features with high and different coefficient values Eigenvector coefficient α k (a.u.) Similar features have similar information Locally find features with high coefficient values Features Nº Esteban García-Cuesta, Universidad Carlos III de Madrid
Algorithm 26 PCA 1. 2. Calculate the covariance input Obtain the eigenvectors α and matrix the eigenvalues Λ of Σ and Σ = XX T select α q Select a subset of features Use the selected subset of applying a maximum value features as input in a machine algorithm to α q learning algorithm 3. 4. Guided features selection Esteban García-Cuesta, Universidad Carlos III de Madrid
Guided Feature Selection 27 Subset of selected original features Eigenvector (a.u.) Wavelength number (cm -1 ) Esteban García-Cuesta, Universidad Carlos III de Madrid
Remote Sensing Application Results 28 A MLP neural network has been used for estimation 7 B4 purposes GFS 6.5 Cross-validation 6 Error (K) Proofs with different number 5.5 of hidden neurons 5 4.5 The proposed GFS improves 4 and converges faster than B4 3.5 20 30 40 50 60 70 80 90 100 The error increases adding Number of selected features more features Esteban García-Cuesta, Universidad Carlos III de Madrid
Remote Sensing Application Results 29 Esteban García-Cuesta, Universidad Carlos III de Madrid
Feature Selection 30 We developed a feature selection method based on PCA to reveal the dependency between features It allows to introduce a priori known knowledge The selection of original features allows to design specific sensors Reduce the cost of the equipment Reduce the cost of massive data storage Esteban García-Cuesta, Universidad Carlos III de Madrid
Recommend
More recommend