20 03 28
play

20-03-28 9. General Improvement Techniques 9.1 Preprocessing There - PDF document

20-03-28 9. General Improvement Techniques 9.1 Preprocessing There are several concepts/techniques to improve The following is based on slides by Michael M. Richter. learning results that are general enough to be applicable Data preprocessing


  1. 20-03-28 9. General Improvement Techniques 9.1 Preprocessing There are several concepts/techniques to improve The following is based on slides by Michael M. Richter. learning results that are general enough to be applicable Data preprocessing aims to solve/lessen the following to several learning concepts, often using task or problems: application specific knowledge to achieve their results. } data quality not sufficient These techniques can be grouped into } too large a number of features } preprocessing } too large a number of examples } ensemble/combination } wrong representation of data } cooperation } postprocessing Machine Learning J. Denzinger Machine Learning J. Denzinger Types of Data Preprocessing Dealing with insufficient data quality Cleaning of data: } Deleting examples with missing feature values } removing of errors and fixing of missing data } Deleting a particular feature (where the values are } elimination of noise often missing) from all examples Data integration and transformation: } Manually adding/correcting feature values } creating examples out of several data bases (projection } Extending values of a feature by new value “unknown” and join) } Adding a “default”-value for a missing feature value } change of representation Data reduction: } Trying to predict a missing feature value } eliminating features and/or examples F constitutes its own learning problem all need substantial application specific knowledge Machine Learning J. Denzinger Machine Learning J. Denzinger Dealing with insufficient data quality due to noisy data Dealing with too many features Identifying inconsistent values and outliers by Also known as reduction of dimensionality or attribute (or feature) selection. Aims at deleting } checking semantic consistency conditions } irrelevant features and } using set partitioning techniques } redundant features Treating detected examples by Identifying irrelevant features: } previous methods } small variance } using value from nearest cluster } little or no correlation with classification feature } binning for numerical values } not important for classification feature } by, for example, learning decision tree from small subset of examples and feature does not occur in tree Machine Learning J. Denzinger Machine Learning J. Denzinger 1

  2. 20-03-28 Dealing with too many features (cont.) Dealing with too many examples Identifying redundant features: Perform data sampling! } performing semantic analysis of data (for example, Different sampling methods: realizing that there are birth year and age as features) } random sampling } features that have high correlation with other } cluster sampling: randomly put examples into clusters features and randomly select the clusters to sample from } stratified sampling: create simple clusters (for But: sometimes redundant features can be relevant! example by clustering according to different values of For example, features that have been computed out of one feature) and randomly select examples out of others obviously have high correlation with those each cluster features. We do not want to remove those! Machine Learning J. Denzinger Machine Learning J. Denzinger Transformation of data Changing the type of certain data Used to get data into a form that makes it more Aimed at changing the representation but not the suitable for a chosen learning method. Includes information content of data. } methods for noise reduction (as looked at before) } Normalizing numerical feature value spaces } changing the type of certain data (i.e. the value space } into [0..1] of a feature) } into their logarithms } transformation of integer codes into symbol values } applying generalization/using taxonomies (and vice versa) } aggregation of several examples into one } like 1->male, 2->female } construction of new features out of other ones } conversion of types } like a string into a symbolic value Machine Learning J. Denzinger Machine Learning J. Denzinger Generalization Aggregation Reduction of the information content of a feature. Combining several examples fulfilling a certain condition. } Exchange of a numerical feature by a symbolic one (that still has some quantitative meaning) New feature values for this new example need to be generated by: } for example: high, medium, low } Combination of several feature values into one value } summing “old” values up } for example: specific diseases into disease types } creating average of “old” values è use of a taxonomy } counting entries } clustering of the values of a feature can also be used to } ... find good groups to combine. Machine Learning J. Denzinger Machine Learning J. Denzinger 2

  3. 20-03-28 Construction of new features New feature should be relevant for the goal of the learning ( F application knowledge required). Note that the difference to aggregation is that the feature value computation is done within an example not out of several examples. Example: profit := income - expenses Machine Learning J. Denzinger 3

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend