Data Preparation What exactly is the problem, the expected benefit? - PowerPoint PPT Presentation

Data Preparation What exactly is the problem, the expected benefit? project understanding How would a solution look like? What is known about the domain? revise objective What data do we have available? Is the data relevant to the problem? data understanding Is it valid? Does it reflect our expectations? Is the data quality, quantity, recency sufficient? partially does data no cancel project suit problem? yes Which data should we concentrate on? data preparation How is the data best transformed for modeling? How may we increase the data quality? What kind of model architecture suits the problem best? What is the best technique/method to get the model? modeling How good does the model perform technically? technical quality revise objective improvable? likely unlikely How good is the model in terms of project requirements? evaluation What have we learned from the project? business objective partially close project achieved? no success How is the model best deployed? deployment How do we know that the model is still valid? Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. 1 / 15 � Michael R. Berthold, Christian Borgelt, Frank H¨ c oppner, Frank Klawonn and Iris Ad¨ a

Data understanding vs Data preparation Data understanding provides general information about the data like the existence and partly also about the character of missing values, outliers, the character of attributes and dependencies between attribute. Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. 2 / 15 � Michael R. Berthold, Christian Borgelt, Frank H¨ c oppner, Frank Klawonn and Iris Ad¨ a

Data understanding vs Data preparation Data understanding provides Data preparation uses this general information about the data information to like select attributes, reduce the dimension of the the existence and partly also data set, about the character of missing values, select records, treat missing values, outliers, treat outliers, the character of attributes and integrate, unify and transform dependencies between data and attribute. improve data quality. Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. 2 / 15 � Michael R. Berthold, Christian Borgelt, Frank H¨ c oppner, Frank Klawonn and Iris Ad¨ a

Feature extraction refers to construct (new) features from the given attributes. Example Find the best workers in a company. Attributes : the tasks, a worker has finished within each month, the number of hours he has worked each month, the number of hours that are normally needed to finish each task. These attributes contain information about the efficiency of the worker. But instead using these three “raw” attributes, it might be more useful to define a new attribute efficiency . hours actually spent to finish the tasks efficiency = hours normally needed to finish the tasks Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. 3 / 15 � Michael R. Berthold, Christian Borgelt, Frank H¨ c oppner, Frank Klawonn and Iris Ad¨ a

Feature selection Feature selection refers to techniques to choose a subset of the features (attributes) that is as small as possible and sufficient for the data analysis. Feature selection includes removing (more or less) irrelevant features and removing redundant features. Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. 4 / 15 � Michael R. Berthold, Christian Borgelt, Frank H¨ c oppner, Frank Klawonn and Iris Ad¨ a

Feature selection techniques Selecting the top-ranked features. Choose the features with the best evaluation when single features are evaluated. Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. 5 / 15 � Michael R. Berthold, Christian Borgelt, Frank H¨ c oppner, Frank Klawonn and Iris Ad¨ a

Feature selection techniques Selecting the top-ranked features. Choose the features with the best evaluation when single features are evaluated. Selecting the top-ranked subset. Choose the subset of features with the best performance. This requires exhaustive search and is impossible for larger numbers of features. (For 20 features there are already more than one million possible subsets.) Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. 5 / 15 � Michael R. Berthold, Christian Borgelt, Frank H¨ c oppner, Frank Klawonn and Iris Ad¨ a

Feature selection techniques Selecting the top-ranked features. Choose the features with the best evaluation when single features are evaluated. Selecting the top-ranked subset. Choose the subset of features with the best performance. This requires exhaustive search and is impossible for larger numbers of features. (For 20 features there are already more than one million possible subsets.) Forward selection. Start with the empty set of features and add features one by one. In each step, add the feature that yields the best improvement of the performance. Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. 5 / 15 � Michael R. Berthold, Christian Borgelt, Frank H¨ c oppner, Frank Klawonn and Iris Ad¨ a

Feature selection techniques Selecting the top-ranked features. Choose the features with the best evaluation when single features are evaluated. Selecting the top-ranked subset. Choose the subset of features with the best performance. This requires exhaustive search and is impossible for larger numbers of features. (For 20 features there are already more than one million possible subsets.) Forward selection. Start with the empty set of features and add features one by one. In each step, add the feature that yields the best improvement of the performance. Backward elimination. Start with the full set of features and remove features one by one. In each step, remove the feature that yields to the least decrease in performance. Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. 5 / 15 � Michael R. Berthold, Christian Borgelt, Frank H¨ c oppner, Frank Klawonn and Iris Ad¨ a

Record selection Reasons for using only a subsample Faster computation Cross-Validation with training and test set Timeliness. Data which is outdated can be removed. Representativeness. Is the given sample matching the whole population? If not and we do have information about the true distribution, select a representative subsample. (e.g. there are more women than men in a questionnaire for computer scientists) Rare events. Select well-directed more rare events to model them better. Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. 6 / 15 � Michael R. Berthold, Christian Borgelt, Frank H¨ c oppner, Frank Klawonn and Iris Ad¨ a

Data cleansing Data cleansing or data scrubbing refers to detecting / correcting / removing inaccurate, incorrect or incomplete records from a data set. Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. 7 / 15 � Michael R. Berthold, Christian Borgelt, Frank H¨ c oppner, Frank Klawonn and Iris Ad¨ a

Improve data quality Turn all characters into capital letters to level case sensitivity. Remove spaces and nonprinting characters. Fix the format of numbers, date and time (including decimal point). Split fields that carry mixed information into two separate attributes, e.g. “Chocolate, 100g” into “Chocolate” and “100.0”. This is known as field overloading. Use spell-checker or stemming to normalize spelling in free text entries. Replace abbreviations by their long form (with the help of a dictionary). Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. 8 / 15 � Michael R. Berthold, Christian Borgelt, Frank H¨ c oppner, Frank Klawonn and Iris Ad¨ a

Data Preparation What exactly is the problem, the expected benefit? - PowerPoint PPT Presentation

Data Preparation What exactly is the problem, the expected benefit? project understanding How would a solution look like? What is known about the domain? revise objective What data do we have available? Is the data relevant to the problem?

Data Preparation Data Preparation Types of Data and Basic statistics Discretization of

Preparation Test Preparation Practice Makes Perfect: Students should take numerous practice

Preparation for Sonship Mike Parsons 23. Frequency (2) Preparation for Sonship How can we be

Preparation for Sonship Mike Parsons 12. Deconstruction (1) eg.freedomarc.org Preparation for

Preparation for Sonship Mike Parsons 6. Angelic Realm (2) eg.freedomarc.org Preparation for

Data Preparation Discretization Data cleaning (Data pre-processing) Data

Joe Seibert, R.Ph. Matt Seibert, C.Ph.T. Medical Emergencies at Sea Skipper Preparation Crew

Paul Roberts Wood Preparation 8.4.2017 Wood Preparation 8.4.2017 Accurate machining

asking Why ? Preparation of the Gifts : Changes for Congregation: Suscipiat Dominus Preparation

PREPARATION OF FORCE ACCOUNTS PREPARATION OF FORCE ACCOUNTS WORKSHOP District 6 Resource Center

1 Receiver -1 Receiver Field Preparation Field Preparation June 2009 Topcon GRS- Topcon GRS

Paul Roberts Wood Preparation 9.9.2018 Wood Preparation 9.9.2018 Accurate machining

Cooking Academy Holistic Food Preparation Cooking Academy Holistic Food Preparation Module #3

Data Preprocessing Data Mining and Exploration: Preprocessing Data preparation is a big issue for

Data Preparation Data cleaning Discretization (Data preprocessing) Data

Data Preparation Data cleaning Data integration and transformation (Data

Database design theory, Part I Functional dependencies Introduction As we saw in the last segment,

Chapter 4 SQL McGraw-Hill and Atzeni, Ceri, Paraboschi, Torlone 1999 1 Database Systems

A P P E N D I X C Other Relational Query Languages In Chapter 6 we presented the relational

The Relational Model Ramakrishnan & Gehrke, Chapter 3 320302 Databases & Web Applications

The Relational Model Database Management Systems, R. Ramakrishnan and J. Gehrke 1 Why Study the

Enterprise Vocabulary Development in Protege/OWL: Workflow and Concept History Requirements

Managing Changing Data April 4, 2017 Safely Changing Data When I make changes, how do I

ENUM@CNNIC in China Xiaodong(Sheldon) Lee Advanced Services Research, CNNIC Apricot2005 in KYOTO

Data Preparation What exactly is the problem, the expected benefit? - PowerPoint PPT Presentation

Data Preparation What exactly is the problem, the expected benefit? project understanding How would a solution look like? What is known about the domain? revise objective What data do we have available? Is the data relevant to the problem?

Data Preparation Data Preparation Types of Data and Basic statistics Discretization of

Preparation Test Preparation Practice Makes Perfect: Students should take numerous practice

Preparation for Sonship Mike Parsons 23. Frequency (2) Preparation for Sonship How can we be

Preparation for Sonship Mike Parsons 12. Deconstruction (1) eg.freedomarc.org Preparation for

Preparation for Sonship Mike Parsons 6. Angelic Realm (2) eg.freedomarc.org Preparation for

Data Preparation Discretization Data cleaning (Data pre-processing) Data

Joe Seibert, R.Ph. Matt Seibert, C.Ph.T. Medical Emergencies at Sea Skipper Preparation Crew

Paul Roberts Wood Preparation 8.4.2017 Wood Preparation 8.4.2017 Accurate machining

asking Why ? Preparation of the Gifts : Changes for Congregation: Suscipiat Dominus Preparation

PREPARATION OF FORCE ACCOUNTS PREPARATION OF FORCE ACCOUNTS WORKSHOP District 6 Resource Center

1 Receiver -1 Receiver Field Preparation Field Preparation June 2009 Topcon GRS- Topcon GRS

Paul Roberts Wood Preparation 9.9.2018 Wood Preparation 9.9.2018 Accurate machining

Cooking Academy Holistic Food Preparation Cooking Academy Holistic Food Preparation Module #3

Data Preprocessing Data Mining and Exploration: Preprocessing Data preparation is a big issue for

Data Preparation Data cleaning Discretization (Data preprocessing) Data

Data Preparation Data cleaning Data integration and transformation (Data

Database design theory, Part I Functional dependencies Introduction As we saw in the last segment,

Chapter 4 SQL McGraw-Hill and Atzeni, Ceri, Paraboschi, Torlone 1999 1 Database Systems

A P P E N D I X C Other Relational Query Languages In Chapter 6 we presented the relational

The Relational Model Ramakrishnan &amp; Gehrke, Chapter 3 320302 Databases &amp; Web Applications

The Relational Model Database Management Systems, R. Ramakrishnan and J. Gehrke 1 Why Study the

Enterprise Vocabulary Development in Protege/OWL: Workflow and Concept History Requirements

Managing Changing Data April 4, 2017 Safely Changing Data When I make changes, how do I

ENUM@CNNIC in China Xiaodong(Sheldon) Lee Advanced Services Research, CNNIC Apricot2005 in KYOTO

The Relational Model Ramakrishnan & Gehrke, Chapter 3 320302 Databases & Web Applications