Learning From Data Lecture 27 Learning Aides Input Preprocessing - PowerPoint PPT Presentation

Learning From Data Lecture 27 Learning Aides Input Preprocessing Dimensionality Reduction and Feature Selection Principal Components Analysis (PCA) Hints, Data Cleaning, Validation, . . . M. Magdon-Ismail CSCI 4100/6100

Learning Aides Additional tools that can be applied to all techniques Preprocess data to account for arbitrary choices during data collection (input normalization) Remove irrelevant dimensions that can mislead learning (PCA) Incorporate known properties of the target function (hints and invariances) Remove detrimental data (deterministic and stochastic noise) Better ways to validate (estimate E out ) for model selection M Learning Aides : 2 /16 � A c L Creator: Malik Magdon-Ismail Nearest neighbor − →

Nearest Neighbor Mr. Good and Mr. Bad were both given credit cards by the Bank of Learning (BoL). Mr. Good Mr. Bad (Age in years, Income in $ × 1 , 000) (47,35) (22,40) Mr. Unknown who has “coordinates” (21yrs,$36K) applies for credit. Should the BoL give him credit, according to the nearest neighbor algorithm? What if, income is measured in dollars instead of “K” (thousands of dollars)? 50 Income (K) 25 20 45 Age (yrs) M Learning Aides : 3 /16 � A c L Creator: Malik Magdon-Ismail Nearest neighbor uses Euclidean distance − →

Nearest Neighbor Uses Euclidean Distance Mr. Good and Mr. Bad were both given credit cards by the Bank of Learning (BoL). Mr. Good Mr. Bad (Age in years, Income in $) (47,35000) (22,40000) Mr. Unknown who has “coordinates” (21yrs,$36000) applies for credit. Should the BoL give him credit, according to the nearest neighbor algorithm? What if, income is measured in dollars instead of “K” (thousands of dollars)? 40000 Income ($) 35000 -3500 3500 Age (yrs) M Learning Aides : 4 /16 � A c L Creator: Malik Magdon-Ismail Algorithms treat dimensions uniformly − →

Uniform Treatment of Dimensions Most learning algorithms treat each dimension equally | x − x ′ | Nearear neighbor: d ( x , x ′ ) = | | Weight Decay: Ω( w ) = λ w t w SVM: margin defined using Euclidean distance RBF: bump function decays with Euclidean distance Input Preprocessing Unless you want to emphasize certain dimensions, the data should be preprocessed to present each dimension on an equal footing M Learning Aides : 5 /16 � A c L Creator: Malik Magdon-Ismail Input preprocessing is a data transform − →

Input Preprocessing is a Data Transform   x t 1   x t   2 X = x n �→ z n . .   . x t n g ( x ) = ˜ g (Φ( x )) . Raw { x n } have (for example) arbitrary scalings in each dimension, and { z n } will not. M Learning Aides : 6 /16 � A c L Creator: Malik Magdon-Ismail Centering − →

Centering z 2 x 2 centered raw data z 1 x 1 z n = Σ − 1 z n = x n − ¯ x z n = D x n 2 x n N � D ii = 1 Σ = 1 n = 1 x n x t N X t X σ i N n =1 � Σ = 1 ¯ z = 0 σ i = 1 ˜ N Z t Z = I M Learning Aides : 7 /16 � A c L Creator: Malik Magdon-Ismail Normalizing − →

Normalizing z 2 x 2 z 2 centered raw data normalized z 1 x 1 z 1 z n = Σ − 1 z n = x n − ¯ x z n = D x n 2 x n N � D ii = 1 Σ = 1 n = 1 x n x t N X t X σ i N n =1 � Σ = 1 ¯ z = 0 σ i = 1 ˜ N Z t Z = I M Learning Aides : 8 /16 � A c L Creator: Malik Magdon-Ismail Whitening − →

Whitening z 2 x 2 z 2 z 2 centered raw data normalized whitened z 1 x 1 z 1 z 1 z n = Σ − 1 z n = x n − ¯ x z n = D x n 2 x n N � D ii = 1 Σ = 1 n = 1 x n x t N X t X σ i N n =1 � Σ = 1 ¯ z = 0 σ i = 1 ˜ N Z t Z = I M Learning Aides : 9 /16 � A c L Creator: Malik Magdon-Ismail Only use training data − →

Only Use Training Data For Preprocessing WARNING! Transforming data into a more convenient format has a hidden trap which leads to data snooping. When using a test set, determine the input transformation from training data only . Rule: lock away the test data until you have your final hypothesis. D train D − → − → g ( x ) = ˜ g (Φ( x )) input preprocessing 30 z = Φ( x ) snooping Cumulative Profit % − − − − 20 − − → − − 10 − − − 0 → − − − − − − − − − − − − − − − − − − − − − − − → -10 no snooping D test 0 100 200 300 400 500 Day E test M Learning Aides : 10 /16 � A c L Creator: Malik Magdon-Ismail PCA − →

Principal Components Analysis Original Data Rotated Data − − − − − − → Rotate the data so that it is easy to Identify the dominant directions (information) Throw away the smaller dimensions (noise) � x 1 � � z 1 � � z 1 � → → x 2 z 2 M Learning Aides : 11 /16 � A c L Creator: Malik Magdon-Ismail Projecting the data − →

Projecting the Data to Maximize Variance (Always center the data first) v Original Data z = x t n v Find v to maximize the variance of z M Learning Aides : 12 /16 � A c L Creator: Malik Magdon-Ismail Maximizing the variance − →

Maximizing the Variance N � var [ z ] = 1 z 2 n N v n =1 Original Data N � = 1 v t x n x t n v N n =1 � � N � 1 = v t x n x t v n N n =1 = v t Σ v . Choose v as v 1 , the top eigenvector of Σ — the top principal component (PCA) M Learning Aides : 13 /16 � A c L Creator: Malik Magdon-Ismail − →

The Principal Components z 1 = x t v 1 z 2 = x t v 2 z 3 = x t v 3 . . . v 1 , v 2 , · · · , v d are the eigenvectors of Σ with eigenvalues λ 1 ≥ λ 2 ≥ · · · ≥ λ d v Original Data Theorem [Eckart-Young]. These directions give best reconstruction of data; also capture maximum variance. M Learning Aides : 14 /16 � A c L Creator: Malik Magdon-Ismail PCA features for digits data − →

PCA Features for Digits Data 100 80 % Reconstruction Error 60 z 2 40 20 1 0 0 50 100 150 200 not 1 k z 1 Principal components are automated Captures dominant directions of the data. May not capture dominant dimensions for f . M Learning Aides : 15 /16 � A c L Creator: Malik Magdon-Ismail Other Learning Aides − →

Other Learning Aides 1. Nonlinear dimension reduction: x 2 x 2 x 2 x 1 x 1 x 1 2. Hints (invariances and prior information): rotational invariance, monotonicity, symmetry, . . . . 3. Removing noisy data: symmetry symmetry symmetry intensity intensity intensity 4. Advanced validation techniques: Rademacher and Permutation penalties More efficient than CV, more convenient and accurate than VC. M Learning Aides : 16 /16 � A c L Creator: Malik Magdon-Ismail

Learning From Data Lecture 27 Learning Aides Input Preprocessing - PowerPoint PPT Presentation

Learning From Data Lecture 27 Learning Aides Input Preprocessing Dimensionality Reduction and Feature Selection Principal Components Analysis (PCA) Hints, Data Cleaning, Validation, . . . M. Magdon-Ismail CSCI 4100/6100 Learning Aides

Learning Aides Additional tools that can be applied to all techniques Learning From Data Lecture

Advanced Home Health Aides Update on Implementation Update on Implementation DRAFT - November

2014 Parade Aide Orientation What does a parade aide do? Parade aides follow a series of

Meet your instructors and course aides Blake Everett Johnson, Ph.D. T eaching Assistant

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

CSN08101 Digital Forensics Lecture 9: Data Analysis Lecture 9: Data Analysis Module Leader: Dr

61A Lecture 30 Announcements Data Processing Data Processing 4 Data Processing Many data sets

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Massive Data Algorithmics Lecture 1: Introduction Massive Data Algorithmics Lecture 1:

1 Plaque begins to harden onto the teeth after 24 hours therefore it can be easily removed if the

O R I E N T A T I O N PARK RK WITH ITH US Types of Permit Permit Required Campus

2019 -2020 Budget Presentation General Fund Salaries and Benefits April 8 th , 2019 - 6:00 P.M

Training Partnership Charissa Raynor, Executive Director Sept. 15, 2014 1 Who We Are

http://www.utexas.edu/ugs/ugr/pos ter/samples#stem Poster 1 (Parasite Poster 3 (Microbial

Learning MR-Sort rules with coalitional veto Olivier Sobrie 1,2 Vincent Mousseau 1 Marc Pirlot 2 1

Frontiers of Materials Research: A Decadal Survey* *This is the fourth materials research decadal

Low Rank Approximation Lecture 10 Daniel Kressner Chair for Numerical Algorithms and HPC

Ischemic Heart Failure Trial Commentary : Alec Vahanian Bichat Hospital, Paris, France

Reminder: Next week you have 1 Monday Tuesday Wednesday Thursday Friday Session 9

Announcements ICS 6B Regrades for Quiz #3 and Homeworks #4 & 5 are due Thursday Boolean

Portability Problems Modularity Stick as much machine dependent code into a single module. When

Chapter 3: Operating-System Structures System Components Operating System Services

Sambuz

Useful Links

Newsletter

Mail Us

Learning From Data Lecture 27 Learning Aides Input Preprocessing - PowerPoint PPT Presentation

Learning From Data Lecture 27 Learning Aides Input Preprocessing Dimensionality Reduction and Feature Selection Principal Components Analysis (PCA) Hints, Data Cleaning, Validation, . . . M. Magdon-Ismail CSCI 4100/6100 Learning Aides

Learning Aides Additional tools that can be applied to all techniques Learning From Data Lecture

Advanced Home Health Aides Update on Implementation Update on Implementation DRAFT - November

2014 Parade Aide Orientation What does a parade aide do? Parade aides follow a series of

Meet your instructors and course aides Blake Everett Johnson, Ph.D. T eaching Assistant

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

CSN08101 Digital Forensics Lecture 9: Data Analysis Lecture 9: Data Analysis Module Leader: Dr

61A Lecture 30 Announcements Data Processing Data Processing 4 Data Processing Many data sets

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Massive Data Algorithmics Lecture 1: Introduction Massive Data Algorithmics Lecture 1:

1 Plaque begins to harden onto the teeth after 24 hours therefore it can be easily removed if the

O R I E N T A T I O N PARK RK WITH ITH US Types of Permit Permit Required Campus

2019 -2020 Budget Presentation General Fund Salaries and Benefits April 8 th , 2019 - 6:00 P.M

Training Partnership Charissa Raynor, Executive Director Sept. 15, 2014 1 Who We Are

http://www.utexas.edu/ugs/ugr/pos ter/samples#stem Poster 1 (Parasite Poster 3 (Microbial

Learning MR-Sort rules with coalitional veto Olivier Sobrie 1,2 Vincent Mousseau 1 Marc Pirlot 2 1

Frontiers of Materials Research: A Decadal Survey* *This is the fourth materials research decadal

Low Rank Approximation Lecture 10 Daniel Kressner Chair for Numerical Algorithms and HPC

Ischemic Heart Failure Trial Commentary : Alec Vahanian Bichat Hospital, Paris, France

Reminder: Next week you have 1 Monday Tuesday Wednesday Thursday Friday Session 9

Announcements ICS 6B Regrades for Quiz #3 and Homeworks #4 &amp; 5 are due Thursday Boolean

Portability Problems Modularity Stick as much machine dependent code into a single module. When

Chapter 3: Operating-System Structures System Components Operating System Services

Sambuz

Useful Links

Newsletter

Mail Us

Announcements ICS 6B Regrades for Quiz #3 and Homeworks #4 & 5 are due Thursday Boolean