Learning From Data Lecture 27 Learning Aides
Input Preprocessing Dimensionality Reduction and Feature Selection Principal Components Analysis (PCA) Hints, Data Cleaning, Validation, . . .
- M. Magdon-Ismail
CSCI 4100/6100
Learning From Data Lecture 27 Learning Aides Input Preprocessing - - PowerPoint PPT Presentation
Learning From Data Lecture 27 Learning Aides Input Preprocessing Dimensionality Reduction and Feature Selection Principal Components Analysis (PCA) Hints, Data Cleaning, Validation, . . . M. Magdon-Ismail CSCI 4100/6100 Learning Aides
Input Preprocessing Dimensionality Reduction and Feature Selection Principal Components Analysis (PCA) Hints, Data Cleaning, Validation, . . .
CSCI 4100/6100
Preprocess data to account for arbitrary choices during data collection (input normalization) Remove irrelevant dimensions that can mislead learning (PCA) Incorporate known properties of the target function (hints and invariances) Remove detrimental data (deterministic and stochastic noise) Better ways to validate (estimate Eout) for model selection
c A M L Creator: Malik Magdon-Ismail
Learning Aides: 2 /16
Nearest neighbor − →
(Age in years, Income in $ × 1, 000) (47,35) (22,40)
give him credit, according to the nearest neighbor algorithm? What if, income is measured in dollars instead of “K” (thousands of dollars)?
Age (yrs) Income (K)
20 45 25 50
c A M L Creator: Malik Magdon-Ismail
Learning Aides: 3 /16
Nearest neighbor uses Euclidean distance− →
(Age in years, Income in $) (47,35000) (22,40000)
BoL give him credit, according to the nearest neighbor algorithm? What if, income is measured in dollars instead of “K” (thousands of dollars)?
Age (yrs) Income ($)
3500 35000 40000
c A M L Creator: Malik Magdon-Ismail
Learning Aides: 4 /16
Algorithms treat dimensions uniformly − →
Nearear neighbor: d(x, x′) = | | x − x′ | | Weight Decay: Ω(w) = λwtw SVM: margin defined using Euclidean distance RBF: bump function decays with Euclidean distance
c A M L Creator: Malik Magdon-Ismail
Learning Aides: 5 /16
Input preprocessing is a data transform − →
1
2
n
c A M L Creator: Malik Magdon-Ismail
Learning Aides: 6 /16
Centering − →
raw data x1 x2 centered z1 z2
2xn
Dii = 1 σi Σ = 1 N
N
xnxt
n = 1
N XtX
NZtZ = I
c A M L Creator: Malik Magdon-Ismail
Learning Aides: 7 /16
Normalizing − →
raw data x1 x2 centered z1 z2 normalized z1 z2
2xn
Dii = 1 σi Σ = 1 N
N
xnxt
n = 1
N XtX
NZtZ = I
c A M L Creator: Malik Magdon-Ismail
Learning Aides: 8 /16
Whitening − →
raw data x1 x2 centered z1 z2 normalized z1 z2 whitened z1 z2
2xn
Dii = 1 σi Σ = 1 N
N
xnxt
n = 1
N XtX
NZtZ = I
c A M L Creator: Malik Magdon-Ismail
Learning Aides: 9 /16
Only use training data − →
When using a test set, determine the input transformation from training data only. Rule: lock away the test data until you have your final hypothesis. D − → Dtrain input preprocessing z = Φ(x) − → g(x) = ˜ g(Φ(x)) − − − − → − − − − − − − → Dtest − − − − − − − − − − − − − − − − − − − − − − − → Etest
Day Cumulative Profit % no snooping snooping
100 200 300 400 500
10 20 30 c A M L Creator: Malik Magdon-Ismail
Learning Aides: 10 /16
PCA − →
Original Data
Rotated Data
Identify the dominant directions (information) Throw away the smaller dimensions (noise)
x1
z1
z1
A M L Creator: Malik Magdon-Ismail
Learning Aides: 11 /16
Projecting the data − →
nv
v Original Data
c A M L Creator: Malik Magdon-Ismail
Learning Aides: 12 /16
Maximizing the variance − →
N
n
N
nv
N
n
v Original Data
c A M L Creator: Malik Magdon-Ismail
Learning Aides: 13 /16
− →
v Original Data
c A M L Creator: Malik Magdon-Ismail
Learning Aides: 14 /16
PCA features for digits data − →
k % Reconstruction Error
50 100 150 200 20 40 60 80 100
z1 z2
1 not 1
Captures dominant directions of the data. May not capture dominant dimensions for f.
c A M L Creator: Malik Magdon-Ismail
Learning Aides: 15 /16
Other Learning Aides − →
x1 x2 x1 x2 x1 x2
rotational invariance, monotonicity, symmetry, . . . .
intensity symmetry intensity symmetry intensity symmetry
More efficient than CV, more convenient and accurate than VC.
c A M L Creator: Malik Magdon-Ismail
Learning Aides: 16 /16