Learning Aides Additional tools that can be applied to all - - PowerPoint PPT Presentation

learning aides
SMART_READER_LITE
LIVE PREVIEW

Learning Aides Additional tools that can be applied to all - - PowerPoint PPT Presentation

Learning Aides Additional tools that can be applied to all techniques Learning From Data Lecture 27 Learning Aides Input Preprocessing Dimensionality Reduction and Feature Selection Preprocess data to account for arbitrary choices during data


slide-1
SLIDE 1

Learning From Data Lecture 27 Learning Aides

Input Preprocessing Dimensionality Reduction and Feature Selection Principal Components Analysis (PCA) Hints, Data Cleaning, Validation, . . .

  • M. Magdon-Ismail

CSCI 4100/6100

Learning Aides

Additional tools that can be applied to all techniques

Preprocess data to account for arbitrary choices during data collection (input normalization) Remove irrelevant dimensions that can mislead learning (PCA) Incorporate known properties of the target function (hints and invariances) Remove detrimental data (deterministic and stochastic noise) Better ways to validate (estimate Eout) for model selection

c A M L Creator: Malik Magdon-Ismail

Learning Aides: 2 /16

Nearest neighbor − →

Nearest Neighbor

  • Mr. Good and Mr. Bad were both given credit cards by the Bank of Learning (BoL).
  • Mr. Good
  • Mr. Bad

(Age in years, Income in $ × 1, 000) (47,35) (22,40)

  • Mr. Unknown who has “coordinates” (21yrs,$36K) applies for credit. Should the BoL

give him credit, according to the nearest neighbor algorithm? What if, income is measured in dollars instead of “K” (thousands of dollars)?

Age (yrs) Income (K)

20 45 25 50 c A M L Creator: Malik Magdon-Ismail

Learning Aides: 3 /16

Nearest neighbor uses Euclidean distance− →

Nearest Neighbor Uses Euclidean Distance

  • Mr. Good and Mr. Bad were both given credit cards by the Bank of Learning (BoL).
  • Mr. Good
  • Mr. Bad

(Age in years, Income in $) (47,35000) (22,40000)

  • Mr. Unknown who has “coordinates” (21yrs,$36000) applies for credit. Should the

BoL give him credit, according to the nearest neighbor algorithm? What if, income is measured in dollars instead of “K” (thousands of dollars)?

Age (yrs) Income ($)

  • 3500

3500 35000 40000 c A

M L Creator: Malik Magdon-Ismail

Learning Aides: 4 /16

Algorithms treat dimensions uniformly − →

slide-2
SLIDE 2

Uniform Treatment of Dimensions

Most learning algorithms treat each dimension equally

Nearear neighbor: d(x, x′) = | | x − x′ | | Weight Decay: Ω(w) = λwtw SVM: margin defined using Euclidean distance RBF: bump function decays with Euclidean distance

Input Preprocessing Unless you want to emphasize certain dimensions, the data should be preprocessed to present each dimension on an equal footing

c A M L Creator: Malik Magdon-Ismail

Learning Aides: 5 /16

Input preprocessing is a data transform − →

Input Preprocessing is a Data Transform

X =     xt

1

xt

2

. . . xt

n

    xn → zn g(x) = ˜ g(Φ(x)). Raw {xn} have (for example) arbitrary scalings in each dimension, and {zn} will not.

c A M L Creator: Malik Magdon-Ismail

Learning Aides: 6 /16

Centering − →

Centering

raw data x1 x2 centered z1 z2

zn = xn − ¯ x zn = Dxn zn = Σ−1

2xn

Dii = 1 σi Σ = 1 N

N

  • n=1

xnxt

n = 1

N XtX

¯ z = 0 ˜ σi = 1

  • Σ = 1

NZtZ = I

c A M L Creator: Malik Magdon-Ismail

Learning Aides: 7 /16

Normalizing − →

Normalizing

raw data x1 x2 centered z1 z2 normalized z1 z2

zn = xn − ¯ x zn = Dxn zn = Σ−1

2xn

Dii = 1 σi Σ = 1 N

N

  • n=1

xnxt

n = 1

N XtX

¯ z = 0 ˜ σi = 1

  • Σ = 1

NZtZ = I

c A M L Creator: Malik Magdon-Ismail

Learning Aides: 8 /16

Whitening − →

slide-3
SLIDE 3

Whitening

raw data x1 x2 centered z1 z2 normalized z1 z2 whitened z1 z2

zn = xn − ¯ x zn = Dxn zn = Σ−1

2xn

Dii = 1 σi Σ = 1 N

N

  • n=1

xnxt

n = 1

N XtX

¯ z = 0 ˜ σi = 1

  • Σ = 1

NZtZ = I

c A M L Creator: Malik Magdon-Ismail

Learning Aides: 9 /16

Only use training data − →

Only Use Training Data For Preprocessing

WARNING!

Transforming data into a more convenient format has a hidden trap which leads to data snooping.

When using a test set, determine the input transformation from training data only. Rule: lock away the test data until you have your final hypothesis. D − → Dtrain input preprocessing z = Φ(x) − → g(x) = ˜ g(Φ(x)) − − − − → − − − − − − − → Dtest − − − − − − − − − − − − − − − − − − − − − − − → Etest

Day Cumulative Profit % no snooping snooping

100 200 300 400 500

  • 10

10 20 30 c A M L Creator: Malik Magdon-Ismail

Learning Aides: 10 /16

PCA − →

Principal Components Analysis

Original Data

− − − − − − →

Rotated Data

Rotate the data so that it is easy to

Identify the dominant directions (information) Throw away the smaller dimensions (noise)

x1 x2

z1 z2

z1

  • c

A M L Creator: Malik Magdon-Ismail

Learning Aides: 11 /16

Projecting the data − →

Projecting the Data to Maximize Variance

(Always center the data first) z = xt

nv v Original Data

Find v to maximize the variance of z

c A M L Creator: Malik Magdon-Ismail

Learning Aides: 12 /16

Maximizing the variance − →

slide-4
SLIDE 4

Maximizing the Variance

var[z] = 1 N

N

  • n=1

z2

n

= 1 N

N

  • n=1

vtxnxt

nv

= vt

  • 1

N

N

  • n=1

xnxt

n

  • v

= vtΣv.

v Original Data

Choose v as v1, the top eigenvector of Σ — the top principal component (PCA)

c A M L Creator: Malik Magdon-Ismail

Learning Aides: 13 /16

− →

The Principal Components

z1 = xtv1 z2 = xtv2 z3 = xtv3 . . . v1, v2, · · · , vd are the eigenvectors of Σ with eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λd Theorem [Eckart-Young]. These directions give best re- construction of data; also capture maximum variance.

v Original Data

c A M L Creator: Malik Magdon-Ismail

Learning Aides: 14 /16

PCA features for digits data − →

PCA Features for Digits Data

k % Reconstruction Error

50 100 150 200 20 40 60 80 100

z1 z2

1 not 1

Principal components are automated

Captures dominant directions of the data. May not capture dominant dimensions for f.

c A M L Creator: Malik Magdon-Ismail

Learning Aides: 15 /16

Other Learning Aides − →

Other Learning Aides

  • 1. Nonlinear dimension reduction:

x1 x2 x1 x2 x1 x2

  • 2. Hints (invariances and prior information):

rotational invariance, monotonicity, symmetry, . . . .

  • 3. Removing noisy data:

intensity symmetry intensity symmetry intensity symmetry

  • 4. Advanced validation techniques: Rademacher and Permutation penalties

More efficient than CV, more convenient and accurate than VC.

c A M L Creator: Malik Magdon-Ismail

Learning Aides: 16 /16