recap: Approximation Versus Generalization VC Analysis - - PowerPoint PPT Presentation

recap approximation versus generalization
SMART_READER_LITE
LIVE PREVIEW

recap: Approximation Versus Generalization VC Analysis - - PowerPoint PPT Presentation

recap: Approximation Versus Generalization VC Analysis Bias-Variance Analysis E out E in + ( d vc ) E out = bias + var 1. Did you fit your data well enough ( E in )? 1. How well can you fit your data ( bias )? 2. Are you confident your E in


slide-1
SLIDE 1

Learning From Data Lecture 8 Linear Classification and Regression

Linear Classification Linear Regression

  • M. Magdon-Ismail

CSCI 4100/6100 recap: Approximation Versus Generalization

VC Analysis Bias-Variance Analysis Eout ≤ Ein + Ω(dvc) Eout = bias + var

  • 1. Did you fit your data well enough (Ein)?
  • 2. Are you confident your Ein will generalize to Eout
  • 1. How well can you fit your data (bias)?
  • 2. How close to that best fit can you get (var)?

in-sample error model complexity

  • ut-of-sample error

VC dimension, dvc Error d∗

vc

The VC Insuarance Co.

The VC warranty had conditions for becoming void:

You can’t look at your data before choosing H. Data must be generated i.i.d from P(x). Data and test case from same P(x) (same bin).

x y x y x y ¯ g(x) sin(x) x y ¯ g(x) sin(x)

H0 bias = 0.50; var = 0.25. Eout = 0.75 H1 bias = 0.21; var = 1.69. Eout = 1.90

c A M L Creator: Malik Magdon-Ismail

Linear Classification and Regression: 2 /23

Recap: learning curve − →

recap: Decomposing The Learning Curve

VC Analysis Bias-Variance Analysis

Number of Data Points, N Expected Error in-sample error generalization error Eout Ein Number of Data Points, N Expected Error bias variance Eout Ein

Pick H that can generalize and has a good chance to fit the data Pick (H, A) to approximate f and not behave wildly after seeing the data

c A M L Creator: Malik Magdon-Ismail

Linear Classification and Regression: 3 /23

3 learning problems − →

Three Learning Problems

Logistic Regression Credit Analysis Approve

  • r Deny

Amount

  • f Credit

Probability

  • f Default

y ∈ R y ∈ [0, 1] y = ±1 Classification Regression

  • Linear models are perhaps the fundamental model.
  • The linear model is the first model to try.

c A M L Creator: Malik Magdon-Ismail

Linear Classification and Regression: 4 /23

Linear signal − →

slide-2
SLIDE 2

The Linear Signal

linear in x: gives the line/hyperplane separator

↓ s = wtx ↑

linear in w: makes the algorithms work

x is the augmented vector: x ∈ {1} × Rd

c A M L Creator: Malik Magdon-Ismail

Linear Classification and Regression: 5 /23

Using the linear signal − →

The Linear Signal

s = wtx −

                              

→ sign(wtx)

{−1, +1}

→ wtx

R

→ θ(wtx)

[0, 1] y = θ(s)

c A M L Creator: Malik Magdon-Ismail

Linear Classification and Regression: 6 /23

Classification and PLA − →

Linear Classification

Hlin = h(x) = sign(wtx)

  • 1. Ein ≈ Eout because dvc = d + 1,

Eout(h) ≤ Ein(h) + O

  • d

N log N

  • .
  • 2. If the data is linearly separable, PLA will find a separator =

⇒ Ein = 0. w(t + 1) = w(t) + x∗y∗ ↑

misclassified data point

Ein = 0 = ⇒ Eout ≈ 0

(f is well approximated by a linear fit).

What if the data is not separable (Ein = 0 is not possible)?

pocket algorithm

How to ensure Ein ≈ 0 is possible?

select good features

c A M L Creator: Malik Magdon-Ismail

Linear Classification and Regression: 7 /23

Non-separable data − →

Non-Separable Data

c A M L Creator: Malik Magdon-Ismail

Linear Classification and Regression: 8 /23

Pocket algorithm − →

slide-3
SLIDE 3

The Pocket Algorithm

Minimizing Ein is a hard combinatorial problem.

The Pocket Algorithm

– Run PLA – At each step keep the best Ein (and w) so far.

(Its not rocket science, but it works.)

(Other approaches: linear regression, logistic regression, linear programming . . . )

c A M L Creator: Malik Magdon-Ismail

Linear Classification and Regression: 9 /23

Digits − →

Digits Data

Each digit is a 16 × 16 image.

c A M L Creator: Malik Magdon-Ismail

Linear Classification and Regression: 10 /23

Input is 256 dimensional − →

Digits Data

Each digit is a 16 × 16 image.

  • 1 -1 -1 -1 -1 -1 -1 -0.63 0.86 -0.17 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -0.99 0.3 1 0.31 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
  • 1 -1 -0.41 1 0.99 -0.57 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -0.68 0.83 1 0.56 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -0.94 0.54

1 0.78 -0.72 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0.1 1 0.92 -0.44 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -0.26 0.95 1 -0.16 -1 -1

  • 1 -0.99 -0.71 -0.83 -1 -1 -1 -1 -1 -0.8 0.91 1 0.3 -0.96 -1 -1 -0.55 0.49 1 0.88 0.09 -1 -1 -1 -1 0.28 1 0.88 -0.8 -1
  • 0.9 0.14 0.97 1 1 1 0.99 -0.74 -1 -1 -0.95 0.84 1 0.32 -1 -1 0.35 1 0.65 -0.10 -0.18 1 0.98 -0.72 -1 -1 -0.63 1 1

0.07 -0.92 0.11 0.96 0.30 -0.88 -1 -0.07 1 0.64 -0.99 -1 -1 -0.67 1 1 0.75 0.34 1 0.70 -0.94 -1 -1 0.54 1 0.02 -1 -1

  • 1 -0.90 0.79 1 1 1 1 0.53 0.18 0.81 0.83 0.97 0.86 -0.63 -1 -1 -1 -1 -0.45 0.82 1 1 1 1 1 1 1 1 0.13 -1 -1 -1 -1 -1
  • 1 -0.48 0.81 1 1 1 1 1 1 0.21 -0.94 -1 -1 -1 -1 -1 -1 -1 -0.97 -0.42 0.30 0.82 1 0.48 -0.47 -0.99 -1 -1 -1 -1
  • x = (1, x1, · · · , x256)

← input w = (w0, w1, · · · , w256) ← linear model

  • dvc = 257

c A M L Creator: Malik Magdon-Ismail

Linear Classification and Regression: 11 /23

Intensity and symmetry features − →

Intensity and Symmetry Features

feature: an important property of the input that you think is useful for classification.

(dictionary.com: a prominent or conspicuous part or characteristic)

x = (1, x1, x2) ← input w = (w0, w1, w2) ← linear model

  • dvc = 3

c A M L Creator: Malik Magdon-Ismail

Linear Classification and Regression: 12 /23

PLA on digits data − →

slide-4
SLIDE 4

PLA on Digits Data

PLA

Iteration Number, t Error (log scale) Eout Ein

250 500 750 1000 1% 10% 50%

c A M L Creator: Malik Magdon-Ismail

Linear Classification and Regression: 13 /23

Pocket on digits data − →

Pocket on Digits Data

PLA

Iteration Number, t Error (log scale) Eout Ein

250 500 750 1000 1% 10% 50%

Pocket

Iteration Number, t Error (log scale) Eout Ein

250 500 750 1000 1% 10% 50%

c A M L Creator: Malik Magdon-Ismail

Linear Classification and Regression: 14 /23

Regression − →

Linear Regression

age 32 years gender male salary 40,000 debt 26,000 years in job 1 year years at home 3 years . . . . . .

Classification: Approve/Deny Regression: Credit Line (dollar amount)

regression ≡ y ∈ R

h(x) =

d

  • i=0

wixi = wtx

c A M L Creator: Malik Magdon-Ismail

Linear Classification and Regression: 15 /23

Regression − →

Linear Regression

age 32 years gender male salary 40,000 debt 26,000 years in job 1 year years at home 3 years . . . . . .

Classification: Approve/Deny Regression: Credit Line (dollar amount)

regression ≡ y ∈ R

h(x) =

d

  • i=0

wixi = wtx

c A M L Creator: Malik Magdon-Ismail

Linear Classification and Regression: 16 /23

Squared error − →

slide-5
SLIDE 5

Least Squares Linear Regression

x y

x1 x2 y

c A M L Creator: Malik Magdon-Ismail

Linear Classification and Regression: 17 /23

Squared error − →

Least Squares Linear Regression

x y

x1 x2 y

y = f(x) + ǫ ← − noisy target P(y|x) in-sample error Ein(h) = 1

N N

  • n=1

(h(xn) − yn)2

  • ut-of-sample error

Eout(h) = Ex[(h(x) − y)2]          h(x) = wtx

c A M L Creator: Malik Magdon-Ismail

Linear Classification and Regression: 18 /23

Matrix representation − →

Using Matrices for Linear Regression

X =     —x1— —x2— . . . —xN—    

  • data matrix, N × (d + 1)

y =     y1 y2 . . . yN    

  • target vector

ˆ y =     ˆ y1 ˆ y2 . . . ˆ yN     =     wtx1 wtx2 . . . wtxN     = Xw

  • in-sample predictions

Ein(w) = 1 N

N

  • n=1

(ˆ yn − yn)2 =

1 N|

| ˆ y − y | |2

2

=

1 N|

| Xw − y | |2

2

=

1 N(wtXtXw − 2wtXty + yty)

c A M L Creator: Malik Magdon-Ismail

Linear Classification and Regression: 19 /23

Pseudoinverse solution − →

Linear Regression Solution

Ein(w) = 1 N (wtXtXw − 2wtXty + yty)

Vector Calculus: To minimize Ein(w), set ∇wEin(w) = 0. ∇w(wtAw) = (A + At)w, ∇w(wtb) = b. A = XtX and b = Xty:

∇wEin(w) = 2 N (XtXw − Xty) Setting ∇Ein(w) = 0: XtXw = Xty ← − normal equations wlin = (XtX)−1Xty ← − when XtX is invertible

c A M L Creator: Malik Magdon-Ismail

Linear Classification and Regression: 20 /23

Regression algorithm − →

slide-6
SLIDE 6

Linear Regression Algorithm

Linear Regression Algorithm:

  • 1. Construct the matrix X and the vector y from the data set

(x1, y1), · · · , (xN, yN), where each x includes the x0 = 1 coordinate, X =     —x1— —x2— . . . —xN—    

  • data matrix

, y =     y1 y2 . . . yN    

  • target vector

.

  • 2. Compute the pseudo inverse X† of the matrix X. If XtX is invertible,

X† = (XtX)−1Xt

  • 3. Return wlin = X†y.

c A M L Creator: Malik Magdon-Ismail

Linear Classification and Regression: 21 /23

Generalization − →

Generalization

The linear regression algorithm gets the smallest possible Ein in one step. Generalization is also good.

One can obtain a regression version of dvc. There are other bounds, for example: E[Eout(h)] = E[Ein(h)] + O d N

  • Number of Data Points, N

Expected Error Eout Ein σ2 d + 1

c A M L Creator: Malik Magdon-Ismail

Linear Classification and Regression: 22 /23

Regression for classification − →

Linear Regression for Classification

Linear regression can learn any real valued target function.

For example yn = ±1. (±1 are real values!) Use linear regression to get w with wtxn ≈ yn = ±1 Then sign(wtxn) will likely agree with yn = ±1. These can be good initial weights for classification.

Example. Classifying 1 from not 1 (multiclass → 2 class) Average Intensity Symmetry

c A M L Creator: Malik Magdon-Ismail

Linear Classification and Regression: 23 /23