Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 2: - - PowerPoint PPT Presentation

data mining techniques
SMART_READER_LITE
LIVE PREVIEW

Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 2: - - PowerPoint PPT Presentation

Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 2: Regression Jan-Willem van de Meent ( credit : Yijun Zhao, Marc Toussaint, Bishop) Administrativa Instructor Jan-Willem van de Meent Email : j.vandemeent@northeastern.edu


slide-1
SLIDE 1

Data Mining Techniques

CS 6220 - Section 3 - Fall 2016

Lecture 2: Regression

Jan-Willem van de Meent (credit: Yijun Zhao, Marc Toussaint, Bishop)

slide-2
SLIDE 2

Administrativa

Instructor Jan-Willem van de Meent Email: j.vandemeent@northeastern.edu
 Phone: +1 617 373-7696
 Office Hours: 478 WVH, Wed 1.30pm - 2.30pm Teaching Assistants Yuan Zhong E-mail: yzhong@ccs.neu.edu
 Office Hours: WVH 462, Wed 3pm - 5pm Kamlendra Kumar E-mail: kumark@zimbra.ccs.neu.edu
 Office Hours: WVH 462, Fri 3pm - 5pm

slide-3
SLIDE 3

Administrativa

Course Website http://www.ccs.neu.edu/course/cs6220f16/sec3/ Piazza https://piazza.com/northeastern/fall2016/cs622003/home Project Guidelines (Vote next week) http://www.ccs.neu.edu/course/cs6220f16/sec3/project/

slide-4
SLIDE 4

Question

What would you like 
 to get out of this course?

slide-5
SLIDE 5

Linear Regression

slide-6
SLIDE 6

Regression Examples

x = ⇒ y

Features Continuous Value

  • {age, major, gender, race} ⇒ GPA
  • {income, credit score, profession} ⇒ Loan Amount
  • {college,major,GPA} ⇒ Future Income
slide-7
SLIDE 7

Example: Boston Housing Data

UC Irvine Machine Learning Repository 
 (good source for project datasets) https://archive.ics.uci.edu/ml/datasets/Housing

slide-8
SLIDE 8

Example: Boston Housing Data

  • 1. CRIM: per capita crime rate by town
  • 2. ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
  • 3. INDUS: proportion of non-retail business acres per town
  • 4. CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  • 5. NOX: nitric oxides concentration (parts per 10 million)
  • 6. RM: average number of rooms per dwelling
  • 7. AGE: proportion of owner-occupied units built prior to 1940
  • 8. DIS: weighted distances to five Boston employment centres
  • 9. RAD: index of accessibility to radial highways
  • 10. TAX: full-value property-tax rate per $10,000
  • 11. PTRATIO: pupil-teacher ratio by town
  • 12. B: 1000(Bk - 0.63)^2 where Bk is the proportion of african americans by town
  • 13. LSTAT: % lower status of the population
  • 14. MEDV: Median value of owner-occupied homes in $1000's
slide-9
SLIDE 9

Example: Boston Housing Data

CRIM: per capita crime rate by town

slide-10
SLIDE 10

Example: Boston Housing Data

CHAS: Charles River dummy variable 
 (= 1 if tract bounds river; 0 otherwise)

slide-11
SLIDE 11

MEDV: Median value of owner-occupied homes in $1000's

Example: Boston Housing Data

slide-12
SLIDE 12

Example: Boston Housing Data

N data
 points D features

slide-13
SLIDE 13

Given N observations

Regression: Problem Setup

learn a function and for a new input x* predict

{(x1, y1),(x2, y2),...,(xN, yN)} yi = f (xi) ∀i = 1,2,..., N y∗ = f (x ∗)

slide-14
SLIDE 14

Assume f is a linear combination of D features

Linear Regression

were x and w are defined as Learning task: Estimate w for N points we write

slide-15
SLIDE 15

Linear Regression

slide-16
SLIDE 16

Error Measure

Mean Squared Error (MSE): E(w) = 1 N

N

X

n=1

(wTxn yn)2 = 1 N k Xw y k2 where X = 2 6 6 4 — x1T — — x2T — . . . — xNT — 3 7 7 5 y = 2 6 6 4 y1T y2T . . . yNT 3 7 7 5

slide-17
SLIDE 17

Minimizing the Error

E(w) = 1

N k Xw y k2

5E(w) = 2

NXT(Xw y) = 0

XTXw = XTy w = X†y where X† = (XTX)1XT is the ’pseudo-inverse’ of X

slide-18
SLIDE 18

Minimizing the Error

E(w) = 1

N k Xw y k2

5E(w) = 2

NXT(Xw y) = 0

XTXw = XTy w = X†y where X† = (XTX)1XT is the ’pseudo-inverse’ of X

Matrix Cookbook (on course website)

slide-19
SLIDE 19

Ordinary Least Squares

Construct matrix X and the vector y from the dataset {(x1, y1), x2, y2), . . . , (xN, yN)} (each x includes x0 = 1) as follows: X =     — xT

1 —

— xT

2 —

. . . — xT

N —

    y =     y T

1

y T

2

. . . y T

N

    Compute X† = (XTX)−1XT Return w = X†y

slide-20
SLIDE 20

Gradient Descent

5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50

w0 w1 countours : E(w)

slide-21
SLIDE 21

Least Mean Squares

Initialize the w(0) for time t = 0 for t = 0, 1, 2, . . . do Compute the gradient gt = 5E(w(t)) Set the direction to move, vt = gt Update w(t + 1) = w(t) + ηvt Iterate until it is time to stop Return the final weights w

(a.k.a. gradient descent)

slide-22
SLIDE 22

Question

When would you want to use OLS, when LMS?

slide-23
SLIDE 23

Ordinary least squares (OMS)

Computational Complexity

Least Mean Squares (LMS)

slide-24
SLIDE 24

Ordinary least squares (OMS)

Computational Complexity

Least Mean Squares (LMS) OMS is expensive when D is large

slide-25
SLIDE 25

Effect of step size

slide-26
SLIDE 26

Choosing Stepsize

large gradient large step? small gradient small step?

Set step size proportional to ?

to r f(x)??

slide-27
SLIDE 27

Choosing Stepsize

large gradient large step? small gradient small step?

Set step size proportional to ?

to r f(x)??

Two commonly used techniques

  • 1. Stepsize adaptation
  • 2. Line search
slide-28
SLIDE 28

Stepsize Adaptation

Input: initial x 2 Rn, functions f(x) and r f(x), initial stepsize α, tolerance θ Output: x

1: repeat 2:

y x α

r f(x) |r f(x)| 3:

if [ thenstep is accepted]f(y)  f(x)

4:

x y

5:

α 1.2α // increase stepsize

6:

else[step is rejected]

7:

α 0.5α // decrease stepsize

8:

end if

9: until |y x| < θ

[perhaps for 10 iterations in sequence]

(“magic numbers”)

slide-29
SLIDE 29

Second Order Methods

Compute Hessian matrix of second derivatives

slide-30
SLIDE 30

Second Order Methods

  • Broyden-Fletcher-Goldfarb-Shanno (BFGS) method:

Input: initial x 2 Rn, functions f(x), r f(x), tolerance θ Output: x

1: initialize H-1 = In 2: repeat 3:

compute ∆ = H-1r f(x)

4:

perform a line search minα f(x + α∆)

5:

∆ α∆

6:

y r f(x + ∆) r f(x)

7:

x x + ∆

8:

update H-1 ⇣ I y∆

>

>y

>

H-1⇣ I y∆

>

>y

⌘ + ∆∆

>

>y

9: until |

|∆| |1 < θ

Memory-limited version: L-BFGS

slide-31
SLIDE 31

Stochastic Gradient Descent

What if N is really large?

Batch gradient descent (evaluates all data) Minibatch gradient descent (evaluates subset) Converges under Robbins-Monro conditions

slide-32
SLIDE 32

Probabilistic Interpretation

slide-33
SLIDE 33

Normal Distribution

Right Skewed Left Skewed Random

slide-34
SLIDE 34

Normal Distribution

∼ ⇒

Density:

slide-35
SLIDE 35

Central Limit Theorem

N = 1 0.5 1 1 2 3 N = 2 0.5 1 1 2 3 N = 10 0.5 1 1 2 3

If y1, …, yn are

  • 1. Independent identically distributed (i.i.d.)
  • 2. Have finite variance 0 < σy2 < ∞
slide-36
SLIDE 36

Multivariate Normal

Density:

slide-37
SLIDE 37

Regression: Probabilistic Interpretation

slide-38
SLIDE 38

Regression: Probabilistic Interpretation

slide-39
SLIDE 39

Regression: Probabilistic Interpretation

Joint probability of N independent data points

slide-40
SLIDE 40

Regression: Probabilistic Interpretation

Log joint probability of N independent data points

slide-41
SLIDE 41

Regression: Probabilistic Interpretation

Log joint probability of N independent data points

slide-42
SLIDE 42

Regression: Probabilistic Interpretation

Log joint probability of N independent data points

slide-43
SLIDE 43

Regression: Probabilistic Interpretation

Log joint probability of N independent data points Maximum
 Likelihood

slide-44
SLIDE 44

Basis function regression

Linear regression

y = w0 + w1x1 + ... + wDxD = w T x

Basis function regression Polynomial regression

slide-45
SLIDE 45

Polynomial Regression

x t M = 0 1 −1 1 x t M = 1 1 −1 1 x t M = 3 1 −1 1 x t M = 9 1 −1 1

slide-46
SLIDE 46

Polynomial Regression

x t M = 0 1 −1 1 x t M = 1 1 −1 1 x t M = 3 1 −1 1 x t M = 9 1 −1 1

Underfit

slide-47
SLIDE 47

Polynomial Regression

x t M = 0 1 −1 1 x t M = 1 1 −1 1 x t M = 3 1 −1 1 x t M = 9 1 −1 1

Overfit

slide-48
SLIDE 48

Regularization

L2 regularization (ridge regression) minimizes: E(w) = 1 N k Xw y k2 + λ k w k2 where λ 0 and k w k2 = wTw

  • k

k L1 regularization (LASSO) minimizes: E(w) = 1 N k Xw y k2 + λ|w|1 where λ 0 and |w|1 =

D

P

i=1

|ωi|

slide-49
SLIDE 49

Regularization

slide-50
SLIDE 50

Regularization

L2: closed form solution w = (XTX + λI)1XTy L1: No closed form solution. Use quadratic programming: minimize k Xw y k2 s.t. k w k1 s

slide-51
SLIDE 51

Review: Bias-Variance Trade-off

Maximum likelihood estimator Bias-variance decomposition
 (expected value over possible data points)

slide-52
SLIDE 52

Bias-Variance Trade-off

Often: low bias ⇒ high variance low variance ⇒ high bias Trade-off:

slide-53
SLIDE 53

K-fold Cross-Validation

  • 1. Divide dataset into K “folds”
  • 2. Train on all except k-th fold
  • 3. Test on k-th fold
  • 4. Minimize test error w.r.t. λ
slide-54
SLIDE 54

K-fold Cross-Validation

  • Choices for K: 5, 10, N (leave-one-out)
  • Cost of computation: K x number of λ
slide-55
SLIDE 55

Learning Curve

slide-56
SLIDE 56

Learning Curve

slide-57
SLIDE 57

Loss Functions