Data Mining Techniques
CS 6220 - Section 3 - Fall 2016
Lecture 2: Regression
Jan-Willem van de Meent (credit: Yijun Zhao, Marc Toussaint, Bishop)
Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 2: - - PowerPoint PPT Presentation
Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 2: Regression Jan-Willem van de Meent ( credit : Yijun Zhao, Marc Toussaint, Bishop) Administrativa Instructor Jan-Willem van de Meent Email : j.vandemeent@northeastern.edu
CS 6220 - Section 3 - Fall 2016
Jan-Willem van de Meent (credit: Yijun Zhao, Marc Toussaint, Bishop)
Instructor Jan-Willem van de Meent Email: j.vandemeent@northeastern.edu Phone: +1 617 373-7696 Office Hours: 478 WVH, Wed 1.30pm - 2.30pm Teaching Assistants Yuan Zhong E-mail: yzhong@ccs.neu.edu Office Hours: WVH 462, Wed 3pm - 5pm Kamlendra Kumar E-mail: kumark@zimbra.ccs.neu.edu Office Hours: WVH 462, Fri 3pm - 5pm
Course Website http://www.ccs.neu.edu/course/cs6220f16/sec3/ Piazza https://piazza.com/northeastern/fall2016/cs622003/home Project Guidelines (Vote next week) http://www.ccs.neu.edu/course/cs6220f16/sec3/project/
Features Continuous Value
UC Irvine Machine Learning Repository (good source for project datasets) https://archive.ics.uci.edu/ml/datasets/Housing
CRIM: per capita crime rate by town
CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
MEDV: Median value of owner-occupied homes in $1000's
N data points D features
Given N observations
learn a function and for a new input x* predict
{(x1, y1),(x2, y2),...,(xN, yN)} yi = f (xi) ∀i = 1,2,..., N y∗ = f (x ∗)
Assume f is a linear combination of D features
were x and w are defined as Learning task: Estimate w for N points we write
Mean Squared Error (MSE): E(w) = 1 N
N
X
n=1
(wTxn yn)2 = 1 N k Xw y k2 where X = 2 6 6 4 — x1T — — x2T — . . . — xNT — 3 7 7 5 y = 2 6 6 4 y1T y2T . . . yNT 3 7 7 5
E(w) = 1
N k Xw y k2
5E(w) = 2
NXT(Xw y) = 0
XTXw = XTy w = X†y where X† = (XTX)1XT is the ’pseudo-inverse’ of X
E(w) = 1
N k Xw y k2
5E(w) = 2
NXT(Xw y) = 0
XTXw = XTy w = X†y where X† = (XTX)1XT is the ’pseudo-inverse’ of X
Matrix Cookbook (on course website)
1 —
2 —
N —
1
2
N
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50
w0 w1 countours : E(w)
Ordinary least squares (OMS)
Least Mean Squares (LMS)
Ordinary least squares (OMS)
Least Mean Squares (LMS) OMS is expensive when D is large
large gradient large step? small gradient small step?
Set step size proportional to ?
to r f(x)??
large gradient large step? small gradient small step?
Set step size proportional to ?
to r f(x)??
Two commonly used techniques
Input: initial x 2 Rn, functions f(x) and r f(x), initial stepsize α, tolerance θ Output: x
1: repeat 2:
y x α
r f(x) |r f(x)| 3:
if [ thenstep is accepted]f(y) f(x)
4:
x y
5:
α 1.2α // increase stepsize
6:
else[step is rejected]
7:
α 0.5α // decrease stepsize
8:
end if
9: until |y x| < θ
[perhaps for 10 iterations in sequence]
(“magic numbers”)
Compute Hessian matrix of second derivatives
Input: initial x 2 Rn, functions f(x), r f(x), tolerance θ Output: x
1: initialize H-1 = In 2: repeat 3:
compute ∆ = H-1r f(x)
4:
perform a line search minα f(x + α∆)
5:
∆ α∆
6:
y r f(x + ∆) r f(x)
7:
x x + ∆
8:
update H-1 ⇣ I y∆
>
∆
>y
⌘
>
H-1⇣ I y∆
>
∆
>y
⌘ + ∆∆
>
∆
>y
9: until |
|∆| |1 < θ
Memory-limited version: L-BFGS
What if N is really large?
Batch gradient descent (evaluates all data) Minibatch gradient descent (evaluates subset) Converges under Robbins-Monro conditions
Right Skewed Left Skewed Random
∼ ⇒
Density:
N = 1 0.5 1 1 2 3 N = 2 0.5 1 1 2 3 N = 10 0.5 1 1 2 3
If y1, …, yn are
Density:
Joint probability of N independent data points
Log joint probability of N independent data points
Log joint probability of N independent data points
Log joint probability of N independent data points
Log joint probability of N independent data points Maximum Likelihood
Linear regression
y = w0 + w1x1 + ... + wDxD = w T x
Basis function regression Polynomial regression
x t M = 0 1 −1 1 x t M = 1 1 −1 1 x t M = 3 1 −1 1 x t M = 9 1 −1 1
x t M = 0 1 −1 1 x t M = 1 1 −1 1 x t M = 3 1 −1 1 x t M = 9 1 −1 1
Underfit
x t M = 0 1 −1 1 x t M = 1 1 −1 1 x t M = 3 1 −1 1 x t M = 9 1 −1 1
Overfit
L2 regularization (ridge regression) minimizes: E(w) = 1 N k Xw y k2 + λ k w k2 where λ 0 and k w k2 = wTw
k L1 regularization (LASSO) minimizes: E(w) = 1 N k Xw y k2 + λ|w|1 where λ 0 and |w|1 =
D
P
i=1
|ωi|
L2: closed form solution w = (XTX + λI)1XTy L1: No closed form solution. Use quadratic programming: minimize k Xw y k2 s.t. k w k1 s
Maximum likelihood estimator Bias-variance decomposition (expected value over possible data points)
Often: low bias ⇒ high variance low variance ⇒ high bias Trade-off: