Applied Machine Learning Applied Machine Learning
Linear Regression
Siamak Ravanbakhsh Siamak Ravanbakhsh
COMP 551 COMP 551 (winter 2020) (winter 2020) 1
Applied Machine Learning Applied Machine Learning Linear Regression - - PowerPoint PPT Presentation
Applied Machine Learning Applied Machine Learning Linear Regression Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1 Learning objectives Learning objectives linear model evaluation criteria how to find
Siamak Ravanbakhsh Siamak Ravanbakhsh
COMP 551 COMP 551 (winter 2020) (winter 2020) 1
2
source: http://chrisauld.com/2012/10/07/what-do-we-know-about-the-effect-of-income-inequality-on-health/
History: method of least squares was invented by Legendre and Gauss (1800's) Gauss at age 24 used it to predict the future location of Ceres (largest astroid in the astroid belt)
3 . 1
effect of income inequality on health and social problems
source: http://chrisauld.com/2012/10/07/what-do-we-know-about-the-effect-of-income-inequality-on-health/
History: method of least squares was invented by Legendre and Gauss (1800's) Gauss at age 24 used it to predict the future location of Ceres (largest astroid in the astroid belt)
3 . 1
Winter 2020 | Applied Machine Learning (COMP551)
3 . 2
(n)
(n)
4 . 1
(n)
(n)
4 . 1
(n)
vectors are assume to be column vectors x =
=1
2
D⎦
1
2
D] ⊤
(n)
4 . 1
(n)
vectors are assume to be column vectors x =
=1
2
D⎦
1
2
D] ⊤
a feature
(n)
4 . 1
(n)
vectors are assume to be column vectors x =
=1
2
D⎦
1
2
D] ⊤
a feature
(n)
we assume N instances in the dataset D = {(x
, y )}
(n) (n n=1 N
for example, is the feature d of instance n
each instance has D features indexed by d
x
∈d (n)
R
4 . 1
X = ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎡ x(1)⊤ x(2)⊤ ⋮ x(N)⊤⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎤
each row is a datapoint, each column is a feature
4 . 2
X = ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎡ x(1)⊤ x(2)⊤ ⋮ x(N)⊤⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎤
each row is a datapoint, each column is a feature
1 (1)
1 (N)
2 (1)
2 (N)
D (1)
D (N)⎦
4 . 2
X = ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎡ x(1)⊤ x(2)⊤ ⋮ x(N)⊤⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎤
each row is a datapoint, each column is a feature
1 (1)
1 (N)
2 (1)
2 (N)
D (1)
D (N)⎦
4 . 2
X = ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎡ x(1)⊤ x(2)⊤ ⋮ x(N)⊤⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎤
each row is a datapoint, each column is a feature
1 (1)
1 (N)
2 (1)
2 (N)
D (1)
D (N)⎦
4 . 2
X = ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎡ x(1)⊤ x(2)⊤ ⋮ x(N)⊤⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎤
each row is a datapoint, each column is a feature
1 (1)
1 (N)
2 (1)
2 (N)
D (1)
D (N)⎦
4 . 2
Winter 2020 | Applied Machine Learning (COMP551)
Example: Micro array data (X), contains gene expression levels labels (y) can be {cancer/no cancer} label for each patient
gene (d) patient (n)
4 . 3
w
1 1
D D
w
D
will generalize to a vector later
5
w
1 1
D D
w
D
will generalize to a vector later
model parameters or weights
5
w
1 1
D D
w
D
will generalize to a vector later
bias or intercept model parameters or weights
5
w
1 1
D D
w
D
will generalize to a vector later
1 D ⊤
w
⊤
bias or intercept model parameters or weights
5
w
1 1
D D
w
D
will generalize to a vector later
1 D ⊤
w
⊤
bias or intercept model parameters or weights
yh_n = np.dot(w,x)
5
(n) (n)
6 . 1
w (n)
(n) (n)
6 . 1
w (n)
(n) (n)
minimize a measure of difference between = y ^(n) f
(x)
w (n)
y(n)
6 . 1
2 1
)square error loss (a.k.a. L2 loss)
for a single instance (a function of labels)
w (n)
(n) (n)
minimize a measure of difference between = y ^(n) f
(x)
w (n)
y(n)
6 . 1
2 1
)square error loss (a.k.a. L2 loss)
for a single instance (a function of labels)
w (n)
(n) (n)
minimize a measure of difference between = y ^(n) f
(x)
w (n)
y(n)
for future convenience
6 . 1
2 1
)square error loss (a.k.a. L2 loss)
for a single instance (a function of labels)
w (n)
(n) (n)
minimize a measure of difference between = y ^(n) f
(x)
w (n)
y(n)
2 1 ∑n=1 N (n)
⊤ (n) 2
sum of squared errors cost function
for future convenience
6 . 1
2 1
)square error loss (a.k.a. L2 loss)
for a single instance (a function of labels)
w (n)
(n) (n)
minimize a measure of difference between = y ^(n) f
(x)
w (n)
y(n)
2 1 ∑n=1 N (n)
⊤ (n) 2
sum of squared errors cost function
for future convenience for the whole dataset versus
6 . 1
6 . 2
1
6 . 2
1 (x , y )
(3) (3)
(x , y )
(1) (1)
(x , y )
(2) (2)
(x , y )
(4) (4)
6 . 2
1
∗
(x , y )
(3) (3)
(x , y )
(1) (1)
(x , y )
(2) (2)
(x , y )
(4) (4)
w∗
∗
1 ∗
6 . 2
1
∗
(x , y )
(3) (3)
(x , y )
(1) (1)
(x , y )
(2) (2)
(x , y )
(4) (4)
w∗
∗
1 ∗
(3)
(3)
6 . 2
w ∑n (n)
T (n) 2
1
∗
(x , y )
(3) (3)
(x , y )
(1) (1)
(x , y )
(2) (2)
(x , y )
(4) (4)
w∗
∗
1 ∗
(3)
(3)
6 . 2
Winter 2020 | Applied Machine Learning (COMP551)
∗
w ∑n (n)
T (n) 2
1
∗
w∗
∗
1 ∗ 1
2 ∗ 2
2
6 . 3
⊤ (n) 1 × D D × 1 ∈ R
7
⊤ (n) 1 × D D × 1 ∈ R
N × D N × 1 D × 1
7
⊤ (n) 1 × D D × 1 ∈ R
N × D N × 1 D × 1
w 2 1
2
(y −2 1
⊤
squared L2 norm of the residual vector
7
⊤ (n) 1 × D D × 1 ∈ R
N × D N × 1 D × 1
yh = np.dot(X, w) cost = np.sum((yh - y)**2)/2. # or cost = np.mean((yh - y)**2)/2.
w 2 1
2
(y −2 1
⊤
squared L2 norm of the residual vector
7
weight space data space
image: Grosse, Farahmand, Carrasquilla
8 . 1
the objective is a smooth function of w find minimum by setting partial derivatives to zero
weight space data space
image: Grosse, Farahmand, Carrasquilla
8 . 1
both scalar
8 . 2
both scalar
2 1 ∑n (n)
(n) 2
8 . 2
both scalar
2 1 ∑n (n)
(n) 2
8 . 2
both scalar
2 1 ∑n (n)
(n) 2
∑n
(n)2
xy ∑n
(n) (n)
dw dJ
x(n) (n)
(n)
8 . 2
both scalar
2 1 ∑n (n)
(n) 2
∑n
(n)2
xy ∑n
(n) (n)
dw dJ
x(n) (n)
(n)
global minimum because cost is smooth and convex
more on convexity layer
8 . 2
for a multivariate function J(w
, w )1
partial derivatives instead of derivative = derivative when other vars. are fixed
8 . 3
for a multivariate function J(w
, w )1
partial derivatives instead of derivative
J(w , w ) ≜∂w
1
∂ 1
ϵ→0 ϵ J(w
,w +ϵ)−J(w ,w )1 1
1
= derivative when other vars. are fixed
8 . 3
for a multivariate function J(w
, w )1
partial derivatives instead of derivative
J(w , w ) ≜∂w
1
∂ 1
ϵ→0 ϵ J(w
,w +ϵ)−J(w ,w )1 1
critical point: all partial derivatives are zero
1
= derivative when other vars. are fixed
8 . 3
for a multivariate function J(w
, w )1
partial derivatives instead of derivative
J(w , w ) ≜∂w
1
∂ 1
ϵ→0 ϵ J(w
,w +ϵ)−J(w ,w )1 1
critical point: all partial derivatives are zero
1
= derivative when other vars. are fixed
gradient: vector of all partial derivatives
∂w
1
∂ ∂w
D
∂ ⊤
1
8 . 3
setting
J(w) =∂w
i
∂
cost is a smooth and convex function of w
(y∂w
i
∂ ∑n 2 1 (n)
w (n) 2
8 . 4
setting
J(w) =∂w
i
∂
cost is a smooth and convex function of w
(y∂w
i
∂ ∑n 2 1 (n)
w (n) 2
using chain rule:
=∂w
i
∂J df
w
dJ ∂w
i
∂f
w
8 . 4
setting
J(w) =∂w
i
∂
cost is a smooth and convex function of w
(y∂w
i
∂ ∑n 2 1 (n)
w (n) 2
using chain rule:
=∂w
i
∂J df
w
dJ ∂w
i
∂f
w
(w x⊤ (n)
(n) d (n)
we get
8 . 4
(n)
⊤ (n) d (n)
system of D linear equations
8 . 5
(n)
⊤ (n) d (n)
system of D linear equations
matrix form (using the design matrix)
N × 1
D × N each row enforces one of D equations
8 . 5
1
2
y − Xw
(n)
⊤ (n) d (n)
system of D linear equations
matrix form (using the design matrix)
N × 1
D × N each row enforces one of D equations
8 . 5
1
2
y − Xw
(n)
⊤ (n) d (n)
system of D linear equations
matrix form (using the design matrix)
N × 1
D × N each row enforces one of D equations
Normal equation: because for optimal w, the residual vector is
normal to column space of the design matrix
8 . 5
we can get a closed form solution!
1
2
y − Xw
8 . 6
X Xw =
⊤
X y
⊤
X (y −
⊤
Xw) = 0
we can get a closed form solution!
1
2
y − Xw
8 . 6
X Xw =
⊤
X y
⊤
X (y −
⊤
Xw) = 0
we can get a closed form solution!
∗
⊤ −1 ⊤
1
2
y − Xw
8 . 6
X Xw =
⊤
X y
⊤
X (y −
⊤
Xw) = 0
we can get a closed form solution!
∗
⊤ −1 ⊤
D × D D × N N × 1
1
2
y − Xw
8 . 6
X Xw =
⊤
X y
⊤
X (y −
⊤
Xw) = 0
we can get a closed form solution!
∗
⊤ −1 ⊤
D × D D × N N × 1
pseudo-inverse of X
1
2
y − Xw
8 . 6
X Xw =
⊤
X y
⊤
=⊤ −1 ⊤
X (y −
⊤
Xw) = 0
projection matrix into column space of X
we can get a closed form solution!
∗
⊤ −1 ⊤
D × D D × N N × 1
pseudo-inverse of X
1
2
y − Xw
8 . 6
Winter 2020 | Applied Machine Learning (COMP551)
X Xw =
⊤
X y
⊤
=⊤ −1 ⊤
X (y −
⊤
Xw) = 0
projection matrix into column space of X
we can get a closed form solution!
w = np.linalg.lstsq(X,y)[0]
∗
⊤ −1 ⊤
D × D D × N N × 1
pseudo-inverse of X
1
2
y − Xw
8 . 6
∗
⊤ −1 ⊤
9
D × D D × N N × 1
∗
⊤ −1 ⊤
O(ND) D elements, each using N ops.
9
D × D D × N N × 1
∗
⊤ −1 ⊤
O(ND) D elements, each using N ops. O(D N)
2
D x D elements, each requiring N multiplications
9
D × D D × N N × 1
∗
⊤ −1 ⊤
O(D )
3
matrix inversion O(ND) D elements, each using N ops. O(D N)
2
D x D elements, each requiring N multiplications
9
D × D D × N N × 1
∗
⊤ −1 ⊤
O(D )
3
matrix inversion O(ND) D elements, each using N ops. O(D N)
2
D x D elements, each requiring N multiplications
total complexity for N > D is O(ND +
2
D )
3
in practice we don't directly use matrix inversion (unstable)
9
D × D D × N N × 1
instead of
Y ∈ RN×D′
we have
10
instead of
Y ∈ RN×D′
we have
N × D D × D′
a different weight vectors for each target each column of Y is associated with a column of W
N × D′
10
instead of
Y ∈ RN×D′
we have
∗
⊤ −1 ⊤
D × D D × N N × D′
N × D D × D′
a different weight vectors for each target each column of Y is associated with a column of W
N × D′
10
instead of
w = np.linalg.lstsq(X,Y)[0]
Y ∈ RN×D′
we have
∗
⊤ −1 ⊤
D × D D × N N × D′
N × D D × D′
a different weight vectors for each target each column of Y is associated with a column of W
N × D′
10
so far we learned a linear function f
=w
w xd d
11 . 1
so far we learned a linear function f
=w
w xd d
nothing changes if we have nonlinear bases
w
w ϕ (x)d d
11 . 1
so far we learned a linear function f
=w
w xd d
nothing changes if we have nonlinear bases
w
w ϕ (x)d d
∗
⊤ −1 ⊤
solution simply becomes
Φ = ⎣ ⎢ ⎢ ⎢ ⎡ ϕ
(x),
1 (1)
ϕ
(x),
1 (2)
⋮ ϕ
(x),
1 (N)
ϕ
(x),
2 (1)
ϕ
(x),
2 (2)
⋮ ϕ
(x),
2 (N)
⋯ , ⋯ , ⋱ ⋯ , ϕ
(x)
D (1)
ϕ
(x)
D (2)
⋮ ϕ
(x)
D (N) ⎦
⎥ ⎥ ⎥ ⎤
replacing
11 . 1
so far we learned a linear function f
=w
w xd d
nothing changes if we have nonlinear bases
w
w ϕ (x)d d
∗
⊤ −1 ⊤
solution simply becomes
Φ = ⎣ ⎢ ⎢ ⎢ ⎡ ϕ
(x),
1 (1)
ϕ
(x),
1 (2)
⋮ ϕ
(x),
1 (N)
ϕ
(x),
2 (1)
ϕ
(x),
2 (2)
⋮ ϕ
(x),
2 (N)
⋯ , ⋯ , ⋱ ⋯ , ϕ
(x)
D (1)
ϕ
(x)
D (2)
⋮ ϕ
(x)
D (N) ⎦
⎥ ⎥ ⎥ ⎤
replacing
a (nonlinear) feature
11 . 1
11 . 2
polynomial bases
k
11 . 2
polynomial bases
k
Gaussian bases
k
s2 (x−μ
)k 2
11 . 2
polynomial bases
k
Gaussian bases
k
s2 (x−μ
)k 2
Sigmoid bases
k 1+e−
s x−μ k
1
11 . 2
k
s2 (x−μ
)k 2
#x: N #y: N plt.plot(x, y, 'b.') 1 2 3 phi = lambda x,mu: np.exp(-(x-mu)**2) 4 mu = np.linspace(0,10,10) #10 Gaussians bases 5 Phi = phi(x[:,None], mu[None,:]) #N x 10 6 w = np.linalg.lstsq(Phi, y)[0] 7 yh = np.dot(Phi,w) 8 plt.plot(x, yh, 'g-') 9
11 . 3
k
s2 (x−μ
)k 2
(n)
(n)
(n)
#x: N #y: N plt.plot(x, y, 'b.') 1 2 3 phi = lambda x,mu: np.exp(-(x-mu)**2) 4 mu = np.linspace(0,10,10) #10 Gaussians bases 5 Phi = phi(x[:,None], mu[None,:]) #N x 10 6 w = np.linalg.lstsq(Phi, y)[0] 7 yh = np.dot(Phi,w) 8 plt.plot(x, yh, 'g-') 9
11 . 3
k
s2 (x−μ
)k 2
(n)
(n)
(n)
#x: N #y: N plt.plot(x, y, 'b.') 1 2 3 phi = lambda x,mu: np.exp(-(x-mu)**2) 4 mu = np.linspace(0,10,10) #10 Gaussians bases 5 Phi = phi(x[:,None], mu[None,:]) #N x 10 6 w = np.linalg.lstsq(Phi, y)[0] 7 yh = np.dot(Phi,w) 8 plt.plot(x, yh, 'g-') 9
before adding noise noise
11 . 3
k
s2 (x−μ
)k 2
(n)
(n)
(n)
#x: N #y: N plt.plot(x, y, 'b.') 1 2 3 phi = lambda x,mu: np.exp(-(x-mu)**2) 4 mu = np.linspace(0,10,10) #10 Gaussians bases 5 Phi = phi(x[:,None], mu[None,:]) #N x 10 6 w = np.linalg.lstsq(Phi, y)[0] 7 yh = np.dot(Phi,w) 8 plt.plot(x, yh, 'g-') 9
before adding noise
noise
11 . 3
k
s2 (x−μ
)k 2
(n)
(n)
(n)
#x: N #y: N plt.plot(x, y, 'b.') 1 2 3 phi = lambda x,mu: np.exp(-(x-mu)**2) 4 mu = np.linspace(0,10,10) #10 Gaussians bases 5 Phi = phi(x[:,None], mu[None,:]) #N x 10 6 w = np.linalg.lstsq(Phi, y)[0] 7 yh = np.dot(Phi,w) 8 plt.plot(x, yh, 'g-') 9 phi = lambda x,mu: np.exp(-(x-mu)**2) mu = np.linspace(0,10,10) #10 Gaussians bases Phi = phi(x[:,None], mu[None,:]) #N x 10 #x: N 1 #y: N 2 plt.plot(x, y, 'b.') 3 4 5 6 w = np.linalg.lstsq(Phi, y)[0] 7 yh = np.dot(Phi,w) 8 plt.plot(x, yh, 'g-') 9
before adding noise
noise
11 . 3
k
s2 (x−μ
)k 2
(n)
(n)
(n)
#x: N #y: N plt.plot(x, y, 'b.') 1 2 3 phi = lambda x,mu: np.exp(-(x-mu)**2) 4 mu = np.linspace(0,10,10) #10 Gaussians bases 5 Phi = phi(x[:,None], mu[None,:]) #N x 10 6 w = np.linalg.lstsq(Phi, y)[0] 7 yh = np.dot(Phi,w) 8 plt.plot(x, yh, 'g-') 9 phi = lambda x,mu: np.exp(-(x-mu)**2) mu = np.linspace(0,10,10) #10 Gaussians bases Phi = phi(x[:,None], mu[None,:]) #N x 10 #x: N 1 #y: N 2 plt.plot(x, y, 'b.') 3 4 5 6 w = np.linalg.lstsq(Phi, y)[0] 7 yh = np.dot(Phi,w) 8 plt.plot(x, yh, 'g-') 9 w = np.linalg.lstsq(Phi, y)[0] yh = np.dot(Phi,w) plt.plot(x, yh, 'g-') #x: N 1 #y: N 2 plt.plot(x, y, 'b.') 3 phi = lambda x,mu: np.exp(-(x-mu)**2) 4 mu = np.linspace(0,10,10) #10 Gaussians bases 5 Phi = phi(x[:,None], mu[None,:]) #N x 10 6 7 8 9
before adding noise
noise
11 . 3
Winter 2020 | Applied Machine Learning (COMP551)
(n)
(n)
(n)
phi = lambda x,mu: 1/(1 + np.exp(-(x - mu))) #x: N 1 #y: N 2 plt.plot(x, y, 'b.') 3 4 mu = np.linspace(0,10,10) #10 sigmoid bases 5 Phi = phi(x[:,None], mu[None,:]) #N x 10 6 w = np.linalg.lstsq(Phi, y)[0] 7 yh = np.dot(Phi,w) 8 plt.plot(x, yh, 'g-') 9
Sigmoid bases
k 1+e−
s x−μ k
1
11 . 4
W =
∗
(X X) X Y
⊤ −1 ⊤
In
12
W =
∗
(X X) X Y
⊤ −1 ⊤
In what if we have a large dataset?
use stochastic gradient descent
N > 100, 000, 000
12
W =
∗
(X X) X Y
⊤ −1 ⊤
In what if we have a large dataset?
use stochastic gradient descent
N > 100, 000, 000 what if
⊤
is not invertible?
columns of X (features) are not linearly independent (either redundant features or D > N)
12
W =
∗
(X X) X Y
⊤ −1 ⊤
In
W* is not unique, make it unique by
removing redundant features
regularization (later!)
what if we have a large dataset?
use stochastic gradient descent
N > 100, 000, 000 what if
⊤
is not invertible?
columns of X (features) are not linearly independent (either redundant features or D > N)
12
decomposition-based (not discussed) methods still work use gradient descent (later!)
W =
∗
(X X) X Y
⊤ −1 ⊤
In
W* is not unique, make it unique by
removing redundant features
regularization (later!)
w = np.linalg.lstsq(X,Y)[0]
what if we have a large dataset?
use stochastic gradient descent
N > 100, 000, 000 what if
⊤
is not invertible?
columns of X (features) are not linearly independent (either redundant features or D > N)
12
O(ND +
2
D )
3
13