Machine Learning Review 1 Linear Regression Assume a set of traning - - PDF document

machine learning review
SMART_READER_LITE
LIVE PREVIEW

Machine Learning Review 1 Linear Regression Assume a set of traning - - PDF document

Machine Learning Review 1 Linear Regression Assume a set of traning data is denoted by { x ( i ) , y ( i ) } i =1 , ,m where x ( i ) R n and y ( i ) R . Our aim is to find an optimal such that the hypothesis function h : R n


slide-1
SLIDE 1

Machine Learning Review

1 Linear Regression

Assume a set of traning data is denoted by {x(i), y(i)}i=1,··· ,m where x(i) ∈ Rn and y(i) ∈ R. Our aim is to find an optimal θ such that the hypothesis function hθ : Rn → R fits the training data best. In particular, our problem can be formulated under the Euclidean norm as follows min

θ

J(θ) = 1 2

m

  • i=1

(hθ(x(i)) − y(i))2 where J(θ) is so-called a cost function One choice to solve the above problem is the Gradient Descent (GD) algorithm (see pp. 7 in our slides). Another soution to the lienar regression problem is to directly compute the close form solution (see pp. 13-17 in our slides). Assuming X =    (x(1))T . . . (x(m))T   

  • y =

   (y(1)) . . . (y(m))    J(θ) can be rewritten as J(θ) = 1 2( y − Xθ)T ( y − Xθ) It can be minimized by letting its derivative w.r.t. θ equal zero, i.e., ▽θJ(θ) = XT Xθ − XT y Therefore, we have θ = (XT X)−1XT Y

2 Logistic Regression

In this section, we look at classificatin problem where y(i) ∈ {0, 1} (so-called the label for the training example). Basically, we use a logistic function (or sigmoid function) g(z) = 1/(1 + e−z) to continuously approximate the discrete classification. In particular, we look at the probability that a given data x belongs to a category y = 0 (or y = 1). The hypothesis function is defined as hθ(x) = g(θT x) = 1/(1 + e−θT x) 1

slide-2
SLIDE 2

Assuming p(y = 1 | x; θ) = hθ(x) = 1/(1 + exp(−θT x) p(y = 0 | x; θ) = 1 − hθ(x) = 1/(1 + exp(θT x) we have that p(y | x; θ) = (hθ(x))y(1 − hθ(x))1−y Given a set of training data {x(i), y(i)}i=1,··· ,m, we define the likelihood func- tion as L(θ) =

  • i

p(y(i) | x(i); θ) =

  • i

(hθ(x(i)))y(i)(1 − hθ(x(i)))1−y(i) To simply the computations, we take the logarithm of the likelihood function (i.e. so-called log-likelihood) ℓ(θ) = log L(θ) =

m

  • i=1
  • y(i) log hθ(x(i)) + (1 − y(i)) log(1 − hθ(x(i))
  • We can use gradient ascent algorithm to maximize the above objective function

(see pp. 10 in our slides). Also, we can take another choice, i.e., Newton’s method (see. pp. 11-13 in our slides)

3 Regularization and Bayesian Statistics

We first take the linear regression problem for an example. To sovle the overfit- ting problem, we introduce a regularization item (or penalty) in the objective function, i.e., min

θ

1 2m  

m

  • i=1

(hθ(x(i)) − y(i))2 + λ

n

  • j=1

θ2

j

  A gradient descent based solution is given on pp. 7 in our slides. Also, the closed form solution can be found on pp. 9. As shown on pp. 10, similar method can be applied to logistic regression problem. We then look at Maximum Likelihood Estimation (MLE). Assume data are generated via probability distribution d ∼ p(d | θ) Given a set of independent and identically distributed (i.i.d.) data samples D = {di}i=1,··· ,m, our goal is to estimate θ that best models the data. The log-likelihood function is defined by ℓ(θ) = log

m

  • i=1

p(di | θ) =

m

  • i=1

p(di | θ) Our problem becomes θMLE = arg max

θ

ℓ(θ) = arg max

θ m

  • i=1

p(di | θ) 2

slide-3
SLIDE 3

In Maximum-a-Posteriori Estimation (MAP), we treat θ as a random vari-

  • able. Our goal is to choose θ that maximizes the posterior probability of θ (i.e.,

probability in the light of the observed data). According to Bayes rule p(θ | D) = p(θ)p(D | θ) p(D) Note that, p(D) is independent of θ. In MAP, our problem is formulated by θMAP = arg max

θ

p(θ | D) = arg max

θ

p(θ)p(D | θ) p(D) = arg max

θ

p(θ)p(D | θ) = arg max

θ

log p(θ)p(D | θ) = arg max

θ

(log p(θ) + log p(D | θ)) = arg max

θ

  • log p(θ) +

m

  • i=1

log p(di | θ)

  • The detailed application of MLE and MAP to linear regression and logistic

regression can be found in pp. 15-22.

4 Naive Bayes and EM Algorithm

Given a set of training data, the features and labels are both represented by random variables {Xj}j=1,··· ,n and Y , respectively. The key of Naive Bayes is to assume that Xj and Xj′ are conditionally independent given Y , for ∀j = j′. Hence, given a data sample x = [x1, x2, · · · , xn] and its label y, we have that P(X1 = x1, X2 = x2, · · · , Xn = xn | Y = y) =

n

  • j=1

P(Xj = xj | X1 = x1, X2 = x2, · · · , Xj−1 = xj−1, Y = y) =

n

  • j=1

P(Xj = xj | Y = y) Then the Naive Bayes model can be formulated as P(Y = y | X1 = x1, · · · , Xn = xn) = P(X1 = x1, · · · , Xn = xn | Y = y)p(Y = y) P(X1 = x1, · · · , Xn = xn) = P(Y = y) n

j=1 p(Xj = xj | Y = y)

P(X1 = x1, · · · , Xn = xn) where there are two sets of parameters (denoted by Ω): p(Y = y) (or p(y)) and p(Xj = x | Y = y) (or pj(x | y)). Since p(X1 = x1, · · · , Xn = xn) follows a predefined distribution and is independent of these parameters, we have that p(y | x1, · · · , xn) ∝ p(y)

n

  • j=1

pj(xj | y) 3

slide-4
SLIDE 4

Given a set of training data {x(i), y(i)}, the log-likelihood function is ℓ(Ω) =

m

  • i=1

log p(x(i), y(i)) =

m

  • i=1

log  p(y(i))

n

  • j=1

pj(x(i)

j

| y(i))   =

m

  • i=1

log p(y(i)) +

m

  • i=1

n

  • j=1

log pj(x(i)

j

| y(i)) The maximum-likelihood estimates are then the parameter values p(y) and pj(xj|y) that maximize ℓ(Ω) =

m

  • i=1

log p(y(i)) +

m

  • i=1

n

  • j=1

log pj(x(i)

j

| y(i)) subject to the following constraints:

  • p(y) ≥ 0 for ∀y ∈ {1, 2, · · · , k}
  • k

y=1 p(y) = 1

  • For ∀y ∈ {1, · · · , k}, j ∈ {1, · · · , n}, x ∈ {0, 1}, pj(x | y) ≥ 0
  • For ∀y ∈ {1, · · · , k}, j ∈ {1, · · · , n},

x∈{0,1} pj(x | y) = 1

Solutions can be found on pp. 49 in our slides. In the Expectation-Maximization (EM) algorithm, we assume there is no label for any training data. By introducing a latent variable z, the log-likelihood function can be defined as ℓ(θ) =

m

  • i=1

log p(x(i); θ) =

m

  • i=1

log

  • z

p(x(i), z; θ) The basic idea of EM algorithm (to maximizing ℓ(θ)) is to repeatedly construct a lower-bound on ℓ (E-step), and then optimize that lower-bound (M-step) For each i ∈ {1, 2, · · · , m}, let Qi be some distribution over the z’s

  • z

Qi(z) = 1, Qi(z) ≥ 0 We have ℓ(θ) =

m

  • i=1

log

  • z(i)

p(x(i), z(i); θ) =

m

  • i=1

log

  • z(i)

Qi(z(i))p(x(i), z(i); θ) Qi(z(i)) ≥

m

  • i=1
  • z(i)

Qi(z(i)) log p(x(i), z(i); θ) Qi(z(i)) 4

slide-5
SLIDE 5

since log

  • Ez(i)∼Qi

p(x(i), z(i); θ) Qi(z(i))

  • ≥ Ez(i)∼Qi
  • log

p(x(i), z(i); θ) Qi(z(i))

  • according to the Jensen’s inequality (see pp.

25 for the details of Jensen’s inequality). Therefore, for any set of distributions Qi, ℓ(θ) has a lower bound

m

  • i=1
  • z(i)

Qi(z(i)) log p(x(i), z(i); θ) Qi(z(i)) In order to tighten the lower bound (i.e., to let the equality hold in the Jensen’s inequality), p(x(i), z(i); θ)/Qi(z(i)) should be a constant (such that E[p(x(i), z(i); θ)/Qi(z(i))] = p(x(i), z(i); θ)/Qi(z(i))). Therefore, Qi(z(i)) ∝ p(x, z(i); θ) Since

z Qi(z) = 1, one natural choice is

Qi(z(i)) = p(x(i), z(i); θ)

  • z p(x(i), z(i); θ)

= p(x(i), z(i); θ) p(x(i); θ) = p(z(i) | x(i); θ) In EM algorithm, we repeat the following step until convergence

  • (E-step) For each i, set

Qi(z(i)) := p(z(i) | x(i); θ)

  • (M-step) set

θ := arg max

θ

  • i
  • z(i)

Qi(z(i)) log p(x(i), z(i); θ) Qi(z(i)) The convergence of the EM algorithm can be found on pp. 95 in our slides. An example of applying the EM algorithm to the Mixtures of Gaussians is given on

  • pp. 98-105. Also, as shown on pp. 106-115, EM algorithm also can be applied

to Naive Bayes model.

5 SVM

In SVM, we use a hyperplane (wT x+b = 0) to separate the given training data. We first assume the training data is linearly separable. Give a training data x(i), the margin γ(i) is the distance between x(i) and the hyperplane wT

  • x(i) − γ(i) w

w

  • + b = 0 ⇒ γ(i) =

w w T x(i) + b w 5

slide-6
SLIDE 6

If y(i) = 1, γ(i) ≥ 0; otherwise (when y(i) = −1), γ(i) < 0. In SVM, we actually look at the geometric margin by removing the signs. Then γ(i) is redefined as γ(i) = y(i) w w T x(i) + b w

  • = y(i)(wT x(i) + b)

w Note that, scaling (w, b) does not change γ(i) (and thus γ). For the whole training data, the (geometric) margin is written as γ = min

i

γ(i) = mini{y(i)(wT x(i) + b)} w Then, SVM can be formulated as maxγ,w,b γ s.t. y(i)(wT x(i) + b) ≥ γw, ∀i We scale (w, b) such that mini{y(i)(wT x(i) + b)} = 1; therefore, γ = 1/w. Then, the above formulation becomes maxw,b 1/w s.t. y(i)(wT x(i) + b) ≥ 1, ∀i Maximizing 1/w is equivalent to minimizing w2 = wT w minw,b wT w s.t. y(i)(wT x(i) + b) ≥ 1, ∀i The above QP problem can be resolved by the state-of-art QP solvers. Unfor- tunately, existing generic QP solvers is of low efficiency, especially in face of a large training set. We hereby apply the Lagrange duality theory to the SVM

  • problem. Details can be found on pp. 37-45.

When the training data is linearly non-separable, we can use kernel method (pp. 46-63) or the regularized version of SVM (pp. 64-72).

6 K-Means

In each iteration, we first (re)-assign each example x(i) to its closest cluster center (based on the smallest Euclidean distance) Ck∗ = {x(i) : k∗ = arg min

k x(i) − µk2}

i.e., Ck is the set of examples assigned to cluster k with center µk. We then update the cluster means µk = mean(Ck) = 1 |Ck|

  • x∈Ck

x The iterations are stopped when cluster means does not change by much. 6

slide-7
SLIDE 7

7 Principle Component Analysis (PCA)

The PCA algorithm includes the following steps:

  • 1. Center the data (subtract the mean µ = 1

N

N

i=1 x(i) from each data point)

x(i) ← x(i) − µ

  • 2. Compute the covariance matrix

S = 1 N

N

  • i=1

x(i)x(i)T = 1 N XXT

  • 3. Do an eigendecomposition of the covariance matrix S
  • 4. Take first K leading eigenvectors {ul}l=1,··· ,K with eigenvalues {λl}l=1,··· ,K
  • 5. The final K dim. projection of data is given by

Z = U T X where U is D × K and Z is K × N 7