Supervised Learning
Liyao Xiang http://xiangliyao.cn/ Shanghai Jiao Tong University
EE226 Big Data Mining Lecture 4
Supervised Learning Liyao Xiang http://xiangliyao.cn/ Shanghai Jiao - - PowerPoint PPT Presentation
EE226 Big Data Mining Lecture 4 Supervised Learning Liyao Xiang http://xiangliyao.cn/ Shanghai Jiao Tong University Reference and Acknowledgement Most of the course materials are credited to Andrew Ngs CS229 lecture notes. Outline
Liyao Xiang http://xiangliyao.cn/ Shanghai Jiao Tong University
EE226 Big Data Mining Lecture 4
lecture notes.
50 100 150 200 250 1 2 3 4
Size in m2
Price in million RMB 75
x(i)∈X: input variables y(i)∈Y: output variables (x(i), y(i)): a training example {(x(i), y(i)); i = 1,…, m}: training set h:X⟼Y hypothesis (假设函数) testing example predicted value
Let’s consider a richer dataset in which we also know the number of bedrooms in each apartment
Size #bedrooms Price (million¥) 40 1.2 65 1 1.9 80 2 2.2 89 2 3.3 120 3 5.3 … … …
(i): the size of the i-th apartment
in the training set
(i): the number of bedrooms of
the i-th apartment in the training set
linear function: hθ(x) = θ0 + θ1x1 + θ2x2
h(x) =
n
X
i=0
θixi = θT x Why a linear function?
Let’s consider a richer dataset in which we also know the number of bedrooms in each apartment
Size #bedrooms Price (million¥) 40 1.2 65 1 1.9 80 2 2.2 89 2 3.3 120 3 5.3 … … …
close to y for the training examples!
h(x) =
n
X
i=0
θixi = θT x J(θ) = 1 2
m
X
i=1
(hθ(x(i)) − y(i))2 Why a least-squares cost?
guess” for θ, and use gradient descent (梯度下降) alg. repeatedly to make J(θ) smaller:
θj := θj − α∂J(θ) θj 𝛽: learning rate direction of steepest decrease of J least mean square update rule: error term
before taking a single step
gradient of the error w.r.t. a single training example
regression only has one global minima, which the gradient descent always converges to. This is because the cost function J is a convex quadratic function (⼆亍次凸函数). x θ contour (等⾼髙线) shows the cost global minima is reached!
them to 0s. And solve the equations! f: Rmxn ⟼ R A: m x n matrix
4. 5.
Property 1 Property 2, 3 Property 4, 5
= 0
distributed 独⽴竌同分布) and
y(i) = ✓T x(i) + ✏(i)
error term
✏(i) ✏(i) ∼ N(0, 2)
p(y(i)|x(i); θ) = 1 √ 2πσ exp ⇣ − (y(i) − θT x(i))2 2σ2 ⌘
probability as possible
minimizing this term instead!
y = θ0 + θ1x y = θ0 + θ1x + θ2x2 y =
5
X
j=0
θjxj The more features we add, the better. However, there is also a risk in adding too many features. underfitting
the corresponding error term
X
i
w(i)(y(i) − θT x(i))2 θT x w(i) = exp ⇣ − (x(i) − x)2 2τ 2 ⌘ θ is giving a higher weight to the training examples close to the testing data x Non-parametric Alg: keep the entire training dataset when making predictions
J(θ) = 1 2
m
X
i=1
(hθ(x(i)) − y(i))2 h(x) =
n
X
i=0
θixi = θT x
example belongs to the positive class, otherwise, it is a member of the negative class
depending on the sign of θTx, i.e., sign(θTx) = y ⟺ yθTx > 0
positive), the stronger the belief that y is negative (or positive)
in the training data. Loss value should be small if y(i)θTx(i) > 0 and large if y(i)θTx(i) < 0
converge to the global minima!)
and Loss_func ( y(i)θTx(i) ) → ∞ as y(i)θTx(i) →-∞ Losslogistic(z) = log(1 + e−z) Lossexp = e−z Losshinge = max{1 − z, 0} logistic regression support vector machines boosting
which hopefully yields θ that y(i)θTx(i) > 0 for most training examples J(θ) = 1 m
m
X
i=1
Losslogistic(y(i)θT x(i)) = 1 m
m
X
i=1
log(1 + exp(−y(i)θT x(i))) g(z) = 1 1 + e−z
function →1 as z → ∞ and g(z) → 0 as z → -∞
define the probability model for binary classification.
, & refine hypothesis class as
logistic loss p(Y = y|x; θ) = g(yxT θ) = 1 1 + e−yxT θ hθ(x) = 1 1 + e−xT θ
Losslogistic(z) = log(1 + e−z) d dz Losslogistic(z) = 1 1 + e−z · d dz e−z = − e−z 1 + e−z = −g(−z) Sigmoid function ∂ ∂θk Losslogistic(yxT θ) = −g(−yxT θ) ∂ ∂θk (yxT θ) = −g(−yxT θ)yxk θt+1 = θt αt · rθLosslogistic(y(i)x(i)T θt) incorrect label
P(y = 1|x; θ) = hθ(x) = 1 1 + e−θT x P(y = 0|x; θ) = 1 − hθ(x) p(y|x; θ) = (hθ(x))y(1 − hθ(x))1−y gradient ascent: similar to least mean square update rule, but h is non-linear!
the loss corresponds to points where its first derivative is 0
than gradient descent, and requires many fewer iterations to get very close to the minimum.
θ := θ − l0(θ) l00(θ) θ := θ H−1rθl(θ) Hessian matrix
average logistic loss Losslogistic(z) = log(1 + e−z) θt+1 = θt αt · rθLosslogistic(y(i)x(i)T θt) θ := θ − l0(θ) l00(θ)
hypothesis?
linear models y|x; θ ∼ N(µ, σ2) hθ(x) = θT x y|x; θ ∼ Bernoulli(φ) hθ(x) = 1 1 + e−θT x
regression: y|x; θ ∼ N(µ, σ2) y|x; θ ∼ Bernoulli(φ) exponential family distributions p(y; η) = b(y) exp(ηT T(y) − a(η)) A fixed choice of T, a, and b defines a family of distributions parameterized by η
classification: p(y; φ) = φy(1 − φ)1−y = exp(y log φ + (1 − y) log(1 − φ)) = exp ⇣⇣ log
1 − φ ⌘ y + log(1 − φ) ⌘ p(y; µ) = 1 √ 2π exp ⇣ − 1 2(y − µ)2⌘ Gaussian, Bernoulli, … = 1 √ 2π exp ⇣ − 1 2y2⌘ × exp ⇣ µy − 1 2µ2⌘ b(y) ηT T(y)
ηT T(y)
y|x; θ h(x) = E[T(y)|x] hθ(x) = p(y = 1|x; θ) = 0 · p(y = 0|x; θ) + 1 · p(y = 1|x; θ) = E[y|x]
hθ(x) = E[y|x; θ]
Assumption 2
= µ y|x; θ ∼ N(µ, σ2) = η p(y; µ) = 1 √ 2π exp ⇣ − 1 2y2⌘ × exp ⇣ µy − 1 2µ2⌘
Assumption 1: Write the Gaussian distribution in the form of exponential family distribution
= θT x
Assumption 3
family distributions to model the conditional distribution: y|x; θ ∼ Bernoulli(φ) hθ(x) = E[y|x; θ]
Assumption 2
= φ
Bernoulli distribution
= 1/(1 + e−η)
Assumption 1
p(y; φ) = exp ⇣⇣ log
1 − φ ⌘ y + log(1 − φ) ⌘
= η
= 1/(1 + e−θT x)
Assumption 3
denoting the probability of each outcome
φ1, . . . , φk
k
X
i=1
φi = 1 (T(y))i = 1{y = i} E[(T(y))i] = P(y = i) = φi
p(y; φ) = φ1{y=1}
1
φ1{y=2}
2
. . . φ1{y=k}
k
= φ1{y=1}
1
φ1{y=2}
2
. . . φ
1−Pk−1
i=1 1{y=i}
k k
X
i=1
φi = 1 probability of class i! known as softmax function
Assumption 2
Multinomial distribution Express the multinomial distribution in the form of Exponential family & Assumption 3 Our hypothesis outputs the estimated probability that p(y=i | x; θ ) for every i∈{1,…k}.
Newton’s method
distributions
variable and the natural parameter η
in most cases)