- 6. Linear & logistjc regressions
6. Linear & logistjc regressions Chlo-Agathe Azencot Centre for - - PowerPoint PPT Presentation
6. Linear & logistjc regressions Chlo-Agathe Azencot Centre for - - PowerPoint PPT Presentation
Foundatjons of Machine Learning CentraleSuplec Fall 2017 6. Linear & logistjc regressions Chlo-Agathe Azencot Centre for Computatjonal Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr Learning objectjves
2
Learning objectjves
- Density estjmatjon:
– Defjne parametric methods. – Defjne the maximum likelihood estjmator and compute
it for Bernouilli, multjnomial and Gaussian densitjes.
– Defjne the Bayes estjmator and compute it for normal
priors.
- Supervised learning:
– Compute the maximum likelihood estjmator / least-
square fjt solutjon for linear regression.
– Compute the maximum likelihood estjmator for logistjc
regression.
3
Density estjmatjon
4
Parametric methods
- Parametric estjmatjon:
– assume a form for p(x|θ)
E.g.
– Goal: estjmate θ using X – usually assume that independent and identjcally
distributed (iid)
5
Maximum likelihood estjmatjon
- Find θ such that X is the most likely to be drawn.
- Likelihood of θ given the i.i.d. sample X:
- Log likelihood:
- Maximum likelihood estjmatjon (MLE):
6
Bernouilli density
- Two states: failure / success
MLE estjmate of p0:
7
Bernouilli density
- Two states: failure / success
MLE estjmate of p0:
- Log likelihood: ?
8
Bernouilli density
- Two states: failure / success
MLE estjmate of p0:
- Log likelihood:
- Maximize the likelihood: ?
9
Bernouilli density
- Two states: failure / success
MLE estjmate of p0:
- Log likelihood:
- Maximize the likelihood: set the gradient to 0.
?
10
11
Bernouilli density
- Two states: failure / success
MLE estjmate of p0:
- Log likelihood:
- Maximize the likelihood: set its gradient to 0.
12
Multjnomial density
- Consider K mutually exclusive and exhaustjve
classes
– Each class occurs with probability pk – x1, x2, …, xK indicator variables: xk=1 if the outcome is
class k and 0 otherwise
- The MLE of pk is
13
Gaussian distributjon
- Gaussian distributjon = normal distributjon
Compute the MLE estjmates of μ and σ.
14
15
Gaussian distributjon
- Gaussian distributjon = normal distributjon
Compute the MLE estjmates of μ and σ.
16
Bias-variance tradeof
- Mean squared error of the estjmator:
A biased estjmator may achieve betuer MSE than an unbiased one.
θ θ0 E[θθ̃] bias variance
17
Bayes estjmator
- Treat θ as a random variable with prior p(θ)
- Bayes rule:
- Density estjmatjon at x:
posterior evidence likelihood prior
18
Bayes estjmator
- Treat θ as a random variable with prior p(θ)
- Bayes rule:
- Density estjmatjon
- Maximum likelihood estjmate (MLE):
- Bayes estjmate:
?
19
20
Bayes estjmator: Normal prior
- n data points (iid)
- MLE of θ:
Compute the Bayes estjmator of θ Hint:
Compute p(θ|X ) and show that it follows a normal distributjon
21
22
23
24
Bayes estjmator: Normal prior
- n data points (iid)
- MLE of θ:
Compute the Bayes estjmator of θ p(θ|X ) follows a normal distributjon with
– mean – variance
25
Bayes estjmator: Normal prior
- n data points (iid)
- MLE of θ:
Compute the Bayes estjmator of θ p(θ|X ) follows a normal distributjon with
– mean – variance
26
Bayes estjmator: Normal prior
- n data points (iid)
- MLE of θ:
- Bayes estjmator:
sample mean prior mean
27
Bayes estjmator: Normal prior
- n data points (iid)
- MLE of θ:
- Bayes estjmator:
sample mean prior mean large when σ is large when n is ?
?
28
Bayes estjmator: Normal prior
- n data points (iid)
- MLE of θ:
- Bayes estjmator:
- When n ↗: θBayes gets closer to the sample average (uses
informatjon from the sample).
- When σ is small, θBayes gets closer to μ (litule uncertainty
about the prior).
29
Linear regression
30
Linear regression
31
Linear regression: MLE
- Assume error is Gaussian distributed
- Replace g with its estjmator f
x x* E[y|x*] E[y|x] = βx+β0 p(y|x*)
32
MLE under Gaussian noise
- Maximize (log) likelihood
independent of β
33
MLE under Gaussian noise
- Maximize (log) likelihood
independent of β
?
34
MLE under Gaussian noise
- Assuming Gaussian error, maximizing the
likelihood is equivalent to minimizing the sum of squared residuals.
- Maximize (log) likelihood
independent of β
35
Linear regression least-squares fjt
- Minimize the residual sum of squares
36
Linear regression least-squares fjt
- Minimize the residual sum of squares
Historically:
– Carl Friedrich Gauss (to predict the locatjon of Ceres) – Adrien Marie Legendre
37
Linear regression least-squares fjt
- Minimize the residual sum of squares
Estjmate β. What conditjon do you need to verify?
38
Linear regression least-squares fjt
- Minimize the residual sum of squares
- Assuming X has full column rank (and hence XTX
invertjble):
39
Linear regression least-squares fjt
- Minimize the residual sum of squares
- Assuming X has full column rank (and hence XTX
invertjble):
- If X is rank-defjcient, use a pseudo-inverse.
A pseudo-inverse of A is a matrix G s. t. AGA = A
40
Gauss-Markov Theorem
- Under the assumptjon that
the least-squares estjmator of β is its (unique) best linear unbiased estjmator.
41
Gauss-Markov Theorem
- Under the assumptjon that
the least-squares estjmator of β is its (unique) best linear unbiased estjmator.
- Best Linear Unbiased Estjmator (BLUE):
Var(βθ̃) < Var(β*) for any β* that is a linear unbiased estjmator of β
42
Gauss-Markov Theorem
- Under the assumptjon that
the least-squares estjmator of β is its (unique) best linear unbiased estjmator.
- Best Linear Unbiased Estjmator (BLUE):
Var(βθ̃) < Var(β*) for any β* that is a linear unbiased estjmator of β
43
Gauss-Markov Theorem
- Under the assumptjon that
the least-squares estjmator of β is its (unique) best linear unbiased estjmator.
- Best Linear Unbiased Estjmator (BLUE):
Var(βθ̃) < Var(β*) for any β* that is a linear unbiased estjmator of β
44
Gauss-Markov Theorem
- Under the assumptjon that
the least-squares estjmator of β is its (unique) best linear unbiased estjmator.
- Best Linear Unbiased Estjmator (BLUE):
Var(βθ̃) < Var(β*) for any β* that is a linear unbiased estjmator of β
psd and minimal for D=0
45
46
(true for all β)
47
48
49
Correlated variables
- If the variables are decorrelated:
– Each coeffjcient can be estjmated separately; – Interpretatjon is easy:
“A change of 1 in xj is associated with a change of βj in Y, while everything else stays the same.”
- Correlatjons between variables cause problems:
– The variance of all coeffjcients tend to increase; – Interpretatjon is much harder
when xj changes, so does everything else.
50
Logistjc regression
51
What about classifjcatjon?
52
What about classifjcatjon?
- Model P(Y=1|x) as a linear functjon?
?
53
What about classifjcatjon?
- Model P(Y=1|x) as a linear functjon?
– Problem: P(Y=1|x) must be between 0 and 1. – Non-linearity:
- If P(Y=1|x) close to +1 or 0, x must change a lot for y to change;
- If P(Y=1|x) close to 0.5, that's not the case.
– Hence: use a logit transformatjon
→ Logistjc regression.
p f(x)
54
Maximum likelihood estjmatjon of logistjc regression coeffjcients
- Log likelihood for n observatjons ?
55
56
Maximum likelihood estjmatjon of logistjc regression coeffjcients
- Log likelihood for n observatjons
57
Maximum likelihood estjmatjon of logistjc regression coeffjcients
- Gradient of the log likelihood
?
58
59
- Gradient of the log likelihood
- To maximize the likelihood:
– set the gradient to 0 – cannot be solved analytjcally – -L convex so we can use gradient descent (no local
minima)
Maximum likelihood estjmatjon of logistjc regression coeffjcients
60
61
Summary
- MAP estjmate:
- MLE:
- Bayes estjmate:
- Assuming Gaussian error, maximizing the likelihood is
equivalent to minimizing the RSS.
- Linear regression MLE:
- Logistjc regression MLE: solve with gradient descent.
62
References
- A Course in Machine Learning.
http://ciml.info/dl/v0_99/ciml-v0_99-all.pdf
– Least-squares regression: Chap 7.6
- The Elements of Statjstjcal Learning.
http://web.stanford.edu/~hastie/ElemStatLearn/
– Least-squares regression: Chap 2.2.1, 3.1, 3.2.1 – Gauss-Markov theorem: Chap 3.2.3
63
class GradientDescentOptjmizer():
64
class LeastSquaresRegr()
65
class seq_LeastSquaresRegr()
66
class seq_LeastSquaresRegr()
67
class seq_LeastSquaresRegr()
68
class LogistjcRegr()
69