6. Linear & logistjc regressions Chlo-Agathe Azencot Centre for - - PowerPoint PPT Presentation

6 linear logistjc regressions
SMART_READER_LITE
LIVE PREVIEW

6. Linear & logistjc regressions Chlo-Agathe Azencot Centre for - - PowerPoint PPT Presentation

Foundatjons of Machine Learning CentraleSuplec Fall 2017 6. Linear & logistjc regressions Chlo-Agathe Azencot Centre for Computatjonal Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr Learning objectjves


slide-1
SLIDE 1
  • 6. Linear & logistjc regressions

Foundatjons of Machine Learning CentraleSupélec — Fall 2017 Chloé-Agathe Azencot

Centre for Computatjonal Biology, Mines ParisTech

chloe-agathe.azencott@mines-paristech.fr

slide-2
SLIDE 2

2

Learning objectjves

  • Density estjmatjon:

– Defjne parametric methods. – Defjne the maximum likelihood estjmator and compute

it for Bernouilli, multjnomial and Gaussian densitjes.

– Defjne the Bayes estjmator and compute it for normal

priors.

  • Supervised learning:

– Compute the maximum likelihood estjmator / least-

square fjt solutjon for linear regression.

– Compute the maximum likelihood estjmator for logistjc

regression.

slide-3
SLIDE 3

3

Density estjmatjon

slide-4
SLIDE 4

4

Parametric methods

  • Parametric estjmatjon:

– assume a form for p(x|θ)

E.g.

– Goal: estjmate θ using X – usually assume that independent and identjcally

distributed (iid)

slide-5
SLIDE 5

5

Maximum likelihood estjmatjon

  • Find θ such that X is the most likely to be drawn.
  • Likelihood of θ given the i.i.d. sample X:
  • Log likelihood:
  • Maximum likelihood estjmatjon (MLE):
slide-6
SLIDE 6

6

Bernouilli density

  • Two states: failure / success

MLE estjmate of p0:

slide-7
SLIDE 7

7

Bernouilli density

  • Two states: failure / success

MLE estjmate of p0:

  • Log likelihood: ?
slide-8
SLIDE 8

8

Bernouilli density

  • Two states: failure / success

MLE estjmate of p0:

  • Log likelihood:
  • Maximize the likelihood: ?
slide-9
SLIDE 9

9

Bernouilli density

  • Two states: failure / success

MLE estjmate of p0:

  • Log likelihood:
  • Maximize the likelihood: set the gradient to 0.

?

slide-10
SLIDE 10

10

slide-11
SLIDE 11

11

Bernouilli density

  • Two states: failure / success

MLE estjmate of p0:

  • Log likelihood:
  • Maximize the likelihood: set its gradient to 0.
slide-12
SLIDE 12

12

Multjnomial density

  • Consider K mutually exclusive and exhaustjve

classes

– Each class occurs with probability pk – x1, x2, …, xK indicator variables: xk=1 if the outcome is

class k and 0 otherwise

  • The MLE of pk is
slide-13
SLIDE 13

13

Gaussian distributjon

  • Gaussian distributjon = normal distributjon

Compute the MLE estjmates of μ and σ.

slide-14
SLIDE 14

14

slide-15
SLIDE 15

15

Gaussian distributjon

  • Gaussian distributjon = normal distributjon

Compute the MLE estjmates of μ and σ.

slide-16
SLIDE 16

16

Bias-variance tradeof

  • Mean squared error of the estjmator:

A biased estjmator may achieve betuer MSE than an unbiased one.

θ θ0 E[θθ̃] bias variance

slide-17
SLIDE 17

17

Bayes estjmator

  • Treat θ as a random variable with prior p(θ)
  • Bayes rule:
  • Density estjmatjon at x:

posterior evidence likelihood prior

slide-18
SLIDE 18

18

Bayes estjmator

  • Treat θ as a random variable with prior p(θ)
  • Bayes rule:
  • Density estjmatjon
  • Maximum likelihood estjmate (MLE):
  • Bayes estjmate:

?

slide-19
SLIDE 19

19

slide-20
SLIDE 20

20

Bayes estjmator: Normal prior

  • n data points (iid)
  • MLE of θ:

Compute the Bayes estjmator of θ Hint:

Compute p(θ|X ) and show that it follows a normal distributjon

slide-21
SLIDE 21

21

slide-22
SLIDE 22

22

slide-23
SLIDE 23

23

slide-24
SLIDE 24

24

Bayes estjmator: Normal prior

  • n data points (iid)
  • MLE of θ:

Compute the Bayes estjmator of θ p(θ|X ) follows a normal distributjon with

– mean – variance

slide-25
SLIDE 25

25

Bayes estjmator: Normal prior

  • n data points (iid)
  • MLE of θ:

Compute the Bayes estjmator of θ p(θ|X ) follows a normal distributjon with

– mean – variance

slide-26
SLIDE 26

26

Bayes estjmator: Normal prior

  • n data points (iid)
  • MLE of θ:
  • Bayes estjmator:

sample mean prior mean

slide-27
SLIDE 27

27

Bayes estjmator: Normal prior

  • n data points (iid)
  • MLE of θ:
  • Bayes estjmator:

sample mean prior mean large when σ is large when n is ?

?

slide-28
SLIDE 28

28

Bayes estjmator: Normal prior

  • n data points (iid)
  • MLE of θ:
  • Bayes estjmator:
  • When n ↗: θBayes gets closer to the sample average (uses

informatjon from the sample).

  • When σ is small, θBayes gets closer to μ (litule uncertainty

about the prior).

slide-29
SLIDE 29

29

Linear regression

slide-30
SLIDE 30

30

Linear regression

slide-31
SLIDE 31

31

Linear regression: MLE

  • Assume error is Gaussian distributed
  • Replace g with its estjmator f

x x* E[y|x*] E[y|x] = βx+β0 p(y|x*)

slide-32
SLIDE 32

32

MLE under Gaussian noise

  • Maximize (log) likelihood

independent of β

slide-33
SLIDE 33

33

MLE under Gaussian noise

  • Maximize (log) likelihood

independent of β

?

slide-34
SLIDE 34

34

MLE under Gaussian noise

  • Assuming Gaussian error, maximizing the

likelihood is equivalent to minimizing the sum of squared residuals.

  • Maximize (log) likelihood

independent of β

slide-35
SLIDE 35

35

Linear regression least-squares fjt

  • Minimize the residual sum of squares
slide-36
SLIDE 36

36

Linear regression least-squares fjt

  • Minimize the residual sum of squares

Historically:

– Carl Friedrich Gauss (to predict the locatjon of Ceres) – Adrien Marie Legendre

slide-37
SLIDE 37

37

Linear regression least-squares fjt

  • Minimize the residual sum of squares

Estjmate β. What conditjon do you need to verify?

slide-38
SLIDE 38

38

Linear regression least-squares fjt

  • Minimize the residual sum of squares
  • Assuming X has full column rank (and hence XTX

invertjble):

slide-39
SLIDE 39

39

Linear regression least-squares fjt

  • Minimize the residual sum of squares
  • Assuming X has full column rank (and hence XTX

invertjble):

  • If X is rank-defjcient, use a pseudo-inverse.

A pseudo-inverse of A is a matrix G s. t. AGA = A

slide-40
SLIDE 40

40

Gauss-Markov Theorem

  • Under the assumptjon that

the least-squares estjmator of β is its (unique) best linear unbiased estjmator.

slide-41
SLIDE 41

41

Gauss-Markov Theorem

  • Under the assumptjon that

the least-squares estjmator of β is its (unique) best linear unbiased estjmator.

  • Best Linear Unbiased Estjmator (BLUE):

Var(βθ̃) < Var(β*) for any β* that is a linear unbiased estjmator of β

slide-42
SLIDE 42

42

Gauss-Markov Theorem

  • Under the assumptjon that

the least-squares estjmator of β is its (unique) best linear unbiased estjmator.

  • Best Linear Unbiased Estjmator (BLUE):

Var(βθ̃) < Var(β*) for any β* that is a linear unbiased estjmator of β

slide-43
SLIDE 43

43

Gauss-Markov Theorem

  • Under the assumptjon that

the least-squares estjmator of β is its (unique) best linear unbiased estjmator.

  • Best Linear Unbiased Estjmator (BLUE):

Var(βθ̃) < Var(β*) for any β* that is a linear unbiased estjmator of β

slide-44
SLIDE 44

44

Gauss-Markov Theorem

  • Under the assumptjon that

the least-squares estjmator of β is its (unique) best linear unbiased estjmator.

  • Best Linear Unbiased Estjmator (BLUE):

Var(βθ̃) < Var(β*) for any β* that is a linear unbiased estjmator of β

psd and minimal for D=0

slide-45
SLIDE 45

45

slide-46
SLIDE 46

46

(true for all β)

slide-47
SLIDE 47

47

slide-48
SLIDE 48

48

slide-49
SLIDE 49

49

Correlated variables

  • If the variables are decorrelated:

– Each coeffjcient can be estjmated separately; – Interpretatjon is easy:

“A change of 1 in xj is associated with a change of βj in Y, while everything else stays the same.”

  • Correlatjons between variables cause problems:

– The variance of all coeffjcients tend to increase; – Interpretatjon is much harder

when xj changes, so does everything else.

slide-50
SLIDE 50

50

Logistjc regression

slide-51
SLIDE 51

51

What about classifjcatjon?

slide-52
SLIDE 52

52

What about classifjcatjon?

  • Model P(Y=1|x) as a linear functjon?

?

slide-53
SLIDE 53

53

What about classifjcatjon?

  • Model P(Y=1|x) as a linear functjon?

– Problem: P(Y=1|x) must be between 0 and 1. – Non-linearity:

  • If P(Y=1|x) close to +1 or 0, x must change a lot for y to change;
  • If P(Y=1|x) close to 0.5, that's not the case.

– Hence: use a logit transformatjon

→ Logistjc regression.

p f(x)

slide-54
SLIDE 54

54

Maximum likelihood estjmatjon of logistjc regression coeffjcients

  • Log likelihood for n observatjons ?
slide-55
SLIDE 55

55

slide-56
SLIDE 56

56

Maximum likelihood estjmatjon of logistjc regression coeffjcients

  • Log likelihood for n observatjons
slide-57
SLIDE 57

57

Maximum likelihood estjmatjon of logistjc regression coeffjcients

  • Gradient of the log likelihood

?

slide-58
SLIDE 58

58

slide-59
SLIDE 59

59

  • Gradient of the log likelihood
  • To maximize the likelihood:

– set the gradient to 0 – cannot be solved analytjcally – -L convex so we can use gradient descent (no local

minima)

Maximum likelihood estjmatjon of logistjc regression coeffjcients

slide-60
SLIDE 60

60

slide-61
SLIDE 61

61

Summary

  • MAP estjmate:
  • MLE:
  • Bayes estjmate:
  • Assuming Gaussian error, maximizing the likelihood is

equivalent to minimizing the RSS.

  • Linear regression MLE:
  • Logistjc regression MLE: solve with gradient descent.
slide-62
SLIDE 62

62

References

  • A Course in Machine Learning.

http://ciml.info/dl/v0_99/ciml-v0_99-all.pdf

– Least-squares regression: Chap 7.6

  • The Elements of Statjstjcal Learning.

http://web.stanford.edu/~hastie/ElemStatLearn/

– Least-squares regression: Chap 2.2.1, 3.1, 3.2.1 – Gauss-Markov theorem: Chap 3.2.3

slide-63
SLIDE 63

63

class GradientDescentOptjmizer():

slide-64
SLIDE 64

64

class LeastSquaresRegr()

slide-65
SLIDE 65

65

class seq_LeastSquaresRegr()

slide-66
SLIDE 66

66

class seq_LeastSquaresRegr()

slide-67
SLIDE 67

67

class seq_LeastSquaresRegr()

slide-68
SLIDE 68

68

class LogistjcRegr()

slide-69
SLIDE 69

69

class LogistjcRegr()