Introduction to Machine Learning Linear Regression Models Learning - - PowerPoint PPT Presentation

introduction to machine learning linear regression models
SMART_READER_LITE
LIVE PREVIEW

Introduction to Machine Learning Linear Regression Models Learning - - PowerPoint PPT Presentation

Introduction to Machine Learning Linear Regression Models Learning goals Know the hypothesis space of the linear model 3 Understand the risk function that 1 = slope = 0.5 1 = slope = 0.5 2 follows with L2 loss 1 Unit 1 Unit y 0 =


slide-1
SLIDE 1

Introduction to Machine Learning Linear Regression Models

1 Unit 1 Unit θ1 = slope = 0.5 θ1 = slope = 0.5 θ0 = intercept = 1 θ0 = intercept = 1

1 2 3 1 2 3 4

x y

Learning goals

Know the hypothesis space of the linear model Understand the risk function that follows with L2 loss Understand how optimization works for the linear model Understand how outliers affect the estimated model differently when using L1 or L2 loss

slide-2
SLIDE 2

LINEAR REGRESSION: HYPOTHESIS SPACE

We want to predict a numerical target variable by a linear transformation of the features x ∈ Rp. So with θ ∈ Rp this mapping can be written as: y = f(x) = θ0 + θT x

= θ0 + θ1x1 + · · · + θpxp

This defines the hypothesis space H as the set of all linear functions in

θ: H = {θ0 + θT x | (θ0, θ) ∈ Rp+1}

c

  • Introduction to Machine Learning – 1 / 16
slide-3
SLIDE 3

LINEAR REGRESSION: HYPOTHESIS SPACE

1 Unit 1 Unit θ1 = slope = 0.5 θ1 = slope = 0.5 θ0 = intercept = 1 θ0 = intercept = 1

1 2 3 1 2 3 4

x y

y = θ0 + θ · x

c

  • Introduction to Machine Learning – 2 / 16
slide-4
SLIDE 4

LINEAR REGRESSION: HYPOTHESIS SPACE

Given observed labeled data D, how to find (θ0, θ)? This is learning or parameter estimation, the learner does exactly this by empirical risk minimization. NB: We assume from now on that θ0 is included in θ.

c

  • Introduction to Machine Learning – 3 / 16
slide-5
SLIDE 5

LINEAR REGRESSION: RISK

We could measure training error as the sum of squared prediction errors (SSE). This is the risk that corresponds to L2 loss:

Remp(θ) = SSE(θ) =

n

  • i=1

L

  • y(i), f
  • x(i) | θ
  • =

n

  • i=1
  • y(i) − θT x(i)2

−1 1 1 2 3 4 5 6

x y

Minimizing the squared error is computationally much simpler than minimizing the absolute differences (L1 loss).

c

  • Introduction to Machine Learning – 4 / 16
slide-6
SLIDE 6

LINEAR MODEL: OPTIMIZATION

We want to find the parameters θ of the linear model, i.e., an element of the hypothesis space H that fits the data optimally. So we evaluate different candidates for θ. A first (random) try yields a rather large SSE: (Evaluation).

2 4 6 8 −2 2 4 6 θ = ( 1.8 , 0.3 ) SSE: 16.85 c

  • Introduction to Machine Learning – 5 / 16
slide-7
SLIDE 7

LINEAR MODEL: OPTIMIZATION

We want to find the parameters θ of the linear model, i.e., an element of the hypothesis space H that fits the data optimally. So we evaluate different candidates for θ. Another line yields an even bigger SSE (Evaluation). Therefore, this

  • ne is even worse in terms of empirical risk.

2 4 6 8 −2 2 4 6 θ = ( 1.8 , 0.3 ) SSE: 16.85 2 4 6 8 −2 2 4 6 θ = ( 1 , 0.1 ) SSE: 24.3 c

  • Introduction to Machine Learning – 6 / 16
slide-8
SLIDE 8

LINEAR MODEL: OPTIMIZATION

We want to find the parameters θ of the linear model, i.e., an element of the hypothesis space H that fits the data optimally. So we evaluate different candidates for θ. Another line yields an even bigger SSE (Evaluation). Therefore, this

  • ne is even worse in terms of empirical risk. Let’s try again:

2 4 6 8 −2 2 4 6 θ = ( 1.8 , 0.3 ) SSE: 16.85 2 4 6 8 −2 2 4 6 θ = ( 1 , 0.1 ) SSE: 24.3 2 4 6 8 −2 2 4 6 θ = ( 0.5 , 0.8 ) SSE: 10.61 c

  • Introduction to Machine Learning – 7 / 16
slide-9
SLIDE 9

LINEAR MODEL: OPTIMIZATION

Since every θ results in a specific value of Remp(θ), and we try to find arg minθ Remp(θ), let’s look at what we have so far:

2 4 6 8 −2 2 4 6 θ = ( 1.8 , 0.3 ) SSE: 16.85 2 4 6 8 −2 2 4 6 θ = ( 1 , 0.1 ) SSE: 24.3 2 4 6 8 −2 2 4 6 θ = ( 0.5 , 0.8 ) SSE: 10.61 I n t e r c e p t −2 −1 1 2 Slope 0.0 0.5 1.0 1.5 S S E 20 40 60 80 100

c

  • Introduction to Machine Learning – 8 / 16
slide-10
SLIDE 10

LINEAR MODEL: OPTIMIZATION

Instead of guessing, we use optimization to find the best θ:

Intercept −2 −1 1 2 S l

  • p

e 0.0 0.5 1.0 1.5 SSE 20 40 60 80 100 c

  • Introduction to Machine Learning – 9 / 16
slide-11
SLIDE 11

LINEAR MODEL: OPTIMIZATION

Instead of guessing, we use optimization to find the best θ:

Intercept −2 −1 1 2 S l

  • p

e 0.0 0.5 1.0 1.5 SSE 20 40 60 80 100 c

  • Introduction to Machine Learning – 10 / 16
slide-12
SLIDE 12

LINEAR MODEL: OPTIMIZATION

Instead of guessing, we use optimization to find the best θ:

2 4 6 8 −2 2 4 6 θ = ( 1.8 , 0.3 ) SSE: 16.85 2 4 6 8 −2 2 4 6 θ = ( 1 , 0.1 ) SSE: 24.3 2 4 6 8 −2 2 4 6 θ = ( 0.5 , 0.8 ) SSE: 10.61 2 4 6 8 −2 2 4 6 θ = ( −1.7 , 1.3 ) SSE: 5.88 I n t e r c e p t −2 −1 1 2 Slope 0.0 0.5 1.0 1.5 S S E 20 40 60 80 100

c

  • Introduction to Machine Learning – 11 / 16
slide-13
SLIDE 13

LINEAR MODEL: OPTIMIZATION

For L2 regression, we can find this optimal value analytically:

ˆ θ = arg min

θ

Remp(θ) =

n

  • i=1
  • y(i) − θT x(i)2

= arg min

θ

y − Xθ2

2

where X =

   

1 x(1)

1

... x(1)

p

1 x(2)

1

... x(2)

p

. . . . . . . . .

1 x(n)

1

... x(n)

p

    is the n × (p + 1)-design matrix.

This yields the so-called normal equations for the LM:

∂ ∂θRemp(θ) = 0 = ⇒ ˆ θ =

  • XT X

−1 XT y

c

  • Introduction to Machine Learning – 12 / 16
slide-14
SLIDE 14

EXAMPLE: REGRESSION WITH L1 VS L2 LOSS

We could also minimize the L1 loss. This changes the risk and

  • ptimization steps:

Remp(θ) =

n

  • i=1

L

  • y(i), f
  • x(i) | θ
  • =

n

  • i=1
  • y(i) − θT x(i)
  • (Risk)

Intercept −2 −1 1 2 S l

  • p

e 0.0 0.5 1.0 1.5 Sum of Absolute Errors 5 10 15 20

L1 Loss Surface

Intercept −2 −1 1 2 S l

  • p

e 0.0 0.5 1.0 1.5 SSE 20 40 60 80 100

L2 Loss Surface

L1 loss is harder to optimize, but the model is less sensitive to outliers.

c

  • Introduction to Machine Learning – 13 / 16
slide-15
SLIDE 15

EXAMPLE: REGRESSION WITH L1 VS L2 LOSS

25 50 75 100 2 4 6 8 10

x1 y Loss

L1 L2

L1 vs L2 Without Outlier

c

  • Introduction to Machine Learning – 14 / 16
slide-16
SLIDE 16

EXAMPLE: REGRESSION WITH L1 VS L2 LOSS

Adding an outlier (highlighted red) pulls the line fitted with L2 into the direction of the outlier:

25 50 75 100 0.0 2.5 5.0 7.5 10.0

x1 y Loss

L1 L2

L1 vs L2 With Outlier

c

  • Introduction to Machine Learning – 15 / 16
slide-17
SLIDE 17

LINEAR REGRESSION

Hypothesis Space: Linear functions xTθ of features ∈ X . Risk: Any regression loss function. Optimization: Direct analytical solution for L2 loss, numerical

  • ptimization for L1 and others.

c

  • Introduction to Machine Learning – 16 / 16