Linear Models Overview Topic Introduction & Justification - - PowerPoint PPT Presentation

linear models overview topic introduction justification
SMART_READER_LITE
LIVE PREVIEW

Linear Models Overview Topic Introduction & Justification - - PowerPoint PPT Presentation

Linear Models Overview Topic Introduction & Justification Introduction & Model Assumptions We are discussing linear models for the following reasons Matrix Representation In many practical situations, the assumptions underlying


slide-1
SLIDE 1

Notation

  • Data set is of fixed size n
  • Consider the ith set of input-output points in the data set
  • Input variables

– The inputs of the ith set can be collected into a vector xi – xi is a column vector with p − 1 elements, xi ∈ Rp−1 – Thus, there are p − 1 model inputs – Individual inputs of a vector of inputs will be written as xi,j

  • Notation for vectors

– Sometimes xi will represent a column vector of all the model inputs in the ith set of input-output pairs in the data set – Other times xi will represent the ith point (scalar) in the input vector x – May also denote as x·,i – Distinction will be clear from context and use of boldface for vectors and matrices

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

3

Linear Models Overview

  • Introduction & Model Assumptions
  • Matrix Representation
  • Coefficient Estimation
  • Least Squares
  • Properties
  • Analysis of Variance
  • Performance Coefficients
  • Statistical Inferences
  • Prediction Errors
  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

1

Notation Continued

  • Output Variables

– Assume a single model output – This does not reduce the generality – Could create a separate linear model for each output – More efficient method computationally (see me if interested) – yi: output of the ith set of input-output points

  • Will also use ambiguous notation for outputs

– Sometimes y is a vector of all the outputs in the data set – Other times y is a scalar output due to a single input vector x – Will try to keep clear by context and use of boldface for vectors and matrices

  • Note that in this set of notes, random variables are not

represented with capital letters (e.g., y, ε)

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

4

Topic Introduction & Justification We are discussing linear models for the following reasons

  • In many practical situations, the assumptions underlying linear

models do apply

  • The theory and methods are very rich
  • These simple models often work reasonably well on difficult

problems

  • They demonstrate the broad range of information that can be
  • btained from a model. Would like similar information from

nonlinear models.

  • Linear models form the core of many nonlinear models
  • “Many” relationships in data are simple (monotonic)
  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

2

slide-2
SLIDE 2

Where Does ε Come From?

z1,...,zn x1,...,xn y Observed Variables Unobserved Variables Output xn ,...,xp-1 Observed Variables

c

c+1

z

Process

  • In general, y = f(x1, . . . , xp−1, z1, . . . , znz)
  • We only have access to x1, . . . , xp−1
  • For this topic, we are assuming that the output has the following

relationship: y = ω0 + p−1

j=1 ωjxj + fz(z1, . . . , znz)

  • Here ε = fz(z1, . . . , znz) accounts for the effect of all these

unknown variables

  • If ε is of the form ε = nz

j=1 αjzj, the CLT applies and

ε ∼ N(0, σ2

ε)

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

7

Statistical Model for the Data: Assumptions y = ω0 + ⎛ ⎝

p−1

  • j=1

ωjxj ⎞ ⎠ + εi With linear models, we make the following key assumptions:

  • The output variable is related to the input variables by the linear

relationship shown above

  • All of the inputs xi,j are known exactly (non-random)

– Fortunately, the optimal estimator for the random inputs case is the same

  • ε is a random error term

– εi has zero mean: E[εi] = 0 – ε has constant variance: E[ε2

i ] = E[ε2 k] = σ2 ε for all i, k

– εi and εk are uncorrelated for i = k – We will sometimes assume εi ∼ N(0, σ2

ε) ∀i

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

5

Linear Model Performance Limits y = ω0 +

p−1

  • j=1

ωjxj + ε ˆ y = w0 +

p−1

  • j=1

wjxj

  • Even in the best case scenario wj = ωj, our model will not be

perfect: y − E[y; x] = y − ˆ y = ε

  • The random component of y is not predictable
  • ε represents the net contribution of the unmeasured effects on the
  • utput y
  • ε is unpredictable given the data set and the model inputs
  • Since the ε is the only random variable in the expression for y, it is

easy to show that var(y) = σ2

ε

  • Recall that var(c + X) = var(X) where X is a random variable

and c is a constant

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

8

Linear Model Comments Statistical Model Regression Model y = ω0 + ⎛ ⎝

p−1

  • j=1

ωjxj ⎞ ⎠ + ε ˆ y = w0 +

p−1

  • j=1

wjxj

  • The p model parameters (ωj) are unknown
  • Our goal: estimate y given an input vector x and a data set
  • The noise term, ε, is independent of x and therefore unpredictable
  • E[y] = ω0 + p−1

j=1 ωjxj

  • Our estimate (regression model) will be of the same form
  • If wj = ωj, ˆ

y = E[y]

  • E[y] is the optimal model in the sense of minimum MSE
  • Our goal is equivalent to estimating the model parameters ωj
  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

6

slide-3
SLIDE 3

Example 1: MATLAB Code

function [] = LinearSurface(); close all; FigureSet(1,4.5,2.8); N = 20; x = rand(N,1); y = 2*x - 1 + randn(N,1); xh = 0:0.01:1; A = [ones(N,1) x]; b = y; w = pinv(A)*b; yh = w(1) + w(2)*xh; % y_hat yt = -1 + 2 *xh; % y_true h = plot(x,y,’ko’,xh,yh,’b’,xh,yt,’r’); set(h(1),’MarkerSize’,2); set(h(1),’MarkerFaceColor’,’k’); set(h(1),’MarkerEdgeColor’,’k’); set(h,’LineWidth’,1.5); xlabel(’Input x’); ylabel(’Output y’); title(’1-Dimensional Response Surface Example’); set(gca,’Box’,’Off’); grid on; AxisSet(8); legend(’Data Set’,’Approximation’,’True Model’,2); print -depsc LinearSurface;

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

11

Geometric Interpretation

  • The process parameter ωi represents how much influence the

model input xi (scalar) has on the process output y: ∂y ∂xi = ωi

  • This is useful information that many nonlinear models lack
  • Geometrically, ˆ

y = f(x) represents a hyperplane

  • Also called the response surface
  • For a single input the response surface is is a line
  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

9

Practical Application

  • Often the assumptions we make will be invalid
  • People (engineers, researchers, etc.) uses these models profusely

anyway

  • In most ways, linear models are robust
  • They usually still generate reasonable fits even if the assumptions

are incorrect

  • Nonlinear models make fewer assumptions, but have other

problems

  • There is no perfect method for modeling
  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

12

Example 1: Linear Surface

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 −3 −2.5 −2 −1.5 −1 −0.5 0.5 1 1.5 2 Input x Output y 1−Dimensional Response Surface Example Data Set Approximation True Model

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

10

slide-4
SLIDE 4

Nominal Data & Classification

  • Linear models can be used with nominal data as well as interval
  • Use a simple transformation
  • Example:

xi =

  • 1

if male if female

  • Can also used for classification
  • For example, if the data set outputs are encoded as follows

y =

  • 1

if success if failure then the decision rule for predicted outputs during application of the model could be If y < 0.5, declare failure. Otherwise, declare success.

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

15

Process Diagram

x1 x2 xp−1 ω0 ω1 ω2 ωp−1 ε y Σ Σ

  • Often pseudo block-diagrams are used to show the flow of

computation in the statistical model or regression model

  • The diagram above illustrates how the output of the statistical

model is generated

  • Our primary goal is to estimate y
  • An equivalent goal is to estimate the parameters ωi, if the

statistical model is accurate

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

13

Efficient Nominal Data Encoding

  • Suppose the process output (or input) is one of four colors: red,

blue, yellow, or green

  • This best way to encode this is to declare a model output for each

category Category y1 y2 y3 y4 Red +1

  • 1
  • 1
  • 1

Blue

  • 1

+1

  • 1
  • 1

Yellow

  • 1
  • 1

+1

  • 1

Green

  • 1
  • 1
  • 1

+1

  • This example requires constructing 4 models
  • For new inputs, the output of the model will not be binary
  • For example, y1 = 0.6, y2 = −1.1, y3 = 0.8, y4 = 0
  • How do you choose what the final nominal output is?
  • This yields additional information about the inputs
  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

16

Model Diagram

x1 x2 xp−1 w0 w1 w2 wp−1 ˆ y Σ

  • Can draw a similar diagram for the regression model
  • We want wj ≈ ωj so that ˆ

y ≈ E[y; x]

  • We cannot estimate ε from knowledge of x
  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

14

slide-5
SLIDE 5

Process Representation The process can also be represented in matrix terms y = Aω + ε where y

n×1

= ⎡ ⎢ ⎢ ⎢ ⎣ y1 y2 . . . yn ⎤ ⎥ ⎥ ⎥ ⎦ ω

p×1 =

⎡ ⎢ ⎢ ⎢ ⎣ ω0 ω1 . . . ωp−1 ⎤ ⎥ ⎥ ⎥ ⎦ ε

n×1 =

⎡ ⎢ ⎢ ⎢ ⎣ ε1 ε2 . . . εn ⎤ ⎥ ⎥ ⎥ ⎦ y ∈ Rn×1 A ∈ Rn×p ω ∈ Rp×1 ε ∈ Rn×1

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

19

Matrix Representations

  • We can represent the entire data set by a matrix and a vector
  • Convenient

– Compact – Enables us to use linear algebra – Elegant – MATLAB processes matrices and vectors more efficiently than for loops

  • The matrix A will be called the data matrix
  • The vector y will be the output vector
  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

17

Direct Results By our assumptions we have E[y] = Aω var(ε) = ⎡ ⎢ ⎢ ⎢ ⎣ σ2

ε

. . . σ2

ε

. . . . . . . . . ... . . . . . . σ2

ε

⎤ ⎥ ⎥ ⎥ ⎦ = σ2

εI

var(y) = var(ε) = σ2

εI

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

20

Matrix Representations A

n×p =

⎡ ⎢ ⎢ ⎢ ⎣ 1 x1,1 x1,2 . . . x1,p−1 1 x2,1 x2,2 . . . x2,p−1 . . . . . . . . . ... . . . 1 xn,1 xn,2 . . . xn,p−1 ⎤ ⎥ ⎥ ⎥ ⎦ w

p×1 =

⎡ ⎢ ⎢ ⎢ ⎣ w0 w1 . . . wp−1 ⎤ ⎥ ⎥ ⎥ ⎦ Note: the column of 1’s accounts for the constant coefficient w0. If we define x·,0 = 1, the notation becomes a little easier. A ∈ Rn×p w ∈ Rp×1 ˆ y = Aw = ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ w0 + p−1

j=1 wjx1,j

w0 + p−1

j=1 wjx2,j

. . . w0 + p−1

j=1 wjxn,j

⎤ ⎥ ⎥ ⎥ ⎥ ⎦ = ⎡ ⎢ ⎢ ⎢ ⎣ ˆ y1 ˆ y2 . . . ˆ yn ⎤ ⎥ ⎥ ⎥ ⎦

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

18

slide-6
SLIDE 6

Error Measures Plotted

−3 −2 −1 1 2 3 1 2 3 4 5 6 7 8 9 Error: y − yhat Error Measures Various Measures of Model Performance Squared Error Absolute Error Root Absolute Error

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

23

Coefficient Estimation

  • In order to use our model, we need to pick values for w
  • The obvious choice is w = ω, but ω is unknown
  • We only have data: A and y
  • Popular approach: pick the coefficients that maximize some

measure of model accuracy

  • For this class, we will always use the prediction error

PE = E n

  • i=1

p(yi, ˆ yi)

  • Here p(yi, ˆ

yi) is some measure of the error

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

21

Error Measures MATLAB Code

function [] = ErrorMeasures(); close all; e = -3:0.01:3; FigureSet(1,4.5,2.8); h = plot(e,e.^2,’b’,e,abs(e),’r’,e,sqrt(abs(e)),’g’); set(h(3),’Color’,[0 0.8 0]); set(h,’LineWidth’,1.5); set(gca,’Box’,’Off’); grid on; axis([-3 3 0 9]); xlabel(’Error: y - y_{hat}’); ylabel(’Error Measures’); title(’Various Measures of Model Performance’); AxisSet(8); legend(’Squared Error’,’Absolute Error’,’Root Absolute Error’); print -depsc ErrorMeasures;

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

24

Picking an Error Measure

  • The vast majority of applications use squared error:

p(yi, ˆ yi) = (yi − ˆ yi)2

  • The average (estimate of the PE) then becomes

ASE = 1 n − p

n

  • i=1

(yi − ˆ yi)2 ≈ MSE = E

  • (yi − ˆ

yi)2

  • There are many other possible error measures
  • People choose MSE/ASE because
  • 1. Is easiest to solve for the optimal solution
  • 2. If the error terms have a normal distribution, it is also the

maximum likelihood solution

  • Squared error also has some disadvantages

– It gives much more weight to large errors than small errors – Is therefore sensitive to outliers

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

22

slide-7
SLIDE 7

Normal Equations e y − ˆ y Ee = e2 = eTe = (y − Aw)T(y − Aw) = (yT − wTAT)(y − Aw) = yTy − wTATy − yTAw + wTATAw

  • Ee is a nonlinear function of y, A, and w
  • Is a quadratic function of each of these components
  • Ee is the sum of squared errors
  • If we take the average squared errors (ASE), we have an estimate
  • f the mean square error = the estimated power of the error

ˆ Pe = 1 nEe = 1 n

n−1

  • i=0

|ei|2

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

27

Least Squares Solution If we use an estimate of the average squared error as our measure of model performance, ASE = 1 n − p

n

  • i=1

(yi − ˆ yi)2 ∝

n

  • i=1

(yi − ˆ yi)2 = (y − ˆ y)T(y − ˆ y) = (y − Aw)T(y − Aw) = (yT − wTAT)(y − Aw) = yTy − wTATy − yTAw + wTATAw

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

25

Normal Equation Components Let us define Average squared error (ASE): ˆ Pe

1×1 1

ne2 = 1 n

n−1

  • i=0

|ei|2 Average output variance: ˆ Py

1×1

1 ny2 = 1 n

n−1

  • i=0

|yi|2 Average correlation: ˆ R

p×p 1

nATA = 1 n

n−1

  • i=0

xixT

i

Average cross-correlation: ˆ d

p×1 1

nATy = 1 n

n−1

  • i=0

xiyi If we replaced the sample average 1

n

n−1

i=0 (·) with expectation, E[·],

each of these terms would be the mean ensemble value rather than the sample average estimate

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

28

Least Squares Solution and Gradients The gradient of a function of multiple parameters is just a vector of the partial derivatives of that function with respect to each parameter ∇wf(w) = ⎡ ⎢ ⎢ ⎢ ⎢ ⎣

∂f(w) ∂w1 ∂f(w) ∂w2

. . .

∂f(w) ∂wp

⎤ ⎥ ⎥ ⎥ ⎥ ⎦ The least squares solution is found by taking the gradient of ASE and setting it equal to zero: ∇wASE = −2ATy + 2ATAw = 0 (ATA)w = ATy

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

26

slide-8
SLIDE 8

Least Squares Properties w = (ATA)−1ATy If ε ∼ N(0, σ2

ε), w = ˆ

ω has the following properties:

  • Unbiased: E[w] = ω
  • Minimum variance among all unbiased estimators:

varε{w} ≤ varε{w∗} for any other estimator, w∗

  • Consistent: limn→∞ Pr {|ˆ

ω − ω| ≥ ǫ} = 0 for any ǫ > 0

  • Sufficient: the conditional joint probability of the sample
  • bservations, given ˆ

ω, does not depend on ω: f(y|ˆ ω, ω, A) = f(y|ˆ ω, A)

  • Minimizes the predictive squared loss
  • If ε ∼ N(0, σ2

ε), it is the maximum likelihood estimate

  • The nonlinear models that we study later rarely have these

properties

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

31

Completing the Square Then ˆ Pe = 1 n

  • yTy − wTATy − yTAw + wTATAw
  • = ˆ

Py − wT ˆ d − ˆ dTw + wT ˆ Rw If we assume ˆ R > 0, we can complete the square ˆ Pe(w) = ˆ Py + wT( ˆ Rw − ˆ d) − ˆ dTw + ˆ dT ˆ R−1 ˆ d − ˆ dT ˆ R−1 ˆ d = ˆ Py + wT ˆ R(w − ˆ R−1 ˆ d) − ˆ dTw + ˆ dT ˆ R−1 ˆ d − ˆ dT ˆ R−1 ˆ d = ˆ Py + (wT) ˆ R(w − ˆ R−1 ˆ d) − ( ˆ dT ˆ R−1) ˆ R(w − ˆ R−1 ˆ d) − ˆ dT ˆ R−1 ˆ d = ˆ Py + (wT − ˆ dT ˆ R−1) ˆ R(w − ˆ R−1 ˆ d) − ˆ dT ˆ R−1 ˆ d = ˆ Py − ˆ dT ˆ R−1 ˆ d + (w − ˆ R−1 ˆ d)T ˆ R(w − ˆ R−1 ˆ d) = ˆ Py − ˆ dT ˆ R−1 ˆ d + ( ˆ Rw − ˆ d)T ˆ R−1( ˆ Rw − ˆ d) = ˆ Pe,min + ( ˆ Rw − ˆ d)T ˆ R−1( ˆ Rw − ˆ d)

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

29

Pseudo-Inverses w = (ATA)−1ATy

  • The Matrix (ATA)−1AT is called the left pseudo-inverse of the

matrix A

  • If A is a square matrix (n = p), this is the usual inverse
  • Generally n > p
  • As a rule of thumb, n should at be ≥ 10 × p
  • In words, you should have 10 times as many points as you have

free parameters

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

32

Relating LSE and MMSE Estimation ˆ Pe(w) = ˆ Py − ˆ dT ˆ R−1 ˆ d + ( ˆ Rw − ˆ d)T ˆ R−1( ˆ Rw − ˆ d)

  • We arrived at the optimal solution for the weights two different

ways

  • The solution is the same

w = (ATA)−1ATy

  • A very rare case where the solution has a closed-form solution
  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

30

slide-9
SLIDE 9

Analysis of Variance (ANOVA) SSTO: Total Sum of Squares SSR: Regression Sum of Squares SSE: Error Sum of Squares SSTO =

n

  • i=1

(yi − ¯ y)2 = yTy − 1 nyTJy SSR =

n

  • i=1

(ˆ yi − ¯ y)2 = wTATy − 1 nyTJy SSE =

n

  • i=1

(yi − ˆ yi)2 = eTe = yTy − wTATy = yTy − wTATy = yT(I − H)y where J is a matrix of 1s of the appropriate dimension and ¯ y is the sample average ¯ y = 1

n

n

i=1 yi.

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

35

The Hat Matrix If we let w be the least squares solution, w = (ATA)−1ATy then we can write ˆ y = Aw = A

  • (ATA)−1ATy
  • =
  • A(ATA)−1AT

y = Hy where H = A(ATA)−1AT

  • H is called the hat matrix
  • It has a number of useful properties

– H ∈ Rn×n – Rank: ≤ p – Rank = p if columns of A are linearly independent – Idempotent: HH = H – Symmetric: HT = H

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

33

ANOVA Continued The deviation of the output from its average can be written as yi − ¯ y

Deviation Total

= ˆ yi − ¯ y

Deviation of model from average

+ yi − ˆ yi

Deviation of process from the model

Remarkably,

n

  • i=1

(yi − ¯ y)2 =

n

  • i=1

(ˆ yi − ¯ y)2 +

n

  • i=1

(yi − ˆ yi)2 SSTO = SSR + SSE In words, the total variation of y is equal to the model variation plus the residual (think error) variation. Nonlinear models do not have this property.

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

36

Residuals The residuals are defined as e = y − ˆ y = y − Hy = (I − H)y The variance matrix of the residuals can then be written as var{e} = σ2

ε(I − H) n×n

≈ ASE(I − H) The ASE is defined more precisely later.

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

34

slide-10
SLIDE 10

Coefficient of Multiple Determination R2 = SSR SSTO = 1 − SSE SSTO = [ (yi − ¯ y) (ˆ yi − ¯ y)]2 (yi − ¯ y)2 ˆ yi − ¯ ˆ y 2 = ˆ σ2

y,ˆ y

σ2

yσ2 ˆ y

  • Measures the proportion of total variance reduction associated

with the model

  • 0 ≤ R2 ≤ 1
  • The first two expressions only work with linear least squares

regression models

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

39

Further ANOVA Definitions & Ambiguity

  • Regression Mean Square:

MSR = SSR p − 1

  • Average Square Error:

ASE = SSE n − p

  • Note ASE as defined here is just an estimate of the true

MSE = E[(y − ˆ y)2]

  • It’s actually a very good estimate

– Unbiased: E[ASE] = σ2

ε = MSE

– Minimum variance

  • Note that the denominator is n − p since we use p degrees of

freedom to estimate w from the data

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

37

Coefficient of Multiple Determination Comments R2 = SSR SSTO = 1 − SSE SSTO = [ (yi − ¯ y) (ˆ yi − ¯ y)]2 (yi − ¯ y)2 (ˆ yi − ¯ y)2 = ˆ σ2

y,ˆ y

σ2

yσ2 ˆ y

  • Interpretations

– R2 = 1: Perfect fit; all residuals are zero – R2 = 0: Means w = 0; no discernible linear relationship of the inputs to the output

  • Limitations

– R2 may be an inaccurate indicator if n is close to p – R2 is a measure of the fit to the data, not the PE – R2 can only increase as p increases

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

40

Measures of Model Performance Intuitively, the model is accurate if most of the variation is in the regression SSTO = SSR + SSE Accurate Inaccurate SSR Large Small SSE Small Large Note that 0 ≤SSR ≤ SSTO 0 ≤SSE ≤ SSTO

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

38

slide-11
SLIDE 11

Example 2: MATLAB Code

function [] = PerformanceCoefficients(); close all; FigureSet(1,’LTX’); n = 50; p = 2; x = rand(n,1); y = 2*x - 1 + 0.5*randn(n,1); b = y; A = [ones(n,1) x]; w = pinv(A)*b; yh = w(1) + w(2)*x; % y_hat ym = mean(y); SSTO = sum((y -ym).^2); SSR = sum((yh-ym).^2); SSE = sum((y -yh).^2); MSO = SSTO/(n-1); MSR = SSR/(p-1); ASE = SSE/(n-p); fprintf(’SSTO : %7.3f\n’,SSTO); fprintf(’SSR : %7.3f\n’,SSR); fprintf(’SSE : %7.3f\n’,SSE); fprintf(’SSR+SSE: %7.3f\n’,SSR+SSE); fprintf(’MSR : %7.3f\n’,MSR); fprintf(’ASE : %7.3f\n’,ASE); fprintf(’\n’); R2 = 1 - SSE/SSTO; Ra = 1 - ASE/MSO;

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

43

Alternate Coefficients of Multiple Determination

  • Adjusted Coefficient of Multiple Determination:

R2

a = 1 −

SSE n − p SSTO n − 1

  • Coefficient of Multiple Correlation:

R = √ R2

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

41

R = sqrt(R2); fprintf(’R2 : %7.3f\n’,R2); fprintf(’Ra : %7.3f\n’,Ra); fprintf(’R : %7.3f\n’,R ); %disp([1-SSE/SSTO,SSR/SSTO,sum((y-ym).*(yh-ym)).^2/(SSTO*SSR)]) s2 = ASE*inv(A’*A); alpha = 0.05; % Level of significance w0l = w(1) - tinv(1-alpha/2,n-p)*sqrt(s2(1,1)); % Lower confidence boundary w0u = w(1) + tinv(1-alpha/2,n-p)*sqrt(s2(1,1)); % Upper confidence boundary fprintf(’%7.3f <= w0 <= %7.3f Estimate:%7.3f True:%7.3f Significance:%4.2f\n’,... w0l,w0u,w(1),-1,alpha); w1l = w(2) - tinv(1-alpha/2,n-p)*sqrt(s2(2,2)); % Lower confidence boundary w1u = w(2) + tinv(1-alpha/2,n-p)*sqrt(s2(2,2)); % Upper confidence boundary fprintf(’%7.3f <= w1 <= %7.3f Estimate:%7.3f True:%7.3f Significance:%4.2f\n’,... w1l,w1u,w(2),2,alpha); alpha = 0.01; % Level of significance w0l = w(1) - tinv(1-alpha/2,n-p)*sqrt(s2(1,1)); % Lower confidence boundary w0u = w(1) + tinv(1-alpha/2,n-p)*sqrt(s2(1,1)); % Upper confidence boundary fprintf(’%7.3f <= w0 <= %7.3f Estimate:%7.3f True:%7.3f Significance:%4.2f\n’,... w0l,w0u,w(1),-1,alpha); w1l = w(2) - tinv(1-alpha/2,n-p)*sqrt(s2(2,2)); % Lower confidence boundary w1u = w(2) + tinv(1-alpha/2,n-p)*sqrt(s2(2,2)); % Upper confidence boundary fprintf(’%7.3f <= w1 <= %7.3f Estimate:%7.3f True:%7.3f Significance:%4.2f\n’,... w1l,w1u,w(2),2,alpha); xh = 0:0.01:1; yh = w(1) + w(2)*xh; % y_hat yt = -1 + 2 *xh; % y_true

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

44

Example 2: Performance Coefficients

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 −2 −1.5 −1 −0.5 0.5 1 1.5 2 Input x Output y R2:0.618 Ra

2:0.610 R:0.786

Data Set Approximation True Model

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

42

slide-12
SLIDE 12

Model Coefficient Inferences E[w] = ω var(w)

p×p

= σ2

ε(ATA)−1

s2[w] = ASE(ATA)−1 ≈ var(w)

  • The estimated confidence interval for the coefficients is given by:

wk ± t(1 − α/2; n − p)s[wk]

  • In words, the probability that ωk lies outside this range is α under

the normal error model assumptions

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

47

h = plot(x,y,’ko’,xh,yh,’b’,xh,yt,’r’); set(h(1),’MarkerSize’,2); set(h(1),’MarkerFaceColor’,’k’); set(h(1),’MarkerEdgeColor’,’k’); set(h,’LineWidth’,1.5); xlabel(’Input x’); ylabel(’Output y’); st = sprintf(’R^2:%5.3f R_a^2:%5.3f R:%5.3f’,R2,Ra,R); title(st); set(gca,’Box’,’Off’); grid on; AxisSet(8); legend(’Data Set’,’Approximation’,’True Model’,2); print -depsc PerformanceCoefficients;

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

45

Model Coefficient Inference MATLAB Code The following was appended to the previous code

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

48

Example 2: Performance Coefficients MATLAB Output

SSTO : 25.655 SSR : 15.845 SSE : 9.810 SSR+SSE: 25.655 MSR : 15.845 ASE : 0.204 R2 : 0.618 Ra : 0.610 R : 0.786

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

46

slide-13
SLIDE 13

The Bias Input

  • Our model output is generated by

ˆ y = [1 xT]w = w0 +

p−1

  • j=1

xjwj

  • This vector notation is awkward
  • Without loss of generality, I will assume that the first element of

the input vector is the constant 1: x0 = 1

  • Thus the newly defined input vector x′ can be written as

x′ = [1 x1 x2 . . . xp−1]

  • I will still use x to denote the (expanded) input vector
  • The difference should be clear from context
  • This means the output of our linear model can be written simply

as ˆ y = xTw

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

51

Model Coefficient Inference MATLAB Output

  • 1.539 <= w0 <=
  • 0.944

Estimate: -1.242 True: -1.000 Significance:0.05 1.726 <= w1 <= 2.747 Estimate: 2.237 True: 2.000 Significance:0.05

  • 1.638 <= w0 <=
  • 0.845

Estimate: -1.242 True: -1.000 Significance:0.01 1.555 <= w1 <= 2.918 Estimate: 2.237 True: 2.000 Significance:0.01

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

49

Expected Predictions

  • Suppose that we have constructed our model using our data set
  • In other words, we have calculated the weight vector w
  • Now suppose that we receive a new input vector x and we wish to

predict the process output y

  • The best prediction is

ˆ y = xTw

  • It can be shown that

E[ˆ y] = E[y] = xTω

  • This estimate is unbiased and minimum variance
  • But how good is it?
  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

52

Model Coefficient Hypothesis Tests We can also test to see if some inputs are useless Hypotheses: H0 :ωk = 0 Null hypothesis H1 :ωk = 0 Alternate hypothesis Test statistic: T = wk s{wk} Decision rule: If |T| ≤ t(1 − α/2; n − p), accept H0. Otherwise, reject H0.

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

50

slide-14
SLIDE 14

Example 3: Confidence Intervals MATLAB

function [] = ConfidenceIntervals(); close all; rand(’state’,1); randn(’state’,3); N = 50; % Number of points P = 2; % Number of parameters x = rand(N,1); % Data set inputs y = 2*x - 1 + randn(N,1); % Data set outputs A = [ones(N,1) x]; % A matrix w = pinv(A)*y; % Estimated weights yh = A*w; % Model outputs SSE = sum((y -yh).^2); % Error sum of squares ASE = SSE/(N-P); % Average squared error s2 = ASE*inv(A’*A); % Covariance matrix of weight vector xt = (-1:0.01:2)’; % Test points NT = length(xt); % No. test points At = [ones(NT,1) xt]; % Test matrix yh = w(1) + w(2)*xt; % Model output at test points yt = -1 + 2 *xt; % Process output at test points sy2 = zeros(NT,1); % s^2[y_hat] for cnt = 1:NT, sy2(cnt) = At(cnt,:)*s2*At(cnt,:)’; end; sy = sy2.^(1/2); % s[y_hat] sp2 = sy2 + ASE; % s^2[y_hat + e] sp = sp2.^(1/2); % s[y_hat + e] %======================================================== % Expected Point Prediction Boundaries

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

55

Expected Prediction Confidence Intervals

  • Ideally we would like our model to

– Generate good estimated outputs ˆ y for new inputs – Give us a confidence interval on the estimate

  • It can be shown that

var(ˆ y) = σ2

εxT(ATA)−1x = xT

σ2

ε(ATA)−1

x = x

1×p Tvar(w) p×p

x

p×1

  • The estimated variance is then given by

s2[ˆ y] = ASE xT(ATA)−1x = x

1×p Ts2[w] p×p

x

p×1

  • The 1 − α confidence limits for E[y] are

ˆ y ± t(1 − α/2; n − p)s[ˆ y]

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

53

%======================================================== figure; FigureSet(1,4.5,2.8); yl95 = yh - tinv(1-0.05/2,N-P)*sy; % Lower 95% confidence interval yu95 = yh + tinv(1-0.05/2,N-P)*sy; % Upper 95% confidence interval yl99 = yh - tinv(1-0.01/2,N-P)*sy; % Lower 95% confidence interval yu99 = yh + tinv(1-0.01/2,N-P)*sy; % Upper 95% confidence interval h = plot(x,y,’b.’,xt,yt,’b’,xt,yh,’c’,... xt,yl95,’g’,xt,yl99,’r’,... xt,yu95,’g’,xt,yu99,’r’); set(h(1),’MarkerSize’,15); set(h([4 6]),’Color’,[0.0 0.7 0.0]); set(h,’LineWidth’,1.5); xlabel(’Input x’); ylabel(’Output y’); title(’Confidence Intervals for E[y]’); set(gca,’Box’,’Off’); grid on; axis([min(xt) max(xt) -5 5]); AxisSet(8); legend(’Data Set’,’True’,’Model’,’95% CI’,’99% CI’,2); print -depsc ExpectedConfidenceInterval; %======================================================== % Expected Point Prediction Boundaries %======================================================== figure; FigureSet(1,4.5,2.8); yl95 = yh - tinv(1-0.05/2,N-P)*sp; % Lower 95% confidence interval yu95 = yh + tinv(1-0.05/2,N-P)*sp; % Upper 95% confidence interval yl99 = yh - tinv(1-0.01/2,N-P)*sp; % Lower 95% confidence interval yu99 = yh + tinv(1-0.01/2,N-P)*sp; % Upper 95% confidence interval h = plot(x,y,’b.’,xt,yt,’b’,xt,yh,’c’,... xt,yl95,’g’,xt,yl99,’r’,... xt,yu95,’g’,xt,yu99,’r’);

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

56

Expected Prediction Confidence Interval Comments The 1 − α confidence limits for E[y] are ˆ y ± t(1 − α/2; n − p)s[ˆ y]

  • Recall that

y = xTω + ε E[y] = xTω

  • The above confidence interval is for the expected value of y, not

for y itself

  • In words, if all of the normal error model assumptions hold, the

expected value of y will fall in the above confidence interval 100(1 − α) percent of the time

  • As α decreases, the confidence interval gets larger
  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

54

slide-15
SLIDE 15

Example 3: Expected Confidence Intervals Plot

−1 −0.5 0.5 1 1.5 2 −5 −4 −3 −2 −1 1 2 3 4 5 Input x Output y Confidence Intervals for E[y] Data Set True Model 95% CI 99% CI

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

59

set(h(1),’MarkerSize’,15); set(h([4 6]),’Color’,[0.0 0.7 0.0]); set(h,’LineWidth’,1.5); xlabel(’Input x’); ylabel(’Output y’); title(’Confidence Intervals for y’); set(gca,’Box’,’Off’); grid on; axis([min(xt) max(xt) -5 5]); AxisSet(8); legend(’Data Set’,’True’,’Model’,’95% CI’,’99% CI’,2); print -depsc PointConfidenceInterval; %======================================================== % Surface Boundaries %======================================================== figure; FigureSet(1,4.5,2.8); yl95 = yh - P*finv(1-0.05,P,N-P)*sy; % Lower 95% confidence interval yu95 = yh + P*finv(1-0.05,P,N-P)*sy; % Upper 95% confidence interval yl99 = yh - P*finv(1-0.01,P,N-P)*sy; % Lower 95% confidence interval yu99 = yh + P*finv(1-0.01,P,N-P)*sy; % Upper 95% confidence interval h = plot(x,y,’b.’,xt,yt,’b’,xt,yh,’c’,... xt,yl95,’g’,xt,yl99,’r’,... xt,yu95,’g’,xt,yu99,’r’); set(h(1),’MarkerSize’,15); set(h([4 6]),’Color’,[0.0 0.7 0.0]); set(h,’LineWidth’,1.5); xlabel(’Input x’); ylabel(’Output y’); title(’Surface Confidence Intervals’); set(gca,’Box’,’Off’); grid on; axis([min(xt) max(xt) -5 5]);

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

57

Prediction Confidence Intervals

  • We can also approximate the confidence interval for y, rather than

E[y] ˆ y ± t(1 − α/2; n − p)s[y] where s2[y] = ASE + s2[ˆ y] = ASE(1 + xT(ATA)−1x)

  • In words, if all of the normal error model assumptions hold, the
  • bserved values of y will fall in the above confidence interval

100(1 − α) percent of the time

  • This is a wider confidence interval than for E[y] because it

accounts for the noise term ε

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

60

AxisSet(8); legend(’Data Set’,’True’,’Model’,’95% CI’,’99% CI’,2); print -depsc SurfaceConfidenceInterval;

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

58

slide-16
SLIDE 16

Multiple Confidence Intervals Continued 1

  • The distinction between confidence intervals for a single point and

a group of points is crucial

  • Consider an experiment in which two fair coins are flipped
  • The following two statement is true (statistically)

– The probability of receiving a heads on coin A is 50% – The probability of receiving a heads on coin B is 50%

  • However, the following statement is NOT true (statistically)

– The probability of receiving a heads on coin A and a heads on coin B is 50%

  • In fact, we know this probability is 25%
  • This is a common conceptual mistake that has led many to false

conclusions

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

63

Example 4: Prediction Confidence Intervals Plot

−1 −0.5 0.5 1 1.5 2 −5 −4 −3 −2 −1 1 2 3 4 5 Input x Output y Confidence Intervals for y Data Set True Model 95% CI 99% CI

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

61

Multiple Confidence Intervals Continued 2

  • Suppose that we have 1000 new inputs x1, x2, . . . , x1000
  • Suppose that we find the individual 95% confidence intervals for

each yi yi ∈ ˆ yi ± t(1 − α/2; n − p)s[yi]

  • It should be apparent to you that, on average, 50 of the observed

values of yi will be outside of its confidence interval

  • Thus, clearly the probability that all 1000 of the observed outputs

are within the 1000 specified intervals is dramatically less than 95%

  • Fortunately, for the normal error model there are confidence

intervals for multiple predictions (see me if interested)

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

64

Multiple Confidence Intervals

  • Suppose we have two new inputs (not in the data set) x1 and x2
  • What is the 1 − α confidence interval for y1 and y2
  • It is NOT simply the two individual confidence intervals

y1 ∈ ˆ y1 ± t(1 − α/2; n − p)s[y1] y2 ∈ ˆ y2 ± t(1 − α/2; n − p)s[y2]

  • These are separate confidence intervals
  • We cannot be certain that 100(1 − α) of the time, y1 will be in

the above confidence interval AND y2 will be in the above confidence interval

  • We can make these claims independently
  • We can’t claim they will both be true with 1 − α confidence
  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

62

slide-17
SLIDE 17

Example 5: MATLAB Code

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

67

Surface Confidence Interval xTω ∈ ˆ y ± Ws[ˆ y] ∀x where W 2 = pF(1 − α; p, n − p)

  • Confidence intervals can also be estimated for the entire regression

surface

  • This means the hyperplane defined by xTω is within the specified

interval with 1 − α confidence

  • These confidence intervals tend to be wider than those for E[y] or

for y

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

65

Statistical Tests There are a host of statistic hypothesis tests that can be applied to linear models under the normal error assumption

  • Are the errors εi normally distributed?
  • Is the error variance σ2

ε constant?

  • Is the model appropriate?

H0 : E[ˆ y] = xTω H1 : E[ˆ y] = xTω

  • See me if interested
  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

68

Example 5: Surface Confidence Intervals Plot

−1 −0.5 0.5 1 1.5 2 −5 −4 −3 −2 −1 1 2 3 4 5 Input x Output y Surface Confidence Intervals Data Set True Model 95% CI 99% CI

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

66

slide-18
SLIDE 18

Example 6: MATLAB Code

function [] = ScatterPlotEx(); close all; rand(’state’,1); randn(’state’,3); N = 100; % Number of points x1 = randn(N,1); % First input x2 = 5*rand(N,1); % Second input x3 = x1+1*x2+2*randn(N,1); % Third input x4 = x1-2*x2-3*x3+3*rand(N,1); % Fourth input y = 3 + 2*x1 - 3*x2 + 5*randn(N,1); % Output D = [x1 x2 x3 x4 y]; % Data matrix FigureSet(1,’LTX’); [h,ha] = plotmatrix(D); set(h,’MarkerSize’,2); set(get(ha(5,1),’XLabel’),’String’,’x_1’) set(get(ha(5,2),’XLabel’),’String’,’x_2’) set(get(ha(5,3),’XLabel’),’String’,’x_3’) set(get(ha(5,4),’XLabel’),’String’,’x_4’) set(get(ha(5,5),’XLabel’),’String’,’y ’) set(get(ha(1,1),’YLabel’),’String’,’x_1’) set(get(ha(2,1),’YLabel’),’String’,’x_2’) set(get(ha(3,1),’YLabel’),’String’,’x_3’) set(get(ha(4,1),’YLabel’),’String’,’x_4’) set(get(ha(5,1),’YLabel’),’String’,’y ’) AxisSet(8); print -depsc ScatterPlot;

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

71

Scatter Plots

  • Scatter plots of a data set (both inputs and outputs) can be useful

in identifying relationships

  • These are easy to generate in MATLAB
  • Use the plotmatrix command
  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

69

Scaling Problems: Correlation Transformation

  • Numerically, you can run into problems if you’re not careful
  • This normally occurs when you try to calculate the inverse of ATA
  • If the inputs are scaled very differently, it may help to apply a

correlation transformation ´ xi = 1 √n − 1 xi − ¯ xi sxi

  • ´

y = 1 √n − 1 y − ¯ y sy

  • Then it can be easily shown that the correlation matrix and

cross-correlation vector are given by rxx

(p−1)×(p−1)

= ´ AT ´ A rxy

(p−1)×1

= ´ AT ´ y

  • When using transformed variables, there is no need to include a

constant term

  • All sample averages are zero so the bias term is equal to zero
  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

72

Example 6: Scatter Plot

−50 50 y −50 50 x4 −10 10 x3 2 4 x2 −5 5 −50 50 x1 y −50 50 x4 −10 10 x3 2 4 x2 −5 5 x1

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

70

slide-19
SLIDE 19

Final Remarks

  • The statistical theory supporting linear models is very rich
  • For well-behaved data sets, the “optimal” least squares solution

works very well

  • In this case we can form confidence intervals and answer many

useful and interesting questions

  • For data sets with “useless” inputs or highly correlated inputs, the

least squares solution can be very inaccurate

  • In this case, biased solutions can be much better
  • We cannot calculate confidence intervals for the biased solutions
  • We can efficiently estimate the prediction error - but this is much

less information than for the least squares solution

  • This was only an introduction to linear modeling
  • Most books on the topic will contain much more information
  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

75

Example 7: Correlation Matrices

Rxx = 1.0000 0.1829 0.3357

  • 0.2219

0.1829 1.0000 0.5585

  • 0.7269

0.3357 0.5585 1.0000

  • 0.9646
  • 0.2219
  • 0.7269
  • 0.9646

1.0000 Rxy = 0.3164

  • 0.6085
  • 0.1641

0.3394

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

73

Example 7: MATLAB Code

function [] = CorrelationMatrices(); close all; rand(’state’,1); randn(’state’,3); N = 100; % Number of points x1 = randn(N,1); % First input x2 = 5*rand(N,1); % Second input x3 = x1+1*x2+2*randn(N,1); % Third input x4 = x1-2*x2-3*x3+3*rand(N,1); % Fourth input y = 3 + 2*x1 - 3*x2 + 5*randn(N,1); % Output A = [x1 x2 x3 x4]; % Data matrix An = zeros(size(A)); % Normalized data matrix for cnt = 1:size(A,2) x = A(:,cnt); x = (x - mean(x))/(sqrt(N-1)*std(x)); An(:,cnt) = x; end; yn = (y - mean(y))/(sqrt(N-1)*std(y)); Rxx = An’*An Rxy = An’*yn

  • J. McNames

Portland State University ECE 4/557 Linear Models

  • Ver. 1.27

74