Analysis of variance and regression November 22, 2007 - - PowerPoint PPT Presentation

analysis of variance and regression november 22 2007
SMART_READER_LITE
LIVE PREVIEW

Analysis of variance and regression November 22, 2007 - - PowerPoint PPT Presentation

Analysis of variance and regression November 22, 2007 Parametrisations : Choice of parameters Comparison of models Test for linearity Linear splines Lene Theil Skovgaard, Dept. of Biostatistics, Institute of Public Health,


slide-1
SLIDE 1

Analysis of variance and regression November 22, 2007

slide-2
SLIDE 2

Parametrisations:

  • Choice of parameters
  • Comparison of models
  • Test for linearity
  • Linear splines
slide-3
SLIDE 3

Lene Theil Skovgaard,

  • Dept. of Biostatistics,

Institute of Public Health, University of Copenhagen e-mail: L.T.Skovgaard@biostat.ku.dk http://staff.pubhealth.ku.dk/~lts/regression07_2

slide-4
SLIDE 4

Parametrisations, November 2007 1

Parameter: unknown quantity that we want to estimate (provide a good guess)

  • the decrease in blood pressure following treatment A
  • r the difference in decrease for treatment A and placebo
  • the increase in insulin growth factor (IGF-1) with age

Parametrisation: choice of which parameters are to enter the model description Re-parametrisation: shift to a new set of parameters

slide-5
SLIDE 5

Parametrisations, November 2007 2

Most well known choice of parametrisation:

  • Change of scale/units

Do we measure height in cm or m? Take the relation of lung capacity versus height: fev1= α + β× height If we change from measuring height in cm to m, we also change the regression coefficient (the parameter) from β to β∗ = 100β

  • Change of origin/intercept

– choice of another reference group in ANOVA – subtracting e.g. 170cm from all height measurements Re-parametrisations do not change the model as such!

  • same fitted values
  • same confidence- and prediction limits
  • – but a possibility for interpretations of specific interest
slide-6
SLIDE 6

Parametrisations, November 2007 3

What makes us choose a specific parametrisation?

  • Ease
  • the program has some default parametrisations
  • Estimation of specific quantities:
  • the potency of a drug, ED50 or ED90
  • Test of specific hypotheses
  • difference between treatment and placebo
  • difference in height for boys and girls at the age of 14
slide-7
SLIDE 7

Parametrisations, November 2007 4

In the more advanced situations (beyond linearity) – non-linear regression, logistic regression, correlated observations:

  • Knowledge of distributional assumptions:
  • Some parameter estimates may be more normally distributed

than others (and we like to be able to construct symmetric confidence intervals, using the standard error) In linear models the estimates have exact normal distributions (provided the model assumptions are met, of course...)

slide-8
SLIDE 8

Parametrisations, November 2007 5

Example: A group consisting of 45 patients with Reumatoid Arthritis are randomised to one out of 6 possible treatments (treat):

  • Placebo
  • Aspirin
  • One of 4 doses (dose) of an active anti-inflammatory drug

which we shall denote X. Outcome: An index (Index) summing up the effectiveness of the treatment (decrease in various symptoms)

slide-9
SLIDE 9

Parametrisations, November 2007 6

Outcome: Index-values: Reference: Woolson, R.F. & Clarke, W.R.: Statistical methods for the analysis of biomedical data. 2ed., Wiley, 2002. (Exercise 10.4 page 409)

slide-10
SLIDE 10

Parametrisations, November 2007 7

How do we represent these data in SAS?

Obs group type dose index 1 placebo placebo 6.2 2 placebo placebo 5.8 3 placebo placebo 9.5 4 placebo placebo 10.2 5 placebo placebo 8.3 6 placebo placebo 7.9 7 placebo placebo 9.2 38 x20 active 20 29.5 39 x20 active 20 34.6 40 x20 active 20 31.9 41 x25 active 25 41.8 42 x25 active 25 45.2 43 x25 active 25 43.2 44 x25 active 25 46.5 45 x25 active 25 41.7

slide-11
SLIDE 11

Parametrisations, November 2007 8

Summary statistics

The MEANS Procedure Analysis Variable : index N group Obs N Mean Std Dev Minimum Maximum

  • aspirin

11 11 23.2545455 2.9561338 17.1000000 26.6000000 placebo 9 9 8.6222222 1.8369661 5.8000000 11.6000000 x10 5 5 5.9600000 0.7635444 5.2000000 6.9000000 x15 9 9 17.9444444 1.0607911 16.4000000 19.5000000 x20 6 6 33.1333333 3.0051068 29.5000000 37.2000000 x25 5 5 43.6800000 2.1182540 41.7000000 46.5000000

slide-12
SLIDE 12

Parametrisations, November 2007 9

We start by looking at the 4 X-groups only: Below, the outcome Index is plotted against Dose group.

slide-13
SLIDE 13

Parametrisations, November 2007 10

Comparison of 4 dose groups: One-way ANOVA Model written as a multiple regression: Y = β0 + β1x1 + β2x2 + β3x3 + ǫ where the x’s are socalled ”dummy”variables: x1 is 1 if subject i belongs to the first group, and 0 otherwise x2 is 1 if subject i belongs to the second group, and 0 otherwise x3 is 1 if subject i belongs to the third group, and 0 otherwise With this parametrisation, β0 will correspond to the level for the last group (the reference group, here group 4); β1 will be the difference in level between group 1 and group 4 β2 will be the difference in level between group 2 and group 4 and so on...

slide-14
SLIDE 14

Parametrisations, November 2007 11

Traditional One-way ANOVA in SAS: proc glm data=drug; where type=’active’; class group; model index=group / solution; run; which yields the output:

The GLM Procedure Class Level Information Class Levels Values group 4 x10 x15 x20 x25 Number of Observations Used 25

slide-15
SLIDE 15

Parametrisations, November 2007 12 The GLM Procedure Dependent Variable: index Sum of Source DF Squares Mean Square F Value Pr > F Model 3 4391.364444 1463.788148 412.97 <.0001 Error 21 74.435556 3.544550 Corrected Total 24 4465.800000 R-Square Coeff Var Root MSE index Mean 0.983332 7.734994 1.882698 24.34000 Source DF Type I SS Mean Square F Value Pr > F group 3 4391.364444 1463.788148 412.97 <.0001 Source DF Type III SS Mean Square F Value Pr > F group 3 4391.364444 1463.788148 412.97 <.0001

slide-16
SLIDE 16

Parametrisations, November 2007 13 Standard Parameter Estimate Error t Value Pr > |t| Intercept 43.68000000 B 0.84196796 51.88 <.0001 group x10

  • 37.72000000 B

1.19072251

  • 31.68

<.0001 group x15

  • 25.73555556 B

1.05011855

  • 24.51

<.0001 group x20

  • 10.54666667 B

1.14003001

  • 9.25

<.0001 group x25 0.00000000 B . . . NOTE: The X’X matrix has been found to be singular, and a generalized inverse was used to solve the normal equations. Terms whose estimates are followed by the letter ’B’ are not uniquely estimable.

The ’B’ to the right of the estimates is explained in the NOTE It simply means: By renaming the group levels/names, we may get a different parametrisation!

slide-17
SLIDE 17

Parametrisations, November 2007 14

We here disregard the problem of variance heterogeneity:

proc glm data=drug; where type=’active’; class group; model index=group / noint solution; means group / hovtest=levene; run;

from which we get

Levene’s Test for Homogeneity of index Variance ANOVA of Squared Deviations from Group Means Sum of Mean Source DF Squares Square F Value Pr > F group 3 192.7 64.2320 5.92 0.0043 Error 21 228.0 10.8585

A clear indication that the variance increases with dose. Logarithms?

slide-18
SLIDE 18

Parametrisations, November 2007 15

Same model, now parametrised with one level for each group: proc glm data=drug; where type=’active’; class group; model index=group / noint solution; run; now yielding instead:

Dependent Variable: index Sum of Source DF Squares Mean Square F Value Pr > F Model 4 19202.25444 4800.56361 1354.35 <.0001 Error 21 74.43556 3.54455 Uncorrected Total 25 19276.69000

slide-19
SLIDE 19

Parametrisations, November 2007 16 R-Square Coeff Var Root MSE index Mean 0.983332 7.734994 1.882698 24.34000 Source DF Type I SS Mean Square F Value Pr > F group 4 19202.25444 4800.56361 1354.35 <.0001 Source DF Type III SS Mean Square F Value Pr > F group 4 19202.25444 4800.56361 1354.35 <.0001 Standard Parameter Estimate Error t Value Pr > |t| group x10 5.96000000 0.84196796 7.08 <.0001 group x15 17.94444444 0.62756587 28.59 <.0001 group x20 33.13333333 0.76860808 43.11 <.0001 group x25 43.68000000 0.84196796 51.88 <.0001

The tests now refer to the hypothesis of a zero level (which is not interesting)

slide-20
SLIDE 20

Parametrisations, November 2007 17

Parametrisations in One-way ANOVA

  • One level (µ4) for the reference group

(the last, numerically or alphabetically), supplemented with differences from this reference group to each

  • f the remaining groups (β1, β2, β3)

Ygi = µ4 + βg + εgi, – good for testing of identity and certain pairwise comparisons βi = µi − µ4

  • One level for each group

Ygi = µg + εgi – good for estimation, not suited for testing!!

slide-21
SLIDE 21

Parametrisations, November 2007 18

Estimate statements in GLM If we want to compare dose 10 with dose 15:

proc glm data=drug; where type=’active’; class group; model index=group / noint solution; estimate ’dose 15 vs. dose 10’ group -1 1 0 0; run; from which we get Standard Parameter Estimate Error t Value Pr > |t| dose 15 vs. dose 10 11.9844444 1.05011855 11.41 <.0001

slide-22
SLIDE 22

Parametrisations, November 2007 19

We return to the scatter plot, now with a linear regression line Can we use a simple model, saying that the dose effect is linear?

slide-23
SLIDE 23

Parametrisations, November 2007 20

We look at the simple linear regression model: Ygi = α + βXgi + εgi, εgi ∼ N(0, σ2) where Ygi denotes Index, Xgi denotes Dose, α is the intercept (with the vertical axis), β is the slope of the line and εgi is the distance from observation to line, measured vertically (the residual), for individual i in drug group g Program statements: proc reg data=drug; where type=’active’; model index=dose / clb; run;

slide-24
SLIDE 24

Parametrisations, November 2007 21

The output becomes:

The REG Procedure Dependent Variable: index Number of Observations Used 25 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 4368.67598 4368.67598 1034.55 <.0001 Error 23 97.12402 4.22278 Corrected Total 24 4465.80000 Root MSE 2.05494 R-Square 0.9783 Dependent Mean 24.34000 Adj R-Sq 0.9773 Coeff Var 8.44265

slide-25
SLIDE 25

Parametrisations, November 2007 22 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1

  • 20.11437

1.44191

  • 13.95

<.0001 dose 1 2.58456 0.08035 32.16 <.0001 Parameter Estimates Variable DF 95% Confidence Limits Intercept 1

  • 23.09719
  • 17.13155

dose 1 2.41833 2.75078

  • Is this a reasonable description?
  • Can we test the linearity?
  • Is this model ’almost as good as’

– a quadratic model – the ANOVA model

slide-26
SLIDE 26

Parametrisations, November 2007 23

Quadratic fit:

Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1

  • 22.91640

5.05646

  • 4.53

0.0002 dose 1 2.93221 0.60603 4.84 <.0001 dose2 1

  • 0.00987

0.01705

  • 0.58

0.5685

slide-27
SLIDE 27

Parametrisations, November 2007 24

Model reduction

  • F test for comparison of explained variation

We have to compare two models: Model 1 The ANOVA model with 4 separate groups Model 2 The linear regression model with actual dose as covariate Note: The models have to be “nested”, i.e. one is derived from the

  • ther, typically be demanding some parameters to be equal to each
  • ther or equal to zero (e.g. “remove effects”).

Indeed, the linear regression model is a special case of the ANOVA model, in which all contrast (µ2 − µ1, µ3 − µ2, µ4 − µ3) are equal.

slide-28
SLIDE 28

Parametrisations, November 2007 25

Sums of squares Model Sum Sq(Model):

i(ˆ

yi − ¯ y)2 Explained variation: How much do the predicted values vary? (the bigger the better) Residuals Sum Sq(Residual):

i(yi − ˆ

yi)2 Unexplained variation (residual variation): How large are the deviations from the model? (the smaller the better)

slide-29
SLIDE 29

Parametrisations, November 2007 26

Look at changes in the explained variation Sum Sq(Model). How much less is explained by the simpler model? ∆Sum Sq = Sum Sq(Model1) − Sum Sq(Model2) > 0, always (since more parameters can always explain more of the variation).

slide-30
SLIDE 30

Parametrisations, November 2007 27

∆Mean Sq = ∆Sum Sq/∆df F = ∆Mean Sq Mean Sq(Residual) How large can this get, just by chance/coincidence? F ∼ F(d f2 − d f1, d f1) where d f1 and d f2 denote the degrees of freedom for the Mean Sq(Residual) for the two models.

slide-31
SLIDE 31

Parametrisations, November 2007 28

Direct calculation of F-test statistic: Sum Sq(Model) for ANOVA model: 4391.36444, with 3 df Mean Sq(Residual) for ANOVA model: 3.54455, with 21 df (d f1 = 21) Sum Sq(Model) for linear model: 4368.67598, with 1 df > ((4391.36444-4368.67598)/2)/3.54455 [1] 3.200471 3.20 ∼ F(2, 21) corresponds to a P-value of 0.06 If we have many groups, the test is not very powerful

slide-32
SLIDE 32

Parametrisations, November 2007 29

We may trick SAS into performing our comparison directly by including dose simultaneously as a linear effect and as a class effect:

proc glm data=drug; where type=’active’; class group; model index=group dose / solution; run;

In this way, the term group represents the variation of dose group means around a straight line

slide-33
SLIDE 33

Parametrisations, November 2007 30

The output becomes:

Dependent Variable: index Sum of Source DF Squares Mean Square F Value Pr > F Model 3 4391.364444 1463.788148 412.97 <.0001 Error 21 74.435556 3.544550 Corrected Total 24 4465.800000 Source DF Type I SS Mean Square F Value Pr > F dose 1 4368.675979 4368.675979 1232.51 <.0001 group 2 22.688466 11.344233 3.20 0.0612

slide-34
SLIDE 34

Parametrisations, November 2007 31 Source DF Type III SS Mean Square F Value Pr > F dose 0.00000000 . . . group 2 22.68846585 11.34423293 3.20 0.0612 Standard Parameter Estimate Error t Value Pr > |t| Intercept

  • 9.053333333 B

5.10994328

  • 1.77

0.0910 dose 2.109333333 B 0.22800600 9.25 <.0001 group x10

  • 6.080000000 B

2.97680629

  • 2.04

0.0539 group x15

  • 4.642222222 B

1.86166122

  • 2.49

0.0211 group x20 0.000000000 B . . . group x25 0.000000000 B . . .

Note: The effect group now has only 2 degrees of freedom, since we are not testing equality, but linearity!

slide-35
SLIDE 35

Parametrisations, November 2007 32

’Contrast’ statement in GLM for test of linearity (a bit tricky)

proc glm data=drug; where type=’active’; class group; model index=group / solution; contrast ’dev linearity’ group 1 -2 1 0, group 0 1 -2 1; run;

gives an extra line of output:

Contrast DF Contrast SS Mean Square F Value Pr > F dev linearity 2 22.688466 11.344233 3.20 0.0612

slide-36
SLIDE 36

Parametrisations, November 2007 33

Explanation for this contrast: The linear regression model is a special case of the ANOVA model, in which all contrast (µ2 − µ1, µ3 − µ2, µ4 − µ3) are equal. µ2 − µ1 = µ3 − µ2 ⇒ (µ3 − µ2) − (µ2 − µ1) = µ1 − 2µ2 + µ3 = 0 µ3 − µ2 = µ4 − µ3 ⇒ (µ4 − µ3) − (µ3 − µ2) = µ2 − 2µ3 + µ4 = 0

slide-37
SLIDE 37

Parametrisations, November 2007 34

Example with comparison of regression lines Analysis of covariance, ANCOVA (just a name used for a model with

  • ne Class variable and one quantitative variable.)

Relate weight to age and gender, in the well known Juul-data,

tanner=1 The GLM Procedure Dependent Variable: weight Sum of Source DF Squares Mean Square F Value Pr > F Model 3 12527.57848 4175.85949 183.19 <.0001 Error 457 10417.45375 22.79530 Corrected Total 460 22945.03223

slide-38
SLIDE 38

Parametrisations, November 2007 35 R-Square Coeff Var Root MSE weight Mean 0.545982 15.94097 4.774443 29.95076 Source DF Type I SS Mean Square F Value Pr > F sex 1 701.27395 701.27395 30.76 <.0001 age 1 11639.86222 11639.86222 510.63 <.0001 age*sex 1 186.44231 186.44231 8.18 0.0044 Source DF Type III SS Mean Square F Value Pr > F sex 1 132.83848 132.83848 5.83 0.0162 age 1 10905.95329 10905.95329 478.43 <.0001 age*sex 1 186.44231 186.44231 8.18 0.0044

slide-39
SLIDE 39

Parametrisations, November 2007 36 Standard Parameter Estimate Error t Value Pr > |t| Intercept 3.818464396 B 1.46473900 2.61 0.0094 sex female 5.234784674 B 2.16850068 2.41 0.0162 sex male 0.000000000 B . . . age 2.991010992 B 0.15710180 19.04 <.0001 age*sex female

  • 0.691706685 B

0.24186467

  • 2.86

0.0044 age*sex male 0.000000000 B . . . NOTE: The X’X matrix has been found to be singular, and a generalized inverse was used to solve the normal equations. Terms whose estimates are followed by the letter ’B’ are not uniquely estimable.

Interpretation: Boys increase their weight more rapidly than girls:

  • approx. 692g more per year, CI=(216, 1167)
slide-40
SLIDE 40

Parametrisations, November 2007 37

Weight related to age: Again: Problems with variance homogeneity.....

slide-41
SLIDE 41

Parametrisations, November 2007 38

Estimate the weight difference at the age of 14:

proc glm data=juul; where tanner=1; class sex; model weight=sex age age*sex / solution clparm; estimate ’difference at 14’ sex -1 1 age*sex -14 14; run;

with the output:

Standard Parameter Estimate Error t Value Pr > |t| difference at 14 4.44910891 1.34342207 3.31 0.0010 Parameter 95% Confidence Limits difference at 14 1.80905820 7.08915963

slide-42
SLIDE 42

Parametrisations, November 2007 39

Simultaneous test of both sex-effects proc glm data=juul; where tanner=1; class sex; model weight=sex age age*sex / solution clparm; contrast ’sex and sex*age’ sex -1 1, age*sex -1 1; run; which yields:

Contrast DF Contrast SS Mean Square F Value Pr > F sex and sex*age 2 263.5387428 131.7693714 5.78 0.0033

slide-43
SLIDE 43

Parametrisations, November 2007 40

SAS-computation of both lines simultaneously Subgroup analysis

  • Keep the interaction term sex*age
  • ’Leave out’ the marginal effect age

– this will merge the marginal effect into the interaction term

  • ’Leave out’ the intercept (use option noint in textsfModel)

– this will merge the intercept into the sex effect proc glm data=juul; where tanner=1; class sex; model weight=sex age*sex / noint solution clparm; run;

slide-44
SLIDE 44

Parametrisations, November 2007 41

Output:

The GLM Procedure Dependent Variable: weight Sum of Source DF Squares Mean Square F Value Pr > F Model 4 426066.6962 106516.6741 4672.75 <.0001 Error 457 10417.4538 22.7953 Uncorrected Total 461 436484.1500 R-Square Coeff Var Root MSE weight Mean 0.545982 15.94097 4.774443 29.95076 Source DF Type I SS Mean Square F Value Pr > F sex 2 414240.3917 207120.1959 9086.09 <.0001 age*sex 2 11826.3045 5913.1523 259.40 <.0001

slide-45
SLIDE 45

Parametrisations, November 2007 42 Source DF Type III SS Mean Square F Value Pr > F sex 2 885.61069 442.80534 19.43 <.0001 age*sex 2 11826.30453 5913.15226 259.40 <.0001 Standard Parameter Estimate Error t Value Pr > |t| sex female 9.053249070 1.59904186 5.66 <.0001 sex male 3.818464396 1.46473900 2.61 0.0094 age*sex female 2.299304308 0.18389546 12.50 <.0001 age*sex male 2.991010992 0.15710180 19.04 <.0001 Parameter 95% Confidence Limits sex female 5.910862396 12.195635744 sex male 0.940005468 6.696923324 age*sex female 1.937918735 2.660689880 age*sex male 2.682279484 3.299742501

slide-46
SLIDE 46

Parametrisations, November 2007 43

Same model, 2 different parametrisations:

proc glm data=juul; where tanner=1; class sex; model weight=sex age age*sex / solution; run;

  • An intercept for the reference group (sex=’male’)
  • A difference from sex=’female’ to sex=’male’
  • An effect of age (slope) for the reference group
  • A difference in slopes from sex=’female’ to sex=’male’

proc glm data=juul; where tanner=1; class sex; model weight=sex age*sex / noint solution; run;

  • An intercept for each group (sex)
  • An effect of age (slope) for each group (sex)
slide-47
SLIDE 47

Parametrisations, November 2007 44

Now we include the Placebo group as Dose=0. Does the Placebo treatment fit in here? No, obviously not. Either we have a placebo effect or a threshold effect. But how could we make a formal test?

slide-48
SLIDE 48

Parametrisations, November 2007 45

data drug; set drug; active_drug=(dose>0); run; proc glm data=drug; where group ne ’aspirin’; class active_drug; model index= dose active_drug / solution; run;

Class Level Information Class Levels Values active_drug 2 0 1 Number of Observations Used 34

slide-49
SLIDE 49

Parametrisations, November 2007 46 Dependent Variable: index Sum of Source DF Squares Mean Square F Value Pr > F Model 2 6003.556011 3001.778006 749.72 <.0001 Error 31 124.119577 4.003857 Corrected Total 33 6127.675588 R-Square Coeff Var Root MSE index Mean 0.979744 9.915869 2.000964 20.17941 Source DF Type I SS Mean Square F Value Pr > F dose 1 4635.140701 4635.140701 1157.67 <.0001 active_drug 1 1368.415310 1368.415310 341.77 <.0001 Source DF Type III SS Mean Square F Value Pr > F dose 1 4368.675979 4368.675979 1091.12 <.0001 active_drug 1 1368.415310 1368.415310 341.77 <.0001

slide-50
SLIDE 50

Parametrisations, November 2007 47 Standard Parameter Estimate Error t Value Pr > |t| Intercept

  • 20.11437309 B

1.40403637

  • 14.33

<.0001 dose 2.58455657 0.07824389 33.03 <.0001 active_drug 0 28.73659531 B 1.55441023 18.49 <.0001 active_drug 1 0.00000000 B . . . NOTE: The X’X matrix has been found to be singular, and a generalized inverse was used to solve the normal equations. Terms whose estimates are followed by the letter ’B’ are not uniquely estimable.

Interpretation: The placebo group (active_drug=0) lies an estimated 28.74 above the expected 0-dose level.

slide-51
SLIDE 51

Parametrisations, November 2007 48

Example: Serum IGF-1 for boys We have seen different age dependencies in separate Tanner stages. Is this simply due to a nonlinear age effect?

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 222 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5

slide-52
SLIDE 52

Parametrisations, November 2007 49

How can we model non-linear effects?

  • Polynomials tend to be very wiggly

(use ’i=rq’ or ’i=rc’ in the symbol-statement)

  • Splines

piecewise interpolations, often linear or cubic

slide-53
SLIDE 53

Parametrisations, November 2007 50

Separate lines for each age group:

  • Subdivide age into groups, using appropriate thresholds
  • Fit linear effect of age in each age group

The result is a 5 unconnected lines:

slide-54
SLIDE 54

Parametrisations, November 2007 51

Linear splines:

  • Subdivide age into groups, using appropriate thresholds
  • Fit linear effect of age in each age group
  • Make the linear pieces ’meet’ at the thresholds

The result is a bended line:

slide-55
SLIDE 55

Parametrisations, November 2007 52 data juul; set juul; /* define new intercept value */ age5=age-5; /* define number of years above certain threshold ages */ extra_age1=max(age-10,0); extra_age2=max(age-12,0); extra_age3=max(age-13,0); extra_age4=max(age-15,0); run; proc sort data=juul; by sex; run; proc reg data=juul; where age ge 5 and age le 20; by sex; model ssigf1=age5 extra_age10 extra_age12 extra_age13 extra_age15; run;

slide-56
SLIDE 56

Parametrisations, November 2007 53

Definition af extra_age:

age_ extra_ extra_ extra_ extra_ Obs age group age10 age12 age13 age15 215 9.92 1 0.00 216 10.03 2 0.03 217 10.03 2 0.03 218 10.04 2 0.04 300 11.99 2 1.99 0.00 301 12.01 3 2.01 0.01 0.00 302 12.01 3 2.01 0.01 0.00 343 12.93 3 2.93 0.93 0.00 344 13.01 4 3.01 1.01 0.01 345 13.03 4 3.03 1.03 0.03 406 14.94 4 4.94 2.94 1.94 0.00 407 15.01 5 5.01 3.01 2.01 0.01 408 15.13 5 5.13 3.13 2.13 0.13 610 54.00 5 44.00 42.00 41.00 39.00 611 55.24 5 45.24 43.24 42.24 40.24 612 56.71 5 46.71 44.71 43.71 41.71

slide-57
SLIDE 57

Parametrisations, November 2007 54

sex=male The REG Procedure Dependent Variable: ssigf1 Number of Observations Read 506 Number of Observations Used 362 Number of Observations with Missing Values 144 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 5 3959.67230 791.93446 92.21 <.0001 Error 356 3057.38141 8.58815 Corrected Total 361 7017.05370 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 11.80524 0.80661 14.64 <.0001 age5 1 0.70448 0.21803 3.23 0.0013 extra_age10 1

  • 0.16250

0.55581

  • 0.29

0.7702 extra_age12 1 2.38356 1.25125 1.90 0.0576 extra_age13 1

  • 1.15156

1.28119

  • 0.90

0.3694 extra_age15 1

  • 2.41836

0.53872

  • 4.49

<.0001

slide-58
SLIDE 58

Parametrisations, November 2007 55

Can we reduce the model to a simpler one?

  • To simple linearity ..... not reasonable
  • Quadratic age effect ... Why?

proc glm data=juul; where age ge 5 and age le 20; by sex; model ssigf1=age5 extra_age10 extra_age12 extra_age13 extra_age15 / solution; contrast ’all’ extra_age10 1, extra_age12 1, extra_age13 1, extra_age15 1; run; Contrast DF Contrast SS Mean Square F Value Pr > F all 4 738.1629865 184.5407466 21.49 <.0001

slide-59
SLIDE 59

Parametrisations, November 2007 56

Test of adequacy of linear spline? We have to test it against a more complicated model, e.g.

  • separate regressions for each age group
  • Inclusion of tanner as a Class-variable
  • ...
slide-60
SLIDE 60

Parametrisations, November 2007 57

Dose-response curves Example of a typical dose-response relation, for moderate doses We have almost linearity in this dose range:

slide-61
SLIDE 61

Parametrisations, November 2007 58

For extreme doses we see a clear deviation from linearity and: smaller variation in the ends

slide-62
SLIDE 62

Parametrisations, November 2007 59

Y axis: Logit transformed response, i.e. log(response/(100 − response)) X axis: Logarithmic transformed dose

proc gplot; plot logit*dose / haxis=axis1 vaxis=axis2 frame; axis1 logbase=2 logstyle=expand value=(H=2) minor=NONE label=(H=3 ’Dose on log scale’); axis2 value=(H=2) minor=NONE label=(A=90 R=0 H=3 ’Logit transformed response’); symbol v=circle i=sm60 h=3 c=BLACK l=1 w=2; run; /* ’i=sm60’ indicates a 60% smoothing */

We get a reasonable linearity on this scale

slide-63
SLIDE 63

Parametrisations, November 2007 60

Theoretical dose response relation: This looks like – a cumulative normal distribution – a sigmoid shape – a logistic curve

slide-64
SLIDE 64

Parametrisations, November 2007 61

Example from anaesthesia: 47 patients to be operated with two different anesthetics

  • Halothane
  • Neurolept

Y: Twitch response at the ulnar nerve (at the thumb), in % X: Dose of muscle relaxantia

group=halothane patient dose depression 1 22.2 58 2 22.8 27 3 22.2 67 . . . . . . . . . 13 34.9 94 14 35.9 96 15 35.4 37 group=neurolept patient dose depression 1 22.9 27 2 25.0 48 3 24.6 12 . . . . . . . . . 30 37.1 93 31 37.7 85 32 37.1 70

slide-65
SLIDE 65

Parametrisations, November 2007 62

slide-66
SLIDE 66

Parametrisations, November 2007 63

Neurolept

slide-67
SLIDE 67

Parametrisations, November 2007 64

Transformation to linearity in the usual way: y : logit_twitch = log

  • twitch

100 − twitch

  • x : logdose = log(dose)

Linear relation: logit_twitch = α + β logdose produces estimates of α and β, with corresponding standard errors (s.e.)

slide-68
SLIDE 68

Parametrisations, November 2007 65 group=neurolept The REG Procedure Dependent Variable: logity Number of Observations Used 32 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 39.62074 39.62074 34.06 <.0001 Error 30 34.90174 1.16339 Corrected Total 31 74.52248 Root MSE 1.07861 R-Square 0.5317 Dependent Mean 0.42477 Adj R-Sq 0.5161 Coeff Var 253.92655

slide-69
SLIDE 69

Parametrisations, November 2007 66 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1

  • 20.61514

3.61037

  • 5.71

<.0001 logdose 1 6.18841 1.06043 5.84 <.0001 Parameter Estimates Variable DF 95% Confidence Limits Intercept 1

  • 27.98850
  • 13.24177

logdose 1 4.02273 8.35409

slide-70
SLIDE 70

Parametrisations, November 2007 67

We get estimates of α and β from the equation logit_twitch = α + β logdose But: What about the parameters of interest, i.e. ED50 and ED90? ˆ ED50 = exp

  • − ˆ

α ˆ β

  • How do we calculate s.e.( ˆ

ED50) ? Reparametrisation: γ1= log(ED50) γ2= log(ED90)

slide-71
SLIDE 71

Parametrisations, November 2007 68

The model may then be written as: y= logit_twitch = logit(0.9) × x−γ1

γ2−γ1 = 2.197 × x−γ1 γ2−γ1

This function is nonlinear in γ1 and γ2! Direct estimation of γ1 and γ2 using non-linear regression ... more about this next week