R Packages for Rank-based Estimates John Kloke 1 Joseph McKean 2 1 - - PowerPoint PPT Presentation

r packages for rank based estimates
SMART_READER_LITE
LIVE PREVIEW

R Packages for Rank-based Estimates John Kloke 1 Joseph McKean 2 1 - - PowerPoint PPT Presentation

R Packages for Rank-based Estimates John Kloke 1 Joseph McKean 2 1 University of Wisconsin - Madison 2 Western Michigan University 10 July 2013 Kloke, McKean R Packages for Rank-based Estimates Outline Rank-based (R) estimation for linear


slide-1
SLIDE 1

R Packages for Rank-based Estimates

John Kloke1 Joseph McKean2

1University of Wisconsin - Madison 2Western Michigan University

10 July 2013

Kloke, McKean R Packages for Rank-based Estimates

slide-2
SLIDE 2

Outline

◮ Rank-based (R) estimation for linear models ◮ Rfit ◮ Joint ranking (JR) estimation for cluster correlated data ◮ jrfit

Kloke, McKean R Packages for Rank-based Estimates

slide-3
SLIDE 3

Notation

Usual iid Linear Model yi = α + xT

i β + ei for i = 1, . . . , n

Assume ei

iid

∼ F, f are continuous errors with I(f ) < ∞. Matrix Formulation y = α1n + Xβ + e WLOG assume X is centered (X T1n = 0) and has full rank and design conditions hold.

Kloke, McKean R Packages for Rank-based Estimates

slide-4
SLIDE 4

Rank Estimation

Jureˇ ckov´ a (1971) & Jaeckel (1972)

The rank-based estimator of β is defined as

  • βϕ

= Argminy − Xβϕ = Argmin

n

  • i=1

a[R(yi − xT

i β)](yi − xT i β) ◮ vϕ = n i=1 a[R(vi)]vi

v ∈ Rn, is Jaeckel’s (1972) dispersion function.

◮ R denotes Rank. ◮ Scores are generated as a[t] = ϕ

  • t

(n+1)

  • .

◮ Score function: ϕ(u) is a nondecreasing, square-integrable

function defined on (0, 1) such that

  • ϕ(u) du = 0 and
  • ϕ2(u) du = 1.

Kloke, McKean R Packages for Rank-based Estimates

slide-5
SLIDE 5

Asymptotic Theory and Inference

ˆ βϕ ˙ ∼Np

  • β, τ 2

ϕ(X TX)−1

where τϕ is a scale parameter which depends on the error distribution (f ) and the score function (ϕ). A consistent estimate

  • f τϕ was proposed by Koul, Sievers, McKean (1987).

Inference

◮ Simple confidence intervals and tests of hypothesis for the

slope parameters should be based on a tn−p−1 where p = ncol(X) (see Hettmansperger, McKean 2009).

◮ A reduction in dispersion which is akin to the least squares

reduced model test is based on the statistic FRD = (y − ˆ yFϕ − y − ˆ yR)/q ˆ τϕ/2 where ˆ yF is the full model fit, ˆ yR is the reduced model fit, and q is the number of constraints.

Kloke, McKean R Packages for Rank-based Estimates

slide-6
SLIDE 6

Rfit

Kloke, McKean 2012

> args(rfit.default) function (formula, data = list(), yhat0 = NULL, scores = ws symmetric = FALSE, intercept = NULL, TAU = "F0", ...)

◮ yhat0: initial fit. Defaults to an L1 estimate (quantreg,

Koenker 2013).

◮ scores: object of class ‘scores’. Several available. Defaults to

Wilcoxon.

◮ symmetric: use median (F) or Hodges-Lehmann (T) to

estimate intercept

◮ intercept: should an intercept be added to the model?

Default is yes unless already in column space of the design matrix.

◮ TAU: F0 = fortran; R = R; N = NA

Kloke, McKean R Packages for Rank-based Estimates

slide-7
SLIDE 7

Example 1: Uric Acid & Cardiovascular Risk Factors

Heritier, et. al. 2009

◮ Investigating the association between uric acid levels and

various cardiovascular risk factors in developing countries.

◮ Outcome variable: Uric Acid (µmol/L) ◮ Risk Factors: body mass index (bmi), systolic blood pressure

(sys), low-density lipoprotein cholesterol (ldl), total cholesterol (choles), male (sex)

◮ 998 Participants (474 men & 524 women) aged 25 to 64.

Kloke, McKean R Packages for Rank-based Estimates

slide-8
SLIDE 8

> summary(rfit(Uric~bmi+sys+choles+ldl+sex,data=CardioRiskFactor Call: rfit.default(formula = Uric ~ bmi + sys + choles + ldl + sex, data = CardioRiskFactors) Coefficients: Estimate Std. Error t.value p.value

  • 122.54244

19.10680 -6.4136 2.197e-10 *** bmi 5.16547 0.49141 10.5115 < 2.2e-16 *** sys 0.67100 0.10480 6.4027 2.352e-10 *** choles 60.37128 5.10748 11.8202 < 2.2e-16 *** ldl

  • 53.30736

5.33107 -9.9994 < 2.2e-16 *** sex 128.74061 5.00197 25.7380 < 2.2e-16 ***

  • Signif. codes:

0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Multiple R-squared (Robust): 0.4688407 Reduction in Dispersion Test: 175.1226 p-value: 0

Kloke, McKean R Packages for Rank-based Estimates

slide-9
SLIDE 9

> fitF<-rfit(Uric~bmi+sys+choles+ldl+sex +smok+alco+apoa+trig+age,data=CardioRiskFactors) > fitR<-rfit(Uric~bmi+sys+choles+ldl+sex, data=CardioRiskFactors) > drop.test(fitF,fitR) Drop in Dispersion Test F-Statistic p-value 82.193 0.000

Kloke, McKean R Packages for Rank-based Estimates

slide-10
SLIDE 10

Wilcoxon Studentized Residuals

> qqnorm(rstudent(fit))

−3 −2 −1 1 2 3 −5 5 10 15 20

Normal Q−Q Plot

Theoretical Quantiles Sample Quantiles

Kloke, McKean R Packages for Rank-based Estimates

slide-11
SLIDE 11

Cluster Correlated Data

Consider an experiment done over m blocks or clusters. For block k (k = 1, . . . , m) we have the linear model yk = α1nk + X kβ + ek

◮ Assume e1, . . . , em are independent random vectors. ◮ N = m k=1 nk.

Matrix Formulation y = α1N + Xβ + e where X = [X T

1 · · · X T m]T is a N × p design matrix

Kloke, McKean R Packages for Rank-based Estimates

slide-12
SLIDE 12

JR Estimator

Kloke, McKean, Rashid 2009

Assume equal marginals: ekj ∼ f , F for k = 1, . . . , m, j = 1, . . . nk The joint rankings (JR) estimator of β is ˆ βJR = Argminy − Xβϕ, where vϕ =

N

  • t=1

a(R(vt))vt is Jaeckel’s (1972) dispersion function.

Kloke, McKean R Packages for Rank-based Estimates

slide-13
SLIDE 13

Asymptotic Distribution

Using a result from Brunner and Denker (1994) ˆ βJR ˙ ∼Np

  • β, τ 2

ϕ

  • X TX

−1 V ϕ

  • X TX

−1

◮ V ϕ = m k=1 X T k Σϕ,kX k ◮ Σϕ,k = var(ϕ(F(ek))) ◮ ϕ(F(ek)) = [ϕ(F(ek1)), . . . , ϕ(F(eknk))]T

Kloke, McKean R Packages for Rank-based Estimates

slide-14
SLIDE 14

Estimates of V ϕ = m

k=1 X T k ΣϕX k

  • 1. Assume Σϕ is compound symmetric. Then
  • Σϕ = (1 − ˆ

ρϕ)I + ˆ ρϕJ where ˆ ρϕ =

1 M−p

m

k=1

  • i>j a(R(ˆ

eki))a(R(ˆ ekj)), M = m

k=1

nk

2

  • .
  • 2. Sandwich estimator. m

k=1 X T k a(R(ˆ

ek))a(R(ˆ ek))TX k

  • 1. Valid under exchangeable within block errors.
  • 2. Requires a large number of clusters (large m).

Kloke, McKean R Packages for Rank-based Estimates

slide-15
SLIDE 15

jrfit

> args(jrfit) function (x, y, block, yhat0 = NULL, scores = wscores, fitint = NULL, var.type = "sandwich", fitblock = NULL, ...)

◮ x: N × p design matrix ◮ y: N × 1 vector of responses ◮ block: N × 1 vector denoting block membership ◮ var.type: ‘sandwich’, ‘cs’, or ‘user’. ◮ fitblock: should block be fit as fixed effect? Default T for

‘cs’ and F for ‘sandwich’

Kloke, McKean R Packages for Rank-based Estimates

slide-16
SLIDE 16

Example 2: Simulated Cluster-Correlated

◮ Model: yij = α + wi∆ + xijβ + bj + eij ◮ m = 160 blocks ◮ n = 4 observations (repeated measures) per block ◮ ∆ = 0.5 is the treatment effect ◮ β = 0 is the effect of a (baseline) covariate ◮ eij ∼ t5 and bj ∼ t3

Kloke, McKean R Packages for Rank-based Estimates

slide-17
SLIDE 17

> fit<-jrfit(cbind(x,w),y,block) > summary(fit) Coefficients: Estimate

  • Std. Error t-ratio

p.value x 0.0904889 0.1390482 0.6507737 0.5161260 w 0.7094427 0.2578163 2.7517367 0.0066124 > > fit<-jrfit(cbind(x,w),y,block,var.type=’cs’,fitblock=F) > summary(fit) Coefficients: Estimate

  • Std. Error t-ratio

p.value x 0.0904889 0.1182933 0.7649543 0.4454256 w 0.7094427 0.2642782 2.6844538 0.0080298

Kloke, McKean R Packages for Rank-based Estimates

slide-18
SLIDE 18

Summary

◮ Rfit provides robust rank-based inference for iid linear model.

Available at CRAN.

◮ jrfit provides robust rank-based inference for linear model

w/ cluster correlated errors. Available at http://www.biostat.wisc.edu/~kloke/.

Kloke, McKean R Packages for Rank-based Estimates

slide-19
SLIDE 19

Future Work

◮ Regression through the origin ◮ High-breakdown R estimates (+ diagnostics) ◮ Generalized JR estimate based on working covariance of

var(yi).

◮ “Nonparametric Statistical Methods Using R” (2014) -

Chapman Hall

Kloke, McKean R Packages for Rank-based Estimates

slide-20
SLIDE 20

References

◮ Brunner, E. and Denker, M. (1994), Rank statistics under dependent

  • bservations and applications to factorial designs, Journal of Statistical

Inference and Planning, 42, 353-378. ◮ Hettmansperger, T. P. and McKean, J. W. (2011), Robust Nonparametric Statistical Methods, 2nd Edition, Boca Raton, FL: CRC Press. ◮ Heritier, S. Cantoni, E., Copt, S. and Victoria-Feser, M. (2009), Robust Methods in Biostatistics, New York: Wiley. ◮ Jaeckel, L. A. (1972), Estimating regression coefficients by minimizing the dispersion of the residuals, Annals of Mathematical Statistics, 43, 1449-1458. ◮ Jureˇ akov´ a, J. (1971), Nonparametric estimate of regression coefficients, Annals

  • f Mathematical Statistics, 42, 1328:1338.

◮ Kloke, JD, McKean, JW (2012). Rfit: Rank-based Estimation for Linear

  • Models. R Journal, 4/2, 57-64.

◮ Kloke, J.D., McKean, J.W., and Rashid, M. (2009), Rank-Based Estimation and Associated Inferences for Linear Models with Cluster Correlated Errors, Journal

  • f the American Statistical Association, 104, 485:384-390.

◮ Koenker, R. (2013). quantreg: Quantile Regression. R package version 4.98. http://CRAN.R-project.org/package=quantreg. ◮ Koul, H.L., Sievers, G.L. and McKean, J.W. (1987). An estimator of the Scale parameter for the Rank Analysis of Linear Models under General Score

  • Functions. Scand J. Statist.

14, 131 - 141.

Kloke, McKean R Packages for Rank-based Estimates