Practicalities ENAR Spring Meeting Pittsburgh Short tutorial (105 - - PowerPoint PPT Presentation

practicalities
SMART_READER_LITE
LIVE PREVIEW

Practicalities ENAR Spring Meeting Pittsburgh Short tutorial (105 - - PowerPoint PPT Presentation

Peter Dalgaard Practicalities ENAR Spring Meeting Pittsburgh Short tutorial (105 minutes) March 2004 High coverage, not great depth Little time for


slide-1
SLIDE 1

Peter Dalgaard ENAR Spring Meeting Pittsburgh March 2004

An Introduction to the R Environment

1–51

Practicalities

  • Short tutorial (105 minutes)
  • High coverage, not great depth
  • Little time for interaction, but there should be made room

for clarification.

  • Short break (5–10 minutes) in the middle.
  • Script of demos in

http://www.biostat.ku.dk/~pd/ENAR-demos.R

1

Plan

  • Elementary things about R, simple interactive demo
  • Modeling tools
  • R packages
  • Dealing with the R workspace
  • Graphics in R
  • Ad-hoc programming
  • Not covering: Installation, Data summaries, GUIs,

Computing on the language, C/Fortran interface, Advanced analysis...

2

The R environment

  • Built around the programming language R, an Open

Source dialect of the S language

  • R is Free Software, and runs on a variety of platforms (I’ll

be using Linux here, mainly to avoid technical surprises).

  • Command-line execution based on function calls
  • Extensible with user functions
  • Workspace containing data and functions
  • Various graphics devices (interactive and non-interactive)

3

slide-2
SLIDE 2

The basic vector types

  • Numeric (integer/double)
  • Character (strings)
  • Logical
  • Factor (really integer + level attribute)
  • Lists (generic vectors)

4

Basic operations

  • Standard arithmetic (x + y, etc.)
  • Recycling: If operating on two vectors of different length,

the shorter one is replicated (with warning if it is not an even multiple)

  • c — concatenate
  • seq — sequences
  • rep — replication
  • sum, mean, range, ...

5

Demo 1

x <- round(rnorm(10,mean=20,sd=5)) # simulate data x mean(x) m <- mean(x) x - m # notice recycling (x - m)^2 sum((x - m)^2) sqrt(sum((x - m)^2)/9) sd(x)

6

Smart indexing

  • a[5] single element
  • a[5:7] several elements
  • a[-6] all except the 6th
  • a[b>200] logical index
  • a["name"] by name
  • a$b list elements

7

slide-3
SLIDE 3

Matrices/tables/arrays

  • Used in matrix calculus and as input to, e.g.,

chisq.test(). Results of tabulation.

  • Vectors with dimensions
  • Dimnames
  • Matrices: Generate with matrix
  • Indexing methods include [i,j], [i,], [,j]

8

Data frames

  • Like data set in other packages
  • Technically: Lists of vectors/factors of same length
  • Row names (must be unique)
  • Indexed like matrices (Beware, though: Data frames are

not matrices)

  • Generate from read operation or with data.frame
  • Many sample data frames are avalilable using data()

9

Demo 2

data(airquality) airquality$Month airquality[airquality$Month==5,]

  • z <- airquality[airquality$Month==5,]$Ozone

mean(oz) mean(oz, na.rm=TRUE) attach(airquality) mean(Ozone, na.rm=TRUE) tapply(Ozone, Month, mean, na.rm=TRUE) detach()

10

Some standard procedures

  • Continuous data by group: t.test, wilcox.test,
  • neway.test, kruskal.test
  • Categorical data: prop.test, chisq.test,

fisher.test

  • Correlations: cor.test, with options for nonparametrics

11

slide-4
SLIDE 4

Demo 3

library(ISwR) data(intake) attach(intake) t.test(pre, post, paired=TRUE) detach() data(caesarean) # loads a table caesar.shoe chisq.test(caesar.shoe) fisher.test(caesar.shoe)

12

Modeling tools: Overview

  • Model formulas
  • Model objects and summaries
  • Comparing models
  • Evaluating model fit
  • Generalized linear models

13

Model formulas

  • Linear model, y

ǫ

  • In practice something like

y

  • β0

β1

height

β2

1

type

2

✆ ✁

β3

1

type

3

✆ ✁

ǫ

  • Wilkinson-Rogers formulas:

y

  • height

type (Interpretation depends on whether variables are categorical or continuous)

14

Model formulas in R

  • R representation y ~ height + type where type is a

factor

  • Interactions a:b, a*b = a + b + a:b
  • Algebra (a:(b + c) = a:b + a:c etc.)
  • Notice special interpretation of operators
  • Special items: offset, -1

15

slide-5
SLIDE 5

Fitting linear models

data(airquality) aq <- transform(airquality, Month=factor(Month)) fit.aq <- lm(log(Ozone) ~ Solar.R + Wind + Temp + Month, data=aq)

  • lm generates a fitted model object
  • Extract information from model object
  • Fit other models based on model object

16

Inspecting model objects

  • Extract information about the fit
  • summary(fit.aq)
  • fitted, resid
  • anova(model1, model2)
  • plot – diagnostics
  • predict(fit.aq, newdata)

17

Model search

  • anova(model) “Type I” sum of squares
  • drop1 (“Type III”), add1
  • step (AIC/BIC) criteria
  • update

18

Demo 4

data(airquality) aq <- transform(airquality, Month=factor(Month)) fit.aq <- lm(log(Ozone) ~ Solar.R + Wind + Temp + Month, data=aq) fit.aq2 <- update(fit.aq, ~ . - Month) summary(fit.aq) plot(fit.aq) drop1(fit.aq, test="F") anova(fit.aq, fit.aq2)

19

slide-6
SLIDE 6

Generalized linear models

  • Statistical distribution (exponential) family
  • Link function transforming mean to linear scale
  • Deviance
  • Examples; Binomial, Poisson, Gaussian (σ known — in

principle)

  • Canonical link functions
  • Fit using glm in R

20

Demo 5

no.yes <- c("No","Yes") smoking <- gl(2, 1, 8, no.yes)

  • besity <- gl(2, 2, 8, no.yes)

snoring <- gl(2, 4, 8, no.yes) n.tot <- c(60,17,8,2,187,85,51,23) n.hyp <- c(5,2,1,0,35,13,15,8) data.frame(smoking,obesity,snoring,n.tot,n.hyp) hyp.tbl <- cbind(n.hyp,n.tot-n.hyp) glm.hyp <- glm(hyp.tbl~smoking+obesity+snoring, family=binomial("logit")) summary(glm.hyp) 0.87194 + qnorm(c(.025,.975))*0.39757 library(MASS) confint(glm.hyp)

21

Likelihood-based inference

  • Wald approximations ( ˆ

β

  • s.e.

ˆ β

, etc.) can be badly inaccurate in small samples

  • Likelihood-based inference is preferable
  • Use drop1(model, test="Chisq") (for

binomial/Poisson)

  • profile (in MASS for glm) investigates behaviour of

likelihood around maximum

  • plot(profile(model)) shows signed LR statistic

sign

β

ˆ β

✂ ☎

Q when varying each parameter.

  • confint gives likelihood-based confidence intervals

22

R packages

  • Collections of R functions, data, and compiled code
  • Well-defined format that ensures easy installation, a basic

standard of documentation, and enhances portability and reliability,

  • You can write your own packages! It is not entirely trivial,

but tools are there to help you.

23

slide-7
SLIDE 7

Packages that come with R

  • Standard R (1.8.1 - reorganized in 1.9.0) loads with these

packages available – ctest — Classical tests – mva — Multivariate analysis – modreg — Modern Regression – nls — Nonlinear regression – ts — Time series (elementary)

  • 8 further packages are available in core R, but not

automatically loaded (tcltk, lqs, ...)

  • 10 further packages and package bundles are maintained

separately, but included with source and binary distributions (survival, nlme, MASS, ...).

24

CRAN

  • The Comprehensive R Archive Network
  • Collection of servers mirroring a central server in Vienna.

Modeled on CTAN and CPAN (for TEX and Perl code)

  • http://cran.us.r-project.org
  • Maintains a curated collection of R packages as well as the

source and binary distributions R itself

  • About 320 packages available
  • Unix/Linux variants generally install packages from
  • sources. Windows and MacOSX have binary package

formats which are even easier to install

  • See also: Bioconductor,

http://www.bioconductor.org

25

Demo 6

# Cheat for offline demo: # Pretend CRAN is local directory

  • ptions(CRAN="file:/home/pd/cran.r-project.org")

# Manipulate install path .libPaths("~/Rlibrary") .libPaths() # Source install (gives harmless warning) install.packages("mvtnorm") library(mvtnorm) library(help=mvtnorm)

26

Practical issues

  • Dealing with the workspace
  • Reading data
  • Saving and restoring data and results

27

slide-8
SLIDE 8

The workspace

  • The global environment contains R objects created on the

command line.

  • There is an additional search path of loaded packages and

attached data frames.

  • The search path is maintained by library(), attach(),

and detach()

  • This determines the way R looks up objects by name
  • Notice that objects in the global environment may mask
  • bjects in packages.

28

Demo 7

search() data(intake) # From ISwR ls() attach(intake) search() ls("intake") # show variables in data frame post - pre rm(intake) # remove data frame detach() # remove from search path

29

Reading data

  • Simple data vectors can be read using scan()
  • Data frames can be read from most reasonably structured

text file formats (space separated columns, tab- and comma-delimited files) using read.table() or read.delim().

  • The read functions have many useful arguments. Notice in

particular colClasses, which allows you setup ad-hoc conversion to any desired object class.

  • The foreign package can read files from Stata, SAS

export libraries, SPSS, and Epi-Info, Minitab, and some S-PLUS versions.

30

Getting organized

Several possibilities:

  • Save/restore entire workspace (objects only)
  • Save selected objects and load them
  • source() script files
  • Batch processing (R CMD BATCH file.R)
  • ESS – Emacs Speaks Statistics: Integrated environment for

maintaining scripts, running R, saving results, etc.

31

slide-9
SLIDE 9

R graphics

  • The standard interface
  • Customizing plots
  • Graphics parameters
  • Math on plots
  • Grid and lattice

32

Standard R graphics

  • Ink on paper model; once something is drawn it cannot be

erased.

  • Sensible default plots
  • Arguments can override defaults
  • Options to turn off various elements of plots (e.g. the axes)
  • Functions to add elements.

33

Basic x-y plots

  • The plot function with one or two numeric arguments
  • Scatterplot or line plot (or both) depending on type

argument: "l" for lines, "p" for points (the default), "b" for both, plus quite a few more.

  • Functions for adding to a plot: lines, points,

segments, abline, text, mtext, axis

  • Also: formula interface, plot(y~x)

34

Graphical parameters

  • Arguments to plot et al. (67 possibilities!)
  • The par function can be used to set most of them
  • persistently. Most info is found via help(par)
  • Look them up! Here are some of the more commonly used:

– Point and line characteristics: pch, col, lty, lwd – Multiframe layout: mfrow, mfcol – Axes: xlim, ylim, xaxt, yaxt, log

35

slide-10
SLIDE 10

Specific plots

  • Histograms — hist(x)
  • Density plots — plot(density(x))
  • Boxplots — boxplot(x)
  • Barplots — barplot(x) (x can be a matrix)
  • Pies — pie()
  • Matrix plots (multiple y columns) — matplot()

36

Demo 8

data(intake) par(mfrow=c(2,2)) matplot(intake) matplot(t(intake)) matplot(t(intake),type="b") matplot(t(intake),type="b",pch=1:11,col="black", lty="solid", xaxt="n") axis(1,at=1:2,labels=names(intake))

37

Math on plots

  • Sort of like TeX
  • Works on unevaluated expressions (quote(alpha),

expression(alpha))

  • Special conventions: ˆ,[] sub/superscript, special names

alpha, sum, int

  • See help(plotmath)

38

Grid and Lattice graphics

  • Standard R graphics allow graphs to be arranged in an

m

n gridded layout.

  • The grid package allows arbitrary viewports and create

graph objects (“grobs”) which can be modified before they are printed.

  • The lattice package uses grid for a structural approach

to multiframe graphs

  • Model formulas, y~x|g1*g2*...
  • Shingles: Partially overlapping intervals used for

conditioning plots

  • Panel functions — potentially user codable

39

slide-11
SLIDE 11

Demo 9

library(lattice) data(airquality) lset(theme = col.whitebg()) myplot <- xyplot(log(Ozone)~Solar.R | equal.count(Temp), group=Month, data=airquality, ylab=list(label=expression("log"*O[3]),cex=2), xlab=list(cex=2)) myplot # OBS: no plot until object is printed!

40

Ad-hoc programming

  • This had better be brief ...
  • What does an R function look like
  • Flow control
  • Matrix algebra
  • Optimizers

41

Simple functions

  • logit <- function(p) log(p/(1-p))
  • logit(0.5)
  • Formal arguments
  • Actual arguments
  • Positional matching: plot(x,y)
  • Keyword matching: t.test(x ~ g, mu=2,

alternative="less")

  • Partial matching: t.test(x ~ g, mu=2, alt="l")

42

Flow control

  • if/else
  • ifelse()
  • switch()
  • for loops
  • repeat, while
  • break

43

slide-12
SLIDE 12

Apply-functions/loop avoidance

  • lapply – list-apply
  • sapply – simplifying apply
  • tapply – tabulating apply
  • apply, sweep – along slices of tables
  • replicate – repeat expression

44

Demo 10

# these examples all do the same pval <- numeric(1000) for (i in 1:1000) pval[i] <- t.test(rexp(25),mu=1)$p.value f <- function(i) t.test(rexp(25),mu=1)$p.value pval <- sapply(1:1000, f) pval <- replicate(1000, t.test(rexp(25),mu=1)$p.value) qqplot(ppoints(1000), pval, pch=".")

45

Matrix algebra

  • R contains a pretty full set of primitives for matrix calculus
  • A %*% B for matrix multiplication
  • solve(A, b) for solving linear equations. (solve(A)

for matrix inverse)

  • Various special products and decompositions

46

Demo 11

Preliminaries, the interesting bits are on the next slide data(airquality) attach(transform(airquality,Month=factor(Month))) fit.aq <- lm(log(Ozone)~Solar.R+Temp+Month) # This is a bit hardcore: Find terms for Month # and extract contrast matrix where <- fit.aq$assign==match("Month", attr(fit.aq$terms,"term.labels")) cnt <- contrasts(fit.aq$model[["Month"]])

47

slide-13
SLIDE 13

Demo, continued

# From regression coef to per-month levels # (determinable up to additive constant) cf <- drop(cnt %*% coef(fit.aq)[where]) M <- vcov(fit.aq)[where,where] V <- cnt %*% M %*% t(cnt) dif <- outer(cf, cf, "-") d <- diag(V) vdiff <- outer(d, d, "+") - 2*V dif/sqrt(vdiff) # pairwise t tests

48

Optimization tools

  • 1-dimensional: optimize
  • nlm, Newton-style optimizer
  • ptim, wrapper for several advanced optimizers

49

Demo 12

mll <- function(theta) -dbinom(4, 10, theta, log=TRUE) plot(mll)

  • ptimize(mll, c(0,1))

nlm(mll, .5)

  • ptim(.5, mll, method="BFGS")

50

Summary

So what have we seen?

  • R is a versatile working environment
  • There is a very flexible toolkit for building graphics

displays

  • You can handle simple tasks quite easily
  • Complicated task can be handled via ad hoc

programming, often elegantly

  • Extensions can be made to integrate seamlessly and a large

body of such extensions is available from CRAN

51