An Introduction to Statistical Computing in R K2I Data Science Boot - - PowerPoint PPT Presentation

an introduction to statistical computing in r
SMART_READER_LITE
LIVE PREVIEW

An Introduction to Statistical Computing in R K2I Data Science Boot - - PowerPoint PPT Presentation

An Introduction to Statistical Computing in R K2I Data Science Boot Camp - Day 1 AM Session May 15, 2017 Statistical Computing in R May 15, 2017 1 / 55 AM Session Outline Intro to R Basics Plotting In R Data Manipulation Statistical


slide-1
SLIDE 1

An Introduction to Statistical Computing in R

K2I Data Science Boot Camp - Day 1 AM Session

May 15, 2017

Statistical Computing in R May 15, 2017 1 / 55

slide-2
SLIDE 2

AM Session Outline

Intro to R Basics Plotting In R Data Manipulation

Statistical Computing in R May 15, 2017 2 / 55

slide-3
SLIDE 3

R Basics

Here we will give a quick overview of the R language and the RStudio IDE. Our emphasis will be to explore the most used features of R, especially those used in later courses. This won’t cover all the details, but will the most important parts.

Statistical Computing in R May 15, 2017 3 / 55

slide-4
SLIDE 4

Working with Rstudio

Before beginning with R let’s orient ourselves with RStudio.

Statistical Computing in R May 15, 2017 4 / 55

slide-5
SLIDE 5

Our initial view of RStudio is:

Statistical Computing in R May 15, 2017 5 / 55

slide-6
SLIDE 6

Go to: File -> New File -> R Script. This gives:

Statistical Computing in R May 15, 2017 6 / 55

slide-7
SLIDE 7

Statistical Computing in R May 15, 2017 7 / 55

slide-8
SLIDE 8

Try It Out

Type the following into console ?lm ??linear plot(1:20, 1:20)

Statistical Computing in R May 15, 2017 8 / 55

slide-9
SLIDE 9

There are several useful shortcut keys in RStudio. A few popular ones: Ctrl+Enter - When pressed in Editor, sends current line to console. Ctrl+1, Ctrl+2 - switch between editor and console Ctrl+Shift+Enter - run entire script in console tab completion - this is perhaps the most used feature For vim/emacs users Tools -> Global Options -> Code -> Keybindings will give you your prefered bindings.

Statistical Computing in R May 15, 2017 9 / 55

slide-10
SLIDE 10

It’s important to know our working directory. Given a file name, R will assume it is located in your current working directory. R will also save output to the working directory by default. It is important to set your working directory to the correct location or specify full path names.

Statistical Computing in R May 15, 2017 10 / 55

slide-11
SLIDE 11

Try out the following in the console window: getwd() list.files() To change your working directory go to: Session -> Set Working Directory

  • > Choose Directory

Alternatively, setwd("/path/to/directory")

Statistical Computing in R May 15, 2017 11 / 55

slide-12
SLIDE 12

Reading, Writing, Saving, and Loading

Here we’ll look at bringing data into R and getting it out We’ll also see how to save R objects and environments

Statistical Computing in R May 15, 2017 12 / 55

slide-13
SLIDE 13

Reading In Data

read.table read.csv read.fwf Check out options for each ?read.table

Statistical Computing in R May 15, 2017 13 / 55

slide-14
SLIDE 14

Syntax

?read.table ?read.csv read.table("/path/to/your/file.ext", header=TRUE, sep=",", stringsAsFactors = FALSE)

Statistical Computing in R May 15, 2017 14 / 55

slide-15
SLIDE 15

Most Common Options

sep tells how fields/variables are separated. Commons values are: ”,” (comma) ” ” (single space) ”\t” (tab escape character) stringsAsFactors tells whether to treat non numeric values as factor/categorical variables. header tells whether first line of file has variable names na.strings tells how missing values are encoded in the file.

Statistical Computing in R May 15, 2017 15 / 55

slide-16
SLIDE 16

Standard Procedure

Open file in text editor Check items relevant to options. Header? Separator type? For big files, Linux tools are helpful: head -n10 BigFile.txt > OpenMe

Statistical Computing in R May 15, 2017 16 / 55

slide-17
SLIDE 17

Try it Out

Let’s read in the ReadMeInX.txt files into R. Try it on your own before looking at the answer on the next slides. Example workflow:

1 Set your working directory to the directory containing the files. 2 Examine the files in a text editor to check for common options

(header, separator, etc.)

Statistical Computing in R May 15, 2017 17 / 55

slide-18
SLIDE 18

# read.table's default seperator ok for this one set0 <- read.table("ReadMeIn0.txt", header=TRUE) # specify new seperator set1 <- read.table("ReadMeIn1.txt", header=TRUE, sep=',') # Or use read.csv set1 <- read.csv("ReadMeIn1.txt", header=TRUE)

Statistical Computing in R May 15, 2017 18 / 55

slide-19
SLIDE 19

# another change of seperator set2 <- read.table("ReadMeIn2.txt", header=TRUE, sep=';') # check for missing set3 <- read.table("ReadMeIn3.txt", header=FALSE, sep=',', na.strings = '')

Statistical Computing in R May 15, 2017 19 / 55

slide-20
SLIDE 20

Writing Data

write.table write.csv

Statistical Computing in R May 15, 2017 20 / 55

slide-21
SLIDE 21

Syntax and Common Options

?write.csv write.csv(myRObject, file="/path/to/save/spot/file.csv", row.names=FALSE) Options largely the same as their read counterparts row.names = FALSE is helpful to avoid have 1,2,3,... as a variable/column

Statistical Computing in R May 15, 2017 21 / 55

slide-22
SLIDE 22

Try It Out

Write out one of the files you imported. Try to varying options like sep, quote.

Statistical Computing in R May 15, 2017 22 / 55

slide-23
SLIDE 23

Saving Objects

saveRDS/readRDS are used to save (compressed version of) individual R

  • bjects

# save our data set saveRDS(set1,file="TstObj.rds") # get it back newtst <- readRDS("TstObj.rds") # can save any R object. Try a vector my.vector <- c(1,8,-100) saveRDS(my.vector, file="JustAVector.rds")

Statistical Computing in R May 15, 2017 23 / 55

slide-24
SLIDE 24

Saving Environment

We can save all variables in the current R workspace with save.image We can load in a saved workspace with load R will ask you save your work when you exit # Save all our work save.image("AllMyWork.RData") # Reload it load("AllMyWork.RData") # name given to default save load(".RData")

Statistical Computing in R May 15, 2017 24 / 55

slide-25
SLIDE 25

The Basics of R

Let’s do a whirlwind tour of R: it’s syntax and data structures This won’t cover all the details, but will the most important parts

Statistical Computing in R May 15, 2017 25 / 55

slide-26
SLIDE 26

Basic R Data Types

# numeric types: interger, double 348 # character "my string" # logical TRUE FALSE # artithmetic as you'd expect 43 + 1 * 2^4 # so too logical operators/comparison TRUE | FALSE 1 + 7 != 7 # Other logical operators: # &, |, ! # <,>,<=,>=, ==, !=

Statistical Computing in R May 15, 2017 26 / 55

slide-27
SLIDE 27

Data Types Cont.

# variables assignment is done with the <- operator my.number <- 483 # the '.' above does nothing. we could have done: # mynumber <- 483 # instead # it's an Rism to use .'s in variable names. # typeof() tells use type typeof(my.number) ## [1] "double" # we can convert between types my.int <- as.integer(my.number) typeof(my.int) ## [1] "integer"

Statistical Computing in R May 15, 2017 27 / 55

slide-28
SLIDE 28

R Data Structures - Vectors

# the vector is the most important data structure # create it with c() my.vec <- c(1,2,67,-98) # get some properties str(my.vec) ## num [1:4] 1 2 67 -98 length(my.vec) ## [1] 4 # access elements with [] my.vec[3] ## [1] 67 my.vec[c(3,4)] ## [1] 67 -98 # can do assignment too my.vec[5] <- 41.2 Statistical Computing in R May 15, 2017 28 / 55

slide-29
SLIDE 29

Vectors - Cont.

# other ways to create vectors x <- 1:6 y <- seq(7,12,by=1) # Operations get recycled through whole vector x + 1 ## [1] 2 3 4 5 6 7 x > 3 ## [1] FALSE FALSE FALSE TRUE TRUE TRUE # Can do component wise operations between vectors x * y ## [1] 7 16 27 40 55 72 x / y ## [1] 0.1428571 0.2500000 0.3333333 0.4000000 0.4545455 0.5000000 y %/% x ## [1] 7 4 3 2 2 2 Statistical Computing in R May 15, 2017 29 / 55

slide-30
SLIDE 30

Try It Out

# Try guess what the following lines will do # Will it run at all? If so, what will it give? # Think about it and run to confirm 7 -> w w <- z <- 44 1 + TRUE 0 | 15 & 3 my.vec[2:4] my.vec[-2] my.vec[c(TRUE,FALSE,FALSE,TRUE,FALSE)] my.vec[ sum( c(TRUE,FALSE,FALSE,TRUE,TRUE) ) ] <- TRUE my.vec[3] <- "I'm a string" as.numeric(my.vec) x[x>3] x + c(1,2)

Statistical Computing in R May 15, 2017 30 / 55

slide-31
SLIDE 31

Matrices

# matricies are 2d vectors. # create using matrix() my.matrix <- matrix(rnorm(20),nrow=4,ncol=5) # rnorm() draws 20 random samples from a n(0,1) distribution my.matrix ## [,1] [,2] [,3] [,4] [,5] ## [1,] 0.5351131 1.08710882 0.5670939 0.2800755 -0.8050743 ## [2,] -1.9263838 0.86267009 0.7318280 0.4177110 -0.9576529 ## [3,] -1.2931770 -1.03381286 -0.9035750 1.9787516 0.3747967 ## [4,] -2.6190953 -0.04829205 1.3157181 1.2562005 0.1131199 # note matricies loaded by column # Get details dim(my.matrix) ## [1] 4 5 nrow(my.matrix) ## [1] 4 ncol(my.matrix) ## [1] 5 Statistical Computing in R May 15, 2017 31 / 55

slide-32
SLIDE 32

Matrices - Cont.

# Indexing is similar to vectors but with 2 dimensions # get second row my.matrix[2,] ## [1] -1.9263838 0.8626701 0.7318280 0.4177110 -0.9576529 # get first,last columns of row three my.matrix[3,c(1,4)] ## [1] -1.293177 1.978752 # transposing done with t() Statistical Computing in R May 15, 2017 32 / 55

slide-33
SLIDE 33

Lists

# lists similar to vectors but contain different types # create with list my.list <- list("just a string", 44, my.matrix, c(TRUE,TRUE,FALSE)) # access items via double brackets [[]] my.list[[4]] ## [1] TRUE TRUE FALSE # access multiple items my.list[1:2] ## [[1]] ## [1] "just a string" ## ## [[2]] ## [1] 44 # list items can be named too named.list <- list(Item1="my string", Item2=my.list) # access of named item is via dollar sign operator # [[]] also works c(named.list$Item1,named.list[[1]]) ## [1] "my string" "my string" Statistical Computing in R May 15, 2017 33 / 55

slide-34
SLIDE 34

Putting it together

Let’s practice with R data types by doing PCA on the iris data. data("iris") head(iris) str(iris) Note iris is a data.frame data type; this is simply a list.

Statistical Computing in R May 15, 2017 34 / 55

slide-35
SLIDE 35

PCA outline

Save the numeric columns of iris as a matrix. (Hint: ?as.matrix) Center and scale the matrix (Hint: ?scale) Compute the correlation matrix R = 1 n − 1X TX Here X is our (centered and scaled) data matrix, n is the number of rows/observations in our data, and X T is the transpose of X. (Hint: t(X) is transpose operator and A%*%B performs matrix multiplication on the matricies A and B)

Statistical Computing in R May 15, 2017 35 / 55

slide-36
SLIDE 36

PCA outline cont.

Obtain the two leading eigenvectors of the correlation matrix R. Denote these as v 1, v 2. (Hint: ?eigen) Compute the first and second principle components via z1 = Xv 1 z2 = Xv 2 Produce a scatter plot of z1 vs z2 (Hint: ?plot) Take a few moments to try it yourself before looking at the answers on the next slides.

Statistical Computing in R May 15, 2017 36 / 55

slide-37
SLIDE 37

PCA from scratch

data("iris") # get numeric portions of list and make a matrix X <- as.matrix(iris[1:4]) # center and scale X <- scale(X,center = TRUE,scale=TRUE) # get the number of rows n <- nrow(X) # compute correlation matrix R <- (1/(n-1))*t(X)%*%X # perform eigen decomposition Reig <- eigen(R) # get eigen vectors Reig.vecs <- Reig$vectors # create principle components pc1 <- X%*%Reig.vecs[,1] pc2 <- X%*%Reig.vecs[,2]

Statistical Computing in R May 15, 2017 37 / 55

slide-38
SLIDE 38

PCA from scratch cont.

# compare to R's PCA function their.pcs <-prcomp(iris[1:4],center = TRUE,scale. = TRUE) head(their.pcs$x[,1:2]) ## PC1 PC2 ## [1,] -2.257141 -0.4784238 ## [2,] -2.074013 0.6718827 ## [3,] -2.356335 0.3407664 ## [4,] -2.291707 0.5953999 ## [5,] -2.381863 -0.6446757 ## [6,] -2.068701 -1.4842053 # our result head(cbind(pc1,pc2)) ## [,1] [,2] ## [1,] -2.257141 -0.4784238 ## [2,] -2.074013 0.6718827 ## [3,] -2.356335 0.3407664 ## [4,] -2.291707 0.5953999 ## [5,] -2.381863 -0.6446757 ## [6,] -2.068701 -1.4842053 Statistical Computing in R May 15, 2017 38 / 55

slide-39
SLIDE 39

PCA from scratch cont.

plot(pc1,pc2,col=iris$Species)

−3 −2 −1 1 2 3 −2 −1 1 2 pc1 pc2

Statistical Computing in R May 15, 2017 39 / 55

slide-40
SLIDE 40

Factors

# Factors are like vector, but with predefined allowed values called levels # Factors are used to represent categorical variables in R # create a factor factor1 <- factor(c('Good','Bad','Ugly')) # find it's levels levels(factor1) ## [1] "Bad" "Good" "Ugly" # below gives warning, but not error factor1[4] <- 17 ## Warning in ‘[<-.factor‘(‘*tmp*‘, 4, value = 17): invalid factor level, NA generated # see what happened factor1 ## [1] Good Bad Ugly <NA> ## Levels: Bad Good Ugly factor1[4] <- 'Bad' # get the breakdown table(factor1) ## factor1 ## Bad Good Ugly ## 2 1 1 Statistical Computing in R May 15, 2017 40 / 55

slide-41
SLIDE 41

Note one of our previous examples R filled in the improper factor value with NA NA is R’s way of specifying missing data Note the missing data is handled differently than ordinary values, as we will see as we go along.

Statistical Computing in R May 15, 2017 41 / 55

slide-42
SLIDE 42

Questions

What will the following lines of code do? my.matrix[3:4,1:2] <- c(4,5) my.matrix[4,5] <- 'string' mf.strings <- c('F','F','M','F') factor2 <- as.factor(mf.strings) c(factor1, factor2) factor1 == 'Ugly' my.list[[3]][2,] sum(c(1,2,3,NA)) sum(c(1,2,3,NA),na.rm = TRUE)

Statistical Computing in R May 15, 2017 42 / 55

slide-43
SLIDE 43

Data Frames

The data.frame is how R represents data sets. They are simply lists, with a few additional restrictions. # create your own my.df <- data.frame( age = c(45,27,19,59,71,13,5), gender = factor(c('M','M','M','F','M','F','F')) ) str(my.df) ## 'data.frame': 7 obs. of 2 variables: ## $ age : num 45 27 19 59 71 13 5 ## $ gender: Factor w/ 2 levels "F","M": 2 2 2 1 2 1 1

Statistical Computing in R May 15, 2017 43 / 55

slide-44
SLIDE 44

Data Frames - Cont.

Individual variables can be accessed via $ operator

my.df$age ## [1] 45 27 19 59 71 13 5 summary(my.df$age) ##

  • Min. 1st Qu.

Median Mean 3rd Qu. Max. ## 5.00 16.00 27.00 34.14 52.00 71.00 table(my.df$gender) ## ## F M ## 3 4 # data frames are really just lists my.df[[2]] ## [1] M M M F M F F ## Levels: F M Statistical Computing in R May 15, 2017 44 / 55

slide-45
SLIDE 45

Data Frames - Cont.

# data.frames can be subsetted like matrcies my.df[1:3,c("age")] ## [1] 45 27 19 # logical subsetting especially useful for .data.frames # get ages over 40 age.logic <- my.df$age > 40 # take a subset of these rows my.df[age.logic,] ## age gender ## 1 45 M ## 4 59 F ## 5 71 M # create a new variable age.sq my.df$age.sq <- my.df$age^2 Statistical Computing in R May 15, 2017 45 / 55

slide-46
SLIDE 46

Try It Out

Let’s use R’s internal iris data set to practice with data frames my.iris <- iris my.iris

1 Create two new variables Length.Sum and Width.Sum which are the

sum of Sepal and Petal length/width respectively.

2 Use subsetting and R’s mean function to find the average

Length.Sum of setosa species

Statistical Computing in R May 15, 2017 46 / 55

slide-47
SLIDE 47

my.iris$Length.Sum = my.iris$Sepal.Length + my.iris$Petal.Length my.iris$Width.Sum = my.iris$Sepal.Width + my.iris$Petal.Width setosa.inds <- my.iris$Species == 'setosa' mean(my.iris[setosa.inds,]$Length.Sum) ## [1] 6.468

Statistical Computing in R May 15, 2017 47 / 55

slide-48
SLIDE 48

Control Structures

R has all the typical control structures: if-else statements for loops while loops

Statistical Computing in R May 15, 2017 48 / 55

slide-49
SLIDE 49

Syntax

if(logical_expression){ execute_code } else{ executre_other_code } for(value in sequence){ work_with_value } while(expression_is_true){ execute_code }

Statistical Computing in R May 15, 2017 49 / 55

slide-50
SLIDE 50

Functions

Defining functions is R is easy # use function key word with assignment <- my.mean <- function(input.vector){ sum = 0 for(val in input.vector) { sum = sum + val } # the expression get retuned return.me <- sum / length(input.vector) } my.mean(1:10)

Statistical Computing in R May 15, 2017 50 / 55

slide-51
SLIDE 51

Functions cont.

my.mean <- function(input.vector){ sum = 0 for(val in input.vector) { sum = sum + val } # returns 1 now retrun.me <- sum / length(input.vector) 1 } my.mean(1:10) ## [1] 1

Statistical Computing in R May 15, 2017 51 / 55

slide-52
SLIDE 52

Try It Out

Create a function my.summary which inputs a vector, x, calculates the mean, standard deviation, max, and min of x, and returns these in a list Try out R’s internal functions mean, sd, max,min

Statistical Computing in R May 15, 2017 52 / 55

slide-53
SLIDE 53

my.summary <- function(x) { list( mean = mean(x), sd = sd(x), max = max(x), min = min(x) ) }

Statistical Computing in R May 15, 2017 53 / 55

slide-54
SLIDE 54

Try It Out cont.

Loop through the variables in my.iris, evaluating my.summary on each (provided the variable is numeric) and printing the maximum. Hint: Use is.numeric to test each variable before applying my.summary

Statistical Computing in R May 15, 2017 54 / 55

slide-55
SLIDE 55

for(var in my.iris) { if(is.numeric(var)){ tmp <- my.summary(var) print(tmp$max) } }

Statistical Computing in R May 15, 2017 55 / 55