An Introduction to Statistical Computing in R
K2I Data Science Boot Camp - Day 1 AM Session
May 15, 2017
Statistical Computing in R May 15, 2017 1 / 55
An Introduction to Statistical Computing in R K2I Data Science Boot - - PowerPoint PPT Presentation
An Introduction to Statistical Computing in R K2I Data Science Boot Camp - Day 1 AM Session May 15, 2017 Statistical Computing in R May 15, 2017 1 / 55 AM Session Outline Intro to R Basics Plotting In R Data Manipulation Statistical
Statistical Computing in R May 15, 2017 1 / 55
Statistical Computing in R May 15, 2017 2 / 55
Statistical Computing in R May 15, 2017 3 / 55
Statistical Computing in R May 15, 2017 4 / 55
Statistical Computing in R May 15, 2017 5 / 55
Statistical Computing in R May 15, 2017 6 / 55
Statistical Computing in R May 15, 2017 7 / 55
Statistical Computing in R May 15, 2017 8 / 55
Statistical Computing in R May 15, 2017 9 / 55
Statistical Computing in R May 15, 2017 10 / 55
Statistical Computing in R May 15, 2017 11 / 55
Statistical Computing in R May 15, 2017 12 / 55
Statistical Computing in R May 15, 2017 13 / 55
Statistical Computing in R May 15, 2017 14 / 55
Statistical Computing in R May 15, 2017 15 / 55
Statistical Computing in R May 15, 2017 16 / 55
1 Set your working directory to the directory containing the files. 2 Examine the files in a text editor to check for common options
Statistical Computing in R May 15, 2017 17 / 55
Statistical Computing in R May 15, 2017 18 / 55
Statistical Computing in R May 15, 2017 19 / 55
Statistical Computing in R May 15, 2017 20 / 55
Statistical Computing in R May 15, 2017 21 / 55
Statistical Computing in R May 15, 2017 22 / 55
Statistical Computing in R May 15, 2017 23 / 55
Statistical Computing in R May 15, 2017 24 / 55
Statistical Computing in R May 15, 2017 25 / 55
Statistical Computing in R May 15, 2017 26 / 55
Statistical Computing in R May 15, 2017 27 / 55
# the vector is the most important data structure # create it with c() my.vec <- c(1,2,67,-98) # get some properties str(my.vec) ## num [1:4] 1 2 67 -98 length(my.vec) ## [1] 4 # access elements with [] my.vec[3] ## [1] 67 my.vec[c(3,4)] ## [1] 67 -98 # can do assignment too my.vec[5] <- 41.2 Statistical Computing in R May 15, 2017 28 / 55
# other ways to create vectors x <- 1:6 y <- seq(7,12,by=1) # Operations get recycled through whole vector x + 1 ## [1] 2 3 4 5 6 7 x > 3 ## [1] FALSE FALSE FALSE TRUE TRUE TRUE # Can do component wise operations between vectors x * y ## [1] 7 16 27 40 55 72 x / y ## [1] 0.1428571 0.2500000 0.3333333 0.4000000 0.4545455 0.5000000 y %/% x ## [1] 7 4 3 2 2 2 Statistical Computing in R May 15, 2017 29 / 55
Statistical Computing in R May 15, 2017 30 / 55
# matricies are 2d vectors. # create using matrix() my.matrix <- matrix(rnorm(20),nrow=4,ncol=5) # rnorm() draws 20 random samples from a n(0,1) distribution my.matrix ## [,1] [,2] [,3] [,4] [,5] ## [1,] 0.5351131 1.08710882 0.5670939 0.2800755 -0.8050743 ## [2,] -1.9263838 0.86267009 0.7318280 0.4177110 -0.9576529 ## [3,] -1.2931770 -1.03381286 -0.9035750 1.9787516 0.3747967 ## [4,] -2.6190953 -0.04829205 1.3157181 1.2562005 0.1131199 # note matricies loaded by column # Get details dim(my.matrix) ## [1] 4 5 nrow(my.matrix) ## [1] 4 ncol(my.matrix) ## [1] 5 Statistical Computing in R May 15, 2017 31 / 55
# Indexing is similar to vectors but with 2 dimensions # get second row my.matrix[2,] ## [1] -1.9263838 0.8626701 0.7318280 0.4177110 -0.9576529 # get first,last columns of row three my.matrix[3,c(1,4)] ## [1] -1.293177 1.978752 # transposing done with t() Statistical Computing in R May 15, 2017 32 / 55
# lists similar to vectors but contain different types # create with list my.list <- list("just a string", 44, my.matrix, c(TRUE,TRUE,FALSE)) # access items via double brackets [[]] my.list[[4]] ## [1] TRUE TRUE FALSE # access multiple items my.list[1:2] ## [[1]] ## [1] "just a string" ## ## [[2]] ## [1] 44 # list items can be named too named.list <- list(Item1="my string", Item2=my.list) # access of named item is via dollar sign operator # [[]] also works c(named.list$Item1,named.list[[1]]) ## [1] "my string" "my string" Statistical Computing in R May 15, 2017 33 / 55
Statistical Computing in R May 15, 2017 34 / 55
Statistical Computing in R May 15, 2017 35 / 55
Statistical Computing in R May 15, 2017 36 / 55
Statistical Computing in R May 15, 2017 37 / 55
# compare to R's PCA function their.pcs <-prcomp(iris[1:4],center = TRUE,scale. = TRUE) head(their.pcs$x[,1:2]) ## PC1 PC2 ## [1,] -2.257141 -0.4784238 ## [2,] -2.074013 0.6718827 ## [3,] -2.356335 0.3407664 ## [4,] -2.291707 0.5953999 ## [5,] -2.381863 -0.6446757 ## [6,] -2.068701 -1.4842053 # our result head(cbind(pc1,pc2)) ## [,1] [,2] ## [1,] -2.257141 -0.4784238 ## [2,] -2.074013 0.6718827 ## [3,] -2.356335 0.3407664 ## [4,] -2.291707 0.5953999 ## [5,] -2.381863 -0.6446757 ## [6,] -2.068701 -1.4842053 Statistical Computing in R May 15, 2017 38 / 55
−3 −2 −1 1 2 3 −2 −1 1 2 pc1 pc2
Statistical Computing in R May 15, 2017 39 / 55
# Factors are like vector, but with predefined allowed values called levels # Factors are used to represent categorical variables in R # create a factor factor1 <- factor(c('Good','Bad','Ugly')) # find it's levels levels(factor1) ## [1] "Bad" "Good" "Ugly" # below gives warning, but not error factor1[4] <- 17 ## Warning in ‘[<-.factor‘(‘*tmp*‘, 4, value = 17): invalid factor level, NA generated # see what happened factor1 ## [1] Good Bad Ugly <NA> ## Levels: Bad Good Ugly factor1[4] <- 'Bad' # get the breakdown table(factor1) ## factor1 ## Bad Good Ugly ## 2 1 1 Statistical Computing in R May 15, 2017 40 / 55
Statistical Computing in R May 15, 2017 41 / 55
Statistical Computing in R May 15, 2017 42 / 55
Statistical Computing in R May 15, 2017 43 / 55
my.df$age ## [1] 45 27 19 59 71 13 5 summary(my.df$age) ##
Median Mean 3rd Qu. Max. ## 5.00 16.00 27.00 34.14 52.00 71.00 table(my.df$gender) ## ## F M ## 3 4 # data frames are really just lists my.df[[2]] ## [1] M M M F M F F ## Levels: F M Statistical Computing in R May 15, 2017 44 / 55
# data.frames can be subsetted like matrcies my.df[1:3,c("age")] ## [1] 45 27 19 # logical subsetting especially useful for .data.frames # get ages over 40 age.logic <- my.df$age > 40 # take a subset of these rows my.df[age.logic,] ## age gender ## 1 45 M ## 4 59 F ## 5 71 M # create a new variable age.sq my.df$age.sq <- my.df$age^2 Statistical Computing in R May 15, 2017 45 / 55
1 Create two new variables Length.Sum and Width.Sum which are the
2 Use subsetting and R’s mean function to find the average
Statistical Computing in R May 15, 2017 46 / 55
Statistical Computing in R May 15, 2017 47 / 55
Statistical Computing in R May 15, 2017 48 / 55
Statistical Computing in R May 15, 2017 49 / 55
Statistical Computing in R May 15, 2017 50 / 55
Statistical Computing in R May 15, 2017 51 / 55
Statistical Computing in R May 15, 2017 52 / 55
Statistical Computing in R May 15, 2017 53 / 55
Statistical Computing in R May 15, 2017 54 / 55
Statistical Computing in R May 15, 2017 55 / 55