Applied Statistics and Data Modeling An introduction to R Luc - - PowerPoint PPT Presentation

applied statistics and data modeling
SMART_READER_LITE
LIVE PREVIEW

Applied Statistics and Data Modeling An introduction to R Luc - - PowerPoint PPT Presentation

Applied Statistics and Data Modeling An introduction to R Luc Duchateau 1 Paul Janssen 2 1 Faculty of Veterinary Medicine Ghent University, Belgium 2 Center for Statistics Hasselt University, Belgium 2020 UGent STATS VM L. Duchateau &


slide-1
SLIDE 1 UGent

STATS

VM

Applied Statistics and Data Modeling

An introduction to R Luc Duchateau1 Paul Janssen2

1Faculty of Veterinary Medicine

Ghent University, Belgium

2Center for Statistics

Hasselt University, Belgium

2020

  • L. Duchateau & P.Janssen

(UH & UG) Applied Statistics and Data Modeling 2020 1 / 38

slide-2
SLIDE 2 UGent

STATS

VM

Overview

1

R and RStudio What is R and RStudio? Installation of R and RStudio Using RStudio

2

R as a calculator

3

Some R concepts R help Objects

4

R functions

5

Data What are data? Reading in data Exploring data

6

The function lm

  • L. Duchateau & P.Janssen

(UH & UG) Applied Statistics and Data Modeling 2020 2 / 38

slide-3
SLIDE 3 UGent

STATS

VM

R and RStudio What is R and RStudio?

What is R?

Programming language Open source Software environment 8 basic packages + 14574 other packages available Packages installed via install.packages("package name")

  • L. Duchateau & P.Janssen

(UH & UG) Applied Statistics and Data Modeling 2020 3 / 38

slide-4
SLIDE 4 UGent

STATS

VM

R and RStudio What is R and RStudio?

What is RStudio?

Alternative implementation of R Packages can be installed via Tools - Install packages

  • L. Duchateau & P.Janssen

(UH & UG) Applied Statistics and Data Modeling 2020 4 / 38

slide-5
SLIDE 5 UGent

STATS

VM

R and RStudio Installation of R and RStudio

Installation of R

https://cran.r-project.org/

  • L. Duchateau & P.Janssen

(UH & UG) Applied Statistics and Data Modeling 2020 5 / 38

slide-6
SLIDE 6 UGent

STATS

VM

R and RStudio Installation of R and RStudio

Installation of RStudio

https://www.rstudio.com/

  • L. Duchateau & P.Janssen

(UH & UG) Applied Statistics and Data Modeling 2020 6 / 38

slide-7
SLIDE 7

R and RStudio Using RStudio

Interface of RStudio

  • L. Duchateau & P.Janssen

(UH & UG) Applied Statistics and Data Modeling 2020 7 / 38

slide-8
SLIDE 8

R and RStudio Using RStudio

Script in RStudio

  • L. Duchateau & P.Janssen

(UH & UG) Applied Statistics and Data Modeling 2020 8 / 38

slide-9
SLIDE 9

R and RStudio Using RStudio

Run command in RStudio

  • L. Duchateau & P.Janssen

(UH & UG) Applied Statistics and Data Modeling 2020 9 / 38

slide-10
SLIDE 10 UGent

STATS

VM

R as a calculator

R as calculator

2+3 ## [1] 5 (5+11)/2-9 ## [1] -1 2ˆ3 ## [1] 8

  • L. Duchateau & P.Janssen

(UH & UG) Applied Statistics and Data Modeling 2020 10 / 38

slide-11
SLIDE 11 UGent

STATS

VM

Some R concepts R help

R help

built-in help help(mean) ?mean

  • nline help

StackOverflow StackExchange R-bloggers

  • L. Duchateau & P.Janssen

(UH & UG) Applied Statistics and Data Modeling 2020 11 / 38

slide-12
SLIDE 12

Some R concepts Objects

Scalars

Objects: scalars, vectors, datasets, . . . Creating objects: assignment operator (<-) height <- 173 height ## [1] 173 Case sensitive height <- 173 Height <- 186 height ## [1] 173 Height ## [1] 186

  • L. Duchateau & P.Janssen

(UH & UG) Applied Statistics and Data Modeling 2020 12 / 38

slide-13
SLIDE 13

Some R concepts Objects

Scalars

Calculations with objects height <- 173 weight <- 63 BMI<-weight/(height/100)ˆ2 BMI ## [1] 21.04982 Text objects Greeting <- "Hello world!" Greeting ## [1] "Hello world!"

  • L. Duchateau & P.Janssen

(UH & UG) Applied Statistics and Data Modeling 2020 13 / 38

slide-14
SLIDE 14 UGent

STATS

VM

Some R concepts Objects

Vectors

Vectors: function c()

A numeric vector x <- c(1, 1, 2, 3, 5, 8) x ## [1] 1 1 2 3 5 8 A character vector y <- c("Belgium", "Portugal", "Italy") y ## [1] "Belgium" "Portugal" "Italy"

  • L. Duchateau & P.Janssen

(UH & UG) Applied Statistics and Data Modeling 2020 14 / 38

slide-15
SLIDE 15 UGent

STATS

VM

Some R concepts Objects

Vectors

Calculating with vectors x*2 ## [1] 2 2 4 6 10 16 xˆ2 ## [1] 1 1 4 9 25 64 x*x ## [1] 1 1 4 9 25 64

  • L. Duchateau & P.Janssen

(UH & UG) Applied Statistics and Data Modeling 2020 15 / 38

slide-16
SLIDE 16 UGent

STATS

VM

R functions

Functions

We already used one function to create a vector c() x <- c(1, 1, 2, 3, 5, 8) x ## [1] 1 1 2 3 5 8 A function has a name and a list of arguments separated by a comma

  • L. Duchateau & P.Janssen

(UH & UG) Applied Statistics and Data Modeling 2020 16 / 38

slide-17
SLIDE 17 UGent

STATS

VM

R functions

Math functions

Trigonometric: sin(pi/2) ## [1] 1 cos(0) ## [1] 1 tan(0) ## [1] 0 asin(1) ## [1] 1.570796 acos(1) ## [1] 0 atan(0) ## [1] 0

  • L. Duchateau & P.Janssen

(UH & UG) Applied Statistics and Data Modeling 2020 17 / 38

slide-18
SLIDE 18 UGent

STATS

VM

R functions

Math functions

Rounding round(8.6178,2) ## [1] 8.62 floor (8.6178) ## [1] 8 signif(8.6178,2) ## [1] 8.6 sign(8.6178); ## [1] 1 sign(-8.6178) ## [1] -1 abs(-8.6178) ## [1] 8.6178

  • L. Duchateau & P.Janssen

(UH & UG) Applied Statistics and Data Modeling 2020 18 / 38

slide-19
SLIDE 19 UGent

STATS

VM

R functions

Math functions

Logarithms & exponentials exp(0) ## [1] 1 log(1) ## [1] 0 log10(1000) ## [1] 3

  • L. Duchateau & P.Janssen

(UH & UG) Applied Statistics and Data Modeling 2020 19 / 38

slide-20
SLIDE 20 UGent

STATS

VM

R functions

Math functions

Others sqrt(25) ## [1] 5 factorial(4) ## [1] 24

  • L. Duchateau & P.Janssen

(UH & UG) Applied Statistics and Data Modeling 2020 20 / 38

slide-21
SLIDE 21 UGent

STATS

VM

R functions

Statistical functions

x <- c(1, 3, 4, 6, 2, 8) mean(x) ## [1] 4 var(x) ## [1] 6.8 sd(x) ## [1] 2.607681 quantile(x) ## 0% 25% 50% 75% 100% ## 1.00 2.25 3.50 5.50 8.00 sort(x) ## [1] 1 2 3 4 6 8 rank(x) ## [1] 1 3 4 5 2 6

  • L. Duchateau & P.Janssen

(UH & UG) Applied Statistics and Data Modeling 2020 21 / 38

slide-22
SLIDE 22 UGent

STATS

VM

R functions

Using functions on vectors

x <- c(4, 16, 9, 25) sqrt(x) ## [1] 2 4 3 5 log(x) ## [1] 1.386294 2.772589 2.197225 3.218876 exp(sqrt(x)) ## [1] 7.389056 54.598150 20.085537 148.413159

  • L. Duchateau & P.Janssen

(UH & UG) Applied Statistics and Data Modeling 2020 22 / 38

slide-23
SLIDE 23 UGent

STATS

VM

Data What are data?

Dataset

breed size litters weight 1 Maine coon large 2 5.1 2 Russian blue small 3.9 3 Bengal medium 4.5 4 Ragdol medium 1 4.8 5 Chartreux large 1 5.2 6 Siamese small 2 4.1 7 Persian medium 2 4.2 8 Maine coon large 3 4.8

  • L. Duchateau & P.Janssen

(UH & UG) Applied Statistics and Data Modeling 2020 23 / 38

variables

  • bservations
  • rdinal

nominal discrete continuous

slide-24
SLIDE 24 UGent

STATS

VM

Data Reading in data

Reading in data

Different formats: .xls(x), .csv, .txt, . . . Most important distinguishing properties:

header: does the first row contain column names? column separator: comma, semicolon, tab? decimal sign: point, comma?

General function in R to read in data: read.table()

args(read.table) ## function (file, header = FALSE, sep = "", quote = "\"'", dec = ".", ## numerals = c("allow.loss", "warn.loss", "no.loss"), row.names, ## col.names, as.is = !stringsAsFactors, na.strings = "NA", ## colClasses = NA, nrows = -1, skip = 0, check.names = TRUE, ## fill = !blank.lines.skip, strip.white = FALSE, blank.lines.skip = TRUE, ## comment.char = "#", allowEscapes = FALSE, flush = FALSE, ## stringsAsFactors = default.stringsAsFactors(), fileEncoding = "", ## encoding = "unknown", text, skipNul = FALSE) ## NULL

  • L. Duchateau & P.Janssen

(UH & UG) Applied Statistics and Data Modeling 2020 24 / 38

slide-25
SLIDE 25 UGent

STATS

VM

Data Reading in data

Reading in data

Specific functions for specific formats Function Format Header Column Decimal separator sign read.csv() .csv TRUE ” , ” ” . ” read.csv(,sep=";") .csv TRUE ” ; ” ” . ” read.csv2() .csv TRUE ” ; ” ” , ” read.delim() .txt TRUE ” tab ” ” . ” read.delim2() .txt TRUE ” tab ” ” , ”

  • L. Duchateau & P.Janssen

(UH & UG) Applied Statistics and Data Modeling 2020 25 / 38

slide-26
SLIDE 26 UGent

STATS

VM

Data Reading in data

Reading in cats data

In this course we use .csv files: cats.csv First open csv-file in notepad breed;size;litters;weight Maine coon;large;2;5.1 Russian Blue;small;0;3.9 Bengal;medium;0;4.5 British Shorthair;medium;1;4.8 Chartreux;large;1;5.2 Siamese;small;2;4.1 Persian;medium;2;4.2 Maine Coon;large;3;4.8

  • L. Duchateau & P.Janssen

(UH & UG) Applied Statistics and Data Modeling 2020 26 / 38

slide-27
SLIDE 27 UGent

STATS

VM

Data Reading in data

Reading in cats data

breed;size;litters;weight Maine coon;large;2;5.1 Russian Blue;small;0;3.9 Bengal;medium;0;4.5 British Shorthair;medium;1;4.8 Chartreux;large;1;5.2 Siamese;small;2;4.1 Persian;medium;2;4.2 Maine Coon;large;3;4.8 separator: semicolon decimal sign: point

  • L. Duchateau & P.Janssen

(UH & UG) Applied Statistics and Data Modeling 2020 27 / 38

slide-28
SLIDE 28 UGent

STATS

VM

Data Reading in data

Reading in cats data

separator: semicolon decimal sign: point Function Format Header Column Decimal separator sign read.csv() .csv TRUE ” , ” ” . ” read.csv(,sep=";") .csv TRUE ” ; ” ” . ” read.csv2() .csv TRUE ” ; ” ” , ” read.delim() .txt TRUE ” tab ” ” . ” read.delim2() .txt TRUE ” tab ” ” , ” Most appropriate function: read.csv(,sep=";")

  • L. Duchateau & P.Janssen

(UH & UG) Applied Statistics and Data Modeling 2020 28 / 38

slide-29
SLIDE 29

Data Reading in data

Reading in cats data

Find out what your working directory is getwd() ## [1] "c:/users/lduchate/docs/OC/onderwijs/ADEKUS2020" Set the working directory to the directory where the file is located setwd("c:/users/lduchate/docs/OC/onderwijs/ADEKUS2020/data" #this path will be different for every student Read in the data and store them in an object cats. cats<-read.csv("cats.csv",sep=";")

  • L. Duchateau & P.Janssen

(UH & UG) Applied Statistics and Data Modeling 2020 29 / 38

slide-30
SLIDE 30

Data Exploring data

Exploring the cats data

Print the data in case of a small dataset cats ## breed size litters weight ## 1 Maine coon large 2 5.1 ## 2 Russian Blue small 3.9 ## 3 Bengal medium 4.5 ## 4 British Shorthair medium 1 4.8 ## 5 Chartreux large 1 5.2 ## 6 Siamese small 2 4.1 ## 7 Persian medium 2 4.2 ## 8 Maine Coon large 3 4.8

  • L. Duchateau & P.Janssen

(UH & UG) Applied Statistics and Data Modeling 2020 30 / 38

slide-31
SLIDE 31 UGent

STATS

VM

Data Exploring data

Exploring the cats data

Print a few lines in case of a large dataset head(cats,n=5L) ## breed size litters weight ## 1 Maine coon large 2 5.1 ## 2 Russian Blue small 3.9 ## 3 Bengal medium 4.5 ## 4 British Shorthair medium 1 4.8 ## 5 Chartreux large 1 5.2

  • L. Duchateau & P.Janssen

(UH & UG) Applied Statistics and Data Modeling 2020 31 / 38

slide-32
SLIDE 32 UGent

STATS

VM

Data Exploring data

Exploring the cats data

Checking the data structure str(cats) ## 'data.frame': 8 obs. of 4 variables: ## $ breed : Factor w/ 8 levels "Bengal","British Shorthair",..: 4 7 1 2 3 8 6 5 ## $ size : Factor w/ 3 levels "large","medium",..: 1 3 2 2 1 3 2 1 ## $ litters: int 2 0 0 1 1 2 2 3 ## $ weight : num 5.1 3.9 4.5 4.8 5.2 4.1 4.2 4.8

  • L. Duchateau & P.Janssen

(UH & UG) Applied Statistics and Data Modeling 2020 32 / 38

slide-33
SLIDE 33 UGent

STATS

VM

Data Exploring data

Exploring the cats data

Extracting one column from a dataset cats[[4]] ## [1] 5.1 3.9 4.5 4.8 5.2 4.1 4.2 4.8 cats$weight ## [1] 5.1 3.9 4.5 4.8 5.2 4.1 4.2 4.8 cats[["weight"]] ## [1] 5.1 3.9 4.5 4.8 5.2 4.1 4.2 4.8

  • L. Duchateau & P.Janssen

(UH & UG) Applied Statistics and Data Modeling 2020 33 / 38

slide-34
SLIDE 34 UGent

STATS

VM

Data Exploring data

Exploring the cats data

Calculating means mean(cats$weight) ## [1] 4.575 with(cats, tapply(weight, size, FUN=mean)) ## large medium small ## 5.033333 4.500000 4.000000 Calculating standard deviations sd(cats$weight) ## [1] 0.4773438 with(cats, tapply(weight, size, FUN=sd)) ## large medium small ## 0.2081666 0.3000000 0.1414214

  • L. Duchateau & P.Janssen

(UH & UG) Applied Statistics and Data Modeling 2020 34 / 38

slide-35
SLIDE 35 UGent

STATS

VM

Data Exploring data

Exploring the cats data

Scatterplot and boxplot par(mfrow=c(1,2),pty="s") #plot.default(weight˜size,data=cats, type="p") par(mar=c(4,4,0,0)) with(cats, plot.default(weight˜size)) boxplot(weight˜size,data=cats)

  • L. Duchateau & P.Janssen

(UH & UG) Applied Statistics and Data Modeling 2020 35 / 38

slide-36
SLIDE 36 UGent

STATS

VM

Data Exploring data

  • 1.0

1.5 2.0 2.5 3.0 4.0 4.2 4.4 4.6 4.8 5.0 5.2 size weight large medium small 4.0 4.2 4.4 4.6 4.8 5.0 5.2 size weight

  • L. Duchateau & P.Janssen

(UH & UG) Applied Statistics and Data Modeling 2020 36 / 38

slide-37
SLIDE 37 UGent

STATS

VM

The function lm

The function lm

Fitting a linear model: the function lm args(lm) ## function (formula, data, subset, weights, na.action, method = "qr", ## model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, ## contrasts = NULL, offset, ...) ## NULL

  • L. Duchateau & P.Janssen

(UH & UG) Applied Statistics and Data Modeling 2020 37 / 38

slide-38
SLIDE 38 UGent

STATS

VM

The function lm

Linear model for the cats data

lm(weight˜size, cats) ## ## Call: ## lm(formula = weight ˜ size, data = cats) ## ## Coefficients: ## (Intercept) sizemedium sizesmall ## 5.0333

  • 0.5333
  • 1.0333
  • L. Duchateau & P.Janssen

(UH & UG) Applied Statistics and Data Modeling 2020 38 / 38