Introduction to data frames Steve Bagley somgen223.stanford.edu 1 - - PowerPoint PPT Presentation

introduction to data frames
SMART_READER_LITE
LIVE PREVIEW

Introduction to data frames Steve Bagley somgen223.stanford.edu 1 - - PowerPoint PPT Presentation

Introduction to data frames Steve Bagley somgen223.stanford.edu 1 Using packages from the tidyverse somgen223.stanford.edu 2 install.packages ("tidyverse") Need to install the tidyverse set of packages Type this:


slide-1
SLIDE 1

Introduction to data frames

Steve Bagley

somgen223.stanford.edu 1

slide-2
SLIDE 2

Using packages from the tidyverse

somgen223.stanford.edu 2

slide-3
SLIDE 3

Need to install the tidyverse set of packages

  • Type this:

install.packages("tidyverse")

  • “tidyverse” is a coherent set of packages for operating a kind of data called the

“data frame.”

  • It is not built-in, so you need to install it (once), then load it each time you

restart R.

  • Put library(tidyverse) or library("tidyverse") at the top of every

script file.

  • More about packages later.

somgen223.stanford.edu 3

slide-4
SLIDE 4

Data frame: a two-dimensional data structure

A data frame is one of the most powerful features in R.

  • It is a rectangular data structure that can contain different types of data, similar

to an Excel spreadsheet.

  • Typically, each row in a data frame contains information about one instance of

some (real-world) object.

  • Each column can be thought of as a variable, containing the values for the

corresponding instances.

  • All the values in one column should be of the same type, but different columns

can be of different types.

somgen223.stanford.edu 4

slide-5
SLIDE 5

Data frame example

bod <- as_tibble(BOD) bod # A tibble: 6 x 2 Time demand <dbl> <dbl> 1 1 8.3 2 2 10.3 3 3 19 4 4 16 5 5 15.6 6 7 19.8

  • This data set contains data on biological oxygen demand.
  • A tibble is a kind of data frame. This one has 6 rows and 2 columns.
  • Across the top is the name of each column. The next row shows the type of data

in the column. <dbl>, means double-precision floating point number.

  • The row numbers are added for printing. They are not part of the data frame.

somgen223.stanford.edu 5

slide-6
SLIDE 6

Data frame example

bod # A tibble: 6 x 2 Time demand <dbl> <dbl> 1 1 8.3 2 2 10.3 3 3 19 4 4 16 5 5 15.6 6 7 19.8 Check claim on earlier slide:

  • Typically, each row in a data frame describes an instance of some (real-world)
  • bject. (Yes: one row for each subject, time, or measurement.)
  • Each column contains the values of a variable for the corresponding instance.

(Yes: one column for each variable.)

somgen223.stanford.edu 6

slide-7
SLIDE 7

Data frame functions

  • The rest of this section shows data frame functions (“verbs”).
  • Each function takes a data frame and produces a new data frame.

somgen223.stanford.edu 7

slide-8
SLIDE 8

Data frame function: filtering rows

filter(bod, Time < 4) # A tibble: 3 x 2 Time demand <dbl> <dbl> 1 1 8.3 2 2 10.3 3 3 19

  • This produces (and prints out) a new tibble, which contains all the rows where

the Time value in that row is less than 4.

  • There are only 3 rows in this data frame.
  • There are still 2 columns.
  • filter does not modify bod, which still has 6 rows and 2 columns.

somgen223.stanford.edu 8

slide-9
SLIDE 9

Combining constraints in filter

filter(bod, Time < 4, demand <= 16) # A tibble: 2 x 2 Time demand <dbl> <dbl> 1 1 8.3 2 2 10.3

  • This filters by the conjunction of the two constraints—both must be satisfied.
  • Constraints appear as second (and third, …) arguments, separated by commas.

somgen223.stanford.edu 9

slide-10
SLIDE 10

Disjunction with filter

filter(bod, Time < 4 | demand <= 16) # A tibble: 5 x 2 Time demand <dbl> <dbl> 1 1 8.3 2 2 10.3 3 3 19 4 4 16 5 5 15.6

  • This filters by the disjunction of the two constraints—either must be satisfied.
  • More about logical operators later.

somgen223.stanford.edu 10

slide-11
SLIDE 11

Filtering out all rows

filter(bod, Time > 10) # A tibble: 0 x 2 # ... with 2 variables: Time <dbl>, demand <dbl>

  • If the constraint is too severe, then you will produce a tibble with zero rows.

somgen223.stanford.edu 11

slide-12
SLIDE 12

Data frame function: select columns

select(bod, demand) # A tibble: 6 x 1 demand <dbl> 1 8.3 2 10.3 3 19 4 16 5 15.6 6 19.8

  • The select function will return a subset of the tibble, using only the requested

columns in the order specified.

somgen223.stanford.edu 12

slide-13
SLIDE 13

Use - to leave out a column

select(bod, -Time) # A tibble: 6 x 1 demand <dbl> 1 8.3 2 10.3 3 19 4 16 5 15.6 6 19.8

  • The - operator can be used to leave out a column.

somgen223.stanford.edu 13

slide-14
SLIDE 14

Data frame function: arrange to sort rows

arrange(bod, demand) # A tibble: 6 x 2 Time demand <dbl> <dbl> 1 1 8.3 2 2 10.3 3 5 15.6 4 4 16 5 3 19 6 7 19.8

  • arrange takes a data frame and a column, and sorts the rows by the values in

that column in ascending order.

somgen223.stanford.edu 14

slide-15
SLIDE 15

Use desc to sort from high to low

arrange(bod, desc(demand)) # A tibble: 6 x 2 Time demand <dbl> <dbl> 1 7 19.8 2 3 19 3 4 16 4 5 15.6 5 2 10.3 6 1 8.3

somgen223.stanford.edu 15

slide-16
SLIDE 16

Interlude

somgen223.stanford.edu 16

slide-17
SLIDE 17

Arguments by position

1:5 [1] 1 2 3 4 5 5:1 [1] 5 4 3 2 1 seq(1, 5) [1] 1 2 3 4 5 seq(5, 1) [1] 5 4 3 2 1

  • seq is the function equivalent of the colon operator.
  • Arguments can be specified by position, with one supplied argument for each

name in the function parameter list, and in the same order.

  • Arguments are separated by commas.

somgen223.stanford.edu 17

slide-18
SLIDE 18

Arguments by name

seq(from = 1, to = 5) [1] 1 2 3 4 5 seq(to = 5, from = 1) [1] 1 2 3 4 5

  • Arguments can be supplied by name using the syntax, variable = value.
  • When using names, the order of the named arguments does not matter.
  • You can mix positional and named arguments (carefully).
  • Do not use <- in place of = when specifying named arguments.

somgen223.stanford.edu 18

slide-19
SLIDE 19

Using the correct argument name

seq(1, 5) [1] 1 2 3 4 5 seq(from = 1, to = 5) [1] 1 2 3 4 5 seq(begin = 1, end = 5) Warning: In seq.default(begin = 1, end = 5) : extra arguments 'begin', 'end' will be disregarded [1] 1

  • You have to use the correct argument name.

somgen223.stanford.edu 19

slide-20
SLIDE 20

How to find the names of a function’s arguments

  • How can you figure out the names of seq’s arguments?
  • Answer: the arguments are listed in the R documentation of the function.

## Try this: ?seq

somgen223.stanford.edu 20

slide-21
SLIDE 21

How to assign and evaluate in one line

x <- 10 x [1] 10 (x <- 10) [1] 10

  • When you are typing to R, it is very common to want to assign a variable and

return the value of the assignment.

  • But ordinarily, R does not return a visible result from an assignment expression.
  • You can force it to do so in a single expression by putting parentheses, ( ),

around the assignment.

  • This is used throughout the R documentation and some of the slides for this

course, so you should be familiar with this bit of syntax.

somgen223.stanford.edu 21

slide-22
SLIDE 22

Back to data frame functions

somgen223.stanford.edu 22

slide-23
SLIDE 23

Data frame function: mutate to compute new column

(bod2 <- mutate(bod, inv_demand = 1 / demand)) # A tibble: 6 x 3 Time demand inv_demand <dbl> <dbl> <dbl> 1 1 8.3 0.120 2 2 10.3 0.0971 3 3 19 0.0526 4 4 16 0.0625 5 5 15.6 0.0641 6 7 19.8 0.0505

  • This uses mutate to add a new column to bod which is the reciprocal of demand.
  • Note use of = to assign to new column. Do not use <- here!
  • The result of this is a new data frame with the new column. You must assign to a

new (or old) name to save the result.

  • Note use of ( ) to assign and evaluate on a single line.

somgen223.stanford.edu 23

slide-24
SLIDE 24

Exercise: change units in orange trees

  • range <- as_tibble(Orange)

head(orange) # A tibble: 6 x 3 Tree age circumference <ord> <dbl> <dbl> 1 1 118 30 2 1 484 58 3 1 664 87 4 1 1004 115 5 1 1231 120 6 1 1372 142

  • Add a new column to orange called circum_in which is the circumference in

inches, not in millimeters.

  • Hint: 1 in = 2.54 cm

somgen223.stanford.edu 24

slide-25
SLIDE 25

Answer: change units in orange trees

  • range <- mutate(orange, circum_in = circumference/(10 * 2.54))
  • range

# A tibble: 35 x 4 Tree age circumference circum_in <ord> <dbl> <dbl> <dbl> 1 1 118 30 1.18 2 1 484 58 2.28 3 1 664 87 3.43 4 1 1004 115 4.53 5 1 1231 120 4.72 6 1 1372 142 5.59 7 1 1582 145 5.71 8 2 118 33 1.30 9 2 484 69 2.72 10 2 664 111 4.37 # ... with 25 more rows

somgen223.stanford.edu 25

slide-26
SLIDE 26

Use read_csv to read csv file (from file or url)

cw <- read_csv("https://somgen223.stanford.edu/data/cw.csv")

  • We will need that directory many times in this course, so use str_c to construct

the location for read_csv data_dir <- "https://somgen223.stanford.edu/data/" cw <- read_csv(str_c(data_dir, "cw.csv"))

  • data_dir holds the first part of the url as a string
  • str_c glues all of its arguments together into a single string, like so:

str_c("abc", "def") [1] "abcdef"

somgen223.stanford.edu 26

slide-27
SLIDE 27

Reading

  • Read: 5 Data transformation | R for Data Science (sections 5.1 to 5.5)

somgen223.stanford.edu 27