introduction to data frames
play

Introduction to data frames Steve Bagley somgen223.stanford.edu 1 - PowerPoint PPT Presentation

Introduction to data frames Steve Bagley somgen223.stanford.edu 1 Using packages from the tidyverse somgen223.stanford.edu 2 install.packages ("tidyverse") Need to install the tidyverse set of packages Type this:


  1. Introduction to data frames Steve Bagley somgen223.stanford.edu 1

  2. Using packages from the tidyverse somgen223.stanford.edu 2

  3. install.packages ("tidyverse") Need to install the tidyverse set of packages • Type this: • “tidyverse” is a coherent set of packages for operating a kind of data called the “data frame.” • It is not built-in, so you need to install it (once), then load it each time you restart R. • Put library(tidyverse) or library("tidyverse") at the top of every script file. • More about packages later. somgen223.stanford.edu 3

  4. Data frame: a two-dimensional data structure A data frame is one of the most powerful features in R. • It is a rectangular data structure that can contain different types of data, similar to an Excel spreadsheet. • Typically, each row in a data frame contains information about one instance of some (real-world) object. • Each column can be thought of as a variable, containing the values for the corresponding instances. • All the values in one column should be of the same type, but different columns can be of different types. somgen223.stanford.edu 4

  5. 4 3 6 15.6 5 5 16 4 bod <- as_tibble (BOD) 19 3 10.3 19.8 2 2 8.3 1 1 < dbl > < dbl > Time demand # A tibble: 6 x 2 bod 7 Data frame example • This data set contains data on biological oxygen demand. • A tibble is a kind of data frame. This one has 6 rows and 2 columns. • Across the top is the name of each column. The next row shows the type of data in the column. <dbl> , means double-precision floating point number. • The row numbers are added for printing. They are not part of the data frame. somgen223.stanford.edu 5

  6. 4 3 6 15.6 5 5 16 4 bod 19 3 10.3 19.8 2 2 8.3 1 1 < dbl > < dbl > Time demand # A tibble: 6 x 2 7 Data frame example Check claim on earlier slide: • Typically, each row in a data frame describes an instance of some (real-world) object. (Yes: one row for each subject, time, or measurement.) • Each column contains the values of a variable for the corresponding instance. (Yes: one column for each variable.) somgen223.stanford.edu 6

  7. Data frame functions • The rest of this section shows data frame functions (“verbs”). • Each function takes a data frame and produces a new data frame. somgen223.stanford.edu 7

  8. 10.3 19 # A tibble: 3 x 2 Time demand < dbl > < dbl > 1 1 8.3 2 2 filter (bod, Time < 4) 3 3 Data frame function: filtering rows • This produces (and prints out) a new tibble, which contains all the rows where the Time value in that row is less than 4. • There are only 3 rows in this data frame. • There are still 2 columns. • filter does not modify bod , which still has 6 rows and 2 columns. somgen223.stanford.edu 8

  9. filter (bod, Time < 4, demand <= 16) # A tibble: 2 x 2 Time demand < dbl > < dbl > 1 1 8.3 2 2 10.3 Combining constraints in filter • This filters by the conjunction of the two constraints—both must be satisfied. • Constraints appear as second (and third, …) arguments, separated by commas. somgen223.stanford.edu 9

  10. 3 10.3 5 16 4 4 19 3 filter (bod, Time < 4 | demand <= 16) 2 15.6 2 8.3 1 1 < dbl > < dbl > Time demand # A tibble: 5 x 2 5 Disjunction with filter • This filters by the disjunction of the two constraints—either must be satisfied. • More about logical operators later. somgen223.stanford.edu 10

  11. filter (bod, Time > 10) # A tibble: 0 x 2 # ... with 2 variables: Time <dbl>, demand <dbl> Filtering out all rows • If the constraint is too severe, then you will produce a tibble with zero rows. somgen223.stanford.edu 11

  12. 19 19.8 # A tibble: 6 x 1 demand < dbl > 1 8.3 2 10.3 3 select (bod, demand) 4 16 5 15.6 6 Data frame function: select columns • The select function will return a subset of the tibble, using only the requested columns in the order specified. somgen223.stanford.edu 12

  13. 19 19.8 # A tibble: 6 x 1 demand < dbl > 1 8.3 2 10.3 3 select (bod, - Time) 4 16 5 15.6 6 Use - to leave out a column • The - operator can be used to leave out a column. somgen223.stanford.edu 13

  14. 15.6 3 6 19 3 5 16 4 4 arrange (bod, demand) 5 10.3 19.8 2 2 8.3 1 1 < dbl > < dbl > Time demand # A tibble: 6 x 2 7 Data frame function: arrange to sort rows • arrange takes a data frame and a column, and sorts the rows by the values in that column in ascending order. somgen223.stanford.edu 14

  15. 4 3 6 10.3 2 5 15.6 5 4 16 arrange (bod, desc (demand)) 19 8.3 3 2 19.8 7 1 < dbl > < dbl > Time demand # A tibble: 6 x 2 1 Use desc to sort from high to low somgen223.stanford.edu 15

  16. Interlude somgen223.stanford.edu 16

  17. 1 : 5 [1] 1 2 3 4 5 5 : 1 [1] 5 4 3 2 1 seq (1, 5) [1] 1 2 3 4 5 seq (5, 1) [1] 5 4 3 2 1 Arguments by position • seq is the function equivalent of the colon operator. • Arguments can be specified by position , with one supplied argument for each name in the function parameter list, and in the same order. • Arguments are separated by commas. somgen223.stanford.edu 17

  18. seq (from = 1, to = 5) [1] 1 2 3 4 5 seq (to = 5, from = 1) [1] 1 2 3 4 5 Arguments by name • Arguments can be supplied by name using the syntax, variable = value . • When using names, the order of the named arguments does not matter. • You can mix positional and named arguments (carefully). • Do not use <- in place of = when specifying named arguments. somgen223.stanford.edu 18

  19. seq (1, 5) [1] 1 2 3 4 5 seq (from = 1, to = 5) [1] 1 2 3 4 5 seq (begin = 1, end = 5) Warning : In seq.default (begin = 1, end = 5) : extra arguments 'begin', 'end' will be disregarded [1] 1 Using the correct argument name • You have to use the correct argument name. somgen223.stanford.edu 19

  20. ## Try this: ?seq How to find the names of a function’s arguments • How can you figure out the names of seq’s arguments? • Answer: the arguments are listed in the R documentation of the function. somgen223.stanford.edu 20

  21. x <- 10 x [1] 10 (x <- 10) [1] 10 How to assign and evaluate in one line • When you are typing to R, it is very common to want to assign a variable and return the value of the assignment. • But ordinarily, R does not return a visible result from an assignment expression. • You can force it to do so in a single expression by putting parentheses, ( ) , around the assignment . • This is used throughout the R documentation and some of the slides for this course, so you should be familiar with this bit of syntax. somgen223.stanford.edu 21

  22. Back to data frame functions somgen223.stanford.edu 22

  23. 4 3 7 6 0.0641 15.6 5 5 0.0625 16 4 (bod2 <- mutate (bod, inv_demand = 1 / demand)) 0.0526 19 3 0.0505 0.0971 10.3 2 2 0.120 8.3 1 1 < dbl > < dbl > < dbl > Time demand inv_demand # A tibble: 6 x 3 19.8 Data frame function: mutate to compute new column • This uses mutate to add a new column to bod which is the reciprocal of demand . • Note use of = to assign to new column. Do not use <- here! • The result of this is a new data frame with the new column. You must assign to a new (or old) name to save the result. • Note use of ( ) to assign and evaluate on a single line. somgen223.stanford.edu 23

  24. 664 58 6 1 120 1231 5 1 115 1004 4 1 87 orange <- as_tibble (Orange) 3 1 484 142 2 1 30 118 1 1 < dbl > < ord > < dbl > age circumference Tree # A tibble: 6 x 3 head (orange) 1372 Exercise: change units in orange trees • Add a new column to orange called circum_in which is the circumference in inches, not in millimeters. • Hint: 1 in = 2.54 cm somgen223.stanford.edu 24

  25. 1231 5.71 120 4.72 6 1 1372 142 5.59 7 1 1582 145 8 2 5 1 118 33 1.30 9 2 484 69 2.72 10 2 664 111 orange <- mutate (orange, circum_in = circumference / (10 * 2.54)) 4.53 # ... with 25 more rows 30 orange # A tibble: 35 x 4 Tree age circumference circum_in < ord > < dbl > < dbl > < dbl > 1 1 118 1.18 115 2 1 484 58 2.28 3 1 664 87 3.43 4 1 1004 4.37 Answer: change units in orange trees somgen223.stanford.edu 25

  26. cw <- read_csv ("https://somgen223.stanford.edu/data/cw.csv") data_dir <- "https://somgen223.stanford.edu/data/" cw <- read_csv ( str_c (data_dir, "cw.csv")) str_c ("abc", "def") [1] "abcdef" Use read_csv to read csv file (from file or url) • We will need that directory many times in this course, so use str_c to construct the location for read_csv • data_dir holds the first part of the url as a string • str_c glues all of its arguments together into a single string, like so: somgen223.stanford.edu 26

  27. Reading • Read: 5 Data transformation | R for Data Science (sections 5.1 to 5.5) somgen223.stanford.edu 27

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend