Introduction to data frames Steve Bagley somgen223.stanford.edu 1

Using packages from the tidyverse somgen223.stanford.edu 2

install.packages ("tidyverse") Need to install the tidyverse set of packages • Type this: • “tidyverse” is a coherent set of packages for operating a kind of data called the “data frame.” • It is not built-in, so you need to install it (once), then load it each time you restart R. • Put library(tidyverse) or library("tidyverse") at the top of every script file. • More about packages later. somgen223.stanford.edu 3

Data frame: a two-dimensional data structure A data frame is one of the most powerful features in R. • It is a rectangular data structure that can contain different types of data, similar to an Excel spreadsheet. • Typically, each row in a data frame contains information about one instance of some (real-world) object. • Each column can be thought of as a variable, containing the values for the corresponding instances. • All the values in one column should be of the same type, but different columns can be of different types. somgen223.stanford.edu 4

4 3 6 15.6 5 5 16 4 bod <- as_tibble (BOD) 19 3 10.3 19.8 2 2 8.3 1 1 < dbl > < dbl > Time demand # A tibble: 6 x 2 bod 7 Data frame example • This data set contains data on biological oxygen demand. • A tibble is a kind of data frame. This one has 6 rows and 2 columns. • Across the top is the name of each column. The next row shows the type of data in the column. <dbl> , means double-precision floating point number. • The row numbers are added for printing. They are not part of the data frame. somgen223.stanford.edu 5

4 3 6 15.6 5 5 16 4 bod 19 3 10.3 19.8 2 2 8.3 1 1 < dbl > < dbl > Time demand # A tibble: 6 x 2 7 Data frame example Check claim on earlier slide: • Typically, each row in a data frame describes an instance of some (real-world) object. (Yes: one row for each subject, time, or measurement.) • Each column contains the values of a variable for the corresponding instance. (Yes: one column for each variable.) somgen223.stanford.edu 6

Data frame functions • The rest of this section shows data frame functions (“verbs”). • Each function takes a data frame and produces a new data frame. somgen223.stanford.edu 7

10.3 19 # A tibble: 3 x 2 Time demand < dbl > < dbl > 1 1 8.3 2 2 filter (bod, Time < 4) 3 3 Data frame function: filtering rows • This produces (and prints out) a new tibble, which contains all the rows where the Time value in that row is less than 4. • There are only 3 rows in this data frame. • There are still 2 columns. • filter does not modify bod , which still has 6 rows and 2 columns. somgen223.stanford.edu 8

filter (bod, Time < 4, demand <= 16) # A tibble: 2 x 2 Time demand < dbl > < dbl > 1 1 8.3 2 2 10.3 Combining constraints in filter • This filters by the conjunction of the two constraints—both must be satisfied. • Constraints appear as second (and third, …) arguments, separated by commas. somgen223.stanford.edu 9

3 10.3 5 16 4 4 19 3 filter (bod, Time < 4 | demand <= 16) 2 15.6 2 8.3 1 1 < dbl > < dbl > Time demand # A tibble: 5 x 2 5 Disjunction with filter • This filters by the disjunction of the two constraints—either must be satisfied. • More about logical operators later. somgen223.stanford.edu 10

filter (bod, Time > 10) # A tibble: 0 x 2 # ... with 2 variables: Time <dbl>, demand <dbl> Filtering out all rows • If the constraint is too severe, then you will produce a tibble with zero rows. somgen223.stanford.edu 11

19 19.8 # A tibble: 6 x 1 demand < dbl > 1 8.3 2 10.3 3 select (bod, demand) 4 16 5 15.6 6 Data frame function: select columns • The select function will return a subset of the tibble, using only the requested columns in the order specified. somgen223.stanford.edu 12

19 19.8 # A tibble: 6 x 1 demand < dbl > 1 8.3 2 10.3 3 select (bod, - Time) 4 16 5 15.6 6 Use - to leave out a column • The - operator can be used to leave out a column. somgen223.stanford.edu 13

15.6 3 6 19 3 5 16 4 4 arrange (bod, demand) 5 10.3 19.8 2 2 8.3 1 1 < dbl > < dbl > Time demand # A tibble: 6 x 2 7 Data frame function: arrange to sort rows • arrange takes a data frame and a column, and sorts the rows by the values in that column in ascending order. somgen223.stanford.edu 14

4 3 6 10.3 2 5 15.6 5 4 16 arrange (bod, desc (demand)) 19 8.3 3 2 19.8 7 1 < dbl > < dbl > Time demand # A tibble: 6 x 2 1 Use desc to sort from high to low somgen223.stanford.edu 15

Interlude somgen223.stanford.edu 16

1 : 5 [1] 1 2 3 4 5 5 : 1 [1] 5 4 3 2 1 seq (1, 5) [1] 1 2 3 4 5 seq (5, 1) [1] 5 4 3 2 1 Arguments by position • seq is the function equivalent of the colon operator. • Arguments can be specified by position , with one supplied argument for each name in the function parameter list, and in the same order. • Arguments are separated by commas. somgen223.stanford.edu 17

seq (from = 1, to = 5) [1] 1 2 3 4 5 seq (to = 5, from = 1) [1] 1 2 3 4 5 Arguments by name • Arguments can be supplied by name using the syntax, variable = value . • When using names, the order of the named arguments does not matter. • You can mix positional and named arguments (carefully). • Do not use <- in place of = when specifying named arguments. somgen223.stanford.edu 18

seq (1, 5) [1] 1 2 3 4 5 seq (from = 1, to = 5) [1] 1 2 3 4 5 seq (begin = 1, end = 5) Warning : In seq.default (begin = 1, end = 5) : extra arguments 'begin', 'end' will be disregarded [1] 1 Using the correct argument name • You have to use the correct argument name. somgen223.stanford.edu 19

## Try this: ?seq How to find the names of a function’s arguments • How can you figure out the names of seq’s arguments? • Answer: the arguments are listed in the R documentation of the function. somgen223.stanford.edu 20

x <- 10 x [1] 10 (x <- 10) [1] 10 How to assign and evaluate in one line • When you are typing to R, it is very common to want to assign a variable and return the value of the assignment. • But ordinarily, R does not return a visible result from an assignment expression. • You can force it to do so in a single expression by putting parentheses, ( ) , around the assignment . • This is used throughout the R documentation and some of the slides for this course, so you should be familiar with this bit of syntax. somgen223.stanford.edu 21

Back to data frame functions somgen223.stanford.edu 22

4 3 7 6 0.0641 15.6 5 5 0.0625 16 4 (bod2 <- mutate (bod, inv_demand = 1 / demand)) 0.0526 19 3 0.0505 0.0971 10.3 2 2 0.120 8.3 1 1 < dbl > < dbl > < dbl > Time demand inv_demand # A tibble: 6 x 3 19.8 Data frame function: mutate to compute new column • This uses mutate to add a new column to bod which is the reciprocal of demand . • Note use of = to assign to new column. Do not use <- here! • The result of this is a new data frame with the new column. You must assign to a new (or old) name to save the result. • Note use of ( ) to assign and evaluate on a single line. somgen223.stanford.edu 23

664 58 6 1 120 1231 5 1 115 1004 4 1 87 orange <- as_tibble (Orange) 3 1 484 142 2 1 30 118 1 1 < dbl > < ord > < dbl > age circumference Tree # A tibble: 6 x 3 head (orange) 1372 Exercise: change units in orange trees • Add a new column to orange called circum_in which is the circumference in inches, not in millimeters. • Hint: 1 in = 2.54 cm somgen223.stanford.edu 24

1231 5.71 120 4.72 6 1 1372 142 5.59 7 1 1582 145 8 2 5 1 118 33 1.30 9 2 484 69 2.72 10 2 664 111 orange <- mutate (orange, circum_in = circumference / (10 * 2.54)) 4.53 # ... with 25 more rows 30 orange # A tibble: 35 x 4 Tree age circumference circum_in < ord > < dbl > < dbl > < dbl > 1 1 118 1.18 115 2 1 484 58 2.28 3 1 664 87 3.43 4 1 1004 4.37 Answer: change units in orange trees somgen223.stanford.edu 25

cw <- read_csv ("https://somgen223.stanford.edu/data/cw.csv") data_dir <- "https://somgen223.stanford.edu/data/" cw <- read_csv ( str_c (data_dir, "cw.csv")) str_c ("abc", "def") [1] "abcdef" Use read_csv to read csv file (from file or url) • We will need that directory many times in this course, so use str_c to construct the location for read_csv • data_dir holds the first part of the url as a string • str_c glues all of its arguments together into a single string, like so: somgen223.stanford.edu 26

Reading • Read: 5 Data transformation | R for Data Science (sections 5.1 to 5.5) somgen223.stanford.edu 27

Introduction to data frames Steve Bagley somgen223.stanford.edu 1 - PowerPoint PPT Presentation

Introduction to data frames Steve Bagley somgen223.stanford.edu 1 Using packages from the tidyverse somgen223.stanford.edu 2 install.packages ("tidyverse") Need to install the tidyverse set of packages Type this:

Buckling Resistance of Frames Buckling Resistance of Frames Buckling Resistance of Frames

framing Evoked vs. invoked frames: Words evoke frames by being strongly associated with

Overview/Questions Review: formatting HTML pages Frames Style Sheets 2 1 HTML Frames

CS 184: Foundations of Computer Graphics Lecture 23: Intro to Animation Rahul Narain Animation

Sequence Diagrams: Interaction Frames Ferd van Odenhoven Fontys Hogeschool voor Techniek en

Frames and OWL side by side Hai WANG The University of Manchester Outline Introduction Major

Decomposing Concepts with Frames Wiebke Petersen Heinrich-Heine-Universit at D usseldorf

Workshop 2.1: Data frames Murray Logan July 15, 2017 Table of contents 1 Data importation and

Improving methods for linking area frames with list frames: preliminary results Cristiano

~32 Frames E Spaced evenly= A Slides # 9 thru 28) S T W ~32 Frames Spaced evenly, I

Scalable frames Kasso Okoudjou joint with X. Chen, G. Kutyniok, F. Philipp, R. Wang Department

Molecular Biology, part 2 l Junk DNA l Reading frames, open reading frames l Splicing and number of

(Inbe) Tweening : the process of generating intermediate frames between two key frames to give the

} (pages and frames, respectively) free frames can be tracked using a simple bitmap Interpret VA as

Link Layer Link Layer Transfer frames over one or more connected links Frames are messages

Registers and Stack Frames Lecture 6 Function calls and stack frames function invocation

On the solution of Bingham fluids and a Preconditioned Douglas-Rachford splitting method for

The role of black-hole simulations in fundamental physics U. Sperhake DAMTP , University of

Ecoulements de fluides viscoplastiques : expriences et simulations Dbriefing de lun des

Manifesto for Agile Software Development We are uncovering better ways of developing software by

Nonlinear Regression 30.11.2016 Goals of Todays Lecture Understand the difference between

UQCRH Downregulation Promotes Warburg Effect in Renal Cell Carcinoma Cells Xin Lu, Ph.D. Boler

Jaak Vilo vilo@egeen.ee Estonian Computer Science Theory days: Pedase, 3.10.2003 DNA determines

Using atmospheric radiocarbon ( 14 CO 2 ) to estimate fossil and biogenic CO 2 fluxes in the LA