Files, pathnames Steve Bagley somgen223.stanford.edu 1 Files have - - PowerPoint PPT Presentation

files pathnames
SMART_READER_LITE
LIVE PREVIEW

Files, pathnames Steve Bagley somgen223.stanford.edu 1 Files have - - PowerPoint PPT Presentation

Files, pathnames Steve Bagley somgen223.stanford.edu 1 Files have contents and a name A file that contains R code will have a name such as test.r . The base name is test . The extension is .r . The extension usually signals the


slide-1
SLIDE 1

Files, pathnames

Steve Bagley

somgen223.stanford.edu 1

slide-2
SLIDE 2

Files have contents and a name

  • A file that contains R code will have a name such as test.r.
  • The base name is test.
  • The extension is .r.
  • The extension usually signals the file type.
  • You should use .r or .R for files that contain R code, and ‘.rmd’ or ‘.Rmd’ for

files that contain R Markdown. Some possible extensions are .csv (for comma separated values), .txt or .text (for text files).

somgen223.stanford.edu 2

slide-3
SLIDE 3

File systems are hierarchical collections of directories and files

  • Files live in directories, also called folders.
  • Each directory can contain zero or more files, and zero or more directories.
  • Each directory, except the top, lives inside a single (parent) directory.
  • This produces a hierarchical, tree-like, organization.
  • Computer trees grow upside down.

somgen223.stanford.edu 3

slide-4
SLIDE 4

Naming files, macOS and Unix

  • Files are given full names using a path notation.
  • A sample path: /Users/betty/Documents/test.r
  • This file has base name test, with extension r.
  • It is located in the directory Documents, which is located in the directory betty,

which is located in the directory Users.

  • The leading / refers to the highest level, or root, of the filesystem.

somgen223.stanford.edu 4

slide-5
SLIDE 5

Naming files, Windows

  • Files are given full names using a path notation.
  • C:\Users\betty\Documents\test.r
  • This file has base name test, with extension r.
  • It is located in the directory Documents, which is located in the directory betty,

which is located in the directory Users.

  • The leading \ refers to the highest level, or root, of the filesystem.
  • When typing paths to R as strings, you must double the backslashes, or convert

to forward slashes:

  • C:\\Users\\sbagley\\Documents\\test.r
  • C:/Users/sbagley/Documents/test.r

somgen223.stanford.edu 5

slide-6
SLIDE 6

The working directory

  • It is annoying to have to always type the full path to the directory.
  • In R and RStudio, you can set the working directory, which then becomes the

default directory for all file operations.

  • When type a path that does not start with a slash, it is interpreted relative to the

current working directory.

  • If you don’t set the working directory, it is (probably) your home directory, or

the directory from which R was started.

somgen223.stanford.edu 6

slide-7
SLIDE 7

Using relative pathnames

  • A relative pathname does not start at the root directory.
  • It is interpreted with respect to the current (working) directory.
  • Example: data/file1.csv.
  • In this example, the file should be located in the data subdirectory of the

current directory.

  • If you use relative directories, then you can easily move or rename the top-level

project directory without breaking things.

somgen223.stanford.edu 7

slide-8
SLIDE 8

Package fs for working with files and directories

library(fs)

  • The easiest way to write code that manipulates files and directories is to use the

fs package. It is usually installed as part of tidyverse, but you have to load it separately.

somgen223.stanford.edu 8

slide-9
SLIDE 9

dir_ls: list all files and directories in a directory

dir_ls()

  • dir_ls takes a number of optional arguments to control which files to return.

somgen223.stanford.edu 9

slide-10
SLIDE 10

Find all csv files in a directory

dir_ls(path = "/Users/sbagley/temp", glob = "*.csv")

  • glob is jargon for a file name pattern. * matches anything.

somgen223.stanford.edu 10

slide-11
SLIDE 11

Getting the parts of a path

p1 <- path("/Users/sbagley/temp/test.csv") path_dir(p1) [1] "/Users/sbagley/temp" path_file(p1) [1] "test.csv" path_ext(p1) [1] "csv" path_ext_set(p1, "tsv") /Users/sbagley/temp/test.tsv

  • Using these function is much easier than writing your own functions to

manipulate filenames as strings.

somgen223.stanford.edu 11

slide-12
SLIDE 12

Your home directory

p2 <- path("~/temp/test.csv") path_expand(p2) /Users/sbagley/temp/test.csv

  • The tilde ~ stands for your home directory.

somgen223.stanford.edu 12

slide-13
SLIDE 13

Combining all files in a directory

  • Sometimes your data are spread across multiple files in a directory. For example,

there is one file for each of multiple runs of an experiment.

  • You want to combine them together into a single data frame.
  • You want some indicator of which run they came from.

somgen223.stanford.edu 13

slide-14
SLIDE 14

Step 1: listing all the files in a directory

library(fs) my_dir <- "~/sync/teaching/somgen223/website/data/multiple_runs" my_files <- dir_ls(my_dir) ## these pathnames are very long. to view them here, remove the ## directory part path_file(my_files) [1] "file1.csv" "file2.csv"

somgen223.stanford.edu 14

slide-15
SLIDE 15

Inspect those files

read_csv(my_files[1]) # A tibble: 2 x 2 gene value <chr> <dbl> 1 ABC123 12.5 2 DEF234 333 read_csv(my_files[2]) # A tibble: 2 x 2 gene value <chr> <dbl> 1 DKK7 2.2 2 LEM9 9

somgen223.stanford.edu 15

slide-16
SLIDE 16

Step 2: read and combine into single data frame

(new_df <- map_df(my_files, read_csv)) # A tibble: 4 x 2 gene value <chr> <dbl> 1 ABC123 12.5 2 DEF234 333 3 DKK7 2.2 4 LEM9 9

  • map_df applies the second argument, a function, to all the things in the first

argument, here a list of files.

  • Each call to read_csv produces a data frame.
  • These data frames are glued together (stacked vertically) to form the answer.

somgen223.stanford.edu 16

slide-17
SLIDE 17

Elaboration: save the filename in the data frame

  • This first solution glues together all the rows, but now you don’t know which

files each row came from. (If you are lucky, there will already be a column that indicates this.)

  • So we need to write a function that reads the csv from a file, and adds the

filename, which contains the run number as part of the name, as a new column.

somgen223.stanford.edu 17

slide-18
SLIDE 18

How to define your own function

  • R has many built-in functions, but sometimes it is useful to define your own.
  • A function encapsulates a set of expressions that perform some conceptually

meaningful chunk of work. This is especially useful when you need to repeat the chunk of work multiple times.

somgen223.stanford.edu 18

slide-19
SLIDE 19

Defining a function

add1 <- function(v){ 1 + v } add1(0) [1] 1 add1(-1:1) [1] 0 1 2 x <- 1:5 add1(x) [1] 2 3 4 5 6

somgen223.stanford.edu 19

slide-20
SLIDE 20

How to evaluate a function call

add1(x) [1] 2 3 4 5 6

  • 1. Evaluate the argument, here, x.
  • 2. Evaluate the function definition with the function’s argument v temporarily

bound to the value of its argument.

somgen223.stanford.edu 20

slide-21
SLIDE 21

A function to add the file name as a column

read_and_record_filename <- function(filename){ read_csv(filename) %>% mutate(filename = path_file(filename)) }

somgen223.stanford.edu 21

slide-22
SLIDE 22

Try map_df again

(new_df2 <- map_df(my_files, read_and_record_filename)) # A tibble: 4 x 3 gene value filename <chr> <dbl> <chr> 1 ABC123 12.5 file1.csv 2 DEF234 333 file1.csv 3 DKK7 2.2 file2.csv 4 LEM9 9 file2.csv

somgen223.stanford.edu 22

slide-23
SLIDE 23

Now convert filename to run number

new_df2 %>% mutate(run_number = as.numeric(str_remove_all(filename, "(file)|(\\.csv)")), ## remove filename column after we are done with it filename = NULL) # A tibble: 4 x 3 gene value run_number <chr> <dbl> <dbl> 1 ABC123 12.5 1 2 DEF234 333 1 3 DKK7 2.2 2 4 LEM9 9 2

somgen223.stanford.edu 23

slide-24
SLIDE 24

Convert filename to run number, version 2

new_df2 %>% mutate(run_number = as.numeric(str_extract(filename, "[0-9]+")), ## remove filename column after we are done with it filename = NULL) # A tibble: 4 x 3 gene value run_number <chr> <dbl> <dbl> 1 ABC123 12.5 1 2 DEF234 333 1 3 DKK7 2.2 2 4 LEM9 9 2

somgen223.stanford.edu 24

slide-25
SLIDE 25

Reading data in other formats

somgen223.stanford.edu 25

slide-26
SLIDE 26

How to read different formats

  • This course has focused on csv-formatted files: the items on each row are

separated by commas, and the header row, if present, follows the same format.

  • But there are other formats….

somgen223.stanford.edu 26

slide-27
SLIDE 27

readr package

Function Notes read_csv separator is “,” read_csv2 separator is “;”, decimal point is “,” read_tsv separator is “\t” (TAB) read_delim general case read_fwf reads fixed-width fields read_table separator is whitespace

somgen223.stanford.edu 27

slide-28
SLIDE 28

How to read a csv file that does not have a header line

df <- read_csv("file1.csv", colnames = c("id", "weight", "height"))

somgen223.stanford.edu 28

slide-29
SLIDE 29

How to read a csv file that does has a header line that you don’t want

df <- read_csv("file1.csv", colnames = c("id", "weight", "height"), skip = 1)

somgen223.stanford.edu 29

slide-30
SLIDE 30

How to skip over some lines at the beginning of the csv file

df <- read_csv("file1.csv", skip = 5)

  • This will skip the first 5 rows, and start reading on line 6.

somgen223.stanford.edu 30

slide-31
SLIDE 31

Other read_csv options

  • n_max is the maximum number of rows to read. This can be useful if you want

to first work with just part of a very large file.

  • skip_empty_rows (TRUE or FALSE) controls whether to skip completely empty
  • rows. If not skipped, they’ll produce NA values.

somgen223.stanford.edu 31

slide-32
SLIDE 32

readxl package

  • The function read_excel can read either xls or xlsx format files.
  • You will need to specify the sheet if you have a multi-sheet spreadsheet.
  • You can read out of a cell range. Very useful!

somgen223.stanford.edu 32

slide-33
SLIDE 33

Reading other data formats

Format Package XML xml2 json jsonlite SAS, SPSS, Stata haven (or foreign) SQL RSQLite, many others

somgen223.stanford.edu 33

slide-34
SLIDE 34

Reading

  • Skim only if interested: 19 Functions | R for Data Science
  • Read: 11 Data import | R for Data Science

somgen223.stanford.edu 34