Files, pathnames Steve Bagley somgen223.stanford.edu 1
Files have contents and a name • A file that contains R code will have a name such as test.r . • The base name is test . • The extension is .r . • The extension usually signals the file type. • You should use .r or .R for files that contain R code, and ‘.rmd’ or ‘.Rmd’ for files that contain R Markdown. Some possible extensions are .csv (for comma separated values), .txt or .text (for text files). somgen223.stanford.edu 2
File systems are hierarchical collections of directories and files • Files live in directories, also called folders. • Each directory can contain zero or more files, and zero or more directories. • Each directory, except the top, lives inside a single (parent) directory. • This produces a hierarchical, tree-like, organization. • Computer trees grow upside down. somgen223.stanford.edu 3
Naming files, macOS and Unix • Files are given full names using a path notation. • A sample path: /Users/betty/Documents/test.r • This file has base name test , with extension r . • It is located in the directory Documents , which is located in the directory betty , which is located in the directory Users . • The leading / refers to the highest level, or root, of the filesystem. somgen223.stanford.edu 4
• C:\Users\betty\Documents\test.r • C:\\Users\\sbagley\\Documents\\test.r • C:/Users/sbagley/Documents/test.r Naming files, Windows • Files are given full names using a path notation. • This file has base name test , with extension r . • It is located in the directory Documents , which is located in the directory betty , which is located in the directory Users . • The leading \ refers to the highest level, or root, of the filesystem. • When typing paths to R as strings, you must double the backslashes, or convert to forward slashes: somgen223.stanford.edu 5
The working directory • It is annoying to have to always type the full path to the directory. • In R and RStudio, you can set the working directory, which then becomes the default directory for all file operations. • When type a path that does not start with a slash, it is interpreted relative to the current working directory. • If you don’t set the working directory, it is (probably) your home directory, or the directory from which R was started. somgen223.stanford.edu 6
Using relative pathnames • A relative pathname does not start at the root directory. • It is interpreted with respect to the current (working) directory. • Example: data/file1.csv . • In this example, the file should be located in the data subdirectory of the current directory. • If you use relative directories, then you can easily move or rename the top-level project directory without breaking things. somgen223.stanford.edu 7
library (fs) Package fs for working with files and directories • The easiest way to write code that manipulates files and directories is to use the fs package. It is usually installed as part of tidyverse , but you have to load it separately. somgen223.stanford.edu 8
dir_ls () dir_ls : list all files and directories in a directory • dir_ls takes a number of optional arguments to control which files to return. somgen223.stanford.edu 9
dir_ls (path = "/Users/sbagley/temp", glob = "*.csv") Find all csv files in a directory • glob is jargon for a file name pattern. * matches anything. somgen223.stanford.edu 10
p1 <- path ("/Users/sbagley/temp/test.csv") path_dir (p1) [1] "/Users/sbagley/temp" path_file (p1) [1] "test.csv" path_ext (p1) [1] "csv" path_ext_set (p1, "tsv") / Users / sbagley / temp / test.tsv Getting the parts of a path • Using these function is much easier than writing your own functions to manipulate filenames as strings. somgen223.stanford.edu 11
p2 <- path ("~/temp/test.csv") path_expand (p2) / Users / sbagley / temp / test.csv Your home directory • The tilde ~ stands for your home directory. somgen223.stanford.edu 12
Combining all files in a directory • Sometimes your data are spread across multiple files in a directory. For example, there is one file for each of multiple runs of an experiment. • You want to combine them together into a single data frame. • You want some indicator of which run they came from. somgen223.stanford.edu 13
library (fs) my_files <- dir_ls (my_dir) ## these pathnames are very long. to view them here, remove the ## directory part path_file (my_files) [1] "file1.csv" "file2.csv" Step 1: listing all the files in a directory my_dir <- "~/sync/teaching/somgen223/website/data/multiple_runs" somgen223.stanford.edu 14
read_csv (my_files[2]) 2 DEF234 333 1 DKK7 < chr > < dbl > value gene # A tibble: 2 x 2 read_csv (my_files[1]) 12.5 9 1 ABC123 < dbl > < chr > value gene # A tibble: 2 x 2 2.2 Inspect those files 2 LEM9 somgen223.stanford.edu 15
9 # A tibble: 4 x 2 gene value < chr > < dbl > 1 ABC123 12.5 (new_df <- map_df (my_files, read_csv)) 2.2 4 LEM9 Step 2: read and combine into single data frame 2 DEF234 333 3 DKK7 • map_df applies the second argument, a function, to all the things in the first argument, here a list of files. • Each call to read_csv produces a data frame. • These data frames are glued together (stacked vertically) to form the answer. somgen223.stanford.edu 16
Elaboration: save the filename in the data frame • This first solution glues together all the rows, but now you don’t know which files each row came from. (If you are lucky, there will already be a column that indicates this.) • So we need to write a function that reads the csv from a file, and adds the filename, which contains the run number as part of the name, as a new column. somgen223.stanford.edu 17
How to define your own function • R has many built-in functions, but sometimes it is useful to define your own. • A function encapsulates a set of expressions that perform some conceptually meaningful chunk of work. This is especially useful when you need to repeat the chunk of work multiple times. somgen223.stanford.edu 18
add1 <- function (v){ 1 + v } add1 (0) [1] 1 add1 ( - 1 : 1) [1] 0 1 2 x <- 1 : 5 add1 (x) [1] 2 3 4 5 6 Defining a function somgen223.stanford.edu 19
add1 (x) [1] 2 3 4 5 6 How to evaluate a function call 1. Evaluate the argument, here, x . 2. Evaluate the function definition with the function’s argument v temporarily bound to the value of its argument. somgen223.stanford.edu 20
read_and_record_filename <- function (filename){ read_csv (filename) %>% mutate (filename = path_file (filename)) } A function to add the file name as a column somgen223.stanford.edu 21
2 DEF234 333 file2.csv # A tibble: 4 x 3 gene value filename < chr > < dbl > < chr > 1 ABC123 12.5 file1.csv (new_df2 <- map_df (my_files, read_and_record_filename)) file1.csv 3 DKK7 2.2 file2.csv 4 LEM9 9 Try map_df again somgen223.stanford.edu 22
1 ABC123 < dbl > 2.2 9 1 2 1 12.5 new_df2 %>% < dbl > 4 LEM9 < chr > value run_number gene # A tibble: 4 x 3 filename = NULL) ## remove filename column after we are done with it "(file)|(\\.csv)")), mutate (run_number = as.numeric ( str_remove_all (filename, 2 Now convert filename to run number 2 DEF234 333 3 DKK7 somgen223.stanford.edu 23
12.5 < dbl > 2.2 9 1 2 1 new_df2 %>% 1 ABC123 < dbl > 4 LEM9 < chr > value run_number gene # A tibble: 4 x 3 filename = NULL) ## remove filename column after we are done with it mutate (run_number = as.numeric ( str_extract (filename, "[0-9]+")), 2 Convert filename to run number, version 2 2 DEF234 333 3 DKK7 somgen223.stanford.edu 24
Reading data in other formats somgen223.stanford.edu 25
How to read different formats • This course has focused on csv-formatted files: the items on each row are separated by commas, and the header row, if present, follows the same format. • But there are other formats…. somgen223.stanford.edu 26
read_fwf read_csv read_csv2 read_tsv read_delim read_table readr package Function Notes separator is “,” separator is “;”, decimal point is “,” separator is “\t” ( TAB ) general case reads fixed-width fields separator is whitespace somgen223.stanford.edu 27
df <- read_csv ("file1.csv", colnames = c ("id", "weight", "height")) How to read a csv file that does not have a header line somgen223.stanford.edu 28
skip = 1) df <- read_csv ("file1.csv", colnames = c ("id", "weight", "height"), How to read a csv file that does has a header line that you don’t want somgen223.stanford.edu 29
df <- read_csv ("file1.csv", skip = 5) How to skip over some lines at the beginning of the csv file • This will skip the first 5 rows, and start reading on line 6. somgen223.stanford.edu 30
Other read_csv options • n_max is the maximum number of rows to read. This can be useful if you want to first work with just part of a very large file. • skip_empty_rows ( TRUE or FALSE ) controls whether to skip completely empty rows. If not skipped, they’ll produce NA values. somgen223.stanford.edu 31
Recommend
More recommend