files pathnames
play

Files, pathnames Steve Bagley somgen223.stanford.edu 1 Files have - PowerPoint PPT Presentation

Files, pathnames Steve Bagley somgen223.stanford.edu 1 Files have contents and a name A file that contains R code will have a name such as test.r . The base name is test . The extension is .r . The extension usually signals the


  1. Files, pathnames Steve Bagley somgen223.stanford.edu 1

  2. Files have contents and a name • A file that contains R code will have a name such as test.r . • The base name is test . • The extension is .r . • The extension usually signals the file type. • You should use .r or .R for files that contain R code, and ‘.rmd’ or ‘.Rmd’ for files that contain R Markdown. Some possible extensions are .csv (for comma separated values), .txt or .text (for text files). somgen223.stanford.edu 2

  3. File systems are hierarchical collections of directories and files • Files live in directories, also called folders. • Each directory can contain zero or more files, and zero or more directories. • Each directory, except the top, lives inside a single (parent) directory. • This produces a hierarchical, tree-like, organization. • Computer trees grow upside down. somgen223.stanford.edu 3

  4. Naming files, macOS and Unix • Files are given full names using a path notation. • A sample path: /Users/betty/Documents/test.r • This file has base name test , with extension r . • It is located in the directory Documents , which is located in the directory betty , which is located in the directory Users . • The leading / refers to the highest level, or root, of the filesystem. somgen223.stanford.edu 4

  5. • C:\Users\betty\Documents\test.r • C:\\Users\\sbagley\\Documents\\test.r • C:/Users/sbagley/Documents/test.r Naming files, Windows • Files are given full names using a path notation. • This file has base name test , with extension r . • It is located in the directory Documents , which is located in the directory betty , which is located in the directory Users . • The leading \ refers to the highest level, or root, of the filesystem. • When typing paths to R as strings, you must double the backslashes, or convert to forward slashes: somgen223.stanford.edu 5

  6. The working directory • It is annoying to have to always type the full path to the directory. • In R and RStudio, you can set the working directory, which then becomes the default directory for all file operations. • When type a path that does not start with a slash, it is interpreted relative to the current working directory. • If you don’t set the working directory, it is (probably) your home directory, or the directory from which R was started. somgen223.stanford.edu 6

  7. Using relative pathnames • A relative pathname does not start at the root directory. • It is interpreted with respect to the current (working) directory. • Example: data/file1.csv . • In this example, the file should be located in the data subdirectory of the current directory. • If you use relative directories, then you can easily move or rename the top-level project directory without breaking things. somgen223.stanford.edu 7

  8. library (fs) Package fs for working with files and directories • The easiest way to write code that manipulates files and directories is to use the fs package. It is usually installed as part of tidyverse , but you have to load it separately. somgen223.stanford.edu 8

  9. dir_ls () dir_ls : list all files and directories in a directory • dir_ls takes a number of optional arguments to control which files to return. somgen223.stanford.edu 9

  10. dir_ls (path = "/Users/sbagley/temp", glob = "*.csv") Find all csv files in a directory • glob is jargon for a file name pattern. * matches anything. somgen223.stanford.edu 10

  11. p1 <- path ("/Users/sbagley/temp/test.csv") path_dir (p1) [1] "/Users/sbagley/temp" path_file (p1) [1] "test.csv" path_ext (p1) [1] "csv" path_ext_set (p1, "tsv") / Users / sbagley / temp / test.tsv Getting the parts of a path • Using these function is much easier than writing your own functions to manipulate filenames as strings. somgen223.stanford.edu 11

  12. p2 <- path ("~/temp/test.csv") path_expand (p2) / Users / sbagley / temp / test.csv Your home directory • The tilde ~ stands for your home directory. somgen223.stanford.edu 12

  13. Combining all files in a directory • Sometimes your data are spread across multiple files in a directory. For example, there is one file for each of multiple runs of an experiment. • You want to combine them together into a single data frame. • You want some indicator of which run they came from. somgen223.stanford.edu 13

  14. library (fs) my_files <- dir_ls (my_dir) ## these pathnames are very long. to view them here, remove the ## directory part path_file (my_files) [1] "file1.csv" "file2.csv" Step 1: listing all the files in a directory my_dir <- "~/sync/teaching/somgen223/website/data/multiple_runs" somgen223.stanford.edu 14

  15. read_csv (my_files[2]) 2 DEF234 333 1 DKK7 < chr > < dbl > value gene # A tibble: 2 x 2 read_csv (my_files[1]) 12.5 9 1 ABC123 < dbl > < chr > value gene # A tibble: 2 x 2 2.2 Inspect those files 2 LEM9 somgen223.stanford.edu 15

  16. 9 # A tibble: 4 x 2 gene value < chr > < dbl > 1 ABC123 12.5 (new_df <- map_df (my_files, read_csv)) 2.2 4 LEM9 Step 2: read and combine into single data frame 2 DEF234 333 3 DKK7 • map_df applies the second argument, a function, to all the things in the first argument, here a list of files. • Each call to read_csv produces a data frame. • These data frames are glued together (stacked vertically) to form the answer. somgen223.stanford.edu 16

  17. Elaboration: save the filename in the data frame • This first solution glues together all the rows, but now you don’t know which files each row came from. (If you are lucky, there will already be a column that indicates this.) • So we need to write a function that reads the csv from a file, and adds the filename, which contains the run number as part of the name, as a new column. somgen223.stanford.edu 17

  18. How to define your own function • R has many built-in functions, but sometimes it is useful to define your own. • A function encapsulates a set of expressions that perform some conceptually meaningful chunk of work. This is especially useful when you need to repeat the chunk of work multiple times. somgen223.stanford.edu 18

  19. add1 <- function (v){ 1 + v } add1 (0) [1] 1 add1 ( - 1 : 1) [1] 0 1 2 x <- 1 : 5 add1 (x) [1] 2 3 4 5 6 Defining a function somgen223.stanford.edu 19

  20. add1 (x) [1] 2 3 4 5 6 How to evaluate a function call 1. Evaluate the argument, here, x . 2. Evaluate the function definition with the function’s argument v temporarily bound to the value of its argument. somgen223.stanford.edu 20

  21. read_and_record_filename <- function (filename){ read_csv (filename) %>% mutate (filename = path_file (filename)) } A function to add the file name as a column somgen223.stanford.edu 21

  22. 2 DEF234 333 file2.csv # A tibble: 4 x 3 gene value filename < chr > < dbl > < chr > 1 ABC123 12.5 file1.csv (new_df2 <- map_df (my_files, read_and_record_filename)) file1.csv 3 DKK7 2.2 file2.csv 4 LEM9 9 Try map_df again somgen223.stanford.edu 22

  23. 1 ABC123 < dbl > 2.2 9 1 2 1 12.5 new_df2 %>% < dbl > 4 LEM9 < chr > value run_number gene # A tibble: 4 x 3 filename = NULL) ## remove filename column after we are done with it "(file)|(\\.csv)")), mutate (run_number = as.numeric ( str_remove_all (filename, 2 Now convert filename to run number 2 DEF234 333 3 DKK7 somgen223.stanford.edu 23

  24. 12.5 < dbl > 2.2 9 1 2 1 new_df2 %>% 1 ABC123 < dbl > 4 LEM9 < chr > value run_number gene # A tibble: 4 x 3 filename = NULL) ## remove filename column after we are done with it mutate (run_number = as.numeric ( str_extract (filename, "[0-9]+")), 2 Convert filename to run number, version 2 2 DEF234 333 3 DKK7 somgen223.stanford.edu 24

  25. Reading data in other formats somgen223.stanford.edu 25

  26. How to read different formats • This course has focused on csv-formatted files: the items on each row are separated by commas, and the header row, if present, follows the same format. • But there are other formats…. somgen223.stanford.edu 26

  27. read_fwf read_csv read_csv2 read_tsv read_delim read_table readr package Function Notes separator is “,” separator is “;”, decimal point is “,” separator is “\t” ( TAB ) general case reads fixed-width fields separator is whitespace somgen223.stanford.edu 27

  28. df <- read_csv ("file1.csv", colnames = c ("id", "weight", "height")) How to read a csv file that does not have a header line somgen223.stanford.edu 28

  29. skip = 1) df <- read_csv ("file1.csv", colnames = c ("id", "weight", "height"), How to read a csv file that does has a header line that you don’t want somgen223.stanford.edu 29

  30. df <- read_csv ("file1.csv", skip = 5) How to skip over some lines at the beginning of the csv file • This will skip the first 5 rows, and start reading on line 6. somgen223.stanford.edu 30

  31. Other read_csv options • n_max is the maximum number of rows to read. This can be useful if you want to first work with just part of a very large file. • skip_empty_rows ( TRUE or FALSE ) controls whether to skip completely empty rows. If not skipped, they’ll produce NA values. somgen223.stanford.edu 31

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend