Finding packages, project organization Steve Bagley - - PowerPoint PPT Presentation

finding packages project organization
SMART_READER_LITE
LIVE PREVIEW

Finding packages, project organization Steve Bagley - - PowerPoint PPT Presentation

Finding packages, project organization Steve Bagley somgen223.stanford.edu 1 How to find R packages There are over 15,000 packages available for R. Thats great, but how do you find what you want? Task Views


slide-1
SLIDE 1

Finding packages, project organization

Steve Bagley

somgen223.stanford.edu 1

slide-2
SLIDE 2

How to find R packages

  • There are over 15,000 packages available for R.
  • That’s great, but how do you find what you want?
  • Task Views (https://cran.r-project.org/web/views/): human-curated lists of

packages for a given area.

  • METACRAN (https://r-pkg.org/): provides some more organization to

CRAN.

somgen223.stanford.edu 2

slide-3
SLIDE 3

What is stored where

  • R servers (usually CRAN, but also BioConductor). Contain the packages. Also,

some packages live on the developer’s website.

  • Your computer. Contains the packages you have installed.
  • Your (project) directory. Contains script (program) files, data files, output files.
  • The workspace. Contains the current variable and function bindings, and packages

that have been loaded since starting R.

somgen223.stanford.edu 3

slide-4
SLIDE 4

Where to put files for your project

  • The most natural organization of a project uses the tree-like structure of a

hierarchical file system.

  • For each project, put all the scripts/code in one directory or sub-directory.
  • R (and RStudio) have a notion of the current working directory, which can be

set through the graphical user interface, or using R commands (setwd, getwd).

somgen223.stanford.edu 4

slide-5
SLIDE 5

The workspace

  • The workspace contains all the functions and variables that you have defined in

it (but not deleted from it).

  • You can save the workspace contents, close R, and then restart it, restoring all of

the workspace contents.

  • When you restart, all of your data will be there. But you still will need to reload

all the packages you use.

  • You might not want to rely on save the workspace for two reasons:
  • It may be easier to start over with a fresh workspace than to try undoing some

complicated error.

  • You want a written record of reproducible commands (scripts) to create the state, not

just the state itself.

somgen223.stanford.edu 5

slide-6
SLIDE 6

Project organization: RStudio

  • Use the Project menu (upper right corner) to create/open a project.

somgen223.stanford.edu 6

slide-7
SLIDE 7

Project organization: directories

  • 1. Make a project directory: .../dolphin/
  • 2. Make a subdirectory for the input data files: .../dolphin/data
  • 3. Make a subdirectory for your code/script files: .../dolphin/src
  • 4. Make a subdirectory for the output files: .../dolphin/output
  • 5. Make a subdirectory for all pdfs: .../dolphin/figures
  • 6. Make a subdirectory for any papers: .../dolphin/papers

somgen223.stanford.edu 7

slide-8
SLIDE 8

How to approach a new dataset

  • 1. Whenever you get a new dataset, record when you got it, where you got it from.
  • 2. Read the raw data from a file or url.
  • 3. Fix column names to make all subsequent manipulation easier.
  • 4. Figure out the meaning of the data in each column. You may have received a

description of the data (“metadata”), or something called a “data dictionary”. If not, you may need to apply your knowledge of the domain and some common sense.

  • 5. Start testing your assumptions about the data (and about the metadata, which

can be wrong). Look for illegal values (completely out of bounds), outliers (possible, but unlikely), missing values, typos, coding errors, inconsistencies.

  • 6. In general, try to fix the problems by writing a sequence of R expressions (script
  • r R Markdown). This makes your work reproducible: you can rerun the script,
  • r use it on the next version of the data. Try to never modify by hand the source

files containing the original data.

somgen223.stanford.edu 8

slide-9
SLIDE 9

How to approach data visualization

  • Compared to what: decide how to make a meaningful comparison.
  • This could be: treatment vs control, compared to baseline, compared to some

simple null model, trend over time, trend over space.

  • Then display the data to make this comparison visually salient.

somgen223.stanford.edu 9

slide-10
SLIDE 10

How to start the exploration

  • Make some assumptions, even very simple or straightforward ones, about the
  • data. Sometimes these are explicitly stated by whoever gives you the data. (They

might be wrong.)

  • See if those assumptions hold true.
  • Iterate, trying to build up an explanation (model) in your head.
  • Focus on understanding, make the graph pretty later.

somgen223.stanford.edu 10

slide-11
SLIDE 11

Saving figures

  • Use R Markdown to make a computational lab notebook. This will show your

entire analysis workflow, and can include data frame tables and figures.

  • You can write a figure out to a file.

somgen223.stanford.edu 11

slide-12
SLIDE 12

Saving figures

## This opens a pdf file for writing. pdf("../figures/fig27.pdf") ## This plot is sent to the file ggplot(iris, aes(Petal.Width, Petal.Length)) + geom_point(aes(color = Species)) ## This closes the file dev.off()

  • ../ is the parent directory of the current directory.
  • ../figures/ is the sibling directory, assuming we are in src.
  • pdf writes pdf files.
  • postscript writes postscript files.
  • png writes png files.
  • jpeg writes jpeg files.
  • tiff writes tiff files.
  • svg writes svg files.

somgen223.stanford.edu 12

slide-13
SLIDE 13

Kinds of data frames

somgen223.stanford.edu 13

slide-14
SLIDE 14

data.frame vs tibble vs data.table

Type Package Notes data.frame built-in slow for big data, some odd defaults tibble tidyverse used throughout tidyverse, fast enough data.table data.table very fast, syntax is powerful/complex

somgen223.stanford.edu 14