GETTING STARTED WE WANT TO DRAW GOOD DATA GRAPHICS REPRODUCIBLY - - PowerPoint PPT Presentation
GETTING STARTED WE WANT TO DRAW GOOD DATA GRAPHICS REPRODUCIBLY - - PowerPoint PPT Presentation
GETTING STARTED WE WANT TO DRAW GOOD DATA GRAPHICS REPRODUCIBLY Abstraction in Software Less More Easy things are awkward Easy things are trivial Hard things are straightforward Hard things are really awkward Really hard things are
WE WANT TO DRAW GOOD DATA GRAPHICS REPRODUCIBLY
Abstraction in Software
Less More
Easy things are awkward Hard things are straightforward Really hard things are doable Easy things are trivial Hard things are really awkward Really hard things are impossible
Excel D3 Stata Grid ggplot
Two ways to use R and ggplot
- 1. Do Everything in R
Raw Data Read in, Clean & Analyze
ggplot Figures
- 2. Just use ggplot
Tidy results ggplot Figures Stata, SAS, etc (Read in, likely with some filtering/transformation)
THE RIGHT FRAME OF MIND
TYPE OUT YOUR
CODE BY HAND
RSTUDIO
ORGANIZE YOUR PROJECTS
Use RMarkdown TO REPRODUCE YOUR OWN WORK
This is what we want to end up with: nicely formatted text, plots, and tables.
- 1. Lorem Ipsum
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enimad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
In a Literate Programming approach to documents, chunks of code are processed and replaced with their output
library(ggplot2) tea <- rnorm(100) biscuits <- tea + rnorm(100, 0, 1.3) data <- data.frame(tea, biscuits) p <- ggplot(data, aes(x = tea, y = biscuits)) + geom_point() + geom_smooth(method = "lm") + labs(x = "Tea", y = "Biscuits") + theme_bw() print(p)# Lorem Ipsum
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do *eiusmod tempor* incididunt ut labore et dolore magna aliqua. Ut enimad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.In a Literate Programming approach to documents, chunks of code are processed and replaced with their output
- 1. Lorem Ipsum
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enimad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
An Rmd document lets you keep your code and notes together in plain text And produce good-looking
- utput in a range of formats
An Rmd document lets you keep your code and notes together in plain text And produce good-looking
- utput in a range of formats
knit in R notes.Rmd
# Report
We can see this *relationship* in a scatterplot. As you can see, this plot looks pretty nice.Report
We can see this relationship in a scatterplot. As you can see, this plot looks pretty nice. x ynotes.html
```{r my-code} p !" ggplot(data, mapping) p + geom_point() ```An Rmd document lets you keep your code and notes together in plain text And produce good-looking
- utput in a range of formats
knit in R notes.Rmd
# Report
We can see this *relationship* in a scatterplot. As you can see, this plot looks pretty nice.Report
We can see this relationship in a scatterplot. As you can see, this plot looks pretty nice. x ynotes.docx
```{r my-code} p !" ggplot(data, mapping) p + geom_point() ```Markdown puts formatting instructions in plain-text documents
# Header Plain text *italics* **bold** `verbatim` Footnote.[^1] [^1]: The footnote.
- 1. List
- 2. List
- Bullet 1
- Bullet 2
!" Subhead
Markdown Output
Header
Plain text italics bold
verbatim
Footnote.
The footnote.- 1. List
- 2. List
° Bullet 1 ° Bullet 2 Subhead
1 1A Markdown Processor turns the marked-up plain text into actually formatted
- utput in HTML, PDF,
DOCX or other file types.
Header section provides metadata and sets options
Code chunk
Text with Markdown formatting In RStudio, code chunks can be "played" one at a time Chunks are replaced by their
- utput when the
document is made
Code chunks can have their
- wn names and options
RStudio will do all the work for you when it comes to processing your document—i.e., getting it from plain-text Rmd to HTML, Word, or PDF.
- 1. Lorem Ipsum
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enimad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
GETTING ORIENTED
library(tidyverse)
Loading tidyverse: ggplot2 Loading tidyverse: tibble Loading tidyverse: tidyr Loading tidyverse: readr Loading tidyverse: purrr Loading tidyverse: dplyr
The Tidyverse
Draw graphs Nicer data tables Tidy your data Get data into R Cool functional programming stuff Action verbs for manipulating data
library(socviz) Course-Specific Library
CODE YOU CAN TYPE AND RUN
## Inside chunks of code, lines beginning with ## the hash character are comments my_numbers <- c(1, 1, 4, 1, 1, 4, 1)
my_numbers ## [1] 1 1 4 1 1 4 1
OUTPUT
What R Looks Like
FOUR THINGS TO KNOW ABOUT R
1: Everything has a Name
FALSE TRUE Inf for if break function
Some names are forbidden
my_numbers data p
- 2. Everything is an Object
my_numbers <- c(1, 2, 3, 1, 3, 5, 25) letters
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" [20] "t" "u" "v" "w" "x" "y" "z"
You create objects by assigning a thing to a name
named thing "gets" this stuff
my_numbers <- c(1, 2, 3, 1, 3, 5, 25)
You create objects by assigning a thing to a name
The assignment operator performs the action of creating objects. Use the keyboard shortcut to type it:
- ption - Mac
alt - Windows
- 3. You do things using functions and
- perators
my_numbers <- c(1, 2, 3, 1, 3, 5, 25)
named thing "gets" this stuff c() is a function that takes comma-separated numbers or strings and joins them together into a vector
take inputs, perform actions, produce outputs
mean()
Functions have parentheses at the end of their name. This is where the inputs,
- r arguments go.
mean(x = my_numbers)
Named argument. These names are internal to functions. "Input is this object. Calculate the mean of it."
Functions
mean(my_numbers)
If you just write the name of the input, R assigns it to the function’s arguments in the order given.
take inputs, perform actions, produce outputs
Functions
You can assign a function’s
- utput to a named object
my_summary <- summary(my_numbers) my_sd <- sd(my_numbers) my_summary my_sd
Objects you create exist until you overwrite or delete them
rm(my_numbers) my_numbers my_numbers <- c(1, 2, 3, 1, 3, 5, 25)
Objects are of different classes
class(my_numbers)
numeric character factor
Vectors
matrix data.frame tibble
Arrays
lm glm
Models
Things to try on Objects
class(my_numbers) table(my_numbers) x <- c(my_numbers, 5) mean(c(my_numbers, my_numbers))
Notice that these are functions How do x and y differ?
y <- c(my_numbers, "hello")
Functions can be nested, and will be evaluated from the inside out.
Some operators
+, -, *, /, ^ Arithmetic <-
Assignment ("gets")
=
- r
&, &&, |, ||, ! Logical %*%, %in%, %>%
Special
<, >, <=, >=, ==, !=
Relational
The pipe operator
mean(my_numbers) my_numbers %>% mean()
This will be very convenient later on
round(mean(my_numbers)) my_numbers %>% mean() %>% round()
"and then"
%>%
R will be Frustrating
We’re going to be adding a lot of objects together.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point()
"+"
goes here
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point()
not here
LET’S GO
library(gapminder) gapminder
# A tibble: 1,704 x 6 country continent year lifeExp pop gdpPercap <fctr> <fctr> <int> <dbl> <int> <dbl> 1 Afghanistan Asia 1952 28.801 8425333 779.4453 2 Afghanistan Asia 1957 30.332 9240934 820.8530 3 Afghanistan Asia 1962 31.997 10267083 853.1007 4 Afghanistan Asia 1967 34.020 11537966 836.1971 5 Afghanistan Asia 1972 36.088 13079460 739.9811 6 Afghanistan Asia 1977 38.438 14880372 786.1134 7 Afghanistan Asia 1982 39.854 12881816 978.0114 8 Afghanistan Asia 1987 40.822 13867957 852.3959 9 Afghanistan Asia 1992 41.674 16317921 649.3414 10 Afghanistan Asia 1997 41.763 22227415 635.3414 # ... with 1,694 more rows
p + geom_point()
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p
Named thing gets … … the output of this function … … using these arguments Objects created by ggplot() are unusual in that you can add things to them, and they will work as though you wrote all the code at once.
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) p + geom_point()
Make Some Graphs
ggplot wants you to feed it TIDY DATA
gdp lifexp pop continent 340 65 31 Euro 227 51 200 Amer 909 81 80 Euro 126 40 20 Asia
country year cases population 1 Afghanistan 1999 745 19987071 2 Afghanistan 2000 2666 20595360 3 Brazil 1999 37737 172006362 4 Brazil 2000 80488 174504898 5 China 1999 212258 1272915272 6 China 2000 213766 1280428583
country year rate 1 Afghanistan 1999 745/19987071 2 Afghanistan 2000 2666/20595360 3 Brazil 1999 37737/172006362 4 Brazil 2000 80488/174504898 5 China 1999 212258/1272915272 6 China 2000 213766/1280428583
country 1999 2000 1 Afghanistan 745 2666 2 Brazil 37737 80488 3 China 212258 213766
country 1999 2000 1 Afghanistan 19987071 20595360 2 Brazil 172006362 174504898 3 China 1272915272 1280428583
GETTING YOUR DATA INTO R
read_dta(file = "data/my_stata_file.dta") read_spss(file = "data/my_spss_file.sav") read_sas(data_file = "<NAME>", catalog_file = "<NAME>")
my_data <- read_csv(file = “data/organdonation.csv")
read_csv2(file = "data/my_csv_file.csv") read_table(file = "<NAME>")
Field delimiter is ; Field delimiter is , Structured but not delimited
url <- "https://cdn.rawgit.com/kjhealy/viz-
- rgandata/master/organdonation.csv"
- rgans <- read_csv(file = url)
- rgans <- read_csv(file = "data/organdonation.csv")
Local File Path Remote URL
engmort <- read_table(file = "data/mortality.txt", skip = 2, na = ".")
HOW ggplot WORKS
ggplot’s FLOW OF ACTION
gdp lifexp pop continent 340 65 31 Euro 227 51 200 Amer 909 81 80 Euro 126 40 20 Asia
Asia Euro Amer 0-35 36-100 >100
log GDP
Life Expectancy
A Gapminder Plot
Continent Population
gdp lifexp pop continent
340 65 31 Euro 227 51 200 Amer 909 81 80 Euro 126 40 20 Asia
- 1. Tidy Data
x=gdp y=lifexp size=pop color=continent
- 2. Mapping
- 3. Geom
geom_point() ggplot(mapping = aes(x = …)) ggplot(data = gapminder)
x y y log10 x
Asia Euro Amer 0-35 36-100 >100 log GDP Life Expectancy A Gapminder Plot
- 4. Coordinate
System
- 5. Scales
- 6. Labels
& Guides
Continent Population
Asia Euro Amer 0-35 36-100 >100
log GDP
Life Expectancy
A Gapminder Plot
Continent Population
PIECE BY PIECE
head(gapminder)
## # A tibble: 6 × 6 ## country continent year lifeExp pop gdpPercap ## <fctr> <fctr> <int> <dbl> <int> <dbl> ## 1 Afghanistan Asia 1952 28.801 8425333 779.4453 ## 2 Afghanistan Asia 1957 30.332 9240934 820.8530 ## 3 Afghanistan Asia 1962 31.997 10267083 853.1007 ## 4 Afghanistan Asia 1967 34.020 11537966 836.1971 ## 5 Afghanistan Asia 1972 36.088 13079460 739.9811 ## 6 Afghanistan Asia 1977 38.438 14880372 786.1134
dim(gapminder) ## [1] 1704 6
p <- ggplot(data = gapminder)
Create a ggplot object Data is gapminder table
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
mapping: tell ggplot the variables you want represented by features of the plot
- The mapping = aes(../) instruction links
variables to things you will see on the plot.
- The x and y values are the most obvious ones.
- Other aesthetic mappings can include, e.g.,
color, shape, and size.
Mappings do not directly specify the particular, e.g., colors, shapes, or line styles that will appear
- n the plot. Rather they establish which variables
in the data will be represented by which visible features on the plot.
p + geom_point()
Add a geom layer to the plot
p + geom_smooth()
Try a different geom
This process is literally additive
p + geom_point() + geom_smooth() + scale_x_log10(labels = scales::dollar)
p + geom_point() + geom_smooth(method = "lm")
Every geom is a function. Functions take arguments.
Keep Layering
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) p + geom_point() + geom_smooth(method = "lm") + scale_x_log10(label = scales::dollar)
p + geom_point() + geom_smooth(method = "gam") + scale_x_log10(labels = scales::dollar) + labs(x = "GDP Per Capita", y = "Life Expectancy in Years", title = "Economic Growth and Life Expectancy", subtitle = "Data points are country-years", caption = "Data source: Gapminder")
MAPPING vs SETTING AESTHETICS
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, color = "purple")) p + geom_point() + geom_smooth(method = "loess") + scale_x_log10()
What has gone wrong here?
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp) p + geom_point(color = "purple") + geom_smooth(method = "loess")) + scale_x_log10()
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) p + geom_point(alpha = 0.3) + geom_smooth(color = "orange", se = FALSE, size = 2, method = "lm") + scale_x_log10()
Here, some aesthetics are mapped, and some are set
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, color = continent, fill = continent)) p + geom_point() + geom_smooth(method = "loess") + scale_x_log10()
MAP or SET AESTHETICS per geom
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) p + geom_point(mapping = aes(color = continent)) + geom_smooth(method = "loess") + scale_x_log10()
PAY CLOSE ATTENTION TO HOW SCALES ARE DRAWN, AND WHY
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, color = continent, fill = continent)) p + geom_point() + geom_smooth(method = "loess") + scale_x_log10()
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) p + geom_point(mapping = aes(color = continent)) + geom_smooth(method = "loess") + scale_x_log10()
REMEMBER: EVERY MAPPED VARIABLE HAS A SCALE
Saving Your Work
ggsave() ggsave("figures/my_figure.png") ggsave("my_figure.pdf") ggsave("my_figure.pdf", plot = p5, scale = 1.2) ggsave("figures/my-figure.pdf", plot = p5, width = 8, height = 5)
With ggsave
pdf(file = "plot.pdf", height = 5in, width = 5in) print(p4) dev.off()
With pdf() or other graphics devices
Open device … … and close when done
```{r my_plot, fig.cap="My Plot", fig.width=9, fig.height=8} ```
Within an Rmd chunk
p + geom_point(mapping = aes(color = continent)) + geom_smooth(method = "loess") + scale_x_log10() knitr::opts_chunk$set(fig.width=8, fig.height=5)
Set defaults in your first code chunk