GETTING STARTED WE WANT TO DRAW GOOD DATA GRAPHICS REPRODUCIBLY - - PowerPoint PPT Presentation

getting started we want to draw good data graphics
SMART_READER_LITE
LIVE PREVIEW

GETTING STARTED WE WANT TO DRAW GOOD DATA GRAPHICS REPRODUCIBLY - - PowerPoint PPT Presentation

GETTING STARTED WE WANT TO DRAW GOOD DATA GRAPHICS REPRODUCIBLY Abstraction in Software Less More Easy things are awkward Easy things are trivial Hard things are straightforward Hard things are really awkward Really hard things are


slide-1
SLIDE 1

GETTING STARTED

slide-2
SLIDE 2

WE WANT TO DRAW GOOD DATA GRAPHICS REPRODUCIBLY

slide-3
SLIDE 3

Abstraction in Software

Less More

Easy things are awkward Hard things are straightforward Really hard things are doable Easy things are trivial Hard things are really awkward Really hard things are impossible

Excel D3 Stata Grid ggplot

slide-4
SLIDE 4

Two ways to use R and ggplot

slide-5
SLIDE 5
  • 1. Do Everything in R

Raw Data Read in, Clean & Analyze

ggplot Figures

slide-6
SLIDE 6
  • 2. Just use ggplot

Tidy results ggplot Figures Stata, SAS, etc (Read in, likely with some filtering/transformation)

slide-7
SLIDE 7

THE RIGHT FRAME OF MIND

slide-8
SLIDE 8

TYPE OUT YOUR

CODE BY HAND

slide-9
SLIDE 9

RSTUDIO

slide-10
SLIDE 10
slide-11
SLIDE 11
slide-12
SLIDE 12

ORGANIZE YOUR PROJECTS

slide-13
SLIDE 13
slide-14
SLIDE 14
slide-15
SLIDE 15
slide-16
SLIDE 16

Use RMarkdown TO REPRODUCE YOUR OWN WORK

slide-17
SLIDE 17

This is what we want to end up with: nicely formatted text, plots, and tables.

  • 1. Lorem Ipsum

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enimad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

slide-18
SLIDE 18

In a Literate Programming approach to documents, chunks of code are processed and replaced with their output

library(ggplot2) tea <- rnorm(100) biscuits <- tea + rnorm(100, 0, 1.3) data <- data.frame(tea, biscuits) p <- ggplot(data, aes(x = tea, y = biscuits)) + geom_point() + geom_smooth(method = "lm") + labs(x = "Tea", y = "Biscuits") + theme_bw() print(p)

# Lorem Ipsum

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do *eiusmod tempor* incididunt ut labore et dolore magna aliqua. Ut enimad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
slide-19
SLIDE 19

In a Literate Programming approach to documents, chunks of code are processed and replaced with their output

  • 1. Lorem Ipsum

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enimad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

slide-20
SLIDE 20

An Rmd document lets you keep your code and notes together in plain text And produce good-looking

  • utput in a range of formats
slide-21
SLIDE 21

An Rmd document lets you keep your code and notes together in plain text And produce good-looking

  • utput in a range of formats

knit in R notes.Rmd

# Report

We can see this *relationship* in a scatterplot. As you can see, this plot looks pretty nice.

Report

We can see this relationship in a scatterplot. As you can see, this plot looks pretty nice. x y

notes.html

```{r my-code} p !" ggplot(data, mapping) p + geom_point() ```
slide-22
SLIDE 22

An Rmd document lets you keep your code and notes together in plain text And produce good-looking

  • utput in a range of formats

knit in R notes.Rmd

# Report

We can see this *relationship* in a scatterplot. As you can see, this plot looks pretty nice.

Report

We can see this relationship in a scatterplot. As you can see, this plot looks pretty nice. x y

notes.docx

```{r my-code} p !" ggplot(data, mapping) p + geom_point() ```
slide-23
SLIDE 23

Markdown puts formatting instructions in plain-text documents

# Header Plain text *italics* **bold** `verbatim` Footnote.[^1] [^1]: The footnote.

  • 1. List
  • 2. List
  • Bullet 1
  • Bullet 2

!" Subhead

Markdown Output

Header

Plain text italics bold

verbatim

Footnote.

The footnote.
  • 1. List
  • 2. List

° Bullet 1 ° Bullet 2 Subhead

1 1

A Markdown Processor turns the marked-up plain text into actually formatted

  • utput in HTML, PDF,

DOCX or other file types.

slide-24
SLIDE 24

Header section provides metadata and sets options

Code chunk

Text with Markdown formatting In RStudio, code chunks can be "played" one at a time Chunks are replaced by their

  • utput when the

document is made

Code chunks can have their

  • wn names and options
slide-25
SLIDE 25

RStudio will do all the work for you when it comes to processing your document—i.e., getting it from plain-text Rmd to HTML, Word, or PDF.

  • 1. Lorem Ipsum

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enimad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

slide-26
SLIDE 26

GETTING ORIENTED

slide-27
SLIDE 27

library(tidyverse)

Loading tidyverse: ggplot2 Loading tidyverse: tibble Loading tidyverse: tidyr Loading tidyverse: readr Loading tidyverse: purrr Loading tidyverse: dplyr

The Tidyverse

Draw graphs Nicer data tables Tidy your data Get data into R Cool functional programming stuff Action verbs for manipulating data

library(socviz) Course-Specific Library

slide-28
SLIDE 28

CODE YOU CAN TYPE AND RUN

## Inside chunks of code, lines beginning with ## the hash character are comments my_numbers <- c(1, 1, 4, 1, 1, 4, 1)

my_numbers ## [1] 1 1 4 1 1 4 1

OUTPUT

What R Looks Like

slide-29
SLIDE 29

FOUR THINGS TO KNOW ABOUT R

slide-30
SLIDE 30

1: Everything has a Name

FALSE TRUE Inf for if break function

Some names are forbidden

my_numbers data p

slide-31
SLIDE 31
  • 2. Everything is an Object

my_numbers <- c(1, 2, 3, 1, 3, 5, 25) letters

## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" [20] "t" "u" "v" "w" "x" "y" "z"

You create objects by assigning a thing to a name

named thing "gets" this stuff

slide-32
SLIDE 32

my_numbers <- c(1, 2, 3, 1, 3, 5, 25)

You create objects by assigning a thing to a name

The assignment operator performs the action of creating objects. Use the keyboard shortcut to type it:

  • ption - Mac

alt - Windows

slide-33
SLIDE 33
  • 3. You do things using functions and
  • perators

my_numbers <- c(1, 2, 3, 1, 3, 5, 25)

named thing "gets" this stuff c() is a function that takes comma-separated numbers or strings and joins them together into a vector

slide-34
SLIDE 34

take inputs, perform actions, produce outputs

mean()

Functions have parentheses at the end of their name. This is where the inputs,

  • r arguments go.

mean(x = my_numbers)

Named argument. These names are internal to functions. "Input is this object. Calculate the mean of it."

Functions

slide-35
SLIDE 35

mean(my_numbers)

If you just write the name of the input, R assigns it to the function’s arguments in the order given.

take inputs, perform actions, produce outputs

Functions

slide-36
SLIDE 36

You can assign a function’s

  • utput to a named object

my_summary <- summary(my_numbers) my_sd <- sd(my_numbers) my_summary my_sd

slide-37
SLIDE 37

Objects you create exist until you overwrite or delete them

rm(my_numbers) my_numbers my_numbers <- c(1, 2, 3, 1, 3, 5, 25)

slide-38
SLIDE 38

Objects are of different classes

class(my_numbers)

numeric character factor

Vectors

matrix data.frame tibble

Arrays

lm glm

Models

slide-39
SLIDE 39

Things to try on Objects

class(my_numbers) table(my_numbers) x <- c(my_numbers, 5) mean(c(my_numbers, my_numbers))

Notice that these are functions How do x and y differ?

y <- c(my_numbers, "hello")

Functions can be nested, and will be evaluated from the inside out.

slide-40
SLIDE 40

Some operators

+, -, *, /, ^ Arithmetic <-

Assignment ("gets")

=

  • r

&, &&, |, ||, ! Logical %*%, %in%, %>%

Special

<, >, <=, >=, ==, !=

Relational

slide-41
SLIDE 41

The pipe operator

mean(my_numbers) my_numbers %>% mean()

This will be very convenient later on

round(mean(my_numbers)) my_numbers %>% mean() %>% round()

"and then"

%>%

slide-42
SLIDE 42

R will be Frustrating

We’re going to be adding a lot of objects together.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point()

"+"

goes here

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point()

not here

slide-43
SLIDE 43

LET’S GO

slide-44
SLIDE 44

library(gapminder) gapminder

# A tibble: 1,704 x 6 country continent year lifeExp pop gdpPercap <fctr> <fctr> <int> <dbl> <int> <dbl> 1 Afghanistan Asia 1952 28.801 8425333 779.4453 2 Afghanistan Asia 1957 30.332 9240934 820.8530 3 Afghanistan Asia 1962 31.997 10267083 853.1007 4 Afghanistan Asia 1967 34.020 11537966 836.1971 5 Afghanistan Asia 1972 36.088 13079460 739.9811 6 Afghanistan Asia 1977 38.438 14880372 786.1134 7 Afghanistan Asia 1982 39.854 12881816 978.0114 8 Afghanistan Asia 1987 40.822 13867957 852.3959 9 Afghanistan Asia 1992 41.674 16317921 649.3414 10 Afghanistan Asia 1997 41.763 22227415 635.3414 # ... with 1,694 more rows

slide-45
SLIDE 45

p + geom_point()

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))

p

Named thing gets … … the output of this function … … using these arguments Objects created by ggplot() are unusual in that you can add things to them, and they will work as though you wrote all the code at once.

slide-46
SLIDE 46

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) p + geom_point()

slide-47
SLIDE 47 40 60 80 30000 60000 90000 gdpPercap lifeExp
slide-48
SLIDE 48

Make Some Graphs

slide-49
SLIDE 49

ggplot wants you to feed it TIDY DATA

slide-50
SLIDE 50

gdp lifexp pop continent 340 65 31 Euro 227 51 200 Amer 909 81 80 Euro 126 40 20 Asia

slide-51
SLIDE 51
slide-52
SLIDE 52
slide-53
SLIDE 53
slide-54
SLIDE 54
slide-55
SLIDE 55
slide-56
SLIDE 56

country year cases population 1 Afghanistan 1999 745 19987071 2 Afghanistan 2000 2666 20595360 3 Brazil 1999 37737 172006362 4 Brazil 2000 80488 174504898 5 China 1999 212258 1272915272 6 China 2000 213766 1280428583

slide-57
SLIDE 57 country year key value 1 Afghanistan 1999 cases 745 2 Afghanistan 1999 population 19987071 3 Afghanistan 2000 cases 2666 4 Afghanistan 2000 population 20595360 5 Brazil 1999 cases 37737 6 Brazil 1999 population 172006362 7 Brazil 2000 cases 80488 8 Brazil 2000 population 174504898 9 China 1999 cases 212258 10 China 1999 population 1272915272 11 China 2000 cases 213766 12 China 2000 population 1280428583
slide-58
SLIDE 58 country year key value 1 Afghanistan 1999 cases 745 2 Afghanistan 1999 population 19987071 3 Afghanistan 2000 cases 2666 4 Afghanistan 2000 population 20595360 5 Brazil 1999 cases 37737 6 Brazil 1999 population 172006362 7 Brazil 2000 cases 80488 8 Brazil 2000 population 174504898 9 China 1999 cases 212258 10 China 1999 population 1272915272 11 China 2000 cases 213766 12 China 2000 population 1280428583
slide-59
SLIDE 59

country year rate 1 Afghanistan 1999 745/19987071 2 Afghanistan 2000 2666/20595360 3 Brazil 1999 37737/172006362 4 Brazil 2000 80488/174504898 5 China 1999 212258/1272915272 6 China 2000 213766/1280428583

slide-60
SLIDE 60

country 1999 2000 1 Afghanistan 745 2666 2 Brazil 37737 80488 3 China 212258 213766

country 1999 2000 1 Afghanistan 19987071 20595360 2 Brazil 172006362 174504898 3 China 1272915272 1280428583

slide-61
SLIDE 61
slide-62
SLIDE 62

GETTING YOUR DATA INTO R

slide-63
SLIDE 63

read_dta(file = "data/my_stata_file.dta") read_spss(file = "data/my_spss_file.sav") read_sas(data_file = "<NAME>", catalog_file = "<NAME>")

my_data <- read_csv(file = “data/organdonation.csv")

read_csv2(file = "data/my_csv_file.csv") read_table(file = "<NAME>")

Field delimiter is ; Field delimiter is , Structured but not delimited

slide-64
SLIDE 64

url <- "https://cdn.rawgit.com/kjhealy/viz-

  • rgandata/master/organdonation.csv"
  • rgans <- read_csv(file = url)
  • rgans <- read_csv(file = "data/organdonation.csv")

Local File Path Remote URL

slide-65
SLIDE 65

engmort <- read_table(file = "data/mortality.txt", skip = 2, na = ".")

slide-66
SLIDE 66

HOW ggplot WORKS

slide-67
SLIDE 67

ggplot’s FLOW OF ACTION

slide-68
SLIDE 68
slide-69
SLIDE 69

gdp lifexp pop continent 340 65 31 Euro 227 51 200 Amer 909 81 80 Euro 126 40 20 Asia

slide-70
SLIDE 70

Asia Euro Amer 0-35 36-100 >100

log GDP

Life Expectancy

A Gapminder Plot

Continent Population

slide-71
SLIDE 71

gdp lifexp pop continent

340 65 31 Euro 227 51 200 Amer 909 81 80 Euro 126 40 20 Asia

  • 1. Tidy Data

x=gdp y=lifexp size=pop color=continent

  • 2. Mapping
  • 3. Geom

geom_point() ggplot(mapping = aes(x = …)) ggplot(data = gapminder)

slide-72
SLIDE 72

x y y log10 x

Asia Euro Amer 0-35 36-100 >100 log GDP Life Expectancy A Gapminder Plot

  • 4. Coordinate

System

  • 5. Scales
  • 6. Labels

& Guides

Continent Population

slide-73
SLIDE 73

Asia Euro Amer 0-35 36-100 >100

log GDP

Life Expectancy

A Gapminder Plot

Continent Population

slide-74
SLIDE 74

PIECE BY PIECE

slide-75
SLIDE 75

head(gapminder)

## # A tibble: 6 × 6 ## country continent year lifeExp pop gdpPercap ## <fctr> <fctr> <int> <dbl> <int> <dbl> ## 1 Afghanistan Asia 1952 28.801 8425333 779.4453 ## 2 Afghanistan Asia 1957 30.332 9240934 820.8530 ## 3 Afghanistan Asia 1962 31.997 10267083 853.1007 ## 4 Afghanistan Asia 1967 34.020 11537966 836.1971 ## 5 Afghanistan Asia 1972 36.088 13079460 739.9811 ## 6 Afghanistan Asia 1977 38.438 14880372 786.1134

dim(gapminder) ## [1] 1704 6

slide-76
SLIDE 76

p <- ggplot(data = gapminder)

Create a ggplot object Data is gapminder table

slide-77
SLIDE 77

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))

mapping: tell ggplot the variables you want represented by features of the plot

slide-78
SLIDE 78
  • The mapping = aes(../) instruction links

variables to things you will see on the plot.

  • The x and y values are the most obvious ones.
  • Other aesthetic mappings can include, e.g.,

color, shape, and size.

slide-79
SLIDE 79

Mappings do not directly specify the particular, e.g., colors, shapes, or line styles that will appear

  • n the plot. Rather they establish which variables

in the data will be represented by which visible features on the plot.

slide-80
SLIDE 80

p + geom_point()

Add a geom layer to the plot

slide-81
SLIDE 81
slide-82
SLIDE 82

p + geom_smooth()

Try a different geom

slide-83
SLIDE 83
slide-84
SLIDE 84

This process is literally additive

p + geom_point() + geom_smooth() + scale_x_log10(labels = scales::dollar)

slide-85
SLIDE 85

p + geom_point() + geom_smooth(method = "lm")

Every geom is a function. Functions take arguments.

slide-86
SLIDE 86
slide-87
SLIDE 87

Keep Layering

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) p + geom_point() + geom_smooth(method = "lm") + scale_x_log10(label = scales::dollar)

slide-88
SLIDE 88
slide-89
SLIDE 89

p + geom_point() + geom_smooth(method = "gam") + scale_x_log10(labels = scales::dollar) + labs(x = "GDP Per Capita", y = "Life Expectancy in Years", title = "Economic Growth and Life Expectancy", subtitle = "Data points are country-years", caption = "Data source: Gapminder")

slide-90
SLIDE 90
slide-91
SLIDE 91

MAPPING vs SETTING AESTHETICS

slide-92
SLIDE 92

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, color = "purple")) p + geom_point() + geom_smooth(method = "loess") + scale_x_log10()

slide-93
SLIDE 93

What has gone wrong here?

slide-94
SLIDE 94

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp) p + geom_point(color = "purple") + geom_smooth(method = "loess")) + scale_x_log10()

slide-95
SLIDE 95
slide-96
SLIDE 96

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) p + geom_point(alpha = 0.3) + geom_smooth(color = "orange", se = FALSE, size = 2, method = "lm") + scale_x_log10()

Here, some aesthetics are mapped, and some are set

slide-97
SLIDE 97
slide-98
SLIDE 98

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, color = continent, fill = continent)) p + geom_point() + geom_smooth(method = "loess") + scale_x_log10()

slide-99
SLIDE 99
slide-100
SLIDE 100

MAP or SET AESTHETICS per geom

slide-101
SLIDE 101

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) p + geom_point(mapping = aes(color = continent)) + geom_smooth(method = "loess") + scale_x_log10()

slide-102
SLIDE 102
slide-103
SLIDE 103

PAY CLOSE ATTENTION TO HOW SCALES ARE DRAWN, AND WHY

slide-104
SLIDE 104

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, color = continent, fill = continent)) p + geom_point() + geom_smooth(method = "loess") + scale_x_log10()

slide-105
SLIDE 105

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) p + geom_point(mapping = aes(color = continent)) + geom_smooth(method = "loess") + scale_x_log10()

slide-106
SLIDE 106

REMEMBER: EVERY MAPPED VARIABLE HAS A SCALE

slide-107
SLIDE 107

Saving Your Work

slide-108
SLIDE 108

ggsave() ggsave("figures/my_figure.png") ggsave("my_figure.pdf") ggsave("my_figure.pdf", plot = p5, scale = 1.2) ggsave("figures/my-figure.pdf", plot = p5, width = 8, height = 5)

With ggsave

slide-109
SLIDE 109

pdf(file = "plot.pdf", height = 5in, width = 5in) print(p4) dev.off()

With pdf() or other graphics devices

Open device … … and close when done

slide-110
SLIDE 110

```{r my_plot, fig.cap="My Plot", fig.width=9, fig.height=8} ```

Within an Rmd chunk

p + geom_point(mapping = aes(color = continent)) + geom_smooth(method = "loess") + scale_x_log10() knitr::opts_chunk$set(fig.width=8, fig.height=5)

Set defaults in your first code chunk

slide-111
SLIDE 111

Getting Help

slide-112
SLIDE 112