Tidyverse wrapup Steve Bagley somgen223.stanford.edu 1 Making - - PowerPoint PPT Presentation

tidyverse wrapup
SMART_READER_LITE
LIVE PREVIEW

Tidyverse wrapup Steve Bagley somgen223.stanford.edu 1 Making - - PowerPoint PPT Presentation

Tidyverse wrapup Steve Bagley somgen223.stanford.edu 1 Making numbers into factors using numeric ranges somgen223.stanford.edu 2 Making numbers into factors using numeric ranges We use factors for grouping, but numbers themselves do not


slide-1
SLIDE 1

Tidyverse wrapup

Steve Bagley

somgen223.stanford.edu 1

slide-2
SLIDE 2

Making numbers into factors using numeric ranges

somgen223.stanford.edu 2

slide-3
SLIDE 3

Making numbers into factors using numeric ranges

  • We use factors for grouping, but numbers themselves do not make very good
  • groups. Would you want to group together all subjects with weight of 12.5?
  • Instead, we set up non-overlapping intervals, and use those as the factor values.
  • Example: 0–10, 10–20, 20–30

somgen223.stanford.edu 3

slide-4
SLIDE 4

Example

y <- c(1, 2, 3, 4, 5) cut_number(y, n = 2) [1] [1,3] [1,3] [1,3] (3,5] (3,5] Levels: [1,3] (3,5]

  • cut_number tries to create n bins with approximately the same number of

values in each bin.

  • It returns a factor vector using a special symbolic code for the ranges.
  • The interval (a,b] spans from a to b, open on the left end, and closed on the
  • right. This does not include a, but does include b.
  • Note the levels of the factor.

somgen223.stanford.edu 4

slide-5
SLIDE 5

Example

tibble(y = y, y_cut = cut_number(y, n = 2)) # A tibble: 5 x 2 y y_cut <dbl> <fct> 1 1 [1,3] 2 2 [1,3] 3 3 [1,3] 4 4 (3,5] 5 5 (3,5]

somgen223.stanford.edu 5

slide-6
SLIDE 6

cut_interval

z <- c(1, 1, 1, 2, 4, 5) cut_number(z, n = 2) [1] [1,1.5] [1,1.5] [1,1.5] (1.5,5] (1.5,5] (1.5,5] Levels: [1,1.5] (1.5,5] cut_interval(z, n = 2) [1] [1,3] [1,3] [1,3] [1,3] (3,5] (3,5] Levels: [1,3] (3,5]

  • cut_interval makes n intervals with the same range (width).

somgen223.stanford.edu 6

slide-7
SLIDE 7

cut_width

cut_width(z, width = 1) [1] [0.5,1.5] [0.5,1.5] [0.5,1.5] (1.5,2.5] (3.5,4.5] (4.5,5.5] Levels: [0.5,1.5] (1.5,2.5] (2.5,3.5] (3.5,4.5] (4.5,5.5]

  • cut_width makes intervals of the specified width.

somgen223.stanford.edu 7

slide-8
SLIDE 8

Graphics example

iris %>% mutate(petal_length = cut_number(Petal.Length, n = 4)) %>% ggplot(aes(petal_length, Petal.Width)) + geom_boxplot()

0.0 0.5 1.0 1.5 2.0 2.5 [1,1.6] (1.6,4.35] (4.35,5.1] (5.1,6.9]

petal_length Petal.Width

somgen223.stanford.edu 8

slide-9
SLIDE 9

Formatting numbers

somgen223.stanford.edu 9

slide-10
SLIDE 10

round

x <- c(1.4234, 1.5, 1.6234, 2.4, 2.5, 10.6) round(x) [1] 1 2 2 2 2 11 round(x, digits = 1) [1] 1.4 1.5 1.6 2.4 2.5 10.6 round(x, digits = -1) [1] 0 10

  • round creates a new, rounded, number.
  • At 0.5 it rounds to the even digit.
  • You can specify the number of digits. Negative numbers round to multiples of

10.

somgen223.stanford.edu 10

slide-11
SLIDE 11

signif

x [1] 1.4234 1.5000 1.6234 2.4000 2.5000 10.6000 signif(x) [1] 1.4234 1.5000 1.6234 2.4000 2.5000 10.6000 signif(x, digits = 1) [1] 1 2 2 2 2 10 signif(x, digits = 5) [1] 1.4234 1.5000 1.6234 2.4000 2.5000 10.6000

  • signif creates a new number, rounded to the specified number of significant

digits.

somgen223.stanford.edu 11

slide-12
SLIDE 12

data frame example

library(scales) # for the number function (d <- tibble(x = c(123400, 12340, 1234, 123.4, 12.34, 1.234, 0.1234, 0.01234)) %>% mutate(rounded = round(x, digits = 2), signifed = signif(x, digits = 2), number1 = number(x, accuracy = 1), number2 = number(x, accuracy = 0.1))) # A tibble: 8 x 5 x rounded signifed number1 number2 <dbl> <dbl> <dbl> <chr> <chr> 1 123400 123400 120000 123 400 123 400.0 2 12340 12340 12000 12 340 12 340.0 3 1234 1234 1200 1 234 1 234.0 4 123. 123. 120 123 123.4 5 12.3 12.3 12 12 12.3 6 1.23 1.23 1.2 1 1.2 7 0.123 0.12 0.12 0.1 8 0.0123 0.01 0.012 0 0.0

somgen223.stanford.edu 12

slide-13
SLIDE 13

set option

  • ptions(pillar.sigfig = 1)

d # A tibble: 8 x 5 x rounded signifed number1 number2 <dbl> <dbl> <dbl> <chr> <chr> 1 123400 123400 120000 123 400 123 400.0 2 12340 12340 12000 12 340 12 340.0 3 1234 1234 1200 1 234 1 234.0 4 123. 123. 120 123 123.4 5 12. 12. 12 12 12.3 6 1. 1. 1. 1 1.2 7 0.1 0.1 0.1 0.1 8 0.01 0.01 0.01 0 0.0

  • This sets a print option for tibbles.
  • The default value is 3.
  • A value you set stays in place until you change it (or quit R).

somgen223.stanford.edu 13

slide-14
SLIDE 14

set option

  • ptions(pillar.sigfig = 5)

d # A tibble: 8 x 5 x rounded signifed number1 number2 <dbl> <dbl> <dbl> <chr> <chr> 1 123400 123400 120000 123 400 123 400.0 2 12340 12340 12000 12 340 12 340.0 3 1234 1234 1200 1 234 1 234.0 4 123.4 123.4 120 123 123.4 5 12.34 12.34 12 12 12.3 6 1.234 1.23 1.2 1 1.2 7 0.1234 0.12 0.12 0.1 8 0.01234 0.01 0.012 0 0.0

somgen223.stanford.edu 14

slide-15
SLIDE 15

sprintf

sprintf("The value of x is approximately: %.2f", 1.23456) [1] "The value of x is approximately: 1.23"

  • sprintf inserts values into a format string, which contains both literal text and

format codes, starting with %.

  • The result is of type character. You can print this (or save it).
  • For more about the many format codes, see the help page.

somgen223.stanford.edu 15

slide-16
SLIDE 16

controlling how data frames print

print(d, n = 2, width = 20) # A tibble: 8 x 5 x rounded <dbl> <dbl> 1 123400 123400 2 12340 12340 # ... with 6 more # rows, and 3 # more variables: # signifed <dbl>, # number1 <chr>, # number2 <chr>

  • This will print 2 rows, and the first 20 characters per row.

somgen223.stanford.edu 16

slide-17
SLIDE 17

printing the entire data frame

print(d, n = +Inf)

  • This will print all rows.

somgen223.stanford.edu 17

slide-18
SLIDE 18

Row vs column operations

somgen223.stanford.edu 18

slide-19
SLIDE 19

Exercise: Sum along all the columns

(d1 <- tibble(x = 1:3, y = 11:13, z = 100:102)) # A tibble: 3 x 3 x y z <int> <int> <int> 1 1 11 100 2 2 12 101 3 3 13 102

  • How would you create a new row that contains the column sums?

somgen223.stanford.edu 19

slide-20
SLIDE 20

Answer: Sum all the columns

d1 %>% summarize_all(sum) # A tibble: 1 x 3 x y z <int> <int> <int> 1 6 36 303

  • This applies the sum function to every one of the columns.

somgen223.stanford.edu 20

slide-21
SLIDE 21

Include the sum as the last row

bind_rows(d1, summarize_all(d1, sum)) # A tibble: 4 x 3 x y z <int> <int> <int> 1 1 11 100 2 2 12 101 3 3 13 102 4 6 36 303

  • This includes the row with the summed values as the bottom row.

somgen223.stanford.edu 21

slide-22
SLIDE 22

Exercise: Sum across all the rows

d1 # A tibble: 3 x 3 x y z <int> <int> <int> 1 1 11 100 2 2 12 101 3 3 13 102

  • How would you create a new column with the sum of all the previous columns?
  • This is a bit more complicated: each column is a vector, but each row is not.

somgen223.stanford.edu 22

slide-23
SLIDE 23

Answer: Sum across all the rows

d1 %>% mutate(row_sum = rowSums(.)) # A tibble: 3 x 4 x y z row_sum <int> <int> <int> <dbl> 1 1 11 100 112 2 2 12 101 115 3 3 13 102 118

  • rowSums is built-in.
  • There is also a rowMeans function.
  • But what if we want a different calculation?

somgen223.stanford.edu 23

slide-24
SLIDE 24

Answer: Sum across all the rows

d1 %>% mutate(row_sum = reduce(., `+`)) # A tibble: 3 x 4 x y z row_sum <int> <int> <int> <int> 1 1 11 100 112 2 2 12 101 115 3 3 13 102 118

  • + is a binary operator to compute the sum.

somgen223.stanford.edu 24

slide-25
SLIDE 25

Answer: Sum across all the rows

d1 %>% mutate(row_sum = flatten_dbl(pmap(., sum))) # A tibble: 3 x 4 x y z row_sum <int> <int> <int> <dbl> 1 1 11 100 112 2 2 12 101 115 3 3 13 102 118

  • This is a more complex approach using functions from the purrr package.

somgen223.stanford.edu 25

slide-26
SLIDE 26

How to combine multiple plots together

somgen223.stanford.edu 26

slide-27
SLIDE 27

package patchwork

## install.packages("patchwork") library(patchwork)

  • This package allows you to easily combine multiple ggplot plots into a single

graphic.

somgen223.stanford.edu 27

slide-28
SLIDE 28

Create two graphs

g1 <- tibble(x = 1:3, y = 1:3) %>% ggplot(aes(x, y)) + geom_point(size = 5) g2 <- tibble(x = 1:3, y = 3:1) %>% ggplot(aes(x, y)) + geom_point(size = 5)

somgen223.stanford.edu 28

slide-29
SLIDE 29

Combine using patchwork

g1 | g2

1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 3.0

x y

1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 3.0

x y

  • Use “|” to place side-by-side

somgen223.stanford.edu 29

slide-30
SLIDE 30

Combine using patchwork

g1 / g2

1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 3.0

x y

1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 3.0

x y

  • Use “/” to place on top of

somgen223.stanford.edu 30

slide-31
SLIDE 31

Combine using patchwork

(g1 | g2) / (g2 | g1)

1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 3.0

x y

1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 3.0

x y

1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 3.0

x y

1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 3.0

x y

  • Use “( )” for grouping

somgen223.stanford.edu 31

slide-32
SLIDE 32

Combine using patchwork

g1 | g2 | g1 | g2

1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 3.0

x y

1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 3.0

x y

1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 3.0

x y

1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 3.0

x y

somgen223.stanford.edu 32

slide-33
SLIDE 33

Label the plots

g1 + g2 + g1 + g2 + plot_annotation(tag_levels = "a", tag_prefix = "(", tag_suffix = ")")

1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 3.0

x y

(a)

1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 3.0

x y

(b)

1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 3.0

x y

(c)

1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 3.0

x y

(d)

somgen223.stanford.edu 33

slide-34
SLIDE 34

More ggplot themes

somgen223.stanford.edu 34

slide-35
SLIDE 35

More ggplot themes

  • The package ggthemes has a large collections of themes, some designed to

match well-known styles (the Economist, The Wall Street Journal, Stata, Excel).

  • It also has some colorblind scales.

somgen223.stanford.edu 35

slide-36
SLIDE 36

Excel theme

library(ggthemes) ggplot(BOD, aes(Time, demand)) + geom_point() + theme_excel_new()

10.0 12.5 15.0 17.5 20.0 2 4 6 somgen223.stanford.edu 36

slide-37
SLIDE 37

Economist theme

ggplot(BOD, aes(Time, demand)) + geom_point() + theme_economist()

10.0 12.5 15.0 17.5 20.0 2 4 6 Time demand somgen223.stanford.edu 37

slide-38
SLIDE 38

Wall Street Journal theme

ggplot(BOD, aes(Time, demand)) + geom_point() + theme_wsj()

10.0 12.5 15.0 17.5 20.0 2 4 6

somgen223.stanford.edu 38

slide-39
SLIDE 39

Updates to tidyverse packages

somgen223.stanford.edu 39

slide-40
SLIDE 40

forcats package

  • This package contains functions for working with factors, especially reordering,

renaming, and grouping them.

  • One kind of grouping is “lumping” of factors: combining relatively rarely
  • ccurring factor levels into a single Other category.
  • Details here: forcats 0.5.0 - Tidyverse

somgen223.stanford.edu 40

slide-41
SLIDE 41

dplyr package

  • New version (1.0.0) is coming in the next couple of months. It should have some

improvements and simplifications of some of the verbs we learned about.

  • Details here: dplyr 1.0.0 is coming soon - Tidyverse

somgen223.stanford.edu 41

slide-42
SLIDE 42

Resources

  • There is a list of resources at the bottom of the course webpage.

somgen223.stanford.edu 42