BootcampR AN INTRODUCTION TO R Jason A. Heppler, PhD University of - - PowerPoint PPT Presentation

bootcampr
SMART_READER_LITE
LIVE PREVIEW

BootcampR AN INTRODUCTION TO R Jason A. Heppler, PhD University of - - PowerPoint PPT Presentation

BootcampR AN INTRODUCTION TO R Jason A. Heppler, PhD University of Nebraska at Omaha March 3, 2020 @jaheppler Hi. I'm Jason. I like to gesture at screens. Digital Engagement Librarian , University of Nebraska at Omaha Mentor, Mozilla Open


slide-1
SLIDE 1

BootcampR

AN INTRODUCTION TO R

Jason A. Heppler, PhD University of Nebraska at Omaha March 3, 2020 @jaheppler

slide-2
SLIDE 2
  • Hi. I'm Jason.

I like to gesture at screens.

Digital Engagement Librarian, University of Nebraska at Omaha Mentor, Mozilla Open Leaders Researcher, Humanities+Design, Stanford University

slide-3
SLIDE 3

Schedule March 10: 1:30-3 Making Networks in CL 232 March 17: 1:30-3 Making Maps in CL 112 March 31: 1:30-3 Clustering and Classifying in CL 112

slide-4
SLIDE 4

Today's plan

  • Aesthetics and design
  • Intro to ggplot
  • The grammar of graphics
  • Hands-on!

Open up RStudio. We'll start doing a few things together soon.

slide-5
SLIDE 5

"The bad news is that when ever you learn a new skill you’re going to suck. It’s going to be frustrating. The good news is that is typical and happens to everyone and it is

  • nly temporary.

You can’t go from knowing nothing to becoming an expert without going through a period of great frustration and great suckiness."

—Hadley Wickham

slide-6
SLIDE 6

ggplot is part of the tidyverse A highly functional package for reasoning the creation of statistical charts and graphics.

slide-7
SLIDE 7

Edward Tufte suggests that graphical excellence is defined by "that which gives the viewer the greatest number of ideas, in the shortest time, with the least ink, the smallest space, and which tells the truth about data."

Edward Tufte, The Visual Display of Quantitative Information (Graphics Press, 1983)

slide-8
SLIDE 8

2

For the exploration of data and evidence

1

For the communication of information and results

slide-9
SLIDE 9

Types of Visualization

  • Information visualization: statistical

charts and graphs to represent data

slide-10
SLIDE 10
slide-11
SLIDE 11

Types of Visualization

  • Information visualization: statistical

charts and graphs to represent data

  • Scientific visualization: scientific data

that has close ties to real-world objects with spatial properties

slide-12
SLIDE 12
slide-13
SLIDE 13

Types of Visualization

  • Information visualization: statistical

charts and graphs to represent data

  • Scientific visualization: scientific data

that has close ties to real-world objects with spatial properties

  • Infographic: combining various statistics

and visualizations with narrative

slide-14
SLIDE 14
slide-15
SLIDE 15

Cognitive and Social Aspects

  • f Visualization
slide-16
SLIDE 16

Gestalt Principles

  • f Data Visualization
slide-17
SLIDE 17

Gestalt Principles of Data Visualization

Gestalt psychology is an old practice of understanding how humans perceive patterns. The principles of Gestalt psychology attempt to explore how we view separate visual elements as a whole.

slide-18
SLIDE 18

Gestalt Principles of Data Visualization

  • Similarity. Objects that are visually similar (the same color,

e.g.) are perceived as part of the same group.

slide-19
SLIDE 19

Gestalt Principles of Data Visualization

  • Similarity. Objects that are visually similar (the same color,

e.g.) are perceived as part of the same group.

slide-20
SLIDE 20

Gestalt Principles of Data Visualization

  • Similarity. Objects that are visually similar (the same color,

e.g.) are perceived as part of the same group.

slide-21
SLIDE 21

Gestalt Principles of Data Visualization

  • Proximity. Humans perceive objects close together as being

part of a single group.

slide-22
SLIDE 22
slide-23
SLIDE 23

Gestalt Principles of Data Visualization

  • Enclosure. Surrounding a group of related elements with a

visual element.

slide-24
SLIDE 24

Gestalt Principles of Data Visualization

  • Closure. Humans tend to fill in the blanks when presented with

missing information. When viewing a shape with missing segments, we perceive it as a single unit.

slide-25
SLIDE 25

Effective design of complex visualizations must consider these principles and the intentional and unintentional signals our graphics send to our readers.

slide-26
SLIDE 26
slide-27
SLIDE 27
slide-28
SLIDE 28
slide-29
SLIDE 29
slide-30
SLIDE 30
slide-31
SLIDE 31
slide-32
SLIDE 32

In the mid-1980s, statisticians William Cleveland and Robert McGill ran experiments with human volunteers to study the perception of quantitative information encoded by different cues.

slide-33
SLIDE 33

Let's get started. ggplot2 is a data visualization package that uses

  • a grammar of graphics: breaking up graphs into components
  • popular method for creating explanatory and exploratory

graphics Supplementary packages for ggplot are available for more customization and function, for example:

  • gganimate: create animations
  • gghighlight: highlight lines and points
  • ggrepel: automatic adjustment of text labels
  • ggbeeswarm: add non-overlapping points

...and many more.

slide-34
SLIDE 34

Grammar of graphics ggplot has three essential components:

  • 1. data: a dataset you are visualizing
  • 2. aesthetic mappings: that identify coordinates (what

columns to map to x and y), and assigning variables to visual elements (color, shape, size, etc.)

  • 3. geometric layer: a type of graphic (point, line, boxplot,

bars, maps, etc.)

slide-35
SLIDE 35

Grammar of graphics

  • geom_*: type of graphic
  • stat_*: statistical representation of the data
  • scale_*: visual values (axis scale, color scale)
  • facet_*: divide plot into subplots
  • theme(): adjust background colors, grid lines, font sizes, etc.
  • labs(): add labels like title, x and y labels, subtitles, captions, etc.
slide-36
SLIDE 36

Grammar of graphics A visualization concept created by Leland Wilkinson (1999) to define the elements of statistical graphics:

"... describes the meaning of what we do when we construct statistical graphics ... More than a taxonomy ... Computational system based on the underlying mathematics of representing statistical functions of data."

Adapted by the creator of ggplot, Hadley Wickham, in 2009. ggplot offers a:

  • consistent and simple syntax for
  • describing statistical graphics, and is
  • highly modular to break graphs into
  • semantic components.

See Hadley Wickham, "A Layered Grammar of Graphics," Journal of Computational and Graphical Statistics vol. 19 no. 1 (2010): 3--28 http://vita.had.co.nz/papers/layered-grammar.pdf.

slide-37
SLIDE 37

Grammar of graphics

library(tidyverse) # devtools::install_github("hepplerj/superfundr") library(superfundr) data(superfunds) # Let's look at the first five rows superfunds %>% head(5)

# A tibble: 5 x 20 site_name epa_id city county state zipcode region npl_status superfund_agree… <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> 1 ATLAS TA… MAD00… FAIR… BRIST… MA 02719 1 Currently… N 2 ATLAS TA… MAD00… FAIR… BRIST… MA 02719 1 Currently… N 3 ATLAS TA… MAD00… FAIR… BRIST… MA 02719 1 Currently… N 4 ATLAS TA… MAD00… FAIR… BRIST… MA 02719 1 Currently… N 5 ATLAS TA… MAD00… FAIR… BRIST… MA 02719 1 Currently… N # … with 11 more variables: federal_facility <chr>, op_unit_no <dbl>, seq_id <dbl>, # decision_type <chr>, completion_date <dttm>, fiscal_year <dbl>, media <chr>, # contaminant <chr>, address <fct>, latitude <dbl>, longitude <dbl>

slide-38
SLIDE 38

Grammar of graphics

library(tidyverse) # devtools::install_github("hepplerj/superfundr") library(superfundr) data(superfunds) # Let's look at the first five rows superfunds %>% head(5) # distinct() lets us identify unique values # and the .keep_all argument returns all data # that matches superfunds_subset <- superfunds %>% distinct(site_name, .keep_all = TRUE)

ggplot(data = superfunds_subset, aes(x = state)) + geom_bar() + labs(title = "U.S. Superfund Sites", x = "State", y = "Count")

slide-39
SLIDE 39

Grammar of graphics

slide-40
SLIDE 40

Grammar of graphics Let's break down what we did with the ggplot code. The code for our previous bar chart looked like:

ggplot(data = superfunds, aes(x = state)) + geom_bar() + labs(title = "U.S. Superfund Sites", x = "State", y = "Count")

slide-41
SLIDE 41

Grammar of graphics Let's break down what we did with the ggplot code. The code for our previous bar chart looked like:

ggplot(data = superfunds, aes(x = state)) + geom_bar()

slide-42
SLIDE 42

Grammar of graphics Let's break down what we did with the ggplot code. The code for our previous bar chart looked like:

ggplot(data = superfunds, aes(x = state)) + geom_bar()

ggplot needs:

  • 1. mapping of data
  • 2. to aesthetic attributes
  • 3. using geometric objects
  • 4. with data statistically transformed
  • 5. and, if needed, mapped onto a facet or coordinate system
slide-43
SLIDE 43

Grammar of graphics Let's break down what we did with the ggplot code. The code for our previous bar chart looked like:

ggplot(data = superfunds, aes(x = state)) + geom_bar()

  • 1. mapping of data

data = superfunds

  • 2. to aesthetic attributes

aes(x = state)

slide-44
SLIDE 44

Grammar of graphics Let's break down what we did with the ggplot code. The code for our previous bar chart looked like:

ggplot(data = superfunds, aes(x = state)) + geom_bar()

  • 1. mapping of data

data = superfunds

  • 2. to aesthetic attributes

aes(x = state)

slide-45
SLIDE 45

Grammar of graphics Let's break down what we did with the ggplot code. The code for our previous bar chart looked like:

ggplot(data = superfunds, aes(x = state)) + geom_bar()

  • 3. Using geometric objects

geom_bar()

See the ggplot geom_bar() documentation for the differences in the stat flag. By default, geom_bar() uses "stat='count'" which sets the height of the bar proportion to the number of cases in each group. Since we want the height of the bars to represent values in the data, we use "stat='identity'" to map a variable to the y aesthetic.

slide-46
SLIDE 46

Grammar of graphics There are several geometric objects (think: chart types) available in

  • ggplot. We could, for example, look at a histogram of fiscal years that

sites were proposed for Superfund.

# the distinct() function lets me grab just unique names in # the data frame so we're not counting duplicate entries. superfunds %>% distinct(site_name, .keep_all = TRUE) %>% ggplot(aes(x = fiscal_year)) + geom_histogram() + labs(title = "U.S. Superfund Sites", x = "Fiscal Year", y = "Count")

slide-47
SLIDE 47

Grammar of graphics

slide-48
SLIDE 48

R makes the production of these graphics

  • simple. Note that we were able to create a

barchart and histogram with just two lines of code.

slide-49
SLIDE 49

Let's do some more hands-on

Head on over to https://tinyurl.com/unobootcamp and select the worksheet for this week.

slide-50
SLIDE 50

Questions? Troubleshooting?

Next workshop: March 10, 1:30p-3p: Making Networks (CL 232)