Data Visualization in R May 15, 2017 Data Visualization in R May - - PowerPoint PPT Presentation

data visualization in r
SMART_READER_LITE
LIVE PREVIEW

Data Visualization in R May 15, 2017 Data Visualization in R May - - PowerPoint PPT Presentation

Data Visualization in R May 15, 2017 Data Visualization in R May 15, 2017 1 / 40 Jumping In Lets get started right away by the loading ggplot2 package and reading in our dataset. ### Install packages if you don't have them yet ### Typical


slide-1
SLIDE 1

Data Visualization in R

May 15, 2017

Data Visualization in R May 15, 2017 1 / 40

slide-2
SLIDE 2

Jumping In

Let’s get started right away by the loading ggplot2 package and reading in our dataset.

### Install packages if you don't have them yet ### Typical install: # install.packages('gpplot2') # install.packages('dplyr') ### Install personal copy (no admin rights) # install.packages('gpplot2',lib="/path/to/myfolder") # install.packages('dplyr',lib="/path/to/myfolder") ### Load packages library(ggplot2) library(dplyr) # Load personal copy # library(ggplot2,lib.loc="/path/to/myfolder") # library(dplyr,lib.loc="/path/to/myfolder") ### Read In data auto.data <- read.csv("AutoData.csv", header = TRUE) # tbl_df() isn't necessary here # It helps to display the data more clearly auto.data <- tbl_df(auto.data) Data Visualization in R May 15, 2017 2 / 40

slide-3
SLIDE 3

Glimpse at the data

Run the following to get a quick glimpse of the data # Find the dimensions dim(auto.data) # Look at the structure str(auto.data) # Examine the top head(auto.data) # Find out about a function ?str

Data Visualization in R May 15, 2017 3 / 40

slide-4
SLIDE 4

Data Exploration

When looking at a new data set, exploration is key. What types of variables do we have? What types of relationships do you expect to see between variables? Does your intuition check out? If not, why not? Do we observe anomalous behavior?

Data Visualization in R May 15, 2017 4 / 40

slide-5
SLIDE 5

Scatter Plots

One of the simpler plots we can make is a scatter plot between to continuous variables. Try it out: # qplot is convenient front end for the more powerful, # but slightly more complicated ggplot() function. qplot(curb.weight,price,data=auto.data)

Data Visualization in R May 15, 2017 5 / 40

slide-6
SLIDE 6

10000 20000 30000 40000 1500 2000 2500 3000 3500 4000

curb.weight price

Data Visualization in R May 15, 2017 6 / 40

slide-7
SLIDE 7

qplot is an easy to use front end to the main ggplot function. It has several reasonable defaults for plotting data As we just just saw, two continuous inputs gets us back a scatter plot.

Data Visualization in R May 15, 2017 7 / 40

slide-8
SLIDE 8

The true power of ggplot comes from its ability to easily visualize relationships between many variables. The main ingredients we’ll be using are:

1 aesthetics 2 facets 3 geoms Data Visualization in R May 15, 2017 8 / 40

slide-9
SLIDE 9

Aesthetics

Aesthetics control many of the plot’s visual properties Importantly these visual properties may be mapped directly to variables

Data Visualization in R May 15, 2017 9 / 40

slide-10
SLIDE 10

Aesthetics Example

Try the following # map color to factor/categorical variable qplot(curb.weight, price, data=auto.data, color=num.of.cylinders) # map color to continuous variable qplot(curb.weight, price, data=auto.data, color=bore)

Data Visualization in R May 15, 2017 10 / 40

slide-11
SLIDE 11

10000 20000 30000 40000 1500 2000 2500 3000 3500 4000

curb.weight price num.of.cylinders

eight five four six three twelve 10000 20000 30000 40000 1500 2000 2500 3000 3500 4000

curb.weight price

3.0 3.5

bore

Data Visualization in R May 15, 2017 11 / 40

slide-12
SLIDE 12

There are many other aesthetics besides color. Some we’ll encounter are:

1 color 2 size 3 shape 4 fill

Not all aesthetics work with both categorical and continuous variables (like color did) Also only a certain subset of aesthetics will be available for each plot type (geom)

Data Visualization in R May 15, 2017 12 / 40

slide-13
SLIDE 13

Try It Out

See how the following aesthetics behave with the scatter plot. Feel free to change the variables in the scatter plot qplot(curb.weight, price, data=auto.data, size=horsepower) qplot(curb.weight, price, data=auto.data, shape=drive.wheels)

Data Visualization in R May 15, 2017 13 / 40

slide-14
SLIDE 14

Facets

Facets represent another way of visualizing the effect of factor/categorical variables Facets enable us to get a separate plot for each level/category

Data Visualization in R May 15, 2017 14 / 40

slide-15
SLIDE 15

Facet Example

Let’s try out a faceting example qplot(curb.weight, price, data=auto.data) + facet_wrap(~aspiration)

Data Visualization in R May 15, 2017 15 / 40

slide-16
SLIDE 16

std turbo 1500 2000 2500 3000 3500 4000 1500 2000 2500 3000 3500 4000 10000 20000 30000 40000

curb.weight price

Data Visualization in R May 15, 2017 16 / 40

slide-17
SLIDE 17

Note facet_wrap gives a separate plot for each category Also note how we incorporated the behavior of facet_wrap: via the +

  • perator

This is one of the main strengths of ggplot: plots are built up in intuitive layers

Data Visualization in R May 15, 2017 17 / 40

slide-18
SLIDE 18

Also available is facet_grid for examining the interaction between two categorical variables: qplot(curb.weight, price, data=auto.data) + facet_grid(drive.wheels~num.of.doors)

Data Visualization in R May 15, 2017 18 / 40

slide-19
SLIDE 19

four two 4wd fwd rwd 1500 2000 2500 3000 3500 4000 1500 2000 2500 3000 3500 4000 10000 20000 30000 40000 10000 20000 30000 40000 10000 20000 30000 40000

curb.weight price

Data Visualization in R May 15, 2017 19 / 40

slide-20
SLIDE 20

Try out the following: qplot(curb.weight, price, data=auto.data) + facet_grid(.~drive.wheels) qplot(curb.weight, price, data=auto.data) + facet_grid(drive.wheels~.) qplot(curb.weight, price, data=auto.data, color=num.of.doors) + facet_grid(drive.wheels~.)

Data Visualization in R May 15, 2017 20 / 40

slide-21
SLIDE 21

geoms

The final way we’ll look at to control ggplots is via geoms The geom controls the type of plot which is displayed. We’ve already looked at one: geom_point. We could rewrite our scatter plot code more explicitly as: qplot(curb.weight,price,data=auto.data,geom='point')

Data Visualization in R May 15, 2017 21 / 40

slide-22
SLIDE 22

Let’s check out another geom: geom_histogram # geom_histogram operates with a single continuous variable. # Let's look at price qplot(price, data=auto.data, geom='histogram') # or via qplot's defaults qplot(price,data=auto.data)

Data Visualization in R May 15, 2017 22 / 40

slide-23
SLIDE 23

10 20 30 10000 20000 30000 40000

price count

Data Visualization in R May 15, 2017 23 / 40

slide-24
SLIDE 24

Note the warning concerning binwidth The binwidth chosen can dramatically impact how we visually interpret the distribution It’s best to experiment with values to get a feel for the data We can alter the binwidth by passing the option to qplot qplot(price, data=auto.data, geom='histogram', binwidth=20000)

Data Visualization in R May 15, 2017 24 / 40

slide-25
SLIDE 25

25 50 75 20000 40000

price count

This tells a very different story than the original!

Data Visualization in R May 15, 2017 25 / 40

slide-26
SLIDE 26

Note our price distribution is a bit skewed Perhaps we are not interested in higher priced (≥ 20, 000 say) cars We can limit our plot cars with lower price by setting limits qplot(price, data=auto.data, geom='histogram', binwidth=450) + xlim(4000,20000)

Data Visualization in R May 15, 2017 26 / 40

slide-27
SLIDE 27

5 10 5000 10000 15000 20000

price count

Data Visualization in R May 15, 2017 27 / 40

slide-28
SLIDE 28

Just like our point geom, histogram too has aesthetics. Try the following qplot(price, data=auto.data, color=drive.wheels) qplot(price, data=auto.data, fill=drive.wheels)

Data Visualization in R May 15, 2017 28 / 40

slide-29
SLIDE 29

10 20 30 10000 20000 30000 40000

price count drive.wheels

4wd fwd rwd 10 20 30 10000 20000 30000 40000

price count drive.wheels

4wd fwd rwd

Which one do like the best? Do you like either? How might we make it better?

Data Visualization in R May 15, 2017 29 / 40

slide-30
SLIDE 30

The colors help but the figure is a bit busy. We can try faceting instead: qplot(price, data=auto.data) + facet_wrap(~drive.wheels)

Data Visualization in R May 15, 2017 30 / 40

slide-31
SLIDE 31

4wd fwd rwd 10000 20000 30000 40000 10000 20000 30000 40000 10000 20000 30000 40000 10 20 30

price count

Data Visualization in R May 15, 2017 31 / 40

slide-32
SLIDE 32

This helps us separate out the categorical variables much easier. Note the counts vary quite a bit among the different classes, but yet the count axis is the same for all. We can change this by modifying the facet_wrap call: qplot(price, data=auto.data) + facet_wrap(~drive.wheels, scales = 'free_y')

Data Visualization in R May 15, 2017 32 / 40

slide-33
SLIDE 33

4wd fwd rwd 10000 20000 30000 40000 10000 20000 30000 40000 10000 20000 30000 40000 5 10 15 10 20 30 1 2 3

price count

Data Visualization in R May 15, 2017 33 / 40

slide-34
SLIDE 34

rwd fwd 4wd 10000 20000 30000 40000 1 2 3 10 20 30 5 10 15

price count

See ?facet_wrap for more useful options. For example nrow=3 in the above.

Data Visualization in R May 15, 2017 34 / 40

slide-35
SLIDE 35

Try it out

Take some time to explore the data using various geoms, aesthetics, and facets. How are the other variable related to price? Are some of the relationships stronger than others?

Data Visualization in R May 15, 2017 35 / 40

slide-36
SLIDE 36

More geoms

There are many other geoms besides point and histogram. Try ??geom to see a list. Different geoms operate with different (combinations of) data types (i.e. categorical or continuous). As is characteristic of ggplot, geoms can be layered to create plots of increasing detail/complexity.

Data Visualization in R May 15, 2017 36 / 40

slide-37
SLIDE 37

Try out the following: qplot(price,data=auto.data, geom='density') qplot(price, ..density.., # don't use counts data=auto.data, geom='histogram') + geom_density() qplot(height,price, data=auto.data, geom='density2d') qplot(height,price, data=auto.data)+ geom_density2d()

Data Visualization in R May 15, 2017 37 / 40

slide-38
SLIDE 38

Can you guess the geom for creating a boxplot? Create a boxplot displaying price for each of the drive.wheels categories

Data Visualization in R May 15, 2017 38 / 40

slide-39
SLIDE 39

qplot(drive.wheels, price, data=auto.data, geom='boxplot')

Data Visualization in R May 15, 2017 39 / 40

slide-40
SLIDE 40

References and Additional Info

ggplot2 documentation: http://docs.ggplot2.org/current/ Hadley’s ggplot2 book: http://ggplot2.org/book/ RStudio ggplot cheatsheet: http://www.rstudio.com/wp-content/ uploads/2015/03/ggplot2-cheatsheet.png

Data Visualization in R May 15, 2017 40 / 40