DATA VISUALIZATION WITH GGPLOT2
Introduction Data Visualization with ggplot2 Chapter 1 0.15 0.10 - - PowerPoint PPT Presentation
Introduction Data Visualization with ggplot2 Chapter 1 0.15 0.10 - - PowerPoint PPT Presentation
DATA VISUALIZATION WITH GGPLOT2 Introduction Data Visualization with ggplot2 Chapter 1 0.15 0.10 density
Data Visualization with ggplot2
Chapter 1
- 5000
10000 15000 Fair Good Very Good Premium Ideal
cut price
0.00 0.05 0.10 0.15 −2.5 0.0 2.5
bimodal density
5 10 15 20 Carnivore Herbivore Insectivore Omnivore
vore sleep_total vore
Carnivore Herbivore Insectivore Omnivore
Data Visualization with ggplot2
Chapter 2
20 40 60 80 100 20 40 60 80 100 20 40 60 80 100
Silt Sand Clay
5000 10000 15000 1 2 3 4 5
carat price
A− A+ AB− AB+ B− B+ O− O+
10 20 30 40 50 60 −10 −5 5 10 Fitted values Residuals
- lm(Volume ~ Girth)
Residuals vs Fitted
31 20 19
Data Visualization with ggplot2
Chapter 3
Brandenburger Tor Potsdamer Platz Victory Column Checkpoint Charlie Reichstag Alexander Platz
Data Visualization with ggplot2
Chapter 3
Brandenburger Tor Potsdamer Platz Victory Column Checkpoint Charlie Reichstag Alexander Platz
Data Visualization with ggplot2
Chapter 4
- Introduction to grid
- Manipulating graphical objects
- ggplot_build()
- gridExtra
Data Visualization with ggplot2
Chapter 5
146 148 150 152 98 100 102 104
group1 group2
- 95% CI range
Current year past record high past record low
- New record high
New record low
25 50 75 100 200 300
new_day temp
- PARIS
REYKJAVIK NEW YORK LONDON 25 50 75 25 50 75 100 200 300 100 200 300
new_day temp
DATA VISUALIZATION WITH GGPLOT2
Let’s practice!
DATA VISUALIZATION WITH GGPLOT2
Box Plots
Data Visualization with ggplot2
Statistical plots
- Academic audience
- 2 common types
- Box plots
- Density plots
- Case study: 2D box plots
Data Visualization with ggplot2
Box plot
- John Tukey - Exploratory Data Analysis
- Visualizing the 5 number summary
Data Visualization with ggplot2
- ●
- ●
- −2
−1 1 2
values
Data Visualization with ggplot2
- ●
- ●
- −2
−1 1 2
values
mean standard deviation Not robust!
Data Visualization with ggplot2
- ●
- ●
- −2
−1 1 2
values
minimum
Data Visualization with ggplot2
- ●
- ●
- −2
−1 1 2
values
Q1 minimum
Data Visualization with ggplot2
- ●
- ●
- −2
−1 1 2
values
Q1 minimum Q2
Data Visualization with ggplot2
- ●
- ●
- −2
−1 1 2
values
Q1 minimum Q2 Q3
Data Visualization with ggplot2
- ●
- ●
- −2
−1 1 2
values
Q1 minimum Q2 Q3 maximum = median IQR = interquartile range
Data Visualization with ggplot2
- ●
- ●
- −2
−1 1 2
values
2 1 3 4 5 5-number summary 25% 25% 25% 25%
Data Visualization with ggplot2
- ●
- ●
- ●
- −2
2 4 6
values
Data Visualization with ggplot2
- ●
- ●
- ●
- −2
2 4 6
values
Data Visualization with ggplot2
- ●
- ●
- ●
- −2
2 4 6
values
Data Visualization with ggplot2
- ●
- ●
- ●
- −2
2 4 6
values
Data Visualization with ggplot2
- ●
- ●
- ●
- −2
2 4 6
values
Data Visualization with ggplot2
- ●
- ●
- ●
- −2
2 4 6
values
Data Visualization with ggplot2
- ●
- ●
- ●
- −2
2 4 6
values
DATA VISUALIZATION WITH GGPLOT2
Let’s practice!
DATA VISUALIZATION WITH GGPLOT2
Density Plots
Data Visualization with ggplot2
- Distribution of univariate data
- Statistics
- Probability Density Function
- Theoretical: based on formula
- Empirical: based on data
Density plot
0.0 0.1 0.2 0.3 0.4 −3 −2 −1 1 2 3
x f(x)
Standard Normal Curve
0.0 0.1 0.2 0.3 0.4 −3 −2 −1 1 2 3
x f(x)
t (8)
0.0 0.5 1.0 1.5 2.0 1 2 3 4
x f(x)
chi−sq (1)
0.00 0.25 0.50 0.75 1.00 1 2 3 4
x f(x)
F (2,18)
Data Visualization with ggplot2
Kernel Density Estimate (KDE)
A sum of 'bumps' placed at the observations. The kernel function determines the shape of the bumps while the window width, h, determines their width.
Source: Brian S. Everi and Torsten Hothorn, A Handbook of Statistical Analyses Using R
Data Visualization with ggplot2
−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
x
Example
> x <- c(0.0, 1.0, 1.1, 1.5, 1.9, 2.8, 2.9, 3.5) > x [1] 0.0 1.0 1.1 1.5 1.9 2.8 2.9 3.5
Data Visualization with ggplot2
0.0 0.1 0.2 0.3 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
x values
Bumps
Data Visualization with ggplot2
0.0 0.1 0.2 0.3 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
x values
Sum of bumps
Many overlapping lines -> higher value -> higher density Empirical Probability Density Function mode = value at which probability density function has its maximum value
Data Visualization with ggplot2
Bandwidth - h
0.0 0.1 0.2 0.3 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
bw = 0.4 values
0.0 0.1 0.2 0.3 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
bw = 0.69 values
Remember: Density plots are representations
- f the underlying distribution!
0.279 0.355
Data Visualization with ggplot2
0.0 0.1 0.2 0.3 0.4 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
bw = 0.4 values
Intermediate steps Plot extends beyond limits of data
0.0 0.1 0.2 0.3 1 2 3
bw = 0.4, restricted to range density
geom_density() area ≠ 1 happens for every bandwidth!
DATA VISUALIZATION WITH GGPLOT2
Let’s practice!
DATA VISUALIZATION WITH GGPLOT2
Multiple Groups/Variables
Data Visualization with ggplot2
Groups
Levels within a factor variable
> head(mammals) vore sleep_total 1 Carnivore 12.1 2 Omnivore 17.0 3 Herbivore 14.4 4 Omnivore 14.9 5 Herbivore 4.0 6 Herbivore 14.4 > levels(mammals$vore) [1] "Carnivore" "Herbivore" "Insectivore" "Omnivore"
Data Visualization with ggplot2
Jiered points
- ●
- 5
10 15 20 Carnivore Herbivore Insectivore Omnivore
vore sleep_total
ggplot(mammals, aes(x = vore, y = sleep_total)) + geom_point(position = position_jitter(0.2))
Data Visualization with ggplot2
Box plot
- 5
10 15 20 Carnivore Herbivore Insectivore Omnivore
vore sleep_total
ggplot(mammals, aes(x = vore, y = sleep_total)) + geom_boxplot()
5 observations - meaningless!
Data Visualization with ggplot2
Box plot (2)
- 5
10 15 20 Carnivore Herbivore Insectivore Omnivore
vore sleep_total
ggplot(mammals, aes(x = vore, y = sleep_total)) + geom_boxplot(varwidth = TRUE)
Data Visualization with ggplot2
Density plots
ggplot(mammals, aes(x = sleep_total, fill = vore)) + geom_density(col = NA, alpha = 0.35)
0.0 0.1 0.2 0.3 5 10 15 20
sleep_total density vore
Carnivore Herbivore Insectivore Omnivore
abundant, but only 5 observations!
> # Add weights > mammals <- mammals %>% group_by(vore) %>% mutate(n = n()/nrow(mammals))
Data Visualization with ggplot2
Weighted
0.0 0.5 1.0 1.5 5 10 15 20
sleep_total density vore
Carnivore Herbivore Insectivore Omnivore
ggplot(mammals, aes(x = sleep_total, fill = vore)) + geom_density(aes(weight = n), col = NA, alpha = 0.35)
Data Visualization with ggplot2
Violin plot
5 10 15 20 Carnivore Herbivore Insectivore Omnivore
vore sleep_total
ggplot(mammals, aes(x = vore, y = sleep_total)) + geom_violin()
Data Visualization with ggplot2
5 10 15 20 Carnivore Herbivore Insectivore Omnivore
vore sleep_total vore
Carnivore Herbivore Insectivore Omnivore
ggplot(mammals, aes(x = vore, y = sleep_total, fill = vore)) + geom_violin(aes(weight = n), col = NA)
Weighted
Data Visualization with ggplot2
Compare separate variables
> dim(faithful) [1] 272 2 > head(faithful) eruptions waiting 1 3.600 79 2 1.800 54 3 3.333 74 4 2.283 62 5 4.533 85 6 2.883 55
Data Visualization with ggplot2
First look
ggplot(faithful, aes(x = waiting, y = eruptions)) + geom_point()
- 2
3 4 5 50 60 70 80 90
waiting eruptions
Data Visualization with ggplot2
2D density plot
ggplot(faithful, aes(x = waiting, y = eruptions)) + geom_density_2d()
2 3 4 5 50 60 70 80 90
waiting eruptions
Data Visualization with ggplot2
2 3 4 5 50 60 70 80 90
waiting eruptions
0.005 0.010 0.015 0.020 0.025
density
2D density plot
ggplot(faithful, aes(x = waiting, y = eruptions)) + stat_density_2d(geom = "tile", aes(fill = ..density..), contour = FALSE)
Data Visualization with ggplot2
2 3 4 5 50 60 70 80 90
waiting eruptions
0.005 0.010 0.015 0.020 0.025
density
Viridis
library(viridis) ggplot(faithful, aes(x = waiting, y = eruptions)) + stat_density_2d(geom = "tile", aes(fill = ..density..), contour = FALSE) + scale_fill_viridis()
Data Visualization with ggplot2
- ●●●●●● ● ●
- ●●●●●●● ● ●
- ●●●●●●● ● ●
- ●●●●●●● ● ●
- ● ●●●●●● ● ●
- ● ● ● ● ● ● ● ● ●
- ● ● ● ● ● ● ●
- ● ● ● ● ●
- ● ● ● ● ● ● ●
- ● ● ● ● ● ● ● ● ●
- ● ● ● ●●●●● ● ●
- ● ● ●●●●●●● ● ●
- ● ●●●●
- ●●● ●
- ●●●
- ●● ● ●
- ●●●
- ●● ● ●
- ● ●●●
- ●● ● ●
- ● ●●●●
- ●● ● ●
- ● ●●●●●● ● ● ●
- ● ● ●●●● ● ● ●
- 2
3 4 5 50 60 70 80 90
waiting eruptions density
- 0.005
0.010 0.015 0.020
Grid of circles
ggplot(faithful, aes(x = waiting, y = eruptions)) + stat_density_2d(geom = "point", aes(size = ..density..), n = 20, contour = FALSE) + scale_size(range = c(0, 9))
DATA VISUALIZATION WITH GGPLOT2