Comparing Distributions Nick Strayer Instructor DataCamp - - PowerPoint PPT Presentation

comparing distributions
SMART_READER_LITE
LIVE PREVIEW

Comparing Distributions Nick Strayer Instructor DataCamp - - PowerPoint PPT Presentation

DataCamp Visualization Best Practices in R VISUALIZATION BEST PRACTICES IN R Comparing Distributions Nick Strayer Instructor DataCamp Visualization Best Practices in R Why compare distributions? Verify balanced groups For comparisons sake


slide-1
SLIDE 1

DataCamp Visualization Best Practices in R

Comparing Distributions

VISUALIZATION BEST PRACTICES IN R

Nick Strayer

Instructor

slide-2
SLIDE 2

DataCamp Visualization Best Practices in R

Why compare distributions?

Verify balanced groups For comparisons sake

slide-3
SLIDE 3

DataCamp Visualization Best Practices in R

Why not facet histogams?

ggplot(md_speeding, aes(x = speed_over)) + geom_histogram() + facet_grid(vehicle_color~.)

slide-4
SLIDE 4

DataCamp Visualization Best Practices in R

The box plot

slide-5
SLIDE 5

DataCamp Visualization Best Practices in R

Box plot pros

Familiar Lots of good summary statistics

slide-6
SLIDE 6

DataCamp Visualization Best Practices in R

Boxplot cons

Show me the data!

slide-7
SLIDE 7

DataCamp Visualization Best Practices in R

A simple addition

geom_jitter() shows raw points jostled to avoid overlap.

Layer under your geom_boxplot.

md_speeding %>% filter(vehicle_color == 'BLUE') %>% ggplot(aes(x = gender, y = speed)) + # Draw points behind geom_jitter(alpha = 0.3, color = 'steelblue') + geom_boxplot(alpha = 0) + # make transparent labs(title = 'Distribution of speed for blue cars by gender')

slide-8
SLIDE 8

DataCamp Visualization Best Practices in R

slide-9
SLIDE 9

DataCamp Visualization Best Practices in R

Let's compare some distributions

VISUALIZATION BEST PRACTICES IN R

slide-10
SLIDE 10

DataCamp Visualization Best Practices in R

Boxplot alternatives

VISUALIZATION BEST PRACTICES IN R

Nick Strayer

Instructor

slide-11
SLIDE 11

DataCamp Visualization Best Practices in R

Limitations of the boxplot w/ jitter

Josteling points can only deal with so much overlap Hard to get an idea of data density

slide-12
SLIDE 12

DataCamp Visualization Best Practices in R

What are some other options?

Beeswarm plots Violin plots

slide-13
SLIDE 13

DataCamp Visualization Best Practices in R

Beeswarm plots

'Smart' jittering Individual points are clumped together as close to the axis as possible Handily included as geom_beeswarm in the ggbeeswarm package.

library(ggbeeswarm) ggplot(data, aes(y = y, x = group)) + geom_beeswarm(color = 'steelblue')

slide-14
SLIDE 14

DataCamp Visualization Best Practices in R

slide-15
SLIDE 15

DataCamp Visualization Best Practices in R

Beeswarm pros

Individual datapoints Distributional shape

slide-16
SLIDE 16

DataCamp Visualization Best Practices in R

Beeswarm cons

Get hard with lots of data Arbitrary stacking

slide-17
SLIDE 17

DataCamp Visualization Best Practices in R

Violin plots

KDE reflected to be symmetric Just replace geom_boxplot with geom_violin.

ggplot(data, aes(y = y, x = group)) + geom_violin(fill = 'steelblue')

slide-18
SLIDE 18

DataCamp Visualization Best Practices in R

slide-19
SLIDE 19

DataCamp Visualization Best Practices in R

Violin pros

Every datapoint is heard Not every datapoint is seen, so good for lots of data.

slide-20
SLIDE 20

DataCamp Visualization Best Practices in R

Violin cons

Kernel width choice Not every datapoint is seen

slide-21
SLIDE 21

DataCamp Visualization Best Practices in R

Let's try some more advanced comparisons!

VISUALIZATION BEST PRACTICES IN R

slide-22
SLIDE 22

DataCamp Visualization Best Practices in R

Comparing spatially related distribution

VISUALIZATION BEST PRACTICES IN R

Nick Strayer

Instructor

slide-23
SLIDE 23

DataCamp Visualization Best Practices in R

What are 'spatially connected axes'?

There is an underlying ordering of the classes. E.g. months of the year: Jan < Feb < Mar < ...

slide-24
SLIDE 24

DataCamp Visualization Best Practices in R

The ridgeline plot

library(ggridges) # gives us geom_density_ridges() ggplot(md_speeding, aes(x = speed_over, y = month)) + geom_density_ridges(bandwidth = 2) + xlim(1, 35)

slide-25
SLIDE 25

DataCamp Visualization Best Practices in R

Ridgeline pros

slide-26
SLIDE 26

DataCamp Visualization Best Practices in R

Ridgeline cons

slide-27
SLIDE 27

DataCamp Visualization Best Practices in R

slide-28
SLIDE 28

DataCamp Visualization Best Practices in R

Let's make some ridgelines!

VISUALIZATION BEST PRACTICES IN R

slide-29
SLIDE 29

DataCamp Visualization Best Practices in R

Congratulations!

VISUALIZATION BEST PRACTICES IN R

Nick Strayer

Instructor

slide-30
SLIDE 30

DataCamp Visualization Best Practices in R

slide-31
SLIDE 31

DataCamp Visualization Best Practices in R

slide-32
SLIDE 32

DataCamp Visualization Best Practices in R

slide-33
SLIDE 33

DataCamp Visualization Best Practices in R

slide-34
SLIDE 34

DataCamp Visualization Best Practices in R

Going further

Curated list of data visualizations and R- based tutorials. Articles that dig deep into visualization techniques and mistakes. An ongoing stream of cool projects and inspiration. Books! , Andy Kirk by Alberto Cairo Flowing data Datawrapper Blog Twitter (#datavis) Data Visualization The Functional Art and The Truthful Art

slide-35
SLIDE 35

DataCamp Visualization Best Practices in R

Thank You!

VISUALIZATION BEST PRACTICES IN R