Always p lot y
- ur data rst!
" Always. " - Se verus Snap e
2 / 29
Always p lot y
- ur data rst!
" Always. " - Se verus Snap e
Wh y?
Outliers an d imp
- ssible v
alues Determine c
- rre ct
statistical appr
- ach
Assumptions an d diagn
- stics
Disco ver n ew relationships 2 / 29 Often th e most inf
- rmative asp
e ct of analysis Comm unicates th e " data st
- ry" th
e b est Most abuse d ar ea of quan titative science Figures c an b e very misleading
The Visualization P aradox
Misleading Grap hs 3 / 29Much b etter
4 / 29
Graphical m etho d should match level of measuremen t Lab el all ax es an d include gur e c aption Simplicity an d clarity A void of ‘ char tjunk’
K eys t
- G
- o d Viz'
s
5 / 29
Graphical m etho d should match level of measuremen t Lab el all ax es an d include gur e c aption Simplicity an d clarity A void of ‘ char tjunk’ Unless th ere ar e 3 or more v ariables, a void 3D gur es (an d e ven then, a void it) Black & w hite, grayscale/pattern n e for m
- st simp
le gur es
K eys t
- G
- o d Viz'
s
5 / 29
Data Visualizations
T ak es practic e -- tr y a bun ch of stuff
6 / 29
Data Visualizations
T ak es practic e -- tr y a bun ch of stuff Resources Edward T ufte' s b o oks "R for Data Science" by Grolem und and Wickham "Data Visualization for So cial Science" by Healy
6 / 29
Coun ting th e n umb er of
- c currences of unique
even ts
Cate gorical or c
- n
tin uous just lik e with tableF() an d table1() Can se e cen tral t endency (c
- n
tin uous data) or most c
- mmon v
alue (cate gorical data) Can se e range an d extr emes
Fre quency Distributions
────────────────────────────────────────────────────── x Freq CumFreq Percent CumPerc Valid CumValid 1 265 265 26.50% 26.50% 27.32% 27.32% 2 222 487 22.20% 48.70% 22.89% 50.21% 3 242 729 24.20% 72.90% 24.95% 75.15% 4 241 970 24.10% 97.00% 24.85% 100.00% Missing 30 1000 3.00% 100.00% ──────────────────────────────────────────────────────
7 / 29
Bar Grap h
Fre quencies an d Viz' s T
- gether ❤
8 / 29
Bar Grap h Histo gram
Fre quencies an d Viz' s T
- gether ❤
8 / 29
What d
- es DISTRIBUTION m
ean?
The wa y that th e data p
- in
ts ar e sc attere d
9 / 29
F
- r
Con tin uous General shap e Exceptions (outliers) Mo des (p eaks) Cen ter & spread (chap 3) Histo gram F
- r
Ca tegorical Coun ts
- f
each Percen t
- r
Rate (adjusts for an ‘
- ut
- f’
to compare) Bar char t Pie char t
avoid!
What d
- es DISTRIBUTION m
ean?
The wa y that th e data p
- in
ts ar e sc attere d
9 / 29
Let' s App ly This T
- th
e Inh
- Dataset
10 / 29
Reminder
11 / 29
Read in th e Data
library(tidyverse) # the easy button library(rio) # read in Excel files library(furniture) # nice tables data_raw <- rio::import("Ihno_dataset.xls") %>% dplyr::rename_all(tolower) # converts all variable names to lower case
12 / 29
Read in th e Data
library(tidyverse) # the easy button library(rio) # read in Excel files library(furniture) # nice tables data_raw <- rio::import("Ihno_dataset.xls") %>% dplyr::rename_all(tolower) # converts all variable names to lower case
And Cl ean It
data_clean <- data_raw %>% dplyr::mutate(majorF = factor(major, levels= c(1, 2, 3, 4, 5), labels = c("Psychology", "Premed", "Biology", "Sociology", "Economics"))) %>% dplyr::mutate(coffeeF = factor(coffee, levels = c(0, 1), labels = c("Not a regular coffee drinker", "Regularly drinks coffee")))
12 / 29
data_clean %>% furniture::tableF(majorF) ## ## ───────────────────────────────────────── ## majorF Freq CumFreq Percent CumPerc ## Psychology 29 29 29.00% 29.00% ## Premed 25 54 25.00% 54.00% ## Biology 21 75 21.00% 75.00% ## Sociology 15 90 15.00% 90.00% ## Economics 10 100 10.00% 100.00% ## ───────────────────────────────────────── data_clean %>% furniture::tableF(phobia) ## ## ───────────────────────────────────── ## phobia Freq CumFreq Percent CumPerc ## 0 12 12 12.00% 12.00% ## 1 15 27 15.00% 27.00% ## 2 12 39 12.00% 39.00% ## 3 16 55 16.00% 55.00% ## 4 21 76 21.00% 76.00% ## 5 11 87 11.00% 87.00% ## 6 1 88 1.00% 88.00% ## 7 4 92 4.00% 92.00% ## 8 4 96 4.00% 96.00% ## 9 1 97 1.00% 97.00% ## 10 3 100 3.00% 100.00% ## ─────────────────────────────────────
Fre quency Distrubutions
13 / 29
Fre quency Viz' s F
- r viz'
s, w e will use ggplot2
This pr
- vides th
e m
- st p
- werful, b
eautiful fram ework for data visualizations
14 / 29
Fre quency Viz' s F
- r viz'
s, w e will use ggplot2
This pr
- vides th
e m
- st p
- werful, b
eautiful fram ework for data visualizations It is built
- n
making layers Each plot has a " geom " function e.g. geom_bar() for bar char ts, geom_histogram() for histo grams, etc.
14 / 29
data_clean %>% ggplot() + aes(majorF)
Bar Char ts
15 / 29
data_clean %>% ggplot() + aes(majorF) data_clean %>% ggplot() + aes(majorF) + geom_bar()
Bar Char ts
15 / 29
Bar Char ts
data_clean %>% ggplot() + aes(coffee) + geom_bar()
16 / 29
Histo grams
data_clean %>% ggplot() + aes(phobia) + geom_histogram()
17 / 29
Histo grams (chan ge n umb er of bins)
data_clean %>% ggplot() + aes(phobia) + geom_histogram(bins = 8)
18 / 29
Histo grams (chan ge bins t
- siz
e 5)
data_clean %>% ggplot() + aes(phobia) + geom_histogram(binwidth = 5)
19 / 29
Histo grams
data_clean %>% ggplot() + aes(mathquiz) + geom_histogram(binwidth = 4)
20 / 29
Histo grams -b y- a F actor (c
- lumns)
data_clean %>% ggplot() + aes(mathquiz) + geom_histogram(binwidth = 4) + facet_grid(. ~ coffeeF)
21 / 29
Histo grams -b y- a F actor (r
- ws)
data_clean %>% ggplot() + aes(mathquiz) + geom_histogram(binwidth = 4) + facet_grid(coffeeF ~ .)
22 / 29
De ciles (br eak in to 10% ch unks)
data_clean %>% dplyr::pull(statquiz) %>% quantile(probs = c(.10, .20, .30, .40, .50, .60, .70, .80, .90)) ## 10% 20% 30% 40% 50% 60% 70% 80% 90% ## 4.0 6.0 6.0 7.0 7.0 8.0 8.0 8.0 8.1
23 / 29
De ciles - with missin g v alues
data_clean %>% dplyr::pull(mathquiz) %>% quantile(probs = c(.10, .20, .30, .40, .50, .60, .70, .80, .90))
Error in quantile.default(., probs = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, : missing values and NaN's not allowed if 'na.rm' is FALSE
24 / 29
De ciles - na.rm = TRUE
data_clean %>% dplyr::pull(mathquiz) %>% quantile(probs = c(.10, .20, .30, .40, .50, .60, .70, .80, .90), na.rm =TRUE) ## 10% 20% 30% 40% 50% 60% 70% 80% 90% ## 15.0 21.0 25.2 28.0 30.0 32.0 33.8 37.2 41.0
25 / 29
Quar tiles (br eak in to 4 ch unks)
data_clean %>% dplyr::pull(statquiz) %>% quantile(probs = c(0, .25, .50, .75, 1)) ## 0% 25% 50% 75% 100% ## 1 6 7 8 10
26 / 29
Percen tiles
data_clean %>% dplyr::pull(statquiz) %>% quantile(probs = c(.01, .05, .173, .90)) ## 1% 5% 17.3% 90% ## 2.98 3.00 5.00 8.10
27 / 29
Questions?
28 / 29
Next T
- pic
Cen ter an d Spr ead
29 / 29
Data Visualization
Cohen Chapt er 2
EDUC/PSY 6600