201ab Quantitative methods Visualization E D V UL | UCSD Psychology - - PowerPoint PPT Presentation

201ab quantitative methods visualization
SMART_READER_LITE
LIVE PREVIEW

201ab Quantitative methods Visualization E D V UL | UCSD Psychology - - PowerPoint PPT Presentation

201ab Quantitative methods Visualization E D V UL | UCSD Psychology Visualization failure modes Cool vs informative visualizations Ways graphs can mislead Making a graph pretty ggplot: grammar of graphics E D V UL | UCSD


slide-1
SLIDE 1

ED VUL | UCSD Psychology

201ab Quantitative methods Visualization

slide-2
SLIDE 2

ED VUL | UCSD Psychology

  • Visualization failure modes
  • Cool vs informative visualizations
  • Ways graphs can mislead
  • Making a graph pretty
  • ggplot: grammar of graphics
slide-3
SLIDE 3

ED VUL | UCSD Psychology

Entirely made up.

slide-4
SLIDE 4

ED VUL | UCSD Psychology

Nonsense variables.

slide-5
SLIDE 5

ED VUL | UCSD Psychology

Graph independent of data.

slide-6
SLIDE 6

ED VUL | UCSD Psychology

Multiple variables graphed as one.

slide-7
SLIDE 7

ED VUL | UCSD Psychology

Credit: xkcd

slide-8
SLIDE 8

ED VUL | UCSD Psychology

Not labeled (or mislabeled).

slide-9
SLIDE 9

ED VUL | UCSD Psychology

Misleading or useless axis scales.

slide-10
SLIDE 10

ED VUL | UCSD Psychology

Misleading binning.

slide-11
SLIDE 11

ED VUL | UCSD Psychology

Illegible

slide-12
SLIDE 12

ED VUL | UCSD Psychology

Credit: xkcd

slide-13
SLIDE 13

ED VUL | UCSD Psychology

Visualization failure modes

  • Completely made up.
  • Nonsense variables/relationships.
  • Graph independent of data.
  • Multiple variables treated as one.
  • Not labeled, or mislabeled.
  • Misleading / unusable scales.
  • Misleading binning.
  • Illegible.
  • Crazy mapping from variables -> visual properties.
slide-14
SLIDE 14

ED VUL | UCSD Psychology

slide-15
SLIDE 15

ED VUL | UCSD Psychology

slide-16
SLIDE 16

ED VUL | UCSD Psychology

slide-17
SLIDE 17

ED VUL | UCSD Psychology

slide-18
SLIDE 18

ED VUL | UCSD Psychology

  • Visualization failure modes
  • Cool vs scientific visualizations
  • Making a graph pretty
  • ggplot: grammar of graphics
  • How to graph common data types.
slide-19
SLIDE 19

ED VUL | UCSD Psychology

slide-20
SLIDE 20

ED VUL | UCSD Psychology

From dynamicdiagrams.com

slide-21
SLIDE 21

ED VUL | UCSD Psychology

From dynamicdiagrams.com

slide-22
SLIDE 22

ED VUL | UCSD Psychology

From dynamicdiagrams.com

slide-23
SLIDE 23

ED VUL | UCSD Psychology

From dynamicdiagrams.com

slide-24
SLIDE 24

ED VUL | UCSD Psychology

24 This one.

  • Looks cooler!
  • Provides a visual puzzle.
  • Misrepresents magnitudes.
  • Does not adhere to (modern!) convention.
  • Makes it difficult to make quantitative

comparisons, or extract numbers This is a bad scientific data display But it is a cool visualization This one.

  • Looks a bit more boring
  • Is much easier to parse and understand
  • Accurately, quantitatively represents

magnitudes.

  • Adheres to modern convention
  • Makes it easy to make quantitative

comparisons, and extract numbers This is a good scientific data display But might not be as interesting a visualization

slide-25
SLIDE 25

ED VUL | UCSD Psychology

slide-26
SLIDE 26

ED VUL | UCSD Psychology

slide-27
SLIDE 27

ED VUL | UCSD Psychology

  • Visualization failure modes
  • Cool vs scientific visualizations
  • Making a graph pretty
  • ggplot: grammar of graphics
  • How to graph common data types.
slide-28
SLIDE 28

ED VUL | UCSD Psychology

slide-29
SLIDE 29

ED VUL | UCSD Psychology

May have gone a bit overboard into “visualization” territory – looks good, but starts violating some conventions:

  • No Y axis
  • Y axis label used as title
slide-30
SLIDE 30

ED VUL | UCSD Psychology

  • Visualization failure modes
  • Cool vs informative visualizations
  • Making a graph pretty
  • ggplot: grammar of graphics
  • Graphs for common types of data.
slide-31
SLIDE 31

ED VUL | UCSD Psychology

library(ggplot2) Fig <- ggplot(data=..., mapping=aes(...)) + facet_*() + geom_*() + stat_*() + scale_*() + theme*()

Basic operation: Take a tidy data frame map variables onto different aesthetic variables (e.g., x, y, color, fill, size, shape, alpha, group). Draw some geom(etric entity) according to that mapping (e.g., point, line, tile, area, ribbon, etc.)

slide-32
SLIDE 32

ED VUL | UCSD Psychology

slide-33
SLIDE 33

ED VUL | UCSD Psychology

slide-34
SLIDE 34

ED VUL | UCSD Psychology

  • Visualization failure modes
  • Cool vs informative visualizations
  • Making a graph pretty
  • ggplot: grammar of graphics
  • Graphs for common types of data.
  • Practice in R.
  • More exotic graph types / considerations
slide-35
SLIDE 35

ED VUL | UCSD Psychology

Goal: show how response/dependent variable(s) change with explanatory/independent variable(s). What kind of variables? Categorical? Numerical? Helps to think of it as an abstract formula of sorts, e.g.,:

How does height (numerical response) vary across sex (categorical), nationality (categorical), and parents’ income (numerical):

numerical ~ 2*categorical + numerical This abstraction helps you pick starting points for graphs.

slide-36
SLIDE 36

ED VUL | UCSD Psychology

categorical ~ 0

(1 categorical response variable, with 0 explanatory variables)

Pie chart

  • Hardest comparisons

++ easiest proportion

  • Waste of ink
  • Considered tacky.

Histogram barplot of counts ++ Easiest comparisons

  • Hardest proportion

Stacked bar plot + easy-ish comparisons + easy-ish proportion + socially acceptable pie chart Data: http://vulstats.ucsd.edu/data/spsp.demographics.cleaned.csv

slide-37
SLIDE 37

ED VUL | UCSD Psychology

categorical ~ 0

(1 categorical response variable, with 0 explanatory variables)

Data: http://vulstats.ucsd.edu/data/spsp.demographics.cleaned.csv Counts: highlight sample size when n is small proportions: easier interpretation.

slide-38
SLIDE 38

ED VUL | UCSD Psychology

numerical ~ 0

(1 numerical response variable, with 0 explanatory variables)

Smoothed density

  • Obscures noisiness

+ not too sensitive to reasonable kernel width. Histogram + Portrays noisiness.

  • Impression sensitive to bins

Data: http://vulstats.ucsd.edu/data/cal1020.cleaned.Rdata

slide-39
SLIDE 39

ED VUL | UCSD Psychology

numerical ~ 0

(1 numerical response variable, with 0 explanatory variables)

slide-40
SLIDE 40

ED VUL | UCSD Psychology

numerical ~ categorical

(1 numerical response variable, with 1 categorical explanatory variable)

Mean+error boxplot Jitter Useful when n is small violin Useful when n is large densities

(coords flipped)

Emp CDF

(coords flipped)

Best when coords not flipped, Best for few categories (<4?). Easy stat. comparison

slide-41
SLIDE 41

ED VUL | UCSD Psychology

Credit: xkcd

slide-42
SLIDE 42

ED VUL | UCSD Psychology

numerical ~ categorical

(1 numerical response variable, with 1 categorical explanatory variable)

– Always put error bars on bar charts (std. error or CI are fine) – Look at rawer data (e.g,. strip charts) before going to more compressed plots. – By removing the solid bar from a bar chart, you can add a good visualization of data distribution. This is better.

slide-43
SLIDE 43

ED VUL | UCSD Psychology

numerical ~ categorical

(my suggestions)

With small n: Show all the data points with jitter (here, data are sub- sampled to generate a low n scenario) With large n: Show distribution with violin or density.

slide-44
SLIDE 44

ED VUL | UCSD Psychology

numerical ~ categorical

(eclectic plots, useful with large n, weird distributional differences)

Overlayed density/histograms With large n can show weird differences. Cumulative distribution functions Highlights differences in the tails. Only useful with really large n (so tails aren’t just noise).

slide-45
SLIDE 45

ED VUL | UCSD Psychology

numerical ~ numerical

(1 numerical response variable, with 1 numerical explanatory variable)

Scatterplot: Best option with small n. Hard to make legible with large n. 2D histogram heatmap: Useless for small n. Best option with large n.

2 x numerical ~ 0

slide-46
SLIDE 46

ED VUL | UCSD Psychology

numerical ~ numerical

(1 numerical response variable, with 1 numerical explanatory variable)

Conditional means This will require binning by x. Fitted conditional means Very rarely should you show these on their

  • wn, without the raw data.

Generally: use method=lm, rather than loess.

slide-47
SLIDE 47

ED VUL | UCSD Psychology

Credit: xkcd

slide-48
SLIDE 48

ED VUL | UCSD Psychology

numerical ~ numerical

(my recommendation)

My recommendation: Show data, show fit.

slide-49
SLIDE 49

ED VUL | UCSD Psychology

numerical ~ numerical

(1 numerical response variable, with 1 numerical explanatory variable)

Normalization by x useful when you don’t care about distribution over x. Note: you are unlikely to luxuriate in this much data.

slide-50
SLIDE 50

ED VUL | UCSD Psychology

numerical ~ numerical + categorical

(1 numerical response, with numerical & categorical explanatory variable)

Color-coded scatterplot Hard to parse with lots of data. Fitted lines / conditional means. Show error bars. If y is smooth in x, show conditional means (as in here). Bin width matters. Note importance of explanatory variable on the x axis!

slide-51
SLIDE 51

ED VUL | UCSD Psychology

numerical ~ numerical + categorical

(1 numerical response, with numerical & categorical explanatory variable)

If scatterplots are important, split into facets with large n. If line comparison is important, keep in same panel.

slide-52
SLIDE 52

ED VUL | UCSD Psychology

General pointers

slide-53
SLIDE 53

ED VUL | UCSD Psychology

General pointers

  • Label your axes.
  • Follow conventions

– Explanatory variable on x axis. – Don’t get creative – respect variable types. – Don’t make visualization puzzles

  • Convey information clearly, numerically
  • Represent uncertainty! (distribution, error, confidence)
  • Be wary of binning artifacts / thresholding
  • Cool visualizations are not good science graphs
slide-54
SLIDE 54

ED VUL | UCSD Psychology

Graph priorities

  • Interpretable without requiring caption or puzzle

– Label all axes, legends, etc. intuitively. – No spiffy visualization puzzles.

  • Facilitate quantitative interpretation and comparison

– Easy to estimate numbers from graph – Be wary of binning/thresholding

  • Permit inferential statistics by eye

– Represent distribution/variability, uncertainty/error

  • Follow conventions for the relationship/data presented
  • Graphs should not waste ink and should look pretty
slide-55
SLIDE 55

ED VUL | UCSD Psychology

slide-56
SLIDE 56

ED VUL | UCSD Psychology

  • Visualization failure modes
  • Cool vs informative visualizations
  • Making a graph pretty
  • ggplot: grammar of graphics
  • Graphs for common types of data.
  • Practice in R.
  • More esoteric graph types / considerations
slide-57
SLIDE 57

ED VUL | UCSD Psychology

http://vulstats.ucsd.edu/data/duckworth-grit-scale-data/data-coded.csv

Make plots to…

  • 1. Compare males and females on the big 5

personality traits:

  • extroversion
  • neuroticism
  • agreeableness
  • conscientiousness
  • openness
  • 2. Evaluate the relationship between

conscientiousness and grit?

  • does this relationship vary with sex?
slide-58
SLIDE 58

ED VUL | UCSD Psychology

  • Visualization failure modes
  • Cool vs informative visualizations
  • Making a graph pretty
  • ggplot: grammar of graphics
  • Graphs for common types of data.
  • Practice in R.
  • More esoteric graph types / considerations
slide-59
SLIDE 59

ED VUL | UCSD Psychology

2 x categorical ~ 0

(2 categorical response variable, with 0 explanatory variables)

slide-60
SLIDE 60

ED VUL | UCSD Psychology

categorical ~ categorical

(1 categorical response variable, with 1 categorical explanatory variable)

slide-61
SLIDE 61

ED VUL | UCSD Psychology

categorical ~ numerical

(1 categorical response variable, with 1 numerical explanatory variable)

Stacked area charts. Generally, must round/bin numerical variable. Stacked counts show the distribution of numerical variable. Proportions show how categorical variable changes.

slide-62
SLIDE 62

ED VUL | UCSD Psychology

categorical ~ numerical

(with small n, binning must be very coarse; most useful with large n)

slide-63
SLIDE 63

ED VUL | UCSD Psychology

  • num. ~ cat. vs
  • cat. ~ num.

Same data, but they invite different comparisons and interpretations.

slide-64
SLIDE 64

ED VUL | UCSD Psychology

numerical ~ 2 x categorical

(1 numerical response variable, with 1 categorical explanatory variable)

slide-65
SLIDE 65

ED VUL | UCSD Psychology

numerical ~ 2 x categorical

(1 numerical response variable, with 2 categorical explanatory variable)

Notes: can’t show error, so it better be tiny (as in here, with enormous n). Which comparisons jump out is determined by number -> color mapping, so be careful.

slide-66
SLIDE 66

ED VUL | UCSD Psychology

numerical ~ 2 x numerical

(1 numerical response variable, with 2 numerical explanatory variable)

Heat map or surface plot Generally your data need to be: complete, smooth, abundant Bubble chart: Comparisons across dot size are not easy, so that shouldn’t be a very important variable.

slide-67
SLIDE 67

ED VUL | UCSD Psychology

2 x numerical ~ numerical

(2 numerical response variable, with 1 numerical explanatory variable)

Double-axis plot. Usually a terrible idea.

slide-68
SLIDE 68

ED VUL | UCSD Psychology

  • Visualization failure modes
  • Cool vs informative visualizations
  • Making a graph pretty
  • ggplot: grammar of graphics
  • Graphs for common types of data.
  • Practice in R.
  • More esoteric graph types / considerations