CS 109a: Data Science Effective Exploratory Data Analysis and - - PowerPoint PPT Presentation

cs 109a data science
SMART_READER_LITE
LIVE PREVIEW

CS 109a: Data Science Effective Exploratory Data Analysis and - - PowerPoint PPT Presentation

CS 109a: Data Science Effective Exploratory Data Analysis and Visualization Pavlos Protopapas, Kevin Rader, Rahul Dave, Margo Levine Ask an interesting What is the scientific goal ? What would you do if you had all the data ? question. What do


slide-1
SLIDE 1

CS 109a: Data Science

Effective Exploratory Data Analysis and Visualization

Pavlos Protopapas, Kevin Rader, Rahul Dave, Margo Levine

slide-2
SLIDE 2

Ask an interesting question. Get the data. Explore the data. Model the data. Communicate and visualize the results.

What is the scientific goal? What would you do if you had all the data? What do you want to predict or estimate? How were the data sampled? Which data are relevant? Are there privacy issues? Plot the data. Are there anomalies? Are there patterns? Build a model. Fit the model. Validate the model. What did we learn? Do the results make sense? Can we tell a story?
slide-3
SLIDE 3

Ask an interesting question. Get the data. Explore the data. Model the data. Communicate and visualize the results.

What is the scientific goal? What would you do if you had all the data? What do you want to predict or estimate? How were the data sampled? Which data are relevant? Are there privacy issues? Plot the data. Are there anomalies? Are there patterns? Build a model. Fit the model. Validate the model. What did we learn? Do the results make sense? Can we tell a story?

VISUALIZE THE DATA

slide-4
SLIDE 4 https://www.autodeskresearch.com/publications/samestats
slide-5
SLIDE 5 https://www.autodeskresearch.com/publications/samestats
slide-6
SLIDE 6

Example: Antibiotics Will Burtin, 1951

slide-7
SLIDE 7

Data

slide-8
SLIDE 8

Data

Genus, Species

slide-9
SLIDE 9

Data

Genus, Species

slide-10
SLIDE 10

Data

Genus, Species +

slide-11
SLIDE 11

Data

Genus, Species

  • Min. Inhibitory


Concentration
 [ml/g] +

slide-12
SLIDE 12

What Questions?

slide-13
SLIDE 13
  • M. Bostock, Protovis

after W. Burtin, 1951

Gram Positive Gram Negative

slide-14
SLIDE 14
  • M. Bostock, Protovis

after W. Burtin, 1951

How effective are the drugs?

Gram Positive Gram Negative

slide-15
SLIDE 15
  • M. Bostock, Protovis

after W. Burtin, 1951

How effective are the drugs?

If bacteria is gram positive, Penicillin & Neomycin are most effective If bacteria is gram negative, Neomycin is most effective

Gram Positive Gram Negative

slide-16
SLIDE 16

Wainer & Lysen, “That’s funny...” American Scientist, 2009 Adapted from Brian Schmotzer

slide-17
SLIDE 17

Wainer & Lysen, “That’s funny...” American Scientist, 2009 Adapted from Brian Schmotzer

How do the bacteria compare?

slide-18
SLIDE 18

Wainer & Lysen, “That’s funny...” American Scientist, 2009 Adapted from Brian Schmotzer

How do the bacteria compare?

slide-19
SLIDE 19

Wainer & Lysen, “That’s funny...” American Scientist, 2009 Adapted from Brian Schmotzer

Not a streptococcus! (realized ~30 years later) Really a streptococcus! (realized ~20 years later)

How do the bacteria compare?

slide-20
SLIDE 20

Wainer & Lysen, “That’s funny...” American Scientist, 2009

How do the bacteria compare?

slide-21
SLIDE 21

Wainer & Lysen, “That’s funny...” American Scientist, 2009

How do the bacteria compare?

slide-22
SLIDE 22

“The greatest value of a picture is when it forces us to notice what we never expected to see.” John Tukey

Exploratory Data Analysis

slide-23
SLIDE 23

Visualization Goals

Communicate (Explanatory)

Present data and ideas Explain and inform Provide evidence and support Influence and persuade

Analyze (Exploratory)

Explore the data Assess a situation Determine how to proceed Decide what to do

slide-24
SLIDE 24

Communicate

New York Times

slide-25
SLIDE 25

Explore

slide-26
SLIDE 26

EDA Workflow

1. Build a DataFrame from the data (ideally, put all data in this object) 2. Clean the DataFrame. It should have the following properties

  • Each row describes a single object
  • Each column describes a property of that object
  • Columns are numeric whenever appropriate
  • Columns contain atomic properties that cannot be further decomposed

3. Explore global properties. Use histograms, scatter plots, and aggregation functions to summarize the data. 4. Explore group properties. Use groupby and small multiples to compare subsets of the data.

slide-27
SLIDE 27

Viz options

  • Pandas

Visualization module

  • Matplotlib
  • Seaborn
  • Above 3 are inter-mixable
  • Be lazy (to an extent…)
  • Other options: Bokeh,

Vega, Vincent, Altair

slide-28
SLIDE 28

Cars Dataset

Basic Pandas/matplotlib

slide-29
SLIDE 29

Can set limits, tick styles, scales, add lines, annotations, titles, legends Seaborn provides a different visual style and lots of canned plots.

slide-30
SLIDE 30

Effective Visualizations

slide-31
SLIDE 31

Not Effective...

Sources: US Treasury and WHO reports
slide-32
SLIDE 32

Effective EDA Viz

  • 1. Have graphical integrity
  • 2. Keep it simple
  • 3. Use the right display
  • 4. Use color sensibly
slide-33
SLIDE 33
  • 1. Graphical Integrity
slide-34
SLIDE 34
slide-35
SLIDE 35

Graphical Integrity

Flowing Data
slide-36
SLIDE 36

Scale Distortions

Flowing Data
slide-37
SLIDE 37
slide-38
SLIDE 38

Scale Distortions

slide-39
SLIDE 39

Alberto Cairo • University of Miami • www.thefunctionalart.com • Twitter: @albertocairo

“Double the axes, double the mischief”

(Quote from Gary Smith’s Standard Deviations) http://www.thefunctionalart.com/2015/10/double-axes-double-mischief.html

Graphic from Robert Reich’s Saving Capitalism

slide-40
SLIDE 40

Be Proportional

slide-41
SLIDE 41

Alberto Cairo • University of Miami • www.thefunctionalart.com • Twitter: @albertocairo

slide-42
SLIDE 42

2012

Alberto Cairo • University of Miami • www.thefunctionalart.com • Twitter: @albertocairo

slide-43
SLIDE 43

Include Uncertainty

slide-44
SLIDE 44

Alberto Cairo • University of Miami • www.thefunctionalart.com • Twitter: @albertocairo

slide-45
SLIDE 45

What you show

8PM Saturday 8PM Sunday 8PM Monday 8PM Tuesday

Hurricane

(category 5)

CAIRO

Alberto Cairo • University of Miami • www.thefunctionalart.com • Twitter: @albertocairo

slide-46
SLIDE 46

What non-scientists are not aware of (cone is just 66% probability)

8PM Saturday 8PM Sunday 8PM Monday 8PM Tuesday

Hurricane

(category 5)

CAIRO

2/3 1/3

Alberto Cairo • University of Miami • www.thefunctionalart.com • Twitter: @albertocairo

slide-47
SLIDE 47

What we could be showing instead

Hurricane

(category 5)

CAIRO

Alberto Cairo • University of Miami • www.thefunctionalart.com • Twitter: @albertocairo

slide-48
SLIDE 48

Plot all your data

slide-49
SLIDE 49

Counties with the LOWEST kidney cancer death rates (1980-1989) Counties with the HIGHEST kidney cancer death rates (1980-1989)

Alberto Cairo • University of Miami • www.thefunctionalart.com • Twitter: @albertocairo

slide-50
SLIDE 50

Counties with the LOWEST kidney cancer death rates (1980-1989) Counties with the HIGHEST kidney cancer death rates (1980-1989)

Alberto Cairo • University of Miami • www.thefunctionalart.com • Twitter: @albertocairo

slide-51
SLIDE 51
slide-52
SLIDE 52
  • 2. Keep It Simple
slide-53
SLIDE 53

Avoid Chartjunk

  • ngoing, Tim Brey

Extraneous visual elements that distract from the message

slide-54
SLIDE 54

1 3 2 4

slide-55
SLIDE 55

Don’t!

matplotlib gallery Excel Charts Blog
slide-56
SLIDE 56
  • 3. Use The Right

Display

slide-57
SLIDE 57

http://extremepresentation.typepad.com/blog/files/choosing_a_good_chart.pdf

slide-58
SLIDE 58

Comparisons

slide-59
SLIDE 59

Bars vs. Lines

Zacks 1999
slide-60
SLIDE 60

Proportions

slide-61
SLIDE 61

Pie Charts

slide-62
SLIDE 62

Stacked Bar Chart

  • S. Few
slide-63
SLIDE 63

Stacked Area Chart

  • S. Few
slide-64
SLIDE 64

Correlations

slide-65
SLIDE 65

Scatterplots

http://xkcd.com/388/
slide-66
SLIDE 66 From Edward Tufte, Visual and Statistical Thinking

London Cholera Epidemic

slide-67
SLIDE 67

Don’t!

matplot3d tutorial

slide-68
SLIDE 68

Trends

slide-69
SLIDE 69

Yahoo! Finance

slide-70
SLIDE 70
slide-71
SLIDE 71

Distributions

slide-72
SLIDE 72

Histogram

ggplot2

slide-73
SLIDE 73

Bin Width

binwidth = 0.1 binwidth = 0.01

ggplot2

slide-74
SLIDE 74

Density Plots

slide-75
SLIDE 75

https:// www.autodeskresearch.com/ publications/samestats

slide-76
SLIDE 76

https:// www.autodeskresearch.com/ publications/samestats

slide-77
SLIDE 77

https://www.autodeskresearch.com/publications/samestats

slide-78
SLIDE 78

https://www.autodeskresearch.com/publications/samestats

slide-79
SLIDE 79

GROUP getting complex…

slide-80
SLIDE 80

Faceting and Small Multiples

Use seaborn or multiple plots in matplotlib

slide-81
SLIDE 81

Small multiples

slide-82
SLIDE 82

SPLOM

slide-83
SLIDE 83

Design Exercise

Hands-On Exercise

slide-84
SLIDE 84

How do you feel about doing science?

Interest Before After Excited Kind of interested OK Not great Bored 12 6 14 30 38 11 5 40 25 19 Table

Data courtesy of Cole Nussbaumer
slide-85
SLIDE 85

Come up with multiple visualizations. Pen and Paper Only.

slide-86
SLIDE 86

Pie Side by side bar

slide-87
SLIDE 87

Stacked bar, not very useful Data Transposed Bar Chart

slide-88
SLIDE 88

Difference Bar Chart

slide-89
SLIDE 89

Slopegraph

slide-90
SLIDE 90

After the pilot program,

68%

  • f kids expressed interest towards science,

compared to 44% going into the program.

slide-91
SLIDE 91

Perceptual Effectiveness

slide-92
SLIDE 92 Stephen’s Power Law, 1961
  • J. Bertin, 1967
Cleveland / McGill, 1984 Heer / Bostock, 2010
  • J. Mackinlay, 1986
slide-93
SLIDE 93

How much longer?

A B

slide-94
SLIDE 94

How much longer?

A B

4x

slide-95
SLIDE 95

How much steeper slope?

A B

slide-96
SLIDE 96

How much steeper slope?

A B

4x

slide-97
SLIDE 97

How much larger area?

A B

slide-98
SLIDE 98

How much larger area?

A B

10x

slide-99
SLIDE 99

How much darker?

A B

slide-100
SLIDE 100

How much darker?

A B

2x

slide-101
SLIDE 101

How much bigger value?

A B 2 16

slide-102
SLIDE 102

How much bigger value?

A B 2 16

4x

slide-103
SLIDE 103

Most Efficient Least Efficient

  • C. Mulbrandon
VisualizingEconomics.com
slide-104
SLIDE 104

Most Efficient Least Efficient

  • C. Mulbrandon
VisualizingEconomics.com

Quantitative Ordered Categories

}

}

}

slide-105
SLIDE 105

Most Effective

VisualizingEconomics.com

slide-106
SLIDE 106

Less Effective

VisualizingEconomics.com

slide-107
SLIDE 107

Pie vs. Bar Charts

slide-108
SLIDE 108

Cliff Mass

Least Effective

slide-109
SLIDE 109

2012

Alberto Cairo • University of Miami • www.thefunctionalart.com • Twitter: @albertocairo

slide-110
SLIDE 110

Alberto Cairo • University of Miami • www.thefunctionalart.com • Twitter: @albertocairo

Figures represented in all these graphics: 22%, 25%, 34%, 29%, 32%

Length or height

A A B C D E B C D E

Position Angle/area Line weight Hue and shade Area

A B C D E A B C D E A B C D E A A B C D E B D C E A B C D E A B C D E A B C D E

Data visualization and visual encoding

slide-111
SLIDE 111
  • 4. Use Color

Strategically

slide-112
SLIDE 112

Colors for Categories

Do not use more than 5-8 colors at once

Ware, “Information Visualization”

slide-113
SLIDE 113

Colors for Ordinal Data

Zeilis et al, 2009, “Escaping RGBland: Selecting Colors for Statistical Graphics”

Vary luminance and saturation

slide-114
SLIDE 114

Colors for Quantitative Data

Rogowitz and Treinish, Why should engineers and scientists be worried about color?

Hue (Rainbow) Luminance Luminance & Hue

slide-115
SLIDE 115

Rainbow Colormap

slide-116
SLIDE 116

Rainbow Colormap

  • R. Simmon

Perceptually nonlinear

slide-117
SLIDE 117

Avoid Rainbow Colors!

matplotlib gallery
slide-118
SLIDE 118

Gray

slide-119
SLIDE 119

Color Blindness

Deuteranope Protanope Tritanope

Based on slide from Stone

Red / green deficiencies Blue / Yellow deficiency

slide-120
SLIDE 120

Color Blindness

Normal Protanope Deuteranope Lightness

Based on slide from Stone
slide-121
SLIDE 121

Viridis

slide-122
SLIDE 122

Color Brewer

Cynthia Brewer, Color Use Guidelines for Data Representation

Nominal Ordinal

slide-123
SLIDE 123
slide-124
SLIDE 124

Diverging Palette for Quantitative or Ordinal Sequential Palette for Densities

slide-125
SLIDE 125

Effective Visualizations

  • 1. Have graphical integrity
  • 2. Keep it simple
  • 3. Use the right display
  • 4. Use color strategically
slide-126
SLIDE 126

Further Reading

slide-127
SLIDE 127

Edward Tufte

slide-128
SLIDE 128

Stephen Few

slide-129
SLIDE 129

I’ve always believed in the power of data visualization (the representation of information by means of charts, diagrams, maps, etc.) to enable understanding

2012 2016

Alberto Cairo • University of Miami • www.thefunctionalart.com • Twitter: @albertocairo