Data Viz
April 2, 2020 Data Science CSCI 1951A Brown University Instructor: Ellie Pavlick HTAs: Josh Levin, Diane Mutako, Sol Zitter
1
Data Viz April 2, 2020 Data Science CSCI 1951A Brown University - - PowerPoint PPT Presentation
Data Viz April 2, 2020 Data Science CSCI 1951A Brown University Instructor: Ellie Pavlick HTAs: Josh Levin, Diane Mutako, Sol Zitter 1 Announcements Videos on if you can! Use raise-hand feature for questions. Any questions/concerns
April 2, 2020 Data Science CSCI 1951A Brown University Instructor: Ellie Pavlick HTAs: Josh Levin, Diane Mutako, Sol Zitter
1
questions.
2
Reduction, Classification, Regularization)
3
4
5
Hypothesis: CS students sleep less than Brown students in general
6
Hypothesis: CS students sleep less than Brown students in general
Viz #1: Quick side-by-side histogram of CS students’ sleep
Means + CIs
7
Hypothesis: CS students sleep less than Brown students in general
Viz #1: Quick side-by-side histogram of CS students’ sleep
Means + CIs Run linear regression, control for various things, find large coefficient on whether student has two concentrations
8
Hypothesis: CS students sleep less than Brown students in general
Viz #1: Quick side-by-side histogram of CS students’ sleep
Means + CIs Run linear regression, control for various things, find large coefficient on whether student has two concentrations Viz #2: Quick histograms (or box-whiskers maybe) of hours of sleep vs. number of concentrations
9
Hypothesis: CS students sleep less than Brown students in general
Viz #1: Quick side-by-side histogram of CS students’ sleep
Means + CIs Run linear regression, control for various things, find large coefficient on whether student has two concentrations Viz #2: Quick histograms (or box-whiskers maybe) of hours of sleep vs. number of concentrations Viz #3: Quick histogram of number
CS vs. non-CS students
10
Hypothesis: CS students sleep less than Brown students in general
Viz #1: Quick side-by-side histogram of CS students’ sleep
Means + CIs Run linear regression, control for various things, find large coefficient on whether student has two concentrations Viz #2: Quick histograms (or box-whiskers maybe) of hours of sleep vs. number of concentrations Viz #3: Quick histogram of number
CS vs. non-CS students Viz #4: Final polished visualizations for poster/paper/ report
11
Hypothesis: CS students sleep less than Brown students in general
Viz #1: Quick side-by-side histogram of CS students’ sleep
Means + CIs Run linear regression, control for various things, find large coefficient on whether student has two concentrations Viz #ia: Quick histograms (or box-whiskers maybe) of hours of sleep vs. number of concentrations Viz #ib: Quick histogram of number
CS vs. non-CS students Viz #N+1: Final polished visualizations for poster/paper/ report
while not converged
trends I am seeing
results
12
trends I am seeing
results
13
More important (matplotlib, excel, whatever is easy)
trends I am seeing
results
14
Most attention, cause its fun ;) (D3, etc.)
trends I am seeing
results
15
You are the main audience, goal is to make sure you understand what you are looking at
trends I am seeing
results
16
Everyone else is the main audience. Goal is to make point as clearly and concisely as possible.
17
Diane Neil Maggie
*:)
18
— Your figures should speak for themselves. The analysis should be understandable and your conclusions should be obviously supported, without too much effort
19
— Your figures should speak for themselves. The analysis should be understandable and your conclusions should be obviously supported, without too much effort Don’t obfuscate the data or Hide the prOcess you used to come to your coNclusions. GivE people enough data So that They can disagree with You if they want to.
20
— Your figures should speak for themselves. The analysis should be understandable and your conclusions should be obviously supported, without too much effort Don’t obfuscate the data or Hide the prOcess you used to come to your coNclusions. GivE people enough data So that They can disagree with You if they want to. Minimalism — Substance over style. Make your point concisely, without redundant or distracting information or ornamentation.
21
Ellie rants about culture for 2 seconds. Indulge me….
22
Edward Tufte—dogma of data viz
23
— Your figures should speak for themselves. The analysis should be understandable and your conclusions should be obviously supported, without too much effort
24
Learning curve
25 50 75 100
25
Learning curve
Classification Accuracy (%) 25 50 75 100 Training Size 10 100 1000 10000 1000000
26
Frequency 2500 5000 7500 10000 Age 20 40 60 80 100
Population 1 Population 2
27
Log Frequency 2.5 5 7.5 10 Age 20 40 60 80 100
Population 1 Population 2
Sometimes can use logs (but say you did so…)
28
Frequency 22.5 45 67.5 90 Age 20 40 60 80 100
29
Frequency 10 20 30 40 Age 4 8 12 16 20
Sometimes can remove outliers (but say you did so…)
30
Frequency 22.5 45 67.5 90 Age 20 40 60 80 100
Sometimes better to analyze separately. (Look at your data!)
10 20 30 40 4 8 12 16 20 10 20 30 40 90 92 94 96 100
31
25 50 75 100 20 40 60 80 100
32
25 50 75 100 20 40 60 80 100 25 50 75 100 20 40 60 80 100 25 50 75 100 20 40 60 80 100 25 50 75 100 20 40 60 80 100 25 50 75 100 20 40 60 80 100
Sometimes better to split into multiple charts…
33
Company Earnings by Year (in millions) 2.3 2.1 2.0 2.1 1.7 1.3
2012 2013 2014 2015 2016 2017
34
Company Earnings by Year (in millions) 2.3 2.1 2.0 2.1 1.7 1.3
2012 2013 2014 2015 2016 2017
Not really interpretable as “parts of a whole”…
35
Company Earnings by Year (in millions)
1.3 1.5 1.8 2.0 2.3 2012 2013 2014 2015 2016 2017
36
37
Earnings Gap in Canada is Smaller
Earnings 17.5 35 52.5 70 US Canada
College No College
38
Earnings Gap in Canada is Smaller
Earnings Gap 4 8 12 16 US Canada
States I have lived in
Years 4.5 9 13.5 18 Michigan Maryland Pennsylvania New York Rhode Island
39
States I have lived in
Years 4.5 9 13.5 18 Michigan Maryland Pennsylvania New York Rhode Island
40
States I have lived in
Years 4.5 9 13.5 18 Michigan Maryland Pennsylvania New York Rhode Island
41
Don’t obfuscate the data or Hide the prOcess you used to come to your coNclusions. GivE people enough data So that They can disagree with You if they want to.
42
Average Performance (over 5-fold validation)
76 78.5 81 83.5 86 Fancy Model1 Fancy Model2 Fancy Model3
43
Average Performance (over 5-fold validation)
25 50 75 100 Fancy Model1 Fancy Model2 Fancy Model3 Random Guessing
Help calibrate how easy/hard the problem is, what types of numbers to expect a priori…
44
Average Performance (over 5-fold validation)
17.5 35 52.5 70 Baseline Old Model New Model
45
Average Performance (over 5-fold validation)
20 40 60 80 Baseline Old Model New Model
Sometimes can include error bars/ confidence intervals…
46
Average Performance (over 5-fold validation)
20 40 60 80 Baseline Old Model New Model
Even better, just show all the data alongside the summary stats.
47
5 10 15 20 3 6 9 12
Population1 Population2
48
Be careful about showing smoothed/ estimated/aggregated trends only
5 10 15 20 3 6 9 12
Population1 Population2
49
Whenever possible, show underlying data
5 10 15 20 3 6 9 12
Population1 Population2
50
Percent Accuracy 80.1 80.3 80.5 80.7 80.9 Baseline Old Model New Model
51
Percent Accuracy 25 50 75 100 Baseline Old Model New Model
Rescale to a meaningful range (use full range
expected)
52
Percent Accuracy 25 50 75 100 Baseline Old Model New Model
And/or include error bars
53
Minimalism — Substance over style. Make your point concisely, without redundant or distracting information or ornamentation.
54
F1 Score 12.5 25 37.5 50 Model1 Model2 Model3
55
F1 Score 12.5 25 37.5 50 Model1 Model2 Model3
Don’ t use colors/decorations unless they add new information
56
F1 Score 12.5 25 37.5 50 Model1 Model2 Model3
Just look how pretty that is
57
Model 1 Model 2 Model 3
Model Performance (F1 Score)
58
Don’ t use colors/decorations unless they add new information
Model 1 Model 2 Model 3
Model Performance (F1 Score)
59
F1 Score 12.5 25 37.5 50 Model1 Model2 Model3
Just look how pretty that is
Model Performance (F1 Score)
60
61
Just…don’ t
62
Just look how pretty that is
Company Earnings by Year (in millions)
0.575 1.15 1.725 2.3 2012 2013 2014 2015 2016 2017
63
64
65
66
Type of Hypothesis First plot I’d make Group A differs from Group B according to metric C side-by-side histograms, with means and CIs X effects Y scatter plot, with correlation Prediction Tasks, Recommendations dim.reduction feature matrix to 2D, then scatter and color by label/group Any of the above correlation matrices between all features Any of the above counts of all features (broken down by groups/labels if relevant)
67
labels on points
68
control
streamlines process for making complex charts (e.g. large grids/side-by-sides) but harder to tweak little things
use this for messy scatter plots)
for doing your homeworks)
69
70
71
72
73
A watercolor painting celebrating that event hangs today in the Chenango Museum in Norwich. The canal itself was also utilized for recreation. In the summer months it supported swimming, boating and
favorite pastimes. Before the Chenango Canal was built, much of the Southern Tier and Central New York was still considered to be frontier. In the summer months it supported swimming , picnicking and fishing .
74