Data Viz April 2, 2020 Data Science CSCI 1951A Brown University - - PowerPoint PPT Presentation

data viz
SMART_READER_LITE
LIVE PREVIEW

Data Viz April 2, 2020 Data Science CSCI 1951A Brown University - - PowerPoint PPT Presentation

Data Viz April 2, 2020 Data Science CSCI 1951A Brown University Instructor: Ellie Pavlick HTAs: Josh Levin, Diane Mutako, Sol Zitter 1 Announcements Videos on if you can! Use raise-hand feature for questions. Any questions/concerns


slide-1
SLIDE 1

Data Viz

April 2, 2020 Data Science CSCI 1951A Brown University Instructor: Ellie Pavlick HTAs: Josh Levin, Diane Mutako, Sol Zitter

1

slide-2
SLIDE 2

Announcements

  • Videos on if you can! Use raise-hand feature for

questions.

  • Any questions/concerns logistically?
  • Extra Office Hours tomorrow

2

slide-3
SLIDE 3

Today

  • Questions from previous lectures? (Dimensionality

Reduction, Classification, Regularization)

  • Data Viz tips and best practices

3

slide-4
SLIDE 4

When do I do data viz during a project?

4

slide-5
SLIDE 5

When do I do data viz during a project?

5

Hypothesis: CS students sleep less than Brown students in general

slide-6
SLIDE 6

When do I do data viz during a project?

6

Hypothesis: CS students sleep less than Brown students in general

Viz #1: Quick side-by-side histogram of CS students’ sleep

  • vs. the rest.

Means + CIs

slide-7
SLIDE 7

When do I do data viz during a project?

7

Hypothesis: CS students sleep less than Brown students in general

Viz #1: Quick side-by-side histogram of CS students’ sleep

  • vs. the rest.

Means + CIs Run linear regression, control for various things, find large coefficient on whether student has two concentrations

slide-8
SLIDE 8

When do I do data viz during a project?

8

Hypothesis: CS students sleep less than Brown students in general

Viz #1: Quick side-by-side histogram of CS students’ sleep

  • vs. the rest.

Means + CIs Run linear regression, control for various things, find large coefficient on whether student has two concentrations Viz #2: Quick histograms (or box-whiskers maybe) of hours of sleep vs. number of concentrations

slide-9
SLIDE 9

When do I do data viz during a project?

9

Hypothesis: CS students sleep less than Brown students in general

Viz #1: Quick side-by-side histogram of CS students’ sleep

  • vs. the rest.

Means + CIs Run linear regression, control for various things, find large coefficient on whether student has two concentrations Viz #2: Quick histograms (or box-whiskers maybe) of hours of sleep vs. number of concentrations Viz #3: Quick histogram of number

  • f concentrations for

CS vs. non-CS students

slide-10
SLIDE 10

When do I do data viz during a project?

10

Hypothesis: CS students sleep less than Brown students in general

Viz #1: Quick side-by-side histogram of CS students’ sleep

  • vs. the rest.

Means + CIs Run linear regression, control for various things, find large coefficient on whether student has two concentrations Viz #2: Quick histograms (or box-whiskers maybe) of hours of sleep vs. number of concentrations Viz #3: Quick histogram of number

  • f concentrations for

CS vs. non-CS students Viz #4: Final polished visualizations for poster/paper/ report

slide-11
SLIDE 11

When do I do data viz during a project?

11

Hypothesis: CS students sleep less than Brown students in general

Viz #1: Quick side-by-side histogram of CS students’ sleep

  • vs. the rest.

Means + CIs Run linear regression, control for various things, find large coefficient on whether student has two concentrations Viz #ia: Quick histograms (or box-whiskers maybe) of hours of sleep vs. number of concentrations Viz #ib: Quick histogram of number

  • f concentrations for

CS vs. non-CS students Viz #N+1: Final polished visualizations for poster/paper/ report

while not converged

slide-12
SLIDE 12

When do I do data viz during a project?

  • At the very start of analysis, to find out wth is going
  • n in my data
  • Periodically throughout, to vet the quantitative

trends I am seeing

  • At the very end of a project, to showcase the

results

12

slide-13
SLIDE 13

When do I do data viz during a project?

  • At the very start of analysis, to find out wth is going
  • n in my data
  • Periodically throughout, to vet the quantitative

trends I am seeing

  • At the very end of a project, to showcase the

results

13

More important (matplotlib, excel, whatever is easy)

slide-14
SLIDE 14

When do I do data viz during a project?

  • At the very start of analysis, to find out wth is going
  • n in my data
  • Periodically throughout, to vet the quantitative

trends I am seeing

  • At the very end of a project, to showcase the

results

14

Most attention, cause its fun ;) (D3, etc.)

slide-15
SLIDE 15

When do I do data viz during a project?

  • At the very start of analysis, to find out wth is going
  • n in my data
  • Periodically throughout, to vet the quantitative

trends I am seeing

  • At the very end of a project, to showcase the

results

15

You are the main audience, goal is to make sure you understand what you are looking at

slide-16
SLIDE 16

When do I do data viz during a project?

  • At the very start of analysis, to find out wth is going
  • n in my data
  • Periodically throughout, to vet the quantitative

trends I am seeing

  • At the very end of a project, to showcase the

results

16

Everyone else is the main audience. Goal is to make point as clearly and concisely as possible.

slide-17
SLIDE 17

So many bad figures…

17

Diane Neil Maggie

slide-18
SLIDE 18

My “three pillars”* of Data Viz

*:)

18

slide-19
SLIDE 19

My “three pillars” of Data Viz

— Your figures should speak for themselves. The analysis should be understandable and your conclusions should be obviously supported, without too much effort

19

slide-20
SLIDE 20

My “three pillars” of Data Viz

— Your figures should speak for themselves. The analysis should be understandable and your conclusions should be obviously supported, without too much effort Don’t obfuscate the data or Hide the prOcess you used to come to your coNclusions. GivE people enough data So that They can disagree with You if they want to.

20

slide-21
SLIDE 21

My “three pillars” of Data Viz

— Your figures should speak for themselves. The analysis should be understandable and your conclusions should be obviously supported, without too much effort Don’t obfuscate the data or Hide the prOcess you used to come to your coNclusions. GivE people enough data So that They can disagree with You if they want to. Minimalism — Substance over style. Make your point concisely, without redundant or distracting information or ornamentation.

21

slide-22
SLIDE 22

“form follows function”

Ellie rants about culture for 2 seconds. Indulge me….

22

slide-23
SLIDE 23

Great tangent to go on…

Edward Tufte—dogma of data viz

23

slide-24
SLIDE 24

My “three pillars” of Data Viz

— Your figures should speak for themselves. The analysis should be understandable and your conclusions should be obviously supported, without too much effort

24

slide-25
SLIDE 25

Learning curve

25 50 75 100

Missing or Cryptic Labels

25

slide-26
SLIDE 26

Missing or Cryptic Labels

Learning curve

Classification Accuracy (%) 25 50 75 100 Training Size 10 100 1000 10000 1000000

26

slide-27
SLIDE 27

Skewed or Crunched Data

Frequency 2500 5000 7500 10000 Age 20 40 60 80 100

Population 1 Population 2

27

slide-28
SLIDE 28

Skewed or Crunched Data

Log Frequency 2.5 5 7.5 10 Age 20 40 60 80 100

Population 1 Population 2

Sometimes can use logs (but say you did so…)

28

slide-29
SLIDE 29

Skewed or Crunched Data

Frequency 22.5 45 67.5 90 Age 20 40 60 80 100

29

slide-30
SLIDE 30

Skewed or Crunched Data

Frequency 10 20 30 40 Age 4 8 12 16 20

Sometimes can remove outliers (but say you did so…)

30

slide-31
SLIDE 31

Skewed or Crunched Data

Frequency 22.5 45 67.5 90 Age 20 40 60 80 100

Sometimes better to analyze separately. (Look at your data!)

10 20 30 40 4 8 12 16 20 10 20 30 40 90 92 94 96 100

31

slide-32
SLIDE 32

Skewed or Crunched Data

25 50 75 100 20 40 60 80 100

32

slide-33
SLIDE 33

Skewed or Crunched Data

25 50 75 100 20 40 60 80 100 25 50 75 100 20 40 60 80 100 25 50 75 100 20 40 60 80 100 25 50 75 100 20 40 60 80 100 25 50 75 100 20 40 60 80 100

Sometimes better to split into multiple charts…

33

slide-34
SLIDE 34

Chart/Data Type Mismatch

Company Earnings by Year (in millions) 2.3 2.1 2.0 2.1 1.7 1.3

2012 2013 2014 2015 2016 2017

34

slide-35
SLIDE 35

Chart/Data Type Mismatch

Company Earnings by Year (in millions) 2.3 2.1 2.0 2.1 1.7 1.3

2012 2013 2014 2015 2016 2017

Not really interpretable as “parts of a whole”…

35

slide-36
SLIDE 36

Chart/Data Type Mismatch

Company Earnings by Year (in millions)

1.3 1.5 1.8 2.0 2.3 2012 2013 2014 2015 2016 2017

36

slide-37
SLIDE 37

Chart/Data Type Mismatch

37

Earnings Gap in Canada is Smaller

Earnings 17.5 35 52.5 70 US Canada

College No College

slide-38
SLIDE 38

Chart/Data Type Mismatch

38

Earnings Gap in Canada is Smaller

Earnings Gap 4 8 12 16 US Canada

slide-39
SLIDE 39

States I have lived in

Years 4.5 9 13.5 18 Michigan Maryland Pennsylvania New York Rhode Island

Clicker Question! (a) Crunched/Skewed Data (b) Missing/Cryptic Labels (c) Chart/Data Type Mismatch (d) Its just ugly What is the biggest problem with this?

39

slide-40
SLIDE 40

States I have lived in

Years 4.5 9 13.5 18 Michigan Maryland Pennsylvania New York Rhode Island

Clicker Question! (a) Crunched/Skewed Data (b) Missing/Cryptic Labels (c) Chart/Data Type Mismatch (d) Its just ugly What is the biggest problem with this?

40

slide-41
SLIDE 41

Clicker Question! (a) Crunched/Skewed Data (b) Missing/Cryptic Labels (c) Chart/Data Type Mismatch (d) Its just ugly What is the biggest problem with this?

States I have lived in

Years 4.5 9 13.5 18 Michigan Maryland Pennsylvania New York Rhode Island

41

slide-42
SLIDE 42

My “three pillars” of Data Viz

Don’t obfuscate the data or Hide the prOcess you used to come to your coNclusions. GivE people enough data So that They can disagree with You if they want to.

42

slide-43
SLIDE 43

Average Performance (over 5-fold validation)

76 78.5 81 83.5 86 Fancy Model1 Fancy Model2 Fancy Model3

No Point of Comparison

43

slide-44
SLIDE 44

Average Performance (over 5-fold validation)

25 50 75 100 Fancy Model1 Fancy Model2 Fancy Model3 Random Guessing

No Point of Comparison

Help calibrate how easy/hard the problem is, what types of numbers to expect a priori…

44

slide-45
SLIDE 45

Average Performance (over 5-fold validation)

17.5 35 52.5 70 Baseline Old Model New Model

Summary Stats Only

45

slide-46
SLIDE 46

Average Performance (over 5-fold validation)

20 40 60 80 Baseline Old Model New Model

Summary Stats Only

Sometimes can include error bars/ confidence intervals…

46

slide-47
SLIDE 47

Average Performance (over 5-fold validation)

20 40 60 80 Baseline Old Model New Model

Summary Stats Only

Even better, just show all the data alongside the summary stats.

47

slide-48
SLIDE 48

Summary Stats Only

5 10 15 20 3 6 9 12

Population1 Population2

48

slide-49
SLIDE 49

Summary Stats Only

Be careful about showing smoothed/ estimated/aggregated trends only

5 10 15 20 3 6 9 12

Population1 Population2

49

slide-50
SLIDE 50

Summary Stats Only

Whenever possible, show underlying data

5 10 15 20 3 6 9 12

Population1 Population2

50

slide-51
SLIDE 51

Misleading/Badly Scaled Axes

Percent Accuracy 80.1 80.3 80.5 80.7 80.9 Baseline Old Model New Model

51

slide-52
SLIDE 52

Misleading/Badly Scaled Axes

Percent Accuracy 25 50 75 100 Baseline Old Model New Model

Rescale to a meaningful range (use full range

  • f values that are interesting/

expected)

52

slide-53
SLIDE 53

Misleading/Badly Scaled Axes

Percent Accuracy 25 50 75 100 Baseline Old Model New Model

And/or include error bars

53

slide-54
SLIDE 54

My “three pillars” of Data Viz

Minimalism — Substance over style. Make your point concisely, without redundant or distracting information or ornamentation.

54

slide-55
SLIDE 55

F1 Score 12.5 25 37.5 50 Model1 Model2 Model3

Redundancy

55

slide-56
SLIDE 56

F1 Score 12.5 25 37.5 50 Model1 Model2 Model3

Redundancy

Don’ t use colors/decorations unless they add new information

56

slide-57
SLIDE 57

F1 Score 12.5 25 37.5 50 Model1 Model2 Model3

Redundancy

Just look how pretty that is

57

slide-58
SLIDE 58

Superfluousness

Model 1 Model 2 Model 3

Model Performance (F1 Score)

58

slide-59
SLIDE 59

Superfluousness

Don’ t use colors/decorations unless they add new information

Model 1 Model 2 Model 3

Model Performance (F1 Score)

59

slide-60
SLIDE 60

F1 Score 12.5 25 37.5 50 Model1 Model2 Model3

Superfluousness

Just look how pretty that is

Model Performance (F1 Score)

60

slide-61
SLIDE 61

Why is that chart 3D?

61

slide-62
SLIDE 62

Why is that chart 3D?

Just…don’ t

62

slide-63
SLIDE 63

Why is that chart 3D?

Just look how pretty that is

Company Earnings by Year (in millions)

0.575 1.15 1.725 2.3 2012 2013 2014 2015 2016 2017

63

slide-64
SLIDE 64

…😒

64

slide-65
SLIDE 65

Tips for quick-and-helpful data viz

65

slide-66
SLIDE 66

Tips for quick-and-helpful data viz

66

Type of Hypothesis First plot I’d make Group A differs from Group B according to metric C side-by-side histograms, with means and CIs X effects Y scatter plot, with correlation Prediction Tasks, Recommendations dim.reduction feature matrix to 2D, then scatter and color by label/group Any of the above correlation matrices between all features Any of the above counts of all features (broken down by groups/labels if relevant)

slide-67
SLIDE 67

Tips for quick-and-helpful data viz

67

  • Histograms:
  • Play with bin size/normalization until you can see clearly
  • Sometimes I use box plots if the variation is low (but always
  • verlay the points themselves)
  • Scatters:
  • Apply jitter or transparency to scatter plots so you can see
  • verlapping points
  • Add labels (or use plotly for interaction) so you can see

labels on points

slide-68
SLIDE 68

Tips for quick-and-helpful data viz

68

  • Matplotlib: https://matplotlib.org/ my <3, because I am old-
  • school. Not super streamlined but does give you a lot of

control

  • Seaborn: https://seaborn.pydata.org/ plays well with numpy,

streamlines process for making complex charts (e.g. large grids/side-by-sides) but harder to tweak little things

  • Plotly: https://plotly.com/ good for quick interactive charts (I

use this for messy scatter plots)

  • D3: https://d3js.org/ good for making very flashy plots (and

for doing your homeworks)

slide-69
SLIDE 69

Clicker Question! (a) A (b) B (c) C (d) F

69

Look at this chart I made!!

slide-70
SLIDE 70

Clicker Question! (a) A (b) B (c) C (d) F

70

Look at this chart I made!!

slide-71
SLIDE 71

Clicker Question! (a) A (b) B (c) C (d) F

71

Look at this chart I made!!

slide-72
SLIDE 72

Clicker Question! (a) A (b) B (c) C (d) F Look at this chart I made!!

72

slide-73
SLIDE 73

Clicker Question! (a) A (b) B (c) C (d) F

73

Look at this chart I made!!

slide-74
SLIDE 74

Clicker Question! (a) A (b) B (c) C (d) F

A watercolor painting celebrating that event hangs today in the Chenango Museum in Norwich. The canal itself was also utilized for recreation. In the summer months it supported swimming, boating and

  • fishing. In the winter months, after the surface froze
  • ver, ice skating and even horse racing became

favorite pastimes. Before the Chenango Canal was built, much of the Southern Tier and Central New York was still considered to be frontier. In the summer months it supported swimming , picnicking and fishing .

74

Look at this chart I made!!