Perry Watts, Stakana Analytics Elkins Park, PA Nate Derby, Stakana - - PowerPoint PPT Presentation

perry watts stakana analytics elkins park pa nate derby
SMART_READER_LITE
LIVE PREVIEW

Perry Watts, Stakana Analytics Elkins Park, PA Nate Derby, Stakana - - PowerPoint PPT Presentation

Perry Watts, Stakana Analytics Elkins Park, PA Nate Derby, Stakana Analytics, Seattle, WA The Challenge An Effective Graph Is one that reveals " patterns , differences and uncertainty " in the underlying data. But What if your data


slide-1
SLIDE 1

Perry Watts, Stakana Analytics Elkins Park, PA Nate Derby, Stakana Analytics, Seattle, WA

slide-2
SLIDE 2

The Challenge

Is one that reveals "patterns, differences and uncertainty" in the underlying data.

2

But

What if your data map to crowded displays with overlapping points, lines, or other obstructions that interfere with pattern detection?

Framingham Heart Study Overlapping points (n=5,209) Airlines Data Many overlapping lines (n=6,100) Barley Data Unreadable response axis (n=120) Stock Data Untraceable interleaving lines (n=699)

Our Examples are Challenging An Effective Graph

slide-3
SLIDE 3

Our Approach

3

To Output that conveys its message more effectively Along the Way: Incremental Go from preliminary graphs that are less than optimal

Point out problems | issues. Solutions offered take advantage of new features in ODS statistical graphics and the insights of William S. Cleveland. Show why GTL must be used instead of a more convenient SG PROC to produce the graph you are looking at. We don't spend a lot of time on SAS code, however. Our goal is to define graphics problems and show how to solve them.

slide-4
SLIDE 4

Framingham Heart Study (sashelp.heart)

4

SAS Sample #35172 deals with dense data by using 95% transparency in the scatter plot, stretching the graph out, and including marginal histograms.

slide-5
SLIDE 5

Framingham Heart Study (sashelp.heart)

5

Code Outline for SAS Sample #35172

PROC TEMPLATE; PROC TEMPLATE; DEFINE STATGRAPH scatterhist; BEGINGRAPH / DESIGNWIDTH=600px DESIGNHEIGHT=400px; ENTRYTITLE "Two Continuous Variables"; ENTRYTITLE "Two Continuous Variables"; LAYOUT LATTICE / ROWS=2 COLUMNS=2; LAYOUT OVERLAY; HISTOGRAM Xvar; ENDLAYOUT; LAYOUT OVERLAY; LAYOUT OVERLAY; ENTRY 'NOBS: ' ...; ENTRY 'NOBS: ' ...; ENDLAYOUT; ENDLAYOUT; LAYOUT OVERLAY; LAYOUT OVERLAY; SCATTERPLOT Y= SCATTERPLOT Y=Yvar Yvar X= X=Xvar Xvar; ENDLAYOUT; ENDLAYOUT; LAYOUT OVERLAY; LAYOUT OVERLAY; HISTOGRAM HISTOGRAM Yvar Yvar; ENDLAYOUT; ENDLAYOUT; ENDLAYOUT; /*LATTICE*/ ENDLAYOUT; /*LATTICE*/ ENDGRAPH; /*END GRAPH BLOCK*/ ENDGRAPH; /*END GRAPH BLOCK*/ END; /*END DEFINE BLOCK*/ END; /*END DEFINE BLOCK*/ RUN; RUN; PROC SGRENDER DATA= PROC SGRENDER DATA=sashelp.heart sashelp.heart TEMPLATE= TEMPLATE=scatterhist scatterhist; RUN; RUN;

slide-6
SLIDE 6

Framingham Heart Study (sashelp.heart)

6

Why PROC SGPANEL Doesn't Work

LAYOUT LATTICE / ROWS=2 COLUMNS=2 ROWWEIGHTS=(.2 .8) COLUMNWEIGHTS=(.8 .2); ROWWEIGHTS=(.2 .8) COLUMNWEIGHTS=(.8 .2);

Panels must have equal dimensions in PROC SGPANEL

slide-7
SLIDE 7

Framingham Heart Study (sashelp.heart)

7

What's missing from the definition for NOBS?

LAYOUT OVERLAY / BORDER=true; ENTRY 'NOBS: ' EVAL(N( ENTRY 'NOBS: ' EVAL(N(xvar xvar)) / ...; )) / ...; ENDLAYOUT; ENDLAYOUT;

In a scatter plot each point references an X and a Y coordinate. (Neither can be missing).

slide-8
SLIDE 8

Framingham Heart Study (sashelp.heart)

8

Changing the code gives the right answer

LAYOUT OVERLAY / BORDER=true; ENTRY 'NOBS: ' EVAL(N( ENTRY 'NOBS: ' EVAL(N(xvar xvar + + yvar yvar)) / ...; )) / ...; ENDLAYOUT; ENDLAYOUT;

The '+' operator works, because a missing value is returned when at least XVAR or YVAR is missing. (SUM won't work).

slide-9
SLIDE 9

Framingham Heart Study (sashelp.heart)

9

From William S. Cleveland :

"make the data rectangle slightly smaller than the scale-line rectangle".

11% 14.5%

ODS Statistical Graphics Axis Format

slide-10
SLIDE 10

Framingham Heart Study (sashelp.heart)

10

Data points can't appear above the axis maximum tick value.

11% 14.5%

Conventional SAS/GRAPH Axis Format

slide-11
SLIDE 11

The graph is squared off to eliminate bin-width distortion due to stretching. Marginal histogram bin heights are now comparable, because VIEWMAX is set to 15%. Borders are removed to make marginal histogram bin ranges more visible.

11

Framingham Heart Study (sashelp.heart)

The Revised Graph: Histogram Fixes

* * * *

slide-12
SLIDE 12

Framingham Heart Study (sashelp.heart)

12

We Still Have a Problem with the Scatter Plot

slide-13
SLIDE 13

Framingham Heart Study (sashelp.heart)

13

Try Rounding related to Cleveland's Jittering

   

Jittering adds "random noise" to each point for a slight separation.

slide-14
SLIDE 14

Framingham Heart Study (sashelp.heart)

14

Try Rounding related to Cleveland's Jittering

    

slide-15
SLIDE 15

Framingham Heart Study (sashelp.heart)

15

Rounding for a 3rd Dimension based on Frequency

slide-16
SLIDE 16

Framingham Heart Study (sashelp.heart)

16

Rounding for a 3rd Dimension based on Frequency

SQUAREFILLED markers in the scatter plot line up better with histogram bins. The legend makes the graph less square. Compensate by labeling histogram axes tick marks. With solid color plotting symbols, it is easier to line up histogram end bins with the blue data outliers. * * * * Continuous legends are only available in GTL

slide-17
SLIDE 17

Framingham Heart Study (sashelp.heart)

17

Create a Digitized Contour Plot with PROC KDE

Switch from raw data manipulation ("rounding") to statistical estimation where cell color is based on probability.

slide-18
SLIDE 18

Framingham Heart Study (sashelp.heart)

An adjusted raw data set is plotted. X and Y data values are "rounded". Z, rendered by color, is the count

  • f tied observations at a given

(rounded) point.

18

Output from PROC KDE is plotted. The plotting region is divided into a 60X60 grid of cells in X and Y variable units (3,600 obs). Z equals DENSITY not Frequency. COUNT, another variable, sums to 5,199.

A Rounded vs. Digitized KDE Contour Plot

slide-19
SLIDE 19

Framingham Heart Study (sashelp.heart)

19

Generating the Digitized Plot from PROC KDE

proc proc kde kde data= data=sashelp.heart sashelp.heart; Bivar Bivar Height Weight / PLOTS=NONE out= Height Weight / PLOTS=NONE out=KDEGridded KDEGridded; run; run; proc proc sgrender sgrender template= template=xTmp xTmp data= data=KDEGridded KDEGridded(where=(count>0)); (where=(count>0)); run; run; ;

slide-20
SLIDE 20

Framingham Heart Study (sashelp.heart)

20

Add the BMI to the Digitized Contour Plot

Complete source code can be found in the ZIP file referenced in the Paper

slide-21
SLIDE 21

Airlines Data

21

A Progression of Time Series Plots

This is a progression of 100 series plots of flights where each flight has a unique departure date. The X axis = the number of days before departure a flight is booked. The Y axis = the cumulative number of bookings. Each flight accommodates 180 passengers.

slide-22
SLIDE 22

Airlines Data

22

A Progression of Time Series Plots

Is there a Relationship between Days Before Departure and Departure Date?

slide-23
SLIDE 23

Airlines Data

23

A Progression of Time Series Plots

Add a Color Dimension to see the Connection between Days Before Departure and Departure Dates

slide-24
SLIDE 24

Airlines Data

24

A Progression of Time Series Plots

Time Series plots should cumulate left to right. That means the X-axis needs to be reversed. An inset replaces the legend, because the legend points to group variable, Departure Date (100), not Date Range (6). Inset text maps colors to plot lines. No legend line-to-line mapping is needed. What's Different?

slide-25
SLIDE 25

Airlines Data

25

A Progression of Time Series Plots

LAYOUT OVERLAY / ... Xaxisopts=(... reverse=true); %do %do i= 1 %to 6; = 1 %to 6; SERIESPLOT X= SERIESPLOT X=x&i x&i Y= Y=y&i y&i / GROUP= / GROUP=ddate ddate LINEATTRS=(COLOR=&& LINEATTRS=(COLOR=&&color&i color&i); ); %end; %end; LAYOUT GRIDDED / COLUMNS=1 ...; ...; %do j = 1 %to 6; %do j = 1 %to 6; ENTRY TEXTATTRs=(WEIGHT=bold COLOR=&& ENTRY TEXTATTRs=(WEIGHT=bold COLOR=&&color&j color&j)"&& )"&&Range&j Range&j"; "; %end; %end; ENDLAYOUT; /*gridded*/ ENDLAYOUT; /*gridded*/ ENDLAYOUT; /*overlay*/ ENDLAYOUT; /*overlay*/

slide-26
SLIDE 26

Airlines Data

26

Using LAYOUT DATAPANEL

slide-27
SLIDE 27

Airlines Data

27

Using LAYOUT DATAPANEL

LAYOUT DATAPANEL LAYOUT DATAPANEL classvars classvars=( =(ByDdateLbl ByDdateLbl)/ )/ headerlabelattrs headerlabelattrs=(weight=bold ...) =(weight=bold ...) headerbackgroundcolor headerbackgroundcolor=CXBCB9E5 =CXBCB9E5 columndatarange columndatarange=union =union columnaxisopts columnaxisopts=(... REVERSE=TRUE) =(... REVERSE=TRUE) rowaxisopts rowaxisopts=( ... ) =( ... ); layout prototype layout prototype /...; /...; seriesplot seriesplot x= x=DaysLeft DaysLeft y=bookings/ y=bookings/ group= group=ddate ddate ...; ...; endlayout endlayout; /*prototype*/ ; /*prototype*/ endlayout endlayout; /* ; /*dataPanel dataPanel */ */

slide-28
SLIDE 28

Barley Data

28

Working with "Multi-Way" Dot Plots

The named inventor, William S. Cleveland, recommends his dot plot as a replacement for the horizontal bar chart. The barley data "multi-way" dot plot is famous. R.A. Fisher used the data to illustrate his ANOVA method of experimental design. Years later, Cleveland discovers the data error that ANOVA missed.

slide-29
SLIDE 29

The Barley Data "Multi-Way" Dot Plot

29

The Data Error

1931 and 1932 YIELDS are reversed at the MORRIS site

slide-30
SLIDE 30

The Barley Data "Multi-Way" Dot Plot

30

From the DOT Statement in PROC SGPANEL

Cleveland supplied the data on STATLIB

proc proc sgpanel sgpanel data=barley; data=barley; title1 "Canadian Barley Production"; title1 "Canadian Barley Production"; panelby panelby site; site; dot variety / response=yield group=year; dot variety / response=yield group=year; run; run;

slide-31
SLIDE 31

The Barley Data "Multi-Way" Dot Plot

31

From the DOT Statement in PROC SGPANEL

Sites are in Minnesota. Can't see patterns. SITE and VARIETY are

  • rdered alphabetically,

not by median. To re-order, switch from DOT in SGPANEL to SCATTERPLOT in GTL. Plot a 6X1 paneled graph to see the connection between SITE and YIELD.

slide-32
SLIDE 32

The Barley Data "Multi-Way" Dot Plot

32

From the SCATTERPLOT Statement in GTL

slide-33
SLIDE 33

The Barley Data "Multi-Way" Dot Plot

33

GTL CLEVELAND'S Graph

The Elements of Graphing Data

slide-34
SLIDE 34

Stock Data (sashelp.stocks)

34

Stock trends are difficult to see in this graph. Overlaid dashed lines are difficult to track.

Working with Interleaving Time Series Plots

slide-35
SLIDE 35

Stock Data (sashelp.stocks)

35

Use the Default Style. Replace dashed lines with solid ones. Use different line widths. Use anti-aliasing to improve resolution.

Working with Interleaving Time Series Plots

slide-36
SLIDE 36

Stock Data (sashelp.stocks)

36

Naomi Robbins says to place plot lines into separate panels to increase visibility. However, lines are then harder to compare.

Working with Interleaving Time Series Plots

slide-37
SLIDE 37

Stock Data (sashelp.stocks)

37

Display stocks two at a time to increase comparability.

Working with Interleaving Time Series Plots

slide-38
SLIDE 38

Stock Data (sashelp.stocks)

38

Add Band Plots for emphasis. Schwartz says the bands represent the Area Under the Curve (AUC). LIMITLOWER for both curves is set to $0, but this creates unwanted

  • verlay (see arrows).

Working with Interleaving Time Series Plots

slide-39
SLIDE 39

Stock Data (sashelp.stocks)

39

Working with Interleaving Time Series Plots

With "interleaved" band plots, the area between the curves (ABC) is emphasized.

slide-40
SLIDE 40

Summary

40

Heart Data: Overlapping Points (n=5,209)

Before After (1) Color provides a 3rd Dimension for Frequency

slide-41
SLIDE 41

Summary

41

Heart Data: Overlapping Points (n=5,209)

Before After (2) Color provides a 3rd Dimension for Density

slide-42
SLIDE 42

Summary

42

Airlines Data: Overlapping Lines (n=6,100)

Before After (1) Color provides a 4th Dimension for the Departure Date Range

slide-43
SLIDE 43

Summary

43

Airlines Data: Overlapping Lines (n=6,100)

Before After (2) DATAPANEL now provides the 4th Dimension

slide-44
SLIDE 44

Summary

44

Barley Data: Overlapping tick labels (n=120)

Before After

Did not work in 9.2 SAS

slide-45
SLIDE 45

Summary

45

Stock Data: Interleaving lines (n=699)

Before After

slide-46
SLIDE 46

Contact Information

46

Perry Watts Stakana Analytics pwatts@stakana.com www.PerryWatts.org Nate Derby Stakana Analytics nderby@stakana.com www.NDerby.org