Tables Zero-dimensional data with position R.W. Oldford University - - PowerPoint PPT Presentation

tables
SMART_READER_LITE
LIVE PREVIEW

Tables Zero-dimensional data with position R.W. Oldford University - - PowerPoint PPT Presentation

Tables Zero-dimensional data with position R.W. Oldford University of Waterloo Magnitude Visual representations - labels Recall from last day, the ordering of the elementary tasks from most accurate to least accurate: 1. Position along a


slide-1
SLIDE 1

Tables

Zero-dimensional data with position R.W. Oldford

University of Waterloo

slide-2
SLIDE 2

Magnitude

Visual representations - labels

Recall from last day, the ordering of the elementary tasks from most accurate to least accurate:

1. Position along a common scale 2. Position on identical but nonaligned scales 3. Lengths (N.B. line segments were all oriented horizontally or vertically, though nonaligned) 4. Angle Slope (not close to 0, π/2, or π radians) 5. Area 6. Volume 7. Colour density, colour saturation 8. Colour hue

Missing from the list is the possibility of using “labels” which, provided there were not too many, were observed to work well with categorical data. E.g. “forty-two” or 42 Because it is simply read (by the trained person), using the number itself as “label” would probably have the most accurate (and fastest) decoding!

slide-3
SLIDE 3

Magnitude

Visual representations - numbers as labels

Which is faster to decode? To compare values?

Word numbers? Or symbolic numbers?

slide-4
SLIDE 4

Magnitude

Visual representations - numbers as labels

Which is easier to compare values?

left centre right decimal aligned rounded

slide-5
SLIDE 5

Magnitude

Visual representations - choosing positions

Which is easier to compare magnitudes?

unordered ascending descending

slide-6
SLIDE 6

Magnitude

Visual representations - choosing positions

Which is easier to compare magnitudes?

unordered ascending descending

slide-7
SLIDE 7

Magnitude

Visual representations - choosing positions

Horizontal versus vertical for comparing magnitudes?

slide-8
SLIDE 8

Modern numerals

A surprisingly recent innovation

◮ Early European forms ◮ Roman numerals indicate century. ◮ from G.F . Hill’s The Development of Arabic Numerals in Europe (1915), p.

  • 28. as recorded in Cajori

(1928), p. 49.

slide-9
SLIDE 9

Modern numerals

Important Characteristics Characteristics permit ease of visual reasoning within the written system – hand calculation. E.g. consider

Multiplication Long division

At least for ‘natural’ everyday numbers.

slide-10
SLIDE 10

Modern numerals

Important Characteristics

Calculating-Table

by Gregor Reisch (c. 1467 - 1525) A woodcut illustration from Reisch’s Margarita Philosophica (1503) fea- turing Arithmetica instructing an “al- gorist” (Boethius) on the left and an “abacist” (Pythagoras) on right.

These are two different types of arith- metic (Typus Arithmeticae)

Algorism: “technique of performing basic arithmetic by writing numbers in place value form and applying a set

  • f memorized rules and facts to the digits.” (Wikipedia)
slide-11
SLIDE 11

Modern numerals

Search for an ostensive definition

◮ from Cajori (1928), p. 65. ◮ Earnest but fanciful hypotheses, post-hoc rationalizing. ◮ “They serve merely as entertaining illustrations of the

  • peration of a pseudo-scientific

imagination, uncontrolled by all the known facts.” Cajori (1928), p. 68. ◮ they are not themselves an

  • stensive definition, or pictorial

form ◮ rather they are a learned symbolic abstraction ◮ must be distinguishable, easily learned, and easily constructed by pen and paper

slide-12
SLIDE 12

Modern numerals

Even positional representation is surprisingly recent innovation

◮ The nine numerals are enhanced with powers of 10 by surrounding them with that number of zeros. ◮ From Christoff Rudolff (Augsburg, 1574?) Künstliche Rechnung mit der Ziffer as taken from Cajori (1928), p. 56.

slide-13
SLIDE 13

Modern numerals

Cuneiform and positional representation

One of the earliest written number systems was that of cuneiform, used by the Babylonians (circa 2500 BCE). A wedge-shape (cuneus in Latin) tipped reed is pressed into wet clay and a single stroke indicates a single unit, the first number 1. (No zero.) A simple visual representation with a standardized layout of the symbols. 1 2 3 4 5 6 7 8 9 Needs a compression for larger numbers. 10 20 30 40 50 Ten is like two hands pressed together. Note standardized layout. Ends at 50.

· · ·

10 11 12 · · · 19 Which is nice up until 59. After that positional representation is used! E.g. 70 =

slide-14
SLIDE 14

Modern numerals

Important Characteristics

◮ fixed and small base

◮ few characters to learn. ◮ Other bases, e.g. 12, might be even better. Babylonians (cuneiform) uses base 60.

◮ positional

◮ character encodes value ◮ position encodes magnitude (increasing right to left)

◮ standardized size and layout

◮ align in columns ◮ sequences separate in groups of 3 Allows them to be used dynamically as visual aids to reasoning.

Visually executed algorithms rely on position (e.g. multiplication, long division, . . . )

Natural grouping by position

Several pass sorting (by first digit, then second, . . . )

slide-15
SLIDE 15

Tables of numbers

A Babylonian innovation

(a) Balance sheet (b) Pythagorean triplets

◮ Rows (sometimes unaligned) and columns, gridlines, indexing, headings. ◮ For reference. Static.

slide-16
SLIDE 16

Modern Tables

Another layout of digits.

Tables are:

◮ for the record ◮ visual aids to reasoning.

The first is more prevalent, the second more important. Like other layouts of numbers, tables should take advantage of the visual characteristics of the digits they display.

slide-17
SLIDE 17

Modern Tables

Example: Acidity of Ontario Lakes

Background: One of the most pressing environmental problems facing large areas of North America is acidic precipitation. All of southern Ontario receives a steady bombardment of acids, acid forming gases and associated pollutants. The acids come down with rain, snow, fog, and small particles in the air. Man’s activities are responsible for the large majority of these acids. Smelters and coal-fired electric generating stations both in Canada and the United States spew millions of tonnes of sulphur dioxide into the atmosphere

  • annually. Cars, trucks, and trains contribute more millions of tonnes of nitrogen dioxides. These gases react with

sunlight, oxygen, ozone, water and other gases to form sulphuric and nitric acid – strong, corrosive acids. In unpolluted areas, rain and snow are naturally slightly acidic since carbon dioxide, which is a natural component of the atmosphere, dissolves in water to form weak carbonic acid. Water quality of lakes has developed in response to weathering processes induced by this weak acid. Rocks and minerals react with carbonic acid to form bicarbonate, which is found in natural waters everywhere. The complex biological communities in lakes, streams and forests have adapted and evolved in equilibrium with these natural conditions and processes. However, acid rain has seriously disturbed this equilibrium. There are over 250,000 lakes in Ontario. Thousands have been affected by acid rain, many of them in the Muskoka-Haliburton region, where there is a substantial cottage and tourist industry.

slide-18
SLIDE 18

Acidity of Ontario Lakes

Ontario Government Publication: “Acid Sensitivity Survey of Lakes in Ontario – 1989”.

slide-19
SLIDE 19

Acidity of Ontario Lakes

Rows ordered alphabetically by “County

  • r District”

Lots of redundant information.

Difficult to see patterns, if any. Ontario Government Publication: “Acid Sensitivity Survey of Lakes in Ontario – 1989”.

slide-20
SLIDE 20

Acidity of Ontario Lakes

Arrange by region, remove uninteresting redundancy.

slide-21
SLIDE 21

Acidity of Ontario Lakes

Arrange by acidity.

slide-22
SLIDE 22

Acidity of Ontario Lakes

Arrange by acidity, remove further redundancy, annotate.

slide-23
SLIDE 23

Modern Tables

Conveying information visually

Modern tables:

◮ need no longer be for the record (databases are) ◮ should be displayed as visual aids to reasoning.

Like other layouts of numbers, tables should take advantage of the visual characteristics of the digits they display.

◮ Take advantage of rows. ◮ Align digits in columns. ◮ Show important individual numbers. ◮ Use white space to separate groups of numbers. ◮ Think hard about the information to be communicated.

slide-24
SLIDE 24

Tables

Analysis

Consider the following table of “Sales data”:

TABLE 1.1 Data in Four Areas and Eight Three-Month Periods in 1969-1970. 13-15 16-18 19-21 22-24 25-27 28-30 31-33 34-36 A 97.62 92.24 100.90 90.39 95.69 94.44 91.13 97.81 B 48.29 42.31 49.98 39.09 46.38 49.74 41.74 37.39 C 75.23 75.16 100.11 74.23 74.23 76.97 71.66 76.47 D 49.69 57.21 80.19 51.09 52.88 49.41 59.32 52.56

Analysis: ◮

Goal is to see some patterns in the data.

Develop a summary description (“model”) for the pattern.

Assess the agreement of the pattern with the data. Source: A.S.C. Ehrenberg (1975) Data Reduction: Analysing and Interpreting Statistical Data.

slide-25
SLIDE 25

Analysis of table data

Step 1

Separate row and column headings

13-15 16-18 19-21 22-24 25-27 28-30 31-33 34-36 A 97.62 92.24 100.90 90.39 95.69 94.44 91.13 97.81 B 48.29 42.31 49.98 39.09 46.38 49.74 41.74 37.39 C 75.23 75.16 100.11 74.23 74.23 76.97 71.66 76.47 D 49.69 57.21 80.19 51.09 52.88 49.41 59.32 52.56 TABLE 1.1 Data in Four Areas and Eight Three-Month Periods in 1969-1970.

Lines and space.

Next: Assign meaningful labels. Columns are 3 month periods numbered from 1968.

slide-26
SLIDE 26

Analysis of table data

Step 2

Meaningful labels. Separate years.

Quarters (1969) Quarters (1970) Area 1 2 3 4 1 2 3 4 North 97.62 92.24 100.90 90.39 95.69 94.44 91.13 97.81 South 48.29 42.31 49.98 39.09 46.38 49.74 41.74 37.39 East 75.23 75.16 100.11 74.23 74.23 76.97 71.66 76.47 West 49.69 57.21 80.19 51.09 52.88 49.41 59.32 52.56

Gridlines added to separate years and define table. Table title unnecessary (a different

  • ne with different information might be added).
slide-27
SLIDE 27

Analysis of table data

Step 3

Focus on 1969. Reduce to significant digits.

Quarters (1969) Area 1 2 3 4 North 97.62 92.24 100.90 90.39 South 48.29 42.31 49.98 39.09 East 75.23 75.16 100.11 74.23 West 49.69 57.21 80.19 51.09 ⇒ Quarters (1969) Area 1 2 3 4 North 98 92 101 90 South 48 42 50 39 East 75 75 100 74 West 50 57 80 51 It is a common mistake to present too many digits. Number of significant digits is typically 1, 2 or 3. It may require a change in units of measurement being displayed (e.g. 100, 000s of dollars rather than 1000s). Patterns?

slide-28
SLIDE 28

Analysis of table data

On “significant digits”

Because we are looking for patterns in the table, reducing to “significant” digits may not always mean the usual “scientifically significant digits”. For example, had the table been, say Quarters (1969) Area 1 2 3 4 North 12345097.62 12345092.24 12345100.90 12345090.39 South 12345048.29 12345042.31 12345049.98 12345039.09 East 12345075.23 12345075.16 12345100.11 12345074.23 West 12345049.69 12345057.21 12345080.19 12345051.09 We still would have wanted the table below: Quarters (1969) Area 1 2 3 4 North 98 92 101 90 South 48 42 50 39 East 75 75 100 74 West 50 57 80 51 Because the first five significant digits are iden- tical any pattern in the table will be in the re- maining digits. Simply subtract 1,234,500 from every entry and report a location 1,234,500 to accompany the table.

slide-29
SLIDE 29

Analysis of table data

Step 4

Looking for patterns. Get column and row sums.

Quarters (1969) Area 1 2 3 4 Total North 98 92 101 90 381 South 48 42 50 39 179 East 75 75 100 74 324 West 50 57 80 51 238 Total 271 266 331 254 1122 Patterns? Sums are on a different scale (magnitude) than the points within the table.

slide-30
SLIDE 30

Analysis of table data

Step 5

Looking for patterns. Get column and row averages or medians or . . . .

Quarters (1969) Area 1 2 3 4 Ave. North 98 92 101 90 95 South 48 42 50 39 45 East 75 75 100 74 81 West 50 57 80 51 60 Average 68 67 83 64 70 Patterns? Quarterly averages don’t differ much from overall average. Area averages differ more. Area values also do not differ much from the Areas averaged across quarters.

slide-31
SLIDE 31

Analysis of table data

Step 6

Lack of variation across quarters is better seen if these figures align within columns.

Quarters (1969) Area 1 2 3 4 Ave. North 98 92 101 90 95 South 48 42 50 39 45 East 75 75 100 74 81 West 50 57 80 51 60 Average 68 67 83 64 70 Area 1969 North South East West Ave. Q1 98 48 75 50 68 Q2 92 42 75 57 67 Q3 101 50 100 80 83 Q4 90 39 74 51 64 Average 95 45 81 60 70 Note how much easier it is to scan down the North column to see the little variability in the leading digit. Exceptions now stand out more easily – the 100 in the East column and the 80 in the West. Similarly Q3 overall. More widely spaced rows would diminish this visual advantage.

slide-32
SLIDE 32

Analysis of table data

Step 7

Column order is arbitrary. Reorder to better reveal patterns.

Area 1969 South West East North Ave. Q1 48 50 75 98 68 Q2 42 57 75 92 67 Q3 50 80 100 101 83 Q4 39 51 74 90 64 Average 45 60 81 95 70 Columns ordered from smallest to largest. Much simpler to see the same decreasing pattern occurs in all quarters. Could rearrange rows, except: not much variation across quarters & the quarters have a meaningful time order. For rearranged rows, place largest at top. It’s easier to look at subtractive difference between rows.

slide-33
SLIDE 33

Analysis of table data

Step 8

Could exclude exceptions in calculating row and column summaries.

Area 1969 South West East North Ave. Q1 48 50 75 98 68 Q2 42 57 75 92 67 Q3 50 80 100 101 83 Q4 39 51 74 90 64 Average 45 53∗ 75∗ 95 67∗

∗ Excluding Q3 in West and East.

Exceptions coloured brown. Could also indicate with parentheses, i.e. 100 or (100).

slide-34
SLIDE 34

Analysis of table data

How far have we come?

Compare where we started, with where we ended Original 1969 data 13-15 16-18 19-21 22-24 A 97.62 92.24 100.90 90.39 B 48.29 42.31 49.98 39.09 C 75.23 75.16 100.11 74.23 D 49.69 57.21 80.19 51.09 Area 1969 South West East North Ave. Q1 48 50 75 98 68 Q2 42 57 75 92 67 Q3 50 80 100 101 83 Q4 39 51 74 90 64 Average 45 53∗ 75∗ 95 67∗

∗ Excluding Q3 in West and East.

slide-35
SLIDE 35

Analysis of table data

Summary Description

Four areas: North 95, East 75, West 53, and South 45. There were two exceptionally high values in Q3 from the East and West regions, differing from their averages by about 25 units. Area 1969 South West East North Ave. Q1 48 50 75 98 68 Q2 42 57 75 92 67 Q3 50 80 100 101 83 Q4 39 51 74 90 64 Average 45 53∗ 75∗ 95 67∗

∗ Excluding Q3 in West and East.

slide-36
SLIDE 36

Guidelines for constructing tables

Things to note . . . Ehrenberg (1975), p. 14.

1. Base rules ◮ Reduce number of digits. (N.B. could require finding a common location) Mental arithmetic is more difficult with more than two significant (varying) digits. ◮ Figures to be compared should be close together. ◮ Use memorable self-explanatory symbols and labels. ◮ Separate different types of items/groups with white space or gridlines. 2. Calculations ◮ Avoid introducing new variables or scales (e.g. totals) whenever possible. ◮ Use averages (or medians) to help focus the eye over the array. ◮ Note dramatically exceptional values and exclude them from pattern summary calculations. 3. If possible swap rows and columns, reorder rows and/or columns: ◮ Numbers that vary the least should appear in columns. Both regularities and exceptions are easier to spot. ◮ Rearrange rows to have large numbers appear above small numbers. Differences should easier to detect following subtraction rules. ◮ Rearrange columns so that averages are strictly decreasing (or increasing) from left to right. Easier then to detect departures from this pattern within the table.

slide-37
SLIDE 37

Analysis of table data

Modelling the sales figures

South 45, West 53, East 75, and North 95. This is the essential pattern that our analysis uncovered. These are, in some sense, our “model” numbers (i.e. reasonable summaries, but not the actual value for any quarter). We think they capture essential structure in these sales figures. Sounds like a “pictorial form” – modelling entails the possibility of an underlying structure connecting the sales figures to our numerical picture. The model effectively says that the different quarters need not be considered. But not all quarters had these values. . . . There was variation from these values. Questions: What should we say about this variation? About this deviation from our model? Does it matter? Answers: Model the deviation as well. Determine its characteristics.

slide-38
SLIDE 38

Analysis of table data

Step 9

Use model for areas, and look at deviations from model.

Area 1969 South West East North Ave. Q1 3

  • 3

3 1 Q2

  • 3

4

  • 3

Q3 5 27 25 6 6∗ Q4

  • 6
  • 2
  • 1
  • 5
  • 4

Average 0∗ 0∗ 0∗

∗ Excluding Q3 in east and West.

Model has South 45, West 53, East 75, and North 95. Comments? Column averages must be zero, they are from the model. Note rounding effects.

slide-39
SLIDE 39

Analysis of table data

Step 9

Model has North 95, East 75, West 53, South 45 Deviations are: Area 1969 South West East North Ave. Q1 3

  • 3

3 1 Q2

  • 3

4

  • 3

Q3 5 27 25 6 6∗ Q4

  • 6
  • 2
  • 1
  • 5
  • 4

Average 0∗ 0∗ 0∗

∗ Excluding Q3 in West and East.

Summarize size of deviations: Use average absolute deviation. Area 1969 South West East North Ave.

  • Ave. Dev.

4 3∗ 0∗ 4 3

∗ Excluding Q3 in West and East.

slide-40
SLIDE 40

Analysis of table data

Summary Description

Four areas: North 95, East 75, West 53, and South 45. Deviation is about 3 units about these averages, with no regular pattern. There were two exceptionally high values in Q3 from the East and West regions, differing from their averages by about 25 units. Table has itself become redundant. Need no longer be included.

slide-41
SLIDE 41

Add to our guidelines

Things to note . . . Ehrenberg (1975), p. 14.

1. Base rules ◮ Reduce number of digits. Mental arithmetic is more difficult with more than two significant (varying) digits. ◮ Figures to be compared should be close together. ◮ Use memorable self-explanatory symbols and labels. ◮ Separate different types of items/groups with white space or gridlines. 2. Calculations ◮ Avoid introducing new variables or scales (e.g. totals) whenever possible. ◮ Use averages (or medians) to help focus the eye over the array. ◮ Note dramatically exceptional values and exclude them from pattern summary calculations. 3. If possible swap rows and columns, reorder rows and/or columns: ◮ Numbers that vary the least should appear in columns. Both regularities and exceptions are easier to spot. ◮ Rearrange rows to have large numbers appear above small numbers. Differences should easier to detect following subtraction rules. ◮ Rearrange columns so that averages are strictly decreasing (or increasing) from left to right. Easier then to detect departures from this pattern within the table. 4. Summarize irregular aspects of the data statistically, e.g. by average deviations from appropriate averages.

slide-42
SLIDE 42

Generalizing

Making use of the model

We started with two years data, 1969 and 1970, but only built a model based on the first of these. Can we apply what we learned about 1969 directly to 1970? If the model works for both years, we have in some sense validated it. The possibility of an underlying structure, described by our model, would seem to be connected with reality (cf. pictorial form) Applying the model amounts to building a final table for 1970 according to the same table organization. For 1970 this yields Area 1970 South West East North Ave. Q1 46 53 74 96 67 Q2 50 49 77 94 68 Q3 42 59 72 91 66 Q4 37 53 76 98 66 Average 44 54 75 95 67 which tells essentially the same story.

slide-43
SLIDE 43

Regularity

Model consistency

Area 1969 South West East North Ave. Q1 48 50 75 98 68 Q2 42 57 75 92 67 Q3 50 80 100 101 83 Q4 39 51 74 90 64 Average 45 53∗ 75∗ 95 67∗

∗ Excluding Q3 in West and East.

Area 1970 South West East North Ave. Q1 46 53 74 96 67 Q2 50 49 77 94 68 Q3 42 59 72 91 66 Q4 37 53 76 98 66 Average 44 54 75 95 67 Consistent results across both years.

slide-44
SLIDE 44

Irregularity

Deviations

Area 1969 South West East North Ave. Q1 3

  • 3

3 1 Q2

  • 3

4

  • 3

Q3 5 27 25 6 6∗ Q4

  • 6
  • 2
  • 1
  • 5
  • 4

Average 0∗ 0∗ 0∗

∗ Excluding Q3 in West and East.

Area 1970 South West East North Ave. Q1 2

  • 1
  • 1

1 Q2 6

  • 4

2

  • 1

Q3

  • 2

5

  • 3
  • 3
  • 1

Q4

  • 7

1 1 3 Average No regular patterns, average absolute deviation about the same (≈ 3)

slide-45
SLIDE 45

Deviations

Deeper examination

Averaging the deviations over the two years: Average of Area 1969 & ’70 South West East North Ave. Q1 2

  • 2

2 1 Q2 2 1

  • 2

Q3 2 3∗ −2∗ 2 1 Q4

  • 6
  • 1
  • 2

Average

∗ Excluding Q3 in 1969.

Still no regular patterns, average absolute deviation about 2. (Note that 3/ √ 2 ≈ 2.12.)

slide-46
SLIDE 46

Generalizing

Essential features

1.

Consistent model for averages (over both years): Area South West East North Ave. 1969 45 53∗ 75∗ 95 67∗ 1970 44 54 75 95 67

2.

Consistently patternless, irregular deviations (over both years). ◮ Average deviation of zero. (Force of calculation.) ◮ Average absolute deviation the same over each year (about 3). ◮ Consistently patternless deviations over quarters and areas for each year and over both years. It is this consistently irregular pattern of deviations that indicates the ability to generalize the consistent averages of areas to other, unseen, years. Analysis of our tabular model suggests that it can be used to predict.

slide-47
SLIDE 47

Statistical modelling

Mathematical representation

The model can be given a more formal, symbolic mathematical, representation as follows. Data (as we calculated for each year): yij = ¯ yi+ + rij where yij is the value for that year for Area i and Quarter j, ¯ yi+ is the arithmetic average for Area i summed over the Quarters or second index, as indicated by “+” in that index.

slide-48
SLIDE 48

Mathematical representation

Need to be able to generalize

We found that the model was the same over both years. The one we constructed for 1969 seemed to generalize to 1970. We might consider building a representation for this model which explicitly showed the generalization (for any year). We also didn’t really use the average for some regions in 1969 – we excluded the exceptional values in two areas in the third quarter. We model the data patterns, and the process that might have generated the data. (These need not be identical.) The latter we do to generalize to other as yet unobserved data.