Beyond three dimensions Non Cartesian representations R.W. Oldford - - PowerPoint PPT Presentation

beyond three dimensions
SMART_READER_LITE
LIVE PREVIEW

Beyond three dimensions Non Cartesian representations R.W. Oldford - - PowerPoint PPT Presentation

Beyond three dimensions Non Cartesian representations R.W. Oldford Beyond three dimensions Hey! Im a starfish! Moving from 0 to 1 to 2 to 3 dimensions have all been advances which have required an adjustment in our thinking. It has allowed


slide-1
SLIDE 1

Beyond three dimensions

Non Cartesian representations R.W. Oldford

slide-2
SLIDE 2

Beyond three dimensions

Hey! I’m a starfish!

Moving from 0 to 1 to 2 to 3 dimensions have all been advances which have required an adjustment in our thinking. It has allowed us to “visualize” more, and more complex, information. Each step requires training.

slide-3
SLIDE 3

Cartesian coordinates

Cartesian coordinates use orthogonal axes and a grid system to place each point in space. Cartesian coordinates René Descartes (1596-1650) ≥ 4d?

When we have n individual cases each one with its own values for each of p variates, we typically represent the ith case mathematically as a vector xi. The collection of data form a “cloud” of n points in a p-dimensional space. Mathematics doesn’t care what value p > 0 takes. How can a point cloud be visualized in more than three dimensions?

slide-4
SLIDE 4

Four dimensional data - example

Edgar Anderson’s Iris data: 150 flowers, 50 from each of three different species Iris Virginica Iris Versicolor Iris Setosa

Four measurements

◮ sepal width ◮ sepal height ◮ petal width ◮ petal height

slide-5
SLIDE 5

Four dimensional data - example

Learning from maps, we could use glyphs, especially ones that meaningfully convey the information in the data (provide an ostensive “pictorial form”).

Edgar Anderson proposed:

slide-6
SLIDE 6

Four dimensional data - example

Iris Setosa:

Centre glyph is the average.

slide-7
SLIDE 7

Four dimensional data - example

Iris Versicolor:

Centre glyph is the average.

slide-8
SLIDE 8

Four dimensional data - example

Using averages to compare and locate species

Anderson’s real interest lay not in discriminating between species but in whether the middle species (in terms of number of chromosomes and habitat) had relative distance of 1:2 from the two extreme

  • species. This proportion roughly matched the chromosome ratios.
slide-9
SLIDE 9

Higher dimensional glyphs

Anderson’s nested rectangular glyphs were natural for his problem. They were in a sense actually cartoon versions of the flower parts themselves. So, what might be the most natural glyph for us? We know, for example, that for a variety of reasons including possibly

◮ distinguishing predators and prey, ◮ distinguishing friend from foe ◮ identifying close relations ◮ recognizing mood, reaction ◮ social interaction of all kinds

that we have evolved to be very good at recognizing, distinguishing, and reading

  • faces. And that our parts of our brain are “wired” to deal with faces.

In 1973 Herman Chernoff suggested we use faces as glyphs.

slide-10
SLIDE 10

Chernoff faces

Herman Chernoff Constructing a Chernoff face

Variate values are assigned to different cartoon face features. A variety of faces have been implemented since Chernoff’s original proposal.

slide-11
SLIDE 11

Chernoff faces

Original Chernoff Faces, 1973.

slide-12
SLIDE 12

Chernoff faces

There are a couple of R packages that draw cartoon faces (i.e. variations on Chernoff faces). The first we woll look at is the TeachingDemos package.

library(TeachingDemos) The faces2(...) function can accommodate (up to) 18 dimensions as follows: The face features are assigned for columns 1 to (at most) 18

  • 1. Width of center
  • 2. Top vs. Bottom width (height of split)
  • 3. Height of Face
  • 4. Width of top half of face
  • 5. Width of bottom half of face
  • 6. Length of Nose
  • 7. Height of Mouth
  • 8. Curvature of Mouth (abs < 9)
  • 9. Width of Mouth
  • 10. Height of Eyes
  • 11. Distance between Eyes (.5-.9)
  • 12. Angle of Eyes/Eyebrows
  • 13. Circle/Ellipse of Eyes
  • 14. Size of Eyes
  • 15. Position Left/Right of Eyeballs/Eyebrows
  • 16. Height of Eyebrows
  • 17. Angle of Eyebrows
  • 18. Width of Eyebrows
slide-13
SLIDE 13

Chernoff faces

To demonstrate, we first scale the data to be weetween 0 and 1 for each variate

data <- iris[,1:4] # the four measurements on Anderson's irises scale01 <- function (x) { xrange <- range(x) (x - min(xrange))/diff(xrange) } data <- apply(data, 2, scale01) faces2(data[c(1, 51, 101), ], which=c(8, 13, 14, 17), scale = "none", nrows=1, ncols=3)

slide-14
SLIDE 14

Chernoff faces

With the appropriate packages, we can explore how variable these features are

library(tkrplot) ## Loading required package: tcltk if(interactive()){ tke2 <- rep( list(list('slider',from=0,to=1,init=0.5,resolution=0.1)), 18) names(tke2) <- c('CenterWidth','TopBottomWidth','FaceHeight','TopWidth', 'BottomWidth','NoseLength','MouthHeight','MouthCurve','MouthWidth', 'EyesHeight','EyesBetween','EyeAngle','EyeShape','EyeSize','EyeballPos', 'EyebrowHeight','EyebrowAngle','EyebrowWidth') tkfun2 <- function(...){ tmpmat <- rbind(Min=0,Adjust=unlist(list(...)),Max=1) faces2(tmpmat, scale='none') } tkexamp( tkfun2, list(tke2), plotloc='left', hscale=2, vscale=2 ) }

slide-15
SLIDE 15

Chernoff faces

We could choose different variate to feature mappings

data <- iris[,1:4] # the four measurements on Anderson's irises scale01 <- function (x) { xrange <- range(x) (x - min(xrange))/diff(xrange) } data <- apply(data, 2, scale01) faces2(data[c(1, 51, 101), ], which=c(6, 7, 8, 12), scale = "none", nrows=1, ncols=3)

slide-16
SLIDE 16

Chernoff faces

For all of the irises (in random order) n <- nrow(data) faces2(data[sample(1:n, n, replace =FALSE), ], which=c(6, 7, 8, 12), nrow=10, ncol=15)

slide-17
SLIDE 17

Chernoff faces

The olive data (150 randomly selected from 572; 8 dimensions)

faces2(data[sample(1:nrow(data), 150, replace = FALSE), ], nrows=10, ncols=15, which=c(1, 2, 3, 4, 6, 7, 8, 12), scale = "none")

slide-18
SLIDE 18

Chernoff faces

Problems with faces

◮ Eye movements (saccades). ◮ Some facial features are

more important than others.

◮ Grouping will depend on

which variates are matched to features.

◮ Not so useful for exploratory

work, OK (with care) for infographic.

slide-19
SLIDE 19

Florence Nightingale’s rose plot

Areas (centre to edge) are proportional to value. Time series.

slide-20
SLIDE 20

Florence Nightingale’s rose plot

E.g. The Atlas of Britain and Northern Ireland, 1963. Note colour used to identify variate. Area proportional to magnitude.

slide-21
SLIDE 21

Florence Nightingale’s rose plot

There is a function in R called stars(...) which will produce Nightingale’s rose style glyphs. n <- nrow(data) stars(data[sample(1:n, n, replace =FALSE), ], nrow=10, ncol=15, main = "Nightingale Roses (almost)", draw.segments=TRUE, cex=1.2)

slide-22
SLIDE 22

Florence Nightingale’s rose plot

Iris data

Nightingale Roses (almost)

slide-23
SLIDE 23

Florence Nightingale’s rose plot

Olive data

Nightingale Roses (almost)

slide-24
SLIDE 24

Florence Nightingale’s rose plot

Note that our use here is very different from Nightingale’s

◮ Nightingale used overlapping segments (hence the rose) ◮ represented time series (typically strong positive correlation between

neighbouring wedges)

◮ i.e. large beside/near large, small beside/near small

◮ very few roses

In contrast:

◮ our use is more like that of the 1963 British Atlas example, ◮ no overlapping segments ◮ different variates shown at each wedge

◮ no expectation of correlation’s sign

◮ many “stars” or “roses” ◮ looking to group like looking shapes.

Nightingale’s Crimean war could have been represented as stars with 3 segments (one for each cause of death) and a separate star for each month.

slide-25
SLIDE 25

Radial Axes

Sepal.Length Sepal.Width Petal.Length Petal.Width

Axes as equi-angular radii of a circle.

slide-26
SLIDE 26

Radial Axes

Sepal.Length Sepal.Width Petal.Length Petal.Width

Data scaled with dataset’s minimum value at centre, maximum at end (for each variate).

slide-27
SLIDE 27

Radial Axes

Sepal.Length Sepal.Width Petal.Length Petal.Width

One flower. Four measurements.

slide-28
SLIDE 28

Radial Axes

Sepal.Length Sepal.Width Petal.Length Petal.Width

All of the setosa data at once. Correlations?

slide-29
SLIDE 29

Radial Axes

Sepal.Length Sepal.Width Petal.Length Petal.Width

All of the versicolor data at once. Correlations?

slide-30
SLIDE 30

Radial Axes

Sepal.Length Sepal.Width Petal.Length Petal.Width

All of the virginica data at once. Correlations?

slide-31
SLIDE 31

Radial Axes

Iris Setosa Iris Versicolor Iris Virginica

Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length Sepal.Width Petal.Length Petal.Width

All three species. Comments?

slide-32
SLIDE 32

Radial Axes

Averages Iris Setosa Iris Versicolor Iris Virginica

Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length Sepal.Width Petal.Length Petal.Width

All three species. Comments?

slide-33
SLIDE 33

Radial Axes

Iris Setosa Iris Versicolor Iris Virginica Altogether

Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length Sepal.Width Petal.Length Petal.Width

All three species. Comments?

slide-34
SLIDE 34

Radial Axes

Averages and Individuals Iris Setosa Iris Versicolor Iris Virginica

Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length Sepal.Width Petal.Length Petal.Width

All three species. Comments?

slide-35
SLIDE 35

Radial Axes

Data explorations using stars(...) again stars(data[ran.order,], lwd=1.25, labels=NULL, radius=FALSE, col.lines = rep("steelblue", n), nrow=10, ncol = 15, key.loc=c(12,-0.5), main="Iris data")

slide-36
SLIDE 36

Radial Axes

Iris data

Sepal.Length Sepal.Width Petal.Length Petal.Width

slide-37
SLIDE 37

Radial Axes

radius = TRUE

Iris data

Sepal.Length Sepal.Width Petal.Length Petal.Width

slide-38
SLIDE 38

Radial Axes

Filled by setting col.stars = rep("steelblue", n)

Iris data

Sepal.Length Sepal.Width Petal.Length Petal.Width

slide-39
SLIDE 39

Radial Axes

Locating the glyphs on a plot: stars(data, locations=data[, 1:2], # locate using first two lwd=0.5, # Narrow the line len=.05, # NEED to adjust this one labels=NULL, radius=TRUE, col.lines = rep(adjustcolor("steelblue", 0.5), n), col.stars = rep(adjustcolor("steelblue", 0.5), n), nrow=10, ncol = 15, key.loc=c(12,-0.5), main="Iris data")

slide-40
SLIDE 40

Radial Axes

Locating the glyphs on a plot:

Iris data

slide-41
SLIDE 41

Radial Axes

The olive data

Olive data

palmitic palmitoleic stearic

  • leic

linoleic linolenic arachidic eicosenoic

slide-42
SLIDE 42

Radial Axes

Locating the glyphs on a plot:

Olive data

slide-43
SLIDE 43

Radial Axes

The effect of correlation can be seen with following artificial set of data:

w <- runif(100) ; x <- runif(100); y <- x ; z <- -y fakedata <- data.frame( w = w, x=x, y=y, z=z)

Fake data

w x y z

cor.w.x. cor.x.y. cor.y.z. cor.z.w.

  • 0.0304026

1

  • 1

0.0304026 Note that two correlations are missing: cor(x, z) and cor(w, y)

slide-44
SLIDE 44

Radial Axes - overlaid

Overlaying all observations reveals the correlational patterns.

w x y z

cor.w.x. cor.x.y. cor.y.z. cor.z.w.

  • 0.0304026

1

  • 1

0.0304026

slide-45
SLIDE 45

Serial Axes

Cartesian coordinate systems are based on orthogonal axes

◮ every axis is orthogonal to every other axis ◮ every set of axes is orthogonal to every other distinct set ◮ which variates are matched to which axes doesn’t matter

In contrast, radial axes are an example of a serial axis system.

◮ variates are most easily compared if they appear beside each other. ◮ the order of the axes matters (in that the display produced is different)

With serial axes systems the axes follow an order and the order affects the display.

slide-46
SLIDE 46

Pairwise ordering of axes

Repeating axes/variates allows any variate to appear beside (and be compared with) any other. Consider the complete graph having variates as nodes.

◮ every path is a selection and ordering of axes ◮ an Eulerian will have every pair of variates appear beside one another ◮ an ordering of all axes with no repeats is a Hamiltonian

If we have some measure (e.g. correlation) on a pair of variates, we could use these as weights to choose an ordering. For example,

◮ the early part of a greedy Eulerian will tend to have small/large weights ◮ a weighted Hamiltonian could be used to produce an ordering of

minimal/maximal total weight The choices could effect very different displays.

slide-47
SLIDE 47

Pairwise ordering of axes - Eulerian

On the fake data

w x y w.1 z x.1 y.1 z.1

slide-48
SLIDE 48

Pairwise ordering of axes - Eulerian

The iris data

Iris data

Sepal.Length Sepal.Width Petal.Length Sepal.Length Petal.Width Sepal.Width Petal.Length Petal.Width

slide-49
SLIDE 49

Pairwise ordering of axes - Eulerian

The olive data

Olive data

palmitic palmitoleic stearic palmitic

  • leic

palmitoleic stearic

  • leic

linoleic palmitic linolenic palmitoleic linoleic stearic linolenic

  • leic

linoleic linolenic arachidic palmitic eicosenoic palmitoleic arachidic stearic eicosenoic

  • leic

arachidic linoleic eicosenoic linolenic arachidic eicosenoic

slide-50
SLIDE 50

Pairwise ordering of axes - Scaling

Occasionally it makes sense to scale the data in other ways (not just by variate ranges). For example, for the iris data scaling each flower by its largest measure (sepal length) compares the flower shapes and ignores flower sizes. data <- iris[,-5] # the iris data title <- "Iris data" # Scale by row data <- apply(data, 1, scale01) # transpose to get the flowers as rows data <- t(data) # A function to get a colour for each species (not available to you) cols <- get_cols(iris[,"Species"]) # First no repetition of the variates; preserve order stars(data, labels=NULL,main=title, radius=TRUE, lwd=1.25, col.lines = cols, col.stars = cols, nrow=10, ncol = 15, key.loc=c(12,-0.5), ## data scaling used as is scale=FALSE ) # Now all pairs appear, again flowers appear in order stars(data[,eseq(p)],labels=NULL,main=title, radius=TRUE, lwd=1.25, col.lines = cols, col.stars = cols, nrow=10, ncol = 15, key.loc=c(12,-0.5), ## data scaling used as is scale=FALSE )

slide-51
SLIDE 51

Pairwise ordering of axes - scaling

Comparing flower shapes. Flowers in order (50 in each group); no variates repeated.

Iris data

Sepal.Length Sepal.Width Petal.Length Petal.Width

slide-52
SLIDE 52

Pairwise ordering of axes - Scaling

Comparing flower shapes. Flowers in order (50 in each group); All pairs appear (Eulerian).

Iris data

Sepal.Length Sepal.Width Petal.Length Sepal.Length Petal.Width Sepal.Width Petal.Length Petal.Width

slide-53
SLIDE 53

Parallel axes

Axes might also be laid out equi-spaced in parallel to one another.

Sepal.Length Sepal.Width Petal.Length Petal.Width

These are also called parallel coordinate plots (Inselberg).

slide-54
SLIDE 54

Parallel axes

Axes might also be laid out equi-spaced in parallel to one another.

Sepal.Length Sepal.Width Petal.Length Petal.Width

Each point becomes a “curve” formed from the line segments connecting the values on each coordinate axis. Shown above is one Iris Setosa flower’s measurements.

slide-55
SLIDE 55

Parallel axes

Overlaying all Iris Setosa flowers:

Sepal.Length Sepal.Width Petal.Length Petal.Width

The collection of flowers show positive and negative correlations

slide-56
SLIDE 56

Parallel axes

Overlaying all Iris Versicolor flowers:

Sepal.Length Sepal.Width Petal.Length Petal.Width

Each collection of flowers show correlations and distinctive shapes.

slide-57
SLIDE 57

Parallel axes

Overlaying all Iris Virginica flowers:

Sepal.Length Sepal.Width Petal.Length Petal.Width

Each collection of flowers show correlations and distinctive shapes.

slide-58
SLIDE 58

Parallel axes

All three species together:

Sepal.Length Sepal.Width Petal.Length Petal.Width

Each collection of flowers show correlations and distinctive shapes.

slide-59
SLIDE 59

Parallel axes

All three species together and all pairs of variates:

Sepal.Length Sepal.Width Petal.Length Sepal.Length Petal.Width Sepal.Width Petal.Length Petal.Width

Each collection of flowers show correlations and distinctive shapes.

slide-60
SLIDE 60

Line-point duality

Each pair of parallel axes corresponds to a pair of orthogonal axes. Patterns seen in a scatterplot should also appear as patterns in the parallel coordinates. There is in fact a line-point duality that connects the geometry within a

  • rthogonal coordinate pair system to that of the geometry of a parallel

coordinate pair system.

◮ A point in a scatter plot (pair of orthogonal coordinates) is a line between

parallel axes (a pair of parallel coordinates)

◮ A line in a scatter plot (pair of orthogonal coordinates) is a point of

intersection of lines between parallel axes (a pair of parallel coordinates)

◮ intersection may occur outside the parallel axes, possibly at infinity ◮ i.e. if you continue the lines beyond the axes they will intersect

◮ two parallel lines (same slope) in a scatterplot correspond to two different

intersection points one above the other in a parallel coordinate plot;

◮ lines with common intercepts in a scatterplot do NOT produce points of

intesections in the parallel coordinate plot at the same vertical location; changing slope changes the horizontal and vertical position of the point of intersection.

◮ a curve in a scatterplot is a curve of intersection points in parallel

coordinates (think tangent lines, or lines formed by successive points along the curve)

slide-61
SLIDE 61

Line-point duality

These points can be easily illustrated using loon.

library(loon)

  • ldDir <- getwd()

# Change directory to wherever you saved this code DataViz <- "/Users/rwoldford/Documents/Admin/courses/Data Visualization/Lectures/Slides" ThisLecture <- "4. Higher dimensional data/A. Non-Cartesian Representations" Rcode <- paste(DataViz,ThisLecture,"R", sep="/") setwd(Rcode) source("parallelCoordPointLine.R") setwd(oldDir) Select each group of interest; invert the others and deactivate them; brush points and lines to

  • bserve connection. Select subsets of points and move them around. Jitter points.

Reactivate all points, select and return all to their original position. Repeat with different groups (select by colour).

slide-62
SLIDE 62

Parallel coordinates

data <- iris[,-5] title <- "Iris data" library(MASS) parcoord(data, col=cols, main=title)

Iris data

Sepal.Length Sepal.Width Petal.Length Petal.Width

slide-63
SLIDE 63

Parallel coordinates

data <- olive[,-c(1,2)] title <- "Olive data" parcoord(data, col=get_cols(olive[,"Area"]),lwd=1, main=title)

Olive data

palmitic palmitoleic stearic

  • leic

linoleic linolenic arachidic eicosenoic

slide-64
SLIDE 64

Some structure

A structure in 7d space

dim 1 dim 2 dim 3 dim 4 dim 5 dim 6 dim 7 A line in 7 dimensions is always a line (or a point) in every lower dimensional projection.

slide-65
SLIDE 65

Some structure

A structure in 7d space

dim 1 dim 2 dim 3 dim 4 dim 5 dim 6 dim 7 Another line in 7 dimensions. Again it must appear as a line (or a point) in every lower dimensional projection.

slide-66
SLIDE 66

Some structure

A structure in 7d space

dim 1 dim 2 dim 3 dim 4 dim 5 dim 6 dim 7 Several (actually 5) different lines in 7 dimensions each given a different colour. More easily explored in loon.

slide-67
SLIDE 67

Some structure

A 3d helix embedded in 10d space

dim 1 dim 2 dim 3 dim 4 dim 5 dim 6 dim 7 dim 8 dim 9 dim 10 Clearly some structure is visible. Note projections which are nearly a line; and those where several parallel lines appear. In the latter there seem to be alternating slopes.

slide-68
SLIDE 68

Some structure

A paraboloid in 2d space

dim 1 dim 2 Smooth curvature indicates some smooth curves in those dimensions.

slide-69
SLIDE 69

Some structure

A paraboloid in 6d space

dim 1 dim 2 dim 3 dim 4 dim 5 dim 6 Smooth curvature indicates some smooth curves in those dimensions.

slide-70
SLIDE 70

Some structure

A structure in 7d space

dim 1 dim 2 dim 3 dim 4 dim 5 dim 6 dim 7 A plane; narrows (or becomes more parallel) more the closer the projection is to edge on.

slide-71
SLIDE 71

Parallel axes glyphs

From mountains.R (adapted from code by Dan Carr) data <- iris[,1:4] mountains(data)

slide-72
SLIDE 72

Interactive analysis

In loon, both radial axes and parallel axes are special cases of serial axes plots: library(loon) labels <- paste(olive[,"Region"], ":\n ", olive[, "Area"]) l_serialaxes(oliveAcids, linkingGroup = "olive", itemLabel = labels, showItemLabels = TRUE, showAxesLabels = FALSE, axesLayout = "parallel") l_plot(oliveAcids, linkingGroup = "olive", itemLabel = labels, showItemLabels = TRUE) areas <- as.numeric(olive[,"Area"]) l_hist(areas, linkingGroup = "olive")

slide-73
SLIDE 73

Example analysis

The vignette file minorities.Rmd from loon provides an interactive analysis which makes use of various serial axes plots.

slide-74
SLIDE 74

High dimensional data as glyphs

What if the dimensionality is very large? Can we still have glyphs which show all dimensions? One thought would be to use a square (or perhaps rectangular) “pixel” or “heatmap”.

◮ every point in the high-dimensional space is represented by an array of tiny

squares (or pixels)

◮ each tiny square (or pixel) is an element of the array and its value is that of

a specific variate for that point

◮ each tiny square is coloured so that its colour encodes the value of the

variate In this way high dimensional data (hundreds and even thousands of variates) appear as a single picture glyph for each point. N.B. Need to decide on the mapping of variates to elements of the array.

slide-75
SLIDE 75

High dimensional data as glyphs

Some desirable features of the mapping of variates to array locations:

◮ order of the variates be preserved ◮ variates that are close to one another in order be close to one another in the

array (e.g. especially desirable when each point is a regular time series)

◮ in the case where each point is an ordering time, say, then if the resolution

  • f time becomes finer (e.g. from days to hours, or hours to minutes), then

the positions in the array should not change much as the array becomes larger Proposed solutions: regular and recursive “space filling” curves

◮ Hilbert curve ◮ Morton curve ◮ Keim’s recursive rectangles

slide-76
SLIDE 76

High dimensional data as glyphs

Hilbert:

slide-77
SLIDE 77

High dimensional data as glyphs

Morton:

slide-78
SLIDE 78

High dimensional data as glyphs

Keim:

slide-79
SLIDE 79

High dimensional data as glyphs - S&P 500 data

For example, the SP500 data contains S&P 500 constituents stock indices from 1962, when at least one of the constituents is available, to 2015 and if the data is not available, it will return missing data.

The constituents information is in the following website \href{http://en.wikipedia.org/wiki/List_of/_S%26P/_500/_companies}{http: //en.wikipedia.org/wiki/List/_of/_S%26P/_500/_companies }. The data set comes from qrmdata package in CRAN. See help(SP500_const) for more information.

library(qrmdata) data("SP500_const") # load the constituents data from qrmdata Here we will select three years of data around the financial crash of 2008. time <- c("2007-01-03", "2009-12-31") # specify time period data_sp500 <- SP500_const[paste0(time, collapse = "/"),] # grab out data The data has dimensions dim(data_sp500) = 756, 505. Each column is a “point”; it corresponds to one of the stocks that constitute the S&P 500 index. Each row corresponds to a trading day and is a “variate”; the total number of trading days is therefore the dimensionality of each point (stock). The variate values for each stock are the adjusted close price for the various trading days. The data has 756 dimensions, each corresponding to a single trading day from January 3, 2007 to December 31, 2009. (Note: there are only 5 trading days per week.)

slide-80
SLIDE 80

High dimensional data as glyphs - S&P 500 data

For example, some stock indices in the first seven days (from January 3, 2007 to January 11, 2007 inclusive) are listed below.

MMM ABT ABBV ACN ACE ATVI ADBE 61.93 18.41 NA 30.48 50.27 7.95 39.92 61.68 18.76 NA 31.16 49.61 8.02 40.82 61.26 18.76 NA 30.73 49.25 7.97 40.62 61.40 18.82 NA 31.17 49.28 7.97 40.45 61.47 18.99 NA 31.10 48.77 7.96 39.63 61.60 19.05 NA 31.28 48.76 8.13 39.22 62.24 19.04 NA 31.16 49.36 8.54 39.88 Each row is a dimension/variate/day and each column is headed by the “stock ticker” abbreviation for that stock (e.g. ABT is “Abbot Laboratories”) Note also that some data is missing (i.e. recorded as NA). We’ll remove all stocks with missing data from consideration, and split it into a list of stocks (for the glyph-making functions) as follows: cases <- complete.cases(t(data_sp500)) # identifies cases that have no NAs x <- na.omit(t(data_sp500)) # omit the missing data data_omitNA <- split(x,row(x)) # split the data into list str(data_omitNA[1:3]) # present the first three stocks indices in the data list ## List of 3 ## $ 1: num [1:756] 61.9 61.7 61.3 61.4 61.5 ... ## $ 2: num [1:756] 18.4 18.8 18.8 18.8 19 ... ## $ 3: num [1:756] 30.5 31.2 30.7 31.2 31.1 ...

slide-81
SLIDE 81

High dimensional data as glyphs - S&P 500 data

Moreover, the S&P 500 data also provides the sectors in which stocks belong (see SP500_const_info) . Below shows the sectors and subsectors of the first few stocks

knitr::kable(head(SP500_const_info, 7)) Ticker Sector Subsector MMM Industrials Industrial Conglomerates ABT Health Care Health Care Equipment & Services ABBV Health Care Pharmaceuticals ACN Information Technology IT Consulting & Other Services ACE Financials Property & Casualty Insurance ATVI Information Technology Home Entertainment Software ADBE Information Technology Application Software For example, we might look at stocks from a few different subsectors (n.b. use of cases). bank_loc <- which(SP500_const_info$Subsector[cases] == "Banks") pharma_loc <- which(SP500_const_info$Subsector[cases] == "Pharmaceuticals")

  • il_loc <- which(SP500_const_info$Subsector[cases] == "Oil & Gas Exploration & Production")

dataSubsectors <- data_omitNA[c(bank_loc, pharma_loc, oil_loc)]

slide-82
SLIDE 82

High dimensional data as glyphs - S&P 500 data

We could also calculate the averages over each sector as

nBanks <- length(bank_loc) bank_ave <- Reduce("+", data_omitNA[bank_loc])/nBanks nPharmas <- length(pharma_loc) pharma_ave <- Reduce("+", data_omitNA[pharma_loc])/nPharmas nOils <- length(oil_loc)

  • il_ave <- Reduce("+", data_omitNA[oil_loc])/nOils

and add them to the data at the end: nStocks <- length(dataSubsectors) dataSubsectors[seq(nStocks+1, nStocks +3)] <- list(bank_ave, pharma_ave, oil_ave)

slide-83
SLIDE 83

High dimensional data as glyphs - S&P 500 data

Now we can begin to construct some glyphs (blue is high, red low value, grey is average). library(colorspace) cols <- rev(diverge_hcl(21)) # diverge from red to blue and use the glyphs package. First we need to make sure that we are using the correct make_glyphs() (namely the one from glyphs and not loon). unloadNamespace("glyphs") library("glyphs") The glyphs can now be created using the make_glyphs function from the glyphs package.

slide-84
SLIDE 84

High dimensional data as glyphs - S&P 500 data

Create all three types of encodings: Hilbert, Morton, and Keim’s recursive

  • rectangles. First Hilbert and Morton:

SP500Hilbert <- make_glyphs(dataSubsectors, glyph_type = "Hilbert",

  • rigin = "mean", col=cols)

SP500Morton <- make_glyphs(dataSubsectors, glyph_type = "Morton",

  • rigin = "mean", cols = cols)

with the Keim recursive rectangle, height and width parameters can be used to effect a layout that is more natural to this problem.

slide-85
SLIDE 85

High dimensional data as glyphs - S&P 500 data

With the Keim recursive rectangle, height and width parameters can be used to effect a layout that is more natural to this problem. width = c(5, 1, 12, 1) # set the widths height = c(1, 4, 1, 3) # and the heights for rectangles

  • rganizes weekdays into a week (5 columns of 1 row)

and weeks into a month (1 column of 4 rows) and months into a year (12 columns of 1 row) and years into separate rows (1 column of 3 rows)

slide-86
SLIDE 86

High dimensional data as glyphs - S&P 500 data

The Keim recursive rectangle with width = c(5, 1, 12, 1), height = c(1, 4, 1, 3)

width= c(5, 1, 12, 1) and height = c(1, 4, 1, 3) 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0

slide-87
SLIDE 87

High dimensional data as glyphs - S&P 500 data

So, we’ll use Keim recursive rectangle with width = c(5, 1, 12, 1), height = c(1, 4, 1, 3) for the S&P 500 data. # With Keim's recursive rectangle we can # approximate weeks, months, years SP500Keim <- make_glyphs(dataSubsectors, glyph_type = "rectangle", width = width, height = height,

  • rigin = "mean", cols = cols)
slide-88
SLIDE 88

High dimensional data as glyphs - S&P 500 data

Plot these glyphs using this little ad hoc function doit <- function (glyphs, main ="", labels = NULL, labelCol = "grey30") x <- getGridXY(length(glyphs)) # get coordinates for each glyph plot_glyphs(x, glyphs = glyphs, axes = FALSE, xlab = "", ylab = "", glyphWidth = 0.8, glyphHeight = 0.6, main = main, cex.main = 0.8) if (!is.null(labels)) text(x, labels = labels, col = labelCol) } Then to look at the individual stocks via Hilbert encoding, simply execute doit(SP500Hilbert[1:nStocks])

slide-89
SLIDE 89

High dimensional data as glyphs - S&P 500 data

To select the different sectors and averages, a number of indices will be handy ave_indices <- (nStocks+1):(nStocks+3) bank_indices <- 1:nBanks pharma_indices <- (nBanks+1):(nBanks+nPharmas)

  • il_indices <- (nBanks + nPharmas +1):nStocks

We now examine these for each encoding.

slide-90
SLIDE 90

High dimensional data as glyphs - S&P 500 data

doit(SP500Hilbert[ave_indices], main = "Hilbert - Subsector averages", labels = c("Banks", "Pharma", "Oil"))

Hilbert − Subsector averages

Banks Pharma Oil

slide-91
SLIDE 91

High dimensional data as glyphs - S&P 500 data

doit(SP500Morton[ave_indices], main = "Morton - Subsector averages", labels = c("Banks", "Pharma", "Oil"))

Morton − Subsector averages

Banks Pharma Oil

slide-92
SLIDE 92

High dimensional data as glyphs - S&P 500 data

doit(SP500Keim[ave_indices], main = "Keim's rectangle - Subsector averages", labels = c("Banks", "Pharma", "Oil"))

Keim's rectangle − Subsector averages

Banks Pharma Oil

slide-93
SLIDE 93

High dimensional data as glyphs - S&P 500 data

doit(SP500Keim[bank_indices], main = "Keim's rectangle - Banks", labels = as.vector(SP500_const_info[cases,][bank_loc,]$Ticker), labelCol="black")

Keim's rectangle − Banks

BAC BK BBT C CMA FITB HBAN JPM KEY MTB PNC STI USB WFC ZION

slide-94
SLIDE 94

High dimensional data as glyphs - S&P 500 data

doit(SP500Keim[pharma_indices], main = "Keim's rectangle - Pharmas", labels = as.vector(SP500_const_info[cases,][pharma_loc,]$Ticker), labelCol="black")

Keim's rectangle − Pharmas

AGN ENDP LLY MRK MYL PRGO PFE

slide-95
SLIDE 95

High dimensional data as glyphs - S&P 500 data

doit(SP500Keim[oil_indices], main = "Keim's rectangle - Oil", labels = as.vector(SP500_const_info[cases,][oil_loc,]$Ticker), labelCol="yellow")

Keim's rectangle − Oil

APC APA COG XEC COP DVN EOG EQT MRO NFX NBL OXY OKE PXD RRC SWN WMB