Beyond three dimensions Non Cartesian representations R.W. Oldford - - PowerPoint PPT Presentation
Beyond three dimensions Non Cartesian representations R.W. Oldford - - PowerPoint PPT Presentation
Beyond three dimensions Non Cartesian representations R.W. Oldford Beyond three dimensions Hey! Im a starfish! Moving from 0 to 1 to 2 to 3 dimensions have all been advances which have required an adjustment in our thinking. It has allowed
Beyond three dimensions
Hey! I’m a starfish!
Moving from 0 to 1 to 2 to 3 dimensions have all been advances which have required an adjustment in our thinking. It has allowed us to “visualize” more, and more complex, information. Each step requires training.
Cartesian coordinates
Cartesian coordinates use orthogonal axes and a grid system to place each point in space. Cartesian coordinates René Descartes (1596-1650) ≥ 4d?
When we have n individual cases each one with its own values for each of p variates, we typically represent the ith case mathematically as a vector xi. The collection of data form a “cloud” of n points in a p-dimensional space. Mathematics doesn’t care what value p > 0 takes. How can a point cloud be visualized in more than three dimensions?
Four dimensional data - example
Edgar Anderson’s Iris data: 150 flowers, 50 from each of three different species Iris Virginica Iris Versicolor Iris Setosa
Four measurements
◮ sepal width ◮ sepal height ◮ petal width ◮ petal height
Four dimensional data - example
Learning from maps, we could use glyphs, especially ones that meaningfully convey the information in the data (provide an ostensive “pictorial form”).
Edgar Anderson proposed:
Four dimensional data - example
Iris Setosa:
Centre glyph is the average.
Four dimensional data - example
Iris Versicolor:
Centre glyph is the average.
Four dimensional data - example
Using averages to compare and locate species
Anderson’s real interest lay not in discriminating between species but in whether the middle species (in terms of number of chromosomes and habitat) had relative distance of 1:2 from the two extreme
- species. This proportion roughly matched the chromosome ratios.
Higher dimensional glyphs
Anderson’s nested rectangular glyphs were natural for his problem. They were in a sense actually cartoon versions of the flower parts themselves. So, what might be the most natural glyph for us? We know, for example, that for a variety of reasons including possibly
◮ distinguishing predators and prey, ◮ distinguishing friend from foe ◮ identifying close relations ◮ recognizing mood, reaction ◮ social interaction of all kinds
that we have evolved to be very good at recognizing, distinguishing, and reading
- faces. And that our parts of our brain are “wired” to deal with faces.
In 1973 Herman Chernoff suggested we use faces as glyphs.
Chernoff faces
Herman Chernoff Constructing a Chernoff face
Variate values are assigned to different cartoon face features. A variety of faces have been implemented since Chernoff’s original proposal.
Chernoff faces
Original Chernoff Faces, 1973.
Chernoff faces
There are a couple of R packages that draw cartoon faces (i.e. variations on Chernoff faces). The first we woll look at is the TeachingDemos package.
library(TeachingDemos) The faces2(...) function can accommodate (up to) 18 dimensions as follows: The face features are assigned for columns 1 to (at most) 18
- 1. Width of center
- 2. Top vs. Bottom width (height of split)
- 3. Height of Face
- 4. Width of top half of face
- 5. Width of bottom half of face
- 6. Length of Nose
- 7. Height of Mouth
- 8. Curvature of Mouth (abs < 9)
- 9. Width of Mouth
- 10. Height of Eyes
- 11. Distance between Eyes (.5-.9)
- 12. Angle of Eyes/Eyebrows
- 13. Circle/Ellipse of Eyes
- 14. Size of Eyes
- 15. Position Left/Right of Eyeballs/Eyebrows
- 16. Height of Eyebrows
- 17. Angle of Eyebrows
- 18. Width of Eyebrows
Chernoff faces
To demonstrate, we first scale the data to be weetween 0 and 1 for each variate
data <- iris[,1:4] # the four measurements on Anderson's irises scale01 <- function (x) { xrange <- range(x) (x - min(xrange))/diff(xrange) } data <- apply(data, 2, scale01) faces2(data[c(1, 51, 101), ], which=c(8, 13, 14, 17), scale = "none", nrows=1, ncols=3)
Chernoff faces
With the appropriate packages, we can explore how variable these features are
library(tkrplot) ## Loading required package: tcltk if(interactive()){ tke2 <- rep( list(list('slider',from=0,to=1,init=0.5,resolution=0.1)), 18) names(tke2) <- c('CenterWidth','TopBottomWidth','FaceHeight','TopWidth', 'BottomWidth','NoseLength','MouthHeight','MouthCurve','MouthWidth', 'EyesHeight','EyesBetween','EyeAngle','EyeShape','EyeSize','EyeballPos', 'EyebrowHeight','EyebrowAngle','EyebrowWidth') tkfun2 <- function(...){ tmpmat <- rbind(Min=0,Adjust=unlist(list(...)),Max=1) faces2(tmpmat, scale='none') } tkexamp( tkfun2, list(tke2), plotloc='left', hscale=2, vscale=2 ) }
Chernoff faces
We could choose different variate to feature mappings
data <- iris[,1:4] # the four measurements on Anderson's irises scale01 <- function (x) { xrange <- range(x) (x - min(xrange))/diff(xrange) } data <- apply(data, 2, scale01) faces2(data[c(1, 51, 101), ], which=c(6, 7, 8, 12), scale = "none", nrows=1, ncols=3)
Chernoff faces
For all of the irises (in random order) n <- nrow(data) faces2(data[sample(1:n, n, replace =FALSE), ], which=c(6, 7, 8, 12), nrow=10, ncol=15)
Chernoff faces
The olive data (150 randomly selected from 572; 8 dimensions)
faces2(data[sample(1:nrow(data), 150, replace = FALSE), ], nrows=10, ncols=15, which=c(1, 2, 3, 4, 6, 7, 8, 12), scale = "none")
Chernoff faces
Problems with faces
◮ Eye movements (saccades). ◮ Some facial features are
more important than others.
◮ Grouping will depend on
which variates are matched to features.
◮ Not so useful for exploratory
work, OK (with care) for infographic.
Florence Nightingale’s rose plot
Areas (centre to edge) are proportional to value. Time series.
Florence Nightingale’s rose plot
E.g. The Atlas of Britain and Northern Ireland, 1963. Note colour used to identify variate. Area proportional to magnitude.
Florence Nightingale’s rose plot
There is a function in R called stars(...) which will produce Nightingale’s rose style glyphs. n <- nrow(data) stars(data[sample(1:n, n, replace =FALSE), ], nrow=10, ncol=15, main = "Nightingale Roses (almost)", draw.segments=TRUE, cex=1.2)
Florence Nightingale’s rose plot
Iris data
Nightingale Roses (almost)
Florence Nightingale’s rose plot
Olive data
Nightingale Roses (almost)
Florence Nightingale’s rose plot
Note that our use here is very different from Nightingale’s
◮ Nightingale used overlapping segments (hence the rose) ◮ represented time series (typically strong positive correlation between
neighbouring wedges)
◮ i.e. large beside/near large, small beside/near small
◮ very few roses
In contrast:
◮ our use is more like that of the 1963 British Atlas example, ◮ no overlapping segments ◮ different variates shown at each wedge
◮ no expectation of correlation’s sign
◮ many “stars” or “roses” ◮ looking to group like looking shapes.
Nightingale’s Crimean war could have been represented as stars with 3 segments (one for each cause of death) and a separate star for each month.
Radial Axes
Sepal.Length Sepal.Width Petal.Length Petal.Width
Axes as equi-angular radii of a circle.
Radial Axes
Sepal.Length Sepal.Width Petal.Length Petal.Width
Data scaled with dataset’s minimum value at centre, maximum at end (for each variate).
Radial Axes
Sepal.Length Sepal.Width Petal.Length Petal.Width
One flower. Four measurements.
Radial Axes
Sepal.Length Sepal.Width Petal.Length Petal.Width
All of the setosa data at once. Correlations?
Radial Axes
Sepal.Length Sepal.Width Petal.Length Petal.Width
All of the versicolor data at once. Correlations?
Radial Axes
Sepal.Length Sepal.Width Petal.Length Petal.Width
All of the virginica data at once. Correlations?
Radial Axes
Iris Setosa Iris Versicolor Iris Virginica
Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length Sepal.Width Petal.Length Petal.Width
All three species. Comments?
Radial Axes
Averages Iris Setosa Iris Versicolor Iris Virginica
Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length Sepal.Width Petal.Length Petal.Width
All three species. Comments?
Radial Axes
Iris Setosa Iris Versicolor Iris Virginica Altogether
Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length Sepal.Width Petal.Length Petal.WidthAll three species. Comments?
Radial Axes
Averages and Individuals Iris Setosa Iris Versicolor Iris Virginica
Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length Sepal.Width Petal.Length Petal.WidthAll three species. Comments?
Radial Axes
Data explorations using stars(...) again stars(data[ran.order,], lwd=1.25, labels=NULL, radius=FALSE, col.lines = rep("steelblue", n), nrow=10, ncol = 15, key.loc=c(12,-0.5), main="Iris data")
Radial Axes
Iris data
Sepal.Length Sepal.Width Petal.Length Petal.Width
Radial Axes
radius = TRUE
Iris data
Sepal.Length Sepal.Width Petal.Length Petal.Width
Radial Axes
Filled by setting col.stars = rep("steelblue", n)
Iris data
Sepal.Length Sepal.Width Petal.Length Petal.Width
Radial Axes
Locating the glyphs on a plot: stars(data, locations=data[, 1:2], # locate using first two lwd=0.5, # Narrow the line len=.05, # NEED to adjust this one labels=NULL, radius=TRUE, col.lines = rep(adjustcolor("steelblue", 0.5), n), col.stars = rep(adjustcolor("steelblue", 0.5), n), nrow=10, ncol = 15, key.loc=c(12,-0.5), main="Iris data")
Radial Axes
Locating the glyphs on a plot:
Iris data
Radial Axes
The olive data
Olive data
palmitic palmitoleic stearic
- leic
linoleic linolenic arachidic eicosenoic
Radial Axes
Locating the glyphs on a plot:
Olive data
Radial Axes
The effect of correlation can be seen with following artificial set of data:
w <- runif(100) ; x <- runif(100); y <- x ; z <- -y fakedata <- data.frame( w = w, x=x, y=y, z=z)
Fake data
w x y z
cor.w.x. cor.x.y. cor.y.z. cor.z.w.
- 0.0304026
1
- 1
0.0304026 Note that two correlations are missing: cor(x, z) and cor(w, y)
Radial Axes - overlaid
Overlaying all observations reveals the correlational patterns.
w x y z
cor.w.x. cor.x.y. cor.y.z. cor.z.w.
- 0.0304026
1
- 1
0.0304026
Serial Axes
Cartesian coordinate systems are based on orthogonal axes
◮ every axis is orthogonal to every other axis ◮ every set of axes is orthogonal to every other distinct set ◮ which variates are matched to which axes doesn’t matter
In contrast, radial axes are an example of a serial axis system.
◮ variates are most easily compared if they appear beside each other. ◮ the order of the axes matters (in that the display produced is different)
With serial axes systems the axes follow an order and the order affects the display.
Pairwise ordering of axes
Repeating axes/variates allows any variate to appear beside (and be compared with) any other. Consider the complete graph having variates as nodes.
◮ every path is a selection and ordering of axes ◮ an Eulerian will have every pair of variates appear beside one another ◮ an ordering of all axes with no repeats is a Hamiltonian
If we have some measure (e.g. correlation) on a pair of variates, we could use these as weights to choose an ordering. For example,
◮ the early part of a greedy Eulerian will tend to have small/large weights ◮ a weighted Hamiltonian could be used to produce an ordering of
minimal/maximal total weight The choices could effect very different displays.
Pairwise ordering of axes - Eulerian
On the fake data
w x y w.1 z x.1 y.1 z.1
Pairwise ordering of axes - Eulerian
The iris data
Iris data
Sepal.Length Sepal.Width Petal.Length Sepal.Length Petal.Width Sepal.Width Petal.Length Petal.Width
Pairwise ordering of axes - Eulerian
The olive data
Olive data
palmitic palmitoleic stearic palmitic
- leic
palmitoleic stearic
- leic
linoleic palmitic linolenic palmitoleic linoleic stearic linolenic
- leic
linoleic linolenic arachidic palmitic eicosenoic palmitoleic arachidic stearic eicosenoic
- leic
arachidic linoleic eicosenoic linolenic arachidic eicosenoic
Pairwise ordering of axes - Scaling
Occasionally it makes sense to scale the data in other ways (not just by variate ranges). For example, for the iris data scaling each flower by its largest measure (sepal length) compares the flower shapes and ignores flower sizes. data <- iris[,-5] # the iris data title <- "Iris data" # Scale by row data <- apply(data, 1, scale01) # transpose to get the flowers as rows data <- t(data) # A function to get a colour for each species (not available to you) cols <- get_cols(iris[,"Species"]) # First no repetition of the variates; preserve order stars(data, labels=NULL,main=title, radius=TRUE, lwd=1.25, col.lines = cols, col.stars = cols, nrow=10, ncol = 15, key.loc=c(12,-0.5), ## data scaling used as is scale=FALSE ) # Now all pairs appear, again flowers appear in order stars(data[,eseq(p)],labels=NULL,main=title, radius=TRUE, lwd=1.25, col.lines = cols, col.stars = cols, nrow=10, ncol = 15, key.loc=c(12,-0.5), ## data scaling used as is scale=FALSE )
Pairwise ordering of axes - scaling
Comparing flower shapes. Flowers in order (50 in each group); no variates repeated.
Iris data
Sepal.Length Sepal.Width Petal.Length Petal.Width
Pairwise ordering of axes - Scaling
Comparing flower shapes. Flowers in order (50 in each group); All pairs appear (Eulerian).
Iris data
Sepal.Length Sepal.Width Petal.Length Sepal.Length Petal.Width Sepal.Width Petal.Length Petal.Width
Parallel axes
Axes might also be laid out equi-spaced in parallel to one another.
Sepal.Length Sepal.Width Petal.Length Petal.Width
These are also called parallel coordinate plots (Inselberg).
Parallel axes
Axes might also be laid out equi-spaced in parallel to one another.
Sepal.Length Sepal.Width Petal.Length Petal.Width
Each point becomes a “curve” formed from the line segments connecting the values on each coordinate axis. Shown above is one Iris Setosa flower’s measurements.
Parallel axes
Overlaying all Iris Setosa flowers:
Sepal.Length Sepal.Width Petal.Length Petal.Width
The collection of flowers show positive and negative correlations
Parallel axes
Overlaying all Iris Versicolor flowers:
Sepal.Length Sepal.Width Petal.Length Petal.Width
Each collection of flowers show correlations and distinctive shapes.
Parallel axes
Overlaying all Iris Virginica flowers:
Sepal.Length Sepal.Width Petal.Length Petal.Width
Each collection of flowers show correlations and distinctive shapes.
Parallel axes
All three species together:
Sepal.Length Sepal.Width Petal.Length Petal.Width
Each collection of flowers show correlations and distinctive shapes.
Parallel axes
All three species together and all pairs of variates:
Sepal.Length Sepal.Width Petal.Length Sepal.Length Petal.Width Sepal.Width Petal.Length Petal.Width
Each collection of flowers show correlations and distinctive shapes.
Line-point duality
Each pair of parallel axes corresponds to a pair of orthogonal axes. Patterns seen in a scatterplot should also appear as patterns in the parallel coordinates. There is in fact a line-point duality that connects the geometry within a
- rthogonal coordinate pair system to that of the geometry of a parallel
coordinate pair system.
◮ A point in a scatter plot (pair of orthogonal coordinates) is a line between
parallel axes (a pair of parallel coordinates)
◮ A line in a scatter plot (pair of orthogonal coordinates) is a point of
intersection of lines between parallel axes (a pair of parallel coordinates)
◮ intersection may occur outside the parallel axes, possibly at infinity ◮ i.e. if you continue the lines beyond the axes they will intersect
◮ two parallel lines (same slope) in a scatterplot correspond to two different
intersection points one above the other in a parallel coordinate plot;
◮ lines with common intercepts in a scatterplot do NOT produce points of
intesections in the parallel coordinate plot at the same vertical location; changing slope changes the horizontal and vertical position of the point of intersection.
◮ a curve in a scatterplot is a curve of intersection points in parallel
coordinates (think tangent lines, or lines formed by successive points along the curve)
Line-point duality
These points can be easily illustrated using loon.
library(loon)
- ldDir <- getwd()
# Change directory to wherever you saved this code DataViz <- "/Users/rwoldford/Documents/Admin/courses/Data Visualization/Lectures/Slides" ThisLecture <- "4. Higher dimensional data/A. Non-Cartesian Representations" Rcode <- paste(DataViz,ThisLecture,"R", sep="/") setwd(Rcode) source("parallelCoordPointLine.R") setwd(oldDir) Select each group of interest; invert the others and deactivate them; brush points and lines to
- bserve connection. Select subsets of points and move them around. Jitter points.
Reactivate all points, select and return all to their original position. Repeat with different groups (select by colour).
Parallel coordinates
data <- iris[,-5] title <- "Iris data" library(MASS) parcoord(data, col=cols, main=title)
Iris data
Sepal.Length Sepal.Width Petal.Length Petal.Width
Parallel coordinates
data <- olive[,-c(1,2)] title <- "Olive data" parcoord(data, col=get_cols(olive[,"Area"]),lwd=1, main=title)
Olive data
palmitic palmitoleic stearic
- leic
linoleic linolenic arachidic eicosenoic
Some structure
A structure in 7d space
dim 1 dim 2 dim 3 dim 4 dim 5 dim 6 dim 7 A line in 7 dimensions is always a line (or a point) in every lower dimensional projection.
Some structure
A structure in 7d space
dim 1 dim 2 dim 3 dim 4 dim 5 dim 6 dim 7 Another line in 7 dimensions. Again it must appear as a line (or a point) in every lower dimensional projection.
Some structure
A structure in 7d space
dim 1 dim 2 dim 3 dim 4 dim 5 dim 6 dim 7 Several (actually 5) different lines in 7 dimensions each given a different colour. More easily explored in loon.
Some structure
A 3d helix embedded in 10d space
dim 1 dim 2 dim 3 dim 4 dim 5 dim 6 dim 7 dim 8 dim 9 dim 10 Clearly some structure is visible. Note projections which are nearly a line; and those where several parallel lines appear. In the latter there seem to be alternating slopes.
Some structure
A paraboloid in 2d space
dim 1 dim 2 Smooth curvature indicates some smooth curves in those dimensions.
Some structure
A paraboloid in 6d space
dim 1 dim 2 dim 3 dim 4 dim 5 dim 6 Smooth curvature indicates some smooth curves in those dimensions.
Some structure
A structure in 7d space
dim 1 dim 2 dim 3 dim 4 dim 5 dim 6 dim 7 A plane; narrows (or becomes more parallel) more the closer the projection is to edge on.
Parallel axes glyphs
From mountains.R (adapted from code by Dan Carr) data <- iris[,1:4] mountains(data)
Interactive analysis
In loon, both radial axes and parallel axes are special cases of serial axes plots: library(loon) labels <- paste(olive[,"Region"], ":\n ", olive[, "Area"]) l_serialaxes(oliveAcids, linkingGroup = "olive", itemLabel = labels, showItemLabels = TRUE, showAxesLabels = FALSE, axesLayout = "parallel") l_plot(oliveAcids, linkingGroup = "olive", itemLabel = labels, showItemLabels = TRUE) areas <- as.numeric(olive[,"Area"]) l_hist(areas, linkingGroup = "olive")
Example analysis
The vignette file minorities.Rmd from loon provides an interactive analysis which makes use of various serial axes plots.
High dimensional data as glyphs
What if the dimensionality is very large? Can we still have glyphs which show all dimensions? One thought would be to use a square (or perhaps rectangular) “pixel” or “heatmap”.
◮ every point in the high-dimensional space is represented by an array of tiny
squares (or pixels)
◮ each tiny square (or pixel) is an element of the array and its value is that of
a specific variate for that point
◮ each tiny square is coloured so that its colour encodes the value of the
variate In this way high dimensional data (hundreds and even thousands of variates) appear as a single picture glyph for each point. N.B. Need to decide on the mapping of variates to elements of the array.
High dimensional data as glyphs
Some desirable features of the mapping of variates to array locations:
◮ order of the variates be preserved ◮ variates that are close to one another in order be close to one another in the
array (e.g. especially desirable when each point is a regular time series)
◮ in the case where each point is an ordering time, say, then if the resolution
- f time becomes finer (e.g. from days to hours, or hours to minutes), then
the positions in the array should not change much as the array becomes larger Proposed solutions: regular and recursive “space filling” curves
◮ Hilbert curve ◮ Morton curve ◮ Keim’s recursive rectangles
High dimensional data as glyphs
Hilbert:
High dimensional data as glyphs
Morton:
High dimensional data as glyphs
Keim:
High dimensional data as glyphs - S&P 500 data
For example, the SP500 data contains S&P 500 constituents stock indices from 1962, when at least one of the constituents is available, to 2015 and if the data is not available, it will return missing data.
The constituents information is in the following website \href{http://en.wikipedia.org/wiki/List_of/_S%26P/_500/_companies}{http: //en.wikipedia.org/wiki/List/_of/_S%26P/_500/_companies }. The data set comes from qrmdata package in CRAN. See help(SP500_const) for more information.
library(qrmdata) data("SP500_const") # load the constituents data from qrmdata Here we will select three years of data around the financial crash of 2008. time <- c("2007-01-03", "2009-12-31") # specify time period data_sp500 <- SP500_const[paste0(time, collapse = "/"),] # grab out data The data has dimensions dim(data_sp500) = 756, 505. Each column is a “point”; it corresponds to one of the stocks that constitute the S&P 500 index. Each row corresponds to a trading day and is a “variate”; the total number of trading days is therefore the dimensionality of each point (stock). The variate values for each stock are the adjusted close price for the various trading days. The data has 756 dimensions, each corresponding to a single trading day from January 3, 2007 to December 31, 2009. (Note: there are only 5 trading days per week.)
High dimensional data as glyphs - S&P 500 data
For example, some stock indices in the first seven days (from January 3, 2007 to January 11, 2007 inclusive) are listed below.
MMM ABT ABBV ACN ACE ATVI ADBE 61.93 18.41 NA 30.48 50.27 7.95 39.92 61.68 18.76 NA 31.16 49.61 8.02 40.82 61.26 18.76 NA 30.73 49.25 7.97 40.62 61.40 18.82 NA 31.17 49.28 7.97 40.45 61.47 18.99 NA 31.10 48.77 7.96 39.63 61.60 19.05 NA 31.28 48.76 8.13 39.22 62.24 19.04 NA 31.16 49.36 8.54 39.88 Each row is a dimension/variate/day and each column is headed by the “stock ticker” abbreviation for that stock (e.g. ABT is “Abbot Laboratories”) Note also that some data is missing (i.e. recorded as NA). We’ll remove all stocks with missing data from consideration, and split it into a list of stocks (for the glyph-making functions) as follows: cases <- complete.cases(t(data_sp500)) # identifies cases that have no NAs x <- na.omit(t(data_sp500)) # omit the missing data data_omitNA <- split(x,row(x)) # split the data into list str(data_omitNA[1:3]) # present the first three stocks indices in the data list ## List of 3 ## $ 1: num [1:756] 61.9 61.7 61.3 61.4 61.5 ... ## $ 2: num [1:756] 18.4 18.8 18.8 18.8 19 ... ## $ 3: num [1:756] 30.5 31.2 30.7 31.2 31.1 ...
High dimensional data as glyphs - S&P 500 data
Moreover, the S&P 500 data also provides the sectors in which stocks belong (see SP500_const_info) . Below shows the sectors and subsectors of the first few stocks
knitr::kable(head(SP500_const_info, 7)) Ticker Sector Subsector MMM Industrials Industrial Conglomerates ABT Health Care Health Care Equipment & Services ABBV Health Care Pharmaceuticals ACN Information Technology IT Consulting & Other Services ACE Financials Property & Casualty Insurance ATVI Information Technology Home Entertainment Software ADBE Information Technology Application Software For example, we might look at stocks from a few different subsectors (n.b. use of cases). bank_loc <- which(SP500_const_info$Subsector[cases] == "Banks") pharma_loc <- which(SP500_const_info$Subsector[cases] == "Pharmaceuticals")
- il_loc <- which(SP500_const_info$Subsector[cases] == "Oil & Gas Exploration & Production")
dataSubsectors <- data_omitNA[c(bank_loc, pharma_loc, oil_loc)]
High dimensional data as glyphs - S&P 500 data
We could also calculate the averages over each sector as
nBanks <- length(bank_loc) bank_ave <- Reduce("+", data_omitNA[bank_loc])/nBanks nPharmas <- length(pharma_loc) pharma_ave <- Reduce("+", data_omitNA[pharma_loc])/nPharmas nOils <- length(oil_loc)
- il_ave <- Reduce("+", data_omitNA[oil_loc])/nOils
and add them to the data at the end: nStocks <- length(dataSubsectors) dataSubsectors[seq(nStocks+1, nStocks +3)] <- list(bank_ave, pharma_ave, oil_ave)
High dimensional data as glyphs - S&P 500 data
Now we can begin to construct some glyphs (blue is high, red low value, grey is average). library(colorspace) cols <- rev(diverge_hcl(21)) # diverge from red to blue and use the glyphs package. First we need to make sure that we are using the correct make_glyphs() (namely the one from glyphs and not loon). unloadNamespace("glyphs") library("glyphs") The glyphs can now be created using the make_glyphs function from the glyphs package.
High dimensional data as glyphs - S&P 500 data
Create all three types of encodings: Hilbert, Morton, and Keim’s recursive
- rectangles. First Hilbert and Morton:
SP500Hilbert <- make_glyphs(dataSubsectors, glyph_type = "Hilbert",
- rigin = "mean", col=cols)
SP500Morton <- make_glyphs(dataSubsectors, glyph_type = "Morton",
- rigin = "mean", cols = cols)
with the Keim recursive rectangle, height and width parameters can be used to effect a layout that is more natural to this problem.
High dimensional data as glyphs - S&P 500 data
With the Keim recursive rectangle, height and width parameters can be used to effect a layout that is more natural to this problem. width = c(5, 1, 12, 1) # set the widths height = c(1, 4, 1, 3) # and the heights for rectangles
- rganizes weekdays into a week (5 columns of 1 row)
and weeks into a month (1 column of 4 rows) and months into a year (12 columns of 1 row) and years into separate rows (1 column of 3 rows)
High dimensional data as glyphs - S&P 500 data
The Keim recursive rectangle with width = c(5, 1, 12, 1), height = c(1, 4, 1, 3)
width= c(5, 1, 12, 1) and height = c(1, 4, 1, 3) 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0
High dimensional data as glyphs - S&P 500 data
So, we’ll use Keim recursive rectangle with width = c(5, 1, 12, 1), height = c(1, 4, 1, 3) for the S&P 500 data. # With Keim's recursive rectangle we can # approximate weeks, months, years SP500Keim <- make_glyphs(dataSubsectors, glyph_type = "rectangle", width = width, height = height,
- rigin = "mean", cols = cols)
High dimensional data as glyphs - S&P 500 data
Plot these glyphs using this little ad hoc function doit <- function (glyphs, main ="", labels = NULL, labelCol = "grey30") x <- getGridXY(length(glyphs)) # get coordinates for each glyph plot_glyphs(x, glyphs = glyphs, axes = FALSE, xlab = "", ylab = "", glyphWidth = 0.8, glyphHeight = 0.6, main = main, cex.main = 0.8) if (!is.null(labels)) text(x, labels = labels, col = labelCol) } Then to look at the individual stocks via Hilbert encoding, simply execute doit(SP500Hilbert[1:nStocks])
High dimensional data as glyphs - S&P 500 data
To select the different sectors and averages, a number of indices will be handy ave_indices <- (nStocks+1):(nStocks+3) bank_indices <- 1:nBanks pharma_indices <- (nBanks+1):(nBanks+nPharmas)
- il_indices <- (nBanks + nPharmas +1):nStocks
We now examine these for each encoding.
High dimensional data as glyphs - S&P 500 data
doit(SP500Hilbert[ave_indices], main = "Hilbert - Subsector averages", labels = c("Banks", "Pharma", "Oil"))
Hilbert − Subsector averages
Banks Pharma Oil
High dimensional data as glyphs - S&P 500 data
doit(SP500Morton[ave_indices], main = "Morton - Subsector averages", labels = c("Banks", "Pharma", "Oil"))
Morton − Subsector averages
Banks Pharma Oil
High dimensional data as glyphs - S&P 500 data
doit(SP500Keim[ave_indices], main = "Keim's rectangle - Subsector averages", labels = c("Banks", "Pharma", "Oil"))
Keim's rectangle − Subsector averages
Banks Pharma Oil
High dimensional data as glyphs - S&P 500 data
doit(SP500Keim[bank_indices], main = "Keim's rectangle - Banks", labels = as.vector(SP500_const_info[cases,][bank_loc,]$Ticker), labelCol="black")
Keim's rectangle − Banks
BAC BK BBT C CMA FITB HBAN JPM KEY MTB PNC STI USB WFC ZION
High dimensional data as glyphs - S&P 500 data
doit(SP500Keim[pharma_indices], main = "Keim's rectangle - Pharmas", labels = as.vector(SP500_const_info[cases,][pharma_loc,]$Ticker), labelCol="black")
Keim's rectangle − Pharmas
AGN ENDP LLY MRK MYL PRGO PFE
High dimensional data as glyphs - S&P 500 data
doit(SP500Keim[oil_indices], main = "Keim's rectangle - Oil", labels = as.vector(SP500_const_info[cases,][oil_loc,]$Ticker), labelCol="yellow")
Keim's rectangle − Oil
APC APA COG XEC COP DVN EOG EQT MRO NFX NBL OXY OKE PXD RRC SWN WMB