Three-dimensional Radial Visualization of High-dimensional - - PowerPoint PPT Presentation
Three-dimensional Radial Visualization of High-dimensional - - PowerPoint PPT Presentation
Three-dimensional Radial Visualization of High-dimensional Continuous or Discrete Datasets Fan Dai, Yifan Zhu and Ranjan Maitra Department of Statistics Iowa State University {fd43,yifanzhu,maitra} @ iastate.edu Motivation Multivariate
Motivation
Multivariate datasets
agriculture, engineering, genetics, social science. . .
Complex data structure
datasets with many discrete, skewed or correlated features
image, voice, surveys. . . need advanced methods for analysis and summaries
Display distinct groups while also inherent variability
Dai, Zhu & Maitra RadViz3D for High-dimensional Data 2 / 34
Example: Gamma Ray Bursts (GRBs)
Extremely energetic explosions observed in distant galaxies.
data from NASA’s Burst and Transient Source Experiment 1,599 GRBs with complete information on 9 parameters
time for % flux to arrive, peak fluxes in different channels, time-integrated fluences over time-points
Nine heavily-skewed “parameters” or attributes
use of logarithms to reduce skewness
astrophysics community argued long over 2 or 3 types
analysis based on summary exclusion of some heavily-correlated attributes recent analysis shows all 9 features important for clustering
actually 5 ellipsoidal groups, not 2 or 3
smaller-dimensional 9D example used as a test case
Dai, Zhu & Maitra RadViz3D for High-dimensional Data 3 / 34
Visualization tools for continuous multivariate data
pairwise scatter plots
Dai, Zhu & Maitra RadViz3D for High-dimensional Data 4 / 34
Pairwise Scatterplots: Gamma Ray Bursts
Dai, Zhu & Maitra RadViz3D for High-dimensional Data 5 / 34
Background and Current Work
Visualization tools for continuous multivariate data
pairwise scatter plots
limited in providing multivariate assessments
parallel coordinates plot (Inselberg ’85, Wegman ’90)
Dai, Zhu & Maitra RadViz3D for High-dimensional Data 6 / 34
Parallel Coordinate Plots: Gamma Ray Bursts
−5.0 −2.5 0.0 2.5 T50 T90 F1 F2 F3 F4 P64 P256 P1024
variable value
Represent multidimensional data using lines.
vertical line represents each dimension or attribute. p 1 lines connected at appropriate scaled dimensional value represent each observation polar version provided by star plot
Dai, Zhu & Maitra RadViz3D for High-dimensional Data 7 / 34
Background and Current Work
Many approaches to display continuous multivariate data
pairwise scatter plots
limited in providing multivariate assessments
parallel coordinates plot (Inselberg ’85, Wegman ’90)
placement order matters, unclear for large n, p hard to identify groups/patterns with even moderate n.
Andrews’ curves represent each observation via trigonometric series
Dai, Zhu & Maitra RadViz3D for High-dimensional Data 8 / 34
Andrews’ Curves: Gamma Ray Bursts
Plot each X = (X1, X2, . . . , Xp) as a curve: f(t) = x1 + x2 sin t + x3 cos t + x4 sin 2t + x5 cos 2t + . . . , t 2 [π, π] Entire curve displays one observation
Dai, Zhu & Maitra RadViz3D for High-dimensional Data 9 / 34
Background and Current Work
Many approaches to display continuous multivariate data
pairwise scatter plots
limited in providing multivariate assessments
parallel coordinates plot (Inselberg ’85, Wegman ’90)
placement order matters, unclear for large n, p polar version provided by star plot
Andrews’ curves
- rder in which coordinate enters series important
very computationally intensive for larger p
Star coordinates plot
represents coordinate axes as equi-angled rays extending from center
- rder matters, optimized (van Long & Linsen ’11)
Use springs to display observation (radial visualization)
Dai, Zhu & Maitra RadViz3D for High-dimensional Data 10 / 34
Two-dimensional radial visualization (RadViz2D)
Uses Hooke’s law to project data onto unit circle
place p springs (anchor points) on the rim
pull each spring by value relative to coordinate from center
- bservations w/ similar relative values in all attributes end up closer to
center, others are closer to the anchor points
- rder of placement of springs affects display
refinements to improve RadViz2D exist (see later)
Dai, Zhu & Maitra RadViz3D for High-dimensional Data 11 / 34
RadViz2D Illustration
X = (X1, X2, X3, X4, X5) = (0.7, 0.5, 0.3, 0.2, 0.7) Maps X 2 Rp to 2D point Ψ•(X; U) = UX/10
pX:
U projection matrix, columns (anchor points) on S1
Dai, Zhu & Maitra RadViz3D for High-dimensional Data 12 / 34
animation by animate[2019/01/23
Two-dimensional radial visualization (RadViz2D)
Uses Hooke’s law to project data onto unit circle
place p springs (anchor points) on the rim
pull each spring by value relative to coordinate from center
- bservations w/ similar relative values in all attributes end up closer to
center, others are closer to the anchor points
- rder of placement of springs affects display
refinements to improve RadViz2D exist (see later)
Effective for sparse data, in evaluating distinct groups
Nonlinear map distorts, affects interpretability High-dimensional observations more difficult to visualize
Can fully 3D extension improve performance?
Viz3D provides third dimension, constant for all observations (Artero &
de Oliveira, ’04)
Dai, Zhu & Maitra RadViz3D for High-dimensional Data 13 / 34
Generalizing Radial Visualization
Allow anchor points in U on Sq, q > 1, not necessarily equi-spaced
p springs at u1, u2, . . . , up 2 Sq, with spring constants X1, X2, . . . , Xp. equilibrium point Y 2 Rq+1 of system satisfies
p
X
j=1
Xj(Y uj) = 0,
Y = Ψ(X; U) = UX/10
pX solves the system.
is line-, point-ordering- and convexity-invariant. scaling every coordinate to be in [0,1] allows for Y 2 Sq.
Dai, Zhu & Maitra RadViz3D for High-dimensional Data 14 / 34
Placement of Anchor Points
Suppose: coordinates of X are uncorrelated. For X 1, X 2 2 Rp, let Y i = Ψ(X i; U), i = 1, 2.
Euclidean distance between Y 1 and Y 2 is kY 1 Y 2k2 = X 1 10
pX 1
- X 2
10
pX 2
!0 U0U X 1 10
pX 1
- X 2
10
pX 2
! ,
X i, X j very dissimilar, with perfect negative correlation, should be placed as far away as possible (in opposite directions) in our radial visualization.
However, kY i Y jk2 ! 0 as hui, uji ! 0.
may create artificial visual correlation between ith and jth coordinates if hui, uji ! 0 < π/2. need ujs far from the other as possible; so evenly distributed. Sq: for larger q, can get larger angles between ujs
Also place positively correlated coordinates close together
q > 1 has advantage in placing multiple coordinates together
Dai, Zhu & Maitra RadViz3D for High-dimensional Data 15 / 34
Three-dimensional Radial Visualization
q = 2 in our generalization yields RadViz3D:
equi-spaced anchor points for 5 Platonic solids, p = 4, 6, 8, 12, 20.
closely related to Thomson problem in traditional molecular quantum chemistry (Atiyah & Sutcliffe ’03).
for other p, approximate through Fibonacci grid, jth anchor point:
uj1 = cos(2πjϕ1) q 1 u2
j3,
uj2 = sin(2πjϕ1) q 1 u2
j3,
uj3 = 2j 1 p 1, where ϕ = (1 + p 5)/2 is the golden ratio. (González ’10) distributes anchor points along generative spiral on S2, with consecutive points as separated as possible, satisfies "well-separation" property (Saff & Kuijlaars ’97).
Dai, Zhu & Maitra RadViz3D for High-dimensional Data 16 / 34
4D Examples simulated via MixSim package in R
x1 x2 x3 x4
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- RadViz2D, ¨
ω = 103 Viz3D RadViz3D
x1 x2 x3 x4
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- RadViz2D,¨
ω = 102 Viz3D RadViz3D
Dai, Zhu & Maitra RadViz3D for High-dimensional Data 17 / 34
tion by animate[2019/ tion by animate[2019/ tion by animate[2019/ tion by animate[2019/
Higher-dimensional Datasets
Display p anchor points infeasible, even for moderate p
placement of equally-spaced anchor points built on not inducing spurious positive correlations in display
with increasing p, harder to guarantee such outcome
Project high-dimensional data to uncorrelated coordinates but preserve distinctiveness and variability in groups
Principal Components finds mutually orthogonal projections summarizing proportion of total variance, but does not account for groups.
Dai, Zhu & Maitra RadViz3D for High-dimensional Data 18 / 34
Maximum-Ratio Projection (MRP)
Step 1: Obtain PCs (orthogonal V g) for each group
Find orthogonal W closest to all V g
Project X with W and then obtain MRP
Step 2: Obtain uncorrelated projections that maximize between-group sums of squares and cross products (SSCP) relative to the total SSCP .
Let T, W be (p.d.) total & between-group corrected SSCP .
ˆ vj = T 1
2 ˆ
w j/kT 1
2 ˆ
w jk, j = 1, 2, . . . , k, ˆ w j, j = 1, 2, . . . , k are, in decreasing order, the k largest eigenvalues of T 1/2BT 1/2. k G 1, chosen by scree plot/quality of display G 4 needs 4 G + 1 more projections w/ null contribution needs p.d. T, does not hold if p > min ng
MRP maximizes separation between groups (in projected space) relative to total variability.
Dai, Zhu & Maitra RadViz3D for High-dimensional Data 19 / 34
500D Examples
x1 x2 x3 x4
- ●
- ●
- ●
- RadViz2D, ¨
ω = 103 Viz3D RadViz3D
x1 x2 x3 x4
- ●
- ●
- ● ●
- ●
- ●
- ●
- RadViz2D, ¨
ω = 102 Viz3D RadViz3D
Dai, Zhu & Maitra RadViz3D for High-dimensional Data 20 / 34
tion by animate[2019/ tion by animate[2019/ tion by animate[2019/ tion by animate[2019/
Datasets with Skewed Attributes
Consider a r.v. X with CDF FX(x).
FX(X) ⇠ U(0, 1) ) Y = Φ1[FX(X)] ⇠ N(0, 1).
call the above (classical) Gaussianized Distributional Transform (CGDT) marginal application of CGDT specifies distribution on X with desired marginal and correlation structure.
CGDT standardizing transform, more stringent than usual affine 0-mean, unit-variance inducing transform
CGDT matches all marginal quantiles to N(0,1) Apply to skewed datasets or with unclear marginals
Before applying MRP and RadViz3D
Dai, Zhu & Maitra RadViz3D for High-dimensional Data 21 / 34
Applications: Gamma Ray Bursts Dataset
X1 X2 X3 X4
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- RadViz2D
Viz3D RadViz3D
Groups 1 2 3 4 5 Heavily skewed attributes, so CGDT appropriate Results indicate 5 overlapping clusters
some suggestion of 2, 3 super-types of GRBs
Dai, Zhu & Maitra RadViz3D for High-dimensional Data 22 / 34
mation by animate[2019/0 mation by animate[2019/0
Applications: Face Recognition
112⇥92-images of 6/40 faces at 10 light angles/conditions. (20⇥14) DWT2 (LL band) of wavelet-transformed images with 280 features (Jadhav & Holambe, 2009)
Dai, Zhu & Maitra RadViz3D for High-dimensional Data 23 / 34
Applications: Face Recognition
Persons A B C D E F
X1 X2 X3 X4
- ●
- ●
- RadViz2D
Viz3D RadViz3D
marginals unclear: use CGDT RadViz3D clarifies all 6 people the best
Dai, Zhu & Maitra RadViz3D for High-dimensional Data 24 / 34
mation by animate[2019/0 mation by animate[2019/0
Datasets with Discrete Attributes
For discrete-valued variable X, CDF FX(X) 6⇠ U(0, 1) because of discreteness.
CGDT currently not applicable
Note that the CDF is only right continuous Solution proposed by Rüschendorf (2013) via the generalized distributional transform
Dai, Zhu & Maitra RadViz3D for High-dimensional Data 25 / 34
Generalized Distributional Transform (GDT)
Definition
Let X be a real-valued RV with CDF FX(·) and let V ⇠ U(0, 1) be a RV independent of X. The generalized distributional transform of X is U = F(X, V) where F(x, λ) . = P(X < x) + λP(X = x) = FX(x) + λ[FX(x) FX(x)] is the generalized CDF of X.
Theorem
Let U = F(X, V) be the generalized distributional transform of X. Then U ⇠ Uniform(0, 1) and X = F 1
X (U) a.s.
where F 1(t) = inf{x 2 R : FX(x) t} is the generalized inverse, or the quantile transform, of FX(·). Use F(X, V) in place of FX(X), calculate GDT as before
use of GDT on non-discriminating coordinate can spuriously bestow it hyper-importance
suggest ANOVA test on each GDT-ed coordinate, control FDR
Dai, Zhu & Maitra RadViz3D for High-dimensional Data 26 / 34
Illustration: Simulated Binary Datasets
x1 x2 x3 x4
- RadViz2D, low
clustering complexity Viz3D RadViz3D
x1 x2 x3 x4
- ●
- ●
- ●
- RadViz2D, high
clustering complexity Viz3D RadViz3D
Dai, Zhu & Maitra RadViz3D for High-dimensional Data 27 / 34
tion by animate[2019/ tion by animate[2019/ tion by animate[2019/ tion by animate[2019/
Applications: Senate Voting Records
108th US Congress (2005-06) had 542 (Y/N/NR) Senate votes
55 Republicans, 44 Democrats, 1 (D-caucus) Independent (VT) (Banerjee et al, 2008)
combine N/NR to get dataset of binary attributes
Dai, Zhu & Maitra RadViz3D for High-dimensional Data 28 / 34
Applications: Senate Voting Records
X1 X2 X3
- ●
- RadViz2D
Viz3D RadViz3D
Democratic Republican G = 2 so only 1 MRP with postive eigenvalue
spring X1 pulls members of one party towards itself more X2, X3, X4 pull senators from both parties with equally (non-discriminating) force
Dai, Zhu & Maitra RadViz3D for High-dimensional Data 29 / 34
mation by animate[2019/0 mation by animate[2019/0
Applications: Handwritten Indic Scripts
(Map Acknowledgment: Surveyor- General of India)
Handwritten scripts from Bangla (east), Gujarati (west), Gurmukhi (north), Kannada and Malayalam (southern states of Karnataka and Kerala), Urdu (Persian script), with 116 mixed features (Obaidullah et al,
2017).
Dai, Zhu & Maitra RadViz3D for High-dimensional Data 30 / 34
Applications: Handwritten Indic Scripts
X1 X2 X3 X4
- ●
- ●
- ●
- ●
- ●
- ●
- ●●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ● ●
- ● ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- RadViz2D
Viz3D RadViz3D
Bangla Gujarati Gurmukhi Kannada Malayalam Urdu Viz3D (lesser extent RadViz2D) separates Urdu, Kannada and Gujarati, not the other 3 languages RadViz3D best in classifying all the 6 scripts
also points to difficulty of problem
Dai, Zhu & Maitra RadViz3D for High-dimensional Data 31 / 34
mation by animate[2019/0 mation by animate[2019/0
Applications: RNA Sequences
Gene expression levels, in FPKM, of RNA sequences from 13 human
- rgans.
focus on 8 largest (in terms of the sample size) organs
esophagus (659), colon (339), thyroid (318), lung (313), breast (212), stomach (159), liver (115) and prostate (106)
p=20242 discrete features
some have many discrete values, essentially continuous
dataset of mixed attributes.
Display for distinctiveness of samples from each organ
Dai, Zhu & Maitra RadViz3D for High-dimensional Data 32 / 34
Applications: RNA Sequences
X1 X2 X3 X4 X5 X6 X7
- ●
- ●
- ●
- ●●
- ●
- ● ●
- ●
- ●
- ●●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ● ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ● ●
- ●
- ●
- ●
- ●
- ●
- ●●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- RadViz2D
Viz3D RadViz3D Breast Colon Esophagus Liver Lung Prostate Stomach Thyrioid
RadViz2D, Viz3D poorer at separating organs RadViz3D indicates clear separation between organs
colon and stomach have some marginal overlap.
Dai, Zhu & Maitra RadViz3D for High-dimensional Data 33 / 34
mation by animate[2019/0 mation by animate[2019/0
Conclusions and Further Work
Visualization tool for HD datasets
RadViz3D for more comprehensive display of grouped data MRP , GDT for discrete, mixed, skewed variates displays distinct groups more accurately R package https://github.com/fanne-stat/radviz3d manuscript https://arxiv.org/abs/1904.06366
Number of issues merit further attention
MRP linear; non-linear projections better? extend for categorical (non-binary) attributes GDT/MRP with other tools for improved visualization
Dai, Zhu & Maitra RadViz3D for High-dimensional Data 34 / 34