Three-dimensional Radial Visualization of High-dimensional - - PowerPoint PPT Presentation

three dimensional radial visualization of high
SMART_READER_LITE
LIVE PREVIEW

Three-dimensional Radial Visualization of High-dimensional - - PowerPoint PPT Presentation

Three-dimensional Radial Visualization of High-dimensional Continuous or Discrete Datasets Fan Dai, Yifan Zhu and Ranjan Maitra Department of Statistics Iowa State University {fd43,yifanzhu,maitra} @ iastate.edu Motivation Multivariate


slide-1
SLIDE 1

Three-dimensional Radial Visualization of High-dimensional Continuous or Discrete Datasets

Fan Dai, Yifan Zhu and Ranjan Maitra

Department of Statistics Iowa State University

{fd43,yifanzhu,maitra}@iastate.edu

slide-2
SLIDE 2

Motivation

Multivariate datasets

agriculture, engineering, genetics, social science. . .

Complex data structure

datasets with many discrete, skewed or correlated features

image, voice, surveys. . . need advanced methods for analysis and summaries

Display distinct groups while also inherent variability

Dai, Zhu & Maitra RadViz3D for High-dimensional Data 2 / 34

slide-3
SLIDE 3

Example: Gamma Ray Bursts (GRBs)

Extremely energetic explosions observed in distant galaxies.

data from NASA’s Burst and Transient Source Experiment 1,599 GRBs with complete information on 9 parameters

time for % flux to arrive, peak fluxes in different channels, time-integrated fluences over time-points

Nine heavily-skewed “parameters” or attributes

use of logarithms to reduce skewness

astrophysics community argued long over 2 or 3 types

analysis based on summary exclusion of some heavily-correlated attributes recent analysis shows all 9 features important for clustering

actually 5 ellipsoidal groups, not 2 or 3

smaller-dimensional 9D example used as a test case

Dai, Zhu & Maitra RadViz3D for High-dimensional Data 3 / 34

slide-4
SLIDE 4

Visualization tools for continuous multivariate data

pairwise scatter plots

Dai, Zhu & Maitra RadViz3D for High-dimensional Data 4 / 34

slide-5
SLIDE 5

Pairwise Scatterplots: Gamma Ray Bursts

Dai, Zhu & Maitra RadViz3D for High-dimensional Data 5 / 34

slide-6
SLIDE 6

Background and Current Work

Visualization tools for continuous multivariate data

pairwise scatter plots

limited in providing multivariate assessments

parallel coordinates plot (Inselberg ’85, Wegman ’90)

Dai, Zhu & Maitra RadViz3D for High-dimensional Data 6 / 34

slide-7
SLIDE 7

Parallel Coordinate Plots: Gamma Ray Bursts

−5.0 −2.5 0.0 2.5 T50 T90 F1 F2 F3 F4 P64 P256 P1024

variable value

Represent multidimensional data using lines.

vertical line represents each dimension or attribute. p 1 lines connected at appropriate scaled dimensional value represent each observation polar version provided by star plot

Dai, Zhu & Maitra RadViz3D for High-dimensional Data 7 / 34

slide-8
SLIDE 8

Background and Current Work

Many approaches to display continuous multivariate data

pairwise scatter plots

limited in providing multivariate assessments

parallel coordinates plot (Inselberg ’85, Wegman ’90)

placement order matters, unclear for large n, p hard to identify groups/patterns with even moderate n.

Andrews’ curves represent each observation via trigonometric series

Dai, Zhu & Maitra RadViz3D for High-dimensional Data 8 / 34

slide-9
SLIDE 9

Andrews’ Curves: Gamma Ray Bursts

Plot each X = (X1, X2, . . . , Xp) as a curve: f(t) = x1 + x2 sin t + x3 cos t + x4 sin 2t + x5 cos 2t + . . . , t 2 [π, π] Entire curve displays one observation

Dai, Zhu & Maitra RadViz3D for High-dimensional Data 9 / 34

slide-10
SLIDE 10

Background and Current Work

Many approaches to display continuous multivariate data

pairwise scatter plots

limited in providing multivariate assessments

parallel coordinates plot (Inselberg ’85, Wegman ’90)

placement order matters, unclear for large n, p polar version provided by star plot

Andrews’ curves

  • rder in which coordinate enters series important

very computationally intensive for larger p

Star coordinates plot

represents coordinate axes as equi-angled rays extending from center

  • rder matters, optimized (van Long & Linsen ’11)

Use springs to display observation (radial visualization)

Dai, Zhu & Maitra RadViz3D for High-dimensional Data 10 / 34

slide-11
SLIDE 11

Two-dimensional radial visualization (RadViz2D)

Uses Hooke’s law to project data onto unit circle

place p springs (anchor points) on the rim

pull each spring by value relative to coordinate from center

  • bservations w/ similar relative values in all attributes end up closer to

center, others are closer to the anchor points

  • rder of placement of springs affects display

refinements to improve RadViz2D exist (see later)

Dai, Zhu & Maitra RadViz3D for High-dimensional Data 11 / 34

slide-12
SLIDE 12

RadViz2D Illustration

X = (X1, X2, X3, X4, X5) = (0.7, 0.5, 0.3, 0.2, 0.7) Maps X 2 Rp to 2D point Ψ•(X; U) = UX/10

pX:

U projection matrix, columns (anchor points) on S1

Dai, Zhu & Maitra RadViz3D for High-dimensional Data 12 / 34

animation by animate[2019/01/23

slide-13
SLIDE 13

Two-dimensional radial visualization (RadViz2D)

Uses Hooke’s law to project data onto unit circle

place p springs (anchor points) on the rim

pull each spring by value relative to coordinate from center

  • bservations w/ similar relative values in all attributes end up closer to

center, others are closer to the anchor points

  • rder of placement of springs affects display

refinements to improve RadViz2D exist (see later)

Effective for sparse data, in evaluating distinct groups

Nonlinear map distorts, affects interpretability High-dimensional observations more difficult to visualize

Can fully 3D extension improve performance?

Viz3D provides third dimension, constant for all observations (Artero &

de Oliveira, ’04)

Dai, Zhu & Maitra RadViz3D for High-dimensional Data 13 / 34

slide-14
SLIDE 14

Generalizing Radial Visualization

Allow anchor points in U on Sq, q > 1, not necessarily equi-spaced

p springs at u1, u2, . . . , up 2 Sq, with spring constants X1, X2, . . . , Xp. equilibrium point Y 2 Rq+1 of system satisfies

p

X

j=1

Xj(Y uj) = 0,

Y = Ψ(X; U) = UX/10

pX solves the system.

is line-, point-ordering- and convexity-invariant. scaling every coordinate to be in [0,1] allows for Y 2 Sq.

Dai, Zhu & Maitra RadViz3D for High-dimensional Data 14 / 34

slide-15
SLIDE 15

Placement of Anchor Points

Suppose: coordinates of X are uncorrelated. For X 1, X 2 2 Rp, let Y i = Ψ(X i; U), i = 1, 2.

Euclidean distance between Y 1 and Y 2 is kY 1 Y 2k2 = X 1 10

pX 1

  • X 2

10

pX 2

!0 U0U X 1 10

pX 1

  • X 2

10

pX 2

! ,

X i, X j very dissimilar, with perfect negative correlation, should be placed as far away as possible (in opposite directions) in our radial visualization.

However, kY i Y jk2 ! 0 as hui, uji ! 0.

may create artificial visual correlation between ith and jth coordinates if hui, uji ! 0 < π/2. need ujs far from the other as possible; so evenly distributed. Sq: for larger q, can get larger angles between ujs

Also place positively correlated coordinates close together

q > 1 has advantage in placing multiple coordinates together

Dai, Zhu & Maitra RadViz3D for High-dimensional Data 15 / 34

slide-16
SLIDE 16

Three-dimensional Radial Visualization

q = 2 in our generalization yields RadViz3D:

equi-spaced anchor points for 5 Platonic solids, p = 4, 6, 8, 12, 20.

closely related to Thomson problem in traditional molecular quantum chemistry (Atiyah & Sutcliffe ’03).

for other p, approximate through Fibonacci grid, jth anchor point:

uj1 = cos(2πjϕ1) q 1 u2

j3,

uj2 = sin(2πjϕ1) q 1 u2

j3,

uj3 = 2j 1 p 1, where ϕ = (1 + p 5)/2 is the golden ratio. (González ’10) distributes anchor points along generative spiral on S2, with consecutive points as separated as possible, satisfies "well-separation" property (Saff & Kuijlaars ’97).

Dai, Zhu & Maitra RadViz3D for High-dimensional Data 16 / 34

slide-17
SLIDE 17

4D Examples simulated via MixSim package in R

x1 x2 x3 x4

  • RadViz2D, ¨

ω = 103 Viz3D RadViz3D

x1 x2 x3 x4

  • RadViz2D,¨

ω = 102 Viz3D RadViz3D

Dai, Zhu & Maitra RadViz3D for High-dimensional Data 17 / 34

tion by animate[2019/ tion by animate[2019/ tion by animate[2019/ tion by animate[2019/

slide-18
SLIDE 18

Higher-dimensional Datasets

Display p anchor points infeasible, even for moderate p

placement of equally-spaced anchor points built on not inducing spurious positive correlations in display

with increasing p, harder to guarantee such outcome

Project high-dimensional data to uncorrelated coordinates but preserve distinctiveness and variability in groups

Principal Components finds mutually orthogonal projections summarizing proportion of total variance, but does not account for groups.

Dai, Zhu & Maitra RadViz3D for High-dimensional Data 18 / 34

slide-19
SLIDE 19

Maximum-Ratio Projection (MRP)

Step 1: Obtain PCs (orthogonal V g) for each group

Find orthogonal W closest to all V g

Project X with W and then obtain MRP

Step 2: Obtain uncorrelated projections that maximize between-group sums of squares and cross products (SSCP) relative to the total SSCP .

Let T, W be (p.d.) total & between-group corrected SSCP .

ˆ vj = T 1

2 ˆ

w j/kT 1

2 ˆ

w jk, j = 1, 2, . . . , k, ˆ w j, j = 1, 2, . . . , k are, in decreasing order, the k largest eigenvalues of T 1/2BT 1/2. k  G 1, chosen by scree plot/quality of display G  4 needs 4 G + 1 more projections w/ null contribution needs p.d. T, does not hold if p > min ng

MRP maximizes separation between groups (in projected space) relative to total variability.

Dai, Zhu & Maitra RadViz3D for High-dimensional Data 19 / 34

slide-20
SLIDE 20

500D Examples

x1 x2 x3 x4

  • RadViz2D, ¨

ω = 103 Viz3D RadViz3D

x1 x2 x3 x4

  • ● ●
  • RadViz2D, ¨

ω = 102 Viz3D RadViz3D

Dai, Zhu & Maitra RadViz3D for High-dimensional Data 20 / 34

tion by animate[2019/ tion by animate[2019/ tion by animate[2019/ tion by animate[2019/

slide-21
SLIDE 21

Datasets with Skewed Attributes

Consider a r.v. X with CDF FX(x).

FX(X) ⇠ U(0, 1) ) Y = Φ1[FX(X)] ⇠ N(0, 1).

call the above (classical) Gaussianized Distributional Transform (CGDT) marginal application of CGDT specifies distribution on X with desired marginal and correlation structure.

CGDT standardizing transform, more stringent than usual affine 0-mean, unit-variance inducing transform

CGDT matches all marginal quantiles to N(0,1) Apply to skewed datasets or with unclear marginals

Before applying MRP and RadViz3D

Dai, Zhu & Maitra RadViz3D for High-dimensional Data 21 / 34

slide-22
SLIDE 22

Applications: Gamma Ray Bursts Dataset

X1 X2 X3 X4

  • ●●
  • RadViz2D

Viz3D RadViz3D

Groups 1 2 3 4 5 Heavily skewed attributes, so CGDT appropriate Results indicate 5 overlapping clusters

some suggestion of 2, 3 super-types of GRBs

Dai, Zhu & Maitra RadViz3D for High-dimensional Data 22 / 34

mation by animate[2019/0 mation by animate[2019/0

slide-23
SLIDE 23

Applications: Face Recognition

112⇥92-images of 6/40 faces at 10 light angles/conditions. (20⇥14) DWT2 (LL band) of wavelet-transformed images with 280 features (Jadhav & Holambe, 2009)

Dai, Zhu & Maitra RadViz3D for High-dimensional Data 23 / 34

slide-24
SLIDE 24

Applications: Face Recognition

Persons A B C D E F

X1 X2 X3 X4

  • RadViz2D

Viz3D RadViz3D

marginals unclear: use CGDT RadViz3D clarifies all 6 people the best

Dai, Zhu & Maitra RadViz3D for High-dimensional Data 24 / 34

mation by animate[2019/0 mation by animate[2019/0

slide-25
SLIDE 25

Datasets with Discrete Attributes

For discrete-valued variable X, CDF FX(X) 6⇠ U(0, 1) because of discreteness.

CGDT currently not applicable

Note that the CDF is only right continuous Solution proposed by Rüschendorf (2013) via the generalized distributional transform

Dai, Zhu & Maitra RadViz3D for High-dimensional Data 25 / 34

slide-26
SLIDE 26

Generalized Distributional Transform (GDT)

Definition

Let X be a real-valued RV with CDF FX(·) and let V ⇠ U(0, 1) be a RV independent of X. The generalized distributional transform of X is U = F(X, V) where F(x, λ) . = P(X < x) + λP(X = x) = FX(x) + λ[FX(x) FX(x)] is the generalized CDF of X.

Theorem

Let U = F(X, V) be the generalized distributional transform of X. Then U ⇠ Uniform(0, 1) and X = F 1

X (U) a.s.

where F 1(t) = inf{x 2 R : FX(x) t} is the generalized inverse, or the quantile transform, of FX(·). Use F(X, V) in place of FX(X), calculate GDT as before

use of GDT on non-discriminating coordinate can spuriously bestow it hyper-importance

suggest ANOVA test on each GDT-ed coordinate, control FDR

Dai, Zhu & Maitra RadViz3D for High-dimensional Data 26 / 34

slide-27
SLIDE 27

Illustration: Simulated Binary Datasets

x1 x2 x3 x4

  • RadViz2D, low

clustering complexity Viz3D RadViz3D

x1 x2 x3 x4

  • RadViz2D, high

clustering complexity Viz3D RadViz3D

Dai, Zhu & Maitra RadViz3D for High-dimensional Data 27 / 34

tion by animate[2019/ tion by animate[2019/ tion by animate[2019/ tion by animate[2019/

slide-28
SLIDE 28

Applications: Senate Voting Records

108th US Congress (2005-06) had 542 (Y/N/NR) Senate votes

55 Republicans, 44 Democrats, 1 (D-caucus) Independent (VT) (Banerjee et al, 2008)

combine N/NR to get dataset of binary attributes

Dai, Zhu & Maitra RadViz3D for High-dimensional Data 28 / 34

slide-29
SLIDE 29

Applications: Senate Voting Records

X1 X2 X3

  • RadViz2D

Viz3D RadViz3D

Democratic Republican G = 2 so only 1 MRP with postive eigenvalue

spring X1 pulls members of one party towards itself more X2, X3, X4 pull senators from both parties with equally (non-discriminating) force

Dai, Zhu & Maitra RadViz3D for High-dimensional Data 29 / 34

mation by animate[2019/0 mation by animate[2019/0

slide-30
SLIDE 30

Applications: Handwritten Indic Scripts

(Map Acknowledgment: Surveyor- General of India)

Handwritten scripts from Bangla (east), Gujarati (west), Gurmukhi (north), Kannada and Malayalam (southern states of Karnataka and Kerala), Urdu (Persian script), with 116 mixed features (Obaidullah et al,

2017).

Dai, Zhu & Maitra RadViz3D for High-dimensional Data 30 / 34

slide-31
SLIDE 31

Applications: Handwritten Indic Scripts

X1 X2 X3 X4

  • ●●
  • ● ●
  • ● ●
  • RadViz2D

Viz3D RadViz3D

Bangla Gujarati Gurmukhi Kannada Malayalam Urdu Viz3D (lesser extent RadViz2D) separates Urdu, Kannada and Gujarati, not the other 3 languages RadViz3D best in classifying all the 6 scripts

also points to difficulty of problem

Dai, Zhu & Maitra RadViz3D for High-dimensional Data 31 / 34

mation by animate[2019/0 mation by animate[2019/0

slide-32
SLIDE 32

Applications: RNA Sequences

Gene expression levels, in FPKM, of RNA sequences from 13 human

  • rgans.

focus on 8 largest (in terms of the sample size) organs

esophagus (659), colon (339), thyroid (318), lung (313), breast (212), stomach (159), liver (115) and prostate (106)

p=20242 discrete features

some have many discrete values, essentially continuous

dataset of mixed attributes.

Display for distinctiveness of samples from each organ

Dai, Zhu & Maitra RadViz3D for High-dimensional Data 32 / 34

slide-33
SLIDE 33

Applications: RNA Sequences

X1 X2 X3 X4 X5 X6 X7

  • ●●
  • ● ●
  • ●●
  • ●●
  • ●●
  • ● ●
  • ● ●
  • ●●
  • RadViz2D

Viz3D RadViz3D Breast Colon Esophagus Liver Lung Prostate Stomach Thyrioid

RadViz2D, Viz3D poorer at separating organs RadViz3D indicates clear separation between organs

colon and stomach have some marginal overlap.

Dai, Zhu & Maitra RadViz3D for High-dimensional Data 33 / 34

mation by animate[2019/0 mation by animate[2019/0

slide-34
SLIDE 34

Conclusions and Further Work

Visualization tool for HD datasets

RadViz3D for more comprehensive display of grouped data MRP , GDT for discrete, mixed, skewed variates displays distinct groups more accurately R package https://github.com/fanne-stat/radviz3d manuscript https://arxiv.org/abs/1904.06366

Number of issues merit further attention

MRP linear; non-linear projections better? extend for categorical (non-binary) attributes GDT/MRP with other tools for improved visualization

Dai, Zhu & Maitra RadViz3D for High-dimensional Data 34 / 34