[PPT] - Information Visualization and Visual Analytics roles, challenges, PowerPoint Presentation

SLIDE 1

Information Visualization and Visual Analytics roles, challenges, and examples Giuseppe Santucci

SLIDE 2

VisDis and the Database & User Interface

The VisDis and the Database/Interface group background is about:

– Visual Information Access – Data quality – Data integration – Adaptive Interfaces – User Centered Design – Usability and Accessibility – Infovis evaluation – Visual quality metrics – Visual Analytics

Data sampling
Density map optimization

SLIDE 3

Outline

Information Visualization

– Main issues

Data overloading

– Visual Analytics – Automatic data analysis – Three examples

Projects and books

SLIDE 4

Information visualization !

1. Infovis is perfect for exploration, when we don’t know exactly what to look at. It supports vague goals 2. Infovis is perfect to explain complex data and to support decisions

Other approaches to data analysis

– Statistics: strong verification but does not support exploration and vague goals – Data mining: actionable and reliable but black box, not interactive, question-response style – Visual analytics (formerly Visual Data Mining) is trying to join the two worlds

SLIDE 5

Canonical steps in infovis – STEP 1

DATA

Internal Representation

Encoding of values Univariate data Bivariate data Trivariate data Multidimensional data Encoding of relations Temporal data Map & Diagrams Graphs/Trees Data streams

Sport Literature Mathematics Physics History Geography Art Chemistry

SLIDE 6

Canonical steps in infovis – STEP 2

Internal Representation

Space limitations Scrolling Overview + details Distortion Suppression Zoom & pan Semantic zoom Time limitation Perceptual issues Cognitive issues

Presentation

SLIDE 7

SO WE ARE DONE! (?)

SLIDE 8

Outline

Information Visualization
Data overloading

– Visual Analytics – Automatic data analysis – Three examples

Projects and books and conferences

SLIDE 9

Data size and complexity !

100 million FedEx transactions per day
150 million VISA credit card transactions per day
300 million long distance ATT calls per day
50 billion e-mails per day
600 billion IP packets per day
1 trillion (1012) of web pages (according to Google),

corresponding to about 3 petabytes of data

Google processes 20 petabytes of data per day
Data streams (sensor network, IP traffic, etc)

kilobyte, megabyte, gigabyte, terabyte, petabyte …

SLIDE 10

Rescuing information

In different situations people need to exploit and to use hidden

information resting in unexplored large data sets – decision-makers – analysts – engineers – emergency response teams – ...

Several techniques exist devoted to this aim

– Automatic analysis techniques (e.g., data mining) – Manual analysis techniques (e.g., Information visualization)

Petabyte datasets require a joint effort:

SLIDE 11

Visual Analytics

SLIDE 12

VA is highly interdisciplinary

Scientific & Information Visualisation Data Management Data Mining Spatio- Temporal Data Human Perception +Cognition Infrastructure Infrastructure Evaluation Evaluation

Each component presents challenging issues

SLIDE 13

Visualization

Scientific Visualization & Information Visualization

– interactivity & scalability issues

Challenges: design of new scalable structure that

support: – Visual abstractions (e.g., clustering, sampling, etc.) – Rapid update of visual displays for billion record databases (10 frames per second)

SLIDE 14

Data Management

Answering a query against a large data set is now possible

Among the other challenges:

Integration of heterogeneous data such as numeric data,

graphs, text, audio and video signals, semi-structured data

Data streams - In many application data are continuously

produced (sensor data, stock market data, news data, etc.)

Data provenance - Understanding where data come from
Data reduction - Visualizing billion records is not possible.

We need to reduce and abstract the data to support interaction at different detail levels (see, e.g., Google Earth)

...

SLIDE 15

Data mining

Methods to automatically extract insights

– Supervised learning from examples: using training samples to learn models for the classification (or prediction) of previously unseen data sample – Cluster analysis, which aims to extract structure from unknown data, grouping data instances into classes based on mutual similarity, and to identify outliers – Association rule mining (analysis of co-occurrence of data items) and dimensionality reduction

Challenges come from:

– semi-structured and complex data (web data, documents) – interaction with visualizations

SLIDE 16

Spatio - Temporal Data

Data about time and space are widely spread

– geographic measurements – GPS position data – remote sensing applications (e.g., satellite data)

Finding spatial relationships and patterns among this data is of

special interest

The analysis of data with references both in space and in time is a

challenging research topic: – scale: clusters and other phenomena may only occur at particular scales, which may not be the scale at which data is recorded – uncertainty: spatio-temporal data are often incomplete, interpolated, collected at different times, etc. – …

SLIDE 17

Perception and cognition

A critical element is the human being ()

– Visual analysis tasks require the careful design of apt human-computer interfaces – Challenges: need to integrate Psychology, Sociology, Neurosciences, and Design issues

user-centred analysis and modelling
multimodal

interaction techniques for visualization and exploration of large information spaces

availability of improved display resources
novel interaction algorithms
perceptual,

cognitive and graphical principles which in combination lead to improved visual communication of data and analysis results

Form Intention Form Action plan Execute Action Evaluatio Interpretatio Perception

SLIDE 18

Evaluation and Infrastructure

How to assess (evaluate) the effectiveness of visual

analytics environment is a topic of lively debate

The same happens for infrastructures: agreed solutions

are still under investigation Both topics are still in the phase of workshop results... D3!

SLIDE 19

Back to the Automatic Data Analysis

We can classify the automatic activities in three main groups

1. Deriving new values from the dataset for ad-hoc visualization
This is the less standard and the more creative part of the process
2. Data reduction / data mining
Clustering /classification /…
Sampling / pixel oriented visualization
Dimension reduction
3. Visualization improvement
Data distribution
Perceptual issues
Cognitive issues

SLIDE 20

Example for group 1 Deriving new values from the dataset for ad-hoc visualization (you are going to visualize DERIVED data)

SLIDE 21

A Visual Analytics example (Group 1)

Deriving new values from the dataset for ad-hoc visualization

How to visually compare J. London and M. Twain books ?
[D. A. Keim and D. Oelke. Literature Fingerprinting: A New Method for

Visual Literary Analysis. 2007 IEEE Symp. on Visual Analytics Science and Technology (VAST '07) ]

1. Split the book in several text block (e.g., pages, paragraph,

sentences)

2. Measure, for each text block, a relevant feature (e.g.,

average sentence length, word usage, etc. )

3. Associate the relevant feature to a visual attribute (e.g.,

color)

4. Visualize it

SLIDE 22

J.London vs M.Twain average sentence lengths

SLIDE 23

User interaction (a non uniform book?)

SLIDE 24

Details of a book

SLIDE 25

What about the Bible?

SLIDE 26

Example 2 Data reduction / data mining

SLIDE 27

Visual Analytics of Anomaly Detection in Large Data Streams (paper from Daniel Keim group)

You have to monitor a network composed of 8 systems with

16 servers each

Each server provide basic information

– CPU % occupation – DISK % occupation – MEM % occupation – ... – That corresponds to 128 temporal data streams (overplotting !!)

time

CPU %

SLIDE 28

Pixel oriented visualization

28 days (5 min windows), about 8k observations Each observation takes a pixel The color codes the CPU %

SLIDE 29

The whole system

Color is preattentive!

SLIDE 30

Automated analysis

Computing high CPU % clusters
That selects hot time intervals

SLIDE 31

Automated analysis...

Detecting persistent anomalies

SLIDE 32

Looking for correlations

SLIDE 33

Example 3 Visualization improvement

SLIDE 34

A Visual Analytics example (Group 3 – Visualization improvement)

Data distribution and perceptual issues

Density maps

8x8 pixels

empty pixel 4 data items are plotted on the same pixel:d=4

we can map the density values to a 256 levels grey or color scale

SLIDE 35

The case study (Infovis contest 2005)

About 60,000 USA companies plotted on a 800x450 (360,000

pixels) scatter plot

126 distinct density values ranging on [1..1,633]
7,042 active pixels (i.e., hosting at least one company):

– 2526 pixels (36%) host exactly one company (d=1) – 1182 pixels (17%) host two companies (d=2) – ... – 1 pixel (0.0001 %) hosts 1633 companies (d=1633)

SLIDE 36

What is the problem?

The choice of the right mapping is crucial, because
f density frequency distribution presents very

skewed behaviour Density (126 distinct values) Pixel number

36% 17% 0.001%

1633

SLIDE 37

The mapping

126 different data densities = { 1, 2, … , 1,633 } 256 Color Codes = { 0,1, 2, … , 255}

? Available solutions

Linear mapping
Non linear mappings

SLIDE 38

Linear mapping

 Most pixels share very low color codes  Few color codes are used (46 out of 256) Different low density values are represented

by the same color code: densities in [1..10] are mapped on codes {1,2}

        − − =

min max min

255 ) ( d d d d Round d ColorCode

Straightforward solution
Useless in this situation

Color code frequency distribution

Transfer Function

collisions colors

SLIDE 39

Density function mapping

Color code frequency distribution

TF

        =

∑

= j i AP i AP j

N d DN Round d ColorCode

1

) ( 255 ) (

Hermann et al. [HMM00]
Quite similar to histogram

aequalization

Better than linear mapping

 Few color codes are used (39 out of 256) Lowest color code unnecessarily high Codes ranging only on [91..255] Different high density values are

represented by the same color code: densities in [48..1,633] -> [250,255]

SLIDE 40

Our proposal

We take into account that:

densities and color codes are discrete and finite
too close color codes are hardly distinguishable

(for human beings)

[E. Bertini, A. Di Girolamo, G.Santucci - See what you know: analyzing data distribution to improve density map visualization – Eurovis 2007 conference]

SLIDE 41

uniform scale mapping

We use a reduced color scale, e.g. with 15 codes (NL=15)

18 36 55 73 91 109 128 146 164 182 200 219 237 255

1

c

2

c

…

L AP

N N

3

c

NL

c

Target color code frequency distribution

This implies that different density values will be necessarily represented by the same color code: to reduce the degradation the mapping is performed through an algorithm that tries to assign to each code the same number of pixels

SLIDE 42

NDV>NL : uniform scale mapping

Color code frequency distribution

Because of densities are discrete the algorithm cannot ensure the NAP/NL value and through a peak analysis it minimizes the variance

 Full color scale usage [0..255]  All the color codes are used  Maximum color code separation

Pixels Distribute d ColorCode

j =

) (

SLIDE 43

Visual comparison

Linear mapping Density function mapping Uniform scale mapping

SLIDE 44

Visual comparison

SLIDE 45

SLIDE 46

SLIDE 47

The parcel dataset

Postal parcels plotted by weight (x) and volume (y)

SLIDE 48

Grey scale

Linear CSU=0.53 CsAR=1 CS=2.83 Density Function CSU=0.18 CsAR=0.62 CS=5.23 Uniform color sc. CSU=1 CsAR=1 CS=8.79

SLIDE 49

Conclusions

Visual Analytics is a new (exciting) emerging research field
Information visualization is a core component of VA
Automated data analysis could be classified in three main

groups

– Deriving new values (more creative) – Data reduction (sometimes creative) – Image improvement (very technical)

It is highly interdisciplinary and require a collaborative

approach

It is mainly a METHODOLOGY / VISION than a technique
However a collection of available results / proposal is

quickly growing

SLIDE 50

The new (European) book on VA

Illuminating the path : The

Research and Development Agenda for Visual Analytics

– 2005, focusing on USA homeland security

Managing the Information Age

Solving Problems with Visual Analytics (2010) – One of the major outcome of Vismaster – Availble for free at: – http://www.vismaster.eu/

SLIDE 51

5 books you HAVE to read (greedy order)

Robert Spence - Information Visualization: Design for

Interaction (2nd Edition) - Addison-Wesley (ACM Press) - BASIC ISSUES

Chaomei Chen - Information Visualization - Second Edition
Springer - AN UPDATED OVERVIEW
Managing the Information Age Solving Problems with

Visual Analytics (2010) VISMASTER BOOK

Colin Ware - Information Visualization, Third Edition:

Perception for Design (Interactive Technologies) - Morgan Kaufmann - PERCEPTUAL ISSUES

Card, Mackinlay, Shneiderman - Reading in Information

Visualization - 1999 HYSTORICAL

SLIDE 52

Visual Analytics projects

SLIDE 53

The Vismaster CA project

SLIDE 54

The Promise NoE project

SLIDE 55

PanopteSec Network Cyber Security

3 years European IP project!

SLIDE 56

SLIDE 57

PanopteSec: Call for Master Thesis

Design implement and test a

Visual Analytics Environment for Network security

D3 framework
It includes the Information