Information Visualization and Visual Analytics roles, challenges, - - PowerPoint PPT Presentation

information visualization and visual analytics roles
SMART_READER_LITE
LIVE PREVIEW

Information Visualization and Visual Analytics roles, challenges, - - PowerPoint PPT Presentation

Information Visualization and Visual Analytics roles, challenges, and examples Giuseppe Santucci VisDis and the Database & User Interface The VisDis and the Database/Interface group background is about: Visual Information Access


slide-1
SLIDE 1

Information Visualization and Visual Analytics roles, challenges, and examples Giuseppe Santucci

slide-2
SLIDE 2

VisDis and the Database & User Interface

  • The VisDis and the Database/Interface group background is about:

– Visual Information Access – Data quality – Data integration – Adaptive Interfaces – User Centered Design – Usability and Accessibility – Infovis evaluation – Visual quality metrics – Visual Analytics

  • Data sampling
  • Density map optimization
slide-3
SLIDE 3

Outline

  • Information Visualization

– Main issues

  • Data overloading

– Visual Analytics – Automatic data analysis – Three examples

  • Projects and books
slide-4
SLIDE 4

Information visualization !

1. Infovis is perfect for exploration, when we don’t know exactly what to look at. It supports vague goals 2. Infovis is perfect to explain complex data and to support decisions

  • Other approaches to data analysis

– Statistics: strong verification but does not support exploration and vague goals – Data mining: actionable and reliable but black box, not interactive, question-response style – Visual analytics (formerly Visual Data Mining) is trying to join the two worlds

slide-5
SLIDE 5

Canonical steps in infovis – STEP 1

DATA

Internal Representation

Encoding of values Univariate data Bivariate data Trivariate data Multidimensional data Encoding of relations Temporal data Map & Diagrams Graphs/Trees Data streams

Sport Literature Mathematics Physics History Geography Art Chemistry

slide-6
SLIDE 6

Canonical steps in infovis – STEP 2

Internal Representation

Space limitations Scrolling Overview + details Distortion Suppression Zoom & pan Semantic zoom Time limitation Perceptual issues Cognitive issues

Presentation

slide-7
SLIDE 7

SO WE ARE DONE! (?)

slide-8
SLIDE 8

Outline

  • Information Visualization
  • Data overloading

– Visual Analytics – Automatic data analysis – Three examples

  • Projects and books and conferences
slide-9
SLIDE 9

Data size and complexity !

  • 100 million FedEx transactions per day
  • 150 million VISA credit card transactions per day
  • 300 million long distance ATT calls per day
  • 50 billion e-mails per day
  • 600 billion IP packets per day
  • 1 trillion (1012) of web pages (according to Google),

corresponding to about 3 petabytes of data

  • Google processes 20 petabytes of data per day
  • Data streams (sensor network, IP traffic, etc)

kilobyte, megabyte, gigabyte, terabyte, petabyte …

slide-10
SLIDE 10

Rescuing information

  • In different situations people need to exploit and to use hidden

information resting in unexplored large data sets – decision-makers – analysts – engineers – emergency response teams – ...

  • Several techniques exist devoted to this aim

– Automatic analysis techniques (e.g., data mining) – Manual analysis techniques (e.g., Information visualization)

  • Petabyte datasets require a joint effort:
slide-11
SLIDE 11

Visual Analytics

slide-12
SLIDE 12

VA is highly interdisciplinary

Scientific & Information Visualisation Data Management Data Mining Spatio- Temporal Data Human Perception +Cognition Infrastructure Infrastructure Evaluation Evaluation

Each component presents challenging issues

slide-13
SLIDE 13

Visualization

  • Scientific Visualization & Information Visualization

– interactivity & scalability issues

  • Challenges: design of new scalable structure that

support: – Visual abstractions (e.g., clustering, sampling, etc.) – Rapid update of visual displays for billion record databases (10 frames per second)

slide-14
SLIDE 14

Data Management

  • Answering a query against a large data set is now possible

Among the other challenges:

  • Integration of heterogeneous data such as numeric data,

graphs, text, audio and video signals, semi-structured data

  • Data streams - In many application data are continuously

produced (sensor data, stock market data, news data, etc.)

  • Data provenance - Understanding where data come from
  • Data reduction - Visualizing billion records is not possible.

We need to reduce and abstract the data to support interaction at different detail levels (see, e.g., Google Earth)

  • ...
slide-15
SLIDE 15

Data mining

  • Methods to automatically extract insights

– Supervised learning from examples: using training samples to learn models for the classification (or prediction) of previously unseen data sample – Cluster analysis, which aims to extract structure from unknown data, grouping data instances into classes based on mutual similarity, and to identify outliers – Association rule mining (analysis of co-occurrence of data items) and dimensionality reduction

  • Challenges come from:

– semi-structured and complex data (web data, documents) – interaction with visualizations

slide-16
SLIDE 16

Spatio - Temporal Data

  • Data about time and space are widely spread

– geographic measurements – GPS position data – remote sensing applications (e.g., satellite data)

  • Finding spatial relationships and patterns among this data is of

special interest

  • The analysis of data with references both in space and in time is a

challenging research topic: – scale: clusters and other phenomena may only occur at particular scales, which may not be the scale at which data is recorded – uncertainty: spatio-temporal data are often incomplete, interpolated, collected at different times, etc. – …

slide-17
SLIDE 17

Perception and cognition

  • A critical element is the human being ()

– Visual analysis tasks require the careful design of apt human-computer interfaces – Challenges: need to integrate Psychology, Sociology, Neurosciences, and Design issues

  • user-centred analysis and modelling
  • multimodal

interaction techniques for visualization and exploration of large information spaces

  • availability of improved display resources
  • novel interaction algorithms
  • perceptual,

cognitive and graphical principles which in combination lead to improved visual communication of data and analysis results

Form Intention Form Action plan Execute Action Evaluatio Interpretatio Perception
slide-18
SLIDE 18

Evaluation and Infrastructure

  • How to assess (evaluate) the effectiveness of visual

analytics environment is a topic of lively debate

  • The same happens for infrastructures: agreed solutions

are still under investigation Both topics are still in the phase of workshop results... D3!

slide-19
SLIDE 19

Back to the Automatic Data Analysis

We can classify the automatic activities in three main groups

  • 1. Deriving new values from the dataset for ad-hoc visualization
  • This is the less standard and the more creative part of the process
  • 2. Data reduction / data mining
  • Clustering /classification /…
  • Sampling / pixel oriented visualization
  • Dimension reduction
  • 3. Visualization improvement
  • Data distribution
  • Perceptual issues
  • Cognitive issues
slide-20
SLIDE 20

Example for group 1 Deriving new values from the dataset for ad-hoc visualization (you are going to visualize DERIVED data)

slide-21
SLIDE 21

A Visual Analytics example (Group 1)

Deriving new values from the dataset for ad-hoc visualization

  • How to visually compare J. London and M. Twain books ?
  • [D. A. Keim and D. Oelke. Literature Fingerprinting: A New Method for

Visual Literary Analysis. 2007 IEEE Symp. on Visual Analytics Science and Technology (VAST '07) ]

  • 1. Split the book in several text block (e.g., pages, paragraph,

sentences)

  • 2. Measure, for each text block, a relevant feature (e.g.,

average sentence length, word usage, etc. )

  • 3. Associate the relevant feature to a visual attribute (e.g.,

color)

  • 4. Visualize it
slide-22
SLIDE 22

J.London vs M.Twain average sentence lengths

slide-23
SLIDE 23

User interaction (a non uniform book?)

slide-24
SLIDE 24

Details of a book

slide-25
SLIDE 25

What about the Bible?

slide-26
SLIDE 26

Example 2 Data reduction / data mining

slide-27
SLIDE 27

Visual Analytics of Anomaly Detection in Large Data Streams (paper from Daniel Keim group)

  • You have to monitor a network composed of 8 systems with

16 servers each

  • Each server provide basic information

– CPU % occupation – DISK % occupation – MEM % occupation – ... – That corresponds to 128 temporal data streams (overplotting !!)

time

CPU %

slide-28
SLIDE 28

Pixel oriented visualization

28 days (5 min windows), about 8k observations Each observation takes a pixel The color codes the CPU %

slide-29
SLIDE 29

The whole system

Color is preattentive!

slide-30
SLIDE 30

Automated analysis

  • Computing high CPU % clusters
  • That selects hot time intervals
slide-31
SLIDE 31

Automated analysis...

  • Detecting persistent anomalies
slide-32
SLIDE 32

Looking for correlations

slide-33
SLIDE 33

Example 3 Visualization improvement

slide-34
SLIDE 34

A Visual Analytics example (Group 3 – Visualization improvement)

Data distribution and perceptual issues

Density maps

8x8 pixels

empty pixel 4 data items are plotted on the same pixel:d=4

we can map the density values to a 256 levels grey or color scale

slide-35
SLIDE 35

The case study (Infovis contest 2005)

  • About 60,000 USA companies plotted on a 800x450 (360,000

pixels) scatter plot

  • 126 distinct density values ranging on [1..1,633]
  • 7,042 active pixels (i.e., hosting at least one company):

– 2526 pixels (36%) host exactly one company (d=1) – 1182 pixels (17%) host two companies (d=2) – ... – 1 pixel (0.0001 %) hosts 1633 companies (d=1633)

slide-36
SLIDE 36

What is the problem?

  • The choice of the right mapping is crucial, because
  • f density frequency distribution presents very

skewed behaviour Density (126 distinct values) Pixel number

36% 17% 0.001%

1633

slide-37
SLIDE 37

The mapping

126 different data densities = { 1, 2, … , 1,633 } 256 Color Codes = { 0,1, 2, … , 255}

? Available solutions

  • Linear mapping
  • Non linear mappings
slide-38
SLIDE 38

Linear mapping

 Most pixels share very low color codes  Few color codes are used (46 out of 256) Different low density values are represented

by the same color code: densities in [1..10] are mapped on codes {1,2}

        − − =

min max min

255 ) ( d d d d Round d ColorCode

  • Straightforward solution
  • Useless in this situation

Color code frequency distribution

Transfer Function

collisions colors

slide-39
SLIDE 39

Density function mapping

Color code frequency distribution

TF

        =

= j i AP i AP j

N d DN Round d ColorCode

1

) ( 255 ) (

  • Hermann et al. [HMM00]
  • Quite similar to histogram

aequalization

  • Better than linear mapping

 Few color codes are used (39 out of 256) Lowest color code unnecessarily high Codes ranging only on [91..255] Different high density values are

represented by the same color code: densities in [48..1,633] -> [250,255]

slide-40
SLIDE 40

Our proposal

We take into account that:

  • densities and color codes are discrete and finite
  • too close color codes are hardly distinguishable

(for human beings)

[E. Bertini, A. Di Girolamo, G.Santucci - See what you know: analyzing data distribution to improve density map visualization – Eurovis 2007 conference]

slide-41
SLIDE 41

uniform scale mapping

We use a reduced color scale, e.g. with 15 codes (NL=15)

18 36 55 73 91 109 128 146 164 182 200 219 237 255

1

c

2

c

L AP

N N

3

c

NL

c

Target color code frequency distribution

This implies that different density values will be necessarily represented by the same color code: to reduce the degradation the mapping is performed through an algorithm that tries to assign to each code the same number of pixels

slide-42
SLIDE 42

NDV>NL : uniform scale mapping

Color code frequency distribution

Because of densities are discrete the algorithm cannot ensure the NAP/NL value and through a peak analysis it minimizes the variance

 Full color scale usage [0..255]  All the color codes are used  Maximum color code separation

Pixels Distribute d ColorCode

j =

) (

slide-43
SLIDE 43

Visual comparison

Linear mapping Density function mapping Uniform scale mapping

slide-44
SLIDE 44

Visual comparison

slide-45
SLIDE 45
slide-46
SLIDE 46
slide-47
SLIDE 47

The parcel dataset

Postal parcels plotted by weight (x) and volume (y)

slide-48
SLIDE 48

Grey scale

Linear CSU=0.53 CsAR=1 CS=2.83 Density Function CSU=0.18 CsAR=0.62 CS=5.23 Uniform color sc. CSU=1 CsAR=1 CS=8.79

slide-49
SLIDE 49

Conclusions

  • Visual Analytics is a new (exciting) emerging research field
  • Information visualization is a core component of VA
  • Automated data analysis could be classified in three main

groups

– Deriving new values (more creative) – Data reduction (sometimes creative) – Image improvement (very technical)

  • It is highly interdisciplinary and require a collaborative

approach

  • It is mainly a METHODOLOGY / VISION than a technique
  • However a collection of available results / proposal is

quickly growing

slide-50
SLIDE 50

The new (European) book on VA

  • Illuminating the path : The

Research and Development Agenda for Visual Analytics

– 2005, focusing on USA homeland security

  • Managing the Information Age

Solving Problems with Visual Analytics (2010) – One of the major outcome of Vismaster – Availble for free at: – http://www.vismaster.eu/

slide-51
SLIDE 51

5 books you HAVE to read (greedy order)

  • Robert Spence - Information Visualization: Design for

Interaction (2nd Edition) - Addison-Wesley (ACM Press) - BASIC ISSUES

  • Chaomei Chen - Information Visualization - Second Edition
  • Springer - AN UPDATED OVERVIEW
  • Managing the Information Age Solving Problems with

Visual Analytics (2010) VISMASTER BOOK

  • Colin Ware - Information Visualization, Third Edition:

Perception for Design (Interactive Technologies) - Morgan Kaufmann - PERCEPTUAL ISSUES

  • Card, Mackinlay, Shneiderman - Reading in Information

Visualization - 1999 HYSTORICAL

slide-52
SLIDE 52

Visual Analytics projects

slide-53
SLIDE 53

The Vismaster CA project

slide-54
SLIDE 54

The Promise NoE project

slide-55
SLIDE 55

PanopteSec Network Cyber Security

  • 3 years European IP project!
slide-56
SLIDE 56
slide-57
SLIDE 57

PanopteSec: Call for Master Thesis

  • Design implement and test a

Visual Analytics Environment for Network security

  • D3 framework
  • It includes the Information

visualization homework