Interactive Model Learning from High-Dimensional Data: A Visual - - PowerPoint PPT Presentation

interactive model learning from high dimensional data a
SMART_READER_LITE
LIVE PREVIEW

Interactive Model Learning from High-Dimensional Data: A Visual - - PowerPoint PPT Presentation

Interactive Model Learning from High-Dimensional Data: A Visual Analytics Approach Klaus Mueller Klaus Mueller Computer Science Lab for Visual Analytics and Imaging (VAI) Stony Brook University Visual Analytics Visual Analytics (Laymans


slide-1
SLIDE 1

Interactive Model Learning from High-Dimensional Data: A Visual Analytics Approach

Klaus Mueller

Computer Science Lab for Visual Analytics and Imaging (VAI) Stony Brook University

Klaus Mueller

slide-2
SLIDE 2

Visual Analytics

slide-3
SLIDE 3

Visual Analytics (Layman’s View)

slide-4
SLIDE 4

Visual Analytics (Layman’s View)

slide-5
SLIDE 5

Visual Analytics (Layman’s View)

slide-6
SLIDE 6

Visual Analytics (Layman’s View)

slide-7
SLIDE 7

Visual Analytics (Layman’s View)

slide-8
SLIDE 8

Visual Analytics (Expert View)

Human Computer Visual Interface Data

slide-9
SLIDE 9

Visual Analytics (Expert View)

Human Computer computing hardware algorithms Visual Interface Data manage

slide-10
SLIDE 10

Visual Analytics (Expert View)

Human Computer computing hardware algorithms pattern recognition creative thought Visual Interface Data manage

slide-11
SLIDE 11

Visual Analytics (Expert View)

Human Computer computing hardware algorithms pattern recognition mental model creative thought abstracted knowledge Visual Interface Data manage

slide-12
SLIDE 12

Visual Analytics (Expert View)

Human Computer computing hardware formal model algorithms formatted knowledge pattern recognition mental model creative thought abstracted knowledge Visual Interface Data manage

slide-13
SLIDE 13

Visual Analytics (Expert View)

Human Computer computing hardware formal model algorithms formatted knowledge pattern recognition mental model creative thought abstracted knowledge Visual Interface Data manage formalized insight

slide-14
SLIDE 14

Visual Analytics (Expert View)

Human Computer computing hardware formal model algorithms formatted knowledge pattern recognition mental model creative thought abstracted knowledge Visual Interface Data update manage visualize

slide-15
SLIDE 15

Visual Analytics (Expert View)

Human Computer computing hardware formal model algorithms formatted knowledge pattern recognition mental model creative thought abstracted knowledge Visual Interface Data interact manage learn apply/update

slide-16
SLIDE 16

Visual Analytics (Expert View)

Human Computer computing hardware formal model algorithms formatted knowledge pattern recognition mental model creative thought abstracted knowledge Visual Interface Data update manage visualize apply/update

slide-17
SLIDE 17

Visual Analytics (Expert View)

Human Computer computing hardware formal model algorithms formatted knowledge pattern recognition mental model creative thought abstracted knowledge Visual Interface Data interact update manage learn visualize apply/update apply/update

slide-18
SLIDE 18

Visual Analytics (Expert View)

Human Computer computing hardware formal model algorithms formatted knowledge pattern recognition mental model creative thought abstracted knowledge Visual Interface visual communication Data interact update manage learn visualize apply/update apply/update Mueller, et al. IEEE CG&A, 2011

slide-19
SLIDE 19

Visual Communication

Obviously, the better a communicator the computer is, the better the learnt model

  • computer communicates its current model via visualizations
  • analyst critiques it via visual interactions
  • computer learns a better model
  • and so on…
slide-20
SLIDE 20

Visual Communication

Obviously, the better a communicator the computer is, the better the learnt model

  • computer communicates its current model via visualizations
  • analyst critiques it via visual interactions
  • computer learns a better model
  • and so on…

A key question is thus:

  • can computers master the art of communication?
slide-21
SLIDE 21

Visual Communication

Obviously, the better a communicator the computer is, the better the learnt model

  • computer communicates its current model via visualizations
  • analyst critiques it via visual interactions
  • computer learns a better model
  • and so on…

A key question is thus:

  • can computers master the art of communication?

Good visual design and interaction is important

Mueller, et al. IEEE CG&A, 2011

slide-22
SLIDE 22

Visual Model Sculpting

Some motivating quotes from Michelangelo:

I saw the angel in the marble and carved until I set him free. Every block of stone has a statue inside it and it is the task of the sculptor to discover it. The marble not yet carved can hold the form of every thought the greatest artist has.

slide-23
SLIDE 23

Visual Model Sculpting

Some motivating quotes from Michelangelo:

I saw the angel in the marble and carved until I set him free. Every block of stone has a statue inside it and it is the task of the sculptor to discover it. The marble not yet carved can hold the form of every thought the greatest artist has.

Exchange ‘angel’ or ‘statue’ by ‘model’ and you can be the Michelangelo of Visual Analytics 

slide-24
SLIDE 24

Differences

Michelangelo’s ‘data’ were 3-D blocks of marble

  • ours are N-D blocks of bytes

Michelangelo’s tools were chisels, etc.

  • ours are mouse, multi-touch devices, etc

Michelangelo would say things like this:

  • “It is well with me only when I have a chisel in my hand. “
slide-25
SLIDE 25

High-D Visualization

Problems

  • comprehensive high-D visualizations can be very confusing
  • need to make high-D visualization user friendly and intuitive
slide-26
SLIDE 26

High-D Visualization

Problems

  • comprehensive high-D visualizations can be very confusing
  • need to make high-D visualization user friendly and intuitive

Key elements towards these goals

  • interactive: allow users to playfully sculpt the knowledge
  • communicative: let the data tell their story
  • illustrative: abstract away irrelevant detail
  • grounded: maintain a reference to native data space
slide-27
SLIDE 27

High-D Visualization

Problems

  • comprehensive high-D visualizations can be very confusing
  • need to make high-D visualization user friendly and intuitive

Key elements towards these goals

  • interactive: allow users to playfully sculpt the knowledge
  • communicative: let the data tell their story
  • illustrative: abstract away irrelevant detail
  • grounded: maintain a reference to native data space

Four (somewhat) complementary paradigms

  • spectral plots  see high-D hierarchies
  • dynamic scatterplots  see high-D shapes
  • parallel coordinates  see high-D cause + effect
  • space embeddings  see high-D relationships
slide-28
SLIDE 28

Spectral Plots (SpectrumMiner)

shown: 7076 particles of 450-D mass spectra acquired with single particle mass spectrometer (SPLAT)

slide-29
SLIDE 29

N-D Sculpting w/SpectrumMiner

reducing the effect of sodium (set weight = 0.1)

slide-30
SLIDE 30

N-D Sculpting w/SpectrumMiner

reducing the effect of sodium (set weight = 0.1) 3D PCA view

Garg, Nam, Ramakrishnan, Mueller, IEEE VAST 2008

slide-31
SLIDE 31

N-D Sculpting w/SpectrumMiner

reducing the effect of sodium (set weight = 0.1) 3D PCA view automated k-means user chooses k=5

slide-32
SLIDE 32

N-D Sculpting w/SpectrumMiner

reducing the effect of sodium (set weight = 0.1) 3D PCA view automated k-means user chooses k=5 inspect more closely

slide-33
SLIDE 33

N-D Sculpting w/SpectrumMiner

show dimension interactions in neighborhood map

Nam, Zelenyuk, Imre, Mueller, IEEE VAST 2007

slide-34
SLIDE 34

N-D Sculpting w/SpectrumMiner

show dimension interactions in neighborhood map before merge after merge

slide-35
SLIDE 35

N-D Sculpting w/SpectrumMiner

show dimension interactions in neighborhood map before merge after merge Support Vector Machine (SVM) Model encodes this knowledge

slide-36
SLIDE 36

Scatterplots

Familiar for the display of bi-variate relationships

slide-37
SLIDE 37

Scatterplots

Familiar for the display of bi-variate relationships Multivariate relationships arranged in scatterplot matrices

  • not overly intuitive to perceive multivariate relationships
slide-38
SLIDE 38

Dynamic Scatterplots

Interaction to help ‘see’ N-D

  • user interface is key  N-D NavigatorTM
slide-39
SLIDE 39

Dynamic Scatterplots

Interaction to help ‘see’ N-D

  • user interface is key  N-D NavigatorTM

Motion parallax beats stereo for 3D shape perception

  • the same is true for N-D shape perception
  • help perception by illustrative motion blur
slide-40
SLIDE 40

Dynamic Scatterplots

Interaction to help ‘see’ N-D

  • user interface is key  N-D NavigatorTM

Motion parallax beats stereo for 3D shape perception

  • the same is true for N-D shape perception
  • help perception by illustrative motion blur
slide-41
SLIDE 41

Dynamic Scatterplots

Elemental component is the polygonal touchpad

  • allows navigation of projection plane in N-D space
  • get axis vectors using generalized barycentric interpolation

x-axis y-axis

3 2 3

cot( ) cot( ) || || w p v            

1 1

where =

N N i i i i k i k

p a v a w w

 

 

Garg, Nam, Ramakrishnan, Mueller, IEEE VAST 2008

slide-42
SLIDE 42

Application: Cluster Analysis

Step 1:

  • dimension reduction using subspace clustering

Step 2:

  • visit each subspace
  • initialize projective view using projection pursuit
  • set up touchpad

Step 3:

  • lift-off…

Nam, Mueller, (submitted) IEEE TVCG, 2010

slide-43
SLIDE 43

Video

slide-44
SLIDE 44

Locating Interesting Patterns – Dynamic Display

Initial view All packets have source port 80.

Garg, Nam, Ramakrishnan, Mueller, VAST 2008

slide-45
SLIDE 45

Locating Interesting Patterns – Dynamic Display

Random Coloring

slide-46
SLIDE 46

Locating Interesting Patterns – Dynamic Display

Zooming

slide-47
SLIDE 47

Locating Interesting Patterns – Dynamic Display

Moving the Y Axis between Src_IP and Time dimension Same Color: Same Src_IP and Dest_IP

slide-48
SLIDE 48

Locating Interesting Patterns – Dynamic Display

To overcome the

  • verlap, twist the X-

axis a bit. Separate different packet groups.

slide-49
SLIDE 49

Locating Interesting Patterns – Dynamic Display

What are we looking for? Patterns for Webpage loading Exchanged packets between same Src IP and Dest IP in a short time period

slide-50
SLIDE 50

Locating Interesting Patterns – Dynamic Display

Select interesting packets Highlight them

slide-51
SLIDE 51

Locating Interesting Patterns – Dynamic Display

Confirm that selected packets are spreading

  • ver time
slide-52
SLIDE 52

Locating Interesting Patterns – Dynamic Display

  • Twist the view to

separate overlapped packets

slide-53
SLIDE 53

Locating Interesting Patterns - Full View

slide-54
SLIDE 54

Learn the Model

Use Inductive Logic Programming (Prolog) to formulate initial model (rule):

webpage_load(X) :- same_src_ips(X),same_dest_ips(X),same_src_port(X,80), timeframe_upper(X,10).

Classify other data points with this rule and visualize Marking negative examples yields updated/refined rule:

webpage_load(X) :- same_src_ips(X),same_dest_ips(X),same_src_port(X,80), timeframe_upper(X,10),length(X,L),greaterthan(L,8).

Garg, Nam, Ramakrishnan, Mueller, VAST 2008

slide-55
SLIDE 55

Parallel Coordinates

a car as a 7-dimensional data point

slide-56
SLIDE 56

Illustrative Parallel Coordinates

Traditional parallel coordinates plot

slide-57
SLIDE 57

Illustrative Parallel Coordinates

Traditional parallel coordinates plot Illustrative parallel coordinates plot

slide-58
SLIDE 58

Technique 1: Edge Bundling

Reduced clutter by replace poly-lines with poly-curves (color indicates cluster membership):

McDonnell, Mueller, Computer Graphics Forum. 2008

slide-59
SLIDE 59

Edge Bundling (cont.)

The user can change the tension to control the amount of clutter reduction Examples of low and medium tension, respectively:

slide-60
SLIDE 60

Technique 2: Cluster Rendering

In traditional PC, clusters are often rendered as heavy line segments on top of the dataset

  • in IPC we render the clusters as polygonal meshes
  • helps to show the ranges of each cluster along axes
slide-61
SLIDE 61

Technique 3: Opacity Hints

Allows context to be preserved Important clusters can be made more opaque

slide-62
SLIDE 62

Technique 4: Branched Clusters

To illustrate the distribution of the data long each axis, it is possible to split the clusters Branches provide an alternative to the display of histograms for visualizing data distributions

slide-63
SLIDE 63

Branched Clusters (cont.)

A parameter allows one to tune the visualization and change the minimum branch thickness

slide-64
SLIDE 64

Technique 5: Per-Cluster Histograms

Histograms are typically used in parallel coordinate plots to show distributions along individual axes We introduce the idea of using histograms on a per- cluster basis to reveal distribution

slide-65
SLIDE 65

One More Flavor …

Lots of unstructured data on the web We need to add structure to:

  • make it machine readable
  • reason with it

Humans can easily segment:

  • references into author, title, etc.
  • images into objects
  • videos into scenes
slide-66
SLIDE 66

Supervised learning

  • requires large amounts of user-tagged data
  • further, data is dynamic
  • we might need to supplement the tagged data

Automatic learning [Raina 2007]

  • Highly time intensive

Machine Learning Approaches

slide-67
SLIDE 67

Semi-Automatic Visual Learning

Keep the user in the learning loop, but:

  • allow interaction with data as a whole

Use clustering methods to visually group similar objects

  • helps the user mark an entire set as one category

In absence of feature vectors for a given data set

  • identify important features
  • allow user to adjust relative weights

 Visual Active Learning

Garg, Ramakrishnan, Mueller, VAST 2010

slide-68
SLIDE 68

A Good Feature Vector Is Key

Given a good feature vector:

  • similar points will be close-by in feature vector space

If tokens in a dataset don’t have an explicit feature vector create one based on:

  • structure
  • context
  • location
  • semantics

Semantics can also simplify the problem

  • e.g. in an address dataset, all numbers of the same length

are interchangeable

slide-69
SLIDE 69
  • 1. Feature-vector

calculation

  • 2. Distance matrix

calculation

  • 3. Graph layout
  • 4. UI: Modify feature

vector weights

  • 5. Cluster data
  • 6. UI: Sculpt clusters
  • 7. UI: Name clusters
  • 8. Train HMM
  • 9. UI: Resolve

inconsistencies 10.Re-cluster data

PHONE STATE COMPANY STREET CITY

Preprocess data Model refinement stage Model initialization stage

slide-70
SLIDE 70

Hidden Markov Model (HMM)

Statistical model used for data segmentation Contains

  • Set of (hidden) states S
slide-71
SLIDE 71

Hidden Markov Model (HMM)

Statistical model used for data segmentation Contains

  • Set of states S
  • Set of observations W
slide-72
SLIDE 72

Hidden Markov Model (HMM)

Statistical model used for data segmentation Contains

  • Set of states S
  • Set of observations W
  • Transition model: P(st|st−1)
slide-73
SLIDE 73

Hidden Markov Model (HMM)

Statistical model used for data segmentation Contains

  • Set of states S
  • Set of observations W
  • Transition model: P(st|st−1)
  • Emission model: P(w|s)
slide-74
SLIDE 74

HMM

Baum-Welch algorithm learns the model given:

  • transition probabilities
  • emission probabilities
  • set of observations

Requires hand tagged data Gets infeasible with data size Our solution:

  • cluster the data based on feature vectors
  • tag coherent data groups as a whole
  • tag ambiguous data one by one
slide-75
SLIDE 75

HMM: Text Segmentation

Viterbi algorithm

  • returns most probable sequence of states

<COMPANY, STREET,CITY, STATE, PHONE> Input:

  • The Grand America Hotel 555 South Main Street Salt Lake City

UT (800)621-4505

Output:

The Grand America Hotel, 555 South Main Street, Salt Lake City, UT, (800)621-4505

slide-76
SLIDE 76

Preprocessing − Windowing Approach

Window 1 Window 2 Window 3 Window 4 Window 5 1 Hour Auto Glass Inc 403 West St New York NY (212) 4 Star Auto Sound & Sec Inc 2481 Central Park Ave Yonkers NY (914) 1 Hour Photo & Copy Center 2140a White Plains Rd Bronx NY (718) Westfield Agency Inc 105 E Main St Westfield NY (716) A C P 65-09 Brook Av Deer Park NY (516) A A M C A R 303 W 96th St New York NY (212)

slide-77
SLIDE 77

Windowing Approach

Window 1 Window 2 Window 3 Window 4 Window 5 1 Hour Auto Glass Inc 403 West St New York NY (212) 4 Star Auto Sound & Sec Inc 2481 Central Park Ave Yonkers NY (914) 1 Hour Photo & Copy Center 2140a White Plains Rd Bronx NY (718) Westfield Agency Inc 105 E Main St Westfield NY (716) A C P 65-09 Brook Av Deer Park NY (516) A A M C A R 303 W 96th St New York NY (212)

2

  • Feature Vector for “Auto”
slide-78
SLIDE 78

Feature Vectors in a Text Dataset

Structure

  • What type of characters does the token contain

Context

  • What type of words does it occur before/after

Location

  • At what positions (windows) does it occur in the dataset

The final feature vector stores a summary of all the

  • ccurrences of a given token

Word Has letter Has digit Has symbol Has caps All caps Length Liberty 1 1 7 1-2-3 1 1 5 Word Neigh- bors Has letter Has digit Has symbol Has caps All caps Length 1-3 Length 4-6 Length 7+

Liberty

Av. Avenue 1344 A-1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Final F-vec 3 2 2 3 2 2

slide-79
SLIDE 79

Distance matrix

Given feature vectors, calculate all pairs of distances

User modifiable

slide-80
SLIDE 80

Token Visualization: Random Layout

slide-81
SLIDE 81

Token Visualization: Distance Based Layout

slide-82
SLIDE 82

Token Visualization: User Assigned Categories

Ambiguous data

slide-83
SLIDE 83

Token Visualization: Disambiguation

Window 1 Window 2 Window 3 Window 4 Window 5 Corte Salon 1019 U St NW 2nd Fl Washington DC 20001 Glover Park Hardware 2251 Wisconsin Ave NW Washington DC 20007 Laura Bee Designs 6418 20th Ave NW Seattle Washington 98107 Bob’s Quality Meats 4861 Rainier Avenue S Seattle Washington 98118

slide-84
SLIDE 84

Token Visualization: Disambiguation

slide-85
SLIDE 85

Results: Address Data Set

Segmenting an address dataset of NY businesses

slide-86
SLIDE 86

Initial Layout

slide-87
SLIDE 87

Layout After Tweaking Feature Vector Weights

slide-88
SLIDE 88

Zooming In

slide-89
SLIDE 89

Layout After Clustering Using Markov Cluster Algorithm

slide-90
SLIDE 90

Cluster Naming Using Inner Core

slide-91
SLIDE 91

Cluster Editing

If the clusters don’t lend themselves to categories

  • re-cluster using a different refinement level

The user can modify the clusters as follows:

  • merge clusters
  • split clusters
  • create a new cluster using nodes from multiple clusters
  • name the clusters
slide-92
SLIDE 92

Cluster Editing

slide-93
SLIDE 93

Cluster Editing

slide-94
SLIDE 94

Debugging

Show entries with ambiguously labeled tokens This involves tokens that:

  • belong to multiple categories
  • occur on border of 2 categories

The visualization steps through the entry showing the class assigned to each token

slide-95
SLIDE 95

Current Work

Application to Health Analytics

  • decision support for emergency room physicians
slide-96
SLIDE 96

Current Work

Application to Health Analytics

  • decision support for emergency room physicians

Zhang, et al. VAHC 2010

slide-97
SLIDE 97

Thanks

Support from NSF, NIH, DOE, BNL, PNL, CEWIT Collaborators:

  • Dr. Alla Zelenyuk, Dr. Dan Imre (formerly BNL, now PNL)
  • Dr. IV Ramakrishan (Stony Brook University)
  • Dr. Kevin McDonnell (Dowling College)

MS/PhD Students

  • Peter Imrich, Yiping Han, Julia EunJu Nam, Supriya Garg,

Hyunjung Lee, Zhiyuan Zhang

More information at http://www.cs.sunysb.edu/~mueller