VISUAL DATA MINING MODELS e-business intelligence lab FOR - - PowerPoint PPT Presentation

visual data mining models
SMART_READER_LITE
LIVE PREVIEW

VISUAL DATA MINING MODELS e-business intelligence lab FOR - - PowerPoint PPT Presentation

VISUAL DATA MINING MODELS e-business intelligence lab FOR ENHANCING THE www.e-bi.gr KNOWLEDGE EXTRACTION Dr. Ioannis Kopanakis kopanak@e-bi.gr Assistant Professor Head, Dept. of Commerce & Marketing Technological Educational Institute


slide-1
SLIDE 1

1

VISUAL DATA MINING MODELS FOR ENHANCING THE KNOWLEDGE EXTRACTION

e-business intelligence lab www.e-bi.gr

  • Dr. Ioannis Kopanakis

kopanak@e-bi.gr Assistant Professor Head, Dept. of Commerce & Marketing Technological Educational Institute of Crete Scientific Director e-Business Intelligence Lab Center for Technological Research - Crete

slide-2
SLIDE 2

2

Visual Senses

  • The visual senses for humans have a unique status,
  • ffering a very broadband channel for information flow.
  • Visual approaches to analysis and mining attempt to

take advantage of our abilities to perceive pattern and structure in visual form and to make sense of, or interpret, what we see.

  • Visual data mining techniques have proven to be of

high value in exploratory data analysis and they also have a high potential for mining large databases.

  • In this presentation, we try to investigate the area of

visual data mining.

slide-3
SLIDE 3

3

From Data to Information

  • Having the right information at the right time is crucial for

making the right decisions.

  • Because of the fast technological progress, the amount of

information, which may be of interest for making decisions, increases very fast [Kei96a].

  • One reason for the ever increasing stream of data is the

automation of activities in all areas, including business, engineering, science, and government.

  • But finding the valuable information hidden in them is like

searching a pin in a haystack.

  • The process of searching and analyzing large amounts of data

is called “data mining”.

  • The large collections of data are the potential lodes of valuable

information but like in real mining, the search and extraction can be a difficult and exhaustive process [Kei94a].

slide-4
SLIDE 4

4

Data Mining

  • Data mining is a knowledge discovery process of

extracting previously unknown, actionable information from very large databases.

  • In details it is the nontrivial extraction of implicit,

previously unknown and potentially useful information from data.

  • In other words, it is the search from relationships and

global patterns that exist in large databases, but are “hidden” among the vast amounts of data.

  • These relationships represent valuable knowledge

about the database and objects in the world [Fra92].

slide-5
SLIDE 5

5

Data Mining II

  • Data mining is the efficient and possibly unsupervised

discovery of interesting, useful and previously unknown patterns in a data warehouse, which is a historical database, designed to facilitate analysis and knowledge discovery [Gan96].

  • Common patterns of interest include classification,

associations, clustering, and sequential patterns.

  • The success of the data mining process is critically dependent

upon the availability of user insights and biases, even though the process may use unsupervised learning algorithms.

  • In some sense, data mining is like the work of radiologists. It is

like scanning the database to identify phenomena that need to be looked at, showing the regular structure of the data but also helping to find anomalies.

slide-6
SLIDE 6

6

Data Mining Process

slide-7
SLIDE 7

7

Data Preparation Stage

  • improving the data quality
  • summarizing the data to facilitate the analysis and

discovery process.

  • on either operational databases or on a data

warehouse

  • The quality of the data in the data warehouse is

constantly monitored by data analysts.

  • Due to the heterogeneity and non-standard policies

enforced on data quality at the different source databases, the warehouse data is usually cleaned or standardized via data scrubbing.

slide-8
SLIDE 8

8

Model Derivation Stage

  • focuses on choosing learning samples, testing samples and

learning algorithms

  • a suitable sample set is selected which forms the training data

for the data mining algorithm.

  • The data mining process in this stage is viewed as the

derivation of an interesting representative knowledge model.

  • The algorithm for model derivation, together with the guidance

provided by the user, will generally produce several models of the information contained in the data

  • The data mining algorithms use guidance from the analyst to

decide various parameters of the model derived, such as its accuracy and prevalence, controlling the computational complexity of the learning process

slide-9
SLIDE 9

9

Validation Stage

  • monitoring of database updates and continued validation of

patterns learned in the past.

  • not all the knowledge models generated will have business

applications.

  • continuously monitor the validity of the knowledge models in the

context of changes to data in the warehouse.

  • When the population in the warehouse shifts significantly, the

previously learned models will no longer be applicable, and new models will have to be derived.

  • We may also be able to learn new models incrementally from

the new data.

slide-10
SLIDE 10

10

Visual Data Mining

  • Visual data mining could be related with all

three data mining sub-modules.

Visual data mining involves the invention of visual representations that would enhance information and knowledge flow throughout each module of the data mining process.

slide-11
SLIDE 11

11

VDM on the Data Preparation Stage

  • enhance or carry out tasks of the pre-processing module in a

visual manner

  • visual manipulation of data and handle problems such as

missing data fields, data transformations, data sampling and pruning

  • visualization can be used as a summarization tool for the

human user to gain an overview of data sets

  • enables him to formulate accurate hypotheses and objectives
  • select carefully only the relevant and useful data to be sampled

and extracted for data pre-processing

  • Thus, visual data mining in the field of data preparation could

serve as a channel for the inflow of relevant domain knowledge and human decisions, which could help optimize these

  • therwise laborious data pre-processing activities [Law01].
slide-12
SLIDE 12

12

VDM on the Model Derivation Stage

  • utilization of visualization techniques for enhancing the

constituting steps of this module:

– visual evaluation, monitoring and guidance of the model derivation module.

  • Evaluation

– validation of training samples, test samples, and learned models against the data in the database plus the appropriateness of data and learning algorithms for specific data mining situations.

  • Monitoring

– tracking the progress of the data mining algorithms, evaluating the continued relevance of learned patterns in the context of database updates, etc.

  • Guidance

– user initiated biasing or altering inputs, learned patterns and

  • ther system decisions
slide-13
SLIDE 13

13

VDM on the Model Derivation Stage

  • getting the insights needed for true understanding and

comprehension of the actual model derived

  • That can only be obtained by ensuring that humans can

actually visualize models in order to help understand the “black box” functions learned with neural nets and complex rule based classifiers [Fay01].

  • Visualizing a model should allow a user to discuss and explain

the logic behind the model with colleagues, customers, and

  • ther users.
  • Getting by in the logic or rationale is part of building users’ trust

in the results.

  • If the user can understand what has been discovered in the

context of business issues, he/she will trust it and put it into use.

  • Unfortunately, users are often forced to trade off accuracy of a

model for understandability.

slide-14
SLIDE 14

14

VDM on the Model Derivation Stage

  • Advanced visualization techniques can greatly expand the

range of models that can be understood by domain experts

  • Three components are essential for understanding a model:

representation, interaction and integration.

  • Representation refers to the visual forming in which the model
  • appears. A good representation displays the model in terms of

visual components that are already familiar to the user.

  • Interaction refers to the ability to see the model in action in real

time, to let the user play with the model as if it was a machine.

  • Integration refers to the ability to display relationships between

the model and alternate views of the data on which it is based.

  • Integration provides the user context [The01].
slide-15
SLIDE 15

15

VDM on the Validation Stage

  • assist the knowledge engineer to acquire enhanced

knowledge by the visualization of outcomes produced by data mining processes.

  • the results of data mining algorithms representing

associations, relevancies and classifications are in a form difficult to be understood by humans

  • In that context, visual data mining on the validation

stage could be defined as the graphical presentation

  • f data, whether the data is base data, summary data,
  • r mined outcomes extracted from data.
  • This is a type of visual data analysis, where the

analytic component is offloaded to human perception [Kei95a].

slide-16
SLIDE 16

16

VDM on the Validation Stage

  • The basic idea of visual data mining in this field is to produce

information rich visualization outcomes easily perceived by human’s perception.

  • The difficulty in producing new visualization models is balancing

between those two factors having also in mind to increase the magnitude and the quality of the knowledge extracted and easily acquired.

  • In the context of data mining, an additional difficulty to deal with

is the size of the data set to be visualized.

  • Human beings look for structure, features, patterns, trends,

anomalies, and relationships in data.

  • Visual data mining should support this by presenting the data in

various forms and perspectives.

  • A visual representation can provide a qualitative overview and

can assist in identifying regions of interest and appropriate parameters for more focused quantitative analysis.

  • In an ideal system, visualization techniques should harnesses

the perceptual capabilities of the human visual system [Fay01].

slide-17
SLIDE 17

17

Importance of VDM

  • In 1854, while searching for ideas to bring a cholera

epidemic raging in London, Dr. J. Snow drew dots

  • n a map of the neighbourhood at the locations of

the recorded deaths.

  • The maps had the positions of the drinking water
  • wells. The concentration of deaths near just one of

the wells was visually striking.

  • He had the handle of the suspect well changed and

the epidemic stopped! Apparently, the disease was being transmitted by contact with the handle.

  • This true story is widely considered as an early

success of visualization [Ins01].

slide-18
SLIDE 18

18

Role of VDM I

  • Visual data mining is the link between the two most powerful

information processing systems: humans and the modern computer

  • Humans

– limited to handle scale – easily overwhelmed by the volumes of data – unmatched abilities of perception enable to analyse complex events within a short time – recognize important information – and to make decisions – perceptual system processes different types of data in a very flexible way – automatically recognizing unusual properties while at the same time ignoring well-known properties – handles vague descriptions and imprecise knowledge – and using general knowledge easily draws complex conclusions

slide-19
SLIDE 19

19

Role of VDM II

  • On the other hand, most data mining techniques work fully

automatically but need to have a-priori defined tasks.

  • The tasks are a specific type of hypothesis and the goal of

the algorithms are to find quantitative rules that make the hypotheses more specific and allow the user to confirm or reject them.

  • Task-oriented data mining is important but it is also

important to develop techniques for data-driven hypotheses generation.

  • For this purpose, it is necessary to include the human in

the data mining process and combine the flexibility, creativity, and general knowledge of the human with the enormous storage capacity and the computational power

  • f today’s computers.
slide-20
SLIDE 20

20

Human & Computer Abilities

Abilities of the Computer Data Storage Numerical Computation Searching Human's Abilities Perception Creativity General Knowledge Logic Planing Diagnosis Prediction

slide-21
SLIDE 21

21

The Utilization of VDM

  • We recognize that visualization is a very effective

mean of incorporating in human participation.

  • This will open up channels for the infusion of domain

knowledge into the knowledge discovery process whereby relevant, useful, and important background business information and domain heuristics can be fully exploited to guide KDD.

  • Visualization can be used as a data mining technique
  • n its own, or it can be an effective front-end tool in

synergy with other data mining methods, within hybrid, interactive, and cooperative KDD framework in a complementary fashion.

  • This has as a result a more powerful and synergistic

approach to data exploration and discovery.

slide-22
SLIDE 22

22

The Utilization of VDM II

  • Intermediate results generated by one data mining

technique will be analyzed by a human analyst, who will subsequently decide how to proceed, perhaps using another technique.

  • This may involve iterative reselection of data sets and

attributes, and any pre-processing steps necessary.

  • New human hypotheses may be postulated along the

way, based on initial rounds of data mining results, which may require further verification and testing through subsequent sessions of mining.

  • The iterative processes will terminate when the results

are deemed satisfactory

slide-23
SLIDE 23

23

Research Objectives

  • Harnessing the perceptual capabilities of humans,
  • include the human in the data mining process and
  • combine the flexibility, creativity, and general knowledge of the

human with the enormous storage capacity and the computational power of today’s computers.

  • Open up channels for the infusion of domain knowledge into the

knowledge discovery process whereby relevant, useful, and important background business information and domain heuristics can be fully exploited to guide KDD by the utilization

  • f Visual Data Mining.
  • Provide user friendly functionality which will not only be

restricted to the experts of the field

  • enhancing the general attempt to develop a data mining tool

that will be targeted to all interested users.

slide-24
SLIDE 24

24

  • Line Graphs

– Line Graph, Multiple Line Graph.

  • Bar Charts

– Permutations Matrix, Survey Plots

  • Geometric Techniques

– Scatter Plots (Scatter Plot Matrix, HyperSlice), Prosection Views, Grand Tour, Parallel Coordinates, Landscapes, HyperBox

  • Icon-based Techniques

– Chernoff Faces, Star Glyphs, Stick Figures, Shape- Coding, Colour Icons

VDM Techniques I

slide-25
SLIDE 25

25

  • Pixel-oriented Techniques

– Query Independent Techniques: Simple, Space-Filling Curves, Recursive Pattern – Query Dependent Techniques: Spiral, Axes, Circle Segment.

  • Hierarchical Techniques

– Dimensional Stacking, Worlds-within-Worlds, Treemap, Cone Trees, Info Cube

  • Graph-Based Techniques

– Basic Graphs: Straight-Line, Polyline, Curved-Line – Specific Graphs: Cluster-Optimized, Directed-Acyclic, Orthogonal, Symmetry-Optimized – 3D-Graphs: Ball, Torrus

VDM Techniques II

slide-26
SLIDE 26

26

  • Hybrid & Dynamic Interaction Techniques

– Arbitrary combinations from above. – Data-to-Visualization Mapping, Projections, Filtering (Selection, Querying), Linking & Brushing, Zooming, Detail

  • n Demand, Data Pre-processing Techniques
  • Distortion Techniques

– Simple Distortion: Perspective Wall, Table Lens, Fisheye View. – Complex Distortion: Hyperbolic Tree, 3D-Hyperbolic Representation

VDM Techniques III

slide-27
SLIDE 27

27

Line Graphs

  • Line graphs are used for displaying single-valued or

piecewise continuous functions of one variable.

  • They are normally used for 2D (x, y) data representations

where the x value is not repeated.

  • A background grid can be used to help to determine the

absolute values of the points.

  • Multiple line graphs can be used or overlaid to show more

than two dimensions.

  • The fact that the first dimension (independent variable) is

unique gives to this dimension a special significance.

  • The independent variable typically represents the ordering
  • f the data (i.e. time).
slide-28
SLIDE 28

28

Multiple Line Graph

  • multi-line graph of

the Car data set [Bla98].

  • The X-axis orders

the data records by type (American, Japanese and European).

  • Within each type,

the records are sorted by year.

  • The Y-axis represents the various dimensions
  • The offsets among the line plots are equal as the range of

each line plot is the same after the normalization

slide-29
SLIDE 29

29

  • Multiple bar graphs and

histograms can be used effectively in data mining.

  • You can use an array of

histograms to approximate the density functions of all dimensions of the data.

  • histogram matrix of the

Car data set [Bla98] dimensions for each class.

Bar Charts: Histogram Matrix

slide-30
SLIDE 30

30

  • According to the

permutation matrix technique [Ber83] the heights of the bars correspond to the data values visualized.

Bar Charts: Permutation Matrix

the representation of the Iris Flower data set

  • By permuting or sorting the rows or columns, depending on

how the data is oriented, patterns in the data could be revealed.

slide-31
SLIDE 31

31

  • rotate the permutation matrix by

90 degrees, shorten each bar graph by 50%, and extend an equal bar graph on the other side

  • f the axis.
  • adequate for the visualization of

n-dimensional data

  • allowing to detect correlations

between any two variables especially when the data is sorted according to a particular criterion.

  • When colour is designating the

different classes we could investigate which attributes are good at classifying data.

Bar Charts: Survey Plots

  • The data is sorted by

the number of cylinders and then by the miles per gallon attributes

slide-32
SLIDE 32

32

Line Graphs

§ The flexibility of those techniques makes them applicable in a variety of areas, including the data mining life cycle stages of data preparation and validation. § Possible cases, besides the direct visualization and mining

  • f data, can be the sampling and pruning during the data

preparation stage, or even the visualization of compatible data mining outcomes during the validation stage. § In the same context, an additional option that allows a much more flexible application framework is the expansion

  • f those techniques, followed by the definition of

analogous flexible mapping procedures that would correspondingly transform the mining outcomes.

slide-33
SLIDE 33

33

  • In the category of geometric projection techniques our

aim is to find “interesting” projections of multidimensional data sets [Spe99].

  • The class of geometric projection techniques includes

techniques of exploratory statistics such as principal component analysis, factor analysis, multidimensional scaling.

  • The basic underlying idea of any geometric projection

technique is to visualize geometric transformations and projections of the data in such a way that the structure, properties and patterns of the data set in the n-dimensional space will be revealed [Hin99] [Dhi98].

Geometric Projection Techniques

slide-34
SLIDE 34

34

  • Scatter plots are one of the oldest and most commonly used

methods to project high dimensional data to 2D.

  • pair-wise parallel projections are generated,
  • each one provides a general impression of the relationships

among the data visualized, within the context of the pair of dimensions selected

  • the projections are generally arranged in a grid structure to help

the user distinguish the dimensions associated

  • Many variations on the scatter plot have been developed to

increase the information content of the representation as well as provide tools to facilitate the data exploration

  • Some of these include rotating the data clouds, utilization of

different marking symbols for the distinction of the classes or

  • verlapping points and colour or shading for adding a third

dimension.

Scatter-Plot

slide-35
SLIDE 35

35

  • A grid of 2D scatter plots is the standard means of

extending the scatter plot to higher dimensions.

  • If you have 10-dimensional data, a 10x10 array of

scatter plots is used to provide a visualization of each dimension versus every other dimension.

  • This is useful for looking at all possible two-way

interactions or correlations between dimensions.

  • The standard display quickly becomes inadequate for

high dimensions and user interactions of zooming and panning are needed to interpret the scatter plots effectively.

Scatter-Plot Matrix

slide-36
SLIDE 36

36

  • scatter-plot matrix for the Car data set [Bla98].
  • Data for American cars is red, Japanese green and

European blue.

Scatter-Plot Matrix

  • Positive correlations

can be seen between horsepower and weight.

  • Negative correlations

can be seen between mpg, horsepower and weight.

slide-37
SLIDE 37

37

  • Advantages of scatter plots include ease of interpretation

and robustness to the size of the data set.

  • They help to find clusters, outliers, trends, and correlations

among the data visualized.

  • Brushing and coloured class points are used to gain

additional insights on the data.

  • Zooming panning and jittering can be used to improve the

visualization when too many points overlap or the resolution of the data causes many data points to lie at the same coordinate.

  • Glyphs, icons, colour and splatting are additional features
  • r techniques that have been used to extend the

usefulness of the scatter plot.

Comments on Scatter-Plots

slide-38
SLIDE 38

38

  • The 3D scatter plot with animation, different colours, different

shapes and interaction can extend the data mining capabilities to three, four, five or more dimensions.

  • Depending on the user interface, the insight into the higher

dimensions is rarely as good as with the standard 2D plot though.

  • One reason is that after two dimensions are used for the X and

Y axes, the other dimensions (Z axis, colour, shape, animation and so on) do not have equal effect on the visualization.

  • they do not have a robust behaviour as the dimensionality of

the data set increases.

  • The increasing dimensionality results in decreasing the screen

space provided for each projection.

Comments on Scatter-Plots II

slide-39
SLIDE 39

39

  • we map the n-dimensional

space onto the two display dimensions by using n equidistant axes, which are parallel to one of the display axes.

  • The axes correspond to the

dimensions and are linearly scaled from the minimum to the maximum value of the corresponding dimension.

Parallel Coordinate Technique

  • Each data item is presented as a polygonal line, intersecting each of

the axes at that point, which corresponds to the value of the considered dimension [Ins85] [Ins87].

slide-40
SLIDE 40

40

  • it is powerful in revealing a

wide range of data characteristics such as different data distributions and functional dependencies.

  • the number of the data items

that can be visualized concurrently is limited to about a thousand.

  • One nice property of this

model is that the mapping of points from Cartesian

Parallel Coordinate Technique

coordinates to parallel coordinates is a highly structured mathematical transformation, and hence mathematical objects are mapped to mathematical objects

slide-41
SLIDE 41

41

  • The basic idea of pixel-oriented techniques is to map each data

value to a coloured pixel and present the data values belonging to one attribute in separate windows [Kei96b] [Hin99].

  • Since in general these techniques use only one pixel per data

value, they allow the visualization of the largest amount of data (up to about 1,000,000 data values), possible on current displays.

  • If each data value is represented by one pixel, the main

question is how to arrange the pixels on the screen.

Pixel-Oriented Techniques

slide-42
SLIDE 42

42

  • All pixel-oriented techniques partition the screen into multiple

windows.

  • For data sets with n attributes (dimensions), the screen is

partitioned into n windows, one for each of the attributes.

  • In case of the query-dependent techniques, an additional

window is provided for the overall distance. Inside the windows, the data values are arranged according to the given overall sorting which may be

– data-driven for the query-independent techniques – or query-driven for the query-dependent techniques.

  • Correlations, functional dependencies, and other interesting

relationships among attributes may be derived by relating corresponding regions in the multiple windows.

Pixel-Oriented Techniques

slide-43
SLIDE 43

43

Query-Independent

arrange the data values on the display [Kei96b].

  • Appropriate for data with a natural ordering (e.g. time

series data)

  • The query-

independent visualization techniques sort the data according to an attribute(s) and use a screen-filling pattern to

slide-44
SLIDE 44

44

Query-Dependent

  • If there is no natural ordering
  • f the data and the main goal

is an interactive exploration of the database, we are more interested in feedback to some query.

  • In this case, we turn to the

query-dependent visualization techniques, which visualize the relevance of the data items with respect to a query.

  • The arrangement of the data items centers the most relevant

data items in the middle of the window, and less relevant data items are arranged in a spiral-shape to the outside

slide-45
SLIDE 45

45

The Circle Segments Technique

  • display the data dimensions as segments of a circle.
  • the circle is partitioned into n segments, each one representing

a data dimension.

slide-46
SLIDE 46

46

  • Cone Tree

Hierarchical Techniques

  • Treemap
  • hierarchical partitioning of the set into subspaces [Hin99].

strictly organize data according to a specific data structure, such as a tree or network, with progressive levels refining the display into subspaces.

slide-47
SLIDE 47

47

  • Directed Acyclic

Graph-Based Techniques

  • 3D Cluster-Optimized
  • attempt to produce “clever” representations of the graphs on

the screen. The mode of each technique depends on the features that we want to indicate and make clear.

slide-48
SLIDE 48

48

Hybrid Techniques

  • The visualization techniques presented could also be

combined, developing hybrid visual environments.

  • According to specific cases and requirements, we could

support a blend of selected techniques, encapsulated in a heterogeneous visualization system.

  • On hybrid visualization models attempt the integration

and use of multiple types of techniques in one or multiple windows, in order to enhance the expressiveness of each representation.

  • The overall solution of a visualization model covering all

stages of the data mining life cycle will actually be a hybrid combination of specific models.

slide-49
SLIDE 49

49

Distortion Techniques

  • In order to allow the representation of larger amounts
  • f data, visualization space is being distorted,

emphasizing only that portion of the image that has been selected [Hin99].

  • The distinctive advantage of these techniques is that

we can zoom in the area of interest without loosing the whole view of the representation.

  • We actually manage to focus our attention to smaller

portions of the information visualized in a bi- directional drill down manner.

slide-50
SLIDE 50

50

Perspective Wall

  • The perspective wall integrates detailed and contextual views

to support the visualisation of linearly structured information spaces, using interactive 3D animation to make smooth and coherent the transition among the views. The perspective wall folds a 2D layout into a 3D wall that smoothly integrates a central region for viewing details.

slide-51
SLIDE 51

51

Table Lens

  • provides a compact visualization of a table, spreadsheet
  • r database with the possibility of viewing portions of the

table in detail The user, as manipulating a magnifying lens, can produce zooming distortions of the selected rows

slide-52
SLIDE 52

52

  • A number of virtual reality (VR) paradigms have been applied to

information visualization.

  • Sometimes called Benediktine [Ben91] spaces (coined by

William Gibson in a science fiction story)

  • Benedikt talked about the mapping problem in terms of extrinsic

(three spatial coordinates and one time) and intrinsic (shape, colour, texture etc.) dimensions when creating worlds in cyberspace.

  • This mapping problem is even greater in the application of

virtual reality in information visualization, as there is no obvious mapping of the data dimensions to the virtual world’s spatial dimensions.

  • Interaction (movement in three dimensions) is a central part of

any VR visualization, requiring learning on user’s behalf.

Virtual Reality

slide-53
SLIDE 53

53

  • Vineta, a VR display for document retrieval

PITS (Populated Information Terrains)

slide-54
SLIDE 54

54

  • RadViz

Radial Coordinate Technique

  • GridViz
slide-55
SLIDE 55

55

  • A circular version of the

parallel coordinates technique, in which the axes radiate from the centre of a circle and extend to the perimeter

  • Due to the asymmetry of

the lower data values from higher ones, certain patterns may be easier to detect with this visualization

Circular Parallel Coordinates

  • Iris flower data set
slide-56
SLIDE 56

56

  • each row represents one clusters where the pie and the bar

charts represent active and supplementary fields.

IBM Intelligent Miner: Cluster Grouping - All Clusters View

  • Fields that have the

greatest influence

  • n forming the

cluster are displayed on the left

  • The numbers down

the left side represent the cluster size as a percentage

slide-57
SLIDE 57

57

  • uses box plots arranged in rows and columns.
  • statistics about the clusters or groups

SGI MineSet: Cluster Visualizer

  • One row for each

attribute and each column represents a cluster.

  • The Population

column displays statistics for the dataset as a whole. Iris Flower data set

slide-58
SLIDE 58

58

  • Colour of the

arrow represents the lift of the rule,

  • arrow’s width

represents the confidence

  • the colour of the

node represents the support of the corresponding item set

IBM Intelligent Miner: Rules Graph

  • nodes to represent item sets and lines with arrows to represent

rules

slide-59
SLIDE 59

59

  • Each internal node represents a test on an attribute and each

branch represents the result of that test. Each leaf node represents a class

IBM Intelligent Miner: Classification Tree

  • classes in different

colours.

  • mouse over

reveals:

– split criterion – number of records – number of records corresponding to the class, – percentage of correctly classified records assigned to a node (purity)

slide-60
SLIDE 60

60

quantitative and relational characteristics

  • Each level branches on the

values of a different attribute

  • Edges, show the relationship of
  • ne set of data to its subsets
  • Each node in the tree shows

also a chart, which represents all the data in the sub-tree below

  • The properties of that chart

(height, colour etc.) correspond to aggregations of data values, usually sums, averages, or counts

SGI MineSet:Tree Visualizer

  • hierarchical data in tree forms, as a 3D landscape reveals
slide-61
SLIDE 61

61

  • 3D tree models where nodes are represented as spheres.
  • The nodes have labels declaring the splitting attribute along

with the percentage of the data belonging to that branch.

IBM Data Explorer: Decision Trees

  • Additional information has

been mapped to the size and colour of the node (sphere), which depends on customization options.

  • The interactive 3D

environment allows the exploration of the constructed tree representation, revealing detailed information.

slide-62
SLIDE 62

62

  • data dimensions are

assigned to the edges of a 3D cube and scatter-plot projections on each side of the cube are made accordingly

  • By rotating the cube

we can explore the corresponding combination of attributes in each scatter-plot.

IBM Data Explorer: 3D Scatter- Plots

  • adequate for exploring raw data.
slide-63
SLIDE 63

63

  • scatter-plot matrix technique linked with simple bar and plot

charts

SAS Enterprise Miner: Scatter- Plots

  • Highly

interactive interface and filtering functionality

slide-64
SLIDE 64

64

  • data points are represented in one, two, or three dimensional

scatter-plots

  • two more attributes may be mapped to sliders, allowing

animation and fly-throughs.

SGI MineSet: Scatter Visualizer

  • An attribute can be mapped to

a colour code to guide animations through potentially interesting combinations of values of the animation variables.

  • if we map one or two numeric

variables to the sliders, we can animate the size, colour,

  • r position of the entities
slide-65
SLIDE 65

65

  • the data represents the sales of several companies over time.
  • If the time variable is mapped to a slider and the sales variable is mapped to

size then the entities grow or shrink as the time slider is animated.

SGI MineSet: Scatter Visualizer

  • Splat Visualizer
  • Playing back the animation path,

we can watch the changing size, colour, and motion of the data- points for trends or anomalies.

  • Navigation

through the 3D landscape and scaling the values of variables for greater emphasis could also be an option.

  • Filtering the display to show only

those entities which meet certain criteria could also be utilized for the clarification of the scene.

slide-66
SLIDE 66

66

  • similar to the Scatter Visualizer, with the distinction that data density is

shown by using varying opacity

  • The result approximates the effect of rendering each data point individually,

and is particularly useful for datasets that may contain too many points to display.

SGI MineSet: Splat Visualizer

  • The Splat Visualizer uses

graphical objects, called splats, which represent aggregates of data points.

  • The colour and opacity,

but not the position, of the splats can change during animation

slide-67
SLIDE 67

67

  • 3D colour histogram with columns from the Adult data set [Bla98] mapped to

axes, sliders, colour, and opacity.

  • The left axis of this example shows each occupation sorted by average

income along an axis. The occupation executive-managerial, listed at the end of the axis, has the highest average income, providing a natural progression for the values

SGI MineSet: Splat Visualizer

  • On the other hand, the
  • rdering for the values of

education (the right axis) is generally from low to

  • high. In a few cases

though, there are anomalies in that order. This unexpected ordering might be interesting as it points out places where the data does not agree with expectations.

slide-68
SLIDE 68

68

Visual Data Mining on the Validation Stage

  • Not just visual data mining to the raw data but also to

the outcomes produced by the data mining algorithms would address

  • Association rules

– large resulting rule set, hard to understand, rule behaviour problems.

  • Relevance analysis

– number of attributes, temporal factor problems

  • Classification

– separation, similarity, coherence, formation problems.

slide-69
SLIDE 69

69

Visual Data Mining Models for Association Rules

  • Bar Chart Form Model

– Rule’s sub-expression visualized by a bar in the chart. – Bar’s length, depth, colour and position are features utilized in the representation.

slide-70
SLIDE 70

70

Visualizing a set of Association Rules

  • Neighbourhoods of Mining Interest
  • Similarity

Arrangement

– Attributes participation table – Rules similarity table – Smart 2D placement

slide-71
SLIDE 71

71

Grid Form Visual data mining model for Association Rules

  • Custer of cells, constructing a grid form.
  • Each column corresponds to an attribute and each

row represents a rule.

  • Sub-expression’s

characteristics coded to the properties of the bar.

slide-72
SLIDE 72

72

Parallel Coordinates VDM for Association Rules

  • Each rule in the n-dimensional space is represented as a

polygonal line, intersecting each of the axes at that point, which corresponds to the characteristics of the considered sub-expression.

slide-73
SLIDE 73

73

Visualizing Relevance Analysis Outcomes

  • Solar Plexus Model

– Each attribute represented by a circle. – The closest a circle is to the centre, the most relevant to the target attribute. – 3D-snail placement

  • f the spheres in the

3D world. – Spambase case study

slide-74
SLIDE 74

74

3D Class-Preserving Projection Technique

  • Projection scheme that maps from the Rn to the R3 space,

preserving the properties of the high dimensional data. – Letter Image Recognition case study – Maximize the inter-class distances among the class-means

slide-75
SLIDE 75

75

Class-Preserving Projection Techniques & Association Rules

  • Each rule could be perceived as an n-D surface, enclosing a

sub-space.

  • The

boundaries of that sub- space are defined by the conditions of rule’s sub- expressions, which pose the limits in each dimension.

slide-76
SLIDE 76

76

Visual Mining of Association Rules - Dermatology case study

  • Bar Chart Form

model

  • Parallel

Coordinates model

  • Grid Form

model

  • Detailed Bar

Chart Form model

slide-77
SLIDE 77

77

Visual Mining of Association Rules - Adult case study

  • Bar Chart Form

model

  • Parallel

Coordinates model

  • Grid Form

model

slide-78
SLIDE 78

78

Summary

  • Our point of view: a synthesis of data mining,

visualization and human to to enhance the effectiveness of the overall data mining process.

  • We investigated in extent the existing visualization

techniques.

  • Visual Data Mining Modelling Suggestions
slide-79
SLIDE 79

79

Accomplishments

  • Link the two most powerful information processing systems,

humans and the modern computer.

  • Combine the flexibility, creativity, and general knowledge of the

human with the enormous storage capacity and the computational power of today’s computers.

  • Harness the perceptual, creative and exploratory capabilities of

humans to provide visual insight into data.

  • Open up channels for the infusion of domain knowledge into the

knowledge discovery process.

  • Enhance the attempt to develop a user friendly data mining tool

that will be used by all interested users.

slide-80
SLIDE 80

80

The End

Any Questions?