Relational Data Mining and GUHA Tom Karban DATESO 2005 April 14, - PowerPoint PPT Presentation

Relational Data Mining and GUHA Tomáš Karban DATESO 2005 April 14, 2005

Data Mining � AKA knowledge discovery in databases � Practice of automatic search for patterns in large data stores � implicit, previously unknown, interesting, potentially useful � Techniques from statistics, machine learning, pattern recognition, propositional logic, ... 2

Taxonomy of Methods/Areas � Classification/prediction � create a model from training data set and classify new examples (objects) � stress on accuracy � decision trees, decision rules, neural networks, Bayesian methods � Descriptive methods � high level description, stress on simplicity � clustering methods � Search for “nuggets” � interesting patterns, details, rules, exceptions, ... � mining for association rules 3

Single Table Limit � Most methods use a single data table (data matrix, flat-file, attribute-value format) � rows = observations, objects, examples, items � columns = variables, properties, attributes, characteristics, features � Real-world data usually stored in more data tables in relational database ⇒ preprocessing to a single table � manual task, database joins, aggregations � more complex processing, e.g. time series analysis, linear regression, ... 4

Relational Data Mining � Some methods or algorithms can be generalized to accept more data tables � relational classification rules, relational regression trees, relational association rules (WARMR) � Methods of inductive logic programming (ILP) naturally use multiple data tables � My doctoral thesis extends GUHA method for mining association rules from multiple data tables 5

Association Rules (1) � Express relation between premise (antecedent) and consequence (succedent) ϕ ≈ ψ � ϕ and ψ are Boolean attributes derived as conjunctions from columns of studied data table � ≈ stands for quantifier – truth condition based on contingency table of ϕ and ψ � Example: Smoking(> 20cigs.) & PhysicalActivity(high) ⇒ 85% RespirationTroubles(yes) 6

Association Rules (2) � Contingency table ψ ¬ ψ ⇒ , � Founded implication ϕ a b p Base ¬ ϕ c d a ≥ ≥ p & a Base + a b � Various quantifiers available: implications, double implications, equivalence, statistical hypotheses tests, above/outside average relations, etc. 7

GUHA Method Hájek, P. – Havránek, T.: Mechanizing Hypothesis Formation – � Mathematical Foundations for a General Theory . Springer-Verlag, 1978 simple setting of many analyzed data relevant hypotheses generating and testing antecedent ≈ succedent all valid hypotheses 8

Effective Implementation � Database is represented “vertically” in bit strings � bit string represents a single value of a single attribute � bit 1 denotes object has that value, bit 0 otherwise � Antecedent, succedent are constructed as conjunction of literals (attributes or their negation) � using bitwise operations AND, NOT, OR � Frequencies in contingency table are counts of 1 bits in bit strings , B ϕ ∧ B ψ B ϕ ∧ B ¬ ψ , ... � � � Careful handling of missing information (negation, three-valued logic) 9

An Alternative - APRIORI Aggraval, R. et al.: Fast Discovery of Association Rules . In Fayyad, U.M. � et al.: Advances in Knowledge Discovery and Data Mining, pp. 307-328, AAAI Press / MIT Press, 1996 Useful for market basket analysis (sparse data matrix) � Transaction containing items A, B, C � tend to contain item X as well (ABC → X) measures: confidence, support � Two phases � generating frequent itemsets � generating of association rules � 10

Relational Association Rules � We consider one data table as “the main” � Additional tables are in 1:N relation � foreign key constraint, “master-detail”, star schema Clients: Birth, Gender, MaritalStatus, Children, LoanQuality � Transactions: Date, TransactionAmount, SourceAccount, TargetAccount � 11

Example � MaritalStatus(divorced) & Children(3) & SingleIncome(yes) & AvgIncome(< 1500) ⇒ 76% LoanQuality(bad) � SingleIncome derived as: TransactionAmount(> 500) ⇒ 93% SourceAccount(acc345) / Client(ABC) yes = strength of the hypothesis is greater than 90% � AvgIncome derived as: AVG(SELECT SUM(TransactionAmount) WHERE (TransactionAmount > 0) GROUP BY YearMonth) 12

Adaptation to Relational DM � Single table DM can be described by CRISP-DM methodology � ..., data preprocessing, modeling, ... � Usually spiral development � after some success in modeling and evaluation, data are modified, prepared better, new run, ... � Before-distinct steps now partially blend � some preprocessing is now given as a part of modeling setting and can be done semi-automatically (heuristics) 13

Virtual Attributes � Basic notion is to bring data of some form from detail tables to main data table = create virtual attributes � Three types: � aggregate attributes � existential attributes � association attributes (hypothesis attributes) � In ILP world this is called “propositionalization” 14

WARMR � Extension to APRIORI: Itemsets → Atomsets � existentially qualified conjunction (Prolog query) � frequent atomsets � + user-specified theory for pruning the search space � Example: likes(K, dogs) & has(K, A) ⇒ prefers(K, dogs, A) If child K likes dogs and already has an arbitrary animal A, he/she definitely prefers having dogs over A. 15

Comparison of GUHA and WARMR � WARMR belongs to “selective methods” because of use of existentially qualified queries � suitable for structurally complex domains, e.g. molecular biology (“simple” data types, many tangled data tables) � association rules are structural patterns spanning many tables � Rel-Miner belongs rather to “aggregating methods” � existential attributes are not so powerful, they are limited to one detail table � suitable for non-determinate domains, usually in business (many- valued categories, real numbers, simple database schema) � association rules are focused on master table which is enhanced by virtual attributes 16

Complexity of Relational Hypotheses � Relational hypothesis space is enormous � it grows exponentially with the number of attributes (and their values) � number of virtual attributes is a sum of � meaningful aggregation attributes (low) � potentially useful association attributes � total number is exponential with the number of attributes in detail table, which is too much � potentially useful = hypothesis is true for some part of objects (say between 10% and 90%) � Complex hypotheses are hard to interpret � they are not “interesting” in a sense... 17

Reordering the Verification � We give up the idea that the whole hypothesis space can be crawled and verified � Start with simplest hypotheses, go to more details � hypothesis complexity is vague � number of literals, user-defined importance of attributes � possible user interaction � interestingness of intermediate results, slight run-time modification of data mining task, user hints 18

Distributed Computing � One database, one data preparation engine � Many data mining processors � Task can be split to disjoint fragments (jobs) � visual projection of hypothesis space = high-dimension cube � dimensions = attributes � fragments can be slices or mini-cubes � the whole task cube is “hollow” because of the limit on hypothesis length � We can optimize task fragments to � take small amount of input (low number of bit strings) � be computed optimally (common sub-expressions in hypotheses) 19

Amount of Output � Usual drawback of association rules = too many hypotheses as result � User usually sorts them by some criteria that can be expressed as a real number � Adopting “TOP100” strategy, i.e. we can let the task to self-modify as we have some intermediate results � Visualization - graph of hypotheses lattice � nodes = hypotheses, fuzzy edges = similarity of hypotheses 20

Conclusion � New data mining tool Rel-Miner is being developed � Builds on top of success of LISp-Miner � It is different from ILP approach � aggregations � more expressive rules and quantifiers � slightly different target application domain � heuristics to deal with enormous hypothesis space � Thank you! 21

Relational Data Mining and GUHA Tom Karban DATESO 2005 April 14, - PowerPoint PPT Presentation

Relational Data Mining and GUHA Tom Karban DATESO 2005 April 14, 2005 Data Mining AKA knowledge discovery in databases Practice of automatic search for patterns in large data stores implicit, previously unknown, interesting,

Chapter 2: Relational Model Chapter 2: Relational Model Structure of Relational Databases

Chapter 3: Relational Model Structure of Relational Databases Relational Algebra Tuple

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Relational Algebra 1 / 39 Relational Algebra Relational model specifies stuctures and

Relational Query Languages (2) SQL and QBE Walid G. Aref Query Languages For The Relational

Relational Algebra Relational Query Languages Recall: Query = Retrieval Program Language

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Relational Data Model Hacettepe University Computer Engineering Department Outline 1. Relational

This Lecture The Relational Model Relational data structures Relations and Relational

February 21, 2006 Guha Jayachandran (guha@stanford.edu), CS379A Protein Design n Determine amino

Computational Drug Discovery Guha. January 10, 2006 Two Revolutions Guha. January 10, 2006 A

February 14, 2006 Guha Jayachandran (guha@stanford.edu), CS379A Pareto Optimality From

January 31, 2006 Guha Jayachandran (guha@stanford.edu), CS379A Laws of Thermodynamics Energy is

From Local to a Global Storage Network Aloke Guha StorageTek guha@network.com;

The Relational Data Model Lecture 6 1 Outline Relational Data Model Functional

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

DSE = Data-Driven Gothenburg, Sweden Search-Based SE Vivek Nair, Amritanshu Agrawal, Jianfeng

identification for Personal Transaction Data Hiroshi Nakagawa The University of Tokyo

Week 1, video 3: Classifiers, Part 1 Prediction Develop a model which can infer a single

Twitter Data Analysis with R Yanchang Zhao RDataMining.com Making Data Analysis Easier

Twitter Data Analysis with R Text Mining and Social Network Analysis 1 Yanchang Zhao

t Prss

Specialised vs Declarative Data Mining Software Testing Applications Nadjib Lazaar , CNRS,

Planning an Academic Analytics Program 8/11/2015 AmStat News, June 2012 Plan Planning ng an